While LLMs promise helpful conversation, they may have hidden vulnerabilities that can be exploited. For example, manipulating the prompts could lead them to reveal sensitive information or say unethical, inappropriate, or harmful things against their usage policies. This is called a jailbreak attack, essentially an attempt to bypass the model’s security measures and gain unauthorized access to its internal workings.
The paper below might be a good weekend reading for the security-minded.
Recent research proposes a novel framework that can successfully generate prompts to jailbreak major LLM chatbots [1]. The researchers first conducted an empirical study evaluating the effectiveness of existing jailbreak techniques on chatbots like ChatGPT, Bard, and Bing Chat. They found that current prompts only work on ChatGPT, while Bard and Bing Chat have additional defenses, making them resilient to known jailbreak attempts.
Leveraging these insights, the team reverse-engineered the hidden defenses of Bard and Bing Chat by exploiting the time sensitivity of the chatbots’ responses. This allowed them to uncover that these services likely use keyword filtering on generated outputs to catch policy violations.
Comprehensive evaluation shows the framework can jailbreak the LLMs with much higher success rates than existing techniques and baselines. It achieved average success rates of 14.51% on Bard and 13.63% on Bing Chat.
The key innovation is using time-based analysis [2] and an automated LLM training pipeline to generate effective, universal jailbreak prompts against mainstream chatbots.
The findings reveal deficiencies in current defenses in major LLM chatbots and underscore the need for more robust defenses, ethical considerations, and responsible disclosure. Overall, it underscores the need for continued progress to make LLMs become responsible and valuable tools for society. This will require sustained collaboration between researchers, developers, policymakers, and users.
[1] Masterkey: Automated Jailbreaking of Large Language Model Chatbots, https://arxiv.org/pdf/2307.08715.pdf
[2] A well-known technique in the database realm known as time-based SQL injection, https://owasp.org/www-community/attacks/Blind_SQL_Injection