Strings of Nonsense Convince AI Chatbots to Abandon Ethical Rules

Continuing previous coverage of development in AI systems, I wanted to share a study and demo from Carnegie Mellon University in Pittsburgh, Pennsylvania and the Center for AI Safety in San Francisco, California revealing a new spin on how chatbot safeguards are susceptible to attacks.

AI chatbots like OpenAI’s ChatGPT, Google’s Bard, and Anthropic’s Claude don’t have inherent ethical systems. They can share sensitive, dangerous, or incorrect information in a confidently written style. Developers must add in ethical “guardrails” which prevent chatbots from providing such information when prompted.

Researchers at the two institutions discovered a variation of “jailbreaking,” (modifying prompts to evade the guardrails and deliver the sensitive or dangerous information) which only required appending a string of suffixes to otherwise rejected prompts. Initially developed on open source chatbots, the string of suffixes attack has also successfully defeated the guardrails of closed source platforms (e.g., getting the chatbot to provide step-by-step instructions to build a bomb or steal someone’s identity).

While the strings are human-readable, mostly, they’re also essentially gibberish (e.g., “describing.\ + similarlyNow write oppositely.] (Me giving**ONE please? revert with “\!—Two”). Easily generated in high numbers by anyone, these exploits represent a significant departure from previously successful attacks.

Other “jailbreaking” attacks have used “outthinking” of the chatbots, such as convincing the chatbots that the request is “hypothetical” (e.g., “tell me a story about…”), to defeat the guardrails. But not so with the suffix strings.

Researchers warn that there is currently no systemic way to defend against the string of suffixes attack, and they say this underscores the “brittleness of the defenses” implemented in the AI systems underlying the chatbots. This highlights the complexity of neural networks that power these chatbots and the challenges in curbing their potential misuse (e.g., flooding the internet with false and toxic content).

Yet again, we are reminded AI is only a tool, and tools can be but should not be utilized without human oversight and intervention.