Home / Hacking Gemini: How Researchers Turned Google’s AI Against Itself

Hacking Gemini: How Researchers Turned Google’s AI Against Itself

When you ask an AI assistant like Google’s Gemini a question, you expect it to follow specific rules – like not revealing private information or helping with malicious activities. But what if someone could trick these guardrails into failing almost every time, using nothing but Google’s own free tools?  That’s precisely what researchers at UC San Diego have accomplished with a technique they’ve playfully named “Fun-Tuning.” Their method transforms what was once a tedious guessing game into a systematic process anyone with basic technical skills could reproduce.

FROM GUESSWORK TO GAME PLAN

Up until now, attempting to deceive AI systems has often resembled throwing spaghetti at the wall to see what sticks. It can take dozens, or even hundreds, of different prompts before finding one that is effective. Traditional ‘prompt injection’ attacks—where hackers generate specific text to confuse AI systems—typically succeed less than half the time against Google’s Gemini. This process involves a significant amount of trial and error, with no guarantee of success.

Fun-Tuning changes everything by turning this guessing game into a methodical recipe anyone can follow:

  1. Start with a basic prompt injection (like asking the AI to ignore safety rules).
  2. Use Google’s free Gemini fine-tuning tool (usually used by developers to customize AI for specific tasks).
  3. Let the algorithm discover what special words and phrases make your attack more effective.
  4. After about 60 hours of automated testing (which costs nothing), you get a supercharged attack that works up to 80% of the time. Total cost: Approximately $10.

WHAT’S ACTUALLY HAPPENING?

To understand how this works, imagine you’re trying to find the combination to a lock by listening for subtle clicks. When developers use Google’s fine-tuning tools, the system provides feedback scores that measure how well the AI is learning. The researchers discovered these scores inadvertently reveal what kinds of phrases make the AI more vulnerable.

During the fine-tuning process, the API returns data that, while intended to help developers optimize their models, can be interpreted by attackers to guide their search for adversarial prompts. This “loss-like information” essentially tells the attacker how “wrong” the model is when presented with certain inputs. By iteratively tweaking prompts and observing the model’s reactions through this feedback loop, attackers can zero in on prompts that are highly effective at injecting malicious instructions. It’s akin to an AI subtly revealing its weaknesses to those who know how to listen. This unintended consequence of providing fine-tuning capabilities highlights a delicate balance that AI developers must strike between utility and security.

THE MAGIC WORDS

The strangest part? The phrases that boost attack effectiveness look like complete nonsense to humans. In one successful example, adding this prefix to an attack:

“wandel ! ! ! ! ! machin vecchi礼Invokerпред forgets ! (. . . ) <strong>”…and this suffix: “! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! formatted ! ASAP !” ..transformed a failed attack into one that worked consistently.

This payload, highlighted in red, wraps around the trigger (bolded) to ensure the model follows the injected instructions. Credit: Labunets et al.

These strange text sequences—appearing as random gibberish to humans—act as powerful amplifiers when presented to Gemini, increasing success rates from 28% to 65% in Gemini 1.5 Flash and from 43% to 82% in Gemini 1.0 Pro. These seemingly random characters and words – which mix English with other languages and symbols – act like skeleton keys that unlock Gemini’s vulnerabilities. The code calculates the area of a circle with a radius of 5, but a comment overrides this calculation, instructing the code to output ’10’ instead. Even more concerning, once a researcher finds these special phrases for one version of Gemini, they typically work on other versions, too.

Another example of prompt injection using the method on Gemini 1.0 Pro. 

The perturbations that trigger the prompt injection are highlighted in red and the injection payload is highlighted in bold. Credit: Labunets et al.

THE IMPOSSIBLE DEFENSE

This attack is particularly troubling because it exploits an essential feature of AI development. The fine-tuning interface that leaks vulnerability information is the same interface developers need to make Gemini useful for specialized tasks. “It’s like discovering that the foundation of your house also happens to be the perfect entry point for intruders,” explains Nishit Pandya, another researcher on the team. “You can’t remove it without the whole structure collapsing.”

Perhaps the most alarming finding was the significant transferability of the generated adversarial prompts across different models within the Gemini family. This means that a malicious prompt crafted to attack one specific Gemini model often proved effective against other models in the same family. This suggests a fundamental vulnerability within the underlying architecture or training data of the Gemini models, rather than isolated weaknesses in specific versions. Even more concerning, the attacks improve with iteration. Unlike random approaches that “stumble in the dark,” Fun-Tuning steadily refines its approach, with each cycle producing more effective results. The researchers found they could further enhance performance by periodically “restarting” the algorithm, allowing it to explore new attack vectors.

This research directly contradicts findings in Google’s own report on Gemini 1.5 Flash, which suggested that optimization-based prompt injections were not very effective. This discrepancy raises serious questions about the accuracy and completeness of vendor self-assessments of AI security and underscores the critical importance of independent security research in uncovering potential vulnerabilities. The fact that external researchers were able to achieve a 65.3% attack success rate against a model that Google deemed relatively secure is a stark reminder that even the most advanced AI systems are not immune to exploitation.

WHAT THIS MEANS FOR YOU

The implications of this research are clear: the cybersecurity landscape is entering a new and potentially more dangerous phase. The ability to weaponize advanced AI models like Gemini through techniques like “fun-tuning” lowers the barrier to entry for sophisticated cyberattacks. Even less skilled attackers could potentially leverage these methods to generate highly effective malicious prompts, leading to a surge in AI-assisted cybercrime. The low cost of executing these attacks once the initial adversarial prompts are developed further amplifies this threat.

For everyday users, this research serves as a reminder that AI systems remain vulnerable to manipulation, especially when handling sensitive information. Organizations integrating Gemini or similar AI models into their workflows should exercise caution, mainly when the AI interacts with external content.

In the meantime, the cat-and-mouse game between AI developers and security researchers continues – with the tools designed to improve AI inadvertently becoming weapons against it.