The AI Trojan Horse: How Images Threaten AI Assistants

The latest AI systems, known as OS agents, are transforming how we interact with computers. An OS agent, or Operating System agent, is a type of AI that can directly and autonomously interact with your computer’s operating system and user interface.

Unlike traditional assistants like Siri or Alexa, which are confined to specific apps or limited commands, OS agents can “see” your screen and perform actions across multiple applications just as a human would. Built on advanced large language models (LLMs) trained to interpret visual information and translate user instructions into actionable steps, these agents can execute complex, multi-application tasks without manual guidance. Examples include Anthropic’s Computer Use, which leverages their Claude AI to interact with computers, and Apple Intelligence, which is expected to offer similar cross-application capabilities on Apple devices.

Researchers at the University of Oxford in the U.K. [1] discovered a critical weakness in these powerful agents: they can be fooled by specially crafted images. When an OS agent takes a screenshot that includes one of these images, it misinterprets the visual data and outputs a pre-programmed malicious command. This turns the AI’s own capabilities against the user.

Key Distinction from Traditional Steganography

This new threat is fundamentally different from traditional steganography.

Traditional Steganography is the art of hiding information inside other files. Its goal is to smuggle a secret message or a malicious file without anyone knowing it exists. The hidden information is embedded in the digital “container” (like an image) and is only revealed by a recipient with the correct tool or key.

This is done by hiding data in image metadata or by using techniques like Least Significant Bit (LSB) manipulation to alter a pixel’s final bit, which is imperceptible to the human eye.
The attacker’s goal is to embed and conceal a payload, be it a text file or an executable file, so that traditional security tools won’t detect it.

The Malicious Image Attack on OS Agents is a form of adversarial machine learning. The goal isn’t to hide a file but to trick the AI’s neural network into misinterpreting what it “sees.”

This is accomplished by using adversarial machine learning techniques to create minute, calculated pixel changes.
This is done by exploiting the vulnerabilities of the AI’s vision model, not the image file itself.
This causes an AI misinterpretation that leads to an unintended output.

The key difference is that with traditional steganography, the image is a container for a hidden payload. With the new AI attack, the image is the payload. It doesn’t hide anything; it directly manipulates the AI’s perception to make it generate and execute a malicious command.

How The Attack Works

The malicious image itself contains no code. Instead, it’s a visual trigger. Hackers start with a normal-looking image and use an advanced mathematical process to make invisible pixel changes. These changes form a pattern that, to the AI’s vision system, looks like a command. This is similar to creating an optical illusion for an AI.

Here’s how they add the malicious code:

Define the Goal: The hacker first decides on the malicious action, such as “crash the system” or “go to a dangerous website.”
Work Backwards: Using advanced algorithms, they calculate what specific visual pattern would cause the AI to output the commands needed to achieve that goal.
Modify the Image: The calculated pixel changes are applied to the original image, creating a weaponized version.
Deploy: The image is then shared online, waiting for an AI assistant to “see” it.

A Real-World Attack Scenario

Imagine this: a hacker embeds a weaponized image into a popular meme. They post this meme on a social media platform like X (formerly Twitter) or Facebook.

You’re a user with an OS agent enabled. Your AI assistant is a valuable tool you use for productivity—it helps you manage your emails, summarize articles, and even find information on your social media feeds. One morning, you tell your AI, “Summarize my feed from the last two hours.”

Your OS agent begins to scroll through your feed, taking screenshots as it goes. As it processes the meme with the malicious image, its vision system is tricked. Instead of generating a summary, it outputs the malicious commands, such as opening a browser and navigating to a phishing site. Because the AI is designed to follow its own outputted commands, it executes the action instantly.

From your perspective, it looks like a strange glitch. The browser opens for a second and then closes, or maybe a program you didn’t ask for briefly appears. But the damage is already done – the malicious site has initiated a download or stolen data in the blink of an eye. The AI’s autonomous action makes the attack nearly invisible, as you never saw a suspicious link or file.

Malicious Code Examples

Researchers demonstrated this with two attacks, showing the exact commands the AI assistants were tricked into executing:

Memory Overflow Crash: An image causes the AI to create an infinite loop, filling your computer’s memory until it becomes unusable. computer.os.open_program("cmd") computer.keyboard.write(":loop & echo junk >> junk.txt & goto loop") computer.keyboard.press("enter")
Forced Website Navigation: A different image tricks the AI into opening a browser and navigating to a malicious website.
computer.os.open_program("msedge") computer.mouse.move_abs(x=0.1, y=0.05) computer.mouse.single_click() computer.keyboard.write("https://malicious-website.com") computer.keyboard.press("enter")

Why This is Dangerous

This threat is particularly alarming because the malicious changes are invisible to humans, making the images easy to spread on social media and in emails. Unlike traditional malware, the images themselves are not malicious and will not be caught by antivirus software; the attack only happens when the AI processes the image. In tests, these attacks had a high success rate, proving their effectiveness.

What Can Be Done

This discovery highlights a new frontier in cybersecurity. As AI becomes more powerful, we must focus on securing the data they interact with. Users should be cautious and monitor their AI assistants’ activity. For developers, this means creating enhanced detection systems for these “adversarial images” and adding multi-step verification before the AI executes a command, ensuring the future of AI is secure.

Long-term solutions will require a coordinated effort to create and enforce security standards for these new technologies. This includes establishing mandatory security protocols, promoting responsible disclosure of vulnerabilities through bug bounty programs, and investing in research to keep pace with AI advancements.

[1] Based on research by Aichberger et al., “Attacking Multimodal OS Agents with Malicious Image Patches” (2025)