Despite the hype, a new study reveals today’s AI agents fail at 97% of real remote work tasks
In a world where ChatGPT writes poetry and DALL-E creates art, you’d think AI would be eating our digital lunch by now. You’d be wrong.
The $140,000 Reality Check
A team of researchers from the Center for AI Safety and Scale AI just dropped a bombshell on the AI automation hype train. They created the Remote Labor Index (RLI) — a comprehensive test that pits today’s most advanced AI agents against real freelance work worth over $140,000. The results? Let’s just say the robots aren’t quite ready for their close-up.
The highest-performing AI agent, Manus, managed to complete only 2.5% of projects at a professional level. That’s not a typo. Ninety-seven and a half percent failure rate. GPT-5? Just 1.7%. Claude Sonnet 4.5? Also 2.1%. These are the same models that can write sonnets and debug code in seconds.
What Makes Remote Work So Hard?
Here’s the thing: The AI benchmarks we usually hear about are like asking a chef to chop one onion perfectly. The RLI is like asking them to run an entire restaurant kitchen during dinner rush.
The study tested AI agents on 240 real freelance projects sourced from Upwork, including: 3D modeling and CAD designs, Video production and animation, Audio mixing and music production, Architectural renderings, Web development, and Data analysis and visualization
These weren’t simplified test cases. Each project came with the original client brief, all necessary input files, and a gold-standard deliverable produced by a human professional who actually got paid for the work. The average project took professionals 28.9 hours to complete and cost $632.60.
Where AI Actually Succeeds
The study wasn’t a complete washout. AI agents showed some sparkles in specific areas:
Audio Production: Creating sound effects for retro games, mixing voice-overs with music, separating vocals from accompaniment
Visual Creation: Generating logos and ad creatives, simple image editing tasks
Code Generation: Building interactive data visualizations and simple web tools
Writing: Producing reports and documentation
But here’s the catch: these successes represent a tiny slice of the remote work pie. Most real-world projects require juggling multiple skills, maintaining consistency across different file types, and crucially knowing when your work is actually good enough.
The Four Ways AI Fails
The researchers identified four main failure patterns that kept showing up:
- Technical Meltdowns: Corrupt files, wrong formats, empty deliverables
- Incomplete Work: Missing components, truncated videos, absent source files
- Quality Issues: Work that technically meets requirements but looks like a child did it
- Consistency Nightmares: Different files in the same project don’t match up
Imagine asking an AI to design a house, and it produces beautiful blueprints, but the 3D render shows a completely different building. That’s happening a lot.
Why This Matters for Your Job
If you’re a remote worker, this study should be both reassuring and terrifying. Reassuring because your job isn’t disappearing tomorrow. Terrifying because the gap, while wide, is closing.
The researchers used an Elo rating system (like in chess) to track relative performance between AI models. While all models scored far below the human baseline of 1,000, the newer models are steadily climbing. Manus leads the pack at 509.9, with older models trailing behind.
“We observe that models are steadily approaching higher automation rates across projects,” the researchers note. “This demonstrates that RLI is sensitive enough to detect ongoing progress in AI capabilities.”
The Coming Transition
What makes AI different from previous automation technologies is its potential for generalization. A calculator could automate math but couldn’t learn to write. An AI that truly masters remote work would likely have the cognitive flexibility to adapt to new types of jobs as they emerge.
But we’re not there yet. Not close.
The researchers deliberately excluded certain categories of work, like customer service, project management, and anything requiring live human interaction because they’re even harder to automate. The 2.5% success rate is for the easy stuff.
What Comes Next
The Remote Labor Index isn’t just a one-off study; it’s designed to be an ongoing measurement tool. As new AI models are released, they’ll be tested against the same benchmark, giving us a clear view of how quickly the automation frontier is advancing.
For now, though, the message is clear: the next time someone tells you AI will replace all remote workers within five years, you can tell them about the study where the world’s most advanced AI agents couldn’t even complete a basic freelance video project without producing something that looked like it was made by a first-year film student with a hangover.
Bottom line: The robots aren’t coming for your remote job tomorrow. But they’re practicing, and someone’s keeping score.
Source: “Measuring AI Automation of Remote Work” (arXiv:2510.26787v1) by Mazeika et al.