iStock
The demos look slick, the promises even slicker. In slides and keynotes, agentic assistants plan, click, and ship your work while you sip coffee. Promoters like McKinsey call it the agentic AI advantage.
|
ADVERTISEMENT |
Then you put these systems on real client work and the wheels come off. The newest empirical benchmark from researchers at the Center for AI Safety and Scale AI finds current AI agents completing only a tiny fraction of jobs at a professional standard.
Benchmarks, not buzzwords, describe reality
Headlines say agents are here. The data say otherwise. The new Remote Labor Index (RLI), a multidomain benchmark built from 240 real freelance-type projects in 23 categories, reports an automation rate topping out at 2.5% across leading agents, meaning almost all deliverables would be rejected by a reasonable client. The dataset spans design, operations, business intelligence, audio-video, game development, CAD, architecture, and more, reflecting the work that actually shows up in remote markets, not cherry-picked lab tasks.
The point is not that AI fails everywhere. The RLI describes scattered wins in text-heavy data visualization, audio editing, and simple image generation. But the failures are systematic. Reviewers cite empty or corrupt files, missing assets, low-grade visuals, and inconsistencies in deliverables—the kinds of misses that doom client work regardless of clever reasoning traces.
These aren’t close calls. Inter-annotator agreement sits at 94.4% for the accept-or-reject decision. So, we’re not talking about taste.
If you need a concrete sense of difficulty, the benchmark’s human reference projects averaged 28.9 hours to complete, with a median of 11.5 hours and an average price of $632.60. Those are realistic project sizes. They include work like a world-happiness report dashboard, a 2D promo for a tree services firm, 3D animations for new earbuds, an IEEE-formatted paper, an architectural concept for a container home, and a “Watermelon Game”-style casual web game. This is the right yardstick for agent claims.
Other grounded evaluations tell a similar story, such as the WebArena benchmark. And in software, SWE-bench shows that turning model skill into working patches across real repositories remains hard without tight scaffolding.
Tasks automate; projects still require adults in the room
When I work with companies on AI adoption, I push a simple framing. Use AI to do well-scoped tasks inside a project, not to run the project. That rule aligns with the published evidence from benchmarks. The RLI team notes pockets of success in content drafting, audio cleanup, image assets, and basic data visualization, which pair nicely with human review in marketing, product, and analytics teams. In my client work, this shows up as faster ad variants, cleaner query logic, quicker explainer scripts, and first-pass chart code that a developer can polish.
That’s the difference between automating a component and carrying a project across the finish line.
Contrast those gains with multihour, multifile builds that require iterative verification. In METR’s HCAST findings, agents succeed 70–80% on tasks humans do in under an hour, and under 20% on tasks that take humans more than four hours. That’s the difference between automating a component and carrying a project across the finish line.
This gap explains why the RLI authors also track a relative “Elo” progress signal, which rises over time even as absolute project completion stays low. Improvement is real. Hype overstates what that improvement means for near-term automation of whole projects.
Plan for augmentation now, not mass replacement
Hype has a business model. The agentic AI advantage storyline promises proactive, goal-driven assistants that automate complex processes throughout the firm. Markets respond to bold claims, then teams inherit the risk. Gartner even warns that more than two out of five so-called agentic initiatives will be scrapped by 2027 due to unclear value and rising costs, a wave of “agent washing” where conventional tooling gets relabeled as autonomy.
That supports staffing models where you automate slices of jobs, not the jobs themselves.
The balanced plan is to redesign work so humans direct, verify, and integrate agent outputs, then let evidence guide scope increase. OpenAI’s GDPval report shows that with human oversight, frontier models are approaching expert quality on carefully defined, economically valuable tasks. That supports staffing models where you automate slices of jobs, not the jobs themselves. It also matches early labor data. A recent Stanford employment analysis reports wage gains in AI-exposed roles without broad, immediate job loss, consistent with a world where AI changes the task mix before it wipes out occupations.
The near-term playbook is straightforward. Use AI to reduce cycle time on repeatable tasks. Assign owners to verify outputs. Track acceptance rates and defect types, the same way the RLI evaluators categorized corrupt files, missing components, inconsistent renders, and low-quality assets. Expect head count to shift as pieces of marketing, writing, programming, and analysis take fewer people while roles that specify goals, judge quality, and integrate outputs become more central.
On current trend lines, more capable AI agents will arrive during the next few years, helped by scaffolded workflows and better tool use. Yet the evidence says that whole-project autonomy for general remote-capable work isn’t a short-term outcome, regardless of hype from McKinsey and others.
Conclusion
Agentic AI is exciting, but real benchmarks beat glossy promises. The Remote Labor Index shows tiny automation rates on the kinds of projects companies actually pay for, backed by strong evaluation methods and consistent with other grounded benchmarks on web and desktop tasks.
Progress will continue. The smart move is to treat agents as force multipliers inside projects while humans stay accountable for outcomes. Leaders who adopt with discipline will bank the gains today and be ready for tomorrow without buying into a bubble.

Add new comment