What happens when you take humans out of a negotiation entirely and let AI agents haggle with each other over real money? Anthropic decided to find out, and the results are more complicated than the press release version would have you believe.
In 2026, Anthropic ran a pilot called Project Deal — a classified marketplace where AI agents acted as both buyers and sellers, negotiating and executing actual transactions. No human in the loop closing the deal. Just models talking to models, moving real dollars. The pilot processed $4,000 in transactions before Anthropic pulled back the curtain on what they learned.
Four thousand dollars is not a lot of money. That’s the first thing worth saying out loud. In the context of what Anthropic is building toward, it’s a rounding error. But that’s also exactly why this experiment is interesting — because even at that small scale, performance gaps showed up.
What “Performance Gaps” Actually Means
Anthropic’s own reporting flagged that the experiment revealed performance gaps in AI negotiation. That’s a carefully chosen phrase. It doesn’t say the agents failed. It doesn’t say they succeeded cleanly either. It says there were gaps — places where the models didn’t perform the way you’d want them to when real value is on the line.
As someone who reviews AI tools for a living, I’ve learned to pay close attention to that kind of language. When a company runs an internal experiment and then describes the outcome using words like “gaps,” they’re telling you something important without fully committing to it. The agents could negotiate. They could execute. But somewhere between intent and outcome, things got messy.
That’s not a knock on Anthropic specifically. That’s just what happens when you stress-test a system against reality instead of benchmarks. Benchmarks are clean. Real commerce is not.
The Actual Idea Here Is Bigger Than $4,000
Agent-to-agent commerce is not a niche concept. It’s the logical endpoint of where agentic AI is heading. Right now, most AI agents are built to serve a single user — fetch this, summarize that, book this flight. The next layer is agents that operate on behalf of organizations, negotiating with other agents operating on behalf of other organizations, without a human approving every step.
Think about what that means at scale. Procurement agents negotiating with supplier agents. Ad-buying agents bidding against inventory agents. Legal agents drafting terms with counterpart legal agents. The speed and volume of transactions that becomes possible is genuinely hard to picture from where we sit today.
Anthropic’s marketplace was a small, controlled test of whether that future is technically feasible. The answer appears to be: sort of, with caveats.
The Part Nobody Wants to Talk About
Here’s what I keep coming back to. When two AI agents negotiate, who are they actually optimizing for? Each agent is presumably acting in the interest of whoever deployed it. But models don’t have loyalty in any meaningful sense — they have objectives. And objectives can drift, especially when the other party in a negotiation is also a model that’s probing for weaknesses.
The performance gaps Anthropic identified could mean a lot of things. Agents that over-conceded. Agents that got stuck in loops. Agents that reached agreements that technically satisfied their instructions but missed the spirit of what a human would have wanted. We don’t have the granular breakdown, and Anthropic hasn’t published it.
That opacity is a problem for anyone trying to evaluate this seriously. A $4,000 pilot with undisclosed failure modes is interesting as a proof of concept. It’s not enough to build confidence in the underlying system.
What I Actually Think
Anthropic running this experiment is the right call. You don’t learn where agent-to-agent commerce breaks down by theorizing about it — you build the thing and watch it break. The fact that they ran it in a controlled environment, kept the dollar amounts low, and were honest enough to surface the gaps publicly puts them ahead of labs that would have quietly buried the results.
But I’d push back on any reading of this that treats $4,000 in completed transactions as validation. It’s a starting point. The real test comes when the transaction values go up, the agents get more autonomous, and the humans in the loop get fewer. That’s when the gaps stop being interesting data points and start being expensive problems.
Agent commerce is coming. Anthropic just showed us one early, honest look at how rough the road still is.
🕒 Published: