Agent harnesses don't transfer

2026-04-04

The most quoted number in agent design right now is the gap between Claude Opus running inside Claude Code at 77% and the same Opus running inside Cursor at 93% on the same SWE-bench problems. Sixteen percentage points from the harness alone, with the model held constant. The takeaway people keep drawing is that the harness matters as much as the model, sometimes more.

That’s true and at this point a bit boring. The interesting part of that result is the one almost nobody is saying out loud. The reason the harness can move performance by sixteen points on SWE-bench is that SWE-bench is a coding benchmark, and coding is the one domain where agent harnesses have somewhere genuinely useful to put pressure. In every other domain, you don’t see a sixteen-point harness gap, because there’s no benchmark sensitive enough to register it, because the underlying feedback signals don’t exist.

My argument is that the harness pattern everyone converged on, the Claude Code shape, is a coding-domain artifact. For coding it works, and works well. Outside coding the people copying it are quietly making worse products than the chatbot they were going to replace. I run a non-coding agent product at Spradley AI, so I’m partly diagnosing the field and partly writing down the things I had to learn by getting them wrong.

What I mean by harness

By harness I mean the runtime around the model. The prompt loop, the tool definitions, the file system or working directory the model writes into, the planner that decides what to do next, the sub-agents that get spawned, the memory and context management between turns, the permission and sandboxing layer. Roughly everything except the weights.

A model on its own is just a chat completion. The harness is what makes the completion run somewhere that has state, take an action, see what changed, and pick what to do next.

If you’ve used Claude Code, Cursor, Cline, Aider, Codex, or Devin, you’ve used a harness. They look superficially different and they’re all doing roughly the same thing: a model with a turn-taking loop, a sandboxed working directory it can read and write to, a small set of tools centered on file manipulation and shell execution, a planning step that produces a list of subtasks, optional sub-agents for the harder ones, a permission system layered over the dangerous tools, and a transcript that survives across runs.

That’s roughly the shape the field has settled on, and even though the wiring of each piece can produce a sixteen-point gap on SWE-bench, the outline of “what an agent looks like” is now shared between most of the serious products.

Why coding is unusually good for this shape

Coding has a set of properties that almost no other domain has, and the harness pattern was built around them.

The feedback signal is essentially free. You wrote a function, you run the test, and a couple of seconds later the language itself tells you whether the function works. The harness gets to drive its own loop without a human in the middle, because the loop closes on its own. In almost any other domain the feedback either requires a human to produce or doesn’t exist on a useful timescale, and the harness has no real way of knowing whether it’s making progress.

The output is also structured in a way most domains aren’t. A change to a codebase is a diff, and you can apply, revert, compare, and tool-build around diffs cheaply. In something like legal contract review there is no diff. The output is a recommendation about a clause that might be “weaken this” or “strengthen this” or “ask the counterparty for a concession on X,” where the right answer depends on who’s reading the contract and what they’re trying to do with it, and where there is nothing to mechanically apply or revert.

The action space is small and bounded. Coding agents need maybe ten tools, and the set genuinely closes: read file, write file, run command, search, list, plan, ask for clarification, sub-agent, finish, abort. A customer support agent’s action space is open in a much messier way. It includes things like deciding whether to respond at all, deciding whether to escalate, picking a tone, choosing how much detail to give, knowing when to say “we won’t do that,” and a long tail of judgement calls that don’t reduce neatly to tool calls.

Artefacts persist between runs in a way that turns out to be load-bearing. Code lives on disk. The agent can come back tomorrow, look at the project, see what state it’s in, and pick up where it left off. In an interview agent the equivalent state is much fuzzier. There’s an evolving understanding of the company, themes that aren’t quite settled, a half-formed sense of who’s been talked to and what’s still missing, and none of it lives in a directory you can ls and orient against.

The harness pattern was designed around these properties, not deliberately to exclude anyone else, but because coding was the use case driving most of the product investment over the last two years. The shape was carved to fit a specific domain, and then quietly picked up by everyone else as the generic answer to “what does an agent look like.”

What goes wrong outside coding

I’ll keep my own product out of this for now and use a hypothetical. Imagine you’re building a customer support agent for a SaaS company. You read the agent harness literature, you pick up the standard pattern, and you build something that has a working directory (ticket history), tools (look up customer, check entitlement, draft response), a planning step (what’s the customer’s actual problem), sub-agents (one to check the docs, one to check the customer account, one to draft the reply), and a permission layer (the agent can’t issue refunds without approval).

It sounds reasonable. It demos well. The version I keep seeing in the wild ends up performing worse, after a few months of real traffic, than a much simpler system that retrieves the right doc and puts it next to a chat box for the human to read. The reason is that most of the failure modes in customer support are not the failure modes the harness pattern was designed to handle. The agent is fine at writing the reply. Where it falls down is in deciding whether to write a reply at all, in recognising that the customer is angry and needs an escalation more than an answer, in noticing that the question being asked is really the symptom of a different question, in catching that its own tools are returning stale data. None of those have a “test” that runs in two seconds and tells you that you got it wrong, so the harness doesn’t know it’s failing, because the harness was built for a domain where failure was self-evident.

The same shape ends up getting copied into sales outreach, scheduling, legal review, medical chart summarisation, due diligence, employee research (yes, mine), HR ops, compliance, the whole list. It gets copied because it’s the only published example most teams have seen of “this is what a serious agent product looks like.” And it ends up not fitting, because the feedback loops aren’t there, the output isn’t shaped like a diff, the action space isn’t bounded the same way, and the artefacts don’t persist into a directory you can read off disk.

The thing that ships is more complicated than the chatbot it replaced, not actually faster on the hard cases, more brittle on the edges, and significantly worse to debug, because the planner has invented a plan that’s plausible but wrong and the sub-agents have already chased it down a wrong path together by the time anyone notices.

The hard problem the harness pattern hides

The real problem in non-coding agent products is that the feedback loop the harness assumes does not exist yet, and you have to build it yourself before any harness will be worth anything.

This is the thing I learned the hard way at Spradley. We’re doing employee research, which means the unit of work is a conversation, the output is a synthesis across thousands of those conversations, and the question of whether the synthesis is correct is much harder than the question of whether a test passes. We tried building a Claude-Code-shaped harness early. It looked impressive in demos. It produced summaries that read well, themes that sounded right, and quotes that were technically grounded in transcripts. None of that told us whether the synthesis was actually useful for the leadership team that had to act on it.

What we ended up needing first was an eval, but not in the sense the word usually gets used. Not the model’s eval, the product’s eval. A way to take a synthesis the agent produced, hand it to the people whose job it is to act on it, and find out whether the synthesis actually changed the decision they would have made otherwise. That kind of feedback signal is slow to gather, expensive in human time, and impossible to fully automate, and without it any tweak we made to the harness was theatre.

This is the part of agent product work that doesn’t get talked about because it isn’t glamorous and it doesn’t transfer between companies. Coding has the cheap automatic feedback signal pre-built and donated by the language itself, which is why coding agents have been able to make the kind of progress they have. Everyone else has to build the equivalent themselves, inside their own domain, with their own users, on their own timeline. The harness is comparatively easy. Most of the actual difficulty is in defining what good output even looks like for the thing you’re building, and most teams skip that step because the field hasn’t made it look important yet.

What this means if you’re building a non-coding agent product

I’m not telling anyone to avoid harnesses. The harness pattern is useful even outside coding, in the way the MVC pattern is useful even when you’re not building Smalltalk. What I’m saying is that you should not invest in your harness until you have invested in the thing your harness is supposed to be optimising for.

In practice this means building the eval before the planner. Sit down and decide what “good output” actually means in your domain, write down the test, run it with humans for a while, and see whether you can get inter-rater agreement that’s meaningfully better than chance. If you can’t, you don’t have an agent problem; you have a definition-of-good problem, and no amount of harness work is going to rescue you from it.

It also means starting simpler than the literature would suggest. A chatbot with the right tools and the right retrieval will beat a poorly-grounded agent on most non-coding tasks for the foreseeable future, and you should promote a chatbot to an agent only when your eval is telling you that the chatbot pattern is leaving real value on the table and you can name which value and where.

Closely related: resist the instinct to copy the file-system-and-bash shape of the standard harness. Your domain probably doesn’t have files. It might have entities, conversations, documents, people, or accounts. The state you need to persist between turns isn’t a directory; it’s something closer to a graph of entities and their relationships, and if the harness pretends otherwise you’ll spend a year fighting an abstraction that was never designed for you.

The model question is also worth keeping in proportion. The gap between Sonnet, Opus, GPT-5, and Gemini for your specific use case is almost certainly smaller than the gap between your current harness and a better harness, which is almost certainly smaller than the gap between having no eval and having one. If you’re spending engineering time on which model to use before you’ve spent equivalent time on your own evaluation, you have the priorities backwards. Once you’ve done the eval work, the model selection question tends to mostly answer itself.

What I’m not saying

I’m not saying agents don’t work outside coding. They do. Some of the agent products I find most exciting are decidedly not coding products. What I’m saying is that the visible success of coding agents has produced an architectural template that the rest of the field is copying with too little thought, and the cost of that copying is being paid quietly inside companies who don’t realise the chatbot they were about to ship would have been better than the agent they ended up shipping.

I’m also not saying you should never build your own harness. Building your own is correct when the generic harness genuinely doesn’t fit your domain, which in most non-coding cases it doesn’t. What I’d push back on is teams who spend six months building a harness before they’ve spent a week building an eval, because the harness without the eval is just code that runs.

About my own bias

I run a company that built its own harness for a non-coding domain. I had to learn most of this the hard way and not all of it is settled in my head. The version of this argument I trust most is the one I won’t fully be able to write for another year, when I’ll have more data on what survived contact with reality and what didn’t. The version I can write today is the one above, and I’d rather post it now than wait for the perfect version, because the people copying the wrong pattern are doing it this quarter and probably this week.

If you’re reading this and building a non-coding agent product and the eval question is making you slightly defensive, my honest guess is that’s the part of the work you should probably be doing this week instead of whatever you had planned.

Sources I used: the harness gap numbers are from a 2026 comparison of Claude Opus across Claude Code, Cursor, Codex, Aider and others. The SWE-bench Verified standings I quote are from the Awesome Agents leaderboard. The framing of harness as the unit of distribution, rather than framework, is sharpest in this Save piece on the harness pattern.