Your Product Instincts Are Expiring

Agent = Model + Harness.

The Shift

If you’re a PM shipping AI features, here’s something that should bother you more than it does.

A general-purpose agent (Claude Code, Codex, take your pick) wired up with a couple of public APIs and an MCP will quietly out-perform the bespoke AI features that data-rich companies ship inside their own products.

The incumbent owns the data. The context. The distribution.

The generic tool owns none of it.

And it laps them anyway.

Not everywhere. Where the moat is structural, it still holds. A model with twenty years of your CRM history or your actual repo loaded in context is hard to beat. But a generic agent can pull the context it needs at inference time now, and in synchronous, read-heavy work that often matches what the incumbent baked in. That’s exactly where the moat has already gone soft. Where it stays hard (messy enterprise auth, low latency, deep workflow lock-in), the incumbent keeps the edge, for now. That edge is narrowing faster than they’d like.

It’s already spooking them. Eighteen months ago Benioff was calling Microsoft’s agent “Clippy 2.0” and selling Agentforce as the smartest agent in the room. Now the pitch is that Salesforce is the trusted platform where any agent, theirs or anyone’s, comes to act: the operating system for the Agentic Enterprise, the landing zone. You could read that as a winning move, owning the layer every agent has to land on, the way the platform layer beat on-prem. Maybe it is. But notice the shift either way. The product stopped being the intelligence and became the place the intelligence lands. The value is draining out of the app and into whatever wraps the model.

That wrapper has a name now. The harness.

Here’s my claim, and I’ll mark it plainly as opinion: this is not a blip, and it is not an engineering footnote. It’s a shift in where product value lives, and if you’re doing the PM job the way it’s always been done, you’re standing on the wrong side of it.

The path here is clean. Prompt, then context, then harness.

First we squeezed single prompts. Then we wrapped models in context (tools, retrieval, memory) to do more with a small window. The harness is the rung above that. Nobody invented it last year; the field just finally understood enough of the system to point at it and name it.

And the name arrived with numbers. Hold the model fixed, change only the harness around it: a Stanford and MIT team (the Meta-Harness paper, March 2026) measured up to a 6x swing on the same benchmark. That’s the best case, and yes, it’s one harness beating another (Claude Code is a harness too), not the model being beaten by magic. LangChain did the same trick in the open, taking a coding agent from outside the Top 30 to Top 5 on a hard benchmark by changing the scaffolding alone, model untouched. The exact multiple isn’t the point. The point is that a layer most PMs never look at moves the result that much.

If you're not the model, you're the harness.

That’s where the leverage went. And here’s the part that should worry you, not just your eng team: the work moved down to a layer you have no instincts for, and the instincts you spent a career building were for the layer above it.

That's why this is coming for us.

The Ground

The PM job was built on solid ground.

For most of our careers, the stack moved in a line. You could hold a sharp opinion on a feature without knowing whether the team reached for SQL or NoSQL, REST or GraphQL, a cache or a read replica. Those were stable. Well understood. Someone else’s call. The intuition we built on top aged slowly: how things break, what’s cheap, what’s expensive, where to push and where to defer.

That slow aging was the whole deal. It’s why a PM could stand at the intersection of tech, business, and UX and stay credible without climbing back down into the weeds every quarter.

The ground isn't holding still anymore.

And the intuition we built on it is going stale faster than we're used to.

Watch how we cope, because it gives us away. Ask a PM what they do for AI today and you’ll hear two words: prompts and evals. We craft the prompt, then we grade the output. That was the whole toolkit.

Except the toolkit is moving under us. Take the prompt. Tools like DSPy now generate and tune prompts programmatically. You stop hand-writing the sentence and start defining the objective it gets compiled against. The craft didn’t disappear; it moved upstream, into deciding what good looks like. Which is a product judgment, not a wording trick. (The same researcher behind DSPy, Omar Khattab, is on that Meta-Harness paper. The people optimizing prompts and the people optimizing harnesses are the same people. That’s not a coincidence; it’s the direction.)

Now the other half of the toolkit, the eval. And here’s what one actually is.

A way to grade a system you've decided not to understand.

Evals are useful. I’m not dunking on them. They catch failures your intuition would miss, at scale. But they tell you that the system failed, not why. And the why is where the work is. They let you keep your distance, and distance is the problem now.

Because there’s a PM this shift is coming for first. Call him the Figma-and-wishlist PM. Mockups over the wall, requirements over the wall, grade the output with an eval. The problem isn’t the Figma. It’s that he stays one layer up while the decisions that actually shape the product, the ones inside the harness, get made without him, by default, by whoever’s closest to the code.

He's about to get a lot less valuable. And he doesn't see it yet.

The Work

The work moved. Shaping the environment around a model isn’t only an engineering problem anymore.

It's a taste problem.

What should the model see, and when? Where do you want determinism, and where do you want judgment? Should the recruiting assistant be allowed to rank candidates by their salary expectations? That last one isn’t an engineering question. It’s a product call about what the system should refuse to do. And right now it’s getting answered by whoever wired up the tool, if it’s getting answered at all.

You’re not going to out-engineer Claude Code; that was never the job. But the lab only builds the general harness. Your company still has to build the one that wraps the model inside your product: what it sees from your domain, what it’s allowed to do with your users’ data, where it has to stop and defer to a human. Those are product decisions. They’re the gap. And they’re being made by default.

Taste is supposed to be our job. So this is our ground to climb back down into.

Be clear about what I’m not saying. I’m not saying become an engineer, write production code, or live in the repo. I’m saying the opposite of comfortable. The floor we stood on dropped, and staying abstract (Figma, wishlist, eval, repeat) makes us less relevant, not more. You go down a layer not to take the engineer’s job, but to keep having an opinion worth hearing.

If you want to feel this for yourself, don’t read more about it. Take the next AI feature on your roadmap and build the dumbest working version of it this weekend. Public APIs, an MCP, no production anything. You’ll surface more of the real product questions in two days than two months of specs will hand you.

That’s where this goes next. This piece is the why. Part two is the what: what a harness actually is, and which parts of it a PM should hold real opinions about. Part three is the how: I’m building one in the open, start to finish, as the worked example. But you don’t have to wait for me.

This one was just the why.

I might be early. And here’s the bear case that would make me wrong: if the tooling matures fast enough that building a harness becomes a commodity, abstracted away the way the cloud did to servers, then the judgment moves up, not down, and PMs stay right where they are. I don’t think it goes that way. The plumbing commoditizes; the taste calls don’t. But that’s the bet.

The ground is moving. I’d start walking down.

This is Part 1 of 3 — "Beyond Evals." Part 2: what a harness is, and the parts a PM should hold opinions about. Part 3: the how — one real build, start to finish.

Sources

Salesforce positioning Benioff's 2024–25 attacks on Microsoft Copilot as "Clippy 2.0" and a "science project" (more); the later open-interop pivot (A2A protocol with Google and 50+ partners; MCP support in Agentforce 3, June 2025); and the current "operating system for the Agentic Enterprise / landing zone" framing.
6x harness swing Meta-Harness: End-to-End Optimization of Model Harnesses (Lee, Nair, Zhang, Lee, Khattab, Finn; Stanford and MIT; March 2026). Reference code.
Top 30 → Top 5 without touching the model LangChain, "Evaluating Deep Agents CLI on Terminal-Bench 2.0": 52.8% → 66.5% with the model (GPT-5.2-Codex) held fixed; gains from system prompt, tools, and middleware.
Programmatic prompts DSPy (Stanford NLP; Khattab et al.), which compiles and optimizes prompts against eval metrics rather than hand-writing them.
"Agent = Model + Harness" Framing popularized in the agent-engineering community; see LangChain's deepagents, "the batteries-included agent harness."