Harness Engineering for Coding Agents

Harness engineering is the practice of building the controls around an AI coding agent so its output becomes trustworthy enough to need less supervision. Everything outside the language model is the harness: vendor controls (the inner harness) plus the rules, sensors, generators, and integrations you add (the outer harness). For UK teams shipping on .NET, React, TypeScript, and SQL Server, a strong outer harness combines portable assets like AGENTS.md and MCP servers with stack-specific computational controls.

The inner harness is fixed by the vendor. The outer harness is where engineering effort pays back.
Guides are feedforward controls. Sensors are feedback controls. You need both.
Deterministic checks beat prose rules. Every text rule worth keeping should aspire to become a check or a generator.
The most portable outer-harness investments are MCP servers, AGENTS.md, and CI checks. The least portable are tool-specific rules and skills.
Switching between Cursor and Claude Code is realistic if you invest in the portable layer first.

What is harness engineering and why does it matter?

The term comes from a 02 April 2026 article by Birgitta Böckeler at Thoughtworks, where she frames a coding agent as Agent = Model + Harness. The model is the language model. The harness is everything else: system prompts, retrieval, tool registries, planners, rules files, MCP servers, linters, test runners, review agents, and the way they all compose.

Harness engineering matters because language models are non-deterministic, do not know your codebase or your conventions, and do not understand code the way an engineer does. They produce tokens that look like code. The harness is what turns that output into something you can trust enough to merge, ship, and run.

Done well, a harness has two effects. It increases the chance the agent gets it right first time, and it provides a feedback loop that self-corrects most issues before they reach a human reviewer. The output is less review toil, fewer wasted tokens, and higher system quality.

Inner harness vs outer harness: what is the difference?

The harness is not one thing. It exists at two levels.

The inner harness is what the tool vendor builds in. With Cursor, that is the modes, the planner, the codebase index, the inline diff review UI, and the way it handles tool calls. With Claude Code, it is the terminal agent loop, the codebase reasoning, the file-edit primitives, and how CLAUDE.md and AGENTS.md get loaded. You cannot change the inner harness. You live with whatever the vendor ships.

The outer harness is what you add on top. AGENTS.md, project rules, MCP servers, linters, structural tests, custom analysers, code generators, hooks, and review-agent skills all sit here. The outer harness is where engineering effort pays back. It is also where the practical work of harness engineering happens.

The two harnesses interact. A choice in the inner harness, for example whether the agent can run terminal commands, changes which outer-harness controls make sense to build. That is why understanding what your tools provide before you start building rules is the first step.

What is in Cursor’s inner harness?

Cursor’s inner harness is IDE-resident. The main controls users interact with are:

Plan, Agent, and Ask modes that change how aggressive the agent is allowed to be.
Project rules in .cursor/rules/*.mdc with globs frontmatter for path-targeted activation.
Codebase indexing so the agent retrieves relevant files automatically rather than relying on a single dump.
Tool calls for read, edit, run terminal, search, and structured diff review.
Background agents for longer-running tasks that run in cloud sandboxes.
MCP support so the agent can talk to external systems through the Model Context Protocol.
Edit-by-edit review UI that shows every change as a diff before it lands.
Model picker so a team can route work to a faster, cheaper model where appropriate.

Talk Think Do uses Cursor as the primary IDE for all engineers. The Q1 2026 AI Velocity Report records 84% of code as AI-authored, with Cursor Rules and Reusable Skills doing most of the work to keep the agent inside our conventions.

What is in Claude Code’s inner harness?

Claude Code’s inner harness is terminal and filesystem-resident. The main controls are:

A CLI agent that runs in your shell rather than in an IDE.
Full-codebase reasoning that lets the agent read large parts of the project before acting.
Multi-step planning with explicit todo lists the user can monitor and adjust.
File edits and command execution through standard Unix primitives.
Test execution as a first-class operation, so the agent can verify its own changes.
MCP support that mirrors Cursor’s, so the same MCP servers work in both.
CLAUDE.md and AGENTS.md conventions as the primary instruction surface, loaded from the repository root.

Our .NET Framework to .NET 10 migration guide goes deep on CLAUDE.md setup as a feedforward control. The pattern carries to any project: target framework, coding conventions, forbidden APIs, testing framework, and architectural rules all live in the markdown the agent reads first.

How do those inner-harness differences shape your outer harness?

The two tools push you in different directions.

Cursor pulls you toward IDE-resident controls. Project rules in .cursor/rules/, Skills, and inline review work because the agent and the engineer share an editor. The cost of changing a rule is low and the feedback loop is tight.

Claude Code pulls you toward filesystem and CLI-resident controls. AGENTS.md and CLAUDE.md carry the conventions. Pre-commit hooks, scripts the agent can shell out to, and MCP servers do the heavy lifting. Because Claude Code is happy running commands, deterministic CLI tools become first-class harness assets.

Both tools support MCP, which is the single most important shared surface. MCP servers expose your work tracker, your test runner, your CI events, and your internal APIs to whichever agent is connected. They do not care which IDE the agent lives in.

The practical implication is simple. If a control depends on Cursor’s UI, you have a Cursor-only asset. If it lives on the filesystem or behind an MCP endpoint, you have a portable asset. Engineers should make conscious choices about which kind they are building.

Switching between Cursor and Claude Code: what stays, what changes?

Talk Think Do uses both. Cursor is the primary IDE. Claude Code handles agent workflows and longer-running tasks. The lesson from running both side-by-side is that some outer-harness assets carry over and some do not.

Assets that carry over cleanly:

MCP servers for work tracking, test execution, logs, Azure, CI/CD, and GitHub.
AGENTS.md at the repository root.
CI checks that run on every push (build, test, lint, structural tests, security scans).
Pre-commit and pre-push hooks that bundle fast sensors.
Code generators and project templates invoked through scripts.
Test suites and mutation tests.

Assets that do not carry over:

Cursor Rules in .cursor/rules/*.mdc. Claude Code does not read these natively.
Cursor Skills that depend on the Cursor runtime.
IDE-only review UIs and inline-diff controls.

The lesson is to invest first in the portable layer. MCP servers, AGENTS.md, CI sensors, and generators reward the work whether the team is on Cursor today, Claude Code tomorrow, or a mix. Tool-specific rules then sit on top as the cheap, fast, IDE-resident layer.

Guides as feedforward controls in a React, TypeScript, .NET, and SQL stack

Feedforward controls (Böckeler calls them guides) anticipate what the agent might do wrong and steer it before it acts. The strongest guides are stack-specific. Generic prose helps, but a Roslyn analyser is harder to ignore than a sentence in a rule file.

For React and TypeScript:

An AGENTS.md plus a .cursor/rules/components.mdc capping component file size, forbidding any, and enforcing prop-type conventions.
A tsconfig.json with strict: true, noUncheckedIndexedAccess: true, and exactOptionalPropertyTypes: true.
An ESLint config the agent must run before declaring a task done, with custom rules for hook usage, accessibility, and dependency boundaries.
A package.json script bundle (lint, typecheck, test) the agent is told to run as a unit.

For .NET 10:

A CLAUDE.md documenting target framework, dependency injection conventions, EF Core conventions, and forbidden APIs (WebClient, HttpClient.GetAsync without timeout, DateTime.Now instead of DateTime.UtcNow or TimeProvider).
A Directory.Build.props enforcing <TreatWarningsAsErrors>true</TreatWarningsAsErrors>, <Nullable>enable</Nullable>, and <EnforceCodeStyleInBuild>true</EnforceCodeStyleInBuild> for the whole solution.
An EditorConfig shared across the codebase.
A custom Roslyn analyser as deterministic feedforward, for example one that fails the build on await inside a loop without explicit batching.

For SQL Server:

A db/conventions.md that the agent ingests describing naming, indexing, and migration conventions.
A SQLFluff config covering stored procedures and views.
A migration template forcing explicit up and down scripts.
A code-generation step (see the harness templates guide) that scaffolds entity, DTO, repository, and migration in one go from a model description, so the agent extends rather than reinvents.

Sensors as feedback controls in a React, TypeScript, .NET, and SQL stack

Feedback controls (sensors) observe what the agent did and help it self-correct. The cheapest sensors run on every commit. The expensive ones run post-integration.

For React and TypeScript sensors:

TypeScript compiler (tsc --noEmit).
ESLint, Stylelint, and Prettier in CI.
Vitest for unit tests, Playwright for end-to-end tests.
Agent-readable error messages: tests and lint output that include explicit fix instructions, not just failure descriptions. This is what Böckeler calls a positive kind of prompt injection.

For .NET sensors:

dotnet build, dotnet test, and dotnet format.
ArchUnitNET for module-boundary tests so the agent cannot quietly leak a layer.
Stryker.NET for mutation testing as a quality-of-tests sensor.
GitHub Advanced Security with CodeQL for security findings as a sensor.

For SQL Server sensors:

Schema-diff via SqlPackage or DACPAC compare on every PR.
EF Core migration validator scripts that fail if a migration includes destructive operations without an explicit override flag.
A stored-procedure linter and query plan regression checks.

Across all stacks, an inferential review-agent (a Cursor Skill or a Claude Code subagent) runs against the diff. The prompt explicitly tells it to ground its findings in the computational sensor output before it speculates. Without that grounding, inferential review tends to hallucinate problems and miss real ones.

Why deterministic tools beat over-reliance on text

This is the central practical claim of harness engineering for any team using current models.

Language models reading prose rules still hallucinate. Under context pressure, when a long task or a noisy diff fills the window, the rule the agent followed at the start gets quietly dropped at the end. Type checkers, linters, structural tests, and migration validators do not have moods. They catch the same mistake the same way every time, in milliseconds.

The cost difference is huge. An inferential check on every commit is expensive in tokens and latency. A computational check on every commit is essentially free. Computational sensors are also more truthful. They cannot be talked out of a finding.

The headline rule is short:

Every text rule worth keeping should aspire to become a deterministic check or a generator.

A rule that says “use the repository pattern” is fragile prose. A Roslyn analyser that fails the build when the convention is breached is not. Better still is a generator that scaffolds the right shape from a model so the mistake is structurally impossible. We cover the generator pattern in depth in the harness templates guide.

Computational vs inferential controls: when to use each

Böckeler’s distinction is useful. Computational controls are deterministic, run on the CPU, and are cheap and fast. Inferential controls use a model, are slower and more expensive, and are non-deterministic. Both have a place. The question is when to reach for which.

Use computational controls where you can. They are cheap enough to run on every commit and reliable enough to gate a merge. Type checking, linting, structural tests, mutation testing, schema diffs, dependency-cruiser, and security scans all fit here.

Use inferential controls where computational ones cannot answer the question. Examples include “is this code semantically duplicating existing logic?”, “does this match our domain language?”, and “does this PR change the architectural intent?”. Inferential checks are powerful, but downstream of cheap checks, not in front of them. The pattern that works is computational sensors run first, the inferential review agent reads their output, and only then does it weigh in. The agent stops being a critic in a vacuum.

A good practical heuristic: a control that produces the same answer twice should be computational. A control that requires judgement should be inferential. If you find an inferential control producing the same answer twice, it is a candidate for promotion to computational.

Harness templates: when to invest, and where

Böckeler ends her article with an open question: will service templates evolve into harness templates? A harness template is a bundle of guides and sensors that leashes a coding agent to a known topology, like a CRUD business service or an event processor. Templates narrow what the agent has to invent, which is exactly what makes the agent’s output reliable.

Talk Think Do has a working answer. Years before LLMs, we invested in Codenative™, a model-driven templating tool covering scaffolding, ongoing feature generation, conflict-free regeneration on model changes, and template-enforced conventions. It is entirely deterministic. It underpins our Accelerators without being a public product.

When coding agents arrived, Codenative was re-implemented as native Cursor Skills inside .cursor/skills/. The same templates that engineers used to invoke directly are now invocable by the agent. The natural-language layer rides on top of an unchanged deterministic core: an engineer says “add a Booking entity with a date, party size, and customer reference”, the agent translates that into the YAML domain model the templates accept, and the deterministic generator produces the artefacts. The non-deterministic step is bounded; the deterministic step does the heavy lifting.

We treat this as the strongest computational feedforward control available, because a generator does not just steer the agent away from mistakes, it makes whole classes of mistake structurally impossible.

If you do not have a Codenative-equivalent, you almost certainly have a scaffolder you can elevate: dotnet new templates, Yeoman, Plop, T4, Hygen, NSwag, OpenAPI generators, or Cookiecutter. Wrap each in a Cursor Skill or expose it through MCP. Then add a sensor that detects drift from the generated shape (ArchUnitNET on .NET, custom ESLint rules on TypeScript, schema-diff on SQL). Resist hand-evolving the generated parts so the regeneration loop stays useful.

For the full treatment, including Codenative’s natural-language to YAML translation, conflict-free regeneration, and a worked end-to-end example, read the harness templates guide.

A worked outer harness for a React, TypeScript, .NET 10, and SQL Server system

The diagram below shows where each kind of control sits in a typical system. The model is at the bottom; the inner harness is what the vendor builds; the outer harness is where engineering effort lives; human review and production sit above.

PRODUCTION

What ships

Running system

Azure, observability, SLOs

HUMAN REVIEW

Where judgement lives

PR review

Engineer + domain expert

QA validation

ISTQB-qualified

OUTER HARNESS

What we build

MCP servers

Work, tests, logs, CI

Rules & AGENTS.md

Conventions, forbidden APIs

Computational sensors

Build, tests, ArchUnitNET

Generators (Skills)

Codenative templates

INNER HARNESS

Vendor controls

Cursor

Modes, indexing, review UI

Claude Code

CLI agent, codebase reasoning

MODEL

The non-deterministic core

Frontier LLM

Claude, GPT, swappable

The same picture maps onto a concrete repo for a React, TypeScript, .NET 10, and SQL Server system as a list of assets.

Feedforward, computational:

Codenative-derived Cursor Skills for entity, API, and migration scaffolding.
Custom Roslyn analyzers for forbidden APIs and architectural rules.
Directory.Build.props enforcing nullable reference types and treat-warnings-as-errors.
EF Core conventions configured globally.
tsconfig.strict.json for the React/TypeScript packages.
Shared ESLint and Stylelint configs.

Feedforward, inferential:

AGENTS.md at the repo root for both Cursor and Claude Code.
CLAUDE.md with project-specific Claude guidance.
.cursor/rules/*.mdc for path-targeted Cursor guidance.
OpenSpec specifications for spec-driven development on substantial features.

Feedback, computational, every commit:

dotnet build, dotnet test, dotnet format.
tsc --noEmit, ESLint, Vitest.
ArchUnitNET module-boundary tests.
EF Core migration validator scripts.
SQLFluff for stored procs and views.
A pre-push hook that runs the fast subset of the above.

Feedback, computational, post-integration:

Stryker.NET mutation testing.
GitHub Advanced Security with CodeQL.
SqlPackage schema-diff on each PR.
Query plan regression checks.
Dependency scanners.

Feedback, inferential:

A review-agent skill that runs against the PR diff, grounded in the computational sensor output.
AI test critique on AI-generated tests, targeting the behaviour-harness gap.

Glue:

An MCP server exposing the work tracker, CI/CD events, and the test runner so the agent self-corrects against real signals rather than its own model of the system.

That stack covers most failure modes a coding agent has on a typical .NET, React, and SQL system. It is also achievable: every component listed exists today and can be assembled in a few weeks for a new project, or rolled in incrementally for an existing one.

How Talk Think Do applies harness engineering today

The Q1 2026 AI Velocity Report is the public record of what our outer harness looks like in practice. The headline numbers:

84% of code is AI-authored, up from 51% the previous quarter.
40-50% faster delivery, repeatable across active projects.
Six live custom MCP servers covering work items, test execution, logging, Azure, CI/CD, and GitHub.
Cursor Rules and Reusable Cursor Skills as the IDE-resident layer.
Codenative templates re-housed inside .cursor/skills/ as the deterministic backbone.
OpenSpec for spec-driven development, second quarter in production use.
A competitive tender won at 55% of the conventional cost, delivered on time and to spec.

Every line of AI-authored code goes through senior engineer review and ISTQB-qualified QA validation, inside our ISO 27001-certified security framework. The outer harness is what makes that economics possible. Without it, 84% AI-authored would be a quality problem rather than a productivity number.

If you are starting from scratch, our experience says the order to invest is: AGENTS.md and a Directory.Build.props first (an afternoon’s work), then the fast computational sensors in CI (a few days), then one or two MCP servers for your highest-friction workflows (a week or two each), then code generators or Skills wrapped around your existing scaffolders. The inferential review-agent comes last, once the computational layer below it is reliable enough to ground its findings.

For more on how this fits a UK Microsoft-stack engagement, see our AI integration service, Claude Code development service, and AI approach pages. To see what a Codenative-style template harness looks like up close, read the companion harness templates guide. To talk through a harness review of an existing codebase, book a free consultation.

Frequently asked questions

What is harness engineering?

Harness engineering is the practice of building the controls around an AI coding agent so it produces trustworthy output with less supervision. The harness is everything outside the language model itself: rules files, MCP servers, linters, type checkers, test suites, code generators, and review agents. A good harness increases the chance the agent gets it right first time and helps it self-correct when it does not.

What is the difference between inner and outer harnesses?

The inner harness is what the tool vendor builds in: Cursor's modes, planner, and indexing, or Claude Code's terminal agent and codebase reasoning. The outer harness is what you add: AGENTS.md, project rules, MCP servers, linters, structural tests, and review skills. You cannot change the inner harness, you live with it. The outer harness is where almost all engineering effort goes.

How does Cursor's harness compare with Claude Code's?

Cursor's inner harness is IDE-resident: Plan, Agent and Ask modes, project rules in `.cursor/rules/`, codebase indexing, and inline diff review. Claude Code's inner harness is terminal and filesystem-resident: a CLI agent that reads `CLAUDE.md` and `AGENTS.md`, runs commands, and edits files. Both support MCP. Practical implication: invest in MCP servers and `AGENTS.md` if you want your harness to survive a tool switch.

Should we invest in tool-specific rules or portable ones like MCP servers?

Both, but with weighting. Tool-specific assets like Cursor Rules and Skills give you the tightest integration today. Portable assets like AGENTS.md, MCP servers, CI checks, pre-commit hooks, and code generators carry over if you switch tools or add a second agent. As a rule of thumb, put the cheap and tool-specific work close to the IDE and the expensive and durable work in portable layers.

When should we use a deterministic check instead of an AI prompt rule?

Whenever you can. A type checker, linter, structural test, or schema-diff sensor catches a problem the same way every time. A prose rule in AGENTS.md telling the agent to avoid the same problem can be hallucinated past, especially under context pressure. Every prose rule worth keeping should aspire to become a deterministic check or a generator that makes the mistake structurally impossible.

What does a minimum viable outer harness look like for a new .NET project?

A `Directory.Build.props` enforcing nullable reference types and treat-warnings-as-errors, an `AGENTS.md` and `CLAUDE.md` describing target framework, conventions, and forbidden APIs, `.cursor/rules/` for Cursor-specific guidance, `dotnet build` and `dotnet test` as fast feedback sensors on every commit, ArchUnitNET for module boundaries, and an MCP server exposing your work tracker. That is enough to get most agent failure modes under control.