Article 3 of the series “From Specification to Execution.” The previous article covered executable spec, that is, how to formalize an intention without ambiguity. Here I move to the next point. Once the spec is in place, who writes what?

The useful question is not whether AI will replace developers. That does not help my daily work. The question that matters is simpler. In the pipeline from spec to code, who writes the tests, who writes the code, and who checks that the suite has not drifted along the way?

Many teams let a single agent write both tests and code in the same motion. The result often looks clean at first, then turns fragile as soon as I look closely. In article 1, I described three recurring drifts, test deletion, over-mocking, and perpetually green tests. That is my starting point.

My point is direct. The critical test should not come from the same actor as the code it constrains. If I want a real constraint, I need at least two separate roles, and sometimes three. Humans, AI, and deterministic tools do not play the same part.

The test as executable spec

In his augmented coding workflow (2025), Kent Beck gives his agent a precise instruction. The agent must take the next unchecked test in plan.md, implement it, and then write the minimum code needed to make it pass. The agent does not choose the test. It executes it. The test comes before the code, and it comes from somewhere else than the agent.

That detail changes the workflow. Two 2024 and 2025 papers point in the same direction. Test-Driven Development for Code Generation (TiCoder, arxiv 2402.13521) shows that iterating on tests and code with an LLM improves the quality of the result. Tests as Prompt: A TDD Benchmark for LLM Code Generation (TENET, arxiv 2505.09027) goes further. On real repositories, giving the tests to the AI in the prompt reaches the state of the art for test-guided code generation.

I keep one simple rule. The test is not a by-product of the code. It is the constraint that guides the code.

The consequence follows immediately. If the agent writes both the test and the code, there is no independent constraint left. It mostly checks that it does what it just did. That is a closed loop, not a useful workflow.

So I am left with three possible sources for the test, me, another AI agent, or a deterministic tool.

Deterministic tools still have a place

The AI discussion sometimes makes older tools disappear from view. That is a mistake. There are already deterministic tools, with no model involved, that generate tests from code, a signature, or a spec. I keep them in mind because they cover ground where AI is weaker.

Search-based

EvoSuite in Java, Pynguin in Python, and Randoop in Java explore the input space with evolutionary algorithms or random feedback-directed search. They focus on coverage, branches, paths, and mutation score.

I find them useful to lock down regressions in lightly tested legacy code. Their limit is simple. They generate what the code does, not what it should do. If the code contains a bug, they can test the bug very effectively.

Symbolic and concolic execution

KLEE for C and C++ on LLVM bitcode, angr for binaries, and their cousins run code symbolically, generate path constraints, and then solve those constraints with an SMT solver to produce concrete inputs.

Their strength is clear. They give formal guarantees, are deterministic, and reproduce well. I keep them mostly for security, reachable paths, vulnerabilities, and pathological inputs. Their well-known limit is path explosion, which makes them fall off on larger application codebases.

Model-based testing

GraphWalker, ModelJUnit, and SpecExplorer follow another logic. I describe the system as a model, usually a state machine with transitions, and the tool generates test sequences according to a chosen strategy, random walk, all-edges, all-paths.

I think they are underused in modern application code. Yet they have proven themselves for years in telecom, especially on switches and systems where state matters more than the happy path.

Spec-driven property-based testing

The most interesting case for my workflow is also the most sober. Hypothesis ghostwriter in Python, icontract-hypothesis, and the QuickCheck-style derivations in Haskell generate test skeletons from types, signatures, and contracts. There is no AI here, only templates, type inference, and a few lookup tables.

A command like this sets the tone:

hypothesis write mon_module.parse_sip_uri

Ghostwriter inspects the signature, type annotations, and docstring, then produces a property-based test base. I can then complete it with useful invariants. It is deterministic, reproducible, and well suited to critical modules where I do not want invention.

With icontract, preconditions and postconditions become usable too. The tool generates tests that respect the precondition and check the postcondition. Again, what I gain is control.

Why people talk about them less

I see three reasons.

The first is setup cost. Modeling a state machine, spelling out contracts, and preparing the ground all take time.

The second is more structural. These tools generate coverage, not meaning. I still need a business spec on top.

The third is simpler. In public debate, AI takes the space, and these tools have been pushed to the side. That is a shame, because they are often more reliable in their own domain.

The task × author matrix

I summarize the split in a simple grid. Three actors, humans, AI, and deterministic tools.

TaskHumanAI writer-agentDeterministic
Business spec and intent
Acceptance test, Gherkin✅ validates✅ proposes
Unit test, concrete cases✅ writes or validates before code✅ proposes
Property-based test, invariants✅ proposes the invariants✅ co-writesghostwriter for the skeleton
Structural coverage✅ EvoSuite, Pynguin
State machine tests✅ models✅ GraphWalker, ModelJUnit
Security and path tests✅ KLEE, angr
Production code
Refactoring✅ validates✅ proposes and executes
Test critique✅ critic-agent✅ mutation, lint, coverage gate

I take three rules from that grid.

The critical test comes from me or from a deterministic tool. I do not let the same agent write the code and the test that constrains it. Otherwise I fall back to the tautologies described in article 1.

Production code comes from AI, under constraint. The agent does not freely propose a shape and then validate it itself. It implements a frame that is already in place.

The critic is a first-class actor. It is not there to decorate the pipeline. It is there to stop the writer-agent from turning a test into a confirmation of what it already does.

The writer + critic pattern

The single-agent image is still common, but it already feels dated to me. What works better, in recent work and in my own practice, is a clean split between the one who writes and the one who critiques.

Agent-as-a-Judge

Meta’s 2024 work goes in that direction. One agent evaluates another by looking at the full chain of actions, not just the final result. Applied to test generation, that means the critic-agent does not only check whether the test passes. It looks at how the test was designed, which mocks were added, and whether the spec is still visible.

Multi-agent verification

Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers (arxiv 2502.20379, 2025) proposes specialized verifiers, with no extra training, each centered on a different aspect, semantics, security, performance, style. For tests, the idea translates well. One verifier for coverage, another for oracles, a third for Gherkin compliance, a fourth for anti-patterns. The final vote decides.

Multi-agent debate

Multi-Agent Debate for LLM Judges (arxiv 2510.12697, 2025) shows that debate between opposing agents improves judgment quality. I can translate that into my workflow. One agent defends the quality of the test, another looks for its weaknesses, then a judge decides. I get better precision than with a single agent talking to itself.

In practice

The pattern that looks most robust to me is this:

Spec (me) -> writer-agent -> critic-agent -> deterministic judge -> human validation

The critic-agent applies hard rules, like the ones I versioned in the public repo. It does not modify a test to make the code pass. It rejects mocks on internal code when they hide the real behavior. It flags tautological oracles.

The deterministic judge does not argue. It applies objective gates, mutation score, lint, coverage. It is what turns an opinion into a pass threshold.

I keep one simple conclusion. The writer-agent is replaceable. The critic-agent is much less so. It is the part that stops test theatre, and therefore the actor whose quality shapes everything else.

Three points to keep

1. The critical test should come from me or from a deterministic tool. If it comes from the same agent as the code, I lose the independent constraint.

2. Deterministic tools still cover useful ground. EvoSuite, Pynguin, KLEE, GraphWalker, Hypothesis ghostwriter, and icontract-hypothesis cover needs that AI handles less well, systematic coverage, symbolic paths, state models, and invariants.

3. The writer + critic + deterministic judge pattern is better than a single agent. The critic-agent is often the most useful actor in the pipeline.

The next step is easier to state and harder to measure. Does a test suite actually have value? Coverage, mutation testing, test smells, robustness. Article 4 will go through the four axes I use to answer that question.


Further reading


This article is the third in a seven-part series. The previous article covered executable spec. The next one will talk about how to measure whether a test suite is worth anything.