AI Didn’t Kill Testing — It Made It Essential

Article 1 of the series “From Specification to Execution — a workflow for reliable code in the AI era.”

You might think that with Claude, Cursor, or Copilot able to generate a large portion of your code, tests would have become secondary. The idea is intuitive: if the machine produces code in seconds, why not let it? Yet Kent Beck, one of the founding voices of TDD, explained in 2025 in Augmented Coding: Beyond the Vibes that he now practices more TDD with an AI agent than before. That observation deserves a closer look.

This article opens a seven-part series. It sets the context: why tests are regaining importance just when AI seemed poised to relegate them, and what workflow is emerging for producing reliable code with an agent. The following six articles detail this workflow step by step, from specification to execution.

Quick definition — TDD (Test-Driven Development) means writing a failing test first, then writing the code that makes it pass. The cycle is short: Red, Green, Refactor. This practice forces you to formalize expected behavior before writing the implementation.

The Economic Inversion

To understand what changed, you need to look at where costs sat before AI, and where they sit now.

Before AI, writing a test took time. Not a huge amount, but enough that teams under pressure would often postpone or abandon tests. You wrote few, and paid the price later as production bugs. This is what fueled years of recurring criticism of TDD: cost deemed too high, lack of time, priority given to delivery.

With AI, generating fifty tests takes seconds. The cost of writing has dropped sharply, but the work has shifted: it now consists of evaluating whether a test actually provides value. In other words, the primary cost is no longer writing — it’s assessing quality. That cost has increased.

Before AI                  With AI
─────────                  ────────
Cost of writing a test     Cost of writing a test
■■■■■■■■■■                ▏

Cost of judging its quality Cost of judging its quality
■■■                        ■■■■■■■■■■■■■

Birgitta Böckeler at ThoughtWorks summarizes the problem well: GenAI amplifies indiscriminately. In other words, AI cannot automatically distinguish a good codebase from a bad one. On mediocre code, it reproduces existing flaws. On weak tests, it generates more of the same kind — just in greater numbers.

The key point is no longer writing the test, but the upstream specification and the evaluation of the tests produced. These remain human activities, and it’s at this level that code reliability is now determined.

Three Documented Pathologies

If you let an AI agent write its own tests without a framework, three recurring drifts emerge.

Pathology 1 — Silent Test Deletion

Kent Beck described this in his interview on the Pragmatic Engineer podcast (2025): faced with a failing test, an AI agent may modify or delete the test rather than fix the production code. This is often the simplest solution for the agent, since it makes the test pass quickly.

The problem runs deeper than simple cheating. The test formalizes expected behavior. If the agent modifies it to make the code pass, it doesn’t just fix a discrepancy — it alters the original contract. And since the agent often works autonomously, this drift can go unnoticed until the next code review.

Pathology 2 — Systematic Over-Mocking

A 2025 empirical study, Are Coding Agents Generating Over-Mocked Tests?, shows that agents tend to use mocks by default. It’s a convenient strategy: it avoids setting up fixtures, instantiating real objects, or handling concrete cases. The test passes, coverage goes up. On the surface, everything works.

In reality, the test barely verifies anything. It mostly checks that a mock was called correctly. James Shore called this test theatre: a form of testing that reassures but protects poorly. Defects that emerge from component interactions — which are common — remain invisible.

Pathology 3 — Perpetually Green Tests

The most extreme case is a test that stays green regardless of what changes you make to the production code. This can stem from missing assertions, mocks too detached from their role, or tautological oracles — tests that merely restate what the code does, instead of verifying what it should do.

The ThoughtWorks Technology Radar v33 (2025) highlights mutation testing to detect this type of test. The idea is simple: if you intentionally modify the production code, a solid test should fail. If the test continues to pass, it probably isn’t testing much. The survival rate of mutants then becomes a useful indicator of the actual quality of the test suite.

A fourth, more subtle drift deserves mention: the oracle from code. The agent relies on the implementation to deduce what the test should expect. The test becomes circular: it verifies that the code does what the code does, with no explicit reference to expected behavior.

The Resulting Workflow

Faced with these drifts, two options exist: give up on AI, forfeiting the productivity it can bring, or grant it full autonomy, with the risk of producing the fragile code described above. A third path emerged in 2024–2025, through convergence of Kent Beck, Martin Fowler, and ThoughtWorks: frame AI within a workflow where tests are no longer produced by the agent but serve as constraints on its work.

1. SPECIFICATION    the human formalizes intent
                   (Gherkin, examples, properties, types)
                              ↓
2. TESTS            the human writes (or strictly validates) tests
                   BEFORE any production code is written
                              ↓
3. AI EXECUTION     the agent implements under the constraint of tests
                   one cycle = one test = one commit
                              ↓
4. VERIFICATION     multi-agent (writer + critic) + deterministic tools
                   (mutation testing, lint, coverage)
                              ↓
5. AUDIT            loopback: test quality checked regularly
                   (monthly or quarterly depending on criticality)

The principle is this: tests serve as executable specifications that the AI must satisfy. Without upstream tests, the AI produces plausible implementations, but errors become harder to spot. With upstream tests, the AI proposes a solution, the tests validate or reject it, and the cycle remains under human control.

This approach is supported by several works published in 2025. Martin Fowler describes it in Exploring Gen AI: Spec-Driven Development. An industrial study from the same year on generating acceptance tests with LLMs also reports encouraging results: Gherkin scenarios accepted in the vast majority of cases, often correct on first generation, with clear improvement after correction or adding context. The remaining work then falls to humans, primarily for specification validation. Acceptance Test Generation with Large Language Models.

Plan for the Six Following Articles

This series presents the workflow step by step. Each article is self-contained, but together they reconstitute the operational sequence.

Article 2 — Executable Specification: Gherkin, Examples, Properties. How to formalize an intent so that a human and an agent can start from a common baseline. Levels of formalization, common pitfalls, and lessons from a 2025 industrial study on what AI produces from precise Gherkin.

Article 3 — Who Writes the Test, Who Writes the Code? The division of work between humans, AI agents, and deterministic tools (yes, they exist — and the AI hype has unjustly eclipsed them). The multi-agent writer + verifier pattern emerging in recent academic research.

Article 4 — Measuring Whether Tests Are Worth Anything. Four complementary axes: coverage (and its pitfalls), mutation testing, test smells, and robustness. A lightweight audit applicable to an existing project.

Article 5 — Property-Based Testing: Defense Against Weak Oracles. Why example-based tests often miss edge cases, and why property-based testing lends itself well to AI-assisted usage. With an example on a SIP parser.

Article 6 — Putting It into Practice: React + Go (Gin / GORM / PostgreSQL). Applying the workflow to a modern stack. Tool selection per layer, the value of testcontainers-go over sqlmock, and using Pact to limit end-to-end tests.

Article 7 — Launching Execution: From Plan to PR. Using a persistent plan.md, pre-commit hooks as guardrails, a multi-stage CI pipeline, and how to audit an existing project to set up this workflow.

The accompanying material for the series (demos, extended TDD skill for Claude Code, Pact examples) will be published in a public repository, with the link in the next article.

Three Key Points, Which Form the Foundation for What Follows

One. AI didn’t eliminate the need for tests. It mainly shifted the effort: writing a test costs little, but evaluating its quality requires more work than before. Without upstream tests, using an AI agent increases the risk of invisible technical debt.

Two. Three drifts recur in test suites produced autonomously by an agent: silent test deletion, systematic over-mocking, and perpetually green tests. They are documented and measurable.

Three. A five-step workflow — specification, tests, AI execution, verification, audit — allows combining productivity with control. The point is not to discard AI, but to frame it.

Next week, we’ll start with specification: how to formalize an intent so that a human and an agent can rely on the same baseline, without ambiguity.


Further Reading

  • Kent Beck, Augmented Coding: Beyond the Vibes (2025) — signals.aktagon.com
  • Kent Beck (interview) on the The Pragmatic Engineer podcast (2025) — newsletter.pragmaticengineer.com
  • Martin Fowler, Exploring Gen AI: Spec-Driven Development (2025) — martinfowler.com
  • Are Coding Agents Generating Over-Mocked Tests? An Empirical Study (arxiv 2602.00409, 2025) — arxiv.org
  • Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents (arxiv 2602.07900, 2025) — arxiv.org
  • Acceptance Test Generation with LLMs: An Industrial Case Study (arxiv 2504.07244, 2025) — arxiv.org
  • ThoughtWorks Technology Radar volume 33 (2025) — thoughtworks.com

This article is the first in a 7-part series. If this topic interests you, subscribe to the blog’s RSS feed so you don’t miss the rest.