Article 2 of the series “From Specification to Execution, a workflow for making code more reliable in the AI era.” The previous article set the context. With AI, part of the work has moved from writing code to spec and judgment. Here, I start with the spec.

In 2025, a team of engineers measured what GPT-4 Turbo produces when asked to generate Cypress acceptance tests from Gherkin scenarios. 60 percent of the tests were correct on the first generation, 92 percent after a minor correction or extra context. Of 20 problematic cases, 12 shared the same cause, insufficient context in the user story. These results are published in Acceptance Test Generation with Large Language Models. The point is clear. The bottleneck is not generation. It is specification.

I start from a simple idea. A useful spec is not decorative text. It has to say the same thing to the human and the agent, without leaving blanks to fill in.

Why spec changes with AI

Martin Fowler describes the mechanism well in Spec-Driven Development (2025). Models complete patterns very well. They infer intention much less well. They do not have access to the business context I keep in my head, unless I make it explicit.

The more precise the spec is, the less room there is for interpretation. The vaguer it is, the more the agent invents. The result can look plausible, which is exactly the problem. A plausible bug shows up later, and usually at a higher cost.

Take a request like: “make me an endpoint that returns users.” The agent can produce code without difficulty. It still has to choose pagination, filters, JSON shape, error codes, and cache policy. If those elements are missing, it fills the blanks with patterns learned elsewhere. Sometimes they are good. Often they are foreign to the system I am building.

If I provide a precise Gherkin scenario, one request example, one response example, and, when useful, an OpenAPI type, the room for invention shrinks sharply. The agent completes what is already framed.

The role changes. The work is no longer about producing more text. It is about stating an intention clearly. That is where quality is decided.

The four levels of formalization

A spec is not binary. It is not just present or absent. I see it as a slider.

Level 1, the user story in natural language

The classic format still helps:

As a <role>, I want <action> so that <benefit>.

This format answers the why. It does not give enough structure for the expected behavior. A user story alone is not executable specification. It is a starting point.

Level 2, Gherkin scenarios

Gherkin gives me a form that both business and tools can read.

Scenario: Rating a standard call to a known destination
  Given a rate of 0.02 €/min for destination "33"
  When I rate a 120-second CDR to "+33612345678"
  Then the cost should be 0.04

It stays readable on the business side. It is also executable by tools like Cucumber, godog, behave or cucumber-js.

The 2025 study I mentioned above used this level as an intermediate step. The flow was simple, user story, GPT-4 generated Gherkin, then GPT-4 generated Cypress tests from the Gherkin. That step matters. It leaves room to correct the spec before test generation goes too far.

Level 3, typed examples

When Gherkin is still too verbal, I move to concrete examples in code.

EXAMPLES = [
    # (duration_s, destination, expected_cost_millicents)
    (60,  "+33612345678", 2000),
    (120, "+33612345678", 4000),
    (1,   "+33612345678", 2000),   # round up to the next minute
    (0,   "+33612345678", 0),
    (61,  "+33612345678", 4000),   # boundary
]

Typed examples bring two things. They plug directly into parameterized tests or table-driven tests. They also force edge cases into the open. That is often where I find the fuzzy zones I left in the user story.

Level 4, properties and contracts

This is the most formal level. It is also the one that frames AI best when the code is critical.

# Properties of call cost calculation
# 1. cost(d, t) >= 0 for all d, t
# 2. cost(d1, t) <= cost(d2, t) if d1 <= d2
# 3. cost(d, 2*r) == 2 * cost(d, r)

Here I am no longer describing a specific case. I am describing a business invariant. These properties translate directly into property-based tests, with Hypothesis in Python, fast-check in TypeScript, or rapid in Go. I come back to that in article 5.

Contracts play a similar role at the signature level. icontract in Python, pydantic for models, Zod in TypeScript, and OpenAPI types all reduce the room left for interpretation. A typed function like def rate_call(duration: int, tariff: Tariff) -> Cost is already a partial spec. The agent cannot decide on its own that the function returns a string.

Which level for which topic

LevelAudienceToolsWriting costConstraint strength
1. User storyEveryoneConfluence, Jira, READMELowLow
2. GherkinBusiness + devCucumber, godog, behaveMediumMedium
3. Typed examplesDevpytest.parametrize, table-drivenMediumHigh
4. Properties + contractsDevHypothesis, icontract, typesHighVery high

The practical rule is simple. The higher the cost of a bug, the higher I go in formalization. A user settings page can live with levels 1 and 2. A pricing library, a protocol parser, or a dialplan routing engine needs level 4. On critical code, staying too low on the scale ends up costing more than the formalization itself.

Three common traps

Three mistakes come up often when I write a spec meant to be read by an agent.

Trap 1, implicit context

The user story says: “the user can log in.” It is still missing concrete elements, the authentication method, the password policy, what happens after three failures, and the rate limit rule.

When I keep that context in my head, I replace a blank with an assumption. The agent invents. That is exactly what the 2025 study shows. Of 20 problematic cases, 12 came from insufficient context in the user story. The problem is not the model. It is the text I give it.

The remedy is a list. I write down the preconditions, system state, configuration, environment variables, and data that already exists. When a feature depends on them, I even prefer a shared file that fixes the context once and for all.

Trap 2, examples that are too tame

I easily write scenarios that pass, with a valid user, clean data, and the happy path. Edge cases stay out. The agent completes the pattern instead of questioning it, then produces tests of the same kind. Coverage goes up, but robustness does not always follow.

For each feature, I aim for at least three failure cases, invalid input, missing resource, incompatible state. If I only have scenarios that succeed, the spec is incomplete. A useful spec does not describe only the path that works. It also describes what the system must refuse.

Trap 3, the tautological property

When I want to reach level 4, I can write a property that merely repeats the result.

# Property: the sorting function returns a sorted list.

That property tests almost nothing. A good property describes a relation between two states or two executions. For the same sorting function, I prefer for example:

  • len(sort(xs)) == len(xs), length is preserved.
  • sort(sort(xs)) == sort(xs), idempotence.
  • set(sort(xs)) == set(xs), elements are preserved.

The practical rule fits in one sentence. If I have to read the implementation to find the property, it is too weak. A good property comes before the code.

A usable template

Here is a SPEC.md format I can drop into a ticket and hand to an AI agent as is.

# Spec - <feature name>

## Intention (user story)
As a <role>, I want <action> so that <benefit>.

## Context to make explicit
- Expected system state before: ...
- Configuration / environment variables: ...
- External dependencies: ...

## Gherkin scenarios

Scenario: <happy path>
  Given ...
  When ...
  Then ...

Scenario: <failure case 1>
  Given ...
  When ...
  Then ... (expected error)

[at least one, ideally three failure scenarios]

## Typed examples
| Input | Expected output |
|---|---|
| ... | ... |

## Candidate properties, level 4 if the code is critical
1. Invariant: ...
2. Round-trip: ...
3. Metamorphic: ...

## Non-goals
- What this feature does not do.

I keep this template in a versioned file, then I pass it to the agent when I want it to work from a shared baseline. The goal is simple. Reduce unnecessary interpretation.

Three points to keep

1. A spec is no longer optional. With AI, it becomes the direct input to the workflow, and its quality shapes the code that comes out.

2. I can formalize at four levels depending on the criticality of the topic, user story, Gherkin, typed examples, properties, and contracts. The more expensive the bug, the higher I go.

3. Three traps show up often, implicit context, examples that are too tame, and tautological properties. Fixing them early avoids asking the agent to guess what I did not write.

Article 3 will cover how to split the work between humans, AI, and deterministic tools, and when to leave each of them in place while AI hype tries to push them aside.


Further reading


This article is the second in a seven-part series. If you found it useful, the previous article sets the context, and the next one will cover how to split responsibilities between humans, AI, and deterministic tools in the workflow.