Measuring If Tests Are Worth Anything: Four Axes
Article 4 of the series “From Specification to Execution.” The previous article discussed the distribution of roles between human, AI, and deterministic tools. We can now write tests. The more useful question remains: How do we know if they are worth anything?
A case often cited by the authors of Hypothesis summarizes the problem well. A production JSON parser showed 95% coverage on an example-based test suite. The score looked good. Yet, as soon as it was fed into Hypothesis with arbitrary Unicode inputs, it broke in less than 30 seconds on a surrogate character. The tests existed. They didn’t protect what they claimed to protect.
That is my starting point. Coverage can be useful. It can also reassure too quickly. If I want to know whether a test suite is worth anything, I have to look at something other than the percentage displayed in the CI.
Why coverage alone lies
Coverage measures execution, not verification.
Three levels come up often:
- Line coverage says a line was executed. It doesn’t say if the output was correct.
- Branch coverage adds the notion of branches. It’s better, but it remains an indicator of passage, not quality.
- MC/DC (Modified Condition / Decision Coverage) goes further. It verifies that each condition in a boolean expression independently affects the outcome. This is the standard in critical fields like avionics or automotive safety.
None of these metrics say whether the assertions are solid. A test can execute code without really checking anything. A suite can also reach very high coverage with weak oracles. The review paper A Brief Survey on Oracle-based Test Adequacy Metrics summarizes it well. Coverage observes the structure. It doesn’t measure the quality of the verification.
With AI, this point becomes more visible. If an agent can generate fifty tests in a few seconds, the real issue is no longer volume. It’s the quality of the volume. ThoughtWorks pointed this out in Technology Radar v33, refocusing the debate on mutation testing for codebases where AI produces a lot of tests.
I therefore keep four axes.
Axis 1: Coverage
Coverage remains useful. I always enable it, but I don’t give it the main role.
I prefer branch coverage over just line coverage. The difference is clear as soon as the code contains non-trivial branches.
| Language | Command |
|---|---|
| Python | pytest --cov=src --cov-branch --cov-report=term-missing |
| TypeScript / JavaScript | vitest run --coverage |
| Go | go test -cover -covermode=atomic ./... |
I also reason per module, not just globally. A critical business module might target 75%, sometimes more. Wiring code, like a thin HTTP handler, might stay lower. A single global threshold masks gaps and often leads to a chase for 100% that costs time without improving quality.
The important point is simple. Coverage is a prerequisite, not a proof.
Axis 2: Mutation testing
This is the axis that changes the reading of a test suite the most.
The principle is mechanical. I mutate the production code. I replace a == with !=, I invert a boolean, I delete a line, I shift an index. Then I rerun the tests. If the tests pass despite the mutation, the mutant survived. This means no test detected the behavior change.
The metric is simple:
score = killed mutants / (killed mutants + surviving mutants)
It measures the real strength of the suite. Not its size. Not its volume. Its ability to detect a regression.
In the AI era, this tool becomes even more useful. An agent can produce many tests in a short time. The risk is confusing quantity with real protection. Mutation testing separates the two.
I keep a very simple reading rule:
- A test is missing, if the mutation corresponds to a behavior never covered.
- The assertion is too weak, if the mutated code executes but isn’t truly verified.
- The mutant is equivalent to the original code—a rarer case, but real, and I mark it as such with justification.
Tools vary by language:
| Language | Tool |
|---|---|
| Python | mutmut |
| TypeScript / JavaScript | StrykerJS |
| Go | gremlins |
| Java / Kotlin | PIT |
| Rust | cargo-mutants |
| Ruby | mutant |
| .NET | Stryker.NET |
The cost is real. Mutation testing is much slower than normal tests, often by a factor of 10 to 100. So I don’t run it everywhere. I limit it to files modified in PRs, with a full run at night or on the main branch, and thresholds per module.
Axis 3: Test smells
A test can pass, be covered, and remain poorly written.
Test smells are anti-patterns in test code. They don’t always break the suite immediately, but they often predict future degradation. The Garousi and Felderer paper, Smells in software test code: A survey of knowledge in industry and academia, remains a good baseline to spot them.
I watch mainly for five cases:
| Smell | Description |
|---|---|
| Assertion roulette | Multiple assertions with no clear message |
| Eager test | A test verifies too many behaviors at once |
| Mystery guest | The test depends on an obscure external resource |
| Over-mocking | Mocks on code I control |
| Empty test | Empty or almost empty test |
With an AI agent in the loop, these drifts appear quickly. The agent sometimes likes to mask complexity behind mocks, or stack several checks in the same test. The suite passes, then becomes hard to maintain.
I therefore rely on static tools per language.
- In Python,
ruffwith rulesPT,S, andBalready covers a good portion of the ground. To go further, PyNose detects several test smells. - In TypeScript and JavaScript, I combine ESLint with
eslint-plugin-vitestoreslint-plugin-jest, pluseslint-plugin-testing-library. Theexpect-expect,valid-expect, andno-conditional-expectrules remain the most useful. - In Go,
golangci-lintwithtestifylint,thelper,paralleltest,errcheck, andbodyclosealready catches a large share of false greens.
I pay particular attention to testifylint. It spots assert.Nil(t, err) that should be assert.NoError, or comparisons to nil that pass too easily. It’s a small noise on the screen, but large over time.
Axis 4: Robustness
The last axis looks at the suite under less comfortable conditions.
Flakiness
A flaky test doesn’t say the same thing on every run. It depends on timing, execution order, shared state, or an external resource. Generally, it’s not a capricious test. It’s a poorly isolated test.
Atlassian estimates the average loss tied to flaky tests in a mid-sized organization at around 150,000 developer hours per year. I don’t need the exact number to retain the idea. A flaky suite costs time and ultimately erodes trust.
I test stability with simple repetitions.
| Language | Command |
|---|---|
| Python | pytest --count=10 -x with pytest-repeat |
| TypeScript | vitest run --retry=3 |
| Go | go test -shuffle=on -count=10 -race ./... |
To go further, I keep in mind DeFlaker and FlakeFlagger. Commercial detection tools exist too, but the most important part remains diagnosis. If a test is flaky, I first look for the source of shared state, the temporal dependency, or the poorly isolated external resource.
Property-based testing
Property-based testing covers the input space rather than a list of hand-written cases. This is exactly what the JSON parser in the introduction was missing.
I see it as the main weapon against forgotten edges. It complements the coverage axis well, and fixes many weaknesses of example-based tests. The topic deserves its own article, which I am saving for the next part of the series.
Mini-audit you can run this weekend
I can run a baseline audit on an existing project in less than an hour.
Python
pytest --cov=src --cov-branch --cov-report=term-missing
ruff check src tests
mutmut run --paths-to-mutate=src/<critical_module>
mutmut results
pytest --count=10 -x
TypeScript
pnpm exec vitest run --coverage
pnpm exec eslint . --max-warnings 0
pnpm exec stryker run --mutate "src/<critical_module>/**/*.ts"
pnpm exec vitest run --retry=3
Go
go test -cover -covermode=atomic ./...
golangci-lint run
gremlins unleash ./internal/<critical_package>/
go test -shuffle=on -count=10 -race ./...
I’m not looking for a perfect score. I’m looking for a signal. If coverage is high but the mutation score is low, the test code is an illusion. If tests are stable but full of smells, I must simplify them. If tests look good on paper but are flaky, I must restore stability first.
Three points to keep
1. Coverage alone doesn’t say much. It measures execution, not verification.
2. The four axes complement each other. Coverage, mutation testing, test smells, and robustness cannot replace one another.
3. AI increases volume, not default value. If I want to know if a suite is worth anything, I have to measure something other than the coverage percentage.
The next step is more naturally handed to an AI agent. Property-based testing explores the input space, finds edge cases, and fits well with this workflow. This is the subject of article 5.
Further reading
- A Brief Survey on Oracle-based Test Adequacy Metrics (arxiv 2212.06118) - arxiv.org
- Garousi V. & Felderer M. (2018), Smells in software test code: A survey of knowledge in industry and academia - sciencedirect.com
- ThoughtWorks Technology Radar v33 (2025) - thoughtworks.com/radar
- DeFlaker: Automatically Detecting Flaky Tests (ICSE 2018) - experts.illinois.edu
- FlakeFlagger: Predicting Flakiness Without Rerunning Tests (ICSE 2021) - jonbell.net
- mutmut - github.com/boxed/mutmut
- StrykerJS - stryker-mutator.io
- gremlins (Go) - github.com/go-gremlins/gremlins
- PyNose - github.com/JetBrains-Research/PyNose
- eslint-plugin-testing-library - github.com/testing-library/eslint-plugin-testing-library
- golangci-lint - golangci-lint.run
- The series public repo - github.com/mwolff44/spec-to-tests
This article is the fourth in a seven-part series. The previous article discussed role distribution. The next one will cover property-based testing and how it helps catch forgotten edges.