Your agent says done but the tests fail
“Claude Code says done but the code does not run or the tests fail”
Your agent announces the work is finished, but the tests fail and the app does not even start. It calls the job done the moment the code exists, before running anything, because writing plausible code and verifying it are different acts and only the first is its default. The fix is a definition of done that requires proof, plus a hook that runs your tests and blocks completion until they pass.
Why it calls done too early
When an agent says done, it is making a prediction, not reporting an observation. By default it has not run your code; it has produced text that looks like correct code and then produced text that looks like a completion message. Both are high-probability outputs given the task, and neither required the code to actually execute. Done is a guess about the code, dressed as a fact.
The training bias points the same way. A confident this should work is the natural end of a helpful reply, and it scores well, while stopping to run the tests, read a failure, and fix it is slower and less certain. So the agent is pulled toward declaring success at the moment the artifact exists, which is exactly one step before the moment you actually care about, when it runs.
The reported version of this is an agent that prioritizes completing the task quickly over checking constraints before it acts. False completion is that same speed-over-verification instinct aimed at the finish line.
The manual fix (free, and complete)
The fix is a definition of done that requires evidence, and a way to make the evidence automatic. First, paste this rule into CLAUDE.md or AGENTS.md so done means proven, not typed.
## Prove it before you say done
- "Done" is not allowed until you have actually run the code. Run the tests and
the app, and paste the real output. No "this should work".
- If tests fail or the app does not start, that is not done. Fix it and run
again. Report the failure honestly; do not hide it.
- State exactly what you ran and what you observed: the command, and the result.
- If you cannot run something, say so plainly and tell me what to run, rather
than claiming it works.- Redefine done as proven. Require the agent to actually run the tests and the app before it can say done, and paste the real output.
- Ban this should work. Forbid confident completion language that is not backed by an observed result.
- Make failures honest. Require the agent to report failing tests rather than hiding them, then fix and re-run.
- Automate the check with a hook. Add a Stop hook that runs your test command so completion is gated on a real pass.
The rule gets far stronger with a hook that runs on the agent's stop event. A minimal Claude Code hook that runs your tests and refuses to let the turn end on a failure looks like this.
The durable fix (the kit)
A Stop hook you write yourself is exactly the right idea, and it is also the thing most people never get around to wiring up per project, per test runner, per edge case. The kit ships the enforcement version out of the box: hooks that run the proof step and hold the agent to a real pass, plus a moment-triggered tutor that, the first time your agent tries to declare done without running anything, explains why proven beats plausible.
The definition-of-done rule travels to Codex, Cursor, and Antigravity through AGENTS.md, so the ask-first, prove-it behavior is cross-tool. The hooks that actually run your tests and block a false done, and the live HUD that shows the proof, run in Claude Code only. One install, one-time price.
The real numbers
The prove-it step is why the kit was faster and cheaper on three of five tasks in our first clean-room suite: catching a failure inside the loop is cheaper than shipping it and debugging later. When the agent runs the tests before it stops, the broken done never reaches you.
The honest cost is the same one that shows up everywhere: on already-clear, already-working tasks the extra run-and-verify step is overhead. On a plain todo app the kit was about twice as slow, because there was nothing to catch. Proving the work pays for itself when the work can break, and costs you when it cannot.
| Measure | Vanilla agent | Kit-configured |
|---|---|---|
| Ran the code before saying done | not by default | yes, gated by a hook |
| Faster and cheaper overall | 2 of 5 tasks | 3 of 5 tasks |
| Already-clear task (todo app) | faster | about 2x slower |
Directional, n=1 per task. Proof wins where code can break, and costs you where it cannot.
It is in Anthropic's own tracker
False completion is a behavior, not a crash, so it is not filed as its own bug. The closest documented root cause is the speed-over-verification pattern.
The report describes an agent that prioritizes completing the task quickly over checking constraints before acting. Saying done before running anything is that instinct at the finish line: complete first, verify never.
Common questions
Why does my agent say done when the code does not work?
Because done is a prediction, not an observation. By default the agent has not run your code; it produced text that looks like working code and then text that looks like a completion. Both are likely outputs that never required execution. A rule that redefines done as proven, plus a hook that runs your tests, closes the gap.
How do I force Claude Code to run tests before finishing?
Add a Stop hook in .claude/settings.json that runs your test command; a non-zero exit keeps the agent working instead of letting the turn end. Pair it with a CLAUDE.md rule that bans this should work and requires pasted output. The rule sets the expectation and the hook enforces it.
Can the agent still lie about test results?
A rule alone can be talked around, which is the whole problem with rules. A hook that actually runs the command and reads the real exit code cannot: the tests either pass or they do not, independent of what the agent claims. That is the difference between asking for proof and requiring it.
What if there are no tests yet?
Then done should at least mean the app starts and the changed path runs without error, with the command and output shown. The kit supports a tests-after or no-tests-yet policy, so the proof step scales down to running the app rather than demanding a suite you have not written.

Make your agent ask before it builds
One kit. Ask-first behavior, lean output, no AI slop. Works with Claude Code, Codex, and Antigravity.
Get the kit · $29One-time. Less than one hour of cleaning up AI slop.