This guide covers everything about Why Most AI-Written Tests Are Not Really Tests. AI can write tests that pass. That’s not the same as writing tests that test something. The difference matters more than developers initially appreciate, and the gap is wide enough that AI-written test suites often increase technical debt rather than reducing it. A test that runs is not the same as a test that fails when the code is broken, and only the second kind earns its place in the codebase.

Last updated: May 3, 2026

This article walks through the specific patterns of fake-but-passing tests that AI tools generate, how to recognize them, and how to use Claude to write tests that actually test something. The patterns repeat across all AI coding tools โ€” these are not a Claude-specific or Copilot-specific problem. They are structural to how language models approach test-writing.

Key Takeaways

  • The most common fake-test pattern: AI writes a test that mocks the function under test, calls a wrapper, and asserts that the mock was called.
  • AI writes a test that calls a function and asserts the return value is “not None” or “is a string.” This passes if the function returns anything at all.
  • AI tends to write happy-path tests.
  • A subtle pattern: AI writes tests that pass because they happen to match the current implementation rather than the function’s contract.
  • The prompt that produces real tests: “Write tests for this function.

The rest of this article walks through the reasoning behind each of these claims, with specific tools, numbers, and methodology where relevant. Skim the section headings if you are short on time, or read straight through for the full case.

How We Tested

The recommendations in this article come from hands-on use, not vendor talking points. Bloxtra’s methodology is consistent across categories: we run each tool on twenty fixed prompts at default settings, accept the first three outputs without re-rolls, and grade the median rather than the cherry-pick. Reviews stay open for at least two weeks of daily use before publishing, and we revisit them whenever the underlying tool changes meaningfully. We don’t accept paid placements, and our rankings are not influenced by affiliate revenue.

Scoring follows a published rubric called the Bloxtra Score: Quality (30%), Usefulness in real work (25%), Trust and honesty (20%), Speed (15%), Value for money (10%). The same rubric applies across every category, so a 78 in Chatbots and a 78 in Coding mean genuinely comparable tools. Read the full methodology on our About page, where we publish our review process, conflict-of-interest policy, and editorial standards.

Pattern 1: Mock and Assert The Mock

The most common fake-test pattern: AI writes a test that mocks the function under test, calls a wrapper, and asserts that the mock was called. The test passes. The actual function could be deleted entirely and the test would still pass.

Recognize this by reading the test. If the assertion is “this function was called” rather than “this function returned the right thing,” the test is checking that the test was written, not that the code works. Delete it or rewrite it.

The fix: tests should check observable behavior โ€” return values, side effects on real (or test) databases, output written to files. Mocks are useful for isolating the unit under test, but the assertion should be on the result, not on the mock.

Pattern 2: Trivial Assertions

AI writes a test that calls a function and asserts the return value is “not None” or “is a string.” This passes if the function returns anything at all. It doesn’t check that the value is correct, just that it exists.

Recognize this by reading the assertion. “assert result” with nothing more is suspicious. “assert isinstance(result, dict)” without checking what is in the dict is checking very little. Real assertions look like “assert result.total == 42” or “assert result == expected_data”.

The fix: assertions should be specific. The test should fail if the function produces wrong output, not just if it produces no output.

Pattern 3: Tests That don’t Cover Edge Cases

AI tends to write happy-path tests. Function called with normal input, returns normal output. The edge cases โ€” empty input, oversized input, unicode, concurrency, malformed data โ€” are skipped because the AI focused on the central case.

Recognize this by checking what the test covers. If the input is always valid and the output is always expected, the test is missing the cases that cause real bugs. The fix: ask Claude explicitly to “write tests for the edge cases โ€” empty input, malformed input, boundary values, error conditions.”

Pattern 4: Tests That Check The Implementation, Not The Contract

A subtle pattern: AI writes tests that pass because they happen to match the current implementation rather than the function’s contract. Refactor the implementation while preserving behavior, and the tests fail even though nothing is broken.

Recognize this by asking: would this test still make sense if the implementation changed? If the test breaks because the function uses a different sort algorithm but produces the same sorted output, the test is checking the wrong thing.

The fix: tests should check the contract โ€” what the function promises to do โ€” not the implementation details. Avoid asserting on private state, internal call sequences, or implementation-specific behavior.

How to Write Real Tests with Claude

The prompt that produces real tests: “Write tests for this function. For each test, state in a comment what specific failure the test would catch. Cover at least one edge case (empty input, boundary value, or error condition). Don’t mock the function under test. Assertions should be on observable output, not on mock calls.”

The “state what failure the test catches” constraint forces the model to think about the test’s purpose, not just to produce code that passes. Tests that can’t articulate what they would catch are usually not catching anything.

After the AI produces tests, run them against a deliberately broken version of the function. If the tests pass when the function is wrong, the tests are not testing. Fix or delete them.

The Mutation Testing Check

A more rigorous check: introduce a deliberate bug into the function (change a > to a <, swap two arguments, return a constant), and run the tests. If they still pass, they are not testing the changed behavior. This is mutation testing, manually applied.

You don’t need to do this for every test, but doing it occasionally on tests you wrote with AI help builds intuition for which patterns produce real tests and which produce fake ones. After a few rounds, you start writing AI prompts that produce real tests by default.

When AI Test Generation Is Worth It

For boilerplate test setup โ€” fixtures, mocks of external services, test data factories โ€” AI generation saves real time. For simple unit tests where the contract is clear and the assertions are obvious, AI generation is fine.

For tests that require thinking about edge cases, error conditions, or non-obvious failure modes โ€” those are the tests where AI most often produces fake-but-passing output. Generate the easy parts; write the hard parts manually or with heavy AI iteration.

Frequently Asked Questions

Are all AI-written tests bad?

No โ€” AI is competent at simple unit tests with clear contracts. The problems show up on edge cases, error handling, and tests that should check non-obvious behavior.

How do I tell if a test is fake?

Read the assertion. If it doesn’t say what the function should do (returned this specific value, produced this specific output), it’s probably not testing anything useful.

Should I write all my tests by hand?

No. Use AI for boilerplate and simple cases, write the harder tests manually. Mutation-test occasionally to catch fake passes.

What is the best AI tool for writing tests?

Claude in our testing produces somewhat fewer fake-test patterns than competitors, but no AI is immune. The prompt matters more than the tool.

Should I run AI-written tests against broken code?

Yes โ€” occasionally introduce deliberate bugs and check that AI-written tests catch them. This catches fake-pass tests early.

What This Means in Practice

The honest answer for most readers: pick the option that fits your specific situation, test it on real work for at least two weeks before committing, and revisit the decision when the underlying tools change. AI tools update frequently enough that what is correct today may not be correct in six months. Build in a re-evaluation step every quarter for any tool that occupies a meaningful slot in your workflow.

Avoid the temptation to over-stack tools. The friction of switching between five tools eats into the productivity gain that any individual tool provides. The teams that get the most from AI are usually the ones using two or three tools deeply, not the ones with subscriptions to a dozen.

My Take

AI tests that pass are not the same as tests that test. Watch for mock-and-assert-the-mock, trivial assertions, missing edge cases, and implementation-checking tests. Use Claude with the right prompt to produce real tests, and verify with manual mutation testing. Try Claude free at claude.ai on real work this week.

If you have questions about anything covered here, or want us to test a specific tool, email editorial@bloxtra.com. We read every message and reply within a working day. Corrections are dated and public โ€” when we get something wrong or when a tool changes meaningfully after we publish, we update the article and note the change at the bottom.

Related reading: AI code review with Claude, Coding AI failure modes, Best AI coding tools.