What the Tests Can't Tell You

There is a particular kind of silence that settles over a room when you’re sitting with a codebase you didn’t build.

Maybe it came from an external agency at the end of a fixed-term engagement. Maybe it was built by a team that has since reorganised, and you are the one picking up the work. Maybe you have just taken on a new role and this is what your predecessor left behind. The circumstances vary, but the moment is recognisable: the project is, by all accounts, complete. There are services, schemas, API contracts, and a test suite. The folder structure looks reasonable. The coverage report says 78%. Someone gives you a walkthrough, pointing at the green checkmarks in the pipeline and calling it confidence.

You take the repository. You start reading. And somewhere in that process, between the third service you didn’t architect and the fifth test file you didn’t write, a question starts forming that you can’t quite shake: How do I know this is actually good?

Not “does it work today.” You can see that. The tests are passing, the pipeline is green, and the system is in front of you. But that isn’t the question you’re really asking. What you actually want to know is: Did they build this with discipline? Was testing a practice that ran through the entire engagement, or a formality applied at the end? Were failures fixed promptly, or allowed to accumulate? Did coverage grow naturally as features were built, or was it bolted on in the week before delivery?

You go back to the test files. You look at the coverage. You try to form a view.

And the uncomfortable truth is, you can’t.

A single frame from a film you weren’t there to watch

The codebase is a snapshot. It tells you what exists right now: the tests that are present, the coverage as it stands, the assertions that are currently written. What it cannot tell you is anything about the process that produced it.

Test files can be added the day before delivery. Coverage can be manufactured by writing tests that pass without asserting meaningful behaviour. A suite that looks healthy on the day of handover can have spent most of the engagement in a broken state, with failures quietly ignored or marked as expected until someone found time to address them.

There is no way to read any of this from the code. The git log will show you commits, but it won’t tell you whether the test suite was green when each one was made. The coverage report will show you a percentage, but not whether that number was stable for six months or reached in the final week. The test files will show you what was written, but not whether those tests were run consistently, run at all, or run and then quietly skipped when they became inconvenient.

You are reading the autopsy, not watching the patient.

This isn’t a failure of due diligence. It’s a structural problem. The information you need to evaluate testing discipline (the full history of how that test suite actually behaved throughout development) simply isn’t recorded anywhere. CI systems keep logs for a few weeks. Developers don’t keep records. Nobody maintains a running account of how the test suite actually behaved day by day. At the moment of handover, the historical record of how tests performed during development vanishes, replaced by a single static frame.

And so you make a judgement call based on insufficient evidence. You look at the coverage, you run the suite, you ask a few questions, and you sign off. Maybe you’re right. Maybe the agency was rigorous throughout. Maybe not. You don’t really know.

Trust is not a quality standard

The handover scenario is vivid, but the underlying problem isn’t unique to it. The same gap appears anywhere you’re asked to verify testing quality rather than simply take it on trust.

When you acquire a company and need to assess whether their engineering team operates to standard. When you bring an outsourced squad in-house and want to understand the habits they’ve built. When a key engineer leaves and the team inheriting their services needs to understand the state of what they’re taking on. When a compliance audit requires evidence that quality controls were in place throughout a project, not just at the point of submission.

In each of these situations, you need the history. The current snapshot, however convincing it looks, isn’t evidence of discipline. It’s evidence of what was true on a single day.

The history is the evidence

Obvyr is built around a simple idea: every test run is an event worth recording.

When a team integrates Obvyr (via the CLI tool or the Gradle plugin, with no changes to how developers actually work), every test execution becomes an observation: the command run, the exit code, the duration, and the full breakdown of individual test results. These observations accumulate over time, building a complete record of how the test suite has behaved throughout development.

From that record, a project dashboard surfaces what a static snapshot never could: pass rates over time, execution volume, flakiness scores, and the trajectory of quality across the engagement. For any individual test, a Test Health Profile shows its full history: how many times it has been run, its pass rate across every execution, when it first failed, when it last failed, and how many consecutive passes it has accumulated since. Not what state it is in today. What it looked like at every point across the engagement.

At the moment of handover, or at any point during an engagement, a VP of Engineering can open that dashboard and see exactly how testing discipline played out in practice. Not a coverage number. Not a green pipeline. The actual record of what happened.

If the suite was consistently green throughout development, that is visible. If tests were failing and ignored, that is visible too. If coverage was assembled in a rush before delivery, the test execution history tells a different story than the coverage report does. The data cannot be retrospectively tidied up in the way a codebase can.

Start building the record

If you are about to take ownership of a codebase, about to accept delivery from a third party, or about to hand off a project to a new team: the time to have this evidence is throughout the engagement, not after it ends.

Obvyr captures testing discipline as it happens, so that when accountability matters, the record is already there.

To talk about what this looks like for your team, reach out at hello@obvyr.com.

What the Tests Can't Tell You

A single frame from a film you weren’t there to watch

Trust is not a quality standard

The history is the evidence

Start building the record

Stay in the loop