← All news
Engineering

How we prove every change saves you tokens

Why we treat token savings as a measurement problem, and the discipline behind every update we ship.

The Lineman team

It's easy to claim a tool saves tokens. Proving that a specific change actually saved them, without quietly costing you quality somewhere else, is much harder. We treat that as the real problem it is. Every change to how Lineman behaves has to earn its place against measurement before it ships.

Savings is a measurement problem

The naive way to evaluate a token optimiser is to run it once, see a smaller number, and ship. That's a trap. Agent runs are noisy. The same task can use very different token counts from one attempt to the next, for reasons that have nothing to do with your change. A single before-and-after comparison can show a "win" that's really just noise. It can just as easily hide a real regression underneath.

So we hold ourselves to two rules:

  • Measure against a noise floor. We don't trust a result unless the effect is bigger than the run-to-run variation we'd expect by chance. A change that can't clear that bar is unproven, not a win.
  • Measure cost and quality together. A change that saves tokens but degrades the answer isn't a saving. It's a hidden cost. Every evaluation pairs the token delta with a quality check, so we see both at once.

Tested on real work

We lean on real-world tasks, including public benchmarks like SWE-Bench Pro built from actual GitHub issues, rather than tidy synthetic cases that flatter the tool. Real repositories are messy in the ways that matter, and that's exactly where a compression layer either holds up or falls down.

If we can't measure that a change helps, we don't ship it as if it did.

Always reversible

The discipline goes one step further. Behavioural changes are gated, so the previous behaviour is always one switch away. If something looks good in measurement but behaves badly in the wild, we can fall back without a scramble. That safety net is what lets us move quickly without betting your sessions on every experiment.

Why we bother telling you this

The numbers we publish only mean something if the method behind them is sound. When we say Lineman saves 40%+ tokens while holding quality essentially flat, that's a claim we've held to a standard, not a single lucky run. You can see the methodology and the results on our benchmarks page, and the underlying argument in our whitepaper.

Related