It's easy to claim a tool saves tokens. Proving that a specific change actually saved them, without quietly costing you quality somewhere else, is much harder. We treat that as the real problem it is. Every change to how Lineman behaves has to earn its place against measurement before it ships.
Savings is a measurement problem
The naive way to evaluate a token optimiser is to run it once, see a smaller number, and ship. That's a trap. Agent runs are noisy. The same task can use very different token counts from one attempt to the next, for reasons that have nothing to do with your change. A single before-and-after comparison can show a "win" that's really just noise. It can just as easily hide a real regression underneath.
So we hold ourselves to two rules:
- Measure against a noise floor. We don't trust a result unless the effect is bigger than the run-to-run variation we'd expect by chance. A change that can't clear that bar is unproven, not a win.
- Measure cost and quality together. A change that saves tokens but degrades the answer isn't a saving. It's a hidden cost. Every evaluation pairs the token delta with a quality check, so we see both at once.
Tested on real work
We lean on real-world tasks, including public benchmarks like SWE-Bench Pro built from actual GitHub issues, rather than tidy synthetic cases that flatter the tool. Real repositories are messy in the ways that matter, and that's exactly where a compression layer either holds up or falls down.
If we can't measure that a change helps, we don't ship it as if it did.
Always reversible
The discipline goes one step further. Behavioural changes are gated, so the previous behaviour is always one switch away. If something looks good in measurement but behaves badly in the wild, we can fall back without a scramble. That safety net is what lets us move quickly without betting your sessions on every experiment.
Why we bother telling you this
The numbers we publish only mean something if the method behind them is sound. When we say Lineman saves 40%+ tokens while holding quality essentially flat, that's a claim we've held to a standard, not a single lucky run. You can see the methodology and the results on our benchmarks page, and the underlying argument in our whitepaper.