How we measured 40%+ token savings

A headline number is only as good as the method behind it. When we say Lineman cuts token usage by 53% on average while holding quality essentially flat, we owe you the part most launch posts skip: what we measured, what we deliberately held constant, and how we checked that the savings didn't quietly cost accuracy. So here it is.

What we measured

We ran 180 tasks across six independent benchmark suites, including SWE-bench Lite, which draws from real GitHub issues in popular open-source repositories. Each task ran twice. Once as a baseline with the agent untouched, and once with Lineman in front of the data-heavy tool calls. Same task, same agent, same prompt. The only difference was whether the secondary model compressed tool output before it reached the primary model.

For every run we recorded two things that matter independently:

Tokens consumed by the primary model, the cost axis.
Output quality of the final result, the axis that keeps the cost number honest.

What we held constant

A token-savings number is meaningless if you change the task to get it, so we didn't. The benchmark task, the primary model, and the agent behaviour were identical between baseline and treatment. The only variable was Lineman. That isolation is the whole point: any difference in tokens or quality comes from the routing, not from a friendlier test set.

The figures

Result	Figure
Tasks measured	180
Benchmark suites	6
Average token reduction	53%
SWE-bench Lite token reduction	~48%
Quality retention (external benchmarks)	98.3%
Internal data-heavy tasks	up to 75%

The internal figure is higher because those tasks are exactly the data-heavy shape Lineman is built for. The external benchmarks are the conservative, generalisable read.

Why the quality check matters most

Saving tokens is trivial if you don't care about the answer. Throw away half the input and the number looks great. The discipline is proving the answer survives.

A token saved is only a saving if the result is still right. Otherwise it's a regression with good marketing.

That's why every run carries a paired quality score, not just a token count. Across external benchmarks, treatment runs held 98.3% of baseline output quality. The compression strips redundancy rather than signal. The agent still gets the symbols, the failing assertion, the relevant section. It just stops paying frontier rates to read the parts it was going to ignore.

The full per-suite breakdown, with token deltas, quality deltas, and success rates side by side, is published on the benchmarks page. We'd rather you interrogate the method than take the headline on faith.

How we measured 40%+ token savings

What we measured

What we held constant

The figures

Why the quality check matters most

Related

SWE-Bench Pro, in the open

Why a sidekick beats a bigger context window

The case for task-specific model routing