← All news
Research

How we measured 40%+ token savings

The methodology behind our headline numbers: 180 tasks, six benchmark suites, and what we held constant.

The Lineman team

A headline number is only as good as the method behind it. When we say Lineman cuts token usage by 53% on average while holding quality essentially flat, we owe you the part most launch posts skip: what we measured, what we deliberately held constant, and how we checked that the savings didn't quietly cost accuracy. So here it is.

What we measured

We ran 180 tasks across six independent benchmark suites, including SWE-bench Lite, which draws from real GitHub issues in popular open-source repositories. Each task ran twice. Once as a baseline with the agent untouched, and once with Lineman in front of the data-heavy tool calls. Same task, same agent, same prompt. The only difference was whether the secondary model compressed tool output before it reached the primary model.

For every run we recorded two things that matter independently:

  • Tokens consumed by the primary model, the cost axis.
  • Output quality of the final result, the axis that keeps the cost number honest.

What we held constant

A token-savings number is meaningless if you change the task to get it, so we didn't. The benchmark task, the primary model, and the agent behaviour were identical between baseline and treatment. The only variable was Lineman. That isolation is the whole point: any difference in tokens or quality comes from the routing, not from a friendlier test set.

The figures

ResultFigure
Tasks measured180
Benchmark suites6
Average token reduction53%
SWE-bench Lite token reduction~48%
Quality retention (external benchmarks)98.3%
Internal data-heavy tasksup to 75%

The internal figure is higher because those tasks are exactly the data-heavy shape Lineman is built for. The external benchmarks are the conservative, generalisable read.

Why the quality check matters most

Saving tokens is trivial if you don't care about the answer. Throw away half the input and the number looks great. The discipline is proving the answer survives.

A token saved is only a saving if the result is still right. Otherwise it's a regression with good marketing.

That's why every run carries a paired quality score, not just a token count. Across external benchmarks, treatment runs held 98.3% of baseline output quality. The compression strips redundancy rather than signal. The agent still gets the symbols, the failing assertion, the relevant section. It just stops paying frontier rates to read the parts it was going to ignore.

The full per-suite breakdown, with token deltas, quality deltas, and success rates side by side, is published on the benchmarks page. We'd rather you interrogate the method than take the headline on faith.

Related