← All news
Research

SWE-Bench Pro, in the open

Browse our benchmark runs task by task: Lineman against a baseline, with the receipts.

The Lineman team

Benchmark averages are easy to trust and easy to hide behind. A single "53% savings" headline tells you the aggregate held up. It doesn't tell you what happened on any one task, where the wins were biggest, or where Lineman barely moved the needle. So we publish the per-task runs too, in the open, on real software-engineering problems.

Our SWE-Bench Pro results page shows Lineman against a baseline Claude Code session, task by task, on real-world engineering work.

What "in the open" means

For each task we run the same problem twice, once with a baseline agent and once with Lineman routing the data-heavy tool calls, and we publish what we found side by side:

  • Token usage and cost for baseline and treatment, so you can see the saving in dollars, not just percentages.
  • Turn count, because an efficient run shouldn't need more steps to get there.
  • Resolution outcome, whether the task was actually solved, so a cheaper run that broke the answer can't hide inside an average.

Every task carries its own row. The wins are visible and so are the ties, because both belong in an honest picture.

Why per-task, not just an average

An average can launder a bad result. A method that saves 70% on five tasks and quietly fails the sixth can still post a strong headline, and you'd never know unless someone showed you the sixth. Publishing each run is how we stay honest and give you something you can actually audit.

If a benchmark only shows you the average, it's showing you the number it likes. We show you all of them.

It also reflects how the savings really behave. Lineman's biggest wins land on the most data-heavy tasks: large files to read, long logs to triage, sprawling search results to digest. On a lighter task there's less data tax to cut in the first place. The per-task view lets you see that pattern rather than take our word for it.

The picture it adds up to

Across our broader benchmark work, 180 tasks in six suites, Lineman delivers a 53% average token reduction while retaining 98.3% of baseline output quality, with up to 75% savings on the internal data-heavy tasks it's built for. The SWE-Bench Pro page is where you drill past that average into the individual real-world runs behind it.

Go read the runs, then read the methodology behind every number on the page. We'd rather you check our work than trust our headline.

Related