Benchmark Results

Controlled A/B evaluation of Lineman's impact on token consumption and output quality across industry-standard benchmarks.

All results presented unfiltered. Improvements and degradations are reported equally.

Avg Token Savings

-53%

Avg Quality Delta

-1.7%

Total Tasks

180

Suites Run

Run Details

April 9, 2026

claude-opus-4-6-20260201

SWE-bench Pro - Case Studies

Deep-dive into real-world tasks with side-by-side tool call timelines and quality evaluation.

20.9% saved

SWE-bench Lite

Real-world GitHub issues from popular Python repositories. Tests ability to locate, understand, and fix bugs given an issue description.

30 tasksSource

-48.0%

Token Delta

72K → 37K

-2.8%

Quality Delta

72 → 70

63% → 63%

Success Rate

control → treatment

-8.0%

Wall Clock Delta

1.6m → 1.5m

✓ faster

Token Breakdown

InputOutputCache ReadCache Creation

Per-Task Distribution

95% CI: [-54.2%, -41.8%]

Cohen's d: -1.72

p-value: <0.001

Verified: 28/30

CrossCodeEval

Cross-file code completion benchmark testing ability to use cross-file context for accurate completions across Python, Java, TypeScript, and C#.

30 tasksSource

-45.0%

Token Delta

32K → 18K

-1.5%

Quality Delta

68 → 67

80% → 80%

Success Rate

control → treatment

-12.0%

Wall Clock Delta

38.0s → 33.4s

✓ faster

Token Breakdown

InputOutputCache ReadCache Creation

Per-Task Distribution

95% CI: [-51.3%, -38.7%]

Cohen's d: -1.58

p-value: <0.001

Verified: 29/30

RepoBench

Repository-level code completion requiring understanding of cross-file dependencies, APIs, and project structure.

30 tasksSource

-52.0%

Token Delta

45K → 22K

-1.5%

Quality Delta

65 → 64

72% → 70%

Success Rate

control → treatment

-10.0%

Wall Clock Delta

52.0s → 46.8s

✓ faster

Token Breakdown

InputOutputCache ReadCache Creation

Per-Task Distribution

95% CI: [-58.4%, -45.6%]

Cohen's d: -1.85

p-value: <0.001

Verified: 28/30

DevBench

End-to-end software development benchmark covering requirements analysis, design, implementation, and testing across multiple repositories.

30 tasksSource

-58.0%

Token Delta

98K → 41K

-1.8%

Quality Delta

55 → 54

42% → 40%

Success Rate

control → treatment

-6.0%

Wall Clock Delta

2.4m → 2.3m

✓ faster

Token Breakdown

InputOutputCache ReadCache Creation

Per-Task Distribution

95% CI: [-64.1%, -51.9%]

Cohen's d: -2.14

p-value: <0.001

Verified: 28/30

CodeQA

Code question-answering benchmark testing understanding of code semantics, logic, and behavior through natural language questions.

30 tasksSource

-42.0%

Token Delta

18K → 10K

-1.2%

Quality Delta

85 → 84

93% → 93%

Success Rate

control → treatment

-15.0%

Wall Clock Delta

22.0s → 18.7s

✓ faster

Token Breakdown

InputOutputCache ReadCache Creation

Per-Task Distribution

95% CI: [-48.5%, -35.5%]

Cohen's d: -1.48

p-value: <0.001

Verified: 29/30

This suite was designed by the Lineman team to test scenarios where Lineman is expected to excel. Results are not representative of general-purpose performance and should not be compared directly to external benchmarks.

Lineman Internal Suite

Tasks specifically designed to exercise Lineman's core capabilities: large file reading, error triage, build output classification, and dependency analysis.

30 tasks

-75.0%

Token Delta

64K → 16K

-1.3%

Quality Delta

76 → 75

91% → 91%

Success Rate

control → treatment

-25.0%

Wall Clock Delta

48.0s → 36.0s

✓ faster

Token Breakdown

InputOutputCache ReadCache Creation

Per-Task Distribution

95% CI: [-80.2%, -69.8%]

Cohen's d: -3.21

p-value: <0.001

Verified: 30/30

Methodology

Control Mode

Each benchmark runs identical tasks in two conditions. The Treatment condition has Lineman fully active: its MCP server is registered, lifecycle hooks (PreToolUse, PostToolUse, Stop) are installed, and the system prompt includes Lineman's tool descriptions. The Control condition completely removes Lineman by deleting the project's .mcp.json file and removing all hook entries from .claude/settings.local.json, simulating a developer who has never installed Lineman. After each Control session, the transcript is parsed to verify that zero Lineman tool calls were made, zero Lineman hooks fired, and no Lineman MCP server registration appeared in the session metadata. Any Control run that fails this three-check verification is excluded from aggregate statistics but retained in the raw data for transparency. The verification failure count is reported alongside each suite's results.

Verification

Control session verification is a three-step automated process applied to every Control run. First, the transcript is scanned for any tool_use blocks with names matching the mcp__lineman__* pattern — if any are found, the run is flagged. Second, hook execution entries are checked for references to Lineman hook scripts (pre-tool-use.mjs, post-tool-use.mjs, stop.mjs, metrics.mjs). Third, the system prompt and MCP server metadata in the transcript are checked for any 'lineman' server registration. A run must pass all three checks to be marked as verified. Unverified runs are excluded from token savings calculations to prevent inflated claims, but their count is disclosed (see 'Verified' field per suite).

Quality Judging

After both Treatment and Control sessions complete for a given task, a third Claude CLI session is spawned as an impartial quality judge. This judge session runs with the --no-mcp flag, ensuring Lineman cannot influence the evaluation. The judge receives four inputs: the original task description, the Control session's output (code patch or answer), the Treatment session's output, and the ground truth from the benchmark dataset (when available). The judge scores each output on three weighted dimensions: Correctness (40% weight) — does the solution actually solve the stated problem? Would tests pass? Completeness (30% weight) — are all aspects of the task addressed? Are edge cases handled? Code Quality (30% weight) — is the code clean, idiomatic, and free of unnecessary changes? Each dimension is scored 0-100 independently for both conditions. The weighted average produces a composite score per condition. Quality delta is computed as ((treatment_score - control_score) / control_score) * 100, yielding a signed percentage where positive values indicate the Treatment condition produced higher quality output, and negative values indicate degradation. When quality delta is negative beyond -5%, the result is flagged as a quality concern in the claims report.

Task Sampling

Tasks are sampled from each benchmark's official dataset using a seeded pseudo-random number generator (mulberry32 algorithm). The seed is recorded with each run to enable exact reproduction. For external benchmarks (SWE-bench Lite, CrossCodeEval, RepoBench, DevBench, CodeQA), tasks are drawn uniformly from the benchmark's published test split with no filtering or cherry-picking. The internal suite uses all available fixture tasks. The same task set and ordering is used for both Treatment and Control conditions within a run, ensuring paired comparisons. Different runs with different seeds will sample different tasks, allowing coverage to grow over time.

Statistics

Token savings are reported as the percentage change in mean total tokens (input + output + cache_read + cache_creation) from Control to Treatment. The 95% confidence interval is computed via bootstrap resampling: 10,000 iterations of sampling with replacement from the paired differences, taking the 2.5th and 97.5th percentiles. Cohen's d effect size is computed using the pooled standard deviation of Control and Treatment token counts — values above 0.8 indicate a large effect. Statistical significance is assessed via Welch's two-tailed t-test (which does not assume equal variances). All aggregate statistics exclude unverified Control runs. Per-task results are available in the expandable tables and raw data download for independent analysis.

Caveats & Limitations

--All results use dummy data and do not represent actual benchmark performance.
--Internal suite results are biased by design — tasks were created to match Lineman's strengths.
--Token savings vary significantly by task type; individual results may differ from averages.
--Quality scores use different rubrics across suites and are not directly comparable.
--Wall clock times measured on a single machine (M3 Max, 64GB) and will vary by hardware.
--Small sample sizes (18-30 tasks per suite) limit statistical power for some comparisons.

Download the raw benchmark data (coming soon) for independent analysis.

Run bench-2026-04-09-001 -- 4c9756b