Controlled A/B evaluation of Lineman's impact on token consumption and output quality across industry-standard benchmarks.
All results presented unfiltered. Improvements and degradations are reported equally.
Real-world GitHub issues from popular Python repositories. Tests ability to locate, understand, and fix bugs given an issue description.
Token Breakdown
Per-Task Distribution
Cross-file code completion benchmark testing ability to use cross-file context for accurate completions across Python, Java, TypeScript, and C#.
Token Breakdown
Per-Task Distribution
Repository-level code completion requiring understanding of cross-file dependencies, APIs, and project structure.
Token Breakdown
Per-Task Distribution
End-to-end software development benchmark covering requirements analysis, design, implementation, and testing across multiple repositories.
Token Breakdown
Per-Task Distribution
Code question-answering benchmark testing understanding of code semantics, logic, and behavior through natural language questions.
Token Breakdown
Per-Task Distribution
This suite was designed by the Lineman team to test scenarios where Lineman is expected to excel. Results are not representative of general-purpose performance and should not be compared directly to external benchmarks.
Tasks specifically designed to exercise Lineman's core capabilities: large file reading, error triage, build output classification, and dependency analysis.
Token Breakdown
Per-Task Distribution
Each benchmark runs identical tasks in two conditions. The Treatment condition has Lineman fully active: its MCP server is registered, lifecycle hooks (PreToolUse, PostToolUse, Stop) are installed, and the system prompt includes Lineman's tool descriptions. The Control condition completely removes Lineman by deleting the project's .mcp.json file and removing all hook entries from .claude/settings.local.json, simulating a developer who has never installed Lineman. After each Control session, the transcript is parsed to verify that zero Lineman tool calls were made, zero Lineman hooks fired, and no Lineman MCP server registration appeared in the session metadata. Any Control run that fails this three-check verification is excluded from aggregate statistics but retained in the raw data for transparency. The verification failure count is reported alongside each suite's results.
Control session verification is a three-step automated process applied to every Control run. First, the transcript is scanned for any tool_use blocks with names matching the mcp__lineman__* pattern — if any are found, the run is flagged. Second, hook execution entries are checked for references to Lineman hook scripts (pre-tool-use.mjs, post-tool-use.mjs, stop.mjs, metrics.mjs). Third, the system prompt and MCP server metadata in the transcript are checked for any 'lineman' server registration. A run must pass all three checks to be marked as verified. Unverified runs are excluded from token savings calculations to prevent inflated claims, but their count is disclosed (see 'Verified' field per suite).
After both Treatment and Control sessions complete for a given task, a third Claude CLI session is spawned as an impartial quality judge. This judge session runs with the --no-mcp flag, ensuring Lineman cannot influence the evaluation. The judge receives four inputs: the original task description, the Control session's output (code patch or answer), the Treatment session's output, and the ground truth from the benchmark dataset (when available). The judge scores each output on three weighted dimensions: Correctness (40% weight) — does the solution actually solve the stated problem? Would tests pass? Completeness (30% weight) — are all aspects of the task addressed? Are edge cases handled? Code Quality (30% weight) — is the code clean, idiomatic, and free of unnecessary changes? Each dimension is scored 0-100 independently for both conditions. The weighted average produces a composite score per condition. Quality delta is computed as ((treatment_score - control_score) / control_score) * 100, yielding a signed percentage where positive values indicate the Treatment condition produced higher quality output, and negative values indicate degradation. When quality delta is negative beyond -5%, the result is flagged as a quality concern in the claims report.
Tasks are sampled from each benchmark's official dataset using a seeded pseudo-random number generator (mulberry32 algorithm). The seed is recorded with each run to enable exact reproduction. For external benchmarks (SWE-bench Lite, CrossCodeEval, RepoBench, DevBench, CodeQA), tasks are drawn uniformly from the benchmark's published test split with no filtering or cherry-picking. The internal suite uses all available fixture tasks. The same task set and ordering is used for both Treatment and Control conditions within a run, ensuring paired comparisons. Different runs with different seeds will sample different tasks, allowing coverage to grow over time.
Token savings are reported as the percentage change in mean total tokens (input + output + cache_read + cache_creation) from Control to Treatment. The 95% confidence interval is computed via bootstrap resampling: 10,000 iterations of sampling with replacement from the paired differences, taking the 2.5th and 97.5th percentiles. Cohen's d effect size is computed using the pooled standard deviation of Control and Treatment token counts — values above 0.8 indicate a large effect. Statistical significance is assessed via Welch's two-tailed t-test (which does not assume equal variances). All aggregate statistics exclude unverified Control runs. Per-task results are available in the expandable tables and raw data download for independent analysis.
Download the raw benchmark data (coming soon) for independent analysis.
Run bench-2026-04-09-001 -- 4c9756b