Task-Specific Model Routing: Reducing AI Coding Assistant Token Costs by 27-58% Through Intelligent Workload Delegation

Goodex, Inc.|April 2026

Aspects of the technology described in this paper are patent-pending.

27-58%
Token cost reduction
<2s
Latency per task
86/100
Quality score
CPU-only
No GPU required

Executive Summary

Frontier large language models (LLMs) power today's AI coding assistants, but a significant portion of their token consumption goes to mechanical, data-processing tasks -- summarizing files, filtering search results, triaging build output -- that require no deep reasoning. This paper presents our research into task-specific model routing, a technique that classifies coding assistant workloads by cognitive complexity and delegates data-heavy, low-reasoning tasks to small, specialized models while reserving frontier models for tasks that require genuine intelligence.

Our key findings: 27-58% token cost reduction on files ranging from 250 to 2,000 lines with no measurable degradation in task quality; sub-2-second latency per delegated task running on CPU-only inference with no GPU requirement; quality scores averaging 86/100 using an automated evaluation framework; and that disabling chain-of-thought reasoning improves performance for structured extraction tasks across all models tested.

1. Introduction: The Hidden Cost of Intelligence

AI coding assistants have transformed software development. Developers interact with frontier models -- Claude, GPT-4, and others -- that can reason about code architecture, debug complex issues, and generate production-quality implementations. But this power comes at a cost: frontier model inference is expensive, and much of that expense is wasted.

Consider a typical AI-assisted coding session. A developer asks the assistant to understand a codebase, fix a bug, and implement a feature. The assistant reads files, searches for references, analyzes build output, and filters search results. Each of these operations sends tokens through the frontier model. A 2,000-line source file costs approximately $0.06 in input tokens alone on a model like Claude Sonnet. Over a productive session involving 50 file reads, that's $3.00 spent purely on reading -- before any reasoning, code generation, or decision-making occurs.

The insight behind our research is that these tasks fall into two fundamentally different categories: high-reasoning tasks that require the full capability of a frontier model (code generation, architectural decisions, multi-step debugging), and mechanical tasks that require data processing but not deep reasoning (file summarization, search filtering, build output triage, error classification).

The mechanical category often accounts for the majority of token throughput in a coding session, yet these tasks share key properties: they have high input volume, require structured output in a predictable format, tolerate some information loss, and do not require chain-of-thought reasoning.

Our research explores whether a small, cheap model -- running at roughly 1/100th the per-token cost of a frontier model -- can handle these mechanical tasks at acceptable quality, and what infrastructure is needed to route tasks effectively between models.

2. Background and Related Work

2.1 Cost Optimization in LLM Systems

The AI industry has developed several approaches to reducing inference costs. Prompt caching stores frequently-used prompts to avoid re-processing, but does not reduce the fundamental token volume of data-heavy tasks. Context window optimization (RAG, chunking, sliding windows) attempts to reduce what gets sent to the model but not which model sees it. Model distillation creates smaller versions of large models but requires significant training infrastructure. Mixture-of-Experts architectures route different tokens to different components within a single model -- our approach operates at a higher level, routing entire tasks to different models.

2.2 Multi-Model Architectures

The concept of using multiple models with different capabilities is not new. Ensemble methods, cascading classifiers, and speculative decoding all involve multiple models cooperating. Our contribution is applying this principle to the specific domain of AI coding assistants, where the task taxonomy is well-defined and the cost asymmetry between model tiers is extreme.

2.3 The Model Context Protocol (MCP)

Our architecture leverages the Model Context Protocol (MCP), an open standard for connecting AI models to external tools and data sources. MCP provides a standardized interface through which a primary model can invoke secondary models as tools, making multi-model routing a natural extension of existing tool-use patterns.

3. Task Classification Framework

3.1 Classification Criteria

The core of our approach is a principled framework for determining which tasks can be safely delegated to a smaller model. We define four properties that characterize delegable tasks:

PropertyDescriptionExample
High-volume inputTask involves processing hundreds or thousands of linesReading a 1,500-line source file
Low-reasoning requirementRequires extraction or classification, not planningSummarizing exports and structure
Structured outputResponse format is predictable and verifiableJSON with known fields
Loss toleranceSome information loss is acceptableA summary capturing 90% of key facts

3.2 Task Taxonomy

Task CategoryInput VolumeReasoningDelegable?
File summarizationHighLowYes
Search result filteringMedium-HighLowYes
Build output triageHighLowYes
Error classificationMediumLowYes
Content compressionHighLowYes
Code generationLow-MediumHighNo
Architecture decisionsLowHighNo
Bug diagnosisVariableHighNo

3.3 The Compressor/Filter Principle

A key design principle that emerged from our research: the secondary model should act exclusively as a compressor, filter, or classifier -- never as a reasoner. This bright line dramatically simplifies the system. The primary model trusts the secondary model to reduce data, not to make decisions. This asymmetric trust relationship is critical for maintaining overall quality while achieving cost savings.

4. System Architecture

4.1 Overview

Our architecture follows a delegation pattern where the frontier model acts as orchestrator and explicitly decides when to invoke the secondary model. The primary model sees the secondary model as a tool -- one of many available actions it can take. This leverages the primary model's existing tool-use capabilities for routing, without requiring a separate routing layer.

Developer <-> Primary LLM (Frontier) <-> [MCP Protocol] <-> Task Router <-> Secondary LLM (Small)

4.2 Component Design

The system comprises five key components: a Task Router that receives requests and dispatches to the secondary model; Prompt Templates optimized for small models as tightly scoped single-turn prompts; a Response Shaper using regex-based parsing (more reliable than JSON mode with small models); Authority Framing that directs the primary model not to re-read source material; and a Fallback Mechanism that transparently reverts to the primary model on failure.

4.3 Deployment Flexibility

TopologyDescriptionUse Case
Co-locatedSecondary model on the same machineDevelopment, privacy-sensitive
DisaggregatedSecondary model as a cloud serviceProduction, team environments

Both topologies present the same interface. A strategy pattern abstracts the routing, with automatic fallback from disaggregated to co-located mode if the cloud service is unreachable.

4.4 CPU-Only Inference

A practical finding: small models (1.7B-14B parameters) running structured extraction tasks perform adequately on CPU-only inference. On an Apple M4 Pro (24GB), our recommended 8B parameter model processes requests at approximately 25 tokens/second, completing most tasks in under 2 seconds.

5. Evaluation Framework

5.1 Multi-Mode Evaluation

We developed a three-mode evaluation framework: Fast Mode for deterministic structural validation (under 5 seconds, no external deps); Full Mode using a frontier model as automated judge with 3-shot median scoring per dimension; and Compare Mode for statistical A/B testing with Cohen's d effect size and bootstrap confidence intervals.

5.2 Quality Rubrics

Each task type defines a weighted rubric with dimensions specific to that task. Weights are integers summing to 100.

DimensionWeightWhat It Measures
Completeness30Key symbols, module purpose, relationships captured
Accuracy30No hallucinated names, types, or incorrect facts
Conciseness20Summary length proportionate to input
Structure20Required output fields present

5.3 Quality Gates

GateThresholdEnforcement
Structural validationAll required fieldsBlocks any merge
Average quality score>= 70/100Blocks merge to main
Per-dimension regressionNo drop > 5 pointsBlocks merge to main
Composite regressionNo drop > 3 pointsAlerts and blocks
Token budgetOutput < 70% of inputWarning only

6. Results

6.1 Token Savings

File SizeWithout RoutingWith RoutingSavingsTurn Reduction
250 lines87K tokens64K tokens27%4 -> 1 turn
1,000 lines125K tokens52K tokens58%6 -> 1 turn
2,000 lines148K tokens105K tokens29%7 -> 2 turns

The routing layer cost is approximately constant at ~53K tokens regardless of file size. The sweet spot for maximum savings is files in the 500-1,500 line range.

6.2 Model Comparison

Size classParametersQualitySpeedNotes
Small1.7B73/10099 tok/sFastest, adequate for simple tasks
Medium8B86/10025 tok/sBest balance
Medium-large12B90/10017 tok/sHighest overall scores
Large14B88/10014 tok/sBest at code understanding

Notable negative results: a 4B candidate scored 0% success rate, and one 14B candidate from a different family was inconsistent - scoring 0 on some medium-difficulty tasks. These results underscore the importance of empirical benchmarking over parameter count assumptions.

6.3 The Think-Mode Finding

Think Mode ON
79
/100 quality score
Think Mode OFF
86
/100 quality score

Disabling chain-of-thought reasoning improves performance for structured extraction tasks across all models tested. The explanation: chain-of-thought introduces unnecessary deliberation for tasks that are fundamentally pattern-matching and extraction, adding latency and occasionally leading the model to overthink straightforward tasks.

7. Design Insights and Lessons Learned

Regex extraction outperforms JSON mode

Small models frequently produce invalid JSON but reliably follow structural patterns. Regex-based extraction proved significantly more reliable for structured data.

Authority framing prevents redundant work

Each response includes a directive to the primary model not to re-read source material. Without this, frontier models frequently re-read files after receiving summaries, negating the token savings.

Single-tool interface reduces schema overhead

Consolidating all task types behind a single tool with a task_type discriminator reduced schema token overhead by 68% (1,642 to 524 tokens). This compounds over an entire session.

Graceful degradation is non-negotiable

Every delegated task has an automatic fallback. The developer never experiences a failure due to the optimization layer -- at worst, they lose cost savings for that request.

8. Limitations and Future Work

Files above approximately 3,000 lines require chunked processing, introducing coordination overhead and potential information loss at chunk boundaries. Our current implementation covers 7 core task types with 2 fully benchmarked -- each new task type requires its own prompt engineering, fixtures, and rubrics. Two of the six models tested were unsuitable, underscoring that model selection requires empirical validation.

Future research directions include adaptive routing using learned task complexity classifiers, multi-model cascading across capability levels, cross-session learning from accumulated benchmark data, and expanding the task taxonomy beyond the current 7 core types.

9. Conclusion

Task-specific model routing is a practical and effective technique for reducing the token cost of AI coding assistants. By classifying workloads into high-reasoning and mechanical categories, and delegating mechanical tasks to small, specialized models, we achieved 27-58% token savings while maintaining quality scores of 86/100 on a rigorous benchmark framework.

The key contributions of this research are: a principled task classification framework based on four measurable properties; an architecture for multi-model routing leveraging existing tool-use protocols (MCP); a rigorous evaluation methodology combining deterministic validation, LLM-as-judge scoring, and statistical significance testing; and empirical findings including the counterintuitive result that disabling chain-of-thought reasoning improves structured extraction performance.

These results suggest that the AI coding assistant industry can achieve significant cost reductions without requiring better models -- only smarter routing of the models already available.

For inquiries about this research, contact research@goodex.dev