Task-Specific Model Routing: Reducing AI Coding Assistant Token Costs by 27-58% Through Intelligent Workload Delegation

Goodex, Inc.|April 2026

Aspects of the technology described in this paper are patent-pending.

27-58%

Token cost reduction

<2s

Latency per task

86/100

Quality score

CPU-only

No GPU required

Executive Summary 1. Introduction 2. Background and Related Work 3. Task Classification Framework 4. System Architecture

5. Evaluation Framework 6. Results 7. Design Insights 8. Limitations and Future Work 9. Conclusion

Executive Summary

Frontier large language models (LLMs) power today's AI coding assistants, but a significant portion of their token consumption goes to mechanical, data-processing tasks -- summarizing files, filtering search results, triaging build output -- that require no deep reasoning. This paper presents our research into task-specific model routing, a technique that classifies coding assistant workloads by cognitive complexity and delegates data-heavy, low-reasoning tasks to small, specialized models while reserving frontier models for tasks that require genuine intelligence.

Our key findings: 27-58% token cost reduction on files ranging from 250 to 2,000 lines with no measurable degradation in task quality; sub-2-second latency per delegated task running on CPU-only inference with no GPU requirement; quality scores averaging 86/100 using an automated evaluation framework; and that disabling chain-of-thought reasoning improves performance for structured extraction tasks across all models tested.

1. Introduction: The Hidden Cost of Intelligence

AI coding assistants have transformed software development. Developers interact with frontier models -- Claude, GPT-4, and others -- that can reason about code architecture, debug complex issues, and generate production-quality implementations. But this power comes at a cost: frontier model inference is expensive, and much of that expense is wasted.

Consider a typical AI-assisted coding session. A developer asks the assistant to understand a codebase, fix a bug, and implement a feature. The assistant reads files, searches for references, analyzes build output, and filters search results. Each of these operations sends tokens through the frontier model. A 2,000-line source file costs approximately $0.06 in input tokens alone on a model like Claude Sonnet. Over a productive session involving 50 file reads, that's $3.00 spent purely on reading -- before any reasoning, code generation, or decision-making occurs.

The insight behind our research is that these tasks fall into two fundamentally different categories: high-reasoning tasks that require the full capability of a frontier model (code generation, architectural decisions, multi-step debugging), and mechanical tasks that require data processing but not deep reasoning (file summarization, search filtering, build output triage, error classification).

The mechanical category often accounts for the majority of token throughput in a coding session, yet these tasks share key properties: they have high input volume, require structured output in a predictable format, tolerate some information loss, and do not require chain-of-thought reasoning.

Our research explores whether a small, cheap model -- running at roughly 1/100th the per-token cost of a frontier model -- can handle these mechanical tasks at acceptable quality, and what infrastructure is needed to route tasks effectively between models.

2. Background and Related Work

2.1 Cost Optimization in LLM Systems

The AI industry has developed several approaches to reducing inference costs. Prompt caching stores frequently-used prompts to avoid re-processing, but does not reduce the fundamental token volume of data-heavy tasks. Context window optimization (RAG, chunking, sliding windows) attempts to reduce what gets sent to the model but not which model sees it. Model distillation creates smaller versions of large models but requires significant training infrastructure. Mixture-of-Experts architectures route different tokens to different components within a single model -- our approach operates at a higher level, routing entire tasks to different models.

2.2 Multi-Model Architectures

The concept of using multiple models with different capabilities is not new. Ensemble methods, cascading classifiers, and speculative decoding all involve multiple models cooperating. Our contribution is applying this principle to the specific domain of AI coding assistants, where the task taxonomy is well-defined and the cost asymmetry between model tiers is extreme.

2.3 The Model Context Protocol (MCP)

Our architecture leverages the Model Context Protocol (MCP), an open standard for connecting AI models to external tools and data sources. MCP provides a standardized interface through which a primary model can invoke secondary models as tools, making multi-model routing a natural extension of existing tool-use patterns.

3. Task Classification Framework

3.1 Classification Criteria

The core of our approach is a principled framework for determining which tasks can be safely delegated to a smaller model. We define four properties that characterize delegable tasks:

Property	Description	Example
High-volume input	Task involves processing hundreds or thousands of lines	Reading a 1,500-line source file
Low-reasoning requirement	Requires extraction or classification, not planning	Summarizing exports and structure
Structured output	Response format is predictable and verifiable	JSON with known fields
Loss tolerance	Some information loss is acceptable	A summary capturing 90% of key facts

3.2 Task Taxonomy

Task Category	Input Volume	Reasoning	Delegable?
File summarization	High	Low	Yes
Search result filtering	Medium-High	Low	Yes
Build output triage	High	Low	Yes
Error classification	Medium	Low	Yes
Content compression	High	Low	Yes
Code generation	Low-Medium	High	No
Architecture decisions	Low	High	No
Bug diagnosis	Variable	High	No

3.3 The Compressor/Filter Principle

A key design principle that emerged from our research: the secondary model should act exclusively as a compressor, filter, or classifier -- never as a reasoner. This bright line dramatically simplifies the system. The primary model trusts the secondary model to reduce data, not to make decisions. This asymmetric trust relationship is critical for maintaining overall quality while achieving cost savings.

4. System Architecture

4.1 Overview

Our architecture follows a delegation pattern where the frontier model acts as orchestrator and explicitly decides when to invoke the secondary model. The primary model sees the secondary model as a tool -- one of many available actions it can take. This leverages the primary model's existing tool-use capabilities for routing, without requiring a separate routing layer.

Developer <-> Primary LLM (Frontier) <-> [MCP Protocol] <-> Task Router <-> Secondary LLM (Small)

4.2 Component Design

The system comprises five key components: a Task Router that receives requests and dispatches to the secondary model; Prompt Templates optimized for small models as tightly scoped single-turn prompts; a Response Shaper using regex-based parsing (more reliable than JSON mode with small models); Authority Framing that directs the primary model not to re-read source material; and a Fallback Mechanism that transparently reverts to the primary model on failure.

4.3 Deployment Flexibility

Topology	Description	Use Case
Co-located	Secondary model on the same machine	Development, privacy-sensitive
Disaggregated	Secondary model as a cloud service	Production, team environments

Both topologies present the same interface. A strategy pattern abstracts the routing, with automatic fallback from disaggregated to co-located mode if the cloud service is unreachable.

4.4 CPU-Only Inference

A practical finding: small models (1.7B-14B parameters) running structured extraction tasks perform adequately on CPU-only inference. On an Apple M4 Pro (24GB), our recommended 8B parameter model processes requests at approximately 25 tokens/second, completing most tasks in under 2 seconds.

5. Evaluation Framework

5.1 Multi-Mode Evaluation

We developed a three-mode evaluation framework: Fast Mode for deterministic structural validation (under 5 seconds, no external deps); Full Mode using a frontier model as automated judge with 3-shot median scoring per dimension; and Compare Mode for statistical A/B testing with Cohen's d effect size and bootstrap confidence intervals.

5.2 Quality Rubrics

Each task type defines a weighted rubric with dimensions specific to that task. Weights are integers summing to 100.

Dimension	Weight	What It Measures
Completeness	30	Key symbols, module purpose, relationships captured
Accuracy	30	No hallucinated names, types, or incorrect facts
Conciseness	20	Summary length proportionate to input
Structure	20	Required output fields present

5.3 Quality Gates

Gate	Threshold	Enforcement
Structural validation	All required fields	Blocks any merge
Average quality score	>= 70/100	Blocks merge to main
Per-dimension regression	No drop > 5 points	Blocks merge to main
Composite regression	No drop > 3 points	Alerts and blocks
Token budget	Output < 70% of input	Warning only

6. Results

6.1 Token Savings

File Size	Without Routing	With Routing	Savings	Turn Reduction
250 lines	87K tokens	64K tokens	27%	4 -> 1 turn
1,000 lines	125K tokens	52K tokens	58%	6 -> 1 turn
2,000 lines	148K tokens	105K tokens	29%	7 -> 2 turns

The routing layer cost is approximately constant at ~53K tokens regardless of file size. The sweet spot for maximum savings is files in the 500-1,500 line range.

6.2 Model Comparison

Size class	Parameters	Quality	Speed	Notes
Small	1.7B	73/100	99 tok/s	Fastest, adequate for simple tasks
Medium	8B	86/100	25 tok/s	Best balance
Medium-large	12B	90/100	17 tok/s	Highest overall scores
Large	14B	88/100	14 tok/s	Best at code understanding

Notable negative results: a 4B candidate scored 0% success rate, and one 14B candidate from a different family was inconsistent - scoring 0 on some medium-difficulty tasks. These results underscore the importance of empirical benchmarking over parameter count assumptions.

6.3 The Think-Mode Finding

Think Mode ON

/100 quality score

Think Mode OFF

/100 quality score

Disabling chain-of-thought reasoning improves performance for structured extraction tasks across all models tested. The explanation: chain-of-thought introduces unnecessary deliberation for tasks that are fundamentally pattern-matching and extraction, adding latency and occasionally leading the model to overthink straightforward tasks.

7. Design Insights and Lessons Learned

Regex extraction outperforms JSON mode

Small models frequently produce invalid JSON but reliably follow structural patterns. Regex-based extraction proved significantly more reliable for structured data.

Authority framing prevents redundant work

Each response includes a directive to the primary model not to re-read source material. Without this, frontier models frequently re-read files after receiving summaries, negating the token savings.

Single-tool interface reduces schema overhead

Consolidating all task types behind a single tool with a task_type discriminator reduced schema token overhead by 68% (1,642 to 524 tokens). This compounds over an entire session.

Graceful degradation is non-negotiable

Every delegated task has an automatic fallback. The developer never experiences a failure due to the optimization layer -- at worst, they lose cost savings for that request.

8. Limitations and Future Work

Files above approximately 3,000 lines require chunked processing, introducing coordination overhead and potential information loss at chunk boundaries. Our current implementation covers 7 core task types with 2 fully benchmarked -- each new task type requires its own prompt engineering, fixtures, and rubrics. Two of the six models tested were unsuitable, underscoring that model selection requires empirical validation.

Future research directions include adaptive routing using learned task complexity classifiers, multi-model cascading across capability levels, cross-session learning from accumulated benchmark data, and expanding the task taxonomy beyond the current 7 core types.

9. Conclusion

Task-specific model routing is a practical and effective technique for reducing the token cost of AI coding assistants. By classifying workloads into high-reasoning and mechanical categories, and delegating mechanical tasks to small, specialized models, we achieved 27-58% token savings while maintaining quality scores of 86/100 on a rigorous benchmark framework.

The key contributions of this research are: a principled task classification framework based on four measurable properties; an architecture for multi-model routing leveraging existing tool-use protocols (MCP); a rigorous evaluation methodology combining deterministic validation, LLM-as-judge scoring, and statistical significance testing; and empirical findings including the counterintuitive result that disabling chain-of-thought reasoning improves structured extraction performance.

These results suggest that the AI coding assistant industry can achieve significant cost reductions without requiring better models -- only smarter routing of the models already available.

For inquiries about this research, contact research@goodex.dev

Contents