Notion's Task-Based Model Routing Architecture
TRIGGER
AI features span different task types with conflicting requirements—some need deep reasoning and long-form coherence, others need fast responses at high volume with simpler outputs. Using a single model for all tasks means either overpaying for simple tasks or underperforming on complex ones.
APPROACH
Notion's AI team built a task-based routing layer that classifies each user request and dispatches it to the optimal model. Input: user request + inferred task category. Output: response from the category-optimal model, with ongoing regression testing across dozens of models and hundreds of prompts. Writing product specs routes to high-reasoning models (e.g., Claude, GPT-4) for fluency and long-form coherence. Question-answering about workspace history routes to models with large context windows and exhaustive reasoning for citation accuracy. High-volume structured tasks like auto-filling database fields route to specialized fine-tuned models, cutting latency by 50% while simultaneously improving output quality. The system is validated by AI Data Specialists (a hybrid QA/prompt engineering role) using an LLM-as-a-judge evaluation framework with custom criteria per feature.
PATTERN
“A fine-tuned specialist beats your expensive generalist on both speed AND quality for narrow tasks—so 'which LLM' is the wrong question. Model selection is a per-task architecture decision. The trap is treating it as global configuration when your product's tasks have fundamentally different optimization targets.”
✓ WORKS WHEN
- Product has distinct AI task categories with different quality/latency/cost priorities
- High-volume simple tasks (field extraction, classification) coexist with low-volume complex tasks (generation, reasoning)
- You have enough task-specific training data to fine-tune specialist models for high-volume categories
- Latency requirements vary significantly across features (sub-second for autocomplete, multi-second acceptable for document generation)
- Cost scales with volume and some task categories dominate inference spend
✗ FAILS WHEN
- All tasks have similar complexity and latency requirements
- Volume is too low to justify fine-tuning costs or maintaining multiple model integrations
- Tasks are highly interdependent and need consistent reasoning across a single context
- Rapid model switching latency exceeds the gains from specialization
- Team lacks infrastructure to manage routing logic and multiple model deployments