How Canva Tests Search Understanding with Layered Query Difficulty
TRIGGER
Search evaluation using only 'happy path' queries—exact title matches and well-formed searches—fails to catch regressions in spell correction, synonym handling, and query understanding that users encounter daily with typos and reformulations.
APPROACH
Canva's team built a query difficulty ladder by programmatically degrading generated queries. Input: base query extracted from synthetic document. Output: multiple query variants at different difficulty levels targeting the same relevant document. Starting from an 'easy' query (words sampled directly from document title/content), they applied transformations: misspelling one or more words, replacing words with synonyms, rewording the entire query via GPT-4o. Each transformation level represents a difficulty tier. This allowed them to segment evaluation metrics by difficulty level and identify which query-understanding components were failing.
PATTERN
“You'll ship regressions to the components most users depend on if you only test easy queries. Easy queries test indexing; hard queries (typos, synonyms, paraphrases) test the layers of query processing that handle real user messiness.”
✓ WORKS WHEN
- Your search pipeline has multiple query-processing stages (spell correction, synonym expansion, semantic matching) that can fail independently
- You can programmatically generate query variants that stress different components (misspellings for spell-check, synonyms for semantic matching)
- You want to measure robustness across difficulty levels rather than just aggregate recall/precision
- Query transformations can be applied systematically without human judgment on what constitutes realistic mistakes
✗ FAILS WHEN
- Your search is primarily semantic/embedding-based where typos and synonyms are already handled uniformly by the embedding model
- Real user query errors are domain-specific in ways that random misspelling doesn't capture (e.g., medical terminology errors follow different patterns than general typos)
- The difficulty ladder creates unrealistic query patterns that don't match actual user behavior
- You can't segment production metrics by query difficulty to validate that offline difficulty tiers correlate with real-world performance