build

How Canva Tests Search Understanding with Layered Query Difficulty

TRIGGER

Search evaluation using only 'happy path' queries—exact title matches and well-formed searches—fails to catch regressions in spell correction, synonym handling, and query understanding that users encounter daily with typos and reformulations.

APPROACH

Canva's team built a query difficulty ladder by programmatically degrading generated queries. Input: base query extracted from synthetic document. Output: multiple query variants at different difficulty levels targeting the same relevant document. Starting from an 'easy' query (words sampled directly from document title/content), they applied transformations: misspelling one or more words, replacing words with synonyms, rewording the entire query via GPT-4o. Each transformation level represents a difficulty tier. This allowed them to segment evaluation metrics by difficulty level and identify which query-understanding components were failing.

PATTERN

“You'll ship regressions to the components most users depend on if you only test easy queries. Easy queries test indexing; hard queries (typos, synonyms, paraphrases) test the layers of query processing that handle real user messiness.”

✓ WORKS WHEN

Your search pipeline has multiple query-processing stages (spell correction, synonym expansion, semantic matching) that can fail independently
You can programmatically generate query variants that stress different components (misspellings for spell-check, synonyms for semantic matching)
You want to measure robustness across difficulty levels rather than just aggregate recall/precision
Query transformations can be applied systematically without human judgment on what constitutes realistic mistakes

✗ FAILS WHEN

Your search is primarily semantic/embedding-based where typos and synonyms are already handled uniformly by the embedding model
Real user query errors are domain-specific in ways that random misspelling doesn't capture (e.g., medical terminology errors follow different patterns than general typos)
The difficulty ladder creates unrealistic query patterns that don't match actual user behavior
You can't segment production metrics by query difficulty to validate that offline difficulty tiers correlate with real-world performance

Stage

build

Source

Canva →

From

November 2024