← Back to patterns
build

How Canva Tests Search Understanding with Layered Query Difficulty

TRIGGER

Search evaluation using only 'happy path' queries—exact title matches and well-formed searches—fails to catch regressions in spell correction, synonym handling, and query understanding that users encounter daily with typos and reformulations.

APPROACH

Canva's team built a query difficulty ladder by programmatically degrading generated queries. Input: base query extracted from synthetic document. Output: multiple query variants at different difficulty levels targeting the same relevant document. Starting from an 'easy' query (words sampled directly from document title/content), they applied transformations: misspelling one or more words, replacing words with synonyms, rewording the entire query via GPT-4o. Each transformation level represents a difficulty tier. This allowed them to segment evaluation metrics by difficulty level and identify which query-understanding components were failing.

PATTERN

You'll ship regressions to the components most users depend on if you only test easy queries. Easy queries test indexing; hard queries (typos, synonyms, paraphrases) test the layers of query processing that handle real user messiness.

WORKS WHEN

  • Your search pipeline has multiple query-processing stages (spell correction, synonym expansion, semantic matching) that can fail independently
  • You can programmatically generate query variants that stress different components (misspellings for spell-check, synonyms for semantic matching)
  • You want to measure robustness across difficulty levels rather than just aggregate recall/precision
  • Query transformations can be applied systematically without human judgment on what constitutes realistic mistakes

FAILS WHEN

  • Your search is primarily semantic/embedding-based where typos and synonyms are already handled uniformly by the embedding model
  • Real user query errors are domain-specific in ways that random misspelling doesn't capture (e.g., medical terminology errors follow different patterns than general typos)
  • The difficulty ladder creates unrealistic query patterns that don't match actual user behavior
  • You can't segment production metrics by query difficulty to validate that offline difficulty tiers correlate with real-world performance

Stage

build

Source

Canva

From

November 2024

Want patterns like this in your inbox?

3 patterns weekly. No fluff.