build

How Canva Evaluates Search Without Seeing User Queries

TRIGGER

Search teams at companies with privacy constraints cannot view user queries or content to build evaluation datasets—the standard approach of human judges labeling real query-document pairs is impossible when the data is private user designs.

APPROACH

Canva's team used GPT-4o to generate entirely synthetic evaluation data. For each test case: (1) seed the LLM with a realistic topic and design type sampled from real distribution statistics, (2) generate titles and text content matching aggregate character-length distributions from production, (3) create queries by sampling words from generated content then programmatically modifying them (misspellings, synonyms, LLM rewording) at multiple difficulty levels, (4) generate non-relevant documents by creating partial-match content and modified variants. Input: aggregate statistics about design distributions + prompts. Output: 1000+ labeled test cases with relevant/non-relevant document pairs. They ran evaluation locally using Testcontainers replicating production ElasticSearch and ML models, producing results in under 10 minutes versus 2-3 days for online A/B tests.

PATTERN

“Aggregate statistics (length distributions, type frequencies) seed LLM-generated synthetic data that correlates with production without accessing private queries. Run hundreds of evaluations locally while online tests take days.”

✓ WORKS WHEN

You have access to aggregate statistics about real data distributions even when individual records are private
The search task is re-finding (one correct answer per query) rather than exploratory discovery
Your search pipeline can run locally via containers for rapid iteration (Canva achieved 300+ evaluations in time of one production experiment)
Query patterns are somewhat predictable—users search by words in titles/content, with common error types like misspellings and synonyms
Offline evaluation results correlate with online A/B test outcomes (validate this before trusting the synthetic pipeline)

✗ FAILS WHEN

Search relevance depends on user-specific context that can't be captured in aggregate statistics (collaborative filtering, personalization based on behavior history)
The query vocabulary is highly specialized or domain-specific in ways that LLMs can't realistically generate without real examples
You need to evaluate ranking among many relevant results rather than single-target retrieval
The LLM consistently refuses or fails to generate edge cases you need to test (Canva found GPT-4o wouldn't create 12-15 word titles)
Your production infrastructure can't be replicated locally for deterministic evaluation

Stage

build

Source

Canva →

From

November 2024