Canva's Hierarchical Criteria Approach to Embedding Model Selection
TRIGGER
Team needed to select an embedding model for image similarity search, but 'similarity' is subjective and standard benchmarks don't capture domain-specific requirements—what makes two images 'similar enough' for template replacement differs from academic similarity metrics.
APPROACH
Canva defined an explicit similarity hierarchy (subject > color/tone > positioning > background > emotion) before evaluation. They generated embeddings for 50,000 images across 5 models (DINOv2, CLIP, ViTMAE, DreamSim, CaiT), stored them in Faiss, then had engineers and designers manually review nearest-3-neighbors for 200 sample images. Each model was scored against the hierarchical criteria. Input: 200 query images × 5 models. Output: ranked model selection based on human judgment against defined hierarchy. DINOv2 won for their photo-heavy use case.
PATTERN
“Endless "which model looks better" debates with no consensus? Without explicit priority hierarchy (subject > color > positioning > background), reviewers optimize for different things. Define criteria ranking before evaluation or waste weeks on subjective arguments.”
✓ WORKS WHEN
- Task involves subjective matching where 'correct' depends on unstated priorities (similarity, relevance, quality)
- You have access to domain experts (designers, domain specialists) who can evaluate samples
- Sample size of 100-500 images is sufficient to reveal model behavior patterns
- Evaluation criteria can be decomposed into rankable factors
- Standard academic benchmarks don't capture your specific use case requirements
✗ FAILS WHEN
- Ground truth is objective and measurable (classification accuracy, exact match)
- No domain experts available for manual review
- Criteria genuinely cannot be prioritized (all factors equally weighted)
- Scale requires automated evaluation (millions of samples needed for statistical significance)
- Task is well-represented by existing benchmarks