The Training Data Mismatch Trap in Embedding Models
TRIGGER
Image similarity system performed well on photographs but produced poor results on graphics, cartoons, text-heavy images, and symbols—despite using the same embedding model and infrastructure for all content types.
APPROACH
Canva deployed DINOv2 for reverse image search across 150M+ images. Post-deployment analysis revealed strong results on photos but weak results on cartoons, symbols, and text-containing images. Root cause: DINOv2 was trained on LVD-142M dataset composed primarily of photographs, not graphics or symbolic imagery. The model also wasn't trained for symbol/text recognition tasks, focusing instead on object, texture, and scene categorization. Result: 4.5x speed improvement for photo replacement, but designers must bypass suggestions for graphic content.
PATTERN
“The "training data mismatch trap": 95% benchmark accuracy on photos but garbage results on graphics, icons, and diagrams. Models trained on LVD-142M photos treat your non-photo content as noise. Check dataset provenance before trusting benchmark scores.”
✓ WORKS WHEN
- Your content distribution differs significantly from model training data (graphics vs photos, technical diagrams vs natural images)
- Model documentation or papers specify training data composition
- You can segment content types and route to different models or fallbacks
- Failure modes are acceptable with human bypass option
✗ FAILS WHEN
- Content is homogeneous and matches common training distributions (natural photos)
- Training data composition is undocumented or proprietary
- Single-model requirement for latency or infrastructure reasons
- No way to detect content type for routing decisions