← Back to patterns
build

The Training Data Mismatch Trap in Embedding Models

TRIGGER

Image similarity system performed well on photographs but produced poor results on graphics, cartoons, text-heavy images, and symbols—despite using the same embedding model and infrastructure for all content types.

APPROACH

Canva deployed DINOv2 for reverse image search across 150M+ images. Post-deployment analysis revealed strong results on photos but weak results on cartoons, symbols, and text-containing images. Root cause: DINOv2 was trained on LVD-142M dataset composed primarily of photographs, not graphics or symbolic imagery. The model also wasn't trained for symbol/text recognition tasks, focusing instead on object, texture, and scene categorization. Result: 4.5x speed improvement for photo replacement, but designers must bypass suggestions for graphic content.

PATTERN

The "training data mismatch trap": 95% benchmark accuracy on photos but garbage results on graphics, icons, and diagrams. Models trained on LVD-142M photos treat your non-photo content as noise. Check dataset provenance before trusting benchmark scores.

WORKS WHEN

  • Your content distribution differs significantly from model training data (graphics vs photos, technical diagrams vs natural images)
  • Model documentation or papers specify training data composition
  • You can segment content types and route to different models or fallbacks
  • Failure modes are acceptable with human bypass option

FAILS WHEN

  • Content is homogeneous and matches common training distributions (natural photos)
  • Training data composition is undocumented or proprietary
  • Single-model requirement for latency or infrastructure reasons
  • No way to detect content type for routing decisions

Stage

build

Source

Canva

From

January 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.