← Back to patterns
build

Notion's AI Data Specialist Role

TRIGGER

Scaling AI quality across many features requires evaluation that's faster than manual review but more nuanced than generic benchmarks. Standard benchmarks don't capture feature-specific quality criteria, and pure human review can't keep pace with deployment velocity.

APPROACH

Notion employs 'AI Data Specialists'—a hybrid role combining QA expertise, prompt engineering, and product thinking—who design custom evaluation criteria for each feature and teach judge models what to look for in different contexts. These specialists analyze real user behavior patterns to improve prompts based on actual usage rather than synthetic tests. Evaluations run continuously rather than as one-time gates, catching regressions early when new models from OpenAI, Anthropic, Google, or open-source launch. Input: model outputs + feature-specific evaluation criteria designed by specialists. Output: quality scores enabling rapid model deployment decisions across dozens of models and hundreds of prompts.

PATTERN

Generic rubrics miss feature-specific failures while engineers assume evaluation is a testing problem. The bottleneck is encoding what "good" means per feature—requiring someone who understands users, prompts, and quality simultaneously, not better measurement infrastructure.

WORKS WHEN

  • Product has many distinct AI features with different quality definitions
  • Model ecosystem is evolving rapidly and you need to evaluate new options continuously
  • Quality criteria are nuanced enough that generic metrics miss important failures
  • Team can support a dedicated evaluation role (not just engineers part-time)
  • Deployment velocity is high enough that manual review of every change is infeasible

FAILS WHEN

  • Single AI feature with stable, well-defined success metrics
  • Quality can be measured with deterministic tests (exact match, format validation)
  • Model selection is infrequent and can be done with periodic manual evaluation
  • Organization lacks access to people with combined QA/prompt/product skills
  • Feature-specific criteria change faster than specialists can encode them

Stage

build

From

May 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.