build

Notion's AI Data Specialist Role

TRIGGER

Scaling AI quality across many features requires evaluation that's faster than manual review but more nuanced than generic benchmarks. Standard benchmarks don't capture feature-specific quality criteria, and pure human review can't keep pace with deployment velocity.

APPROACH

Notion employs 'AI Data Specialists'—a hybrid role combining QA expertise, prompt engineering, and product thinking—who design custom evaluation criteria for each feature and teach judge models what to look for in different contexts. These specialists analyze real user behavior patterns to improve prompts based on actual usage rather than synthetic tests. Evaluations run continuously rather than as one-time gates, catching regressions early when new models from OpenAI, Anthropic, Google, or open-source launch. Input: model outputs + feature-specific evaluation criteria designed by specialists. Output: quality scores enabling rapid model deployment decisions across dozens of models and hundreds of prompts.

PATTERN

“Generic rubrics miss feature-specific failures while engineers assume evaluation is a testing problem. The bottleneck is encoding what "good" means per feature—requiring someone who understands users, prompts, and quality simultaneously, not better measurement infrastructure.”

✓ WORKS WHEN

Product has many distinct AI features with different quality definitions
Model ecosystem is evolving rapidly and you need to evaluate new options continuously
Quality criteria are nuanced enough that generic metrics miss important failures
Team can support a dedicated evaluation role (not just engineers part-time)
Deployment velocity is high enough that manual review of every change is infeasible

✗ FAILS WHEN

Single AI feature with stable, well-defined success metrics
Quality can be measured with deterministic tests (exact match, format validation)
Model selection is infrequent and can be done with periodic manual evaluation
Organization lacks access to people with combined QA/prompt/product skills
Feature-specific criteria change faster than specialists can encode them

Stage

build

Source

Notion →

From

May 2025