build

The Verifiability Gap That Limits Agent Capability

TRIGGER

AI coding agents improved rapidly through reinforcement learning while agents for other knowledge work plateaued, creating a growing capability gap between programming assistance and general knowledge work automation.

APPROACH

Ivan Zhao (Notion CEO) observed that coding agents improved rapidly because code outputs can be verified programmatically—tests pass or fail, compilers emit errors, and these signals enable reinforcement learning at scale. Input: task domain characteristics. Output: assessment of whether automated verification enables RL-based improvement. Model makers exploit this to train coding agents that continuously improve. For knowledge work tasks like project management, strategy memos, or relationship appropriateness, no equivalent automated verification exists; assessing quality requires human judgment that cannot be scaled into a training loop. The article is a theoretical framework, not a case study—it explains the structural constraint rather than documenting a specific implementation. Notion's response has been deploying 700+ agents alongside 1,000 employees for verifiable tasks (meeting notes, IT requests, status reports) while keeping humans in the loop for judgment-dependent work.

PATTERN

“Your coding agents will keep getting better while your knowledge work agents plateau—and no amount of prompting will close the gap. Agent capability compounds only where outputs are programmatically verifiable. The path to better knowledge work agents runs through inventing verification mechanisms, not waiting for smarter models.”

✓ WORKS WHEN

Outputs have objective correctness criteria (code compiles, tests pass, calculations verify)
Feedback signals can be generated automatically at scale
Task has clear success/failure states that don't require subjective judgment
Domain has existing automated quality checks that can serve as reward signals
Volume of tasks is high enough to generate meaningful training signal

✗ FAILS WHEN

Quality is inherently subjective (writing style, strategic soundness, relationship appropriateness)
Success depends on long-term outcomes that can't be measured immediately
Task requires judgment calls where reasonable people disagree
Verification would require the same expertise as doing the task
Ground truth labels are expensive or impossible to obtain at scale

Stage

build

Source

Notion →

From

December 2025