← Back to patterns
build

The Verifiability Gap That Limits Agent Capability

TRIGGER

AI coding agents improved rapidly through reinforcement learning while agents for other knowledge work plateaued, creating a growing capability gap between programming assistance and general knowledge work automation.

APPROACH

Ivan Zhao (Notion CEO) observed that coding agents improved rapidly because code outputs can be verified programmatically—tests pass or fail, compilers emit errors, and these signals enable reinforcement learning at scale. Input: task domain characteristics. Output: assessment of whether automated verification enables RL-based improvement. Model makers exploit this to train coding agents that continuously improve. For knowledge work tasks like project management, strategy memos, or relationship appropriateness, no equivalent automated verification exists; assessing quality requires human judgment that cannot be scaled into a training loop. The article is a theoretical framework, not a case study—it explains the structural constraint rather than documenting a specific implementation. Notion's response has been deploying 700+ agents alongside 1,000 employees for verifiable tasks (meeting notes, IT requests, status reports) while keeping humans in the loop for judgment-dependent work.

PATTERN

Your coding agents will keep getting better while your knowledge work agents plateau—and no amount of prompting will close the gap. Agent capability compounds only where outputs are programmatically verifiable. The path to better knowledge work agents runs through inventing verification mechanisms, not waiting for smarter models.

WORKS WHEN

  • Outputs have objective correctness criteria (code compiles, tests pass, calculations verify)
  • Feedback signals can be generated automatically at scale
  • Task has clear success/failure states that don't require subjective judgment
  • Domain has existing automated quality checks that can serve as reward signals
  • Volume of tasks is high enough to generate meaningful training signal

FAILS WHEN

  • Quality is inherently subjective (writing style, strategic soundness, relationship appropriateness)
  • Success depends on long-term outcomes that can't be measured immediately
  • Task requires judgment calls where reasonable people disagree
  • Verification would require the same expertise as doing the task
  • Ground truth labels are expensive or impossible to obtain at scale

Stage

build

From

December 2025

Want patterns like this in your inbox?

3 patterns weekly. No fluff.