build

The Unit Test Satisficing Trap

TRIGGER

AI coding agents were marking features as complete after making code changes and running unit tests or curl commands, but features frequently didn't work end-to-end—the agent couldn't recognize that passing unit tests didn't mean the feature worked as a user would experience it.

APPROACH

Anthropic equipped their coding agent with browser automation tools (Puppeteer MCP) and explicitly prompted it to test features as a human user would before marking them complete. The agent runs through user flows—navigating to the interface, clicking buttons, verifying responses appear—rather than relying solely on code-level tests. They also implemented a bootstrap routine where each session starts by testing that basic functionality still works (start server, open new chat, send message, receive response) before attempting new features.

PATTERN

“"Feature complete" with broken features—the "unit test satisficing trap" where your agent declares victory because tests pass while the UI is completely broken. The verification tool you provide becomes the agent's definition of done. Give it browser automation; it will verify like a user would.”

✓ WORKS WHEN

Features have user-visible behavior that can be verified through UI interaction (web apps, desktop apps, CLI tools)
Browser automation or equivalent E2E testing tools are available and can be provided as agent tools
Feedback loop from E2E test to code change is fast enough to iterate (<60 seconds per test cycle)
The UI state is inspectable—elements have stable selectors and responses are visible in DOM
Features don't rely heavily on visual appearance that automation tools can't verify (layout, animation, color)

✗ FAILS WHEN

End-to-end testing requires human judgment that can't be automated (visual polish, UX feel, performance perception)
Critical behavior happens in browser-native elements that automation can't inspect (native alert modals, file pickers, print dialogs)
The application requires authentication or state that's difficult to bootstrap in test environments
E2E test setup time dominates iteration cycles (>5 minutes to run a single verification)
Features are backend-only with no user-facing component to verify

Stage

build

Source

Anthropic Engineering →

From

November 2025