What Anthropic Learned About Teaching Agents Domain Reasoning
TRIGGER
Adding a think tool improved agent performance, but gains varied dramatically by domain—in complex policy environments, the agent didn't know what aspects of the problem to reason about without guidance.
APPROACH
Anthropic paired the think tool with domain-specific prompting showing example reasoning patterns. For the airline domain, examples demonstrated: listing applicable rules, checking required information collection, verifying policy compliance, and iterating over tool results. Example patterns included baggage fee calculations by membership tier and payment method combination rules. Results: think tool with optimized prompt achieved 0.570 pass^1 vs 0.404 for think tool alone (41% relative improvement). In the simpler retail domain, think tool alone achieved 0.812 without additional prompting. Instructions were placed in the system prompt rather than tool description for better integration.
PATTERN
“The agent knows HOW to reason but not WHAT to verify in your domain. Show the specific checklists: "check membership tier, then payment method, then calculate fee." Generic think prompts leave domain knowledge on the table.”
✓ WORKS WHEN
- Domain has complex, enumerable rules that benefit from explicit checklists (multi-step calculations, tiered policies, conditional logic)
- You can articulate what 'good reasoning' looks like for common scenarios in your domain
- Policy complexity is high enough that generic reasoning misses important verification steps
- You have representative examples of the decision patterns the agent will encounter
✗ FAILS WHEN
- Domain is simple enough that the agent reasons correctly without examples (τ-bench retail showed minimal gain from prompting)
- Reasoning patterns are too varied to capture in a few examples
- You can't identify the specific verification steps that matter for your domain
- System prompt length constraints prevent including detailed examples