Anthropic's Dedicated Think Tool for Sequential Reasoning
TRIGGER
AI agents making sequential tool calls were failing to maintain policy compliance and making costly errors mid-chain—the model would retrieve information but act on it incorrectly because there was no structured space to verify constraints before each action.
APPROACH
Anthropic added a 'think' tool to Claude's tool set—a no-op tool that logs reasoning without affecting external state. Input: a thought string describing the agent's current reasoning. Output: nothing (the thought is appended to the conversation log). On τ-bench airline domain, the think tool with domain-specific prompting achieved 0.570 pass^1 vs 0.370 baseline (54% relative improvement). On τ-bench retail domain, think tool alone achieved 0.812 vs 0.783 baseline. On SWE-bench, the think tool contributed to a 1.6% improvement (statistically significant with p < .001, effect size d = 1.47). The optimized prompt included examples showing how to enumerate applicable rules, check required information, and verify policy compliance before acting.
PATTERN
“Making "stop and think" a tool invocation forces a structural pause—the model must choose to reason before acting. Without this, reasoning and action blur together, and agents skip constraint checks mid-chain.”
✓ WORKS WHEN
- Agent performs sequential tool calls where each step depends on previous results (not parallel/independent calls)
- Environment has complex policies or constraints the agent must verify before acting (τ-bench airline policy had detailed baggage, cancellation, and payment rules)
- Mistakes are costly and irreversible—you can't easily undo a wrong action
- You can provide domain-specific examples of what good thinking looks like in your prompt
- The additional output tokens for thinking are acceptable given the reliability gains
✗ FAILS WHEN
- Agent only needs single tool calls or parallel independent calls with no dependencies
- Task has simple instruction following without multi-step policy verification
- All necessary information is available upfront before any tool calls (use extended thinking instead)
- Token budget is severely constrained and you can't afford the reasoning overhead
- Domain is simple enough that the agent's default behavior already achieves acceptable accuracy