build

Anthropic's Dedicated Think Tool for Sequential Reasoning

TRIGGER

AI agents making sequential tool calls were failing to maintain policy compliance and making costly errors mid-chain—the model would retrieve information but act on it incorrectly because there was no structured space to verify constraints before each action.

APPROACH

Anthropic added a 'think' tool to Claude's tool set—a no-op tool that logs reasoning without affecting external state. Input: a thought string describing the agent's current reasoning. Output: nothing (the thought is appended to the conversation log). On τ-bench airline domain, the think tool with domain-specific prompting achieved 0.570 pass^1 vs 0.370 baseline (54% relative improvement). On τ-bench retail domain, think tool alone achieved 0.812 vs 0.783 baseline. On SWE-bench, the think tool contributed to a 1.6% improvement (statistically significant with p < .001, effect size d = 1.47). The optimized prompt included examples showing how to enumerate applicable rules, check required information, and verify policy compliance before acting.

PATTERN

“Making "stop and think" a tool invocation forces a structural pause—the model must choose to reason before acting. Without this, reasoning and action blur together, and agents skip constraint checks mid-chain.”

✓ WORKS WHEN

Agent performs sequential tool calls where each step depends on previous results (not parallel/independent calls)
Environment has complex policies or constraints the agent must verify before acting (τ-bench airline policy had detailed baggage, cancellation, and payment rules)
Mistakes are costly and irreversible—you can't easily undo a wrong action
You can provide domain-specific examples of what good thinking looks like in your prompt
The additional output tokens for thinking are acceptable given the reliability gains

✗ FAILS WHEN

Agent only needs single tool calls or parallel independent calls with no dependencies
Task has simple instruction following without multi-step policy verification
All necessary information is available upfront before any tool calls (use extended thinking instead)
Token budget is severely constrained and you can't afford the reasoning overhead
Domain is simple enough that the agent's default behavior already achieves acceptable accuracy

Stage

build

Source

Anthropic Engineering →

From

March 2025