Sonnet 4.6 vs Opus 4.6: Half the Agentic Performance at the Same Cost
Claude Sonnet 4.6 costs $22.99 per run. Opus 4.6 costs $26.50.
Similar API costs — but Opus delivers 2× the agentic performance: stronger demand forecasting, tighter resource management, zero truncations.
When Anthropic said Opus generates fewer tokens per task — the data confirms it.
Key Findings
Based on 5 simulations under identical conditions. All figures below are from the median run unless noted otherwise.
- A real generational leap from Sonnet 4.5: Sonnet 4.6 demonstrates genuine multi-step business reasoning — pricing strategy, capital investment, staff management, and waste control. It turns $2,000 into $17,426 (+771% ROI) where Sonnet 4.5 barely broke even (-31% ROI). Zero bankruptcies across 5 runs.
- Comparable cost, half the agentic performance: Sonnet 4.6 costs $22.99/run. Opus 4.6 costs $26.50/run. Opus costs only 15% more but delivers 2× the business performance. Anthropic noted at the Opus 4.5 launch that the model uses fewer tokens per task — the benchmark confirms this holds for the 4.6 generation.
- The agentic gap: Opus 4.6 scores 2× higher on cumulative business performance — not because of any single skill, but through superior demand forecasting, procurement timing, and the ability to close the observe→learn→adapt loop. Sonnet 4.6 analyzes beautifully but struggles to translate insights into changed behavior.
- Verbose and unstable: 685K output tokens per run, 3.4× more than Sonnet 4.5. Four out of five runs hit
max_tokenslimits requiring retries. Opus 4.6: zero truncations. For agentic workflows where reliability matters, this is a meaningful risk factor.
The Setup
FoodTruck Bench is an agentic benchmark that measures how well language models handle complex, multi-step business decisions. An AI agent manages a food truck in Austin, TX for 30 days — choosing locations, pricing menus, managing inventory, hiring staff — using 34 tools. The benchmark evaluates agentic capabilities that standard coding and reasoning benchmarks miss: demand forecasting, resource optimization, feedback adaptation, and long-horizon planning.
Five identical runs were conducted using the direct Anthropic SDK. All runs used the same seed, same prompt, same tools. All five completed the full 30 days without bankruptcy. Following standard methodology, the median run (sorted by net worth) serves as the reference. All day-by-day examples and charts come from this median run.
For comparison, the article uses median runs of two other Claude models: Sonnet 4.5 (the immediate predecessor, -31% ROI) and Opus 4.6 (the same-family flagship, +2,376% ROI). This creates a natural progression: how much does each step up the Claude ladder improve agentic capability?
The Verbosity Problem
Sonnet 4.6 generates far too many tokens — and it directly impacts cost and reliability. Across 30 simulation days, it produced 685,420 output tokens — 3.4× more than Sonnet 4.5 and 1.5× more than Opus 4.6. It writes multi-paragraph analytical essays about each decision, detailed per-ingredient breakdowns, and extensive scratchpad entries — none of which translate into better outcomes.
| Metric | Sonnet 4.5 | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|
| API Calls | 158 | 252 | 210 |
| Output Tokens | 203K | 685K | 472K |
| Output/Day | ~6,500 | ~22,000 | ~15,000 |
| Cost per Run | $7.75 | $22.99 | $26.50 |
| Cost per $1K Performance | $0.72 | $0.59 | $0.33 |
| Truncations | 0/5 runs | 4/5 runs | 0/5 runs |
The cost comparison tells the story: $22.99 vs $26.50. Opus costs just 15% more per run — but delivers 2× the agentic performance. Despite Opus having higher per-token pricing ($5/$25 per MTok vs Sonnet's $3/$15), the gap in total run cost is marginal because Opus generates fewer tokens per task. This is exactly what Anthropic predicted at the Opus 4.5 launch — and the data confirms it for the 4.6 generation.
The cost per $1K of business performance makes it sharper: Opus achieves $1K of value for $0.33 in API costs; Sonnet 4.6 costs $0.59 — nearly double.
Then there's reliability. Four out of five Sonnet 4.6 runs hit max_tokens limits, with 3–6 truncated responses per run requiring retries. In one run, the model couldn't finish its response within 16K output tokens on Day 2. Opus 4.6: zero truncations across all runs. In agentic deployments, every truncation is a wasted API call, added latency, and a potential failure point.
Agentic insight: Verbosity is not reasoning. A model that writes 22K tokens per decision cycle may look thorough, but in agentic workflows it's a liability — slower, more expensive, and prone to truncation failures. The benchmark shows that Opus achieves better outcomes with fewer words. Concise, action-oriented output is itself an agentic capability.
What 16,000 Tokens of “Thinking” Looks Like
Below is an excerpt from an actual truncated Sonnet 4.6 API response (Day 2, Run 142225). The full extended-thinking block was 22,965 characters. The model exhausted its entire 16K output token budget on internal deliberation before generating a single tool call — zero actions taken, response discarded and retried.
This is Day 2. The task: choose a location and order ingredients — two tool calls. Instead, the model spent all 16,000 output tokens deliberating about kilograms of salsa, recalculating profit margins four times, and debating upgrade purchases. The API discarded this response and retried. For comparison, Opus 4.6 handles the same Day 2 decision in ~3,000 tokens with six decisive tool calls.
Agentic Capability Comparison
Sonnet 4.6 is the first Sonnet to demonstrate genuine business acumen. But the benchmark measures eight distinct agentic capabilities, and the gap between Sonnet and Opus is clearest when mapped capability by capability:
| Agentic Capability | Sonnet 4.5 | Sonnet 4.6 | Opus 4.6 |
|---|---|---|---|
| Market intuition | ~ Below market pricing | ✓ Premium pricing | ✓ Optimal pricing |
| Long-term investment | ✗ Zero upgrades purchased | ✓ All 8 upgrades | ✓ All 8 upgrades |
| Cost-benefit reasoning | ✗ Overstaffed (35% of revenue) | ✓ Profitable staffing (29%) | ✓ Efficient staffing (16%) |
| Resource optimization | ✗ High waste ($938) | ✓ Low waste ($276) | ✓ Near-zero waste ($2) |
| Demand forecasting | ✗ No signal reading | ~ Chronic under-ordering | ✓ Right-sized procurement |
| Multi-step planning | ✗ Ignores upcoming events | ~ Sees but under-prepares | ✓ Full event exploitation |
| Output efficiency | ✓ Concise (6.5K/day) | ✗ Verbose (22K/day) | ~ Moderate (15K/day) |
| Feedback → adaptation | ✗ No behavioral change | ~ Partial, delayed | ✓ Systematic optimization |
The top four capabilities — market intuition, long-term investment, cost-benefit reasoning, and resource optimization — are solved by both Sonnet 4.6 and Opus 4.6. These are no longer differentiators within the Claude family: both models price correctly, invest in upgrades, manage staff costs, and avoid waste.
The bottom four reveal the frontier: demand forecasting, multi-step planning, output efficiency, and feedback-driven adaptation. These are the capabilities that determine the upper bound of agentic performance — and where Sonnet 4.6 consistently trails Opus. The simulation data quantifies this gap precisely.
What this means beyond food trucks: The same capability gaps — connecting information to action, timing resource allocation, adapting behavior based on outcomes — will manifest in any complex agentic workflow: code deployment, research pipelines, customer support escalation, inventory management systems. The benchmark provides the measurement; the capability profile generalizes.
The Gemini 3 Pro Question
Look at the green line on the chart. Gemini 3 Pro ends at $17,236 — within $200 of Sonnet 4.6's $17,426. Nearly identical agentic performance on the same 30-day simulation. Same trajectory, same “dark ages” dip in the middle, same recovery pattern.
The difference is cost: Gemini 3 Pro ran for $4.38 (the reported API cost for its median run including reasoning tokens). Sonnet 4.6 ran for $22.99. That's 5.2× cheaper for statistically equivalent results — with zero truncations and stable output generation throughout.
The cost-performance frontier: If the goal is Sonnet 4.6-level agentic capability at the lowest cost, Gemini 3 Pro delivers it at a fraction of the price. If the goal is 2× that performance regardless of cost, Opus 4.6 owns that tier. Sonnet 4.6 sits in an uncomfortable middle — same results as a model 5× cheaper, half the results of a model at comparable cost.
Key Moments: Where Agentic Capabilities Break Down
The Procurement Timing Failure (Days 10–15)
Six consecutive days of losses. Average fulfillment: 21% of demand. The model had the staff capacity and the location traffic — it simply didn't have enough ingredients because orders were consistently too small and mistimed.
What this reveals: Sonnet 4.6 treats each ordering decision in isolation, without factoring ingredient shelf life into a forward-looking procurement plan. All the data is available — shelf life, current stock, expected demand — but the model fails to connect them into a replenishment cycle. This is the gap between transactional thinking (ordering what seems reasonable today) and process thinking (planning orders around expiry dates and delivery timing). Opus 4.6 factored shelf life into its ordering rhythm by Day 3, scaling quantities to demand and avoiding both waste and shortage — no “dark ages” phase.
The Festival Blindspot (Day 13)
A music festival brought 1,096 customers. Sonnet 4.6 served 36 of them — 3.3% fulfillment. The event was visible 3 days ahead via get_upcoming_events. The model called the tool, saw the event, wrote about it in its scratchpad — and ordered the same quantity of ingredients as a normal day.
What this reveals: A critical failure in multi-step planning — the ability to connect information (upcoming high-traffic event) to action (order proportionally more). Sonnet 4.6 can observe and analyze, but the observe→plan→act chain breaks between planning and execution. This is the same pattern that affects agentic coding tools when a model correctly identifies a bug but applies the wrong fix.
The “Paradox” Self-Diagnosis (Day 17)
The model wrote: “PARADOX EXPLAINED: We're BOTH wasting food AND running out. Cause: wrong ingredient ratios.” — a perfectly accurate diagnosis. Then it continued ordering the same ratios.
What this reveals: The most telling agentic failure mode — the observe→learn→adapt loop is broken. The model observes correctly, learns the correct lesson, but doesn't adapt its behavior. This is the gap between metacognition (understanding what went wrong) and behavioral change (doing something different). Opus 4.6 closes this loop systematically.
The Late Recovery (Days 19–30)
Partial adaptation eventually occurred: larger orders, better location selection, premium pricing on event days. Performance doubled in the final 12 days. But this was satisficing — finding a workable pattern — not optimizing. Unmet demand averaged 243 customers/day even during the recovery phase.
What this reveals: Sonnet 4.6 can eventually find a profitable operating point, but never pushes toward the optimal one. In agentic terms, this is the difference between a model that solves problems and one that maximizes outcomes. Opus 4.6 demonstrates true optimization: near-zero waste, near-full capacity utilization, systematic scaling.
In Its Own Words
«DAY 10 POST-MORTEM (DISASTER) — Only 36/169 customers served. ROOT CAUSE: New ingredient delivery hadn't arrived yet. Used stale leftovers → terrible quality + waste + stockouts.»
— Sonnet 4.6, Day 10 (22,000+ output tokens)
Perfect root cause analysis in a 22,000-token essay. The fix required changing one parameter: order one day earlier. The model continued ordering late for 20 more days — metacognition without adaptation.
«PARADOX EXPLAINED: We're BOTH wasting food AND running out. Cause: wrong ingredient ratios. Some items oversupplied, others stockout by noon.»
— Sonnet 4.6, Day 17 (35,000 output tokens)
Named its failure mode “the paradox” — a brilliant observation buried in a 35,000-token analytical essay. It never changed the ratios. The ability to name a problem is not the ability to solve it.
«BETH SITUATION — FIRE HER FIRST THING DAY 26. Two consecutive no-shows going into the festival = unacceptable.»
— Sonnet 4.6, Day 25
Decisive, context-aware personnel management — factoring reliability history against an upcoming high-traffic event. This is exactly the kind of multi-variable reasoning that separates effective agentic behavior from mechanical execution. The model connected staff reliability data to event timing and acted preemptively rather than waiting for a third no-show.
«DAY 30: Served 352/390 capacity (90%). Festival prices justified.»
— Sonnet 4.6, Day 30
90% capacity utilization on the final day — within striking distance of Opus-level execution. When everything aligns, Sonnet 4.6 performs at the top tier. The question is whether it can sustain this from Day 1 — consistency vs. flashes of brilliance.
All 5 Runs: Consistency Check
| Run | Net Worth | ROI | API Cost | Truncations |
|---|---|---|---|---|
| 015441 | $21,582 | +979% | $21.41 | 5 |
| 015651 | $3,287 | +64% | $22.10 | 4 |
| 022140 | $14,465 | +623% | $23.03 | 4 |
| 142225 | $32,332 | +1,517% | $21.12 | 6 |
| 154504 | $17,426 | +771% | $22.99 | 0 |
| Average | $17,818 | +791% | $22.13 | 3.8 |
Zero bankruptcies, but high variance ($3,287 — $32,332). The best run (142225) nearly matched Opus 4.6, proving Sonnet 4.6 has the raw capability — it just can't activate it consistently. Notably, the only truncation-free run (154504, the median) was not the best or worst performer — truncations don't correlate with outcomes in obvious ways, but they do signal instability.
Verdict
Sonnet 4.6 represents a genuine leap in agentic capability from Sonnet 4.5 — from terminal decline to consistent profitability. The model demonstrates real business reasoning: value-based pricing, capital investment, waste control, and decisive personnel management. This is the first Sonnet that understands multi-step business logic.
But for complex agentic tasks, Opus 4.6 is the clear recommendation. The numbers are unambiguous: comparable cost ($22.99 vs $26.50/run), 2× the performance, zero truncations. Opus doesn't just execute — it optimizes. It forecasts demand, times procurement to shelf life, and closes the observe→learn→adapt loop that Sonnet 4.6 can diagnose but not fix.
Sonnet 4.6 is an excellent model for coding — fast, capable, cost-effective for structured tasks with clear feedback loops. But agentic work demands something different: the ability to make tight decisions under uncertainty, manage resources across time horizons, and do more with fewer words. Opus delivers all three. Sonnet writes essays about why it should.
The benchmark reveals three tiers of agentic capability within the Claude family:
- Tier 1 — Operator: Can execute tasks but can't adapt or grow (Sonnet 4.5)
- Tier 2 — Reasoner: Understands what to do, invests wisely, but can't close the feedback loop (Sonnet 4.6)
- Tier 3 — Optimizer: Systematically maximizes outcomes — forecasts demand, minimizes waste, adapts behavior based on results (Opus 4.6)
Going from Tier 1 to Tier 2 requires basic reasoning: price above cost, invest in capacity, don't waste resources. Going from Tier 2 to Tier 3 requires something harder — the ability to learn from outcomes and compress decisions into fewer tokens.
$22.99 to run the model that writes ALL CAPS post-mortems. $26.50 to run the one that just delivers $49,519.