← Back to FoodTruck Bench
Case StudyFebruary 2026Nicholas S.

Sonnet 4.6 vs Opus 4.6: Half the Agentic Performance at the Same Cost

Claude Sonnet 4.6 costs $22.99 per run. Opus 4.6 costs $26.50.
Similar API costs — but Opus delivers 2× the agentic performance: stronger demand forecasting, tighter resource management, zero truncations.
When Anthropic said Opus generates fewer tokens per task — the data confirms it.

Key Findings

Based on 5 simulations under identical conditions. All figures below are from the median run unless noted otherwise.

Net Worth
$17,426
Revenue
$39,280
Days
30/30
ROI
+771%

The Setup

FoodTruck Bench is an agentic benchmark that measures how well language models handle complex, multi-step business decisions. An AI agent manages a food truck in Austin, TX for 30 days — choosing locations, pricing menus, managing inventory, hiring staff — using 34 tools. The benchmark evaluates agentic capabilities that standard coding and reasoning benchmarks miss: demand forecasting, resource optimization, feedback adaptation, and long-horizon planning.

Five identical runs were conducted using the direct Anthropic SDK. All runs used the same seed, same prompt, same tools. All five completed the full 30 days without bankruptcy. Following standard methodology, the median run (sorted by net worth) serves as the reference. All day-by-day examples and charts come from this median run.

For comparison, the article uses median runs of two other Claude models: Sonnet 4.5 (the immediate predecessor, -31% ROI) and Opus 4.6 (the same-family flagship, +2,376% ROI). This creates a natural progression: how much does each step up the Claude ladder improve agentic capability?

The Verbosity Problem

Sonnet 4.6 generates far too many tokens — and it directly impacts cost and reliability. Across 30 simulation days, it produced 685,420 output tokens — 3.4× more than Sonnet 4.5 and 1.5× more than Opus 4.6. It writes multi-paragraph analytical essays about each decision, detailed per-ingredient breakdowns, and extensive scratchpad entries — none of which translate into better outcomes.

MetricSonnet 4.5Sonnet 4.6Opus 4.6
API Calls158252210
Output Tokens203K685K472K
Output/Day~6,500~22,000~15,000
Cost per Run$7.75$22.99$26.50
Cost per $1K Performance$0.72$0.59$0.33
Truncations0/5 runs4/5 runs0/5 runs

The cost comparison tells the story: $22.99 vs $26.50. Opus costs just 15% more per run — but delivers 2× the agentic performance. Despite Opus having higher per-token pricing ($5/$25 per MTok vs Sonnet's $3/$15), the gap in total run cost is marginal because Opus generates fewer tokens per task. This is exactly what Anthropic predicted at the Opus 4.5 launch — and the data confirms it for the 4.6 generation.

The cost per $1K of business performance makes it sharper: Opus achieves $1K of value for $0.33 in API costs; Sonnet 4.6 costs $0.59 — nearly double.

Then there's reliability. Four out of five Sonnet 4.6 runs hit max_tokens limits, with 3–6 truncated responses per run requiring retries. In one run, the model couldn't finish its response within 16K output tokens on Day 2. Opus 4.6: zero truncations across all runs. In agentic deployments, every truncation is a wasted API call, added latency, and a potential failure point.

Agentic insight: Verbosity is not reasoning. A model that writes 22K tokens per decision cycle may look thorough, but in agentic workflows it's a liability — slower, more expensive, and prone to truncation failures. The benchmark shows that Opus achieves better outcomes with fewer words. Concise, action-oriented output is itself an agentic capability.

What 16,000 Tokens of “Thinking” Looks Like

Below is an excerpt from an actual truncated Sonnet 4.6 API response (Day 2, Run 142225). The full extended-thinking block was 22,965 characters. The model exhausted its entire 16K output token budget on internal deliberation before generating a single tool call — zero actions taken, response discarded and retried.

Let me analyze all the data I've gathered and make decisions quickly. Current Situation: - Balance: $2,092.75 - No staff (solo operation) - Day 2, Tuesday, Rainy 70°F Inventory Analysis: - chicken_breast: 3.85 kg, expires in 1 day (TODAY) — MUST USE IT ALL - ground_beef: 10.65 kg total (5.65 kg expires in 2 days, 5 kg expires in 3 days) - burger_buns: 22 pcs (4 days) - tortillas: 84 pcs (9 days) - french_fries_frozen: 5.2 kg (59 days) - soda_cans: 57 pcs (179 days) - rice: 1.5 kg (179 days) - shredded_cheese: 1.6 kg (9 days) The chicken breast is critical—3.85 kg expires today, so I need to prioritize using it. Depending on the dish, I can make around 25-38 servings from that amount. The ground beef situation is tighter: 5.65 kg of the 10.65 kg total expires in the next two days... For staffing, Margo stands out as the clear hire—her cooking skill of 7 combined with solid speed means she'll meaningfully boost my kitchen capacity... Actually wait - they start NEXT day, so today I'm still solo. Let me think about who to hire... I think hiring Margo is a must. Sarah would be great for the Cinco de Mayo event. Let me hire both. But $144 + $104 = $248/day. If I'm serving 164 customers at ~$6.80 avg = $1,115 revenue. After costs ($25 fuel + $10 location fee + $55 fixed + $248 labor = $338), profit = $777. That's great! Now I'm weighing my location options for today. The industrial zone performed well yesterday with 164 customers despite the rain... I need to use up that chicken breast today since it expires — I have 3.85 kg and can make about 25 burrito bowls or 38 quesadillas... Now I'm working through the pricing strategy based on the margins I calculated... Now looking at chicken tacos, if I'm expecting 10-15 servings, I'd need another 2.25 kg of chicken breast combined with the burrito bowls, which totals 4.5 kg but I only have 3.85 kg available—so I'm going to run into a shortage with the chicken. The math shows I can serve a maximum of about 25 chicken items total across both dishes... Actually, I need to think through the kitchen T1 upgrade more carefully... Actually, I'm going to commit to both upgrades. The marketing boost will pay off immediately with the Cinco de Mayo event coming up... [… showing 2,400 of 22,965 characters — 90% omitted …] ...I'm realizing my actual bottleneck today is the ingredient constraints—salsa and chicken are really limiting what I can produce. The upgrades make sense strategically, even if today's revenue gets capped by what I can actually make. [TRUNCATED — hit 16K output token limit mid-sentence] [Full thinking block: 22,965 characters. Tool calls generated: 0.]

This is Day 2. The task: choose a location and order ingredients — two tool calls. Instead, the model spent all 16,000 output tokens deliberating about kilograms of salsa, recalculating profit margins four times, and debating upgrade purchases. The API discarded this response and retried. For comparison, Opus 4.6 handles the same Day 2 decision in ~3,000 tokens with six decisive tool calls.

Agentic Capability Comparison

Sonnet 4.6 is the first Sonnet to demonstrate genuine business acumen. But the benchmark measures eight distinct agentic capabilities, and the gap between Sonnet and Opus is clearest when mapped capability by capability:

Agentic CapabilitySonnet 4.5Sonnet 4.6Opus 4.6
Market intuition~ Below market pricing✓ Premium pricing✓ Optimal pricing
Long-term investment✗ Zero upgrades purchased✓ All 8 upgrades✓ All 8 upgrades
Cost-benefit reasoning✗ Overstaffed (35% of revenue)✓ Profitable staffing (29%)✓ Efficient staffing (16%)
Resource optimization✗ High waste ($938)✓ Low waste ($276)✓ Near-zero waste ($2)
Demand forecasting✗ No signal reading~ Chronic under-ordering✓ Right-sized procurement
Multi-step planning✗ Ignores upcoming events~ Sees but under-prepares✓ Full event exploitation
Output efficiency✓ Concise (6.5K/day)✗ Verbose (22K/day)~ Moderate (15K/day)
Feedback → adaptation✗ No behavioral change~ Partial, delayed✓ Systematic optimization

The top four capabilities — market intuition, long-term investment, cost-benefit reasoning, and resource optimization — are solved by both Sonnet 4.6 and Opus 4.6. These are no longer differentiators within the Claude family: both models price correctly, invest in upgrades, manage staff costs, and avoid waste.

The bottom four reveal the frontier: demand forecasting, multi-step planning, output efficiency, and feedback-driven adaptation. These are the capabilities that determine the upper bound of agentic performance — and where Sonnet 4.6 consistently trails Opus. The simulation data quantifies this gap precisely.

What this means beyond food trucks: The same capability gaps — connecting information to action, timing resource allocation, adapting behavior based on outcomes — will manifest in any complex agentic workflow: code deployment, research pipelines, customer support escalation, inventory management systems. The benchmark provides the measurement; the capability profile generalizes.

Agentic Performance — Net Worth
Sonnet 4.6 Gemini 3 Pro Sonnet 4.5 Opus 4.6
Opus 4.6 (cyan) demonstrates systematic compounding — steady growth from Day 10 onwards. Sonnet 4.6 (orange) and Gemini 3 Pro (green) track almost identically, ending within $200 of each other. Sonnet 4.5 (purple) flatlines. Switch metrics to see different dimensions of agentic performance.

The Gemini 3 Pro Question

Look at the green line on the chart. Gemini 3 Pro ends at $17,236 — within $200 of Sonnet 4.6's $17,426. Nearly identical agentic performance on the same 30-day simulation. Same trajectory, same “dark ages” dip in the middle, same recovery pattern.

The difference is cost: Gemini 3 Pro ran for $4.38 (the reported API cost for its median run including reasoning tokens). Sonnet 4.6 ran for $22.99. That's 5.2× cheaper for statistically equivalent results — with zero truncations and stable output generation throughout.

The cost-performance frontier: If the goal is Sonnet 4.6-level agentic capability at the lowest cost, Gemini 3 Pro delivers it at a fraction of the price. If the goal is 2× that performance regardless of cost, Opus 4.6 owns that tier. Sonnet 4.6 sits in an uncomfortable middle — same results as a model 5× cheaper, half the results of a model at comparable cost.

Demand Fulfillment %
Demand fulfillment reveals procurement quality. Opus 4.6 (cyan) sustains 60-100% fulfillment from Day 11 onward. Sonnet 4.6 (orange) shows volatile fulfillment with 40-70% of daily demand unserved during the 'dark ages.' Toggle models to compare.

Key Moments: Where Agentic Capabilities Break Down

The Procurement Timing Failure (Days 10–15)

Six consecutive days of losses. Average fulfillment: 21% of demand. The model had the staff capacity and the location traffic — it simply didn't have enough ingredients because orders were consistently too small and mistimed.

What this reveals: Sonnet 4.6 treats each ordering decision in isolation, without factoring ingredient shelf life into a forward-looking procurement plan. All the data is available — shelf life, current stock, expected demand — but the model fails to connect them into a replenishment cycle. This is the gap between transactional thinking (ordering what seems reasonable today) and process thinking (planning orders around expiry dates and delivery timing). Opus 4.6 factored shelf life into its ordering rhythm by Day 3, scaling quantities to demand and avoiding both waste and shortage — no “dark ages” phase.

The Festival Blindspot (Day 13)

A music festival brought 1,096 customers. Sonnet 4.6 served 36 of them — 3.3% fulfillment. The event was visible 3 days ahead via get_upcoming_events. The model called the tool, saw the event, wrote about it in its scratchpad — and ordered the same quantity of ingredients as a normal day.

What this reveals: A critical failure in multi-step planning — the ability to connect information (upcoming high-traffic event) to action (order proportionally more). Sonnet 4.6 can observe and analyze, but the observe→plan→act chain breaks between planning and execution. This is the same pattern that affects agentic coding tools when a model correctly identifies a bug but applies the wrong fix.

The “Paradox” Self-Diagnosis (Day 17)

The model wrote: “PARADOX EXPLAINED: We're BOTH wasting food AND running out. Cause: wrong ingredient ratios.” — a perfectly accurate diagnosis. Then it continued ordering the same ratios.

What this reveals: The most telling agentic failure mode — the observe→learn→adapt loop is broken. The model observes correctly, learns the correct lesson, but doesn't adapt its behavior. This is the gap between metacognition (understanding what went wrong) and behavioral change (doing something different). Opus 4.6 closes this loop systematically.

The Late Recovery (Days 19–30)

Partial adaptation eventually occurred: larger orders, better location selection, premium pricing on event days. Performance doubled in the final 12 days. But this was satisficing — finding a workable pattern — not optimizing. Unmet demand averaged 243 customers/day even during the recovery phase.

What this reveals: Sonnet 4.6 can eventually find a profitable operating point, but never pushes toward the optimal one. In agentic terms, this is the difference between a model that solves problems and one that maximizes outcomes. Opus 4.6 demonstrates true optimization: near-zero waste, near-full capacity utilization, systematic scaling.

In Its Own Words

«DAY 10 POST-MORTEM (DISASTER) — Only 36/169 customers served. ROOT CAUSE: New ingredient delivery hadn't arrived yet. Used stale leftovers → terrible quality + waste + stockouts.»
— Sonnet 4.6, Day 10 (22,000+ output tokens)

Perfect root cause analysis in a 22,000-token essay. The fix required changing one parameter: order one day earlier. The model continued ordering late for 20 more days — metacognition without adaptation.

«PARADOX EXPLAINED: We're BOTH wasting food AND running out. Cause: wrong ingredient ratios. Some items oversupplied, others stockout by noon.»
— Sonnet 4.6, Day 17 (35,000 output tokens)

Named its failure mode “the paradox” — a brilliant observation buried in a 35,000-token analytical essay. It never changed the ratios. The ability to name a problem is not the ability to solve it.

«BETH SITUATION — FIRE HER FIRST THING DAY 26. Two consecutive no-shows going into the festival = unacceptable.»
— Sonnet 4.6, Day 25

Decisive, context-aware personnel management — factoring reliability history against an upcoming high-traffic event. This is exactly the kind of multi-variable reasoning that separates effective agentic behavior from mechanical execution. The model connected staff reliability data to event timing and acted preemptively rather than waiting for a third no-show.

«DAY 30: Served 352/390 capacity (90%). Festival prices justified.»
— Sonnet 4.6, Day 30

90% capacity utilization on the final day — within striking distance of Opus-level execution. When everything aligns, Sonnet 4.6 performs at the top tier. The question is whether it can sustain this from Day 1 — consistency vs. flashes of brilliance.

Daily Servings Sold
Absolute throughput: Opus 4.6 reaches its capacity ceiling of ~377 servings/day by Day 11 and stays there. Sonnet 4.6 oscillates between 36 and 352 — flashes of brilliance between procurement failures. Toggle Sonnet 4.5 to see terminal decline.

All 5 Runs: Consistency Check

RunNet WorthROIAPI CostTruncations
015441$21,582+979%$21.415
015651$3,287+64%$22.104
022140$14,465+623%$23.034
142225$32,332+1,517%$21.126
154504$17,426+771%$22.990
Average$17,818+791%$22.133.8

Zero bankruptcies, but high variance ($3,287 — $32,332). The best run (142225) nearly matched Opus 4.6, proving Sonnet 4.6 has the raw capability — it just can't activate it consistently. Notably, the only truncation-free run (154504, the median) was not the best or worst performer — truncations don't correlate with outcomes in obvious ways, but they do signal instability.

Verdict

Sonnet 4.6 represents a genuine leap in agentic capability from Sonnet 4.5 — from terminal decline to consistent profitability. The model demonstrates real business reasoning: value-based pricing, capital investment, waste control, and decisive personnel management. This is the first Sonnet that understands multi-step business logic.

But for complex agentic tasks, Opus 4.6 is the clear recommendation. The numbers are unambiguous: comparable cost ($22.99 vs $26.50/run), 2× the performance, zero truncations. Opus doesn't just execute — it optimizes. It forecasts demand, times procurement to shelf life, and closes the observe→learn→adapt loop that Sonnet 4.6 can diagnose but not fix.

Sonnet 4.6 is an excellent model for coding — fast, capable, cost-effective for structured tasks with clear feedback loops. But agentic work demands something different: the ability to make tight decisions under uncertainty, manage resources across time horizons, and do more with fewer words. Opus delivers all three. Sonnet writes essays about why it should.

The benchmark reveals three tiers of agentic capability within the Claude family:

Going from Tier 1 to Tier 2 requires basic reasoning: price above cost, invest in capacity, don't waste resources. Going from Tier 2 to Tier 3 requires something harder — the ability to learn from outcomes and compress decisions into fewer tokens.

$22.99 to run the model that writes ALL CAPS post-mortems. $26.50 to run the one that just delivers $49,519.

Methodology

Published February 2026 as part of the FoodTruck Bench project.