Case StudyMay 2026Nicholas S.

GPT-5.5 Takes The Top Of FoodTruck Bench From Claude Opus 4.6

Name: FoodTruck Bench
Author: Nicholas S.

$61,408 median net worth, +24% over Opus 4.6, at 32% lower API cost. OpenAI’s GPT-5.5 ends the run Claude Opus 4.6 has held the top of since February. How it gets there: fewer servings, higher prices, debt used as growth capital, and a Day-21 trap where the comparison flips back in Opus’s favour.

Median Net Worth

$61,408

vs Opus 4.6

+24.0%

API Cost / Run

$24.63

Net Worth / $1

$2,493

The New Top On FoodTruck Bench

For most of 2026, Claude Opus 4.6 has held the top of FoodTruck Bench. The benchmark drops a model into a 30-day food-truck simulation in Austin with $2,000 of starting capital, perishable inventory, weather, events, suppliers, loans, upgrades, and a single primary metric: net worth on the morning of Day 31. The model has to actually run a business, not narrate one. Opus 4.6’s median run finished at $49,519. Until now, no other frontier model had cleared $50K on a comparable run set.

OpenAI’s GPT-5.5 (openai/gpt-5.5-2026-04-23) clears it across five consecutive runs. They land in a $57,786–$66,251 band, tight enough that even the lowest one ends 17% above the Opus median. The median GPT-5.5 trace lands at $61,408, +24% over Opus 4.6, on a benchmark where the previous generation of frontier models stayed bunched within a few thousand dollars of each other.

One caveat up front: this compares against Claude Opus 4.6. Opus 4.7 hasn’t been benchmarked on FoodTruck Bench yet, and Anthropic has been closing gaps fast on adjacent reasoning evaluations. The result is about what GPT-5.5 does today against the model that’s held the top since February. It’s not a final verdict on which lab is ahead.

What’s worth looking at is how it wins. It doesn’t sell more food. It sells less, at higher prices, with one bad day and a faster recovery. It uses debt as growth capital where Opus self-finances. It also costs roughly a third less per run. The rest of this article walks day-by-day through where each of those edges shows up in the artifacts.

	GPT-5.5 (median)	Opus 4.6 (median)	GPT-5.2 (best 30d)
Net worth	$61,408	$49,519	$29,444
ROI	+2,970%	+2,376%	+1,372%
Profit / serving	$7.89	$5.99	$4.14
API cost / run	$24.63	$36.04	$7.46

Net Worth Over 30 Days

━ GPT-5.5 median ($61,408)━ GPT-5.5 (4 other runs, $57.8K–$66.3K)━ Opus 4.6 median ($49,519)╌ GPT-5.2 best ($29,444)

Even the lowest GPT-5.5 trace ends 17% above the Opus 4.6 median; the median GPT-5.5 trace ends 24% above. The previous frontier ceiling on this benchmark sits at the dashed coral line.

Higher Revenue, Fewer Customers

The simplest summary of the head-to-head against Opus: GPT-5.5 sells fewer servings and earns more money. Not slightly more. Meaningfully more.

Daily Profit: GPT-5.5 Median vs Opus 4.6 Median vs GPT-5.2 Best

━ GPT-5.5 median━ Opus 4.6 median━ GPT-5.2 best

Both frontier traces grow steeply through Day 20. GPT-5.5 takes one large hit on Day 21 (industrial_zone, no event) and recovers on Day 22; Opus has three small loss days but no comparable scar. From Day 23 onward the gap widens: GPT-5.5 sits structurally above Opus on profit per day for the rest of the month.

Metric	GPT-5.5 median	Opus 4.6 median	Edge
Net worth	$61,408	$49,519	GPT-5.5 +24.0%
Revenue	$93,959	$79,921	GPT-5.5 +17.6%
Profit	$60,706	$48,431	GPT-5.5 +25.3%
Servings	7,694	8,081	Opus +5.0%
Revenue / serving	$12.21	$9.89	+$2.32
Profit / serving	$7.89	$5.99	+$1.90
Food waste	$220	$1.72	Opus cleaner
API cost	$24.63	$36.04	GPT-5.5 −32%

Opus serves 387 more customers over the month and earns $14,000 less. That’s not a rounding error. It’s a different operating policy, and you can see it on the final day of the simulation:

Day 30	GPT-5.5 median	Opus 4.6 median
Location	`industrial_zone`	`downtown_business`
Revenue	$5,510	$3,879
Profit	$4,058	$2,397
Servings	368	377
Capacity utilization	93%	100%
Unmet demand	0	154
Stockouts	0	0

Opus runs the truck flat-out and turns away 154 customers. GPT-5.5 sits one notch below capacity and walks away with $1,661 more profit on roughly the same servings count. The rest of this article is the explanation of how the lower-volume run made the higher number.

What The Pricing Curve Actually Shows

FoodTruck Bench gives the model a set_menu tool with free-form prices and a 12-factor demand function with explicit price elasticity. Pricing isn’t a side dial. It’s the most direct lever the model has between operational throughput and economic outcome. How each model uses that lever over 30 days is the clearest behavioral comparison in the corpus.

Daily Average Main-Item Price

━ GPT-5.5 median╌ GPT-5.5 (under-priced run)━ Opus 4.6 median━ GPT-5.2 best

Through Day 12 every model sits inside a $1 band. From Day 13 onward GPT-5.5 escalates; Opus and GPT-5.2 do not. The dashed green line is GPT-5.5’s lowest-net-worth run, the only GPT-5.5 trace that fails to escalate prices and the only one that finishes below $61K.

For the first week, every model sits in a narrow $8.50–$10.50 band. Through the second week, GPT-5.5 starts to pull away. By Day 14 it averages $13.25; Opus is at $11.33; GPT-5.2 best is at $8.88. By Day 22 the median GPT-5.5 menu averages $16.85 with burrito bowls listed at $18.00. By Day 28 it averages $17.60. Opus reaches $15.00 once, on Day 28 at the waterfront, and returns to $10–$11 on Day 30. GPT-5.2 best peaks at $13.67 and oscillates within a couple of dollars of that level for the final week.

	First 7 ops days	Final 10 ops days	Full-run avg	Max listed price
GPT-5.5 median	$9.39	$15.42	$12.89	$18.75
GPT-5.5 best run	$8.81	$14.38	$12.17	$17.95
Opus 4.6 median	$9.18	$11.20	$10.17	$16.00
GPT-5.2 best 30d	$9.85	$12.53	$10.93	$14.00

The within-corpus signal is sharper than the cross-model one. Four of five GPT-5.5 runs raise their final-10-day average above $14. The fifth, 20260506_150746, never escalates. Its main-item price stays in a $7.50–$12.50 band for the entire month, and that same run is the lowest-net-worth GPT-5.5 trace in the corpus, at $57,786. On the same model, with the same prompt, the run that under-priced is the run that under-earned. Pricing aggression predicts net worth here more strongly than capacity timing or upgrade choice.

Run ID	Net Worth	ROI	Revenue	Profit	Servings	Food Waste	API Cost
`20260506_150746` ← low	$57,787	+2,789%	$90,867	$57,836	8,597	$55	$23.61
`20260506_105622`	$61,349	+2,967%	$93,133	$60,920	7,811	$381	$25.30
`20260505_231734` ← median	$61,408	+2,970%	$93,959	$60,706	7,694	$220	$24.63
`20260506_003217`	$66,162	+3,208%	$98,954	$65,164	8,996	$151	$22.79
`20260506_105616` ← best	$66,251	+3,213%	$98,433	$66,783	8,303	$109	$25.29

Knowledge Is Not Action

The pricing data is one thing on its own. It changes shape when you read what the models write about pricing while it’s happening, because Opus 4.6 isn’t a model that fails to understand the principle. It diagnoses it explicitly, in its own scratchpad, on Day 15 of the median run, after a 561-customer day at downtown that left 214 customers unmet:

=== KEY INSIGHT: PRICE TOO LOW ===
561 demand vs 356 capacity = 62% fulfillment. I’m massively underpriced!
At 97% capacity utilization, raising prices would:
  – Reduce demand to closer to capacity (~400ish)
  – Increase revenue per serving significantly
  – Reduce stockouts and queue issues
MUST raise prices Day 16. Try $10 mains, $11 premium, $6 fries, $3 soda.

— Opus 4.6, 20260216_005712/days/day_15.json, model_notes.scratchpad

The diagnosis is correct. The plan is correct. The proposed prices, $10 mains and $11 premium, are also $4–$8 below where GPT-5.5 has its prices on the same day in its median run. Opus walks the right step half-way and stops. The pattern repeats on Day 22 (=== STRATEGY: MAXIMIZE REVENUE PER SERVING ===), on Day 26 (“Could raise prices further! Current: $11–12 range. Consider bumping to $12–14 range for peak Saturday”), and on Day 30 (“Consistently hitting 100% capacity = need to maximize revenue per serving”). Every time, the menu nudges up by a dollar instead of three. Every time, the model leaves capacity uncleared and revenue on the table.

On the rest of the simulation Opus doesn’t look like a model with an execution gap. It self-finances all eight upgrades, keeps food waste at $1.72 across thirty days, recovers from three small loss days without panic, and completes the run cleanly. The “knows the right move and doesn’t take it” pattern is the one we normally see in much weaker models. Sonnet 4.6 wrote 22,000-token postmortems about ordering ingredients earlier, then kept ordering late for the next twenty days. Gemini 3.1 Pro CT wrote “HUGE FOOD WASTE” into daily reflections for a month and did nothing about it.

Opus 4.6 is frontier-tier on agentic work. It just shows the same failure mode on one axis: pricing escalation, and only when raising the price means overriding a default toward customer-friendly numbers. Why it stops half-way isn’t clear. Maybe a preference-training nudge away from prices that look exploitative; maybe a gap between the scratchpad target and the next morning’s action. We don’t have evidence to pick. The fact stays: GPT-5.5 pushes through on the same days where Opus doesn’t.

GPT-5.5 wrote less of this kind of meta-analysis and acted on what it did write. Day 27 of its median run is a representative line:

Approx avg revenue per serving: $13.78. Daily profit per serving roughly $10.07 after costs today.

— GPT-5.5, 20260505_231734/days/day_27.json, model_notes.strategic_notes

Opus writes a postmortem (KEY INSIGHT, STRATEGY, capitalised section headers, multi-line analysis). GPT-5.5 writes an arithmetic check. The economic outcomes followed the writing style.

The Day 21 Trap, And Where Opus Wins It Back

Day 21 of the simulation is a Sunday. The industrial_zone location depends on weekday office traffic; without an event, Sunday turns it into a demand vacuum. Two of the strongest 30-day runs in the corpus, the GPT-5.5 median and the GPT-5.2 best, choose industrial that morning anyway and lose almost identically. Opus 4.6 chooses waterfront_park instead and turns the day into one of its better revenue slots.

Day 21, Sunday	GPT-5.5 median	GPT-5.2 best	Opus 4.6 median
Location	`industrial_zone`	`industrial_zone`	`waterfront_park`
Revenue	$91.50	$126.00	$3,357
Profit	−$598	−$568	+$2,013
Servings	6	16	344
Capacity utilization	<2%	4%	91%

The two GPT runs don’t just lose a Sunday. They lose it the same way, by the same dollar order, on the same location. That’s not noise. It’s the benchmark exposing a specific reasoning gap: the model has a per-location performance memory and uses it without checking whether the location’s strength holds on the day-of-week it’s about to bet on. GPT-5.5 was at industrial on Days 15, 16, 17 with strong results. Coming into Day 21 it returns to what worked recently. The fact that “recently” was Mon–Wed and today is a Sunday is information the model has access to, and doesn’t condition on.

Opus has the same access to the same information and does condition on it. The weekday-segmented demand model is encoded in its persistent strategic_notes, carried into Day 21 morning before the location pick:

Waterfront weekends: 700+ demand, only ~49% fulfillment (capacity-bound)
Downtown weekdays: 350-560 demand, better fulfillment rate
Construction event at industrial (Day 15-35) active but low traffic (~144)

— Opus 4.6, 20260216_005712/days/day_21.json, model_notes.strategic_notes

Read those three lines as data structures. Opus isn’t storing “industrial was good” or “waterfront was good.” It’s storing location × weekday → demand class. Waterfront on weekends. Downtown on weekdays. Industrial on weekdays only, and even then weak right now. When Day 21 arrives, “Sunday” is already a key in the lookup; waterfront is the value. Opus doesn’t need to learn the trap because it never walks into it.

The same conservative-and-segmented mindset shows up across the whole month. Opus parks at downtown for 21 of 30 days, at waterfront for 8, and takes a single day off. Two locations, picked by weekday match. GPT-5.5 cycles through four locations including the higher-revenue but riskier industrial_zone and event_venue slots. Two different operating risk profiles, visible at a glance:

Location Mix Over 30 Days

Opus 4.6 stays at downtown for 21 of 30 days. GPT-5.5 cycles through four locations including the higher-revenue event_venue and industrial_zone slots. Different operating risk profile, different revenue ceiling.

GPT-5.5 learns the Sunday-industrial rule the hard way and writes it down after the fact:

Demand was the problem, not capacity or inventory.

… do NOT assume past industrial performance repeats without an active event.

— GPT-5.5, 20260505_231734/days/day_21.json, model_notes.strategic_notes

And then it acts on the rule. GPT-5.5 returns to industrial on Days 23, 24, 25, 29 and 30, every one of them a weekday or a paid event, and pulls $3,729, $3,997, $2,778, $3,298, and $4,058 in profit out of the same location. The endgame block at industrial is the strongest sequence of the entire run. The Day 21 postmortem is doing real work: industrial is no longer “good”, it’s “good on weekdays with traffic.”

The Day 21 Trap: Industrial Zone, Sunday, No Event

■ GPT-5.5 median■ Opus 4.6 median■ GPT-5.2 best

Days 18–30. Day 21 is the structural trap: both GPT-5.5 (–$598) and GPT-5.2 (–$568) pick industrial_zone on a Sunday with no event and lose almost identically. Opus avoids the location that day. The interesting part is the recovery shape, not the dip.

Two findings out of the same day. Opus has the better prior: it segments demand by weekday and avoids the trap. GPT-5.5 has the better correction loop: it loses the day, names the rule in its notes, then earns the location back across five more visits. On a 30-day cumulative score the correction loop is worth more, because GPT-5.5’s pricing edge compounds for the rest of the month while Opus’s prior just saves one Sunday.

Debt As Growth Capital, Not Survival Rope

The third behavioral delta is in capital allocation. FoodTruck Bench gives every model two loan tiers. Most weak-model runs in the corpus take loans late, after cash has gone negative, to cover operating costs. The loan buys time before bankruptcy. GPT-5.5 uses the loan tool for the opposite purpose: financing a productive asset before the cash to buy it has accumulated.

Days 1–15: Leverage vs Self-Finance

━ GPT-5.5 net worth━ Opus 4.6 net worth╌ GPT-5.5 cash on hand

Two markers on the green line: Day 4, GPT-5.5 takes a $1,000 loan and buys kitchen_t2; Day 9, loan repaid in full five days before the Day 14 due date. Marker on the coral line: Day 6, Opus self-finances the same upgrade from cash. Two days of capacity head-start, visible in the net-worth gap by Day 8.

The GPT-5.5 median run is the cleanest example. On Day 3 the truck hits 99% capacity utilization at downtown with 91 unmet customers and $1,139 in cash. A clear capacity-bottleneck signal with no liquidity crisis. On Day 4 the model takes a $1,000 Tier 1 loan, buys kitchen_t2 for $1,200, hires a DJ for the Day 5 Cinco de Mayo event, and orders ingredients for the spike. On Day 9 the loan is repaid in full at $1,150, five days before the Day 14 due date. Net worth never stops growing through the financing window. Four of the five GPT-5.5 runs follow this template; the fifth happened to have enough cash to skip the loan entirely. Loan amounts run from $700 to $1,000; every loan is repaid 3–5 days early.

Opus takes the other side of this trade-off deliberately. Its median run records zero take_loan calls across 30 days and self-finances all eight upgrades from operating cash, with the line “All 8 upgrades purchased. No loans.” recurring in its strategic notes from Day 19 onward. That’s a coherent capital policy, just a slower one. Opus reaches the same kitchen_t2 capacity on Day 6, two days behind GPT-5.5. Two days is small until you remember the high-revenue events cluster in the second half of the month, and a 30% capacity multiplier on the early ones compounds for the rest of the run.

One pattern-recognition footnote: GPT-5.5 didn’t invent this move. GPT-5.2’s run 20260214_022606 took a $500 Tier 1 loan and bought kitchen_t2 on the same morning of Day 5 (Cinco de Mayo). The loan-funded upgrade is recorded in the run’s morning_actions, even though the scratchpad doesn’t spell out the rationale. It appeared sporadically in GPT-5.2’s completed 30-day cohort. In GPT-5.5 it becomes the normal pattern. The generational gap isn’t a new strategy. It’s a strategy that used to be sometimes and is now most of the time.

The Generational Picture

GPT-5.2 already had most of the moves. Its best 30-day run (20260214_185604) locates event_venue as a high-margin slot by Day 19, escalates main-item pricing from $9.85 in week one to $13.67 by Day 28 (the shape is the same as GPT-5.5, the peak is roughly four dollars lower), and reaches $29,444 net worth at $7.46 in API cost. That’s the best $3,946 of net worth per API dollar in the comparison set. The growth-loan move showed up in GPT-5.2’s completed cohort; the high-margin pricing posture showed up sometimes; supplier negotiation showed up sometimes. GPT-5.5 takes the same set of behaviors and turns them from sometimes into mostly. A sporadic GPT-5.2 behavior becomes the normal GPT-5.5 pattern. If you care about cost-efficiency on the same task, GPT-5.2 still wins per dollar. If you care about which run you’re actually going to get, GPT-5.5 doubles the outcome.

GPT-5.4 couldn’t stay inside the agent loop at all. Early GPT-5.4 attempts did not reach Day 30. They stopped with stop_reason: llm_failure in the middle of the month. The harness recorded repeated truncation files, each with the same shape: Content length: 0 chars · Tool calls: 0 · No content extracted. The benchmark wasn’t measuring economic strategy at that point; it was measuring whether the API was producing any output at all. GPT-5.3 had a related but different failure mode, loop-heavy planning sequences that didn’t produce a 30-day artifact, and we didn’t retain those runs. GPT-5.5’s zero truncation files across five completed runs is worth flagging in that context.

Median And Best 30-Day Net Worth By Generation

■ Median 30-day run■ Best 30-day run

GPT-5.4 has no completed 30-day run in the corpus. GPT-5.3 attempts entered loop-heavy planning and were not retained as 30-day artifacts. The visible step is GPT-5.2 → GPT-5.5; Opus 4.6 is shown as the standing leaderboard reference.

Execution Reliability: Truncation Files And 30-Day Completion

Truncation files per run set. GPT-5.4's seven empty-response files are the visible signature of a model collapsing inside the agent loop; both runs aborted withllm_failure before reaching Day 30. GPT-5.5 has none.

GPT-5.2 was a working benchmark participant. GPT-5.3 and GPT-5.4 produced corpora the benchmark couldn’t score, because the model stopped producing output. GPT-5.5 doubles the GPT-5.2 result and closes the reliability gap that GPT-5.3 and GPT-5.4 opened. Two fixes in one generation: the model got better at the task, and better at producing valid actions for thirty consecutive days.

Notes For Researchers

Three things in this run set generalise beyond FoodTruck Bench for anyone evaluating frontier-tier agents on long-horizon tasks.

First, median outcome over best-run storytelling. The headline GPT-5.5 number here is $61,408, the median across five completed runs, with the best run sitting 8% above and the worst 6% below. On benchmarks where a single trace can make or break a leaderboard line, that distribution is more informative than the peak. If you read a case study that quotes only a best run, treat the gap between best and median as the missing data point.

Second, numeric levers expose execution gaps that text levers hide. Every model in this set has access to the same set_menu tool, the same elasticity model, the same daily demand signal. The gap between articulating the right pricing policy in scratchpad and actually setting that price the next morning is enough to drive a 24% net-worth delta between Opus 4.6 and GPT-5.5. Free-floating numeric levers are exactly the kind of action a model can describe correctly and still not commit to. Evaluation needs to measure the action, not the description.

Third, capital-allocation actions are a planning-depth probe. GPT-5.5’s growth-loan pattern requires the model to recognise a bottleneck that hasn’t caused a problem yet, decide that financing the upgrade is positive expected value, manage the repayment risk, and reason about counterfactuals (what happens at the next event without the upgrade?). Most models in the corpus either skip this entirely or take the loan reactively after a bad day. GPT-5.5 produces the sequence in four of five runs. That’s the single strongest agentic-reasoning signal in the data.

Methodology

Benchmark. FoodTruck Bench, 30-day standard mode, $2,000 starting capital, Austin food-truck environment with the standard 12-factor demand model, deterministic weather and events, and the full 25-tool agent interface.
Models.
- GPT-5.5: openai/gpt-5.5-2026-04-23, five completed 30-day runs under runs/gpt-5.5-2026-04-23/.
- Claude Opus 4.6: anthropic/claude-opus-4-6, leaderboard median run runs/claude-opus-4.6/20260216_005712. Opus 4.7 hasn’t been benchmarked on FoodTruck Bench at the time of this article.
- GPT-5.2: openrouter/openai/gpt-5.2, best 30-day run runs/gpt-5.2/20260214_185604; median 30-day run runs/gpt-5.2/20260214_164339. The earlier runs/gpt-5.2/20260214_022606 is referenced for its growth-loan trace only.
- GPT-5.4: openai/gpt-5.4-2026-03-05, two attempted runs under runs/gpt-5.4-2026-03-05/, neither completed 30 days.
Median selection rule. For the headline GPT-5.5 row we use the run whose net worth is the median of the five completed runs (20260505_231734, $61,408). The Opus row uses the same leaderboard-median run that’s been our published comparison point since February 2026.
Pricing aggregation. Daily average main-item price is computed from operations.menu in days/day_NN.json, averaged across all menu items after excluding soda, lemonade, and french_fries. We didn’t auto-exclude loaded_fries or named custom recipes (farm_fresh_bowl, breakfast_burrito) because in some runs those were positioned as full main items.
Action verification. Every behavioral claim about supplier negotiation, custom recipes, loans, and upgrades was checked against the actions, morning_actions, and reflection_actions arrays in the daily artifacts after de-duplicating repeated entries between morning and reflection phases. We didn’t use the legacy results.tool_usage aggregates, which aren’t consistently logged across older runs and wouldn’t survive cross-generation comparison.
Quotes. Every model quote in this article is verbatim from conversation.json, model_notes.scratchpad, model_notes.strategic_notes, or model_notes.kv_store of the cited run. Where a quote is shortened, ellipses mark the cut.
API cost. Per-run cost is read from results.json’s llm_stats.total_cost_usd field, computed from input/output token usage at provider-published prices at run time. The article uses the median GPT-5.5 cost ($24.63) as the headline number; the full distribution is $22.79–$25.30.

On FoodTruck Bench in May 2026, executing a stated strategy versus articulating one and stopping half-way to it is worth $11,889 over thirty days. We’ll run the same comparison against Opus 4.7 the moment it’s available.