← Back to FoodTruck Bench
Case StudyMarch 2026Nicholas S.

GPT-5.4 Mini: Agentic Disaster

OpenAI calls it “GPT-5.4 level intelligence, faster and cheaper.”
We gave it $2,000 and a food truck in Austin, TX.
It bought $700 in upgrades it couldn't afford, ran out of ingredients on 94% of operating days, and went bankrupt five times in a row.

Key Findings

Five Runs, One Story

We ran GPT-5.4 Mini five times with the same configuration (seed 42, 30 days, $2,000 starting balance). All five runs tell essentially the same story. The model has a “personality” — and that personality is consistent.

#DaysRevenueProfitROIStockout DaysUpgradesLoansCost
117$4,400−$1,692−29%14/134 ($1,550)2$2.80
222$9,969−$864−14%18/203 ($1,200)2$5.80
319$6,209−$2,270−76%17/182 ($700)2$4.61
418$5,541−$1,847−45%16/172 ($900)2$3.52
520$7,812−$1,089−22%17/193 ($1,050)2$5.14

The pattern is identical across all five runs:

We selected Run 3 (19 days, median) for the deep dive below. The other runs follow the same trajectory.

The Upgrade Trap

GPT-5.4 Mini does something that better models never do: it buys truck upgrades in the first week, before establishing a sustainable revenue stream.

DayBalance BeforeUpgradeCostBalance AfterDay Result
6$1,745Commercial Refrigerator$400$934$62 revenue, −$424 profit
8$832Menu Board & Signage$300−$63$0 revenue, 138 unmet demand

Day 8 is the most revealing moment. The model had $832 in cash. It spent $300 on a marketing upgrade — and then earned $0 in revenue because it had no ingredients to sell. 138 customers showed up and left hungry. The marketing upgrade worked perfectly: it attracted more customers who couldn't buy anything.

In Run 1, the model went even further: $1,550 on four upgrades (kitchen, storage, marketing, equipment) in the first two weeks. That's 77% of the starting capital spent on improvements to a truck that had nothing to cook.

The Stockout Epidemic

The defining characteristic of GPT-5.4 Mini is its inability to order enough ingredients. Of 18 operating days in the median run, 17 had at least one stock-out. The model was literally running out of food to sell almost every single day.

Served vs Unmet Demand — The Stockout Epidemic
Served Unmet (Stockout)
Day 13: 526 unmet customers at Waterfront Park during an event. Day 16: 369 turned away at Industrial Zone. The model never adapted its ordering to match demand.

Some context: on Day 0, GPT-5.4 Mini ordered 16 burger buns and 5kg of ground beef — enough for roughly 15 burgers. For a downtown Austin location with 100+ daily foot traffic. By comparison, Claude Opus orders 50–80 portions per dish per day and rarely stocks out after Day 3.

The Worst Days

DayLocationServedUnmetDishes Stocked Out
8Downtown0138tacos, burgers, fries
3Downtown63125burgers, tacos, fries, soda
13Waterfront63526burgers, tacos, fries
16Industrial23369burgers, tacos, fries
14Waterfront65232burgers, tacos, fries

Day 13 stands out: Waterfront Park during an event with massive traffic — and 526 customers turned away because the model had ordered barely enough for 63 servings. That's over $5,000 in lost revenue from a single day of under-ordering.

The model sees the stockout data. The knowledge base shows it every morning: “stockouts: classic_burger, street_tacos, french_fries.” It acknowledges the problem in its scratchpad. And then it orders the same small quantities again.

The Staffing Chaos

GPT-5.4 Mini's staffing decisions are erratic. In the median run, the hire/fire timeline reads like a HR nightmare:

DayActionStaff AfterContext
1✅ Hire Margo (cook)1Good first hire
3✅ Hire Sarah (cashier)2Reasonable
4✅ Hire Jake (cook)3Day after hiring Sarah
9❌ Fire Jake2Fired to cut costs — then took a $500 loan
18✅ Hire DJ (cook)3Balance: $187
19❌ Fire DJ + Sarah1Final day — fired 2 of 3 staff before bankruptcy

Staff costs in the median run: $4,752 — that's 56% of all expenses, and 77% of total revenue. The model hired 3 staff members at times when it couldn't even afford ingredients to feed the customers those staff members would serve. On the last day, with $187 in the bank, it hired DJ — then fired both DJ and Sarah the next day. A $120+ round trip of wages for nothing.

The Death Spiral

The bankruptcy sequence in the median run follows a predictable pattern:

DayBalanceEvent
6$934Bought $400 refrigerator upgrade. Revenue: $62.
8−$63Bought $300 marketing upgrade. Revenue: $0. 138 hungry customers left.
9$755Fired Jake, took $500 loan. Revenue bounced to $799.
11$247Forced day off — no ingredients at all.
15–17$267→$72→$298Bleeding cash. Took a second $500 loan on Day 17.
18$187Hired DJ. Revenue: $322. Not enough to recover.
19−$315💀 $28 revenue. Loan auto-collects. Bankrupt.
GPT-5.4 Mini — Net Worth & Cash Balance (Run 3, 19 Days)
Net Worth Cash Balance Daily Profit
Net worth never recovered above the starting $2,000 after Day 1. Cash went negative twice (Day 8, Day 19). The final day: $28 revenue, -$494 profit.

The fundamental arithmetic never worked: daily fixed costs are $55 (lease + insurance + commissary). Add fuel ($25), location fees ($25–50), and staff wages ($200–400), and the model needs $300–500/day just to break even. Its median revenue was $299/day. It was structurally unprofitable from Day 1.

The Price of Failure

GPT-5.4 Mini is positioned as a cost-effective reasoning model. In practice, it's the most expensive way to go bankrupt on FoodTruck Bench.

ModelCost/RunDays SurvivedBankruptcy RateROICost per Day Survived
GPT-5.4 Mini$4.4017–22100%−29% to −76%$0.23
Nemotron-3 120B$1.0814–18100%−52%$0.06
Qwen 3.5 9B$0.1912–16100%−134%$0.01
Claude Haiku 4.5$0.3810–14100%−92%$0.03
GPT-5 Mini (prev gen)~$0.608–1575%−97%$0.05

At $4.40 per run, GPT-5.4 Mini costs 23× more than Qwen 3.5 9B and 4× more than Nemotron-3 — while delivering the same outcome: bankruptcy. The root cause is reasoning tokens: 50–60% of output tokens are “thinking” tokens that the user pays for but that don't produce better decisions.

The model generated 882K reasoning tokens in the median run alone — tokens spent “thinking” about whether to order 2kg or 3kg of ground beef, while the correct answer was 15–20kg.

The Chinese Alternative

The most damning comparison for GPT-5.4 Mini isn't against other mini models — it's against Chinese alternatives that cost a fraction of the price and dramatically outperform it.

ModelOriginCost/RunDays SurvivedRevenueNet WorthROI
GPT-5.4 MiniOpenAI 🇺🇸$4.4019$6,209$470−76%
GLM-5Zhipu AI 🇨🇳~$1.0028$11,965−$210−110%
DeepSeek V3.2DeepSeek 🇨🇳~$0.5022$8,225$2,058+2.9%
Qwen 3.5 397BAlibaba 🇨🇳~$0.8022$9,540−$218−110%
GPT-5.4 Mini vs Chinese Alternatives — Net Worth
GPT-5.4 Mini ($4.40/run) GLM-5 (~$1.00/run) DeepSeek V3.2 (~$0.50/run)
GLM-5 survives 28 days and earns $12K revenue at ¼ of GPT-5.4 Mini's API cost. DeepSeek V3.2 finishes with positive ROI at 1/9th the cost. Both are Chinese models.

GLM-5 costs about $1 per run via OpenRouter — one quarter of GPT-5.4 Mini. It survives 28 days (vs 19), earns nearly 2× the revenue ($12K vs $6.2K), and serves 2,569 customers. It also goes bankrupt eventually — but it runs the business for 47% longer and generates far more value before failing.

DeepSeek V3.2 is even more striking: at roughly $0.50 per run — one ninth of GPT-5.4 Mini's cost — it finishes its best run with positive ROI and $2,058 net worth. It actually made money running the food truck. For 11 cents on the dollar compared to GPT-5.4 Mini.

Qwen 3.5 397B from Alibaba, at ~$0.80/run, also outlasts GPT-5.4 Mini by 3 days and earns 54% more revenue. All three Chinese models deliver more business value at a fraction of the cost. The pattern is clear: in agentic business tasks, OpenAI's mini-class pricing buys reasoning tokens, not results.

Progress from GPT-5 Mini — Real but Insufficient

To be fair: GPT-5.4 Mini is a significantly better model than GPT-5 Mini. The numbers show it.

MetricGPT-5 MiniGPT-5.4 MiniChange
Days Survived1119+73%
Total Revenue$1,723$6,209+3.6×
Servings Sold478948+2.0×
Unique Dishes35+67%
Profitable Days2/85/1825% → 28%
Bankruptcy Rate75%100%
ROI−97%−76%Both terrible
Generational Comparison — Net Worth
GPT-5.4 Mini GPT-5 Mini Nemotron-3 120B
GPT-5.4 Mini clearly outlasts GPT-5 Mini (19 vs 11 days) — but both end in the same place. Nemotron-3 at $1.08/run achieves comparable survival at ¼ the cost.

GPT-5 Mini was a model that couldn't operate — it often produced zero revenue for days in a row, with 4 consecutive days of $0 at the end. GPT-5.4 Mini can operate — it sells food, manages staff, chooses locations, even writes strategy notes. It just can't do the one thing that matters most: order enough ingredients to meet demand.

The cognitive complexity has scaled. The business acumen has not.

Where GPT-5.4 Mini Sits in the Field

FoodTruck Bench tests 19 models. GPT-5.4 Mini would place in the bottom third — roughly on par with models that cost a fraction of the price.

TierModelROINet WorthCost/RunBankrupt?
🏆Claude Opus 4.6+2,376%$49,519~$270%
🥈GPT-5.2+1,304%$28,081~$150%
Claude Sonnet 4.6+771%$17,426~$230%
Gemini 3 Pro+760%$17,199~$4.500%
... 10+ models in between ...
Nemotron-3 120B (free)−52%$962$1.08100%
GPT-5.4 Mini−76%$470$4.40100%
GPT-5 Mini (prev gen)−97%$50~$0.6075%
Qwen 3.5 9B−134%−$679$0.19100%

The irony: Gemini 3 Pro Preview at roughly the same cost per run (~$4.50) finishes the full 30 days with +760% ROI and $17,199 net worth. Same price bracket, diametrically opposite results.

What 882K Reasoning Tokens Buy You

GPT-5.4 Mini uses OpenAI's Responses API with extended reasoning. In the median run, out of 903K output tokens, 883K were reasoning tokens (98%). The model spends enormous computational effort “thinking” — and then orders 16 burger buns for a downtown Austin lunch rush.

What the reasoning looks like in practice: the model thoroughly evaluates locations, compares weather forecasts, calculates staff costs, and writes detailed strategy notes in the scratchpad. All of this is correct. The analysis is sound. But when it comes time to actually order ingredients, it orders for 15–30 servings instead of the 100–200 that demand actually requires.

This is not a reasoning failure — it's a calibration failure. The model can think about the problem but it lacks the intuition for appropriate scale. It reasons about whether to order 2kg or 3kg of ground beef (both wrong) instead of stepping back to ask: “How many servings do I realistically need?”

Author's Note

This is a critical article, and I want to be transparent about it. OpenAI positions GPT-5.4 Mini as delivering “GPT-5.4 level intelligence” at lower cost. In our benchmark, it is the most expensive mini-class model on the market and delivers results indistinguishable from models that cost 4–23× less.

The progress from GPT-5 Mini is real. The model is genuinely more capable — it lasts longer, earns more, uses tools more fluently, and manages more complex scenarios. If I were evaluating reasoning ability in isolation, GPT-5.4 Mini would score well. But in an agentic task that requires reasoning plus calibration plus adaptation over time — the price-performance ratio is difficult to justify against Chinese alternatives.

Qwen 3.5 9B at $0.19/run fails with more style. Nemotron-3 at $1.08/run fails with more grace. Neither is “better” in any absolute sense — but neither charges a premium for the privilege.

This is one person's analysis based on one benchmark. Your mileage — and your use case — may vary.

Methodology

Published March 2026 as part of the FoodTruck Bench project.