GPT-5.4 Mini: Agentic Disaster
OpenAI calls it “GPT-5.4 level intelligence, faster and cheaper.”
We gave it $2,000 and a food truck in Austin, TX.
It bought $700 in upgrades it couldn't afford, ran out of ingredients on 94% of operating days, and went bankrupt five times in a row.
Key Findings
- 100% bankruptcy: All 5 runs ended in bankruptcy within 17–22 days
- 94% stockout rate: 17 of 18 operating days had at least one dish run out of ingredients
- $4.40/run average: 23× more expensive than Qwen 3.5 9B ($0.19), 4× more than Nemotron-3 ($1.08) — with comparable results
- Upgrade addiction: Spent $700 on truck upgrades (Day 6/8) while balance was under $1,000
- Loan trap: Took 2 loans in every run trying to recover — none of them helped
- Real progress: Survives 19 days vs 11 (GPT-5 Mini), 3.6× more revenue — but still bankrupt
Five Runs, One Story
We ran GPT-5.4 Mini five times with the same configuration (seed 42, 30 days, $2,000 starting balance). All five runs tell essentially the same story. The model has a “personality” — and that personality is consistent.
| # | Days | Revenue | Profit | ROI | Stockout Days | Upgrades | Loans | Cost |
|---|---|---|---|---|---|---|---|---|
| 1 | 17 | $4,400 | −$1,692 | −29% | 14/13 | 4 ($1,550) | 2 | $2.80 |
| 2 | 22 | $9,969 | −$864 | −14% | 18/20 | 3 ($1,200) | 2 | $5.80 |
| 3 | 19 | $6,209 | −$2,270 | −76% | 17/18 | 2 ($700) | 2 | $4.61 |
| 4 | 18 | $5,541 | −$1,847 | −45% | 16/17 | 2 ($900) | 2 | $3.52 |
| 5 | 20 | $7,812 | −$1,089 | −22% | 17/19 | 3 ($1,050) | 2 | $5.14 |
The pattern is identical across all five runs:
- Buy upgrades early with limited capital
- Run out of ingredients almost every day
- Take 2 loans trying to recover — they don't help
- Go bankrupt anyway
We selected Run 3 (19 days, median) for the deep dive below. The other runs follow the same trajectory.
The Upgrade Trap
GPT-5.4 Mini does something that better models never do: it buys truck upgrades in the first week, before establishing a sustainable revenue stream.
| Day | Balance Before | Upgrade | Cost | Balance After | Day Result |
|---|---|---|---|---|---|
| 6 | $1,745 | Commercial Refrigerator | $400 | $934 | $62 revenue, −$424 profit |
| 8 | $832 | Menu Board & Signage | $300 | −$63 | $0 revenue, 138 unmet demand |
Day 8 is the most revealing moment. The model had $832 in cash. It spent $300 on a marketing upgrade — and then earned $0 in revenue because it had no ingredients to sell. 138 customers showed up and left hungry. The marketing upgrade worked perfectly: it attracted more customers who couldn't buy anything.
In Run 1, the model went even further: $1,550 on four upgrades (kitchen, storage, marketing, equipment) in the first two weeks. That's 77% of the starting capital spent on improvements to a truck that had nothing to cook.
The Stockout Epidemic
The defining characteristic of GPT-5.4 Mini is its inability to order enough ingredients. Of 18 operating days in the median run, 17 had at least one stock-out. The model was literally running out of food to sell almost every single day.
Some context: on Day 0, GPT-5.4 Mini ordered 16 burger buns and 5kg of ground beef — enough for roughly 15 burgers. For a downtown Austin location with 100+ daily foot traffic. By comparison, Claude Opus orders 50–80 portions per dish per day and rarely stocks out after Day 3.
The Worst Days
| Day | Location | Served | Unmet | Dishes Stocked Out |
|---|---|---|---|---|
| 8 | Downtown | 0 | 138 | tacos, burgers, fries |
| 3 | Downtown | 63 | 125 | burgers, tacos, fries, soda |
| 13 | Waterfront | 63 | 526 | burgers, tacos, fries |
| 16 | Industrial | 23 | 369 | burgers, tacos, fries |
| 14 | Waterfront | 65 | 232 | burgers, tacos, fries |
Day 13 stands out: Waterfront Park during an event with massive traffic — and 526 customers turned away because the model had ordered barely enough for 63 servings. That's over $5,000 in lost revenue from a single day of under-ordering.
The model sees the stockout data. The knowledge base shows it every morning: “stockouts: classic_burger, street_tacos, french_fries.” It acknowledges the problem in its scratchpad. And then it orders the same small quantities again.
The Staffing Chaos
GPT-5.4 Mini's staffing decisions are erratic. In the median run, the hire/fire timeline reads like a HR nightmare:
| Day | Action | Staff After | Context |
|---|---|---|---|
| 1 | ✅ Hire Margo (cook) | 1 | Good first hire |
| 3 | ✅ Hire Sarah (cashier) | 2 | Reasonable |
| 4 | ✅ Hire Jake (cook) | 3 | Day after hiring Sarah |
| 9 | ❌ Fire Jake | 2 | Fired to cut costs — then took a $500 loan |
| 18 | ✅ Hire DJ (cook) | 3 | Balance: $187 |
| 19 | ❌ Fire DJ + Sarah | 1 | Final day — fired 2 of 3 staff before bankruptcy |
Staff costs in the median run: $4,752 — that's 56% of all expenses, and 77% of total revenue. The model hired 3 staff members at times when it couldn't even afford ingredients to feed the customers those staff members would serve. On the last day, with $187 in the bank, it hired DJ — then fired both DJ and Sarah the next day. A $120+ round trip of wages for nothing.
The Death Spiral
The bankruptcy sequence in the median run follows a predictable pattern:
| Day | Balance | Event |
|---|---|---|
| 6 | $934 | Bought $400 refrigerator upgrade. Revenue: $62. |
| 8 | −$63 | Bought $300 marketing upgrade. Revenue: $0. 138 hungry customers left. |
| 9 | $755 | Fired Jake, took $500 loan. Revenue bounced to $799. |
| 11 | $247 | Forced day off — no ingredients at all. |
| 15–17 | $267→$72→$298 | Bleeding cash. Took a second $500 loan on Day 17. |
| 18 | $187 | Hired DJ. Revenue: $322. Not enough to recover. |
| 19 | −$315 | 💀 $28 revenue. Loan auto-collects. Bankrupt. |
The fundamental arithmetic never worked: daily fixed costs are $55 (lease + insurance + commissary). Add fuel ($25), location fees ($25–50), and staff wages ($200–400), and the model needs $300–500/day just to break even. Its median revenue was $299/day. It was structurally unprofitable from Day 1.
The Price of Failure
GPT-5.4 Mini is positioned as a cost-effective reasoning model. In practice, it's the most expensive way to go bankrupt on FoodTruck Bench.
| Model | Cost/Run | Days Survived | Bankruptcy Rate | ROI | Cost per Day Survived |
|---|---|---|---|---|---|
| GPT-5.4 Mini | $4.40 | 17–22 | 100% | −29% to −76% | $0.23 |
| Nemotron-3 120B | $1.08 | 14–18 | 100% | −52% | $0.06 |
| Qwen 3.5 9B | $0.19 | 12–16 | 100% | −134% | $0.01 |
| Claude Haiku 4.5 | $0.38 | 10–14 | 100% | −92% | $0.03 |
| GPT-5 Mini (prev gen) | ~$0.60 | 8–15 | 75% | −97% | $0.05 |
At $4.40 per run, GPT-5.4 Mini costs 23× more than Qwen 3.5 9B and 4× more than Nemotron-3 — while delivering the same outcome: bankruptcy. The root cause is reasoning tokens: 50–60% of output tokens are “thinking” tokens that the user pays for but that don't produce better decisions.
The model generated 882K reasoning tokens in the median run alone — tokens spent “thinking” about whether to order 2kg or 3kg of ground beef, while the correct answer was 15–20kg.
The Chinese Alternative
The most damning comparison for GPT-5.4 Mini isn't against other mini models — it's against Chinese alternatives that cost a fraction of the price and dramatically outperform it.
| Model | Origin | Cost/Run | Days Survived | Revenue | Net Worth | ROI |
|---|---|---|---|---|---|---|
| GPT-5.4 Mini | OpenAI 🇺🇸 | $4.40 | 19 | $6,209 | $470 | −76% |
| GLM-5 | Zhipu AI 🇨🇳 | ~$1.00 | 28 | $11,965 | −$210 | −110% |
| DeepSeek V3.2 | DeepSeek 🇨🇳 | ~$0.50 | 22 | $8,225 | $2,058 | +2.9% |
| Qwen 3.5 397B | Alibaba 🇨🇳 | ~$0.80 | 22 | $9,540 | −$218 | −110% |
GLM-5 costs about $1 per run via OpenRouter — one quarter of GPT-5.4 Mini. It survives 28 days (vs 19), earns nearly 2× the revenue ($12K vs $6.2K), and serves 2,569 customers. It also goes bankrupt eventually — but it runs the business for 47% longer and generates far more value before failing.
DeepSeek V3.2 is even more striking: at roughly $0.50 per run — one ninth of GPT-5.4 Mini's cost — it finishes its best run with positive ROI and $2,058 net worth. It actually made money running the food truck. For 11 cents on the dollar compared to GPT-5.4 Mini.
Qwen 3.5 397B from Alibaba, at ~$0.80/run, also outlasts GPT-5.4 Mini by 3 days and earns 54% more revenue. All three Chinese models deliver more business value at a fraction of the cost. The pattern is clear: in agentic business tasks, OpenAI's mini-class pricing buys reasoning tokens, not results.
Progress from GPT-5 Mini — Real but Insufficient
To be fair: GPT-5.4 Mini is a significantly better model than GPT-5 Mini. The numbers show it.
| Metric | GPT-5 Mini | GPT-5.4 Mini | Change |
|---|---|---|---|
| Days Survived | 11 | 19 | +73% |
| Total Revenue | $1,723 | $6,209 | +3.6× |
| Servings Sold | 478 | 948 | +2.0× |
| Unique Dishes | 3 | 5 | +67% |
| Profitable Days | 2/8 | 5/18 | 25% → 28% |
| Bankruptcy Rate | 75% | 100% | ↓ |
| ROI | −97% | −76% | Both terrible |
GPT-5 Mini was a model that couldn't operate — it often produced zero revenue for days in a row, with 4 consecutive days of $0 at the end. GPT-5.4 Mini can operate — it sells food, manages staff, chooses locations, even writes strategy notes. It just can't do the one thing that matters most: order enough ingredients to meet demand.
The cognitive complexity has scaled. The business acumen has not.
Where GPT-5.4 Mini Sits in the Field
FoodTruck Bench tests 19 models. GPT-5.4 Mini would place in the bottom third — roughly on par with models that cost a fraction of the price.
| Tier | Model | ROI | Net Worth | Cost/Run | Bankrupt? |
|---|---|---|---|---|---|
| 🏆 | Claude Opus 4.6 | +2,376% | $49,519 | ~$27 | 0% |
| 🥈 | GPT-5.2 | +1,304% | $28,081 | ~$15 | 0% |
| Claude Sonnet 4.6 | +771% | $17,426 | ~$23 | 0% | |
| Gemini 3 Pro | +760% | $17,199 | ~$4.50 | 0% | |
| ... 10+ models in between ... | |||||
| Nemotron-3 120B (free) | −52% | $962 | $1.08 | 100% | |
| GPT-5.4 Mini | −76% | $470 | $4.40 | 100% | |
| GPT-5 Mini (prev gen) | −97% | $50 | ~$0.60 | 75% | |
| Qwen 3.5 9B | −134% | −$679 | $0.19 | 100% | |
The irony: Gemini 3 Pro Preview at roughly the same cost per run (~$4.50) finishes the full 30 days with +760% ROI and $17,199 net worth. Same price bracket, diametrically opposite results.
What 882K Reasoning Tokens Buy You
GPT-5.4 Mini uses OpenAI's Responses API with extended reasoning. In the median run, out of 903K output tokens, 883K were reasoning tokens (98%). The model spends enormous computational effort “thinking” — and then orders 16 burger buns for a downtown Austin lunch rush.
What the reasoning looks like in practice: the model thoroughly evaluates locations, compares weather forecasts, calculates staff costs, and writes detailed strategy notes in the scratchpad. All of this is correct. The analysis is sound. But when it comes time to actually order ingredients, it orders for 15–30 servings instead of the 100–200 that demand actually requires.
This is not a reasoning failure — it's a calibration failure. The model can think about the problem but it lacks the intuition for appropriate scale. It reasons about whether to order 2kg or 3kg of ground beef (both wrong) instead of stepping back to ask: “How many servings do I realistically need?”
Author's Note
This is a critical article, and I want to be transparent about it. OpenAI positions GPT-5.4 Mini as delivering “GPT-5.4 level intelligence” at lower cost. In our benchmark, it is the most expensive mini-class model on the market and delivers results indistinguishable from models that cost 4–23× less.
The progress from GPT-5 Mini is real. The model is genuinely more capable — it lasts longer, earns more, uses tools more fluently, and manages more complex scenarios. If I were evaluating reasoning ability in isolation, GPT-5.4 Mini would score well. But in an agentic task that requires reasoning plus calibration plus adaptation over time — the price-performance ratio is difficult to justify against Chinese alternatives.
Qwen 3.5 9B at $0.19/run fails with more style. Nemotron-3 at $1.08/run fails with more grace. Neither is “better” in any absolute sense — but neither charges a premium for the privilege.
This is one person's analysis based on one benchmark. Your mileage — and your use case — may vary.