How does GPT-5.4 Mini perform on agentic business tasks?

GPT-5.4 Mini went bankrupt in all 5 runs (17–22 days) on FoodTruck Bench. The core failure is inventory management: stockouts occurred on 94% of operating days. The model consistently orders insufficient ingredients, spends capital on upgrades instead of inventory, and takes loans it cannot repay. Despite using reasoning tokens extensively, it cannot translate analysis into effective purchasing decisions.

Is GPT-5.4 Mini worth the cost compared to other mini models?

At $4.40 per simulation run, GPT-5.4 Mini is 23× more expensive than Qwen 3.5 9B ($0.19/run) and 4× more expensive than Nemotron-3 ($1.08/run), while delivering comparable or worse results. All three models go bankrupt 100% of the time. The high cost comes from extensive reasoning token usage (50-60% of output tokens are reasoning), but this reasoning does not translate into better business decisions.

How does GPT-5.4 Mini compare to GPT-5 Mini?

GPT-5.4 Mini shows real improvement over GPT-5 Mini: it survives 17–22 days (vs 11 days), earns 3.6× more revenue ($6,209 vs $1,723), and manages staff more effectively. However, both models ultimately go bankrupt. The core inventory management and financial planning weaknesses persist across generations. GPT-5.4 Mini is a better model that still cannot run a food truck profitably.

← Back to FoodTruck Bench

Case StudyMarch 2026Nicholas S.

GPT-5.4 Mini: Agentic Disaster

Name: FoodTruck Bench
Author: Nicholas S.

OpenAI calls it “GPT-5.4 level intelligence, faster and cheaper.”
We gave it $2,000 and a food truck in Austin, TX.
It bought $700 in upgrades it couldn't afford, ran out of ingredients on 94% of operating days, and went bankrupt five times in a row.

Key Findings

100% bankruptcy: All 5 runs ended in bankruptcy within 17–22 days
94% stockout rate: 17 of 18 operating days had at least one dish run out of ingredients
$4.40/run average: 23× more expensive than Qwen 3.5 9B ($0.19), 4× more than Nemotron-3 ($1.08) — with comparable results
Upgrade addiction: Spent $700 on truck upgrades (Day 6/8) while balance was under $1,000
Loan trap: Took 2 loans in every run trying to recover — none of them helped
Real progress: Survives 19 days vs 11 (GPT-5 Mini), 3.6× more revenue — but still bankrupt

Five Runs, One Story

We ran GPT-5.4 Mini five times with the same configuration (seed 42, 30 days, $2,000 starting balance). All five runs tell essentially the same story. The model has a “personality” — and that personality is consistent.

#	Days	Revenue	Profit	ROI	Stockout Days	Upgrades	Loans	Cost
1	17	$4,400	−$1,692	−29%	14/13	4 ($1,550)	2	$2.80
2	22	$9,969	−$864	−14%	18/20	3 ($1,200)	2	$5.80
3	19	$6,209	−$2,270	−76%	17/18	2 ($700)	2	$4.61
4	18	$5,541	−$1,847	−45%	16/17	2 ($900)	2	$3.52
5	20	$7,812	−$1,089	−22%	17/19	3 ($1,050)	2	$5.14

The pattern is identical across all five runs:

Buy upgrades early with limited capital
Run out of ingredients almost every day
Take 2 loans trying to recover — they don't help
Go bankrupt anyway

We selected Run 3 (19 days, median) for the deep dive below. The other runs follow the same trajectory.

The Upgrade Trap

GPT-5.4 Mini does something that better models never do: it buys truck upgrades in the first week, before establishing a sustainable revenue stream.

Day	Balance Before	Upgrade	Cost	Balance After	Day Result
6	$1,745	Commercial Refrigerator	$400	$934	$62 revenue, −$424 profit
8	$832	Menu Board & Signage	$300	−$63	$0 revenue, 138 unmet demand

Day 8 is the most revealing moment. The model had $832 in cash. It spent $300 on a marketing upgrade — and then earned $0 in revenue because it had no ingredients to sell. 138 customers showed up and left hungry. The marketing upgrade worked perfectly: it attracted more customers who couldn't buy anything.

In Run 1, the model went even further: $1,550 on four upgrades (kitchen, storage, marketing, equipment) in the first two weeks. That's 77% of the starting capital spent on improvements to a truck that had nothing to cook.

The Stockout Epidemic

The defining characteristic of GPT-5.4 Mini is its inability to order enough ingredients. Of 18 operating days in the median run, 17 had at least one stock-out. The model was literally running out of food to sell almost every single day.

Served vs Unmet Demand — The Stockout Epidemic

■ Served■ Unmet (Stockout)

Day 13: 526 unmet customers at Waterfront Park during an event. Day 16: 369 turned away at Industrial Zone. The model never adapted its ordering to match demand.

Some context: on Day 0, GPT-5.4 Mini ordered 16 burger buns and 5kg of ground beef — enough for roughly 15 burgers. For a downtown Austin location with 100+ daily foot traffic. By comparison, Claude Opus orders 50–80 portions per dish per day and rarely stocks out after Day 3.

The Worst Days

Day	Location	Served	Unmet	Dishes Stocked Out
8	Downtown	0	138	tacos, burgers, fries
3	Downtown	63	125	burgers, tacos, fries, soda
13	Waterfront	63	526	burgers, tacos, fries
16	Industrial	23	369	burgers, tacos, fries
14	Waterfront	65	232	burgers, tacos, fries

Day 13 stands out: Waterfront Park during an event with massive traffic — and 526 customers turned away because the model had ordered barely enough for 63 servings. That's over $5,000 in lost revenue from a single day of under-ordering.

The model sees the stockout data. The knowledge base shows it every morning: “stockouts: classic_burger, street_tacos, french_fries.” It acknowledges the problem in its scratchpad. And then it orders the same small quantities again.

The Staffing Chaos

GPT-5.4 Mini's staffing decisions are erratic. In the median run, the hire/fire timeline reads like a HR nightmare:

Day	Action	Staff After	Context
1	✅ Hire Margo (cook)	1	Good first hire
3	✅ Hire Sarah (cashier)	2	Reasonable
4	✅ Hire Jake (cook)	3	Day after hiring Sarah
9	❌ Fire Jake	2	Fired to cut costs — then took a $500 loan
18	✅ Hire DJ (cook)	3	Balance: $187
19	❌ Fire DJ + Sarah	1	Final day — fired 2 of 3 staff before bankruptcy

Staff costs in the median run: $4,752 — that's 56% of all expenses, and 77% of total revenue. The model hired 3 staff members at times when it couldn't even afford ingredients to feed the customers those staff members would serve. On the last day, with $187 in the bank, it hired DJ — then fired both DJ and Sarah the next day. A $120+ round trip of wages for nothing.

The Death Spiral

The bankruptcy sequence in the median run follows a predictable pattern:

Day	Balance	Event
6	$934	Bought $400 refrigerator upgrade. Revenue: $62.
8	−$63	Bought $300 marketing upgrade. Revenue: $0. 138 hungry customers left.
9	$755	Fired Jake, took $500 loan. Revenue bounced to $799.
11	$247	Forced day off — no ingredients at all.
15–17	$267→$72→$298	Bleeding cash. Took a second $500 loan on Day 17.
18	$187	Hired DJ. Revenue: $322. Not enough to recover.
19	−$315	💀 $28 revenue. Loan auto-collects. Bankrupt.

GPT-5.4 Mini — Net Worth & Cash Balance (Run 3, 19 Days)

━ Net Worth━ Cash Balance━ Daily Profit

Net worth never recovered above the starting $2,000 after Day 1. Cash went negative twice (Day 8, Day 19). The final day: $28 revenue, -$494 profit.

The fundamental arithmetic never worked: daily fixed costs are $55 (lease + insurance + commissary). Add fuel ($25), location fees ($25–50), and staff wages ($200–400), and the model needs $300–500/day just to break even. Its median revenue was $299/day. It was structurally unprofitable from Day 1.

The Price of Failure

GPT-5.4 Mini is positioned as a cost-effective reasoning model. In practice, it's the most expensive way to go bankrupt on FoodTruck Bench.

Model	Cost/Run	Days Survived	Bankruptcy Rate	ROI	Cost per Day Survived
GPT-5.4 Mini	$4.40	17–22	100%	−29% to −76%	$0.23
Nemotron-3 120B	$1.08	14–18	100%	−52%	$0.06
Qwen 3.5 9B	$0.19	12–16	100%	−134%	$0.01
Claude Haiku 4.5	$0.38	10–14	100%	−92%	$0.03
GPT-5 Mini (prev gen)	~$0.60	8–15	75%	−97%	$0.05

At $4.40 per run, GPT-5.4 Mini costs 23× more than Qwen 3.5 9B and 4× more than Nemotron-3 — while delivering the same outcome: bankruptcy. The root cause is reasoning tokens: 50–60% of output tokens are “thinking” tokens that the user pays for but that don't produce better decisions.

The model generated 882K reasoning tokens in the median run alone — tokens spent “thinking” about whether to order 2kg or 3kg of ground beef, while the correct answer was 15–20kg.

The Chinese Alternative

The most damning comparison for GPT-5.4 Mini isn't against other mini models — it's against Chinese alternatives that cost a fraction of the price and dramatically outperform it.

Model	Origin	Cost/Run	Days Survived	Revenue	Net Worth	ROI
GPT-5.4 Mini	OpenAI 🇺🇸	$4.40	19	$6,209	$470	−76%
GLM-5	Zhipu AI 🇨🇳	~$1.00	28	$11,965	−$210	−110%
DeepSeek V3.2	DeepSeek 🇨🇳	~$0.50	22	$8,225	$2,058	+2.9%
Qwen 3.5 397B	Alibaba 🇨🇳	~$0.80	22	$9,540	−$218	−110%

GPT-5.4 Mini vs Chinese Alternatives — Net Worth

━ GPT-5.4 Mini ($4.40/run)━ GLM-5 (~$1.00/run)━ DeepSeek V3.2 (~$0.50/run)

GLM-5 survives 28 days and earns $12K revenue at ¼ of GPT-5.4 Mini's API cost. DeepSeek V3.2 finishes with positive ROI at 1/9th the cost. Both are Chinese models.

GLM-5 costs about $1 per run via OpenRouter — one quarter of GPT-5.4 Mini. It survives 28 days (vs 19), earns nearly 2× the revenue ($12K vs $6.2K), and serves 2,569 customers. It also goes bankrupt eventually — but it runs the business for 47% longer and generates far more value before failing.

DeepSeek V3.2 is even more striking: at roughly $0.50 per run — one ninth of GPT-5.4 Mini's cost — it finishes its best run with positive ROI and $2,058 net worth. It actually made money running the food truck. For 11 cents on the dollar compared to GPT-5.4 Mini.

Qwen 3.5 397B from Alibaba, at ~$0.80/run, also outlasts GPT-5.4 Mini by 3 days and earns 54% more revenue. All three Chinese models deliver more business value at a fraction of the cost. The pattern is clear: in agentic business tasks, OpenAI's mini-class pricing buys reasoning tokens, not results.

Progress from GPT-5 Mini — Real but Insufficient

To be fair: GPT-5.4 Mini is a significantly better model than GPT-5 Mini. The numbers show it.

Metric	GPT-5 Mini	GPT-5.4 Mini	Change
Days Survived	11	19	+73%
Total Revenue	$1,723	$6,209	+3.6×
Servings Sold	478	948	+2.0×
Unique Dishes	3	5	+67%
Profitable Days	2/8	5/18	25% → 28%
Bankruptcy Rate	75%	100%	↓
ROI	−97%	−76%	Both terrible

Generational Comparison — Net Worth

━ GPT-5.4 Mini━ GPT-5 Mini━ Nemotron-3 120B

GPT-5.4 Mini clearly outlasts GPT-5 Mini (19 vs 11 days) — but both end in the same place. Nemotron-3 at $1.08/run achieves comparable survival at ¼ the cost.

GPT-5 Mini was a model that couldn't operate — it often produced zero revenue for days in a row, with 4 consecutive days of $0 at the end. GPT-5.4 Mini can operate — it sells food, manages staff, chooses locations, even writes strategy notes. It just can't do the one thing that matters most: order enough ingredients to meet demand.

The cognitive complexity has scaled. The business acumen has not.

Where GPT-5.4 Mini Sits in the Field

FoodTruck Bench tests 19 models. GPT-5.4 Mini would place in the bottom third — roughly on par with models that cost a fraction of the price.

Tier	Model	ROI	Net Worth	Cost/Run	Bankrupt?
🏆	Claude Opus 4.6	+2,376%	$49,519	~$27	0%
🥈	GPT-5.2	+1,304%	$28,081	~$15	0%
	Claude Sonnet 4.6	+771%	$17,426	~$23	0%
	Gemini 3 Pro	+760%	$17,199	~$4.50	0%
... 10+ models in between ...
	Nemotron-3 120B (free)	−52%	$962	$1.08	100%
	GPT-5.4 Mini	−76%	$470	$4.40	100%
	GPT-5 Mini (prev gen)	−97%	$50	~$0.60	75%
	Qwen 3.5 9B	−134%	−$679	$0.19	100%

The irony: Gemini 3 Pro Preview at roughly the same cost per run (~$4.50) finishes the full 30 days with +760% ROI and $17,199 net worth. Same price bracket, diametrically opposite results.

What 882K Reasoning Tokens Buy You

GPT-5.4 Mini uses OpenAI's Responses API with extended reasoning. In the median run, out of 903K output tokens, 883K were reasoning tokens (98%). The model spends enormous computational effort “thinking” — and then orders 16 burger buns for a downtown Austin lunch rush.

What the reasoning looks like in practice: the model thoroughly evaluates locations, compares weather forecasts, calculates staff costs, and writes detailed strategy notes in the scratchpad. All of this is correct. The analysis is sound. But when it comes time to actually order ingredients, it orders for 15–30 servings instead of the 100–200 that demand actually requires.

This is not a reasoning failure — it's a calibration failure. The model can think about the problem but it lacks the intuition for appropriate scale. It reasons about whether to order 2kg or 3kg of ground beef (both wrong) instead of stepping back to ask: “How many servings do I realistically need?”

Author's Note

This is a critical article, and I want to be transparent about it. OpenAI positions GPT-5.4 Mini as delivering “GPT-5.4 level intelligence” at lower cost. In our benchmark, it is the most expensive mini-class model on the market and delivers results indistinguishable from models that cost 4–23× less.

The progress from GPT-5 Mini is real. The model is genuinely more capable — it lasts longer, earns more, uses tools more fluently, and manages more complex scenarios. If I were evaluating reasoning ability in isolation, GPT-5.4 Mini would score well. But in an agentic task that requires reasoning plus calibration plus adaptation over time — the price-performance ratio is difficult to justify against Chinese alternatives.

Qwen 3.5 9B at $0.19/run fails with more style. Nemotron-3 at $1.08/run fails with more grace. Neither is “better” in any absolute sense — but neither charges a premium for the privilege.

This is one person's analysis based on one benchmark. Your mileage — and your use case — may vary.

Methodology

Benchmark: FoodTruck Bench v1.0
Simulation: Business sim in Austin, TX. Fixed random seed (42), 30 days configured
API: Direct OpenAI SDK → GPT-5.4 Mini (gpt-5.4-mini-2026-03-17), Responses API with reasoning
Tools: 34 morning tools + 5 reflection tools (OpenAI function-calling schema)
Runs: 5 runs, all same seed — consistent bankruptcy pattern
Primary run: Run 3 (median) — 19 days, median ROI, selected for deep dive
Compared against: 18 other models on the leaderboard, 5 runs each
Cost tracking: Token-level billing from OpenAI API responses (input, output, reasoning, cached)

Published March 2026 as part of the FoodTruck Bench project.