How does Grok 4.20 Reasoning perform on agentic business tasks?

Grok 4.20 Reasoning achieved a 40% survival rate on FoodTruck Bench (2 of 5 runs completed 30 days). The best run reached +69% ROI with $3,378 net worth. All 3 bankruptcies were caused by loan defaults, not fundamental business failures — the model's daily operations are generally sound. At $2.43/run, it's the cheapest reasoning model with any survival.

How does Grok 4.20 compare to GPT-5.4 Mini?

Grok 4.20 Reasoning dramatically outperforms GPT-5.4 Mini: 40% survival vs 0%, median ROI -0.2% vs -76%, and $2.43/run vs $4.40/run. Grok is nearly 2× cheaper and infinitely more effective. GPT-5.4 Mini's core weakness is inventory management (94% stockout rate), while Grok's weakness is loan maturity management — a more sophisticated failure mode.

Why does Grok 4.20 go bankrupt?

All 3 Grok 4.20 bankruptcies were caused by loan defaults. The model takes loans to fund operations or upgrades but fails to manage balloon payment deadlines. FoodTruck Bench has no grace period — if you can't pay the full loan amount on the due date, you go bankrupt instantly. The surviving runs either repaid loans on time or repaid them early.

← Back to FoodTruck Bench

Case StudyMarch 2026Nicholas S.

Grok 4.20 Reasoning: The Cheapest Model That Actually Thinks

Name: FoodTruck Bench
Author: Nicholas S.

xAI’s flagship reasoning model. $2.43 average per run.
40% survival rate. +69% ROI at its peak.
The cheapest reasoning model on the leaderboard that actually builds a profitable business — when the dice roll right.

Key Findings

40% survival rate: 2 of 5 runs completed all 30 days profitably
Best run ROI +69%: $3,378 net worth — better than most mid-tier models
All 3 bankruptcies from loan defaults: not cash depletion, but failed balloon payments
$2.43 average cost per run: the cheapest reasoning model we’ve tested
5× cheaper than GPT-5.4 Mini ($4.40/run) and infinitely better (40% vs 0% survival)
High variance: same model, same seed — ROI ranges from +69% to −58%

Five Runs, Two Fates

We ran Grok 4.20 Reasoning five times with the same configuration (seed 42, 30 days, $2,000 starting balance). The results split cleanly into two groups: two survivors and three bankruptcies.

#	Days	Revenue	Net Worth	ROI	Loans	Upgrades	Waste	Cost	Result
1	30	$23,297	$3,378	+69%	2	5 ($2,750)	$672	$3.95	✅
2	30	$20,629	$2,665	+33%	5 (3 repaid)	4 ($1,550)	$530	$2.04	✅
3	22	$15,179	$1,996	−0.2%	4	4 ($1,550)	$512	$2.80	💀
4	24	$13,246	$1,338	−33%	3	3 ($1,250)	$528	$1.80	💀 median
5	20	$11,680	$841	−58%	2	2 ($900)	$839	$1.58	💀

The divergence is dramatic. The best run earned $23,297 in revenue with 13 unique dishes and invested $2,750 in upgrades, including a tier-2 kitchen. The worst run made only $11,680 with 11 dishes and wasted $839 in expired food — the highest waste of any run. All five used the same API endpoint, the same system prompt, and the same simulation environment.

The Divergence

The most striking thing about Grok 4.20 is the chart below. Five runs, same everything — and yet you see two paths: growth or collapse.

Net Worth — All 5 Runs (The Divergence)

━ Best (+69%)━ Second (+33%)━ BK 22d╌ Median BK 24d━ Worst 20d

Same model, same seed — yet two runs thrive while three crash. All bankruptcies end between Day 20–24 via loan defaults. The survivors diverge upward after Day 15.

For the first 10 days, the runs look remarkably similar. All five hover around $2,000–$2,800 net worth. Then around Day 12–15, the futures split. The survivors push past $3,000; the bankrupts start declining. By Day 20, the gap is over $2,000.

This is the “temperature divergence” in action. With reasoning models at temperature > 0, small stochastic differences in early decisions — which dish to add on Day 3, whether to hire that second cook on Day 5, how much to order on Day 8 — compound into completely different business trajectories. It’s a form of path dependence that no benchmark score captures.

The Survivors — What Goes Right

The best run (ROI +69%) reads like a textbook on food truck management:

13 unique dishes sold over 30 days — broad menu, diverse audience coverage
5 upgrades ($2,750 total), including kitchen_t2 on Day 11 — strategic investment when cash allowed
Zero overdraft across all 30 days
17 profitable days out of 30 (57%)
2 loans taken, both repaid on time
Best day: $2,016 revenue

The second survivor (ROI +33%) took a riskier path. It took 5 loans — the most of any run — but critically, it repaid 3 of them early. This is the key behavioral signal: the model understood loan maturity as a hard deadline and acted accordingly. It dipped into overdraft ($15.95 total) but never let it spiral. It had a near-death experience on Day 22 (balance: −$229) but recovered by Day 25 with a well-timed loan and strong sales.

The Loan Trap — Why Three Runs Die

All three bankruptcies share the same death sequence, and it’s not gradual cash depletion. The model’s daily operations are actually reasonable — revenue is $500–800/day, enough to operate. The fatal flaw is loan maturity management.

Run	Loans	Repaid Early	Defaulted	Overdraft	Death Cause
Best ✅	2	0	No	$0	—
Second ✅	5	3	No	$15.95	—
22-day 💀	4	0	Yes	$48.32	Balloon payment Day 22
24-day 💀	3	0	Yes	$42.82	Balloon payment Day 24
20-day 💀	2	0	Yes	$17.70	Balloon payment Day 20

Here’s the pattern: the surviving runs either take few loans and repay them on time, or take many loans but actively manage repayment. The bankrupt runs take loans hoping revenue will cover the balloon payment at maturity — and it never does. There’s no grace period in FoodTruck Bench: if you can’t pay the full amount on the due date, you go bankrupt instantly.

The median bankrupt run (24 days) is particularly revealing: it had 100% tool diversity — the only run to use every single tool in the toolbox, including supplier negotiations and day-offs. It was the most “creative” run in terms of strategy. But it couldn’t manage a loan repayment schedule. Intelligence without financial discipline is expensive.

The Temperature Divergence

Grok 4.20 Reasoning demonstrates something fundamental about LLM agents: consistency matters more than peak intelligence.

The difference between the best and worst run isn’t one catastrophic decision. It’s dozens of small ones:

Decision Area	Survivors	Bankrupts
Menu diversity	7–13 dishes	5–11 dishes
Upgrade investment	$1,550–$2,750	$900–$1,550
Upgrade timing	kitchen_t2 by Day 11	No tier-2 upgrades
Loan behavior	Repay on time / early	Take & hope
Days off	0	0–3
Daily revenue	$688–$777	$552–$723
Food waste	$530–$672	$512–$839

None of these individual differences look decisive. But they compound. Slightly more dishes mean slightly higher revenue. Slightly better upgrade timing means slightly more capacity at critical moments. Loan repayment discipline means you survive to Day 30 instead of crashing on Day 22.

This is why bankruptcy rate, not peak ROI, is the most important benchmark metric. A 40% survival rate means 60% of the time your agent will fail, regardless of how brilliant its best run looks.

The Price-Performance Story

This is where Grok 4.20 shines. At $2.43 per run, it’s the cheapest reasoning model with any survival on FoodTruck Bench.

Model	Cost/Run	Survival	Median ROI	Median NW	Value Verdict
Qwen 3.5 9B	$0.19	0%	−134%	−$679	Token-cheap, but bankrupt
GLM-5	~$1.00	29%	−111%	−$210	Survives rarely, burns cash
Grok 4.20	$2.43	40%	−0.2%	$1,996	Best value reasoning
GPT-5.4 Mini	$4.40	0%	−76%	$470	2× more expensive, 0% survival
Gemini 3 Pro	~$4.50	100%	+760%	$17,199	Same price, 190× the ROI
Sonnet 4.6	~$23	100%	+771%	$17,426	10× pricier, guaranteed win

Cost per Run vs Survival Rate

Qwen 3.5 9B$0.19

0% — all bankrupt

0%

GLM-5~$1.00

29%

Grok 4.20$2.43

40%

GPT-5.4 Mini$4.40

0% — all bankrupt

0%

Gemini 3 Pro~$4.50

100%

Sonnet 4.6~$23

100%

Survival Rate (% of runs completing 30 days without bankruptcy)

Grok 4.20 at $2.43/run is the cheapest model with any survival. GPT-5.4 Mini costs nearly 2× more and never survives.

The most damning comparison for OpenAI: GPT-5.4 Mini at $4.40/run goes bankrupt 100% of the time. Grok 4.20 at $2.43 survives 40% — nearly twice cheaper, infinitely better outcome. OpenAI’s flagship mini model can’t do what xAI’s costs half as much to achieve.

Against Chinese alternatives, Grok holds up well. GLM-5 ($1.00/run) survives only 29% with a −111% median ROI. DeepSeek V3.2 ($0.50/run) has a 20% survival rate. Grok costs more but delivers better consistency.

The elephant in the room: Gemini 3 Pro at almost exactly the same cost ($4.50 vs $2.43) has a 100% survival rate with +760% ROI. The price gap is small; the performance gap is astronomical. If you’re willing to spend $4.50 per run, there’s no reason to choose Grok over Gemini.

Grok vs GPT-5.4 Mini — The Direct Comparison

This comparison is the most favorable angle for Grok 4.20. Against OpenAI’s mini-class reasoning model, it wins on every dimension:

Metric	GPT-5.4 Mini	Grok 4.20	Winner
Cost per run	$4.40	$2.43	Grok (1.8× cheaper)
Survival rate	0%	40%	Grok (∞ better)
Median ROI	−76%	−0.2%	Grok
Best ROI	−14%	+69%	Grok
Max revenue	$9,969	$23,297	Grok (2.3×)
Best net worth	$1,730	$3,378	Grok (2×)
Menu diversity	3–5 dishes	5–13 dishes	Grok
Tool diversity	68%	82–100%	Grok
Stockout rate	94%	Moderate	Grok

GPT-5.4 Mini’s defining failure is inventory management: it consistently orders enough food for 15–30 servings when demand is 100–200. Grok 4.20 doesn’t have this problem. Its failure mode is different — loan management, not restocking. From a business intelligence standpoint, Grok understands the food truck operations better. It just sometimes can’t manage its debt.

The Bigger Picture

For xAI’s flagship reasoning model, a 40% survival rate is a complicated result. It’s genuinely impressive for the price — no other model at this cost point survives at all. But it’s not competitive with the models that xAI aspires to challenge.

There’s also the question of Grok 4.1 Fast, xAI’s non-reasoning model, which scored 100% bankruptcy across 5 runs with a median net worth of $817 and only 11 days of survival. Grok 4.20 Reasoning is a massive upgrade: 3× longer survival, 3× more revenue, and actual business viability. The reasoning mode clearly works.

Tier	Model	Cost	Survival	Median ROI
🏆 S-tier	Claude Opus 4.6	~$27	100%	+2,376%
🥈 A-tier	Gemini 3 Pro	~$4.50	100%	+760%
🥈 A-tier	Claude Sonnet 4.6	~$23	100%	+771%
B-tier	Grok 4.20 Reasoning	$2.43	40%	−0.2%
C-tier	GLM-5	~$1.00	29%	−111%
D-tier	GPT-5.4 Mini	$4.40	0%	−76%
F-tier	Grok 4.1 Fast	~$0.50	0%	−59%

Grok 4.20 sits in B-tier: capable of running a profitable business but not reliably. It’s the best model in the “sometimes survives” bracket and the cheapest model in the “ever survives” bracket. For researchers interested in cost-efficient agentic evaluation, it’s a compelling option. For anyone who needs reliability, the gap to A-tier (Gemini 3 Pro at similar cost) remains the elephant in the room.

Author's Note

Grok 4.20 Reasoning deserves more credit than it typically gets. At $2.43 per run, it’s genuinely the most cost-effective reasoning model we’ve tested for complex agentic tasks. It uses tools fluently (82–100% diversity), manages memory effectively (up to 12 notes/day), and demonstrates real business thinking — adapting menus, investing in upgrades, responding to events.

The 40% survival rate is a real finding, not a cherry-picked outlier. Two out of five runs produced genuinely profitable businesses. The fatal flaw is narrow and specific: loan maturity management. A model that could track a repayment deadline would score dramatically better.

For xAI as a company, this is a foundation to build on. Grok 4.20 Reasoning already outperforms OpenAI’s comparable offering at half the cost. The gap to Google and Anthropic’s best models is real — but it’s a gap in consistency, not capability. On its best day, Grok runs the food truck just fine.

This is one person’s analysis based on one benchmark. Your mileage — and your use case — may vary.

Methodology

Benchmark: FoodTruck Bench v1.0
Simulation: Business sim in Austin, TX. Fixed random seed (42), 30 days configured, $2,000 starting balance
API: xAI API via litellm (model: xai/grok-4.20-reasoning), reasoning mode enabled
Tools: 34 morning tools + 5 reflection tools (OpenAI function-calling schema)
Runs: 5 runs — 2 survived 30 days, 3 bankrupt (days 20, 22, 24)
Primary run: Run 4 (median) — 24 days, selected for leaderboard representation
Compared against: 19 other models on the leaderboard, 5 runs each
Cost tracking: Token-level billing from xAI API responses (input, output, reasoning, cached)

Published March 2026 as part of the FoodTruck Bench project.