← Back to FoodTruck Bench
Case StudyMarch 2026Nicholas S.

Grok 4.20 Reasoning: The Cheapest Model That Actually Thinks

xAI’s flagship reasoning model. $2.43 average per run.
40% survival rate. +69% ROI at its peak.
The cheapest reasoning model on the leaderboard that actually builds a profitable business — when the dice roll right.

Key Findings

Five Runs, Two Fates

We ran Grok 4.20 Reasoning five times with the same configuration (seed 42, 30 days, $2,000 starting balance). The results split cleanly into two groups: two survivors and three bankruptcies.

#DaysRevenueNet WorthROILoansUpgradesWasteCostResult
130$23,297$3,378+69%25 ($2,750)$672$3.95
230$20,629$2,665+33%5 (3 repaid)4 ($1,550)$530$2.04
322$15,179$1,996−0.2%44 ($1,550)$512$2.80💀
424$13,246$1,338−33%33 ($1,250)$528$1.80💀 median
520$11,680$841−58%22 ($900)$839$1.58💀

The divergence is dramatic. The best run earned $23,297 in revenue with 13 unique dishes and invested $2,750 in upgrades, including a tier-2 kitchen. The worst run made only $11,680 with 11 dishes and wasted $839 in expired food — the highest waste of any run. All five used the same API endpoint, the same system prompt, and the same simulation environment.

The Divergence

The most striking thing about Grok 4.20 is the chart below. Five runs, same everything — and yet you see two paths: growth or collapse.

Net Worth — All 5 Runs (The Divergence)
Best (+69%) Second (+33%) BK 22d Median BK 24d Worst 20d
Same model, same seed — yet two runs thrive while three crash. All bankruptcies end between Day 20–24 via loan defaults. The survivors diverge upward after Day 15.

For the first 10 days, the runs look remarkably similar. All five hover around $2,000–$2,800 net worth. Then around Day 12–15, the futures split. The survivors push past $3,000; the bankrupts start declining. By Day 20, the gap is over $2,000.

This is the “temperature divergence” in action. With reasoning models at temperature > 0, small stochastic differences in early decisions — which dish to add on Day 3, whether to hire that second cook on Day 5, how much to order on Day 8 — compound into completely different business trajectories. It’s a form of path dependence that no benchmark score captures.

The Survivors — What Goes Right

The best run (ROI +69%) reads like a textbook on food truck management:

The second survivor (ROI +33%) took a riskier path. It took 5 loans — the most of any run — but critically, it repaid 3 of them early. This is the key behavioral signal: the model understood loan maturity as a hard deadline and acted accordingly. It dipped into overdraft ($15.95 total) but never let it spiral. It had a near-death experience on Day 22 (balance: −$229) but recovered by Day 25 with a well-timed loan and strong sales.

The Loan Trap — Why Three Runs Die

All three bankruptcies share the same death sequence, and it’s not gradual cash depletion. The model’s daily operations are actually reasonable — revenue is $500–800/day, enough to operate. The fatal flaw is loan maturity management.

RunLoansRepaid EarlyDefaultedOverdraftDeath Cause
Best ✅20No$0
Second ✅53No$15.95
22-day 💀40Yes$48.32Balloon payment Day 22
24-day 💀30Yes$42.82Balloon payment Day 24
20-day 💀20Yes$17.70Balloon payment Day 20

Here’s the pattern: the surviving runs either take few loans and repay them on time, or take many loans but actively manage repayment. The bankrupt runs take loans hoping revenue will cover the balloon payment at maturity — and it never does. There’s no grace period in FoodTruck Bench: if you can’t pay the full amount on the due date, you go bankrupt instantly.

The median bankrupt run (24 days) is particularly revealing: it had 100% tool diversity — the only run to use every single tool in the toolbox, including supplier negotiations and day-offs. It was the most “creative” run in terms of strategy. But it couldn’t manage a loan repayment schedule. Intelligence without financial discipline is expensive.

The Temperature Divergence

Grok 4.20 Reasoning demonstrates something fundamental about LLM agents: consistency matters more than peak intelligence.

The difference between the best and worst run isn’t one catastrophic decision. It’s dozens of small ones:

Decision AreaSurvivorsBankrupts
Menu diversity7–13 dishes5–11 dishes
Upgrade investment$1,550–$2,750$900–$1,550
Upgrade timingkitchen_t2 by Day 11No tier-2 upgrades
Loan behaviorRepay on time / earlyTake & hope
Days off00–3
Daily revenue$688–$777$552–$723
Food waste$530–$672$512–$839

None of these individual differences look decisive. But they compound. Slightly more dishes mean slightly higher revenue. Slightly better upgrade timing means slightly more capacity at critical moments. Loan repayment discipline means you survive to Day 30 instead of crashing on Day 22.

This is why bankruptcy rate, not peak ROI, is the most important benchmark metric. A 40% survival rate means 60% of the time your agent will fail, regardless of how brilliant its best run looks.

The Price-Performance Story

This is where Grok 4.20 shines. At $2.43 per run, it’s the cheapest reasoning model with any survival on FoodTruck Bench.

ModelCost/RunSurvivalMedian ROIMedian NWValue Verdict
Qwen 3.5 9B$0.190%−134%−$679Token-cheap, but bankrupt
GLM-5~$1.0029%−111%−$210Survives rarely, burns cash
Grok 4.20$2.4340%−0.2%$1,996Best value reasoning
GPT-5.4 Mini$4.400%−76%$4702× more expensive, 0% survival
Gemini 3 Pro~$4.50100%+760%$17,199Same price, 190× the ROI
Sonnet 4.6~$23100%+771%$17,42610× pricier, guaranteed win
Cost per Run vs Survival Rate
Qwen 3.5 9B$0.19
0% — all bankrupt
0%
GLM-5~$1.00
29%
Grok 4.20$2.43
40%
GPT-5.4 Mini$4.40
0% — all bankrupt
0%
Gemini 3 Pro~$4.50
100%
Sonnet 4.6~$23
100%
Survival Rate (% of runs completing 30 days without bankruptcy)
Grok 4.20 at $2.43/run is the cheapest model with any survival. GPT-5.4 Mini costs nearly 2× more and never survives.

The most damning comparison for OpenAI: GPT-5.4 Mini at $4.40/run goes bankrupt 100% of the time. Grok 4.20 at $2.43 survives 40% — nearly twice cheaper, infinitely better outcome. OpenAI’s flagship mini model can’t do what xAI’s costs half as much to achieve.

Against Chinese alternatives, Grok holds up well. GLM-5 ($1.00/run) survives only 29% with a −111% median ROI. DeepSeek V3.2 ($0.50/run) has a 20% survival rate. Grok costs more but delivers better consistency.

The elephant in the room: Gemini 3 Pro at almost exactly the same cost ($4.50 vs $2.43) has a 100% survival rate with +760% ROI. The price gap is small; the performance gap is astronomical. If you’re willing to spend $4.50 per run, there’s no reason to choose Grok over Gemini.

Grok vs GPT-5.4 Mini — The Direct Comparison

This comparison is the most favorable angle for Grok 4.20. Against OpenAI’s mini-class reasoning model, it wins on every dimension:

MetricGPT-5.4 MiniGrok 4.20Winner
Cost per run$4.40$2.43Grok (1.8× cheaper)
Survival rate0%40%Grok (∞ better)
Median ROI−76%−0.2%Grok
Best ROI−14%+69%Grok
Max revenue$9,969$23,297Grok (2.3×)
Best net worth$1,730$3,378Grok (2×)
Menu diversity3–5 dishes5–13 dishesGrok
Tool diversity68%82–100%Grok
Stockout rate94%ModerateGrok

GPT-5.4 Mini’s defining failure is inventory management: it consistently orders enough food for 15–30 servings when demand is 100–200. Grok 4.20 doesn’t have this problem. Its failure mode is different — loan management, not restocking. From a business intelligence standpoint, Grok understands the food truck operations better. It just sometimes can’t manage its debt.

The Bigger Picture

For xAI’s flagship reasoning model, a 40% survival rate is a complicated result. It’s genuinely impressive for the price — no other model at this cost point survives at all. But it’s not competitive with the models that xAI aspires to challenge.

There’s also the question of Grok 4.1 Fast, xAI’s non-reasoning model, which scored 100% bankruptcy across 5 runs with a median net worth of $817 and only 11 days of survival. Grok 4.20 Reasoning is a massive upgrade: 3× longer survival, 3× more revenue, and actual business viability. The reasoning mode clearly works.

TierModelCostSurvivalMedian ROI
🏆 S-tierClaude Opus 4.6~$27100%+2,376%
🥈 A-tierGemini 3 Pro~$4.50100%+760%
🥈 A-tierClaude Sonnet 4.6~$23100%+771%
B-tierGrok 4.20 Reasoning$2.4340%−0.2%
C-tierGLM-5~$1.0029%−111%
D-tierGPT-5.4 Mini$4.400%−76%
F-tierGrok 4.1 Fast~$0.500%−59%

Grok 4.20 sits in B-tier: capable of running a profitable business but not reliably. It’s the best model in the “sometimes survives” bracket and the cheapest model in the “ever survives” bracket. For researchers interested in cost-efficient agentic evaluation, it’s a compelling option. For anyone who needs reliability, the gap to A-tier (Gemini 3 Pro at similar cost) remains the elephant in the room.

Author's Note

Grok 4.20 Reasoning deserves more credit than it typically gets. At $2.43 per run, it’s genuinely the most cost-effective reasoning model we’ve tested for complex agentic tasks. It uses tools fluently (82–100% diversity), manages memory effectively (up to 12 notes/day), and demonstrates real business thinking — adapting menus, investing in upgrades, responding to events.

The 40% survival rate is a real finding, not a cherry-picked outlier. Two out of five runs produced genuinely profitable businesses. The fatal flaw is narrow and specific: loan maturity management. A model that could track a repayment deadline would score dramatically better.

For xAI as a company, this is a foundation to build on. Grok 4.20 Reasoning already outperforms OpenAI’s comparable offering at half the cost. The gap to Google and Anthropic’s best models is real — but it’s a gap in consistency, not capability. On its best day, Grok runs the food truck just fine.

This is one person’s analysis based on one benchmark. Your mileage — and your use case — may vary.

Methodology

Published March 2026 as part of the FoodTruck Bench project.