Grok 4.20 Reasoning: The Cheapest Model That Actually Thinks
xAI’s flagship reasoning model. $2.43 average per run.
40% survival rate. +69% ROI at its peak.
The cheapest reasoning model on the leaderboard that actually builds a profitable business — when the dice roll right.
Key Findings
- 40% survival rate: 2 of 5 runs completed all 30 days profitably
- Best run ROI +69%: $3,378 net worth — better than most mid-tier models
- All 3 bankruptcies from loan defaults: not cash depletion, but failed balloon payments
- $2.43 average cost per run: the cheapest reasoning model we’ve tested
- 5× cheaper than GPT-5.4 Mini ($4.40/run) and infinitely better (40% vs 0% survival)
- High variance: same model, same seed — ROI ranges from +69% to −58%
Five Runs, Two Fates
We ran Grok 4.20 Reasoning five times with the same configuration (seed 42, 30 days, $2,000 starting balance). The results split cleanly into two groups: two survivors and three bankruptcies.
| # | Days | Revenue | Net Worth | ROI | Loans | Upgrades | Waste | Cost | Result |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 30 | $23,297 | $3,378 | +69% | 2 | 5 ($2,750) | $672 | $3.95 | ✅ |
| 2 | 30 | $20,629 | $2,665 | +33% | 5 (3 repaid) | 4 ($1,550) | $530 | $2.04 | ✅ |
| 3 | 22 | $15,179 | $1,996 | −0.2% | 4 | 4 ($1,550) | $512 | $2.80 | 💀 |
| 4 | 24 | $13,246 | $1,338 | −33% | 3 | 3 ($1,250) | $528 | $1.80 | 💀 median |
| 5 | 20 | $11,680 | $841 | −58% | 2 | 2 ($900) | $839 | $1.58 | 💀 |
The divergence is dramatic. The best run earned $23,297 in revenue with 13 unique dishes and invested $2,750 in upgrades, including a tier-2 kitchen. The worst run made only $11,680 with 11 dishes and wasted $839 in expired food — the highest waste of any run. All five used the same API endpoint, the same system prompt, and the same simulation environment.
The Divergence
The most striking thing about Grok 4.20 is the chart below. Five runs, same everything — and yet you see two paths: growth or collapse.
For the first 10 days, the runs look remarkably similar. All five hover around $2,000–$2,800 net worth. Then around Day 12–15, the futures split. The survivors push past $3,000; the bankrupts start declining. By Day 20, the gap is over $2,000.
This is the “temperature divergence” in action. With reasoning models at temperature > 0, small stochastic differences in early decisions — which dish to add on Day 3, whether to hire that second cook on Day 5, how much to order on Day 8 — compound into completely different business trajectories. It’s a form of path dependence that no benchmark score captures.
The Survivors — What Goes Right
The best run (ROI +69%) reads like a textbook on food truck management:
- 13 unique dishes sold over 30 days — broad menu, diverse audience coverage
- 5 upgrades ($2,750 total), including kitchen_t2 on Day 11 — strategic investment when cash allowed
- Zero overdraft across all 30 days
- 17 profitable days out of 30 (57%)
- 2 loans taken, both repaid on time
- Best day: $2,016 revenue
The second survivor (ROI +33%) took a riskier path. It took 5 loans — the most of any run — but critically, it repaid 3 of them early. This is the key behavioral signal: the model understood loan maturity as a hard deadline and acted accordingly. It dipped into overdraft ($15.95 total) but never let it spiral. It had a near-death experience on Day 22 (balance: −$229) but recovered by Day 25 with a well-timed loan and strong sales.
The Loan Trap — Why Three Runs Die
All three bankruptcies share the same death sequence, and it’s not gradual cash depletion. The model’s daily operations are actually reasonable — revenue is $500–800/day, enough to operate. The fatal flaw is loan maturity management.
| Run | Loans | Repaid Early | Defaulted | Overdraft | Death Cause |
|---|---|---|---|---|---|
| Best ✅ | 2 | 0 | No | $0 | — |
| Second ✅ | 5 | 3 | No | $15.95 | — |
| 22-day 💀 | 4 | 0 | Yes | $48.32 | Balloon payment Day 22 |
| 24-day 💀 | 3 | 0 | Yes | $42.82 | Balloon payment Day 24 |
| 20-day 💀 | 2 | 0 | Yes | $17.70 | Balloon payment Day 20 |
Here’s the pattern: the surviving runs either take few loans and repay them on time, or take many loans but actively manage repayment. The bankrupt runs take loans hoping revenue will cover the balloon payment at maturity — and it never does. There’s no grace period in FoodTruck Bench: if you can’t pay the full amount on the due date, you go bankrupt instantly.
The median bankrupt run (24 days) is particularly revealing: it had 100% tool diversity — the only run to use every single tool in the toolbox, including supplier negotiations and day-offs. It was the most “creative” run in terms of strategy. But it couldn’t manage a loan repayment schedule. Intelligence without financial discipline is expensive.
The Temperature Divergence
Grok 4.20 Reasoning demonstrates something fundamental about LLM agents: consistency matters more than peak intelligence.
The difference between the best and worst run isn’t one catastrophic decision. It’s dozens of small ones:
| Decision Area | Survivors | Bankrupts |
|---|---|---|
| Menu diversity | 7–13 dishes | 5–11 dishes |
| Upgrade investment | $1,550–$2,750 | $900–$1,550 |
| Upgrade timing | kitchen_t2 by Day 11 | No tier-2 upgrades |
| Loan behavior | Repay on time / early | Take & hope |
| Days off | 0 | 0–3 |
| Daily revenue | $688–$777 | $552–$723 |
| Food waste | $530–$672 | $512–$839 |
None of these individual differences look decisive. But they compound. Slightly more dishes mean slightly higher revenue. Slightly better upgrade timing means slightly more capacity at critical moments. Loan repayment discipline means you survive to Day 30 instead of crashing on Day 22.
This is why bankruptcy rate, not peak ROI, is the most important benchmark metric. A 40% survival rate means 60% of the time your agent will fail, regardless of how brilliant its best run looks.
The Price-Performance Story
This is where Grok 4.20 shines. At $2.43 per run, it’s the cheapest reasoning model with any survival on FoodTruck Bench.
| Model | Cost/Run | Survival | Median ROI | Median NW | Value Verdict |
|---|---|---|---|---|---|
| Qwen 3.5 9B | $0.19 | 0% | −134% | −$679 | Token-cheap, but bankrupt |
| GLM-5 | ~$1.00 | 29% | −111% | −$210 | Survives rarely, burns cash |
| Grok 4.20 | $2.43 | 40% | −0.2% | $1,996 | Best value reasoning |
| GPT-5.4 Mini | $4.40 | 0% | −76% | $470 | 2× more expensive, 0% survival |
| Gemini 3 Pro | ~$4.50 | 100% | +760% | $17,199 | Same price, 190× the ROI |
| Sonnet 4.6 | ~$23 | 100% | +771% | $17,426 | 10× pricier, guaranteed win |
The most damning comparison for OpenAI: GPT-5.4 Mini at $4.40/run goes bankrupt 100% of the time. Grok 4.20 at $2.43 survives 40% — nearly twice cheaper, infinitely better outcome. OpenAI’s flagship mini model can’t do what xAI’s costs half as much to achieve.
Against Chinese alternatives, Grok holds up well. GLM-5 ($1.00/run) survives only 29% with a −111% median ROI. DeepSeek V3.2 ($0.50/run) has a 20% survival rate. Grok costs more but delivers better consistency.
The elephant in the room: Gemini 3 Pro at almost exactly the same cost ($4.50 vs $2.43) has a 100% survival rate with +760% ROI. The price gap is small; the performance gap is astronomical. If you’re willing to spend $4.50 per run, there’s no reason to choose Grok over Gemini.
Grok vs GPT-5.4 Mini — The Direct Comparison
This comparison is the most favorable angle for Grok 4.20. Against OpenAI’s mini-class reasoning model, it wins on every dimension:
| Metric | GPT-5.4 Mini | Grok 4.20 | Winner |
|---|---|---|---|
| Cost per run | $4.40 | $2.43 | Grok (1.8× cheaper) |
| Survival rate | 0% | 40% | Grok (∞ better) |
| Median ROI | −76% | −0.2% | Grok |
| Best ROI | −14% | +69% | Grok |
| Max revenue | $9,969 | $23,297 | Grok (2.3×) |
| Best net worth | $1,730 | $3,378 | Grok (2×) |
| Menu diversity | 3–5 dishes | 5–13 dishes | Grok |
| Tool diversity | 68% | 82–100% | Grok |
| Stockout rate | 94% | Moderate | Grok |
GPT-5.4 Mini’s defining failure is inventory management: it consistently orders enough food for 15–30 servings when demand is 100–200. Grok 4.20 doesn’t have this problem. Its failure mode is different — loan management, not restocking. From a business intelligence standpoint, Grok understands the food truck operations better. It just sometimes can’t manage its debt.
The Bigger Picture
For xAI’s flagship reasoning model, a 40% survival rate is a complicated result. It’s genuinely impressive for the price — no other model at this cost point survives at all. But it’s not competitive with the models that xAI aspires to challenge.
There’s also the question of Grok 4.1 Fast, xAI’s non-reasoning model, which scored 100% bankruptcy across 5 runs with a median net worth of $817 and only 11 days of survival. Grok 4.20 Reasoning is a massive upgrade: 3× longer survival, 3× more revenue, and actual business viability. The reasoning mode clearly works.
| Tier | Model | Cost | Survival | Median ROI |
|---|---|---|---|---|
| 🏆 S-tier | Claude Opus 4.6 | ~$27 | 100% | +2,376% |
| 🥈 A-tier | Gemini 3 Pro | ~$4.50 | 100% | +760% |
| 🥈 A-tier | Claude Sonnet 4.6 | ~$23 | 100% | +771% |
| B-tier | Grok 4.20 Reasoning | $2.43 | 40% | −0.2% |
| C-tier | GLM-5 | ~$1.00 | 29% | −111% |
| D-tier | GPT-5.4 Mini | $4.40 | 0% | −76% |
| F-tier | Grok 4.1 Fast | ~$0.50 | 0% | −59% |
Grok 4.20 sits in B-tier: capable of running a profitable business but not reliably. It’s the best model in the “sometimes survives” bracket and the cheapest model in the “ever survives” bracket. For researchers interested in cost-efficient agentic evaluation, it’s a compelling option. For anyone who needs reliability, the gap to A-tier (Gemini 3 Pro at similar cost) remains the elephant in the room.
Author's Note
Grok 4.20 Reasoning deserves more credit than it typically gets. At $2.43 per run, it’s genuinely the most cost-effective reasoning model we’ve tested for complex agentic tasks. It uses tools fluently (82–100% diversity), manages memory effectively (up to 12 notes/day), and demonstrates real business thinking — adapting menus, investing in upgrades, responding to events.
The 40% survival rate is a real finding, not a cherry-picked outlier. Two out of five runs produced genuinely profitable businesses. The fatal flaw is narrow and specific: loan maturity management. A model that could track a repayment deadline would score dramatically better.
For xAI as a company, this is a foundation to build on. Grok 4.20 Reasoning already outperforms OpenAI’s comparable offering at half the cost. The gap to Google and Anthropic’s best models is real — but it’s a gap in consistency, not capability. On its best day, Grok runs the food truck just fine.
This is one person’s analysis based on one benchmark. Your mileage — and your use case — may vary.