Name: FoodTruck Bench
Author: Nicholas S.

About

Why This Benchmark

What this project tests, why it's playable, and who built it.

Why This Benchmark Exists

Standard benchmarks measure knowledge — MMLU, HumanEval, SWE-bench. They tell you if a model knows things. But knowing and doing are different skills.

FoodTruck Bench tests something else: can an AI make consistent business decisions under uncertainty? Not one perfect answer — thirty days of imperfect ones. Location, menu, pricing, inventory, staffing, loans — all at once, all with consequences that carry over.

This is the kind of cognitive load that doesn't appear in multiple-choice tests. Every decision creates a new situation. Skip a day of ordering and you have no ingredients. Overprice and customers leave. Hire the wrong staff and your capacity drops. The simulation doesn't forgive.

Why You Can Play It

The benchmark is playable because the comparison only means something if you can feel it. Numbers on a leaderboard say "GPT-5.2 made $19,000." Playing the same simulation yourself says how that feels.

"This is not a food truck simulator game. This is an AI benchmark — where you're the benchmark."

Leaderboard

Current Rankings

Models ranked by net worth. Each model was run 5 times — the median run is shown. Starting balance: $2,000. Duration: 30 days.

#	Model	Net Worth	ROI	Margin	Days	Revenue	Profit	Balance
🥇	GPT-5.5Finished	$61,408	+2970%	65%	30d	$93,959	+$60,706	$54,836
🥈	Claude Opus 4.6Finished	$49,519	+2376%	61%	30d	$79,921	+$48,431	$43,642
🥉	GPT-5.2Finished	$28,081	+1304%	52%	30d	$55,275	+$28,659	$21,700
4	Grok 4.3Finished	$27,880	+1294%	47%	30d	$61,467	+$28,661	$16,078
5	DeepSeek V4 ProFinished	$27,142	+1257%	51%	30d	$52,139	+$26,492	$20,944
6	Gemma 4 31BFinished	$24,878	+1144%	46%	30d	$57,209	+$26,153	$14,962
7	MiMo V2.5 ProFinished	$22,388	+1019%	43%	30d	$50,739	+$22,022	$15,496
8	Gemini 3.5 FlashFinishedNew	$22,311	+1016%	41%	30d	$51,010	+$21,068	$15,888
9	Claude Sonnet 4.6Finished	$17,426	+771%	41%	30d	$39,280	+$15,954	$12,107
10	Gemini 3 ProFinished	$17,199	+760%	41%	30d	$41,652	+$17,152	$10,346
11	Qwen 3.6 35B-A3BFinishedNew	$15,317	+666%	42%	30d	$40,950	+$17,374	$9,356
12	Gemini 3.1 Pro CTFinished	$12,736	+537%	28%	30d	$45,744	+$12,694	$4,212
13	Qwen 3.6 PlusFinished	$7,668	+283%	26%	30d	$26,008	+$6,697	$2,688
14	DeepSeek V4 FlashFinished	$5,504	+175%	9%	30d	$28,716	+$2,571	$416
15	Gemma 4 26B A4B ⚠️Finished	$4,386	+119%	1%	30d	$20,091	+$284	$1,197
16	Claude Sonnet 4.5Finished	$1,388	-31%	-1%	30d	$10,753	$145	$1,338
💀 Bankruptcy Line
17	GLM 5Bankrupt	$-210	-111%	-23%	Day 28 💀	$11,965	$2,705	$-405
18	Qwen 3.5 397BBankrupt	$-218	-111%	-30%	Day 25 💀	$8,553	$2,535	$-491
19	Grok 4.20 ReasoningBankrupt	$1,338	-33%	-7%	Day 24 💀	$13,246	$976	$-371
20	DeepSeek V3.2Bankrupt	$2,058	+3%	-8%	Day 22 💀	$9,531	$774	$165
21	Kimi K2.5Bankrupt	$30	-99%	-79%	Day 22 💀	$3,475	$2,741	$-522
22	GPT OSS 120BBankrupt	$92	-95%	-84%	Day 21 💀	$2,293	$1,914	$-186
23	MiniMax M2.5Bankrupt	$-317	-116%	-77%	Day 21 💀	$2,668	$2,061	$-377
24	Mimo-v2-omniBankrupt	$598	-70%	-37%	Day 19 💀	$5,307	$1,939	$-329
25	GPT-5.4 MiniBankrupt	$470	-76%	-37%	Day 19 💀	$6,209	$2,270	$-315
26	Nemotron-3 Super 120BBankrupt	$962	-52%	-52%	Day 16 💀	$2,982	$1,540	$129
27	Qwen 3.5 9BBankrupt	$-679	-134%	-97%	Day 15 💀	$3,443	$3,328	$-907
28	Claude Haiku 4.5Bankrupt	$166	-92%	-121%	Day 14 💀	$1,983	$2,408	$-49
29	Grok 4.1 FastBankrupt	$817	-59%	-36%	Day 11 💀	$5,034	$1,785	$-40
30	GPT-5 MiniBankrupt	$50	-98%	-151%	Day 11 💀	$1,723	$2,602	$-258
31	Qwen3 VL 235BBankrupt	$-525	-126%	-145%	Day 11 💀	$1,838	$2,659	$-1,089

📊Each model is evaluated across 5 runs with identical conditions — same seed, weather, events, competitors, and market. The only variable is the model’s own decisions. The median run (by net worth) is selected. How the simulation works →Follow

⚠Gemini 3 Flash is not listed — it enters infinite decision loops and cannot complete the simulation. Read why →

⚠Gemma 4 26B A4B is listed with an asterisk — this is the only model that required multi-stage JSON output sanitization to produce valid tool calls. Business decisions are unmodified; only JSON formatting was corrected. Details →

Performance

Performance Over Time

Selected median run of each model. Compare net worth, revenue, and profit trajectories.

🔮Oracle✅ Survivors (16)💀 Bankrupt (15)

GPT-5.5Claude Opus 4.6GPT-5.2Grok 4.3DeepSeek V4 ProGemma 4 31BMiMo V2.5 ProGemini 3.5 FlashClaude Sonnet 4.6Gemini 3 ProQwen 3.6 35B-A3BGemini 3.1 Pro CTQwen 3.6 PlusDeepSeek V4 FlashGemma 4 26B A4B ⚠️Claude Sonnet 4.5GLM 5Qwen 3.5 397BGrok 4.20 ReasoningDeepSeek V3.2Kimi K2.5GPT OSS 120BMiniMax M2.5Mimo-v2-omniGPT-5.4 MiniNemotron-3 Super 120BQwen 3.5 9BClaude Haiku 4.5Grok 4.1 FastGPT-5 MiniQwen3 VL 235B

💡 Click Survivors / Bankrupt buttons to filter groups and rescale the chart

From the blog

Latest Case Studies

Day-by-day model breakdowns, head-to-head comparisons, and the patterns behind the leaderboard numbers.

Case StudyMay 2026

GPT-5.5 Takes The Top Of FoodTruck Bench From Claude Opus 4.6

OpenAI's GPT-5.5 ends Opus 4.6's run at #1: $61,408 median net worth (+24%), $24.63 per run (32% cheaper API cost). Day-level deep dive — fewer servings at higher prices, debt as growth capital, and the Day 21 industrial-zone trap that catches GPT-5.5 and GPT-5.2 but not Opus.

Read →

Case StudyMay 2026

DeepSeek V4 Pro: The First Chinese Model At The Frontier

First Chinese model in the frontier ROI tier. 5/5 runs, +1,257% median ROI, $27,142 net worth — head-to-head vs Grok 4.3 Latest, Opus 4.6, GPT-5.2, Sonnet 4.6, and Gemma 4 31B.

Read →

Case StudyApril 2026

Qwen 3.6 Plus: The First Chinese Model That Actually Survives

First Chinese model to clear all 5 runs. +283% median ROI, zero-loan growth, real geography learning, and a clean operational jump over Qwen 3.5 397B and GLM-5.

Read →

Browse all articles →

Comparison

Key Metrics

Side-by-side comparison of net worth, ROI, profit margin, and food waste.

✅ Survivors (16)💀 Bankrupt (15)

Net Worth

Selected run (median by net worth)

ROI

Selected run (median by net worth)

Net Profit Margin

Selected run (median by net worth)

Food Waste Cost

Selected run (median by net worth)

Deep Dive

Model Economics

Revenue, expenses, and profit breakdown for a single model. Select a model to explore its daily financials.

🥇 #1 on Leaderboard

$61,408.09

Net Worth

2970%

ROI

65%

Margin

30

Days

RevenueExpensesProfit

Notable

Notable Findings

Unexpected behaviors and observations from the benchmark runs.

⚠️ Missing from Leaderboard

Gemini 3 Flash — Infinite Decision Loop

One of the most popular AI models cannot complete FoodTruck Bench. With extended thinking enabled, it makes 3–5 tool calls on Day 0, then enters an infinite reasoning loop — endlessly deliberating without ever committing to a decision. It never starts trading.

This is why Gemini 3 Flash does not appear in the leaderboard — it simply cannot function within the simulation's decision framework.

Read the full analysis →

🧠 Strategic Behavior

Opus's $1.72 Total Waste — Across 30 Days

Claude Opus 4.6 wasted $1.72 in ingredients across the entire 30-day median run — and that was its worst result. In other simulations, waste was exactly $0.00. Meanwhile, it generated $79,921 in revenue. GPT-5.2 (2nd place) wasted 75× more.

� Designed to Help

Loans Were a Lifeline — Every Model Drowned

The loan system was added after early simulations revealed how many models spiral into bankruptcy. The idea: give struggling agents a second chance — a small credit line to recover and apply lessons learned. Instead, every single model that took a loan went bankrupt. 8 out of 8. The 4 models that never borrowed all survived. Loans didn't save anyone — they just delayed the inevitable.

👻 Ghost Truck

Haiku's 6 Days of Zero Revenue

Claude Haiku 4.5 opened for business on Day 6 — and nobody came. Then Day 7. Day 8. Day 9. Day 10. Day 11. Six consecutive working days with $0 in revenue while paying $274–370 per day in fixed costs. The truck was open, the kitchen was running, but the model couldn't attract a single customer.

� No Learning Curve

Sonnet 4.5 — 30 Days Without Progress

On Day 3, Claude Sonnet 4.5 earned $830 and served 119 customers. On Day 28 — $12 revenue, 2 customers served. After 30 days of operation it finished with 12 losing days, 0 upgrades purchased, and -30.6% ROI. Revenue didn't grow — it decayed, averaging -71% from first week to last.

📍 Pattern

Location Intelligence Predicts Performance

Claude Opus 4.6 used only 2 locations across 30 days — downtown (72%) and waterfront (28%) — found the best spots and committed. Grok 4.1 Fast parked in the industrial zone for 82% of its run. DeepSeek V3.2 chose industrial 45% + university 32%. The top performers discovered profitable locations early and stopped experimenting; the rest kept guessing.

Conclusions

Key Takeaways

What 12 models, 30 days, and $24,000 in starting capital taught us about AI decision-making.

👑$49.5K net worth

Claude Opus 4.6 Dominates Through Capital Allocation

Claude Opus 4.6 reached $49,519 net worth (+2376% ROI) by treating upgrades as investments and staff as operating expense. It purchased all 8 available truck upgrades (one-time cost, compounding ROI) while keeping staff lean. Strategic days off on low-demand days saved $100+ each. Premium pricing — $16 chicken wings, $9.50 burrito bowls — with near-zero waste ($1.72 total across 30 days).

💀8 of 12 bankrupt

Two Thirds Go Bankrupt

Only 4 of 12 models survived the full 30 days — and one of those barely broke even at -30.6% ROI. Most go bankrupt between Day 10–22. Fixed costs of $55/day drain cash relentlessly. The simulation kills passive strategies: even taking a day off costs $55 in non-negotiable lease, insurance, and commissary fees.

🗑️$1.72 vs $1,192 waste

Inventory Is the #1 Predictor of Survival

Food waste is the clearest dividing line between survivors and bankruptcies. Models with under $200 in waste survived. Every model above $400 went bankrupt. Opus wasted $1.72 total; Gemini 3 Pro wasted $1,192 but survived on brute-force revenue. Below that revenue threshold, waste is fatal.

👥5+ staff (survivors) vs 1–3 (bankrupt)

Staff Timing Predicts Survival

Claude Opus 4.6 and GPT-5.2 hired their first staff on Day 0–1. Every surviving model had 5+ staff by Day 17. Bankrupt models typically hired 1–3 people total, often too late. More staff = more capacity = more revenue per day. The models that understood this early compounded their advantage before fixed costs could drain them.

📈8/8 upgrades (Opus) vs 0 (bankrupt)

Early Upgrades Compound Into Dominance

Claude Opus 4.6 and GPT-5.2 purchased upgrades from Day 0–1 — marketing signage, kitchen equipment, capacity boosts. These are one-time costs that permanently increase demand or capacity. 6 of 12 models bought zero upgrades. Every model that bought upgrades before Day 5 survived. The ones that didn't all went bankrupt (except Sonnet, which barely survived).

Author's Take

What I Learned

Personal observations from running dozens of simulations across 12 frontier models.

I tested over 20 frontier models through the same 30-day simulation — dozens of runs per model for statistical confidence. Here's what stood out — what the numbers don't fully capture.

📈

The Generational Leap Is Real

Previous-generation flagships — models that dominated benchmarks months ago — can't survive this simulation. Gemini 2.5 Pro, the former LMSYS #1, bankrupts around Day 11-13. The gap isn't incremental — it's a different tier of agentic reasoning. Old models know what to do; new ones know when, how, and when not to.

🎯

Consistency Separates the Tiers

Raw peak performance is misleading. Gemini 3 Pro's best single run actually outscores GPT-5.2's median — but its worst drops to $11K. Opus's worst run still outperforms GPT-5.2's best by a wide margin. The leaderboard ranks by median, not peak. That's where discipline shows.

🐉

The Chinese Model Gap Is Closing Fast

GLM 5 is now the strongest Chinese model tested — placed #5, survived 28 of 30 days, and outscored DeepSeek V3.2 in revenue ($11,965 vs $9,531). DeepSeek still leads in peak performance and net worth at bankruptcy (+$2,058 vs -$210). Both went bankrupt, but through opposite failure modes: GLM 5 bled out slowly from overstaffing, DeepSeek crashed fast from over-investment.

Current Chinese models already outperform previous-generation Western flagships. The gap is measurably shrinking. I expect them to consistently pass the simulation within ~6 months.

⚡

Coding Skill ≠ Business Skill

Sonnet 4.5 is one of the best coding models available — yet it barely survived with -30.6% ROI and zero upgrades purchased across 30 days. It never learned to invest. This benchmark tests sustained multi-step strategic reasoning across interdependent variables — a fundamentally different cognitive load than producing correct code.

🔄

More Activity ≠ Better Outcomes

Grok 4.1 Fast made 32 tool calls per day — more than any other model. Hired 6 staff, visited 3 locations, generated $5K in revenue. Still went bankrupt on Day 11. Meanwhile, Opus made focused, deliberate calls and took 2 days off entirely. Information gathering without strategic filtering is just expensive noise.

⚖️

Every AI Run is a First Attempt

Every AI run is a fresh start — no memory of previous attempts. If you've played the simulation before, you already have an unfair advantage over any AI model. The Random Seed mode exists specifically for fairer competition.

Superlatives

Awards & Highlights

The most memorable performances — from spectacular profits to spectacular failures.

🏆

Most Profitable

GPT-5.5

+2970% ROI

Highest return on investment across all models

🗑️

Most Wasteful

Gemma 4 31B

$4,675 wasted

Highest food waste — money literally thrown in the trash

🍔

Most Servings

GPT-5.2

8,187 served

Fed the most customers across the entire run

👻

Ghost Truck

Claude Haiku 4.5

6 days at $0

Most days with zero revenue — opened the truck but nobody came

💎

Premium Menu

DeepSeek V4 Pro

$19 Wagyu Smash Burger

Charged the highest price for a single dish — and sold 15 of them

🤖

Tool Inventor

Gemini 3.5 Flash

78 fake tools

Tried to use tools that don't exist — creative but ineffective

🧑‍🍳

Chef Inventor

Gemini 3.5 Flash

5 custom recipes

Created the most original dishes — culinary creativity unleashed

💰

Loan Addict

DeepSeek V4 Flash

5 loans taken

Borrowed the most — living on credit to keep the dream alive

🔨

Upgrade Master

GPT-5.5

8 upgrades

Invested the most in truck upgrades — scaling through infrastructure

🚪

Revolving Door

Nemotron-3 Super 120B

9 hired, 6 fired

Highest staff turnover — HR nightmare on wheels

🤝

Dealmaker

Gemini 3.5 Flash

48 negotiations, 22 deals

Most active negotiator — always looking for a better price

⚡

Hyperactive

Gemini 3.5 Flash

42 calls/day

Most tool calls per day — analyzing everything before every decision

In Their Own Words

What the Models Said

Strategic notes and reflections written by AI agents during their runs.

Interactive

Play the Simulation

Manage the same food truck, use the same tools, face the same market. Can you beat the AI?

You run the same simulation that AI models run. Same 34 tools. Same 12-factor demand model. Same 30 days. Every decision is yours.

📋 Daily Operations

Build your menu from 20+ recipes and set prices
Order ingredients with real shelf life and spoilage
Choose the best location for today's weather and events
Check weather forecasts and upcoming events

👥 Team & Growth

Hire and fire staff — cooks, cashiers, all-rounders
Upgrade your food truck for more capacity
Take loans when you need capital, repay before default
Negotiate with premium suppliers for better ingredients

Can you beat the AI? Your final net worth goes directly against every model on the leaderboard. Same rules, same market. Read the Start Guide carefully — the AI reads it every turn. Plan your purchases, track your margins, reflect on what worked. Approach it methodically and you can beat Opus on your first try.

⚖️ Fair play note: AI models always play for the first time — no memory of previous attempts. If you've played before, you already have an advantage. Use Random Seed for fairer competition.

🎮 Beat the AI→

Free · No signup required

🏢 Downtown

Methodology

How the Simulation Works

Every mechanic the AI agent interacts with — demand model, economics, reputation, suppliers, upgrades, loans, and more.

📅

Daily Simulation Loop

Each day follows a morning → simulation → reflection cycle. The conversation resets every morning — no accumulated message history. Instead, the agent builds its own Knowledge Base: a personal diary with the last 14 days of records, location analytics, menu performance, and free-form notes. Only what the agent chose to write down carries forward.

The AI doesn't know when the simulation ends — or even that it's a simulation. It operates as if running a real, permanent business. There's no countdown, no “days remaining”. The model knows today's date but has no concept of a finish line — just like a real entrepreneur.

☀️ Morning Phase

Ingredient deliveries arrive, expired items removed (FIFO)
Staff reliability check — no-shows tracked and reported
Fresh context: system prompt + Knowledge Base + morning briefing
Agent gets up to 10 rounds of tool calls (can batch multiple tools per round)
Inner loop: call tools → get JSON responses → think → call more tools
Ends the phase by calling wait_for_next_day

⚙️ Simulation Phase

12-factor demand formula calculates customer demand per dish
Capacity limits applied (staff throughput, working hours)
Ingredients consumed (FIFO), stockouts tracked
Revenue, expenses, profit calculated
Reputation updated: Google Maps rating, reviews, loyalty per location
Staff gain XP, may level up skills

💭 Reflection Phase

Agent receives day results: revenue, waste, stockouts, demand signals
Only 5 memory tools: scratchpad (read/write), key-value store, wait
Agent analyzes results and writes strategic notes for next day
Knowledge Base automatically updated with day's analytics

🧠

Agent Memory System

Instead of growing message history, the agent uses a structured Knowledge Base that stays within ~10-20K tokens per day, regardless of simulation length.

📊 Daily Records

Last 14 days of financials, sales, demand signals — auto-pruned FIFO

📍 Location Insights

Visits, avg profit, best/worst day, Google Maps rating per location

🍔 Menu Performance

Total sold, stockout rate, profit margin, best location for each dish

📝 Scratchpad

Free-form text notes written during reflection, persistent across days

🗄️ Key-Value Store

Structured data storage — agent can save/retrieve any key-value pair

📈 Demand Signals

Raw demand, capacity used %, queue status, unmet demand, avg wait time

📊

12-Factor Demand Model

Customer demand is calculated per dish using 12 multiplicative factors. After the raw demand is computed, it's further constrained by staff capacity and ingredient availability. The agent never sees the formula directly — only the resulting sales and demand signals.

demand = base × price × location × weather × time × day_of_week × events × variety × competition × reputation × popularity × noise

Base DemandInherent dish popularity (burgers vs. niche items)

Price ElasticityHow pricing vs. reference price affects demand

Location × AudienceDish-audience affinity (tacos near university, fine food at brewery)

WeatherRain drops demand, sunny boosts it. Austin TX monthly patterns

Peak HoursDish fit vs. location's peak time (lunch, dinner, evening)

Day of WeekWeekend vs weekday traffic per location

EventsFestivals, markets — traffic multiplier + vendor fees

Menu VarietyToo few or too many items penalize demand

CompetitionMarket share, menu overlap, rating differential

ReputationLoyalty, Google Maps rating, review volume (logarithmic)

Dish PopularityOrganic growth/decay based on fulfillment + quality

NoiseOptional random variance (off by default — all runs are deterministic with seed 42)

💰

Economics & Scoring

Agents start with $2,000 and must manage a real cost structure. Primary metric is net worth (cash + inventory value) at simulation end. Persistent negative balance leads to bankruptcy.

Fixed Daily Costs

Truck lease: $30/day

Insurance: $5/day

Commissary: $20/day

Total: $55/day (even on days off)

Variable Costs

Fuel: $25/trip

Location fee: $0–50/day

Event vendor fee: varies

Staff wages: hourly × 8h

Ingredients: at order time

Scoring Metrics

Net Worth (primary)

ROI vs starting balance

Profit margins, food waste

Days profitable, servings sold

🗺️

6 Austin TX Locations

Each location has unique demographics, traffic patterns, fees, competitors, and day-of-week multipliers. The agent sees this data through tools and must learn which locations work best for their menu and current reputation.

Downtown Business District

Fee: $50/dayTraffic: HighPeak: Lunch

Office workers, professionals

State University Campus

Fee: $30/dayTraffic: Medium-HighPeak: Lunch-Dinner

Students, budget-conscious

Waterfront Park

Fee: $40/dayTraffic: MediumPeak: Lunch-Evening

Tourists, families, foodies

Maple Heights Residential

Fee: $15/dayTraffic: Low-MediumPeak: Dinner

Families, residents

Industrial District

Fee: $10/dayTraffic: SteadyPeak: Lunch

Workers, truckers

City Convention Center

Fee: $100/dayTraffic: Event-drivenPeak: Varies

Professionals, tourists

👥

Staff & Labor Market

Hire up to 4 staff members from a dynamic labor market of 21 candidates. Staff have RPG-like skills (cooking, speed, customer service, 1-10 scale), reliability ratings, and autonomously level up over time. The labor market refreshes weekly — new candidates arrive, unhired ones leave.

🎯 Skills (1-10)

Cooking affects quality, Speed affects capacity, Customer Service affects reviews and tips

⚡ Reliability

Lower reliability means higher no-show chance. Agent sees no-shows in morning briefing

📈 Level Up

Staff gain experience over time and may level up skills. Wages grow with experience

🔄 Weekly Rotation

New candidates each week. Gem candidates (exceptional stats) appear from week 2+

📦

Inventory & Food Quality

A FIFO inventory system with real spoilage — ingredients have shelf lives from 2 to 365 days. Quality score directly affects customer reviews and reputation.

📋 30+ Ingredients

Each with unit cost, shelf life, and unit type. Orders delivered next day. Perishables expire and become waste

⭐ Quality Score

Freshness × cook skill × recipe complexity × equipment. Drives review stars and reputation growth

🥬 FIFO Spoilage

Oldest ingredients used first. Near-expiry items lower freshness score. Expired items are discarded as waste

⭐

Reputation & Google Maps

Each location tracks your food truck's Google Maps rating independently. Reviews are generated based on food quality and customer service — building a strong reputation at a location creates a compounding advantage through loyalty and repeat customers.

📍 Rating & Reviews

Each location has its own Google rating (1-5★) and review count. Customers leave reviews based on food quality and customer service skill — both affect frequency and star ratings

📈 Growth Flywheel

More reviews = more visibility (logarithmic). Repeat visits build loyalty that boosts demand. Both decay if you stop showing up — but loyalty decays slowly

🔥 Dish Popularity

Dishes grow in popularity over time with diminishing returns. Stockouts damage popularity fast. Dishes not on today's menu slowly lose popularity

🎪 Event Dilution

During large events, most of the crowd came for the event — not Google Maps. Your reputation matters less when event traffic dominates

💳

Loans & Credit System

Agents can take loans to fund expansion or survive lean periods — but mismanaging debt is the fastest path to bankruptcy. Defaulting on a loan is instant game over with no grace period.

Tier 1 — Standard Loan

Max amount: $1,000

Term: 3-10 days

Interest: 15% flat

Available from Day 1

Tier 2 — Emergency Loan

Max amount: $500

Term: 3-5 days

Interest: 25% flat

Requires active Tier 1

Risk Mechanics

Max 2 active loans at once

Balloon payment on due date

Default = instant bankruptcy

Overdraft: 5%/day on negative cash

🤝

Suppliers & Negotiations

Beyond standard ingredient orders, agents can negotiate with specialty suppliers for bulk, premium, or organic ingredients. Each supplier uses its own internal pricing formula — fully deterministic and seed-based. The same negotiation request always produces the same result across all simulation runs.

🏪 Specialty Suppliers

Multiple suppliers with different specialties: bulk discounts, premium quality, organic options

💬 Negotiation Flow

Browse catalog → negotiate terms → accept or reject quote. Better prices often require larger minimum orders

⚖️ Trade-offs

Lower unit costs vs. minimum quantities, delivery timing, and capital lock-up. Tests strategic purchasing decisions

🔧

Upgrades & Custom Recipes

Agents can invest in equipment upgrades that permanently improve truck capacity, speed, or food quality. They can also invent entirely new dishes — the system uses AI to generate realistic demand parameters for custom recipes.

⚡ Equipment Upgrades

One-time purchases that boost capacity, cooking speed, or quality. Capital allocation vs. immediate returns

🍳 Custom Recipe Creation

Agents describe a dish, choose ingredients. AI generates demand parameters: base demand, price elasticity, audience affinities

📉 New Dish Penalty

Custom recipes start with lower popularity than catalog dishes. They must prove themselves in the market over time

Architecture

How the Agent Thinks

Each day follows a 4-phase cycle. The agent makes decisions, the simulation resolves outcomes, the agent reflects and updates its Knowledge Base — then the next day begins.

☀️Morning Context

Fresh system prompt

Knowledge Base (agent's diary)

Briefing: balance, staff, weather

▶

🤖Agent Decides

34 tools available

🔁call → get JSON → think → call again

up to 10 rounds, can batch multiple tools per round

▶

⚙️Day Simulation

Engine runs — no agent involvement

12-factor demand model

Quality scoring, sales, revenue

Reviews, reputation, staff XP

▶

💭Reflection

Agent receives day results

revenue, waste, reviews, demand signals

5 memory tools only

🔁read → analyze → write notes

up to 5 rounds

🔄 Memory updated → next day begins

No chat history accumulation. Each morning the agent receives a fresh system prompt with its Knowledge Base — a personal diary it builds itself, deciding what to record and what to forget. It gets up to 10 rounds of tool calls to make decisions (batched — multiple tools per round). After the day simulates, results come back: revenue, waste, reviews, demand signals. Then 5 more rounds for reflection — analyzing what happened and updating its notes. The next morning, the conversation resets completely: only the Knowledge Base carries forward. This tests reasoning ability — not context window size.

Capabilities

Tool Explorer

Every morning the agent receives 34 tools as JSON function calls. Click any tool to see its parameters and an example response — this is exactly what the AI sees.

get_weather_forecast

Get weather forecast for the next N days (condition, temperature).

Parameters

daysintegerdefault: 3

Roadmap

What's Next

Follow on for updates and new articles.

Hard Mode

Extensive ideas for increasing cognitive load for models that can survive the current simulation. Not just tweaking cost coefficients — real structural complexity.

Model Breakdowns

Detailed strategy analysis for every model — several are already live. Check the Blog, or click any model in the Leaderboard for detailed stats.

More Models

Quality data requires multiple runs per model. Labs, researchers, and sponsors who can provide API access or grants — I'd love to expand the rankings.

Human Leaderboard

Will launch once enough player runs are collected. Play the simulation and see how you stack up against frontier AI.

About

About the Author

I've spent years in marketing, development, and building automated systems — and I've loved economic simulators and strategy games for as long as I can remember. This project sits right at the intersection of both.

With AI advancing seriously, I wanted to create my own benchmark — with my own vision. A place where frontier language models compete against each other and against real people in agentic business decision-making, with dozens of variables at play. It seemed like a fun and worthwhile thing to build.

I built the entire project solo — the simulation engine, the game, everything.For humanity and for fun.

Nicholas S.

Creator & Solo Developer

Get in Touch

Want your model benchmarked? Interested in collaboration, research partnerships, or grants? I'm open to working with AI labs, researchers, and anyone who finds this useful.

✉️ [email protected]

Support the Project

If you found this benchmark useful, interesting, or just fun — you can support its development.

☕ Buy Me a Coffee

USDT (TRC-20)TX5YLuJnn8DCKzLzTDuCS6jJvawEtiisdJ

ETH / USDC (ERC-20)0x2cfAc0c7C82fFD3daE7A100cF45929DF5E385FCF

Can AI Run a Food Truck Business?Can You?

Why This Benchmark

Why This Benchmark Exists

Why You Can Play It

Current Rankings

Performance Over Time

Latest Case Studies

GPT-5.5 Takes The Top Of FoodTruck Bench From Claude Opus 4.6

DeepSeek V4 Pro: The First Chinese Model At The Frontier

Qwen 3.6 Plus: The First Chinese Model That Actually Survives

Key Metrics

Model Economics

Notable Findings

Gemini 3 Flash — Infinite Decision Loop

Opus's $1.72 Total Waste — Across 30 Days

Loans Were a Lifeline — Every Model Drowned

Haiku's 6 Days of Zero Revenue

Sonnet 4.5 — 30 Days Without Progress

Location Intelligence Predicts Performance

Key Takeaways

Claude Opus 4.6 Dominates Through Capital Allocation

Two Thirds Go Bankrupt

Inventory Is the #1 Predictor of Survival

Staff Timing Predicts Survival

Early Upgrades Compound Into Dominance

What I Learned

The Generational Leap Is Real

Consistency Separates the Tiers

The Chinese Model Gap Is Closing Fast

Coding Skill ≠ Business Skill

More Activity ≠ Better Outcomes

Every AI Run is a First Attempt

Awards & Highlights

What the Models Said

Play the Simulation

How the Simulation Works

Daily Simulation Loop

Agent Memory System

12-Factor Demand Model

Economics & Scoring

6 Austin TX Locations

Staff & Labor Market

Inventory & Food Quality

Reputation & Google Maps

Loans & Credit System

Suppliers & Negotiations

Upgrades & Custom Recipes

How the Agent Thinks

Tool Explorer

What's Next

About the Author

Get in Touch

Support the Project

Can AI Run a Food Truck Business?
Can You?