🔬 AI Business Simulation Benchmark

Can AI Run a Food Truck Business?
Can You?

AI agents manage a food truck in Austin, TX — choosing locations, setting menus, pricing, inventory, and staff over a 30-day simulation. Compete against frontier models on the same leaderboard.

31Models Tested
30Days
34Agent Tools
6Locations
🎮 Beat the AI
Free · No signup required
📊 View Results

Why This Benchmark

What this project tests, why it's playable, and who built it.

Why This Benchmark Exists

Standard benchmarks measure knowledge — MMLU, HumanEval, SWE-bench. They tell you if a model knows things. But knowing and doing are different skills.

FoodTruck Bench tests something else: can an AI make consistent business decisions under uncertainty? Not one perfect answer — thirty days of imperfect ones. Location, menu, pricing, inventory, staffing, loans — all at once, all with consequences that carry over.

This is the kind of cognitive load that doesn't appear in multiple-choice tests. Every decision creates a new situation. Skip a day of ordering and you have no ingredients. Overprice and customers leave. Hire the wrong staff and your capacity drops. The simulation doesn't forgive.

Why You Can Play It

The benchmark is playable because the comparison only means something if you can feel it. Numbers on a leaderboard say "GPT-5.2 made $19,000." Playing the same simulation yourself says how that feels.

"This is not a food truck simulator game. This is an AI benchmark — where you're the benchmark."

Current Rankings

Models ranked by net worth. Each model was run 5 times — the median run is shown. Starting balance: $2,000. Duration: 30 days.

#ModelNet WorthROIMarginDaysRevenueProfitBalance
🥇GPT-5.5Finished$61,408+2970%65%30d$93,959+$60,706$54,836
🥈Claude Opus 4.6Finished$49,519+2376%61%30d$79,921+$48,431$43,642
🥉GPT-5.2Finished$28,081+1304%52%30d$55,275+$28,659$21,700
4Grok 4.3Finished$27,880+1294%47%30d$61,467+$28,661$16,078
5DeepSeek V4 ProFinished$27,142+1257%51%30d$52,139+$26,492$20,944
6Gemma 4 31BFinished$24,878+1144%46%30d$57,209+$26,153$14,962
7MiMo V2.5 ProFinished$22,388+1019%43%30d$50,739+$22,022$15,496
8Gemini 3.5 FlashFinishedNew$22,311+1016%41%30d$51,010+$21,068$15,888
9Claude Sonnet 4.6Finished$17,426+771%41%30d$39,280+$15,954$12,107
10Gemini 3 ProFinished$17,199+760%41%30d$41,652+$17,152$10,346
11Qwen 3.6 35B-A3BFinishedNew$15,317+666%42%30d$40,950+$17,374$9,356
12Gemini 3.1 Pro CTFinished$12,736+537%28%30d$45,744+$12,694$4,212
13Qwen 3.6 PlusFinished$7,668+283%26%30d$26,008+$6,697$2,688
14DeepSeek V4 FlashFinished$5,504+175%9%30d$28,716+$2,571$416
15Gemma 4 26B A4B ⚠️Finished$4,386+119%1%30d$20,091+$284$1,197
16Claude Sonnet 4.5Finished$1,388-31%-1%30d$10,753$145$1,338
💀 Bankruptcy Line
17GLM 5Bankrupt$-210-111%-23%Day 28 💀$11,965$2,705$-405
18Qwen 3.5 397BBankrupt$-218-111%-30%Day 25 💀$8,553$2,535$-491
19Grok 4.20 ReasoningBankrupt$1,338-33%-7%Day 24 💀$13,246$976$-371
20DeepSeek V3.2Bankrupt$2,058+3%-8%Day 22 💀$9,531$774$165
21Kimi K2.5Bankrupt$30-99%-79%Day 22 💀$3,475$2,741$-522
22GPT OSS 120BBankrupt$92-95%-84%Day 21 💀$2,293$1,914$-186
23MiniMax M2.5Bankrupt$-317-116%-77%Day 21 💀$2,668$2,061$-377
24Mimo-v2-omniBankrupt$598-70%-37%Day 19 💀$5,307$1,939$-329
25GPT-5.4 MiniBankrupt$470-76%-37%Day 19 💀$6,209$2,270$-315
26Nemotron-3 Super 120BBankrupt$962-52%-52%Day 16 💀$2,982$1,540$129
27Qwen 3.5 9BBankrupt$-679-134%-97%Day 15 💀$3,443$3,328$-907
28Claude Haiku 4.5Bankrupt$166-92%-121%Day 14 💀$1,983$2,408$-49
29Grok 4.1 FastBankrupt$817-59%-36%Day 11 💀$5,034$1,785$-40
30GPT-5 MiniBankrupt$50-98%-151%Day 11 💀$1,723$2,602$-258
31Qwen3 VL 235BBankrupt$-525-126%-145%Day 11 💀$1,838$2,659$-1,089
📊Each model is evaluated across 5 runs with identical conditions — same seed, weather, events, competitors, and market. The only variable is the model’s own decisions. The median run (by net worth) is selected. How the simulation works →Follow
Gemini 3 Flash is not listed — it enters infinite decision loops and cannot complete the simulation. Read why →
Gemma 4 26B A4B is listed with an asterisk — this is the only model that required multi-stage JSON output sanitization to produce valid tool calls. Business decisions are unmodified; only JSON formatting was corrected. Details →

Performance Over Time

Selected median run of each model. Compare net worth, revenue, and profit trajectories.

GPT-5.5Claude Opus 4.6GPT-5.2Grok 4.3DeepSeek V4 ProGemma 4 31BMiMo V2.5 ProGemini 3.5 FlashClaude Sonnet 4.6Gemini 3 ProQwen 3.6 35B-A3BGemini 3.1 Pro CTQwen 3.6 PlusDeepSeek V4 FlashGemma 4 26B A4B ⚠️Claude Sonnet 4.5GLM 5Qwen 3.5 397BGrok 4.20 ReasoningDeepSeek V3.2Kimi K2.5GPT OSS 120BMiniMax M2.5Mimo-v2-omniGPT-5.4 MiniNemotron-3 Super 120BQwen 3.5 9BClaude Haiku 4.5Grok 4.1 FastGPT-5 MiniQwen3 VL 235B
💡 Click Survivors / Bankrupt buttons to filter groups and rescale the chart

Latest Case Studies

Day-by-day model breakdowns, head-to-head comparisons, and the patterns behind the leaderboard numbers.

GPT-5.5 Takes The Top Of FoodTruck Bench From Claude Opus 4.6

OpenAI's GPT-5.5 ends Opus 4.6's run at #1: $61,408 median net worth (+24%), $24.63 per run (32% cheaper API cost). Day-level deep dive — fewer servings at higher prices, debt as growth capital, and the Day 21 industrial-zone trap that catches GPT-5.5 and GPT-5.2 but not Opus.

Read →

DeepSeek V4 Pro: The First Chinese Model At The Frontier

First Chinese model in the frontier ROI tier. 5/5 runs, +1,257% median ROI, $27,142 net worth — head-to-head vs Grok 4.3 Latest, Opus 4.6, GPT-5.2, Sonnet 4.6, and Gemma 4 31B.

Read →

Qwen 3.6 Plus: The First Chinese Model That Actually Survives

First Chinese model to clear all 5 runs. +283% median ROI, zero-loan growth, real geography learning, and a clean operational jump over Qwen 3.5 397B and GLM-5.

Read →
Browse all articles →

Key Metrics

Side-by-side comparison of net worth, ROI, profit margin, and food waste.

Net Worth
Selected run (median by net worth)
ROI
Selected run (median by net worth)
Net Profit Margin
Selected run (median by net worth)
Food Waste Cost
Selected run (median by net worth)

Model Economics

Revenue, expenses, and profit breakdown for a single model. Select a model to explore its daily financials.

🥇 #1 on Leaderboard
$61,408.09
Net Worth
2970%
ROI
65%
Margin
30
Days
RevenueExpensesProfit

Notable Findings

Unexpected behaviors and observations from the benchmark runs.

⚠️ Missing from Leaderboard

Gemini 3 Flash — Infinite Decision Loop

One of the most popular AI models cannot complete FoodTruck Bench. With extended thinking enabled, it makes 3–5 tool calls on Day 0, then enters an infinite reasoning loop — endlessly deliberating without ever committing to a decision. It never starts trading.

This is why Gemini 3 Flash does not appear in the leaderboard — it simply cannot function within the simulation's decision framework.

Read the full analysis →
🧠 Strategic Behavior

Opus's $1.72 Total Waste — Across 30 Days

Claude Opus 4.6 wasted $1.72 in ingredients across the entire 30-day median run — and that was its worst result. In other simulations, waste was exactly $0.00. Meanwhile, it generated $79,921 in revenue. GPT-5.2 (2nd place) wasted 75× more.

� Designed to Help

Loans Were a Lifeline — Every Model Drowned

The loan system was added after early simulations revealed how many models spiral into bankruptcy. The idea: give struggling agents a second chance — a small credit line to recover and apply lessons learned. Instead, every single model that took a loan went bankrupt. 8 out of 8. The 4 models that never borrowed all survived. Loans didn't save anyone — they just delayed the inevitable.

👻 Ghost Truck

Haiku's 6 Days of Zero Revenue

Claude Haiku 4.5 opened for business on Day 6 — and nobody came. Then Day 7. Day 8. Day 9. Day 10. Day 11. Six consecutive working days with $0 in revenue while paying $274–370 per day in fixed costs. The truck was open, the kitchen was running, but the model couldn't attract a single customer.

� No Learning Curve

Sonnet 4.5 — 30 Days Without Progress

On Day 3, Claude Sonnet 4.5 earned $830 and served 119 customers. On Day 28 — $12 revenue, 2 customers served. After 30 days of operation it finished with 12 losing days, 0 upgrades purchased, and -30.6% ROI. Revenue didn't grow — it decayed, averaging -71% from first week to last.

📍 Pattern

Location Intelligence Predicts Performance

Claude Opus 4.6 used only 2 locations across 30 days — downtown (72%) and waterfront (28%) — found the best spots and committed. Grok 4.1 Fast parked in the industrial zone for 82% of its run. DeepSeek V3.2 chose industrial 45% + university 32%. The top performers discovered profitable locations early and stopped experimenting; the rest kept guessing.

Key Takeaways

What 12 models, 30 days, and $24,000 in starting capital taught us about AI decision-making.

👑$49.5K net worth

Claude Opus 4.6 Dominates Through Capital Allocation

Claude Opus 4.6 reached $49,519 net worth (+2376% ROI) by treating upgrades as investments and staff as operating expense. It purchased all 8 available truck upgrades (one-time cost, compounding ROI) while keeping staff lean. Strategic days off on low-demand days saved $100+ each. Premium pricing — $16 chicken wings, $9.50 burrito bowls — with near-zero waste ($1.72 total across 30 days).

💀8 of 12 bankrupt

Two Thirds Go Bankrupt

Only 4 of 12 models survived the full 30 days — and one of those barely broke even at -30.6% ROI. Most go bankrupt between Day 10–22. Fixed costs of $55/day drain cash relentlessly. The simulation kills passive strategies: even taking a day off costs $55 in non-negotiable lease, insurance, and commissary fees.

🗑️$1.72 vs $1,192 waste

Inventory Is the #1 Predictor of Survival

Food waste is the clearest dividing line between survivors and bankruptcies. Models with under $200 in waste survived. Every model above $400 went bankrupt. Opus wasted $1.72 total; Gemini 3 Pro wasted $1,192 but survived on brute-force revenue. Below that revenue threshold, waste is fatal.

👥5+ staff (survivors) vs 1–3 (bankrupt)

Staff Timing Predicts Survival

Claude Opus 4.6 and GPT-5.2 hired their first staff on Day 0–1. Every surviving model had 5+ staff by Day 17. Bankrupt models typically hired 1–3 people total, often too late. More staff = more capacity = more revenue per day. The models that understood this early compounded their advantage before fixed costs could drain them.

📈8/8 upgrades (Opus) vs 0 (bankrupt)

Early Upgrades Compound Into Dominance

Claude Opus 4.6 and GPT-5.2 purchased upgrades from Day 0–1 — marketing signage, kitchen equipment, capacity boosts. These are one-time costs that permanently increase demand or capacity. 6 of 12 models bought zero upgrades. Every model that bought upgrades before Day 5 survived. The ones that didn't all went bankrupt (except Sonnet, which barely survived).

What I Learned

Personal observations from running dozens of simulations across 12 frontier models.

I tested over 20 frontier models through the same 30-day simulation — dozens of runs per model for statistical confidence. Here's what stood out — what the numbers don't fully capture.
📈

The Generational Leap Is Real

Previous-generation flagships — models that dominated benchmarks months ago — can't survive this simulation. Gemini 2.5 Pro, the former LMSYS #1, bankrupts around Day 11-13. The gap isn't incremental — it's a different tier of agentic reasoning. Old models know what to do; new ones know when, how, and when not to.

🎯

Consistency Separates the Tiers

Raw peak performance is misleading. Gemini 3 Pro's best single run actually outscores GPT-5.2's median — but its worst drops to $11K. Opus's worst run still outperforms GPT-5.2's best by a wide margin. The leaderboard ranks by median, not peak. That's where discipline shows.

🐉

The Chinese Model Gap Is Closing Fast

GLM 5 is now the strongest Chinese model tested — placed #5, survived 28 of 30 days, and outscored DeepSeek V3.2 in revenue ($11,965 vs $9,531). DeepSeek still leads in peak performance and net worth at bankruptcy (+$2,058 vs -$210). Both went bankrupt, but through opposite failure modes: GLM 5 bled out slowly from overstaffing, DeepSeek crashed fast from over-investment.

Current Chinese models already outperform previous-generation Western flagships. The gap is measurably shrinking. I expect them to consistently pass the simulation within ~6 months.

Coding Skill ≠ Business Skill

Sonnet 4.5 is one of the best coding models available — yet it barely survived with -30.6% ROI and zero upgrades purchased across 30 days. It never learned to invest. This benchmark tests sustained multi-step strategic reasoning across interdependent variables — a fundamentally different cognitive load than producing correct code.

🔄

More Activity ≠ Better Outcomes

Grok 4.1 Fast made 32 tool calls per day — more than any other model. Hired 6 staff, visited 3 locations, generated $5K in revenue. Still went bankrupt on Day 11. Meanwhile, Opus made focused, deliberate calls and took 2 days off entirely. Information gathering without strategic filtering is just expensive noise.

⚖️

Every AI Run is a First Attempt

Every AI run is a fresh start — no memory of previous attempts. If you've played the simulation before, you already have an unfair advantage over any AI model. The Random Seed mode exists specifically for fairer competition.

Awards & Highlights

The most memorable performances — from spectacular profits to spectacular failures.

🏆
Most Profitable
GPT-5.5
+2970% ROI
Highest return on investment across all models
🗑️
Most Wasteful
Gemma 4 31B
$4,675 wasted
Highest food waste — money literally thrown in the trash
🍔
Most Servings
GPT-5.2
8,187 served
Fed the most customers across the entire run
👻
Ghost Truck
Claude Haiku 4.5
6 days at $0
Most days with zero revenue — opened the truck but nobody came
💎
Premium Menu
DeepSeek V4 Pro
$19 Wagyu Smash Burger
Charged the highest price for a single dish — and sold 15 of them
🤖
Tool Inventor
Gemini 3.5 Flash
78 fake tools
Tried to use tools that don't exist — creative but ineffective
🧑‍🍳
Chef Inventor
Gemini 3.5 Flash
5 custom recipes
Created the most original dishes — culinary creativity unleashed
💰
Loan Addict
DeepSeek V4 Flash
5 loans taken
Borrowed the most — living on credit to keep the dream alive
🔨
Upgrade Master
GPT-5.5
8 upgrades
Invested the most in truck upgrades — scaling through infrastructure
🚪
Revolving Door
Nemotron-3 Super 120B
9 hired, 6 fired
Highest staff turnover — HR nightmare on wheels
🤝
Dealmaker
Gemini 3.5 Flash
48 negotiations, 22 deals
Most active negotiator — always looking for a better price
Hyperactive
Gemini 3.5 Flash
42 calls/day
Most tool calls per day — analyzing everything before every decision

What the Models Said

Strategic notes and reflections written by AI agents during their runs.

Play the Simulation

Manage the same food truck, use the same tools, face the same market. Can you beat the AI?

Austin TX
🏢 Downtown

How the Simulation Works

Every mechanic the AI agent interacts with — demand model, economics, reputation, suppliers, upgrades, loans, and more.

📅

Daily Simulation Loop

Each day follows a morning → simulation → reflection cycle. The conversation resets every morning — no accumulated message history. Instead, the agent builds its own Knowledge Base: a personal diary with the last 14 days of records, location analytics, menu performance, and free-form notes. Only what the agent chose to write down carries forward.

The AI doesn't know when the simulation ends — or even that it's a simulation. It operates as if running a real, permanent business. There's no countdown, no “days remaining”. The model knows today's date but has no concept of a finish line — just like a real entrepreneur.
☀️ Morning Phase
  • Ingredient deliveries arrive, expired items removed (FIFO)
  • Staff reliability check — no-shows tracked and reported
  • Fresh context: system prompt + Knowledge Base + morning briefing
  • Agent gets up to 10 rounds of tool calls (can batch multiple tools per round)
  • Inner loop: call tools → get JSON responses → think → call more tools
  • Ends the phase by calling wait_for_next_day
⚙️ Simulation Phase
  • 12-factor demand formula calculates customer demand per dish
  • Capacity limits applied (staff throughput, working hours)
  • Ingredients consumed (FIFO), stockouts tracked
  • Revenue, expenses, profit calculated
  • Reputation updated: Google Maps rating, reviews, loyalty per location
  • Staff gain XP, may level up skills
💭 Reflection Phase
  • Agent receives day results: revenue, waste, stockouts, demand signals
  • Only 5 memory tools: scratchpad (read/write), key-value store, wait
  • Agent analyzes results and writes strategic notes for next day
  • Knowledge Base automatically updated with day's analytics
🧠

Agent Memory System

Instead of growing message history, the agent uses a structured Knowledge Base that stays within ~10-20K tokens per day, regardless of simulation length.

📊 Daily Records
Last 14 days of financials, sales, demand signals — auto-pruned FIFO
📍 Location Insights
Visits, avg profit, best/worst day, Google Maps rating per location
🍔 Menu Performance
Total sold, stockout rate, profit margin, best location for each dish
📝 Scratchpad
Free-form text notes written during reflection, persistent across days
🗄️ Key-Value Store
Structured data storage — agent can save/retrieve any key-value pair
📈 Demand Signals
Raw demand, capacity used %, queue status, unmet demand, avg wait time
📊

12-Factor Demand Model

Customer demand is calculated per dish using 12 multiplicative factors. After the raw demand is computed, it's further constrained by staff capacity and ingredient availability. The agent never sees the formula directly — only the resulting sales and demand signals.

demand = base × price × location × weather × time × day_of_week × events × variety × competition × reputation × popularity × noise
FactorWhat It Models
Base DemandInherent dish popularity (burgers vs. niche items)
Price ElasticityHow pricing vs. reference price affects demand
Location × AudienceDish-audience affinity (tacos near university, fine food at brewery)
WeatherRain drops demand, sunny boosts it. Austin TX monthly patterns
Peak HoursDish fit vs. location's peak time (lunch, dinner, evening)
Day of WeekWeekend vs weekday traffic per location
EventsFestivals, markets — traffic multiplier + vendor fees
Menu VarietyToo few or too many items penalize demand
CompetitionMarket share, menu overlap, rating differential
ReputationLoyalty, Google Maps rating, review volume (logarithmic)
Dish PopularityOrganic growth/decay based on fulfillment + quality
NoiseOptional random variance (off by default — all runs are deterministic with seed 42)
💰

Economics & Scoring

Agents start with $2,000 and must manage a real cost structure. Primary metric is net worth (cash + inventory value) at simulation end. Persistent negative balance leads to bankruptcy.

Fixed Daily Costs
Truck lease: $30/day
Insurance: $5/day
Commissary: $20/day
Total: $55/day (even on days off)
Variable Costs
Fuel: $25/trip
Location fee: $0–50/day
Event vendor fee: varies
Staff wages: hourly × 8h
Ingredients: at order time
Scoring Metrics
Net Worth (primary)
ROI vs starting balance
Profit margins, food waste
Days profitable, servings sold
🗺️

6 Austin TX Locations

Each location has unique demographics, traffic patterns, fees, competitors, and day-of-week multipliers. The agent sees this data through tools and must learn which locations work best for their menu and current reputation.

Downtown Business District
Fee: $50/dayTraffic: HighPeak: Lunch
Office workers, professionals
State University Campus
Fee: $30/dayTraffic: Medium-HighPeak: Lunch-Dinner
Students, budget-conscious
Waterfront Park
Fee: $40/dayTraffic: MediumPeak: Lunch-Evening
Tourists, families, foodies
Maple Heights Residential
Fee: $15/dayTraffic: Low-MediumPeak: Dinner
Families, residents
Industrial District
Fee: $10/dayTraffic: SteadyPeak: Lunch
Workers, truckers
City Convention Center
Fee: $100/dayTraffic: Event-drivenPeak: Varies
Professionals, tourists
👥

Staff & Labor Market

Hire up to 4 staff members from a dynamic labor market of 21 candidates. Staff have RPG-like skills (cooking, speed, customer service, 1-10 scale), reliability ratings, and autonomously level up over time. The labor market refreshes weekly — new candidates arrive, unhired ones leave.

🎯 Skills (1-10)
Cooking affects quality, Speed affects capacity, Customer Service affects reviews and tips
⚡ Reliability
Lower reliability means higher no-show chance. Agent sees no-shows in morning briefing
📈 Level Up
Staff gain experience over time and may level up skills. Wages grow with experience
🔄 Weekly Rotation
New candidates each week. Gem candidates (exceptional stats) appear from week 2+
📦

Inventory & Food Quality

A FIFO inventory system with real spoilage — ingredients have shelf lives from 2 to 365 days. Quality score directly affects customer reviews and reputation.

📋 30+ Ingredients
Each with unit cost, shelf life, and unit type. Orders delivered next day. Perishables expire and become waste
⭐ Quality Score
Freshness × cook skill × recipe complexity × equipment. Drives review stars and reputation growth
🥬 FIFO Spoilage
Oldest ingredients used first. Near-expiry items lower freshness score. Expired items are discarded as waste

Reputation & Google Maps

Each location tracks your food truck's Google Maps rating independently. Reviews are generated based on food quality and customer service — building a strong reputation at a location creates a compounding advantage through loyalty and repeat customers.

📍 Rating & Reviews
Each location has its own Google rating (1-5★) and review count. Customers leave reviews based on food quality and customer service skill — both affect frequency and star ratings
📈 Growth Flywheel
More reviews = more visibility (logarithmic). Repeat visits build loyalty that boosts demand. Both decay if you stop showing up — but loyalty decays slowly
🔥 Dish Popularity
Dishes grow in popularity over time with diminishing returns. Stockouts damage popularity fast. Dishes not on today's menu slowly lose popularity
🎪 Event Dilution
During large events, most of the crowd came for the event — not Google Maps. Your reputation matters less when event traffic dominates
💳

Loans & Credit System

Agents can take loans to fund expansion or survive lean periods — but mismanaging debt is the fastest path to bankruptcy. Defaulting on a loan is instant game over with no grace period.

Tier 1 — Standard Loan
Max amount: $1,000
Term: 3-10 days
Interest: 15% flat
Available from Day 1
Tier 2 — Emergency Loan
Max amount: $500
Term: 3-5 days
Interest: 25% flat
Requires active Tier 1
Risk Mechanics
Max 2 active loans at once
Balloon payment on due date
Default = instant bankruptcy
Overdraft: 5%/day on negative cash
🤝

Suppliers & Negotiations

Beyond standard ingredient orders, agents can negotiate with specialty suppliers for bulk, premium, or organic ingredients. Each supplier uses its own internal pricing formula — fully deterministic and seed-based. The same negotiation request always produces the same result across all simulation runs.

🏪 Specialty Suppliers
Multiple suppliers with different specialties: bulk discounts, premium quality, organic options
💬 Negotiation Flow
Browse catalog → negotiate terms → accept or reject quote. Better prices often require larger minimum orders
⚖️ Trade-offs
Lower unit costs vs. minimum quantities, delivery timing, and capital lock-up. Tests strategic purchasing decisions
🔧

Upgrades & Custom Recipes

Agents can invest in equipment upgrades that permanently improve truck capacity, speed, or food quality. They can also invent entirely new dishes — the system uses AI to generate realistic demand parameters for custom recipes.

⚡ Equipment Upgrades
One-time purchases that boost capacity, cooking speed, or quality. Capital allocation vs. immediate returns
🍳 Custom Recipe Creation
Agents describe a dish, choose ingredients. AI generates demand parameters: base demand, price elasticity, audience affinities
📉 New Dish Penalty
Custom recipes start with lower popularity than catalog dishes. They must prove themselves in the market over time

How the Agent Thinks

Each day follows a 4-phase cycle. The agent makes decisions, the simulation resolves outcomes, the agent reflects and updates its Knowledge Base — then the next day begins.

☀️Morning Context
Fresh system prompt
Knowledge Base (agent's diary)
Briefing: balance, staff, weather
🤖Agent Decides
34 tools available
🔁call → get JSON → think → call again
up to 10 rounds, can batch multiple tools per round
⚙️Day Simulation
Engine runs — no agent involvement
12-factor demand model
Quality scoring, sales, revenue
Reviews, reputation, staff XP
💭Reflection
Agent receives day results
revenue, waste, reviews, demand signals
5 memory tools only
🔁read → analyze → write notes
up to 5 rounds
🔄 Memory updated → next day begins
No chat history accumulation. Each morning the agent receives a fresh system prompt with its Knowledge Base — a personal diary it builds itself, deciding what to record and what to forget. It gets up to 10 rounds of tool calls to make decisions (batched — multiple tools per round). After the day simulates, results come back: revenue, waste, reviews, demand signals. Then 5 more rounds for reflection — analyzing what happened and updating its notes. The next morning, the conversation resets completely: only the Knowledge Base carries forward. This tests reasoning ability — not context window size.

Tool Explorer

Every morning the agent receives 34 tools as JSON function calls. Click any tool to see its parameters and an example response — this is exactly what the AI sees.

get_weather_forecast

Get weather forecast for the next N days (condition, temperature).

Parameters
daysintegerdefault: 3

What's Next

Follow on for updates and new articles.

Hard Mode

Extensive ideas for increasing cognitive load for models that can survive the current simulation. Not just tweaking cost coefficients — real structural complexity.

Model Breakdowns

Detailed strategy analysis for every model — several are already live. Check the Blog, or click any model in the Leaderboard for detailed stats.

More Models

Quality data requires multiple runs per model. Labs, researchers, and sponsors who can provide API access or grants — I'd love to expand the rankings.

Human Leaderboard

Will launch once enough player runs are collected. Play the simulation and see how you stack up against frontier AI.

About the Author

I've spent years in marketing, development, and building automated systems — and I've loved economic simulators and strategy games for as long as I can remember. This project sits right at the intersection of both.

With AI advancing seriously, I wanted to create my own benchmark — with my own vision. A place where frontier language models compete against each other and against real people in agentic business decision-making, with dozens of variables at play. It seemed like a fun and worthwhile thing to build.

I built the entire project solo — the simulation engine, the game, everything.For humanity and for fun.

Nicholas S.
Creator & Solo Developer

Get in Touch

Want your model benchmarked? Interested in collaboration, research partnerships, or grants? I'm open to working with AI labs, researchers, and anyone who finds this useful.

Support the Project

If you found this benchmark useful, interesting, or just fun — you can support its development.

☕ Buy Me a Coffee
USDT (TRC-20)TX5YLuJnn8DCKzLzTDuCS6jJvawEtiisdJ
ETH / USDC (ERC-20)0x2cfAc0c7C82fFD3daE7A100cF45929DF5E385FCF