Methodology
How we test AI models, what we measure, and why you can trust the results.
Benchmark Version
All results on the leaderboard are produced under v1.0 Standard Mode — a 30-day simulation with fixed costs, static ingredient prices, and deterministic demand. When future versions are released (Hard Mode, Arena), results will be separated by version and clearly labeled.
Testing Protocol
Every model is tested under identical conditions. There is no prompt tuning, no cherry-picking runs, and no manual intervention during simulation.
Deterministic Environment
Weather, events, competitor behavior, and demand noise are seed-based. Same seed = same world. This guarantees reproducibility across runs.
Identical System Prompt
All models receive the exact same system prompt — rules, tools descriptions, and simulation mechanics. No model-specific tuning or hints.
5 Runs per Model
Each model is run 5 times with different seeds. The median run (by net worth) is selected for the leaderboard.
No Human Intervention
Once a simulation starts, it runs to completion with zero human input. The agent makes all decisions autonomously through tool calls.
Simulation Parameters
These parameters are fixed for all v1.0 Standard Mode runs:
| Parameter | Value | Notes |
|---|---|---|
| Duration | 30 days | Day 0 (setup) + Day 1–30 (operations) |
| Starting balance | $2,000 | Same for all models |
| Location | Austin, TX | 6 available locations with distinct traits |
| Fixed daily costs | $55/day | Truck lease $30 + Insurance $5 + Commissary $20 |
| Recipes available | 20 | Burgers, tacos, sides, drinks, desserts, bowls |
| Ingredients | 30+ | Static prices in v1.0, shelf life 2–365 days |
| Staff pool | 21 candidates | Weekly rotation, max 4 hired at once |
| Tools available | 34 + 5 reflection | Information, actions, memory, finance |
| Demand model | 12 factors | Price, location, weather, day, events, reputation… |
| Seed | 42 (default) | Controls weather, events, demand noise |
Agent Architecture
The simulation is not one continuous conversation. Each day is a separate API session — but the agent is never starting from scratch. Every morning, the model receives its full operational context:
- System prompt — simulation rules and all 34 tool descriptions (~4K tokens, fixed)
- Knowledge Base — rolling 14-day history with daily P&L records, location performance insights, menu analytics, and the agent's own strategic notes
- Morning briefing — current balance, full inventory, weather forecast, upcoming events, staff status and reliability
Total context per day: ~5–50K tokens, depending on how much the model writes to its scratchpad and key-value store. Models that keep detailed analytical notes naturally consume more tokens — there is no imposed structure or format for note-taking. Each model decides on its own how to organize its memory for maximum effectiveness.
Daily Cycle
Each simulated day follows a strict three-phase cycle:
The agent receives the briefing and all 34 tools. It gathers information (check inventory, weather, events), then acts (choose location, set menu and prices, order ingredients, hire/fire staff). Ends by calling wait_for_next_day.
The engine calculates demand (12 factors), resolves sales, consumes ingredients (FIFO), applies costs, updates reputation, checks bankruptcy conditions, and saves a checkpoint.
The agent sees its day results (revenue, waste, stockouts, staff no-shows) and has access to 5 memory tools only: write_scratchpad, read_scratchpad,store_kv, retrieve_kv, wait_for_next_day. Strategic notes written here carry into the next day's Knowledge Base.
Thinking Mode & Model Settings
Every model is tested at its maximum available performance settings. For models that support extended thinking / reasoning mode, thinking isalways enabled with the maximum available token budget for reasoning. No model is artificially restricted in any way — context length, thinking depth, token limits, or any other parameter.
This is indicated by the 🧠 icon on the leaderboard. The goal is to evaluate each model at its best, not to create artificial handicaps.
Agent Tools
The agent interacts with the simulation through 34 tools during the morning phase and 5 memory tools during reflection. Tools are provided as JSON function-calling schemas — the same format used in production AI applications. Click any tool below to see its parameters and an example response — exactly as the AI agent sees them.
Get weather forecast for the next N days (condition, temperature).
Scoring Methodology
The primary ranking metric is net worth (cash + inventory value at simulation end). Additional metrics provide a fuller picture:
| Metric | Formula | What It Shows |
|---|---|---|
| Net Worth | Cash + Inventory Value | Primary ranking metric — total wealth at end |
| ROI | (Net Worth − $2,000) ÷ $2,000 | Return on initial investment |
| Revenue | Sum of all daily sales | Total income generated |
| Profit Margin | Net Profit ÷ Revenue | Operational efficiency |
| Food Waste | Value of expired ingredients | Inventory management skill |
| Days Profitable | Days with positive daily profit | Consistency of performance |
Run Selection
For each model, 5 complete runs are performed. The run with the median net worth is selected for the leaderboard. Median (not mean) is used to reduce sensitivity to outliers — both lucky and unlucky runs.
Bankruptcy Rules
A model goes bankrupt (simulation ends early) if any of these conditions are met:
- Persistent overdraft: Balance below −$200 for 3 consecutive days
- Negative net worth: Net worth below $0 for 3 consecutive days
- Loan default: Unable to repay a loan on its due date — instant bankruptcy, no grace period
Bankrupt models receive their final net worth at the time of bankruptcy. They are included in the leaderboard — bankruptcy is a valid and telling outcome.
Demand Model
Customer demand for each dish is calculated using a 12-factor multiplicative model. All factors are deterministic (seed-based), ensuring identical conditions across models:
demand = base × price × location × weather × time × day × event × variety × competition × reputation × popularity × noiseAfter raw demand is computed, it's constrained by kitchen capacity (limited by staff and hours) and inventory (limited by ingredient stock, FIFO). These constraints create real trade-offs: hiring more staff increases capacity but raises costs, ordering more ingredients risks spoilage.
The exact coefficients, ranges, and formulas for each factor are proprietary and not disclosed. This prevents models from being optimized against specific parameters and keeps the benchmark meaningful over time.
Fairness Guarantees
Same Prompt
No model-specific instructions, hints, or strategy suggestions.
Same World
Deterministic environment — same weather, events, and competitors per seed.
Same Tools
All 34 tools available. No tool advantages or restrictions per model.
Same Evaluation
Identical scoring. Median selection. No post-hoc adjustments.
Closed Source, Open Results
The simulation engine, system prompts, and demand model parameters are not open-sourced. This is a deliberate choice:
- Prevents gaming: If the exact demand formula is known, models (or their trainers) could be optimized against specific coefficients rather than demonstrating genuine business reasoning
- Protects longevity: The benchmark remains useful as long as the internals are unknown — similar to how standardized tests don't publish answer keys in advance
- Industry standard: LMSYS Chatbot Arena, Anthropic's internal evals, and many established benchmarks use this model — public results, private methodology details
What is published: all results, all metrics, scoring formulas, demand factor names, agent architecture, tool list, and this methodology document. What is not published: source code, system prompt text, exact coefficients, and internal simulation parameters.
Scripted Baseline
In addition to LLM agents, a rule-based scripted agent is included as a baseline. It uses simple heuristics (pick high-traffic locations, 2.5–3× markup, buy what you need) with zero API cost.
This provides a "floor" — if an LLM model can't outperform a script with hardcoded rules, its business reasoning is genuinely lacking.
Known Limitations
- Simulation, not reality: The demand model, while complex, is still a model. Real food truck economics involve weather uncertainty, health inspections, and many factors we don't simulate
- English-only prompts: All models receive English prompts. Performance of non-English-native models may be affected
- No visual input: Models interact through text-based tools only — no images, maps, or visual data is provided
- Cost asymmetry: More expensive models (Opus, GPT-5.2) consume more tokens, which is tracked but not penalized in the primary ranking
- Single-agent only (v1.0): The current version tests individual agents. Multi-agent delegation and competition are planned for future versions
Reproducibility
Every simulation run produces a complete audit trail:
- Day logs — every tool call, every decision, every result per day
- Checkpoints — full simulation state snapshots for resume/replay
- Conversation history — complete LLM interaction logs
- Knowledge Base snapshots — what the agent "remembered" each day
- Run metadata — config, timing, token usage, API costs
Any run can be replayed day-by-day using the Simulation Replay feature on the leaderboard. This provides full transparency into every decision the model made.
Questions?
For methodology questions, model submission requests, or research collaboration inquiries: