PROTOCOLLast updated: February 2026

Methodology

How we test AI models, what we measure, and why you can trust the results.

Benchmark Version

Current Versionv1.0 — Standard Mode

All results on the leaderboard are produced under v1.0 Standard Mode — a 30-day simulation with fixed costs, static ingredient prices, and deterministic demand. When future versions are released (Hard Mode, Arena), results will be separated by version and clearly labeled.

Testing Protocol

Every model is tested under identical conditions. There is no prompt tuning, no cherry-picking runs, and no manual intervention during simulation.

🎲

Deterministic Environment

Weather, events, competitor behavior, and demand noise are seed-based. Same seed = same world. This guarantees reproducibility across runs.

📝

Identical System Prompt

All models receive the exact same system prompt — rules, tools descriptions, and simulation mechanics. No model-specific tuning or hints.

📊

5 Runs per Model

Each model is run 5 times with different seeds. The median run (by net worth) is selected for the leaderboard.

🔒

No Human Intervention

Once a simulation starts, it runs to completion with zero human input. The agent makes all decisions autonomously through tool calls.

Simulation Parameters

These parameters are fixed for all v1.0 Standard Mode runs:

Parameter	Value	Notes
Duration	30 days	Day 0 (setup) + Day 1–30 (operations)
Starting balance	$2,000	Same for all models
Location	Austin, TX	6 available locations with distinct traits
Fixed daily costs	$55/day	Truck lease $30 + Insurance $5 + Commissary $20
Recipes available	20	Burgers, tacos, sides, drinks, desserts, bowls
Ingredients	30+	Static prices in v1.0, shelf life 2–365 days
Staff pool	21 candidates	Weekly rotation, max 4 hired at once
Tools available	34 + 5 reflection	Information, actions, memory, finance
Demand model	12 factors	Price, location, weather, day, events, reputation…
Seed	42 (default)	Controls weather, events, demand noise

Agent Architecture

The simulation is not one continuous conversation. Each day is a separate API session — but the agent is never starting from scratch. Every morning, the model receives its full operational context:

System prompt — simulation rules and all 34 tool descriptions (~4K tokens, fixed)
Knowledge Base — rolling 14-day history with daily P&L records, location performance insights, menu analytics, and the agent's own strategic notes
Morning briefing — current balance, full inventory, weather forecast, upcoming events, staff status and reliability

Total context per day: ~5–50K tokens, depending on how much the model writes to its scratchpad and key-value store. Models that keep detailed analytical notes naturally consume more tokens — there is no imposed structure or format for note-taking. Each model decides on its own how to organize its memory for maximum effectiveness.

Why per-day sessions? A single continuous conversation across 30 days would balloon to hundreds of thousands of tokens. Instead, the Knowledge Base acts as structured, persistent memory — the model has access to all its past performance data, insights, and strategic notes without carrying raw conversation history. Think of it like a business owner who checks their dashboards and journal each morning, not someone re-reading every email from the past month.

Daily Cycle

Each simulated day follows a strict three-phase cycle:

🌅Morning — Agent Decides

The agent receives the briefing and all 34 tools. It gathers information (check inventory, weather, events), then acts (choose location, set menu and prices, order ingredients, hire/fire staff). Ends by calling wait_for_next_day.

🌙Night — Simulation Resolves

The engine calculates demand (12 factors), resolves sales, consumes ingredients (FIFO), applies costs, updates reputation, checks bankruptcy conditions, and saves a checkpoint.

💭Reflection — Agent Learns

The agent sees its day results (revenue, waste, stockouts, staff no-shows) and has access to 5 memory tools only: write_scratchpad, read_scratchpad,store_kv, retrieve_kv, wait_for_next_day. Strategic notes written here carry into the next day's Knowledge Base.

Thinking Mode & Model Settings

Every model is tested at its maximum available performance settings. For models that support extended thinking / reasoning mode, thinking isalways enabled with the maximum available token budget for reasoning. No model is artificially restricted in any way — context length, thinking depth, token limits, or any other parameter.

This is indicated by the 🧠 icon on the leaderboard. The goal is to evaluate each model at its best, not to create artificial handicaps.

Agent Tools

The agent interacts with the simulation through 34 tools during the morning phase and 5 memory tools during reflection. Tools are provided as JSON function-calling schemas — the same format used in production AI applications. Click any tool below to see its parameters and an example response — exactly as the AI agent sees them.

get_weather_forecast

Get weather forecast for the next N days (condition, temperature).

Parameters

daysintegerdefault: 3

Scoring Methodology

The primary ranking metric is net worth (cash + inventory value at simulation end). Additional metrics provide a fuller picture:

Metric	Formula	What It Shows
Net Worth	Cash + Inventory Value	Primary ranking metric — total wealth at end
ROI	(Net Worth − $2,000) ÷ $2,000	Return on initial investment
Revenue	Sum of all daily sales	Total income generated
Profit Margin	Net Profit ÷ Revenue	Operational efficiency
Food Waste	Value of expired ingredients	Inventory management skill
Days Profitable	Days with positive daily profit	Consistency of performance

Run Selection

For each model, 5 complete runs are performed. The run with the median net worth is selected for the leaderboard. Median (not mean) is used to reduce sensitivity to outliers — both lucky and unlucky runs.

What counts as a valid run? A run is valid if it reaches Day 30 or ends in bankruptcy. Runs that fail due to API errors, rate limits, or infrastructure issues are discarded and restarted — the model is not penalized for backend failures.

Bankruptcy Rules

A model goes bankrupt (simulation ends early) if any of these conditions are met:

Persistent overdraft: Balance below −$200 for 3 consecutive days
Negative net worth: Net worth below $0 for 3 consecutive days
Loan default: Unable to repay a loan on its due date — instant bankruptcy, no grace period

Bankrupt models receive their final net worth at the time of bankruptcy. They are included in the leaderboard — bankruptcy is a valid and telling outcome.

Demand Model

Customer demand for each dish is calculated using a 12-factor multiplicative model. All factors are deterministic (seed-based), ensuring identical conditions across models:

demand = base × price × location × weather × time × day × event × variety × competition × reputation × popularity × noise

After raw demand is computed, it's constrained by kitchen capacity (limited by staff and hours) and inventory (limited by ingredient stock, FIFO). These constraints create real trade-offs: hiring more staff increases capacity but raises costs, ordering more ingredients risks spoilage.

The exact coefficients, ranges, and formulas for each factor are proprietary and not disclosed. This prevents models from being optimized against specific parameters and keeps the benchmark meaningful over time.

Fairness Guarantees

⚖️

Same Prompt

No model-specific instructions, hints, or strategy suggestions.

🌍

Same World

Deterministic environment — same weather, events, and competitors per seed.

🛠️

Same Tools

All 34 tools available. No tool advantages or restrictions per model.

📏

Same Evaluation

Identical scoring. Median selection. No post-hoc adjustments.

Closed Source, Open Results

The simulation engine, system prompts, and demand model parameters are not open-sourced. This is a deliberate choice:

Prevents gaming: If the exact demand formula is known, models (or their trainers) could be optimized against specific coefficients rather than demonstrating genuine business reasoning
Protects longevity: The benchmark remains useful as long as the internals are unknown — similar to how standardized tests don't publish answer keys in advance
Industry standard: LMSYS Chatbot Arena, Anthropic's internal evals, and many established benchmarks use this model — public results, private methodology details

What is published: all results, all metrics, scoring formulas, demand factor names, agent architecture, tool list, and this methodology document. What is not published: source code, system prompt text, exact coefficients, and internal simulation parameters.

Scripted Baseline

In addition to LLM agents, a rule-based scripted agent is included as a baseline. It uses simple heuristics (pick high-traffic locations, 2.5–3× markup, buy what you need) with zero API cost.

This provides a "floor" — if an LLM model can't outperform a script with hardcoded rules, its business reasoning is genuinely lacking.

Known Limitations

Simulation, not reality: The demand model, while complex, is still a model. Real food truck economics involve weather uncertainty, health inspections, and many factors we don't simulate
English-only prompts: All models receive English prompts. Performance of non-English-native models may be affected
No visual input: Models interact through text-based tools only — no images, maps, or visual data is provided
Cost asymmetry: More expensive models (Opus, GPT-5.2) consume more tokens, which is tracked but not penalized in the primary ranking
Single-agent only (v1.0): The current version tests individual agents. Multi-agent delegation and competition are planned for future versions

Reproducibility

Every simulation run produces a complete audit trail:

Day logs — every tool call, every decision, every result per day
Checkpoints — full simulation state snapshots for resume/replay
Conversation history — complete LLM interaction logs
Knowledge Base snapshots — what the agent "remembered" each day
Run metadata — config, timing, token usage, API costs

Any run can be replayed day-by-day using the Simulation Replay feature on the leaderboard. This provides full transparency into every decision the model made.

Questions?

For methodology questions, model submission requests, or research collaboration inquiries:

[email protected]