← Back to FoodTruck Bench
PROTOCOLLast updated: February 2026

Methodology

How we test AI models, what we measure, and why you can trust the results.

Benchmark Version

Current Versionv1.0 — Standard Mode

All results on the leaderboard are produced under v1.0 Standard Mode — a 30-day simulation with fixed costs, static ingredient prices, and deterministic demand. When future versions are released (Hard Mode, Arena), results will be separated by version and clearly labeled.

Testing Protocol

Every model is tested under identical conditions. There is no prompt tuning, no cherry-picking runs, and no manual intervention during simulation.

🎲

Deterministic Environment

Weather, events, competitor behavior, and demand noise are seed-based. Same seed = same world. This guarantees reproducibility across runs.

📝

Identical System Prompt

All models receive the exact same system prompt — rules, tools descriptions, and simulation mechanics. No model-specific tuning or hints.

📊

5 Runs per Model

Each model is run 5 times with different seeds. The median run (by net worth) is selected for the leaderboard.

🔒

No Human Intervention

Once a simulation starts, it runs to completion with zero human input. The agent makes all decisions autonomously through tool calls.

Simulation Parameters

These parameters are fixed for all v1.0 Standard Mode runs:

ParameterValueNotes
Duration30 daysDay 0 (setup) + Day 1–30 (operations)
Starting balance$2,000Same for all models
LocationAustin, TX6 available locations with distinct traits
Fixed daily costs$55/dayTruck lease $30 + Insurance $5 + Commissary $20
Recipes available20Burgers, tacos, sides, drinks, desserts, bowls
Ingredients30+Static prices in v1.0, shelf life 2–365 days
Staff pool21 candidatesWeekly rotation, max 4 hired at once
Tools available34 + 5 reflectionInformation, actions, memory, finance
Demand model12 factorsPrice, location, weather, day, events, reputation…
Seed42 (default)Controls weather, events, demand noise

Agent Architecture

The simulation is not one continuous conversation. Each day is a separate API session — but the agent is never starting from scratch. Every morning, the model receives its full operational context:

Total context per day: ~5–50K tokens, depending on how much the model writes to its scratchpad and key-value store. Models that keep detailed analytical notes naturally consume more tokens — there is no imposed structure or format for note-taking. Each model decides on its own how to organize its memory for maximum effectiveness.

Why per-day sessions? A single continuous conversation across 30 days would balloon to hundreds of thousands of tokens. Instead, the Knowledge Base acts as structured, persistent memory — the model has access to all its past performance data, insights, and strategic notes without carrying raw conversation history. Think of it like a business owner who checks their dashboards and journal each morning, not someone re-reading every email from the past month.

Daily Cycle

Each simulated day follows a strict three-phase cycle:

🌅Morning — Agent Decides

The agent receives the briefing and all 34 tools. It gathers information (check inventory, weather, events), then acts (choose location, set menu and prices, order ingredients, hire/fire staff). Ends by calling wait_for_next_day.

🌙Night — Simulation Resolves

The engine calculates demand (12 factors), resolves sales, consumes ingredients (FIFO), applies costs, updates reputation, checks bankruptcy conditions, and saves a checkpoint.

💭Reflection — Agent Learns

The agent sees its day results (revenue, waste, stockouts, staff no-shows) and has access to 5 memory tools only: write_scratchpad, read_scratchpad,store_kv, retrieve_kv, wait_for_next_day. Strategic notes written here carry into the next day's Knowledge Base.

Thinking Mode & Model Settings

Every model is tested at its maximum available performance settings. For models that support extended thinking / reasoning mode, thinking isalways enabled with the maximum available token budget for reasoning. No model is artificially restricted in any way — context length, thinking depth, token limits, or any other parameter.

This is indicated by the 🧠 icon on the leaderboard. The goal is to evaluate each model at its best, not to create artificial handicaps.

Agent Tools

The agent interacts with the simulation through 34 tools during the morning phase and 5 memory tools during reflection. Tools are provided as JSON function-calling schemas — the same format used in production AI applications. Click any tool below to see its parameters and an example response — exactly as the AI agent sees them.

get_weather_forecast

Get weather forecast for the next N days (condition, temperature).

Parameters
daysintegerdefault: 3

Scoring Methodology

The primary ranking metric is net worth (cash + inventory value at simulation end). Additional metrics provide a fuller picture:

MetricFormulaWhat It Shows
Net WorthCash + Inventory ValuePrimary ranking metric — total wealth at end
ROI(Net Worth − $2,000) ÷ $2,000Return on initial investment
RevenueSum of all daily salesTotal income generated
Profit MarginNet Profit ÷ RevenueOperational efficiency
Food WasteValue of expired ingredientsInventory management skill
Days ProfitableDays with positive daily profitConsistency of performance

Run Selection

For each model, 5 complete runs are performed. The run with the median net worth is selected for the leaderboard. Median (not mean) is used to reduce sensitivity to outliers — both lucky and unlucky runs.

What counts as a valid run? A run is valid if it reaches Day 30 or ends in bankruptcy. Runs that fail due to API errors, rate limits, or infrastructure issues are discarded and restarted — the model is not penalized for backend failures.

Bankruptcy Rules

A model goes bankrupt (simulation ends early) if any of these conditions are met:

Bankrupt models receive their final net worth at the time of bankruptcy. They are included in the leaderboard — bankruptcy is a valid and telling outcome.

Demand Model

Customer demand for each dish is calculated using a 12-factor multiplicative model. All factors are deterministic (seed-based), ensuring identical conditions across models:

demand = base × price × location × weather × time × day × event × variety × competition × reputation × popularity × noise

After raw demand is computed, it's constrained by kitchen capacity (limited by staff and hours) and inventory (limited by ingredient stock, FIFO). These constraints create real trade-offs: hiring more staff increases capacity but raises costs, ordering more ingredients risks spoilage.

The exact coefficients, ranges, and formulas for each factor are proprietary and not disclosed. This prevents models from being optimized against specific parameters and keeps the benchmark meaningful over time.

Fairness Guarantees

⚖️

Same Prompt

No model-specific instructions, hints, or strategy suggestions.

🌍

Same World

Deterministic environment — same weather, events, and competitors per seed.

🛠️

Same Tools

All 34 tools available. No tool advantages or restrictions per model.

📏

Same Evaluation

Identical scoring. Median selection. No post-hoc adjustments.

Closed Source, Open Results

The simulation engine, system prompts, and demand model parameters are not open-sourced. This is a deliberate choice:

What is published: all results, all metrics, scoring formulas, demand factor names, agent architecture, tool list, and this methodology document. What is not published: source code, system prompt text, exact coefficients, and internal simulation parameters.

Scripted Baseline

In addition to LLM agents, a rule-based scripted agent is included as a baseline. It uses simple heuristics (pick high-traffic locations, 2.5–3× markup, buy what you need) with zero API cost.

This provides a "floor" — if an LLM model can't outperform a script with hardcoded rules, its business reasoning is genuinely lacking.

Known Limitations

Reproducibility

Every simulation run produces a complete audit trail:

Any run can be replayed day-by-day using the Simulation Replay feature on the leaderboard. This provides full transparency into every decision the model made.

Questions?

For methodology questions, model submission requests, or research collaboration inquiries:

[email protected]