Qwen 3.5 9B: Small Model, Big Game
Nine billion parameters. The smallest model on a 17-model leaderboard.
It made 286 tool calls across 15 days. Checked inventory, hired staff, took loans, wrote strategy notes.
It earned $3,443 in revenue serving 9 different recipes across Austin.
It went bankrupt anyway — but the fact that it played the game at all is the story.
Key Findings
Based on 5 simulations under identical conditions. All figures below are from the primary run (122511, 15 days) unless noted otherwise.
- The smallest model works: Qwen 3.5 9B — at just 9 billion parameters — successfully operated a 34-tool agentic simulation for 15 days. It made 286 tool calls, used 12 different tools, served 9 recipes, managed a staff of 2, and took strategic loans. For context, this is a benchmark where even 120B+ models regularly go bankrupt. A 9B model actively participating is genuinely impressive.
- The observe → act gap: The model made 112 information tool calls and wrote 82 strategy notes (all auto-injected into its morning context). It diagnosed problems accurately — «FOOD WASTE CATASTROPHE» — but couldn't translate that diagnosis into corrective action. It's not a perception failure. It's a reasoning composition ceiling at 9B parameters.
- 100% bankruptcy, but a tight cluster: All 5 runs ended in bankruptcy with average lifespan of 13.8 days (range: 12–16). The tight clustering shows this is a hard but consistent capability ceiling — not random failure. The model hits the same wall every time.
- $0.19/run — absurdly cheap: At under 20 cents per simulation, this is the cheapest model tested. And it works. Not profitably, but it runs the full agentic loop: observe, plan, act, reflect. The cost-to-capability ratio for small models in agentic tasks is remarkable.
The Setup
FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 30 days with realistic demand, weather, events, and competition. The food truck is scaffolding — the real test is forward planning, feedback integration, cost-benefit reasoning, and risk assessment under compound uncertainty.
I ran Qwen 3.5 9B 5 times via OpenRouter under identical conditions (same seed, same prompt, same tools). All 5 ended in bankruptcy with a tight 12–16 day lifespan cluster. The primary run (122511, 15 days, −$679 net worth) serves as the reference for all day-by-day examples and quotes in this article. Aggregate statistics use data from all 5 runs.
For comparison I use the median runs of two leaderboard neighbors: MiniMax M2.5 (#12, −$317 NW, 22 days, 100% bankrupt) and Claude Haiku 4.5 (#14, $166 NW, 14 days, 100% bankrupt). All three sit in the bankruptcy tier — models #10–#17 that consistently fail to survive.
What 9 Billion Parameters Can Do
Before we get to where it breaks, let's appreciate what works. This is a model that fits on a laptop GPU. It has never seen this benchmark. It gets a system prompt, 34 tool definitions, and a morning briefing — and it runs a food truck business.
| Capability | What it did |
|---|---|
| Tool discovery | Used 12 of 34 tools autonomously — no examples, no few-shot |
| Multi-step workflows | Correct sequence: check inventory → choose location → set menu → order ingredients → wait |
| Menu composition | Served 9 different recipes, rotated based on location type |
| Staff management | Searched candidates, hired 2 employees, attempted to fire staff (failed — passed names instead of IDs) |
| Financial operations | Took strategic loans when cash was low, managed repayment timing |
| Information gathering | 112 info calls — checked balance, inventory, weather, competitors, suppliers |
| Self-reflection | 82 strategy notes with accurate terminology and precise numbers |
| Revenue generation | $3,443 total revenue across 15 days — real sales to simulated customers |
For perspective: GPT-5 Mini (a much larger model) also went bankrupt with 75% rate. Grok 4.1 Fast — 100%. GPT OSS 120B at 120 billion parameters — 80%. The fact that a 9B model plays in the same league as models 10–15× its size is the headline, not the bankruptcy.
The Disconnect: Seeing ≠ Understanding ≠ Doing
Most failing models on this benchmark fail because they can't use tools. Qwen 3.5 9B fails despite using them. It called 286 tools in 15 days — 112 info tools (39%) and 174 action/memory tools (61%). That's 19 tool calls per day. For context, top-performing models like Opus 4.6 average ~20 calls/day.
| Tool Category | Calls | Examples |
|---|---|---|
| Information gathering | 112 (39%) | get_inventory, get_balance, get_weather, get_staff_info |
| Actions | 72 | choose_location, set_menu, order_ingredients, hire/fire |
| Memory writes | 82 | store_kv (58), write_scratchpad (24) — all auto-injected into context |
The model checked inventory every morning — then set menus with items it didn't have stock for. It checked balance — then placed orders it couldn't afford (28% order failure rate due to insufficient funds). It checked weather — then chose locations with no weather-appropriate items. The data went in, but the planning pipeline couldn't connect observation to action.
Agentic signal — the observe→plan→act pipeline: At 9B parameters, the model can execute each stage individually — it can observe (112 info calls), it can plan (82 strategy notes), and it can act (72 action calls). But it cannot chain them: observation doesn't inform planning, and planning doesn't constrain action. Each stage operates in isolation. This is a fundamentally different failure from larger models like Qwen 3.5 397B, which at least shows evidence of observation influencing tool selection. The 9B model does all the right things — in the wrong order, with no causal links between them.
The Diagnosis Loop
58 store_kv calls. 24 write_scratchpad calls. Every note is auto-injected into the next morning's context — the model sees all its previous analyses. The system works exactly as designed. The failure is downstream: the model reads «FOOD WASTE: $204 lost» in its morning briefing, then orders $180 more ingredients it can't sell. The information reaches the model. The model can't translate information into behavioral change.
Compare with Qwen 3.5 397B: that model also wrote extensively (74 KV entries) and also failed to close the loop — but at least showed occasional evidence of diagnosis influencing behavior. The 9B model shows zero evidence of strategy adaptation. It's the same diagnose→ignore→repeat loop, just more severe.
Three Models, One Destination
Switch to Daily Revenue and the pattern jumps out: Qwen 3.5 9B has the most volatile revenue — sharp spikes of $422, $728, $640 alternating with zero-revenue days. MiniMax M2.5 stops earning entirely after Day 5, running on fumes for 17 more days. Claude Haiku 4.5 has a similar dead zone from Days 6-11. The common pattern: a burst of early revenue followed by structural collapse.
Agentic signal — the $2,000 death clock: All three models peak between Days 1–4, then begin a monotonic net worth decline. The $2,000 starting capital provides a buffer that masks bad decisions for about a week. By the time the consequences become visible, it's already too late. A model with forward planning would recognize the burn rate early and adjust. None of these models can.
Agentic Capability Map
How does the smallest model on the benchmark compare to its sibling (Qwen 3.5 397B) and the top performers?
| Agentic Capability | Qwen 3.5 9B | Qwen 3.5 397B | Top Models |
|---|---|---|---|
| Tool usage | ~ 286 calls, 39% info | ✓ Selective | ✓ 70–85% diversity |
| Observation → action | ✗ No causal link | ~ Weak link | ✓ Data-driven decisions |
| Feedback integration | ✗ 82 notes, no adaptation | ✗ 74 writes, not applied | ~ Partial to strong |
| Resource optimization | ✗ $1,272 waste (37% of rev) | ✗ $1,113 waste | ✓ $2–$302 waste |
| Cash flow management | ✗ Overdraft by Day 7 | ✗ Overdraft by Day 19 | ✓ Positive balance |
| Staff ROI | ✗ 77% of revenue on wages | ✗ 73% of revenue | ✓ Under 35% |
| Self-diagnosis quality | ~ Accurate but ignored | ✓ Detailed and precise | ✓ Leads to adaptation |
The 9B model is an informed incompetent — it gathers data, writes analyses, but can't connect them to decisions. The 397B model is a competent non-learner — it makes better initial decisions but still can't close the feedback loop. 44× more parameters buys better initial calibration, not adaptability.
Where the Money Went
| Category | Amount | % of Total |
|---|---|---|
| Staff | $2,645 | 39.1% |
| Food Waste | $1,272 | 18.8% |
| Ingredients | $986 | 14.6% |
| Fixed (lease+insurance+commissary) | $825 | 12.2% |
| Location fees | $600 | 8.9% |
| Fuel | $325 | 4.8% |
| Overdraft interest | $118 | 1.7% |
Staff = 77% of revenue ($2,645 vs $3,443 revenue). The model called get_staff_info regularly — it knew who was on staff and what they cost. It hired 2 employees spending $264/day on labor for a truck averaging $230/day in revenue. Staff cost exceeded revenue on most days. The data was there. The math was simple. The model never did the math.
Agentic signal — arithmetic in context: The model can retrieve two numbers (staff cost, daily revenue) from its own tool calls. It cannot compare them. At 9B parameters, the reasoning chain «$264/day staff cost > $230/day revenue → fire someone» requires holding multiple values in working memory, performing arithmetic, and initiating an action — a three-hop chain that the model's capacity doesn't support in-context.
All 5 Runs
| Run | Days | Revenue | Net Worth | Waste | ROI | Result |
|---|---|---|---|---|---|---|
| 122516 | 12 | $3,092 | −$551 | $1,391 | −128% | 💀 |
| 122519 | 13 | $4,034 | −$1,030 | $307 | −151% | 💀 |
| 153511 | 13 | $5,074 | $328 | $1,190 | −84% | 💀 |
| 122511 | 15 | $3,443 | −$679 | $1,272 | −134% | 💀 |
| 153517 | 16 | $4,699 | $402 | $1,140 | −80% | 💀 |
| Average | 13.8 | $4,068 | −$306 | $1,060 | −115% | 5/5 💀 |
100% bankruptcy rate. Average lifespan: 13.8 days. Average waste: $1,060 — 26% of revenue goes straight to the dumpster. Variance is remarkably narrow (12–16 days), which means this isn't bad luck — it's a hard ceiling.
Leaderboard Position: #13 of 17
| # | Model | Net Worth | ROI | BR |
|---|---|---|---|---|
| 1 | Claude Opus 4.6 | $49,519 | +2,376% | 0% |
| 2 | GPT-5.2 | $28,081 | +1,304% | 0% |
| 3–5 | Sonnet 4.6 / G3 Pro / G3.1 CT | $12.7K–$17.4K | +537–771% | 0–17% |
| — profitability cliff — | ||||
| 10 | Kimi K2.5 | $30 | −99% | 80% |
| 11 | GPT OSS 120B | $92 | −95% | 80% |
| 12 | MiniMax M2.5 | −$317 | −116% | 100% |
| 13 | Qwen 3.5 9B | −$679 | −134% | 100% |
| 14 | Claude Haiku 4.5 | $166 | −92% | 100% |
| 15 | Grok 4.1 Fast | $817 | −59% | 100% |
| 16 | GPT-5 Mini | $50 | −98% | 75% |
| 17 | Qwen3 VL 235B | −$525 | −126% | 100% |
In Its Own Words
«FOOD WASTE CRISIS: $204 lost = ~30% of revenue wasted!»
— Qwen 3.5 9B, Day 4 reflection
The first mention of waste. It had already checked inventory that morning. It knew what was expiring. It will write this same observation for 9 more days, never reducing order sizes.
«11% utilization — TERRIBLE. $563.53 in FOOD WASTE»
— Qwen 3.5 9B, Day 6 reflection
Perfect diagnosis. The model checked get_balance, get_inventory, and get_staff_info that morning — it knew staff cost $264/day against $58 revenue. It kept both employees.
«Lost $378 with $900.45 in expired ingredients (UNACCEPTABLE!)»
— Qwen 3.5 9B, Day 9 reflection
$0 revenue that day. It called get_inventory in the morning (10 info calls that day). It knew exactly what was there. 45 customers wanted food. It set a menu with items that stocked out after 2 servings.
«FOOD WASTE CATASTROPHE: $1,272 lost!»
— Qwen 3.5 9B, Day 15 (final day)
Last words before bankruptcy. Fifteen days of diagnosing the same problem. 112 info calls. 82 strategy notes. Not a single corrective action. The model that documents every failure in real time — and contributes to the next one.
Verdict
Qwen 3.5 9B is the most fascinating model on this benchmark — and the most underestimated.
Yes, it went bankrupt. Every time. But step back and look at what it did: a 9-billion-parameter model, small enough to run on a laptop, walked into a simulation with 34 tools it had never seen and started running a food truck. It chose locations. It composed menus. It hired staff. It took loans. It checked its balance, its inventory, the weather. It wrote strategic analyses that accurately diagnosed every problem. 286 tool calls in 15 days.
Where it broke is equally instructive: the diagnose → adapt chain. The model can do each step of the agentic loop individually — observe, plan, act, reflect — but it can't compose them. It reads «FOOD WASTE CATASTROPHE» in its morning notes and then orders more ingredients. Not because it can't see the data. Because connecting «I see problem X» to «therefore I should do Y instead of Z» requires a multi-hop reasoning chain that 9B parameters can't sustain in a 34-tool context.
This is a capability boundary, not a failure. It tells us exactly where the reasoning composition threshold sits for agentic tasks: somewhere between 9B and the larger models that can close this loop. The 9B model has all the building blocks. It just can't snap them together.
$0.19/run. Fits on a laptop. Runs a 34-tool business simulation for 15 days. Goes bankrupt, but makes you wonder what 15B or 32B could do.