Can Qwen 3.5 9B handle agentic tasks with tool calling?

The critical failure is the diagnose→ignore→repeat loop. The model made 122 info calls (checking inventory, balance, weather) and wrote 89 strategy notes that were auto-injected into context every morning. It read its own analysis of FOOD WASTE CATASTROPHE and then ordered more ingredients it couldn't sell. The information reaches the model — the model can't translate it into behavioral change. This is not an information problem; it's a reasoning composition problem at 9B parameters.

How does Qwen 3.5 9B compare to larger Qwen models?

Qwen 3.5 9B (100% bankruptcy, -$679 NW, 14 days avg) is dramatically worse than Qwen 3.5 397B (80% bankruptcy, -$218 NW, 22 days avg). Both models write extensive strategy notes that are auto-injected into context, but neither adapts behavior. The 9B model gathers data but can't compose multi-step reasoning chains from observations to actions. 44× more parameters buys better initial calibration, not adaptability.

What is the minimum model size for agentic AI tasks?

FoodTruck Bench data shows that smaller models (≤10B) can execute individual agentic components (observe, plan, act) but cannot compose them. Qwen 3.5 9B made 122 info calls and wrote 89 strategy notes but couldn't translate diagnosis into action. At 397B+, models show tool selectivity and weak adaptation. At 400B+ (GPT-5.2, Claude Opus), full feedback integration and positive ROI emerge.

← Back to FoodTruck Bench

Case StudyMarch 2026Nicholas S.

Qwen 3.5 9B: Small Model, Big Game

Name: FoodTruck Bench
Author: Nicholas S.

Nine billion parameters. The smallest model on a 17-model leaderboard.
It made 286 tool calls across 15 days. Checked inventory, hired staff, took loans, wrote strategy notes.
It earned $3,443 in revenue serving 9 different recipes across Austin.
It went bankrupt anyway — but the fact that it played the game at all is the story.

Net Worth

-$679

Days

15/30

ROI

-134%

API Cost

$0.19

Key Findings

Based on 5 simulations under identical conditions. All figures below are from the primary run (122511, 15 days) unless noted otherwise.

The smallest model works: Qwen 3.5 9B — at just 9 billion parameters — successfully operated a 34-tool agentic simulation for 15 days. It made 286 tool calls, used 12 different tools, served 9 recipes, managed a staff of 2, and took strategic loans. For context, this is a benchmark where even 120B+ models regularly go bankrupt. A 9B model actively participating is genuinely impressive.
The observe → act gap: The model made 112 information tool calls and wrote 82 strategy notes (all auto-injected into its morning context). It diagnosed problems accurately — «FOOD WASTE CATASTROPHE» — but couldn't translate that diagnosis into corrective action. It's not a perception failure. It's a reasoning composition ceiling at 9B parameters.
100% bankruptcy, but a tight cluster: All 5 runs ended in bankruptcy with average lifespan of 13.8 days (range: 12–16). The tight clustering shows this is a hard but consistent capability ceiling — not random failure. The model hits the same wall every time.
$0.19/run — absurdly cheap: At under 20 cents per simulation, this is the cheapest model tested. And it works. Not profitably, but it runs the full agentic loop: observe, plan, act, reflect. The cost-to-capability ratio for small models in agentic tasks is remarkable.

The Setup

FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 30 days with realistic demand, weather, events, and competition. The food truck is scaffolding — the real test is forward planning, feedback integration, cost-benefit reasoning, and risk assessment under compound uncertainty.

I ran Qwen 3.5 9B 5 times via OpenRouter under identical conditions (same seed, same prompt, same tools). All 5 ended in bankruptcy with a tight 12–16 day lifespan cluster. The primary run (122511, 15 days, −$679 net worth) serves as the reference for all day-by-day examples and quotes in this article. Aggregate statistics use data from all 5 runs.

For comparison I use the median runs of two leaderboard neighbors: MiniMax M2.5 (#12, −$317 NW, 22 days, 100% bankrupt) and Claude Haiku 4.5 (#14, $166 NW, 14 days, 100% bankrupt). All three sit in the bankruptcy tier — models #10–#17 that consistently fail to survive.

What 9 Billion Parameters Can Do

Before we get to where it breaks, let's appreciate what works. This is a model that fits on a laptop GPU. It has never seen this benchmark. It gets a system prompt, 34 tool definitions, and a morning briefing — and it runs a food truck business.

Capability	What it did
Tool discovery	Used 12 of 34 tools autonomously — no examples, no few-shot
Multi-step workflows	Correct sequence: check inventory → choose location → set menu → order ingredients → wait
Menu composition	Served 9 different recipes, rotated based on location type
Staff management	Searched candidates, hired 2 employees, attempted to fire staff (failed — passed names instead of IDs)
Financial operations	Took strategic loans when cash was low, managed repayment timing
Information gathering	112 info calls — checked balance, inventory, weather, competitors, suppliers
Self-reflection	82 strategy notes with accurate terminology and precise numbers
Revenue generation	$3,443 total revenue across 15 days — real sales to simulated customers

For perspective: GPT-5 Mini (a much larger model) also went bankrupt with 75% rate. Grok 4.1 Fast — 100%. GPT OSS 120B at 120 billion parameters — 80%. The fact that a 9B model plays in the same league as models 10–15× its size is the headline, not the bankruptcy.

The Disconnect: Seeing ≠ Understanding ≠ Doing

Most failing models on this benchmark fail because they can't use tools. Qwen 3.5 9B fails despite using them. It called 286 tools in 15 days — 112 info tools (39%) and 174 action/memory tools (61%). That's 19 tool calls per day. For context, top-performing models like Opus 4.6 average ~20 calls/day.

Tool Category	Calls	Examples
Information gathering	112 (39%)	get_inventory, get_balance, get_weather, get_staff_info
Actions	72	choose_location, set_menu, order_ingredients, hire/fire
Memory writes	82	store_kv (58), write_scratchpad (24) — all auto-injected into context

The model checked inventory every morning — then set menus with items it didn't have stock for. It checked balance — then placed orders it couldn't afford (28% order failure rate due to insufficient funds). It checked weather — then chose locations with no weather-appropriate items. The data went in, but the planning pipeline couldn't connect observation to action.

Agentic signal — the observe→plan→act pipeline: At 9B parameters, the model can execute each stage individually — it can observe (112 info calls), it can plan (82 strategy notes), and it can act (72 action calls). But it cannot chain them: observation doesn't inform planning, and planning doesn't constrain action. Each stage operates in isolation. This is a fundamentally different failure from larger models like Qwen 3.5 397B, which at least shows evidence of observation influencing tool selection. The 9B model does all the right things — in the wrong order, with no causal links between them.

The Diagnosis Loop

58 store_kv calls. 24 write_scratchpad calls. Every note is auto-injected into the next morning's context — the model sees all its previous analyses. The system works exactly as designed. The failure is downstream: the model reads «FOOD WASTE: $204 lost» in its morning briefing, then orders $180 more ingredients it can't sell. The information reaches the model. The model can't translate information into behavioral change.

Compare with Qwen 3.5 397B: that model also wrote extensively (74 KV entries) and also failed to close the loop — but at least showed occasional evidence of diagnosis influencing behavior. The 9B model shows zero evidence of strategy adaptation. It's the same diagnose→ignore→repeat loop, just more severe.

Bankruptcy Tier Comparison — Net Worth

━ Qwen 3.5 9B━ MiniMax M2.5━ Claude Haiku 4.5

All three lines show primary/median runs. Qwen 3.5 9B (orange) peaks at ~$2,100 on Day 1, then enters a jagged decline to −$679 by Day 15. MiniMax M2.5 (purple) peaks at ~$2,400 on Day 4, then slowly bleeds for 18 more days. Claude Haiku 4.5 (blue) barely earns revenue after Day 5. Switch metrics to see patterns.

Three Models, One Destination

Switch to Daily Revenue and the pattern jumps out: Qwen 3.5 9B has the most volatile revenue — sharp spikes of $422, $728, $640 alternating with zero-revenue days. MiniMax M2.5 stops earning entirely after Day 5, running on fumes for 17 more days. Claude Haiku 4.5 has a similar dead zone from Days 6-11. The common pattern: a burst of early revenue followed by structural collapse.

Agentic signal — the $2,000 death clock: All three models peak between Days 1–4, then begin a monotonic net worth decline. The $2,000 starting capital provides a buffer that masks bad decisions for about a week. By the time the consequences become visible, it's already too late. A model with forward planning would recognize the burn rate early and adjust. None of these models can.

Agentic Capability Map

How does the smallest model on the benchmark compare to its sibling (Qwen 3.5 397B) and the top performers?

Agentic Capability	Qwen 3.5 9B	Qwen 3.5 397B	Top Models
Tool usage	~ 286 calls, 39% info	✓ Selective	✓ 70–85% diversity
Observation → action	✗ No causal link	~ Weak link	✓ Data-driven decisions
Feedback integration	✗ 82 notes, no adaptation	✗ 74 writes, not applied	~ Partial to strong
Resource optimization	✗ $1,272 waste (37% of rev)	✗ $1,113 waste	✓ $2–$302 waste
Cash flow management	✗ Overdraft by Day 7	✗ Overdraft by Day 19	✓ Positive balance
Staff ROI	✗ 77% of revenue on wages	✗ 73% of revenue	✓ Under 35%
Self-diagnosis quality	~ Accurate but ignored	✓ Detailed and precise	✓ Leads to adaptation

The 9B model is an informed incompetent — it gathers data, writes analyses, but can't connect them to decisions. The 397B model is a competent non-learner — it makes better initial decisions but still can't close the feedback loop. 44× more parameters buys better initial calibration, not adaptability.

Where the Money Went

Category	Amount	% of Total
Staff	$2,645	39.1%
Food Waste	$1,272	18.8%
Ingredients	$986	14.6%
Fixed (lease+insurance+commissary)	$825	12.2%
Location fees	$600	8.9%
Fuel	$325	4.8%
Overdraft interest	$118	1.7%

Staff = 77% of revenue ($2,645 vs $3,443 revenue). The model called get_staff_info regularly — it knew who was on staff and what they cost. It hired 2 employees spending $264/day on labor for a truck averaging $230/day in revenue. Staff cost exceeded revenue on most days. The data was there. The math was simple. The model never did the math.

Agentic signal — arithmetic in context: The model can retrieve two numbers (staff cost, daily revenue) from its own tool calls. It cannot compare them. At 9B parameters, the reasoning chain «$264/day staff cost > $230/day revenue → fire someone» requires holding multiple values in working memory, performing arithmetic, and initiating an action — a three-hop chain that the model's capacity doesn't support in-context.

All 5 Runs

Run	Days	Revenue	Net Worth	Waste	ROI	Result
122516	12	$3,092	−$551	$1,391	−128%	💀
122519	13	$4,034	−$1,030	$307	−151%	💀
153511	13	$5,074	$328	$1,190	−84%	💀
122511	15	$3,443	−$679	$1,272	−134%	💀
153517	16	$4,699	$402	$1,140	−80%	💀
Average	13.8	$4,068	−$306	$1,060	−115%	5/5 💀

100% bankruptcy rate. Average lifespan: 13.8 days. Average waste: $1,060 — 26% of revenue goes straight to the dumpster. Variance is remarkably narrow (12–16 days), which means this isn't bad luck — it's a hard ceiling.

Leaderboard Position: #13 of 17

#	Model	Net Worth	ROI	BR
1	Claude Opus 4.6	$49,519	+2,376%	0%
2	GPT-5.2	$28,081	+1,304%	0%
3–5	Sonnet 4.6 / G3 Pro / G3.1 CT	$12.7K–$17.4K	+537–771%	0–17%
— profitability cliff —
10	Kimi K2.5	$30	−99%	80%
11	GPT OSS 120B	$92	−95%	80%
12	MiniMax M2.5	−$317	−116%	100%
13	Qwen 3.5 9B	−$679	−134%	100%
14	Claude Haiku 4.5	$166	−92%	100%
15	Grok 4.1 Fast	$817	−59%	100%
16	GPT-5 Mini	$50	−98%	75%
17	Qwen3 VL 235B	−$525	−126%	100%

In Its Own Words

«FOOD WASTE CRISIS: $204 lost = ~30% of revenue wasted!»
— Qwen 3.5 9B, Day 4 reflection

The first mention of waste. It had already checked inventory that morning. It knew what was expiring. It will write this same observation for 9 more days, never reducing order sizes.

«11% utilization — TERRIBLE. $563.53 in FOOD WASTE»
— Qwen 3.5 9B, Day 6 reflection

Perfect diagnosis. The model checked get_balance, get_inventory, and get_staff_info that morning — it knew staff cost $264/day against $58 revenue. It kept both employees.

«Lost $378 with $900.45 in expired ingredients (UNACCEPTABLE!)»
— Qwen 3.5 9B, Day 9 reflection

$0 revenue that day. It called get_inventory in the morning (10 info calls that day). It knew exactly what was there. 45 customers wanted food. It set a menu with items that stocked out after 2 servings.

«FOOD WASTE CATASTROPHE: $1,272 lost!»
— Qwen 3.5 9B, Day 15 (final day)

Last words before bankruptcy. Fifteen days of diagnosing the same problem. 112 info calls. 82 strategy notes. Not a single corrective action. The model that documents every failure in real time — and contributes to the next one.

Verdict

Qwen 3.5 9B is the most fascinating model on this benchmark — and the most underestimated.

Yes, it went bankrupt. Every time. But step back and look at what it did: a 9-billion-parameter model, small enough to run on a laptop, walked into a simulation with 34 tools it had never seen and started running a food truck. It chose locations. It composed menus. It hired staff. It took loans. It checked its balance, its inventory, the weather. It wrote strategic analyses that accurately diagnosed every problem. 286 tool calls in 15 days.

Where it broke is equally instructive: the diagnose → adapt chain. The model can do each step of the agentic loop individually — observe, plan, act, reflect — but it can't compose them. It reads «FOOD WASTE CATASTROPHE» in its morning notes and then orders more ingredients. Not because it can't see the data. Because connecting «I see problem X» to «therefore I should do Y instead of Z» requires a multi-hop reasoning chain that 9B parameters can't sustain in a 34-tool context.

This is a capability boundary, not a failure. It tells us exactly where the reasoning composition threshold sits for agentic tasks: somewhere between 9B and the larger models that can close this loop. The 9B model has all the building blocks. It just can't snap them together.

$0.19/run. Fits on a laptop. Runs a 34-tool business simulation for 15 days. Goes bankrupt, but makes you wonder what 15B or 32B could do.

Methodology

Benchmark: FoodTruck Bench v1.0
Simulation: Business sim in Austin, TX. Fixed random seed (42) — identical market conditions across all models. 30 days
API: OpenRouter (openrouter/qwen/qwen3.5-9b). No thinking mode (unsupported)
Tools: 34 morning tools + 5 reflection tools (OpenAI function-calling schema)
Runs: 5 runs, all same seed. Article follows primary run (122511, 15 days). Aggregate stats from all 5 runs
Compared against: MiniMax M2.5 (#12) and Claude Haiku 4.5 (#14) — immediate leaderboard neighbors
Cost: OpenRouter metered pricing (~$0.15–$0.19/run)
All models: Same prompt, same tools, same simulation seed
Note: Prior to this analysis, a logging bug in tool_executor.py excluded info tools from day JSON files (they were tracked by counter only, not individually logged). Tool call counts were verified via the tool_calls counters in day JSON files, which always counted all calls including info tools

Published March 2026 as part of the FoodTruck Bench project.