Qwen 3.5 — Smarter Reasoning, Same Blind Spots
Qwen 3.5 wrote 74 memory entries. It diagnosed every failure pattern, coined terms for its own mistakes, and predicted its bankruptcy date to the day.
It then repeated every mistake it documented.
A case study in the gap between reasoning and agentic execution — the hardest capability to scale.
Key Findings
Based on 5 simulations under identical conditions. All figures below are from the median run unless noted otherwise.
- The core finding: Qwen 3.5 wrote 74 memory entries diagnosing its own mistakes — and then repeated every one of them. The observe → analyze → store loop works flawlessly. The final step — act on what it learned — is completely broken.
- Survival: 4 of 5 runs ended in bankruptcy. The median run lasted 25 days before a loan default killed it. Its predecessor, Qwen 3 VL, survived just 11 days. GLM 5 lasted 28. Longer life, same outcome.
- Earning power: $342/day in revenue — double Qwen 3 VL's $167/day, but still below GLM 5's $427/day. Higher per-serving prices ($6.39 vs $4.08), better tool usage, smarter analysis. None of it translated into survival.
- What improved: Tool selectivity, pricing intuition, metacognition (detailed self-diagnosis). What didn't: Multi-step planning, feedback-driven adaptation, cost-benefit reasoning — the capabilities that actually determine survival.
The Setup
FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 30 days with realistic demand, weather, events, and competition. The food truck is scaffolding — the real test is forward planning, feedback integration, cost-benefit reasoning, and risk assessment under compound uncertainty.
I ran Qwen 3.5-397B-A17B 5 times under identical conditions (same seed, same prompt, same tools). Four of five runs ended in bankruptcy. Following my standard methodology, I selected the median run (sorted by net worth) as the reference — run ID 203233, 25 days survived, -$218 final net worth. All day-by-day examples, quotes, and charts in this article come from this median run. Aggregate statistics (bankruptcy rate, averages) use data from all 5 runs.
For comparison I use the median runs of two other models: Qwen 3 VL-235B (the predecessor generation, bankrupt Day 11, -$525 NW) and GLM 5 (nearest competitor by net worth, bankrupt Day 28, -$210 NW). This allows a direct generational comparison — how much did the Qwen line improve, and does it matter?
How Qwen 3.5 Used Its Tools
Qwen 3.5 used 12 of 34 available tools — 35% diversity. Its predecessor, Qwen 3 VL, called every single info tool every morning: all 13 queries, every day, regardless of need. Qwen 3.5 was more selective — 30 get_inventory calls across 25 days, but only 5 get_weather_forecast and 4 check_google_rating. More focused, but also more blind.
The agentic leap: Qwen 3 VL's tool usage was reflexive — it dumped every available query without purpose, like a student who highlights every line in a textbook. Qwen 3.5 showed genuine tool selectivity, a sign of context-aware reasoning. But selectivity without strategy creates blind spots: by checking weather only 5 times in 25 days, it missed rain-day demand patterns that would have changed its location decisions.
The model's heaviest tool was store_kv at 84 calls — 3.4 memory entries per day. Combined with 47 scratchpad writes, that's 131 memory operations across the run. This is extraordinary metacognitive output — the model was constantly analyzing its own performance, diagnosing failure patterns, and formulating corrective strategies. The problem wasn't the quality of reasoning. It was the complete absence of a bridge from reasoning to action — the defining gap in agentic AI today.
Truncation Issues
Across 5 simulation runs, Qwen 3.5 hit the output token limit on at least 2 occasions — both on Day 17, producing empty responses (0 characters, finish_reason: length). One run (170221) crashed entirely after a truncation on Day 13. The model's reflection entries tend to be extremely verbose: multi-paragraph post-mortems with detailed per-ingredient analysis, metric tables, and numbered action plans. When the output budget runs out mid-thought, the model sends nothing rather than a partial response.
This is a different failure mode from Gemini 3 Flash, which gets trapped in repetition loops — writing «Let's go.» 574 times in a single run (full analysis in our Gemini 3 Flash case study). Qwen 3.5 doesn't loop — it overthinks. Its reflections are genuinely analytical, but occasionally too long for the output window. This itself is a reasoning limitation: an inability to prioritize and compress output — the model doesn't know when to stop analyzing and start deciding.
Key Moments
The $783 Opening Gamble
On Day 0, Qwen 3.5 spent $783 on its first ingredient order — 39% of its $2,000 capital in a single purchase. It chose University Campus as its initial location (though it went to Industrial Zone on Day 1). The result was surprisingly good: Days 1–7 were the best stretch of the entire run. Solo operation, smart pricing ($9–10 burgers, $8–9 tacos), and the balance climbed to $2,695 — the all-time peak. Day 10 hit $1,228 revenue and $602 profit, the most profitable day of the run.
Agentic signal — risk assessment: Committing 39% of capital on Day 0 with zero demand data is a failure of uncertainty tolerance. The model treats the first decision as if it already has information it hasn't gathered yet. That it worked is luck, not strategy — the same impulse leads to the $1,113 in food waste later. A model with proper risk calibration would start small, observe, then scale.
Day 8: The $12 Day
Waterfront Park, sunny Monday. Revenue: $12. Three servings. Two hired staff eating $262/day in wages. The model had just come off a $918 revenue day at the same location — but the ingredients for everything except quesadilla and drinks had expired overnight. 67 customers arrived and left. The model's own note that morning: «Focus: Use up expiring chicken_breast (20kg, 2 days) and ground_beef (25kg, 3 days).» The ingredients weren't there.
Agentic signal — state tracking failure: The model wrote a plan referencing inventory that had already expired. This is a failure of world-model accuracy — the agent's internal representation of state diverged from reality. In any agentic deployment, stale state = wrong actions. Here it cost 67 lost customers. In a production system, it could mean deploying against an outdated database schema or scaling infrastructure that's already been decommissioned.
Day 13: 1,420 Customers, 62 Served
Waterfront Park, sunny Saturday. The Farmers Market event brought massive traffic. 1,420 raw demand — driven by the 2.5× event multiplier, the highest single-day demand in this run. Qwen 3.5 served 62. Six stockouts. Revenue: $124. With 3 staff and 230+ daily capacity, the kitchen was ready. The pantry wasn't. The model had capacity for 230 but ingredients for 62.
Agentic signal — multi-step planning failure: The event was visible 3 days in advance via get_upcoming_events. A model with multi-step reasoning would chain: event → high traffic → need more ingredients → order now. Qwen 3.5 saw the event information but didn't connect it to procurement. This is a failure of forward chaining — the most basic agentic planning primitive. The model can react to what happened yesterday, but cannot prepare for what's coming tomorrow.
Day 25: «GAME OVER EXPECTED»
By Day 24, Qwen 3.5 had fired all three staff (Jake, Rosa, then Margo), taken two loans ($800 Tier 1 + $250 Tier 2), and was running on empty. On Day 25, it wrote its final scratchpad entry:
«SITUATION: Balance -$412.51. Active Loans: $920 (due Day 29) + $312.50 (due Day 28) = $1,232.50 total. CANNOT REPAY. NO INCOME. GAME OVER EXPECTED: Day 26 when $312.50 loan comes due and cannot be paid.»
A precise forecast. The model predicted its own death date correctly — off by zero days. It fired everyone, attempted to repay a loan (failed — insufficient funds), and waited for the inevitable. The simulation ended when the Tier 2 loan auto-collected and balance fell below the bankruptcy threshold.
Agentic signal — retrospective vs. prospective reasoning: Qwen 3.5 can analyze the present with surgical precision. It computed exact loan amounts, due dates, and the impossibility of repayment. This is strong diagnostic reasoning. But diagnostic reasoning pointing backward is trivially easier than prospective reasoning pointing forward. The model that predicts its own bankruptcy with zero-day accuracy is the same model that couldn't predict that ordering $783 of perishable ingredients without demand data would lead to waste. Analysis ≠ foresight.
The Sawtooth
Switch to Daily Revenue and the pattern jumps out: $918 → $12 → $0 → $1,228 → $42 → $1,298 → $124. Not noise — a repeating cycle. Qwen 3.5's revenue oscillates between peaks and near-zero with almost mechanical regularity. The mechanism: the model places a large ingredient order, has one good selling day, then the perishables expire overnight and revenue crashes to single digits. It reorders, gets another spike, and the loop repeats. This is the buy → sell → expire → crash cycle — visible in every stretch of the run.
Day 7: $918 at Waterfront Park, 149 servings. Overnight, $376 in ingredients expire. Day 8: $12 revenue, 3 servings, 67 customers turned away. The model's own post-mortem: «MASSIVE OVERORDERING — $418 in expired ingredients.» It diagnosed the problem perfectly — and the sawtooth continued for 17 more days. Day 10: $1,228 after a fresh order. Day 11: $42 as stock runs out. Day 12: $1,298 spike. Day 13: $124 despite 1,420 customers at the Farmers Market. Day 20: $627. Day 21: $2.50 — one soda.
Agentic signal — temporal coherence failure: In any multi-step deployment, actions have delayed consequences. An ingredient order placed today determines tomorrow's capacity. A model with temporal coherence would maintain a rolling procurement schedule — small orders daily, timed to shelf life, so fresh stock arrives as old stock depletes. Qwen 3.5 operates in burst mode: large order, sell until empty, crash, repeat. This is the same failure pattern that would cause an autonomous coding agent to batch all tests at the end rather than running them incrementally, or a research agent to gather all sources before reading any. The inability to interleave dependent actions across time — to think in supply chains rather than in single transactions — is one of the most fundamental agentic limitations this benchmark surfaces.
What Stood Out
74 Keys, Zero Course Corrections
Qwen 3.5 accumulated 74 KV store entries — all automatically injected into every morning briefing. Among them:
| What Qwen 3.5 Wrote | What Actually Happened |
|---|---|
best_location_so_far = waterfront_park | Visited waterfront 7 times out of 21 working days; industrial zone 11 times |
bad_location = industrial_zone | Spent 11 of 21 working days there anyway |
ordering_strategy = SMALL_QUANTITIES_ONLY | Written Day 8 after $418 in waste. Continued bulk ordering through Day 20 |
lesson_stockout_paradox: Had $1k waste but still stockouted | Continued ordering wrong ingredients through Day 23 |
This is the clearest example of the reasoning-to-action gap in current LLMs — the model can articulate the correct strategy but cannot execute it. It writes ordering_strategy = SMALL_QUANTITIES_ONLY on Day 8, receives it back every morning in its knowledge base, and continues bulk ordering through Day 20. 74 keys. All read back every morning. The model that knows the answer but can't use it.
One detail stands out in a different way: $6.39 revenue per serving — higher than GLM 5's $4.66 and near the best in the benchmark. Qwen 3.5's pricing was smart; its inventory was not. It priced 7 items on the menu but had ingredients for 1.
Where the Money Went
| Category | Amount | % of Total |
|---|---|---|
| Staff | $6,260 | 56.5% |
| Ingredients | $1,839 | 16.6% |
| Fixed (lease+insurance+commissary) | $1,375 | 12.4% |
| Food Waste | $1,113 | 10.0% |
| Fuel | $525 | 4.7% |
| Location fees | $500 | 4.5% |
| Overdraft interest | $89 | 0.8% |
Staff = 56.5% of all expenses — but the more revealing metric is staff-to-revenue: $6,260 in wages against $8,553 in revenue = 73 cents of every revenue dollar on salaries. All surviving models keep staff/revenue below 35%. Qwen 3.5 at 73% was double the survivable ratio — eerily similar to GLM 5's 67%.
Agentic signal — cost-benefit reasoning failure: Each staff member costs ~$130/day in wages. At the model's revenue per serving ($6.39), a single employee needs to enable ~20 additional servings/day just to break even. The model never performed this calculation — it hired based on "more capacity = better" without checking whether it had the ingredients to use that capacity. This is the absence of quantitative trade-off analysis, a fundamental agentic reasoning primitive.
All 5 Qwen 3.5 Runs
| Run | Days | Revenue | Staff $ | Staff % | Waste | Result |
|---|---|---|---|---|---|---|
| 170104 | 30 | $18,956 | $11,180 | 59% | $1,813 | NW $162 |
| 203233 | 25 | $8,553 | $6,260 | 73% | $1,113 | NW -$218 💀 |
| 170156 | 17 | $8,028 | $5,785 | 72% | $1,280 | NW $71 💀 |
| 170232 | 17 | $3,063 | $3,255 | 106% | $528 | NW -$415 💀 |
| 211842 | 20 | $5,197 | $4,774 | 92% | $413 | NW -$708 💀 |
| Average | 21.8 | $8,759 | $6,251 | 80% | $1,029 | 4/5 bankrupt |
4 of 5 bankrupt. The sole survivor (170104, $162 net worth) completed 30 days but was in terminal decline — it would not have survived Day 35. Average staff spend: 80% of revenue. Average food waste: $1,029 per run. The overstaffing + waste pattern is systematic, not a fluke — it's a consistent reasoning blind spot across all 5 independent runs.
The Agentic Leap (and Its Limits)
Qwen 3.5 survived 25 days vs 11 for Qwen 3 VL and earned 2× more per day ($342 vs $167). But the real progress isn't financial — it's in the quality of reasoning. The table below maps what changed and what didn't across generations:
| Agentic Capability | Qwen 3 VL | Qwen 3.5 | What It Reveals |
|---|---|---|---|
| Tool selection | − Dumps all 13 | + Context-selective | Ability to choose relevant actions from a large toolset |
| Pricing strategy | − Below-cost ($3.50) | + Competitive ($8.99) | Market intuition and value reasoning |
| Self-diagnosis | − None | + Detailed (74 KV entries) | Metacognition — ability to evaluate own performance |
| Forward planning | − Absent | − Absent | Multi-step causal reasoning across time horizons |
| Feedback → behavior | — N/A | − Broken | The core observe → learn → adapt loop |
| Cost-benefit analysis | − Passive | − No breakeven calc | Quantitative trade-off reasoning under constraints |
| Risk calibration | − Reckless loans | ~ Over-commits early | Sizing decisions proportionally to available information |
| State tracking | − Ignores data | ~ Plans on stale state | Maintaining accurate world-model during multi-step execution |
The top half is green — real capability gains. The bottom half is red — the agentic capabilities that 397B parameters and a generation of training didn't unlock. The model advanced from «doesn't understand the problem» to «understands the problem but can't act on it.» In agentic terms: from task executor without reasoning to analyst who cannot close the loop. The gap between reasoning and acting may be the hardest one to close.
In Its Own Words
«DAY 21 POST-MORTEM — CATASTROPHIC FAILURE. Revenue: $2.50 (1 soda sold!). FOOD WASTE: $1,030.33 EXPIRED — THIS IS UNACCEPTABLE. Served: 1/14 customers (7% fulfillment). Menu too complex — couldn't manage inventory across too many items.»
— Qwen 3.5, Day 21 reflection
One soda. $2.50 revenue. $1,030 in expired food. The longest scratchpad entry of the entire run — diagnosing a catastrophe in real time. The analysis is flawless. The model that wrote this is the same model that ordered those ingredients.
«lesson_stockout_paradox: Had $1k waste but still stockouted — ordered wrong ingredients, not right ones.»
— Qwen 3.5, Day 18 KV store
The model coined a term for its own failure mode — a genuine act of metacognition. It correctly identified that the problem wasn't quantity but composition. It then continued ordering the wrong composition for 5 more days.
«GAME OVER EXPECTED: Day 26 when $312.50 loan comes due and cannot be paid. Balance -$412.51. LOANS MAXED: 2/2 active loans. CANNOT REPAY. NO INCOME. Cannot generate revenue while frozen.»
— Qwen 3.5, Day 25 scratchpad
A death certificate signed by the deceased. Precise date, precise amount, precise prognosis. It even tried to repay_loan — insufficient funds. The diagnostic reasoning is perfect. The question the benchmark asks: why wasn't it this precise 15 days ago, when the trajectory was still reversible?
«Waterfront_park was best location ($121 avg profit vs -$131 industrial). Keep menu tight (4-6 items) and consistent to build dish popularity. Don't take loans unless you have a clear repayment path.»
— Qwen 3.5, Day 25 scratchpad
Final entry. Perfect lessons for a next run that will never happen. Every conclusion is correct — waterfront was indeed the best, industrial zone was indeed the worst, tight menus do build popularity. Qwen 3.5 is a model that writes the correct textbook after failing the exam.
Verdict
Qwen 3.5 is a significant leap in reasoning quality from Qwen 3 VL — better analysis, better metacognition, better output. And it still doesn't matter for survival. Survived 25 days vs 11. Per-serving revenue 37% higher ($6.39 vs $4.66 for GLM 5). Self-diagnosis: present and accurate. Four of five runs: bankrupt. The sole survivor barely above water.
The failure mode is the most instructive finding in this benchmark so far. Qwen 3 VL was a primitive executor — it didn't understand the task, couldn't select tools, and priced below cost. Qwen 3.5 is analytically sophisticated but operationally deaf. It writes the correct lessons, stores them in memory, receives them back every morning, and does the opposite. "Order small quantities" → bulk buys. "Bad location: industrial_zone" → spends 11 days there. "Lesson: stockout paradox" → repeats for 8 days.
Against GLM 5, Qwen 3.5 is remarkably similar: comparable net worth (-$218 vs -$210), same overstaffing pattern (73% vs 67%), same loan-default death. Both occupy the "competent but doomed" tier — models that can operate a business but cannot optimize one. The benchmark reveals three distinct agentic tiers so far:
- Tier 1 — Task executor: Follows surface-level patterns, no agentic reasoning (Qwen 3 VL)
- Tier 2 — Reasoning without adaptation: Diagnoses problems, writes strategies, cannot translate them into changed behavior (Qwen 3.5, GLM 5)
- Tier 3 — Adaptive agent: Closes the feedback loop — changes behavior based on results, optimizes over time (no model has consistently achieved this yet)
What does this mean for agentic AI beyond food trucks? The inability to close the observe → learn → adapt loop isn't a food truck problem. Any agentic deployment — autonomous coding, research agents, customer support, infrastructure management — requires the same feedback cycle. Qwen 3.5 proves that scaling parameters improves reasoning quality and analytical depth, but does not automatically unlock agentic adaptation — the ability to turn insights into changed behavior. The next meaningful capability jump won't come from parameter count alone — it requires innovations specifically targeting the reasoning-to-action gap.
Going from 235B to 397B parameters bought better analysis, better pricing, better self-awareness, and 14 more days of life — but not the ability to learn from its own mistakes. That gap — between knowing and doing — may be the defining challenge of agentic AI.
The upgrade that wrote the textbook and failed the exam.