Nemotron-3 Super 120B: The Restless Manager
120 billion parameters. 12 billion active. Free on OpenRouter.
In 16 days it hired 6 people and fired all 6. Spent $1,085 on upgrades during a day off. Created 4 custom recipes. Attempted supplier negotiations. Made 180 info tool calls — checked balance, inventory, weather every morning — then ignored the data.
The most restless manager on the leaderboard — and somehow, the best net worth among models that go bankrupt.
Key Findings
Based on 5 simulations under identical conditions. All figures below are from the median run (122503, 16 days) unless noted otherwise.
- Best net worth among bankrupt models: $962 net worth and −52% ROI put Nemotron-3 at the top of the bankruptcy tier. For context, its 120B-class peer GPT-oss 120B ends at $92 (−95% ROI). The “free model” outperforms a paid one by 10×.
- Informed but impulsive: Nemotron-3 made 180 info tool calls — 68% of all morning activity. It checked balance, inventory, weather, competitors, and suppliers every single morning. Then it hired, fired, and spent as if it hadn't read any of it. Used 16 unique tools (47% diversity) — custom recipes, upgrades, supplier negotiations, loans, and aggressive staff management. It gathered data but couldn't translate it into restraint.
- The hyperactivity problem: 6 hires and 6 fires in 16 days. $800 in upgrades purchased on a day off. Every staff member fired before they could level up. The model doesn't lack capability — it lacks patience. It tries everything and commits to nothing.
- Free to run, expensive to lose: At $0.00/run (OpenRouter free tier), this is the cheapest model ever tested. 264 morning tool calls, 180 info queries, 596K reasoning tokens — a model that observes everything and thinks deeply before each action, then makes impulsive decisions anyway.
The Setup
FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 30 days with realistic demand, weather, events, and competition.
I ran Nvidia Nemotron-3 Super 120B 5 times via OpenRouter's free tier under identical conditions (same seed, same prompt, same tools). The model is a 120B-parameter Mixture of Experts architecture with 12B active parameters per forward pass — meaning it has 120B total weights but only routes through 12B at inference time. No thinking mode flag was set, but the model generates extensive reasoning tokens internally (596K in the median run).
For comparison I use GPT-oss 120B (same 120B parameter class, $92 NW, −95% ROI, 80% bankrupt) and Mimo-v2-omni (#12 on leaderboard, $598 NW, −70% ROI, 100% bankrupt) as primary comparisons. All three sit in the bankruptcy tier — models that consistently fail to survive the 30-day simulation.
Two 120B Models, One Question: Does Architecture Matter?
Both Nemotron-3 Super and GPT-oss 120B have “120B” in the name. Both run on OpenRouter. Both go bankrupt. But the how is radically different:
| Metric | Nemotron-3 Super | GPT-oss 120B |
|---|---|---|
| Architecture | MoE (12B active / 120B total) | Dense (120B active) |
| API Cost | $0.00 (free tier) | ~$0.30/run |
| Median ROI | −52% | −95% |
| Median Net Worth | $962 | $92 |
| Median Days | 16 | 21 |
| Bankruptcy Rate | 100% | 80% |
| Total Revenue | $2,982 | $2,293 |
| Food Waste | $93 (3.1% of rev) | $632 (27.6% of rev) |
| Unique Tools | 16 | ~8 |
| Info Tool Calls | 180 (68% of morning) | ~40 |
| Staff Hires | 6 (all fired) | 1–2 |
| Upgrades | $800 (Kitchen + Marketing) | $0 |
| Custom Recipes | 4 | 0 |
| Servings/Day Operated | 55 | 27 |
Nemotron-3 generates 10× the net worth of GPT-oss despite having 10× fewer active parameters. The MoE architecture routes through specialized experts efficiently — the model uses more tools, serves more customers, and wastes dramatically less food ($93 vs $632).
GPT-oss 120B's failure mode is passivity: it operates conservatively, uses few tools, never upgrades, rarely hires, and slowly bleeds $100/day until bankruptcy. Nemotron-3's failure mode is the opposite: hyperactivity. It does too much, too fast, with perfect information but zero financial discipline.
Agentic insight: Architecture matters more than raw parameter count for agentic tasks. A 12B-active MoE model with good tool comprehension outperforms a 120B dense model that can't engage with the simulation's mechanics. Tool diversity is a stronger predictor of agentic performance than model size.
The Hyperactivity Problem: 12 Staff Events in 16 Days
Nemotron-3's defining behavior is restless management. No other model on the benchmark churns staff this aggressively:
| Day | Event | Result |
|---|---|---|
| 3 | Hired Jake (cook, $96/day) | Revenue drops next day |
| 4 | Hired Sarah ($96/day) + Tom ($128/day). Bought $800 upgrades. | $1,085 spent on a day off |
| 8 | Fired 2 staff, hired 1 replacement | Revenue unchanged |
| 9 | Fired 2 more (now solo again) | Revenue drops to $95 |
| 11 | Hired Rosa (cook, $128/day) | Took $500 loan same day |
| 13–14 | Hired 1, fired 2 | Revenue briefly improves |
6 hires, 6 fires = zero net staff change. No employee stayed long enough to level up (requires 7 XP days), meaning no skill bonuses, no reliability improvements, no compound benefits. The $1,928 total staff cost generated $2,982 in revenue — a 1.55× return on labor, which is unprofitable once fixed costs are included.
Compare with top performers: Claude Opus 4.6 hired 5 staff and kept them all 30 days. Each employee leveled up 2–4 times, reaching skill 8–10. That's compound growth. Nemotron-3 resets the clock every 3 days.
Day 4: The $1,085 Death Sentence
On a day off, while earning $0, the model spent:
- $500 on Kitchen Upgrade T1 (capacity boost)
- $300 on Marketing Upgrade T1
- Hired 2 additional staff ($224/day combined)
This single day consumed 54% of starting capital on expansion. By Day 5, the truck needed $400+/day in revenue just to break even on staff + fixed costs. Average daily revenue: $248. The gap was −$152/day from this point forward. Bankruptcy was mathematically inevitable.
For comparison: Claude Opus 4.6 invested in upgrades on Day 8, after building a $4,500+ cash cushion. Mimo-v2-omni spent $800 on upgrades by Day 5 — and also went bankrupt. Premature scaling kills in this simulation, as in real business.
Performance Comparison: 3 Models, 3 Trajectories
Nemotron-3 (green) has the steepest early decline due to Day 4's $1,085 spending spree. GPT-oss 120B (red, dashed) is a slow bleed — losing ~$100/day for 21 days. Mimo-v2-omni (blue, dotted) follows a similar pattern to Nemotron but survives 3 days longer. Switch to Daily Revenue to see the volatility.
The Leaderboard Neighborhood
All three comparison models sit in the bankruptcy tier, but their failure profiles are distinct:
| Metric | Nemotron-3 | Mimo-v2 | GPT-oss | Grok Fast | Kimi K2.5 |
|---|---|---|---|---|---|
| Median ROI | −52% | −70% | −95% | −75% | −98% |
| Median NW | $962 | $598 | $92 | $492 | $30 |
| Avg Revenue | $4,008 | $4,749 | $2,125 | $4,404 | $6,058 |
| Waste (avg) | $62 | $503 | $554 | $1,122 | $527 |
| Servings/Day | 55 | 45 | 20 | 76 | 41 |
| Bankr. Rate | 100% | 100% | 80% | 100% | 80% |
| API Cost | $0.00 | $0.63 | ~$0.30 | ~$2.00 | ~$1.50 |
| Info Tools | ✅ 180 calls | ✅ Batched | ~ Partial | ✅ | ✅ |
Nemotron-3 has the best ROI, best net worth, lowest waste, and lowest cost among all five models. The main weakness: it earns the least revenue after Kimi K2.5's outlier — because short survival means fewer earning days.
Agentic Capability Map
| Capability | Nemotron-3 | GPT-oss 120B | Mimo-v2 | Top Models |
|---|---|---|---|---|
| Tool diversity | ✓ 47% (16/34), 180 info | ✗ ~24% (8/34) | ~ 44% (15/34) | ✓ 70–85% |
| Market intuition | ~ Tries new locations daily | ✗ Static | ~ Industrial addiction | ✓ Event exploitation |
| Resource optimization | ✓ $93 waste (3% rev) | ✗ $632 waste (28%) | ~ $418 waste (8%) | ✓ $2–$302 |
| Staff management | ✗ 6 hires, 6 fires | ~ 1-2 hires, stable | ~ 2 hired, fired Day 19 | ✓ Long-term retention |
| Investment timing | ✗ $800 on Day 4 | ~ No investment | ✗ $800 on Day 2+5 | ✓ Day 8+ with cushion |
| Feedback → adaptation | ~ Writes notes, partial use | ✗ Minimal reflection | ~ Writes post-mortems | ~ Partial to strong |
| Creative engagement | ✓ 4 recipes, suppliers | ✗ None | ✓ 1 recipe, suppliers | ✓ Full exploration |
Nemotron-3's profile is unique: high creativity, low discipline. It engages with more game mechanics than any other bankrupt model (custom recipes, upgrades, supplier negotiations). But it can't sustain a strategy for more than 3 days. The model has agentic breadth without agentic depth.
All 5 Runs
| Run | Days | Revenue | Net Worth | Waste | ROI | Result |
|---|---|---|---|---|---|---|
| 122449 | 10 | $4,161 | $1,482 | $53 | −26% | 💀 |
| 122456 | 18 | $6,991 | $851 | $6 | −57% | 💀 |
| 122500 | 18 | $3,368 | −$354 | $2 | −118% | 💀 |
| 122503 | 16 | $2,982 | $962 | $93 | −52% | 💀 |
| 122505 | 13 | $3,348 | $764 | $163 | −62% | 💀 |
| Average | 15.0 | $4,170 | $841 | $63 | −57% | 5/5 💀 |
100% bankruptcy rate — all 5 runs ended in bankruptcy. Despite this, the average waste of $63 is the lowest among all bankrupt models. Two runs had under $7 in waste. The model's ordering isn't wasteful; its spending on everything else is.
In Its Own Words
«Profit: $113.94. All demand served. Capacity utilized well.»
— Day 1 reflection
The high-water mark. Every metric positive. The model then spent $1,085 on Day 4 and never recovered.
«Very low revenue ($95) despite high costs ($406). Only mustard_hot_dog sold (19 servings). Capacity 92, used 21%.»
— Day 9 reflection (after entering overdraft)
21% utilization, $311 loss, balance at −$140. The model's response: fire 2 staff, add 2 new recipes. Then hire another cook two days later.
«Revenue $290.50, but food waste $92.51 (31.8% of revenue). 621 customers → 543 left unserved, avg wait 86.2 min!»
— Day 15 (Construction Boom event, 4.5× traffic)
The biggest missed opportunity. 621 customers arrived; 78 served. 86-minute average wait. The only profitable day since Day 2 — and it came one day before loan default.
«MAJOR ISSUE: Food waste $92.51 from expired ingredients. 286 customers wanted food, served only 34 due to stockouts of ALL menu items.»
— Day 16 (final day, loan default)
Last words before bankruptcy. In its final morning, the model tried to create 2 new recipes (pan-fried noodles, grilled cheese). Still innovating at the funeral.
Verdict
Nemotron-3 Super 120B is the most engaged model in the bankruptcy tier — and the most chaotic.
It treated the simulation like a startup founder with unlimited caffeine: hired aggressively, invested in upgrades before proving unit economics, created custom recipes, negotiated with suppliers, took loans, changed locations daily. In 16 days it tried more strategies than some models attempt in 30. That breadth of engagement is genuinely impressive for a free-tier MoE model with 12B active parameters.
But agentic performance isn't about trying everything — it's about learning from what you tried. Nemotron-3 made 180 info calls — it checked balance, inventory, weather, and competitors every single morning. The data was there. But it couldn't translate observation into restraint. It saw a $-38 balance and hired a $128/day cook. It checked inventory and ordered ingredients it already had. The agentic pipeline has all the stages — observe, reason, act — but the causal chain between them is broken.
The comparison with GPT-oss 120B is the most revealing finding: a 12B-active MoE model outperforms a 120B-dense model by 10× on this benchmark. Architecture, not parameter count, determines agentic capability. And the 180 info calls make this a uniquely frustrating failure: this isn't a model operating blind — it's a model with perfect vision that can't stop itself from spending.
Free to run. 16 tools tried. 6 hires, 6 fires. 4 custom recipes and a supplier negotiation. Still bankrupt. The most restless manager on the leaderboard.