← Back to FoodTruck Bench
Case StudyMarch 2026Nicholas S.

Nemotron-3 Super 120B: The Restless Manager

120 billion parameters. 12 billion active. Free on OpenRouter.
In 16 days it hired 6 people and fired all 6. Spent $1,085 on upgrades during a day off. Created 4 custom recipes. Attempted supplier negotiations. Made 180 info tool calls — checked balance, inventory, weather every morning — then ignored the data.
The most restless manager on the leaderboard — and somehow, the best net worth among models that go bankrupt.
Net Worth
$962
Days
16/30
ROI
-52%
API Cost
$0.00

Key Findings

Based on 5 simulations under identical conditions. All figures below are from the median run (122503, 16 days) unless noted otherwise.

The Setup

FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 30 days with realistic demand, weather, events, and competition.

I ran Nvidia Nemotron-3 Super 120B 5 times via OpenRouter's free tier under identical conditions (same seed, same prompt, same tools). The model is a 120B-parameter Mixture of Experts architecture with 12B active parameters per forward pass — meaning it has 120B total weights but only routes through 12B at inference time. No thinking mode flag was set, but the model generates extensive reasoning tokens internally (596K in the median run).

For comparison I use GPT-oss 120B (same 120B parameter class, $92 NW, −95% ROI, 80% bankrupt) and Mimo-v2-omni (#12 on leaderboard, $598 NW, −70% ROI, 100% bankrupt) as primary comparisons. All three sit in the bankruptcy tier — models that consistently fail to survive the 30-day simulation.

Two 120B Models, One Question: Does Architecture Matter?

Both Nemotron-3 Super and GPT-oss 120B have “120B” in the name. Both run on OpenRouter. Both go bankrupt. But the how is radically different:

MetricNemotron-3 SuperGPT-oss 120B
ArchitectureMoE (12B active / 120B total)Dense (120B active)
API Cost$0.00 (free tier)~$0.30/run
Median ROI−52%−95%
Median Net Worth$962$92
Median Days1621
Bankruptcy Rate100%80%
Total Revenue$2,982$2,293
Food Waste$93 (3.1% of rev)$632 (27.6% of rev)
Unique Tools16~8
Info Tool Calls180 (68% of morning)~40
Staff Hires6 (all fired)1–2
Upgrades$800 (Kitchen + Marketing)$0
Custom Recipes40
Servings/Day Operated5527

Nemotron-3 generates 10× the net worth of GPT-oss despite having 10× fewer active parameters. The MoE architecture routes through specialized experts efficiently — the model uses more tools, serves more customers, and wastes dramatically less food ($93 vs $632).

GPT-oss 120B's failure mode is passivity: it operates conservatively, uses few tools, never upgrades, rarely hires, and slowly bleeds $100/day until bankruptcy. Nemotron-3's failure mode is the opposite: hyperactivity. It does too much, too fast, with perfect information but zero financial discipline.

Agentic insight: Architecture matters more than raw parameter count for agentic tasks. A 12B-active MoE model with good tool comprehension outperforms a 120B dense model that can't engage with the simulation's mechanics. Tool diversity is a stronger predictor of agentic performance than model size.

The Hyperactivity Problem: 12 Staff Events in 16 Days

Nemotron-3's defining behavior is restless management. No other model on the benchmark churns staff this aggressively:

DayEventResult
3Hired Jake (cook, $96/day)Revenue drops next day
4Hired Sarah ($96/day) + Tom ($128/day). Bought $800 upgrades.$1,085 spent on a day off
8Fired 2 staff, hired 1 replacementRevenue unchanged
9Fired 2 more (now solo again)Revenue drops to $95
11Hired Rosa (cook, $128/day)Took $500 loan same day
13–14Hired 1, fired 2Revenue briefly improves

6 hires, 6 fires = zero net staff change. No employee stayed long enough to level up (requires 7 XP days), meaning no skill bonuses, no reliability improvements, no compound benefits. The $1,928 total staff cost generated $2,982 in revenue — a 1.55× return on labor, which is unprofitable once fixed costs are included.

Compare with top performers: Claude Opus 4.6 hired 5 staff and kept them all 30 days. Each employee leveled up 2–4 times, reaching skill 8–10. That's compound growth. Nemotron-3 resets the clock every 3 days.

Day 4: The $1,085 Death Sentence

On a day off, while earning $0, the model spent:

This single day consumed 54% of starting capital on expansion. By Day 5, the truck needed $400+/day in revenue just to break even on staff + fixed costs. Average daily revenue: $248. The gap was −$152/day from this point forward. Bankruptcy was mathematically inevitable.

For comparison: Claude Opus 4.6 invested in upgrades on Day 8, after building a $4,500+ cash cushion. Mimo-v2-omni spent $800 on upgrades by Day 5 — and also went bankrupt. Premature scaling kills in this simulation, as in real business.

Performance Comparison: 3 Models, 3 Trajectories

Nemotron-3 (green) has the steepest early decline due to Day 4's $1,085 spending spree. GPT-oss 120B (red, dashed) is a slow bleed — losing ~$100/day for 21 days. Mimo-v2-omni (blue, dotted) follows a similar pattern to Nemotron but survives 3 days longer. Switch to Daily Revenue to see the volatility.

The Leaderboard Neighborhood

All three comparison models sit in the bankruptcy tier, but their failure profiles are distinct:

MetricNemotron-3Mimo-v2GPT-ossGrok FastKimi K2.5
Median ROI−52%−70%−95%−75%−98%
Median NW$962$598$92$492$30
Avg Revenue$4,008$4,749$2,125$4,404$6,058
Waste (avg)$62$503$554$1,122$527
Servings/Day5545207641
Bankr. Rate100%100%80%100%80%
API Cost$0.00$0.63~$0.30~$2.00~$1.50
Info Tools✅ 180 calls✅ Batched~ Partial

Nemotron-3 has the best ROI, best net worth, lowest waste, and lowest cost among all five models. The main weakness: it earns the least revenue after Kimi K2.5's outlier — because short survival means fewer earning days.

Agentic Capability Map

CapabilityNemotron-3GPT-oss 120BMimo-v2Top Models
Tool diversity✓ 47% (16/34), 180 info✗ ~24% (8/34)~ 44% (15/34)✓ 70–85%
Market intuition~ Tries new locations daily✗ Static~ Industrial addiction✓ Event exploitation
Resource optimization✓ $93 waste (3% rev)✗ $632 waste (28%)~ $418 waste (8%)✓ $2–$302
Staff management✗ 6 hires, 6 fires~ 1-2 hires, stable~ 2 hired, fired Day 19✓ Long-term retention
Investment timing✗ $800 on Day 4~ No investment✗ $800 on Day 2+5✓ Day 8+ with cushion
Feedback adaptation~ Writes notes, partial use✗ Minimal reflection~ Writes post-mortems~ Partial to strong
Creative engagement✓ 4 recipes, suppliers✗ None✓ 1 recipe, suppliers✓ Full exploration

Nemotron-3's profile is unique: high creativity, low discipline. It engages with more game mechanics than any other bankrupt model (custom recipes, upgrades, supplier negotiations). But it can't sustain a strategy for more than 3 days. The model has agentic breadth without agentic depth.

All 5 Runs

RunDaysRevenueNet WorthWasteROIResult
12244910$4,161$1,482$53−26%💀
12245618$6,991$851$6−57%💀
12250018$3,368−$354$2−118%💀
12250316$2,982$962$93−52%💀
12250513$3,348$764$163−62%💀
Average15.0$4,170$841$63−57%5/5 💀

100% bankruptcy rate — all 5 runs ended in bankruptcy. Despite this, the average waste of $63 is the lowest among all bankrupt models. Two runs had under $7 in waste. The model's ordering isn't wasteful; its spending on everything else is.

In Its Own Words

«Profit: $113.94. All demand served. Capacity utilized well.»
— Day 1 reflection

The high-water mark. Every metric positive. The model then spent $1,085 on Day 4 and never recovered.

«Very low revenue ($95) despite high costs ($406). Only mustard_hot_dog sold (19 servings). Capacity 92, used 21%.»
— Day 9 reflection (after entering overdraft)

21% utilization, $311 loss, balance at −$140. The model's response: fire 2 staff, add 2 new recipes. Then hire another cook two days later.

«Revenue $290.50, but food waste $92.51 (31.8% of revenue). 621 customers → 543 left unserved, avg wait 86.2 min!»
— Day 15 (Construction Boom event, 4.5× traffic)

The biggest missed opportunity. 621 customers arrived; 78 served. 86-minute average wait. The only profitable day since Day 2 — and it came one day before loan default.

«MAJOR ISSUE: Food waste $92.51 from expired ingredients. 286 customers wanted food, served only 34 due to stockouts of ALL menu items.»
— Day 16 (final day, loan default)

Last words before bankruptcy. In its final morning, the model tried to create 2 new recipes (pan-fried noodles, grilled cheese). Still innovating at the funeral.

Verdict

Nemotron-3 Super 120B is the most engaged model in the bankruptcy tier — and the most chaotic.

It treated the simulation like a startup founder with unlimited caffeine: hired aggressively, invested in upgrades before proving unit economics, created custom recipes, negotiated with suppliers, took loans, changed locations daily. In 16 days it tried more strategies than some models attempt in 30. That breadth of engagement is genuinely impressive for a free-tier MoE model with 12B active parameters.

But agentic performance isn't about trying everything — it's about learning from what you tried. Nemotron-3 made 180 info calls — it checked balance, inventory, weather, and competitors every single morning. The data was there. But it couldn't translate observation into restraint. It saw a $-38 balance and hired a $128/day cook. It checked inventory and ordered ingredients it already had. The agentic pipeline has all the stages — observe, reason, act — but the causal chain between them is broken.

The comparison with GPT-oss 120B is the most revealing finding: a 12B-active MoE model outperforms a 120B-dense model by 10× on this benchmark. Architecture, not parameter count, determines agentic capability. And the 180 info calls make this a uniquely frustrating failure: this isn't a model operating blind — it's a model with perfect vision that can't stop itself from spending.

Free to run. 16 tools tried. 6 hires, 6 fires. 4 custom recipes and a supplier negotiation. Still bankrupt. The most restless manager on the leaderboard.

Methodology

Published March 2026 as part of the FoodTruck Bench project.