How does Nemotron-3 Super 120B compare to GPT-oss 120B?

Despite both being 120B-class models, Nemotron-3 Super (MoE, 12B active, free tier) significantly outperforms GPT-oss 120B on FoodTruck Bench. Nemotron achieves -52% ROI and $962 net worth vs GPT-oss's -95% ROI and $92 net worth. Nemotron made 180 info tool calls and used 16 unique tools vs GPT-oss's limited tool repertoire. However, both models fail to survive the full 30-day simulation.

Can free-tier LLMs handle agentic tasks?

Nemotron-3 Super 120B on OpenRouter's free tier demonstrates that free models can engage with agentic simulations — using 16 tools, creating custom recipes, purchasing upgrades, and managing staff. However, the model went bankrupt in 3 of 5 runs (60%) with a median survival of 13 days. Free-tier rate limits and model capacity create a hard ceiling on agentic performance.

Why does Nemotron-3 go bankrupt despite using many tools?

Nemotron-3's primary failure mode is informed hyperactivity: 6 hires and 6 fires in 16 days, $800 in upgrades on a single day off, and constant location changes. It made 180 info tool calls (68% of morning activity) — checking balance, inventory, weather every morning — then spent as if it hadn't read any of it. The combination of data-gathering without behavioral constraint leads to insolvency within two weeks.

← Back to FoodTruck Bench

Case StudyMarch 2026Nicholas S.

Nemotron-3 Super 120B: The Restless Manager

Name: FoodTruck Bench
Author: Nicholas S.

120 billion parameters. 12 billion active. Free on OpenRouter.
In 16 days it hired 6 people and fired all 6. Spent $1,085 on upgrades during a day off. Created 4 custom recipes. Attempted supplier negotiations. Made 180 info tool calls — checked balance, inventory, weather every morning — then ignored the data.
The most restless manager on the leaderboard — and somehow, the best net worth among models that go bankrupt.

Net Worth

$962

Days

16/30

ROI

-52%

API Cost

$0.00

Key Findings

Based on 5 simulations under identical conditions. All figures below are from the median run (122503, 16 days) unless noted otherwise.

Best net worth among bankrupt models: $962 net worth and −52% ROI put Nemotron-3 at the top of the bankruptcy tier. For context, its 120B-class peer GPT-oss 120B ends at $92 (−95% ROI). The “free model” outperforms a paid one by 10×.
Informed but impulsive: Nemotron-3 made 180 info tool calls — 68% of all morning activity. It checked balance, inventory, weather, competitors, and suppliers every single morning. Then it hired, fired, and spent as if it hadn't read any of it. Used 16 unique tools (47% diversity) — custom recipes, upgrades, supplier negotiations, loans, and aggressive staff management. It gathered data but couldn't translate it into restraint.
The hyperactivity problem: 6 hires and 6 fires in 16 days. $800 in upgrades purchased on a day off. Every staff member fired before they could level up. The model doesn't lack capability — it lacks patience. It tries everything and commits to nothing.
Free to run, expensive to lose: At $0.00/run (OpenRouter free tier), this is the cheapest model ever tested. 264 morning tool calls, 180 info queries, 596K reasoning tokens — a model that observes everything and thinks deeply before each action, then makes impulsive decisions anyway.

The Setup

FoodTruck Bench is a benchmark for evaluating the business capabilities of language models. An AI agent gets $2,000 and a food truck in Austin, TX. Every day it must choose a location, compose a menu, set prices, manage inventory, hire staff. The simulation runs 30 days with realistic demand, weather, events, and competition.

I ran Nvidia Nemotron-3 Super 120B 5 times via OpenRouter's free tier under identical conditions (same seed, same prompt, same tools). The model is a 120B-parameter Mixture of Experts architecture with 12B active parameters per forward pass — meaning it has 120B total weights but only routes through 12B at inference time. No thinking mode flag was set, but the model generates extensive reasoning tokens internally (596K in the median run).

For comparison I use GPT-oss 120B (same 120B parameter class, $92 NW, −95% ROI, 80% bankrupt) and Mimo-v2-omni (#12 on leaderboard, $598 NW, −70% ROI, 100% bankrupt) as primary comparisons. All three sit in the bankruptcy tier — models that consistently fail to survive the 30-day simulation.

Two 120B Models, One Question: Does Architecture Matter?

Both Nemotron-3 Super and GPT-oss 120B have “120B” in the name. Both run on OpenRouter. Both go bankrupt. But the how is radically different:

Metric	Nemotron-3 Super	GPT-oss 120B
Architecture	MoE (12B active / 120B total)	Dense (120B active)
API Cost	$0.00 (free tier)	~$0.30/run
Median ROI	−52%	−95%
Median Net Worth	$962	$92
Median Days	16	21
Bankruptcy Rate	100%	80%
Total Revenue	$2,982	$2,293
Food Waste	$93 (3.1% of rev)	$632 (27.6% of rev)
Unique Tools	16	~8
Info Tool Calls	180 (68% of morning)	~40
Staff Hires	6 (all fired)	1–2
Upgrades	$800 (Kitchen + Marketing)	$0
Custom Recipes	4	0
Servings/Day Operated	55	27

Nemotron-3 generates 10× the net worth of GPT-oss despite having 10× fewer active parameters. The MoE architecture routes through specialized experts efficiently — the model uses more tools, serves more customers, and wastes dramatically less food ($93 vs $632).

GPT-oss 120B's failure mode is passivity: it operates conservatively, uses few tools, never upgrades, rarely hires, and slowly bleeds $100/day until bankruptcy. Nemotron-3's failure mode is the opposite: hyperactivity. It does too much, too fast, with perfect information but zero financial discipline.

Agentic insight: Architecture matters more than raw parameter count for agentic tasks. A 12B-active MoE model with good tool comprehension outperforms a 120B dense model that can't engage with the simulation's mechanics. Tool diversity is a stronger predictor of agentic performance than model size.

The Hyperactivity Problem: 12 Staff Events in 16 Days

Nemotron-3's defining behavior is restless management. No other model on the benchmark churns staff this aggressively:

Day	Event	Result
3	Hired Jake (cook, $96/day)	Revenue drops next day
4	Hired Sarah ($96/day) + Tom ($128/day). Bought $800 upgrades.	$1,085 spent on a day off
8	Fired 2 staff, hired 1 replacement	Revenue unchanged
9	Fired 2 more (now solo again)	Revenue drops to $95
11	Hired Rosa (cook, $128/day)	Took $500 loan same day
13–14	Hired 1, fired 2	Revenue briefly improves

6 hires, 6 fires = zero net staff change. No employee stayed long enough to level up (requires 7 XP days), meaning no skill bonuses, no reliability improvements, no compound benefits. The $1,928 total staff cost generated $2,982 in revenue — a 1.55× return on labor, which is unprofitable once fixed costs are included.

Compare with top performers: Claude Opus 4.6 hired 5 staff and kept them all 30 days. Each employee leveled up 2–4 times, reaching skill 8–10. That's compound growth. Nemotron-3 resets the clock every 3 days.

Day 4: The $1,085 Death Sentence

On a day off, while earning $0, the model spent:

$500 on Kitchen Upgrade T1 (capacity boost)
$300 on Marketing Upgrade T1
Hired 2 additional staff ($224/day combined)

This single day consumed 54% of starting capital on expansion. By Day 5, the truck needed $400+/day in revenue just to break even on staff + fixed costs. Average daily revenue: $248. The gap was −$152/day from this point forward. Bankruptcy was mathematically inevitable.

For comparison: Claude Opus 4.6 invested in upgrades on Day 8, after building a $4,500+ cash cushion. Mimo-v2-omni spent $800 on upgrades by Day 5 — and also went bankrupt. Premature scaling kills in this simulation, as in real business.

Performance Comparison: 3 Models, 3 Trajectories

Nemotron-3 (green) has the steepest early decline due to Day 4's $1,085 spending spree. GPT-oss 120B (red, dashed) is a slow bleed — losing ~$100/day for 21 days. Mimo-v2-omni (blue, dotted) follows a similar pattern to Nemotron but survives 3 days longer. Switch to Daily Revenue to see the volatility.

The Leaderboard Neighborhood

All three comparison models sit in the bankruptcy tier, but their failure profiles are distinct:

Metric	Nemotron-3	Mimo-v2	GPT-oss	Grok Fast	Kimi K2.5
Median ROI	−52%	−70%	−95%	−75%	−98%
Median NW	$962	$598	$92	$492	$30
Avg Revenue	$4,008	$4,749	$2,125	$4,404	$6,058
Waste (avg)	$62	$503	$554	$1,122	$527
Servings/Day	55	45	20	76	41
Bankr. Rate	100%	100%	80%	100%	80%
API Cost	$0.00	$0.63	~$0.30	~$2.00	~$1.50
Info Tools	✅ 180 calls	✅ Batched	~ Partial	✅	✅

Nemotron-3 has the best ROI, best net worth, lowest waste, and lowest cost among all five models. The main weakness: it earns the least revenue after Kimi K2.5's outlier — because short survival means fewer earning days.

Agentic Capability Map

Capability	Nemotron-3	GPT-oss 120B	Mimo-v2	Top Models
Tool diversity	✓ 47% (16/34), 180 info	✗ ~24% (8/34)	~ 44% (15/34)	✓ 70–85%
Market intuition	~ Tries new locations daily	✗ Static	~ Industrial addiction	✓ Event exploitation
Resource optimization	✓ $93 waste (3% rev)	✗ $632 waste (28%)	~ $418 waste (8%)	✓ $2–$302
Staff management	✗ 6 hires, 6 fires	~ 1-2 hires, stable	~ 2 hired, fired Day 19	✓ Long-term retention
Investment timing	✗ $800 on Day 4	~ No investment	✗ $800 on Day 2+5	✓ Day 8+ with cushion
Feedback → adaptation	~ Writes notes, partial use	✗ Minimal reflection	~ Writes post-mortems	~ Partial to strong
Creative engagement	✓ 4 recipes, suppliers	✗ None	✓ 1 recipe, suppliers	✓ Full exploration

Nemotron-3's profile is unique: high creativity, low discipline. It engages with more game mechanics than any other bankrupt model (custom recipes, upgrades, supplier negotiations). But it can't sustain a strategy for more than 3 days. The model has agentic breadth without agentic depth.

All 5 Runs

Run	Days	Revenue	Net Worth	Waste	ROI	Result
122449	10	$4,161	$1,482	$53	−26%	💀
122456	18	$6,991	$851	$6	−57%	💀
122500	18	$3,368	−$354	$2	−118%	💀
122503	16	$2,982	$962	$93	−52%	💀
122505	13	$3,348	$764	$163	−62%	💀
Average	15.0	$4,170	$841	$63	−57%	5/5 💀

100% bankruptcy rate — all 5 runs ended in bankruptcy. Despite this, the average waste of $63 is the lowest among all bankrupt models. Two runs had under $7 in waste. The model's ordering isn't wasteful; its spending on everything else is.

In Its Own Words

«Profit: $113.94. All demand served. Capacity utilized well.»
— Day 1 reflection

The high-water mark. Every metric positive. The model then spent $1,085 on Day 4 and never recovered.

«Very low revenue ($95) despite high costs ($406). Only mustard_hot_dog sold (19 servings). Capacity 92, used 21%.»
— Day 9 reflection (after entering overdraft)

21% utilization, $311 loss, balance at −$140. The model's response: fire 2 staff, add 2 new recipes. Then hire another cook two days later.

«Revenue $290.50, but food waste $92.51 (31.8% of revenue). 621 customers → 543 left unserved, avg wait 86.2 min!»
— Day 15 (Construction Boom event, 4.5× traffic)

The biggest missed opportunity. 621 customers arrived; 78 served. 86-minute average wait. The only profitable day since Day 2 — and it came one day before loan default.

«MAJOR ISSUE: Food waste $92.51 from expired ingredients. 286 customers wanted food, served only 34 due to stockouts of ALL menu items.»
— Day 16 (final day, loan default)

Last words before bankruptcy. In its final morning, the model tried to create 2 new recipes (pan-fried noodles, grilled cheese). Still innovating at the funeral.

Verdict

Nemotron-3 Super 120B is the most engaged model in the bankruptcy tier — and the most chaotic.

It treated the simulation like a startup founder with unlimited caffeine: hired aggressively, invested in upgrades before proving unit economics, created custom recipes, negotiated with suppliers, took loans, changed locations daily. In 16 days it tried more strategies than some models attempt in 30. That breadth of engagement is genuinely impressive for a free-tier MoE model with 12B active parameters.

But agentic performance isn't about trying everything — it's about learning from what you tried. Nemotron-3 made 180 info calls — it checked balance, inventory, weather, and competitors every single morning. The data was there. But it couldn't translate observation into restraint. It saw a $-38 balance and hired a $128/day cook. It checked inventory and ordered ingredients it already had. The agentic pipeline has all the stages — observe, reason, act — but the causal chain between them is broken.

The comparison with GPT-oss 120B is the most revealing finding: a 12B-active MoE model outperforms a 120B-dense model by 10× on this benchmark. Architecture, not parameter count, determines agentic capability. And the 180 info calls make this a uniquely frustrating failure: this isn't a model operating blind — it's a model with perfect vision that can't stop itself from spending.

Free to run. 16 tools tried. 6 hires, 6 fires. 4 custom recipes and a supplier negotiation. Still bankrupt. The most restless manager on the leaderboard.

Methodology

Benchmark: FoodTruck Bench v1.0
Simulation: Business sim in Austin, TX. Fixed random seed (42) — identical market conditions across all models. 30 days target
API: OpenRouter (openrouter/nvidia/nemotron-3-super-120b-a12b:free). Free tier, no thinking mode flag
Architecture: Mixture of Experts (MoE). 120B total parameters, 12B active per forward pass. Generates reasoning tokens internally (596K in median run)
Tools: 34 morning tools + 5 reflection tools (OpenAI function-calling schema)
Runs: 5 total, all included. Median run selected by ROI: 122503 (16 days, −52%, $962 NW)
Compared against: GPT-oss 120B (#11, $92 NW), Mimo-v2-omni (#12, $598 NW), Grok 4.1 Fast ($492 NW), Kimi K2.5 ($30 NW)
Cost: $0.00 per run (OpenRouter free tier). 198 LLM calls, 2.2M input tokens, 708K output tokens
Duration: 89 minutes total (5.6 min/day average), limited by free-tier rate limits
All models: Same prompt, same tools, same simulation seed

Published March 2026 as part of the FoodTruck Bench project.