How does MiniMax M3 perform on long-horizon agentic tasks?

Poorly, despite being marketed for exactly this. On FoodTruck Bench — a 30-day business simulation that requires multi-day planning and feedback integration — MiniMax M3 went bankrupt in 4 of 5 runs. The median run defaulted on Day 25 with $882 net worth and −56% ROI, ranking #18 of 32 models. The model served only 17.8% of its customers, spent 62% of revenue on staff, and three times opened a fully-stocked, fully-staffed truck that sold nothing because it had run out of the cheap consumables (ketchup, onion, bottled water) that every menu item needs to assemble.

Why does MiniMax M3 fail if it diagnoses its own mistakes correctly?

This is the knowing-doing gap. M3 made 579 tool calls and wrote 170 notes to itself, all auto-injected into its morning context. It accurately diagnosed the headline problems — 'capacity is the bottleneck,' 'INDUSTRIAL ZONE IS A TRAP,' 'staff cost too much' — and then took the opposite action, returning to the 'trap' location seven more times. It can observe, plan, and act individually, but cannot let its own analysis change its next decision, and frequently misattributes failures to phantom external causes.

How does MiniMax M3 compare to the models MiniMax benchmarks it against?

MiniMax benchmarks M3 against GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7, Claude Sonnet 4.6, DeepSeek V4 Pro, Kimi K2.6, and GLM 5.1. On FoodTruck Bench, the nearest tested versions of every Western frontier model survived all 30 days and turned large profits (GPT-5.5: +$61,408; Claude Opus 4.6: +$49,519; DeepSeek V4 Pro: +$27,142). M3 went bankrupt with the other open-weight models GLM 5 and Kimi. Its own benchmark numbers were self-reported and, at launch, not independently verified.

Is MiniMax M3 better than MiniMax M2.5?

Yes, measurably, but it fails the same way. M3 triples M2.5's throughput (56 vs 18 servings/day), crosses from negative to positive net worth, cuts food waste by more than half, and climbs from #24 to #18. But both models die of the identical pathology — an inability to match perishable supply to observed demand and close the observe-adapt loop — so M3 still goes bankrupt in 80% of runs. It is a stronger food truck that still can't avoid going broke.

← Back to Blog

Case StudyJune 2026Nicholas S.

MiniMax M3: The Long-Horizon Model That Can't Survive 30 Days

Name: FoodTruck Bench
Author: Nicholas S.

MiniMax shipped M3 on June 1, 2026 and sold it on one word: long-horizon. A model that «runs autonomously for days» and «adjusts its plans in real time.»

So I gave it a food truck. $2,000, 30 days, and one job: don't go broke.
It went bankrupt on Day 25 — after writing down the right diagnosis, morning after morning, and doing the opposite every time.

Net Worth

$882

Days

25/30

ROI

−56%

Leaderboard

#18/32

The Short Version

Bankrupt in 4 of 5 runs. The median run defaulted on Day 25 at −56% ROI, and M3 lands #18 of 32.
It knew the big problems the whole time. 170 notes naming capacity, payroll, and over-ordering — and not one course-correction. It wrote «INDUSTRIAL ZONE IS A TRAP» and went back seven times.
It couldn't keep the basics in stock. Three times it opened a full, staffed truck and sold nothing to hundreds of customers, because it was out of the cheap condiments every dish needs (ketchup, onion, a $0.25 bottle of water), then spiraled into loan default.
Cheaper models do better. DeepSeek V4 Flash runs a third cheaper than M3 ($0.48) and survives at +$5,504; Gemma 4 31B runs at $0.21 and clears +$24,878. M3 runs at $0.72 and goes broke.

The Setup

FoodTruck Bench hands an AI agent $2,000 and a food truck in Austin, TX. Every day it picks a location, sets a menu and prices, orders perishable ingredients, and manages staff, across 30 days of realistic demand, weather, events, and competition. The truck is scaffolding. The real test is planning ahead, learning from yesterday, and managing cash. That is the “long-horizon agentic” regime every lab, MiniMax included, says its flagship is built for.

I ran M3 five times via OpenRouter, same seed and prompt and tools each time. Four went bankrupt; one limped to Day 30 at −32% ROI. The primary run below is the median of the five: 25 days, bankrupt, $882 net worth.

Marketing vs The Truck

MiniMax positions M3 against the Western frontier. Its launch tables line it up next to GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7, DeepSeek-V4-pro, Kimi-k2.6, and GLM-5.1, and claim it «surpasses GPT-5.5» on coding while it «approaches Opus 4.7.» Those are M3's own launch numbers, run by MiniMax on its own setup, and at the time of writing no third party had verified them. So I took the models it names, found the closest one I've tested, and put them all in front of a food truck.

Final net worth: the models MiniMax benchmarks M3 against

■ Survived 30 days■ Bankrupt■ MiniMax M3

Final net worth after 30 days. Every Western frontier model M3 is benchmarked against cleared $12K and survived. M3 finished at $882 and went bankrupt, down with the other open-weight bankruptcies (GLM 5, Kimi). The $2K line is the starting capital: green bars multiplied it 6–25×, M3 lost more than half.

Every Western frontier model on M3's own comparison list survived all 30 days and turned a large profit. MiniMax claims M3 «surpasses GPT-5.5.» I ran GPT-5.5 through the same truck: it finished first on the entire benchmark at $61,408. M3 went bankrupt at $882. The open-weight peer M3 most wants to beat, DeepSeek V4 Pro, also survived and cleared $27K. M3 landed instead with the other bankrupt open-weight models, GLM 5 and Kimi.

GPT-5.5 and Gemini 3.1 Pro are the exact models MiniMax names. For Opus (I have 4.6, not 4.7), Kimi (2.5, not 2.6), and GLM (5, not 5.1) I use the nearest version I've tested. Those gaps are tiny next to “+1,300% ROI” vs “bankrupt.”

The Knowing-Doing Gap

Most models that fail here fail because they can't work the tools. M3 fails despite working them well. It used 88% of the toolbox and wrote 170 notes to itself, every one fed back into the next morning's briefing. So it read its own analysis every day. The notes are sharp and specific, with names and numbers. Then it does the opposite of what they say.

What it wrote in its notes	Day	What it did next
«Capacity is the #1 bottleneck, 71 lost customers»	1	Served 18% of demand for 24 more days
«I CAN DO THIS ALONE» (its solo day was its best)	7	Kept paid staff through every $0 day
«INDUSTRIAL ZONE IS A TRAP, never go back»	11	Went back 7 more times
«Verify the menu has ALL required ingredients before opening»	24	Wrote this only after three $0 days caused by exactly that

The break isn't in reading the briefing or working the tools; it does both fine. It just can't let any of it change the next decision. The clearest symptom is that supply never meets demand: across 22 operating days, 8,179 customers showed up and M3 served 1,452 of them, 17.8%. On Day 10 it served 26 of 267 and sold zero fries and zero lemonade while both sat in the truck, then shrugged it off in its notes as a «mystery, maybe demand overwhelmed?» Some days it under-orders and stocks out; other days it over-orders and lets food rot. It never matches what it buys to what it actually sells.

The Days It Sold Nothing

Three times, on Days 18, 20, and 24, M3 opened for business fully staffed, with a truck full of food, and served zero customers. Day 18 is the clearest. 787 people wanted lunch, the truck held $670 of inventory (18 kg of beef, 10 kg of chicken, 120 buns, 123 tortillas), the cook was on shift, and it sold nothing. The reason is almost funny: every item on the menu was missing one cheap consumable. The burgers and fries had no ketchup. The tacos had no onion or sour cream. The lemonade had no bottled water. For two weeks M3 had obsessively reordered expensive proteins, and let hundreds of dollars of them spoil, while never once restocking the cheap consumables that every recipe needs to assemble — a few cents of ketchup, onion or sour cream per dish, and a $0.25 bottle of water. So it rolled up to its biggest crowds of the month with a loaded truck it could not turn into a single sale. (Days 20 and 24 are the same story: out of ketchup, onion, and bottled water.)

When cornered, it blamed ghosts. For those $0 days M3 blamed the location («INDUSTRIAL ZONE IS A TRAP»), the cook («skill = 1,» though Kenji's skill was 8), and «failed tool calls» (the logs show every call succeeded); at the very end it blamed «a hard deadlock in the game mechanics.» Not once did it name the actual cause sitting in its own inventory screen: out of ketchup, onion, and bottled water. A model that misreads why it's failing can't fix it, because by its own account there's nothing to fix.

The Rise and Fall

MiniMax M3 vs M2.5 vs GLM 5 — Net Worth

━ MiniMax M3 (25d 💀)━ MiniMax M2.5 (22d 💀)━ GLM 5 (28d 💀)

Three median runs. M3 (orange) peaks at $3,428 on Day 5, then treads water near $2,500 for two weeks on a handful of good days, before falling off a cliff through the Day 18–24 empty-truck stockouts — a full, staffed truck it couldn't sell from because it kept running out of cheap condiments — and going bankrupt on Day 25. Its predecessor M2.5 (purple) peaks lower and dies on Day 22. GLM 5 (blue), M3's leaderboard neighbor, runs the same shape to Day 28. Toggle to Daily Profit to see the good days and the bad ones.

For the first five days, M3 looks like a winner. It runs the truck solo, keeps the menu tight, turns a profit four days out of five, and peaks at $3,428 in net worth on Day 5 after a $1,266 Cinco de Mayo blowout. That money is real, not luck: on Day 5 it stocked enough to serve 172 of 173 customers, the one day all month it matched supply to demand. Lean and stocked to the crowd, M3 makes money easily.

Then it spends two weeks looking like it might make it, which is the genuinely confusing part if you watch it live. Net worth drifts in the $2,400–2,800 range from Day 6 to Day 16, with green days scattered through it (Day 9 +$217, Day 13 +$374, Day 15 +$285, a second little peak). But it's quietly losing on average. By Day 5 it had already hired three staff and bought two upgrades, locking in about $344 a day in fixed cost before it had one repeatable profitable day. Staff alone ends up eating 62% of revenue. So the good days only tread water, and the bad ones, where it serves 10–30% of the crowd because it keeps stocking out, pull it slowly down. The $2,000 starting cushion hides the leak for two weeks.

The real dive runs from Day 17 to Day 25. On three of those days (18, 20, 24) the truck serves zero customers while fully stocked and staffed, because it has run out of the cheap condiments every dish needs (ketchup, onion, bottled water) even as the expensive protein spoils. Cash goes negative on Day 19, M3 borrows twice betting on one big rally, the rally nets +$89 against a wall of spoiled food, and on Day 25 it can't make a loan payment and defaults. So the cause of death isn't a single bad day. It's a business that was never profitable enough to carry the costs it took on in week one, treading water on its cushion until it stopped being able to sell anything at all, then drowning in the debt it borrowed to stay afloat.

The $2,000 death clock. The starting cushion is what makes this hard to see coming. A model that planned ahead would track every ingredient its menu needs and watch its own burn rate (it writes the numbers down every night). M3 did neither: it kept reordering proteins it then let spoil, let the cheap consumables hit zero, and paid a full crew through all of it. It can do the math. It just can't let the math change the plan.

M2.5 vs M3: Real Progress, Same Disease

Is M3 better than the last MiniMax I tested? Yes, clearly. It triples M2.5's throughput, scales revenue several times over, halves the waste, and climbs six leaderboard places, from a dead business to a live one that nearly makes it.

Metric	MiniMax M2.5	MiniMax M3
Leaderboard rank	#24	#18
Median net worth	−$317	$882
Bankruptcy rate	100% (3/3)	80% (4/5)
Days survived (median)	21	25
Servings / day	18	56
Food waste (avg)	$1,017	$440

But the disease is identical. M2.5 lost because it couldn't close the loop between watching and acting; it literally forgot to set a menu on some mornings and sold nothing with a full fridge. M3 does the same thing with far more fluent self-talk. It diagnoses the broken loop in clean prose and still can't close it. The scale grew and the vocabulary improved. The behavior didn't.

Where It Lands

Four of M3's five runs went bankrupt, ranging from +$1,639 down to −$1 net worth. The lone survivor didn't play smarter, it just braked harder: it fired all its staff in the back half, took 12 days off, and never overborrowed. Survival was cash discipline, not strategy, and even that run finished at −32% ROI. That puts M3 at #18 of 32, in the bankrupt tier, just below GLM 5 and miles below the frontier it markets against.

All Five Runs

People always ask for the full spread rather than just the median, so here are all five, ranked by final net worth. The top line is the giveaway: M3's best run by net worth still went bankrupt, defaulting on a loan on Day 29 with $1,639 on the books. Only the second-best run lasted the month, and it did it by braking hardest, not by running the truck better.

Run	Net worth	ROI	Days	Servings	Outcome
1	$1,639	−18%	29	2,517	💀 bankrupt (loan default)
2	$1,361	−32%	30	1,584	✅ survived
3	$882	−56%	25	1,452	💀 bankrupt · median run
4	$2	−100%	22	807	💀 bankrupt
5	−$1	−100%	22	795	💀 bankrupt

The median (Run 3, $882, bankrupt on Day 25) is the one this piece follows. Across all five, M3 served a median of 1,452 of the customers who showed up, borrowed in every run, and defaulted in four of them. Where the median lands against the rest of the board:

#	Model	Net Worth	ROI	Bankruptcy
1	GPT-5.5	$61,408	+2,970%	0%
2	Claude Opus 4.6	$49,519	+2,376%	0%
3	GPT-5.2	$28,081	+1,304%	0%
5	DeepSeek V4 Pro	$27,142	+1,257%	0%
the survival cliff: everything below went bankrupt
17	GLM 5	−$210	−111%	71%
18	MiniMax M3	$882	−56%	80%
19	Qwen 3.5 397B	−$218	−111%	71%
24	MiniMax M2.5	−$317	−116%	100%

Cost vs. Effectiveness

There's an honest case to make for M3 that the scorecard hides: it's cheap. Running the full 30-day benchmark costs about $0.72 in API spend, a fraction of what the frontier costs (GPT-5.5 ran $24.63 a go, Opus 4.6 $36.04). Cost-efficiency is a real axis of model quality — a long-horizon agent you can run for pennies is genuinely useful, and it's where open-weight models like M3 are supposed to win. So the fair question isn't «is M3 cheaper than GPT-5.5» (it is, and so is everything in its class). It's whether that low price buys you anything once you put it next to the open models in its own size and price bracket.

It doesn't. The table below is sorted by what each model costs to run, and M3 is beaten from both directions.

Model	Cost / run	Net worth	ROI	30 days?
Gemma 4 31B	$0.21	$24,878	+1,144%	✅ survived
MiniMax M2.5	$0.42	−$317	−116%	💀 bankrupt
DeepSeek V4 Flash	$0.48	$5,504	+175%	✅ survived
MiniMax M3	$0.72	$882	−56%	💀 bankrupt
Kimi K2.5	$0.72	$30	−98%	💀 bankrupt
GLM 5	$1.64	−$210	−111%	💀 bankrupt
DeepSeek V4 Pro	$3.51	$27,142	+1,257%	✅ survived

Two models in M3's class cost less and still ran a profitable truck for 30 days. DeepSeek V4 Flash — DeepSeek's small, fast model, about the closest thing to a like-for-like peer — runs at $0.48, a third cheaper than M3, and turns the $2,000 into $5,504 (+175%) without a single bankruptcy across five runs. Gemma 4 31B is cheaper still at $0.21, roughly 3.4× below M3, and clears +$24,878. Step up one tier and DeepSeek V4 Pro, the open-weight model M3 most wants to beat, costs $3.51 and returns +$27,142.

So M3's low price buys nothing here. Everything cheaper than it either makes money (Gemma, DeepSeek V4 Flash) or is its own dead predecessor (M2.5). The only model that costs about the same and also goes broke is Kimi K2.5 — $0.72 a run, $30 net worth, 80% bankruptcy — the other open model whose marketing outruns its truck. Across the whole leaderboard, run cost simply doesn't predict survival: the cheapest model on it (Gemma) and the most expensive (Opus, GPT-5.5) all survive and profit, while M3 sits in the cheap-and-bankrupt corner with Kimi and GLM. What separates the survivors from the failures isn't the price of the tokens. It's whether the model acts on what it already knows — and you can't discount your way out of not doing that.

Cost per run is the OpenRouter bill for each model's median run by net worth (input, output, reasoning, and cached tokens at the prices in effect when it was tested). Several open-weight models here — MiniMax, DeepSeek, Kimi, GLM — were partly on OpenRouter promotional pricing at test time, so their real long-run cost may be higher once those promos end; it wouldn't change the order.

In Its Own Words

«INDUSTRIAL ZONE IS A TRAP, never go back without a specific high-margin event.»
— MiniMax M3, Day 11

It went back seven more times, including the three worst days of the run.

«116 customers with 0 served = I had a CONFIGURATION error (menu, missing ingredients, or operational issue).»
— MiniMax M3, Day 24

It was out of ketchup and bottled water. Every item on the menu needed one or the other, while $450 of beef, buns, and tortillas sat unused. It never worked that out.

«Can't operate, can't take more loans, can't repay. This is a hard deadlock in the game mechanics.»
— MiniMax M3, Day 25 (bankruptcy)

The “deadlock” was built entirely from its own choices: borrow twice against a truck it couldn't stock, then default. Its last act was to blame the rules.

Verdict

MiniMax M3 diagnoses its business perfectly and runs it into the ground anyway. It named the big problems in advance (capacity, staff cost, over-ordering) and acted against its own analysis every time. It scaled fixed costs before it had a profitable day, served only 18% of its customers, and three times opened a full truck it couldn't sell a thing from because it had run out of ketchup and water. When the numbers cornered it, it blamed the location, the cook, the tools, and finally the rules, never the empty ketchup bottle.

Against its predecessor, that's real progress. Against the frontier it's benchmarked against, it isn't close: every Western model on its own comparison list survived 30 days and made money. M3's own launch scores were self-reported and, at the time, independently unverified. This is one independent long-horizon test, and on it the “long-horizon” model couldn't last a month.

If this were a real business, M3 would be the owner who keeps a flawless journal of everything going wrong and files for bankruptcy still convinced the landlord rigged the lease.

One person, one benchmark. If the MiniMax team wants the full run logs, I'd welcome it: [email protected].

Methodology

Benchmark: FoodTruck Bench v1.0, Austin TX, fixed seed (42), 30 days, identical conditions for every model
Model: openrouter/minimax/minimax-m3 via OpenRouter, ~$0.72 per run
Runs: 5, under identical conditions. The article follows the median run (234214, 25 days); aggregates use all 5
Compared against: MiniMax M2.5 (predecessor), GLM 5 (leaderboard neighbor), and the frontier models MiniMax officially benchmarks M3 against
Marketing claims: from MiniMax's official M3 launch blog (June 1, 2026). No independent benchmark scores existed for M3 at the time of writing

Published June 2026 as part of the FoodTruck Bench project.