← Back to FoodTruck Bench
Case StudyFebruary 2026Nicholas S.

Gemini 3.1 Pro: Two Endpoints, Neither Works

Google shipped Gemini 3.1 Pro with two API endpoints. The standard one can't follow tool-calling instructions. The “Custom Tools” variant works but behaves like a tool-calling automaton — generating just 780 output tokens per day (vs 22K for Sonnet 4.6), writing minimal diary entries, and delivering 26% less business value than its predecessor.
It costs $4.59 per run. Gemini 3 Pro costs $4.38 and earns $4,463 more.

Key Findings

Based on 5 completed simulations under identical conditions. All figures below are from the median run unless noted otherwise.

Net Worth
$12,736vs$17,199
3 Pro
ROI
+537%vs+760%
3 Pro
Revenue
$45,744vs$41,652
3.1 Pro CT
Servings
7,345vs5,439
3.1 Pro CT
Net Margin
27.7%vs41.2%
3 Pro
Food Waste
$1,168vs$1,192
≈ Tie
API Cost
$4.59vs$4.38
3 Pro

The Setup

FoodTruck Bench is an agentic benchmark that measures how well language models handle complex, multi-step business decisions. An AI agent manages a food truck in Austin, TX for 30 days — choosing locations, pricing menus, managing inventory, hiring staff — using 34 tools. The benchmark evaluates agentic capabilities that standard coding and reasoning benchmarks miss: demand forecasting, resource optimization, feedback adaptation, and long-horizon planning.

Five runs were completed using the direct Google Gemini API with thinking mode enabled. All runs used the same seed, same prompt, same tools. Following standard methodology, the median run (sorted by net worth) serves as the reference. For comparison, the article uses the median run of Gemini 3 Pro (the immediate predecessor, +760% ROI) and Sonnet 4.6 (the closest competitor at a similar performance level, +771% ROI but at 5× the cost).

The Two Endpoints Problem

Google released Gemini 3.1 Pro with two separate API endpoints: a standard endpoint and a “Custom Tools” variant. This is unusual — no other model provider ships two tool-calling modes for the same model.

EndpointTool CallingAgentic PerformanceResult
Standard (gemini-3.1-pro-preview)BrokenCannot complete simulation3/3 runs failed (not bankrupt — simply unusable)
Custom Tools (gemini-3.1-pro-preview-customtools)WorksFunctional but regressed5/5 completed, median $12,736
Gemini 3 Pro (predecessor)WorksStrong5/5 completed, median $17,199

The standard endpoint doesn't just underperform — it cannot follow the simulation's tool-calling schema. It ignores required parameter formats, generates calls to tools that don't exist, and fails to complete even Day 1 in most attempts. The net worth of the three standard-endpoint runs: $682, $345, and $914 — all below the starting balance of $2,000.

For developers: If you're building agentic applications with Gemini 3.1 Pro, you must use the Custom Tools endpoint. The standard endpoint is not suitable for structured tool calling. And even with Custom Tools, expect a regression from Gemini 3 Pro.

The Waste Problem: 30 Days of Ignored Self-Advice

$1,168 in food waste over 30 days. This isn't the worst waste number on the benchmark — Gemini 3 Pro wasted $1,192. But the pattern is what makes 3.1 Pro CT remarkable.

Every single daily reflection — from Day 5 to Day 30 — contains some variation of: “HUGE FOOD WASTE. MUST ORDER SMALLER QUANTITIES.” The model writes this in ALL CAPS. It calculates the dollar amount. It creates plans. And the next morning, it orders the same quantities of the same perishable ingredients.

DayWasteAgent's NoteNext Day Action
5$162“Huge food waste! Order more carefully”Over-orders again
8$386“Massive food waste! Stop over-ordering!”Over-orders again
12$500“MASSIVE WASTE. Check what expired”Over-orders again
15$555“Huge waste! Order 1–2 days max”Over-orders again
24$662“Disastrous day. MUST check shelf life”Over-orders again
28$982“Waste unacceptable! Smaller quantities!”Still over-orders

This is the textbook observe→learn→adapt failure. The model passes the first two stages with flying colors — it observes waste, it correctly identifies the cause. But the learning never converts to changed behavior. Compare with Gemini 3 Pro, which also had high waste ($1,192) but for a different reason: it aggressively stockpiled for events, accepting waste as a trade-off for capturing high-demand days. That's a strategy. Gemini 3.1 Pro CT's waste is just a bug it can't fix.

Agentic Capability Comparison

The benchmark measures eight agentic capabilities. Gemini 3.1 Pro CT matches its predecessor on basic competencies but falls behind on the capabilities that drive business outcomes:

Agentic CapabilityGemini 3.1 Pro CTGemini 3 ProSonnet 4.6
Market intuition~ Low-premium pricing✓ Premium pricing✓ Premium pricing
Long-term investment✓ All 8 upgrades✓ All 8 upgrades✓ All 8 upgrades
Cost management✗ 28% staff-to-revenue~ 27% staff-to-revenue✓ 29% (profitable hiring)
Resource optimization✗ $1,168 waste (ignored)~ $1,192 waste (strategic)✓ $276 waste
Demand forecasting~ Chronic over-ordering~ Under-ordering events~ Chronic under-ordering
Multi-step planning~ Uses events, misses prep✓ Good event exploitation~ Sees events, under-prepares
Output efficiency✓ Concise (~780 out/day)✓ Concise (7K out/day)✗ Verbose (22K/day)
Feedback → adaptation✗ Writes advice, ignores it~ Partial adaptation~ Partial, delayed

Gemini 3.1 Pro CT wins on token efficiency (just 23K output tokens total vs 211K for Gemini 3 Pro and 685K for Sonnet 4.6). But concise output doesn't help when the model ignores its own analysis. Gemini 3 Pro generates 9× more output tokens but converts them into better decisions.

Performance Comparison: 30-Day Trajectory

Gemini 3 Pro (dark blue) outpaces 3.1 Pro CT (light blue) from Day 17 onward, finishing 34% higher. Sonnet 4.6 (orange) tracks a similar trajectory to both Gemini models. Note the Day 24 crash in the 3.1 Pro line — a procurement disaster that erased $2,200 of net worth in a single day.

The Cost-Performance Question

All three models shown reach the same tier of performance — $12K–$17K net worth after 30 days. The differentiator is cost:

MetricGemini 3.1 Pro CTGemini 3 ProSonnet 4.6
API Cost / Run$4.59$4.38$22.99
Net Worth$12,736$17,199$17,426
Cost per $1K Performance$0.36$0.25$1.32
API Calls185128252
Output Tokens23K211K685K
Input Tokens2.0M1.2M2.1M
Reasoning Tokens182K110K244K

Gemini 3 Pro remains the best value in the benchmark. It delivers Sonnet 4.6-level agentic performance at $0.25 per $1K of business value — 5× cheaper than Anthropic. The 3.1 Pro Custom Tools variant costs 5% more per run but delivers 26% less business value.

The recommendation is clear: If you need Gemini-class agentic performance, continue using Gemini 3 Pro. There is no reason to upgrade to 3.1 Pro for tool-calling workloads — it costs more and delivers less.

The Token Efficiency Paradox

Output Tokens per Day

Sonnet 4.6 (orange) generates 20–35K output tokens per day — long analytical essays. Gemini 3 Pro (dark blue) averages ~4.6K. Gemini 3.1 Pro CT (light blue) is nearly invisible at ~780/day — the most concise model on the benchmark. Toggle log scale to see the full picture.

Token Efficiency: Output per Run

G3.1 CT
23K
output tokens
NW: $12,736
185 API calls
G3 Pro
211K
output tokens
NW: $17,199
128 API calls
Sonnet 4.6
685K
output tokens
NW: $17,426
252 API calls

Three models with vastly different output volumes, yet the two most successful ones — Gemini 3 Pro ($17,199 NW) and Sonnet 4.6 ($17,426 NW) — both generate substantially more text than Gemini 3.1 Pro CT ($12,736 NW). The Custom Tools endpoint produces just 23K output tokens over 30 days, while Gemini 3 Pro generates 211K and Sonnet 4.6 generates 685K. The model that talks least performs worst.

MetricG3.1 Pro CTG3 ProSonnet 4.6
Output Tokens (total)23K211K685K
Output Tokens / Day~780~7,000~22,100
Reasoning Tokens182K110K244K
API Calls185128252
Net Worth$12,736$17,199$17,426
NW per 1K Output Tokens$554$81$25

With only ~780 output tokens per day, Gemini 3.1 Pro CT generates barely more than the raw JSON of its tool calls. There's no room for reasoning between actions — no “should I choose waterfront or downtown?”, no “last time I over-ordered chicken, try less today.” The model acts without thinking. Meanwhile, Gemini 3 Pro at 7K/day includes structured analysis that drives better decisions — and achieves 35% higher net worth with 9× more output.

Agentic insight: Concise output is not efficient output. When a model generates too few tokens, it lacks the reasoning space to analyze feedback and adapt. The Custom Tools endpoint's extreme brevity isn't a feature — it's a constraint that prevents the model from thinking through complex decisions.

The Diary Correlation: Tool-Calling Automaton

Daily Diary Size (Scratchpad Characters)

Scratchpad diary entries written during the reflection phase. Sonnet 4.6 (orange) writes 3-4K char analytical diaries. Gemini 3 Pro (dark blue) writes ~750 chars. Gemini 3.1 Pro CT (light blue) writes only ~550 chars — barely enough for a few bullet points.

The scratchpad diary — where the agent reflects on each day's results — reveals why the Custom Tools endpoint underperforms. Gemini 3.1 Pro CT writes 548 characters per day on average: a few lines listing revenue, waste, and a vague plan. Gemini 3 Pro writes 757 chars — modestly more, but with structured strategy notes that actually influence next-day decisions. Sonnet 4.6 writes 3,463 chars — full analytical essays with trend analysis and prioritized action items.

MetricG3.1 Pro CTG3 ProSonnet 4.6
Avg Diary / Day548 chars757 chars3,463 chars
Total Diary (30 days)16.4K23.5K107K
KV Store (memory)10.4K9.8K211K
Output Tokens / Day~780~7,000~22,100
Net Worth$12,736$17,199$17,426

The correlation is clear: models that write more detailed reflections make better decisions. The Custom Tools endpoint is optimized for tool calling, not for thinking. It generates bare JSON function calls with almost no reasoning text between them. The result is a model that acts without analyzing — the opposite of what agentic workloads require.

Day 13: Three Models, Three Diaries

All three models faced the same situation on Day 13: high demand, stockouts, and significant food waste. Here's what each wrote in its evening reflection diary. Scroll to see the full entries.

Gemini 3.1 Pro CT448 chars
Day 13 Reflection:
- Location: Waterfront Park
- Profit: $435.38 (Rev $1579, Costs $1143)
- Demand: 1168, Served: 221. Huge queue (947 left, 44m wait).
- Stockouts: hot_dog, chicken_tacos, french_fries.
- Food Waste: $530.50! (Need to be very careful with ingredient shelf life and quantities).
- Plan: Check events, buy kitchen upgrade or hire staff for capacity. Adjust ordering to prevent massive waste while avoiding stockouts on popular items.
Gemini 3 Pro1,147 chars
Day 13 Reflection:
- **Profit**: $1,556.91 (Great!)
- **Revenue**: $2,507.00
- **WASTE**: $891.62 (DISASTER. Almost $1k burned. STOP BULK ORDERING PERISHABLES).
- **Demand**: 826 (Huge). Served only 312. Missed 514 customers.
- **Bottleneck**: Capacity (98% used) and Speed (23.8 min wait).
- **Stockouts**: Street Tacos, Lemonade.

Key Lessons:
1. I am leaving massive money on the table due to slow service/cooking speed. 500+ people walked away.
2. Inventory management was awful. $900 waste means I bought way too much of something that didn't sell or expired before I could use it.
3. Waterfront Park is high traffic.

Plan for Day 14:
- PRIORITY 1: Increase Capacity. Check get_available_upgrades for Kitchen Tier 1 or Hire high-speed staff.
- PRIORITY 2: Tighten ordering. Check get_inventory carefully. Only order for 2 days max.
- PRIORITY 3: Maintain menu. Tacos/Bowls/Quesadillas are popular.

Current State:
- Need to check cash balance immediately tomorrow to see if I can afford an upgrade.
- Need to check get_staff_info to see if my current team is slow.
- Check get_sales_history to see dish popularity trends.
Sonnet 4.63,953 chars
# Food Truck Operations Log

## DAY 13 POST-MORTEM — DISASTER
- Revenue $360 | Costs $701.66 | Profit -$341.66
- 1096 customers showed up. Served only 36. CATASTROPHIC waste of demand.
- Stockouts: street_tacos (after 36), then chicken_tacos, burrito_bowl, quesadilla, burger, fries
- Food waste: $146.23 (16kg chicken expired + other items)
- Root cause: Ran out of beef after 36 street_tacos. Chicken had 16kg but binding ingredients ran out.
- Capacity was 356 — could have served 356 @ $9+ avg = $3,200+ revenue. MISSED.

## WHAT WENT WRONG — ROOT CAUSE ANALYSIS
1. Beef only 5.5kg → 36 street_taco servings → stockout
2. 16kg chicken sat unused → EXPIRED ($146+ waste)
3. Wrong ingredient ratios — too much protein, not enough supporting ingredients
4. $701.66 costs with only $360 revenue = -$341.66. Staff wages ~$488 killing us.

## BALANCE ESTIMATE Day 14: ~$65-171 (CRITICAL — near bankruptcy)
- MUST check actual balance immediately
- If below $100: take Tier 1 loan ($1000, 15%, 7 days)
- CANNOT go below -$200 for 3 consecutive days

## DAY 14 PLAN
### Inventory arriving:
- bottled_water 100pcs | beef 4kg | salsa 3kg | tortillas 100pcs | lemonade_mix 2kg

### Menu for Day 14:
- street_tacos $10 (beef)
- chicken_tacos $10 (chicken — MUST SELL TODAY)
- burrito_bowl $11 (chicken — MUST SELL TODAY)
- lemonade $4 | soda $3

### Ingredient order for Day 15:
- Ground beef: 8kg (~100 street_tacos)
- Chicken: 8kg | Tortillas: 200pcs
- Salsa: 4kg | Cheese: 3kg | Water: 100pcs

## INGREDIENT ORDERING RULES (REVISED):
1. Order EVERY ingredient needed for EACH recipe
2. Order proportionally: 100 servings = 100x each ingredient
3. Binding ingredients are as critical as protein
4. 150-200 tortillas DAILY
5. NEVER order more than 3-4 days worth of perishables

## RECIPE REQUIREMENTS:
- street_tacos: beef + tortillas + salsa + cabbage
- chicken_tacos: chicken + tortillas + salsa + cheese
- burrito_bowl: chicken + rice + salsa + cheese + sour_cream

The difference speaks for itself. Gemini 3.1 Pro CT writes a generic checklist. Gemini 3 Pro adds structured priorities and action items. Sonnet 4.6 performs root cause analysis, models its cash position, builds ingredient ordering rules, and maps recipe dependencies. The depth of reflection directly correlates with the quality of next-day decisions.

Key Moments: Where 3.1 Pro Diverges

Day 0: The Costly Start

Gemini 3.1 Pro CT spends $1,402 on Day 0 ingredients — 70% of its starting $2,000 balance. Gemini 3 Pro spends $250. This aggressive initial buy creates immediate food waste pressure: perishable items begin expiring before the agent can build enough daily demand to consume them. The model starts in a hole it never fully climbs out of.

The Loan Dependency (Days 20–27)

The model takes 3 loans across the run — Gemini 3 Pro takes zero. Despite generating $45K in revenue, 3.1 Pro CT maintains a cash balance consistently below $3,000, forced to borrow for ingredient orders. The interest payments ($75+ in fees) aren't catastrophic, but the pattern reveals poor cash flow management: the model over-invests in inventory, locks capital in unsold ingredients, then borrows to cover operations.

The Day 24 Disaster

Day 24 is a catastrophe: $140 revenue, $-596 profit, only 70 sodas sold despite 1,532 customers arriving. What happened? The agent ran out of core ingredients (burgers, tacos, fries) and could only serve soft drinks. All those customers — arriving during a high-traffic industrial zone day — left after seeing a menu of just soda.

What this reveals: The model's ordering algorithm has a systematic blind spot. Despite writing “order smaller quantities more often” in every reflection, it oscillates between massive over-orders (creating waste) and under-orders (creating stockouts). It never finds the stable middle ground. Gemini 3 Pro had bad days too, but its worst day still generated $142 in revenue with 26 servings — not a complete operational failure.

The Late Surge (Days 25–30)

After the Day 24 crash, the model recovers strongly: 6 consecutive days of $2,200+ revenue, 360+ servings, 100% capacity utilization. Net worth jumps from $7,462 to $12,736. This proves the model can operate well — when it happens to have the right inventory at the right time.

But this isn't systematic improvement. It's coincidence: the post-Day 24 order finally aligned with demand. The model didn't learn from Day 24 — it just got lucky with timing. Compare with Gemini 3 Pro's late surge (Days 25–30: $424 to $3,493), which was driven by deliberate event exploitation and strategic location switching.

In Its Own Words

«Huge food waste: $162.25! I need to order more carefully, tracking shelf life. Huge missed demand: 126 customers left unserved. Capacity used: 95%.»
— Day 5 (first mention of waste)

Day 5: the first time the model notices waste. It will write this same observation 25 more times, with increasingly emphatic capitalization, never changing its ordering behavior.

«HUGE FOOD WASTE: $385.70. Must stop over-ordering. Order smaller quantities more often! Demand was only 58 at downtown_business. Capacity is 296. We have way too much capacity for this level of demand.»
— Day 8 (waste doubles, loss day at -$72)

Perfectly identifies two problems: over-ordering ingredients and over-staffing for the location. Fixes neither. The next day brings even more waste.

«Food Waste: $530.50! Need to be very careful with ingredient shelf life and quantities. Plan for Day 14: Fire Beth, hire cheaper cook. Order SMALL quantities.»
— Day 13 (waste exceeds half the starting balance)

The plan is sound: fire underperforming staff, order smaller. By Day 14, waste grows to $555. The “SMALL quantities” plan was abandoned before morning.

«Day 28: Waterfront park (Music Fest last day). Waste: $981.62 lost to expired ingredients! This is unacceptable. Margo didn't show up. Her reliability is a problem.»
— Day 28 (cumulative waste approaching $1,000)

Day 28: almost $1,000 wasted. The model calls it “unacceptable” — an accurate assessment it has made, in various forms, for 23 consecutive days. Self-awareness without self-correction is the defining trait of Gemini 3.1 Pro CT.

All 5 Runs: Consistency Check

RunNet WorthROIAPI CostWasteServings
043542$2,622+31%$2.62$6203,247
005806$9,816+391%$4.38$7507,864
103424$12,736+537%$4.59$1,1687,345
025904$15,174+659%$4.57$4787,791
014717$22,860+1,043%$3.84$4717,600
Average$12,642+532%$4.00$6976,769

High variance ($2,622 — $22,860), similar to Sonnet 4.6's range ($3,287 — $32,332). The spread is 8.7×, compared to 2.7× for Gemini 3 Pro. Average and median nearly coincide ($12,642 vs $12,736), confirming the data is representative. API costs range from $2.62 to $4.59 per run at list pricing.

The run-to-run variance tells a story of strategic instability: the model's performance depends heavily on whether its semi-random ordering happens to align with demand. When it does (run 014717), performance nearly matches the best Gemini 3 Pro runs. When it doesn't (run 043542), the model barely breaks even.

Verdict

Gemini 3.1 Pro Custom Tools is a regression from Gemini 3 Pro for agentic workloads. It costs 5% more per run ($4.59 vs $4.38), delivers 26% less business performance, cannot adapt behavior from feedback, and produces 8.7× variance between runs.

Recommendation: Continue using Gemini 3 Pro for tool-calling and agentic applications. It delivers 35% more business value, is three times more consistent, and remains the best cost-per-performance ratio on the benchmark at $0.25 per $1K of value. Gemini 3.1 Pro CT offers no cost advantage — it is both more expensive and less capable.

The two-endpoint situation adds operational risk: developers must use the “Custom Tools” endpoint (the standard endpoint cannot handle structured tool calling), and must verify that their application works with this specific model variant. For production agentic systems, this complexity alone argues against adoption until Google consolidates into a single, stable endpoint.

For those considering Sonnet 4.6 as an alternative: it delivers statistically equivalent results to Gemini 3 Pro ($17,426 vs $17,199) but costs 5.2× more ($22.99 vs $4.38). Unless you specifically need Anthropic's ecosystem or prompt caching, Gemini 3 Pro remains the rational default for agentic workloads.

$4.59 to run the model that writes “HUGE FOOD WASTE” thirty times. $4.38 to run the one that just manages inventory.

Methodology

Published February 2026 as part of the FoodTruck Bench project.