Gemini 3.1 Pro: Two Endpoints, Neither Works
Google shipped Gemini 3.1 Pro with two API endpoints. The standard one can't follow tool-calling instructions. The “Custom Tools” variant works but behaves like a tool-calling automaton — generating just 780 output tokens per day (vs 22K for Sonnet 4.6), writing minimal diary entries, and delivering 26% less business value than its predecessor.
It costs $4.59 per run. Gemini 3 Pro costs $4.38 and earns $4,463 more.
Key Findings
Based on 5 completed simulations under identical conditions. All figures below are from the median run unless noted otherwise.
- 26% worse than its predecessor: Gemini 3.1 Pro Custom Tools ends at $12,736 net worth (+537% ROI) where Gemini 3 Pro reaches $17,199 (+760% ROI). Despite higher revenue ($45,744 vs $41,652) and more servings (7,345 vs 5,439), the 3.1 Pro model burns through its advantage with a 27.7% net margin vs 41.2% — half the profitability per dollar of revenue.
- Costlier API, worse results: At $4.59/run vs $4.38, Gemini 3.1 Pro CT costs 5% more to run. The cost-per-$1K-of-performance makes it starker: $0.36 vs $0.25. You pay more for the API and lose $4,463 in business performance.
- The waste problem it can't solve: $1,168 in food waste — nearly identical to Gemini 3 Pro ($1,192). But the 3.1 model knows it's wasting food. Every single daily reflection mentions “HUGE FOOD WASTE” and “MUST ORDER SMALLER QUANTITIES.” It writes the diagnosis 30 times and never changes the behavior. Classic metacognition without adaptation.
- Two endpoints, neither ready: Google released Gemini 3.1 Pro with two API endpoints. The standard endpoint cannot follow tool-calling instructions — it ignores simulation rules and generates invalid tool calls. The “Custom Tools” endpoint works but underperforms the previous generation.
The Setup
FoodTruck Bench is an agentic benchmark that measures how well language models handle complex, multi-step business decisions. An AI agent manages a food truck in Austin, TX for 30 days — choosing locations, pricing menus, managing inventory, hiring staff — using 34 tools. The benchmark evaluates agentic capabilities that standard coding and reasoning benchmarks miss: demand forecasting, resource optimization, feedback adaptation, and long-horizon planning.
Five runs were completed using the direct Google Gemini API with thinking mode enabled. All runs used the same seed, same prompt, same tools. Following standard methodology, the median run (sorted by net worth) serves as the reference. For comparison, the article uses the median run of Gemini 3 Pro (the immediate predecessor, +760% ROI) and Sonnet 4.6 (the closest competitor at a similar performance level, +771% ROI but at 5× the cost).
The Two Endpoints Problem
Google released Gemini 3.1 Pro with two separate API endpoints: a standard endpoint and a “Custom Tools” variant. This is unusual — no other model provider ships two tool-calling modes for the same model.
| Endpoint | Tool Calling | Agentic Performance | Result |
|---|---|---|---|
| Standard (gemini-3.1-pro-preview) | Broken | Cannot complete simulation | 3/3 runs failed (not bankrupt — simply unusable) |
| Custom Tools (gemini-3.1-pro-preview-customtools) | Works | Functional but regressed | 5/5 completed, median $12,736 |
| Gemini 3 Pro (predecessor) | Works | Strong | 5/5 completed, median $17,199 |
The standard endpoint doesn't just underperform — it cannot follow the simulation's tool-calling schema. It ignores required parameter formats, generates calls to tools that don't exist, and fails to complete even Day 1 in most attempts. The net worth of the three standard-endpoint runs: $682, $345, and $914 — all below the starting balance of $2,000.
For developers: If you're building agentic applications with Gemini 3.1 Pro, you must use the Custom Tools endpoint. The standard endpoint is not suitable for structured tool calling. And even with Custom Tools, expect a regression from Gemini 3 Pro.
The Waste Problem: 30 Days of Ignored Self-Advice
$1,168 in food waste over 30 days. This isn't the worst waste number on the benchmark — Gemini 3 Pro wasted $1,192. But the pattern is what makes 3.1 Pro CT remarkable.
Every single daily reflection — from Day 5 to Day 30 — contains some variation of: “HUGE FOOD WASTE. MUST ORDER SMALLER QUANTITIES.” The model writes this in ALL CAPS. It calculates the dollar amount. It creates plans. And the next morning, it orders the same quantities of the same perishable ingredients.
| Day | Waste | Agent's Note | Next Day Action |
|---|---|---|---|
| 5 | $162 | “Huge food waste! Order more carefully” | Over-orders again |
| 8 | $386 | “Massive food waste! Stop over-ordering!” | Over-orders again |
| 12 | $500 | “MASSIVE WASTE. Check what expired” | Over-orders again |
| 15 | $555 | “Huge waste! Order 1–2 days max” | Over-orders again |
| 24 | $662 | “Disastrous day. MUST check shelf life” | Over-orders again |
| 28 | $982 | “Waste unacceptable! Smaller quantities!” | Still over-orders |
This is the textbook observe→learn→adapt failure. The model passes the first two stages with flying colors — it observes waste, it correctly identifies the cause. But the learning never converts to changed behavior. Compare with Gemini 3 Pro, which also had high waste ($1,192) but for a different reason: it aggressively stockpiled for events, accepting waste as a trade-off for capturing high-demand days. That's a strategy. Gemini 3.1 Pro CT's waste is just a bug it can't fix.
Agentic Capability Comparison
The benchmark measures eight agentic capabilities. Gemini 3.1 Pro CT matches its predecessor on basic competencies but falls behind on the capabilities that drive business outcomes:
| Agentic Capability | Gemini 3.1 Pro CT | Gemini 3 Pro | Sonnet 4.6 |
|---|---|---|---|
| Market intuition | ~ Low-premium pricing | ✓ Premium pricing | ✓ Premium pricing |
| Long-term investment | ✓ All 8 upgrades | ✓ All 8 upgrades | ✓ All 8 upgrades |
| Cost management | ✗ 28% staff-to-revenue | ~ 27% staff-to-revenue | ✓ 29% (profitable hiring) |
| Resource optimization | ✗ $1,168 waste (ignored) | ~ $1,192 waste (strategic) | ✓ $276 waste |
| Demand forecasting | ~ Chronic over-ordering | ~ Under-ordering events | ~ Chronic under-ordering |
| Multi-step planning | ~ Uses events, misses prep | ✓ Good event exploitation | ~ Sees events, under-prepares |
| Output efficiency | ✓ Concise (~780 out/day) | ✓ Concise (7K out/day) | ✗ Verbose (22K/day) |
| Feedback → adaptation | ✗ Writes advice, ignores it | ~ Partial adaptation | ~ Partial, delayed |
Gemini 3.1 Pro CT wins on token efficiency (just 23K output tokens total vs 211K for Gemini 3 Pro and 685K for Sonnet 4.6). But concise output doesn't help when the model ignores its own analysis. Gemini 3 Pro generates 9× more output tokens but converts them into better decisions.
Performance Comparison: 30-Day Trajectory
Gemini 3 Pro (dark blue) outpaces 3.1 Pro CT (light blue) from Day 17 onward, finishing 34% higher. Sonnet 4.6 (orange) tracks a similar trajectory to both Gemini models. Note the Day 24 crash in the 3.1 Pro line — a procurement disaster that erased $2,200 of net worth in a single day.
The Cost-Performance Question
All three models shown reach the same tier of performance — $12K–$17K net worth after 30 days. The differentiator is cost:
| Metric | Gemini 3.1 Pro CT | Gemini 3 Pro | Sonnet 4.6 |
|---|---|---|---|
| API Cost / Run | $4.59 | $4.38 | $22.99 |
| Net Worth | $12,736 | $17,199 | $17,426 |
| Cost per $1K Performance | $0.36 | $0.25 | $1.32 |
| API Calls | 185 | 128 | 252 |
| Output Tokens | 23K | 211K | 685K |
| Input Tokens | 2.0M | 1.2M | 2.1M |
| Reasoning Tokens | 182K | 110K | 244K |
Gemini 3 Pro remains the best value in the benchmark. It delivers Sonnet 4.6-level agentic performance at $0.25 per $1K of business value — 5× cheaper than Anthropic. The 3.1 Pro Custom Tools variant costs 5% more per run but delivers 26% less business value.
The recommendation is clear: If you need Gemini-class agentic performance, continue using Gemini 3 Pro. There is no reason to upgrade to 3.1 Pro for tool-calling workloads — it costs more and delivers less.
The Token Efficiency Paradox
Output Tokens per Day
Sonnet 4.6 (orange) generates 20–35K output tokens per day — long analytical essays. Gemini 3 Pro (dark blue) averages ~4.6K. Gemini 3.1 Pro CT (light blue) is nearly invisible at ~780/day — the most concise model on the benchmark. Toggle log scale to see the full picture.
Token Efficiency: Output per Run
Three models with vastly different output volumes, yet the two most successful ones — Gemini 3 Pro ($17,199 NW) and Sonnet 4.6 ($17,426 NW) — both generate substantially more text than Gemini 3.1 Pro CT ($12,736 NW). The Custom Tools endpoint produces just 23K output tokens over 30 days, while Gemini 3 Pro generates 211K and Sonnet 4.6 generates 685K. The model that talks least performs worst.
| Metric | G3.1 Pro CT | G3 Pro | Sonnet 4.6 |
|---|---|---|---|
| Output Tokens (total) | 23K | 211K | 685K |
| Output Tokens / Day | ~780 | ~7,000 | ~22,100 |
| Reasoning Tokens | 182K | 110K | 244K |
| API Calls | 185 | 128 | 252 |
| Net Worth | $12,736 | $17,199 | $17,426 |
| NW per 1K Output Tokens | $554 | $81 | $25 |
With only ~780 output tokens per day, Gemini 3.1 Pro CT generates barely more than the raw JSON of its tool calls. There's no room for reasoning between actions — no “should I choose waterfront or downtown?”, no “last time I over-ordered chicken, try less today.” The model acts without thinking. Meanwhile, Gemini 3 Pro at 7K/day includes structured analysis that drives better decisions — and achieves 35% higher net worth with 9× more output.
Agentic insight: Concise output is not efficient output. When a model generates too few tokens, it lacks the reasoning space to analyze feedback and adapt. The Custom Tools endpoint's extreme brevity isn't a feature — it's a constraint that prevents the model from thinking through complex decisions.
The Diary Correlation: Tool-Calling Automaton
Daily Diary Size (Scratchpad Characters)
Scratchpad diary entries written during the reflection phase. Sonnet 4.6 (orange) writes 3-4K char analytical diaries. Gemini 3 Pro (dark blue) writes ~750 chars. Gemini 3.1 Pro CT (light blue) writes only ~550 chars — barely enough for a few bullet points.
The scratchpad diary — where the agent reflects on each day's results — reveals why the Custom Tools endpoint underperforms. Gemini 3.1 Pro CT writes 548 characters per day on average: a few lines listing revenue, waste, and a vague plan. Gemini 3 Pro writes 757 chars — modestly more, but with structured strategy notes that actually influence next-day decisions. Sonnet 4.6 writes 3,463 chars — full analytical essays with trend analysis and prioritized action items.
| Metric | G3.1 Pro CT | G3 Pro | Sonnet 4.6 |
|---|---|---|---|
| Avg Diary / Day | 548 chars | 757 chars | 3,463 chars |
| Total Diary (30 days) | 16.4K | 23.5K | 107K |
| KV Store (memory) | 10.4K | 9.8K | 211K |
| Output Tokens / Day | ~780 | ~7,000 | ~22,100 |
| Net Worth | $12,736 | $17,199 | $17,426 |
The correlation is clear: models that write more detailed reflections make better decisions. The Custom Tools endpoint is optimized for tool calling, not for thinking. It generates bare JSON function calls with almost no reasoning text between them. The result is a model that acts without analyzing — the opposite of what agentic workloads require.
Day 13: Three Models, Three Diaries
All three models faced the same situation on Day 13: high demand, stockouts, and significant food waste. Here's what each wrote in its evening reflection diary. Scroll to see the full entries.
The difference speaks for itself. Gemini 3.1 Pro CT writes a generic checklist. Gemini 3 Pro adds structured priorities and action items. Sonnet 4.6 performs root cause analysis, models its cash position, builds ingredient ordering rules, and maps recipe dependencies. The depth of reflection directly correlates with the quality of next-day decisions.
Key Moments: Where 3.1 Pro Diverges
Day 0: The Costly Start
Gemini 3.1 Pro CT spends $1,402 on Day 0 ingredients — 70% of its starting $2,000 balance. Gemini 3 Pro spends $250. This aggressive initial buy creates immediate food waste pressure: perishable items begin expiring before the agent can build enough daily demand to consume them. The model starts in a hole it never fully climbs out of.
The Loan Dependency (Days 20–27)
The model takes 3 loans across the run — Gemini 3 Pro takes zero. Despite generating $45K in revenue, 3.1 Pro CT maintains a cash balance consistently below $3,000, forced to borrow for ingredient orders. The interest payments ($75+ in fees) aren't catastrophic, but the pattern reveals poor cash flow management: the model over-invests in inventory, locks capital in unsold ingredients, then borrows to cover operations.
The Day 24 Disaster
Day 24 is a catastrophe: $140 revenue, $-596 profit, only 70 sodas sold despite 1,532 customers arriving. What happened? The agent ran out of core ingredients (burgers, tacos, fries) and could only serve soft drinks. All those customers — arriving during a high-traffic industrial zone day — left after seeing a menu of just soda.
What this reveals: The model's ordering algorithm has a systematic blind spot. Despite writing “order smaller quantities more often” in every reflection, it oscillates between massive over-orders (creating waste) and under-orders (creating stockouts). It never finds the stable middle ground. Gemini 3 Pro had bad days too, but its worst day still generated $142 in revenue with 26 servings — not a complete operational failure.
The Late Surge (Days 25–30)
After the Day 24 crash, the model recovers strongly: 6 consecutive days of $2,200+ revenue, 360+ servings, 100% capacity utilization. Net worth jumps from $7,462 to $12,736. This proves the model can operate well — when it happens to have the right inventory at the right time.
But this isn't systematic improvement. It's coincidence: the post-Day 24 order finally aligned with demand. The model didn't learn from Day 24 — it just got lucky with timing. Compare with Gemini 3 Pro's late surge (Days 25–30: $424 to $3,493), which was driven by deliberate event exploitation and strategic location switching.
In Its Own Words
«Huge food waste: $162.25! I need to order more carefully, tracking shelf life. Huge missed demand: 126 customers left unserved. Capacity used: 95%.»
— Day 5 (first mention of waste)
Day 5: the first time the model notices waste. It will write this same observation 25 more times, with increasingly emphatic capitalization, never changing its ordering behavior.
«HUGE FOOD WASTE: $385.70. Must stop over-ordering. Order smaller quantities more often! Demand was only 58 at downtown_business. Capacity is 296. We have way too much capacity for this level of demand.»
— Day 8 (waste doubles, loss day at -$72)
Perfectly identifies two problems: over-ordering ingredients and over-staffing for the location. Fixes neither. The next day brings even more waste.
«Food Waste: $530.50! Need to be very careful with ingredient shelf life and quantities. Plan for Day 14: Fire Beth, hire cheaper cook. Order SMALL quantities.»
— Day 13 (waste exceeds half the starting balance)
The plan is sound: fire underperforming staff, order smaller. By Day 14, waste grows to $555. The “SMALL quantities” plan was abandoned before morning.
«Day 28: Waterfront park (Music Fest last day). Waste: $981.62 lost to expired ingredients! This is unacceptable. Margo didn't show up. Her reliability is a problem.»
— Day 28 (cumulative waste approaching $1,000)
Day 28: almost $1,000 wasted. The model calls it “unacceptable” — an accurate assessment it has made, in various forms, for 23 consecutive days. Self-awareness without self-correction is the defining trait of Gemini 3.1 Pro CT.
All 5 Runs: Consistency Check
| Run | Net Worth | ROI | API Cost | Waste | Servings |
|---|---|---|---|---|---|
| 043542 | $2,622 | +31% | $2.62 | $620 | 3,247 |
| 005806 | $9,816 | +391% | $4.38 | $750 | 7,864 |
| 103424 | $12,736 | +537% | $4.59 | $1,168 | 7,345 |
| 025904 | $15,174 | +659% | $4.57 | $478 | 7,791 |
| 014717 | $22,860 | +1,043% | $3.84 | $471 | 7,600 |
| Average | $12,642 | +532% | $4.00 | $697 | 6,769 |
High variance ($2,622 — $22,860), similar to Sonnet 4.6's range ($3,287 — $32,332). The spread is 8.7×, compared to 2.7× for Gemini 3 Pro. Average and median nearly coincide ($12,642 vs $12,736), confirming the data is representative. API costs range from $2.62 to $4.59 per run at list pricing.
The run-to-run variance tells a story of strategic instability: the model's performance depends heavily on whether its semi-random ordering happens to align with demand. When it does (run 014717), performance nearly matches the best Gemini 3 Pro runs. When it doesn't (run 043542), the model barely breaks even.
Verdict
Gemini 3.1 Pro Custom Tools is a regression from Gemini 3 Pro for agentic workloads. It costs 5% more per run ($4.59 vs $4.38), delivers 26% less business performance, cannot adapt behavior from feedback, and produces 8.7× variance between runs.
Recommendation: Continue using Gemini 3 Pro for tool-calling and agentic applications. It delivers 35% more business value, is three times more consistent, and remains the best cost-per-performance ratio on the benchmark at $0.25 per $1K of value. Gemini 3.1 Pro CT offers no cost advantage — it is both more expensive and less capable.
The two-endpoint situation adds operational risk: developers must use the “Custom Tools” endpoint (the standard endpoint cannot handle structured tool calling), and must verify that their application works with this specific model variant. For production agentic systems, this complexity alone argues against adoption until Google consolidates into a single, stable endpoint.
For those considering Sonnet 4.6 as an alternative: it delivers statistically equivalent results to Gemini 3 Pro ($17,426 vs $17,199) but costs 5.2× more ($22.99 vs $4.38). Unless you specifically need Anthropic's ecosystem or prompt caching, Gemini 3 Pro remains the rational default for agentic workloads.
$4.59 to run the model that writes “HUGE FOOD WASTE” thirty times. $4.38 to run the one that just manages inventory.