Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond
Google's Gemini 3.1 Pro scores highest on reasoning benchmarks, edges past GPT-5.4 and Claude Opus 4.6 on academic tasks, and brings a new thinking_level parameter for developers.
Maya Johnson
Google released Gemini 3.1 Pro on February 19 as an update to the Gemini 3 Pro series launched in November. The headline number: 94.3% on GPQA Diamond, the reasoning benchmark that tests graduate-level scientific questions. That's the highest score any model has achieved on this benchmark.
Where Gemini 3.1 Pro Leads
GPQA Diamond is specifically designed to be difficult for non-experts — even PhD holders in adjacent fields struggle with it. Gemini 3.1 Pro's 94.3% score places it above both GPT-5.4 and Claude Opus 4.6 on pure reasoning tasks.
The model also introduced a thinking_level parameter that lets developers control how much internal reasoning the model uses, and a media_resolution parameter for vision tasks. Function responses now support multimodal objects like images and PDFs.
Where It Doesn't Lead
On practical coding benchmarks like SWE-bench Verified, Claude Opus 4.6 still holds the top spot. On computer-use tasks — navigating real software interfaces — GPT-5.4 leads with record scores on OSWorld and WebArena.
The LLM leaderboard has fragmented: no single model wins everywhere. Gemini leads reasoning, Claude leads coding, GPT leads computer use. Companies choosing a model now need to match the benchmark category to their actual use case.
Pricing and Availability
Gemini 3.1 Pro is available through the Gemini API, Google AI Studio, and Vertex AI. It rolled out to Gemini app users across the AI Plus, Pro, and Ultra subscription tiers.
Our Take
Google has quietly built the best reasoning model available. Gemini 3.1 Pro's GPQA score is a genuine achievement, not benchmark gaming. But reasoning benchmarks don't directly translate to product quality — and Google's consumer AI products still trail behind ChatGPT and Claude in user experience and adoption. The model is excellent. The distribution challenge remains.