Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond

Google's Gemini 3.1 Pro scores highest on reasoning benchmarks, edges past GPT-5.4 and Claude Opus 4.6 on academic tasks, and brings a new thinking_level parameter for developers.

Maya Johnson

Thursday, February 19, 2026·2 min read

Google released Gemini 3.1 Pro on February 19 as an update to the Gemini 3 Pro series launched in November. The headline number: 94.3% on GPQA Diamond, the reasoning benchmark that tests graduate-level scientific questions. That's the highest score any model has achieved on this benchmark.

Where Gemini 3.1 Pro Leads

GPQA Diamond is specifically designed to be difficult for non-experts — even PhD holders in adjacent fields struggle with it. Gemini 3.1 Pro's 94.3% score places it above both GPT-5.4 and Claude Opus 4.6 on pure reasoning tasks.

The model also introduced a thinking_level parameter that lets developers control how much internal reasoning the model uses, and a media_resolution parameter for vision tasks. Function responses now support multimodal objects like images and PDFs.

Where It Doesn't Lead

On practical coding benchmarks like SWE-bench Verified, Claude Opus 4.6 still holds the top spot. On computer-use tasks — navigating real software interfaces — GPT-5.4 leads with record scores on OSWorld and WebArena.

The LLM leaderboard has fragmented: no single model wins everywhere. Gemini leads reasoning, Claude leads coding, GPT leads computer use. Companies choosing a model now need to match the benchmark category to their actual use case.

Pricing and Availability

Gemini 3.1 Pro is available through the Gemini API, Google AI Studio, and Vertex AI. It rolled out to Gemini app users across the AI Plus, Pro, and Ultra subscription tiers.

Our Take

Google has quietly built the best reasoning model available. Gemini 3.1 Pro's GPQA score is a genuine achievement, not benchmark gaming. But reasoning benchmarks don't directly translate to product quality — and Google's consumer AI products still trail behind ChatGPT and Claude in user experience and adoption. The model is excellent. The distribution challenge remains.

Meta Launches Muse Spark — Its First Closed-Source Model Targets 'Personal Superintelligence'

Meta Superintelligence Labs unveils Muse Spark with dual modes, 58% on Humanity's Last Exam, and multimodal reasoning. Breaking with tradition, the model is not open-source.

Alex Chen·Apr 8, 2026

AI LLMs

OpenAI, Anthropic, and Google Unite to Combat AI Model Copying From China

The three biggest Western AI labs are sharing information through the Frontier Model Forum to prevent Chinese competitors from extracting their models' capabilities.

Sarah Mueller·Apr 7, 2026

Where Gemini 3.1 Pro Leads

Where It Doesn't Lead

Our Take

Meta Launches Muse Spark — Its First Closed-Source Model Targets 'Personal Superintelligence'

Meta Superintelligence Labs unveils Muse Spark with dual modes, 58% on Humanity's Last Exam, and multimodal reasoning. Breaking with tradition, the model is not open-source.

Alex Chen·Apr 8, 2026

AI LLMs

OpenAI, Anthropic, and Google Unite to Combat AI Model Copying From China

The three biggest Western AI labs are sharing information through the Frontier Model Forum to prevent Chinese competitors from extracting their models' capabilities.

Sarah Mueller·Apr 7, 2026

Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond

Where Gemini 3.1 Pro Leads

Where It Doesn't Lead

Pricing and Availability

Our Take

More in AI LLMs

Meta Launches Muse Spark — Its First Closed-Source Model Targets 'Personal Superintelligence'

OpenAI, Anthropic, and Google Unite to Combat AI Model Copying From China

Gemini 3.1 Pro Tops Reasoning Benchmarks With 94.3% on GPQA Diamond

Where Gemini 3.1 Pro Leads

Where It Doesn't Lead

Pricing and Availability

Our Take

More in AI LLMs

Meta Launches Muse Spark — Its First Closed-Source Model Targets 'Personal Superintelligence'

OpenAI, Anthropic, and Google Unite to Combat AI Model Copying From China