Claude Sonnet 4.5 Takes SWE-bench Crown With 82% Under High Compute
Anthropic's Sonnet 4.5 hits 77.2% on SWE-bench Verified at standard settings and 82% with high compute. The company also ships Claude Agent SDK and introduces ASL-3 classification.
Maya Johnson
Anthropic released Claude Sonnet 4.5 on September 29, 2025, and the benchmark numbers speak for themselves: 77.2% on SWE-bench Verified at standard settings, climbing to 82.0% with high compute. That makes it the best coding model available by a significant margin, according to Anthropic's blog.
Where Sonnet 4.5 Leads
The SWE-bench score is the headline, but the more telling number is OSWorld: 61.4%, up from 42.2% for Sonnet 4. OSWorld tests practical computer use — navigating real desktops, operating software, completing multi-step tasks. A 19-point jump suggests genuine improvement in agent capabilities, not just benchmark optimization.
Sonnet 4.5 can maintain extended focus for 30+ hours on complex multi-step tasks. That's not a typo — Anthropic reports the model working continuously on long-horizon agent workflows for over a day without degradation.
Pricing stays at $3/$15 per million tokens, unchanged from Sonnet 4. The 200K context window and 64K max output also remain the same. Released under ASL-3, Anthropic's highest safety tier.
Claude Agent SDK
Alongside the model, Anthropic released the Claude Agent SDK — a framework for building multi-step, tool-using AI agents. Combined with Claude Code checkpoints (which let you save and resume agent sessions) and the VS Code extension, this creates a complete developer platform around Claude.
The SDK is significant because it standardizes how developers build with Claude agents, rather than everyone implementing their own orchestration logic.
Code Execution and File Creation
Claude can now execute code and create files directly within Claude apps — not just suggest code, but run it and show results. This moves Claude closer to being a development environment, not just a chat interface.
The Three-Way Race
At the time of launch, the LLM leaderboard looked like this: Claude led coding (SWE-bench), GPT-5 led general knowledge and reasoning, and Gemini 2.5 Pro led academic benchmarks. Sonnet 4.5 widened Claude's coding lead specifically.
Google had released Gemini 2.5 Pro earlier in the year with strong reasoning scores, and OpenAI had shipped GPT-5 in August with broad improvements. But neither could match Sonnet 4.5 on the benchmarks that matter most for professional developers.
Our Take
Sonnet 4.5 at $3/$15 is absurd value. It outperforms models costing 5x more on the benchmarks developers actually care about. The 30-hour sustained focus claim is bold — if it holds up in production, it fundamentally changes what's possible with AI agents. Anthropic is building a moat around the developer experience, and the Agent SDK is the foundation. The question isn't whether Claude is the best coding model. It's whether anyone else can catch up.
FAQ
How much does Claude Sonnet 4.5 cost?
Claude Sonnet 4.5 costs $3 per million input tokens and $15 per million output tokens — identical pricing to Sonnet 4. It's available through the Anthropic API with the model ID claude-sonnet-4-5-20250929.
What is the Claude Agent SDK? The Claude Agent SDK is a framework released alongside Sonnet 4.5 for building multi-step AI agents that can use tools, make decisions, and work on complex tasks autonomously. It standardizes agent development patterns for the Claude ecosystem.
How does Sonnet 4.5 compare to GPT-5? Sonnet 4.5 leads on coding benchmarks like SWE-bench Verified (77.2%-82.0%) while GPT-5 leads on general reasoning and knowledge tasks. The models are competitive, with each excelling in different categories.
What is ASL-3? ASL-3 is Anthropic's AI Safety Level classification system. Sonnet 4.5 was the second model released under ASL-3 (after the Claude 4 family), indicating it meets Anthropic's most rigorous safety and deployment requirements.