Anthropic Launches Claude Opus 4.5 with 80.9% SWE-bench Score and 66% Price Drop


Anthropic has released Claude Opus 4.5 today, claiming the industry’s top coding score and introducing a significant architectural shift to lower costs.

By slashing pricing 66% to $5 per million input tokens and deploying “Tool Search” to reduce context overhead by 85%, the company directly attacks the primary economic barrier to autonomous AI agents.

The model achieves an 80.9% score on SWE-bench Verified, narrowly edging out recent releases from Google and OpenAI to reclaim the performance crown for complex software engineering tasks.

The Benchmark Wars: Reclaiming the Crown

Opus 4.5 arrives with a score of 80.9% on SWE-bench Verified, the current gold standard for evaluating autonomous software engineering capabilities. Surpassing the competition, the score beats Google’s Gemini 3 Pro launch at 76.2% and GPT-5.1-Codex-Max at 77.9%.

Internal evaluations suggest the model now scores higher than human candidates on Anthropic’s own engineering take-home tests. “Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done,” the company stated in its announcement.

To balance cost versus capability, a new “effort” parameter allows developers to dynamically adjust the model’s reasoning depth during API calls. At “medium” effort, Opus 4.5 matches the peak performance of the previous Sonnet 4.5 model but consumes 76% fewer output tokens.

Pushing the ceiling of automated problem-solving, the “high” effort setting exceeds Sonnet 4.5’s capabilities by 4.3 percentage points. November has proven to be an active month in AI, with all three major labs deploying their flagship coding models between the 18th and 24th.

The Economic Shift: Pricing and Architecture

Addressing enterprise concerns about the viability of expensive “reasoning” models, Anthropic has aggressively repriced the model at $5 per million input tokens and $25 per million output tokens.

Compared to the previous Opus generation ($15/$75), the new rate offers a 66% discount, as detailed in Introducing Claude Opus 4.5.

Under the hood, the architecture tackles the “Context Bloat” problem. Traditionally, loading 50+ tool definitions could consume approximately 55,000 tokens before a single user query was processed.

According to the advanced tool use documentation, the new system fundamentally changes this dynamic:

“Instead of loading all tool definitions upfront, the Tool Search Tool discovers tools on-demand. Claude only sees the tools it actually needs for the current task.”

“This represents an 85% reduction in token usage while maintaining access to your full tool library. Internal testing showed significant accuracy improvements on MCP evaluations when working with large tool libraries.”

Complementing this is “Programmatic Tool Calling” (PTC), which allows the model to write orchestration code rather than relying on chat-based turn-taking.

The technical documentation further explains the mechanics of PTC:

“Instead of Claude requesting tools one at a time with each result being returned to its context, Claude writes code that calls multiple tools, processes their outputs, and controls what information actually enters its context window.”

“Claude excels at writing code and by letting it express orchestration logic in Python rather than through natural language tool invocations, you get more reliable, precise control flow.”

PTC eliminates the need for round-trip inference steps for every individual tool call, significantly reducing latency. Processing extensive datasets, such as 200KB of raw expense data, becomes viable as the model returns only the 1KB final result to the context window.

“To build effective agents, they need to work with unlimited tool libraries without stuffing every definition into context upfront,” noted the Anthropic Engineering Team.

Ecosystem Expansion: Chrome, Excel, and Safety

Beyond the core model, “Claude Code” has graduated from beta to general availability, offering a full desktop application for developer workflows. New integrations allow the model to control the Chrome browser directly, moving beyond text generation to active research and task execution.

 

Targeting financial modeling, a dedicated Excel integration allows the model to manipulate spreadsheets with thousands of rows. 

Dianne Na Penn, Head of Product Management for Research at Anthropic, emphasized the importance of this capability: “Knowing the right details to remember is really important in complement to just having a longer context window.”

 

Safety remains a central pillar of the release. The Claude Opus 4.5 system card highlights significant investments in mitigating Chemical, Biological, Radiological, and Nuclear (CBRN) risks.

The System Card explicitly outlines the model’s alignment status:

“Opus 4.5 is the most robustly aligned model we have released to date and, we suspect, the best-aligned frontier model by any developer.”

“Opus 4.5 is harder to trick with prompt injection than any other frontier model in the industry.”

Market Reality: The Agentic Era

Contextualizing the launch, the “November AI Rush” has seen Google, OpenAI, and Anthropic all pivot simultaneously toward autonomous agents. Narratives have pivoted from “chatbots” to “agents” capable of sustaining tasks for 24+ hours.

Claude Opus 4.5 benchmarks

While Anthropic leads in raw benchmarks (80.9%), the margin is razor-thin, with less than 5 percentage points separating the top three contenders. A key trade-off exists in the new architecture: “Tool Search” introduces a search step that may add latency compared to having all tools pre-loaded in context.

Unlike OpenAI’s Windows-native optimization with Codex-Max, Anthropic is betting on a platform-agnostic desktop approach. Memory management has emerged as the new battleground, with OpenAI utilizing “compaction” and Anthropic deploying “Tool Search” to solve the same context-window bottleneck.

 



Source link

Recent Articles

spot_img

Related Stories