GPT-5.4 Targets Anthropic’s Claude With Premium Pricing and Coding Muscle
OpenAI has released GPT-5.4, once again its most powerful reasoning model. The new model combines advanced capabilities in reasoning, programming, and professional workflows in a single system. GPT-5.4 is now available in ChatGPT, the API, and Codex, and is aimed particularly at professional users who need to tackle complex tasks.
Direct comparisons to top models from Anthropic, Google, or xAI are not yet available, neither on Arena.ai nor on Artificial Analysis. The focus on coding and work-related tasks such as presentations makes it clear what OpenAI is pitting GPT-5.4 against: Claude Cowork and Claude Code from rival Anthropic. This also explains the higher costs (see below), after all Anthropic’s top model Claude Opus 4.6 is also quite expensive.
Core Features and Improvements
GPT-5.4 integrates the programming capabilities of GPT-5.3-Codex while simultaneously improving work with tools, software environments, and professional tasks in spreadsheets, presentations, and documents. The model delivers more precise results with fewer follow-up questions and achieves new benchmarks in various evaluations.
In ChatGPT, GPT-5.4 Thinking can now display a work plan in advance, allowing users to make adjustments during processing. This leads to results that better align with actual requirements. Additionally, web search has been improved, particularly for highly specific queries, while context is better preserved even during longer thinking processes.
Performance in Professional Applications
On the GDPval benchmark, which tests the capabilities of AI agents in creating professional work results across 44 professions, GPT-5.4 achieves a rate of 83.0 percent. This means the model meets or exceeds the performance of industry experts in more than four out of five cases. For spreadsheet tasks, such as those a junior analyst in investment banking would perform, GPT-5.4 achieves an average score of 87.3 percent.
The model also shows significant progress in reducing hallucinations and errors. Individual statements from GPT-5.4 are 33 percent less likely to be false, and complete answers contain 18 percent fewer errors compared to GPT-5.2.
Computer Use and Visual Capabilities
GPT-5.4 is OpenAI’s first universal model with native computer use capabilities. It can operate websites and software systems by writing code or executing mouse and keyboard commands in response to screenshots. On the OSWorld-Verified benchmark, which tests a model’s ability to navigate a desktop environment, GPT-5.4 achieves a success rate of 75.0 percent, even surpassing human performance of 72.4 percent.
The improved visual capabilities are also evident in other areas. On the MMMU-Pro test, which examines visual understanding and reasoning, GPT-5.4 achieves a success rate of 81.2 percent. OpenAI also introduces a new level of detail for image inputs that supports full resolution up to 10.24 million pixels.
Programming Capabilities and Tool Use
In the area of programming, GPT-5.4 combines the strengths of GPT-5.3-Codex with enhanced tool use capabilities. On the SWE-Bench Pro benchmark, the model achieves 57.7 percent while maintaining lower latency. In Codex, the /fast mode offers up to 1.5 times faster token speed.
An important innovation is tool search in the API. Instead of including all tool definitions in the prompt in advance, GPT-5.4 can search for specific tools as needed. This significantly reduces the number of tokens required. In tests with 250 tasks from the MCP Atlas benchmark, tool search reduced token usage by 47 percent while maintaining accuracy.
Pricing and Availability
GPT-5.4 is now available in ChatGPT for Plus, Team, and Pro users as well as in the API. The model is more expensive per token than GPT-5.2, though higher token efficiency can reduce overall costs for many tasks.
Token Prices in Comparison
| Model | Input (per 1M Tokens) | Cached Input (per 1M Tokens) | Output (per 1M Tokens) |
|---|---|---|---|
| GPT-5.4 | $2.50 | $0.25 | $15.00 |
| GPT-5.4 Pro | $30.00 | not available | $180.00 |
| GPT-5.2 | $1.75 | $0.175 | $14.00 |
| GPT-5.2 Pro | $21.00 | not available | $168.00 |
For the API, batch and flex pricing are available at half the standard price, while priority processing is offered at double the standard price. In Codex, GPT-5.4 experimentally supports a context window of 1 million tokens, with requests over 272,000 tokens counted toward usage limits at double the rate.
For comparison: Anthropic’s Claude Opus 4.6 is priced at $5/1M token input and $25/1M token output; GPT-5.4 in the Pro version is thus significantly more expensive.
Security Measures and Restrictions
OpenAI classifies GPT-5.4 as “High Capability” in the cybersecurity domain and has implemented corresponding protective measures. These include enhanced security systems, access controls, and asynchronous blocking for higher-risk requests from customers with zero data retention.
The model was trained to reject requests with harmful intent. On surfaces with zero data retention where the user does not participate in the Trusted Access for Cyber program, asynchronous message-based classifiers are used to block high-risk cyber content. Accounts that reach certain thresholds may be subject to deeper analysis.
In security evaluations, GPT-5.4 generally shows comparable or slightly improved values compared to GPT-5.2. On challenging prompts in prohibited content categories, the model achieves high scores in avoiding unsafe responses. However, there are slight fluctuations in some categories such as violence and sexual content.
Limitations and Challenges
Despite the improvements, GPT-5.4 also has limitations. In chain-of-thought monitorability, which measures whether a monitor can derive security-relevant properties from the model’s reasoning trace, GPT-5.4 shows overall lower values than GPT-5 Thinking. Particularly in areas such as health queries lacking evidence, memory, and impossible tasks, the model performs worse.
In some cyber range scenarios that test the ability to perform complete end-to-end cyber operations, GPT-5.4 with a combined success rate of 73.33 percent lags behind GPT-5.3-Codex (80 percent), but significantly outperforms earlier models.
In health evaluations, GPT-5.4 achieves 62.6 percent on the HealthBench benchmark, which represents a slight decline compared to GPT-5.2 (63.3 percent). The model also generates longer responses (averaging 3,311 characters compared to 2,676 characters for GPT-5.2), which brings both advantages and disadvantages.
Replacement for GPT-5.2 Thinking
GPT-5.4 represents a significant advance in AI model development and combines advanced reasoning, programming, and computer use capabilities in a single system for the first time. The improvements in professional workflows, error reduction, and tool efficiency make it a valuable tool for demanding applications.
At the same time, the higher costs, mixed results in some evaluation areas, and necessary security measures present challenges. Users must weigh whether the enhanced capabilities justify the higher prices and whether the limitations in certain areas are relevant to their specific use cases.
OpenAI emphasizes that the model continues to be improved and security measures are continuously adjusted. GPT-5.2 Thinking remains available for three months before being discontinued on June 5, 2026.


