AI

Breaking News: OpenAI’s GPT-5.1-Codex-Max Completes 24-Hour Coding Task in Record Time

Published

7 months ago

November 20, 2025

OpenAI debuts GPT‑5.1-Codex-Max coding model and it already completed a 24-hour task internally

OpenAI has recently unveiled a groundbreaking coding model known as GPT‑5.1-Codex-Max, now accessible within the Codex developer environment. This release signifies a significant advancement in AI-powered software engineering, providing enhanced long-term reasoning, efficiency, and real-time interactive capabilities. GPT‑5.1-Codex-Max will now take over from GPT‑5.1-Codex as the default model across Codex-integrated interfaces.

The latest model is engineered to function as a persistent, high-context software development agent, proficient in managing intricate refactors, debugging processes, and project-scale tasks spanning multiple context windows.

This development follows Google’s recent launch of its potent Gemini 3 Pro model, yet GPT‑5.1-Codex-Max surpasses or matches it on crucial coding benchmarks. On SWE-Bench Verified, GPT‑5.1-Codex-Max achieved 77.9% accuracy at extra-high reasoning effort, surpassing Gemini 3 Pro’s 76.2%. Additionally, it outperformed Gemini on Terminal-Bench 2.0, boasting 58.1% accuracy compared to Gemini’s 54.2%, and matched Gemini’s score of 2,439 on LiveCodeBench Pro, a competitive coding Elo benchmark.

When compared to Gemini 3 Pro’s most advanced configuration — its Deep Thinking model — Codex-Max holds a slight edge in agentic coding benchmarks as well.

Performance Benchmarks: Enhanced Performance Across Critical Tasks

GPT‑5.1-Codex-Max displays measurable improvements over GPT‑5.1-Codex across various standard software engineering benchmarks. For instance, it achieved 79.9% accuracy on SWE-Lancer IC SWE, a remarkable increase from GPT‑5.1-Codex’s 66.3%. In SWE-Bench Verified (n=500), it reached 77.9% accuracy at extra-high reasoning effort, outperforming GPT‑5.1-Codex’s 73.7%.

The performance on Terminal Bench 2.0 (n=89) exhibited more modest enhancements, with GPT‑5.1-Codex-Max achieving 58.1% accuracy compared to GPT‑5.1-Codex’s 52.8%.

All evaluations were conducted with compaction and extra-high reasoning effort enabled. These outcomes suggest that the new model offers a higher threshold for both benchmarked correctness and real-world usability under extended reasoning loads.

Technical Architecture: Enhanced Long-Horizon Reasoning via Compaction

A significant architectural enhancement in GPT‑5.1-Codex-Max is its capacity to effectively reason over extended input-output sessions through a mechanism called compaction. This feature enables the model to retain crucial contextual information while discarding irrelevant details as it approaches its context window limit, allowing for continuous work across millions of tokens without performance deterioration.

Internally, the model has been observed to complete tasks lasting more than 24 hours, including multi-step refactors, test-driven iteration, and autonomous debugging processes.

Compaction also enhances token efficiency. At medium reasoning effort, GPT‑5.1-Codex-Max utilized approximately 30% fewer thinking tokens than GPT‑5.1-Codex for comparable or superior accuracy, impacting both cost and latency.

Platform Integration and Use Cases

GPT‑5.1-Codex-Max is presently accessible across multiple Codex-based environments, which refer to OpenAI’s integrated tools and interfaces purpose-built for code-focused AI agents. These include Codex CLI, IDE extensions, interactive coding environments, and internal code review tooling.

Although GPT‑5.1-Codex-Max is not yet available via public API, OpenAI indicates that this feature is forthcoming. Users interested in working with the model in terminal environments can currently do so by installing and utilizing the Codex CLI.

It is not definitively confirmed how or if the model will integrate into third-party IDEs unless they are constructed atop the CLI or future API.

The model can interact with live tools and simulations, as demonstrated by examples including an interactive CartPole policy gradient simulator and a Snell’s Law optics explorer. These interfaces showcase the model’s ability to reason in real-time while maintaining an interactive development session, effectively merging computation, visualization, and implementation within a single loop.

Cybersecurity and Safety Constraints

While GPT‑5.1-Codex-Max does not meet OpenAI’s “High” capability threshold for cybersecurity according to its Preparedness Framework, it stands as the most capable cybersecurity model OpenAI has deployed to date. It supports use cases like automated vulnerability detection and remediation, albeit with stringent sandboxing and disabled network access by default.

OpenAI reports no escalation in scaled malicious usage but has introduced enhanced monitoring systems, including activity routing and disruption mechanisms for suspicious behavior. Codex remains isolated to a local workspace unless developers opt for broader access, mitigating risks such as prompt injection from untrusted sources.

Deployment Context and Developer Usage

GPT‑5.1-Codex-Max is presently accessible to users on ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. It will also become the new default in Codex-based environments, replacing GPT‑5.1-Codex, which served as a more general-purpose model.

OpenAI notes that 95% of its internal engineers engage with Codex on a weekly basis, and since adoption, these engineers have delivered approximately 70% more pull requests on average, underscoring the tool’s impact on internal development velocity.

Despite its autonomy and persistence, OpenAI emphasizes that Codex-Max should be regarded as a coding assistant rather than a substitute for human review. The model generates terminal logs, test citations, and tool call outputs to promote transparency in the generated code.

Outlook

GPT‑5.1-Codex-Max heralds a substantial evolution in OpenAI’s approach to agentic development tools, offering enhanced reasoning depth, token efficiency, and interactive capabilities across software engineering tasks. By expanding its context management and compaction strategies, the model is poised to tackle tasks at the scale of entire repositories, as opposed to individual files or snippets.

With a continued focus on agentic workflows, secure sandboxes, and real-world evaluation metrics, Codex-Max sets the stage for the next era of AI-assisted programming environments, underscoring the significance of oversight in increasingly autonomous systems.