AI

Introducing Terminal-Bench 2.0: Harbor – A Revolutionary Framework for Containerized Agent Testing

Published

6 months ago

November 8, 2025

BTI Team

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers

The Latest in AI Agent Performance Evaluation: Terminal-Bench 2.0 and Harbor Framework

The developers behind Terminal-Bench, a renowned benchmark suite for assessing the capabilities of autonomous AI agents in real-world terminal-based tasks, have unveiled a major update with version 2.0 alongside the introduction of Harbor, a cutting-edge framework designed to test, enhance, and optimize AI agents within containerized environments.

This dual release is tailored to address persistent challenges in testing and optimizing AI agents, especially those engineered to function independently in authentic developer settings.

Terminal-Bench 2.0, featuring a more demanding and meticulously validated task set, supersedes its predecessor, version 1.0, as the industry standard for evaluating the capabilities of advanced models.

Harbor, the complementary runtime framework, empowers developers and researchers to scale evaluations across a multitude of cloud containers, seamlessly integrating with both open-source and proprietary agents and training pipelines.

Co-creator Alex Shaw expressed his sentiments about Harbor, emphasizing its value for developers and researchers aiming to evaluate and enhance agents and models.

Raising the Bar with Enhanced Task Quality

Terminal-Bench 1.0 gained rapid traction following its launch in May 2025, emerging as the go-to benchmark for evaluating agent performance across AI-driven environments resembling developer-style terminals. These agents interact with systems through command lines, mirroring developers’ work processes beneath graphical user interfaces.

Despite its widespread adoption, Terminal-Bench 1.0 encountered inconsistencies, as certain tasks were deemed poorly defined or unstable due to changes in external services.

Version 2.0 directly addresses these issues by introducing an updated suite comprising 89 tasks, each meticulously validated through manual and LLM-assisted processes. The emphasis is on enhancing task solvability, realism, and specificity, elevating the difficulty level while ensuring improved reliability and reproducibility.

For instance, the download-youtube task was either eliminated or revamped in 2.0 due to its reliance on unreliable third-party APIs.

Shaw acknowledged the performance similarities between SOTA and TB1.0 in his statement on X, attributing this to the significantly higher task quality in the updated benchmark.

Harbor: Enabling Unified Rollouts at Scale

Complementing the benchmark upgrade, the team rolled out Harbor, a novel framework tailored for executing and evaluating agents within cloud-deployed containers.

Harbor facilitates large-scale rollout infrastructure, offering compatibility with major providers such as Daytona and Modal.

Designed to accommodate diverse agent architectures, Harbor supports:

Evaluation of any container-installable agent

Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines

Custom benchmark creation and deployment

Seamless integration with Terminal-Bench 2.

Harbor was instrumental in conducting tens of thousands of rollouts internally during the development of the new benchmark. It is now accessible to the public via harborframework.com, complete with documentation for testing and submitting agents to the public leaderboard.

Initial Successes: GPT-5 Dominates Task Performance

Early findings from the Terminal-Bench 2.0 leaderboard showcase OpenAI’s Codex CLI, a GPT-5 powered variant, leading the pack with a remarkable 49.6% success rate — the highest among all tested agents.

Following closely behind are other GPT-5 iterations and agents based on Claude Sonnet 4.5.

Top 5 Agent Performances (Terminal-Bench 2.0):

Codex CLI (GPT-5) — 49.6%

Codex CLI (GPT-5-Codex) — 44.3%

OpenHands (GPT-5) — 43.8%

Terminus 2 (GPT-5-Codex) — 43.4%

Terminus 2 (Claude Sonnet 4.5) — 42.8%

The tight competition among top models indicates active rivalry across platforms, with no single agent dominating more than half of the tasks.

Agent Submission and Utilization

Users can install Harbor and initiate benchmarking using simple CLI commands to test or submit agents. Submissions for the leaderboard necessitate five benchmark runs, with results and job directories sent to developers for validation via email.

harbor run -d terminal-bench@2.0 -m “<model>” -a “<agent>” –n-attempts 5 –jobs-dir <path/to/output>

Terminal-Bench 2.0 is currently being integrated into research workflows focusing on agentic reasoning, code generation, and tool utilization. Mike Merrill, a co-creator and postdoctoral researcher at Stanford, mentioned an upcoming detailed preprint detailing the verification process and design methodology of the benchmark.

Striving for Standardization

The simultaneous release of Terminal-Bench 2.0 and Harbor signifies a significant stride toward establishing consistent and scalable agent evaluation infrastructure. As LLM agents become more prevalent in developer and operational settings, the demand for controlled, reproducible testing has escalated.

These tools lay the groundwork for a potential unified evaluation stack, supporting model enhancement, environment simulation, and benchmark standardization across the AI landscape.