Polymath

AI coding agents have become remarkably capable within the IDE - starting from autocomplete, to single-file edits, to making changes across the entire repository. However, true software engineering requires operating beyond the editor.

In practice, engineers reason across time, tools, and uncertainty: implementing a spec which requires modifying multiple services, managing CI/CD pipelines, resolving Sentry and Linear tickets, inspecting GCP metrics and logs, coordinating with teammates on Slack, and more. These workflows unfold over hours or days, involve reasoning about changes in the environment, and the use of a variety of tools.

Agents aren't good at this yet. We believe the next step-change in agentic coding will come from training and evaluating agents in realistic, long-horizon environments that span the full software development lifecycle.

Building these environments is non-trivial. We're creating environments that behave like real production systems. Tasks unfold over time, dependencies are enforced, and performance is defined by verifiable outcomes. We obsess over realism and quality because RL on bad data only degrades model performance, and we're building the core infrastructure to produce and run high fidelity environments and tasks at scale.

As professional software engineers and researchers, we've experienced firsthand how AI has transformed our work. We believe the next frontier is enabling agents to operate reliably outside of the IDE. The most difficult engineering problems live at the seams: deploying code, debugging production systems, and coordinating between teammates. Agents should work reliably at these boundaries - not just inside the repository.

We're excited about this future, and work with frontier labs to customize and scale these environments to unlock greater autonomy and reliability in software engineering agents. As agents improve, we introduce harder tasks, more complex environments, longer horizons, and continuously push the frontier of what's possible.

Towards Greater Reliability and Autonomy in Software Engineering Agents

Introducing Polymath