PolymathPolymath

We build the most advanced environments for training and evaluating AI agents on long-horizon, multi-tool tasks in any domain.

  1. Applications & tools: the tools agents interact with (e.g. slack, email, web, excel, github, linear, etc)
  2. Data: information seeded into the environment which represents the initial state
  3. Tasks: descriptions for what the agent should execute and accomplish
  4. Verifiers: rubrics that evaluate how well agents perform on tasks in the environment
  5. Agent(s): AI actor(s) that navigate the environment and complete tasks using the available tools
Introducing Systems-Bench
long-horizon, multi-tool SWE agent benchmark