Exploring the Limits of LLMs through Simulation.
We’re building simulation harnesses that run LLMs and AI agents through realistic scenarios, push them to their breaking points, and learning where they succeed and fail.
OUR MISSION
Today LLM benchmarks are measured on static datasets, but they don’t capture how they behave in open-ended, real-world conditions. We want to take them to their breaking points in controlled simulated environments.
What We're Working On
What We're Working On
Simulated Workflows
Creating controlled environments that mirror real-world tasks.
Simulated Workflows
Creating controlled environments that mirror real-world tasks.
Simulated Workflows
Creating controlled environments that mirror real-world tasks.
Stress-Testing Agents
Running multi step tool use and reasoning chains until they fail.
Stress-Testing Agents
Running multi step tool use and reasoning chains until they fail.
Stress-Testing Agents
Running multi step tool use and reasoning chains until they fail.

Failure Analysis
Identifying edge cases, bottlenecks, and brittle behaviors.

Failure Analysis
Identifying edge cases, bottlenecks, and brittle behaviors.

Failure Analysis
Identifying edge cases, bottlenecks, and brittle behaviors.
Why We’re Doing This
Current benchmarks don’t capture the complexity of how LLMs and AI Agents behave in production. We believe in systematically stress testing in simulated environments.
Code
1
2
3
4
5
Complexity Outpaces Testing
As agents grow more capable, their decision spaces expand faster than our ability to evaluate them. We need new methods to keep up.
Code
1
2
3
4
5
Complexity Outpaces Testing
As agents grow more capable, their decision spaces expand faster than our ability to evaluate them. We need new methods to keep up.
Code
1
2
3
4
5
Complexity Outpaces Testing
As agents grow more capable, their decision spaces expand faster than our ability to evaluate them. We need new methods to keep up.
Benchmarks Miss Reality
Standard leaderboards measure narrow skills, but they don’t reflect how LLMs and agents perform in open-ended environments. We want to close that gap.
Benchmarks Miss Reality
Standard leaderboards measure narrow skills, but they don’t reflect how LLMs and agents perform in open-ended environments. We want to close that gap.
Benchmarks Miss Reality
Standard leaderboards measure narrow skills, but they don’t reflect how LLMs and agents perform in open-ended environments. We want to close that gap.