We've looked at over 200 companies pitching in the AI agents space over the last two years. Maybe a fifth of them are building something that will exist in five years. The rest are demos — impressive demos, often — that haven't yet confronted the hard problems of running autonomously in production environments where things break, context limits hit, and the user isn't watching.
Let's be direct about what separates the companies we fund from the ones we pass on.
What "AI Agent" Actually Means
The definition matters because it's gotten blurred. A chatbot with memory is not an agent. A retrieval-augmented pipeline is not an agent. An agent, as we use the term, is a system that takes a goal, breaks it into subtasks, executes those subtasks using tools or other models, handles errors and unexpected outputs, and produces a result without a human in the loop for each step.
That last part — no human in the loop for each step — is where almost every demo breaks down. Doing one step autonomously with a curated example is easy. Doing twelve steps in a real enterprise environment, where APIs rate-limit, tools return malformed responses, and the original goal turns out to be underspecified — that's the product.
The Failure Mode We Keep Seeing
Most agent startups we talk to have solved the happy path. The demo works. Aiden actually ran a live demo during one pitch meeting we had in early 2025 where the agent completed a multi-step workflow flawlessly. Then we asked what happens when step 4 fails. The answer was essentially "the user retries." That's not an agent — that's a fancy UI on top of a manual process.
Production agent systems need error recovery that doesn't require human intervention on every failure. They need to understand the difference between a recoverable error (retry with backoff), a fatal error (stop and escalate), and an ambiguous state (request clarification with enough context for the user to make a decision in ten seconds, not ten minutes). Building that error taxonomy is unglamorous work. It's also the work that determines whether your product can be trusted with anything important.
What We Actually Look For
When we evaluate an agent company, we're assessing four things:
1. State management. Agents that run tasks over minutes or hours need durable state. If the process crashes at step 7 of 12, can it resume? If the underlying model is updated, does the agent behavior change in ways that break ongoing workflows? The companies we like have thought carefully about state as a first-class primitive — not an afterthought serialized to a JSON blob.
2. Tool reliability and graceful degradation. Every interesting agent uses external tools. Every external tool fails sometimes. The orchestration layer needs to handle tool failures without losing the thread of the task. We pay close attention to how teams have modeled tool contracts — what the tool promises to return, what happens when it doesn't, and how that flows back into the task graph.
3. Evaluation infrastructure. Our data shows that the agent companies that ship reliably into enterprise accounts have invested heavily in evaluation before their first production deployment. Not "we ran some manual tests" evaluation — systematic evals across task types, failure modes, and model versions. If a team can't show us their eval harness, we treat that as a significant risk factor.
4. The right domain. We're skeptical of horizontal agent platforms trying to do everything. The economics of building agent tooling are hard enough without also fighting for every possible use case. The companies that get traction fast pick a domain — legal contract review, software engineering workflows, finance operations — where the task space is bounded, the failure modes are predictable, and there's a clear buyer who feels the pain directly.
The Trust Problem
There's a meta-issue that underlies all of this. Agents are being asked to act with real-world consequences — sending emails, executing transactions, updating databases. The trust model for that is completely different from a search interface or a chatbot. Users need to understand what the agent did, why it did it, and what they'd need to do to undo it.
Explainability in agents isn't about understanding the model's weights. It's about giving users a readable audit trail of decisions and actions. That's a product design problem, not a research problem.
The companies building that audit trail well — structured action logs, human-readable decision summaries, clear rollback mechanisms — are earning enterprise trust faster than ones with better benchmark numbers but opaque execution. We've seen this play out in several portfolio companies. Buyers who said they "weren't ready for agents" became early adopters once they understood exactly what was happening.
Market Timing
One more thing worth saying plainly: most of the agentic value creation is still ahead of us. The current generation of models can handle multi-step reasoning on well-scoped tasks. The next generation will handle ambiguity significantly better. And the generation after that will start to close the gap on what was previously exclusive to experienced human judgment.
We're investing now because the infrastructure layer — the orchestration runtimes, the tool registries, the eval frameworks, the observability tools — needs to be built before the applications layer can mature. If you wait for perfect models, the plumbing will be owned by someone else.
Building agentic infrastructure or an agent-native product? Send us your deck.