Portfolio About Team Blog Contact Get in Touch
Founder Advice

From Research to Product:
The Gap That Kills AI Startups

We've passed on some companies we regret, and funded some we shouldn't have. In both directions, the most common mistake was misjudging how far along a team was in the transition from research to product. It's the gap that kills more AI startups than bad technology, bad timing, or bad luck combined — and it's one of the hardest things to assess from outside.

What the Gap Looks Like

The transition from research to product isn't a single step. It's a series of them, each requiring a different set of skills and a different way of thinking about what success means.

Research success: your method achieves a new state-of-the-art on a benchmark. The metric is clean, the comparison is fair, and the peer review process validates the result. The fact that your benchmark was carefully constructed, that the test set was curated, and that the deployment environment has nothing in common with your training environment — none of that disqualifies the result.

Product success: a user completes a task with your tool, pays for it, and comes back. The metric is messy. The user has context your model doesn't have, asks questions your training data never covered, and judges the output against a standard of "is this actually useful" rather than "is this higher on BLEU score."

The translation between these two success definitions is where teams get lost.

The Three Failure Modes

Failure mode 1: Benchmark overfitting masquerading as capability. A model that scores 92% on a task-specific benchmark built by the same team that built the model is not necessarily 92% useful. We've seen this play out in demo environments that work flawlessly and production deployments that fall apart on the first day. The eval was too close to the training distribution. The team knew the failure cases and avoided them in the benchmark design. When real users showed up with real edge cases, the model had no defenses.

The fix is brutal: bring in users early, set up blind evaluations, and actively seek out the inputs that break your system before you ship. Teams that do this before Series A arrive at the conversation with a much cleaner story — not "our benchmark shows X" but "we tested with 50 domain experts over 3 months and here's what they told us."

Failure mode 2: Research culture in a product organization. Research culture optimizes for generality, rigor, and novelty. Product culture optimizes for solving a specific customer problem, shipping it, and learning fast. These are genuinely different operating modes, and a team that has only ever done research often doesn't realize it's stuck in the wrong mode until users start churning.

Signals: sprint cycles dominated by architecture experiments rather than user feedback loops. Decisions made by consensus in long meetings rather than by designated decision-makers. Discomfort with shipping anything that isn't "ready." A roadmap organized around techniques rather than user jobs-to-be-done.

Failure mode 3: The demo is the product. AI is particularly susceptible to this because demos are easy to make impressive. You can choose the inputs, preload the context, hide the latency, and show only the outputs that work. A compelling demo gets you meetings, sometimes gets you early customers, occasionally gets you funded. What it doesn't do is get you to retention.

We've funded companies that had extraordinary demos and very thin products underneath. We regret some of those investments. What we've learned to ask is: "Show me the worst output your system has produced in the last week." The founders who can answer that question with specificity — who have logging, who have error analysis, who are actively studying failures — are building real products. The ones who deflect or can't answer are still in demo mode.

What the Successful Transition Looks Like

The companies that make the research-to-product transition well share a few characteristics in our experience:

They pick one specific task and get very good at it before expanding. Not "AI for legal work" but "AI for first-pass contract review of NDAs under 20 pages." The narrow scope forces contact with real users on a real task, builds domain knowledge, and creates evaluation infrastructure that actually measures product quality.

They find a domain expert co-founder or an early customer who can be a genuine collaborator on product direction. Research teams building for a domain they don't deeply know are flying blind. The best AI founders we've backed have someone in the organization who understands the workflow they're automating from the inside.

They hire product engineers early. Not in the pejorative sense — people who only do frontend. We mean engineers who think about reliability, error handling, user experience, and failure recovery as primary concerns. AI companies often delay this hire too long because the founding team is researchers who don't always know what they're missing.

The best signal that a team has made the transition: they can describe their product's failure modes in more detail than its capabilities. That means they've been in production long enough to know what breaks.

What We Ask in Diligence

When we evaluate an AI company at Seed or Series A, we spend a significant fraction of our diligence on the research-to-product question. We ask to see the eval infrastructure. We ask about the worst user complaints in the last 30 days. We ask what the team has deliberately decided not to build, and why. We ask how decisions get made when the right answer isn't technically obvious.

The teams that give us confident, specific answers to these questions usually have something real. The teams that are still working it out are earlier than they think they are. That's not disqualifying — it just changes how we price the investment and what kind of support we think they need from us.

Building an AI product and navigating this transition? We've seen this pattern a lot — let's talk.