The training race is over. Or rather, it stopped being a race that any infrastructure startup could meaningfully participate in around mid-2024. When it costs $100M+ to train a frontier model, the pool of serious competitors collapses to half a dozen organizations with specific GPU commitments and deep research talent. For everyone else, the interesting work shifted.
Inference is where the money is actually flowing now. And not just money — architectural complexity, engineering talent, and strategic moats. We've spent the last eighteen months digging into this space, and the pattern is clear: the companies building serving infrastructure are being valued like early SaaS. That's still too cheap.
What Training vs. Inference Actually Means
Training is the phase where you take raw compute and data, run gradient descent for weeks or months, and produce a model checkpoint. It's expensive, relatively rare, and largely completed by the frontier labs. Inference is what happens when that model answers a question. Every single API call, every agent invocation, every embedded feature in a product — that's inference.
The ratio matters here. A model might be trained once every six months. It gets called for inference billions of times a day. The economic weight shifted dramatically the moment models started getting embedded in real applications, and that happened faster than most infrastructure investors priced in.
The Serving Layer Is Not a Commodity
The naive take is: inference is just matrix multiplications, GPUs are fungible, cloud providers will commoditize it. That view misses three things.
First, the hardware isn't uniform. A70 chips, H100s, custom ASICs from both established players and startups — each has different memory bandwidth, different throughput profiles for different model architectures. Routing a request to the right hardware for that specific model and batch size is a real optimization problem. The companies solving it well are shaving 30-50% off cost at scale.
Second, model architectures are diverging fast. Mixture-of-experts, speculative decoding, quantization variants, adapter-based routing — each changes what an optimal serving system looks like. A serving infrastructure that was designed for dense transformers in 2023 is already showing its age. The companies that built with architecture-agnostic abstractions have a durable advantage.
Third, latency is load-bearing. For consumer products, users notice anything above 500ms for first token. For agentic workflows making dozens of sequential LLM calls, latency compounds multiplicatively. The difference between 80ms and 200ms time-to-first-token isn't an engineering nicety — it's the difference between a product that ships and one that doesn't.
Where We're Investing
We look at three categories within inference infrastructure:
Routing and orchestration. Not every query needs a 70B model. A lot of production systems are dramatically over-provisioned because they lack the tooling to route intelligently. A routing layer that sends 70% of requests to a smaller specialized model while escalating only the hard ones to a frontier model can cut cost by an order of magnitude with negligible quality loss. We've seen this deployed at production scale and the unit economics are striking.
Hardware abstraction. The organizations deploying AI at scale don't want to think about which GPU cluster is running their request. They want SLAs, cost ceilings, and a clean API. The infrastructure layer that sits between raw compute and application developers — handling scheduling, failover, and cost optimization transparently — is genuinely valuable and genuinely difficult to build.
Serving optimization for specific architectures. Some of the most compelling technical companies we've seen are essentially doing compiler work for neural networks. Custom attention kernels, fused operations, KV cache management that's specific to an architecture — this is deep systems work that produces a defensible performance edge.
The Cloud Parallel
In our experience, the best analog for what's happening is the 2006-2012 period in cloud infrastructure. Most enterprise developers were still managing servers. The companies building cloud-native tooling — monitoring, deployment, databases-as-a-service — looked like niche picks at the time. By 2016 they were essential infrastructure that no serious engineering org could function without.
Inference infrastructure is at roughly the same point. Most AI products today are built on direct API calls to foundation models, with minimal optimization. As those products scale and cost pressures mount, the tooling layer becomes mandatory. The companies building it now have 12-24 months to establish market position before the layer gets crowded.
One number we keep coming back to: inference cost for a leading model has dropped roughly 80% in the last two years through algorithmic and hardware improvements. That reduction unlocked entirely new use cases — longer context, agent chains, real-time audio. Each new use case requires its own serving infrastructure. The market isn't shrinking as inference gets cheaper. It's expanding.
What We Look for in Founders
The founders building the best inference infrastructure companies we've seen share a specific profile. They've spent time inside a large-scale serving system — whether at a lab, a hyperscaler, or a large AI-native product. They have opinions about batch scheduling, memory allocation, and KV cache design that come from fighting real production fires, not from reading papers. They understand that users of their system don't care about the internals — they care about the SLA and the bill at the end of the month.
That combination of deep systems knowledge and product pragmatism is rare. When we find it, we move fast.
Building in LLM serving or inference infrastructure? Talk to us.