It’s OK to ask why AI prototypes are not getting to production

Enterprise GenAI prototypes are everywhere. GenAI in production are far fewer. Why?

Management consultancies have different success numbers – we’ve heard 3% or 5% going to prod across tens of thousands of AI proof of concepts. For obvious reasons, it’s not a number LLM enthusiasts want to talk about. But we probably should.

Because the number isn’t zero; there are solid examples of LLMs providing value – accelerating support resolutions, improving internal search, transforming document types, assisting self-serve portals of all sorts, and generally giving dev teams new ways to solve problems at the intersection of software, people and natural language. But product development isn’t easy, and inserting a wild new technology makes it harder.

We propose this gentle pushback on the GenAI hype. There’s real opportunity here, both realized today and coming soon. But much of the work of delivering business outcomes with LLMs remains the fundamentals of product innovation: define a measurable outcome, create small teams working cross-functionally, provide robust platform support, and let them cook.

The work is still the work.

Hype isn’t helping

”RAG-LLM is dead.” It has to be true, we read it on LinkedIn.

Or maybe it’s this: RAG-LLM – the retrieval augmented generation pattern, used to wire up chatbots and plantext search engines to a corpus of useful business documents – is a complex solution with measurable results, and a long road to high accuracy. The arrival of very large context windows didn’t change the math on inference costs, which were already ominously high.

The challenge is to get accuracy high enough, and inference costs low enough for a given product-market fit. You can stack models and guardrails until you’re accurate, or you can quantize and trim tokens until your per call costs are low. Doing both is hard.

RAG-LLM, like most GenAI, responds well to iterative prototyping, engineering/design/product partnership, and institutional dev team supports like database and model provisioning, stable APIs and platform guardrails. But these engineering team platforms and practices are the messy work of software development, which isn’t very exciting for AI influencers, because it suggests that AI engineering relies on mature prior art (including the VMware Tanzu Platform that I work on).

Design matters

There’s also a design component that is frequently missed. A system that produces good results 20% of the time might be delightful (think product recommendations) and a system that produces good results 99% of the time might be irresponsible (think autonomous driving).

Starting with pessimistic assumptions about model performance and looking for value is a safer, better path than assuming that 50% accuracy will become 100% someday soon. LLM accuracy remains mostly unsolved in high complexity business contexts. As one example, in June of 2024 Meta hosted a bake off for advanced RAG applications. After 6,038 submissions, the winning solution was incorrect about half of the time. These aren’t edge cases.

Design thinking can help.

Many dead ends can be avoided by starting with a cross functional team that envisions complete systems deployed in human contexts, using well proven design patterns like low fidelity mockups, quick iteration, end user feedback and attention to messy human concerns like community impact, accessibility and risks. The people inside organizations who are asking pointed questions about LLM carbon footprints are also the people who can reduce inference costs to a level that makes ROI sense; it’s the same metric.

Agents are next

Meanwhile, YouTube marches on: RAG is done, now it’s The Year Of AI Agents. Although… industry is not aligned on what “agentic” AI means exactly – Is that multi-step calls to inference models?; Is that LLMs calling functions?; Is that business users defining workflows in natural language?; Or is that merely the notion of LLMs contributing inside a business workflow? Is a wizard an agent?

We think many agentic patterns are promising, but maybe we should start with the smallest slice of the agentic vision – perhaps “automate validation of outputs” – and ship it?

The patterns that make an LLM-powered application agentic are similar to the patterns in all business software. Work happens in a sequence of small steps, with humans and systems collaborating to query, refine and verify a bit of work, then register a user’s decision or other work output into a system. VMware Tanzu’s Spring AI framework is agentic because it thoughtfully manages state and complexity, not because we’re chasing one particular sequence of LLM calls.

LLMs for consumers vs LLMs for creators

There is a massive cultural moment around language models and image generators. Hundreds of millions of active users, strong feelings positive and negative, ad campaigns everywhere (though the ads remain fuzzy on why having AI everywhere is desirable).

With AI it’s helpful to split the consumer chat experience – ChatGPT, Claude, Gemini as a coach or editor or tutor – from the software team’s use of LLMs inside software.

For developers, there’s a slower, quieter GenAI revolution happening under the hood of existing software applications. We believe many software teams are going to use LLMs and other machine learning models to solve narrow problems everywhere. As AWS CEO Adam Selipsky argued in a recent AWS Re:Invent keynote there’s a likely future where “every application is going to use inference.” But doing this in production is often blocked – not by proof of concept vision questions (there’s plenty of hype for the vision) but by solvable execution problems – accuracy, safety, model and data supply chains, source data ethics, cost controls and legal risks.

Off-the-shelf code assistants like Github Copilot straddle both modes. You can buy a readymade code assistant, but many teams are locally developing assistants spotting org-specific patterns and antipatterns. We see software teams building successful LLM products for software developers. This solves for iterative development and user feedback and, as expected, they are notching some wins. We expect these little-assistants-everywhere pattern to extend to ops, support, research and other business domains as LLM expertise grows: Smaller, specialist models everywhere.

Developer platforms accelerate developers

So we just threw a bunch of new problems at LLM adopters: product-market fit questions, end user context questions, ethics and efficiency questions. These teams were already struggling to graduate proof of concepts into viable products. So how can we get them moving?

We believe a key is to let dev teams focus on product development by abstracting away everything except the question of whether they provide value to their users. That means you need an app platform and self-serve patterns that allow teams to access the resources they need, promptly, with low risk.

Product development teams should not be in the business of certifying models. They should not be configuring hardware, storage, or networks. Product teams should not be the people proving to governance committees that they are operating within legal, privacy and ethics constraints. Instead, they should have a menu of options, and clean abstractions between their software and these concerns.

This is not exotic stuff; we’ve been building developer accelerators for decades and the ROI is well established. What’s new is that we stopped doing all of these things when it came time for LLM proof of concepts. Everything was one off and home brewed for a bit. And those efforts mostly didn’t go to production.

Consider an alternate world where there’s an API between dev teams and the LLM. Data science and AI teams certify the models for production use, working closely with legal, privacy and ethics watchdogs. Dev teams pick the best fitting tool available. Logging of outputs and model feedback are built in. Privacy guardrails like tokenization or on prem, even air gapped AI models are available. Cost controls and related bookkeeping are enforced by the platform, not the developers or a model vendor. When models are refined – more accurate, lower costs, specialist finetuning – those apps don’t have to update.

With good platform support lots of the “AI” parts of their application go away. All of this frees up the product teams to ask age old product questions like “Is solving the problem technically feasible?” and “Do people like it?” and “Is this a viable business?”

Which is hard enough, it turns out. If you’d like to learn how VMware’s Tanzu AI Solutions can accelerate prototyping with genAI and AI agents, and safely scale production apps that access AI models, we’d be happy to run a demo for you.

Hype isn’t helping

Design matters

Agents are next

LLMs for consumers vs LLMs for creators

Developer platforms accelerate developers

Related Articles

Messaging and Streaming in the Times of AI – Threat or Opportunity?

AI Agents: Why Workflows Are the LLM Use Case to Watch

Use Tanzu Platform to Optimize Troubleshooting and Improve Capacity Planning

What is AI Middleware, and Why You Need It to Safely Deliver AI Applications

Spring Cloud Services for Tanzu Platform