Generative AI has taken the technology world by storm. While today’s generative models are built upon a decade of progress–thanks to Moore’s Law, better algorithms, and bigger datasets–2022 does feel like an “AlexNet” moment. In the way that computer vision breakthroughs meant computers could see, today’s computers can now create in ways never before thought possible, producing impressive text, imagery, video, 3D assets, and more.
Much has already been written about the market opportunities for the Cambrian explosion of startups now emerging, which aim to remake everything from creative tools to software development.
But equally important is what’s missing today. Other computing paradigm shifts necessitated new infrastructure, from cloud tools at the advent of SaaS to MLOps in the early deep learning era. What problems are unsolved in the generative AI stack? And what tools and infrastructure does the ecosystem need to truly take off?
To start: to what does the term “generative AI” refer? It’s often used interchangeably with “large language model” or “foundation model.” There’s overlap in these terms–for instance, some foundation models are large language models, which are in turn generative models–but still worthwhile delineating to the extent possible.
Large language models (LLMs) are many-parameter machine learning models trained on large amounts of text data. LLMs can be used across a variety of language tasks, including text generation, summarization, and classification. As the name suggests, they’re big: GPT-3, for instance, has 175B parameters.
Foundation models are models trained using self-supervised learning on large datasets. They are not task-specific and can be applied to different tasks with minimal fine tuning. Foundation models have emerged most notably in NLP, with seminal examples including BERT, RoBERTa, and T5. The term was introduced by Stanford HAI in 2021 and named to capture the significance of this new paradigm.
“Generative AI” is a broad term that refers to AI systems trained on large amounts of real-world data in order to generate data themselves. They are often multimodal, e.g. with inputs and/or outputs in one or more modalities (text, image, video). Various model architectures, including diffusion models and Transformer-based LLMs, can be used for generative tasks, which include image generation, language generation, image inpainting, and more.
In this piece, I broadly discuss generative models, but focus in particular on large language and text-to-image models.
So-called “generative” AI has been an active research area for some time, but only recently have factors coalesced to produce the performance and popularity of today’s generative systems.
Model architecture was one key component. In image generation, generative adversarial networks, or GANs, were the state of the art until the popularization of diffusion models. Similarly, the language domain was revolutionized by the seminal 2017 “Attention is All You Need” paper introducing the Transformer architecture. Aidan Gomez, a co-author of the parer, would on on to co-found the NLP startup Cohere. Powerful multimodal models emerged with the combinations of these architectures with models like CLIP. Add parameter scale-up and training optimization, and today’s algorithms are higher-performing than ever. In parallel, of course, compute has gotten cheaper: image generation model Stable Diffusion reportedly cost ~$600K to train.
Perhaps equally important have been accessibility and usability. OpenAI’s GPT-3 and DALL-E-2 were originally restricted access. In 2022, however, Midjourney made its model available through a Discord frontend, and Stable Diffusion publicly released model weights. At the same time, semantic inputs made interaction simpler. This broader accessibility has fostered experimentation at the application layer.
As more people build on top of publicly available models, or train generative models themselves, opportunities have emerged across the generative AI stack.
At the highest level, the core inputs for today’s generative models are unvaried from “traditional” neural networks: algorithms, data, and compute. Algorithms will no doubt continue to progress, but from this trio, the real bottlenecks for generative AI are data and compute. With that said, generative models, particularly LLMs, also have idiosyncrasies that require specialized tooling to make them easier to build and deploy, including prompt engineering, fine-tuning, and evaluation.
The generative AI renaissance will depend to a large extent on data.
For one, it’s clear that training LLMs requires more of it. Research has shown that many of today’s large language models are under-trained. Moreover, fine tuning with smaller datasets remains key for LLM performance on domain-specific tasks.
Several companies are working to bridge the data gap. One recent Radical investment, still in stealth, is building foundation models to generate sophisticated synthetic text training datasets tailor-made for particular end applications. Established players like Gretel.ai generate statistically equivalent synthetic datasets for model training. Others include Syntegra, which is generating data specific to healthcare, and Tonic.ai.
Compute optimization and accessibility is also key to the generative AI boom. Compute costs may have decreased, but generative models, and large language models in particular, are still nontrivially expensive to train and serve in inference. What’s more, spinning up an AWS or other provider instance can be complex and/or costly for nontechnical folks building AI applications.
Several players are tackling compute optimization, both at the training and inference levels. Radical portfolio company CentML focuses on reducing the speed and cost of training and inference at the compute layer, while MosaicML offers a suite of optimizations at the model layer. OctoML is improving model inference speed in production..
Other companies are working to abstract away compute infrastructure entirely, making it faster and easier for people and organizations to deploy ML. With Replicate, users can run ML models in the cloud without having to manage their own compute infrastructure. Banana offers “serverless GPUs,” reducing hosting costs and enabling easier scale-up.
Finally, compute providers themselves will be big beneficiaries of generative AI activity, and there is increasing competition in the space. Nvidia, of course, is a key player as the perceived best-in-class provider of GPUs, but Google Cloud, AWS, and Oracle are all fighting for market share. There are also increasingly “boutique” cloud providers like Coreweave (which recently announced $100M in new funding) and Lambda Labs specifically offering compute resources for machine learning and other heavy workloads.
As Stable Diffusion, GPT-3, Midjourney, and other “text-to-modality” systems have exploded in popularity, semantic prompt inputs have become critical. However, prompting right now is more art than science, leading many to herald the “prompt engineer” as an important new role.
We’re early days in prompt interfaces, but there are many questions as this form factor evolves, including how to evaluate prompts, improve them, and link them together for sequential actions.
Prompt evaluation and experimentation
Anyone who’s played with a semantic input system knows that the right prompt can be key to generating high-quality, appropriate outputs. For businesses building only at the application layer, this matters a lot for usability; product consistency; and, ultimately, churn. Currently, it’s challenging to capture user feedback on model outputs or experiment with prompts in a systematic way.
Humanloop is one startup working to systematize prompt evaluation and experimentation. Its SDK enables customers building on top of Cohere, GPT-3, or other model providers to capture user feedback and measure performance. Customers can also run prompt experiments and fine tune custom models.
In model development, a technique called reinforcement learning from human feedback, or RLHF, is also promising as a mechanism to better align model outputs with prompts. At a very high level, RLHF uses human feedback on model output to fine-tune LLMs. OpenAI used RLHF in developing ChatGPT, making the model more conversationally fluent and even able to exhibit memory.
Prompting right now is 1:1 input-task completion, but many tasks in the real world are multi-step, requiring multiple runs of a generative system or LLM. Prompt chaining is the notion of connecting multiple prompts, with each output serving as the input to the next, in order to accomplish more complex tasks. Chained prompts could enable more agent-like experiences, with interfaces that can take action on behalf of users.
Dust is one open source project working to make it easier to build on top of LLMs, including by incorporating chaining. With Dust, users can chain model API calls, as well as to external APIs, making it possible to execute multi-step tasks. Langchain is another library that enables prompt chaining, in addition to other tools for LLM app development.
There’s a school of thought that argues that prompt engineering will become obsolete as models improve. Regardless, the prompt as an interface has captured the public imagination, and for good reason: people can talk to computers in natural language, rather than code.
While the art and the science of the prompt may evolve, I believe that conversational interfaces are here to stay, and may even become more entrenched as models get better–just look at all the excitement around ChatGPT.
It bears mentioning the thorny stuff, which, unfortunately, is all too relevant in the world of generative AI.
Most obviously, the hyperrealism of today’s video and image generation poses all kinds of governance challenges, including mis/disinformation and proliferation of hateful and malicious content. Deepfake detection has long been a challenge in AI, even prior to Transformers. It may have just gotten a lot harder.
Crucially, misinformation can also occur even without malicious intent. One recent case was Meta’s Galactica, which purportedly regurgitated falsehoods after being trained on large corpuses of scientific data. ChatGPT is also fallible: Ben Thompson writes about one example where the system incorrectly (but authoritatively) described the political philosophy of Thomas Hobbes. In response, some players, like Perplexity AI, are including citations in LLM output to improve trustworthiness.
System safety and security are other pressing areas. “Prompt injection” and adversarial attacks against generative models pose meaningful risks, as recent research has indicated. There is discussion about how susceptible widely used models like ChatGPT are to these attacks. For startups building on top of generative models, prompt injection can pose real usability risks.
Earlier this month, Stanford HAI released the Holistic Evaluation of Language Models (HELM) benchmark, which aims to bring transparency and more holistic benchmarking to LLMs, including assessing them on key risks like toxicity. This is an important step. I’m also keen to meet more entrepreneurs and researchers thinking about these challenges.
We’re just getting started
It’s a brave new world in generative AI, and it’s easy to focus solely on the application layer opportunities made possible by today’s technology. However, equally important and interesting opportunities exist across the value chain–from problems in compute and data to prompting and security.
If you’re building in any of the areas I discussed here, let’s chat. You can find me at firstname.lastname@example.org.
Molly Welch is an Investor at Radical Ventures, based in the Bay Area.
Prior to Radical, Molly worked with Google’s artificial intelligence research division in public policy and marketing roles. Previously, she has worked on AI policy at Lyft and as a venture investor at deeptech VC Playground Global.
Molly holds a B.A. with distinction from Stanford, a Masters in Public Policy from the Harvard Kennedy School, and an MBA with distinction from Harvard Business School.