The Modern ML Stack is broken, but not for long

Understanding what the most important technological breakthrough of our generation can learn from cable television


It seems like in the last few weeks everyone on planet Earth has woken up and realized the AI revolution is here, and we might be about to see the largest wave of business disruption since the rise of the internet. However, what you will not see behind the slick UI of ChatGPT is the gigantic mess of infrastructure that is needed to make it tick. While a gigantic, well-funded organization like OpenAI or Big Tech companies can brute force their way into building by hiring armies of machine learning (ML) engineers and infrastructure (infra) engineers, the rest of the global economy cannot. If modern organizations are really going to participate in the AI revolution, someone is going to need to help them build it.

But equally important is what is missing today. Other computing paradigm shifts necessitated new infrastructure, from cloud tools at the advent of SaaS to machine learning operations (MLOps) in the early deep learning era. What problems are unsolved in the AI stack? And what tools and infrastructure does the ecosystem need to truly take off?

Most people are familiar with the rise of the “modern data stack”, in which organizations are now able to capture and leverage the proprietary data that they receive from their customers. This was the outcome of the “big data” revolution of the 2010s and powered the emergence of arguably the most successful startup of the decade, Snowflake. As technology reached a point in which businesses could successfully capture their data (around 2010), data proliferation necessitated a change in the way that we actually worked with it (image from Statista):

Gone were the days of gut-based decision-making, and in came the rise of the “data driven” manager. Suddenly business leaders could actually track what did and did not work, and access to this data quickly became table stakes. This was (and continues to be) a key secular driver behind the pivot to cloud-based software and the rise of the global cloud providers: all of a sudden everybody wanted more and more data, but there was nowhere to store it.

AI might be having a similar moment today. With the emergence of a “killer app” (ChatGPT), all of a sudden everyone on Earth wants access to this same technology. Unfortunately, similar to the pre-data warehouse era, the infrastructure to build and leverage this technology is limited. AI also needs data, but it needs to use it in different ways than what the modern data stack offers. If modern businesses are going to benefit from the AI revolution the same way they benefited from the big data revolution, they are going to have to be able to leverage and utilize their own proprietary data.

Today, most breakthrough AI models are trained on the same dataset: the Internet. While this is useful in isolation, the real holy grail of insights for business leaders is using their own proprietary data, either by training and serving their own models or fine-tuning general-purpose ones (foundation models).

To do that, however, is much easier said than done. Given its roots in academia and research and the speed of innovation in the last five years, the ability to build AI into a process is far from enterprise-grade today. ML development tools from the global cloud providers (e.g., Sagemaker) are not core to their business and have largely been an afterthought: it is hard to build production-grade products with them alone. As such, the first-generation AI companies were forced to build everything they needed in-house (notably autonomous driving companies and the rest of Big Tech). As ML teams in these firms matured, a generation of startups emerged to provide a variety of point solutions so that new companies would not have to build everything themselves. This brings us to today, where the proliferation of point solutions has brought us to this (landscape image from our friends at a16z):

In short, it is a mess. If you want to build production-grade products in ML, you need to duct tape together 10+ point solutions to stand up an infrastructure that can get you into production. With that comes obvious issues with cost (paying too many vendors), unwieldy-ness (painful to onboard and train new hires), and brittle-ness (multiple single points of failure, high friction integrations, frequent breakages and downtime as new updates are pushed out to one endpoint solution). What began as a few “must have” point solutions that outperformed one-off Sagemaker tasks has transformed into a problem in itself. It is a bit of a heavy-handed (and overused) analogy, but it is reminiscent of what we are seeing in the consumer media industry (image per The Hustle):

Jim Barksdale (former President and CEO of Netscape) famously said that there are “only two ways to make money in business: One is to bundle; the other is unbundle.” The analogy was true in the emergence of the cable television era (when networks bundled together local and national channels) as consumers converted to cable television. As cable got bloated and too expensive, consumers flocked to over-the-top (OTT) streaming providers such as Netflix that provided better value (unbundling). Today, however, as OTT providers have proliferated, a full cable cutter stack has become more expensive than traditional cable. We are in the early innings of this evolution, but companies are already trying to find ways to “rebundle” offerings by combining adjacent channels and services (e.g., Amazon Prime & Prime Video, Disney+ & ESPN, Hulu & Spotify). It is to be decided how this plays out in the near term, but we can imagine the seesaw between unbundling and rebundling will continue in some form for the foreseeable future.

Unfortunately, where the above (perhaps lazy) analogy falls apart is that ML infrastructure has another problem: nobody makes any money. Or, at least, nobody makes enough money to warrant the amount of VC dollars that have been dumped into the space in the last few years. According to Pitchbook, of the $90 billion of venture capital dollars that have been poured into AI software so far (excluding hardware and autonomous vehicle companies), more than 30% ($27.4 billion) has been poured into horizontal technology/infrastructure – that is a significant amount given the distinct lack of $100 million+ ARR players that have emerged in the space.

In the traditional modern data stack, large enterprises are already familiar with paying large sums to ETL (extract, transform, and load) providers, data warehouses, and BI (business intelligence) tools. Several startups in this space have crossed the $100M ARR threshold, while several others approaching it. There are many reasons for this, but one of the main ones is that the big data trend has been around long enough that most of these modern data stack startups have their roots in engineering and production quality systems. These solutions are battle-tested and enterprise-ready and are tackling C-suite priorities that unlock clear customer willingness to pay. Executives know they need to invest in their data stack to stay relevant, and the solutions are mature enough to be deployed in the most complex and demanding environments.

Bringing this back to ML infra, ChatGPT’s emergence has evidently captured the attention of C-suite executives at Fortune 500 companies, who are now eager to integrate ML capabilities into their own organizations. Where it did not exist a few weeks ago, the demand side of the equation is now taking off. However, on the supply side, the proliferation of available point solutions is a messy, complicated, experience that is distinctly not ready for turnkey enterprise deployment.

Most ML infrastructure products in the market today were designed by and for highly technical ML engineers, researchers, and academics. They require significant technical knowledge to operate and often lack or have brittle solutions to critical enterprise requirements such as integrations with IT infrastructure, SOC2 compliance/data security, and private cloud deployment. Notably, as many of these solutions emerged out of research and academia, they also frequently emerge as open source offerings, which Fortune 500 executives are often hesitant to adopt. It is true that giant open source businesses, such as Confluent, can be built. However, these businesses require a tight go-to-market motion and an immediately available, and strongly managed, solution (which most open source ML infra companies lack today).

To be considered enterprise-ready, AI infra startups are going to need to solve these data security and IT integration problems before they begin their sales motions with Fortune 500 customers. As excited as they are about AI, many Fortune 500 companies are increasingly concerned about data security and liabilities and are not going to skim on compliance requirements to try out new technology. In fact, large organizations are increasingly looking to ban the use of ChatGPT as private client data is being inadvertently shared with OpenAI. Some of the above data and IT security requirements can be avoided when selling into other fast-moving AI startups, but eventually, AI companies need to graduate to larger customers with more complex needs (and larger ACVs) if early companies want to reach venture scale. In an economic downturn where many of these startups run out of runway or cutback spend, this effect only becomes more pronounced.

In short, our belief is that if early companies are going to build and scale a sustainable infrastructure software business (particularly in this economic climate), the company needs to be able to sell to blue chip, Fortune 500, customers. To do that, the startups need to be enterprise ready and able to regularly interact with and work with non-technical buyers. Until a few weeks ago, these buyers were not excited about deploying budget to build their own ML organizations. That is changing and the current landscape of offerings are going to struggle to meet these buyers’ requirements.

Prelude aside, venture capitalists are supposed to be optimists. We are big believers that we are in the early days of the greatest period of technological innovation since the emergence of the Internet more than twenty years ago. The AI revolution that began with the Deep Learning breakthroughs in 2012 has hit the mainstream, and a common tagline at Radical is that we believe that “AI is eating software.” We believe that we will see almost all of the world’s software replaced by AI over the next decade as static code bases are largely replaced by constantly evolving AI models. This secular shift is a runaway train that will displace whatever issues stand in its path. In short, all of the above issues regarding the state of ML infrastructure will get fixed: they have to be.

How does this happen? First of all, something has to be done about the existing state of ML infrastructure. While some point solutions will naturally die off, we believe that a handful of centers of gravity will start to emerge where offerings are consolidated. This might happen through M&A where the most well-capitalized players can scoop up valuable adjacent offerings, or via product development and cross-selling as companies that own the most customer relationships are able to naturally expand their offerings. One area that we view as a strategic position in the stack is labeling, which is typically the first thing a new customer looks to as they think about adopting ML and building their own models. By owning the first and longest-standing relationship with customers, these are natural organizations to offer adjacent solutions to customers as they mature along their ML adoption journey. Snorkel has started to run away with this market in the world of text, while players like Scale AI and Radical portfolio company V7 are starting to quickly do the same in the unstructured data like photos, videos, and audio.

Secondly, we are seeing increased customer demand for new end-to-end (E2E) infrastructure platforms (rebundling!) that can help people easily build ML applications. Some “AI 1.0” organizations like Domino Data Lab and provide E2E infrastructure for the pre-Transformer generation of AI. However, we believe a much bigger market exists in a post-Transformer world (particularly given the emerging ability to work with unstructured data). There are giant businesses to be built providing this type of solution for both non-technical end users (e.g., business analysts, financial analysts, data teams) and for technical end users (e.g., ML engineers, data scientists, infrastructure teams). It is an open question as to whether or not we will eventually see a convergence of these personas/stacks into singular offerings, but for now, that does not seem to be happening.

Lastly, AI startups are beginning to realize that they need to be talking to customers and hiring experienced business leaders earlier in their development curve. By augmenting their technical talent with leaders who have lived and breathed enterprise sales and blue chip customer deployments, AI startups can move toward enterprise readiness faster and set product roadmaps appropriately to ensure they are building solutions that buyers are actually looking for. (Sometimes it is worth pushing off an extra 1% of performance to pull forward slicker UI/UX that customers can easily use).

In summary, we believe that the combination of a massive increase in demand for AI combined with what is at best a highly messy current set of offerings means that there is a significant opportunity for entrepreneurs to build enduring businesses in this category. As we think about what we are looking for, we would identify the following:

  • End-to-end offerings that make it seamless and easy to build ML applications for both power users and non-technical users (think one-stop shops).

  • Businesses well positioned to benefit from the emergence of the nascent “unstructured data stack” (e.g., SQL for unstructured data).

  • Data platforms that can help unclog the data glut that exists within ML teams (across data quality, governance, and orchestration).

  • Very high technical bar solutions that require world-class expertise to build (e.g., world-leading researchers and their students emerging from academia, teams leaving cutting-edge AI research organizations, teams with experience building mission-critical infrastructure in Big Tech, etc.).

  • Across the board, we are focused on deeply technical founders who are world leaders in their respective spaces and are customer and product obsessed. This intersection is crucial to seize this explosive moment of AI interest. 

As an AI-focused investment firm, we are confident that the AI revolution is here to stay and will exceed expectations. Just as Snowflake emerged in the big data revolution, there are multiple generational businesses to be built in the age of AI. Beyond the secular adoption, there are a few general business characteristics that get investors (such as ourselves) excited about data infrastructure businesses:

  • Offer a “picks-and-shovels” (yet another overused VC analogy) approach to playing the macro trend.

  • Once adopted, these approaches often become mission-critical parts of organizations with high switching costs (sticky and recession-proof).

  • Often become “need to have” offerings as markets mature, where the leading providers become industry standard (pricing power).

  • In some cases can get access to large volumes of customer data which can be used to train other offerings (data moats).

  • Benefit from the ongoing demand shift toward cloud-agnostic offerings as customers are not interested in being locked into a single global cloud provider (defensibility from Big Tech).

If nothing else, we are very excited about the space and what is to come. If you are building in ML infra, data infra, or just AI in general and would like to reach out to chat and share notes, our inboxes are always open!

Ryan Shannon is an Investor at Radical Ventures.

Prior to joining Radical, Ryan was a Private Equity Investor at TPG Capital in San Francisco, where he focused on Leveraged Buyouts, Corporate Carve-outs, and Take-privates of North American businesses. Previously, Ryan worked as an Investment Banker in the Financial Sponsors group at Barclays in Los Angeles. 

Ryan received an HBA from the Ivey Business School at Western University, where he graduated as an Ivey Scholar, and an MBA from Harvard Business School.

© 2023 Radical Ventures Investments Inc.