The Next Frontier for Large Language Models is Biology

Radical Ventures Partner Rob Toews published an article in Forbes this week discussing applications of large language models in the life sciences. This week we share a brief excerpt from the article.

Large language models like GPT-4 have taken the world by storm thanks to their astonishing command of natural language. Yet the most significant long-term opportunity for LLMs will entail an entirely different type of language: the language of biology.

One striking theme has emerged from the long march of research progress across biochemistry, molecular biology and genetics over the past century: it turns out that biology is a decipherable, programmable, in some ways even digital system.

DNA encodes the complete genetic instructions for every living organism on earth using just four variables—A (adenine), C (cytosine), G (guanine) and T (thymine). Compare this to modern computing systems, which use two variables—0 and 1—to encode all the world’s digital electronic information. One system is binary and the other is quaternary, but the two have a surprising amount of conceptual overlap; both systems can properly be thought of as digital.

To take another example, every protein in every living being consists of and is defined by a one-dimensional string of amino acids linked together in a particular order. Proteins range from a few dozen to several thousand amino acids in length, with 20 different amino acids to choose from.

This, too, represents an eminently computable system, one that language models are well-suited to learn.

As DeepMind CEO and Co-founder Demis Hassabis put it: “At its most fundamental level, I think biology can be thought of as an information processing system, albeit an extraordinarily complex and dynamic one. Just as mathematics turned out to be the right description language for physics, biology may turn out to be the perfect type of regime for the application of AI.”

Large language models are at their most powerful when they can feast on vast volumes of signal-rich data, inferring latent patterns and deep structure that go well beyond the capacity of any human to absorb. They can then use this intricate understanding of the subject matter to generate novel, stunningly sophisticated output.

By ingesting all of the text on the internet, for instance, tools like ChatGPT have learned to converse with thoughtfulness and nuance on any imaginable topic. By ingesting billions of images, text-to-image models like Midjourney have learned to produce creative original imagery on demand.

Pointing large language models at biological data—enabling them to learn the language of life—will unlock similarly breathtaking possibilities.

What, concretely, will this look like?

In the near term, the most compelling opportunity to apply large language models in the life sciences is to design novel proteins.

Continue reading here. Rob writes a regular column for Forbes about Artificial Intelligence.

AI News This Week

AI could solve some of humanity’s hardest problems. It already has. (New York Times)

In this podcast, Demis Hassabis, CEO of Google DeepMind, explores the potential of AI systems for advancing scientific research. Demis proposes a shift in the approach to building AI systems, suggesting that instead of mimicking human capabilities, development should focus on solving the most challenging problems for humans. He highlights the success of Google DeepMind’s AlphaFold, which effectively tackled the long-standing protein-folding problem by accurately determining the three-dimensional shapes of millions of proteins based solely on their molecular sequences. These specialized AI systems may serve as valuable references for generalized AI systems, which could eventually make calls to leverage expertise when specialized skills are required (previewed in the Mixture-of-Experts approach). But, over time these capabilities may be integrated back into the general system, further expanding a general model’s capabilities.
Data revolts break out against AI (The New York Times)

Comedian Sarah Silverman and authors Christopher Golden and Richard Kadrey have filed separate lawsuits against Meta and OpenAI, alleging copyright infringement. The complaints allege that the companies trained their AI models, ChatGPT and LLaMA, using datasets that included unauthorized copies of the authors’ works. Concerns about unauthorized use of proprietary data by AI technologies are coming from a wide range of sources, including fan fiction writers, social media platforms, news organizations, and individual artists. Beyond filing lawsuits, aggravated creators are locking their files from scraping techniques and boycotting websites. The protests reflect a shift in understanding the value of data and a pushback against indiscriminate use of information.
Listen: Jay Alammar on building LLM apps (What’s AI Podcast)

Jay Alammar, an AI educator and co-creator of LLM University at Radical Ventures portfolio company Cohere, shares his view on practical applications of large language models (LLMs). Jay emphasizes the importance of accessibility and community in learning about LLMs. In the episode, Jay shares his experience developing LLM-based applications, such as addressing biases in training data and ensuring ethical deployment. Jay also provides context on the foundational architecture of LLMs known as transformers and explains their ability to capture long-range dependencies with the attention mechanism.
GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE (SemiAnalysis)

SemiAnalysis has shared details on OpenAI’s GPT-4 model, which is confirmed to be a Mixture-of-Experts or MoE (discussed above) with 16 experts each having about 111 billion parameters. The approach allows the system to take an abstract problem and delegate the components to a set of specialized models that are able to resolve them quickly and integrate each solved part of the problem into a comprehensive solution. This underlines that companies are competing to reduce their inference cost and memory footprint.
Research spotlight: What should data science education do with Large Language Models? (Universities of Washington and Pennsylvania, Stanford, Rutgers)

In a paper released this week, the authors propose that Large Language Models (LLMs) are altering the focus of data scientists from hands-on coding and standard analyses to the evaluation and management of analyses performed by AIs. This shift necessitates a significant change in data science education, with near-future curricula incorporating AI-guided programming, LLM-informed creativity, and interdisciplinary knowledge. Concurrently, OpenAI’s introduction of the Code Interpreter rolling out this week promises to enhance accessibility to data science for individuals without coding or data visualization skills.

Radical Reads is edited by Ebin Tomy (Analyst, Radical Ventures)

Radical Blog

The Next Frontier for Large Language Models is Biology

AI News This Week

AI could solve some of humanity’s hardest problems. It already has. (New York Times)

Data revolts break out against AI (The New York Times)

Listen: Jay Alammar on building LLM apps (What’s AI Podcast)

GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE (SemiAnalysis)

Research spotlight: What should data science education do with Large Language Models? (Universities of Washington and Pennsylvania, Stanford, Rutgers)