Reinforcement Learning's Renaissance: How RL is Reshaping AI's Future

The following post is based on the latest Radical Talks episode with hosts Molly Welch and Aaron Rosenberg, and featuring General Reasoning Co-founder and CEO Ross Taylor. Listen to the podcast on Apple Podcasts, Spotify, or YouTube.

Reinforcement learning (RL) solves a fundamental challenge in AI: enabling models to learn from interaction and experience. While next-token prediction has proven remarkably effective for many tasks, from drafting documents to generating code, it learns by imitating patterns in pre-existing data.

RL takes a different approach: models learn by performing actions, observing outcomes, and refining their behavior based on feedback. This matters because learning through experience enables models to tackle cognitively demanding, long-horizon tasks that require trial, error, and iterative improvement — capabilities that emerge from interaction rather than imitation alone.

RL is not new, but has returned to center stage. With reasoning models like OpenAI’s O1 and O3 and DeepSeek’s R1, RL has reclaimed its position as a critical driver of AI capabilities. The field experienced its first golden age in the mid to late 2010s with achievements like AlphaGo and AlphaZero, but the emergence of the Transformer and rise of self-supervised learning for autoregressive language models temporarily shifted focus elsewhere. Now, with foundation models providing rich world knowledge to build upon, RL is unlocking capabilities that seemed out of reach just months ago.

The RL Advantage

Ross Taylor, CEO of General Reasoning and former leader of Meta AI’s reasoning team during Llama 2 and Llama 3 development, explains the advantage: “reinforcement learning allows you to optimize the objective you actually care about.” Previous generations of LLMs relied on imitation learning – matching patterns from internet data. Even LLMs that performed poorly on math possessed mathematical knowledge but lacked reliability. RL connects those dots.

Taylor’s company General Reasoning aims to enable agents to work autonomously for extended periods. Current agents can handle specific tasks but lack persistence for open-ended work — the messy, sustained problem-solving that defines many knowledge work tasks. The company is approaching this through both capability-driven and domain-driven lenses, examining what abilities agents need for long-horizon tasks while also deeply analyzing specific fields to identify ideal AI-suitable problems.

The importance of strong base models is essential. There’s general consensus that having a robust pre-trained LLM was important for reasoning to emerge. Akin to how DeepMind bootstrapped systems to master games like Go by training initially on human data and then improving from there via self-play, now, the combination of world knowledge baked into foundation models with RL’s ability to optimize for actual outcomes creates a more powerful paradigm.

The current resurgence builds on three critical components, according to Aaron Rosenberg, Partner at Radical Ventures and former head of Strategy and Operations at DeepMind. First, massive compute capacity; many believe the amount dedicated to RL will soon dwarf expenditures for pre-training alone. Second, high-fidelity simulators and verifiers. And third, algorithmic innovations like DeepMind’s DQN, OpenAI’s PPO, and DeepSeek’s GRPO.

The Environment Challenge

Environments have become central to RL advancement — simulations of the parts of the world agents should interact with. As the field moves toward the agentic era, these often capture real enterprise workflows like a bank processing transactions, a law firm preparing contracts, or an agent using enterprise software tools.

Getting environments right is difficult. “Reward hacking,” or the tendency of RL agents to exploit poorly designed reward systems if environments lack proper detail, is a challenge. For instance, recent examples show software engineering agents going to the internet to find answers rather than solving problems legitimately. Expertise matters in environment generation: ideal environments require deep domain knowledge from people working in specific fields. Taylor explains: “I could probably code up a very good environment for the kind of domains I’ve been in, like finance or sports betting, but if you asked me to do an environment for insurance, I wouldn’t be best placed to do it.”

Beyond simulating workflows accurately, environments need verification mechanisms to determine “good” outcomes. Some fields are easier to verify – math, physics, software development. But others present gray areas (what does “good” look like in terms of creative writing, for instance?). Universal verifiers represent an emerging approach where models themselves assess correctness by understanding underlying rules rather than comparing to reference answers. For math proofs, models can use programming languages like Lean to verify outputs. For subjective domains like legal contracts, models that reason extensively might produce more reliable assessments. This brute force approach has limitations — expecting models to think extensively about both problems and verification becomes expensive quickly. Taylor suggests that if you can find verifiable rewards, you should pursue them, while acknowledging universal verifiers expand what’s feasible.

Ross, Molly, and Aaron discuss how much recent research focuses perhaps too narrowly on benchmarks vs. real-world results. Even within coding, serious problems remain. Available coding environments don’t accurately represent real software engineering tasks. While labs claim 70-71% performance on benchmarks like SWE-bench, when one research group tested how many pull requests would actually be accepted, the rate was 0%. Taylor sees environments as too artificial with too many toy problems, noting continued opportunity in clearly verifiable domains that better capture real economically valuable workflows.

Economic Value and the Human Element

Taylor emphasizes that the cost of inference for reasoning models shouldn’t be considered in isolation but rather relative to the value of work being automated. Fields like law and financial analysis, for instance, involve very expensive work that justifies high inference costs. But reliability may be the most underappreciated factor. Self-driving cars performed well, but weren’t at the level of reliability to be economically useful until very recently. The question isn’t just whether a model can perform work adequately, but whether it’s trustworthy enough for deployment.

This balance between capability and reliability mirrors a broader shift in how we think about AI development. Nearly a decade ago, Yann LeCun proposed his now-famous layer cake analogy: if intelligence were a cake, unsupervised learning would be the bulk, supervised learning the icing, and RL just the cherry on top. Taylor calls this deeply perceptive, accurately describing the state of the field before 2023. But the new reasoning paradigm has shifted this balance: moving from RLHF focused on human preferences toward verifiable or machine-driven rewards has produced powerful results. Rather than being bottlenecked by the highest quality human annotators, models can now go beyond human performance in some domains. The appropriate analogy may now be a cherry cake, where RL overwhelms all else.

Taylor offers a modification: humans remain essential for now. Models don’t have the domain knowledge and workflow understanding that exists in people’s heads or companies’ internal data that isn’t in pre-training data. Humans must still specify what good outcomes look like. Taylor believes humans may eventually step away, but not yet — either reliability in human-centric domains needs to improve, or progress will come from pushing into domains where verification happens through grounded experiential data rather than human judgment.

Building the Future

The convergence of large-scale foundation models, massive compute infrastructure, improving verification methods, and sophisticated environments creates fundamentally new capabilities. The field has moved from proving concepts in games to tackling economically valuable real-world tasks.

Success requires building high-fidelity environments that capture real workflows, developing reliable verification for complex domains, and connecting domain experts with RL practitioners. As Taylor emphasizes, following curiosity and moving fast when opportunities emerge remains essential. Reinforcement learning has returned not as a cherry on top, but as a central ingredient in building AI systems that can work alongside humans on complex, valuable tasks.

This post is based on insights from Radical Talks, a podcast from Radical Ventures exploring innovation at the forefront of AI. For more cutting-edge AI discussions, subscribe on Apple Podcasts, Spotify, or YouTube.

Radical Talks