Radical Reads

Road Rules for LLMs

By Leah Morris, Senior Director, Velocity Program


The world needs to establish the rules of the road so that any downsides of AI are far outweighed by its benefits.”

— Bill Gates, "Age of AI" (March, 2023)

The Holistic Evaluation of Language Models (HELM) benchmark recently updated its results for evaluating the performance of large-scale language models (LLMs) in various tasks. Unlike traditional benchmarks that often focus on accuracy and precision, HELM provides a comprehensive evaluation of a model including its societal impact. It features models from different organizations advancing the field of LLM research, including Google, Meta, OpenAI, and Radical Ventures portfolio company Cohere.

Cohere’s latest 52B command model is now tied at the top with OpenAI’s text-davinci-002 (part of the GPT-3 model family), and topped the list for fairness in core scenarios.


HELM updated benchmark: core scenarios v0.2.2 (last updated 2023-03-19)

Performing well across tasks is difficult, as is benchmarking these models’ performance. Foundation models’ performance tend to vary depending on the task at hand. In another approach, GPT-4’s success was underscored by its ability to ace a variety of exams, originally designed for humans, including the bar exam and the AP bio exam.

While impressive, this success is not uniformly accepted in the scientific community. Standardized tests for humans make assumptions about typical human fail points (i.e. there is poor construct validity for machines in these tests). Simply put, automated systems do not fail in the same way that humans fail. For example, an AI may ace a human driving test, but this success does not directly translate to fully autonomous self-driving capabilities.

That being said, how can we determine if a model is “better” than another? Whether generalized AI will win out across the board or specialized models will always outperform for specific tasks remains a subject of debate. Regardless, standardized benchmarks designed for AI are a crucial part of transparency in evaluating LLMs. Benchmarks like HELM encourage fair and reproducible evaluations through a standardized set of tasks and evaluation metrics. The companies that participate help the global advancement of responsible AI by enabling more people to participate in the development and use of AI, promoting innovation and creativity, and ultimately help to address social and economic inequalities.

AI News This Week

  • Watch: “Godfather of artificial intelligence” weighs in on the past and potential of AI  (CBS News)

    Geoffrey Hinton, known as the “Godfather of Deep Learning” and an investor in Radical Ventures, discusses the surge in AI interest, as well as his work on backpropagation and deep learning, the ethical implications of AI development, and the potential for AI to improve healthcare and other fields. The segment also discusses concerns surfacing around AI, including the worry that the technology could take a lot of jobs.  Nick Frosst, who was mentored by Hinton and is the co-founder of the Radical Ventures portfolio company Cohere, suggests AI is . ” going to make a whole lot of jobs easier and a whole lot of jobs faster.”

  • Biotech AI startup Unlearn adds $15 mln and OpenAI CTO to board  (Reuters)

    Radical Ventures portfolio company Unlearn has added Mira Murati, CTO of OpenAI, to its Board of Directors. Unlearn has built a machine learning platform that creates “digital twin” profiles of patients in clinical trials. Murati noted, “The team at Unlearn is working on applications of AI that have incredible potential to revolutionize healthcare, diagnostics and treatment.” The addition of Murati is expected to provide Unlearn with valuable insights and guidance as it continues to develop its technology.

  • The Age of AI has begun  (Bill Gates)

    “In my lifetime, I’ve seen two demonstrations of technology that struck me as revolutionary. The first time was in 1980… The second big surprise came just last year.” Bill Gates shares his thesis that the AI revolution is already underway from healthcare to manufacturing. He predicts a sustained explosion of companies working on new uses of AI as well as ways to improve the technology itself. In spite of the unique challenges and risks associated with the technology, he highlights the overwhelming benefits and shares our belief that AI will transform software and drive rapid innovation across sectors.

  • Our new Promethean moment  (The New York Times – subscription may be required)

    ChatGPT has suddenly made everyone aware of the potential of AI to transform society. Like most revolutions, this technological moment has been brewing for decades. Friedman has also come to terms with the power of AI and deems this point in time our next “Promethean Era.” There are moments when society has fundamentally changed due to inventions such as the printing press and steam engine. He sees the AI era as being underpinned by a “technology super-cycle.” An interesting aspect he defines about this era is that AI does not simply solve one problem in society but advances our abilities across almost every field, from human biology to fusion energy and climate change.

  • Why self-driving pioneer Raquel Urtasun is determined to solve our supply chain woes  (The Globe and Mail – subscription may be required)

    The founder of Radical portfolio company Waabi, Raquel Urtasun, discusses her work improving supply chain logistics in this look at her work advancing an AI-focused approach to self-driving. Delivered goods are more important than ever before and yet global supply chains continue to face challenges such as product shortages and delays, labour shortages including a lack of transport workers, and local regulatory policies. Self-driving technology has the potential to transform how goods are transported and delivered. Raquel shares her perspectives on this industry and the transformation that is about to occur.

Radical Reads is edited by Leah Morris (Senior Director, Velocity Program, Radical Ventures).