Many of the popular large language models in use today are predominantly trained on English language data. Despite English being the primary language on the internet, representing 63% of all website content, it is spoken by only 16% of the global population. Cohere for AI, a non-profit research lab run by Radical Ventures portfolio company Cohere, focuses on tackling complex machine learning challenges and expanding access to machine learning research. They have recently released a primer that delves into the “language gap” in AI. The primer discusses its origins, potential consequences, and presents actionable strategies for policymakers and governance bodies to address these challenges. This week, we share a brief summary.
More than 7000 languages are spoken around the world today, but current, state-of-art AI large language models cover only a small percentage of them and favor North American language and cultural perspectives. This is in part because many non-English languages are considered “low-resource,” meaning they are less prominent within computer science research and lack the high-quality datasets necessary for training language models.
This language gap in AI has several undesirable consequences:
- Many language speakers and communities may be left behind as language models that do not cover their language become increasingly integral to economies and societies.
- The lack of linguistic diversity in models can introduce biases that reflect Anglo-centric and North American viewpoints, and undermine other cultural perspectives.
- The safety of all language models is compromised without multilingual capabilities, creating opportunities for malicious users and exposing all users to harm.
There are many global efforts to address the language gap in AI, including Cohere For AI’s Aya project — a global initiative that has developed and publicly released multilingual language models and datasets covering 101 languages. However, more work is needed.
To contribute to efforts to address the AI language gap, we offer four considerations for those working in policy and governance around the world:
- Direct resources towards multilingual research and development.
- Support multilingual dataset creation.
- Recognize that the safety of all language models is improved through multilingual approaches.
- Foster knowledge-sharing and transparency among researchers, developers, and communities.
The evidence outlined in this primer suggests that closing the AI language gap will require concerted efforts across the ecosystem, from those developing and deploying AI models, to those working in policy and governance settings.