What Gets Lost When AI Only Speaks English
A case for linguistic diversity at the frontier of artificial intelligence

The large language models we see today are largely trained on one specific language, English, which is understandable given that it is the most widely used international language as a means of global communication when two countries do not share the same language.
The proportion of English in AI training data is staggering, at 92%, a complete domination. Meanwhile, there are a total of 7,000 living languages on Earth, and 40% of them are facing extinction, which is alarming when viewed from a linguist’s perspective.
World linguistic diversity encompasses ways of meaning, words or characters carrying viewpoints that may not translate into English. More accurately, every language encodes a unique way of understanding the world, and when a language is underrepresented in AI training, its entire body of knowledge risks becoming invisible.
Therefore, training data that focuses solely on English may pose serious risks to intellectual diversity. It is a concern that goes beyond the technical, particularly when you consider that AI reasoning reflects largely Western assumptions, as rigorous training causes a model to interpret and reason through an English cultural lens.
The following examples give us a better understanding of why cultural knowledge is at risk:
Aymara conceptualizes time oppositely to English: the past is in front, the future behind.
Many indigenous Australian languages use cardinal directions (N/S/E/W) rather than relative terms like left and right.
Sub-Saharan African languages contain kinship and land relationship concepts with no English equivalent.
As the examples above illustrate, the loss is not merely linguistic. Rather, the dominance of one specific language in AI training data can lead to a narrowing of humanity’s collective self-understanding.
Read More: The Real AI Divide the World Is Not Talking About
Even where initiatives have been taken to train models with languages other than English, the output largely remains low quality. The lack of data, under-training, and the absence of upgrades result in ineffective outcomes. Subsequently, users notice the gap and switch to English models for better results.
Perhaps the recognition of artificial intelligence’s potential in building LLM models is the main reason behind the dominance of one language in training.
Most models are built in the Western world, specifically in the USA and China. These countries started the revolution with hundreds of great minds working to establish a new ecosystem that could entirely change the human way of living.
Most developing countries, and even some developed ones, have yet to wholeheartedly integrate artificial intelligence systems. That is a key mindset that must change first before any decision is made to develop datasets for their own languages and, subsequently, for model training.
Initiatives have been taken in some parts of the world, such as Masakhane, AmericasNLP, GhanaNLP, AI4Bharat in Africa, India, and elsewhere, insisting that communities must have agency over their own language data to serve their respective communities, rather than having that data extracted from them.
Long-term forecasts indicate that English will remain a dominant force in AI reasoning, and this is made more evident by the alarming rate of one language being lost every 40 days. The barriers are more political and economic than capability-based.
What we require is a sustainable policy to preserve the world’s linguistic heritage as a means of preserving unified knowledge for all.
Mandatory multilingual benchmarking in AI regulation and funded partnerships between AI labs are two key initiatives of great importance in this context.
Understanding that each language is a door to viewing the world differently should compel decision-makers to change their perspective and remain unbiased in data training, for better results from the models.
Because an AI that functions primarily in English is not a neutral technology, and certainly not a universal one.

