LLMs and bias in”non-standard” language

Linguistic bias in Large Language Models (LLMs) is a pressing concern, particularly when it comes to minoritized and marginalized language varieties. While there is much discussion of minority languages in LLMs, the bias also applies to “non-standard” varieties of dominant languages, like English. One of the primary sources of this bias is the input data used to train the models. LLMs are typically trained on large corpora of text, which are often dominated by data from standard English language varieties, such as American English or British English. This means that minority language varieties, such as Hong Kong English, African American English, Mid Ulster English or Zimbabwean English, are underrepresented, misrepresented through lenses of race, colonialism and class, or entirely absent from these training datasets.

This lack of representation can lead to a range of negative consequences in both input and output. In terms of input, LLMs may struggle to recognize and accurately transcribe minority language varieties, leading to errors and inaccuracies in transcription and translation. In terms of output, LLMs may perpetuate stereotypes and biases by generating text that reinforces dominant cultural narratives and marginalizes minority languages and speakers. For example, an LLM trained on predominantly standard English texts may be more likely to generate text that uses derogatory language or perpetuates racist stereotypes when discussing marginalized groups.

Furthermore, LLMs trained on predominantly “standard”English texts will disproportionately identify and evaluate language to be English, even if it is not. This is damaging to speakers of languages who self-identify differently and to speakers of related minority languages. For instance, an LLM may incorrectly categorize languages like Gullah Geechee and Scots as dialects of English rather than distinct languages with their own linguistic and cultural identities. This erasure can have significant consequences for the preservation and promotion of these languages and cultures. It also perpetuates the dominant ideology that standard English is the superior form of the language, with minority languages and language varieties being seen as inferior or substandard.

This is an example of DeepAI perpetuating linguistic bias against Scots. The text is a tweet from the writer Billy Kay. When asked to describe a person who speaks like this the LLM responds that this person is speaking English with Scottish vocabulary and must have a strong connection to their cultural identity.


To mitigate this bias, it is essential to diversify the input data used to train LLMs and recognize the structural prejudice engrained in the way languages are and have been represented. This requires incorporating data written in a variety of minority and marginalized language varieties, and minimizing the data written about these varieties in dominant languages. Moreover, it is crucial to consider the self-identification of the speakers’ languages to ensure that their voices are represented accurately and authentically. This involves not only including texts written in these languages but also acknowledging the historical and ongoing power dynamics that have shaped the way language has been used and represented.