The importance of NLP in global languages
Upwards of 7000 unique languages are in usage all around the globe. However, NLP research has had a stark tendency to concentrate on English, the self-anointed global lingua franca. This blog post by AICoreSpot highlights the value of research into NLP in other international languages. Natural Language Processing (NLP) Research has its core focus set on generating methodologies that function well with the English language, regardless of the obvious advantages that come along with working in other international languages. These advantages range from an outsized societal influence to the modelling of a treasure of linguistic features to prevent overfitting in addition to fascinating challenges for Machine Learning (ML). As mentioned previously, upwards of 7000 unique languages are spoken internationally. Native region-specific languages dominate the linguistic landscape, with most of these international languages spoken in non-English dominant regions such as Asia, Africa, the Pacific, and South America – locations that have always had a marked tendency to prefer their native languages for communication purposes. There have been significant advancements across the board in natural language processing tasks over the previous decade or so, but a primary share of these results have been accomplished in the English language, and a smaller selection of other high-resource languages. A hierarchy of resources can be established on the basis of labelled and unlabeled data available on the internet. Joshi et al., in one of their latest ACL 2020 papers, have established a similar classification on the basis of data availability.
Joshi et al. have classified these languages into five categories, on the basis of the number of languages and their connected users within every category. Category 5 languages have the highest number of speakers, Category 4 comes in at 4th place, Category 3 at 3rd spot, and Category 1 at 2nd spot, with Category 2 languages coming at the first place.
Categories 4 and 5 occupy the enviable position of possessing both large amounts of labelled and unlabeled data available at their disposal, being well dissected in NLP research. Languages in the other categories have been mostly ignored. This blog spot by AICoreSpot puts forth the argument for the rationale behind working in other languages, shifting the dominant focus from English. At a more detailed level, we will explore rationale from societal, ML, linguistic, cultural/normative, and cognitive perspectives.
The societal perspective
Tech is severely limiting itself if it is only apt for English speakers who have ‘conventional’ accents. (American, Canadian, British, Australian, etc.)
The language we use to communicate on an everyday basis is the deciding factor on the degree of our access to data, academic, and even human connections. While we like to view the internet as a democratic platform, it is relatively autocratic with regards to language, an inherent language divide existing between predominantly Western languages and other regional languages. Western languages have a relatively high degree of bias associated with them, in that they are viewed as the ‘normal’, with regional languages taking a backseat. As little as a hundred languages are represented online and speakers of these unique regional languages are severely restricted with the amount and quality of data that is available at their disposal.
As informal discourse on social media leads to the development of many more languages, this divide permeates all levels of tech: at the most fundamental language tech level, low-resource languages are lacking in basic feature sets such as spelling checks, and even keyboard support. You can’t even write in these languages on a standard operating system, with users of these languages being essentially rendered mute. To worsen matters, language that don’t feature a widespread written culture are even more restricted in matters of keyboard support. At a higher level, most algorithms have an inherent bias and are discriminatory against speakers of non-English languages, and even speakers with non-standard accents.
The issue regarding accents is an obstacle to integration as current research looks at a high-resource language like English as being homogenous. Simple logic tells us that the contrary is true, English is a heterogenous melting pot, not a homogenous language. There exist various regional dialects of English even within the Western Hemisphere, and the equation is split wide open when we talk about dialects that exist within non-Western regions.
Current models, as a result, do not have the expected levels of performance on the variety of connected linguistic subcommunities, dialects, and accents. Practically speaking, the points of demarcation between language varieties are a lot more muddier than we perceive them as and language detection of similar languages and dialects is still a challenging issue. For example, although Italian is the lingua franca in Italy, there are approximately 34 regional languages and dialects in use throughout the nation.
An ongoing trend of tech exclusion will not just intensify the language divide but it may compel speakers of unsupported languages and dialects to migrate to high-resource languages with improved tech compatibility, potentially driving minority language variants to extinction. To make sure that non-English language users are not isolated and simultaneously offsetting the prevailing imbalance, to reduce language and literacy barriers, application of our models to non-English languages is the need of the hour.
The linguistic perspective
Although we state to be interested in producing general language understanding methods, our methodologies are inherently biased towards a single language, and that is English.
English and the small selection of other high-resource languages are mostly not representative of the global scenario. A majority of high-resource languages come from the Indo-European family, are mostly used within the Western world, and morphologically weak, that is, data is primarily expressed syntactically through a fixed word order and leveraging various separate words over variance at the word level.
For a more in depth understanding, we can observe the typological features of various languages. The World Atlas of Language Structure catalogues 192 typological features, which are the structural and semantic properties of a language. For example, one typological aspect depicts the usual order of subject, object, and verb in a language. Every feature has 5.93 categories as a mean. Nearly half of these feature categories are prevalent only in the low-resource languages belonging to groups 0-2 above and failed to be identified in languages belonging to categories 3-5. Neglecting such a dominant subset of typological features implies that our NLP models are possibly losing out on a wealth of data which can prove to be good for generalization purposes.
Shifting the focus from English can also enable us to obtain fresh knowledge with regards to the connections between the languages of the globe. Conversely, it can assist us in finding out what linguistic features our models are capable of capturing. You could leverage your knowledge of a specific language to dive into aspects that vary from English like the use of diacritics, extensive compounding, inflection, derivation, reduplication, agglutination, fusion, etc.
The ML Perspective
We program assumptions into the structures of our models that are on the basis of information we intend to apply to them. Although we intend our models to be generalized, a majority of their inductive biases are particular to English and languages that are structurally and linguistically similar to it.
The lack of any overtly programmed data in a model does not imply that it is truly language agnostic. A typical example are n-gram language models, which perform considerably worse for languages with intricate morphology and comparatively free word order.
Likewise, neural frameworks usually ignore the complications of morphologically rich languages. Subword tokenization has poor performance on languages featuring reduplication, byte pair encoding does not have proper alignment with morphology, and languages with extensive vocabularies are more tough for language models. Variations with regards to grammar, word order, and syntax also cause complications for neural models. Additionally, we typically have an assumption that pre-trained embeddings readily encode all relevant data, which might not be the case for all languages.
These challenges are unique for modelling structure – both at the word and at the sentence level, handling sparsity, few-shot learning, encoding appropriate data in pre-trained representations, and transfer between connected languages, amongst many other fascinating directions. These obstacles are not handled well by existing methodologies and therefore the need of the hour is a fresh set of language aware approaches.
New models have consistently equalled human-level performance on more and more tough benchmarks – i.e., in English leveraging labelled datasets with thousands and unlabelled data with millions of instances. In this procedure, as a society we have overfit to the traits and conditions of English-language information. Specifically, by concentrating on high-resource languages, we have assigned priority to methodologies that function well only when large amounts of labelled and unlabelled information are available at our disposal.
A majority of prevailing methodologies fail to hold up during application to the information-scarce conditions that are inherent to a majority of the globe’s languages. Even the latest progress in pre-training language models that drastically minimize the sample complexity for downstream activities need humongous amounts of clean, unlabelled information, which is unavailable for a majority of international languages. Performing well with a reduced amount of data is therefore an ideal setting to evaluate the restrictions of present models – and assessment on low-resource languages makes up debatably it most influential real-world application.
The information our models receive training on unveils not just the traits of the particular language but also reveals the cultural norms and common-sense knowledge.
This common-sense knowledge may be varied for varied cultures. For example, the notion of ‘free’ and ‘non-free’ is variable cross-culturally where ‘free’ goods and services are those that individuals can use without any sort of special authorization or authentication, like salt and pepper in a restaurant. Taboo topics also vary in the several cultures of the globe.
Further cultures demonstrate significant variance in their evaluation of relative power and social distance, among other factors. Additionally, many practical scenarios such as ones contained in the COPA dataset do not correlate to the direct experience of many and equally are not reflective of critical scenarios that are obvious background knowledge for the populace of the globe.
As a consequence, an individual or agent that has exposure only to English information with its origins predominantly from the Western world may be able to hold meaningful and relevant conversation with users from Western nations, but communication with an individual from another culture, probably an Eastern one would result in several practical failures.
Surpassing cultural norms and common-sense knowledge, the information we leverage in training a framework is also reflective of the values of the base society. As an NLP researcher, analyst, or practitioner, we have to question ourselves if want our NLP frameworks to be exclusive to the values and morals of a particular country or linguistic community.
This decision may be less critical for present systems that primarily deal with simplistic activities like text classification, it will turn more important as systems get smarter and are required to handle more complicated decision-making activities.
The cognitive perspective
Human offspring can learn any natural language and this capacity to acquire languages is fascinatingly constant across several types of languages. In order to attain human-level language comprehension, our frameworks should have the capability to display similar levels of consistency across several languages from varied linguistic families and typologies.
Our frameworks should eventually develop the capability to learn abstractions that are not particular to the structural framework of any language but that can generalize to languages with varied attributes.
What practitioners can do
- Datasets: If we develop a fresh dataset, reverse 50% of your annotation allocation for developing the same size data set in a disparate language.
- Assessment: If you have interest in a specific activity, thinking about assessing your framework on the same activity in a different language.
- Bender Rule: Specify the language that you are working on.
- Assumptions: Maintain clarity with regards to the signals your framework leverages and the subsequent assumptions it makes. Contemplate on which are particular to the language you are researching and which may be more generalized.
- Language diversity: Evaluate the language diversity of the sample of the languages you are researching.
- Research: Work on methodologies that tackle challenges facing low-resource challenges.