>Business >NLP Datasets – how effective is your deep learning model?

NLP Datasets – how effective is your deep learning model?

In an upcoming blog article, we will talk about tokenizers being critical to comprehending how deep learning Natural Language Processing (NLP) models go about reading and processing text. After a model gains the capability to interpret text, it can begin to learn how to execute various natural language processing (NLP) activities. At this juncture, we are required to begin figuring out how adequate the model functions with regards to its range of learned activities. Is it just good at one activity or does it feature good performance on a variety of activities? Does it go about learning new activities with ease and minimal training? And how well does it contrast to other NLP models which carry out the exact same or similar activities? 

This is a critical aspect of the process. Think of BERT, it developed an impact when it exploded onto the scene in 2018 due to its performance on a broad variety of NLP activities. Organizations opt which models to enter into production on the basis of their performance on varying datasets. But it’s tough to know precisely which dataset to leverage as the baseline. What if one model features better performance on a translation activity for instance but badly on a question and answer type activity. How do you contrast this with a model that did just adequately on both activities? In this blog article, to provide a solution to these questions, we will observe one of the most recently released NLP datasets and observe how it attempts to manage these issues when attempting to furnish a benchmark for contrasting cross lingual NLP models. But to start with, what relationship do datasets have with Adam Driver? 

Datasets, the film, featuring Adam Driver? 

It’s okay to state that datasets are not the “sexiest” aspect of the thrilling new advancements within deep learning NLP. When you go through a new paper detailing the new human-like performance levels of the most advanced model your are not going to hear people say “wow, I cannot wait till I grab the dataset, it sounds amazing!” Likewise, unluckily for the NLP community, it is open to conjecture if Hollywood would prefer to produce a movie featuring Adam Driver about developing the planet’s most exhaustive machine learning dataset. However, Adam Driver or not, good datasets are the primary pinch point in the AI pipeline which led by progression in deep learning NLP. It’s no more Moore’s Law that’s driving the rate of change in computational technologies, it’s datasets. 

There was a juncture, even in the early-2000s where computational power was a restricting factor. However, now, we are moving towards a new variant of software, or software 2.0 as detailed by Andrej Karpathy, where the objectives of a program are more abstractly defined and stuff like algorithmic enhancements are probable to accomplish more gains than any progression in computational performance. At the base of this new software 2.0 paradigm are, datasets. 

As we no more have particular instructions in the comprehensive steps of our programs, we develop learning objectives, for instance, here are tons of pictures of Adam Driver, figure out how to go about identifying him from pictures that do not contain Adam Driver. We cannot inform the application to leverage his hair, or his eyes or his nose or whatever other aspects to distinguish him from everybody else. That was software in its first iteration, here is a listing of rules as to how you, the computer should go about identifying Adam Driver. This is where the datasets come into force. We require them to begin, in order to go about training our Adam Driver image classifiers. And next, we require adequate datasets to benchmark the bet Adam Driver classifiers in the wild. 

NLP frameworks like GPT-2 have already gone beyond capacities of preliminary datasets which forecast a much more steady increase in machine learning capacities in primary NLP activities. By present standards, they already have a level of performance that is close to or superior to human level performance in a variety of NLP activities. But this displays the limitations of our present datasets over the fact that these frameworks are as good as humans in answering things like questions and respond to tasks. We’ve not hit HAL-9000 levels of machine learning understanding, yet. It would take a few more years before technology evolves to a point where HAL can exist. 

Improved datasets means improved models. If you comprehend how these datasets are structured, then you will comprehend what the framework can learn from them. Also with the increases in the prevalence of transfer learning and fine tuning of models that have been trained prior, it is probable that you will need to develop your own dataset to tailor these frameworks to your unique organizational domain. This is simple if you already know the activities and datasets on which the framework was initially trained in. If you get your dataset correct, who knows, you might develop your own version of HAL and Adam Driver will play your role in the movie! 

Previously, people found it difficult to obtain a good summarization of the NLP dataset landscape. There appeared to be no adequate map to assist in navigating the plethora of differing NLP activities and their correlating datasets. A few models leveraged translation datasets from Stanford, others leveraged the Penn Treebank dataset to evaluate for part-of-speech tagging (POS), and then frameworks like BERT leveraged a broad variety of activities to display the capabilities of their model. What activities should we aim to evaluate our models on and which datasets should we leverage for those evaluations. Also, until a few recent years ago, if you were not deep into academia or you had an insane amount of financial power, then there appeared to be minimal value in attempting to comprehend these datasets. It wouldn’t have been viable to go about training models from ground up on them. Owing to more affordable computational resources, improved learning architectures in models such as ELECTRA or XLNet and resources like HuggingFace, it is both simple and affordable to do. And provided that, as we observed, it is now more critical than ever before to take a time out to get acquainted with these datasets.  

The time is ripe to begin on these datasets as Google just released a brand new multi-lingual, multi-activity, multi-everything benchmark utility referred to as XTREME which holds the promise of assisting in the launch of a new spate of linguistically talented NLP super frameworks. The good thing with regards to this library is that it attempted to provide some structure on the NLP activities required to develop an improved model. It does this by providing definition to four classifications which are then made of up of nine unique activities. 

There exist other resources that have attempted to furnish an all-in-one collection of NLP activities, for example, the Universal Dependency framework. However, XTREME appears to be the forerunner to attempt and organize it in a top down, category to activities approach, to make it simple to comprehend the capabilities of your model. Therefore, we believe that it’s a good resource and will be leveraged here as a method to impart structure to the discourse.  

Leveraging XTREME as a guideline we will initially briefly review the classifications and activities spoken about in their research. Not much of a deep dive, just good enough so that you are aware of what every activity is attempting to accomplish, and how it is structured. For instance, how is the information labelled and how do you leverage it for testing or training. Then we will analyse the various classifications in comprehensive detail and execute some code to explore the datasets. We can also detect some other activities or dataset that may also be good but are not integrated in XTREME. 

It should be observed that one of the primary benefits of the XTREME dataset is its broad flexibility with regards to language compatibility. There are many brilliant datasets out in the wild but several of them are only tailored to English speakers or other high resource languages. XTREME tackles this by making sure that every activity has a complete range of differing languages. XTREME is attempting to furnish a benchmark that quantifies a model’s generalized language capabilities. This is both with regards to a broad range of activities, from low word level syntax activities to higher level reasoning, and with regards to the models capability to go about transferring its learnings to differing languages – from English for example, to a lower resourced language. Therefore, you can leverage XTREME to execute zero shot learning on a fresh grouping of activities and languages. We will not be able to get into this level of details, but hopefully by increasing your awareness of the issue you can delve deeper into XTREME’s several sophisticated features. 

The primary NLP activities, for XTREME or for any other dataset, mostly come under two wide classifications. They are either attempting to go about teaching the model about the world level syntactic sugar that consists of a language. Stuff such as verbs, named entities, and nouns. The parts and pieces that impart to a language its structure. Or they manage the broader level of conceptual comprehension on meaning which human beings typically take for granted. For instance, for human beings it is simple to comprehend that “colorless green ideas sleep furiously” or “time flies like an arrow, fruit flies like a banana.” are syntactically accurate sentences but semantically meaningless or confusing. It is tough for NLP models to go about processing this high level of semantic comprehension.  

The holy grail with regards to NLP is to go about training a model that learns a high level or “general” comprehension of language over than task particular awareness. This is like the notion of Artificial General Intelligence (AGI) that posits that machines can obtain a degree of intelligence that enables them to adapt and learn new theories and skills, much like a human being would. 

BERT was appreciated as it demonstrated a high degree of performance in a broad variety of NLP activities. If frameworks such as BERT could feature good performance on a broad range of activities then would it be doable for them to approach any NLP activity and much like humans, adapt their awareness to the activity and comprehend the semantic problems with our above sentences? There is a lot of discourse with regards to if it is even possible for a machine to obtain this human like ability. A few individuals even believe this concept of AGI is an impossibility as there is no general level of intelligence for the machines to duplicate. 

This is of relevance to XTREME as the authors observed that to comprehend a language model’s multilingual capacities, you need to assess it on a broad range of NLP activities. It is not adequate for a model to be good at translation as it may feature very poor performance at classification or question and answering or lower level activities. A good model ought to feature good performance on a variety of differing activities across a broad array of NLP skills. This is why the authors divide the NLP activities into a variety of groups. 

XTREME activities layout 

Broadly speaking, we can provide definition to the variety of NLP tasks as follows: 

Structured prediction: Concentrating on low level syntactic features of a language and such as parts-of-speech (POS) and name entity recognition (NER) activities. These datasets furnish sentences, typically dissected into lists of individual words, with correlating tags. The NER list consists of a reduced number of tags and there are a few schemes you can leverage.  

Sentence classification: There is no rigid hierarchy here, but sentence classification is at the lower end of the syntactic semantic tree. You could leverage classification activities to go about training a model to distinguish between our “fruit files and green ideas” type sentences specified earlier. We would add a classification layer over previously trained BERT and undertake training with tons of semantically and syntactically precise sentences against syntactically right but semantically odd sentences. And it could feature good performance but it’s a bit of a stretch to state that is has undergone “learning” any knowledge of a semantic nature. These models are intelligent and “cheat” on these activities to obtain results without obtaining semantic awareness. The categorization can be for stuff like sentiment where the model has to determine whether something is primarily positive of negative. The reason classification is a somewhat higher level activity is that it has its basis at the sentence level over the word level for activities such as NER and POS. Sentences are limitlessly more complicated than words since there are, theoretically speaking, an established number of words, but a limitless number of ways in which they can be strung together in a sentence.  

Sentence retrieval: Retrieval is another grade up the complexity stack. These activities attempt to identify semantically resemblant sentences. This is more tough than classification as there is more than one result for a sentence. It is not fully obvious how these models detect similar sentences. You attempt interpreting a 512-dimensional vector and informing someone on what each dimension is for! You may state that the 265th dimension is for cats, and you would know that as you made an effort to interpret them. For instance, are the sentences “I like burgers” and “I don’t like burgers” similar? To some degree they are but if you are employed at a restaurant and your chatbot pulls someone specifying that they don’t like pizza to the pizza menu section they will not be too pleased. Semantic retrieval or semantic textual similarity (STS) activities function by possessing several pairs of sentences connected with a label. The label can detect if they are similar or not similar or even if they are duplicated copies of each other. 

Question answering: Close to the top of the stack we have the question answering – QA activities. These are more complicated than the similarity activities as the questions are not structurally similar to the solutions and they require to take context into consideration. You can have simplistic variants of these activities like “what is the capital of Spain?” You could attempt and match this more simply than something such as “What is the largest town in Spain?” Firstly, what do you imply by largest? Land, stretch of road, population? Then do you exclude cities and only observe towns? The complexity is dependent on the dataset as, for instance, few QA datasets have a solution for every query. This makes a world of difference as the model will be aware it can always identify a solution. Others include solutions that do not have a solution, and queries that have a short solution and a long solution and ones with only a long solution. So there is a lot that is happening here. 

Other activities? What about other activities which are not integrated into the XTREME dataset like text summarization and translation? Well, let’s look at it in terms of what we just spoke about.  

First, length of the text. Both activities need higher than word length text strings so this would indicate they are a bit higher up the intricacy ladder than world level activities like POS and NER.  

Next, the activities themselves. Some activities have a basis in searching for stuff contained within the text body itself. For instance, classification activities are searching for cues that will assist in putting something in one bucket instead of another. 

QA activities on the other hand, are attempting to make an inference that might not even exist to begin with. Is there even a solution here? If there is, then it may not just merely a scenario of copying the text and utilizing that as a solution. Is it a long solution or a short solution? Is it a requirement to leverage more knowledge that is external to the text? For translation activities you typically have the data contained in the text, e.g., this sentence is the German version of that sentence. Cross lingual NLP models are attempting to avoid this by generally undertaking training of models to learn typical language structures in a unsupervised manner.  

Likewise, summarization consists of condensing a specific chunk of text into a smaller size. This appears like a bit more tough activity than translation as you are attempting to find out what to omit and it may consist of more combos of semantics and syntax. So it appears like it is more complicated than translation but someplace below question answering on the intricacy ladder. 

Chatbots? While not a conventional activity, you could state that the eventual objective of developing more complicated activities is to develop the potential for human-like chatbot interactions. This is representative of the highest level of intricacy with regards to NLP as it involves a broad variety of activities from syntax to semantics and inference of something from one field to another. How these models perform at this chatbot like level is quantified by contests like the Turing Test where an individual is required to figure out if they are speaking to a human being or a machine. 

None of this is a hard and fast rule with regards to the order of our complexity ladder. You may have differing perspectives on what makes up a complicated activity. Indeed, the purpose XTREME was developed to begin with as since no singular activity can inform us which NLP model is the best. A few models that feature improved performance on translation activities have limited success when you attempt and transfer their skills to activities like POS and NER. Likewise, a few models that feature good performance on POS or NER in one language suffer to transfer that performance to differing languages. You could make the case for a complexity element within differing groups of languages and leverage that to re-order the above layout. The primary thing is to think about it for yourself and devise what you think are some approximate guidelines and leverage those moving forward. Most critically, you should be open to altering them when you observe and activity or a language which appears to question your perception of the world order. 

What about the coding? 

Too much talk and not adequate coding. Let’s observe the layout of XTREME and also at other prospective datasets which are not integrated in the XTREME library.  

Preliminary setup 

The XTREME Repo has all the code you require to download a majority of the datasets automatically. There is only one dataset that requires to be manually downloaded. To make it simpler to follow the individual phases are outlined in the README. To begin, go through those setup steps and ensure you have obtained all of the datasets. 

XTREME task layouts 

  • Structured prediction (NER and POS) 
  • Sentence classification
  • Sentence retrieval  
  • Question answering (QA) 

Structured Prediction 

Named Entity Recognition (NER) 

Let’s begin by observing the structured prediction datasets in the XTREME package. Do not be too bothered about the particulars of stuff like what each token means for the time being, we will look at that eventually. Just attempt to understand the generic layout and how you might be able to leverage it to either undertake training or fine tuning of your own models. 

NER – dissected 

In layman’s terms, a named entity is any terminology or word that is representative of an object in the physical world. In the sentence, “Tom Brady is the quarterback for the Tampa Bay Buccaneers”, “Tom Brady” and “Tampa Bay Buccaneers” are the named entities as they are representative of an individual and an NFL team respectively. Quarterback, by comparison, is not named since it can be a reference to n-number of individuals. To put it differently, it is particular to the individual Tom Brady. A named entity might be any individual, place, corporation, product, or object. It’s utilization is typically expanded to consist of temporal and numeric terminology also even if they do not fit within the rigid definition of the terminology. 

NER can assist with activities like data extraction when you wish to provide solutions to questions like how much a product costs, when a product is due for delivery or what individual a grouping of text is referring to. NER might also be very particular to your specific field with unique product naming or business particular terms. As an outcome, it’s critical to identify, or develop a good dataset to tackle these possible unique elements. 

NER – what tags are leveraged in your dataset? 

NER datasets will typically be structured in a word-token duo where the token detects whether or not the word is a named entity, and if so, the kind of named entity it is representative of. As an instance, take the NLP library spaCy. Their NER framework will undertake identification of named entities with tokens like “PERSON” which is an individual, and “ORG” which means organization and so on.  

By comparison, other NER models and utilities will leverage slightly differing tokens to go about identifying a named entity. Several datasets will follow what is referred to as Inside-outside-beginning tagging or IOB-tagging naming. This identifies, to begin with, when something is not a named entity, the O or Outside tag. When something is a named entity then it can be either the start or a named entity or inside a named entity. The instances in the XTREME dataset will leverage this IOB format for the NER activities. It’s just worth observing the varying approaches in case you’re leveraging a differing NER dataset and observe a different variant of tagging. 

NER instances 



As an instance of varying tagging strategies, in XTREME, examples leveraging the IOB format the “Empire State Building” would be to go about identifying it by three varying tags or tokens. 


By comparison, in the spaCy default instance, these three terminologies would be covered by the singular token, FAC which is in reference to buildings, highways, airports, bridges, etc. 

Observe that spaCy enables you to leverage other models in addition to using varying tagging schemes like the IOB-tagging or the BILUO scheme. These will leverage a differing format, but the spaCy default model leverages the more direct strategy displayed here. 

 While NER searches for particular portions of sentence to identify entities, POS attempts to comprehend the entire sentence and how each word interacts. Go back to our earlier “time flies like an arrow, fruit flies like a banana” The meaning of this sentence alters considerably based on how you perceive the words in it. For instance, does “time flies like an arrow” mean time moves quickly or is it describing a type of flying, the time fly and their fetish for things in relation to arrows. It’s all relative. 

POS tagging attempts to assist here through identification of whether something is a noun, verb, adverb, and all that grammatical sauce you were instructed about in school and have now forgotten. spaCy furnishes a neat way to go about visualizing the POS tags the output of which we view above. According to that tagging, the spaCy framework does not think “time flies” is a thing. 



In either possibility, POS, is in several ways, more familiar to us from a simplistic grammar viewpoint than NER. We, in all likelihood, recognize more of the tags instantly and comprehend what it is attempting to do. On the other hand, with NER, the activity may appear odd as not many of us have stopped to attempt and overtly identify the named entities in a sentence. It’s just too obvious to a human being what is and is not a thing. For HAL 9000, however, this is a tough activity and therefore the training needed.  

Whereas POS tagging appears to make sense to us, it is still a very tough thing to learn as there is set way to go about identifying precisely what a word is representative of. However, as observed, there is reduced confusion about the tagging scheme than with NER so you could observe a majority of datasets consist of some format of VERB, NOUN, ADV, and so on.  


Leveraging the spaCy tokenizer as an instance of other tagging schemes. 


Some instances of the differing tags available through spaCy. Observe the POS tags to see if they are differing from the instances in the XTREME POS activities. 

POS tagging is a critical foundation of common NLP applications. As it such a fundamental activity its usefulness can often seem hidden as the output of a POS tag .e.g. if something is a noun or verb is typically not the output of the application itself. However, POS tagging is critical for: 

  • Training of models: To train NLP frameworks like BERT or RoBERTa on POS like activities is critical to attempt and develop generic linguistic skills which can assist the model function in differing fields. If these models learn POS skill then they have an improved chance of developing general language skills which assist them with improved performance in a broad array of activities. 
  • Text to speech: Facilitating computers to speak in a more human-like way is more critical now with the escalating utilization of Alexa and SIRI like applications. POS tagging plays a critical part in this sphere as languages like English have words that have unique pronunciations, dependent on their position and context within a sentence. E.g., read pronounced as reed or red, live as in alive or live which rhymes with give. Word sense disambiguation: Go back to our time flies sentence. POS tagging assists us to comprehend whether the fly is a thing that flies around the place or if it is in reference to how swiftly time goes by when you’re having fun. 

Sentence classification 


Are these paraphrased variants of each other? 

A few of the other datasets are simpler to get to get a handle over than the NER and the POS. A few of the aspects of those activities appear specifically technical and concentrated on the syntactic nuances of grammar and linguistics. You may discover that NER and POS activities make perfect sense. One of the fascinating quirks with regards to NLP activities is that as the task intricacy escalates at a model level – when the activity becomes more tough for a machine to learn, to us human beings the activity appears simpler. 

In the XTREME sense, “sentence classification” is in reference to a variety of activities that need a model to categorize sentences into a group of linguistic sets. You might have encountered classification activities prior to where you categorized something as spam or you categorized reviews of a restaurant or a service/product as being negative or positive or categorizing articles into buckets of subjects like news or sport. The classification activities that have implementation in XTREME are a bit different and more tough than conventional classifiers. They still need that a sentence be categorized into two or more categories. Therefore, the fundamental structure has bot altered much. However, the buckets themselves are differing. Now over sentiment or topics they have a basis on semantic relatedness, i.e., if two sentences are paraphrases of the other or consist of contradiction or stuff like entailment. This shifts the variant of classification activity a bit higher up or intricacy ladder closer to things like sentence similarity. But, at its core, it is still a categorization with a set number of outcomes or buckets. 

Looking at duos of sentences and determining whether or not they alike appears mundane to us human beings, correct? Well, there are some difficult ones as any language will unanimously contain exceptions and peculiarities, but in totality it appears like reasonable activity. Not so much in the world of machines. Prior to Skynet or HAL 9000 going rogue they need to take in a lot more about the whole insane realm of human language.  

Paraphrase Adversaries From Word Scrambling (PAWS) 

If you wish for your model to detect whether a sentence is a paraphrase of another statement you will require something such as the PAWS dataset. Paraphrasing is an activity of criticality in NLP. Think of a simplistic chatbot for a delivery company. A client may desire to update their delivery address or their account’s settings. A typical way phrase this might be: 

  • Hi, how can I update my account? 

But there are several ways you could paraphrase this query. 

  • I want to edit my account. 
  • I want to update my settings. 
  • I want to modify my account details
  • Can I alter my account details? 

And this goes on as we identify various ways of asking about the same thing. This becomes even more complex as the text expands in size from a sentence to a paragraph, to several pages, and so on. It definitely is not ideal to attempt and go about coding for all of these examples, and as human beings we can observe that these are all basically stating the same thing. Datasets such as PAWS have been developed from the ground up to assist with this. 

Going through some PAWS instances of paraphrased sentences 


Going through some PAWS instances of sentences that are not paraphrased versions of the other. 

One of the problems PAWS makes an effort to manage is that other NLP entailment datasets were basically, too simple. Recall that neural networks are lazy, lazy entities. They will attempt to lie and cheat and do as little as feasible to get to where you wish them to go. A few entailment datasets did not have a lot of “lexical overlap” where sentences weren’t paraphrases. To put it differently, any sentences which consisted similar words were typically paraphrases. This develops an opportunity for a neural network to “cheat” by just guessing if there are some lexical overlaps then it is very probable that two sentences are paraphrased versions of the other. 

The instance they leverage in the paper is “flights from Florida to New York” and “flights from New York to Florida”. While they sound similar, they are obviously very different things. You would not wish for your chatbot to think these are identical when you attempt and book your flight back to New York. However, a majority of datasets don’t have this degree of subtlety so it can be simple for a shifty model to obtain a high scoring in some activities. PAWS has particularly tackled this through word “swapping” or “scrambling” amongst couples of sentences to produce negative paraphrase instances. 

Your dataframe for these assessments should appear something like this: 


Where you can view the labels detailing whether sentences are either: 

  • 1: The two sentences are both paraphrases of the other. 
  • 0: The sentences have a differing meaning, observe, this does not especially mean they contradict each other. The point to be observed is that they are not related. 

Cross Lingual Natural Language Inference (XNLI) 

The XNLI dataset has a bigger number of buckets in contrast to PAWS-X as something can be contradictory, entail or just be neutral. It extends research already carried out with natural language classification to a big number of low resource languages. The objective being to furnish a method to execute sentence classification beyond English without requiring unique datasets for each language. 

This prevents needing to train a system from the ground up for every language and rather receives training in one language and assessed on data from other languages. This, the writers claim, is more scalable than varying systems all assessing differing languages. It is amazing that training, evaluation, and standardizing mode for low resource languages is seen as a matter of criticality. This will hopefully make it a lot simpler to identify prior-trained models for a majority of languages that you are searching for in the future.  

The layout of XLNI is a lot like PAWS in that there are couples of sentences with a label. The thing that it distinguishes it from PAWS is that there is more nuance to the label. The label can be: 

  • Neutral: Indicative of no relationship existing between the sentences 
  • Contradictory: The sentences contain data contradictory of each other 
  • Entailment: The sentences are paraphrased versions of each other 


Searching for the available labels in the XNLI dataset. 


Searching through some instances of entailment labelled sentences. 

Sentence Retrieval 

How alike are these sentences? This can be particularly tough when contrasting similarity across languages. Differing from classification, there are no more a minimal number of buckets to opt from as it consists of a range of score. 

After sentence classification we shift upwards the stack of complexity to sentence retrieval activities. Again, it is not a set in stone rule with regards to the hierarchy of intricacy but with classification we typically have defined buckets of outcomes. With the XNLI dataset, for instance, classification activities had three bins which an outcome could be classified into. By comparison, with sentence retrieval, the result is less defined. 

Developing and leveraging comparable corpora (BUCC) 

The BUCC is a fascinating dataset as it attempts to minimize the possibility for models to leverage external cues, like metadata, example, URLs, to trick and obtain heuristic insight when attempting to detect sentences that are alike to the other. The BUCC dataset therefore contains no such metadata and they have even made it tough to get the original dataset from the sentences were collected. This makes it nearly undoable for someone to identify the sentence in the original dataset and therefore avail of the meta-information heuristic cues. 

The BUCC strategy seems to be a series of workshops that brings individuals together to develop improved training data from both a computational, cross lingual and linguistics strategy. This is an innovative strategy as it consists of both the linguistic and computational (that is, ML, DL, statistical and so on) societies operating together to develop improved text corpora. Bringing together the skills of both communities is a brilliant way to identify the best course of action in this field. 

The BUCC dataset in XTREME consists of a “gold” file for every language pairing. This has the matching pairs for the correlated files. This way you can go about matching the English sentence with, for instance, with its Spanish counterpart. 


Let’s bring the tables together so we can contrast them. This is done for exploratory reasons so we can observe the layout. 

Tatoeba Sentence Translations 

Tatoeba is an open source project which intends to develop sentence translation pairs by getting individuals to go about translating from a foreign language they know into their mother tongue. The Tatoeba dataset is detailed in a recent research paper and is leveraged to identify a much broader basis on language translations to train on. The files are made up of 1000 English-aligned sentences for every pair. Therefore, you can look at a Spanish to English translation for instance. For this there will be an English file and a French file which will likely be in alignment so you can detect the corresponding sentence couples. 

Observe that in the XTREME dataset there seems to be only a evaluation set available for the Tatoeba dataset. You can see this demonstrated below as the sentences are not in proper alignment. 


In a majority of the other datasets in XTREME there is some information available so you can go about fine tuning your model on differing languages and test is zero shot learning capacities. But with this dataset it appears to only allow for evaluating your model.  


Executing these commands in your terminal, or notebook should enable you to access the original information. 


Then we can observe this new information and obtain some parallel translation sentences if you are searching for some multilingual dataset to go about training your model.  

Question Answering (QA) 

And coming in finally but not of least criticality we come to the issue of answering activities. By now you should be observing a pattern where the 

activities need a bit more from our models with regards of their linguistic capacities. For instance, in similarity activities a model can search for statistical or heuristic cues to guess that things are probable to have some kind of relationship. 

This is also an aspect for QA, but it is more tough to identify these cues as the nature of questions are more open ended. There can be a broad array of responses, or none if there are no solutions, to a specific question. The structure of the dataset will provide definition to the difficulty of the activity by for instance, not including a solution for each question, or including a particular answer for every question. These factors will play a part in the model’s behavior in the real world with regards to searching through a document to find a solution to somebody’s query. Datasets that are developed to mimic real-world type situations will make sure models have improved performance out in the wild as they do not expect the information to be too clean or completely in alignment. 

Stanford Question Answering Dataset (SQuAD) 

The thing to start observing about the three QA datasets is that they all adhere to what is referred to as the SQuAD format. SQuAD is a crowdsourced dataset which is developed by volunteers who author questions on the basis of Wikipedia articles. So they will view an article and then come up with a question whose solution is contained within a specific span of text for that article. There are two variants of the SQuAD dataset, SQuAD2.0 and SQuAD1.1.  

One of the primary factors which distinguishes these variants is that in version 1.1 all the questions could be answered. Therefore, the model could cheat, and now that there was always a solution, remember neural networks are very lazy and will resort to the line of least resistance if you allow them. In version 2.0, there are 50,000 questions which cannot be answered so models are now required to establish if there is infact an answer to every question prior to furnishing a solution. 

The dataset is essentially a json file which consists of groupings of questions and solutions with a correlated paragraph provided for context. For instance, here are a few questions on the subject of “Black Death”. 

The reality that the QA datasets in XTREME are all in this format is brilliant for us as it means we can go about exploring them in much the same fashion and aren’t required to learn a new format for every one. This means we can develop a simple class that leverages a generator so that we can simply look through the instances one by one and then leverage that on all of the QA datasets. 


Cross-lingual Question Answering Dataset (XQuAD) 

XQuAD is essentially the SQuAD1.1 dataset but with expansion for several languages. To start with, they wanted to make sure that these datasets were available for languages that weren’t English. But also, they wished to demonstrate that it is doable to undertake training of a model on one language and then leverage transfer learning to facilitate it to undertake learning of another language. This is in comparison to other strategies which need pre-training on every individual language in order to produce a multi-lingual model. 

This has several advantages, if you can leverage monolingual models then we can undertake training of them on high resource languages like English where we have tons and tons of nice training information. This is a key objective of the entire XTREME dataset itself, so it is attempting to do what XQuAD was performing in QA to a broad array of NLP activities. Training on a high resource language implies we can leverage the shared language comprehension of these frameworks to go about learning other languages without needing re-training on a translated or connected variant of that dataset in a differing language. 

In the author’s own verbiage, this displays that deep monolingual models learn a few abstractions that generalize across languages. Which, when you contemplate about it is pretty swell, perhaps HAL is closer to us than we initially thought?  

The XQuAD variant in XTREME does not have the solutions to the questions so you can get the original dataset if you want to look at the solutions. There is an instance of this in the notebook and you can observe an instance question and answer below: 


Multilingual Question Answering (MLQA) 

MLQA is also intended at making it simpler to develop, and assess multilingual QA models. As observed, it leverages the SQuAD format to develop a multilingual purpose built evaluation benchmark which encompasses a broad variety of diverse languages. 

The fascinating thing about MLQA is the fashion in which it is constructed: 

  • English sentences are detected in Wikipedia which have identical or similar meaning in other languages 
  • These are then extracted with surrounding sentences to develop a context paragraph
  • Questions are subsequently crowd sourced with regards to these English paragraphs. The questions should relate to the sentence identifie in the first step. 
  • The questions are then converted into all the relevant languages and the answer span is recorded in the context paragraph with regards to the target language 

In this fashion you can identify questions in English which have a solution in a corresponding English paragraph or you find a question in French, for instance, with a corresponding answer in English. 


The MLQA consists of a training set and test set so we don’t require to download the original to look at the solutions. 

Typologically Diverse Question Answering (TyDiQA) 

This is another brilliant new addition to the QA assessment framework. Again, it is intended at furnishing the utilities required to train and evaluate cross lingual deep learning models. The TyDiQA dataset makes an effort to raise the bar with regards to the skill needed to answer a question. They perform this by getting people who wish to know the solution to produce the questions prior to their awareness of the solution. In another dataset, like SQuAD, people read the solution prior to developing the question so this could impact how the question is curated. 

The hope is that this will assist in developing more natural questions and avert models from being reliant on “printing” issues – i.e., developing questions when you already know the solution – and leveraging statistical cues to “cheat” and identify the solution. This should, hopefully, assist models generalise over several differing languages. 

Datasets are critical for Software 2.0 

As we observed at the beginning of this blog article, we are in the middle of an evolution from conventional software architecture to a new software paradigm where we no more furnish machines with a set of guidelines and instructions and data. Rather, we furnish neural networks with a general objective or result and then furnish them with tons of data. So rather than providing someone a beginning point A and guidelines on how to get to point B, we furnish them with the point A and B and inform them to figure out how to go there. This fresh paradigm needs a ton of data. It requires datasets to initially figure out the rules, but, just as importantly, datasets are needed to identify if someone came up with the right instructions. What if there are differing instructions? Which one takes you from A to B in the shortest time span or through the ideal route? 

XTREME is an instance of a new critical foundational block in the upcoming development of Software 2.0 It enables us to see which models are legitimately developing the best rules to comprehend differing human languages and go about adapting that knowledge to other activities. As such, this assist both individuals leveraging the models and those generating them. With the increase in transfer learning and fine tuning, a dataset like XTREME will assist you in comprehending how to develop your domain particular dataset to tune the model to your particular business specifications. 

Add Comment