Ressources for question answering in multiple languages.
Recent advances in NLP with language models such as BERT, GPT-2, XLNet or XLM, have allowed surpassing human performance on Reading Comprehension tasks on large-scale datasets (e.g. SQuAD), and this opens up many perspectives for Conversational AI. However, task-specific datasets are mostly in English which makes it difficult to acknowledge progress in foreign languages. Fortunately, state-of-the-art models are now being pre-trained on multiple languages (e.g. BERT was released in a multilingual version managing a hundred languages) and are exhibiting ability for zero-shot transfer from English to others languages on XNLI. In this paper, we run experiments that show that multilingual BERT, trained to solve the complex Question Answering task defined in the English SQuAD dataset, is able to achieve the same task in Japanese and French. It even outperforms the best published results of a baseline which explicitly combines an English model for Reading Comprehension and a Machine Translation Model for transfer. We run further tests on crafted cross-lingual QA datasets (context in one language and question in another) to provide intuition on the mechanisms that allow BERT to transfer the task from one language to another. Finally, we introduce our application Kate. Kate is a conversational agent dedicated to HR support for employees that exploits multilingual models to accurately answer to questions, in several languages, directly from information web pages.READ FULL TEXT VIEW PDF
Multilingual Machine Comprehension (MMC) is a Question-Answering (QA)
Prior work on multilingual question answering has mostly focused on usin...
In this paper we study yes/no questions that are naturally occurring ---...
Recent approaches have exploited weaknesses in monolingual question answ...
We propose a simple method to generate large amounts of multilingual que...
Alongside huge volumes of research on deep learning models in NLP in the...
Despite the fact that natural language conversations with machines repre...
Ressources for question answering in multiple languages.
Over the past two years, we witnessed a real revolution of machine learning for NLPVaswani et al. (2017)Devlin et al. (2018). Perhaps motivated by competitions (SQuAD Rajpurkar et al. (2016), GLUE Wang et al. (2018)
) transcribing real needs, as it was the case for image processing with ImageNetDeng et al. (2009) a few years ago, more powerful proposals are made every day by public and private research laboratories around the world and solve complex language processing tasks: Natural Language Inference Williams et al. (2017)Rajpurkar et al. (2016)Levesque et al. (2012), Learning Sentence Similarity and Paraphrasing Dolan and Brockett (2005)Agirre et al. (2012), Text Classification Socher et al. (2013)Warstadt et al. (2018), Reading Comprehension and Question Answering Rajpurkar et al. (2016)Nguyen et al. (2016)Lai et al. (2017)Joshi et al. (2017).
Recent language models such as GPT-2 Radford et al. (2019), BERT Devlin et al. (2018), RoBERTa Liu et al. (2019b), XLNet Yang et al. (2019), have even been able to outperform humans on the competition datasets. These progresses open up many opportunities for conversational AI such as the ability to accurately identify a user’s intention, reformulate sentences or answer to questions. More specifically, for instance, the task solved in the SQuAD dataset consists in identifying, whenever possible, the answer to an input question in an input source. An algorithm such as BERT that is able solve this task would be really valuable in a virtual assistant. Let us take the example of Kate, the use-case motivating this study. Kate is a conversational assistant dedicated to collaborators that we developed for HR support. It is able to detect pre-defined intents and provide associated scripted answers. It is sometimes unable to answer, usually because the user asks for something that is not expected. Yet, we have at our disposal a very large number of HR resources in the form of web pages on the company intranet. Hence, it would be very useful to exploit technologies able to automatically provide answers to questions based on these sources. In that case, the main issue that we would run into is the fact that questions of the users and/or sources are not necessarily in English whereas existing datasets to train models for the QA task (e.g. SQuAD) are almost exclusively in English. This raises the question of the applicability of the current state-of-the-art models for other languages. One solution could be to build labeled datasets in all targeted languages but this is not very flexible and highly resource-consuming. Another possible direction is zero-shot transfer where we exploit a model trained for a task in one language to solve the same task in another language Hardalov et al. (2019)Liu et al. (2019a)Loginova et al. (2018). On the one hand, many specific strategies have been proposed for zero shot transfer with explicit language alignment Firat et al. (2016)Johnson et al. (2017)Artetxe and Schwenk (2018). On the other hand, state-of-the-art language models are currently being pre-trained and released on hundreds of languages. They embed representations of pieces of words (characters, series of characters) and seem to naturally integrate language alignment as suggested by their surprisingly good performance for zero-shot transfer.
In this paper, we contribute by demonstrating the capability of the multilingual model BERT, when trained to solve the SQuAD task in English, to do the same in other languages without labeled data (French and Japanese). We also build six new cross-lingual QA datasets (question and source in a different language) to further evaluate the model. The promising results allow us to provide intuition on why BERT is able to transfer knowledge for this specific task from English to foreign languages. We finally present our application use-case: an HR virtual assistant named Kate. We explain the conversation process and how BERT is integrated within the architecture. We also display some of the convincing examples which confirm the interest and power of language models applied in conversational agents.
Natural Language Processing (NLP) has been a key application of machine learning techniques for several decades now. Text Classification was one of the first tasks where results were very promising (e.g. for spam detection Pantel et al. (1998)Blanzieri and Bryl (2008)
). The proposed techniques were mainly based on preprocessing heuristics (e.g. TF-IDFJones (2004)
) followed by the training of a multi-class classifier such as One-vs-all SVMManevitz and Yousef (2001). At the time, they restricted the model’s training on a set of data specifically gathered for the targeted task. Later, researchers started to learn word representations on large external unsupervised text sources as a way of augmenting the algorithms with syntactic and semantic words relationships. The most emblematic example is word2vec Mikolov et al. (2013)
which trains a simple neural network (one hidden layer), with several tricks, to predict a word in a sentence from its context (surrounding words) and vice versa. Word representations can then be derived from the network’s weights. These word vectors have many interesting algebraic properties such as the fact that their similarity reflects semantic similarity (e.g. synonyms have close vectors). Furthermore, many studies showed that replacing the TF-IDF representation of sentences by a combination of the word’s embeddings allowed a significant improvement on many NLP tasks especially on limited sized datasetsJoulin et al. (2016)
. This highlighted the potential of transfer learning strategies for NLP. Since then, language embedding has been largely adopted and has incrementally evolved from representing a sentence as the average of the embedding of the component words to more complex sentence/paragraph embedding techniquesKiros et al. (2015)Logeswaran and Lee (2018)Le and Mikolov (2014)
, with recurrent neural networks for instance. Researches also focused on context dependent (left-to-right and right-to-left) embedding for disambiguationPeters et al. (2018).
Very recently, proposals in the literature switched to a different methodology to tackle NLP tasks. As opposed to the previously described feature-based approaches, recent models are categorized as fine-tuning approaches or universal language models. The idea is not really novel since it had already made its way in research a decade ago Collobert and Weston (2008). It consists in wrapping the embedding part, which takes as input a text represented as a sequence of tokens, within the first layers of a model that can be used to solve several NLP tasks Dai and Le (2015)Howard and Ruder (2018)Radford et al. (2018). Tokens generally consist in single characters or the most frequent series of characters of the language, generated with Wordpiece Tokenization Schuster and Nakajima (2012). Just like feature-based approaches, the representation part within the model can be pre-trained on unlabeled text data to learn contextual token representations. Then, the output layers can be adapted to the task at hand and all the model can be fine-tuned on the task data. ULM-Fit Howard and Ruder (2018) is a notable example in this category as it really helped highlighting the power of such a strategy: it has been shown to significantly improve text classification tasks using a hundred times less training data than the previous state of the art models.
Although very efficient, ULM-Fit relies on an LSTM-based architecture without attention. Attention was introduced as a layer on top of recurrent models to exploit all memory states instead of only the last one, which helped to deal with sequences more efficiently Bahdanau et al. (2014). In the recent proposal Transformer, the authors completely by-pass the use of recurrent model Vaswani et al. (2017). They apply the attention layers directly to the projected input sequence rather than the sequence of memory states. The most recent fine-tuning approaches such as GPT-2 Radford et al. (2019), BERT Devlin et al. (2018), RoBERTa Liu et al. (2019b), XLNet Yang et al. (2019), all rely on pre-trained Transformer-based architectures and are able to tackle complex NLP tasks more efficiently than ULM-Fit. They reach a ground-breaking performance in machine reading comprehension (better than human-level on GLUE/SQuAD benchmarks). They have the advantage that only few parameters need to be learned from scratch for the task at hand as they are grounded on fine-tuning. Compared to feature-based approaches, this kind of technique can be seen as a step ahead in transfer learning applied to NLP.
Language models (BERT, XLNet, XLM, etc…) all provide a competitive performance on public benchmarks. In this section, we will focus on BERT because it was recently released in a multilingual version which manages more than a hundred languages.
BERT learns token representations from unlabeled texts and jointly conditions them on both left and right context using deep bidirectional transformers. The bi-directionality is what distinguishes it from earlier transformer-based proposals such as GPT-2 which considered text from left to right.
The model is designed to take, as an input, either a single sentence or a pair of sentences separated by a special token (figure 1). It also has a wide set of outputs so that it can easily be adapted and fine-tuned into a high-performance model for a wide range of tasks without substantial task-specific architecture modifications.
During pre-training, the model is optimized on unlabeled data-BooksCorpus and English Wikipedia-for two unsupervised tasks. The first task is masked language model: data are pair of sentences in which around 15% of the tokens, sampled at random, are masked (by replacing them with a special ”mask” token, with a random token, or with the previous token) and the goal is to retrieve them. The second task is Next Sentence Prediction where the model has to predict, based on the current sentence, either the next or the previous one. This second unsupervised pre-training has been shown to be very useful when the model is later fine-tuned on QA/NLI tasks because the pieces of information that link one sentence to another are relevant for such an application.
Pre-training BERT is really expensive. It takes four days on four to sixteen Cloud TPUs. However, the pre-trained weights have been made available online by the authors111https://github.com/google-research/bert. On the opposite, fine-tuning is relatively inexpensive. For instance, fine-tuning on SQuAD on more than a hundred thousand samples takes at most two hours on a regular GPU (Tesla V100).
BERT was released in two versions (base and large). The difference is on the dimensionality of the transformer (dimension of hidden layers, the number of transformer blocs and the number of self-attention heads). The original and most accurate version is the large one but it was reduced into the base one so that it had the same size in memory as GPT and could be fairly compared to it. Note that only the base version fits in a GPU with 16Gb of RAM.
Many language models pre-trained in English are now available online. Several task-specific datasets (QA, Classification, Similarity) in English can easily be obtained for fine-tuning. On the contrary, data for the same kind of tasks are still at low records in other languages. Hence, it seems difficult to exploit the existing technologies for applications in these languages.
In the last years, specific strategies have been introduced to tackle this issue from the angle of zero-shot transfer. The idea is to transfer a model able to solve a given task in one language to another language for which there is no labeled data. One way is to learn an alignment function between the two languages and combine it with the model Artetxe and Schwenk (2018)Joty et al. (2017). Other more implicit strategies have been proposed. For instance, authors of Johnson et al. (2017) train a sequence-to-sequence model to translate between specific pairs of language and add a special token to the input sequence to denote the target language. They show that their model is then able to translate between new pairs. Another proposal, based on a many-to-one strategy, also allowed translation between languages without having a parallel corpus between them as long as they both had a parallel corpus with a same third language Firat et al. (2016).
Recently, language models Devlin et al. (2018)Lample and Conneau (2019) including BERT have been released in multilingual versions and, without any explicit strategy for language alignment, it has been noted on XNLI Conneau et al. (2018), a corpus of sentence pairs translated into 14 languages, that their ability for zero-shot transfer was surprisingly good (close to language-specific models). Multilingual BERT is pre-trained on the hundred languages with the largest number of Wikipedia articles222https://meta.wikimedia.org/wiki/List_of_Wikipedias
. The pre-training set has been built from samples of the entire dumps of the Wikipedia of each language. Since each language does not have the same number of articles, the role of sampling, whose probability depends on the language frequency, is to balance them to some extent. Token counts are also weighted based on language frequency. The model is intentionally not informed of the language of the sample so that pre-trained embeddings cannot be explicitly language-specific.
With our multilingual chatbot application in mind, we here run two empirical preliminary studies on multilingual BERT. Fine-tuned on the English SQuAD dataset, we evaluate its ability to achieve the same task in other languages. In a first experiment, we carry out tests on French and Japanese SQuAD test sets. In a second experiment, we go further and analyze its behavior on crafted cross-lingual QA datasets (context paragraph and question in a different language).
The SQuAD dataset corresponds to a task where the model has to identify, in a context paragraph, the answer to a question. Sample’s inputs are paragraph-question pairs and the output is the location of the answer in the paragraph (figure 2). From the input pair, BERT outputs a probability, for each token of the paragraph, of being the beginning/ending of the answer (see right part in figure 1
). During the fine-tuning phase, the model weights are updated to minimize the gap between these probabilities and the true location of the answer. The fine-tuning script is provided in BERT’s github repository. All the hyperparameters and options (batch size, learning rate, etc…) suggested by the authors are also specified.
A sample of the SQuAD v1.1 test set (only the first paragraph of each of the 48 Wikipedia pages) has been translated by humans in French and Japanese. The translated sets are available online333https://github.com/AkariAsai/extractive_rc_by_runtime_mt. We here evaluate the performance of the fine-tuned multilingual BERT on them and compare the results to a baseline Asai et al. (2018). The latter is a pivot approach for zero-shot transfer whose results remain the best published ones for the selected datasets. It explicitly combines two models: one language-specific model for Reading Comprehension (RC) in a pivot language (English) and an attentive Machine-Translation model (MT) between the pivot language and the target language (Japanese or French). The algorithm is summarized in figure 3. First, the MT model translates the passage-question pair from the target language to the pivot language, then the RC model extracts the answer in the pivot language and, the algorithm finally recovers the answer in the original language using soft-alignment attention scores from the MT model.
Table 1 displays the Exact Match (EM) and F1-score of the baseline and multilingual BERT on the selected datasets. We can observe that multilingual BERT is able to significantly outperform the baseline on both the Japanese and the French question answering task.
To run cross-lingual tests, we build six additional datasets from the existing ones by mixing context in one language with question in another language. The mixed datasets will be made available online in a github repository. The performance of BERT on all datasets is displayed in Table 2. Since the model was trained for the task in English, the performance is the best for the En-En dataset. The performance on Fr-Fr and Jap-Jap is also very good as noted in the first experiment. We additionally note here that results on cross-lingual sets are close to monolingual results: either as good, or slightly worse or slightly better. For instance, the exact match on the En-Fr dataset is higher than the exact match on the Fr-Fr dataset. We also observe that, in general, the exact match and F1-score are close together when the context is in Japanese whereas there is generally a larger gap for the other two languages. The reason is that in Japanese, tokens are bigger parts of words so there is less room for partial overlapping between the predictions and the ground-truth.
Overall, multilingual BERT is very powerful in the three languages. The performance for English was already noted in the public benchmarks and we add here that BERT has a high ability for QA zero-shot transfer. It is even able to significantly outperform the baseline from Asai et al. (2018) which is all the more impressive as it was not trained to map languages with each other contrary to the baseline. Moreover, as it was pre-trained on a hundred languages, it is much more flexible in the sense that it can directly be applied in a new language whereas the baseline will require an additional trained MT model.
The observed performance of BERT in this section, given that it was only trained on the English SQuAD, allows us to draw two hypothesis: (i) the structure of the QA task is partially similar in every language and (ii) BERT somehow learned, during pre-training, an alignment between the embeddings of the vocabulary of each language. Intuitively, both hypothesis could be correct. Indeed, (i) is possible because there are language-independent characteristics in the task such as identifying passages in the paragraph where a great portion of the vocabulary matches the question (or synonyms), or capturing the subject/object complement directly associated to verb that is common between the passage and the question. And (ii) is also plausible because many words have at least the same roots in several languages and proper nouns (e.g. people or famous techniques such as Boosting) also occur in the Wikipedia of most languages. During pre-training, these common tokens may have played the role of landmarks and allowed the model to align the embedding spaces of the different languages. Both hypothesis are also supported by the empirical results. Hypothesis (ii) is coherent with the good results in cross-lingual experiments. The fact that the attention layers connecting the context in one language and the question in another language have allowed a focus on the location of the answer suggests that representations of words from both languages are similar. Note then that, since the performance on Jap-En is lower than on Jap-Jap whereas the performance on En-Fr is higher than on Fr-Fr, the alignment En-Fr must be stronger than the alignment En-Jap. This is possibly due to a lower amount of words with common roots between English and Japanese. Hypothesis (i) is supported by experiments in Japanese. Results for the Jap-Jap dataset are better than results for the Jap-En dataset and this suggests that the En-En SQuAD task learned by BERT was transferred to Jap-Jap by others mechanisms than only language alignment, possibly inner characteristics of QA that are agnostic to the language.
In conclusion, after the multilingual pre-training and the fine-tuning on the English SQuAD, is seems that BERT has learned (1) a semantic structure of every language, (2) an alignment between the embedding space the different languages (at least the most frequent ones Pires et al. (2019)), (3) to solve a QA task in English and (4) intrinsic concepts of Question Answering that apply to any language. All of these elements contribute to his high performance for zero-shot question answering in non-English languages.
The empirical results being very convincing, we integrated BERT into our HR virtual assistant, KATE (Knowledge shAring experT for Employees). In this section, we present the use-case, the conversation process and how BERT is integrated within the architecture. We also display some concrete examples of the impressive results achieved by BERT on question answering.
Our application is a conversational assistant dedicated to collaborators. We validated this use-case of HR support for employees with an internal survey: it highlighted the complexity of finding the right information in multiple corporate sources (e.g. web pages, word documents) that could be sometimes poorly structured and not indexed enough. Several respondents indicated that when they are lost in these documents, they either ask a colleague directly or drop their research. Therefore, we decided to create a chatbot which would ease the access to this kind of HR data.
Our chatbot is available in a webpage with standard text messaging interactions, connected to a self-developed framework. We added two buttons, “like” and “dislike”, on the most recent bot message (figure 4), to get a direct and explicit feedback from the user about the pertinence of this last answer.
Within Kate’s conversational system (figure 5), the messaging platform first sends the last user utterance to an intent classifier. For our experiment, we used Google Dialogflow NLU, but our framework could also handle other tools for this task like Rasa, Snips, Luis.ai or even BERT trained for classification. The dialog management is conducted within the framework. If the NLU engine succeeds to recognize the user’s intent, then we use a scripted answer generation – basically, the developers prepared a set of pre-written answers to match the current intent, the system choose randomly one of these sentences and complete it with the appropriate entities. The Question Answering part is called when the NLU engine fails, probably because the user asked for something that was not expected, or when we receive a negative feedback. In that case, we rely on our knowledge database, consisting of a list of URLs corresponding to the external contents that might include the right information – for our case, several corporate intranet webpages with HR explanations and rules. We consider the html code of each webpage, and apply a common HTML-to-text function to filter the HTML tags and keep only the actual content. Then we send these simplified texts to an instance of multilingual BERT running on a server to perform the question answering, using the user utterance as a question and each text as the context. In the last step, the best result among all sources is formatted and displayed by the messaging platform.
In figure 6, we give two examples of responses of our QA API, which runs with multilingual BERT, using three urls as information sources. The first question, asked in french, is ”How many employee does the company count?” and the answer ”more than 3000” is found in the second web page. Note that the question and the context passage ”en France, c’est plus de 3000 collaborateurs” have no words in common. The second question is asked in English whereas the source is in French. BERT is still able to find the answer ”par une period d’essai” which translates into ”with a trial period”.
With its large scale multilingual pre-training, its suitability for complex QA tasks and its ability for zero-shot transfer, BERT presents impressive results on monolingual QA in several languages and on cross-lingual QA as well. Applied in a real-world situation in chatbots, first results indicate that it allows the conversational process to handle new user questions from web information sources. The response is very good when the webpage contains an expression that is close to the expected answer, even when the question and the context do not share any common words.
The inclusion of a language model for question answering inside a conversational engine has huge advantages in term of maintenance. To set up standard chatbot systems, for each user intent, developers need to enter by hand many examples of sentences that they want the bot to recognize, and write the set of every possible answer. Question Answering brings some automation and a significant gain of time for developers, linguists and integrators who are working on the chatbot. They now just have to maintain an up-to-date list of external documents in the knowledge database and an accurate language model. Moreover, the chatbot will be able to deal with question that its designers did not expect, by exploring the whole knowledge database. It is also worthwhile mentioning the major benefit provided by the cross-lingual abilities of multilingual BERT. In many cases, corporate documents may be available in different languages, especially for global organizations. With the cross-lingual question answering, the chatbot can explore any content and may find a potential answer, even if the reference texts are not given in the same language as the user utterance.
There is still room for improvements in the future. For instance, we noticed that when the right answer to a question, like “what are the main kinds of [anything]?”, is scattered in several parts of the source (subtitles, bullet lists), multilingual BERT is only able to give a partial answer. We are investigating ways to obtain and aggregate multiple answers. On the other hand, studies have noted that BERT’s multilingual ability, in terms of language alignment can be sometimes limited, in particular for rare languages Pires et al. (2019). Alignment could be improved with, for instance, additional fine-tuning on aligned corpora. We also plan to investigate alternatives to BERT with faster inference to improve the current pipeline and deal with larger knowledge databases.
Improving language understanding with unsupervised learning. Technical report Technical report, OpenAI. Cited by: §2.