There is a growing research interest on conversational interfaces for biomedical natural language processing (laranjo2018conversational). Dialogue systems involve several components (jokinen2009spoken): a natural language understanding (NLU) module, a dialogue manager, a generation module and a module for querying the database. We are interested here in the NLU component, which allows the system to understand user’s utterances through the semantic analysis and formalization of queries.
To develop a NLU model using a machine learning approach, one of the first requirements would be a training dataset. This dataset requires user utterances (input of the NLU), along with a formal representation (output of the NLU). The dataset needs to be large enough to be representative of the task. But what if this dataset does not exist? Depending on the task and the language, it is likely that one can not find any suitable training dataset. The biomedical domain is a good use case when it comes to low resources in terms of data, especially in languages other than English. Due to privacy issues, it is difficult to share real world medical data. Dialog systems in the medical domain have been applied for patient counselling on a wide range of topics, from medical conditions to medication intake(azevedo2018). In most of the systems, interactive capabilities are based on limited, constrained natural language input: for example, users are presented with a menu of multiple-choice questions. In contrast, dialogue systems allow users to access data in a much more natural way—through speech or typed input.(laranjo2018conversational).
One solution to overcome the absence of training data is to generate a training dataset based on a few examples and augment it using known terminologies, external knowledge and paraphrases. In this paper, we assess how well models created using such a generated training set perform on real world data and compare their performances on biomedical NLU task in French.
2 Data generation and augmentation
In this section, we describe the task and the methods we use to generate the training and development sets. We explain the methods for generating data using templates and terminologies, generating paraphrases of templates using pivot translations and incorporating external knowledge with word embeddings and language models. Figure S2 details the general schema of this work.
2.1 Description of the task: NLU in a dialogue task to query EHRs
The aim of the task is to perform Natural Language Understanding of user input. This step will enable physicians to perform queries in Electronic Health records (EHRs) in natural language. The set of queries that a physician may have about characteristics and results of a patient is broad and diverse, therefore, enabling queries in natural language may help accessing information more efficiently. For this purpose, we asked to medical doctors from a French university hospital some examples of questions they would ask to a dialog system aimed at querying information about biological tests results. We collected a set of 178 questions that we annotated manually as a gold standard.
NLU tasks can usually be divided into 3 sub-tasks: domain classification, intent classification, and slot-filling (tur2011). For this task in a restricted domain (bio-medicine), we focus on the two latter: slot-filling (sequence labelling) and intent classification.
Sequence labelling. We distinguish two types of labels: lab mentions (e.g. "créatinine" creatinin, "protéine C réactive" C reactive protein) and dates (e.g. "27/03/2015", "depuis 3 jours" since 3 days). In the training set (generated data, see 2.2), the number of distinct lab mentions is 336, for a length ranging from 1 to 11 tokens and a median length of 2. There is 28% of overlapping between the vocabularies of the train set and the test set (real world data). Regarding the date labels, they include actual dates, relative dates and time ranges. The length is more stable with a median of 3, ranging from 0 to 6 tokens. The vocabulary overlap is 38% between the train and test sets. (see Table S1)
Classification. There are 4 sub-tasks representing 4 axes of classification. For each utterance, we assign one label per axis. Two axes concern the results of the lab exams (i.e. the type of result (5 categories, e.g. value, evolution, date) and interpretation of the result (5 categories, e.g. normality, value, low, high, presence). The two latter concern temporal aspects (i.e. the time of result (3 categories, e.g. first, last, all) and constraints on time (4 categories, e.g. none, range, date, number).
2.2 Data generation
Given the lack of suitable dataset for training, we generate a training dataset using a tailored generator. Inspired by bordes2016, we developed questions templates: 223 for the core of the question (e.g. "quel est le résultat du dernier <lab mention>", what is the result of the last <lab mention>), 23 temporal modifier templates (e.g. "depuis <date|duration|event>" since <date|duration|event>) , and a list of 409 mentions of laboratory tests results (hereafter, lab mentions, e.g. "créatinine" creatinin, "hémoglobine" hemoglobin). Each generated question randomly associates a base template, a temporal modifier template and a lab mention to create unique questions (see Figure S1).
2.3 Data augmentation with paraphrases
Researchers working with data-intensive methods (such as neural networks) already resort to the generation of paraphrases, even for question-answering tasks(D17-1091). We refer the reader to available reviews on methods of paraphrase generation (androutsopoulos2010survey; madnani2010generating), including recent advances using neural approaches (iyyer2018adversarial), and focusing on methods applied for paraphrasing questions. A recent approach makes use of the Paraphrase Database (PPDB) (ganitkevitch2013ppdb), a large multilingual collection of paraphrase pairs (over 100 million pairs for English) with lexical, syntactic and phrasal variations. Another method relies on machine translation (duboue2006answering). For example, (Zhang:2015:EKC:2887007.2887065) derive paraphrases of key words in questions by translating them to a pivot language (they experimented with 11 languages), then back to the source language.
We use a machine translation method to increase the variability of the training set by producing paraphrases of the question templates. We translate each sentence to one or several languages (pivot languages) and translate the result back to the source language. We used the Google Translate API (zotero-5148) for this step. For each template, we randomly select 10 of the 60+ languages available in the API. For each language, we perform the pivot translation and kept the unique paraphrases obtained. We add the paraphrases to the set of templates used for the generation of the datasets.
2.4 Incorporating latent knowledge
Using embeddings of words learned on a large domain specific dataset of unlabeled data can be an effective source of latent knowledge (wang2018)
. We use one million of clinical notes from the clinical data warehouse of a local hospital in France. Leveraging this corpus, we compare three types of method: 1) word embeddings (continuous skip-gram) only on the training set (without external knowledge) as a baseline; 2) continuous skip-gram model with sub-word information (i.e. each word is represented as a bag of character n-grams), as implemented in Fasttext(bojanowski2016)
, 3) embeddings from language models (ELMo) where the vectors are learned from the internal states of a deep bidirectional language model as described inPeters2018.
We split the question templates and the lab mentions into two sets: training (170 templates and 336 mentions) and development (53 templates and 73 mentions). From each, we create two datasets by generating paraphrases (see section 2.3). We generate 16,000 utterances for the training set (80%) and 4,000 for the development set (20%) using templates without paraphrases, and the same quantities for the sets with paraphrases (Table S2). The test set (real world data) is kept aside for the evaluations.
A usual way of producing specialized NLU systems is to elaborate rule-based algorithms to perform the semantic parsing of user’s utterances (weston2015). However, developing such system can be time consuming and is often difficult to maintain. Most of modern NLU systems use statistical learning models to perform this task (young2013a). Before the raising of neural based systems, state of the art systems used conditional random fields (CRF) (lafferty2001conditional)
. Nowadays, these systems tend to be outperformed by neural based approaches, such as convolutional neural networks (CNN) and recurrent neural networks (RNN). On the task of sequence labelling, RNNs and more specifically long short term memory units (LSTM)(hochreiter1997long) are the most used. More recent work combine bidirectional LSTMs (biLSTM) and CRF (lample2016neural).
To assess the capacity of the models to generalize to new data, we evaluate three types of models for this task: CRF, bidirectional LSTM (biLSTM), and a combination of biLSTM and CRF (lample2016neural). The input layer is fed with the generated questions from templates only or from tem- plates with paraphrases. The embeddings are learned either directly on the training set (no external knowledge), or on clinical notes using Fasttext or ELMo. For each combination, we test three different models: CRF, biLSTM and biLSTM+CRF. The details of the models and the tuning parameters are described in the supplementary materials (section S1). All the results are reported in terms of weighted F-measure, computed using 10 repetitions of five fold cross-validation over the test set.
4 Results and discussion
Overall, the best results on sequence labelling and classification tasks are obtained with the models including ELMo representations as the embeddings used to inject external knowledge. On the sequence labelling task, models with ELMo-biLSTM and ELMo-biLSTM-CRF obtained a F1-score of 0.76(95%CI [0.74-0.77]) and 0.77 (95%CI [0.76-0.79]) respectively (see Table 2, Figure S3). On the classification task, the best results are obtained with ELMo on three of the four sub-tasks and with Fasttext-paraphrases on the forth one (see Table 2, Figure S4).
On the sequence labelling task, adding latent knowledge with FastText or ELMo using a million clinical records increases the generalizability of the models regardless of the type of the downstream model. Models with ELMo have an average F1-score of , with FastText and without external knowledge . Adding paraphrases to the templates does not improve the results on this task and even tends to lower the results: ELMo without paraphrases and with; FastText versus ; no external embedding versus . Regarding the type of model, biLSTM and biLSTM-CRF perform better than CRF only with F1-scores of , and , respectively.
On the classification tasks, we also observe better results with ELMo and fastText than without external embedding: mean F1-scores of with ELmo, with FastText and without external embedding. Unlike for the sequence labelling task, adding paraphrases to the training set tends to give better results with F1-scores of without and .
Interestingly, the results obtained with the best models on each task show that it is possible to use our method to provide a baseline system for NLU tasks in the absence of a pre-existing data. Our results not only confirm those by wang2018 regarding the interest of incorporating external knowledge using a large domain specific corpus; our outcomes also highlight the interest of using language models instead of only embeddings to incorporate this knowledge. In our study, the results using ELMo are systematically better than those with FastText although the models were learned on the same data. This may come from the better representation of the context in ELMo compared to FastText. FastText takes into account the tokens in the specified window, which can be described as a "bag of context". But ELMo is a language model and considers the full context of a token (at the sentence level). Of note, this sequence labelling task is not very complex, given the number of different labels. The results on a task with more labels might be lower. Moreover, the results with the paraphrases are more difficult to interpret: they are slightly better on the classification tasks but not on the sequence labelling task. This might come from the method of pivot translation used for producing this paraphrases. Indeed, the quality of the produced paraphrases may not be sufficient for the task. Using more sophisticated methods of paraphrasing could lead to different results.
NLU models learned on the data generated using the proposed method achieve interesting performances. These methods can be considered to learn a baseline model allowing to bootstrap a dialog system and start collecting data from end users. We are interested in exploring to which extent other sources to train embeddings (e.g. medical, non-clinical texts) yield similar results. It would also be interesting to conduct similar experiments in related tasks where data are scarce (e.g. NLU in dialogue systems for patient counselling or virtual patients).
|mentions in the test set||34||177|
|Median length[min-max]||3 [0 - 6]||2 [1 - 11]|
|vocabulary in train set||1,364||451|
|vocabulary in test set(intersection with train)||58(0.38)||250(0.28)|
|Utterances||Templates||Lab mentions||Words (*)||OOVs (*)||Perplexity (*)|
|development||4,000||53||73||36,211 (36,211)||4,724 (4,544)||137.5 (171.1)|
|test||178||-||-||1,579||467 (390)||194.5 (240.1)|
S1 Tuning parameters
For each model (except ELMo) we added some standard features to the input: normalized lemmas and part-of-speech (POS) tagging.
Then, the sequence labelling part of the model was constituted of a CRF only or 2 layers of biLSTM or 2 layers of biLSTM followed by a CRF. The tuning parameters were: the dimension of the embeddings (50, 100, 300) except for ELMo (fixed to the default dimension), the number of units units in the biLSTM (64, 128, 256), the fraction of dropout after the embedding layer and after the LSTM layers (0.1, 0.2, 0.3, 0.4, 0.5). Regarding the classification part of the model, it constituted of a 1 dimensional convolution layer (2 to 5 filter kernel size and 50 to 250 filters, ReLu activation) followed by a max-pooling layer. Models were tuned using a random sample of parameters. The optimization function was ADAM. All the models were implemented using Keras with a Tensorflow
For each model (except ELMo) we added some standard features to the input: normalized lemmas and part-of-speech (POS) tagging. Then, the sequence labelling part of the model was constituted of a CRF only or 2 layers of biLSTM or 2 layers of biLSTM followed by a CRF. The tuning parameters were: the dimension of the embeddings (50, 100, 300) except for ELMo (fixed to the default dimension), the number of units units in the biLSTM (64, 128, 256), the fraction of dropout after the embedding layer and after the LSTM layers (0.1, 0.2, 0.3, 0.4, 0.5). Regarding the classification part of the model, it constituted of a 1 dimensional convolution layer (2 to 5 filter kernel size and 50 to 250 filters, ReLu activation) followed by a max-pooling layer. Models were tuned using a random sample of parameters. The optimization function was ADAM. All the models were implemented using Keras[chollet2015keras]
with a Tensorflow[tensorflow2015-whitepaper] backend.