Semi-Supervised Few-Shot Learning for Dual Question-Answer Extraction

04/08/2019 ∙ by Jue Wang, et al. ∙ University of California, Irvine Zhejiang University 0

This paper addresses the problem of key phrase extraction from sentences. Existing state-of-the-art supervised methods require large amounts of annotated data to achieve good performance and generalization. Collecting labeled data is, however, often expensive. In this paper, we redefine the problem as question-answer extraction, and present SAMIE: Self-Asking Model for Information Ixtraction, a semi-supervised model which dually learns to ask and to answer questions by itself. Briefly, given a sentence s and an answer a, the model needs to choose the most appropriate question q̂; meanwhile, for the given sentence s and same question q̂ selected in the previous step, the model will predict an answer â. The model can support few-shot learning with very limited supervision. It can also be used to perform clustering analysis when no supervision is provided. Experimental results show that the proposed method outperforms typical supervised methods especially when given little labeled data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information Extraction (IE) refers to a spectrum of tasks which automatically extract structured information from unstructured texts. One important task in IE is to extract key phrases and their respective categories from sentences. Given a sentence , the task needs to extract a (,)-pair, where is a phrase belonging to and indicates the category of

. State-of-the-art approaches to this task employ supervised methods based on deep learning. However, these approaches assume that there is sufficient annotated data, which in fact is not available in most situations. Actually acquiring labels is costly, and probably the biggest obstacle to the application of these methods. This motivates the need for effective semi-supervised learning techniques leveraging unlabeled data.

By considering a question as a category, and an answer to it as a phrase, we define a new problem called dual question-answer extraction to address key phrase extraction via question-answering. The new problem is described as: Given a sentence and an answer , retrieve an appropriate question for it; meanwhile, given sentence and question , predict an answer to . The above sub-problems are named Question Selection (QS) and Answer Extraction (AE) respectively. In effect, our problem extracts (,)-pairs for a given sentence , as illustrated in figure 1.

Figure 1: (s, q, a)-triplets of a given sentence

The main challenge comes from the lack of labeled -triplets. While a conventional solution requires a large number of labeled triplets, we advocate a novel approach where the need for labeled data can be significantly reduced. In fact, a large number of -pairs without can be easily collected, as it is reasonable to believe that almost all phrases in a sentence can be regarded as answers, i.e. given a sentence, there is always a question corresponding to each phrase (answer) in the sentence. Since no supervision is needed in this step, -pairs can be regarded as unlabeled data. The unlabeled -pairs are useful in theory, because if there is a model to predict question based on answer , we can compensate the lack of by acquiring from .

Figure 2: Illustration of how SAMIE works.

In this paper, we propose a novel neural network model called

SAMIE: Self-Asking Model for Information Extraction, which learns to ask and answer jointly with limited labeled data thus dually extracting question-answer pairs from given sentences. The model includes two parts of training: the first part refers to a regular supervised method, where the QS sub-model and the AE sub-model are trained individually with labeled -triplets; in the second part, the QS sub-model predicts questions based on -pairs, and then the AE sub-model will give answers based on , so the supervision merely comes from -pairs. Specifically, by learning how to ask with numerous , the model is forced to find patterns of answers in sentences, which reinforces its abstraction and generalization ability and prevents the model from overfitting on a small number of -triplets.

Figure 2 illustrates an example of how the model works. Given a sentence as “The plane from Shanghai will arrive in Beijing on November 2nd”, any phrase in it can be a potential answer. While the QS sub-model gives scores to all candidate questions (the higher the score, the darker the color in the figure), the AE sub-model gives answers to all candidate questions. But only the one whose question has the highest score is the relevant answer, which is “Beijing”.

It is possible to train without any -triplets (neglecting the supervised part of SAMIE), but how the model converges depends on its initialization. In fact, without pre-training or supervision, it may mistake the category and thus will be reduced to a clustering method. To avoid this issue and enhance robustness, the model requires some supervision to ensure the direction it converges in.

We have evaluated our method on public datasets. Experimental results show that the proposed method significantly outperforms traditional supervised methods especially in the case of extreme lack of labeled -triplets.

The key contributions of our work are as following:

  • We propose a framework that redefines key phrase extraction. Different from the restrictive problem definition in the literature which conducts key-phrase extraction relying on fixed categories, we study a more open problem of dual question-answer extraction.

  • We present a semi-supervised model for dual question-answer extraction. Supplied by very limited labeled sentences (below 500), the model can be effectively trained (F10.95) on large datasets of -pairs.

  • We demonstrate that the model can be used to perform phrase clustering without supervision.

2 Proposed Method

Figure 3: The SAMIE framework. The yellow part is related to question selection and the red part is related to answer extraction. Unsupervised part and supervised parts are outlined by dashed lines in the figure.

2.1 Problem definition

Note as spaces of sentences, questions and answers respectively and , , .

The dual question-answer extraction task can be represented as a triplet , consisting of a sentence , a question and an answer to . Note that the answer is a phrase or a single word in the sentence, which requires the ability to exploit context information in both sentence and question.

There are two sub-problems to be solved, namely Question Selection and Answer Extraction.

2.1.1 Question Selection

The model needs to evaluate the relevance of each candidate question to the given sentence and answer. The most appropriate question for the given sentence and answer should be the most relevant one.

More formally, given a sentence and an answer

, our model’s main objective is to learn the conditional probability distribution

where indicates all parameters related to QS, to select the most appropriate question from the candidate questions provided by us:


2.1.2 Answer Extraction

For the AE task, the model aims at predicting an answer to the given sentence and question. Because we assume the answer is included in the given sentence, the task can be reduced to a conditional sequence labeling problem, where the condition is given by a question.

Formally, given a sentence and a question , the goal is to learn the conditional probability distribution where indicates all parameters related to AE, to extract the correct answer :


2.2 Dataset Preparation

When preparing the data, it is necessary to note that, although -pairs are not directly available, collecting them is not expensive. With raw sentences available, the collection of -pairs can be obtained using existing POS taggers and named entity recognizers e.g. Stanford CoreNLP111 The collection allows for some missing possible answers, as the performance will not be affected much. But it cannot include too many errors on the boundaries of phrases.

Without pre-training or using well-trained word embeddings, the model may not converge to the position as we wish, i.e. the definitions of questions may be mistaken. An approach to overcome this challenge is to provide a small sample of -triplets as supervision.

We need to define and give groups for questions that are written in natural language. Each group contains questions expressing the same intent. Although each group is allowed to contain as few as only one question, it is recommended to include more for robust understanding of questions. During the training phase, for each query, one question is picked randomly from each group forming the candidate questions.

2.3 Framework

Figure 3 presents the framework of SAMIE. The QS sub-model predicts the probability of a question given a sentence and an answer, noted (or representing its distribution). Meanwhile, the AE sub-model predicts the distribution of answers given a sentence and a question, noted (or indicating the probability of a specific answer). Note that , i.e. they can share parameters such as the embeddings of , , and .

During the training phase, the framework has two parts: the supervised part trained on where variables are marked with “s” and the unsupervised part trained on where variables are marked with “u”.

There are categories of candidate questions, as mentioned, questions in the same category express the same intent but are in different forms. The candidate questions mainly serve for QS sub-model which will give possibilities of the candidate questions and find the most appropriate question for the given sentence and answer.

Particularly, in the unsupervised part, candidate questions are also fed into AE sub-model yielding answers, where only one of them matches the input answer . With the help of QS sub-model which produces probabilities for all candidate questions, the answers are weighted during loss computing with the irrelevant ones mitigated. As a result, the loss can be used to guide the iterative learning of both the QS and AE sub-models. In other words, considering each sub-model to define a “hyper-dimension”, we treat the parameter spaces of QS and AE as orthogonal hyper-dimensions, so they can both be trained by the loss simultaneously.

2.4 Training Method

2.4.1 Supervised part

The supervised part computes the sum of the losses of two sub-tasks, with the following notations

  • : the input sentence

  • : the input answer

  • : the candidate questions

  • : the labeled question where i.e. is among

Then and can be obtained from the QS sub-model and the AE sub-model, where


And the total loss is given as


2.4.2 Unsupervised part

In the unsupervised part, we do not have any labeled question, and the prediction of QS sub-model is actually an intermediate state in the training phase. With notations as following

  • : the input sentence

  • : the input answer

  • : the candidate questions

the loss of the unsupervised part is computed as


The softmax layer is required to avoid getting a trivial solution i.e. 

thus . And the output of QS sub-model should be multiplied by before the softmax layer, making the entropy of output of the softmax layer smaller, emphasizing the possible question and ignoring the irrelevant ones. Otherwise, all questions may be given too much attention, causing that QS sub-model hesitates and updates continuously while AE sub-model cannot learn anything. Eventually, the model may not converge.

2.4.3 Joint training

During the training phase, we train the models jointly by minimizing the total loss:


where is a parameter giving the importance of the unsupervised part. When is close to 0, the optimizer will focus more on the supervised part; when is close to 1, the optimizer will focus more on the unsupervised part. Because variables are initialized randomly, the unsupervised part may be very confused at the beginning. So it turns out better to begin with a small , i.e. pay more attention to the supervised part, and then increase gradually as training time goes on.

2.5 Evaluation

2.5.1 Question Selection

Given a sentence and its answer , as well as all candidate questions , the model will predict which question is the most appropriate one. The selected question is:


When and are irrelevant i.e is not a valid answer of , no question should be selected. Thus a threshold should be set, where


Otherwise, no question would be selected.

2.5.2 Answer Extraction

Given a sentence and a question, the relevant answer is predicted as


When and are irrelevant, no answer should be extracted. This should be taken into account because a sentence may not contain all kind of slots i.e. answers cannot always be found in the sentence for all questions. However, the AE sub-model will always try its best to find the most possible answer no matter which kind of question is posed i.e. it always produces an answer.

Therefore we should check the result by verifying


Otherwise, no answer would be extracted.

2.6 Implementation

The input layer is able to represent a single text sentence where each word will be mapped to a high-dimensional vector space to obtain a fixed embedding. For a given word, its input representation is constructed by summing the corresponding word and position embeddings. Positional embeddings

[Gehring et al.2017] are required as we use encoder and decoder of the Transformer [Vaswani et al.2017], which contains no recurrence and convolution.

We denote the multi-layer bidirectional Transformer encoder and decoder as Encoder and Decoder. All Encoders share the same word and position embeddings in our implementation, and the Decoder is used to merge two token sequences into one.

The sentence, question and answer should first be encoded to capture the contextual semantic information.


In our setting, the answer is a short phrase in a sentence, so the AE task is a sequence labeling task. The AE sub-model can be then implemented with Decoder followed by a fully-connected layer:


The logits represent the non-normalized possibility of whether a word of the original sentence should be regarded as a part of the answer, and all words whose logit is positive form the predicted answer.

We expect the QS sub-model to give the relevance of and . Therefore, and

are encoded to two vectors and then computing their cosine similarities. We use the following procedure to get the relevance of

and :


and can be arbitrary losses defined for multiclass classification and over two sequences. Both of them can use the prevalent cross-entropy.

3 Experiments

(a) ACC - Question Selection
(b) Precision - Answer Extraction
(c) Recall - Answer Extraction
(d) F1 - Answer Extraction
Figure 4: Metrics along with the number of annotated sentences on ATIS.

3.1 Dataset

We evaluate on the Airline Travel Information System (ATIS) dataset and Chinese Emergency Corpus (CEC). The former contains spoken queries on flight related information, associated intents and key slots in the statements; the latter contains news about emergencies with key elements and their properties. We extract slots of each query by extracting question-answer pairs. To compare the differences between different methods more fairly and clearly, we have filtered out data that does not meet the requirements, and the remaining data still accounts for the majority of the original data.As these datasets do not originally contain any question-answer pair, a group of questions are given by ourselves for each type of slot. The question types for ATIS includes “airline”, “arrive_time”, “depart_time”, “return_time”, “fromloc”, “toloc”, and “stoploc”; and those for CEC are “time”, “location”, “denoter”, “participant”.

3.2 Comparative Models

We compare our approach to the following classical supervised learning methods.

  • biLSTM+attn: For the QS task, use bidirectional LSTM and attention layer to encode the given sentence and answer and predict the most relevant question; for the AE task, use bidirectional LSTM to encode the given sentence and question, and use attention to combine the two sequences before output layer.

  • Transformer: Similar to the above one where biLSTM is replaced with Transformer.

  • SAMIE: Our proposed model (an implementation of the framework with Transformer), trained on -triplets and -pairs.

  • (Small): Any model marked with “Small” is a small version of the original one with fewer units, fewer layers therefore fewer trainable parameters.

Note that the Transformer-based approach has the same QS sub-model and AE sub-model as SAMIE.

3.3 Evaluation Settings

Only a small amount of data is picked as labeled -triplets. Note that one sentence may yield several -triplets, thus we can extract over ten thousand triplets from ATIS and about eight thousand triplets from CEC. We randomly select 15% data from the two datasets as their respective testsets.

We test all models with 64, 128, 256, 512, 1024, 2048 labeled sentences which are converted to -triplets for further training needs. The additional -pairs for SAMIE are all prepared from the remaining trainset. In addition, we conduct a detailed study on the 512 labeled sentences to evaluate SAMIE against overfitting.

We evaluate the QS task and the AE task separately. Given sentences with answers, the QS sub-model should predict their corresponding questions; and given sentences with questions, the AE sub-model should extract possible answers for them. Note that the precision, recall, and F1-score for answer extraction are word-based rather than slot-based. So they should be a little higher than the slot-based ones.

In the end, we train SAMIE without labeled

-triplets, and check the confusion matrix to see the effect of our method in phrase clustering.

3.4 Results

prec recall F1
biLSTM+attn (Small)
0.91 0.92 0.92
biLSTM+attn (Regular)
0.91 0.88 0.90
Transformer (Small)
0.93 0.91 0.92
Transformer (Regular)
0.93 0.90 0.91
SAMIE (Small)
0.88 0.88 0.88
SAMIE (Regular)
0.95 0.94 0.95
Table 1: Detailed comparison with 512 labeled sentences on ATIS.

Due to space limit, we only present the results on the ATIS dataset, but our approach also has similar performance on CEC (the accuracy reaches 0.9 for the QS task with about 500 triplets). Figure 4 shows performances of different methods on ATIS. Among them, the QS task needs special attention, because when using -pairs, the choice of question is only an intermediate state without any supervision and if questions were selected incorrectly, the error would be propagated to AE sub-model. Therefore, the improvement of SAMIE in the QS task illustrates that it has learned well. On the other hand, the result of AE also indicates the contribution of leveraging -pairs to the model.

Due to the limited training resources, all baselines in supervised learning did not perform well. However, SAMIE, leveraging -pairs, had significant improvements compared to others. When the amount of data is small, the gap between them is particularly noticeable. The results confirm SAMIE as a promising solution to quickly extract information of interest at very limited expense in human annotation. As the number of labeled triplets increases, the performance margins between SAMIE and the baseline methods tend to narrow down. However, it is still slightly better than the rest, even when the number of labeled ones becomes very close to the total number (3439) of sentences.

As the amount of labeled data is very small, the problem of overfitting is inevitable. However, our approach has good performance in generalization. This is verified in the detailed performance study of 512 labeled sentences, as shown in table 1. It is known that conventional models (such as Transformer) typically address the problem of overfitting by reducing the number of variables of the network. This method is confirmed by the results of Transformer (Small) and Transformer (Regular) in the AE task, as the former outperforms the latter. However, this method weakens the network and restricts the final performance. Nevertheless, SAMIE alleviates the problem of overfitting. Specifically, SAMIE (Regular) outperforms SAMIE (Small) on all metrics, while both still beat their respective Transformer counterparts. These results reveal that SAMIE works better with more powerful neural network.

Figure 5: Confusion matrix without labeled triplets on ATIS

Finally, we show the possibility to perform a clustering analysis with our model. We trained SAMIE without any labeled -triplets i.e. only the unsupervised part was used, and the confusion matrix shown in figure 5 demonstrates that, although the model made mistakes in the results, it only placed -pairs into limited wrong clusters. The model mainly confused “fromloc” and “stoploc”, a mistake understandable due to the similarity between the two categories.

4 Related Work

The IE task is usually treated as a sequence labeling problem in which contiguous sequences of words are assigned semantic class labels. Standard approaches to solving this problem include generative models, such as maximum entropy Markov models (MEMMs)

[McCallum et al.2000] and conditional random fields (CRFs) [Raymond and Riccardi2007]

. Deep learning approaches such as Recurrent Neural Networks (RNNs) have attracted much attention because of their superior performance in language modeling and understanding tasks. Many researchers

[Mesnil et al.2013, Mesnil et al.2015, Yao et al.2013, Yao et al.2014] applied RNNs to this task and have promising achievements. [Liu and Lane2016] added attention mechanism [Bahdanau et al.2014] on top of RNNs and further improved the accuracy. However, although RNN-based approaches have been proved effective in this kind of problem, they are relatively slow due to their failure in parallelization. Similar work has been done in CNNs [Xu and Sarikaya2013] which are parallel processable.

These approaches have had significant achievements, but they had a strong assumption that there is enough labeled data. Without enough labeled data, these models failed to perform well and the problem of over-fitting becomes especially severe. Furthermore, the categories of key phrases (or slots) should be predefined and are difficult to expand.

Some researchers cast IE problem to reading comprehension. [Levy et al.2017] reduced relation extraction to answering simple reading comprehension questions. Specifically, given a sentence with an annotated -triplet where are entities and is their relation, it models the as a question, and the left as its answer. [Roth et al.2018] further improved the model performance on a similar problem setting. Furthermore, based on previous works, [Qiu et al.2018] focused on cross-sentence relation argument extractions.

There are some studies which explore the relationship between asking and answering to enhance each other. [Golub et al.2017, Kumar et al.2018] used the two-stage process generating questions based on answers another network selects, formally, they factorized into which were computed by two networks. [Tang et al.2017] used regularizations to connect the two networks and trained them simultaneously, which improved performances of both. [Wang et al.2017] used a single seq2seq model to generate questions and answers, whose idea is simple but very effective. [Sachan and Xing2018] applied self-learning strategy for jointly learning to ask as well as answer questions, leveraging unlabeled text along with labeled question-answer pairs for learning.

5 Conclusion

In this paper, we studied the task of extracting question-answer pairs. The problem was addressed by solving two sub-tasks: Question Selection and Answer Extraction. We observed that these two sub-tasks are intrinsically linked and that -pairs are much easier to acquire than -triplets. Therefore, we proposed SAMIE, a semi-supervised learning model that can be trained by -pairs and a small number of labeled -triplets. Our experimental results showed that SAMIE

worked especially well when given very small amount of labeled data, and that it outperformed the baseline methods. In the future, we would consider to further reduce the need for labeled data and try to achieve unsupervised learning with the help of pre-training techniques.