MilkQA: a Dataset of Consumer Questions for the Task of Answer Selection

by   Marcelo Criscuolo, et al.
Universidade de São Paulo

We introduce MilkQA, a question answering dataset from the dairy domain dedicated to the study of consumer questions. The dataset contains 2,657 pairs of questions and answers, written in the Portuguese language and originally collected by the Brazilian Agricultural Research Corporation (Embrapa). All questions were motivated by real situations and written by thousands of authors with very different backgrounds and levels of literacy, while answers were elaborated by specialists from Embrapa's customer service. Our dataset was filtered and anonymized by three human annotators. Consumer questions are a challenging kind of question that is usually employed as a form of seeking information. Although several question answering datasets are available, most of such resources are not suitable for research on answer selection models for consumer questions. We aim to fill this gap by making MilkQA publicly available. We study the behavior of four answer selection models on MilkQA: two baseline models and two convolutional neural network archictetures. Our results show that MilkQA poses real challenges to computational models, particularly due to linguistic characteristics of its questions and to their unusually longer lengths. Only one of the experimented models gives reasonable results, at the cost of high computational requirements.



There are no comments yet.


page 1

page 2

page 3

page 4


PerCQA: Persian Community Question Answering Dataset

Community Question Answering (CQA) forums provide answers for many real-...

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Readers of academic research papers often read with the goal of answerin...

The TechQA Dataset

We introduce TechQA, a domain-adaptation question answering dataset for ...

Finding Answers from the Word of God: Domain Adaptation for Neural Networks in Biblical Question Answering

Question answering (QA) has significantly benefitted from deep learning ...

Question-Answer Selection in User to User Marketplace Conversations

Sellers in user to user marketplaces can be inundated with questions fro...

How Good is Artificial Intelligence at Automatically Answering Consumer Questions Related to Alzheimer's Disease?

Alzheimer's Disease (AD) is the most common type of dementia, comprising...

Question Identification in Arabic Language Using Emotional Based Features

With the growth of content on social media networks, enterprises and ser...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In question answering, the task of answer selection consists in finding the best answer for a question in a pool of pre-selected candidate answers. More formally, given a question and a set of candidate answers , the goal is to find a candidate answer that belongs to , where is the ground truth set of question .

Recent research on answer selection models has been focused on objective and well-formed factoids, while little attention has been given to other kinds of questions, such as consumer questions [1, 2], which are a common form of seeking information that usually occurs in Q&A community sites, forums and customer services.

Consumer questions differ from factoids mainly in their structure. They are often formed by multiple sentences, that are related to each other at semantic level by textual cohesion mechanisms. A striking feature of consumer questions is the presence of a problem description (context) that is followed by one or more sub-questions. Those sub-questions are short, and seem incomplete when viewed in isolation, since they rely strongly on information provided previously in the context. Occasionally, the context is not given explicitly, and the consumer question is posed as a sequence of interrelated, complementary sub-questions. Another noticeable characteristic of this kind of question is that sub-questions are often posed as declarative sentences beginning with “I would like to know” [1], instead of direct interrogative sentences.

We define a consumer question as a text segment that fulfill one of the following criteria: (a) contains a context, or problem description, and sub-questions that rely on that previous information to be completely understood; (b) contains two or more related sub-questions that reinforce each other. By this definition, the context description is not mandatory and what really characterizes a text segment as a consumer question are the relations of complementary meaning between sentences.

Consumer questions pose several challenges to computational models. They often contain misspelings, poor punctuation, and layman’s terms that do not match the vocabulary of potential answers [1, 2]. Context descriptions commonly contain details in excess, which make it hard to distinguish information that is really relevant [3]. Finally, since consumer questions are usually much longer than factoid questions, answer selection models tend to require a great deal of computational resources.

Several datasets are available for the development of answer selection models [4, 5, 6, 7, 8, 9]. However, those resources are not suitable for the development of answer selection models for consumer questions, since they are usually focused on short interrogative sentences. Moreover, questions in some datasets are elaborated artificially and do not represent the real use of natural language. Annotators are asked to elaborate questions for given pieces of text, for example. The result of such a process are simple questions of limited practical use.

Realistic datasets have been responsible for driving fields forward. The contribution of Penn Treebank [10] for the field of syntatic parsing is a known example. In this paper, we present MilkQA111MilkQA is publicly available at:, a realistic dataset of consumer questions from the dairy domain in the Portuguese language. This dataset was formed from thousands of consumer questions collected within the period of twelve years by the customer service of an important agricultural research institution. Every question comes from real situations and is answered in detail by a specialist. The current version of MilkQA contains 2,657 pairs of messages (questions and respective answers) selected and anonymized by three annotators. The average length of questions is 57 words, while the average length of questions in factoid datasets, like WikiQA, is less than 10 words (see Table II). Answers in MilkQA are even longer than questions, and may have hundreds of words.

We implemented four answer selection models to study their behavior in MilkQA. The performance of two state-of-the-art Convolutional Neural Network (CNN) models is compared to results achieved by two baseline models. The first baseline is idf-weighted word matching, and the second is the cosine of idf-weighted sums of word embeddings. We experiment with MilkQA and with the short-question dataset WikiQA [5]. Although relatively simple models achive good results on the short-question dataset, their results on MilkQA are only modest. Better results are achieved by a complex model, though it imposes high requirements on computational resources.

This paper is organized as follows. In Section II, we review the main datasets that are available for the task of answer selection. In Section III, we describe in detail our new dataset, MilkQA. In Section IV, we present and discuss our experiments. Finally, the Section V is dedicated to the conclusion and future work.

Ii Related Work

Although some large question answering datasets are available (Table I), most are composed of rather simple questions, either formulated artificially or derived from queries submitted to search engines. SelQA [8] was introduced as a benchmark dataset for the task of answer selection. It contains nearly 8 thousand questions, formulated by crowdsourcing workers from text segments extracted from Wikipedia. The same process was used to build SQuAD [7], which is much larger, with 100 thousand questions. Questions derived from previously known text segments are generally simpler than natural questions and share many words with their answers. One significant concern with this approach is that the lexical overlap will make sentence selection easier and might inflate the performance of systems [5].

Collecting queries submitted to search engines is an alternative to avoid artificial questions. WikiQA [5] is a dataset of natural factoid questions submitted by users of the Bing search engine. Each question is related to a set of candidate answer sentences extracted from Wikipedia articles, where correct answers are identified by human annotators. Part of the questions in this dataset does not have correct answers in their candidate sets. These examples are useful for the task of answer triggering, where models are required to identify the lack of appropriate answers for a question in its candidate set. The process of collecting questions from search engine logs was also used to build MS MARCO [9], which may be regarded as a larger version of WikiQA. MS MARCO contains 100 thousand factoid questions, as well as questions with no correct answers. However, differently from WikiQA, answers for MS MARCO questions were not restricted to Wikipedia, but were collected from thousands of web documents.

Another important dataset is TREC-QA [4], also known as QASent, which became the standard benchmark dataset for the answer selection task. It contains only 227 questions, chosen from the Text REtrieval Conference (TREC) QA dataset. To generate candidate answer sets, the authors selected sentences from each question’s document pool that contained one or more non-stopwords from the question. Thus, TREC-QA also show significant lexical overlap between questions and answers, inducing a strong bias on models based on word matching.

Finally, InsuranceQA [6] is a non-factoid dataset from the insurance domain. It was created from data collected from an Internet site, where insurance experts answer questions received from users. While answers in InsuranceQA are detailed explanations, significantly longer than typical answers in other datasets, the source site limits questions to single sentences, making them short and objective, with few words only (see Tables II and III).

Question answering datasets in the Portuguese language are less abundant. The Págico [11] dataset was created for a shared task on information retrieval, organized for the Portuguese language. It contains only 153 manually formulated factoid questions, that are not particularly adequate for question-answer matching techniques, typically used in the answer selection task.

Dataset Question Source # Questions
SelQA crowdsourcing 7,904
SQuAD crowdsourcing 100K
WikiQA user query logs 3,047
MS MARCO user query logs 100K
TREC-QA user query logs + editor 227
InsuranceQA users (single sentences) 16,889
Págico editors 153
MilkQA consumer emails 2,657
TABLE I: Answer Selection Datasets
Fig. 1: Label frequencies in the MilkQA dataset. Labels with frequencies below 5% are not shown.

Iii MilkQA Dataset

In this section, we describe the process of creating our dataset, MilkQA, comparing its unique features to those of other datasets.

Iii-a Data Collection and Preparation

MilkQA data was originally collected by the Brazilian Agricultural Research Corporation (Embrapa). As all of Embrapa’s research unities, Embrapa Dairy Cattle maintains a customer service dedicated to assist any citizen interested in their business. This service gets many email messages, that are archived with their respective responses, after getting labels that describe their contents, such as Breed and Diseases. There are 58 different labels. On average, two labels are applied to each message222MilkQA may also be an interesting dataset for the problem of multi-label classification [12]..

Our dataset is derived from a message archive created by Embrapa Dairy Cattle’s customer service, containing nearly 27 thousand message pairs (requests and responses), collected from the year 2003 to 2012. Questions are written by thousands of authors with very different backgrounds and levels of literacy, while answers are elaborated by customer services specialists.

The customer service works as a counter that gets all sorts of requests, ranging from questions about dairy cattle to job applications. However, to build our answer selection dataset, only messages containing consumer questions were considered. Thus, many messages in the archive, that were not knowledge requests, had to be discarded. The filtering process was carried out in two phases. First, we used an automated process to discard most noisy and unwanted messages. Then, the remaining messages went through careful manual selection and cleaning.

The automated process relied on the labels to decide which messages to discard. We estimated usage rates for each label, by manually analyzing a random sample of messages drawn from the full archive. This statistical analysis showed that messages marked with some labels – such as

Training and Internship, for example – were rarely used. Such messages were automatically discarded. A number of duplicated messages were identified and removed as well. The resulting pre-selected archive contained about 10 thousand messages, that needed to go through manual selection.

In the second phase, three annotators worked on the selection of remaining messages, performing two activities simultaneously. At the same time they rejected non-consumer questions, they also anonymized and cleaned accepted messages. Message cleaning consisted in removing particular data, such as people names and contact information, and data that was not related to questions, such as corporate signatures and advertisements, typically found in email messages.

  1. What causes destruction of the ozone layer?

  2. What is the mortality rate for lightning strikes?

  3. What are the different types of homeowners insurance?

Fig. 2: Examples of objective questions drawn from datasets WikiQA (a, b) and InsuranceQA (c).

I have some dairy cows and noticed that the milk production have dropped a lot. I took tests and noticed the presence of mastitis. (a) I do not know which medication to apply. (b) What is the best treatment? (c) What is the quickiest way to get production back to normal?

Fig. 3: A typical consumer question from MilkQA. This question was originally written in Portuguese.
  1. Can calves eat oat? Is it better to mix some mineral?

  2. Can I apply ivermectin to lactating cows? Is it bad for the calf? What about humans?

Fig. 4: Consumer questions with no explicit context.

For this first version of MilkQA, about half of the messages in the pre-selected archive were examined by the annotators. Approximately 53% of the examined messages were selected for the dataset, which means MilkQA currently contains 2,657 anonymized message pairs. We computed the frequency of labels applied to these messages and showed the most frequent on the graph in Fig. 1. The graph shows, for example, that the label Feed was applied to almost 35% of the messages.

Those labels provide an overview of common message contents. The frequency of Elephant Grass, Forage and similar labels reveal that there is a high number of questions related to herd feeding, for example. Other labels, like Farm Planning and Herd Management, show that milk production is also a major concern. Indeed, label frequencies clearly reflect that questions were motivated by the interest of consumers in obtaining knowledge about the dairy cattle domain.

Iii-B Consumer Questions in MilkQA

To highlight the features of MilkQA questions, we contrast examples of consumer questions extracted from this dataset with common examples of objective questions from other datasets. The examples shown in Fig. 2 were extracted from WikiQA and InsuranceQA, and illustrate common features of objective questions. They are represented by single interrogative sentences that are short, direct and complete. In Fig. 3, we show a question drawn from MilkQA. As a typical consumer question, it is represented as a short text containing multiple sentences.

Dataset Words per Question
min avg 50-p 99-p max
WikiQA 2 7 7 16 23
InsuranceQA 2 7 7 14 57
MilkQA 4 57 46 217 681
TABLE II: Comparison of Question Length Statistics
Dataset Words per Answer
min avg 50-p 99-p max
WikiQA 1 22 21 57 165
InsuranceQA 15 100 78 395 1,180
MilkQA 8 237 157 688 3,427
TABLE III: Comparison of Answer Length Statistics

The average length of consumer questions is usually several times longer than that of objective questions. In Table II, we present statistics about question lengths in MilkQA compared to statistics of other two answer selection datasets. Mininum, average and maximum question lengths are shown for each dataset. The columns 50-p and 99-p refer to the length of questions in the 50th and 99th percentile, respectively. The table shows, for instance, that 99% of the questions in WikiQA contains 16 words or less. MilkQA contains a few non-consumer questions that are really short. Those are responsible for the small number of words shown in the first column of the table. On the other hand, MilkQA contains some really long questions, with hundreds of words. Sub-questions are found among full problem descriptions in such cases. Table III show analogous statistics for MilkQA answers, which are even longer than questions.

Consumer questions are usually composed of a context description and one or more sub-questions. Most of the questions in MilkQA follow this pattern. The example in Fig. 3 contains three sub-questions focused on the problem of decrease in milk production, caused by an specific disease (mastitis). Furthermore, many sub-questions in MilkQA are expressed indirectly, like the example sentence in Fig. 3(a), which implies the question “Which medication should I apply?”. Another indirect construction that is very frequent in the dataset is “I would like to know…”. MilkQA also contains consumer questions that provide no explicit context, like the two examples in Fig. 4, composed of several interrelated interrogative questions.

It is clear from the examples in Figures 3 and 4 that sub-questions alone rarely have complete meanings. The meaning of one sub-question – like Fig. 3(b), for instance – usually depends on information given in the context, or found in other sub-questions. This dependency occurs by means of linguistic mechanisms, such as substitution and ellipsis [13], which are really challenging for computational processing.

Other common observations in questions from MilkQA are misspellings, typos and poor use of grammar and punctuation. All of these may have strong impact over tools such as syntactic parsers. Answers in turn are generally well written texts and show few writting problems. In fact, most answers are detailed technical explanations that tend to be reused with small changes made to better meet particular question needs.

Iii-C Answer Selection Pool

The task of answer selection requires a pool of candidate answers associated with every question in the dataset. For each question, one or more ground truth answers must be included in the pool.

To build candidate pools from the (question, answer) pairs in MilkQA, we performed a cluster analysis on the full answer set. We identified some clusters of almost identical answers, with only minor differences between elements. This fact is due to the use of answer templates in the customer service. Such nearly identical answers can cause problems to answer selection models if a ground truth answer conflicts with a negative candidate in the same pool. To avoid this problem, we do not allow more than one answer from the same cluster in each pool. We used a density-based algorithm with parameters tuned to generate very tight clusters. Answers are represented by tf-idf vectors, that are reduced by an autoencoder to 100 dimensions. This analysis identified 97 clusters, containing an average of 6 answers each. The largest and the smallest clusters contains 104 and 2 answers, respectively. In fact, 75% of the identified clusters contains 4 answers or less. That means large clusters represent 25% of the total.

Each answer pool contains 50 candidates, including one ground truth answer, which was originally provided by the customer service for the corresponding question. The other candidates are the answers nearest to the ground truth, starting from an initial distance determined empirically. Such a strategy aims to reproduce challenging situations where answer selection models have to distinguish between similar answers.

Iv Experiments

We evaluated the behavior of baseline and state-of-the-art answer selection models on WikiQA and MilkQA. All models approach answer selection as a ranking task, where each candidate answer is assigned a relevance score. The highest score should indicate the correct answer. Standard metrics Precision at top one333This is precision at , with . (P@1) and Mean Average Precision (MAP) are adopted for performance evaluation.

Iv-a Answer Selection Models

We consider two baseline models: Weighted Word Matching (WWM) and Weighted Sum of word Embeddings (WSE). The first model computes the sum of IDF values for each non-stopword in the question that also occurs in the answer, while the second computes the cosine similarity between question and answer represented as IDF-weighted sums of their word vectors.

We also implemented two Convolutional Neural Network (CNN) models. The first, CNN-STD, is a simplified version of the ranking model proposed in [14]

. The input to this model are two matrices representing a question and a candidate answer. Each matrix is initialized with pre-trained word embeddings and mapped to a feature vector by a convolutional layer. The two feature vectors are concatenated and passed to a hidden layer. A relevance score is obtained by applying the sigmoid function to the hidden layer output. Our implementation differs from the original model in that it does not consider additional features, neither an intermediate similarity matrix. We use hyperbolic tangent for non-linearity and dropout is applied to the fully-connected layer for regularization.

The second model, CNN-LDC, is implemented exactly as described in [15]. The general architecture resembles the previous model, but CNN-LDC employs two-channel convolutions. Before mapping matrices to feature vectors, the model decomposes each word in two components that capture semantic similarities and dissimilarities between the current question and candidate answer. This lexical decomposition relies on an attention matrix computed for each input pair.

In all experiments, we used 300-dimensional word vectors trained with word2vec [16], using the Continous Bag-of-Words model (CBOW) [17]. For WikiQA (English), we used the freely available vectors from Google News444Available at, and for MilkQA (Portuguese), we used vectors trained by our research group555Available at in a wide range of sources, like the LX-Corpus [18], texts crawled from the Portuguese versions of Wikipedia and Google News, movie subtitles, newspaper articles, and children’s story books. All the text sources used to train Portuguese word vectors total approximately 1.4 billion tokens.

Hyperparameters are the same for both CNN models. Filter lengths are 1, 2, and 3, with 50 feature maps each, dropout uses a drop-rate of 0.20, and Adam optimizer is used to minimize the squared errors. In the training sets, answers are labeled with 1 if they are correct or 0 otherwise. Training is interrupted by early stopping if no performance improvement is observed after two evaluations of the development set. Batches are single triples of form , where is a ground truth answer for question , and is a random incorrect answer selected from the question pool. To build another triple (batch), the training algorithm selects the next question,

. That means a given question is used only once in each epoch. The maximum length for questions and answers are, respectively, 20 and 40 for WikiQA, and 315 and 710 for MilkQA.

The dataset was partitioned into train, dev and test subsets containing 2,307, 50 and 300 questions, respectively. The choice for the test set size aims to keep most examples on the the training set, while the dev set size was chosen to avoid slowing down the training process. We found 50 examples to be a good compromise between performance assessment and computation time.

Model WikiQA MilkQA
WWM 0.5062 0.5100 0.2467 0.3836
WSE 0.3951 0.5838 0.1733 0.2552
CNN-STD 0.4135 0.5746 0.4100 0.5573
CNN-LDC 0.5485 0.6848 0.5700 0.6899
TABLE IV: Experiment Results

Iv-B Results

Table IV summarizes the results of our experiments. Performance is measured with P@1, which is computed by our own evaluation script, and with MAP, computed by the official TREC scorer666We used version 8.1, available at (trec_eval). These two metrics reflect different system capabilities. MAP measures how good is a system at placing correct answers at top rank positions, while P@1 represents the fraction of questions that are correctly answered, with a ground truth answer assigned exactly to the first rank position.

As shown in the first row of Table IV, the word matching baseline (WWM) achieves good results on the WikiQA dataset and outperforms CNN-STD, according to P@1. This good performance may be due to frequent lexical overlap in WikiQA. Word matching shows much lower performance on MilkQA.

The best performance on both datasets is achieved by CNN-LDC, while WSE gives the worst overall results. The very low scores of WSE on MilkQA suggests that the weighted sum of vectors is not powerful enough to capture the semantics of longer text segments, such as MilkQA questions and answers. Although word matching (WWM) performs better than WSE on MilkQA, both baselines give very low results compared to CNN models. In fact, CNN-LDC significantly outperforms the other models on this dataset.

Despite the good MAP scores achieved by CNN-LDC on both datasets, compared to other models, P@1 still indicates large room for improvement (see results of the literature on answer selection models [4, 5, 6, 7, 8, 9]). Even on MilkQA, where P@1 score was higher, the value 0.57 may be interpreted as only 57% of the questions being correctly answered by the best of the models.

Iv-C Discussion

At first glance, the lower number of training samples in MilkQA, compared to those of other datasets (Table I), may seem to be the cause for achieving modest results in the experiments (Table IV). However, much higher results have been achieved on very small datasets. CNN-LDC, for instance, achieves a MAP score of 0.77 in TREC-QA [15], whose size is only a fraction of the size of MilkQA. Thus, we believe the unique features of consumer questions are the real cause of the modest results observed in the experiments.

MilkQA features also imposes severe restrictions to answer selection models. The length of questions and answers limits the size of traning batches as well as the number and length of convolution filters in CNNs. To deal with this obstacle, very conservative parameters are chosen for our models. For instance, while CNN-LDC is trained with 500 feature maps in the original paper, we train our model only with 150 feature maps to reduce memory consumption and lower the number of model parameters that should be learned. We also truncate questions and answers to avoid the waste of resources caused by some very long outliers. However, we tried to keep the greatest possible number of texts untouched by choosing cutoff lengths that cause the truncation of only 0.3% of the examples in the dataset. We also truncate sentences in WikiQA, so we can have results comparable to those reported on the CNN-LDC paper 


Training and evaluation of CNN-LDC on MilkQA take long periods of time, even running the processes on GPUs. To compute scores for 50 candidate answers, this model takes about 20 seconds. At this speed, 1.67 hours were taken to evaluate the full test set, while 6.7 hours were necessary to train the model. To reduce training time, we limited the dev set size to only 50 questions, and evaluate the model each 500 batches to decide on early stopping. Each evaluation round of this tiny dev set takes around 16 minutes.

V Conclusion

We introduced MilkQA, a dataset of consumer questions for the task of answer selection. MilkQA contains 2,657 pairs of real questions and answers in the Portuguese language, asked by a large number of authors of different levels of literacy. MilkQA pose difficult challenges to answer selection models due to the linguistic characteristics of its questions and to their much longer length, compared to traditional QA datasets. In our experiments, only modest results could be achieved by simple answer selection models on MilkQA, while a complex model could achieve better results at the cost of high consumption of computational resources. We hope that MilkQA will contribute to further develop research on answer selection involving consumer questions.

We plan to release a new version of MilkQA in the future, which we expect to contain twice the number of questions in the current version. To achieve this goal, we intend to continue the work of message cleaning and anonymization that is carried out by human annotators.


The work of Marcelo Criscuolo was funded by Federal Institute of Education, Science and Technology of São Paulo (IFSP). The work of Erick Fonseca was funded by Fapesp grant number 2013/22973-0. The authors are grateful to Embrapa Dairy Cattle for providing the MilkQA data.