HyperQA
Reference Implementation for WSDM 2018 Paper
view repo
The dominant neural architectures in question answer retrieval are based on recurrent or convolutional encoders configured with complex word matching layers. Given that recent architectural innovations are mostly new word interaction layers or attention-based matching mechanisms, it seems to be a well-established fact that these components are mandatory for good performance. Unfortunately, the memory and computation cost incurred by these complex mechanisms are undesirable for practical applications. As such, this paper tackles the question of whether it is possible to achieve competitive performance with simple neural architectures. We propose a simple but novel deep learning architecture for fast and efficient question-answer ranking and retrieval. More specifically, our proposed model, HyperQA, is a parameter efficient neural network that outperforms other parameter intensive models such as Attentive Pooling BiLSTMs and Multi-Perspective CNNs on multiple QA benchmarks. The novelty behind HyperQA is a pairwise ranking objective that models the relationship between question and answer embeddings in Hyperbolic space instead of Euclidean space. This empowers our model with a self-organizing ability and enables automatic discovery of latent hierarchies while learning embeddings of questions and answers. Our model requires no feature engineering, no similarity matrix matching, no complicated attention mechanisms nor over-parameterized layers and yet outperforms and remains competitive to many models that have these functionalities on multiple benchmarks.
READ FULL TEXT VIEW PDF
As an alternative to question answering methods based on feature enginee...
read it
Over the past few years, question answering and information retrieval sy...
read it
Answer selection is an important subtask of question answering (QA), whe...
read it
One of the main challenges in ranking is embedding the query and documen...
read it
Temporal gates play a significant role in modern recurrent-based neural
...
read it
Neural models that independently project questions and answers into a sh...
read it
In this paper, we apply a general deep learning (DL) framework for the a...
read it
Reference Implementation for WSDM 2018 Paper
Neural ranking models are commonplace in many modern question answering (QA) systems (Severyn and Moschitti, 2015; He and Lin, 2016). In these applications, the problem of question answering is concerned with learning to rank candidate answers in response to questions. Intuitively, this is reminiscent of document retrieval albeit with shorter text which aggravates the long standing problem of lexical chasm (Berger et al., 2000)
. For this purpose, a wide assortment of neural ranking architectures have been proposed. The key and most basic intuition pertaining to many of these models are as follows: Firstly, representations of questions and answers are first learned via a neural encoder such as the long short-term memory (LSTM)
(Hochreiter and Schmidhuber, 1997)network or convolutional neural network (CNN). Secondly, these representations of questions and answers are composed by an interaction function to produce an overall matching score.
The design of the interaction function between question and answer representations lives at the heart of deep learning QA research. While it is simply possible to combine QA representations with simple feed forward neural networks or other composition functions
(Qiu and Huang, 2015; Tay et al., 2017a), a huge bulk of recent work is concerned with designing novel word interaction layers that model the relationship between the words in the QA pairs. For example, similarity matrix based matching (Wan et al., 2016), soft attention alignment (Parikh et al., 2016) and attentive pooling (dos Santos et al., 2016)are highly popular techniques for improving the performance of neural ranking models. Apparently, it seems to be well-established that grid-based matching is essential to good performance. Notably, these new innovations come with trade-offs such as huge computational cost that lead to significantly longer training times and also a larger memory footprint. Additionally, it is good to consider that the base neural encoder employed also contributes to the computational cost of these neural ranking models, e.g., LSTM networks are known to be over-parameterized and also incur a parameter and runtime cost of quadratic scale. It also seems to be a well-established fact that a neural encoder (such as the LSTM, Gated Recurrent Unit (GRU), CNN, etc.) must be first selected for learning individual representations of questions and answers and is generally treated as mandatory for good performance.
In this paper, we propose an extremely simple neural ranking model for question answering that achieves highly competitive results on several benchmarks with only a fraction of the runtime and only 40K-90K parameters (as opposed to millions). Our neural ranking models the relationships between QA pairs in Hyperbolic space instead of Euclidean space. Hyperbolic space is an embedding space with a constant negative curvature in which the distance towards the border is increasing exponentially. Intuitively, this makes it suitable for learning embeddings that reflect a natural hierarchy (e.g., networks, text, etc.) which we believe might benefit neural ranking models for QA. Notably, our work is inspired by the recently incepted Poincaré embeddings (Nickel and Kiela, 2017) which demonstrates the effectiveness of inducing a structural (hierarchical) bias in the embedding space for improved generalization. In our early empirical experiments, we discovered that a simple feed forward neural network trained in Hyperbolic space is capable of outperforming more sophisticated models on several standard benchmark datasets. We believe that this can be attributed to two reasons. Firstly, latent hierarchies are prominent in QA. Aside from the natural hierarchy of questions and answers, conceptual hierarchies also exist. Secondly, natural language is inherently hierarchical which can be traced to power law distributions such as Zipf’s law (Ravasz and Barabási, 2003). The key contributions in this paper are as follows:
We propose a new neural ranking model for ranking question answer pairs. For the first time, our proposed model, HyperQA, performs matching of questions and answers in Hyperbolic space. To the best of our knowledge, we are the first to model QA pairs in Hyperbolic space. While hyperbolic geometry and embeddings have been explored in the domains of complex networks or graphs (Krioukov et al., 2010), our work is the first to investigate the suitability of this metric space for question answering.
HyperQA is an extremely fast and parameter efficient model that achieves very competitive results on multiple QA benchmarks such as TrecQA, WikiQA and YahooCQA. The efficiency and speed of HyperQA are attributed by the fact that we do not use any sophisticated neural encoder and have no complicated word interaction layer. In fact, HyperQA is a mere single layered neural network with only 90K parameters. Very surprisingly, HyperQA actually outperforms many state-of-the-art models such as Attentive Pooling BiLSTMs (dos Santos et al., 2016; Zhang et al., 2017) and Multi-Perspective CNNs (He and Lin, 2016). We believe that this allows us to reconsider whether many of these complex word interaction layers are really necessary for good performance.
We conduct extensive qualitative analysis of both the learned QA embeddings and word embeddings. We discover several interesting properties of QA embeddings in Hyperbolic space. Due to its compositional nature, we find that our model learns to self-organize not only at the QA level but also at the word-level. Our qualitative studies enable us to gain a better intuition pertaining to the good performance of our model.
Many prior works have established the fact that there are mainly two key ingredients to a powerful neural ranking model. First, an effective neural encoder and second, an expressive word interaction layer. The first ingredient is often treated as a given, i.e., the top performing models always use a neural encoder such as the CNN or LSTM. In fact, many top performing models adopt convolutional encoders for sentence representation (He et al., 2015; Qiu and Huang, 2015; Severyn and Moschitti, 2015; He and Lin, 2016; Zhang et al., 2017; Shen et al., 2014). The usage of recurrent models is also notable (Mueller and Thyagarajan, 2016; Severyn and Moschitti, 2015; Tay et al., 2017a, b).
The key component in which many recent models differ at is at the interaction layer. Early works often combined QA embeddings ‘as it is’, i.e., representations are learned first and then combined. For example, Yu et al. (Yu et al., 2014)
used CNN representations as feature inputs to a logistic regression model. The end-to-end CNN-based model of Severyn and Moschitti
(Severyn and Moschitti, 2015)combines the CNN encoded representations of question and answer using a multi-layered perceptron (MLP). Recently, a myriad of composition functions have been proposed as well, e.g., tensor layers in Qiu et al.
(Qiu and Huang, 2015) and holographic layers in Tay et al. (Tay et al., 2017a).It has been recently fashionable to model the relationships between question and answer using similarity matrices. Intuitively, this enables more fine-grained matching across words in question and answer sentences. The Multi-Perspective CNN (MP-CNN) (He et al., 2015) compared two sentences via a wide diversity of pooling functions and filter widths aiming to capture ‘multi-perspectives’ between two sentences. The attention based neural matching (aNMM) model of Yang et al. (Yang et al., 2016) performed soft-attention alignment by first measuring the pairwise word similarity between each word in question and answer. The attentive pooling models of Santos et al. (dos Santos et al., 2016) (AP-BiLSTM and AP-CNN) utilized this soft-attention alignment to learn weighted representations of question and answer that are dependent of each other. Zhang et al. (Zhang et al., 2017) extended AP-CNN to 3D tensor-based attentive pooling (AI-CNN). A recent work, the Cross Temporal Recurrent Network (CTRN) (Tay et al., 2017b) proposed a pairwise gating mechanism for joint learning of QA pairs.
Unfortunately, these models actually introduce a prohibitive computational cost to the model usually for a very marginal performance gain. Notably, it is easy to see that similarity matrix based matching incurs a computational cost of quadratic scale. Representation ability such as dimension size of word or CNN/RNN embeddings are naturally also quite restricted, i.e., increasing any of these dimensions can cause computation or memory requirements to explode. Moreover, it is not uncommon for models such as AI-CNN or AP-BiLSTM to spend more than
minutes on a single epoch on QA datasets that are only medium sized. Let us not forget that these models still have to be extensively tuned which aggravates the impracticality problem posed by some of these models.
In this paper, we seek a new paradigm for neural ranking for QA. While many recent works try to out-stack each other with new layers, we strip down our network instead. Our work is inspired by the very recent Poincarè embeddings (Nickel and Kiela, 2017) which demonstrates the superiority and efficiency of generalization in Hyperbolic space. Moreover, this alleviates many overfitting and complexity issues that Euclidean embeddings might face especially if the data has intrinsic hierarchical structure. It is good to note that power-law distributions, such as Zipf’s law, have been known to be from innate hierarchical structure (Ravasz and Barabási, 2003). Specifically, the defining characteristic of Hyperbolic space is a much quicker expansion relative to that of Euclidean space which makes naturally equipped for modeling hierarchical structure. The concept of Hyperbolic spaces has been applied to domains such as complex network modeling (Krioukov et al., 2010), social networks (Verbeek and Suri, 2016) and geographic routing (Kleinberg, 2007).
There are several key geometric intuitions regarding Hyperbolic spaces. Firstly, the concept of distance and area is warped in Hyperbolic spaces. Specifically, each tile in Figure 1(a) is of equal area in Hyperbolic space but diminishes towards zero in Euclidean space towards the boundary. Secondly, Hyperbolic spaces are conformal, i.e., angles in Hyperbolic spaces and Euclidean spaces are identical. In Figure 1(b), the arcs on the curve are parallel lines that are orthogonal to the boundary. Finally, hyperbolic spaces can be regarded as larger
spaces relative to Euclidean spaces due to the fact that the concept of relative distance can be expressed much better, i.e., not only does the distance between two vectors encode information but also
where a vector is placed in Hyperbolic space. This enables efficient representation learning.In Nickel et al. (Nickel and Kiela, 2017), the authors applied the hyperbolic distance (specifically, the Poincarè distance) to model taxonomic entities and graph nodes. Notably, our work, to the best of our knowledge, is the only work that learns QA embeddings in Hyperbolic space. Moreover, questions and answers introduce an interesting layer of complexity to the problem since QA embeddings are in fact compositions of their constituent word embeddings. On the other hand, nodes in a graph and taxonomic entities in (Nickel and Kiela, 2017) are already at its most abstract form, i.e., symbolic objects. As such, we believe it would be interesting to investigate the impacts of QA in Hyperbolic space in lieu of the added compositional nature.
This section outlines the overall architecture of our proposed model. Similar to many neural ranking models for QA, our network has ‘two’ sides with shared parameters, i.e., one for question and another for answer. However, since we optimize for a pairwise ranking loss, the model takes in a positive (correct) answer and a negative (wrong) answer and aims to maximize the margin between the scores of the correct QA pair and the negative QA pair. Figure 2 depicts the overall model architecture.
Our model accepts three sequences as an input, i.e., the question (denoted as ), the correct answer (denoted as ) and a randomly sampled corrupted answer (denoted as ). Each sequence consists of words where and are predefined maximum sequence lengths for questions and answers respectively. Each word is represented as a one-hot vector (representing a word in the vocabulary). As such, this layer is a look-up layer that converts each word into a low-dimensional vector by indexing onto the word embedding matrix. In our implementation, we initialize this layer with pretrained word embeddings (Pennington et al., 2014). Note that this layer is not updated during training. Instead, we utilize a projection layer that learns a task-specific projection of the embeddings.
In order to learn a task-specific representation for each word, we utilize a projection layer. The projection layer is essentially a single layered neural network that is applied to each word in all three sequences.
(1) |
where , , and
is a non-linear function such as the rectified linear unit (ReLU). The output of this layer is a sequence of
dimensional embeddings for each sequence (question, positive answer and negative answer). Note that the parameters of this projection layer are shared for both question and answer.In order to learn question and answer representations, we simply take the sum of all word embeddings in the sequence.
(2) |
where . is the predefined max sequence length (specific to question and answer) and are -dimensional embeddings of the sequence. This is essentially the neural bag-of-words (NBoW) representation. Unlike popular neural encoders such as LSTM or CNN, the NBOW representation does not add any parameters and is much more efficient. Additionally, we constrain the question and answer embeddings to the unit ball before passing to the next layer, i.e., . This is easily done via when . Note that this projection of QA embeddings onto the unit ball is mandatory and absolutely crucial for HyperQA to even work.
Neural ranking models are mainly characterized by the interaction function between question and answer representations. In our work, we mainly adopt the hyperbolic^{1}^{1}1While there exist multiple models of Hyperbolic geometry such as the Beltrami-Klein model or the Hyperboloid model, we adopt the Poincarè ball / disk due to its ease of differentiability and freedom from constraints (Nickel and Kiela, 2017). distance function to model the relationships between questions and answers. Formally, let be the open -dimensional unit ball, our model corresponds to the Riemannian manifold () and is equipped with the Riemannian metric tensor given as follows:
(3) |
where is the Euclidean metric tensor. The hyperbolic distance function between question and answer is defined as:
(4) |
where denotes the Euclidean norm and are the question and answer embeddings respectively. Note that is the inverse hyperbolic cosine function, i.e., . Notably, changes smoothly with respect to the position of and which enables the automatic discovery of latent hierarchies. As mentioned earlier, the distance increases exponentially as the norm of the vectors approaches 1. As such, the latent hierarchies of QA embeddings are captured through the norm of the vectors. From a geometric perspective, the origin can be seen as the root of a tree that branches out towards the boundaries of the hyperbolic ball. This self-organizing ability of the hyperbolic distance is visually and qualitatively analyzed in later sections.
Amongst the other models of Hyperbolic geometry, the hyperbolic Poincarè distance is differentiable. Let The partial derivate w.r.t to is defined as:
(5) |
where , and .
Finally, we pass the hyperbolic distance through a linear transformation described as follows:
(6) |
where and are scalar parameters of this layer. The performance of this layer is empirically motivated by its performance and was selected amongst other variants such as
, non-linear activations such as sigmoid function or the raw hyperbolic distance.
This section describes the optimization and learning process of HyperQA. Our model learns via a pairwise ranking loss, which is well suited for metric-based learning algorithms.
Our network minimizes the pairwise hinge loss which is defined as follows:
(7) |
where is the set of all QA pairs for question , is the score between and , and is the margin which controls the extent of discrimination between positive QA pairs and corrupted QA pairs. The adoption of the pairwise hinge loss is motivated by the good empirical results demonstrated in Rao et al. (Rao et al., 2016). Additionally, we also adopt the mix sampling strategy for sampling negative samples as described in their work.
Since our network learns in Hyperbolic space, parameters have to be learned via stochastic Riemannian optimization methods such as RSGD (Bonnabel, 2013).
(8) |
where denotes a retraction onto at . is the learning rate and is the Riemannian gradient with respect to . Fortunately, the Riemannian gradient can be easily derived from the Euclidean gradient in this case (Bonnabel, 2013). In order to do so, we can simply scale the Euclidean gradient by the inverse of the metric tensor . Overall, the final gradients used to update the parameters are:
(9) |
Due to the lack of space, we refer interested readers to (Nickel and Kiela, 2017; Bonnabel, 2013)
for more details. For practical purposes, we simply utilize the automatic gradient feature of TensorFlow but convert the gradients with Equation (
9) before updating the parameters.This section describes our empirical evaluation and its results.
In the spirit of experimental rigor, we conduct our empirical evaluation based on four popular and well-studied benchmark datasets for question answering.
YahooCQA - This is a benchmark dataset for community-based question answering that was collected from Yahoo Answers. In this dataset, the answer lengths are relatively longer than TrecQA and WikiQA. Therefore, we filtered answers that have more than words and less than characters. The train-dev-test splits for this dataset are provided by (Tay et al., 2017a).
WikiQA - This is a recently popular benchmark dataset (Yang et al., 2015) for open-domain question answering based on factual questions from Wikipedia and Bing search logs.
SemEvalCQA - This is a well-studied benchmark dataset from SemEval-2016 Task 3 Subtask A (CQA). This is a real world dataset obtained from Qatar Living Forums. In this dataset, there are ten answers in each question ‘thread’ which are marked as ‘Good‘, ‘Potentially Useful’ or ‘’Bad’. We treat ‘Good’ as positive and anything else as negative labels.
TrecQA - This is the benchmark dataset provided by Wang et al. (Wang et al., 2007). This dataset was collected from TREC QA tracks 8-13 and is comprised of factoid based questions which mainly answer the ‘who’, ‘what’, ‘where’, ‘when’ and ‘why’ types of questions. There are two versions, namely clean and raw, as noted by (Rao et al., 2016) which we evaluate our models on.
Statistics pertaining to each dataset is given in Table 1.
YahooCQA | WikiQA | SemEvalCQA | TrecQA | |
---|---|---|---|---|
Train Qns | 50.1K | 94 | 4.8K | 1229 |
Dev Qns | 6.2K | 65 | 224 | 82 |
Test Qns | 6.2K | 68 | 327 | 100 |
Train Pairs | 253K | 5.9K | 36K | 53 |
Dev Pairs | 31.7K | 1.1K | 2.4K | 1.1K |
Test Pairs | 31.7K | 1.4K | 3.2K | 1.5K |
In this section, we introduce the baselines for comparison.
YahooCQA - The key competitors of this dataset are the Neural Tensor LSTM (NTN-LSTM) and HD-LSTM from Tay et al. (Tay et al., 2017a) along with their implementation of the Convolutional Neural Tensor Network (Qiu and Huang, 2015), vanilla CNN model, and the Okapi BM-25 (Robertson et al., 1994) benchmark. Additionally, we also report our own implementations of QA-BiLSTM, QA-CNN, AP-BiLSTM and AP-CNN on this dataset based on our experimental setup.
WikiQA - The key competitors of this dataset are the Paragraph Vector (PV) and PV + Cnt models (Le and Mikolov, 2014) of Le and Mikolv, CNN + Cnt model from Yu et al. (Yu et al., 2014) and LCLR (Yih et al.) (Yih et al., 2013). These three baselines are reported in the original WikiQA paper (Yang et al., 2015) which also include variations that include handcrafted features. Additional strong baselines include QA-BiLSTM, QA-CNN from (dos Santos et al., 2016) along with AP-BiLSTM and AP-CNN which are attentive pooling improvements of the former. Finally, we also report the Pairwise Ranking MP-CNN from Rao et al. (Rao et al., 2016).
SemEvalCQA - The key competitors of this dataset are the CNN-based ARC-I/II architecture by Hu et al. (Hu et al., 2014), the Attentive Pooling CNN (dos Santos et al., 2016), Kelp (Filice et al., 2016) a feature engineering based SVM method, ConvKN (Barrón-Cedeño et al., 2016) a combination of convolutional tree kernels with CNN and finally AI-CNN (Attentive Interactive CNN) (Zhang et al., 2017), a tensor-based attentive pooling neural model. A comparison with AI-CNN (with features) is also included.
TrecQA - The key competitors on the dataset are mainly the CNN model of Severyn and Moschitti (S&M) (Severyn and Moschitti, 2015), the Attention-based Neural Matching Model (aNMM) of Yang et al. (Yang et al., 2016), HD-LSTM (Tay et al.) (Tay et al., 2017a) and Multi-Perspective CNN (MP-CNN) (He et al., 2015) proposed by He et al. Lastly, we also compare with the pairwise ranking adaption of MP-CNN (Rao et al.) (Rao et al., 2016). Additionally and due to long standing nature of this dataset, there have been a huge number of works based on traditional feature engineering approaches (Wang et al., 2007; Heilman and Smith, 2010; Severyn et al., 2014; Yao et al., 2013) which we also report. For the clean version of this dataset, we also compare with AP-CNN and QA-BiLSTM/CNN (dos Santos et al., 2016).
Since the training splits are standard, we are able to directly report the results from the original papers.
This section describes the key evaluation protocol / metrics and implementation details of our experiments.
We adopt a dataset specific evaluation protocol in which we follow the prior work in their evaluation protocols. Specifically, TrecQA and WikiQA adopt the Mean Reciprocal Rank (MRR) and MAP (Mean Average Precision) metrics which are commonplace in IR research. On the other hand, YahooCQA and SemEvalCQA evaluate on MAP and Precision@1 (abbreviated P@1) which is determined based on whether the top predicted answer is the ground truth. For all competitor methods, we report the performance results from the original paper.
Additionally, we report the parameter size and runtime (seconds per epoch) of selected models. We selectively re-implement some of the key competitors with the best performance and benchmark their training time on our machine/GPU (a single Nvidia GTX1070). For reporting the parameter size and training time, we try our best to follow the hyperparameters stated in the original papers. As such, the same model can have different training time and parameter size on different datasets.
HyperQA is implemented in TensorFlow. We adopt the AdaGrad (Duchi et al., 2011) optimizer with initial learning rate tuned amongst . The batch size is tuned amongst . Models are trained for epochs and the model parameters are saved each time the performance on the validation set is topped. The dimension of the projection layer is tuned amongst . L2 regularization is tuned amongst . The negative sampling rate is tuned from to . Finally, the margin is tuned amongst . For TrecQA, WikiQA and YahooCQA, we initialize the embedding layer with GloVe (Pennington et al., 2014) and use the version with and trained on 840 billion words. For SemEvalCQA, we train our own Skipgram model using the unannotated corpus provided by the task. In this case, the embedding dimension is tuned amongst . Embeddings are not updated during training. For the SemEvalCQA dataset, we concatenated the raw QA embeddings before passing into the final layer since we found that it improves performance.
In this section, we present our empirical results on all datasets. For all reported results, the best result is in boldface and the second best is underlined.
Table 2 reports our results on the WikiQA dataset. Firstly, we observe that HyperQA outperforms a myriad of complex neural architectures. Notably, we obtain a clear performance gain of in terms of MAP/MRR against models such as AP-CNN or AP-BiLSTM. Our model also outperforms MP-CNN which is severely equipped with parameterized word matching mechanisms. We achieve competitive results relative to the Rank MP-CNN. Finally, HyperQA is extremely efficient and fast, clocking 2s per epoch compared to 33s per epoch for Rank MP-CNN. The parameter cost is also 90K vs 10 million which is a significant improvement.
Model | MAP | MRR | #Params | Time |
---|---|---|---|---|
PV | 0.511 | 0.516 | - | - |
PV + Cnt | 0.599 | 0.609 | - | - |
LCLR | 0.599 | 0.609 | - | - |
CNN + Cnt | 0.652 | 0.665 | - | - |
QA-BiLSTM (Santos et al.) | 0.656 | 0.670 | - | - |
QA-CNN (Santos et al.) | 0.670 | 0.682 | - | - |
AP-BiLSTM (Santos et al.) | 0.671 | 0.684 | - | - |
AP-CNN (Santos et al.) | 0.688 | 0.696 | - | - |
MP-CNN (He et al.) | 0.693 | 0.709 | 10.0M | 35s |
Rank MP-CNN (Rao et al.) | 0.701 | 0.718 | 10.0M | 33s |
HyperQA (This work) | 0.712 | 0.727 | 90K | 2s |
Table 3 reports the experimental results on YahooCQA. First, we observe that HyperQA outperforms AP-BiLSTM and AP-CNN significantly. Specifically, we outperform AP-BiLSTM, the runner-up model by in terms of MRR and in terms of MAP. Notably, HyperQA is 32 times faster than AP-BiLSTM and has times less parameters. Our approach shows that complicated attentive pooling mechanisms are not necessary for good performance.
Model | P@1 | MRR | # Params | Time |
---|---|---|---|---|
Random Guess | 0.200 | 0.457 | - | - |
BM-25 | 0.225 | 0.493 | - | - |
CNN | 0.413 | 0.632 | - | - |
CNTN (Qiu et al.) | 0.465 | 0.632 | - | - |
LSTM | 0.465 | 0.669 | - | - |
NTN-LSTM (Tay et al.) | 0.545 | 0.731 | - | - |
HD-LSTM (Tay et al.) | 0.557 | 0.735 | - | - |
QA-BiLSTM (Santos et al.) | 0.508 | 0.683 | 1.40M | 440s |
QA-CNN (Santos et al.) | 0.564 | 0.727 | 90.9K | 60s |
AP-CNN (Santos et al.) | 0.560 | 0.726 | 540K | 110s |
AP-BiLSTM (Santos et al.) | 0.568 | 0.731 | 1.80M | 640s |
HyperQA (This work) | 0.683 | 0.801 | 90.0K | 20s |
Table 4 reports the experimental results on SemEvalCQA. Our proposed approach achieves highly competitive performance on this dataset. Specifically, we have obtained the best P@1 performance overall, outperforming the state-of-the-art AI-CNN model by in terms of P@1. The performance of our model on MAP is marginally short from the best performing model. Notably, AI-CNN has benefited from external handcrafted features. As such, comparing AI-CNN (w/o features) with HyperQA shows that our proposed model is a superior neural ranking model. Next, we draw the readers attention to the time cost of AI-CNN. The training time per epoch is per epoch which is about times longer than our model. AI-CNN is extremely cost prohibitive, i.e., attentive pooling is already very expensive and yet AI-CNN performs 3D attentive pooling. Evidently, its performance can be easily superseded in a much smaller training time and parameter cost. This raises questions about the effectiveness of the 3D attentive pooling mechanism.
Model | P@1 | MAP | #Params | Time |
---|---|---|---|---|
ARC-I (Hu et al.) | 0.741 | 0.771 | - | - |
ARC-II (Hu et al.) | 0.753 | 0.780 | - | - |
AP-CNN (Santos et al.) | 0.755 | 0.771 | - | - |
Kelp (Filice et al.) | 0.751 | 0.792 | - | - |
ConvKN (Barrón-Cedeño et al.) | 0.755 | 0.777 | - | - |
AI-CNN (Zhang et al.) | 0.763 | 0.792 | 140K | 3250s |
AI-CNN + Feats (Zhang et al.) | 0.769 | 0.801 | 140K | 3250s |
HyperQA (This work) | 0.809 | 0.795 | 45K | 10s |
Model | MAP | MRR | # Params | Time |
---|---|---|---|---|
Wang et al. (2007) | 0.603 | 0.685 | - | - |
Heilman et al. (2010) | 0.609 | 0.692 | - | - |
Wang et al. (2010) | 0.595 | 0.695 | - | - |
Yao (2013) | 0.631 | 0.748 | - | - |
Severyn and Moschitti (2013) | 0.678 | 0.736 | - | - |
Yih et al (2014) | 0.709 | 0.770 | - | - |
CNN (Yu et al) | 0.711 | 0.785 | - | - |
BLSTM + BM25 (Wang & Nyberg) | 0.713 | 0.791 | - | - |
CNN (Severyn & Moschitti) | 0.746 | 0.808 | - | - |
aNMM (Yang et al.) | 0.750 | 0.811 | - | - |
HD-LSTM (Tay et al.) | 0.750 | 0.815 | - | - |
MP-CNN (He et al.) | 0.762 | 0.822 | 10.0M | 141s |
Rank MP-CNN (Rao et al.) | 0.780 | 0.830 | 10.0M | 130s |
HyperQA (This work) | 0.770 | 0.825 | 90K | 12s |
Model | MAP | MRR | # Params | Time |
---|---|---|---|---|
QA-LSTM / CNN (Santos et al.) | 0.728 | 0.832 | - | - |
AP-CNN (Santos et al.) | 0.753 | 0.851 | - | - |
MP-CNN (He et al.) | 0.777 | 0.836 | 10M | 141 |
Rank MP-CNN (Rao et al.) | 0.801 | 0.877 | 10M | 130s |
HyperQA | 0.784 | 0.865 | 90K | 12s |
Table 5 reports the results on TrecQA (raw). HyperQA achieves very competitive performance on both MAP and MRR metrics. Specifically, HyperQA outperforms the basic CNN model of (S&M) by in terms of MAP/MRR. Moreover, the CNN (S&M) model uses handcrafted features which HyperQA does not require. Similarly, the aNMM model and HD-LSTM also benefit from additional features but are outperformed by HyperQA. HyperQA also outperforms MP-CNN but is around times faster and has times less parameters. MP-CNN consists of a huge number of filter banks and utilizes heavy parameterization to match multiple perspectives of questions and answers. On the other hand, our proposed HyperQA is merely a single layered neural network with 90K parameters and yet outperforms MP-CNN. Similarly, Table 6 reports the results on TrecQA (clean). Similarly, HyperQA also outperforms MP-CNN, AP-CNN and QA-CNN. On both datasets, the performance of HyperQA is competitive to Rank MP-CNN.
Overall, we summarize the key findings of our experiments.
It is possible to achieve very competitive performance with small parameterization, and no word matching or interaction layers. HyperQA outperforms complex models such as MP-CNN and AP-BiLSTM on multiple datasets.
The relative performance of HyperQA is significantly better on large datasets, e.g., YahooCQA (253K training pairs) as opposed to smaller ones like WikiQA (5.9K training pairs). We believe that this is due to the fact that Hyperbolic space is seemingly larger than Euclidean space.
HyperQA is extremely fast and trains at times faster than complex models like MP-CNN. Note that if CPUs are used instead of GPUs (which speed convolutions up significantly), this disparity would be significantly larger.
Our proposed approach does not require handcrafted features and yet outperforms models that benefit from them. This is evident on all datasets, i.e., HyperQA outperforms CNN model with features (TrecQA and WikiQA) and AI-CNN + features on SemEvalCQA.
Ours against | Performance | Params | Speed |
---|---|---|---|
AP-BiLSTM | 1-7% better | 20x less | 32 x faster |
AP-CNN | 1-12% better | Same | 3x faster |
AI-CNN | Competitive | 3x less | 300x faster |
MP-CNN | 1-2% better | 100x less | 10x faster |
Rank MP-CNN | Competitive | 100x less | 10x faster |
In this section, we study the effects of the QA embedding size on performance. Figure 3 describes the relationship between QA embedding size () and MAP on the WikiQA dataset. Additionally, we include a simple baseline (CosineQA) which is exactly the same as HyperQA
but uses cosine similarity instead of hyperbolic distance. The MAP scores of three other reported models (MP-CNN, CNN-Cnt and PV-Cnt) are also reported for reference. Firstly, we notice the disparity between
HyperQA and CosineQA in terms of performance. This is also observed across other datasets but is not reported due to the lack of space. While CosineQA maintains a stable performance throughout embedding size, the performance of HyperQA rapidly improves at . In fact, the performance of HyperQA at (45K parameters) is already similar to the Multi-Perspective CNN (He et al., 2015) which contains 10 million parameters. Moreover, the performance of HyperQA outperforms MP-CNN with -.This section delves into qualitative analysis of our model and aims to investigate the following research questions:
RQ1: Is there any hierarchical structure learned in the QA embeddings? How are QA embeddings organized in the final embedding space of HyperQA?
RQ2: What are the impacts of embedding compositional embeddings in hyperbolic space? Is there an impact on the constituent word embeddings?
RQ3: Are we able to derive any insight about how word interaction and matching happens in HyperQA?
Question | H1 | H2 | H3 | H4 | H5 | |
---|---|---|---|---|---|---|
What is the gross sale of Burger King | Q | are | sales, today | gross | is, what | burger, king |
A | based | sales, 14,billion, 183 | diageo | contributed | burger, corp | |
What is Florence Nightingale famous for | Q | in, the | for | famous | what | florence, nightingale |
A | of, in | was | nursing | founder, modern, born | nightingale, italy | |
Who is the founder of twitter? | Q | the, of | - | twitter, founder | - | who, is |
A | and, the | networking, launched | twitter, jack dorsey | match, social | - |
Words (w) | |
---|---|
0-1 | to, and, an, on, in, of, its, the, had, or, go |
1-2 | be, a, was, up, put, said, but |
2-3 | judging, returning, volunteered, managing, meant, cited |
3-4 | responsibility, engineering, trading, prosecuting |
4-5 | turkish, autonomous, cowboys, warren, seven, what |
5-6 | ebay, magdalena, spielberg, watson, nova |
Figure 4(a) shows a visualization of QA embeddings on the test set TrecQA projected in 3-dimensional space using t-SNE (van der Maaten, 2009). QA embeddings are extracted from the network as discussed in Section 3.3. We observe that question embeddings form a ‘sphere’ over answer embeddings. Contrastingly, this is not exhibited when the cosine similarity is used as shown in Figure 4(b). It is important to note that these are embeddings from the test set which have not been trained and therefore the model is not explicitly told whether a particular textual input is a question or answer. This demonstrates the innate ability of HyperQA to self-organize and learn latent hierarchies which directly answers RQ1. Additionally, Figure 5(a) shows a histogram of the vector norms of question and answer embeddings. We can clearly see that questions in general have a higher vector norm^{2}^{2}2We extract QA embeddings right before the constraining / normalization layer. and are at a different hierarchical level from answers. In order to further understand what the model is doing, we delve deeper into the visualization at word-level.
Table 9 shows some examples of words at each hierarchical level of the sphere on TrecQA. Recall that the vector norms^{3}^{3}3Note that word embeddings are not constrained to . allow us to infer the distance of the word embedding from the origin which depicts its hierarchical level in our context. Interestingly, we found that HyperQA exhibits self-organizing ability even at word-level. Specifically, we notice that the words closer to the origin are common words such as ‘to’, ‘and’ which do not have much semantic values for QA problems. At the middle of the hierarchy (), we notice that there are more verbs. Finally, as we move towards the surface of the ‘sphere’, the words become rarer and reflect more domain-specific words such as ‘ebay’ and ‘spielberg’. Moreover, we also found many names and proper nouns occurring at this hierarchical level.
Additionally, we also observe that words such as ’where’ or ’what’ have relatively high vector norms and located quite high up in the hierarchy. This is in concert with Figure 4 which shows the question embeddings form a sphere around the answer embeddings. At last, we parsed QA pairs word-by-word according to hierarchical level (based on their vector norm). Table 8 reports the outcome of this experiment where are hierarchical levels based on vector norms. First, we find that questions often start with the overall context and drill down into more specific query words. Take the first sample in Table 8 for example, it begins at a top level with ‘burger king’ and then drills down progressively to ’what is gross sales?’. Similarly in the second example, it begins with ‘florence nightingale’ and drills down to ‘famous’ at H3 in which a match is being found with ‘nursing’ in the same hierarchical level. Overall, based on our qualitative analysis, we observe that, HyperQA builds two hierarchical structures at the word-level (in vector space) towards the middle which strongly facilitates word-level matching. Pertaining to answers, it seems like the model builds a hierarchy by splitting on conjunctive words (‘and’), i.e., the root node of the tree starts by conjunctive words at splits sentences into semantic phrases. Overall, Figure 6 depicts our key intuitions regarding the inner workings of HyperQA which explains both RQ2 and RQ3. This is also supported by Figure 5(b) which shows the majority of the word norms are clustered with . This would be reasonable considering that the leaf nodes of both question and answer hierarchies would reside in the middle.
We proposed a new neural ranking model for question answering. Our proposed HyperQA achieves very competitive performance on four well-studied benchmark datasets. Our model is light-weight, fast and efficient, outperforming many state-of-the-art models with complex word interaction layers, attentive mechanisms or rich neural encoders. Our model only has 40K-90K parameters as opposed to millions of parameters which plague many competitor models. Moreover, we derive qualitative insights pertaining to our model which enable us to further understand its inner workings. Finally, we observe that the superior generalization of our model (despite small parameters) can be attributed to self-organizing properties of not only question and answer embeddings but also word embeddings.
Journal of Machine Learning Research
12 (2011), 2121–2159.Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015
. 1576–1586.Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA.
2786–2792.A Decomposable Attention Model for Natural Language Inference. In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. 2249–2255.
Comments
There are no comments yet.