Paraphrase identification is an important topic in artificial intelligence and this task justifies whether two sentences expressed in various forms are semantically similar, Chitra & Rajkumar (2016). For example, “On Sunday, the boy runs in the yard” and “The child runs outside at the weekend” are identified as paraphrase. This task directly benefits many industrial applications, such as plagiarism identification Chitra & Rajkumar (2016), machine translation Kravchenko (2017) and removing redundancy questions in Quora website Wang et al. (2017). Recently, there emerge many methods, such as ABCNN Yin et al. (2015), Siamese LSTM Wang et al. (2017) and L.D.C Wang et al. (2016b).
Conventionally, neural methodology aligns the sentence pair and then generates a matching score for paraphrase identification, Wang et al. (2016b, 2017). Regarding the alignment, we conjecture that the aligned unmatched parts are semantically critical, where we define the corresponded word pairs with low similarity as aligned unmatched parts. For an example: “On Sunday, the boy runs in the yard” and “The child runs inside at the weekend”, the matched parts (i.e. (Sunday, weekend), (boy, child), run) barely make contribution to the semantic sentence similarity, but the unmatched parts (i.e. “yard” and “inside”) determine these two sentences are semantically dissimilar. For another example: “On Sunday, the boy runs in the yard” and “The child runs outside at the weekend”, the aligned unmatched parts (i.e. “yard” and “outside”) are semantically similar, which makes the two sentences paraphrase. In conclusion, if the aligned unmatched parts are semantically consistent, the two sentences are paraphrase, otherwise they are non-paraphrase.
Traditional alignment methods take advantage of attention mechanism Wang et al. (2017), which is a soft-max weighting technique. Weighting technique could pick out the most similar/dissimilar parts, but is weak in modeling the aligned unmatched parts, which are the crucial evidence to identify paraphrase. For the input sentences in Figure 1, the weight between “Sunday” and “run” is lower than the weight between “yard” and “inside”, but the former weight is not the evidence of paraphrase/non-paraphrase, because the former two words that are most dissimilar should not be aligned for an inappropriate comparison.
To extract the aligned unmatched parts, in this paper, we embed Hungarian algorithm Wright (1990) into neural architecture as Hungarian layer (Algorithm 1). Illustrated in Figure 1, the alignment in sentence matching could be formulated as the task-assignment problem, which is tackled by Hungarian algorithm. Simply, Hungarian algorithm works out the theoretically optimal alignment relationship in an exclusive manner and the exclusiveness characterizes the aligned unmatched parts. For the example in Figure 1, because Hungarian layer allocates the aligned pairs with exclusiveness, the matched parts (i.e (Sunday, weekend), (boy, child), run) are aligned firstly, then the word “yard” would be assigned to the word “inside” with a negative similarity, making a strong evidence for discrimination.
Specifically, our model performs this task in three steps. First, our model applies BiLSTM to parse the input sentences into hidden representations. Then, Hungarian layer leverages the hidden representations to extract the aligned unmatched parts. Last, we apply cosine similarity to metric the aligned unmatched parts for a final discrimination. Regarding the training process of Hungarian layer, we modify the back-propagation algorithm in both directions. In the forward pass, Hungarian layer works out the alignment relationship, according to which, the computational graph is dynamically constructed, as demonstrated in Figure 3. Once the computational graph has been dynamically constructed, the backward propagation could be performed as usual in a conventional graph.
We conduct our experiments on the public benchmark dataset of “Quora Question Pairs” for the task of paraphrase identification. Experimental results demonstrate that our model outperforms other baselines extensively and significantly, which verifies our theory about the aligned unmatched parts and illustrates the effectiveness of our methodology.
Contributions. (1.) We offer a new perspective for paraphrase identification, which focuses on the aligned unmatched parts of two sentences. Accordingly, we propose the Hungarian layer to extract the aligned unmatched parts. The proposed method can achieve hard and exclusive alignments between two sequences, while we can learn parameters by end-to-end back-propagation. (2.) Our model outperforms other baselines extensively, verifying the effectiveness of our theory and method.
Organization. In Section 2, we survey the related work of paraphrase identification and dynamic differentiable computational graphs. In Section 3, we introduce our neural architecture. In Section 4, we conduct the experiments. In Section 5, we conclude our paper and publish our codes.
2 Related Work
We have surveyed this task and categorized related papers into three lines.
2.1 Non-Neural Architecture for Paraphrase Identification
The topic of paraphrase identification raises in the last decade. The development has been through four stages before neural architectures: word specific, syntactic tree specific, semantic matching and probabilistic graph modeling.
Firstly, Bilotti et al. (2007) focuses on simple surface-form matching between bag-of-words, which produces poor accuracy, because of word ambiguities and syntactic complexity. Therefore, syntactic analysis is introduced into this task for semantic understanding, such as deeper semantic analysis Shen & Lapata (2007), quasi-synchronous grammars Wang et al. (2009) and tree edit distance Heilman & Smith (2010). Notably, most of these methods compare the grammar tree (e.g. syntactic tree, dependency tree, etc.) of sentence pair. Further, semantic information such as negation, hypernym, synonym and antonym is integrated into this task for a better prediction precision, Lai & Hockenmaier (2014). Finally, Yao et al. (2013) leverages a semi-Markov CRF to align phrases rather than words, which consumes too many resources for industrial applications.
In summary, the advantage of this branch, which roots the foundation in linguistics, is semantically interpretable, while the disadvantage is too simple to understand complex language phenomenon.
2.2 Neural Architecture for Paraphrase Identification: Independent Sentence Encoder
With the popularity of deep neural network, some neural architectures are proposed to analyze the complex language phenomenon in a data-fitting way, which promotes the performance. First of all, the neural network extracts the abstracted features from each sentence independently, then measures the similarity of the abstracted feature pair. There list two frameworks: CNN-based and RAE-based.
Commonly, CNN could be treated as n-gram method, which corresponds to language model. Specifically,Yu et al. (2014) applies a bi-gram CNN to jointly model source and target sequences. Yang et al. (2015) achieves a better performance by following this work. Socher et al. (2011) has proposed a RAE based model to characterize phrase-level representation, which promotes simple pooling method, Blacoe & Lapata (2012). Multi-perspective methods Wang et al. (2017) take the advantage of multiple metric aspects to boost the accuracy.
In summary, the advantage of this branch is to model complex and ambiguous linguistic phenomenon in a black-box style. However, the disadvantage is that the encoder could not adjust the abstracted representations according to the correlation of sentence pair, making an imperfect matching process.
2.3 Neural Architecture for Paraphrase Identification: Interdependent Sentence Encoder
To emphasize the correlation of sentence pair in encoder, the researchers propose the attention-based neural architectures, which guide the encoding process according to the corresponding part. There introduce the representative methods: ABCNN Yin et al. (2015) and L.D.C Wang et al. (2017).
ABCNN is a CNN-based model. In a single stage, this model computes the attention similarity matrix for the convolution layer, then sums out each row and column as the weighs of pooling layer. The output of convolution layer is weighted by pooling layer in an average manner as the output of this stage. ABCNN could stack at most three stages. This method achieves satisfactory performance in many tasks, because of modeling correlation in sentence encoder. L.D.C model Wang et al. (2016b) is an attention-based method, which decomposes the hidden representations into similar and dissimilar parts, then respectively processes each parts to generate the final result. Notably, L.D.C is the state-of-the-art method.
In summary, the advantage of this branch is to model alignment or correlation in the encoding process. However, the disadvantage is to focus on the matched parts, rather than the unmatched parts, which are critical in this task as previously discussed.
2.4 Dynamic Differentiable Computational Graphs
Neural Turing Machine (NTM) Graves et al. (2014); Gulcehre et al. (2016b) is a seminal work to implement instrument-based algorithm in the neural architecture, which attempts to express algorithms by simulating memory and controller. However, NTM leverages the weighting technique, which involves too much noise and makes the learned algorithm fuzzy. Thus, we propose a hard way to embed algorithms into neural architectures.
There also exist some papers for dynamical computational graph construction. At the lower level, pointer-switch networks Gulcehre et al. (2016a) are a kind of dynamic differentiable neural model. At the higher level, some architecture search models Pham et al. (2018); Fernando et al. (2017) construct new differentiable computational graphs dynamically at every iteration.
First, we introduce the basic components of our neural architecture. Then, we analyze the training process of Hungarian layer, that how to dynamically construct the computational graph.
3.1 Neural Architecture
Our neural architecture is illustrated in Figure 2. Basically our model is composed by four components, namely, word embedding, bi-directional LSTM (BiLSTM), Hungarian layer and cosine similarity.
Word Embedding. The goal of this layer is to represent each word in every sentence with
-dimensional semantic vectors. The word representations, which are pre-trained by GloVePennington et al. (2014), are unmodified within the learning procedure. The inputs of this layer are a pair of sentences as word sequences and , while the outputs are corresponding embedding matrices as and .
Bi-Directional LSTM (BiLSTM). The purpose of this layer is to transform lexical representations to hidden contextual representations. For hidden contextual encoding, we employ a parameter-shared bi-directional LSTM (BiLSTM) Hochreiter & Schmidhuber (1997) to parse the word embeddings into hidden representations, mathematically as:
where is the -th hidden representation and corresponds to the -th word embedding in the source/target sentence or /.
Hungarian Layer. This layer, which is the matching component of our model, extracts the aligned unmatched parts from the source and target sentences. This layer is composed by two sequential stages.
Algorithm 1 demonstrates the first stage. The objective of this stage is to align the source and target hidden representations. The inputs of this stage are source hidden representation vectors and target hidden representation vectors , while the outputs of this stage are aligned hidden representation vector pairs , assuming , where corresponds to the -th aligned source/target hidden representation vector, respectively.
Specifically in this stage, there are totally three steps. First, the input hidden representations are crossly dotted to generate the pairwise similarity matrix . Then, Hungarian algorithm works out the aligned source-target position pairs with this similarity matrix. For example in Figure 1, assuming the left/top sentence indicates the source/target sequence, the aligned source-target position pairs are listed as . Last, the input hidden representation vectors are re-organized into the aligned source-target hidden representation vector pairs , according to the aligned source-target position pairs .
The second stage attempts to extract the aligned unmatched parts by weighting the aligned hidden representations from the first stage. Required by extracting the unmatched parts, if two aligned representations are matched, the weight for them should be small, otherwise, large dissimilarity leads to large weight. For this reason, we introduce cosine dissimilarity, mathematically as:
where is the -th aligned cosine dissimilarity and is the -th aligned cosine similarity from the first stage. Thus, the aligned hidden representations are concatenated and then weighted by cosine dissimilarity:
where is the -th output of Hungarian layer, is the -th aligned source/target hidden representation generated by Algorithm 1 and is the scalar-vector multiplication. Actually in the practical setting, most of cosine dissimilarity approach and the remaining hidden representations indicate the aligned unmatched parts.
Cosine Similarity. Last, we average the concatenated hidden representations as the final sentence representation
, which is a conventional procedure in neural natural language processing,Wang et al. (2016b). Then, we employ a cosine similarity as the output:
where is the matching score, is the length of vector and / is the corresponding source/target part of the final sentence representation . Thus, our output ranges in , where means the two sentences are similar/paraphrase, and means otherwise. For further evaluation of accuracy, we also apply a threshold learned in the development dataset to binary the cosine similarity as paraphrase/non-paraphrase. Notably, the introduction of concatenation layer facilitates the inference and training of Hungarian layer.
3.2 Training Hungarian Layer
Previously discussed, Hungarian algorithm is embedded into neural architecture, making a challenge for learning process. We tackle this issue by modifying the back-propagation algorithm in a dynamically graph-constructing manner. In the forward pass, we dynamically construct the links between Hungarian layer and the next layer, according to the aligned position pairs, while in the backward process, the back-propagation is performed through the dynamically constructed links. Next, we illustratively exemplify how the computational graph is dynamically constructed in Hungarian layer as Figure 3 shows.
As Figure 3 shows, in the forward propagation, Hungarian algorithm works out the aligned position pairs, according to which, neural components are dynamically connected to the next layer. For the example of Figure 3, the 1st source and 2nd target word representations are jointly linked to the 1st aligned position of concatenation layer. Once the computational graph has been dynamically constructed in the forward pass, the backward process could propagate through the dynamically constructed links between layers, without any branching and non-differentiated issues. For the example in Figure 3, the backward pass firstly propagates to the 1st aligned position of concatenation layer, then respectively propagates to 1st source and 2nd target word representations. In this way, the optimization framework could still adjust the parameters of neural architectures in an end-to-end manner.
In this section, we verify our model performance on the famous public benchmark dataset of “Quora Question Pairs”. First, we introduce the experimental settings, in Section 4.1. Then, in Section 4.2, we conduct the performance evaluation. Last, in order to further test our assumptions, that the aligned unmatched parts are semantically critical, we conduct a case study for illustration in Section 4.3.
4.1 Experimental Setting
We initialize the word embedding with 300-dimensional GloVe Pennington et al. (2014) word vectors pre-trained in the 840B Common Crawl corpus Pennington et al. (2014). For the out-of-vocabulary (OOV) words, we directly apply zero vector as word representation. Regarding the hyper-parameters, we set the hidden dimension as 150 for each BiLSTM. To train the model, we leverage AdaDelta Zeiler (2012)
as our optimizer, with hyper-parameters as moment factorand . We train the model until convergence, but at most 30 rounds. We apply the batch size as .
4.2 Performance Evaluation
Dataset. Actually, to demonstrate the effectiveness of our model, we perform our experiments on the famous public benchmark dataset of “Quora Question Pairs” 111The url of the dataset: https://data.quora.com. For a fair comparison, we follow the splitting rules of Wang et al. (2017). Specifically, there are over 400,000 question pairs in this dataset, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other or not. We randomly select 5,000 paraphrases and 5,000 non-paraphrases as the development set, and sample another 5,000 paraphrases and 5,000 non-paraphrases as the test set. We keep the remaining instances as the training set.
Baselines. To make a sufficient comparison, we choose five state-of-the-art baselines: Siamese CNN, Multi-Perspective CNN, Siamese LSTM, Multi-Perspective LSTM, and L.D.C. Specifically, Siamese CNN and LSTM encode the two input sentences into two sentence vectors by CNN and LSTM, respectively, Wang et al. (2016a). Based on the two sentence vectors, a cosine similarity is leveraged to make the final decision. Multi-Perspective methods leverage different metric aspects to promote the performance, Wang et al. (2017). L.D.C model Wang et al. (2016b) is an attention-based method, which decomposes the hidden representations into similar and dissimilar parts. L.D.C is a powerful model which achieves the state-of-the-art performance.
We have tested L.D.C. and our model five times to evaluate the mean and variance, then perform the test for statistical significance.
|Siamese CNN Wang et al. (2016a)||79.60|
|Multi-Perspective CNN Wang et al. (2017)||81.38|
|Siamese LSTM Wang et al. (2016a)||82.58|
|Multi-Perspective LSTM Wang et al. (2017)||83.21|
|L.D.C. Wang et al. (2016b)||84.75 0.42|
|Our Model||85.53 0.18|
We apply t-test and. Thus, the improvement is statistically significant.
Results. Our results are reported in Table 1. We can conclude that:
Our method outperforms all the baselines, which illustrates the effectiveness of our model.
In order to evaluate the reliability of the comparison between L.D.C and our model, the results are tested for statistical significance using t-test. In this case, we obtain a p-value = 0.003
0.01. Therefore, the null hypothesis that values are drawn from the same population (i.e., the accuracies of two approaches are virtually equivalent) can be rejected, which means that the improvement is statistically significant.
Compared with Siamese LSTM Wang et al. (2016a), which lacks the matching layer, our model could precisely align the input sentences. Thus, our method promotes the performance.
Compared with L.D.C. Wang et al. (2016b), which is an attention-based method and still analyzes the dissimilar part, our model could exactly extract the aligned unmatched parts rather than the fuzzy dissimilar parts. Thus, our performance is better.
Notably, L.D.C. is a very complex model, which is beaten by our simple model within a statistically significant improvement. This comparison illustrates our model is indeed simple but effective. Thus it is very suitable for industrial applications.
4.3 Case Study
We have conducted a case study in the practical setting of “Quora Question Pairs” with our model for paraphrase identification. Illustrated in Figure 4, the slashed grids correspond to the aligned matched parts, while the crossed ones indicate the aligned unmatched parts. Notably, we mark the pairwise similarity below as unmatched in this case study.
For the example of (a), there exist two input sentences: “What is your review of Hidden Figures -LRB- 2016 movie -RRB-” and “What are your impressions of Hidden Figures -LRB- 2017 movie -RRB-”. From our case analysis, most of the aligned parts are matched, while minor aligned unmatched parts are similar. Thus, our method justifies the two sentences as paraphrase. This is accorded to our assumption.
For the example of (b), there exist two input sentences: “Why is saltwater taffy candy imported in Austria” and “Why is salt water taffy candy unknown in Japan”. There are two unmatched parts that “imported/unknown” and “Austria/Japan”, which are conflicted. Thus, the case is classified as non-paraphrase.
For the example of (c), the two sentences are: “How can I stop being addicted to love” and “How can I stop being so addicted to my phone”. From our case analysis, there is an extreme conflict that “love/phone”, making this case non-paraphrase, according to our assumption.
For the example of (d), the two sentences are: “Is a(n) APK file just a hidden app” and “Where do APK files get stored in Android Studio”. As we know, there are too many conflicts in this case, making a very dissimilar score as non-paraphrase.
In summary, this case study justifies our assumption that “the aligned unmatched parts are semantically critical”.
In this paper, we leverage Hungarian algorithm to design Hungarian layer, which extracts the aligned matched and unmatched parts exclusively from the sentence pair. Then our model is designed by assuming the aligned unmatched parts are semantically critical. Experimental results on benchmark datasets verify our theory and demonstrate the effectiveness of our proposed method.
- Bilotti et al. (2007) Bilotti, Matthew W., Ogilvie, Paul, Callan, Jamie, and Nyberg, Eric. Structured retrieval for question answering. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 351–358, 2007.
- Blacoe & Lapata (2012) Blacoe, William and Lapata, Mirella. A comparison of vector-based representations for semantic composition. In Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 546–556, 2012.
Chitra & Rajkumar (2016)
Chitra, A. and Rajkumar, Anupriya.
Plagiarism detection using machine learning-based paraphrase recognizer.Journal of Intelligent Systems, 25(3):351–359, 2016.
- Fernando et al. (2017) Fernando, Chrisantha, Banarse, Dylan, Blundell, Charles, Zwols, Yori, Ha, David, Rusu, Andrei A, Pritzel, Alexander, and Wierstra, Daan. Pathnet: Evolution channels gradient descent in super neural networks. Arxiv, 2017.
- Graves et al. (2014) Graves, Alex, Wayne, Greg, and Danihelka, Ivo. Neural turing machines. Computer Science, 2014.
- Gulcehre et al. (2016a) Gulcehre, Caglar, Ahn, Sungjin, Nallapati, Ramesh, Zhou, Bowen, and Bengio, Yoshua. Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 140–149, 2016a.
- Gulcehre et al. (2016b) Gulcehre, Caglar, Chandar, Sarath, Cho, Kyunghyun, and Bengio, Yoshua. Dynamic neural turing machine with soft and hard addressing schemes. Arxiv, 2016b.
- Heilman & Smith (2010) Heilman, Michael and Smith, Noah A. Tree edit models for recognizing textual entailments, paraphrases, and answers to questions. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 2-4, 2010, Los Angeles, California, USA, pp. 1011–1019, 2010.
- Hochreiter & Schmidhuber (1997) Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural Computation, 9(8):1735, 1997.
- Kravchenko (2017) Kravchenko, Dmitry. Paraphrase detection using machine translation and textual similarity algorithms. In Conference on Artificial Intelligence and Natural Language, pp. 277–292, 2017.
- Lai & Hockenmaier (2014) Lai, Alice and Hockenmaier, Julia. Illinois-lh: A denotational and distributional approach to semantics. In International Workshop on Semantic Evaluation, pp. 329–334, 2014.
- Pennington et al. (2014) Pennington, Jeffrey, Socher, Richard, and Manning, Christopher. Glove: Global vectors for word representation. In Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543, 2014.
- Pham et al. (2018) Pham, Hieu, Guan, Melody Y, Zoph, Barret, Le, Quoc V, and Dean, Jeff. Efficient neural architecture search via parameter sharing. Arxiv, 2018.
- Shen & Lapata (2007) Shen, Dan and Lapata, Mirella. Using semantic roles to improve question answering. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pp. 12–21, 2007.
Socher et al. (2011)
Socher, Richard, Huang, Eric H, Pennin, Jeffrey, Manning, Christopher D, and
Ng, Andrew Y.
Dynamic pooling and unfolding recursive autoencoders for paraphrase detection.In Advances in Neural Information Processing Systems, pp. 801–809, 2011.
- Wang et al. (2009) Wang, Mengqiu, Smith, Noah A., and Mitamura, Teruko. What is the jeopardy model? a quasi-synchronous grammar for qa. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, pp. 22–32, 2009.
- Wang et al. (2016a) Wang, Zhiguo, Mi, Haitao, and Ittycheriah, Abraham. Semi-supervised clustering for short text via deep representation learning. In the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2016a.
- Wang et al. (2016b) Wang, Zhiguo, Mi, Haitao, and Ittycheriah, Abraham. Sentence similarity learning by lexical decomposition and composition. In the 26th International Conference on Computational Linguistics, 2016b.
- Wang et al. (2017) Wang, Zhiguo, Hamza, Wael, and Florian, Radu. Bilateral multi-perspective matching for natural language sentences. In Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17), 2017.
- Wright (1990) Wright, M. B. Speeding up the hungarian algorithm. Computers and Operations Research, 17(1):95–96, 1990.
- Yang et al. (2015) Yang, Yi, Yih, Wen-tau, and Meek, Christopher. Wikiqa: A challenge dataset for open-domain question answering. In Conference on Empirical Methods on Natural Language Processing, pp. 2013–2018, 2015.
- Yao et al. (2013) Yao, X., Durme, B. Van, Callison-Burch, C., and Clark, P. Semi-markov phrase-based monolingual alignment. 2013 Conference on Empirical Methods in Natural Language Processing, 2013.
Yin et al. (2015)
Yin, Wenpeng, Schütze, Hinrich, Xiang, Bing, and Zhou, Bowen.
Abcnn: Attention-based convolutional neural network for modeling sentence pairs.Computer Science, 2015.
- Yu et al. (2014) Yu, Lei, Hermann, Karl Moritz, Blunsom, Phil, and Pulman, Stephen. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632, 2014.
- Zeiler (2012) Zeiler, Matthew D. Adadelta: An adaptive learning rate method. Computer Science, 2012.