A substantial body of literature in the field of natural language processing is devoted to the architectural design of word embedding-based neural networks. Over the years, painstaking progress has been made toward developing the most effective network components. Important advancements include hierarchical attentionYang et al. (2016), multi-perspective convolutions He et al. (2015), and tree-structured networks Tai et al. (2015).
With the rise of the transformer-based pretrained language models, however, many of these components have been all but forgotten. Nowadays, the dominant paradigm is to pretrain a transformer Vaswani et al. (2017) on large text corpora, then fine-tune on a broad range of downstream single-sentence and sentence-pair tasks alike. Prominent examples include BERT Devlin et al. (2019) and XLNet Yang et al. (2019), which currently represent the state of the art across many natural language understanding tasks.
Self-evidently, these models dispense with much of the age-old wisdom that has so well guided the design of neural networks in the past. Perhaps, that’s the beauty of it all: a simple, universal architecture just “works.” However, it certainly begs the following question: what neural architectural design choices can we use from the past?
In this paper, we precisely explore this question in the context of semantic similarity modeling for English. For this task, one important component is the very deep pairwise word interaction (VDPWI) module, first introduced in He and Lin (2016), which serves as a template for many succeeding works Lan and Xu (2018)
. Conceptually, they propose to explicitly compute pairwise distance matrices for the distinct word representations of the two sentences. The matrices are then fed into a convolutional neural network, which treats semantic similarity modeling as a pattern recognition problem. Clearly, transformers lack such an explicit mechanism, instead modeling pairwise word interactions in an unconstrained, implicit manner through self-attention.
We take the anachronistic position that the pairwise word interaction module is still useful. Concretely, we hypothesize that appending this module to pretrained transformers increases their effectiveness in semantic similarity modeling—we argue that this module is more than a historical artifact. Using BERT Devlin et al. (2019), a pretrained transformer-based language model, we validate our hypothesis on four tasks in semantic textual similarity and answer sentence selection.
Our core contribution is that, to the best of our knowledge, we are the first to explore whether incorporating the pairwise word interaction module improves pretrained transformers for semantic similarity modeling. We consistently improve the effectiveness of BERT on all four tasks by adding an explicit pairwise word interaction module.
2 Background and Related Work
Presently, the predominant approach to many NLP tasks is to first train an expressive language model (LM) on large text corpora, then fine-tune it on downstream task-specific data. One of the pioneers of this approach, Peters et al. (2018)
pretrain their bidirectional long short-term memory network (BiLSTM;Hochreiter and Schmidhuber, 1997), called ELMo, on the Billion Word Corpus Chelba et al. (2014). Then, for each task-specific neural network, they use the contextualized LM embeddings in place of the usual GloVe- or word2vec-based word embeddings Pennington et al. (2014); Mikolov et al. (2013), fine-tuning the entire model end-to-end. Using this method, they achieve state of the art across question answering, sentiment classification, and textual entailment.
Pretrained transformers. Recent, transformer-based pretrained language models Vaswani et al. (2017) disregard the task-specific neural network altogether. Instead, the language model is the downstream model. Devlin et al. (2019) are the first to espouse this approach, calling their bidirectional transformer-based model BERT. They pretrain BERT using a cloze and next sentence prediction task on Wikipedia and BooksCorpus Zhu et al. (2015), then swap out the LM output layer with a task-specific one at fine-tuning time.
Concretely, during fine-tuning, a word-tokenized sentence pair and is first encoded as , where denotes concatenation, and [CLS] and [SEP] are special class and separator tokens. Next, BERT ingests the input into a sequence of layers composed of nonlinear positionwise operators and multiheaded self-attention mechanisms, matching the transformer model—see Vaswani et al. (2017) for specific details. Crucially, the pairwise word interaction modeling occurs in the self-attention mechanisms, defined as
where is a scaling constant, are linear operators, and is the stacked word representations at layer across an input of length . A minor point is that, for multiheaded attention, there are attention heads ( divides ), the output representations of which are concatenated. The key point is that this mechanism models pairwise context in a global and unconstrained manner; that is, any pair of words—even among the same sentence or the same word itself—is free to attend to each other.
3 Our Approach
, pretrained on Wikipedia and GigaWord-5. They then use BiLSTMs for modeling the context of input sentences, obtaining forward and backward context vectorsand for and —the superscript indicates directionality: for forward and backward.
Pairwise interaction layer. From these context vectors, the distance between all context vectors across both sentences are computed to obtain a similarity cube (SimCube) of size , where is the length of the similarity vector:
He and Lin (2016) define the comparison unit (coU) as coU, where denotes the cosine distance between two vectors. The similarity cube is finally reshaped into . To reduce the effects of unimportant interactions, He and Lin (2016) further apply a pairwise focus function and reduce their corresponding magnitudes by a factor of ten.
Classification. The problem is then converted to a pattern recognition one, where a 19-layer convolutional neural network models the patterns of strong pairwise interactions in the similarity cube. A final softmax layer is used for classification.
3.1 BERT with VDPWI
We use the same procedure as He and Lin (2016) for word interaction modeling, except that we feed sentence input pairs to BERT Devlin et al. (2019) for context modeling as the first step. The contextualized embeddings from BERT are used in the downstream model for constructing similarity cube, and the entire model is fine-tuned end-to-end.
Sentence encoding schemes. We also explore the effectiveness of different encoding methods, as well as the contribution of the BiLSTMs in our experimental settings:
Joint vs. separate encoding: we jointly or separately encode the sentence pair for BERT.
Removing the BiLSTM: we experiment with keeping or removing the BiLSTM.
In the first scheme, for joint encoding, we concatenate the tokens from the two sentences and use the regular [SEP] token to mark the end of the first sentence. For separate encoding, we feed the sentences to BERT one at a time, so the two sentences do not interact with each other before the explicit interaction modeling.
In the second scheme, our motivation for removing the BiLSTM is that pretrained transformers already provide deep contextualized word embeddings, so further context modeling may be unnecessary—we may need to perform explicit pairwise word interaction modeling only. Note that, since different forward and backward context vectors exist only with the BiLSTM, the SimCube without BiLSTMs is in .
We represent separate and joint encoding for BERT by appending “SEP” or “JOINT”, respectively, to the subscript of the model name. We indicate the removal of the BiLSTM by appending “ BiLSTM” to the name.
4 Experimental Setup
We run our experiments on machines with two Titan V GPUs and CUDA v10.0. Our models are implemented in PyTorch v1.2.0.
|1||PWIM Liu et al. (2019a)||74.4/71.8||70.9/72.3||75.9/82.2||87.1/80.9|
|2||BERT Devlin et al. (2019)||84.7/83.9||76.3/77.6||81.2/86.2||87.9/82.3|
|5||BERT PWIM BiLSTM||85.2/84.0||70.6/72.0||68.7/72.5||88.5/83.7|
|6||BERT PWIM BiLSTM||85.0/83.7||73.0/74.5||82.7/87.5||88.8/84.0|
We conduct experiments on two question-answering (QA) datasets and two semantic similarity datasets, all in English:
WikiQA Yang et al. (2015) comprises question–answer pairs from Bing query logs. We follow their preprocessing procedure to filter out questions with no correct candidate answer sentences, after which 12K binary-labeled pairs are left.
TrecQA Wang et al. (2007) is an open-domain QA dataset from information retrieval conferences, consisting of 56K question–answer pairs.
STS-B. The Semantic Textual Similarity Benchmark (STS-B; Cer et al., 2017) contains sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Human annotators assign to each pair a similarity score between one and five, inclusive.
SICK Marelli et al. (2014) consists of 10K sentence pairs originally from Task 1 of the SemEval 2014 competition. A similarity score between one and five, inclusive, is provided for each pair.
SICK and STS-B are evaluated using Pearson’s and Spearman’s , and TrecQA and WikiQA using mean average precision (MAP) and mean reciprocal rank (MRR).
4.2 Training and Hyperparameters
For fine-tuning BERT, we follow a similar procedure to Devlin et al. (2019). Specifically, we perform grid search across the learning rate in
and the number of epochs in, choosing the configuration with the best development set scores. Following the original setup, we use the Adam optimizer Kingma and Ba (2014)
with a batch size of 32. For our experiments on SICK and STS-B, which use noncategorical scores, we minimize the Kullback–Leibler divergence, while we use the NLL loss on WikiQA and TrecQA, which are classification tasks; these objective functions are standard on these datasetsHe and Lin (2016).
For training the pairwise word interaction model, following He and Lin (2016)
, we use the RMSProp optimizerTieleman and Hinton (2012)
with a batch size of 8. To tune the hyperparameters on the development set, we run random search across learning rates in the intervaland number of epochs between 3 and 15, inclusive.
We present our results in Table 1. The original VDPWI model results (first row) for WikiQA, TrecQA, and SICK are copied from Liu et al. (2019a), while we train their model on STS-B, which they do not use. The second row is the result from directly fine-tuning BERT on the four datasets. We report our BERT with VDPWI results in rows 3–6.
5.1 Model Quality
For all four datasets, we find that adding explicit PWI modeling improves the effectiveness of BERT in the original joint encoding scheme—see rows 2 and 4, where we observe an average improvement of 0.9 points. The one-sided Wilcoxon signed-rank (WSR) test reveals that this difference is statistically significant ().
Although no single setting achieves the best result on all datasets—i.e., the best numbers appear in different rows in the table—two of our methods (rows 4 and 6) consistently improve upon the original BERT (row 2). Differences between BERT (row 2) and BERT with VDPWI without the BiLSTM (row 6) are not statistically significant according to the one-sided WSR test ().
5.2 Encoding Scheme Analysis
For joint versus separate sentence encoding schemes, we observe that, on all but STS-B, joint encoding achieves better results than separate encoding—see rows 3 and 5, which represent the separate encoding scheme, and rows 4 and 6, which represent the joint scheme. With or without the BiLSTM, we find that separate encoding results in a degenerate solution on TrecQA, where the model underperforms the original nonpretrained model (row 1)—the gap between separate and joint encoding can be up to 14 points. Adjusting for multiple comparisons using the Holm–Bonferroni correction, one-sided WSR tests reveal significant differences () between all four separate–joint encoding pairs, except for the jointly encoded BERT with VDPWI (row 4) and the separately encoded BERT with the BiLSTM-removed VDPWI (row 5; ). We conclude that, to avoid potentially degenerate solutions, jointly encoding the sentences is necessary.
For the BiLSTM ablation experiments, we do not find a detectably significant difference in keeping or removing the BiLSTM according to the two-sided WSR test (
), corrected using the Holm–Bonferroni method. Additionally, the magnitudes of the differences in the results are minor—compare rows 3 and 5, and 4 and 6. We conclude that incorporating the BiLSTM may not be entirely necessary; the pairwise interaction layer and convolutional classifier stack suffices.
6 Conclusions and Future Work
We explore incorporating explicit pairwise word interaction modeling into BERT, a pretrained transformer-based language model. We demonstrate its effectiveness on four tasks in English semantic similarity modeling. We find consistent improvements in quality across all datasets.
This research was supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, and enabled by computational resources provided by Compute Ontario and Compute Canada.
- Cer et al. (2017) Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation.
- Chelba et al. (2014) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2014. One billion word benchmark for measuring progress in statistical language modeling. In Fifteenth Annual Conference of the International Speech Communication Association.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- He et al. (2015) Hua He, Kevin Gimpel, and Jimmy Lin. 2015. Multi-perspective sentence similarity modeling with convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- He and Lin (2016) Hua He and Jimmy Lin. 2016. Pairwise word interaction modeling with deep neural networks for semantic similarity measurement. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation.
- Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
- Lan and Xu (2018) Wuwei Lan and Wei Xu. 2018. Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering. In Proceedings of the 27th International Conference on Computational Linguistics.
- Liu et al. (2019a) Linqing Liu, Wei Yang, Jinfeng Rao, Raphael Tang, and Jimmy Lin. 2019a. Incorporating contextual and syntactic structures improves semantic similarity modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.
- Liu et al. (2019b) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
- Marelli et al. (2014) Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. 2014. SemEval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th International Workshop on Semantic Evaluation.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.
- Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing.
Tieleman and Hinton (2012)
Tijmen Tieleman and Geoffrey Hinton. 2012.
Lecture 6.5-RMSProp, Coursera: Neural networks for machine learning.Technical Report.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems.
- Wang et al. (2007) Mengqiu Wang, Noah A. Smith, and Teruko Mitamura. 2007. What is the jeopardy model? a quasi-synchronous grammar for QA. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning.
- Yang et al. (2015) Yi Yang, Wen-tau Yih, and Christopher Meek. 2015. WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.
- Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv:1906.08237.
- Yang et al. (2016) Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
Zhu et al. (2015)
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015.
Aligning books and movies: Towards story-like visual explanations by
watching movies and reading books.
Proceedings of the IEEE International Conference on Computer Vision.