Semantic Text Similarity (STS) is an important task in Natural Language Processing (NLP) applications such as information retrieval, classification, extraction, question answering, and plagiarism detection. The STS task measures the degree of similarity between two texts and can be expressed as follows: given two sentences, a system returns a continuous score on a scale from 1 to 5, with 1 indicating that the semantics of the sentences are completely independent and 5 meaning that there is a semantic equivalence.
STS is a difficult issue since languages have numerous ambiguities and synonymous expressions, while sentences may have variable lengths and complex structures. Therefore basic models, e.g. bag-of-words or TF-IDF models, are constrained by their specificities that put aside the role played by the word order and ignore syntactic as well as semantic relationships. Recent successes in sentence similarity have been obtained using Neural Networks (RNNs: Recurrent Neural Networks Siamese_LSTM; Kiros; Tai
and CNNs: Convolutional Neural NetworksSimilarity_Convolutional). Neural Networks (NNs) use a deep analysis of sentences and words to take better into account both the semantics and the structure of sentences in order to predict the sentence similarity.
In this paper, we describe our technique based on NNs to measure similarity. First, we use a Siamese CNN to analyze the local context of words in a sentence and to generate a representation of the relevance of a word and its neighborhood. Then, we use a Siamese LSTM to analyze the entire sentence based on its words and its local contexts. At last, we predict the semantic similarity of pairs of sentences using the Manhattan distance.
We applied our framework on the SemEval information for STS assignment and we acquired competitive outcomes demonstrating that our model can give helpful information to enhance the sentence analysis.
2 Related Work
To deal with the STS task, previous studies have resorted to various features (e.g. word overlap, synonym/antonym), linguistic resources (e.g. WordNet and pre-trained word embeddings) and a wide assortment of learning algorithms (e.g. Support Vector Regression (SVR), regression functions and NNs). Among these works, several techniques extract multiple features of sentences and apply regression functions to estimate these similarity scoreslai:2014; zhao:2014; bjerva:2014; Severyn. lai:2014 analyzed distinctive word relations (e.g. synonyms, antonyms, and hyperonyms) with features based on counts of co-occurences with other words and similarities between captions of images. zhao:2014 predicted the sentence similarity from syntactic relationship, distinctive content similitudes, length and string features. bjerva:2014 also utilized a regression algorithm to foresee the STS from different features (WordNet, word overlap, and so forth). Finally, Severyn combined relational syntactic structures with SVR.
The development of NNs has improved the results of many NLP applications and especially the STS task Similarity_Convolutional; Siamese_LSTM; Tsubaki; Rychalska. Architectures such as RNNs and CNNs further improve the semantic analysis and the prediction of sentence relatedness.
RNNs differ from other NN models in their ability to process sequential information. They update a memory cell to make sense of data read in a sentence over time. Rychalska
used a Recursive AutoEncoder (RAE) and a WordNet grant framework to produce sentence embeddings. They consolidated these embeddings with a Support Vector Machine (SVM) classifier to compute a semantic relatedness score. Long Short Term Memory (LSTM) enhances RNNs to handle long-term dependenciesSiamese_LSTM; greff:2015; Tai. The LSTM engineering is made out of a memory cell and non-direct gating units that update its state over time and manage the data stream into/out the cell. Siamese_LSTM used a Siamese LSTM to encode sentences using pre-trained word embedding vectors. Siamese LSTMs used the same weights to encode sentences and to produce comparable sentence representations for similar sentences. Then, they predicted the closeness of pair of sentences using the Manhattan distance between the sentence representations. Tai introduced the Tree-LSTM that is a generalization of LSTM for tree-structured network topologies. They utilized this Tree-LSTM to encode a couple of sentences and to predict their closeness with a NN that analyzes the distance and the angle between the sentence embeddings.
CNNs have accomplished excellent outcomes in classification Kim:2014 and other NLP tasks Collobert:2011. Similarity_Convolutional generated sentence embedding using a Siamese CNN architecture with various convolution and pooling operations to extract distinctive granularities of information. Their convolution uses filters that analyze entire word embeddings and each dimension of word embeddings with multiple window sizes. For output of the convolution operation, they applied several pooling types (max, mean, and min). Finally, they predicted the sentence similarity from numerous measurements (horizontal and vertical comparison) to compare local regions of sentence representation.
In this work, we join the ideas examined in Siamese_LSTM and Kim:2014 to produce more accurate semantic sentence embeddings. The next section presents our model and its characteristics w.r.t. previous work.
3 Our model
A sentence is composed of words which can form phrases and clauses. Examining a sentence and its components helps us to comprehend its meaning. NNs are structures that can inspect relationships between words from multiple points of view. On the one hand, LSTMs can recognize and process the semantics of a sentence by investigating the words through time. They update their state to get the gist of the sentence (global context) in the order of words. In this procedure, LSTMs filter unimportant data by retaining just the main information. On the other hand, CNNs use layers with convolution filters that are connected to local features Kim:2014. They enable the analysis of a sentence from multiple perspectives (filters). This type of NNs does not have the same concern with the sentence length as LSTMs since CNNs examine all the words of the sentence together. Nonetheless, CNNs do not consider the order of words in their analysis, so these structures cannot investigate sequence relationships in the sentence.
Differently from Siamese_LSTM that only analyze the general context of words and from Similarity_Convolutional that do not consider the order of words in the sentences, we analyze the words in two perspectives: general and local contexts. Words are considered through time from the general information of a word (word embedding) and its specific semantic and syntactic features (local context) based on its previous and its following words. We apply a CNN to investigate the local context for each word in a sentence. The CNN analyzes together all the words of the local context and generates their representation as a unique structure. Then, we utilize an LSTM to examine the words of the sentence one by one (Figure 1). Our NN has a Siamese structure Siamese_LSTM; Similarity_Convolutional, i.e. our and our are equal to our and our , respectively. The following subsection describes our CNN, our LSTM, and our similarity metrics to predict the sentence similarity.
3.1 Neural Network Architecture
Kim trained a simple CNN on top of pre-trained word vectors for the sentence classification task Kim:2014. His simple model composed of one layer of convolution achieved excellent results on multiple benchmarks. Inspired by the good results of CNNs in the sentence classification Kim:2014, we use a Siamese CNN to generate local contexts for each word in a sentence from its previous and following words. We utilize pre-trained word embeddings111Publicly available at: code.google.com/p/word2vec to represent these words. Let be the -dimensional word vector corresponding to the -th word in a sentence. A local context of length (e.g. ) is represented as:
where is the concatenation operator. Our convolution operation involves a filter , which is applied to a window of words to produce a local context. In more details, our CNN generates the local context of word by:
where is a bias term and is the hyperbolic tangent function. This filter is connected to every sequence of words in a sentence to deliver a local context for all words.
In order to analyze the general and the local contexts of the word , we concatenate its pre-trained word embeddings (general semantic and syntactic features that were learned on a large corpus) and its local context . Our LSTM updates its state and produces an output at time step in a sentence using the equations described in Siamese_LSTM. The last output of our LSTM represents the meaning of a sentence.
Diverse similarity metrics (cosine, Euclidean and Manhattan distances) were tested and we acquired the best outcome with the Manhattan distance . Since these scores are not optimized for the similarity metric range (1-5), we apply in a post-processing step a regression method using local regression and bandwidth to project our predictions in the correct scale, similarly to Li2003.
4 Experimental Setup
We use the SICK dataset to analyze and to test the performance of our system. This dataset contains 9,927 sentence pairs sick and we split it in 4,927/2,000/3,000 for training/validation/test. Each sentence pair is annotated with a relatedness label [1, 5] corresponding to the average relatedness judged by 10 different individuals. The gold scores for relatedness are composed of: 923 pairs within the [1,2) range, 1,373 pairs within the [2,3) range, 3,872 pairs within the [3,4) range, and 3,672 pairs within the [4,5] range.
We initialize our CNN and our LSTM weights with small random Gaussian entries. Our CNN has filters
and our LSTM has 50-dimensional hidden representationsand memory cells . We use a forget bias of 2.5 to model long-range dependencies, Adadelta method to optimize the parameters, and a learning rate of 0.01. We did not identify any improvement with deep LSTMs because of the small amount of data. Like Siamese_LSTM, we also augmented our training dataset and we pre-trained our network using the dataset of SemEval 2013 STS task.
In order to understand the relevance of the local context for the sentence similarity, we investigated the original Siamese LSTM without local context and compared it with our method using various lengths for the local context: 3, 5, 7, and 9 (Table 1). The original Siamese LSTM analyzes a sentence considering only the general context of words. As expected, the analysis of general and local contexts of words improved the sentence analysis, according to the Pearson’s and Pearman’s correlation coefficients and the Mean Squared Error (MSE) scores. Short or long local contexts did not generate the best results, which shows that short local context (3 words) did not get enough information about the neighborhood of words and long local context (7 words) includes irrelevant information.
|Siamese LSTM Siamese_LSTM||0.8822||0.8345||0.2286|
|Siamese LSTM (publicly available version)222We used the public version of Siamese LSTM Siamese_LSTM available at https://github.com/aditya1503/Siamese-LSTM, however, we did not get the same results as the ones described in their paper.||0.8500||0.7860||0.3017|
|Siamese #local context: 3 + Siamese LSTM||0.8536||0.7909||0.2915|
|Siamese #local context: 5 + Siamese LSTM||0.8549||0.7933||0.2898|
|Siamese #local context: 7 + Siamese LSTM||0.8540||0.7922||0.2911|
|Siamese #local context: 9 + Siamese LSTM||0.8533||0.7890||0.2923|
|Non-Linear Similarity Tsubaki||0.8480||0.7968||0.2904|
|Constituency Tree LSTM Tai||0.8582||0.7966||0.2734|
|Skip-thought+COCO (Kiros et al. 2015)||0.8655||0.7995||0.2561|
|Dependency Tree LSTM Tai||0.8676||0.8083||0.2532|
The bottom part of Table 1 compares the results of our system and the best state-of-the-art systems. Although our method did not generate the best results, our system is among the top systems and the results were improved with respect to the publicly available version of the original Siamese LSTM.
In order to illustrate how our local context acts on sentence analysis, Table 2 shows at the word level the similarity a pair of paraphrases: “Her life spanned years of incredible change for women.” and “Mary lived through an era of liberating reform for women.” For each pair of words taken in both sentences, the similarity measured as a cosine distance 333The cosine distance between two vectors and is defined by . is computed either from general word embeddings (table a) or local contexts of length 5 (table b). The first things to notice is that the two tables have different ranges of values because they each represent a different dimensional space; this means that values must be compared inside each table. Analyzing Table 2a shows that word embeddings preserve general semantic and syntactic relationships of words. In this case, the words are more similar to the words that have similar semantics (1-"Her", 2-"Mary" and 2-"women"; 1-"life" and 2-"lived"; 1-"change" and 2-"reform") and/or have similar syntactic roles (1-"of" and 2-"for"). Table 2b highlights that the local context of a word has its semantic and syntactic features based on the words in its window; e.g. the nearest contexts to 1-"life" are 2-"Mary", 2-"lived", 2-through and 2-"women" since these local contexts have directly (2-"lived") and indirectly (2-"Mary", 2-"through" and 2-"women") similar semantics. This analysis is similar to the syntactic features for the local contexts, e.g. the nearest local context of 1-"for" are 2-"lived", 2-"of", 2-"for" and 2-"woman". The relevance of local context is strengthened when we analyze phrasal verbs or multi-word expressions in which meaning depends strongly on their previous and their following words.
|a. Cosine distance between word embeddings.|
|b. Cosine distance between local contexts of length 5.|
Table 3 shows four examples of STS scores for multiple levels of similarities. The first pair of sentences describes an example of active and passive voice, with the same meaning (4.9 golden score). The second case is an example of positive and negative sentences (3.3 golden score). The third example is composed of sentences that do not share the same meaning, having 1.0 golden score. Finally, our method helps to determine the semantic relationship of the phrasal verb "wipe off" and the verb "clean" in the last example. Our approach improves the Siamese LSTM analysis by generating better scores. The local context helps to better identify not only similar sentences but also the negation and sentences with different meanings. This local information provides LSTM with a smoother analysis of words and how they connect in a sentence.
|Pair of sentences||Golden score||Siamese LSTM||Our approach|
|Fish is being cooked by a woman.||4.9||3.84||4.05|
|A woman is cooking fish.|
|The bearded man is not sitting on a train.||3.3||3.49||3.35|
|The bearded man is sitting on a train.|
|Someone is playing with a toad.||1.0||1.51||1.46|
|The trumpet is being played by a man.|
|I will wash up if you wipe off the table.||5.0||3.67||4.08|
|I will wash up if you clean the table.|
To sum up, the local context of words refined the general context analysis. Our approach identified more details about the words and their local as well as general contexts, which usually leads to improved STS scores.
STS is an important task for various NLP applications, e.g. Automatic Text Summarization (ATS), Question-Answering, Information Retrieval, etc. Our system combines CNN and LSTM structures to analyze, to identify and to preserve the relevant information in each part of sentences and in the whole sentences. The local context turned out to be useful to get complement information about a word in a sentence and to improve the sentence analysis. In our experiments, the local context improved the prediction of the sentence similarity, by reducing the mean squared error and increasing the correlation scores.
We plan to test other methods to analyze the local context Ermakova; Zhu. Unfortunately, the dataset we used for the experiments is of a modest size and we did not find larger annotated corpora for this task. Therefore, we also want to lead extrinsic evaluations by measuring how STS acts on ATS systems, depending on whether the original or the modified Siamese LSTM model is used.
This work was partially financed by the European Project CHISTERA-AMIS ANR-15-CHR2-0001.