Over the last few years, vector space representations of words and sentences, extracted from encoders trained on a large text corpus, are primary components to model any natural language processing (NLP) task, especially while using neural or deep learning methods. This is because training stable word embeddings requires words to have high frequency in the corpussahin2017consistent. Hence, word embeddings generated from resource (corpus) constrained languages have a limited vocabulary. Similarly, corpora collected from domains such as healthcare or computational social sciences foulds2017mixed are typically small, reducing the model’s capacity to generalize while learning to perform a task. Thus, it is common to initialize neural NLP models with pretrained word embeddings learned using word2vec mikolov2013distributed or GloVe pennington2014glove and fine tune sentence encoders like BERT devlin2018bert
in a number of tasks from part-of-speech tagging, named entity recognition, and machine translation to measuring textual similarity.
Let us consider two types of tasks, namely, vector space alignment where the purpose is to learn a mapping between two independently trained embeddings (e.g., crosslingual word alignment) and a classification task (e.g., natural language inference (NLI)). Learning bilingual word embedding models alleviates low resource problems by aligning embeddings from a source language that is rich in available text to a target language with a small corpus with limited vocabulary. Largely, recent work focuses on learning a linear mapping to align two embedding spaces by minimizing the mean squared error (MSE) between embeddings of words projected from the source domain and their counterparts in the target domain mikolov2013distributed; ruder2017survey
. Minimizing MSE is useful when a large set of translated words (between source and target languages) is provided, but the mapping overfits when the parallel corpus is small or may require non-linear transformationssogaard2018limitations
. In order to reduce overfitting and improve word alignment, we propose an auxiliary loss function called locality preserving loss (LPL) that trains the model to align two sets of word embeddings while maintaining the local neighborhood structure around words in the source domain.
With classification tasks where there are two inputs (e.g., NLI), we show how the alignment between the two input subspace acts as regularizer, improving the model’s accuracy on the task with MSE alone and when MSE and LPL are combined together.
Specifically, our main contributions are:
We propose a new loss function called locality preserving loss (LPL) to improve vector space alignment and show how it can improve performance on crosslingual word alignment giving up to 4.1% (13.8% relative) improvement and multiple down-stream tasks such as SNLI with up to 8.9% (19.3% relative) improvement when trained with just 1000 samples.
We demonstrate how LPL reduces the size of the supervised set of labeled items required to train the model while maintaining equivalent performance.
We show how manifold alignment acts as a regularizer while performing natural language inference and that LPL when combined with MSE leads to higher overall accuracy.
2 Background & Related Work
Our work is inspired by a generalized autoencoder and locally linear embedding model.
2.1 Dimensionality Reduction & Manifold Alignment
Manifold learning methods represent these high dimensional datapoints in a lower dimensional space by extracting important features from the data, making it easier to cluster and search for like data points. The methods are broadly categorized into linear, such as Principal Component Analysis (PCA), and non-linear algorithms. Non-linear methods include multi-dimensional scaling(cox2000multidimensional, MDS), locally linear embedding (roweis2000nonlinear, LLE) and Laplacian eigenmaps (belkin2002laplacian, LE). he2004locality
compute the euclidean distance between points to construct an adjacency graph and create a linear map that preserves the neighborhood structure of each point in the manifold. Another popular approach to learn manifolds is an autoencoder where a self-reconstruction loss is used to train a neural networkrumelhart1985learning. vincent2008extracting design an autoencoder that is robust to noise by training it with a noisy input and then reconstructing the original noise-free input.
In locally linear embedding (LLE), the datapoints are assumed to have a linear relation with their neighbors. There are various ways to compute the neighbors of a datapoint like using Euclidean distance. The projection of each point is computed in a two step process. First, a reconstruction loss is utilized to learn the linear relation between a point and neighbors roweis2000nonlinear,
where is the datapoint and the s represent the neighbors. An additional constraint is imposed on the weights () to make the transform scale invariant. In (1) the weights are an matrix in a dataset of points (i.e., each point has its own weights). Learning a transformation therefore requires learning :
is a projection for (typically with reduced dimensions). wang2014generalized extend the autoencoder model by modifying the reconstruction loss to use neighbors similar to the non-linear methods described above:
is the loss between point and associated point (i.e., neighbor) and is a weight that represents the relationship between points. For example, can be when they are nearest neighbors and when they are not. Depending on the type of non-linear method (described above) retrofitted into the model, can be various functions.
benaim2017one utilize a GAN to learn a unidirectional mapping. The total loss applied to train the generator is a combination of different losses, namely, an adversarial loss, a cyclic constraint (inspired by zhu2017unpaired), MSE and an additional distance constraint where the distance between the point and its neighbors in the source domain are maintained in the target domain. Similarly, conneau2017word learn to translate words without any parallel data with a GAN that optimizes a cross domain similarity scale to resolve the hubness problem dinu2014improving.
These methods are the foundation to learn a mapping between two lower dimensional spaces (manifold alignment, fig. 1). wang2011manifold propose a manifold alignment method that preserves the local similarity between points in the manifold being transformed and the correspondence between points that are common to both manifolds. boucher2015aligning replace the manifold alignment algorithm that uses the nearest neighbor graph with a low rank alignment. cui2014generalized align two manifolds without any pairwise data (unsupervised) by assuming the structure of the lower dimension manifolds are similar.
2.2 Cross Embedding Word Alignment
One way to alleviate the problem of limited text is to align words between two languages that have similar meanings to initialize the embeddings for unknown words. mikolov2013distributed learn a linear mapping by optimizing the MSE between the source and target language. xing2015normalized improve the mapping by adding an orthogonal constraint to the weights. In BilBOWA gouws2015bilbowa, the cross-lingual mappings are learned by training monolingual representations for the source and target language and additional training with a cross-lingual objective on a sentence aligned corpus. faruqui2014retrofitting use external information to adjust the existing word embeddings. artetxe2017learning reduce the need for a parallel word corpus by iteratively inducing a dictionary. faruqui2014improving learn to map the embeddings to a joint space with canonical correlation analysis (CCA). lu2015deep extend the prior using deep canonical correlation analysis. Our work is similar to bollegala2017think where the meta-embedding (a common embedding space) for different vector representations in generated using a locally linear embedding (LLE) which preserves the locality. One drawback though is that LLE does not learn a single mapping between the source and target vector spaces. A linear mapping between a word and its neighbor is learned for each new word and the meta-embedding for each word in vocabulary is learned every time new words are added to the vocabulary. nakashole2018norma propose NORMA that uses neighborhood sensitive maps where the neighbors are learned rather than extracted from the existing embedding space.
3 Locality Preserving Alignment (LPA)
3.1 Locality Preservation Criteria
The locality preserving loss (LPL, (5)) is based on an important assumption about the source manifold: for a pre-defined neighborhood of points ( is chosen manually) in the source embedding space we assume points are “close” to a given point such that it can be reconstructed using a linear map of its neighbors. This assumption is similar to that made in locally linear embedding roweis2000nonlinear. The above principle can be applied to the target space in order to learn a reverse mapping too.
As individual embeddings can represent words or sentences, we call each individual embedding a unit. Consider two manifolds (source domain) and (target domain), that are vector space representations of units. We do not make assumptions on the methods used to learn each manifold; they may be different. We also do not assume they share a prima facie common lexical vocabulary. For example,
can be created using a standard distributed representation method like word2vecmikolov2013distributed and consists of English word embeddings while is created using GloVe pennington2014glove and contains Italian embeddings. Let and be the respective vocabularies (collection of units) of the two manifolds. Hence and are sets of units in each vocabulary of size and . The distributed representations of the units in each manifold are and .
While we do not assume that and must have common items, we do assume that there is some set of unit pairs that are connected by some consistent relationship. Let be the set of the unit pairs; we consider a supervised training set, or perhaps derived from a parallel corpus. For example, in crosslingual word alignment this consistent relationship is whether one word can be translated as another; in natural language inference, the relationship is whether one sentence entails the other (the second must logically follow from the first). We assume this common set is much smaller than the individual vocabularies ( and ). The mapping (manifold alignment) function is .
In this paper, we experiment with two types of tasks i.e. cross-lingual word alignment and natural language inference. In cross-lingual word alignment, and represent the source and target vocabularies, bilingual dictionary, and are the target and source manifold. with parameters is a linear projection with a single weight matrix . For NLI, and target and source sentences with and being their manifolds. is 3-layer MLP.
3.3 Locality Preserving Loss (LPL)
We use a mapping function to align the manifold to . The exact structure of is task-specific: for example, in our experiments is a linear function for crosslingual word alignment and it is a single layer neural network (non-linear mapping) for NLI. The mapping is optimized using three loss functions: an orthogonal transform xing2015normalized represented as (i.e. constrain to be ; mean squared error (eq. 4); and locality preserving loss (LPL) as (eq. 5).
The standard loss function to align two manifolds is mean squared error (MSE) ruder2017survey; artetxe2016learning,
which minimizes the distance between the unit’s representation in (the target manifold) and projected vector from . The function has learnable parameters . MSE can lead to an optimal alignment when there is a large number of units in the parallel corpus to train the mapping between the two manifolds ruder2017survey. However, when the parallel corpus is small, the mapping is prone to overfitting glavas2019properly.
Locality preserving loss (LPL: eq. 5) optimizes the mapping to project a unit together with its neighbors. For a small neighborhood of units, the source representation of unit is assumed to be a linear combination of its source neighbors. We represent this small neighborhood (of the source embedding of word ) with , and we compute the local linear reconstruction using , a learned weight associated with each word in the neighborhood of the current word . LPL requires that the projected source embedding is an average of all the projected vectors of its neighbors . Formally, for a particular common item , LPL at minimizes
with . Intuitively, represents the relation between a word and its neighbors in the source domain. We learn it by minimizing the LLE-inspired loss. For a common this is
with . The weights are subject to the constraint , making the projected embeddings invariant to scaling roweis2000nonlinear. We can formalize this with an objective . LPL reduces overfitting because the mapping function does not simply learn the mapping between unit embeddings in the parallel corpus: it also optimizes for a projection of the unit’s neighbors that are not part of the parallel corpus—effectively expanding the size of the training set by the factor .
Accuracy of alignment regularization on SNLI. The left graph shows the accuracy, averaged across 3 runs, for differing size of training samples (total: 500K). The right chart shows accuracy standard deviation for the baseline, baseline + MSE and baseline + MSE + LPL models: LPL yields more consistently optimal systems.
. We indicate which steps can easily be combined with backpropagation.
3.3.1 Model Training with Locality Preserving Alignment
The total supervised loss becomes:
We introduce a constant to allow control over the contribution of LPL to the total loss.
Although we minimize total loss (7), shown explicily with variable dependence, the optimization can be unstable as there are two sets of independent parameters and representing different relationships between datapoints. To reduce the instability, we split the training into two phases. In the first phase, is learned by minimizing alone and the weights are frozen. Once is learned, and are minimized while keeping fixed.
One key difference between our work and artetxe2016learning is that they optimize the mapping function by taking the singular vector decomposition (SVD) of the squared loss while we use gradient descent to find optimal values of . As our experimental results show, while both can empirically advantageous, our work allows LPL to be easily added as just another term in the loss function: with the exception of the alternating optimization of , our approach does not need special optimization updates to be derived.
3.4 Alignment as Regularization
MSE and LPL can be used to align two vector spaces: in particular, we show that the objectives can align two subspaces in the same manifold. When combined with cross entropy loss in a classification task, this subspace alignment effectively acts as a regularizer.
Fig. 2 shows an example architecture where alignment is used as a regularizer for the NLI task. The architecture contains a two layer MLP used to perform language inference, i.e., to predict if the given sentence pairs are entailed, contradictory or neutral. The input to the network is a pair of sentence vectors. The initial representations are generated from any sentence/language encoder, e.g., from BERT. The source/sentence1/premise embeddings are first projected to the hypothesis space. The projected vector is then concatenated with the original pair of embeddings and given as input to the network. The alignment losses (MSE and LPL) are computed between the projected premise and original hypothesis embeddings. If the baseline network is optimized with cross entropy (CE) loss to predict label , the total loss becomes:
where is an empirical hyperparameter that controls the impact of the loss (learning rate). Thus, the loss 8 is an extension of 7 for a classification task but without , which is not applied as is a 3-layer MLP (non-linear mapping) and the constraint for each layer’s weights cannot be guaranteed. The alignment loss becomes a vehicle to bias the model based upon our knowledge of the task, forcing a specific behavior on the network. The behavior can be controlled with , which can be a positive or negative value specific to each label. A positive optimizes the network to align the embeddings while a negative is a divergence loss. In NLI we assign a constant scalar to all samples with a specific label (i.e., 100 for entailment, 1.0 for contradiction and -5.0 for neutral). The scalars have been assigned while optimizing network hyper-parameters. As the optimizer minimizes the loss, a divergence loss tends to ; in practice, the negative loss has a threshold.
4 Experiment Results & Analysis
We demonstrate the effectiveness of the locality preserving alignment (LPA) on two types of tasks: natural language inference and crosslingual word alignment. In order to compute local neighborhoods, as needed for, e.g., (5), we build a standard KD-Tree with Euclidean distance.
4.1 Natural Language Inference
To test the effectiveness of alignment as a regularizer, a 2-layer MLP is used as shown in Figure 2, we measure the change in accuracy with respect to this baseline. An additional single layer network is utilized to perform the alignment with premise and hypothesis spaces. We experiment the impact of the loss function on two datasets: the Stanford natural language inference (SNLI) bowman2015large and the multigenre natural language inference dataset (MNLI) N18-1101. SNLI consists of 500K sentence pairs while MNLI contains about 433k pairs. The MNLI dataset contains two test datasets. The matched dataset contains sentences that are sampled from the same genres as the training samples while mismatched samples test the models accuracy for out of genre text.
Figures 3(a), 4, and 5 show the accuracy of the models when optimized with a standard cross-entropy loss (baseline), with additional MSE for alignment and finally with MSE and LPL combined. The accuracy is measured when the size of the training set is reduced. The reduced datasets are created by randomly sampling the required number from the entire dataset. The graphs show that an alignment loss consistently boosts accuracy of the model with respect to the baseline. It also shows that LPL, when combined with MSE, is able to provide higher gains as compared to the model being optimized with MSE alone. Also, the difference in accuracy is larger (as compared to baseline) when the number of training samples are small and reduces as the training set becomes larger. This is because we calculate the neighbors for each premise from the training dataset only rather than any external text like Wikipedia (i.e., generate embeddings for Wikipedia sentences and then use them as neighbors). As the training size increases LPL has diminishing returns, as the neighbors tend to be part of the training pairs themselves. Figure 3(b) is a plot of the standard deviation of accuracy across 3 runs (training data being randomly sampled each time). We clearly observe that a model regularized with MSE and LPL are more likely to reach optimal parameters consistently.
4.2 Crosslingual Word Alignment
The cross lingual word alignment dataset is from dinu2014improving. The dataset is extracted from the Europarl corpus111http://opus.lingfil.uu.se/ and consists of word pairs split into training (5k pairs) and test (1.5k pairs) respectively. From the 5K word pairs available for training only 3K pairs are used to train the model with LPA and an additional 150 pairs are used as the validation set (in case of Finnish 2.5K pairs are used). This is a reduced set in comparison to the models in table 0(a) that are trained with all pairs.
|Source ()||nt4, 95/98/nt, nt/2000, nt/2000/xp, windows98|
|Target ()||winzozz, mac, nt, osx, msdos|
|Aligned ()||winzozz, nt4, ntfs, mac, 95/98/nt, nt, osx, msdos|
Table 2 shows the neighbors for the word “windows” from the source embedding (English) and the target embedding (Italian). Compared to previous methods that look at explicit mapping of points between the two spaces, LPA tries to maintain the relations between words and their neighbors in the source domain while projecting them into the target domain. In this example, the word “nt/2000” is not a part of the supervised pairs available and will not have an explicit projection in the target domain to be optimized without a locality preserving loss.
Along with the mapping methods in Table 0(a), previous methods also apply additional pre/post processing tranforms on the word embeddings as documented in artetxe2018generalizing (described in table 0(b)). Cross-domain similarity local scaling (CSLS) conneau2017word is used to retrieve the translated word. Table 0(a) shows the accuracy of our approach in comparison to other methods.
The accuracy of our proposed approach is better or comparable to previous methods that use similar numbers of transforms. It is similar to artetxe2018generalizing while having fewer preprocessing steps. This is because we choose to optimize using gradient descent as compared to a matrix factorization approach. Thus, our implementation of artetxe2016learning (MSE Loss only) underperforms in comparison to the original baseline while giving improvements with LPA. Gradient descent has been adopted in this case because the loss function can be easily adopted by any neural network architecture in the future as compared to matrix factorization methods that will force architectures in the future to use a two-step training process.
|P||Family members standing outside a home.|
|H||A family is standing outside.|
|1P||People standing outside of a building.|
|1H||One person is sitting inside.|
|2P||Airline workers standing under a plane.|
|2H||People are standing under the plane.|
|3P||A group of four children dancing in a backyard.|
|3H||A group of children are outside.|
|4P||People standing outside of a building.|
|4H||One person is sitting inside.|
|5P||A family doing a picnic in the park.|
|5H||A family is eating outside.|
|6P||Airline workers standing under a plane.|
|6H||People are standing under the plane.|
Table 3 shows the 2 nearest neighbors for a premise-hypothesis pair (P, H) taken from each classifier i.e. baseline, MSE only and MSE + LPL after they are trained (the dataset size is small at just 2000 samples). Since, NLI is a reasoning task, the sentence pair representations ideally will cluster around a pattern that represents Entailment or Contradiction or Neutral. Instead what is observed is that when the samples are limited, sentence pair representations have NNs that are syntactically similar (NNs 1 and 2) for the baseline model. The predicted labels for the NN pairs are not clustered into entailment but are a combination of all 3. This problem is reduced for models trained with MSE and MSE + LPL (NNs 3 and 4 for MSE, NNs 5 and 6 for MSE + LPL). The predicted labels of the NNs are clustered into entailment only. The sentence pair representations clusters containing a single label suggest the models are better at extracting a pattern for entailment (and improving the model’s ability to reason). This semantic clustering of representations can be attributed to the initial alignment (or divergence) between the premise and hypothesis with additional locality preserving loss to increase the training size.
Apart from better accuracy when the training dataset is small, in figures 3, 4, and 5, we observe that accuracy of the models trained with alignment loss using MSE only and another in combination with LPL converge as number of training samples increase. This happens because of the way k-nearest neighbor (k-NN) is computed for each embedding in the source domain. We use BERT to generate the embedding of each sentence in the SNLI and MNLI dataset. But BERT itself is trained on millions of sentences from Wikipedia and Book Corpus. Searching for k-NN embeddings for each sentence from this dataset (for each sentence in the training sample) is computationally difficult. In order to make the k-NN search tractable, neighbors are extracted from the dataset itself (500K sentences in SNLI and 300K sentences in MNLI). This impacts the overall improvement in accuracy using LPL as it is not a perfect reconstruction of the datapoint (using its neighbors). Initially when the dataset is small the neighbors are unique. As the dataset size increases, the unique neighbors reduce and are subsumed by the overall supervised dataset (hence MSE begins to perform better). Thus, the impact of LPL reduces as the number of unique neighbors decreases and the entire dataset is used to train the model. This is unlikely to happen when NNs from a larger text corpus (unrelated to task) are used to reconstruct the local manifold.
In this paper, we introduce a new loss locality preserving loss (LPL) function that learns a linear relation between the given word and its neighbors and then utilizes it to learn a mapping for the neighborhood words that are not a part of the word pairs (parallel corpus). Also, we show how the results of the method are comparable to current supervised models while requiring a reduced set of word pairs to train on. Additionally, the same alignment loss is applied as a regularizer in a classification task like NLI to demonstrate how it can improve the accuracy of the model over the baseline.