Translations as Additional Contexts for Sentence Classification

06/14/2018 ∙ by Reinald Kim Amplayo, et al. ∙ POSTECH / 포항공과대학교 Yonsei University 0

In sentence classification tasks, additional contexts, such as the neighboring sentences, may improve the accuracy of the classifier. However, such contexts are domain-dependent and thus cannot be used for another classification task with an inappropriate domain. In contrast, we propose the use of translated sentences as context that is always available regardless of the domain. We find that naive feature expansion of translations gains only marginal improvements and may decrease the performance of the classifier, due to possible inaccurate translations thus producing noisy sentence vectors. To this end, we present multiple context fixing attachment (MCFA), a series of modules attached to multiple sentence vectors to fix the noise in the vectors using the other sentence vectors as context. We show that our method performs competitively compared to previous models, achieving best classification performance on multiple data sets. We are the first to use translations as domain-free contexts for sentence classification.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the primary tasks in natural language processing (NLP) is sentence classification, where given a sentence (e.g. a sentence of a review) as input, we are tasked to classify it into one of multiple classes (e.g. into positive or negative). This task is important as it is widely used in almost all subareas of NLP such as sentiment classification for sentiment analysis

[Pang and Lee2007] and question type classification for question answering [Li and Roth2002], to name a few. While past methods require feature engineering, recent methods enjoy neural-based methods to automatically encode the sentences into low-dimensional dense vectors [Kim2014, Joulin et al.2017]. Despite the success of these methods, the major challenge in this task is that extracting features from a single sentence limits the performance.

To overcome this limitation, recent works attempted to augment different kinds of features to the sentence, such as the neighboring sentences [Lin et al.2015] and the topics of the sentences [Zhao et al.2017]. However, these methods used domain-dependent contexts that are only effective when the domain of the task is appropriate. For one thing, neighboring sentences may not be available in some tasks such as question type classification. Moreover, topics inferred using topic models may produce less useful topics when the data set is domain-specific such as movie review sentiment classification [Mimno et al.2011].

Figure 1: PCA visualizations of unaltered sentence vectors on TREC data set, where each language is effective for a specific class, highlighted using a yellow circle.

In this paper, we propose the usage of translations as compelling and effective domain-free contexts, or contexts that are always available no matter what the task domain is. We observe two opportunities when using translations. mj

First, each language has its own linguistic and cultural characteristics that may contain different signals to effectively classify a specific class. Figure 1 contrasts the sentence vectors of the original English sentences and their Arabic-translated sentences in the question type classification task. A yellow circle signifies a clear separation of a class. For example, the green class, or the numeric question type, is circled in the Arabic space as it is clearly separated from other classes, while such separation cannot be observed in English. Meanwhile, location type questions (in orange) are better classified in English.

Second, the original sentences may include language-specific ambiguity, which may be resolved when presented with its translations. Consider the example English sentence “The movie is terribly amazing” for the sentiment classification task. In this case, terribly can be used in both positive and negative sense, thus introduces ambiguity in the sentence. When translated to Korean, it becomes “영화는 대단히 훌륭합니다” which means “The movie is greatly magnificent”, removing the ambiguity.

The above two observations hold only when translations are supported for (nearly) arbitrary language pairs with sufficiently high quality. Thankfully, translation services (e.g. Google Translate) Moreover, recent research on neural machine translation (NMT)

[Bahdanau et al.2014] improved the efficiency and even enabled zero-shot translation [Johnson et al.2016] of models for languages with no parallel data. This provides an opportunity to leverage on as many languages as possible to any domain, providing a much wider context compared to the limited contexts provided by past studies.

Figure 2: PCA visualizations of unaltered sentence vectors (left) and the corresponding MCFA-altered vectors (right) on the MR data set. is the Mahalanobis distance between the two class clusters.

However, despite the maturity of translation, naively concatenating their vectors to the original sentence vector may introduce more noise than signals. The unaltered translation space on the left of Figure 2 shows an example where translation noises make the two classes indistinguishable.

In this paper, we propose a method to mitigate the possible problems when using translated sentences as context based on the following observations. Suppose there are two translated sentences and with slight errors. We posit that can be used to fix when is used as a context of , and vice versa111Hereon, we mean to “fix” as to “correct, repair, or alter.”. Revisiting the example above, to fix the vector of the English sentence “The movie is terribly amazing”, we use the Korean translation to move the vector towards the location where the vector “The movie is greatly magnificent” is.

Based on these observations, we present a neural attention-based multiple context fixing attachment (MCFA). MCFA is a series of modules that uses all the sentence vectors (e.g. Arabic, English, Korean, etc.) as context to fix a sentence vector (e.g. Korean). Fixing the vectors is done by selectively moving the vectors to a location in the same vector space that better separates the class, as shown in Figure 2. Noises from translation may cause adverse effects to the vector itself (e.g. when a noisy vector is directly used for the task) and relatively to other vectors (e.g. when a noisy vector is used to fix another noisy vector). MCFA computes two sentence usability metrics to control the noise when fixing vectors: (a) self usability weighs the confidence of using sentence in solving the task. (b) relative usability weighs the confidence of using sentence in fixing sentence .

Listed below are the three main strengths of the MCFA attachment. (1) MCFA is attached after encoding the sentence, which makes it widely adaptable to other models. (2) MCFA is extensible and improves the accuracy as the number of translated sentences increases. (3) MCFA moves the vectors inside the same space, thus preserves the meaning of vector dimensions. Results show that a convolutional neural network (CNN) attached with MCFA significantly improves the classification performance of CNN, achieving state of the art performance over multiple data sets.

2 Preliminaries

(a) Self and relative usability modules
(b) Vector fixing module
Figure 3: Full architecture of the MCFA attachment. An arrow marked with a variable is a matrix multiplication of the vector and the variable. An arrow without a variable simply carries the previous element to the next element.

2.1 Problem: Translated Sentences as Context

In this paper, the ultimate task that we solve is the sentence classification task where given a sentence and a list of classes, one is task to classify which class (e.g. positive or negative sentiment) among the list of classes does the sentence belong. However, the main challenge that we tackle is the task on how to utilize translated sentences as additional context in order to improve the performance of the classifier. Specifically, the problem states: given the original sentence , the goal is to use , or sentences in other languages which are translated from , as additional context.

Base Model: Convolutional Neural Network.

The base model used is the convolutional neural network (CNN) for sentences [Kim2014]. It is a simple variation of the original CNN for texts [Collobert et al.2011] to be used on sentences. Let be the -dimensional word vector of the -th word in a sentence of length . A convolution operation involves applying a filter matrix to a window of words and producing a new feature vector using the equation , where

is a bias vector and

is a non-linear function. By doing this on all possible windows of words we produce a feature map . We then apply a max-over-time pooling operation [Collobert et al.2011] over the feature map and take the maximum value as the feature vector of the filter. We do this on all feature vectors and concatenate all the feature vectors to obtain the final feature vector

. We can then use this vector as input features to train a classifier such as logistic regression. We use CNN to create sentence vectors for all sentences

. From here on, we refer to these vectors as , respectively. We refer to them collectively as .

Baseline 1: Naive Concatenation.

A simple method in order to use the translated sentences as additional context is to naively concatenate their vectors with the vector of the original sentence. That is, we create a wide vector , and use this as the input feature vector of the sentence to the classifier. This method works fine if the translated sentences are translated properly. However, sentences translated using machine translation models usually contain incorrect translation. In effect, this method will have adverse effects on the overall performance of the classifier. This will especially be very evident if the number of additional sentences increases.

Baseline 2: L2 Regularization.

In order to alleviate the problems above, we can use L2 regularization to automatically select useful features by weakening the appropriate weights. The main problem of this method occurs when almost all of the weights coming from the vectors of the translated sentence are weakened. This leads to making the additional context vectors useless and to having a similar performance when there are no additional context. Ultimately, this method does not make use of the full potential of the additional context.

3 Model

To solve the problems of the baselines discussed above, we introduce an attention-based neural multiple context fixing attachment (MCFA)222The code we use in this paper is publicly shared:, a series of modules attached to the sentence vectors . MCFA attachment is used to fix the sentence vectors, by slightly modifying the per-dimension values of the vector, before concatenating them into the final feature vector. The sentence vectors are altered using other sentence vectors as context (e.g. is altered using ). This results to moving the vectors in the same vector space. The full architecture is shown in Figure 3.

3.1 Self Usability Module

To fix a source sentence vector333Hereon, we say that is a source sentence vector if is the current vector to be altered., we use the other sentence vectors as guide to know which dimensions to fix and to what extent do we need to fix them. However, other vectors might also contain errors which may reflect to the fixing of the source sentence vector. In order to cope with this, we introduce self usability modules. A self usability module contains the self usability of the vector , which measures how confident sentence is for the task at hand. For example, an ambiguous sentence (e.g. “The movie is terribly amazing”) may receive a low self usability, while a clear and definite sentence (e.g. “The movie is very good”) may receive a high self usability.

Mathematically, we calculate the self usability of the vector of sentence , denoted as , using the equation , where is a matrix to be learned. The produced value is a single real number from 0 to 1. We pre-calculate the self usability of all sentence vectors . These are used in the next module, the relative usability module.

3.2 Relative Usability Module

Relative usability measures how useful can be when fixing , relative to other sentences. There are two main differences between and . First, is calculated before knows about while is calculated when knows about . Second, can be low even though is not. This means that is not able to help in fixing the wrong information in . Here, we extend the additive attention module [Bahdanau et al.2014] and use it as a method to calculate the relative usability of two sentences of different languages. To better visualize the original attention mechanism, we present the equations below.


One major challenge in using the attention mechanism in our problem is that the sentence vectors do not belong to the same vector space. Moreover, one characteristic of our problem is that the sentence vectors can be both a source and a context vector (e.g. can be both and in Equation 1). Because of these, we cannot directly use the additive attention module. We extend the module such that (1) each sentence vector has its own projection matrix , and (2) each projection matrix can be used as projection matrix of both the source (e.g. when sentence is the current source) and the context vectors. Finally, we incorporate the self usability function to reflect the self usability of a sentence. Ultimately, the relative usability denoted as is calculated using the equations below, where is the multiplication of a vector and a scalar through broadcasting.


3.3 Vector Fixing Module

The vector fixing module applies the attention weights to the sentence vectors and creates an integrated context vector. We then use this vector alongside with the source sentence vector to create a weighted gate vector. The weighted gate vector is used to determine to what extent should a dimension of the source sentence vector be altered.

The common way to apply the attention weights to the context vectors and create an integrated context vector is to directly do weighted sum of all the context vectors. However, this is not possible because the context vectors are not on the same space. Thus, we use a projection matrix to linearly project the sentence vector to transform the sentence vectors into a common vector space. The integrated context vector is then calculated as .

Finally, we construct a weighted gate vector and use it to fix the source sentence vectors using the equations below, where is a trainable parameter and is the element-wise multiplication procedure. The weighted gate vector is a vector of real numbers between 0 and 1 to modify the intensity of per-dimension values of the sentence vector. This causes the vector to move in the same vector space towards the correct direction.


An alternative approach to do vector correction is using a residual-style correction, where instead of multiplying a gate vector, a residual vector [He et al.2016] is added to the original vector. However, this approach makes the correction not interpretable; it is hard to explain what does adding a value to a specific dimension mean. One major advantage of MCFA is that the corrections in the vectors are interpretable; the weights in the gate vector correspond to the importance of the per-dimension features of the vector. The altered vectors are then concatenated and fed directly as an input vector to the logistic regression classifier for training.

4 Experiments

4.1 Experimental Setting

We test our model on four different data sets as listed below and summarized in Table 1. (a) MR444 [Pang and Lee2005]: Movie reviews data where the task is to classify whether the review sentence has positive or negative sentiment. (b) SUBJ [Pang and Lee2004]: Subjectivity data where the task is to classify whether the sentence is subjective or objective. (c) CR555 [Hu and Liu2004]: Customer reviews where The task is to classify whether the review sentence is positive or negative. (d) TREC666 [Li and Roth2002]: TREC question data set the task is to classify the type of question.

Data set Test
MR 2 20 10662 CV
SUBJ 2 19 10000 CV
CR 2 23 3775 CV
TREC 6 10 5952 500
Table 1: Statistics of the four data sets used in this paper. : number of target classes. : average number of words. : number of data instances. Test: size of the test data, if available. If not, we use 10-fold cross validation (marked as CV) with random split.

All our data sets are in English. For the additional contexts, we use ten other languages, selected based on their diversity and their performance on prior experiments: Arabic, Finnish, French, Italian, Korean, Mongolian, Norwegian, Polish, Russian, and Ukranian. We translate the data sets using Google Translate. Tokenization is done using the polyglot library777 We experiment on using only one additional context () and using all ten languages at once (). For , we only show the accuracy of the best classifier for conciseness.

For our CNN, we use rectified linear units and three filters with different window sizes

with feature maps each, following [Kim2014]. For the final sentence vector, we concatenate the feature maps to get a 300-dimension vector. We use dropout [Srivastava et al.2014] on all non-linear connections with a dropout rate of 0.5. We also use an constraint of 3, following [Kim2014] for accurate comparisons. We use FastText pre-trained vectors888 [Bojanowski et al.2016]

for all our data sets and their corresponding additional context. During training, we use mini-batch size of 50. Training is done via stochastic gradient descent over shuffled mini-batches with the Adadelta update rule. We perform early stopping using a random

of the training set as the development set.

We present several competing models, listed below to compare the performance of our model. (A) Aside from the base model (CNN) [Kim2014], we use Dependency-based CNN (Dep-CNN) [Ma et al.2015], which parses the sentences first and does convolution on ancestor paths and Dependency-sensitivity CNN (DSCNN) [Zhang et al.2016], which uses LSTM to capture dependency information within each sentence; (B) AdaSent [Zhao et al.2015] adopts a hierarchical structure, where consecutive levels are connected through gated recursive composition of adjacent segments, and feeds the hierarchy as a multi-scale representation through a gating network; (C) Topic-aware Convolutional Neural Network (TopCNN) [Zhao et al.2017] uses topics as additional contexts and changes the CNN architecture. TopCNN uses two types of topics: word-specific topic and sentence-specific topic; and (D) CNN+B1 and CNN+B2 are the two baselines presented in this paper.

We do not show results from RNN models because they were shown to be less effective in sentence classification in our prior experiments. For models with additional context, we further use an ensemble classification model using a commonly used method by averaging the class probability scores generated by the multiple variants (in our model’s case,

and models), following [Zhao et al.2017].

CNN 81.5 93.4 85.0 93.6
Dep-CNN 81.9 - - 95.4
DSCNN 82.2 93.2 - 95.6
AdaSent 83.1 95.5 86.3 92.4
C = Topic word sent ens word sent ens word sent ens word sent ens
TopCNN 81.7 (+0.2) 81.3 (-0.2) 83.0 (+1.5) 93.4 (+0.0) 93.4 (+0.0) 95.0 (+1.6) 84.9 (-0.1) 84.8 (-0.2) 86.4 (+1.4) 92.5 (-1.1) 92.0 (-1.6) 94.0 (+0.4)
C = Trans N=1 N=10 ens N=1 N=10 ens N=1 N=10 ens N=1 N=10 ens
CNN+B1 81.9 (+0.4) 81.4 (-0.1) 82.6 (+1.1) 94.6 (+1.2) 93.8 (+0.4) 94.9 (+1.5) 86.2 (+1.2) 85.9 (+0.9) 86.7 (+1.7) 95.4 (+1.8) 95.0 (+1.4) 96.4 (+3.0)
CNN+B2 82.1 (+0.6) 82.1 (+0.6) 82.2 (+0.7) 94.6 (+1.2) 94.0 (+0.6) 94.8 (+1.4) 86.1 (+1.1) 86.3 (+1.3) 86.6 (+1.6) 95.4 (+1.8) 95.2 (+1.6) 96.4 (+3.0)
CNN+MCFA 82.3 (+0.8) 82.7 (+1.2) 83.2 (+1.7) 94.7 (+1.3) 94.8 (+1.4) 95.2 (+1.8) 87.6 (+2.6) 88.6 (+3.6) 89.4 (+4.4) 95.4 (+1.8) 96.0 (+2.4) 96.8 (+3.4)
Table 2: Classification accuracies of competing models. C refers to the additional context, refers to the number of translations. In TopCNN, word refers to using word-specific topic while sentence refers to using sentence-specific topic. Accuracies colored red are accuracies that perform worse than CNN. Previous state of the art results and the results of our best model are bold-faced. The winning result is underlined. The number inside the parenthesis indicates the increase from the base model, CNN.

4.2 Results and Discussion

We report the classification accuracy of the competing models in Table 2. We show that CNN+MCFA achieves state of the art performance on three of the four data sets and performs competitively on one data set. When , MCFA increases the performance of a normal CNN from to , beating the current state of the art on the CR data set. When , MCFA additionally beats the state of the art on the TREC data set. Finally, our ensemble classifier additionally outperforms all competing models on the MR data set. We emphasize that we only use the basic CNN as our sentence encoder for our experiments, yet still achieve state of the art performance on most data sets. Hence, MCFA is successful in effectively using translations as additional context to improve the performance of the classifier.

We compare our model (CNN+MCFA) and the baselines discussed above (CNN+B1, CNN+B2). On all settings, our model outperforms the baselines. When , the performance of our model increases over the performance when , however the performance of CNN+B1 decreases when compared to the performance when . We also show the accuracies of the worst classifiers when in Table 3. On all data sets except SUBJ, the accuracy of CNN+B1 decreases from the base CNN accuracy, while the accuracy of our model always improves from the base CNN accuracy. This is resolved by CNN+B2 by applying L2 regularization, however the increase in performance is marginal.

CNN 81.5 93.4 85.0 93.6
CNN+B1 81.4 94.2 83.8 93.0
CNN+B2 81.7 94.2 84.0 93.2
CNN+MCFA 81.8 94.4 85.8 94.2
Table 3: Accuracies of the worst CNN+translation classifiers when . Accuracies less than CNN accuracies are highlighted in red.

We also compare two different kinds of additional context: topics (TopCNN) and translations (CNN+B1, CNN+B2, CNN+MCFA). Overall, we conclude that translations are better additional contexts than topics. When using a single context (i.e. TopCNN, TopCNN, and our models when ), translations always outperform topics even when using the baseline methods. Using topics as additional context also decreases the performance of the CNN classifier on most data sets, giving an adverse effect to the CNN classifier.

5 Model Interpretation

(a) Example where English attention weight is larger
(b) Example where Korean attention weight is larger
Figure 4: Attention weights of example Korean sentences from the MR data set. The red color fill represents the attention weights given to each sentence. The darker the fill, the larger the attention weight.


Original sentence:
skip this turd and pick your nose instead because you’re sure to get more out of the latter experience .
Korean translation:
후자의 경험에서 더 많은 것을 얻으려면 이 웅덩이를 건너 뛰고 코를 골라야합니다 .
Human re-translation:
In order to get more from the latter experience , you need to skip this puddle and choose your nose .
Self Usability: 0.3958
(a) Low self usability example
Original sentence:
michael moore’s latest documentary about america’s thirst for violence is his best film yet . . .
Korean translation:
마이클 무어 ( Michael Moore ) 의 최근 미국 다큐멘터리 “ 폭력 장면 ” 은 그의 최고의 영화 다 . . .
Human re-translation:
Michael Moore’s latest American documentary “ Violent Scene ” is his best film yet . . .
Self Usability: 1.0000
(b) High self usability example
Table 4: Two examples of self usability of Korean sentences from the MR data set. Texts colored in red are mistranslated texts.

We first provide examples shown in Table 4 on how the self usability module determines the score of sentences. In the first example, it is hard to classify whether the translated sentence is positive or negative, thus it is given a low self usability score. In the second example, although the sentence contains mistranslations, these are minimal and may actually help the classifier by telling it that thirst for violence is not a negative phrase. Thus, it is given a high self usability score.

Figure 5: PCA visualization of unaltered (left) and altered (right) vectors of the MR data set. is the Mahalanobis distance between two class clusters.
Sentence may take its sweet time to get wherever it’s going, but if you have the patience for it, you won’t feel like it’s wasted yours.
you know that ten bucks you’d spend on a ticket? just send it to cranky. we don’t get paid enough to sit through crap like this.
what might have been readily dismissed as the tiresome rant of an aging filmmaker still thumbing his nose at convention takes a surprising, subtle turn at the midway point.
Sentence every nanosecond of the new guy reminds that you could be doing something else more pleasurable. like scrubbing the toilet. emptying rat traps. or doing last year’s taxes with your ex-wife.
in the new release of cinema paradiso, the tale has turned from sweet to bittersweet, and when the tears come during that final, beautiful scene, they finally feel absolutely earned.
after scenes of nonsense, you’ll be wistful for the testosterone-charged wizardry of jerry bruckheimer productions, especially because half past dead is like the rock on walmart budget.
Table 5: Two example sentences, from English (first) and Korean (second) vector spaces, and their nearest neighbors (NN) on both the unaltered and altered vector spaces. We only show the original English sentences for the Korean example for conciseness.

Figure 4 shows two data instance examples where we show the attention weights given to the other contexts when fixing a Korean sentence. The larger the attention weight is, the more the context is used to fix the Korean sentence. In the first example, the Korean sentence contains translation errors; especially, the words bore and climactic setpiece were not translated and were only spelled using the Korean alphabet. In this example, the English attention weight is larger than the Korean attention weight. In the second example, the Korean sentence correctly translates all parts of the English sentence, except for the phrase as it does in trouble. However, this phrase is not necessary to classify the sentence correctly, and may induce possible vagueness because of the word trouble. Thus, the Korean attention weight is larger.

Figure 5 shows the PCA visualization of the unaltered and the altered vectors of four different languages. In the first example, the unaltered sentence vectors are mostly in the middle of the vector space, making it hard to draw a boundary between the two examples. After the fixing, the boundary is much clearer. We also show the English sentence vectors in the second example. Even without fixing the unaltered English sentence vectors, it is easy to distinguish both classes. After the fix, the sentence vectors in the middle of the space are moved, making the distinction more obvious and clearer. We also provide quantitative evidence by showing that the Mahalanobis distance between the two classes in the altered vectors are significantly farther than that of the unaltered vectors.

We also show two examples sentences from English and Korean vector spaces and their corresponding nearest neighbors on both the unaltered and altered vector spaces in Table 5. In the first example, the unaltered vector focuses on the meaning of “wasted yours” in the sentence, which puts it near sentences regarding wasted time or money. After fixing, the sentence vector focuses its meaning on the slow yet worth-the-wait pace of the movie, thus moving it closer to the correct vectors. In the second example, all three sentences have highly descriptive tones, however, the nearest neighbor on the altered space is hyperbolically negative, comparing the movie to a description unrelated to the movie itself.

6 Related Work

One way to improve the performance of a sentence classifier is to introduce new context. Common and obvious kinds of context are the neighboring sentences of the sentence [Lin et al.2015], and the document where the sentence belongs [Huang et al.2012]. Topics of the words in the sentence induced by a topic model were also used as contexts [Zhao et al.2017]. In this paper, we introduce yet another type of additional context, sentence translations, which to the best of our knowledge have not been used previously.

Sentence encoders trained from neural machine translation (NMT) systems were also used for transfer learning

[Hill et al.2016]. [Hill et al.2017] demonstrated that altered-length sentence vectors from NMT encoders outperform sentence vectors from monolingual encoders on semantic similarity tasks. Recent work used representation of each word in the sentence to create a sentence representation suitable for multiple NLP tasks [McCann et al.2017]. Our work shares the commonality of using NMT for another task, but instead of using NMT to encode our sentences, we use it to translate the sentences into new contexts.

Increasing the number of data instances of the training set has also been explored to improve the performance of a classifier. Recent methods include the usage of thesaurus [Zhang et al.2015], paraphrases [Fu et al.2014], among others. These simple variation techniques are preferred because they are found to be very effective despite their simplicity. Our work similarly augments training data, not by adding data instances (vertical augmentation), but rather by adding more context (horizontal augmentation). Though the paraphrase of can be alternatively used as an augmented context, this could not leverage the added semantics coming from another language, as discussed in Section 1.

7 Conclusion

This paper investigates the use of translations as better additional contexts for sentence classification. To answer the problem on mistranslations, we propose multiple context fixing attachment (MCFA) to fix the context vectors using other context vectors. We show that our method improves the classification performance and achieves state-of-the-art performance on multiple data sets. In our future work, we plan to use and extend our model to other complex NLP tasks.


This work was supported by Microsoft Research Asia and the ICT R&D program of MSIT/IITP. [2017-0-01778, Development of Explainable Human-level Deep Machine Learning Inference Framework]