Log In Sign Up

DeepSubQE: Quality estimation for subtitle translations

by   Prabhakar Gupta, et al.

Quality estimation (QE) for tasks involving language data is hard owing to numerous aspects of natural language like variations in paraphrasing, style, grammar, etc. There can be multiple answers with varying levels of acceptability depending on the application at hand. In this work, we look at estimating quality of translations for video subtitles. We show how existing QE methods are inadequate and propose our method DeepSubQE as a system to estimate quality of translation given subtitles data for a pair of languages. We rely on various data augmentation strategies for automated labelling and synthesis for training. We create a hybrid network which learns semantic and syntactic features of bilingual data and compare it with only-LSTM and only-CNN networks. Our proposed network outperforms them by significant margin.


page 4

page 6


DATScore: Evaluating Translation with Data Augmented Translations

The rapid development of large pretrained language models has revolution...

Detecting over/under-translation errors for determining adequacy in human translations

We present a novel approach to detecting over and under translations (OT...

Two Way Adversarial Unsupervised Word Translation

Word translation is a problem in machine translation that seeks to build...

Can Synthetic Translations Improve Bitext Quality?

Synthetic translations have been used for a wide range of NLP tasks prim...

Generating Diverse Translation by Manipulating Multi-Head Attention

Transformer model has been widely used on machine translation tasks and ...

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...

1 Introduction

Digital entertainment industry is growing multifold with ease of internet access and numerous options for on-demand streaming platforms such as Amazon Prime Video, Netflix, Hulu etc. These providers increase their viewership by enabling content in local languages. Translation of subtitles across languages is a preferred cost effective industry practice to maximize content reach. Subtitles are translated using bilingual (and sometimes multilingual) translators. They watch the content and use a source subtitle in one language to translate it to another language. Low translation quality and high man-power cost which grows significantly with scarcity of target language resources are some problems with bilingual translators. Low translation quality can cause increased usage drop-off and hurt content viewership for audience of target language. Hence, translation quality estimation (QE) is one crucial step in the process. Currently, a second translator evaluates the quality making evaluation as expensive as generating the translation itself.

Automated QE has been studied through the lens of binary classification between acceptable and unacceptable translation or scoring (or rating) to assign a score of translation acceptability within a given range. However, binary classification ignores “loosely” translated samples that often occur due to human judgment like paraphrasing, under-translation or over-translation. Translators often rephrase sentences using contextual information from the video that is not available in the source sentence. For the scoring approach, gathering large enough sample of reliable human validated data to train a supervised system is very expensive, time-consuming and does not scale to new languages.

In this work, we propose an automated QE system DeepSubQE reducing both cost and time in subtitle translation while assuring quality. To overcome the problem with conventional binary approaches, we introduced a third category of translation called Loose translations. Our system takes a pair of sentences as input; one in source language and one in target language and classifies it in one of three categories —

Good translation, Loose translation or Bad translation. This paper makes the following contributions:

  • We develop a novel system that can estimate quality of translations generated by either humans or MT systems with application to video subtitling.

  • We demonstrate achieving good generalization for subtitling QE by augmenting data with various strategies including signals from learners that themselves fail to generalize as well for the task.

  • We present a formulation that can handle paraphrasing and other contextually acceptable non-literal translations through appropriate synthesis of Loose Translations.

The paper is divided in following sections. Section 2 describes the different existing methods used for QE. Section 3 discusses our approach to the problem, followed by section 4 which explains the details of the dataset which we generated for training. Section 5 describes the model architecture and training environment. Section 6 presents the experiments, results and observations. Section 7 concludes our findings and presents future endeavours.

2 Related work

QE is important for evaluation of machine translation (MT) systems. Automatic evaluation of machine translation systems is a well studied topic. Metrics like BLeU Papineni et al. (2001) and METEOR Banerjee and Lavie (2005) are some industry-wide accepted metrics to evaluate a translation where a reference text is available. Specifically, for each candidate translation (that is machine generated for MT systems) a reference generated by human translator is necessary for computing the metric. However, we are interested in the setting where no reference is available.

Alternatively, HTER Snover et al. (2006) is a metric used to estimate the post-edits required to improve an MT generated candidate translation to the level of human translation. Numerous models have been proposed to predict HTER for a given MT output Blatz et al. (2004); Specia et al. (2009). State of the art for predicting HTER is given by a two-level neural Predictor-Estimator (PE) model Kim et al. (2017). PE finds the closest matching target token for each token in source and uses such matches to find the aggregate quality of translation for sentence pairs.

Figure 1: HTER score distribution of PE Model on our dataset computed using OpenKiwi toolkit Kepler et al. (2019), see in color.

However, HTER metric is trained on data generated from MT output and hence is biased to error patterns generated by such systems. But errors from subtitle files have a very different distribution and its patterns are not aligned with those that HTER captures. For example, subtitles sometimes have complete mistranslations due to alignment errors that lie outside the space of MT outputs that HTER models are trained on. Figure 1 shows lack of separation of HTER predictions from PE model on our three class data of Good, Loose and Bad. Hence, HTER is unsuitable for our task of evaluating subtitle quality.

Our requirement is a method that evaluates quality subtitle data that could either be human or MT generated. The set of translations span complete mistranslations (due to issues like alignment errors), loose translations (from additional contextual information and paraphrasing) and good translations (literal translations with complete overlap of meaning). Apart from mistranslations, errors could also arise from drift in translations, captioning of non-spoken content, etc (see Gupta et al. (2019a)

for a survey). None of the existing methods directly apply to our problem and we work on tailoring one accordingly to our use case. We define a three way classification to account for the three classes of translation output. Using signals from multiple diverse methods we gather data the represents our notion of classes. A neural network is trained on this data that classifies a given pair of subtitle blocks to one of the three classes.

3 Approach

We define the QE problem given a subtitle file in source language and its translation in target language . Quality is measured on individual pairs of translation text blocks , that are matched by timestamps. Each text block could contain more than one sentence spoken by multiple speakers and captions for non-spoken content (such as whispers, laughs, loudly, etc). The problem is to assign the translation to one of three categories . A translation is Good if it is a perfect or near-perfect translation retaining all meaning from source and reads fluently. It is Loose if it is paraphrased or contains some contextual information not available in source text. Translations of colloquial phrases and idioms also lie in this category. Bad translations are those in which the sentence pair have no overlap of meaning and the target is disconnected from the context in the video.

Statistical NMT Added Scrambled Drifted Randomly
Classification Captions Text Aligned Aligned
French 18.83 33.76 6.58 6.58 17.13 17.13
German 17.26 32.14 7.07 7.07 18.23 18.23
Italian 16.44 32.95 6.57 6.57 18.74 18.74
Portuguese 16.47 33.09 6.59 6.59 18.63 18.63
Spanish 17.82 29.15 7.01 7.01 19.50 19.50
Table 1: Data distribution from different sources (in %)

Gathering sufficient human labelled data to train a supervised system is both expensive and time-consuming. Lack of suitable publicly available subtitle data for this task motivated us to reuse large volumes of unlabelled subtitles. We use signals like timestamp alignment and overlap statistics between source and target along with MT output for short sentences with common phrases for synthesizing samples from the three classes to learn the QE classifier. Our augmentation methods also use statistical classifiers of lower capacity trained on external data. The diversity and quality of data generated is critical for the QE classifier to learn good classification boundaries and generalize sufficiently well to unseen data. We show from experiments on subtitles and other parallel corpora that by fitting a QE classifier of sufficiently high capacity to data generated as described here we get good generalization for our task. Further details of our data augmentation methods are described in Section 4.

We experiment with multiple neural network architectures for the classifier beginning with simpler ones with only CNNs and LSTMs. A hybrid architecture of Bidirectional LSTM (BiLSTM) Graves and Schmidhuber (2005)

followed by CNN outperformed them. Input to each model is a pair of sentences in two languages that outputs the probability of it belonging to three three classes. We limit the length of each sentence to 25 tokens, lowercase each word and include punctuation and numbers. Section 

2 describes the model in detail.

Figure 2: Visualization of DeepSubQE model’s architecture.

4 Data augmentation

We begin with 30k video subtitle files in English with timestamp aligned translations into five languages (French, German, Italian, Portuguese, Spanish). These translation text blocks are unlabeled and there is no information on which of three three QE classes they belong to. We use various features measuring statistics of word and semantic overlap between the source and the target as signals to label them. These features are used with two different methods to assign two distinct scores to each sample indicating the likelihood of it being a Good or a Bad sample. Samples where both scores are in strong agreement are given the corresponding class label of Good or Bad. Others where both scores are aligned but only one of them is a strong indicator with high magnitude are marked as Loose. The rest of the sample with no agreement among the scores are discarded.

Bag-of-words model (BOW). The first is a two parameter model that uses aligned pretrained embeddings Conneau et al. (2017)

to score the sentence pair. The embeddings are used create a cosine similarity matrix

of size for a source and target sentences with and words respectively. A score () is computed for the source (target) by thresholding at and taking a max over the columns (rows) and averaging over the rows (columns). Correspondingly,


where is element-wise thresholding at . The intuition is to aggregate similarity scores from the most relevant words of target for each word in the source and vice-versa. The model assigns a score . The parameters are chosen by tuning on validation data of positives from NMT (implemented using  Hieber et al. (2017)) output and negatives from misaligned subtitles with no learning. All samples labeled through this method are from subtitles which helps the final QE model to learn common patterns in video subtitles.

Random Forest Classifier

(RFC). This model uses features similar to BOW model with MUSE embeddings to train a random forest classifier on EuroParl dataset 

Tiedemann (2012). Translations in EuroParl are augmented with errors such as incorrect word substitution and random sentence alignment to generate incorrect translations. The model benefits from Europarl’s paraphrasings that help it learn beyond literal translations. We used about k samples for each language to train and tuned parameters by cross-validation on a validation set. RFC model assigns a probability score to each input sentence pair.

The two scores, and , both of which are in range are now used to label data to assign a label following,


where the thresholds are manually set to . This labels about of all samples into three classes. All others samples with scores that do not fall in the specified ranges are filtered out due to disagreement among the models. We refer to this set as Statistical Classification. Performance and training details of BOW and RFC are reported in Appendix A. We further augment this data with  NMT on short sentences with frequent phrases like greetings with the Good label. More samples for the Loose category are generated by adding captions to the source (like whispers, sighs, etc) referred as Added Captions or changing word order in the target to degrade fluency called Scrambled Text samples. This constitutes all the positives (Good and Loose) while we synthesize negatives (Bad) in two ways; we randomly choose a target for each source for easy negatives (called Random Aligned) and to choose a target from a temporally close block for hard negatives (called Drifted alignment). Label distribution of the final data is reported in Table 2.

Bad Good Loose
French 39.50 34.17 26.33
German 38.65 33.06 28.29
Italian 39.40 34.33 26.27
Portuguese 39.54 34.10 26.36
Spanish 42.05 29.92 28.03
Table 2: Dataset label distribution (in %)

5 Model architecture

State-of-the-art monolingual information retrieval systems Huang et al. (2013); Shen et al. (2014) use a hybrid architecture of RNN followed by a convolution network to extract semantic and syntactic features of text respectively. We extended their idea to build a network with two monolingual encoders for source and target each to extract semantic features followed by a CNN for syntactic features. Refer to figure 2 for visualization of the network’s architecture. Input to the model are 300 dimensional embeddings from pretrained FastText Bojanowski et al. (2017)

for each token. We used two BiLSTMs for each encoder with the outputs of both LSTM concatenated. They were then sequentially fed to two convolution modules. CNN output was passed through a fully connected layer before making a three class prediction. We used ReLU activation with dropout and Batch Normalization. We used Adam optimizer

Kingma and Ba (2014) in all our experiments with a batch size of 8192. We chose a learning rate of scheduled to drop by a factor of twice whenever the rate of training loss drop was less than before stopping training. Table 3 shows the size of the dataset used for training and testing.

6 Experiments

Train Test
# Samples Accuracy # Samples Accuracy Precision Recall F-Score
French 4.23M 93.91 0.83M 91.49 91.04 90.42 90.63
German 12.92M 95.18 2.53M 93.90 93.68 93.29 93.42
Italian 3.74M 94.41 0.73M 92.12 91.64 91.00 91.24
Portuguese 15.43M 94.20 3.03M 92.73 92.50 91.59 91.89
Spanish 18.24M 93.14 3.58M 91.45 90.90 90.39 90.42
Table 3: Model accuracy on train and tests sets.

Table 3 shows the model comparison across various measures. We observe that the model performs similarly with accuracy of above for all five languages. The model also has similar performance across sentences of various lengths as shown in figure 4 with longer sentences doing slightly better than shorter ones. This is possibly because shorter ones when paraphrased are harder to detect than longer ones.

Figure 3: Model accuracy for each label, plot in color.
Figure 4: Model performance across target sentences of different lengths.

6.1 Miss rate on parallel corpora

We took a set of high quality subtitles that were translated and validated independently by two sets of distinct human translators. A list of parallel sentences were extracted from them using their timestamp alignment information. These translations that went through two rounds of human audit can safely be assumed to contain no Bad translations. Since data consists only positives constituting of Loose and Good, we use miss rate or false negative rate (FNR) as the performance metric for this experiment. We ran a similar test on public EuroParl data for reference that also has only positives. Corresponding FNR numbers are listed in Table 4 along with number of parallel sentences in the test set.

High-quality subs EuroParl
# sentences FNR # sentences FNR
French 2.4k 13.69 888k 2.69
German 9.9k 13.94 919k 2.55
Italian 2.4k 10.72 830k 3.40
Portuguese 21.3k 12.30 891k 4.01
Spanish 25.2k 12.31 888k 2.87
Table 4: Miss rate on parallel corpora.

The FNR is low for all language pairs and most false negatives we identified were contextual translations that should have been marked by the model as Loose but were mistaken as Bad. For example — in English-German, the phrase “Jesus” was rewritten to “Meine Güte.” which literally translates to my goodness. Such cases were under 14% showing a good performance on Bad vs rest which is more critical than the Loose vs Good boundary. Further, our subtitles data seem to have a higher sample of such contextually correct translations that are not a literal match. Such paraphrasing is one area that we could improve upon in future work.

6.2 Classification vs scoring

As briefly discussed in Section 1, QE can also be formulated as a scoring problem. We chose the classification route using cross entropy loss for our model. We compare an alternative that employs scoring based formulation extending the ordinal regression objective from Liu et al. (2018b) that is defined as,


where is the predicted score, is the label, and are lower bound and upper bound thresholds respectively for . The values for each label are set to for class Bad, for Loose and for Good.

Classification Scoring
French 91.49 87.46
German 93.90 90.81
Italian 92.12 88.61
Portuguese 92.73 88.46
Spanish 91.45 87.25
Table 5: Accuracy comparison of classification and scoring losses.

The model assigns Good samples a score higher than those to Loose which should have been higher than those assigned to Bad. We changed the final fully-connected layer of the model to give only one output followed by a sigmoid to bound the score in the range . Table 5 shows the test accuracy of both losses. We can see that the classification loss outperformed the scoring loss by approximately for each language.

French 68.20 67.25 62.90 88.75 88.72 91.49
German 70.46 68.63 61.66 90.88 90.29 93.90
Italian 66.96 66.80 60.70 89.63 89.36 92.12
Portuguese 68.40 70.89 62.19 89.93 88.24 92.73
Spanish 70.73 68.86 61.81 88.50 87.22 91.45
Table 6: Test Accuracy for various model architectures.

6.3 Comparison of architectures

We found that the existing convolution networks Liu et al. (2018a) that try to classify bilingual dataset are not able to learn the linguistic nuances of text and fail in many cases. However, recurrent networks that are good with parsing variable length inputs with temporal dependencies complement them. We compared our hybrid network with an LSTM network, a CNN model and models using LASER sentence embeddings Artetxe and Schwenk (2019). For LSTM network, we concatenated the output of both BiLSTMs and fed to a fully connected layer. For CNN model, we had three convolution modules followed by a fully connected layer. We use the 1024-dimension language agnostic sentence embeddings from LASER. Table 6 compares accuracy of various models on test data. For baseline, we used the equation 3 to generate labels for test data. An only-LSTM network works just about as well as the baseline but only-CNN network brings significant gains. The convolutional layer is possibly evaluating semantic retention better  Kim et al. (2017). LASER FC is a classifier trained with LASER embeddings fed into a fully connected layer that performs worse than baseline. LASER CNN is a CNN on top of LASER embeddings and performs about as well as only-CNN. Proposed hybrid model, however, outperformed all other networks including the CNN by

. The LSTM when used in combination with the CNN is consistently improving prediction accuracies across languages. One notable observation from models trained using LASER embeddings was that LASER models took about 10 epochs on average to meet our stopping criterion while DeepSubQE model took around 34 epochs on average.

Figure 5: Visualization from t-SNE of last layer of DeepSubQE.

In figure 5, we present the t-SNE visualization van der Maaten and Hinton (2008) of the last layer of our hybrid network for English-German pair. The plot shows that there is one cluster for Good translations and one for Bad while there are two for Loose that are spread out spatially. This is because Loose translations can be close to either Good and Bad.

7 Conclusion

We studied the problem of translation quality estimation in video subtitles without any reference texts. We show, empirically, how training data can be synthesized for a three-way classification into Good, Loose and Bad translations. The model decision can then be integrated into the subtitle quality improvement process with Good being acceptable translations, Loose possibly requiring human post-edits and Bad needing complete rewrite.

The current work only uses subtitle block level translations to make a decision and ignores temporal aspect. Temporal structure can bring significant information to make a better judgment particularly on Loose translations. Also, training one model per language pair requires considerable operational load. A multilingual model can reduce this load while helping resource starved languages. Exploiting temporal information and learning a common space for multiple languages are future directions we are considering for this work.


  • M. Artetxe and H. Schwenk (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguistics 7, pp. 597–610. External Links: Link Cited by: §6.3.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In IEEvaluation@ACL, Cited by: §2.
  • J. Blatz, E. Fitzgerald, G. F. Foster, S. Gandrabur, C. Goutte, A. Kulesza, A. Sanchís, and N. Ueffing (2004) Confidence estimation for machine translation. In COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland, External Links: Link Cited by: §2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017)

    Enriching word vectors with subword information

    TACL 5, pp. 135–146. External Links: Link Cited by: §5.
  • P. F. Brown, S. D. Pietra, V. J. D. Pietra, and R. L. Mercer (1993) The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2), pp. 263–311. Cited by: §A.1.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. CoRR abs/1710.04087. External Links: Link, 1710.04087 Cited by: 2nd item, §A.1, §4.
  • A. Graves and J. Schmidhuber (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks : the official journal of the International Neural Network Society 18 5-6, pp. 602–10. Cited by: §3.
  • P. Gupta, M. Sharma, K. Pitale, and K. Kumar (2019a) Problems with automating translation of movie/tv show subtitles. CoRR abs/1909.05362. External Links: Link, 1909.05362 Cited by: §2.
  • P. Gupta, S. Shekhawat, and K. Kumar (2019b) Unsupervised quality estimation without reference corpus for subtitle machine translation using word embeddings. 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pp. 32–38. Cited by: 2nd item.
  • F. Hieber, T. Domhan, M. Denkowski, D. Vilar, A. Sokolov, A. Clifton, and M. Post (2017)

    Sockeye: A toolkit for neural machine translation

    CoRR abs/1712.05690. External Links: Link, 1712.05690 Cited by: §4.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. P. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In CIKM, Cited by: §5.
  • F. Kepler, J. Trénous, M. Treviso, M. Vera, and A. F. T. Martins (2019)

    OpenKiwi: an open source framework for quality estimation

    In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstrations, M. R. Costa-jussà and E. Alfonseca (Eds.), pp. 117–122. External Links: Link, Document Cited by: Figure 1.
  • H. Kim, J. Lee, and S. Na (2017) Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, and J. Kreutzer (Eds.), pp. 562–568. External Links: Link, Document Cited by: §2, §6.3.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.
  • J. Liu, R. Cui, and Y. Zhao (2018a)

    Multilingual short text classification via convolutional neural network

    In WISA, Cited by: §6.3.
  • Y. Liu, A. W. Kong, and C. K. Goh (2018b) A constrained deep neural network for ordinal regression.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pp. 831–839.
    Cited by: §6.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2001) Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §2.
  • Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014) Learning semantic representations using convolutional neural networks for web search. In WWW, Cited by: §5.
  • M. Snover, B. J. Dorr, R. Schwartz, and L. Micciulla (2006) A study of translation edit rate with targeted human annotation. Cited by: §2.
  • L. Specia, N. Cancedda, M. Dymetman, M. Turchi, and N. Cristianini (2009) Estimating the sentence-level quality of machine translation systems. Cited by: §2.
  • J. Tiedemann (2012) Parallel data, tools and interfaces in opus. In LREC, Cited by: §A.2, §4.
  • L. van der Maaten and G. E. Hinton (2008) Visualizing data using t-sne. Cited by: §6.3.

Appendix A Binary Classifiers

In this section, we give details of the binary classifiers used for filtering data, briefly explained in Section 4.

a.1 Bag-of-Words Model

We improve upon the ideas presented in IBM alignment models Brown et al. (1993) by using aligned word embeddings Conneau et al. (2017). To evaluate the performance of the model, we defined another threshold () and predicted the binary label for any given translation pair following,


In table 7, we report the optimum values for and , dataset size and the performance of BOW model.

Samples Accuracy
French 0.6 0.30 200k 90.72
German 0.6 0.35 200k 90.51
Italian 0.5 0.40 200k 88.71
Portuguese 0.6 0.30 200k 91.89
Spanish 0.6 0.30 200k 90.22
Table 7: Dataset size and Performance of BOW

a.2 Random Forest Classifier

We trained a binary Random Forest Classifier (RFC) on EuroParl dataset available in OPUS format Tiedemann (2012). We assumed the translations from EuroParl to be correct and introduced following errors in source text to generate incorrect translations.

  • Randomly Substitute Words: We calculate the frequency of each word from EuroParl’s English corpus. Then, we remove two words with least frequency and introduce two random words at random location in sentence. By removing the least frequent words, we can try to remove more important words of the sentence and sentence loses the meaning.

  • Random Selected Sentence: For every source sentence we match it with a random target sentence from the parallel corpus.

  • Word Trigram Substitution: We compute the word trigram occurrence probability from EuroParl’s English corpus to generate a list of possible words for any given sequence of two words. We select a trigram in sentence, replace the last word with one of the possible words list for first two words of trigram.

Training Test Train Test
Samples Samples Accuracy Accuracy
French 647.5k 161.9k 99.95 92.86
German 717.1k 179.3k 99.97 92.16
Italian 605.2k 151.3k 99.93 92.31
Portuguese 686.7k 171.7k 99.93 92.88
Spanish 703.8k 176.0k 99.96 92.90
Table 8: Dataset size and Performance of RFC

We create datasets for each language maintaining the ratio of correct and incorrect translations as . The sizes of train and test datasets are added in table 8. We then created a list of features, explained below, to represent a translation in a 273 length feature vector.

  • Average Vector Similarity: Cosine similarity of average of each word’s vector for each sentence.

  • Similarity Features: We create a cosine similarity matrix using aligned bilingual word embeddings Conneau et al. (2017). We borrow the ideas from Gupta et al. (2019b), instead of selecting a threshold and calculating the percentage of word matches in cosine similarity matrix; we take the maximum values for each column and row. We repeat process for ngrams for by taking the vector average for .

  • n-gram Frequencies: Vector of source and target unigram, bigram and trigram probabilities for each ngram in source and target sentence.

  • Structural Features: Number of words in source sentence and target sentence

Figure 6: Feature Importance for RFC (Target-Language: German)

In table 8, we show the results for the classifier and in figure 6 show the importance of each feature type for RFC.