Digital entertainment industry is growing multifold with ease of internet access and numerous options for on-demand streaming platforms such as Amazon Prime Video, Netflix, Hulu etc. These providers increase their viewership by enabling content in local languages. Translation of subtitles across languages is a preferred cost effective industry practice to maximize content reach. Subtitles are translated using bilingual (and sometimes multilingual) translators. They watch the content and use a source subtitle in one language to translate it to another language. Low translation quality and high man-power cost which grows significantly with scarcity of target language resources are some problems with bilingual translators. Low translation quality can cause increased usage drop-off and hurt content viewership for audience of target language. Hence, translation quality estimation (QE) is one crucial step in the process. Currently, a second translator evaluates the quality making evaluation as expensive as generating the translation itself.
Automated QE has been studied through the lens of binary classification between acceptable and unacceptable translation or scoring (or rating) to assign a score of translation acceptability within a given range. However, binary classification ignores “loosely” translated samples that often occur due to human judgment like paraphrasing, under-translation or over-translation. Translators often rephrase sentences using contextual information from the video that is not available in the source sentence. For the scoring approach, gathering large enough sample of reliable human validated data to train a supervised system is very expensive, time-consuming and does not scale to new languages.
In this work, we propose an automated QE system DeepSubQE reducing both cost and time in subtitle translation while assuring quality. To overcome the problem with conventional binary approaches, we introduced a third category of translation called Loose translations. Our system takes a pair of sentences as input; one in source language and one in target language and classifies it in one of three categories —Good translation, Loose translation or Bad translation. This paper makes the following contributions:
We develop a novel system that can estimate quality of translations generated by either humans or MT systems with application to video subtitling.
We demonstrate achieving good generalization for subtitling QE by augmenting data with various strategies including signals from learners that themselves fail to generalize as well for the task.
We present a formulation that can handle paraphrasing and other contextually acceptable non-literal translations through appropriate synthesis of Loose Translations.
The paper is divided in following sections. Section 2 describes the different existing methods used for QE. Section 3 discusses our approach to the problem, followed by section 4 which explains the details of the dataset which we generated for training. Section 5 describes the model architecture and training environment. Section 6 presents the experiments, results and observations. Section 7 concludes our findings and presents future endeavours.
2 Related work
QE is important for evaluation of machine translation (MT) systems. Automatic evaluation of machine translation systems is a well studied topic. Metrics like BLeU Papineni et al. (2001) and METEOR Banerjee and Lavie (2005) are some industry-wide accepted metrics to evaluate a translation where a reference text is available. Specifically, for each candidate translation (that is machine generated for MT systems) a reference generated by human translator is necessary for computing the metric. However, we are interested in the setting where no reference is available.
Alternatively, HTER Snover et al. (2006) is a metric used to estimate the post-edits required to improve an MT generated candidate translation to the level of human translation. Numerous models have been proposed to predict HTER for a given MT output Blatz et al. (2004); Specia et al. (2009). State of the art for predicting HTER is given by a two-level neural Predictor-Estimator (PE) model Kim et al. (2017). PE finds the closest matching target token for each token in source and uses such matches to find the aggregate quality of translation for sentence pairs.
However, HTER metric is trained on data generated from MT output and hence is biased to error patterns generated by such systems. But errors from subtitle files have a very different distribution and its patterns are not aligned with those that HTER captures. For example, subtitles sometimes have complete mistranslations due to alignment errors that lie outside the space of MT outputs that HTER models are trained on. Figure 1 shows lack of separation of HTER predictions from PE model on our three class data of Good, Loose and Bad. Hence, HTER is unsuitable for our task of evaluating subtitle quality.
Our requirement is a method that evaluates quality subtitle data that could either be human or MT generated. The set of translations span complete mistranslations (due to issues like alignment errors), loose translations (from additional contextual information and paraphrasing) and good translations (literal translations with complete overlap of meaning). Apart from mistranslations, errors could also arise from drift in translations, captioning of non-spoken content, etc (see Gupta et al. (2019a)
for a survey). None of the existing methods directly apply to our problem and we work on tailoring one accordingly to our use case. We define a three way classification to account for the three classes of translation output. Using signals from multiple diverse methods we gather data the represents our notion of classes. A neural network is trained on this data that classifies a given pair of subtitle blocks to one of the three classes.
We define the QE problem given a subtitle file in source language and its translation in target language . Quality is measured on individual pairs of translation text blocks , that are matched by timestamps. Each text block could contain more than one sentence spoken by multiple speakers and captions for non-spoken content (such as whispers, laughs, loudly, etc). The problem is to assign the translation to one of three categories . A translation is Good if it is a perfect or near-perfect translation retaining all meaning from source and reads fluently. It is Loose if it is paraphrased or contains some contextual information not available in source text. Translations of colloquial phrases and idioms also lie in this category. Bad translations are those in which the sentence pair have no overlap of meaning and the target is disconnected from the context in the video.
Gathering sufficient human labelled data to train a supervised system is both expensive and time-consuming. Lack of suitable publicly available subtitle data for this task motivated us to reuse large volumes of unlabelled subtitles. We use signals like timestamp alignment and overlap statistics between source and target along with MT output for short sentences with common phrases for synthesizing samples from the three classes to learn the QE classifier. Our augmentation methods also use statistical classifiers of lower capacity trained on external data. The diversity and quality of data generated is critical for the QE classifier to learn good classification boundaries and generalize sufficiently well to unseen data. We show from experiments on subtitles and other parallel corpora that by fitting a QE classifier of sufficiently high capacity to data generated as described here we get good generalization for our task. Further details of our data augmentation methods are described in Section 4.
We experiment with multiple neural network architectures for the classifier beginning with simpler ones with only CNNs and LSTMs. A hybrid architecture of Bidirectional LSTM (BiLSTM) Graves and Schmidhuber (2005)
followed by CNN outperformed them. Input to each model is a pair of sentences in two languages that outputs the probability of it belonging to three three classes. We limit the length of each sentence to 25 tokens, lowercase each word and include punctuation and numbers. Section2 describes the model in detail.
4 Data augmentation
We begin with 30k video subtitle files in English with timestamp aligned translations into five languages (French, German, Italian, Portuguese, Spanish). These translation text blocks are unlabeled and there is no information on which of three three QE classes they belong to. We use various features measuring statistics of word and semantic overlap between the source and the target as signals to label them. These features are used with two different methods to assign two distinct scores to each sample indicating the likelihood of it being a Good or a Bad sample. Samples where both scores are in strong agreement are given the corresponding class label of Good or Bad. Others where both scores are aligned but only one of them is a strong indicator with high magnitude are marked as Loose. The rest of the sample with no agreement among the scores are discarded.
Bag-of-words model (BOW). The first is a two parameter model that uses aligned pretrained embeddings Conneau et al. (2017)
to score the sentence pair. The embeddings are used create a cosine similarity matrixof size for a source and target sentences with and words respectively. A score () is computed for the source (target) by thresholding at and taking a max over the columns (rows) and averaging over the rows (columns). Correspondingly,
where is element-wise thresholding at . The intuition is to aggregate similarity scores from the most relevant words of target for each word in the source and vice-versa. The model assigns a score . The parameters are chosen by tuning on validation data of positives from NMT (implemented using Hieber et al. (2017)) output and negatives from misaligned subtitles with no learning. All samples labeled through this method are from subtitles which helps the final QE model to learn common patterns in video subtitles.
Random Forest Classifier
(RFC). This model uses features similar to BOW model with MUSE embeddings to train a random forest classifier on EuroParl datasetTiedemann (2012). Translations in EuroParl are augmented with errors such as incorrect word substitution and random sentence alignment to generate incorrect translations. The model benefits from Europarl’s paraphrasings that help it learn beyond literal translations. We used about k samples for each language to train and tuned parameters by cross-validation on a validation set. RFC model assigns a probability score to each input sentence pair.
The two scores, and , both of which are in range are now used to label data to assign a label following,
where the thresholds are manually set to . This labels about of all samples into three classes. All others samples with scores that do not fall in the specified ranges are filtered out due to disagreement among the models. We refer to this set as Statistical Classification. Performance and training details of BOW and RFC are reported in Appendix A. We further augment this data with NMT on short sentences with frequent phrases like greetings with the Good label. More samples for the Loose category are generated by adding captions to the source (like whispers, sighs, etc) referred as Added Captions or changing word order in the target to degrade fluency called Scrambled Text samples. This constitutes all the positives (Good and Loose) while we synthesize negatives (Bad) in two ways; we randomly choose a target for each source for easy negatives (called Random Aligned) and to choose a target from a temporally close block for hard negatives (called Drifted alignment). Label distribution of the final data is reported in Table 2.
5 Model architecture
State-of-the-art monolingual information retrieval systems Huang et al. (2013); Shen et al. (2014) use a hybrid architecture of RNN followed by a convolution network to extract semantic and syntactic features of text respectively. We extended their idea to build a network with two monolingual encoders for source and target each to extract semantic features followed by a CNN for syntactic features. Refer to figure 2 for visualization of the network’s architecture. Input to the model are 300 dimensional embeddings from pretrained FastText Bojanowski et al. (2017)
for each token. We used two BiLSTMs for each encoder with the outputs of both LSTM concatenated. They were then sequentially fed to two convolution modules. CNN output was passed through a fully connected layer before making a three class prediction. We used ReLU activation with dropout and Batch Normalization. We used Adam optimizerKingma and Ba (2014) in all our experiments with a batch size of 8192. We chose a learning rate of scheduled to drop by a factor of twice whenever the rate of training loss drop was less than before stopping training. Table 3 shows the size of the dataset used for training and testing.
|# Samples||Accuracy||# Samples||Accuracy||Precision||Recall||F-Score|
Table 3 shows the model comparison across various measures. We observe that the model performs similarly with accuracy of above for all five languages. The model also has similar performance across sentences of various lengths as shown in figure 4 with longer sentences doing slightly better than shorter ones. This is possibly because shorter ones when paraphrased are harder to detect than longer ones.
6.1 Miss rate on parallel corpora
We took a set of high quality subtitles that were translated and validated independently by two sets of distinct human translators. A list of parallel sentences were extracted from them using their timestamp alignment information. These translations that went through two rounds of human audit can safely be assumed to contain no Bad translations. Since data consists only positives constituting of Loose and Good, we use miss rate or false negative rate (FNR) as the performance metric for this experiment. We ran a similar test on public EuroParl data for reference that also has only positives. Corresponding FNR numbers are listed in Table 4 along with number of parallel sentences in the test set.
|# sentences||FNR||# sentences||FNR|
The FNR is low for all language pairs and most false negatives we identified were contextual translations that should have been marked by the model as Loose but were mistaken as Bad. For example — in English-German, the phrase “Jesus” was rewritten to “Meine Güte.” which literally translates to my goodness. Such cases were under 14% showing a good performance on Bad vs rest which is more critical than the Loose vs Good boundary. Further, our subtitles data seem to have a higher sample of such contextually correct translations that are not a literal match. Such paraphrasing is one area that we could improve upon in future work.
6.2 Classification vs scoring
As briefly discussed in Section 1, QE can also be formulated as a scoring problem. We chose the classification route using cross entropy loss for our model. We compare an alternative that employs scoring based formulation extending the ordinal regression objective from Liu et al. (2018b) that is defined as,
where is the predicted score, is the label, and are lower bound and upper bound thresholds respectively for . The values for each label are set to for class Bad, for Loose and for Good.
The model assigns Good samples a score higher than those to Loose which should have been higher than those assigned to Bad. We changed the final fully-connected layer of the model to give only one output followed by a sigmoid to bound the score in the range . Table 5 shows the test accuracy of both losses. We can see that the classification loss outperformed the scoring loss by approximately for each language.
6.3 Comparison of architectures
We found that the existing convolution networks Liu et al. (2018a) that try to classify bilingual dataset are not able to learn the linguistic nuances of text and fail in many cases. However, recurrent networks that are good with parsing variable length inputs with temporal dependencies complement them. We compared our hybrid network with an LSTM network, a CNN model and models using LASER sentence embeddings Artetxe and Schwenk (2019). For LSTM network, we concatenated the output of both BiLSTMs and fed to a fully connected layer. For CNN model, we had three convolution modules followed by a fully connected layer. We use the 1024-dimension language agnostic sentence embeddings from LASER. Table 6 compares accuracy of various models on test data. For baseline, we used the equation 3 to generate labels for test data. An only-LSTM network works just about as well as the baseline but only-CNN network brings significant gains. The convolutional layer is possibly evaluating semantic retention better Kim et al. (2017). LASER FC is a classifier trained with LASER embeddings fed into a fully connected layer that performs worse than baseline. LASER CNN is a CNN on top of LASER embeddings and performs about as well as only-CNN. Proposed hybrid model, however, outperformed all other networks including the CNN by
. The LSTM when used in combination with the CNN is consistently improving prediction accuracies across languages. One notable observation from models trained using LASER embeddings was that LASER models took about 10 epochs on average to meet our stopping criterion while DeepSubQE model took around 34 epochs on average.
In figure 5, we present the t-SNE visualization van der Maaten and Hinton (2008) of the last layer of our hybrid network for English-German pair. The plot shows that there is one cluster for Good translations and one for Bad while there are two for Loose that are spread out spatially. This is because Loose translations can be close to either Good and Bad.
We studied the problem of translation quality estimation in video subtitles without any reference texts. We show, empirically, how training data can be synthesized for a three-way classification into Good, Loose and Bad translations. The model decision can then be integrated into the subtitle quality improvement process with Good being acceptable translations, Loose possibly requiring human post-edits and Bad needing complete rewrite.
The current work only uses subtitle block level translations to make a decision and ignores temporal aspect. Temporal structure can bring significant information to make a better judgment particularly on Loose translations. Also, training one model per language pair requires considerable operational load. A multilingual model can reduce this load while helping resource starved languages. Exploiting temporal information and learning a common space for multiple languages are future directions we are considering for this work.
- Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguistics 7, pp. 597–610. External Links: Cited by: §6.3.
- METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In IEEvaluation@ACL, Cited by: §2.
- Confidence estimation for machine translation. In COLING 2004, 20th International Conference on Computational Linguistics, Proceedings of the Conference, 23-27 August 2004, Geneva, Switzerland, External Links: Cited by: §2.
Enriching word vectors with subword information. TACL 5, pp. 135–146. External Links: Cited by: §5.
- The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2), pp. 263–311. Cited by: §A.1.
- Word translation without parallel data. CoRR abs/1710.04087. External Links: Cited by: 2nd item, §A.1, §4.
- Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks : the official journal of the International Neural Network Society 18 5-6, pp. 602–10. Cited by: §3.
- Problems with automating translation of movie/tv show subtitles. CoRR abs/1909.05362. External Links: Cited by: §2.
- Unsupervised quality estimation without reference corpus for subtitle machine translation using word embeddings. 2019 IEEE 13th International Conference on Semantic Computing (ICSC), pp. 32–38. Cited by: 2nd item.
Sockeye: A toolkit for neural machine translation. CoRR abs/1712.05690. External Links: Cited by: §4.
- Learning deep structured semantic models for web search using clickthrough data. In CIKM, Cited by: §5.
OpenKiwi: an open source framework for quality estimation. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstrations, M. R. Costa-jussà and E. Alfonseca (Eds.), pp. 117–122. External Links: Cited by: Figure 1.
- Predictor-estimator using multilevel task learning with stack propagation for neural quality estimation. In Proceedings of the Second Conference on Machine Translation, WMT 2017, Copenhagen, Denmark, September 7-8, 2017, O. Bojar, C. Buck, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, and J. Kreutzer (Eds.), pp. 562–568. External Links: Cited by: §2, §6.3.
- Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §5.
Multilingual short text classification via convolutional neural network. In WISA, Cited by: §6.3.
- A constrained deep neural network for ordinal regression. , pp. 831–839. Cited by: §6.2.
- Bleu: a method for automatic evaluation of machine translation. In ACL, Cited by: §2.
- Learning semantic representations using convolutional neural networks for web search. In WWW, Cited by: §5.
- A study of translation edit rate with targeted human annotation. Cited by: §2.
- Estimating the sentence-level quality of machine translation systems. Cited by: §2.
- Parallel data, tools and interfaces in opus. In LREC, Cited by: §A.2, §4.
- Visualizing data using t-sne. Cited by: §6.3.
Appendix A Binary Classifiers
In this section, we give details of the binary classifiers used for filtering data, briefly explained in Section 4.
a.1 Bag-of-Words Model
We improve upon the ideas presented in IBM alignment models Brown et al. (1993) by using aligned word embeddings Conneau et al. (2017). To evaluate the performance of the model, we defined another threshold () and predicted the binary label for any given translation pair following,
In table 7, we report the optimum values for and , dataset size and the performance of BOW model.
a.2 Random Forest Classifier
We trained a binary Random Forest Classifier (RFC) on EuroParl dataset available in OPUS format Tiedemann (2012). We assumed the translations from EuroParl to be correct and introduced following errors in source text to generate incorrect translations.
Randomly Substitute Words: We calculate the frequency of each word from EuroParl’s English corpus. Then, we remove two words with least frequency and introduce two random words at random location in sentence. By removing the least frequent words, we can try to remove more important words of the sentence and sentence loses the meaning.
Random Selected Sentence: For every source sentence we match it with a random target sentence from the parallel corpus.
Word Trigram Substitution: We compute the word trigram occurrence probability from EuroParl’s English corpus to generate a list of possible words for any given sequence of two words. We select a trigram in sentence, replace the last word with one of the possible words list for first two words of trigram.
We create datasets for each language maintaining the ratio of correct and incorrect translations as . The sizes of train and test datasets are added in table 8. We then created a list of features, explained below, to represent a translation in a 273 length feature vector.
Average Vector Similarity: Cosine similarity of average of each word’s vector for each sentence.
Similarity Features: We create a cosine similarity matrix using aligned bilingual word embeddings Conneau et al. (2017). We borrow the ideas from Gupta et al. (2019b), instead of selecting a threshold and calculating the percentage of word matches in cosine similarity matrix; we take the maximum values for each column and row. We repeat process for ngrams for by taking the vector average for .
n-gram Frequencies: Vector of source and target unigram, bigram and trigram probabilities for each ngram in source and target sentence.
Structural Features: Number of words in source sentence and target sentence