A Deep Neural Network Approach To Parallel Sentence Extraction

09/28/2017 ∙ by Francis Grégoire, et al. ∙ Université de Montréal 0

Parallel sentence extraction is a task addressing the data sparsity problem found in multilingual natural language processing applications. We propose an end-to-end deep neural network approach to detect translational equivalence between sentences in two different languages. In contrast to previous approaches, which typically rely on multiples models and various word alignment features, by leveraging continuous vector representation of sentences we remove the need of any domain specific feature engineering. Using a siamese bidirectional recurrent neural networks, our results against a strong baseline based on a state-of-the-art parallel sentence extraction system show a significant improvement in both the quality of the extracted parallel sentences and the translation performance of statistical machine translation systems. We believe this study is the first one to investigate deep learning for the parallel sentence extraction task.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Parallel corpora are a prerequisite for many multilingual natural language processing applications. As they are an invaluable resource, the limited amount of parallel data, which is only available for a relatively small number of language pairs on very few specific domains, is problematic for scaling natural language processing applications. For example, parallel corpora plays a critical role in machine translation since only the words appearing in the vocabulary of the training set can be translated. Thus, there is a growing interest to collect more parallel data, especially for low-resource languages. With the increasing amount of content-related multilingual articles on the World Wide Web, a potential solution to alleviate the parallel data sparsity issue is to identify and extract parallel sentences from this abundant source of information. Consequently, the objective of parallel sentence extraction is to build parallel corpora by extracting parallel sentence pairs from such multilingual articles. They are widely available on the Web for several language pairs and cover various application domains. Among the different multilingual resource, Wikipedia, an online collaborative encyclopedia, is likely the largest repository of comparable corpora in many languages. Comparable corpora can be defined as collections of topic-aligned but non-sentence-aligned multilingual documents. Several recent works have used Wikipedia as a source of data to create high-quality comparable corpora Otero and López (2010); Patry and Langlais (2011); Barrón-Cedeño et al. (2015) and various parallel sentence extraction systems have been developed over the years to generate new parallel corpora Fung and Cheung (2004); Munteanu and Marcu (2005); Adafre and de Rijke (2006); Abdul-Rauf and Schwenk (2009); Smith et al. (2010); Uszkoreit et al. (2010).

Recent advances in deep learning architectures with recurrent neural networks (RNN) have shown that they can successfully learn complex mapping from variable-length sequences to continuous vector representations. While numerous natural language processing tasks have successfully applied those models, ranging from handwriting generation Graves (2013), to image caption generation Vinyals et al. (2014) and to machine comprehension Hermann et al. (2015), most of the multilingual efforts have been devoted to machine translation Sutskever et al. (2014); Cho et al. (2014), although more research interests have been recently devoted to multilingual semantic textual similarity111http://alt.qcri.org/semeval2017/task2/.

Previous approaches have empirically demonstrated that the inclusion of extracted parallel sentence pairs improved the performance of statistical machine translation (SMT) systems, however such methods rely on a significant amount of feature engineering and are difficult to adapt to out-of-domain contexts. In this paper, we propose a deep neural network approach to parallel sentence extraction that takes as input a pair of documents and outputs sentence pairs classified as translations of each other. Compared to previous approaches which require specialized metadata from document structure or to train multiple different models, our model is learned end-to-end and uses only raw sentence pairs. We show empirically that our proposed approach outperforms a competitive baseline based on the works of 

Munteanu and Marcu (2005) and Smith et al. (2010). To justify the effectiveness of the proposed approach, we add the sentence pairs extracted from Wikipedia articles to a parallel corpus to train SMT systems and show improvements in BLEU scores. Our experiments show that we can achieve promising results by removing the need of any specific feature engineering or external resources. To the best of our knowledge, this is the first time deep learning is applied to extract parallel sentence pairs.

2 Related work

A variety of approaches have been developed to extract parallel sentences from comparable corpora. In particular, Munteanu and Marcu (2005) presents a system which relies on a multi-step procedure to extract sentence pairs from comparable corpora of newspaper articles. The procedure needs to align pairs of similar documents using publication dates and an information retrieval system. From each such pair, all possible sentence pairs from the Cartesian product of the two documents are passed through a word-overlap and sentence-length ratio filter to obtain a set of candidate sentence pairs. These candidate sentence pairs are sent to a classifier which determines whether two sentences are translations of each other. By using the extracted sentence pairs as additional training data for SMT systems, they demonstrate that this improves the translation performance. Smith et al. (2010) extends this approach by exploiting the structure and metadata of interlanguage linked Wikipedia article pairs and introducing several new features, such as distortion features and others that take into account the position of the current and previously aligned sentences. They use their augmented set of features in a conditional random field and obtain state-of-the-art results. Abdul-Rauf and Schwenk (2009) proposes a simpler approach, in which they use an SMT system built from a small parallel corpus. Instead of using a classifier, they translate the source language side of a comparable corpus to find candidate sentences on the target language side. They determine if a translated source sentence and a candidate target sentence are parallel by measuring the word error rate and the translation error rate. Although Barrón-Cedeño et al. (2015)

focuses on aligning domain-specific parallel documents from Wikipedia, they compute similarities between sentence pairs by cosine and length factor measures popular in cross-language information retrieval. Even if they obtain relatively low precision and recall scores with their extraction method, they observe that extracted domain-specific sentence pairs significantly improved translation quality of STM systems on in-domain data.

3 Approach

3.1 Negative Sampling

For training purpose, we use a parallel corpus consisting of parallel sentence pairs , for , where and

denote the source sentences and target sentences sets. These parallel sentence pairs are the positive examples of our training set. Since we want a model that learns differentiable vector representations to distinguish parallel from non-parallel sentences, we need to generate negative examples. Therefore, at the beginning of each training epoch, for every pair of parallel sentences we randomly sample

negative sentence pairs , for .222In any parallel corpus there might be many redundant and similar sentence pairs. Thus, relying only on randomness to select negative sentence pairs does not guarantee that a sampled sentence pair is truly negative and might occasionally generate false negatives. Hence, for each epoch our training data consists of triples , where is a source sentence of tokens, is a target sentence of tokens, and is the label representing the translation relationship between and , so that if and otherwise.

3.2 Model

Our idea is to use deep neural networks to learn cross-language semantics between sentence pairs to estimate the probability that they are translations of each other,

. The proposed model architecture is a siamese network Bromley et al. (1994) consisting of a bidirectional RNN (BiRNN) Schuster and Paliwal (1997)

) sentence encoder with recurrent activation functions such as long short-term memory units (LSTM) 

Hochreiter and Schmidhuber (1997)

or gated recurrent units (GRU) 

Cho et al. (2014). Since we want vector representations in a shared vector space we use a siamese network with tied weights. As illustrated in Figure 1, our architecture uses a shared BiRNN sentence encoder that outputs a vector representation for the source and target sentences.

Figure 1: Architecture for the siamese bidirectional recurrent neural networks. The final recurrent state of the forward and backward networks are concatenated and then fed into fully connected layers culminating in a sigmoid layer.

To avoid repetition and for clarity, we only define equations of the BiRNN encoding the source sentence. For the target sentence, simply substitute for . At each time step , the token in the -th sentence, , defined by its integer index in the vocabulary , is represented as a one-hot vector whose -th element is 1 and all other elements are 0. The one-hot vector is multiplied with a learned embedding matrix to get a continuous vector representation (word embedding) , which serves as input for the forward and backward recurrent states in the BiRNN encoder, and . The forward RNN reads the variable-length sentence and updates its recurrent state from the first token until the last one to create a fixed-size continuous vector representation of the sentence, . The backward RNN processes the sentence in reverse. In our experiments, we use the concatenation of the last recurrent state in both directions as a final representation (see Figure 1)333

We considered combining the recurrent states with average pooling and max pooling to obtain a fixed-size vector representation, but obtained inferior performance.



can be any recurrent activation function, such as LSTM or GRU. After both source and target sentences have been encoded, we capture their matching information by using their element-wise product and absolute element-wise difference. We estimate the probability that the sentences are translations of each other by feeding the matching vectors into fully connected layers:



is the sigmoid function,

, , , and are model parameters. The model is trained by minimizing the cross entropy of our labeled sentence pairs:


For prediction, a sentence pair is classified as parallel if the probability score is greater than or equal to a decision threshold that we need to fix.


4 Experiments

To assess the effectiveness of our approach we compare it in different settings against the baseline model described in Section 4.3. First, we measure the precision, recall and F scores by extracting parallel sentences from a standard parallel corpus in Section 5.1. To compare the approaches with pseudo comparable corpora with different degrees of comparability, we insert noisy non-parallel sentences into the parallel corpus. In Section 5.2, we extract sentence pairs from real comparable corpora and validate their utility by measuring their impact on SMT systems.

4.1 Evaluation metrics

For the evaluation of the performance of our models, a sentence pair predicted as parallel is correct if it is present in the parallel sentence pairs of the dataset. Precision is the proportion of truly parallel sentence pairs among all extracted sentence pairs. Recall is the proportion of truly parallel extracted sentence pairs among all parallel sentence pairs in the dataset. The F

score is the harmonic mean of precision and recall.

For the statistical machine translation evaluation we use the BLEU score Papineni et al. (2002)

as an evaluation metric using the multi-bleu script from Moses 

Koehn et al. (2007)444https://github.com/moses-smt/mosesdecoder.

4.2 Datasets

The most reliable way to compare the precision, recall and F scores would be to have professional translators manually annotate parallel sentences from comparable corpora. However, this option is expensive and impractical. Therefore, for this task it is common practice to compare different approaches using aligned texts from known parallel corpora. Thus, to compute our evaluation metrics we use the WMT’15 English to French datasets555http://www.statmt.org/wmt15/translation-task.html. Our training set consists of 500k parallel sentence pairs randomly selected from the Europarl v7 corpus Koehn (2005). The vocabulary size is 69k for English and 84k for French. We argue that parallel sentence extraction in practice requires domain adaptation, i.e. data during prediction will most probably cover other domains than the ones found in the training set, so we focus on out-of-domain test sets. Therefore, we use the first 1,000 parallel sentence pairs of newstest2012 for the model evaluation experiment. For the STM evaluation experiment, the comparable corpora we use to extract parallel sentences are English-French Wikipedia article pairs from the Wikipedia dumps666https://dumps.wikimedia.org/ and the test set is newstest2013. Data processing is performed to clean and segment the Wikipedia XML documents into sentences. We normalize and tokenize all datasets with the scripts from Moses. The maximum sentence length is set to 80 tokens.

Figure 2: Precision-Recall curve of the models evaluated on the Cartesian product of the 1,000 first sentence pairs of newstest2012 without noise (left) and with a noise ratio of 90% (right).

4.3 Baseline

For comparison, we use a parallel sentence extraction system developed in-house by Bérard (2014) based on the work of Munteanu and Marcu (2005) and Smith et al. (2010). The system consists of a candidate sentence pair filtering process and three models; two word alignment models and a maximum entropy classifier. The word alignment models are trained on both language directions using our training set of 500k parallel sentence pairs. For the classifier, we select another 10k parallel sentence pairs from the held-out Europarl dataset and choose a negative to positive ratio (non-parallel to parallel sentence pairs) to select the number of negative sentence pairs777Munteanu & Marcu Munteanu and Marcu (2005) use 5k parallel sentence pairs with a negative to positive ratio not greater than 5. By using 10k parallel sentence pairs over 5k we obtained small performance gains, but we did not observe any significant gain by using more than 10k parallel sentence pairs.. For example, with a negative to positive ratio of 5 we select 50k negative sentence pairs, so for each epoch the classifier is trained on 60k examples.

Word alignment models  The translation and alignment tables are estimated using the HMM alignment model of Vogel et al. (1996). These probability tables are required to measure the value of the many alignment features used in the classifier. To perform word alignment we use an IBM model 2. The translations with a probability score above 10% from the estimated translation tables are used to infer bilingual dictionaries that are used in the word-overlap filter for candidate sentence pair selection. For our experiments, we use the GIZA++ implementation Och and Ney (2003)888http://www.statmt.org/moses/giza/GIZA++.html to train our word alignment models.

Maximum entropy classifier  The classifier uses 31 features which are based on the work of Munteanu and Marcu (2005) and Smith et al. (2010). They rely on word-level alignment features between two sentences, such as the number and percentage of connected (unconnected) words, the top three largest fertilities, percentage of source words with fertility 1, 2, 3 or more, length of the longest connected (unconnected) substring, log probability of the alignment, and also general features, such as the lengths of the sentences, length difference, length ratio and the percentage of words on each side that have a translation on the other side. A sentence pair is classified as parallel if the classifier outputs a probability score greater than or equal to a decision threshold which needs to be fixed.

Candidate sentence pair selection  During training, a sentence pair filtering process is used to select a fixed number of negative sentence pairs to train the maximum entropy classifier. It is also used during prediction to filter out the unlikely sentence pairs of the Cartesian product. First, it verifies that the ratio of the lengths of the two sentences is not greater than two. It then uses a word-overlap filter to check for both sentences that at least 50% of their words have a translation in the other sentence, according to the bilingual dictionaries inferred from the word alignment models. Every pair that do not fulfill these two conditions are discarded.

During our experiments, we observed that the filtering process eliminates more than 99% of the candidate sentence pairs from the Cartesian product of the test set and that the classifier alone is not able to classify truly parallel sentences. The sentence pair filtering process is only applied to the baseline model.

4.4 Training settings

Our neural network models are implemented using TensorFlow 

Abadi et al. (2016). We use a siamese BiRNN with a single layer in each direction with 512-dimensional word embeddings and 512-dimensional recurrent states. We use GRU as recurrent activation functions since they consistently outperformed LSTM by a small margin in our experiments. The hidden layer of the fully connected layers has 256 hidden units. We initialize all parameters uniformly using TensorFlow’s default uniform unit scaling initialization, except for all biases being initialized to zero. To train our models, we use Adam optimizer Kingma and Ba (2014)

with a learning rate of 0.0002 and a minibatch of 128 examples. Models are trained for a total of 15 epochs. To avoid exploding gradients, we apply gradient clipping such that the norm of all gradients is no larger than 5 

Pascanu et al. (2013). We apply dropout to prevent overfitting with a probability of 0.2 and 0.3 for the non-recurrent input and output connections respectively. Training is performed on a single GPU.

5 Results

5.1 Model Evaluation

In this section we measure the precision, recall and F scores to compare both methods. We use a siamese BiRNN model trained with 7 negative samples and a baseline model trained on a balanced training set with a negative to positive ratio of 1999We experimented with different number of negative samples and after a value of 7 we observed that the marginal benefit of adding more negative samples is not significant. As for the baseline model, we observed better performance and more stability when it is trained on a balanced training set.. In order to compare the feasibility of our approach on different degrees of comparability of comparable corpora, we insert noisy non-parallel sentences into the test set by substituting a defined number of target sentences with external target sentences from the held-out sentence pairs of the newstest2012 corpus. For example, with a noise ratio of 60%, 600 out of the 1,000 sentence pairs are not parallel, such that only 0.04% of the sentence pairs in the Cartesian product are truly parallel. Figure 2 shows the precision-recall curve of the models evaluated on the test set with a noise ratio of 0% and 90%. In Table 1, we report the scores at the decision threshold value fixed at the optimal F value. We see that we are able to consistently outperform by a significant margin the results obtained with the baseline model. Our approach has a F improvement over the baseline model of 9.61% and 19.61% on the test set with a noise ratio of 0% and 90%, respectively.

Noise ratio = 0%
Noise ratio = 90%
P (%)
R (%)
F (%)
P (%)
R (%)
F (%)
BiRNN 83.0 69.6 75.7 0.99 70.6 59.0 66.7 0.99
Baseline 73.1 60.3 66.1 0.85 46.2 48.0 47.1 0.97
Table 1: Precision, recall and F scores at decision threshold value maximizing the F score on our test set.

In Figures 35 and 4, we compare the precision, recall and F scores as the noise ratio in our test set increases. We observe that it becomes harder to identify parallel sentences as the number of non-parallel sentences increases in the test set. However, we see that our neural network-based approach outperforms the baseline model across the line. In contrast to the baseline, the performance of our method stays relatively stable and starts to degrade at very high noise ratios. Within that range, we believe it is more representative to document pairs found in real comparable corpora. While we present the scores for the sentence pairs extracted with a decision threshold value fixed at the optimal F score, some may believe that the precision of the extracted pairs is more important than the recall and that having an approach with the best F score is not optimal. In this regard, Goutte et al. (2012) finds that SMT systems are robust to noise in training data and that recall can be even more important than precision101010

However, it might not be the case for neural machine translation systems based on distributional semantic representations where precision could be the score to prioritize. We need to further investigate the impact and leave it for future work.

. In any case, we see that our approach gives a better precision at a larger recall value, meaning that setting the decision threshold in order to obtain a desired precision

Figure 3: Precision score as the number of noisy non-parallel sentences in the test set increases.
Figure 4: F score as the number of noisy non-parallel sentences in the test set increases.
Figure 5: Recall score as the number of noisy non-parallel sentences in the test set increases.

will lead to a larger number of high-quality parallel sentences. The value of the decision threshold has a direct impact on the quality and the amount of extracted sentence pairs. In our case, because we are in presence of datasets with highly imbalanced classes, we recommend to use a very high value to reduce the number of false positives.

5.2 Statistical Machine Translation Evaluation

English BiRNN 1,487,769 29,740,242 20 11
Baseline 792,514 14,310,191 18 9
French BiRNN 1,487,769 32,613,325 22 12
Baseline 792,514 15,245,228 19 10
Table 2:

Statistics of the size of the parallel corpora extracted from the English-French Wikipedia article pairs. Length is the average and standard deviation of the number of tokens in the sentences.

The objective of parallel sentence extraction is to increase the size of existing parallel corpora and to broaden the covered domains in order to improve the generalization of machine translation systems. To justify the utility of our proposed approach, we extract parallel sentences from interlanguage linked English-French Wikipedia articles and evaluate their quality by measuring the BLEU scores on SMT systems. We want a good balance between the quality and the number of extracted sentence pairs, so for each approach we set the decision threshold value equal to the value maximizing the F score with a noise ratio of 90% (see Table 1). We use these values as a rough estimate to represent the degree of comparability present in Wikipedia article pairs.

For both methods, the classifier independently classifies each -th sentence pair as parallel if . This can lead to a situation where a source sentence is paired to several target sentences, or vice versa. To guarantee that sentences in both languages appear at most in a single pair (i.e. one-to-one alignment), as a post-treatment step we employ a greedy strategy that sorts the extracted sentence pairs by best probability score and greedily iterates over this sequence by eliminating pairs whose source or target sentence has been already paired. Information on the size of the extracted parallel corpora, keeping only the sentence pairs in which both sentences contain at least 3 tokens, is presented in Table 2. As we expected, with the superior precision and recall values of our approach (see Section 5.1), the BiRNN extracts more sentence pairs than the baseline. In fact, there is a quality-size trade-off and it is possible to set the value in order maximize the quality (size) of the extracted parallel corpus, to the detriment of its size (quality). We calculated the coverage ratio and we found that 77% of the sentence pairs extracted by the baseline model were also extracted from our BiRNN approach.

To train the phrase-based translation systems Koehn et al. (2003) we use the Moses toolkit. As baseline SMT system, we train an SMT system on our training set consisting of 500k parallel sentence pairs selected from the Europarl corpus described in Section 4.2. We train two additional SMT systems by augmenting the training set with the extracted sentence pairs from the BiRNN and baseline models. Each system uses newstest2012 as tuning set and is evaluated on the English-French translation quality on newstest2013. Since both extracted parallel corpus are not on the same order of magnitude, we sorted the sentence pairs by similarity score in descending order and used the top 500k sentence pairs to train new SMT systems with training sets of equal size. Table 3 shows the BLEU scores for the different SMT systems. When using the full extracted parallel corpus, we see that our approach improves the BLEU score over the baseline SMT system trained solely on the 500k sentence pairs of the Europarl corpus by 4.7 and about 0.8 for the system trained with the extra sentence pairs extracted with the baseline model. When we only use the top 500k sentence pairs with higher similarity score, our approach is on par with the baseline system. Thus, it confirms the quality of the top 500k ranked sentence pairs, such that we could lower the decision threshold value to extract more high-quality parallel sentences. Given the out-of-domain nature of our Wikipedia articles with respect to our training set, those results are encouraging because they show that our approach should adapt well on comparable corpora with a lower degree of comparability.

Training Data
Europarl 21.5 500,000
+Full BiRNN 26.2 (+4.7) 1,987,769
Baseline 25.4 (+3.9) 1,292,514
+Top500k BiRNN 25.0 (+3.5) 1,000,000
Baseline 24.9 (+3.4) 1,000,000
Table 3: BLEU scores obtained on the newstest2013 test set. Sentences is the number of sentences used to train the SMT systems. The Europarl row is the baseline SMT system trained on 500k sentences pairs from the Europarl corpus.

6 Discussion

In this work, we presented a deep neural network approach to extract parallel sentences. Our work showed that our approach outperforms by a significant margin a strong baseline based on state-of-the-art parallel sentence extraction system. Traditional systems need to train multiples models and to apply a two-step classification procedure. In contrast, we propose a simpler approach that only requires a parallel corpus to encode sentence pairs in a siamese BiRNN encoder using LSTM or GRU activation functions.

Our work enables exploration for researchers who want to apply more advanced deep learning architectures to the parallel sentence extraction task. We believe our approach is scalable and flexible with different languages or domains. That being said, it would be natural to extend the approach using multiple language pairs. Currently, we do not handle the unknown out-of-vocabulary words, which might be an issue. Although we have evaluated our approach in an out-of-domain setting (training done on the Europarl corpus, extracting parallel sentences from Wikipedia articles and testing on newstest2013) with promising results, we need to further investigate the impact it might have.

We saw that the degree of comparability in a pair of documents negatively impact the performance of our approach. A more advanced analysis on the hyperparameters settings could be applied to improve the generalization of the model. Instead of only selecting random negative samples, a promising next step could be to use a mix of random and hard negative samples in our training set (i.e. similar non-parallel sentence pairs). However, to achieve that we are forced to do an extra feedforward pass over the whole training set at the beginning of each training epoch to obtain the similar non-parallel sentences, otherwise we need external resources, such as bilingual dictionaries or pre-trained word embeddings. The data and our code are available on github.