Log In Sign Up

Explicit Sentence Compression for Neural Machine Translation

State-of-the-art Transformer-based neural machine translation (NMT) systems still follow a standard encoder-decoder framework, in which source sentence representation can be well done by an encoder with self-attention mechanism. Though Transformer-based encoder may effectively capture general information in its resulting source sentence representation, the backbone information, which stands for the gist of a sentence, is not specifically focused on. In this paper, we propose an explicit sentence compression method to enhance the source sentence representation for NMT. In practice, an explicit sentence compression goal used to learn the backbone information in a sentence. We propose three ways, including backbone source-side fusion, target-side fusion, and both-side fusion, to integrate the compressed sentence into NMT. Our empirical tests on the WMT English-to-French and English-to-German translation tasks show that the proposed sentence compression method significantly improves the translation performances over strong baselines.


page 1

page 2

page 3

page 4


Explicit Reordering for Neural Machine Translation

In Transformer-based neural machine translation (NMT), the positional en...

Text Compression-aided Transformer Encoding

Text encoding is one of the most important steps in Natural Language Pro...

P-Transformer: Towards Better Document-to-Document Neural Machine Translation

Directly training a document-to-document (Doc2Doc) neural machine transl...

Are BLEU and Meaning Representation in Opposition?

One of possible ways of obtaining continuous-space sentence representati...

Towards Linear Time Neural Machine Translation with Capsule Networks

In this study, we first investigate a novel capsule network with dynamic...

Rewriter-Evaluator Framework for Neural Machine Translation

Encoder-decoder architecture has been widely used in neural machine tran...

1 Introduction

Neural machine translation (NMT) is popularly implemented as an encoder-decoder framework [1]

, in which the encoder is right in charge of source sentence representation. Typically, the input sentence is implicitly represented as a contextualized source representation through deep learning networks. By further feeding the decoder, the source representation is used to learn dependent time-step context vectors for predicting target translation 


In state-of-the-art Transformer-based encoder, self-attention mechanisms are good at capturing the general information in a sentence [3, 4, 5]. However, it is difficult to distinguish which kind of information lying deeply under the language is really salient for learning source representation. Intuitively, when a person reads a source sentence, he/she often selectively focuses on the basic sentence meaning, and re-reads the entire sentence to understand its meaning completely. Take the English sentence in Table I as an example. We manually annotate its basic meaning as a shorter sequence of words than in the original sentence, called backbone information. Obviously, these words with the basic meaning contain more important information for human understanding than the remaining words in the sentence. We argue that such backbone information is also helpful for learning source representation, and is not explicitly considered by the existing NMT system to enrich the source sentence representation.

In this paper, we propose a novel explicit sentence compression approach to enhance the source representation for NMT. To this end, we first design three sentence compression models to accommodate the needs of various languages and scenarios, including supervised, unsupervised, and semi-supervised ways, to learn a backbone information words sequence (as shown in Table I) from the source sentence. We then propose three translation models, including backbone source-side fusion based NMT (BSFNMT), backbone target-side fusion (BTFNMT), and both-side fusion based NMT (BBFNMT), to introduce this backbone knowledge into the existing Transformer NMT system for improving translation predictions. Empirical results on the WMT14 English-to-German and English-to-French translation tasks show that the proposed approach significantly improves the translation performance over the strong even state-of-the-art NMT baselines111Our code is available at

Sentence Both the US authorities and the Mexican security forces are engaged in an ongoing battle against the drug cartels.
Basic Meaning US authorities and Mexican forces battle against drug cartels
Backbone supervised ESC US and Mexican fight drug cartels
Backbone unsupervised ESC US authorities and Mexican security forces battle drug cartels
Backbone semi-supervised ESC US authorities and Mexican security forces battle against drug cartels
TABLE I: An example of sentence compression.

2 Explicit Sentence Compression

Generally, sentence compression222There are many types of sentence compression. In this paper, we focus on abstract sentence summarization. is a typical sequence generation task which aims to maximize the absorption and long-term retention of large amounts of data over a relatively short sequence for text understanding [6, 7]. To distinguish the importance of words in the sentence and, more importantly, to dig out the most salient part in the sentence representation, we utilize the sentence compression method to explicitly distill the key knowledge that can retain the key meaning of the sentence, termed explicit sentence compression (ESC) in this paper. Depending on whether or not the sentence compression is trained using human annotated data, the proposed method can be implemented in three ways: supervised ESC, unsupervised ESC, and semi-supervised ESC.

2.1 Supervised ESC

Sentence compression usually relies on large-scale raw data together with their human-labeled data, which can be viewed as supervision, to train a sentence compression model [8, 9, 10, 11, 12, 13]. For example, [12]

proposed an attentive encoder-decoder recurrent neural network (RNN) to model abstractive text summarization.

[14] furture proposed MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder sentence compression framework which reported state-of-the-art performance on both the Gigaword Corpus and DUC Corpus333

Sentence compression can be conducted by a typical sequence-to-sequence model. The encoder represents the input sentence as a sequence of annotation vectors, and the decoder depends on the attention mechanism to learn the context vector for generating a compressed version with the key meaning of the input sentence. Recently, the new Transformer architecture proposed by [1], which fully relies on self-attention networks, has exhibited state-of-the-art translation performance for several language pairs. We follow this practice and attempt to apply the Transformer architecture to such a compression task.

2.2 Unsupervised ESC

A major challenge in supervised sentence compression is the scarce high quality human annotated parallel data. In practice, due to the lack of parallel annotated data, the supervised sentence compression model cannot be trained or the annotated data domain is different, resulting in the sentence compression model trained on the in-domain performing poorly on the out-of-domain.

Supervised sentence compression models have achieved impressive performances based on large corpora containing pairs of verbose and compressed sentences with human annotation [12, 14]. However, the effectiveness relies heavily on the availability of large amounts of parallel original and human-annotated compressed sentences. This hinders the sentence compression approach from further improvements for many low-resource scenarios. Recently, motivated by recent progress in unsupervised cross-lingual embeddings, the unsupervised NMT [15, 16, 17] opened the door to solving the problem of sequence-to-sequence learning without any parallel sentence pairs. It takes advantage of the lossless (ideal situation) nature of machine translation between languages; i.e., it can translate language to language and back translate to language . However, sentence compression does not have this feature. It is lossy from sentence to sentence , which makes it difficult to restore from the compressed sentence to the original sentence .

[18] added noises to extend the original sentences and trained a denoising auto-encoder to recover the original, constructing an end-to-end training network without any examples of compressed sentences in sequence to sequence framework. In doing so, the model has to exclude and reorder the noisy sentence input, and hence learns to output more semantic important, shorter but grammatically correct sentences. There are two types of noise used in the model: Additive Sampling Noise and Shuffle Noise.

Additive Sampling Noise: To extend the original sentence, we sample additional sentence from the training dataset randomly, and then sub-sample a subset of words from each without replacement. The newly sampled words are appended to the original sentence.

Shuffle Noise: In order for the model to learn to rephrase the input sentence to make the output shorter, we shuffle the resultant additive noisy sentence.

To gain a better quality for the compressed sentences, we transfer the method of [18] into the Transformer architecture instead of their suggested RNN architecture, which makes it conducive to deeper network training and a larger corpus.

2.3 Semi-supervised ESC

As pointed out in [14], sequence to sequence framework has attracted much attention recently due to the advances of deep learning by using large-scale data. Many language generation tasks have only a small scale of pair data which can’t support to train a deep model with good generalization ability. In comparison, there is a lot of unpaired data which is earier to obtain.

We observe a performance degradation caused by different domains in the supervised ESC. According to the experimental results of [18], the accuracy of the unsupervised ESC is currently lower than the supervised one. Therefore, we have further adopted the semi-supervised explicit sentence compression model to alleviate this problem. Specifically, the unsupervised training (often referred to as pre-training) is performed on the unpaired data first and fine-tuning with the small scale paired data (supervised training) to obtain the ESC model with good performance and generalization ability.

2.4 Compression Rate Control

Explicit compression rate (length) control is a common method which has been used in previous sentence compression works. [19]

examined several methods of introducing target output length information, and found that they were effective without negatively impacting summarization quality.

[20] introduced a length marker token that induces the model to target an output of a desired length, coarsely divided into discrete bins. [18] augmented the decoder with an additional length countdown input which is a single scalar that ticks down to when the generation reached the desired length.

Different with the length marker or length countdown input, to induce our model to output the compression sequence with desired length, we use beam search during generation to find the sequence that maximizes a score function given a trained ESC model. The length normalization is introduced to account for the fact that we have to compare hypotheses of different length. Without some form of length-normalization regular

, beam search will favor shorter sequences over longer ones on average since a negative log-probability is added at each step, yielding lower (more negative) scores for longer sentences. Moreover, a coverage penalty

is also added to favor the sequence that cover the source sentence meaning as much as possible according to the attention weights [21].


where is the attention probability of the -th target word on the -th source word. Parameters and control the strength of the length normalization and the coverage penalty. Although can be used to control the compression ratio softly, we use the compression ratio to control the maximum length of decoding generation by hard requirements. When the decoding length is greater than , the decoding stops.

(a) The architecture of proposed BSFNMT model.
(b) The architecture of proposed BTFNMT model.

3 NMT with ESC

In this section, we first introduce the Transformer networks for machine translation. Then based on the fusion position of the backbone knowledge sequence, we propose three novel translation models: the backbone source-side fusion based NMT model (as shown in Figure 

1(a)), the backbone target-side based NMT model (as shown in Figure 1(b)), and the backbone both-side based NMT. All of these models can make use of the source backbone knowledge generated by our sentence compression models.

3.1 Transformer Networks

A Transformer NMT model consists of an encoder and a decoder, which fully rely on self-attention networks (SANs), to translate a sentence in one language into another language with equivalent meaning. Formally, one input sentence = of length is first mapped into a sequence of word vectors. Then the sequence and its position embeddings add up to form the input representation . The sequence is then packed into a query matrix , a key matrix , and a value matrix . For the SAN-based encoder, the self-attention sub-layer is first performed over Q, K, and V to the matrix of outputs as:


where represents the dimensions of the model. Similarly, the translated target words are used to generate the decoder hidden state at the current time-step . Generally, the self-attention function is further refined as multi-head self-attention to jointly consider information from different representation subspaces at different positions:


where the projections are parameter matrices , , , and . For example, there are =8 heads, is 512, and ==512/8=64. A position-wise feed-forward network (FFN) layer is applied over the output of multi-head self-attention, and then is added with the matrix V to generate the final source representation =:


The SAN of decoder then uses both and target context hidden state to learn the context vector by “encoder-decoder attention”:


Finally, the context vector is used to compute translation probabilities of the next target word by a linear, potentially multi-layered function:


where and are projection matrices.

3.2 Backbone Source-side Fusion based NMT

In the backbone source-side fusion based NMT (BSFNMT) model, given an input sentence =, there is an additional compressed sequence = of length generated by the proposed sentence compression model. This compressed sequence is also input to the SAN shared with the original encoder with word vectors in shared vocabulary to learn its final representation =. In the proposed SFNMT model, we introduce an additional multi-head attention layer to fuse the compressed sentence and the original input sentence for learning a more effective source representation.

Specifically, for the multi-head attention-fusion layer, a compressed sentence-specific context representation is computed by the multi-head attention on the original sentence representation and the compressed sentence representation :


and are added to form a fusion source representation :


Finally, the instead of is input to the Eq. (7) in turn for predicting the target translations word by word.

3.3 Backbone Target-side Fusion based NMT

In the backbone target-side fusion based NMT (BTFNMT) model, both the original sentence and its compressed version are also represented as and respectively by the shared SANs. We then use a tuple () instead of the source-side fusion representation as the input to the decoder. Specifically, we introduce an additional “encoder-decoder attention” module into the decoder to learn the compressed sequence context at the current time-step :


Since we are here to treat the original sentence and the compressed sentence as two independent source contexts when encoding at the source side, we use a context gate for integrating two independent contexts of the source: original context and compressed context . The gate is calculated by:


Therefore, the final target fusion context is:



is the logistic sigmoid function,

is the point-wise multiplication, and represent the concatenation operation.

The context is input to replace the the Eq. (8) to compute the probabilities of next target word.

3.4 Backbone Both-side Fusion based NMT

In the backbone both-side fusion based NMT (BBFNMT) model, we combine BSFNMT and BTFNMT. Both the original representation and its compressed enhanced representation are as the input to the decoder. Similarly, we introduce an additional “encoder-decoder attention” module into the decoder to learn the compressed sequence enhanced context at the current time-step :


Then, the context gate consistent with BTFNMT is applied to combine the two context information and .

4 Experiments

4.1 Setup

Sentence Compression

To evaluate the quality of our sentence compression model, we used the Annotated Gigaword corpus [22] as the benchmark [23]. The data includes approximately 3.8 M training samples, 400 K validation samples, and 2 K test samples. The byte pair encoding (BPE) algorithm [24] was adopted for subword segmentation, and the vocabulary size was set at 40 K for our supervised, unsupervised and semi-supervised settings [25].

Baseline systems include AllText and F8W [23, 26]. F8W is simply the first 8 words of the input, and AllText uses the whole text as the compression output. The score of ROUGE-1 (R-1), ROUGE-2 (R-2), and ROUGE-L (R-L) was used to evaluate this task [27]. We use beam search with a beam size of 5, the length length normalization of 0.5, and the coverage penalty of 0.2.

For the semi-supervised setting, in order to make the results comparable to  [14], we used the same 190M English monolingual unpaired data from WMT News Crawl datasets for pre-training (unsupervised training). We included the other pretraining methods: masked language modeling (MLM, BERT) [28], denoising auto-encoder (DAE) [29], and masked sequence to sequence (MASS) [14] to compare with our unsupervised pretraining method in the semi-supervised setting.

Machine Translation

The proposed NMT model was evaluated on the WMT14 English-to-German (EN-DE) and English-to-French (EN-FR) tasks, which are both standard large-scale corpora for NMT evaluation. For the EN-DE translation task, 4.43 M bilingual sentence pairs from the WMT14 dataset were used as training data, including Common Crawl, News Commentary, and Europarl v7. The newstest2013 and newstest2014 datasets were used as the dev set and test set, respectively. For the EN-FR translation task, 36 M bilingual sentence pairs from the WMT14 dataset were were used as training data. Newstest12 and newstest13 were combined for validation and the newstest14 was the test set, following the setting of [30]. The BPE algorithm [24] was also adopted, and the joint vocabulary size was set at 40 K. For the hyper-parameters of our Transformer (base/large) models, we followed the settings used in [1]’s work.

In addition, we also reported the state-of-the-art results in recent literatures, including modelling local dependencies (Localness[31], fusing multiple-layer representations in SANs (Context-Aware[32], and fusing all global context representations in SANs (global-deep context[33]. MultiBLEU was used to evaluate the translation task.

4.2 Main Results

Sentence Compression

Model R-1 R-2 R-L
All text 28.91 10.22 25.08
F8W 26.90 9.65 25.19
 [18] 28.42 7.82 24.95
ESC (This work) 31.37 8.25 28.01
RNN-based Seq2seq 35.50 15.54 32.45
[12] 34.97 17.17 32.70
ESC (This work) 37.53 18.48 34.79
MLM Pretraining 37.75 18.45 34.85
DAE Pretraining 35.97 17.17 33.14
[14] 38.73 19.71 35.96
ESC (This work) 39.54 20.35 36.79
TABLE II: Performance on the sentence compression task

To evaluate the quality of our sentence compression model, we conducted a horizontal comparison between the proposed sentence compression model and other sentence compression models in different settings. Table II shows the comparison results. We observed that the proposed unsupervised ESC model performed substantially better than Fevry and [18]’s unsupervised method. The proposed supervised ESC model also substantially outperformed the RNN-based Seq2seq and [12]’s baseline method. That is, our supervised model gave +2.0 improvements on R-1, R-2, and R-L scores over the RNN-based Seq2seq. This means that the proposed Transformer-based approaches can generate compressed sentences of high quality.

We further compared our semi-supervised model with the semi-supervised pretraining methods of MLM [28], DAE [29], and MASS [14]. Our unsupervised pretrainining method outperformed the other unsupervised pretrainining ones on the sentence compression task consistently.

System EN-DE #Speed #Params EN-FR #Speed #Params
Existing NMT systems
Transformer (base) [1] 27.3 N/A 65.0M 38.1 N/A N/A
   +Localness [31] 28.11 N/A 88.8M N/A N/A N/A
   +Context-Aware SANs [32] 28.26 N/A 194.9M N/A N/A N/A
   +global-deep context [33] 28.58 N/A 111M N/A N/A N/A
Transformer (big) [1] 28.4 N/A 213.0M 41.0 N/A N/A
   +Localness [31] 28.89 N/A 267.4M N/A N/A N/A
   +Context-Aware SANs [32] 28.89 N/A 339.6M N/A N/A N/A
   +global-deep context [33] 29.21 N/A 396M N/A N/A N/A
Our NMT systems
Transformer (base) 27.24 131k 66.5M 38.21 130k 85.7M
BSFNMT 27.75++ 121k 72.1M 39.09++ 120k 89.0M
BTFNMT 28.14+ 120k 72.7M 39.22++ 119k 89.8M
BBFNMT 28.35++ 119k 78.6M 39.40++ 116k 91.4M
Transformer (big) 28.23 11k 221.0M 41.15 11k 222.3M
BSFNMT 28.52+ 10k 225.2M 41.92+ 9k 227.1M
BTFNMT 29.16++ 9k 225.7M 42.22++ 8k 227.5M
BBFNMT 29.37++ 8k 228.9M 42.52++ 8k 230.3M
TABLE III: Comparison with existing NMT systems on WMT14 EN-DE and EN-FR Translation Tasks. “++/+” after the BLEU score indicate that the proposed method was significantly better than the corresponding baseline Transformer (base or big) at significance level p0.01/0.05. “#Speed” denotes the decoding speed measured in target tokens per second.

Machine Translation

According to the results in Table II, we chose the semi-supervised ESC model (which performed the best) to generate compressed sentences for the machine translation task. The main results on the WMT14 EN-DE and EN-FR translation tasks are shown in Table III. In the EN-DE task, we made the following observations:

1) The baseline Transformer (base) in this work achieved a performance comparable to the original Transformer (base) [1]. This indicates that it is a strong baseline NMT system.

2) All BSFNMT, BTFNMT, and BBFNMT significantly outperformed the baseline Transformer (base/big) and only introduces a very small amount of extra parameters. This indicates that the learned compressed backbone information was beneficial for the Transformer translation system.

3) Among the proposed three methods, BTFNMT performed better than BSFNMT. This indicates that the backbone fusion at the target-side is better than at the source-side. In addition, BBFNMT (base/big) outperformed the comparison systems +Localness and +Context-Aware SANs. This indicates that the compression knowledge as an additional context can enhance NMT better.

4) BBFNMT (based) is comparable to the +global-deep context, the best comparison system, while BBFNMT (big) slightly outperformed +global-deep context by BLEU scores. In particular, the parameters of BBFNMT (base/big) model, which just increased M over the Transformer (base/big), were only 70% of the +global-deep context model. This denotes that the BBFNMT model is more efficient than the +global-deep context model. In addition, the training speed of the proposed models slightly decreased (), compared to the corresponding baselines.

5) The proposed BBFNMT (base) slightly outperformed the Transformer (big) which contains much more parameters than BBFNMT (base). This indicates that our improvement is not likely to be due to the increased number of parameters.

For the EN-FR translation task, the proposed models gave similar improvements over the baseline systems and comparing methods (except that the Transformer (big) performed much more better than Transformer (base)). These results show that our method is robust for improving the translation of other language pairs.

4.3 Ablation Study

Evaluating Sentence Compression

To demonstrate the effectiveness of sentence compression, we compared the compressed sentences () generated in the Transformer translation system (BBFNMT) under different settings: AllText, F8W, RandSample (random sampling), supervised ESC, Unsupervised ESC and semi-supervised ESC. Table IV shows the results on newstest2014 for the EN-DE translation task.

Model BLEU on EN-DE
Baseline 27.24
   +AllText 27.24
   +F8W 27.40
   +RandSample 26.53
   +Supervised ESC 27.80
   +Unsupervised ESC 27.97
   +Semi-supervised ESC 28.35
TABLE IV: The effect of our ESC methods.

We made the following observations: 1) Simply introducing AllText and F8W achieved few improvement, and RandSample is lower than the baseline. In comparison, all the +supervised ESC, +unsupervised ESC, and +semi-supervised ESC models substantially improved the performance over the baseline Transformer (base). This means that our ESC method provides a richer source information for machine translation tasks.

2) +Unsupervised ESC can gain better improvements over the +supervised ESC although supervised ESC model can achieve higher quality than the unsupervised ESC model in the benchmark test dataset. This may be due to that the annotated sentence compression training data is in different domain with the WMT EN-DE traing data. Meanwhile, +Semi-supervised ESC with annotated data fine-tuning outperformed both +Unsupervised and +supervised ESC.

Effect of Encoder Parameters

In our model, representations of the original sentence and its compressed version were learned by a shared encoder. To explore the effect of the encoder parameters, we also designed a BBFNMT with two independent encoders to learn representations of the original sentence and its compressed version, respectively. Table V shows results on the newstest2014 test set for the WMT14 EN-DE translation task.

Model BLEU #Params
Transformer (base) 27.24 66.4M
BBFNMT w/ Shared encoder 28.35 78.6M
BBFNMT w/ Independent encoders 28.50 91.6M
TABLE V: The effect of encoder parameters.

The BBFNMT (w/ independent params) slightly outperformed the proposed shared encoder model by a BLEU score of 0.15, but its parameters increased by approximately 30%. In contrast, the parameters in our model are comparable to the baseline Transformer (base). Considering the parameter scale, we took a shared encoder to learn source representation, which makes it easy to verify the effectiveness of the additional translation knowledge, such as our backbone knowledge.

Evaluating Compression Ratio

In order to verify the impact of different compression ratios on translation quality, we conducted experiments on EN-DE translation task with semi-supervised sentence compression in BBFNMT model.

Compression Ratio

BLEU score
Fig. 1: Performances on EN-DE newstest2014 with different sentence compression ratios.

We controled the compression ratio from 0 to 1.0. Consider two boundary conditions, when the compression ratio , it means no compression sequence generated, which is the same as the vanilla Transformer. When the compression ratio , it is equivalent to re-paraphrasing the source sentence using the sentence compression model (maintaining the same length) as the additional input for BBFNMT.

The experimental results are shown in Fig. 1. As can be seen from the results, in our experiments, sentence compression (re-paraphrasing) can bring performance improvement, even when the compression ratio and the sentence length is not shortened, re-paraphrasing can still bring slight improvement of translation quality. On the wmt14 EN-DE translation task, the compression ratio was set to 0.6 to get the best results.

5 Related Work

To let the translation have more focus over the source sentence information, efforts have been initiated on exploiting sentence segmentation, sentence simplification, and sentence compression for machine translation. [36] presented a approach to integrating the sentence skeleton information into a phrase-based statistic machine translation system. [37] proposed an approach to modeling syntactically-motivated skeletal structure of source sentence for statistic machine translation. [34] describe an early approach to skeleton-based translation, which decomposes input sentences into syntactically meaningful chunks. The central part of the sentence is identified and remains unaltered while other parts of the sentence are simplified. This process produces a set of partial, potentially overlapping translations which are recombined to form the final translation. [35] describe a “divide and translate” approach to dealing with complex input sentences. They parse the input sentences, replace subclauses with placeholders and later substitute them with separately translated clauses. Their method requires training translation models on clause-level aligned parallel data with placeholders in order for the translation model to deal with the placeholders correctly. [38] experimented with automatically segmenting the source sentence to overcome problems with overly long sentences. [39] showed that the spaces of original and simplified translations can be effectively combined using translation lattices and compare two decoding approaches to process both inputs at different levels of integration.

Different from these work, our proposed sentence compression model does not rely on any known linguistics motivated (such as syntax) skeleton simplification, but directly trains a computation motivated sentence compression model to learn to compress sentences and re-paraphrase them directly in seq2seq model. Though with a pure computation source, our sentence compression model can surprisingly generate more grammatically correct and refined sentences, and the words in the compressed sentence do not have to be the same as the original sentence. In the meantime, our sentence compression model can stably give source backbone representation exempt from unstable performance of a syntactic parser which is essential for syntactic skeleton simplification. Our sentence compression model can perform unsupervised training on large-scale data sets, and then use the supervised data for finetune, which is more promising from the results.

6 Conclusion and Future work

To give a more focused source representation, this paper makes the first attempt to propose an explicit sentence compression method to enhance state-of-the-art Transformer-based NMT. To demonstrate that the proposed sentence compression enhancement is indeed helpful for the neural machine translation, We evaluate the impact of the proposed model on the large-scale WMT14 English-to-German and English-to-French translation tasks. The experimental results on WMT14 EN-DE and EN-FR translation tasks show that our proposed NMT model can yield significantly improved results over strong baseline translation systems. In the future work, we will release a pre-trained language model that uses unsupervised sentence compression as the pre-training objective to demonstrate the performance of unsupervised sentence compression in representation learning.


The corresponding authors are Rui wang and Hai Zhao. Zuchao Li and Zhuosheng Zhang were internship research fellows at NICT when conducting this work. Hai Zhao was partially supported by National Key Research and Development Program of China (No. 2017YFB0304100) and Key Projects of National Natural Science Foundation of China (No. U1836222 and No. 61733011). Rui Wang was partially supported by JSPS grantin-aid for early-career scientists (19K20354): “Unsupervised Neural Machine Translation in Universal Scenarios” and NICT tenure-track researcher startup fund “Toward Intelligent Machine Translation”.


  • [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  • [2] O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, P. Koehn, and C. Monz, “Findings of the 2018 conference on machine translation (WMT18),” in WMT, 2018.
  • [3] Z.-Y. Dou, Z. Tu, X. Wang, S. Shi, and T. Zhang, “Exploiting deep representations for neural machine translation,” in EMNLP, 2018.
  • [4] X. Wang, Z. Tu, L. Wang, and S. Shi, “Exploiting sentential context for neural machine translation,” in ACL, 2019.
  • [5] B. Yang, J. Li, D. F. Wong, L. S. Chao, X. Wang, and Z. Tu, “Context-aware self-attention networks,” 2019.
  • [6] K. Knight and D. Marcu, “Summarization beyond sentence extraction: A probabilistic approach to sentence compression,” AI, 2002.
  • [7]

    W. Che, Y. Zhao, H. Guo, Z. Su, and T. Liu, “Sentence compression for aspect-based sentiment analysis,”

    TASLP, 2015.
  • [8]

    A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in

    EMNLP, 2015.
  • [9] B. Hu, Q. Chen, and F. Zhu, “LCSTS: A large scale Chinese short text summarization dataset,” in EMNLP, 2015.
  • [10] S. Chopra, M. Auli, and A. M. Rush, “Abstractive sentence summarization with attentive recurrent neural networks,” in NAACL:HLT, 2016.
  • [11] J. Cheng and M. Lapata, “Neural summarization by extracting sentences and words,” in ACL, 2016.
  • [12] R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang, “Abstractive text summarization using sequence-to-sequence rnns and beyond,” in CoNLL, 2016.
  • [13] X. Duan, M. Yin, M. Zhang, B. Chen, and W. Luo, “Zero-shot cross-lingual abstractive sentence summarization through teaching generation and attention,” in ACL, 2019.
  • [14] K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass: Masked sequence to sequence pre-training for language generation,” in ICML, 2019.
  • [15] M. Artetxe, G. Labaka, E. Agirre, and K. Cho, “Unsupervised neural machine translation,” in ICLR, 2018.
  • [16] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised machine translation using monolingual corpora only,” 2018.
  • [17] G. Lample, M. Ott, A. Conneau, L. Denoyer, and M. Ranzato, “Phrase-based & neural unsupervised machine translation,” in EMNLP, 2018.
  • [18] T. Fevry and J. Phang, “Unsupervised sentence compression using denoising auto-encoders,” in CoNLL, 2018.
  • [19] Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura, “Controlling output length in neural encoder-decoders,” in EMNLP, 2016.
  • [20] A. Fan, D. Grangier, and M. Auli, “Controllable abstractive summarization,” in WNMT, 2018.
  • [21] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  • [22] C. Napoles, M. Gormley, and B. Van Durme, “Annotated Gigaword,” in AKBC-WEKEX, 2012.
  • [23] A. M. Rush, S. Chopra, and J. Weston, “A neural attention model for abstractive sentence summarization,” in EMNLP, 2015.
  • [24] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in ACL, 2016.
  • [25] Z. Zhang, H. Zhao, K. Ling, J. Li, Z. Li, S. He, and G. Fu, “Effective subword segmentation for text comprehension,” in TASLP, 2019.
  • [26] Y. Wang and H.-y. Lee, “Learning to encode text as human-readable summaries using generative adversarial networks,” in EMNLP, 2018.
  • [27] C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in ACL, 2004.
  • [28] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [29]

    P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in

    ICML, 2008.
  • [30] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Convolutional sequence to sequence learning,” in ICML, 2017.
  • [31] B. Yang, Z. Tu, D. F. Wong, F. Meng, L. S. Chao, and T. Zhang, “Modeling localness for self-attention networks,” in EMNLP, 2018.
  • [32] B. Yang, J. Li, D. F. Wong, L. S. Chao, X. Wang, and Z. Tu, “Context-aware self-attention networks,” CoRR, vol. abs/1902.05766, 2019.
  • [33] Z.-Y. Dou, Z. Tu, X. Wang, S. Shi, and T. Zhang, “Exploiting deep representations for neural machine translation,” in EMNLP, 2018.
  • [34] B. Mellebeek, K. Owczarzak, D. Groves, J. Van Genabith, and A. Way, “A syntactic skeleton for statistical machine translation,” 2006.
  • [35] K. Sudoh, K. Duh, H. Tsukada, T. Hirao, and M. Nagata, “Divide and translate: improving long distance reordering in statistical machine translation,” in WMT, 2010.
  • [36] T. Xiao, J. Zhu, and C. Zhang, “A hybrid approach to skeleton-based translation,” in ACL, 2014.
  • [37] T. Xiao, J. Zhu, C. Zhang, and T. Liu, “Syntactic skeleton-based translation,” in AAAI, 2016.
  • [38] J. Pouget-Abadie, D. Bahdanau, B. van Merrienboer, K. Cho, and Y. Bengio, “Overcoming the curse of sentence length for neural machine translation using automatic segmentation,” in SSST-8, 2014.
  • [39] E. Hasler, A. de Gispert, F. Stahlberg, A. Waite, and B. Byrne, “Source sentence simplification for statistical machine translation,” CSL, 2017.