The Neural Noisy Channel

11/08/2016 ∙ by Lei Yu, et al. ∙ Google University of Oxford 0

We formulate sequence to sequence transduction as a noisy channel decoding problem and use recurrent neural networks to parameterise the source and channel models. Unlike direct models which can suffer from explaining-away effects during training, noisy channel models must produce outputs that explain their inputs, and their component models can be trained with not only paired training samples but also unpaired samples from the marginal output distribution. Using a latent variable to control how much of the conditioning sequence the channel model needs to read in order to generate a subsequent symbol, we obtain a tractable and effective beam search decoder. Experimental results on abstractive sentence summarisation, morphological inflection, and machine translation show that noisy channel models outperform direct models, and that they significantly benefit from increased amounts of unpaired output data that direct models cannot easily use.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural network sequence to sequence models (kalchbrenner:2013; sutskever:2014; bahdanau:2015) are excellent models of , provided sufficient input–output

pairs are available for estimating their parameters. However, in many domains, vastly more unpaired output examples are available than input–output pairs (e.g., transcribed speech is relatively rare although non-spoken texts are abundant; Swahili–English translations are rare although English texts are abundant; etc.). A classic strategy for exploiting both kinds of data is to use Bayes’ rule to rewrite

as , a factorisation which is called a noisy channel model (shannon:1948). A noisy channel model thus consists of two component models: the conditional channel model, , which characterizes the reverse transduction problem and whose parameters are estimated from the paired samples, and the unconditional source model, , whose parameters are estimated from both the paired and (usually much more numerous) unpaired samples.111We do not model since, in general, we will be interested in finding , and .

Beyond their data omnivorousness, noisy channel models have other benefits. First, the two component models mean that two different aspects of the transduction problem can be addressed independently. For example, in many applications, source models are language models and innovations in these can be leveraged to obtain improvements in any system that uses them as a component. Second, the component models can have complementary strengths, since inference is carried out in the product space; this simplifies design because a single model does not have to get everything perfectly right. Third, the noisy channel operates by selecting outputs that both are a priori likely and that explain the input well. This addresses a failure mode that can occur in conditional models in which inputs are “explained away” by highly predictive output prefixes, resulting in poor training (klein:2001). Since the noisy channel formulation requires its outputs to explain the observed input, this problem is avoided.

In principle, the noisy channel decomposition is straightforward; however, in practice, decoding (i.e., computing ) is a significant computational challenge, and tractability concerns impose restrictions on the form the component models can take. To illustrate, an appealing parameterization would be to use an attentional seq2seq network (bahdanau:2015)

to model the channel probability

. However, seq2seq models are designed under the assumption that the complete conditioning sequence is available before any prefix probabilities of the output sequence can be computed. This assumption is problematic for channel models since it means that a complete output sequence must be constructed before the channel model can be evaluated (since the channel model conditions on the output). Therefore, to be practical, the channel probability must decompose in terms of prefixes of the conditioning variable,

. While the chain rule justifies decomposing output variable probabilities in terms of successive extensions of a partial prefix, no such convenience exists for conditioning variables, and approximations must be introduced.

In this work, we use a variant of the newly proposed online seq2seq model of yu:2016 which uses a latent alignment variable to enable its probabilities to factorize in terms of prefixes of both the input and output, making it an appropriate channel model (§2). Using this channel model, the decoding problem then becomes similar to the problem faced when decoding with direct models (§3). Experiments on abstractive summarization, machine translation, and morphological inflection show that the noisy channel can significantly improve performance and exploit unpaired output training samples and that models that combine the direct model and a noisy channel model offer further improvements still (§4).

2 Background: Segment to Segment Neural Transduction

Our model is based on the Segment to Segment Neural Transduction model (SSNT) of Yu et al., 2016. At a high level, the model alternates between encoding more of the input sequence and decoding output tokens from the encoded representation. This presentation deviates from the Yu et al.’s presentation so as to emphasize the incremental construction of the conditioning context that is enabled by the latent variable.

2.1 Model description

Similar to other neural sequence to sequence models, SSNT models the conditional probability of a output sequence given a input sequence .

To avoid having to observe the complete input sequence before making a prediction of the beginning of the output sequence, we introduce a latent alignment variable which indicates when each token of the output sequence is to be generated as the input sequence is being read. Since we assume that the input is read just once from left to right, we restrict to be a monotonically increasing alignment (i.e., is true with probability 1), where denotes that the output token at position () is generated when the input sequence up through position has been read. The SSNT model is:

(1)

We explain the model in terms of its two components, starting with the word generation term. In the SSNT, the input and output sequences , are encoded with two separate LSTMs (hochreiter1997long), resulting in sequences of hidden states representing prefixes of these sequences. In Yu et al.’s formulation, the input sequence encoder (i.e., the conditioning context encoder) can either be a unidirectional or bidirectional LSTM, but here we assume that it is a unidirectional LSTM, which ensures that it will function well as a channel model that can compute probabilities with incomplete conditioning contexts (this is necessary since, at decoding time, we will be constructing the conditioning context incrementally). Let represent the input sequence encoding for the prefix . Since the final action at timestep will be to predict , it is convenient to let denote the hidden state that excludes , i.e., the encoding of the prefix .

The probability of the next token

is calculated by concatenating the aligned hidden state vectors

and

followed by a softmax layer,

The model thus depends on the current alignment position , which determines how far into it has read.

We now discuss how the sequence of

’s are generated. First, we remark that modelling this distribution requires some care so as to avoid conditioning on the entire input sequence. To illustrate why one might induce a dependency on the entire input sequence in this model, it is useful to compare to a standard attention model. Attention models operate by computing a score using a representation of alignment candidate (in our case, the candidates would be every unread token remaining in the input). If we followed this strategy, it would be necessary to observe the full input sequence when making the first alignment decision.

We instead model the alignment transition from timestep to by decomposing it into a sequence of conditionally independent shift and emit operations that progressively decide whether to read another token or stop reading. That is, at input position , the model decides to emit, i.e., to set and predict the next output token from the word model, or it decides to shift, i.e., to read one more input token and increment the input position . The probability is calculated using the encoder and decoder states defined above as:

The probability of shift is simply . In this formulation, the probabilities of aligning to each alignment candidate can be computed by reading just (rather than the entire sequence). The probabilities are also independent of the contents of the suffix .

Using the probabilities of the auxiliary variables, the alignment probabilities needed in Eq. 1 are computed as:

2.2 Inference algorithms

In SSNT, the probability of generating each depends only on the current output position’s alignment (), the current output prefix (), the input prefix up to the current alignment (). It does not depend on the history of the alignment decisions. Likewise, the alignment decisions at each position are also conditionally independent of the history of alignment decisions. Because of these independence assumptions, can be marginalised using a time dynamic programming algorithm where each fill in a chart with computing the following marginal probabilities:

The model is trained to minimize the negative log likelihood of the parallel corpus :

(2)

The gradients of this objective with respect to the component probability models can be computed using automatic differentiation or using a secondary dynamic program that computes ‘backward’ probabilities. We refer the reader to Section 3.1 of yu:2016 for details.

In this paper, we use a slightly different objective from the one described in yu:2016. Rather than marginalizing over the paths that end in any possible input positions , we require that the full input be consumed when the final output symbol is generated. This constraint biases away from predicting outputs without explaining them using the input sequence.

3 Decoding

We now turn to the problem of decoding, that is, of computing

where we are using the SSNT model described in the previous section as the channel model and a language model that delivers prior probabilities of the output sequence in left-to-right order, i.e.,

.

Marginalizing the latent variable during search is computationally hard (simaan:1996), and so we approximate the search problem as

However, even with this simplification, the search problem remains nontrivial. On one hand, we must search over the space of all possible outputs with a model that makes no Markovian assumptions. This is similar to the decoding problem faced in standard seq2seq transducers. On the other hand, our model computes the probability of the given input conditional on the predicted output hypothesis. Therefore, instead of just relying on a single softmax to provide a probability for every output word type (as we conveniently can in the direct model), we must loop over each output word type, and run a softmax over the input vocabulary—a computational expense that is quadratic in the size of the vocabulary!

To reduce this computational effort, we make use of an auxiliary direct model to explore probable extensions of partial hypotheses, rather than trying to perform an exhaustive search over the vocabulary each time we extend an item on the beam.

Algorithm 1, in Appendix A, describes the decoding algorithm based on a formulation by tillmann1997dp. The idea is to create a matrix of partial hypotheses. Each hypothesis in cell covers the first words of the input () and corresponds to an output hypothesis prefix of length (). The hypothesis is associated with a model score. For each cell , the direct proposal model first calculates the scores of possible extensions of previous cells that could then reach by considering every token in the output vocabulary, from all previous candidate cells . That gives the top partial output sequences. These partial output sequences are subsequently rescored by the noisy channel model, and the best candidates are kept in the beam and used for further extension. The beam size and

are hyperparameters to be tuned in the experiments.

3.1 Model combination

The decoder we have just described makes use of an auxiliary decoding model. This means that, as a generalisation, it is capable of decoding under an objective that is a linear combination of the direct model, channel model, language model and a bias for the output length222In the experiments, we did not marginalize the probability of the direct model when calculating the general search objective. We found that marginalizing the probability does not give better performance and makes decoding extremely slow.,

(3)

The bias is used to penalize the noisy channel model for generating too-short (or long) sequences. The ’s are hyperparameters to be tuned using on a small amount of held-out development data.

4 Experiments

We evaluate our model on three natural language processing tasks, abstractive sentence summarisation, machine translation and morphological inflection generation. For each task, we compare the performance of the direct model, noisy channel model, and the interpolation of the two models.

4.1 Abstractive Sentence Summarisation

Sentence summarisation is the problem of constructing a shortened version of a sentence while preserving the majority of its meaning. In contrast to extractive summarisation, which can only copy words from the original sentence, abstractive summarisation permits arbitrary rewording of the sentence. The dataset (DBLP:conf/emnlp/RushCW15) that we use is constructed by pairing the first sentence and the headline of each article from the annotated Gigaword corpus (graff2003english; napoles2012annotated). There are 3.8m, 190k and 381k sentence pairs in the training, validation and test sets, respectively. yu:2016 filtered the dataset by restricting the lengths of the input and output sentences to be no greater than 50 and 25 tokens, respectively. From the filtered data, they further sampled 1 million sentence pairs for training. We experimented on training the direct model and channel model with both the sampled 1 million and the full 3.8 million parallel data. The language model is trained on the target side of the parallel data, i.e. the headlines. We evaluated the generated summaries of 2000 randomly sampled sentence pairs using full length ROUGE F1. This setup is in line with the previous work on this task (DBLP:conf/emnlp/RushCW15; chopra; gulcehre2016pointing; yu:2016).

The same configuration is used to train the direct model and the channel model. The loss (Equation 2) is optimized by Adam (DBLP:journals/corr/KingmaB14), with initial learning rate of 0.001. We use LSTMs with 1 layer for both the encoder and decoders, with hidden units of 256. The mini-batch size is 32, and dropout of 0.2 is applied to the input and output of LSTMs. For the language model, we use a 2-layer LSTM with 1024 hidden units and 0.5 dropout. The learning rate is 0.0001. All the hyperparameters are optimised via grid search on the perplexity of the validation set. During decoding, beam search is employed with the number of proposals generated by the direct model , and the number of best candidates selected by the noisy channel model .

Model # Parallel data # Data for LM RG-1 RG-2 RG-L
direct (uni) 1.0m - 30.94 14.20 28.72
direct (bi) 1.0m - 31.25 14.52 29.03
direct (bi) 3.8m - 33.82 16.66 31.50
channel + LM + bias (uni) 1.0m 1.0m 31.92 14.75 29.58
channel + LM + bias (bi) 1.0m 1.0m 31.96 14.89 29.51
direct + channel + LM + bias (uni) 1.0m 1.0m 33.07 15.21 30.29
direct + channel + LM + bias (bi) 1.0m 1.0m 33.18 15.65 30.45
channel + LM + bias (uni) 1.0m 3.8m 32.59 15.05 30.06
channel + LM + bias (bi) 1.0m 3.8m 32.65 14.95 30.23
direct + LM + bias (bi) 1.0m 3.8m 31.25 14.52 29.03
direct + channel + LM + bias (uni) 1.0m 3.8m 33.16 15.63 30.53
direct + channel + LM + bias (bi) 1.0m 3.8m 33.21 15.65 30.60
chanel + LM + bias (bi) 3.8m 3.8m 34.12 16.41 31.38
direct + LM + bias (bi) 3.8m 3.8m 33.82 16.66 31.50
direct + channel + LM + bias (bi) 3.8m 3.8m 34.41 16.86 31.83
Table 1: ROUGE F1 scores on the sentence summarisation test set. The ‘uni’ and ‘bi’ in the parentheses denote the encoder for the model proposing candidates is a unidirectional LSTM or bidirectional LSTM. Those rows marked with an denote models that process their input online.
Model # Parallel data # Unpaired data RG-1 RG-2 RG-L
ABS+ 3.8m - 29.55 11.32 26.42
RAS-LSTM 3.8m - 32.55 14.70 30.03
RAS-Elman 3.8m - 33.78 15.97 31.15
Pointing unkown words 3.8m - 35.19 16.66 32.51
ASC + FSC 1.0m 3.8m 31.09 12.79 28.97
ASC + FSC 3.8m 3.8m 34.17 15.94 31.92
direct + channel + LM + bias (bi) 1.0m 3.8m 33.21 15.65 30.60
direct + channel + LM + bias (bi) 3.8m 3.8m 34.41 16.86 31.83
Table 2: Overview of results on the abstractive sentence summarisation task. ABS+ (DBLP:conf/emnlp/RushCW15) is the attentive model with bag-of-words as the encoder. RAS-LSTM and RAS-Elman (chopra) are the sequence to sequence models with attention with the RNN cell implemented as LSTMs and an Elman architecture (elman1990finding), respectively. Pointing the unknown words (gulcehre2016pointing) uses pointer networks (vinyals2015pointer) to select the output token from the input sequence in order to avoid generating unknown tokens. ASC + FSC (miao2016)

is the semi-supervised model based on a variational autoencoder.

Table 1 presents the ROUGE-F1 scores of the test set from the direct model, noisy channel model (channel + LM + bias), the interpolation of the direct model and the noisy channel model (direct + channel + LM + bias), and the interpolation of the direct model and language model (direct + LM + bias) trained on different sizes of data. The noisy channel model with the language model trained on the target side of the 1 million parallel data outperforms the direct model by approximately 1 point. Such improvement indicates that the language model helps improve the quality of the output sequence when no extra unlabelled data is available. Training the language model with all the headlines in the dataset, i.e. 3.8 million sentences, gives a further boost to the ROUGE score. This is in line with our expectation that the model benefits from adding large amounts of unlabelled data. The interpolation of the direct model, channel model, language model and bias of the output length achieves the best results — the ROUGE score is close to the direct model trained on all the parallel data. Although there is still improvement, when the direct model is trained with more data, the gap between the direct model and the noisy channel model is smaller. No gains is observed if the language model is combined with the direct model. We find that as we increase the weight of the language model, the result is getting worse.

Table 2 surveys published results on this task, and places our best models in the context of the current state-of-the-art results. ABS+ (DBLP:conf/emnlp/RushCW15), RAS-LSTM and RAS-Elman (chopra) are different variations of the attentive models. Pointing the unkown words uses pointer networks (vinyals2015pointer) to select the output token from the input sequence in order to avoid generating unknown tokens. ASC + FSC (miao2016) is a semi-supervised model based on a variational autoencoder. Trained on 1m paired samples and 3.8m unpaired samples, the noisy channel achieves comparable or better results than (direct) models trained with 3.8m paired samples. Compared to miao2016, whose ASC + FSC models is an alternative strategy for using unpaired data, the noisy channel is significantly more effective — 33.21 versus 31.09 in ROUGE-1.

Finally, motivated by the qualitative observation that noisy channel model outputs were quite fluent and often used reformulations of the input rather than a strict compression (which would be poorly scored by ROUGE), we carried out a human preference evaluation whose results are summarised in Table 3. This confirms that noisy channel summaries are strongly preferred over those of the direct model.

Model count
both bad 188
both good 106
direct noisy channel 135
noisy channel direct 212
Table 3: Preference ratings for 641 segments from the test set (each segment had ratings from at least 2 raters with 50% agreement on the label and where one label had a plurality of the votes).

4.2 Machine Translation

We next evaluate our models on a Chinese–English machine translation task. We used parallel data with 184k sentence pairs (from the FBIS corpus, LDC2003E14) and monolingual data with 4.3 million of English sentences (selected from the English Gigaword). The training data is preprocessed by lowercasing the English sentences, replacing digits with ‘#’ token, and replacing tokens appearing less than 5 times with an UNK token. This results in vocabulary sizes of 30k and 20k for Chinese sentences and English sentences, respectively.

The models are trained using Adam (DBLP:journals/corr/KingmaB14) with initial learning rate of 0.001 for the direct model and the channel model, and 0.0001 for the language model. The LSTMs for the direct and channel models have 512 hidden units and 1 layer, and 2 layers with 1024 hidden units per layer for the language model. Dropout of 0.5 on the input and output of LSTMs is set for all the model training. The noisy channel decoding uses = 20 and = 10 as the beam sizes.

Table 4 lists the translation performance of different models in BLEU scores. To set benchmarks, we train the vanilla and attentional sequence to sequence models (sutskever:2014; bahdanau:2015) using the same parallel data. For direct models, we leverage bidirectional LSTMs as the encoder for this task. We can see that the vanilla sequence to sequence model behaves poorly due to the small amounts of parallel data. By contrast, the direct model (SSNT) and the attentional model work relatively well, with the attentional model outperforming the SSNT direct model. Although these models both directly model , this result is unsurprising because the SSNT direct model is most effective when the alignment between sequences is largely monotonic, and Chinese–English translation word orders diverge considerably. However, despite this limitation, the noisy channel model is approximately 3 points higher in BLEU than the direct model, and the combination of noisy channel and direct model gives extra boost. Confirming the empirical findings of prior work (and in line with theoretical predictions), the interpolation of the direct model and language model is not effective.

Model BLEU
seq2seq w/o attention 11.19
seq2seq w/ attention 25.27
direct (bi) 23.33
direct + LM + bias (bi) 23.33
channel + LM + bias (bi) 26.28
direct + channel + LM + bias (bi) 26.44
Table 4: BLEU scores from different models for the Chinese to English machine translation task.

4.3 Morphological Inflection Generation

Morphological inflection is the task of generating a target (inflected form) word from a source word (base form), given a morphological attribute, e.g. number, tense, and person etc.. It is useful for reducing data sparsity issues in translating morphologically rich languages. The transformation from the base form to the inflected form is usually to add prefix or suffix, or to do character replacement. The dataset (DBLP:conf/naacl/DurrettD13) that we use in the experiments is created from Wiktionary, including inflections for German nouns, German verbs, Spanish Verbs, Finnish noun and adjective, and Finnish verbs. We only experimented on German nouns and German verbs, as German nouns is the most difficult task333While state-of-the-art systems can achieve 99% accuracies on Spanish verbs and Finnish verbs, they can only get 89% accuracy on German nouns., and the direct model does not perform as well as other state-of-the-art systems on German verbs. The train/dev/test split for German nouns is 2364/200/200, and for German verbs is 1617/200/200. There are 8 and 27 inflection types in German nouns and German verbs, respectively. Following previous work, we learn a separate model for each type of inflection independent of the other inflections. We report results on the average accuracy across different inflections. Our language models were trained on word types extracted by running a morphological analysis tool on the WMT 2016 monolingual data and extracting examples of appropriately inflected word forms.444http://www.statmt.org/wmt16/translation-task.html After annotation the number of instances for training the language model ranged from 300k to 3.8m for different inflection types in German nouns, and from 200 to 54k in German verbs.

The experimental setup that we use on this task is = 60, = 30,

  • direct and channel model: 1 layer LSTM with 128 hidden, , dropout = 0.5.

  • language model: 2 layer LSTM with 512 hidden, , dropout = 0.5.

Table 1 summarises the results from our models. On both datasets, the noisy channel model (channel + LM + bias) does not perform as well as the direct model, but the interpolation of the direct model and noisy channel model (direct + channel + LM + bias) significantly outperforms the direct model. The interpolation of the direct model and language model (direct + LM + bias) achieves better results than the direct model and the noisy channel model on German nouns, but not on German verbs. For further comparison, we also included the state-of-the-art results as benchmarks. NCK15 (DBLP:conf/naacl/NicolaiCK15) tackles the task based on the three-stage approach: (1) align the source and target word, (2) extract inflection rules, (3) apply the rule to new examples. FTND16 (DBLP:conf/naacl/FaruquiTND16)

is based on neural sequence to sequence models. Both models (NCK15+ and FTND16+) rerank the candidate outputs by the scores predicted from n-gram language models, together with other features.

Model Acc.
NCK15 88.60
FTND16 88.12
NCK15+ 89.90
FTND16+ 89.31
direct (uni) 82.25
direct (bi) 87.68
channel + LM + bias (uni) 78.38
channel + LM + bias (bi) 78.13
direct + LM + bias (bi) 90.31
direct + channel + LM + bias (uni) 88.44
direct + channel + LM + bias (bi) 90.94
(a)
Model Acc.
NCK15 97.50
FTND16 97.92
NCK15+ 97.90
FTND16+ 97.11
direct (uni) 87.85
direct (bi) 94.83
channel + LM + bias (uni) 84.42
channel + LM + bias (bi) 92.13
direct + LM + bias (bi) 94.83
direct + channel + LM + bias (uni) 92.20
direct + channel + LM + bias (bi) 97.15
(b)
Figure 1: Accuracy on morphological inflection of German nouns (a), and German verbs (b). NCK15 (DBLP:conf/naacl/NicolaiCK15) and FTND16 (DBLP:conf/naacl/FaruquiTND16) are previous state-of-the-art on this task, with NCK15 based on feature engineering, and FTND16 based on neural networks. NCK15+ and FTND16+ are the semi-supervised setups of these models.

5 Analysis

By observing the output generated by the direct model and noisy channel model, we find (in line with theoretical critiques of conditional models) that the direct model may leave out key information. By contrast, the noisy channel model does seem to avoid this issue. To illustrate, in Example 1 (see Appendix B) in Table 5, the direct model ignores the key phrase ‘coping with’, resulting in incomplete meaning, but the noisy channel model covers it. Similarly, in Example 6, the direct model does not translate the Chinese word corresponding to ‘investigation’. We also observe that while the direct model mostly copies words from the source sentence, the noisy channel model prefers generating paraphrases. For instance, in Example 2, while the direct model copies the word ‘accelerate’ in the generated output, the noisy channel model generate ‘speed up’ instead. While one might argue that copying is a preferable compression technique than paraphrasing (as long as it produces grammatical outputs), it does show the power of these models.

6 Related work

Noisy channel decompositions have been successfully used in a variety of problems, including speech recognition (jelinek:1998), machine translation (brown:1993), spelling correction (brill:2000), and question answering (echihabi:2003). The idea of adding language models and monolingual data in machine translation has been explored in earlier work. gulcehre:2015 propose two strategies of combining a language model with a neural sequence to sequence model. In shallow fusion, during decoding the sequence to sequence model (direct model) proposes candidate outputs and these candidates are reranked based on the scores calculated by a weighted sum of the probability of the translation model and that of the language model. In deep fusion, the language model is integrated into the decoder of the sequence to sequence model by concatenating their hidden state at each time step. sennrich:2016 incorporate target language unpaired training data by doing back-translation to create synthetic parallel training data. While this technique is quite effective, its practicality seems limited to problems where the inputs and outputs contain roughly the same information (such as translation). cheng:2016 leverages the abundant monolingual data by doing multitask learning with an autoencoding objective.

A number of papers have remarked on the tendency for content to get dropped (or repeated) in translation. liu:2016 propose translating in both a left-to-right and a left-to-right direction and seeking a consensus. tu:2016 propose augmenting a direct model’s decoding objective with a reverse translation model (similar to our channel model except it conditions on the direct model’s output RNN’s hidden states rather than the words); however, that work just reranks complete translation hypotheses rather than developing a model that permits an incremental search.

Another trend of work that is related to our model is the investigation of making online prediction for machine translation (gu:2016; grissom:2014; sankaran:2010) and speech recognition (hwang:2016; jaitly2015neural).

Our direct model (and channel model) shares the idea of introducing stochastic latent variables to neural networks with several papers and marginalising these during training. Examples include connectionist temporal classification (CTC) (graves2006connectionist) and the more recent segmental recurrent neural networks (SRNN) (kong2015segmental). Compared to these models, our direct model has the advantage of capturing unbounded dependencies of output words. The direct model is closely related to the sequence transduction model (graves2012sequence)

in the way of modeling the probability of predicting output tokens and marginalizing latent variables using dynamic programming. However, rather than modeling the joint distribution over outputs and alignments by inserting null symbols into the output sequence, our direct model defines a separate latent alignment variable, with alignment distribution defined with neural networks. Similar to our work, the model in

(alkhoulialignment) is decomposed into the alignment model and the model of word predictions. The two models are trained separately and combined during decoding, with subsequent refinements using a Viterbi-EM approximation. By contrast, in our direct and channel models, the latent and observed components of the models are trained jointly using a dynamic program to exactly marginalise the unobserved variables.

7 Conclusion

We have presented and empirically validated a noisy channel transduction model that uses component models based on recurrent neural networks. This formulation lets us use unpaired outputs to estimate the parameters of the source model and input-output pairs to train the channel model. Despite the channel model’s ability to condition on long sequences, we are able to maintain tractable decoding by using a latent segmentation variable that breaks the conditioning context up into a series of monotonically growing segments. Our experiments show that this model makes excellent use of unpaired training data.

References

Appendix A Algorithm

Notation: is the Viterbi matrix, bp is the backpointer, stores the predicted tokens, refers to the vocabulary, , and denotes the maximum number of output tokens that can be predicted.
Input: source sequence
Output: best output sequence
Initialisation: , bp , ,                   , ,
for  do
      Candidates generated by .
     
     
      Rerank the candidates by objective ().
     
end for
for  do
     for  do
         
                                                                                                                            
          Get partial candidate .
         
         
     end for
end for
return a sequence of words stored in by following backpointers starting from .
Algorithm 1 Noisy Channel Decoding

Appendix B Example outputs

Summarisation
Example 1:
source: the european commission on health and consumers protection lrb _unk_ rrb has offered cooperation to indonesia in coping with the spread of avian influenza in the country , official news agency antara said wednesday .
reference: eu offers indonesia cooperation in avian flu eradication
direct: eu offers cooperation to indonesia in avian flu
nc: eu offers cooperation to indonesia in coping with bird flu
Example 2:
source: vietnam will accelerate the export of industrial goods mainly by developing auxiliary industries , and helping enterprises sharpen competitive edges , according to the ministry of industry on thursday .
reference: vietnam to boost industrial goods export
direct: vietnam to accelerate export of industrial goods
nc: vietnam to speed up export of industrial goods
Example 3:
source: japan ’s toyota team europe were banned from the world rally championship for one year here on friday in a crushing ruling by the world council of the international automobile federation -lrb- fia -
reference: toyota are banned for a year
direct: toyota banned from world rally championship
nc: toyota europe banned from world rally championship for one year
Example 4:
source: oil prices roared higher towards ## dollars on monday as equity markets surged on government action aimed at tackling a severe economic downturn .
reference: oil prices soar towards ## dollars
direct: oil prices jump towards ## dollars
nc: oil prices climb towards ## dollars
Translation
Example 5:
source: 欧盟 和 美国 都 表示 可以 接受 这 一 妥协 方案 。
reference: both the eu and the us indicated that they can accept this plan for a compromise .
direct: the eu and the united states indicated that it can accept this compromise .
nc: the european union and the united states have said that they can accept such a compromise plan .
Example 6:
source: 那么 这些 这个 方面 呢 是 现在 警方 调查 重点 。
reference: well , this is the current focus of police investigation .
direct: these are present at the current police .
nc: then these are the key to the current police investigation .
Example 7:
source: 双方 有可能 就此 问题 在 下周 进行 磋商 。
reference: the two sides may conduct negotiations on this issue next week .
direct: the two sides may hold consultations on next week .
nc: the two sides are likely to hold consultations on this issue next week .
Example 8:
source: 那么 在 这个 问题 上 , 伊朗 现在 态度 比较 强硬 , 而 美国 的 态度 更为 强硬 。
reference: well , iran ’s attitude is now quite firm on this issue , while the us takes an even firmer attitude .
direct: on this issue , iran ’s attitude is quite hard and the attitude of the united states is still tougher .
nc: then , on this issue , iran has now taken a tougher attitude toward it . however , the attitude of the united states is even harder .
Table 5: Example outputs on the test set from the direct model and noisy channel model for the summarisation task and machine translation.