Learning Neural Sequence-to-Sequence Models from Weak Feedback with Bipolar Ramp Loss

07/06/2019 ∙ by Laura Jehl, et al. ∙ University of Heidelberg 0

In many machine learning scenarios, supervision by gold labels is not available and consequently neural models cannot be trained directly by maximum likelihood estimation (MLE). In a weak supervision scenario, metric-augmented objectives can be employed to assign feedback to model outputs, which can be used to extract a supervision signal for training. We present several objectives for two separate weakly supervised tasks, machine translation and semantic parsing. We show that objectives should actively discourage negative outputs in addition to promoting a surrogate gold structure. This notion of bipolarity is naturally present in ramp loss objectives, which we adapt to neural models. We show that bipolar ramp loss objectives outperform other non-bipolar ramp loss objectives and minimum risk training (MRT) on both weakly supervised tasks, as well as on a supervised machine translation task. Additionally, we introduce a novel token-level ramp loss objective, which is able to outperform even the best sequence-level ramp loss on both weakly supervised tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence-to-sequence neural models are standardly trained using a maximum likelihood estimation (MLE) objective. However, MLE training requires full supervision by gold target structures, which in many scenarios are too difficult or expensive to obtain. For example, in semantic parsing for question-answering it is often easier to collect gold answers rather than gold parses (Clarke et al., 2010; Berant et al., 2013; Pasupat and Liang, 2015; Rajpurkar et al., 2016, inter alia). In machine translation, there are many domains for which no gold references exist, however cross-lingual document-level links are present for many multilingual data collections.

In this paper we investigate methods where a supervision signal for output structures can be extracted from weak feedback. In the following, we use learning from weak feedback, or

weakly supervised learning

, to refer to a scenario where output structures generated by the model are judged according to an external metric, and this feedback is used to extract a supervision signal that guides the learning process. Metric-augmented sequence-level objectives from reinforcement learning

(Williams, 1992; Ranzato et al., 2016), minimum risk training (MRT) (Smith and Eisner, 2006; Shen et al., 2016) or margin-based structured prediction objectives (Taskar et al., 2005; Edunov et al., 2018) can be seen as instances of such algorithms.

In natural language processing applications, such algorithms have mostly been used in combination with

full supervision tasks

, allowing to compute a feedback score from metrics such as BLEU or F-score that measure the similarity of output structures against gold structures. Our main interest is in

weak supervision tasks where the calculation of a feedback score cannot fall back onto gold structures. For example, matching proposed answers to a gold answer can guide a semantic parser towards correct parses, and matching proposed translations against linked documents can guide learning in machine translation.

In such scenarios the judgments by the external metric may be unreliable and thus unable to select a good update direction. It is our intuition that a more reliable signal can be produced by not just encouraging outputs that are good according to weak positive feedback, but also by actively discouraging bad structures. In this way, a system can more effectively learn what distinguishes good outputs from bad ones. We call an objective that incorporates this idea a bipolar objective. The bipolar idea is naturally captured by the structured ramp loss objective Chapelle et al. (2009), especially in the formulation by Gimpel and Smith (2012) and Chiang (2012), who use ramp loss to separate a hope from a fear output in a linear structured prediction model. We employ several ramp loss objectives for two weak supervision tasks, and adapt them to neural models.

First, we turn to the task of semantic parsing in a setup where only question-answer pairs, but no gold semantic parses are given. We assume a baseline system has been trained using a small supervised data set of question-parse pairs under the MLE objective. The goal is to improve this system by leveraging a larger data set of question-answer pairs. During learning, the semantic parser suggests parses for which corresponding answers are retrieved. These answers are then compared to the gold answer and the resulting weak supervision signal guides the semantic parser towards finding correct parses. We can show that a bipolar ramp loss objective can improve upon the baseline by over 12 percentage points in F1 score.

Second, we employ ramp losses on a machine translation task where only weak supervision in the form of cross-lingual document-level links is available. We assume a translation system has been trained using MLE on out-of-domain data. We then investigate whether document-level links can be used as a weak supervision signal to adapt the translation system to the target domain. We formulate ramp loss objectives which incorporate bipolar supervision from relevant and irrelevant documents. We also present a metric which allows us to include bipolar supervision in an MRT objective. Experiments show that bipolar supervision is crucial for obtaining gains over the baseline. Even with this very weak supervision, we are able to achieve an improvement of over 0.4% BLEU over the baseline using a bipolar ramp loss.

Finally, we turn to a fully supervised machine translation task. In supervised learning, MLE training in a fully supervised scenario has also been associated with two issues. First, it can cause exposure bias (Ranzato et al., 2016)

because during training the model receives its context from the gold structures of the training data, but at test time the context is drawn from the model distribution instead. Second, the MLE objective is agnostic to the final evaluation metric, causing a

loss-evaluation mismatch (Wiseman and Rush, 2016). Our experiments use a similar setup as Edunov et al. (2018), who apply structured prediction losses to two fully supervised sequence-to-sequence tasks, but do not consider structured ramp loss objectives. Like our predecessors, we want to understand if training a pre-trained machine translation model further with a metric-informed sequence-level objective will improve translation performance by alleviating the above-mentioned issues. By gauging the potential of applying bipolar ramp loss in a full supervision scenario, we achieve best results for a bipolar ramp loss, improving the baseline by over 0.4% BLEU.

In sum, we show that bipolar ramp loss is superior to other sequence-level objectives for all investigated tasks, supporting our intuition that a bipolar approach is crucial where strong positive supervision is not available. In addition to adapting the ramp loss objective to weak supervision, our ramp loss objective can also be adapted to operate at the token level, which makes it particularly suitable for neural models as they produce their outputs token by token. A token-level objective also better emulates the behavior of the ramp loss for linear models, which only update the weights of features that differ between hope and fear. Finally, the token-level objective allows us to capture token-level errors in a setup where MLE training is not available. Using this objective, we obtain additional gains on top of the sequence-level ramp loss for weakly supervised tasks.

2 Related Work

Training neural models with metric-augmented objectives has been explored for various NLP tasks in supervised and weakly supervised scenarios. MRT for neural models has previously been employed for machine translation (Shen et al., 2016) and semantic parsing (Liang et al., 2017; Guu et al., 2017).111Note that Liang et al. (2017) refer to their objective as an instantiation of REINFORCE, however they build an average over several outputs for one input and thus the objective more accurately falls under the heading of MRT. Other objectives based on classical structured prediction losses have been used for both machine translation and summarization (Edunov et al., 2018), as well as semantic parsing (Iyyer et al., 2017; Misra et al., 2018). Objectives inspired by REINFORCE have, for example, been applied to machine translation (Ranzato et al., 2016; Norouzi et al., 2016), semantic parsing (Liang et al., 2017; Mou et al., 2017; Guu et al., 2017) and reading comprehension (Choi et al., 2017; Yang et al., 2017).222

We do not use REINFORCE because its updates are based on only one sampled model output, which can lead to high variance. Since it is possible for us to obtain feedback for more than one model output, we employ the more robust MRT that calculates an average over several outputs.

Misra et al. (2018) are the first to compare several objectives for neural semantic parsing. For semantic parsing, they find that objectives employing structured prediction losses perform best. Edunov et al. (2018)

compare different classical structured prediction objectives including MRT on a fully supervised machine translation task. They find MRT to perform best. However, they only obtain larger gains by interpolating MRT with the MLE loss. Neither

Misra et al. (2018) nor Edunov et al. (2018) investigate objectives that correspond to the bipolar ramp loss that is central in our work.

The ramp loss objective Chapelle et al. (2009) has been applied to supervised phrase-based machine translation (Gimpel and Smith, 2012; Chiang, 2012). We adapt these objectives to neural models and adapt them to incorporate bipolar weak supervision, while also introducing a novel token-level ramp loss objective.

3 Neural Sequence-to-Sequence Learning

Our neural sequence-to-sequence models employ an encoder-decoder setup (Cho et al., 2014; Sutskever et al., 2014) with an attention mechanism (Bahdanau et al., 2015). Specifically, we employ the framework Nematus (Sennrich et al., 2017). Given an input sequence

, the probability that a model assigns for an output sequence

is given by Using beam search, we can obtain a sorted -best list of most likely to least likely outputs and we define the most likely output as

Maximum Likelihood Estimation (MLE).

Prior to employing metric-augmented objectives, we assume that a model has been pre-trained with a maximum likelihood estimation (MLE) objective. Given inputs and gold structures

, the parameters of the neural network are updated using Stochastic Gradient Descent (SGD) with minibatches of size

, leading to the following objective:

(1)

Minimum Risk Training (MRT).

We compare our ramp loss objectives to MRT (Shen et al., 2016), which employs an external metric to assign rewards to model outputs. Given an input , outputs are sampled from the model distribution and updates are performed based on the following MRT objective:

(2)

where is the reward returned for by the external metric, and is a distribution over outputs that is normalized over samples and can be controlled for sharpness by a temperature parameter.333We follow the implementation of MRT in Nematus with its default settings, including de-duplication of samples and setting the temperature parameter to . In case of fully supervised MT where the question arises whether to include the reference in the sample, we choose not to include it in order to be comparable with Edunov et al. (2018) who also do not include it. Following Shen et al. (2016), we use a baseline term that acts as a control variate for variance reduction of the stochastic gradient Williams (1992); Greensmith et al. (2004) and allows negative updates for rewards smaller than the baseline. We compute this term by sampling outputs from the model distribution s.t.

Ramp Loss Objectives.

Our ramp loss objectives can be formulated as follows:

(3)

where is a fear output that is to be discouraged and is a hope output that is to be encouraged. Intuitively, should be an output which has high probability, but receives a bad reward from the external metric. Analogously, should be an output which has high probability and receives a high reward from the external metric. The concrete instantiations of and depend on the underlying task and are thus deferred to the respective sections below (see Tables 1, 4 and 7). The RAMP loss defined in equation (3) has been introduced as equation (8) in Gimpel and Smith (2012). This loss naturally incorporates a bipolarity principle by including both hope and fear into one objective. An alternative formulation of ramp loss can be given by favoring the current model prediction, i.e., setting , and searching for a fear output. This has been called “cost-augmented decoding” and been formalized in equation (6) in Gimpel and Smith (2012). This loss dates back to the “margin-rescaled hinge loss” of Taskar et al. (2004) and will be called RAMP1 in the following. The converse approach has been called “cost-diminished decoding” and been formalized in equation (7) in Gimpel and Smith (2012). Here the model prediction is penalized by setting and searching for a hope output. This objective has been called “direct loss” in Hazan et al. (2010), and will be called RAMP2 in the following.

Finally, we introduce a ramp loss objective which can operate on the token level. To be able to adjust individual tokens, we move to probabilities, so that the sequence decomposes as a sum over individual tokens and it is possible to ignore tokens while encouraging or discouraging others. This leads to the Ramp-T objective:

(4)

where and are set to , or depending on the decision whether the corresponding token should be left untouched, encouraged or discouraged. Concretely, we define:

(5)

and

(6)

Figure 1: Settings for token-level rewards and for hope output = “a small house” and fear output = “the house”.

With this definition, tokens that appear in both and are left untouched, whereas tokens that appear only in the hope output are encouraged, and tokens that appear only in the fear output are discouraged (see Figure 1 for an example). This more fine-grained contrast allows the model to learn what distinguishes a good output from a bad one more effectively.444An implementation of the Ramp objectives can be found at https://github.com/carhaas/nematus.

4 Semantic Parsing

Ramp Loss Objectives.

In semantic parsing for question answering, natural language questions are mapped to machine readable parses. Such a parse, , can be executed against a database which returns an answer . This answer can be compared to the available gold answer and the following metric can be defined:

(7)
Name
RAMP
RAMP1
RAMP2
Table 1: Configurations for and for semantic parsing. We abbreviate , which is the most likely output in the -best list that leads to the correct answer, and , which is the most likely output in the -best list that leads to the wrong answer.

For RAMP, is defined as the most probable output in the -best list that leads to the correct answer, i.e. where . In contrast, is defined as the most probable output in that does not lead to the correct answer, i.e. where . The definitions of and for this objective and the related ramp loss objectives can be found in Table 1. If or are found, the parse is cached as a hope or fear output, respectively, for the corresponding input . If at a later point or cannot be found in the current -best list, then previously cached outputs are accessed instead. Should no cached output exist, the corresponding sample is skipped.

Experimental Setup.

Our experiments are conducted on the NLmaps v2 corpus (Lawrence and Riezler, 2018) which is a publicly available corpus555https://www.cl.uni-heidelberg.de/statnlpgroup/nlmaps/ for geographical questions that can be answered with the OpenStreetMap database.666https://www.openstreetmap.org The corpus is a recent extension of its predecessor Haas and Riezler (2016) which has been used in Kočiský et al. (2016) or Duong et al. (2018).

For each question, the corpus provides both gold parses and gold answers that can be obtained by executing the parses against the database. We take a random subset of 2,000 question-parse pairs to train an initial model with the MLE objective. Following Lawrence and Riezler (2018), we take a pre-order traversal of the tree-structured parses to obtain individual tokens. 1,843 and 2,000 further instances of the corpus are retained for development and test set, respectively. For the remaining 22,766 questions, we assume that no gold parses exist and only gold answers are available. With the gold answers as a guide, the initial model is further improved using the metric-augmented objectives of Section 3 and the metric defined in equation (7).

The model has 1,024 hidden units (GRUs) and word embeddings of size 1,000. The optimal learning rate was chosen in preliminary experiments on the development set and is set to . Gradients are clipped to 1.0 if they exceed a value of 1.0 and the sentence length is capped at 200. In the case of the MRT objectives, we set . For the RAMP objectives the size of the -best list is 10. For objectives with minibatches, the size of a minibatch is and validation on the development set is performed after every 100 updates. For objectives where updates are performed after each seen input, the validation is run after every 8,000 updates, leading to the same number of seen inputs compared to the objectives with minibatches.

For validation and at test time, the most likely parse is obtained after a beam search with a beam of size 12. The obtained parse is executed against the database to retrieve its corresponding answer which is compared to the available gold answer. We define recall as the percentage of correct answers in the entire set and precision as the percentage of correct answers in the set of non-empty answers. The harmonic mean of recall and precision constitutes the F1 score. The stopping point is determined by the highest F1 score on the development set after 30 validations or 30 days or run time

777The 30 day mark was only hit by Ramp2. and corresponding results are reported on the test set. To measure statistical significance between models we employ an approximate randomization test (Noreen, 1989).

Experimental Results.

Results using the various ramp loss objectives as well as MRT are shown in Table 2. MRT outperforms the MLE baseline by about 6 percentage points in F1 score. RAMP1 performs worse than MRT, but can still significantly outperform the baseline by 3.05 points in F1 score. RAMP2 performs better than RAMP1, but outperforms MRT only nominally.

In contrast to this, by carefully selecting both a hope and fear parse, RAMP achieves a significant further 5.43 points in F1 score over MRT. By incorporating token-level feedback, our novel objective RAMP-T outperforms all other models significantly and beats the baseline by over 12 points in F1 score. Compared to RAMP, RAMP-T can take advantage of the token-level feedback which allows a model to determine which tokens in the hope output are instrumental to obtain a positive reward but are missing in the fear output. Analogously it is possible to identify which tokens in the fear output lead to an incorrect parse, rather than also punishing the tokens in the fear output which are actually correct.

% F1
1 MLE 57.45
2 MRT 1 63.60 +6.15
3 RAMP1 80 60.50 +3.05
4 RAMP2 80 64.22 +6.77
5 RAMP 80 69.03 +11.58
6 RAMP-T 80 69.87 +12.42
Table 2: Answer F1 scores on the NLmaps v2 test set for various objectives, averaged over two independent runs. is the minibatch size. All models are statistically significant from each other at , except the pair (2, 4).

MRT is not naturally a bipolar objective. It can only discourage wrong parses if the baseline is larger than 0. Investigating the value of the baseline for 10,000 instances shows that in 37% of the cases the baseline is 0, i.e. none of the sampled parses leads to the correct answer. As a result, 37% of the time, wrong parses are ignored rather than discouraged. To explore the importance of always discouraging wrong parses, we introduce the objective MRT neg: it modifies the feedback for parses with a wrong answer to be rather than , which resembles the fear output that is discouraged in the RAMP objective. With this change, the MRT objective always behaves in a bipolar manner, irrespective of the baseline’s value. As a consequence, MRT neg can significantly outperform MRT by 2.33 points in F1 score (see Table 3). This showcases the importance of employing bipolar supervision and it constitutes an important finding compared to previous approaches (Liang et al., 2017; Misra et al., 2018), where the feedback is defined to lie in the range of .

However, MRT neg still falls short of RAMP by 3.1 points in F1 score. This could be because of the different batch sizes as MRT uses a batch size of 1, whereas RAMP employs a batch size of 80. To ensure that the difference between the objectives does not stem from this difference, we run an experiment with RAMP where the batch size is also set to 1, i.e. RAMP m=1. Crucially, it still significantly outperforms MRT. At the same time, it does however have a lower F1 score than RAMP (see Table 3). This showcases the importance of using a larger minibatch size, so that an average over several inputs is computed before updating. In fact, its F1 score is on par with the MRT neg objective, which uses the same minibatch size and incorporates bipolar supervision just as Ramp does. However, RAMP m=1 should still be preferred because the RAMP objectives are more efficient than MRT objectives. In the case of MRT, for every training instance queries need to be executed against the database to obtain an answer and corresponding reward. On the other hand, Ramp has to execute at most the queries of the -best list , but often less if both a correct and an incorrect query are found earlier.

% F1
1 MLE 57.45
2 MRT 1 63.60 +6.15
3 MRT neg 1 65.93 +8.48
4 RAMP m=1 1 66.78 +9.33
5 RAMP 80 69.03 +11.58
Table 3: Answer F1 scores on the NLmaps v2 test set for RAMP and the MRT objective as well as two further objectives, which help crystallize the difference between the two former objectives, averaged over two independent runs. is the minibatch size. All models are statistically significant from each other at , except the pair (3, 4).

To summarize, RAMP can attribute its success to two factors: First, it discourages parses that receive a wrong answer rather than ignoring them as MRT often does. Second, a larger minibatch size leads to improvements because updates are based on an average over several inputs. Further performance gains can be obtained by employing the token-level objective RAMP-T. Finally, RAMP objectives are more efficient because fewer outputs have to be judged.

5 Weakly Supervised Machine Translation

Loss
RAMP
RAMP
RAMP1
RAMP2
RAMP
Table 4: Configurations for and for weakly supervised MT adaptation. is the highest-probability model output. is the probability of under the model. The is taken over the -best list . is a scaling factor regulating the influence of the metric compared to the model probability. and are metrics defined with respect to relevant and irrelevant documents and (see Eq. 8 and 9).

Ramp Loss Objectives.

We consider machine translation (MT) in a weakly supervised domain adaptation setting, where in-domain references are unavailable. In this setting, we obtain weak feedback by matching translation model outputs against cross-lingually linked documents. For each input sentence , we can obtain a set of relevant documents where is a collection of target language documents. Cross-lingual link structures can be found in many multilingual document collections, such as cross-lingual citations in patent documents or product categories in e-commerce data. Our example is links between Wikipedia documents. Instead of a reference translation, we use a relevant document sampled from to guide our search for and . As a relevant document provides much weaker supervision than a reference translation, we construct a more informative supervision signal by integrating negative supervision from an irrelevant document sampled from a collection of irrelevant contrast documents. For each input , the bipolar supervision signal then consists of a pair of sampled documents .

Unlike semantic parsing for question answering, our task uses a continuous reward . In fully supervised MT a sentence-level approximation of the BLEU score can serve as the reward. But computing the BLEU score between a translation and a document does not make sense. We therefore propose two different alternative metrics. The first, , computes how well a translation matches a relevant document. The second, computes how well a translation differentiates between a relevant and an irrelevant document. is defined as the average -gram precision between a hypothesis and a document, multiplied by a brevity penalty. As we do not have a reference length, we include a brevity penalty term which compares the output length to the input length. This ratio can be modified by a factor that represents the average length difference between source and target language and which can be computed over the training data:

(8)

where are the -grams present in , counts the occurrences of an -gram in and is the maximum order of -grams used. The brevity penalty term is

is defined as the difference between and

, subject to a linear transformation to allow values to lie between 0 and 1:

(9)

Our intuition behind this metric is that it should measure how well a translation differentiates between the relevant and irrelevant document, leading to domain-specific translations being weighted higher than domain-agnostic ones.

Table 4

shows our loss functions for the weakly supervised case. RAMP and RAMP2 define

and in the same way as is done in the semantic parsing task, except that the metric is employed to match outputs against documents. Like Gimpel and Smith (2012), we include a scaling factor to trade off the importance of the reward against the model score in determining and . Note that these objectives do not include negative supervision from . Using the metrics defined above, we formulate two objectives that include : RAMP defines in the same way as RAMP, but uses a different definition of : Instead of using a fear output with respect to , i.e. a translation with high probability and low reward , we use a hope output with respect to , i.e. a translation with high probability and high reward . As this translation matches an irrelevant document well, it can be used as a negative output. The same definition of is also used in RAMP1. Note that this objective does not include positive supervision from . Finally, RAMP incorporates and in a different way. This objective defines as a hope and as a fear, but uses the joined metric with respect to the document pair .

Experimental Setup.

We test our objectives on a weakly supervised English-German Wikipedia translation task first proposed in Jehl and Riezler (2016). In-domain training data are 10,000 English sentences with relevant German documents sampled from the WikiCLIR corpus (Schamoni et al., 2014).888WikiCLIR annotates both a stronger mate relation when there is a direct cross-lingual link between documents and a weaker link relation when a there is a bidirectional link between a German mate document and another German document. The experiments reported here use the mate relation. The task includes a small in-domain development and test set (dev: 1,712 sentences, test: 1,526 sentences), each consisting of four Wikipedia articles with diverse subjects. Irrelevant documents are sampled from the German side of the News Commentary999http://casmacat.eu/corpus/news-commentary.html data set, which contains document boundary information.

Byte-pair encoding (Sennrich et al., 2016) with 30,000 merge operations is applied to all source and target data. Sentences longer than 80 words are removed from the training set. Our neural MT model uses 500-dimensional word embeddings and hidden layer dimension of 1,024. Encoder and decoder use GRU units. An out-of-domain model is trained on 2.1 million sentence pairs from Europarl v7 (Koehn, 2005), News Commentary v10 and the MultiUN v1 corpus (Eisele and Chen, 2010). The baseline (MLE) is trained using the MLE objective and ADADELTA (Zeiler, 2012)

for 20 epochs. We train on batches of 64 and use dropout for regularization, with a dropout rate of 0.2 for embedding and hidden layers and 0.1 for source and target layers. Gradients are clipped if their norm exceeds 1.0.

The metric-augmented objectives are trained using SGD. All hyperparameters are chosen on the development set. For the ramp loss objectives, we use a learning rate of 0.005,

and a -best size of 16. We compare ramp loss to MRT using both and as the external cost function, denoted as MRT and MRT respectively. MRT is trained using a learning rate of 0.05, and . For testing and validation, translations are obtained using beam search with a beam size of 16. Results are validated every 200 updates and training is run for 25 validations. The stopping point is determined by the BLEU score (Papineni et al., 2001) on the development set. We report scores computed with Moses’101010https://github.com/moses-smt/mosesdecoder multi-bleu.perl on tokenized, truecased output. Results are averaged over 2 runs.

% BLEU
1 MLE 64 15.59
2 RAMP 40 15.03 0.56
3 RAMP1 40 15.12 0.47
4 RAMP2 40 15.19 0.40
5 MRT 1 15.37 0.22
6 MRT 1 15.70 0.11
7 RAMP 40 15.85 0.26
8 RAMP 40 15.86 0.27
9 RAMP-T 40 16.03 0.44
10 RAMP-T 40 15.84 0.25
Table 5: BLEU scores for weakly supervised MT experiments. Boldfaced results are significantly better than the baseline at according to multeval (Clark et al., 2011). marks a significant difference over RAMP.

Experimental Results.

Results for the different objectives can be found in Table 5. The ramp losses RAMP, RAMP1 and RAMP2, which do not incorporate bipolar supervision from and (lines 2, 3 and 4) actually deteriorate in performance. This shows that supervision from only or only is insufficient. The deteriorating effect is strongest for RAMP, which uses to select both and . We explain this by the fact that is an imperfect label. Trying to push the model to perfectly reproduce will not lead to a good translation. The same observation holds true for MRT. This objective only includes the reward . Compared to the RAMP objectives, the decrease for MRT is smaller.

On the other hand, MRT, which incorporates bipolar supervision, produces a nominal improvement over the MLE baseline. This objective is outperformed by RAMP and RAMP. Both objectives produce a small, but significant, improvement of 0.3% BLEU over the MLE baseline. This result shows that bipolar supervision is crucial for success in this weak supervision scenario. It also shows that unlike MRT, for the bipolar ramp loss it does not matter whether or is used, as they both capture the same idea. The superiority of these objectives over MRT shows again the success of intelligently selecting positive and negative outputs. Another small, but significant improvement is produced by the token-level variant RAMP-T, leading to the best overall result.

To summarize, we find that for this task, which uses very weak supervision from document-level links, small improvements can be obtained. To achieve these improvements, it is imperative to employ objectives which include bipolar supervision from and . This finding holds for both ramp loss and MRT. The best overall result is obtained using ramp loss in the token-level variant.

Analysis of Translation Results.

Figure 2: BLEU scores by sentence length for the MLE Baseline and the RAMP-T runs.

As the improvements in the translation experiments are very small, we conduct a small-scale analysis to better determine the nature of the gains. Our analysis is inspired by bentivogli2016neural. We compare the weakly supervised MLE baseline to the best experiment in this setting, which uses the bipolar token-level ramp loss RAMP-T.

We first analyze the performance by sentence length. We separate the translations into source length brackets and score each bracket separately. The brackets represent quartiles of the source length distribution, ensuring an approximately equal amount of sentences in each bracket. Results are shown in Figure

2. For all systems, we observe a drop in performance up to an input length of 33. Surprisingly, BLEU scores increase again for the top bracket (source length ). For this bracket, we also see the biggest gap between MLE and RAMP-T of 0.52 and 0.67% BLEU for the two runs. This increase is mitigated by much weaker increases in the bottom brackets. A possible explanation for the weaker performance of MLE in the top bracket is the observation that hypotheses produced by the MLE system are longer than for RAMP-T. For the top bracket, hypothesis lengths exceed reference lengths for all systems. However, for MLE this over-generation is more severe at 106% of the reference length compared to RAMP-T at 102%, potentially causing a higher loss in precision.

Figure 3: BLEU scores by Wikipedia article for the MLE Baseline and the RAMP-T runs.
Figure 4: Improvements in BLEU scores by Wikipedia article for the RAMP-T runs.
Figure 3: BLEU scores by Wikipedia article for the MLE Baseline and the RAMP-T runs.

As our test set consists of parallel sentences extracted from four Wikipedia articles, we can examine the performance for each article separately. Figure 4 shows the results. We observe large differences in performance according to article ID. These are probably caused by some articles being more similar to the out-of-domain training data than others. Comparing RAMP-T and MLE, we see that RAMP-T outperforms MLE for each article by a small margin. Figure 4 shows the size of the improvements by article. We observe that margins are bigger on articles with better baseline performance. This suggests that there are challenges arising from domain mismatch which are not addressed by our method.

Source Towards the end of the 19th century , a strong textile industry was developing itself in Schüttorf with several large local businesses ( Schlikker & Söhne , Gathmann & Gerdemann , G. Schümer & Co. and ten Wolde , later Carl Remy ; today ’s RoFa is not one of the original textile companies , but was founded by H. Lammering and later taken over by Gerhard Schlikker jun. , Levert Rost and Wilhelm Edel ;
MLE Ende des 19. Jahrhunderts , eine starke Textilindustrie , die sich in Ettorf mit mehreren großen lokalen Unternehmen ( Schlikker & Söhne , Gathmann & Geréann , G. Schal & Co. und zehn Wolde , später Carl Remy ) entwickelt hat ; die heutige RoFa ist nicht einer der ursprünglichen Textilunternehmen , sondern wurde von H. Lammering [gegründet] und später von Gerhard Schaloker Junge , Levert Rost und Wilhelm Edel übernommen .
RAMP-T Ende des 19. Jahrhunderts entwickelte sich [in Schüttorf] eine starke Textilindustrie mit mehreren großen lokalen Unternehmen ( Schlikker & Söhne , Gathmann & Gerdemann , G. Schal & Co. und zehn Wolde , später Carl Remy ; die heutige RoFa ist nicht eines der ursprünglichen Textilunternehmen , sondern wurde von H. Lammering [gegründet] und später von Gerhard Schaloker Junge , Levert Rost und Wilhelm Edel übernommen .
Reference gegen Ende des 19. Jahrhunderts entwickelte sich in Schüttorf eine starke Textilindustrie mit mehreren großen lokalen Unternehmen ( Schlikker & Söhne , Gathmann & Gerdemann , G. Schümer & Co. und ten Wolde , später Carl Remy , die heutige RoFa ist keine ursprüngliche Textilfirma , sondern wurde von H. Lammering gegründet und später von Gerhard Schlikker jun. , Levert Rost und Wilhelm Edel übernommen .)
Table 6: MT example from Article 2 in the test set. All translation errors are underlined. Incorrect proper names are also set in cursive. Omissions are inserted in brackets and set in cursive [like this]. Improvements by RAMP-T over MLE are marked in boldface.

Lastly, we present an examination of example outputs. Table 6 shows an example of a long sentence from Article 2, which describes the German town of Schüttorf. This article is originally in German, meaning that our model is back-translating from English into German. The reference contains some awkward or even ungrammatical phrases such as “was developing itself”, a literal translation from German. The example also illustrates that translating Wikipedia involves handling frequent proper names (there are 11 proper names in the example). Both models struggle with translating proper names, but RAMP-T produces the correct phrase “Gathmann & Gerdemann”, while MLE fails to do so. The RAMP-T translation is also fully grammatical, while MLE incorrectly translates the main verb phrase “was developing itself” into a relative clause, and contains an agreement error in the translation of the noun phrase “one of the original textile companies”. While making fewer errors in grammar and proper name translation, RAMP-T contains two deletion errors while MLE only contains one. This could be caused by the active optimization of sentence length in the ramp loss model.

6 Fully Supervised Machine Translation

While our work focuses on weakly supervised tasks, we also conduct experiments using a fully supervised MT task. These experiments are motivated on the one hand by adapting the findings of Gimpel and Smith (2012) to the neural MT paradigm, and on the other hand to expand the work by Edunov et al. (2018) on applying classical structured prediction losses to neural MT.

Loss
RAMP
RAMP1
RAMP2
PERC1
PERC2
Table 7: Configurations for and for fully supervised MT. is the highest-probability model output, is a gold standard reference. is the probability of according to the model. The is taken over the -best list . is smoothed per-sentence BLEU and is a scaling factor.

Ramp Loss Objectives.

For fully supervised MT we assume access to one or more reference translations for each input . The reward is a per-sentence approximation of the BLEU score.111111We use the BLEU score with add-1 smoothing for as proposed by Chen and Cherry (2014). Table 7 shows the different definitions of and , which give rise to different ramp losses. RAMP, RAMP1, and RAMP2 are defined analogously to the other tasks. We again include a hyperparameter interpolating cost function and model score when searching for and . Gimpel and Smith (2012)

also include the perceptron loss in their analysis. PERC1 is a re-formulation of the Collins perceptron

(Collins, 2002) where the reference is used as and is used as . A comparison with PERC1 is not possible for the weakly supervised tasks in the previous sections, as gold structures are not available for these tasks. With neural MT and subword methods we are able to compute this loss for any reference without running into the problem of reachability that was faced by phrase-based MT (Liang et al., 2006). However, using sequence-level training towards a reference can lead to degenerate solutions where the model gives low probability to all its predictions (Shen et al., 2016). PERC2 addresses this problem by replacing by a surrogate translation which achieves the highest BLEU score in . This approach is also used by Edunov et al. (2018) for the loss functions which require an oracle. PERC1 corresponds to equation (9), PERC2 to equation (10) of (Gimpel and Smith, 2012).

Experimental Setup.

We conduct experiments on the IWSLT 2014 German-English task, which is based on Cettolo et al. (2012) in the same way as Edunov et al. (2018). The training set contains 160K sentence pairs. We set the maximum sentence length to 50 and use BPE with 14,000 merge operations. Edunov et al. (2018) sample 7K sentences from the training set as heldout data. We do the same, but only use 1/10th of the data as heldout set to be able to validate often.

Our baseline system (MLE) is a BiLSTM encoder-decoder with attention, which is trained using the MLE objective. Word embedding and hidden layer dimensions are set to 256. We use batches of 64 sentences for baseline training and batches of 40 inputs for training RAMP and PERC variants. MRT makes an update after each input using all sampled outputs and resulting in a batch size of 1. All experiments use dropout for regularization, with dropout probability set to 0.2 for embedding and hidden layers and to 0.1 for source and target layers. During MLE-training, the model is validated every 2500 updates and training is stopped if the MLE loss on the heldout set worsens for 10 consecutive validations.

For metric-augmented training, we use SGD for optimization with learning rates optimized on the development set. Ramp losses and PERC2 use a -best list of size 16. For ramp loss training, we set . RAMP and PERC variants both use a learning rate of 0.001. A new -best list is generated for each input using the current model parameters. We compare ramp loss to MRT as described above. For MRT, we use SGD with a learning rate of 0.01 and set and . As Edunov et al. (2018) observe beam search to work better than sampling for MRT, we also run an experiment in this configuration, but find no difference between results. As beam search runs significantly slower, we only report sampling experiments.

The model is validated on the development set after every 200 updates for experiments with batch size 40 and after 8,000 updates for MRT experiments with batch size 1. The stopping point is determined by the BLEU score on the heldout set after 25 validations. As we are training on the same data as the MLE baseline, we also apply dropout during ramp loss training to prevent overfitting. BLEU scores are computed with Moses’ multi-bleu.perl on tokenized, truecased output. Each experiment is run 3 times and results are averaged over the runs.

Experimental Results.

As shown in Table 8, all experiments except for PERC1 yield improvements over MLE, confirming that sequence-level losses which update towards the reference can lead to degenerate solutions. For MRT, our findings show similar performance to the initial experiments reported by Edunov et al. (2018) who gain 0.24 BLEU points on the same test set.121212See their Table 2. Using interpolation with the MLE objective, Edunov et al. (2018) achieve 0.7 BLEU points. As we are only interested in the effect of sequence-level objectives, we do not add MLE interpolation. The best model by Edunov et al. (2018) achieved a BLEU score of 32.91%. It is possible that these score are not directly comparable to ours due to different pre- and post-processing. They also use a multi-layer CNN architecture (Gehring et al., 2017), which has been shown to outperform a simple RNN architecture such as ours. PERC2 and RAMP2 improve over the MLE baseline and PERC1, but perform on a par with MRT and each other. Both RAMP and RAMP1 are able to outperform MRT, PERC2 and RAMP2, with the bipolar objective RAMP also outperforming RAMP1 by a narrow margin. The main difference between RAMP and RAMP1, compared to PERC2 and RAMP2, is the fact that the latter objectives use as , while the former use a fear translation with high probability and low BLEU. We surmise that for this fully supervised task, selecting a which has some known negative characteristics is more important for success than finding a good . RAMP, which fulfills both criteria, still outperforms RAMP2. This result re-confirms the superiority of bipolar objectives compared to non-bipolar ones. While still improving over MLE, token-level ramp loss RAMP-T is outperformed by RAMP by a small margin. This result suggests that when employing a metric-augmented objective on top of an MLE-trained model in a full supervision scenario without domain shift, there is little room for improvement from token-level supervision, while gains can still be obtained from additional sequence-level information captured by the external metric, such as information about the sequence length.

To summarize, our findings on a fully supervised task show the same small margin for improvement as Edunov et al. (2018), without any further tuning of performance, e.g. by interpolation with the MLE objective. Bipolar RAMP is found to outperform the other losses. This observation is also consistent with the results by Gimpel and Smith (2012) for phrase-based MT. We conclude that for fully supervised MT, deliberately selecting a hope and fear translation is beneficial.

% BLEU
1 MLE 64 31.99
2 MRT 1 32.17 0.02 0.18
3 PERC1 40 31.91 0.02 0.08
4 PERC2 40 32.22 0.03 0.23
5 RAMP1 40 32.36 0.05 0.37
6 RAMP2 40 32.19 0.01 0.20
7 RAMP 40 32.44 0.00 0.45
8 RAMP-T 40 32.33 0.00 0.34
Table 8: BLEU scores for fully supervised MT experiments. Boldfaced results are significantly better than MLE at according to multeval (Clark et al., 2011). marks a significant difference to MRT and PERC2, and marks a difference to RAMP1.

7 Conclusion

We presented a study of weakly supervised learning objectives for three neural sequence-to-sequence learning tasks. In our first task of semantic parsing, question-answer pairs provide a weak supervision signal to find parses that execute to the correct answer. We show that ramp loss can outperform MRT if it incorporates bipolar supervision where parses that receive negative feedback are actively discouraged. The best overall objective is constituted by the token-level ramp loss. Next, we turn to weak supervision for machine translation in form of cross-lingual document-level links. We present two ramp loss objectives which combine bipolar weak supervision from a linked document and an irrelevant document . Again, the bipolar ramp loss objectives outperform MRT, and the best overall result is obtained using token-level ramp loss. Finally, to tie our work to previous work on supervised machine translation, we conduct experiments in a fully supervised scenario where gold references are available and a metric-augmented loss is desired to reduce the exposure bias and the loss-evaluation mismatch. Again, the bipolar ramp loss objective performs best, but we find that the overall margin for improvement is small without any additional engineering. We conclude that ramp loss objectives show promise for neural sequence-to-sequence learning, especially when it comes to weakly supervised tasks where the MLE objective cannot be applied. In contrast to ramp losses that either operate only in the undesirable region of the search space (“cost-augmented decoding” as in RAMP1) or only in the desirable region of the search space (“cost-diminished decoding” as in RAMP2), bipolar RAMP operates in both regions of the search space when extracting supervision signals from weak feedback. We showed that MRT can be turned into a bipolar objective by defining a metric that assigns negative values to bad outputs. This improves the performance of MRT objectives. However, the ramp loss objective is still superior as it is easy to implement and efficient to compute. Furthermore, on weakly supervised tasks our novel token-level ramp loss objective RAMP-T can obtain further improvements over its sequence-level counterpart because it can more directly assess which tokens in a sequence are crucial to its success or failure.

Acknowledgments

The research reported in this paper was supported in part by DFG grant RI-2221/4-1. We would like to thank the reviewers for their helpful comments.

References