Curriculum learning, as pioneered by 
, aims to improve the training of machine learning models by choosing what examples to present and in which order to present them to the learning algorithm. Curriculum learning was originally inspired by the learning experience of humans  
—humans tend to learn better and faster when they are first introduced to simpler concepts and exploit previously learned concepts and skills to ease the learning of new abstractions. This phenomenon is widely observed in, e.g., music and sports training, academic training and pet shaping. Without surprise, curriculum learning is found most helpful in end-to-end neural network architectures, since the performance that an artificial neural network can achieve critically depends on the quality of training data presented to it.
|zh||xianzai ta zheng kaolv zhe huijia.|
|zh-gloss||Now he is thinking about going home.|
|en||He is thinking about going home now.|
|zh||wo yao chi niupai.|
|zh-gloss||I want eat steak|
|en||I want a steak. Get me a coke.|
Neural Machine Translation   (NMT) translates text from a source language to a target language in an end-to-end fashion with a single neural network. It has not only achieved state-of-the-art machine translation results, but also eliminated hand-crafted features and rules that are otherwise required by statistical machine translation. The performance of NMT has been improved significantly in recent years, as the NMT architectures evolved from the initial RNN-based models  to convolutional seq2seq models  and further to Transformer models .
However, since obtaining accurately labeled training samples in machine translation is often time-consuming and requires expert knowledge, an important question in NMT is how to best utilize a limited number of available training samples, perhaps with different lengths, qualities, and noise levels. Recently, the application of curriculum learning is also studied for NMT.  propose to feed data to an NMT model in an easy-to-difficult order and characterize the “difficulty” of a training example by the sentence length and the rarity of words that appear in it. Other than using the straightforward difficulty or complexity as a criterion for curriculum design,  propose a method to calculate the noise level of a training example with the help of an additional trusted clean dataset and train an NMT model in a noise-annealing curriculum.
A limitation of the existing curriculum learning methods for NMT is that they only address the batch selection issue in a “learn-from-scratch” scenario. Unfortunately, training an NMT model is a time-consuming task and sometimes could take up to several weeks , depending on the amount of data available. In most practical and commercial cases, a pre-trained model often already exists, while re-training it from scratch with a new ordering of batches is a waste of time and resources. In this paper, we study curriculum learning for NMT from a new perspective, that is given a pre-trained model and the dataset used to train it, to re-select a subset of useful samples from the existing dataset to further improve the model. Unlike the easy-to-difficult insights in traditional curriculum learning , , our idea is analogous to classroom training where a student first attends classes to learn general subjects with equal weights and then carefully reviews a subset of selected subjects to strengthen his/her weak aspects or to elevate ability in a field of interest.
Furthermore, while all the samples participate in batch training for the same number of epochs, it is unlikely that all data contribute equally to a best-performing model. Table1 shows an example of two data samples from the dataset used in this paper, where Example 1 is accurately translated and can potentially improve the model better, while Example 2 is poorly translated (with unexpected words in target sentence) and may even cause performance degradation when fed to the model. The objective of our curriculum design is to identify examples from the existing dataset that may further contribute to model improvement and present them again to the NMT learning system. An overview of our proposed task is given in Figure 1, where useful data can be selected and fed to the system repeatedly to strengthen the model iteratively.
We formulate the data re-selection task as a reinforcement learning problem where the state is the features of randomly sampled training examples, the action is choosing one of them, and the reward is the perplexity difference on a validation set after the pre-trained model is updated with the selected sample. Thus, the primary goal of the learning problem becomes searching for a data selection policy to maximize the reward. Reinforcement learning is known to be unstable or even to diverge when the action-value function is represented by a nonlinear function, e.g., a neural network. For the sake of alleviating instability, our proposed RL framework is built based on the Deterministic Actor-Critic algorithm . It consists of two networks, an actor network which learns a data selection policy, and a critic network which evaluates the action value of choosing each training sample while providing information to guide the training of the actor network. Besides introducing the framework, we also carefully design the state space by choosing a wide range of features to characterize each sample in multiple dimensions, including the sentence length, sentence-level log-likelihood, -gram rarity, and POS and NER tagging.
Experiments on multiple translation datasets demonstrate that our method can achieve a significant performance boost, when normal batch learning cannot improve the model further, by only re-selecting influential samples from the original dataset. Furthermore, the proposed scheme outperforms a number of other curriculum learning baseline methods, including the denoising scheme based on the use of additional trusted data .
2 Problem Definition
In this section, we provide a brief background of NMT and formulate the curriculum learning task on pre-trained NMT models as a reinforcement learning problem.
Machine Translation can be considered a one-to-one mapping from a source sentence to a target sentence . In neural machine translation, a model parameterized by
is searched for to maximize the conditional probabilityover the training samples. Modern NMT models adopt an encoder-decoder architecture where an encoder encodes the source sentence
into a hidden representationand a decoder predicts the next target word
given the hidden vectorand all previously predicted words . Thus the conditional probability is decomposed as
where is the number of tokens in each target sentence. Given a training corpus , the training objective of an NMT model is to minimize
We consider curriculum learning on a pre-trained NMT model, where the goal is to improve an existing model by selecting a subset from the training dataset that led to . As compared to training from scratch, we take advantage of both the versatility of normal batch learning in the initial pre-training stage and a carefully selected curriculum for targeted model improvements. Specifically, our objective is to find an optimal policy to select from and update with such that the performance of the updated model is maximized, i.e.,
where is a subset selected from using policy , is an updated model after training with , and perf indicates the performance of a model, e.g., measured by BLEU or Perplexity.
The main challenge is to identify and select the most beneficial data samples from . A naive way is to evaluate the BLEU improvement on a validation set brought by every single data sample in the training set and select the ones that improve the BLEU the most. However, this method is extremely costly and is not scalable to large datasets.
To obtain a generalizable data selection policy, we formulate the task as a reinforcement learning problem in which the environment is composed of both the dataset and the model . The RL agent aims to learn a policy which decides which sample to select when presented with a random batch of samples. In our framework, a state corresponds to the representation of both a data batch to select from and the NMT model, refers to the action of selecting the best data sample from the batch, and is the performance improvement of the NMT model on validation set after being updated with the selected sample.
The RL agent is trained through interacting with the environment by repeatedly performing the following: 1) receiving a state containing a random batch of samples, 2) providing an action back to the environment according to its trained policy, and 3) updating the policy using a feedback reward given by the environment. Once the policy is trained, it can be used to select data from an arbitrarily large dataset and is scalable.
In this section, we describe our Deterministic Actor-Critic framework for curriculum learning, as well as the design of the state, action, and reward in detail.
3.1 Model Overview
For the model design of the RL agent, we choose the Deterministic Actor-Critic algorithm.
Actor-critic  is a widely used method in reinforcement learning combining both an actor network that outputs an action given a state to maximize the reward, and a critic network that predicts the action value of a state-action pair
. The actor learns a near-optimal policy via policy gradient, while the critic estimates the action-value guiding the update direction of the actor. Compared with actor-only methods like REINFORCE
, the existence of the critic reduces the update variance and accelerates convergence.
As our reward calculation involves evaluating the updated NMT model on a validation set and is thus expensive, we exploit a memory replay buffer to increase sample efficiency. Furthermore, we choose a deterministic policy setting as opposed to stochastic policy due to the fact that the deterministic policy gradient can be calculated much more efficiently as shown in .
The update of the framework is illustrated in Figure 2. The critic network takes a state-action pair , evaluates the action value, and outputs a predicted reward , and updates the parameters supervised by the actual reward from the environment. Note that the critic network provides an immediate reward per iteration. As a result, we do not need to employ additional techniques, e.g., Temporal-Difference (TD) , to approximate the long-term reward. The objective of the critic network is thus to minimize the squared error of between the predicted reward and the actual reward, given by
The update of parameters is achieved through gradient descent as follows:
where corresponds to the learning rate of parameters .
The actor network takes in a state from the environment, applies the learned policy, and outputs a corresponding action . The objective of the actor network is to learn an optimal policy generating the proper action to maximize the predicted reward . Policy gradient is used to update the parameters , i.e.,
Algorithm 1 summarizes the overall learning process of the proposed framework. In each round of data selection (with rounds in total), we first train the RL agent for an adequate number of iterations. In each iteration, we derive a state, an action, and a reward in lines 4–7 and update the actor network and the critic network in line 8 and line 9, respectively. After the RL agent is fully trained, we apply the learned policy to select a subset and use to update the NMT model and move to the next round. Usually one or two rounds are sufficient.
Figure 3 demonstrates the network structure of the RL agent. A feature network is shared between the Actor Network and the Critic Network . It takes in the raw features of the sampled batch of examples, where each feature of each sample goes through an independent single-layer MLP. The concatenation of the outputs constitutes the state representation . Note that different examples in the sampled batch share the same network weights.
The Actor Network is composed of a two-layer MLP and computes a score for each example in the sampled batch of examples based on the input state , and outputs the action as a probability vector representing the probability of each example being selected, by taking a softmax operation over the computed scores of examples.
The Critic Network also has two layers and calculates the action value of a given state-action pair , where the action is the output of the Actor Network, i.e., , and is concatenated to the second layer of the critic network.
Note that although the feature network is shared between both the actor network and the critic network , we only update it with the critic network to reduce training instability.
The state is meant to be a full summarization of the environment including a data batch of examples to select from and information about the pre-trained model. However, the number of parameters in the pre-trained model is too large to be included in the state directly at each time step. Thus, we need to find a representation that can represent both the samples to be selected and the existing model using a limited number of parameters. In our state design, we focus on three different dimensions, namely informativeness, uncertainty, and diversity. We use the sentence length as a measure for informativeness, the sentence-level log-likelihood for uncertainty, and the -gram rarity together with NER and POS taggings for diversity.
The most intuitive representation feature of a parallel sentence is the sentence length, i.e. the number of tokens in a sentence. This simple scalar roughly measures the amount of information involved in a sentence and is also used in  as a “difficulty” estimate.
Following the intuition that examples yielding large uncertainty can benefit the model performance , another feature we use is the sentence-level log-likelihood calculated by the pre-trained model by summing up the log probability of each word of the target sentence:
 and  suggest that selecting samples that are farthest away from the existing dataset can benefit model training. Incorporating this idea, we further utilize two other feature vectors, -gram rarity and taggings, to represent the similarity between a given sample and the entire training set.
For -gram rarity, we use . Given all the sentences in , we define the relative frequency for a unique -gram in as
where , is a sentence from , is all -grams in , and is the total number of -grams in . For -gram representation of a sentence , we form all the -gram frequencies into a vector:
In this paper, we use and calculate the -gram vectors for both the source and target sentences.
For taggings, we use Named Entity Recognition (NER) and Parts of Speech (POS) and apply ideas similar to our-gram design. As the tagging of a word is dependent of the sentence that the word lies in, a same word may be tagged differently in various sentences. To give an example in POS tagging, the word “Book” can either be tagged as “NOUN” or “VERB” depending on its meaning in the sentence. If most occurrences of the word “Book” are tagged as “NOUN” in the training set, the model may not be able to correctly learn the second meaning of it. Therefore, the model should be fed more with samples in which “Book” is tagged as “VERB”. In order to reflect this phenomenon, we define a tagging value for a word and its current tag as
where is the number of times the word has appeared in the training set and is the number of times the word is tagged with tag among all its occurrences. Similarly, for both NER and POS taggings, the tagging values of words form a vector representation of a sentence:
3.3 Action and Reward
In the task of curriculum learning, an action represents the process of data selection.  assume a stream-based setting where data examples come one by one in a stream, and design their action as making a decision on whether or not a single incoming data example should be selected. We argue that our problem setting is actually pool-based instead of stream-based, where a pool of data exists for selection and deciding on the selection of each individual sample per step would be inefficient. In , a dataset is split into several bins according to a noise measure of data samples, and the action determines from which bin the next batch of training data should be selected. However, this method highly depends on an effective heuristic criterion for bin formation and is thus hard to generalize.
Therefore, we propose our action design which samples a batch of data examples from , computes a score for each example according to the trained policy, and choose the one with the highest score. Correspondingly, in our design, we can easily control the size of by varying the batch size , i.e., , since we choose one out of samples for each batch.
For the choice of reward signal, we use the performance improvement of the NMT model evaluated on the validation set after it is updated with the selected sample. Perplexity is used as the performance metric instead of BLEU as it is shown to be more consistent and less noisy . We assign a reward of to unselected samples.
In this section, we will first describe the datasets used in our evaluation and provide the implementation details along with the performance results.
4.1 Datasets & Metrics
To compare our proposed method with other curriculum learning methods on NMT task, we conduct comprehensive empirical analysis on several zh-en translation datasets:
MTD is an internal news translation dataset with 1M samples in the training set and 1,892 samples in both the validation set and the test set. The average length of source sentences is and the average length of target sentences is .
CASICTB, CASIA2015, NEU are three independent datasets with 1M, 2M, and 2M examples from different data sources in WMT18 which is a public translation dataset in news area with more than 20M samples. We only use a part of data from WMT18 to evaluate our method. All three datasets share the same validation set newsdev2017 and the test set newstest2017 both composed of 2k samples.
Table 2 summarizes the details of the experimental datasets. The columns titled “Train”, “Val”, and “Test” correspond to the number of examples in training, validation, and test sets. “Src-Len” and “Tgt-Len” correspond to the average sentence length of source language and target language respectively.
4.2 Experimental Settings
We implement our models in PyTorch 1.1.0 and train the model with a single Tesla P40. We utilize NLTK  to perform POS and NER taggings.
. It consists of a 6-layer encoder and decoder, with 8 attention heads, and 2,048 units for the feed-forward layers. The multi-head attention model dimension and the word embedding size are both set to 512. During training, we use Adam optimizer with a learning rate of 2.0 decaying with a noam scheduler and a warm-up steps of 8,000. Each training batch contains 4,096 tokens and is selected with bucketing . During inference, we employ beam search with a beam size of 5.
For the RL agent, we use an Deterministic Actor-Critic architecture and build our system based on . In our framework, we use several tricks proven to be effective to RL training including a memory replay buffer of size 2,500, a warm-up phase of 500 steps, and a target network which is updated by mixing weights with the on-line network with a mix factor of . For calculating rewards, we train the NMT model with the single selected sample using SGD and a learning rate of 1e-4.
The feature network maps data features of sentence length, sentence-level log-likelihood, taggings, and
-gram rarity to vectors of size 1, 8, 16, and 32 respectively with a FC layer, and concatenates them together as a shared state representation. The actor network is composed of two FC layers with hidden sizes of 300 and 400. The critic network is designed the same as the actor network except that the output action from actor network is concatenated to the second layer. Relu is used as the activation function in each FC layer in this network.
In experiments, we conduct two rounds of RL agent training and data selection. In each round, the RL agent is trained for 20k steps and the best model with the highest sum of rewards during the last 1,000 steps is used for data selection. The RL agent keeps selecting data given randomly sampled batches from the training set and feed the selected data to the NMT model until no performance improvement is observed. For the training process on selected data, we keep the NMT model’s optimizer and learning rate setting unchanged.
We use different batch sizes of the sampled batch for the two rounds of training and selection with and indicating in the first round we select 1 sample from 16 and in the second round, from 128 samples we select the best one. This is because we think in order to achieve improvement further on the basis of the first round, a stricter data selection criterion must be applied.
To make comparisons with other existing curriculum methods, we have conducted several baseline experiments.
We take the core ideas of existing curriculum learning methods of training on data samples with gradually increasing difficulty  and gradually decreasing noise  and apply them to our setting with pre-trained models. We evaluate the following three baseline methods along with our proposed method.
Denoising is a curriculum learning method of training an NMT model in a noise-annealing fashion . They propose to measure NMT data noise with the help of a trusted dataset which contains generally cleaner data compared to the training dataset.  also utilize data noise in their curriculum design and achieve similar performance as . For the choice of the trusted dataset, we choose a subset of 500 sentences from the validation set newsdev2017 of CASICTB, CASIA2015 and NEU.
Sentence Length is an intuitive difficulty measure used in , since longer sentences tend to contain more information and more complicated sentence structure.
Word Rarity is another metric for measuring the sample difficulty, as rare words appear less frequently in the training process and should be presented to the learning system more. The formula for calculating the word rarity of a sentence can be found in .
For baseline experiments, the pre-trained NMT model is further trained on 20% of the original data, which are selected by the above criteria, i.e., the least noisy, the longest, and the highest word rarity, respectively.
4.4 Main Results
Table 3 compares the performance of our method with other baseline methods on different datasets evaluated using BLEU. The result shows that our proposed method significantly out-performs other baseline methods by a great margin. We conduct two rounds of training and update in our experiments. While the result of the first round surpasses almost all the baseline methods, our second round further improves the performance and achieves a final BLEU improvement of , , , and on MTD, CASICTB, CASIA2015, and NEU respectively over the pre-trained model.
The reason of our success is due to our utilization of an RL framework to proactively select data samples that are potentially beneficial to the training of the NMT model. First, we formulate the task of curriculum learning on pre-trained NMT models as a reinforcement learning problem. Second, we construct an effective design of state, action and reward. Our state representation includes features of different dimensions of informativeness, uncertainty and diversity. Third, we propose a Deterministic Actor-Critic framework that learns a policy to select the best samples from the training set to improve the pre-trained model. By incorporating all these designs together, our proposed framework is able to achieve a significant performance enhancement on the pre-trained NMT model.
We evaluate the impact of different modules and methods by ablation test on MTD dataset. Table 4 list the performance of our model variants with different features included.
We incrementally accommodate different features of examples to the state by first starting from sentence length and sentence-level log-likelihood as they are both scalars. The performance increased slightly by BLEU compared with the pre-trained base model. Then we further accommodate -gram rarity and POS and NER taggings to the state vectors, and observe a larger increase of performance of and respectively. Finally, we incorporate the second round of RL agent training and data selection on the basis of the result of first round, and achieve the best performance with a BLEU increase. Note that a stricter selection policy is applied to the second round (128 choose 1) compared with the first round (16 choose 1).
|Senlen + Logp|
|+ 2nd Round||19.11|
5 Related Work
High-quality machine translation corpus is costly and difficult to collect, thus it is necessary to make the best use of the corpus at hand. A straightforward method to achieve this goal is to remove the noisy samples in the training data, and train an new model with the clean ones. Unfortunately, it is hard to estimate the quality of a parallel sentence in the absence of golden reference . Moreover,  find that some of the noisy samples may yield some benefits to the model performance. Then they define a new method of computing noise level of a data example with the help of an extra trusted dataset and propose to train an NMT model in a noise-annealing curriculum.
Curriculum learning aims to organize the training process in a meaningful way by feeding certain samples to the model in certain training stage such that the model can learn better and faster . They propose a simple strategy which organizes all training samples into bins of similar complexity and starts training from the easiest bin to include more complexed bins until all bins are covered.  apply this idea to NMT by binning according to simple features like sentence length and word frequency, and improve this strategy by restricting that each sample can only be trained once during an epoch.  conduct empirical studies on several hand-crafted curriculum and adopt a probabilistic view of curriculum learning.  further propose a competence function with respect to training time step as the indicator of learning progress and select samples based on both difficulty and competence. However, these heuristic-based approaches highly depend on hand-crafted curriculum and are hard to generalize.
Compared with heuristic-based approaches, RL-based policy learning models are trained end-to-end and do not rely on hand-crafted strategies.  use Bayesian optimization to learn a linear model for ranking examples in a work-embedding task.  explore bandit optimization for scheduling tasks in a multi-task problem.
select examples in a co-trained classifier using RL. organize the dataset into bins based on the data noise proposed by  and utilize a DQN to learn a data selection policy deciding from which bin to select the next batch.
Different from the existing curriculum learning methods, our work focuses on learning a training curriculum with reinforcement learning for an existing pre-trained NMT model. We argue that existing curriculum learning methods are only applicable on train-from-scratch scenarios, and learning from an existing model can save training time.
Active learning  is another related area which focuses on selectively obtaining labels for unlabeled data in order to improve the model with least labeling cost.   study active learning for Statistical Machine Translation using some hard-coded heuristics.  design an active learning algorithm based on a deep Q-network, in which the action corresponds to binary annotation decisions applied to a stream of data. 
make use of imitation learning to train a data selection policy.
In this paper, we study curriculum learning for NMT from a new perspective, to re-select a subset of useful samples from the existing dataset to further improve a pre-trained model, and formulate this task as a reinforcement learning problem. Compared with existing curriculum methods only applicable on train-from-scratch scenarios, our setting saves training time by better utilizing the existing pre-trained models. Our proposed framework is built based on the deterministic actor-critic algorithm, and learns a policy to select examples that can improve the model the most. We conduct experiments on several zh-en translation datasets and compare our method with other baseline methods including the easy-to-difficult curriculum and the denoising scheme. Through rounds of training and data selection, our method achieves a significant performance boost on the pre-trained model, and out-performs all baselines methods by a great margin.
-  (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §1, §1, §5.
-  (2009) Natural language processing with python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Cited by: §4.2.
-  (2010) Bucking the trend: large-scale cost-focused active learning for statistical machine translation. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 854–864. Cited by: §5.
“Bilingual expert” can find translation errors.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6367–6374. Cited by: §5.
-  (2017) Learning how to active learn: a deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 595–605. Cited by: §3.3, §5.
-  (2017) A convolutional encoder model for neural machine translation. In Proceedings of the 55th ACL, pp. 123–135. Cited by: §1.
-  (2017) Automated curriculum learning for neural networks. In Proceedings of the 34th ICML-Volume 70, pp. 1311–1320. Cited by: §5.
-  (2009) Active learning for statistical phrase-based machine translation. In Proceedings of Human Language Technologies: NACCL 2009, pp. 415–423. Cited by: §5.
-  (2013) Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700–1709. Cited by: §1.
-  (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §4.2.
OpenNMT: open-source toolkit for neural machine translation. In Proc. ACL, External Links: Cited by: §4.2.
-  (2017) Curriculum learning and minibatch bucketing in neural machine translation. In Proceedings of RANLP 2017, pp. 379–386. Cited by: §4.2, §5.
-  (2000) Actor-critic algorithms. In Advances in neural information processing systems, pp. 1008–1014. Cited by: §3.1.
-  (2009) Flexible shaping: how learning in small steps helps. Cognition 110 (3), pp. 380–394. Cited by: §1.
-  (2019) Reinforcement learning based curriculum optimization for neural machine translation. In Proceedings of the 2019 NACCL, pp. 2054–2061. Cited by: §3.3, 1st item, §5.
-  (2006) Confidence-based active learning. IEEE transactions on pattern analysis and machine intelligence 28 (8), pp. 1251–1261. Cited by: §3.2.
-  (2018) Learning to actively learn neural machine translation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 334–344. Cited by: §5.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th ACL, pp. 311–318. Cited by: §4.1.
Pytorch: tensors and dynamic neural networks in python with strong gpu acceleration. PyTorch: Tensors and dynamic neural networks in Python with strong GPU acceleration 6. Cited by: §4.2.
-  (2004) A day of great illumination: bf skinner’s discovery of shaping. Journal of the experimental analysis of behavior 82 (3), pp. 317–328. Cited by: §1.
-  (2019) Competence-based curriculum learning for neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1162–1172. Cited by: §1, §1, §3.2, 2nd item, 3rd item, §4.3, §5.
Active learning for convolutional neural networks: a core-set approach. In International Conference on Learning Representations, External Links: Cited by: §3.2.
-  (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §5.
-  (2018) Modularized implementation of deep rl algorithms in pytorch. GitHub. Note: https://github.com/ShangtongZhang/DeepRL Cited by: §4.2.
-  (2014) Deterministic policy gradient algorithms. In International Conference on Machine Learning, pp. 387–395. Cited by: §1, §3.1.
-  (1958) Reinforcement today.. American Psychologist 13 (3), pp. 94. Cited by: §1.
-  (2019) The evolved transformer. In International Conference on Machine Learning, pp. 5877–5886. Cited by: §3.3.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
-  (1992) Practical issues in temporal difference learning. In Advances in neural information processing systems, pp. 259–266. Cited by: §3.1.
-  (2016) Learning the curriculum with bayesian optimization for task-specific word representation learning. In Proceedings of the 54th ACL (Volume 1: Long Papers), pp. 130–139. Cited by: §5.
-  (2017) Dynamic data selection for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1400–1410. Cited by: §1.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §4.2.
-  (2018) Denoising neural machine translation training with trusted data and online data selection. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 133–143. Cited by: §1, §1, 1st item, §4.3, §5, §5.
-  (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3-4), pp. 229–256. Cited by: §3.1.
-  (2018) Reinforced co-training. arXiv preprint arXiv:1804.06035. Cited by: §5.
Multi-class active learning by uncertainty sampling with diversity maximization.
International Journal of Computer Vision113 (2), pp. 113–127. Cited by: §3.2.
-  (2018) An empirical exploration of curriculum learning for neural machine translation. arXiv preprint arXiv:1811.00739. Cited by: §5.