Many machine learning (ML) problems involve iterative alternate optimization of different objectives w.r.t different sets of parameters until a global consensus is reached. For instances, in training generative adversarial networks (GANs) (Goodfellow et al., 2014), parameters of the generator and the discriminator are alternately updated to an equilibrium; in many multi-task learning problems (Argyriou et al., 2007), one usually has to alternate the optimization of different task-specific objectives on corresponded data, until the target task performance is maximized. In these processes, one needs to determine which objective and which set of parameters to choose at each step, and subsequently, how many iteration steps to perform for the subproblem . We refer to this as determining an optimization schedule (or update schedule).
While extensive research has been focused on developing better optimization algorithms or update rules (Kingma and Ba, 2014; Bello et al., 2017; Duchi et al., 2011; Sutskever et al., 2013), how to select optimization schedules has remained less studied. When the objective is complex (e.g. non-convex or combinatorial) and the parameters to be optimized are high-dimensional, the optimization schedule can directly impact the quality of convergence. However, we hypothesize that the schedule is learnable in a data-driven way, with the following empirical evidence: (1) The optimization of many ML models is sensitive to the update schedule. For examples, the updates of the generator and the discriminator in GANs are carefully reconciled to avoid otherwise model collapse or gradient vanishing (Goodfellow et al., 2014; Radford et al., 2015); In solving many multi-task learning or regularizer-augmented objectives, the optimization target is a combination of multiple task-specific objectives. It is desirable to weight each objective differently as , while different values of result in different (local) optima. This indicates that different loss terms shall not be treated equally, and achieving the best downstream task performance requires optimizing every
to different extents. (2) Previous research and practice have suggested that there do exist optimization schedules that are more probable to produce better convergence than random ones, e.g.Arjovsky et al. (2017) and Salimans et al. (2016) suggest that keeping the steps of updating the generator and discriminator of GANs at leads to faster and more stable training of GANs.
Based on the hypothesis, in this paper, we develop AutoLoss, a generic meta-learning framework to automatically determine the optimization schedule in iterative and alternate optimization processes. AutoLoss introduces a parametric controller attached to an alternate optimization task. The controller is trained to capture the relations between the past history and the current state of the optimization process, and the next step of the decision on the update schedule. It takes as input a set of status features, and decides which objectives from to optimize, and which set of parameters from to update. The controller is trained via policy gradient to maximize the eventual outcome of the optimization (e.g. downstream task performance). Once trained, it can guide the optimization of task models to achieve higher quality of convergence faster, by predicting better schedules.
To evaluate the effectiveness of AutoLoss, we instantiate it on four typical ML tasks:
-ary quadratic regression, classification using a multi-layer perceptron (MLP), image generation using GANs, and neural machine translation (NMT) based on multi-task learning. We propose an effective set of features and reward functions that are suitable for the controllers’ learning and decisions. We show that, on all four tasks, the AutoLoss controller is able to capture the distribution of better optimization schedules that result in higher quality of convergence on the corresponding task than strong baselines. For examples, on quadratic regression with L1 regularization, it learns to detect the potential risk of overfitting, and incorporates L1 regularization when necessary, helps the task model converge to better results that can hardly be achieved by optimizing linear combinations of objective terms. on GANs, the AutoLoss controller learns to balance the training of generator and discriminator dynamically, and report both faster per-epoch convergence and better quality of generators after convergence, compared to fixed heuristic-driven schedules. On machine translation, it automatically learns to resemble human-tuned update schedules while being more flexible, and reports better perplexity results.
In summary, we make the following contributions in this paper: (1) We present a unified formulation for iterative and alternate optimization processes, based on which, we develop AutoLoss, a generic framework to learn the discrete optimization schedule of such processes using reinforcement learning (RL). To our knowledge, this is the first framework that tries to learn the optimization schedule in a data-driven way. (2) We instantiate AutoLoss on four ML tasks:
-ary regression, MLP classification, GANs, and NMT. We propose a novel set of features and reward functions to facilitate the training of AutoLoss controllers. (3) We empirically demonstrate AutoLoss’ efficacy: it delivers higher quality of convergence for all four tasks on synthetic and real dataset than strong baselines. Training AutoLoss controller has acceptable overhead less than most hyperparameter searching methods; the trained AutoLoss controller is generalizable – it can guide and improve the training of a new task model with different specifications, or on different dataset.
2 Related Work
Alternate Optimization. Many ML models are trained using algorithms with iterative and alternate workflows, such as EM (Moon, 1996)
, stochastic gradient descent (SGD)(Bottou, 2010), coordinate descent (Wright, 2015), multi-task learning (Zhang and Yang, 2017), etc. AutoLoss can improve these processes by learning a controller in a data-driven way, and figuring out better update schedules using this controller, as long as the schedule does affect the optimization goal. In this paper, we focus mostly on optimization problems, but note AutoLoss is applicable to alternate processes that involve non-optimization subtasks, such as sampling methods (Griffiths and Steyvers, 2004; Ma et al., 2015).
Meta learning. Meta learning (Andrychowicz et al., 2016; Maclaurin et al., 2015; Wang et al., 2016; Finn et al., 2017; Chen et al., 2016) has drawn considerable interest from the community, and has been recently applied to improve the optimization of ML models (Ravi and Larochelle, 2016; Li and Malik, 2016; Bello et al., 2017; Fan et al., 2018). Among these works, the closest to ours are Li and Malik (2016); Bello et al. (2017); Fan et al. (2018). Li and Malik (2016) propose learning to optimize to directly predict the gradient values at each step of SGD. Since the gradients are continuous and usually high-dimensional, directly regressing their values might be difficult, and the learned gradient regressor is nontransferable to new models or tasks. Differently, Bello et al. (2017) propose to learn better gradient update rules based on a domain specific language. The learned rules outperform manually designed ones and is generalizable. AutoLoss differs from this line of works – instead of learning to generate values of updates (gradients), AutoLoss focuses on producing better scheduling of updates. Therefore AutoLoss can model other classes of problems such as scheduling the generator and discriminator training in GANs, or even go beyond optimization problems. In Fan et al. (2018), a learning to teach framework is proposed that a teacher model, trained by optimization metadata, can guide the learning of student models. AutoLoss instantiates the framework in the sense that the teacher model (controller) produces better schedules for the task model (student) optimization.
AutoML. Also of note is another line of works that apply RL to enable automatic machine learning (AutoML), such as device placement optimization (Mirhoseini et al., 2017), neural architecture search (Baker et al., 2016; Zoph and Le, 2016), etc. While addressing different problems, AutoLoss’ controller is trained in a similar way (Peters and Schaal, 2008) for sequential and discrete predictions.
Background. In most ML tasks, given observed data , we aim to minimize an objective function with respect to the parameters of the model that we use to characterize the data. Solving this minimization problem involves finding the optima of (denoted as ), which we usually resort to a variety of de facto optimization methods (Boyd and Vandenberghe, 2004) if close-formed solutions are unavailable. In the rest of the paper, we will focus on two typical classes of optimization workflows which many modern ML model solvers would fall into: iterative and alternate optimization.
Iterative optimization methods look for the optimal parameter in an iterative-convergent way, by repeatedly updating until certain stopping criteria is reached. Specifically, at iteration , the parameters are updated from to following the update equation , where we denote as the function that calculates update values of depending on , as a subset of used at iteration and a scaled factor. Many widely-adopted algorithms (Bottou, 2010; Boyd et al., 2003) fall into this family, e.g. in the case for SGD, reduces to deriving the gradient updates (we skip optional steps such as momentum or projection for clarity), is a stochastic batch, and is the learning rate.
To describe alternate optimization, we notice the objective is usually composed of multiple different optimization targets: , and we want to minimize a certain combination of them. For example, when fitting a regression model with mean square error (MSE), appending an L1 loss helps obtain sparsity; in this case, is written as a linear combination of MSE and L1 terms. Similarly, the parameters in many cases are also composable, e.g. when the model has multiple components with independent sets of parameters. If we decompose , an alternate optimization (in our definition) contains multiple steps, where each step involves choosing , which we will call as determining an optimization action (notated as ), and update w.r.t. .
Further, we note that many ML optimization tasks in practice are both iterative and alternate, such as the training process of GANs, where the updates of generator and discriminator parameters are alternated, each with a few iterations of stochastic updates, until equilibrium.
We therefore present iterative and alternate optimization with the following unified formulation:
where denotes the task-specific action space that defines all legitimate pairs of loss and parameter to choose from; are update values of w.r.t. . Eq. 1 reduces to the vanilla form of iterative optimization when .
AutoLoss. Given the formulation in Eq. 1, our goal is to determine , i.e. which losses to optimize and what parameters to update at each , in order to maximize the downstream task performance. We introduce a meta model, which we call controller, to be distinguished from the task model used in the downstream task. The controller is expected to learn during its exploration of task model optimization processes, and is able to decide how to update once sufficient knowledge has been accumulated.
Specifically, we let the controller make sequential decisions at each step
; it scans through the past history and the current states of the process (described as a feature vector), and predicts a one-hot vector , i.e. will be selected if the th entry of is 1. We model our controller as a conditional distribution parameterized by 111The other alternate is to condition the decision at the step on the decision made at the step, though we choose a simpler one to highlight the generic idea behind AutoLoss., where we denote and as the -dim decision variable and -dim feature variable, respectively. At each step , we sample , and perform updates following Eq. 1 and .
Parameter Learning. The parameters of the controller is trained to maximize the performance of the optimization task given sampled sequences of decisions within steps, notated as . Accordingly, we introduce the training objective of the controller as , where is the reward function that evaluates the final task performance after applying the schedule for its optimization. We will discuss the form of in §4. Since the decision process involves non-differentiable sampling, we learn the parameters using REINFORCE (Williams, 1992) and future variants (see §4 for details) (Schulman et al., 2017b)
, where the unbiased policy gradients at each updating step of the controller are estimated by samplingsequences of decisions (for all experiments we set ) and compute
where is the th decision in
. To reduce the variance, we introduce a baseline termin Eq. 2 to stabilize the training (similar to Pham et al. (2018)), where is defined as a moving average of received reward: with as a decay factor. Whenever applicable, the final reward is clipped to a given range to avoid exploding or vanishing gradients. We present the detailed training algorithm in Appendix A.1.
We next apply AutoLoss to four specific ML tasks: -ary quadratic regression and MLP classification with L1 regularization, image generation using GANs, and neural machine translation based on multi-task learning. We instantiate and for these tasks.
4.1 Quadratic Regression and MLP Classification with L1 Regularization
Given training data generated by a linear model with Gaussian noise, we try to fit them using a -ary quadratic model as , where parameters are optimized via minimizing the mean square error (MSE) . Since fitting the data using a higher-order model is prone to overfitting, we add an L1 term . A traditional way to find is to minimize , where is a hyperparameter yet to be determined by hyperparameter search. This problem can be solved using many iterative optimization methods, e.g. SGD. To model this problem using AutoLoss, we define (), with (), and
, i.e. the controller has Bernoulli outputs which we sample decisions from. Similarly, we apply AutoLoss in training a binary MLP classifier
with ReLU nonlinearity, which is non-convex and highly prone to overfitting. We materializewhere are all MLP parameters, with as the binary cross entropy (BCE) and , and .
For both tasks, we design as a concatenation of the following features in order to capture the current optimization state and the past history: (1) training progress: the percentile progress of training . (2) normalized gradient magnitude: an -dim vector where the th entry is . (3) loss values: an -dim vector that contains values of each at . Extracting features (2)(3) requires computing and repeatedly at each step , which might be inefficient. We alternatively maintain and use their latest history values – we compute and only when the controller has decided to optimize at the jcurrent step, and update their values in the history accordingly. (4) validation metrics: the loss value of (MSE for regression or BCE for classification) evaluated on a validation set, the exponential moving averages of it and of its higher-order differences. Similarly, we evaluate the validation error only when needed and use their most recent values stored in the history.
For the reward function, we simply instantiated for regression and for classification, respectively, where is a constant, err is MSE for regression or classification error for classification, evaluated using converged parameters on the validation dataset. Hence, the controller obtains a larger reward if the task model achieves a lower MSE or classification error. It is worth noting that we intentionally choose these two models as a proof-of-concept that AutoLoss would work on both convex and non-convex cases. See §5 for more experiment results.
A vanilla GAN has two set of parameters: the parameters of the generator as and those of the discriminator as , alternately trained via a minimax game as follows (where is notated a noise variable):
This is a typical alternate process that cannot be expressed by any linear combination of loss terms (hence can hardly benefit from hyperparameter search as in the previous two cases). How to appropriately balance the optimization of and is a key factor that affects the success of GAN training. Beyond fixed schedules, automatically adjusting the training of and remains untackled. Fortunately, AutoLoss offers unique opportunities to learn the optimization schedules of GANs.
In particular, we instantiate , with , . To match the possible actions in GANs training, we set as , i.e. the controller chooses at each step to optimize one of and . To track the training status of both and , we reuse the same four aspects of features (1)-(4) in previous applications with the following variations: (2) We use a 3D vector , where the first two entries are gradient norms of and , respectively, while the third is their log ratio to reflect how balanced the updates are; (3) A vector of training losses and their ratio ; (4) As there is no clear validation metric to evaluate a GAN, for , we generate a few samples given its current state of parameters , and compute the inception score (notated as ) of them as a feature to indicate how good is. For , we sample equal number of samples from both and the training set and use ’s classification error (classified as real or fake) on them as a feature. For (2)-(4), we similarly use their most recent history values for improved efficiency. In a same way, we instantiate to encourage the controller to predict schedules that lead to better generators.
4.3 Multi-task Neural Machine Translation
Most multi-task learning problems require optimizing several domain-specific objectives jointly for improved performance (Argyriou et al., 2007). However, without carefully weighting or scheduling of the optimization of each objective, the results may unexpectedly degrade than optimizing a single objective (Zhang and Yang, 2017; Teh et al., 2017). As the third application, we apply AutoLoss to find better optimization schedules for multi-task learning based neural machine translation (NMT). Following Niehues and Cho (2017), we build an attention-based encoder-decoder model with three task objectives: the target task translates German into English (
), while the secondary tasks are German named entity recognition (NER) () and German POS tagging (). We use a shared encoder with parameters and separate decoders with parameters as for the aforementioned three tasks, respectively. To fit within the AutoLoss framework, we set , with , and the action space , i.e. the controller decides one task to optimize at a time. Still, we reuse the same set of features in previous tasks with small revisions, and set the reward function where PPL is the validation perplexity. More details about the NMT task are provided in Appendix A.3.
When the task model is complex and requires numerous iterations to converge (i.e. when in Eq. 2 is large), the controller receives sparse and delayed rewards. To facilitate the training, we adapt depending on the task: for simpler tasks that converge with fewer iterations (e.g. regression and MLP classification), equals the number of steps to convergence. For GANs and NMT that need longer exploration, we set as a fixed constant (instead of the max number to convergence) and online train the controller using proximal policy optimization (PPO) algorithm with actor-critic style. We accordingly adjust the reward function as where is a hyperparameter and is for GANs and PPL for NMT, i.e. we generate a reward every steps based on the improvement of performance and use it as reward for each step in this segment of steps. Since the improvement will be tiny around optima, we normalize the reward by dividing in case the reward is too small to provide enough training signal.
In this section, we evaluate AutoLoss empirically on the four tasks using synthetic and real data. We reveal the follow major findings: (1) Overall, AutoLoss can help achieve better quality of convergence faster on all four tasks compared to strong baselines (§5.1), with acceptable overheads in controller training. (2) A trained controller on a task model is transferable to guide the training of another task model with different configurations (e.g. neural architectures), or on totally different data distributions, while still converging faster and better (§5.3).
5.1 Quality of Convergence
We first verify the feasibility of the AutoLoss idea. We empirically show that under the formulation of Eq. 1, there do exist learnable update schedules, and AutoLoss is able to capture their distribution and guides the task model to achieve better quality of convergence across multiple tasks and models.
5.1.1 Regression and Classification with L1 Regularization
We first apply AutoLoss on two relatively simple tasks with synthetic data, and see whether it can outperform its alternatives (e.g. minimizing linear combinations of loss terms) in combating overfitting. Specifically, for regression, we synthesize dataset using a linear model with Gaussian noise (in the form of ). In this case, a quadratic regressor is over-expressive and highly likely to overfit the data if without proper regularization. Similarly, for MLP, we synthesize a classification dataset with risks of overfitting by letting only dimensions in be informative whereas the rest be either linear combinations of them or random noise. Details of how the data are synthesized are provided in the Appendix A.2. We split our dataset into 5 parts following Fan et al. (2018): and for controller training; Once trained, the controller is used to guide the training of a new task model on another two partitions , . Hence, the controller would not work by just memorizing good schedules on . We reserve the fifth partition to assess the task model after guided training. For both regression and classification, our controller is simply a two-layer MLP with ReLU activation.
We compare MSE or classification error (err) evaluated on in Table 1 to the following methods: (1) w/o L1: which minimizes only an MSE or BCE term on . (2) We designed three flexible schedules that optimize the L1 term at each iteration if the condition is met, where are (S1) task loss values () evaluated on and respectively, (S2) L1 loss and task loss evaluated on , (S3) gradient norms of L1 and MSE loss. We grid search the threshold on training data and only report best achieved results. (3) DGS: we minimize with determined by dense grid search (DGS); Particularly, we densely grid search the best from a pre-selected interval using 50 experiments, and report the best MSE222Note that the DGS presented is a very strong baseline and might even be unrealistic in practice due to unacceptable cost or lack of prior knowledge on hyperparameters..
Without regularization, the performance deteriorates – we observed the large gap between w/o L1 and others with L1 on both tasks (convex and non-convex). AutoLoss manages to detect and combat the potential risk of overfitting with the designed features, and automatically optimizes the provided L1 term when appropriate. In terms of task performance, AutoLoss outperforms three manually designed schedules as well as DGS, a practically very strong method. This is not unexpected as AutoLoss’ parametric controller offers more flexibility than heuristic-driven schedules, or any fixed-formed objectives with a dense grid of values (i.e. DGS). To understand this, consider the -ary quadratic regression which is convex and has global optima only determined by . AutoLoss frees the loss surface from being strictly characterized in the form of a linear combination equation, thus allows for finding better optimal solutions that not only enjoy the regularizer effects (i.e. sparsity), but also more closely characterize observed data. As a side benefit, AutoLoss liberates us from hyper-searching , which might be difficult or expensive, and not transferable from one model or dataset to another. We perform an additional experiment in Figure 1 where we set different in , and note AutoLoss always reaches the same quality of convergence regardless of . Similar results are observed on MLP classification, a highly non-convex model. The results suggest AutoLoss might be a better alternative to incorporate regularization than fixed-formed combinations of loss terms. We further provide an ablation study on the importance of each designed feature in the Appendix A.4.
We next use AutoLoss to help train GANs to generate images. We first build a DCGAN with the architecture of and following Radford et al. (2015), and train it on MNIST. As the task model itself is hard to train, in this experiment, we set the controller as a linear model with Bernoulli outputs. GAN’s minimax loss goes beyond the form of linear combinations, and there is no rigorous evidence showing how the training of and shall be scheduled. Following common practice, we compare AutoLoss to the following baselines: (1) GAN: the vanilla GAN where and are alternately updated once a time; (2) GAN 1:K: suggested by some literature, we build a series of baselines that update and at the ratio 1:K (K = 3, 5, 7, 9, 11) in case is over-trained to reject all samples by ; (3) GAN K:1: that we contrarily bias toward more updates for . To evaluate , we use the inception score () (Salimans et al., 2016) as a quantitative metric, and also visually inspect generated results. To calculate of digit images, we follow Deng et al. (2017) and use a trained CNN classifier on MNIST train split as the “inception network” (real MNIST images have on it). In Figure 2, we plot the w.r.t. number of training epochs, comparing AutoLoss to four best performed baselines out of all GAN 1:K and GAN K:1, each with three trials of experiments. We also report the converged for all methods here: 8.6307, 9.0026, 9.0232, 9.0145, 9.0549 for GAN, GAN (1:5), GAN (1:7), GAN (1:9), AutoLoss, respectively.
In general, GANs trained with AutoLoss present two improvements over baselines: higher quality of final convergence in terms of , and faster per-epoch convergence. For example, comparing to GAN 1:1, AutoLoss improves the converged for 0.5, and is almost 3x faster to achieve where GAN 1:1 converges () in average. We observe GAN 1:7 performs closest to AutoLoss: it achieves , compared to AutoLoss 9.05, though almost 5 epochs slower to converge, and exhibits higher variance in multiple experiments. It is worth noting that all GAN K:1 baselines perform worse than the rest and are skipped in Figure 2, echoing the statements (Arjovsky et al., 2017; Gulrajani et al., 2017; Deng et al., 2017) that more updates of than might be preferable in GAN training. We visualize some generated digit images by AutoLoss-guided GANs in the Appendix A.6 and find the visual quality directly relevant with and no mode collapse is observed.
5.1.3 Multi-task Neural Machine Translation
Lastly, we evaluate AutoLoss on multi-task neural machine translation. Our NN architecture exactly follows the one in Niehues and Cho (2017). More information about the dataset and experiment settings are provided in Appendix A.3 and Niehues and Cho (2017). We use an MLP controller with a 3-way softmax output, and train it along with the NMT model training, and compare it to the following approaches: (1) MT: single-task NMT baseline trained with parallel data; (2) FixedRatio: a manually designed schedule that selects which task objective to optimize next based on a ratio proportional to the size of training data for each task; (3) FineTuned MT: train with FixedRatio first and then fine-tune delicately on MT task. Note that baselines (2) and (3) are searched and heavily tuned by authors of Niehues and Cho (2017). We evaluate the perplexity (PPL) on validation set w.r.t. training epochs in Figure 3(L), and report the final converged PPL as well: 3.77, 3.68, 3.64, 3.54 for MT, FixedRatio, FineTuned MT and AutoLoss, respectively.
We observe that all methods progress similarly but AutoLoss and FineTune MT surpass the other two after several epochs. AutoLoss performs similarly to FineTune MT in terms of training progress before epoch 10, though AutoLoss learns the schedule fully automatically while FineTune MT requires heavy manual crafting. AutoLoss is about 5x faster than FixedRatio to reach where the latter converges, and reports the lowest PPL than all other approaches after convergence, crediting to its flexibility of being able to parameterize and learn the update schedules. We visualize the controller’s softmax output after convergence in Fig 3(M). It is interesting to notice that the controller meta-learns to up-weight the target NMT objective at later phase of the training. This, in some sense, seems to resemble the “fine-tuning the target task” strategy appeared in many multi-task learning literature, but is much more flexible thanks to the parametric controller.
AutoLoss introduces three possible sources of overheads: controller feature extraction, controller inference and training, and potential cost by additional task model training. Since we build features merely based on existing metadata or histories (see §4), which have to be computed anyway even without AutoLoss, the feature extraction has negligible overhead. Moreover, as a simple 2-layer MLP controller would suffice for many applications per our experiments, training or inference with the controller add minimal computational overhead, especially on modern hardware such as GPUs.
Besides, for tasks that converge shortly within a few iterations (e.g. -ary regression and MLP classification), AutoLoss, similar to grid search, requires repeating multiple experiments in order to accumulate sufficient supervisions ( is # of steps to converge). To assess the resulted overhead, we perform a fixed budget experiment: given a fixed number of data batches allowed to scan, we compare in Fig 3(R) the reached convergence by AutoLoss and DGS on the regression task. We observe AutoLoss is much more sample-efficient – it achieves better convergence with less training runs. On the other hand, for computational-heavy tasks that need many steps to converge (GANs, NMT), the controller training, in most cases, can finish simultaneously with task model training, and does not repeat experiments as many times as other hyperparameter search methods would do.
We next investigate the transferability of a trained controller to different models or datasets.
5.3.1 Transfer to Different Models
To see whether a differently configured task model can benefit from a trained controller, we design the following experiment: we let a trained DCGAN controller on MNIST guide the training of new GANs (from scratch) whose and have randomly sampled neural architectures. We describe the sampling strategies in Appendix A.5. We compare the (averaged) converged between with and without the AutoLoss controller in Fig 4(L), while we skip cases that both AutoLoss and the baseline fail () because improper neural architectures are sampled. AutoLoss manages to generalize to unseen architectures, and outperforms DCGAN
in 16 out of 20 architectures. This proves that the trained controller is not simply memorizing the optimization behavior of the specific task model it is trained with; instead, the knowledge learned on a neural network is generalizable to novel model architectures.
5.3.2 Transfer to Different Data Distributions
Our second set of experiments try to figure out whether an AutoLoss controller can generalize to different data distributions. Accordingly, we let a trained controller on one dataset to guide the training of the same task model from scratch, but on a different dataset with totally different distributions. We compare the AutoLoss-trained model to other methods, and report the results in Table 2 and Figure 4(R) on two tasks respectively: MLP classification, for which we synthesize 4 datasets following a generative process with 4 different specifications (therefore different distributions), with one of them used for controller training; GANs, where we first train a controller for digit generation on MNIST, and use the controller to guide the training of the same GAN architecture on CIFAR-10. In both cases, we observe AutoLoss manages to guide the model training on unseen data. On MLP classification, it delivers trained models comparable to or better than models searched via DGS, while being 50x more economical — note that DGS has to repeat 50 or more experiments to achieve the reported results in Table 2 on unseen data (or model). By contrast, AutoLoss, once trained, is free at inference phase. On image generation, when transferred from digit images to natural images, a controller guided GAN achieves both higher quality of convergence and faster per-epoch convergence than a normal GAN trained with various fixed schedules, among which we observe GAN 1:1 performs best on CIFAR-10, while most of GAN K:1 schedules fail. We visually inspect the images generated by DCGANs guided by the MNIST-trained controller and find the image quality satisfying and no mode collapse occurred, with converged , compared to best reported by DCGANs in previous literature. Visualization of the generated CIFAR-10 images can be found in Appendix A.7.
Finally, we are also interested in knowing whether a trained controller is transferable when both data and models change. We transfer a DCGAN controller trained on MNIST to a new DCGAN with different architectures on CIFAR-10, and observe comparable quality and speed of convergence to the best fixed schedule on CIFAR-10, though AutoLoss bypasses the schedule search and is more readily available.
We propose a unified formulation for iterative alternate optimization and developed AutoLoss, a framework to automatically learn and generate optimization schedules. Comprehensive experiments on synthetic and real data have demonstrated that the optimization schedule produced by AutoLoss controller can guide the task model to achieve better quality of convergence, and the trained AutoLoss controller is transferable from one dataset to another, or one model to another.
- Andrychowicz et al. (2016) M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pages 3981–3989, 2016.
- Argyriou et al. (2007) A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. In Advances in neural information processing systems, pages 41–48, 2007.
- Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
- Baker et al. (2016) B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.
- Bello et al. (2017) I. Bello, B. Zoph, V. Vasudevan, and Q. V. Le. Neural optimizer search with reinforcement learning. arXiv preprint arXiv:1709.07417, 2017.
- Benikova et al. (2014) D. Benikova, C. Biemann, M. Kisselew, and S. Pado. Germeval 2014 named entity recognition shared task: companion paper. 2014.
- Bottou (2010) L. Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
- Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
- Boyd et al. (2003) S. Boyd, L. Xiao, and A. Mutapcic. Subgradient methods. 2003.
- Brants et al. (2004) S. Brants, S. Dipper, P. Eisenberg, S. Hansen-Schirra, E. König, W. Lezius, C. Rohrer, G. Smith, and H. Uszkoreit. Tiger: Linguistic interpretation of a german corpus. Research on Language and Computation, 2(4):597–620, Dec 2004. ISSN 1572-8706. doi: 10.1007/s11168-004-7431-3. URL https://doi.org/10.1007/s11168-004-7431-3.
- Cettolo et al. (2012) M. Cettolo, C. Girardi, and M. Federico. Wit: Web inventory of transcribed and translated talks. In Proceedings of the 16 Conference of the European Association for Machine Translation (EAMT), pages 261–268, Trento, Italy, May 2012.
- Chen et al. (2016) Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and N. de Freitas. Learning to learn without gradient descent by gradient descent. arXiv preprint arXiv:1611.03824, 2016.
- Deng et al. (2017) Z. Deng, H. Zhang, X. Liang, L. Yang, S. Xu, J. Zhu, and E. P. Xing. Structured generative adversarial networks. In Advances in Neural Information Processing Systems, pages 3902–3912, 2017.
- Duchi et al. (2011) J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
- Fan et al. (2018) Y. Fan, F. Tian, T. Qin, X.-Y. L. Li, and T.-Y. Liu. Learning to teach. arXiv preprint arXiv:1606.01885, 2018.
- Finn et al. (2017) C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
- Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
- Griffiths and Steyvers (2004) T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences, 101(suppl 1):5228–5235, 2004.
- Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. arXiv preprint arXiv:1704.00028, 2017.
- Kingma and Ba (2014) D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Li and Malik (2016) K. Li and J. Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
- Luong et al. (2015) M. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. CoRR, abs/1508.04025, 2015. URL http://arxiv.org/abs/1508.04025.
- Ma et al. (2015) Y.-A. Ma, T. Chen, and E. Fox. A complete recipe for stochastic gradient mcmc. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
- Maclaurin et al. (2015) D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pages 2113–2122, 2015.
- Mirhoseini et al. (2017) A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972, 2017.
T. K. Moon.
The expectation-maximization algorithm.IEEE Signal processing magazine, 13(6):47–60, 1996.
- Niehues and Cho (2017) J. Niehues and E. Cho. Exploiting linguistic resources for neural machine translation using multi-task learning. arXiv preprint arXiv:1708.00993, 2017.
- Peters and Schaal (2008) J. Peters and S. Schaal. Reinforcement learning of motor skills with policy gradients. Neural networks, 21(4):682–697, 2008.
- Pham et al. (2018) H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.
- Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Ravi and Larochelle (2016) S. Ravi and H. Larochelle. Optimization as a model for few-shot learning. 2016.
- Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2226–2234, 2016.
- Schulman et al. (2017a) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017a. URL http://arxiv.org/abs/1707.06347.
- Schulman et al. (2017b) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
Sutskever et al. (2013)
I. Sutskever, J. Martens, G. Dahl, and G. Hinton.
On the importance of initialization and momentum in deep learning.In International conference on machine learning, pages 1139–1147, 2013.
- Teh et al. (2017) Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu. Distral: Robust multitask reinforcement learning. In Advances in Neural Information Processing Systems, pages 4496–4506, 2017.
- Wang et al. (2016) J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, R. Munos, C. Blundell, D. Kumaran, and M. Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
- Williams (1992) R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.
- Wright (2015) S. J. Wright. Coordinate descent algorithms. Mathematical Programming, 151(1):3–34, 2015.
- Zhang and Yang (2017) Y. Zhang and Q. Yang. A survey on multi-task learning. arXiv preprint arXiv:1707.08114, 2017.
- Zoph and Le (2016) B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Appendix A Appendix
a.1 Training Algorithm
In addition to the descriptions in §3, we present the detailed training procedures in Algorithm 1. For all our experiments, we set . For simple tasks such as d-ary regression and MLP classification that converge quickly in a few steps (therefore less costly), we set as the number of iterations took for a training instance to converge, i.e. a reward is generated upon the completion of a training instance, and we repeat multiple training instances until the controller has converged. For computational-heavy tasks such as GANs that require many iterations to converge, we set as a fixed constant, meaning that we evaluate to generate an intermediate reward every steps (before convergence), and perform a policy gradient update step, in case the exploration takes too long and the reward is too sparse.
a.2 Data Synthesis for -ary Quadratic Regression and MLP Classification
For the experiments in §5.1.1, we generate the dataset for the -ary quadratic regression task as follows:
Sample the weight vector .
Sample the feature vector .
Sample a Gaussian noise .
In our experiments, the synthesized dataset has a groundtruth MSE 3.94 (becasue of the Gaussian noise introduced), which we have subtracted from our results.
For the MLP classification task, we synthesize the data as follows.
Create four cluster centers by sampling from the vertices of a hypercube.
Assign two centers as positive () while as negative ().
Sample the label .
Sample from if otherwise from .
Sample and generate a vector as the first 5% dimensions of .
Generate as another 5% dimensions of , by randomly linearly combining the dimensions in ,
Generate as the rest dimensions of , by sampling from .
a.3 Details for Multi-task Neural Machine Translation
For the machine translation task, we use the WIT corpus (Cettolo et al., 2012) for German to English translation. To accelerate training, we only use one fourth of all data, which has 1M tokens. For the POS tagging task, we use the Tiger Corpus (Brants et al., 2004). The POS tag set consists of 54 tags. The German named-entity tagger is trained on GermEval 2014 NER Shared Task data (Benikova et al., 2014). The corpus is extracted from Wikipedia with the the tag set consisting of 24 tags.
We preprocess the data by tokenizing, true-casing and replacing all Arabic number by zero. In addition, we apply byte-pair encoding with 10K subwords on source and target side of the WIT corpus separately. We then apply the subwords to all German and English corpora.
For the task model, we use an attentional encoder-decoder architecture. The three tasks share one encoder but have their own decoders . The encoder is a two-layer bidirectional LSTM with 256 hidden units. All decoders are also two-layer bidirectional LSTMs with luong attention (Luong et al., 2015) on the top layer. All hidden sizes in decoders are 256. The word embeddings have a size of 128.
For the controller model, instead of REINFORCE, we apply Proximal Policy Optimization algorithm (PPO) (Schulman et al., 2017a) to train the controller. Both actor net and critic net are two-layer MLPs with hidden size 32. Discount rate is set to 0.95.
For the task model, we use Adam optimizer with learning rate 0.0005. Dropout rate is 0.3 at each layer. All gradients are clipped to 1. Batch size is 128. For the controller model, we use Adam optimizer with learning rate 0.001. Buffer size is 2000, batch size is 64. Target policy net and behavior policy net are synchronized every 10 steps of updating.
a.4 Feature Ablation Study
We investigate the importance of the designed controller features presented in §4. In particular, we report in Table 3 the performance on the regression task after dropping one of the features, where we find that all features being useful while feature (3), which captures the most recent values of all loss terms, bringing the biggest improvement. We also tried current and historical states of parameters, gradients, and momentums, and found the set of features presented in §4 achieve best trade-off on performance and efficiency.
|Feature to drop||MSE|
|(2) normalized gradient magnitude||.086|
|(3) loss values||.101|
|(4) validation metrics||.085|
a.5 Sample Strategies to Generate Random DCGAN Architectures
Sample the number of filters in the base layer of and from .
Sample from .
Decide whether to use batchnorm or not.
Sample the activation functions from.
This results in possible DCGAN architectures, among which some of them fail to converge during its training according to our experiments.
a.6 Image Generated by AutoLoss-guided GANs on MNIST
a.7 CIFAR-10 Images Generated by GANs guided by an AutoLoss Controller Trained on MNIST
In Fig 6, we illustrate some images generated by DCGANs under guided training of an AutoLoss controller trained on MNIST (with ). We observe the visual quality of generated images are reasonably good.