Hippo: Taming Hyper-parameter Optimization of Deep Learning with Stage Trees

06/22/2020
by   Ahnjae Shin, et al.
0

Hyper-parameter optimization is crucial for pushing the accuracy of a deep learning model to its limits. A hyper-parameter optimization job, referred to as a study, involves numerous trials of training a model using different training knobs, and therefore is very computation-heavy, typically taking hours and days to finish. We observe that trials issued from hyper-parameter optimization algorithms often share common hyper-parameter sequence prefixes. Based on this observation, we propose Hippo, a hyper-parameter optimization system that removes redundancy in the training process to reduce the overall amount of computation significantly. Instead of executing each trial independently as in existing hyper-parameter optimization systems, Hippo breaks down the hyper-parameter sequences into stages and merges common stages to form a tree of stages (called a stage-tree), then executes a stage once per tree on a distributed GPU server environment. Hippo is applicable to not only single studies, but multi-study scenarios as well, where multiple studies of the same model and search space can be formulated as trees of stages. Evaluations show that Hippo's stage-based execution strategy outperforms trial-based methods such as Ray Tune for several models and hyper-parameter optimization algorithms, reducing GPU-hours and end-to-end training time significantly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 11

page 13

11/24/2019

Stage-based Hyper-parameter Optimization for Deep Learning

As deep learning techniques advance more than ever, hyper-parameter opti...
04/10/2020

A Modified Bayesian Optimization based Hyper-Parameter Tuning Approach for Extreme Gradient Boosting

It is already reported in the literature that the performance of a machi...
03/12/2020

Hyper-Parameter Optimization: A Review of Algorithms and Applications

Since deep neural networks were developed, they have made huge contribut...
09/12/2018

Benchmarking and Optimization of Gradient Boosted Decision Tree Algorithms

Gradient boosted decision trees (GBDTs) have seen widespread adoption in...
11/07/2021

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...
05/11/2022

Hierarchical Collaborative Hyper-parameter Tuning

Hyper-parameter Tuning is among the most critical stages in building mac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL) models have made great leaps in various areas including image classification [resnet, cifar, imagenet], object detection [yolo], and speech recognition [deepspeech, deepspeech2]. However, such benefits come at a cost; training DL models requires heavy datasets and long computations which may take up to a week [gnmt] even on a hundred of GPUs [gnmt]. This cost becomes more significant when we take hyper-parameter optimization into account. Since hyper-parameters can have great impact on the quality of the trained models, investigating the hyper-parameter search space often requires hundreds to thousands of trainings with different hyper-parameter settings [massively]. Consequently, naively running hyper-parameter optimization requires an exceedingly large number of GPUs, and it is crucial to explore the hyper-parameter search space as efficiently as possible.

In this paper, we aim at building a system optimized for running a (possibly multiple) hyper-parameter optimization job, which trains and evaluates the target DL model multiple times, each with a different configuration. Since each training sub-procedure is identified by its unique configuration, i.e., sampled hyper-parameters from a given search space, it is natural to develop a system that can run and manage multiple trainings of the target DL model, especially on a GPU cluster. Prior works on systems for hyper-parameter optimization attempt to a) efficiently schedule training sub-procedures by considering resource utilization or fairness [gandiva, themis], b) provide new abstractions and programming interfaces for productivity of developers [raytune, optuna, vizier], c) optimize resource allocation of sub-procedures according to model performance [asha, hyperdrive, hypersched] or d) design easy to use tuning systems which require minimal coding [chopt, nni]. Unlike these works, we explore untapped opportunities to optimize the resource usage of hyper-parameter optimization jobs in terms of the amount of computation.

Our key observation is that training modern DL models often requires changing hyper-parameter values in the midst of training to reach state-of-the-art accuracy, as they target minimizing high-dimensional, non-convex loss functions. Hence, a hyper-parameter configuration can be regarded as a

sequence of values. Examples include learning rate [resnet, batchnorm, cycliclr, super, 1hour, adam, adadelta, rmsprop, hypergradient], drop-out ratio [elu], optimization algorithm [gnmt], momentum [yellofin], batch size [dont-decay-lr], image augmentation parameters [pba], training image input size [progan], and input sequence length [bert].

Figure 1: A hyper-parameter optimization study consisting of four trials. A single hyper-parameter, learning rate (LR), is being explored within the search space .

We find that existing approaches for hyper-parameter optimization systems [raytune, vizier, mltuner, chopt] overlooked this important characteristic of sequential hyper-parameters, always treating hyper-parameters as single values. These approaches simply execute multiple training sub-procedures separately, without exploiting the fact that there exist redundant computations between the sub-procedures. Figure 1 shows a hyper-parameter optimization job with four sub-procedures, each with different learning rate sequence configurations. A step indicates training with one batch of data. The first 100 training steps for sub-procedures and can be shared, as they are operating on the same learning rate value, . Similarly, and also have a common prefix for learning rate . Instead of executing such common prefixes independently, it is possible to execute them only once and share them across sub-procedures to avoid redundant computation and reduce the amount of resources (GPU-hours) used.

To this end, we present Hippo, a hyper-parameter optimization system that finds redundant computations in hyper-parameter optimization jobs and reuses the results of duplicate workloads. Hippo merges hyper-parameter sequence configurations in the shape of a tree, called stage tree, so that all non-leaf nodes represent redundant computations that can be shared. Stage trees also provide the benefit of simplifying the scheduling of hyper-parameter optimization jobs, as each node in the stage tree serves as a scheduling unit. Internally, Hippo uses an additional data format, a search plan, to handle the dynamics of hyper-parameter optimization jobs and manage various states.

We evaluated Hippo with popular DL models (ResNet56, MobileNetV2, and BERT-Base) and well-known hyper-parameter optimization algorithms (SHA, ASHA, grid search) on a 40-GPU AWS EC2 cluster. Our evaluations show that Hippo outperforms Ray Tune, a state-of-the-art hyper-parameter optimization system, reducing the end-to-end training time and GPU-hours of a single job up to 2.76x and 4.81x, respectively. For multi-job scenarios, Hippo can share redundant computations across jobs and reduce the end-to-end training time and GPU-hours by up to 3.53x and 6.77x, respectively.

The rest of the paper is organized as follows. Section 2 introduces hyper-parameter optimization and motivates our work. Section 3 proposes core representations, stage tree and search plan, for identifying and reusing redundant computations. Section 4 describes the Hippo design, and Section 5 elucidates implementation details. Section 6 presents evaluation results, Section 7 explains related work, and Section 8 concludes.

2 Background and Motivation

2.1 Hyper-Parameter Optimization

Hyper-parameter optimization refers to the act of training multiple instances of a machine learning model with slightly differing training knobs, such as learning rate and batch size. We use the term

study to refer to a single optimization run of a model over a certain search space of parameters. Each sub-procedure of a study that is associated with a set of parameters sampled from the given search space is called a trial.

Hyper-parameter optimization is crucial in training deep learning models for high model quality. The model quality of trials with different hyper-parameter values may differ significantly, even if settings other than the hyper-parameters such as the model architecture and input data are kept the same across all trials.

There are many types of hyper-parameters as well as many possible values for each hyper-parameter. The search space is often very large, and the number of trials is usually in the hundreds and even thousands [asha, themis, survey]. As each trial takes up a considerable amount of time, blindly running the trials one after another is impractical for moderately sized cluster environments. Many hyper-parameter optimization algorithms are being used throughout the community for quickly finding the trials that yield the best models (in terms of model quality) without executing all trials to completion [sha, asha, hyperdrive, pbt]. Meanwhile, various hyper-parameter optimization systems have been proposed to efficiently schedule such trials with average job completion time and inter-user fairness in mind [gandiva, tiresias, gandiva-fair].

Hyper-parameter sequences. Many researchers have recently expanded their hyper-parameter search spaces so that a hyper-parameter can change values after a certain amount of steps according to some sequence, rather than being kept as a constant value during the whole trial. While the learning rate hyper-parameter [resnet, batchnorm, cycliclr, super, 1hour, adam, adadelta, rmsprop, hypergradient] has always been tuned as a hyper-parameter sequence, recent works have also applied this scheme to other hyper-parameters as well, such as batch size [dont-decay-lr], drop-out ratio [elu], optimizer [gnmt], momentum [yellofin], image augmentation parameters [pba], training image input size [progan], input sequence length [bert], and network architecture parameters [progan]. As training modern DL models involves minimizing high-dimensional, non-convex loss functions, we predict this trend of hyper-parameter sequences to become even more popular throughout the community.

Figure 2: Validation accuracy curves for training ResNet56 on CIFAR-10 with various learning rate and batch size settings. All other hyper-parameters are kept the same as reported in the original ResNet paper [resnet]

. Trial A (green) keeps constant learning rate (0.1) and batch size (128) for the whole trial, resulting in the lowest accuracy. Trial B (blue) decays the learning rate by 0.1 at the 100th and 150th epochs.

We conducted a small study that consists of two trials to illustrate the benefits of tuning hyper-parameters as a sequence. Figure 2 shows the model validation accuracy while training a ResNet56 model on the CIFAR-10 dataset, using different learning rate configurations. Trial A (green) operates on a constant learning rate, while Trial B (blue) follows a hyper-parameter sequence. Simply by decaying the learning rate twice, Trial B reached validation accuracies higher than trial A by more than 5 percent. This simple example demonstrates that defining hyper-parameters as sequences instead of constant values has great effects on the quality of the trained model. Clearly, the whole hyper-parameter sequence over the course of training, not just the initial value, is important for model quality.

2.2 Optimization Opportunities

When generating a new hyper-parameter configuration for a trial, many different algorithms each have their own logic on how to generate specific hyper-parameter values. While naïve algorithms simply select all configurations in the search space or select a random subset, algorithms such as Bayesian optimization [bayesian]

deliberately sample the next most promising hyper-parameters based on the history of trials, aiming to discover well-performing trials faster with fewer resources. Likewise, when manually tuning hyper-parameters, a common heuristic to discover a well-performing trial is to slightly modify a previously attempted hyper-parameter sequence that showed good results. As a result, promising trials often share common prefixes in their hyper-parameters.

Sharing computations within a study. Figure 1 depicts a hypothetical hyper-parameter optimization study of trials that partially share prefixes in their hyper-parameter sequences. Existing hyper-parameter optimization systems treat trials as black boxes and do not exploit the fact that trials are actually performing the exact same computation for these prefixes. By performing such computations only once and reusing the resulting DNN checkpoint multiple times for downstream trials, we can reduce the amount of GPU-hours required to serve this study. As the number of trials is typically much greater than the number of GPUs available in the training environment, reducing GPU-hours can also lead to reduced end-to-end training time.

Sharing computations across studies. Reusing computations for common hyper-parameter sequence prefixes can be done for multiple studies, as well. If a hyper-parameter optimization study is submitted for a model and a dataset that already had been explored previously by other studies, then we can identify common hyper-parameter sequence prefixes among the studies and reuse past DNN checkpoints to skip redundant computations for the currently submitted study.

3 Stage Tree & Search Plan

Based on our observations in the previous section, we propose two representations for identifying and sharing common computations across trials and studies. We first show a stage tree, a rearrangement of trials in the form of a tree that puts common computations at the root and intermediate nodes. Next, we present a search plan, a data layout we use in our system to back stage trees.

3.1 Stage Tree

The trend of employing hyper-parameter sequences in hyper-parameter optimization motivates us to divide a trial into several stages, based on the sequences themselves. Consider the study in Figure 3, consisting of four trials. The target hyper-parameter being adjusted is the learning rate, and is sampled from a search space of . Trial 1 uses a learning rate of for 200 steps, and reduces the value to for the next 100 steps. In other words, trial 1 starts with a stage of learning rate for 200 steps, followed by a stage of learning rate for 100 steps. Similarly, trial 2 is made up of three stages with 100 steps each. From now on, we use the term stage to denote a certain interval of a trial. Note that a stage does not necessarily have to have a constant hyper-parameter value; we could have a stage that utilizes hyper-parameter sequences, such as linear or exponential learning rates. We follow the convention of dividing hyper-parameter sequences to set stage boundaries as well, such as piecewise linear functions and sequential combinations of functions.

Dividing trials into stages reveals that trials 2, 3, and 4 actually share the same initial stage (learning rate for 100 steps). Moreover, the first stage of trial 1 can be split into two smaller stages so that trial 1 shares the initial stage as well. By merging common stages across trials, we get the tree-shaped arrangement of stages in Figure 4. In this form, it is evident that stages and can be shared by multiple trials. We refer to this form as a stage tree.

The stage tree is one of the core representations of a hyper-parameter study in our work, and is mainly used to identify schedulable units when it comes to executing a hyper-parameter optimization study. Conveniently, a stage can be considered as a schedulable unit, while edges between stages express scheduling dependencies. We show how stage trees and stages are handled to schedule trials in detail, in Section 4.

Figure 3: A study of trials that share common computations. Each stage is labeled with an id (-) and its hyper-parameter value (learning rate ). A stage can be split into shorter stages to match the length of a stage from another study that shares the same hyper-parameter value.
Figure 4: A stage tree formed from the trials of Figure 3. Stage can be executed once to serve all four trials, while stage can be shared by three trials.
Figure 5: An illustration of a stage tree transformation when a new trial is added to the stage tree in Figure 4. Both the first stage in trial 5 and stage in Figure 4’s stage tree must be split into smaller stages, in order to merge trial 5 into the stage tree. As a result, trial 5 shares stages and with trial 1.

3.2 Search Plan

As new trials arrive, new stages may be added to a stage tree, while existing stages can be split into shorter stages of smaller step ranges. Stages can even be deleted if the given hyper-parameter optimization algorithm decides to kill certain trials. For instance, assume that a new trial has been submitted to the previous stage tree example, as shown in Figure 5. Stage of the new trial (Trial ) cannot be merged into stage or stage in Figure 4, because neither of them has a matching step range (steps -). Instead, stage needs to be divided into stages (steps -) and (steps -), and then the new trial’s last stage, , is appended to . All stages that came after in the original stage tree are modified to follow in the new stage tree.

Although stage tree transformations allow us to map out the current state of a hyper-parameter optimization study, such dynamics make the implementation of a stage tree based system rather complicated for various reasons. First, from a scheduling perspective, managing execution states of stages is difficult because stages can be split even during execution. For example, if the stage split from Figure 4 to Figure 5 occurred while stage was still in execution, then it would be unclear how to handle the currently running process. Second, the ever-changing structure of a stage tree makes it difficult to pinpoint the by-products of executing a stage – namely DNN parameter checkpoints and validation accuracy values – which must be associated with specific hyper-parameter sequences.

To resolve such issues, we introduce another internal representation for a hyper-parameter optimization study, the search plan, that is similar to a stage tree but does not involve any node removals for a newly submitted trial. Nodes contain information of past trials that passed through those particular nodes, while each edge is annotated with an integer indicating the stage boundary in the number of steps.

Figure 6: A search plan example of hyper-parameter configurations. Each node stores various fields, including hyper-parameter value functions for each hyper-parameter (hp_config), checkpoints and intermediate values for later reuse (ckpt, metrics), and a list of integers that mark the current stage (s) that are waiting to be executed under this configuration (requests). Edges across nodes indicate sequential dependencies, e.g., occurs after training a model for 20 steps under , while occurs after training a model for 20 more steps under (a total of 40 preceding steps).
Figure 7: A stage tree generated from the search plan in Figure 6. The numbers below each stage indicate the step to start and stop training. Shaded stages indicate stages with checkpoints where training can be resumed from.

An example of a search plan is drawn in Figure 6. Each node represents a hyper-parameter configuration starting from a certain training step. , the root node of this search plan, indicates a configuration of training a freshly initialized model (no parent node) with an exponential learning rate and constant batch size. Likewise, indicates a configuration of a linear learning rate and constant batch size, starting from a model checkpoint that has been trained with for 10 steps (note the directed edge between and ).

Unlike stage trees, a search plan node is not a scheduling unit. The existence of a node does not necessarily imply that a trial, configured by that node, is currently running in the system. Rather, a node holds various statistics gathered by the system regarding the corresponding hyper-parameter configurations, specified by the following fields:

  • hp_config: Hyper-parameter configurations for each target hyper-parameter, given as functions. Values for coefficients and constants are also given, if required (not shown in the figure). Widely used functions for hyper-parameter values, such as CONSTANT, EXPONENTIAL, COSINE, and STEP, are allowed.

  • ckpt: A dictionary of file paths for checkpoints that were trained under this configuration. Dictionary keys are used to indicate the number of training steps.

  • metrics: Intermediate values for evaluating the quality of the model, like test/validation accuracy and loss.

  • requests: A list of integers representing requests to train and return the metrics of submitted stages. Each integer indicates the total number of training steps required per request. A request consists of one or more stages with sequential dependencies, which are expressed by an integer in a node’s requests field and the edges connecting from preceding nodes. Note that one request may map to one or more trials if they are merged into the same stage (s). For example in Figure 6, 15 in ’s requests field indicates that one or more trials require training with ’s hyper-parameter configuration for 15 steps. To interpret 35 in ’s requests field in a similar manner, we must first follow the edge from the preceding stage (), indicating that the request requires training for 20 steps with ’s hyper-parameter configuration. Then, the request requires training for an additional 15 steps with ’s hyper-parameter configuration, for a total of 35 steps. Note that since there already is a checkpoint for at 20 steps, training can be resumed from this checkpoint.

We also have a few additional fields for implementation reasons, such as a reference count value and other runtime metadata. We will explain these further in later sections.

When a new trial arrives, it is first compared with the search plan to see if there exists a path from the root node to a leaf node that exactly matches the trial’s hyper-parameter sequence. If not, then a new node is created and added to the search plan. Otherwise, we next check the ckpt and metrics fields of the leaf node and immediately return the appropriate results in case no training is needed (e.g., there already is a checkpoint that matches the trial’s hyper-parameter sequence). For the more common case in which we need to train the model, then an entry is added to the requests field and a stage tree is generated from the search plan, to be handed over to a separate scheduler component.

Going back to the example illustrated in Figure 5 where a new trial submission requires splitting an existing stage (A2), Hippo handles this case by simply adding another item to an existing node’s requests field to indicate a new stage (A3) with the same hyper-parameter configuration but a smaller number of steps. Note that if A2 was executed before this change and a checkpoint for A3 was not made, computation for A3 may be repeated again, later when a new stage tree is generated from this updated search plan.

1:function BuildStageTree()
2:     Initialize lookup table
3:     for each not scheduled request  do
4:         FindLatestCheckpoint(r, ) // r is (hp_config, step)
5:     end for
6:     Initialize a tree
7:     for each key, value in  do
8:         // each key and value is a request object
9:         // create a stage that load checkpoint at value and train until key.
10:     end for
11:     connect consecutive stages
12:     return
13:end function
14:function FindLatestCheckpoint(, )
15:     if r.hp_config is running then
16:         return .put(, null)
17:     end if
18:     if  has no parent configuration ||  then
19:         return
20:     end if
21:     for s {step, step - 1, …, hp_config.start} do
22:         if Checkpoint exists at (hp_config, s) then
23:              return .put(, (hp_config, s))
24:         end if
25:     end for
26:      = (hp_config.parent, hp_config.start)
27:      = FindLatestCheckpoint(, )
28:     .put(, )
29:     return
30:end function
Algorithm 1 Build Stage Tree

Going from search plans to stage trees. While search plans are effective for managing the current status and history of a hyper-parameter study, stages are more straight-forward as a scheduling unit for a system scheduler component to interact with. Thus, we use search plans as the basic format for holding internal data, but ultimately generate stage trees when a scheduling decision needs to be made. The generated stage trees are transient representations, used solely for creating scheduling units (stages), and are not kept in the system like search plans.

Algorithm 1 describes the process of generating a stage tree from a search plan. We first think of a helper function FindLatestCheckpoint. This function gets a request object and a lookup table as input. The request object is a tuple of hyper-parameter configuration and step. FindLatestCheckpoint finds the closest checkpoint to the request and stores it to . If the closest checkpoint belongs to one of its ancestors, the function recursively calls with its parent as an input in line 27. Also, it adds its parent to the lookup table in line 28. Therefore, after the lookup table is completed in line 5, the algorithm goes over each entry in the lookup table and starts building the stage tree. Note that the lookup table is also used as a memoization mechanism as in line 18.

Figure 7 illustrates the stage tree generated from the search plan in Figure 6. A stage will be executed by resuming from the nearest available checkpoint, where available checkpoints are marked as shaded areas in the figure. For example, the stage denoted by with steps 20 to 25 in the figure will be trained after resuming from ’s checkpoint at 20 steps, which can be seen in the ckpt field of in Figure 6.

4 Hippo System Design

In this section, we introduce Hippo, a hyper-parameter optimization system that incorporates stage trees and search plans to run multiple studies while automatically reusing computation for sharable stages within a study and across studies.

Figure 8: Hippo system architecture.

4.1 Overview

Executing a trial in Hippo is initiated by the user submitting the trial to Hippo via the client library, a thin interface for constructing trial requests in the appropriate format. A trial request is defined as a pair of a hyper-parameter sequence configuration and the number of training steps (Figure 8

1
). Once a request arrives at the system, the hyper-parameter sequence configuration is immediately compared with the corresponding search plan in the search plan database, and the search plan is adjusted accordingly (

2
). In case metrics and checkpoints that satisfy the request’s criteria are already present, then a response is returned immediately to the user. Otherwise, the scheduler is notified to run new stages.

The scheduler decides which stages to run by examining the stage tree generated from the current search plan (

3
). Stages are given to GPU workers for execution (

4
), and the workers start computation by loading checkpoints from the distributed filesystem (

5

). Workers periodically report evaluation metrics to the

aggregator (

6
). Each server has a node manager to gather metrics locally before passing them to the aggregator, for reducing inter-server data traffic. The aggregator, upon receiving a set of metrics, updates the search plan (

7
) and also pings the scheduler if a new checkpoint has been added (

8
). After repeating the scheduler-aggregator cycle multiple times, the final stage for a trial request will eventually terminate, and the metrics are sent back to the client (

9
).

4.2 Search Plan Database

Hippo stores all search plans that are currently being served in the search plan database. When a new trial is added, Hippo updates the search plan as described in Section 3.2. The various field entries in any node of the search plan, including checkpoints, metrics, and runtime profile data, can also be updated by the aggregator component.

4.3 Scheduler

Hippo schedules computation on GPUs with stages as the basic scheduling unit. Since the number of stages that can run concurrently at a given moment usually exceeds the number of available GPUs in the cluster, Hippo utilizes a scheduler component to determine the stages to be run.

The scheduler takes a stage tree generated from the current search plan as input, and schedules stages on GPU workers. A simple scheduling method would be to do a breadth-first traversal through the stage tree and schedule each stage one by one, until all GPU workers have been assigned a stage. However, we have found that this method incurs significant transition overhead for workers because the scheduling granularity (stage) is too small. Instead, the scheduler computes the critical path of a given tree and schedules the whole sequence of stages in the path on a worker. With multiple workers, the scheduler repeats finding the next critical path among unscheduled stages in the stage tree and scheduling the sequence of stages on an idle worker. The larger scheduling granularity (batch of stages) not only improves locality by avoiding overheads such as checkpoint save/loading, but also allows us to prioritize minimizing the end-to-end training time of a hyper-parameter optimization study. The critical path of a stage tree is the path (from root to leaf) that has the longest estimated execution time; the execution time of an individual stage is estimated by multiplying the number of steps of that stage by the execution time per step (profiled beforehand and stored in the search plan database).

Note that the scheduler does not store any information regarding the execution states of stages. Since stages can be split and even removed during execution (as mentioned in Section 3), setting stage states to handle the execution of a study involves complex state management measures. Instead, the scheduler operates in a stateless manner, relying entirely on the search plan to identify the stages that need to be run and the stages that have already run. In other words, after scheduling stages from a stage tree, the scheduler simply releases the stage tree. When the scheduler is triggered later to schedule more stages, the scheduler takes a new stage tree freshly generated from the latest search plan, and repeats finding and scheduling the next critical path of unscheduled stages in the stage tree on an idle worker.

5 Implementation

We have implemented Hippo in Python, using various libraries. The system utilizes Python’s concurrent programming library, asyncio

, to manage coroutines. Communication between the main Hippo process and node managers is done via the pub/sub interface provided by Apache Kafka 2.4.1, together with Apache ZooKeeper 3.4.13. MySQL 8.0 is used to store system states in the search plan database. Kafka, ZooKeeper, and MySQL are all run in Docker containers. Additionally, we use GlusterFS 6.9 as the distributed file system for saving and sharing checkpoints between nodes. Our current implementation of Hippo utilizes the deep learning framework PyTorch 1.5.0 to train DNN models, though Hippo’s design is not tied to any specific framework.

5.1 Data Pipeline

We implemented a custom data pipeline for PyTorch that is compatible with stages. Two major updates were done. First, we modified the checkpoint mechanism of PyTorch’s default data pipeline to include the current permutation of the dataset as a part of the checkpoint. This way, the data pipeline is able to save its current position in the dataset when a stage terminates, and later resume from the same position for the next stage. Second, we added a feature to change the batch size of the data pipeline. When the batch size is changed, the data pipeline will flush every preprocessed batch from the queue, and relaunch the background threads so that they produce the correct batch samples.

5.2 Client Library

  class MyTrainer(Trainer):
    ...
    def setup(self, hp):
    # hp is a dictionary of updated values
      if "lr" in hp:
        for group in self.optimizer.param_groups:
          group["lr"] = hp["lr"]
      if "bs" in hp:
        if self.train_loader:
          del self.train_loader
        self.train_loader = DataLoader(
          self.train_dataset,
          batch_size=hp["bs"]
        )
    ...
Figure 9: An example that updates the learning rate (lr) and batch size (bs) in the custom Trainer that the user should override. Hippo passes into setup the values of sequential hyper-parameters that should be updated.
  def get_search_space():
    hp = {
      "lr": [
        Constant(0.1),
        Exponential(0.1, 0.95)
      ],
      "bs": [
        Constant(128),
        MultiStep(128, [40], 2)
      ]
    }
    return GridSearchSpace(hp)
Figure 10: Defining a search space consisting of learning rate (lr) and batch size (bs) sequences in Python using the function definitions provided by Hippo. Two different sequences were defined for each hyper-parameter, resulting in four trials.
  hp_set = ["lr", "bs"]
  study = Study(remote_url).create(
    dataset, command, ckpt_path, hp_set
  )
  schedule = Schedule.from_milestones(
    (5, 8), (10, 4)
  )
  tuner = EarlyStopTuner(
    schedule, search_space,
    metric.ExtractSingleNumber(
      "test_acc"
    )
  )
  tuner(study)
  # Users can tune a study multiple times on different tuners
  tuner2(study)
  # Users can directly evaluate a certain trial on a specified step
  study.eval(hp_config, step)
Figure 11: Running a study with an example tuner that trains 8 trials for 5 logical training iterations, early-stops 4 trials, and trains the remaining 4 trials up to 10 logical iterations. The killing decision is made based on the test accuracy as specified in the last argument to EarlyStopTuner.

To run a study in Hippo, users must first decide the model and dataset they want to use in the study, the types and values of hyper-parameters to tune, and the tuning algorithm to use. The training logic, which describes all things needed for training a model such as setting the values of each hyper-parameter, is defined by overriding the base Trainer class Hippo provides. The values of each hyper-parameter used in the Trainer will be drawn from the search space defined in Python by the user. The tuning algorithm specifies how to spawn, pause, or terminate trials that compose the study. Users may implement their own strategies, or simply choose from the tuners we provide. We will now take a closer look at each step a user must take to run a study in Hippo.

First, users should implement the training logic by overriding the base Trainer class Hippo provides. Users should write functions for initializing training (e.g. defining the model or loading the dataset), training for one logical iteration (which may consist of multiple steps), evaluating the model trained so far and returning the metrics, saving, and loading checkpoints. One logical training iteration, executed by one call to the Trainer’s train function, should be long enough to avoid overheads, but short enough to regularly report progress. Often, a logical training iteration is set as one pass through the dataset. Whenever a hyper-parameter value is initialized or updated within a stage, Hippo calls the Trainer’s setup function with a dictionary containing updated values. Then, using theses values in setup, the user should make according changes to the appropriate attributes of the Trainer. Figure 9 illustrates a setup example.

Then, the user should define the search space they wish to explore using Hippo’s implementation of well-known functions. Figure 10 displays a simple example that creates a search space over two types of hyper-parameters to use with the MyTrainer class defined previously in Figure 9. Unlike in existing frameworks, users can directly express hyper-parameters in the search space as sequences, without having to embed the sequences as part of the training logic. Notice the matching keys between the search space and the hp dictionary passed into MyTrainer’s setup. Trials are sampled from this search space as a grid here, resulting in a total of four trials, but users who wish to implement conditional hyper-parameter spaces can optionally pass in a function to GridSearchSpace to filter out certain trials.

The last step is to create a study and a tuner. A study is defined by specifying the dataset, the command to run a trial, the checkpoint path, and the hyper-parameter set. The hyper-parameter set contains the types of hyper-parameters that are tuned in the study. For tuners, we provide several hyper-parameter optimization algorithms such as Successive Halving (SHA) [sha], Hyperband [hyperband], Asynchronous Successive Halving (ASHA) [asha], median-stopping [vizier], and PBT [pbt] in the client library. Figure 11 illustrates how to create a study with a search space containing two types of hyper-parameters, and tune the study with a tuner that early-stops trials on milestones based on a certain metric.

Hippo’s client library heavily utilizes Python’s asyncio library. Instead of creating a new thread for each request, the library creates coroutines which are handled by the default Python event-loop. The tuning algorithms provided by Hippo take advantage of asyncio primitives, such as wait_all (block until all coroutines have finished) and wait_any (block until at least one coroutine has finished), to implement their logic.

Typically, hyper-parameter optimization algorithms submit several requests in parallel. In such situations, the client library batches the requests to reduce processing overhead at the search plan database.

6 Evaluation

We compare Hippo to Ray Tune [raytune], a hyper-parameter optimization algorithm framework built on top of Ray [ray], to present the experimental results that show how a study can be executed both quickly and efficiently on Hippo. We conducted four single study experiments comparing Ray Tune and Hippo, and two multi-study experiments each with a varying number of studies that run in parallel.

Model Dataset Tune Algorithm Algorithm Policy # of trials Merge rate (p)
ResNet56 CIFAR-10 SHA reduction=4, min=15, max=120 448 2.447
ResNet56 CIFAR-10 ASHA reduction=4, min=15, max=120 448 2.447
MobileNetV2 CIFAR-10 Grid search max=120 240 3.144
BERT-Base SQuAD 2.0 Grid search max=27000 40 2.045
Table 1: Specification of four studies. Each study is specified a model, dataset, tuning algorithm, tuning algorithm policy, and a search space. Min and max are number of steps for BERT-Base, and number of epochs otherwise. The number of trials and merge rate of each search space is provided.
Environment

All experiments were conducted on Amazon Web Services. Each experiment uses a homogeneous GPU cluster of five p2.8x instances, each with 8 NVIDIA Tesla K80 GPUs. A distributed file system using GlusterFS [gluster] is set up on Amazon EBS volumes, with one volume per instance. For trials that do not fit in one GPU, we apply synchronous data parallel training. All experiment scripts are implemented in PyTorch 1.5.0 [pytorch]. In all of our experiments, we measure the end-to-end time, the elapsed time from the start of the experiment to the end, and the GPU-hours, the sum of elapsed time each GPU was held for training.

As an effort for a fair comparison, we have made the following changes. First, we have re-implemented the ASHA [asha] algorithm on Ray Tune, as the implementation provided by Ray Tune was different from the original paper. Second, we altered Ray Tune’s Trainer implementation so that it performs its evaluation multiple times in each of its epoch. In addition, we match the number of evaluations between Ray Tune and Hippo.

Merge rate

As our evaluation results vary on the configuration of the search space, we provide a coefficient that summarizes the merging capability of our search space.

p = Total training iterationsUnique training iterations

Total training iterations is defined as the number of training iterations that are needed to train the entire search space without reusing any computations. Unique training iterations is defined as the number of training iterations with zero redundant computation. For example, if there are N identical trials, the merge rate is . Note that to bound our iteration counts, the number of iterations for each trial is set as the maximum iterations a trial can be trained.

Similarly, we define a -wise merge rate defined on search spaces.

q = Total training iterations of K studiesUnique training iterations across K studies

Merge rate is not the only coefficient that determines the GPU-hour reduction in Hippo. The total GPU-hour is also affected by the difference in the training durations for each trial. Also, when applying a hyper-parameter optimization algorithm that early-stops under-performing trials, the actual GPU-hour is affected by the trial that is early-stopped. If a trial that shares common computation with many other trials is early-stopped, the overall GPU-hour reduction will decrease.

Search space

Over four studies, we use a total of seven types of hyper-parameters. Five hyper-parameters (learning rate, batch size, momentum, image cutout[cutout] size, input sequence length[bert]) were tuned as a sequence and two hyper-parameters (optimizer, weight decay) were tuned as a single value throughout the trials. We do not optimize the hyper-parameter sequences directly. Instead, we follow the convention of parameterizing the hyper-parameter sequence as a well known function, and tune the parameters instead. For example, the learning rate search space is composed of the following functions: step-decay, linear, exponential decay after linear warm-up, and other functions. The parameters of each function were chosen to build a search space. For example, only a few learning rate sequences were generated from the linear function, which varies on the slope and the initial value. For the learning rate, we sample many different sequences from the commonly used functions implemented in most DL frameworks. For other hyper-parameters, a constant value or piecewise constant sequences were used. Each search space varies on the type of hyper-parameters, number of trials, and the sequence values of each hyper-parameter. We provide an overview of the search space for each of the studies, as well as the statistics, like the number of trials, and the merge rate or .

6.1 Single Study

(a) End-to-end time
(b) GPU-hour
Figure 12: Single Study Experiment results

This section compares three different hyper-parameter optimization algorithm systems: Ray Tune, Hippo, and Hippo-trial. Along with Hippo, we provide the same evaluation results with Hippo-trial, an implementation of Hippo but without merging so that no computation is reused.

Studies

We compare four different studies across three different hyper-parameter optimization algorithm systems. The design of each study is described in Table 1. Three different models, two different datasets, and three different hyper-parameter optimization algorithms are used for the different studies. For the ResNet56 and MobileNetV2 models, each trial does not train for more than 120 epochs. Only the trial with the highest accuracy is trained for 100 additional epochs. The extra training of the best performing trial is accounted to the GPU-hour and the end-to-end time.

Hyper
Parameter
Function family
learning rate
Initial=0.1, StepLR(gamma=0.1,
milestones=[90,135])
Warmup(5,0.1), StepLR(gamma=0.1,
milestones=[90,135])
Warmup(5,0.1),
Exponential(gamma=0.95)
Warmup(10,0.1),
CosineAnnealingWarmRestarts(=20)
CyclicLR(base_lr=0.001, max_lr=0.1,
step_size_up=20)
batch size 128
values=(128,256), milestones=[70]
momentum 0.9
values=(0.7,0.8,0.9), milestones=[40,80]
weight decay 1e-4, 1e-3
optimizer
Adam, Vanilla SGD,
SGD with nonzero momentum
Table 2: Examples from the search space defined for ResNet56. 5 types of hyper-parameters were tuned. Warmup indicates the duration in epochs and target value.
Hyper
Parameter
Function family
learning rate
Initial=0.1, StepLR(gamma=0.1,
milestones=[100,150])
Warmup(10,0.1), StepLR(gamma=0.1,
milestones=[100,150])
Warmup(10,0.1),
Exponential(gamma=0.95)
Warmup(10,0.1),
CosineAnnealingWarmRestarts(=20)
CyclicLR(base_lr=0.001, max_lr=0.1,
step_size_up=20)
batch size 128
values=(128,256), milestones=[100]
cutout size
(augmentation)
16
values(16,18,20), milestones=[80,100]
optimizer SGD(weight_decay=4e-5)
Table 3: Examples from the search space defined for MobileNetV2. 4 types of hyper-parameters were tuned.
Hyper
Parameter
Function family
learning rate
Initial=5e-5,
Linear(total_t=30000)
Warmup(3000,5e-5),
Linear(total_t=30000)
input sequence length
(preprocessing)
384
values=(384,512),
milestones=[21000]
Table 4: Examples from the search space defined for BERT-Base model. 2 types of hyper-parameters were tuned. The warmup and linear decay durations are denoted by number of steps.
Model Accuracy [%] GPU-Hour End-to-End Time [hour]
Target Ray Tune trial stage Ray Tune trial stage Ray Tune trial stage
ResNet56 (SHA) 93.03 93.08 92.89 93.27 402.66 404.95 83.7 13.92 12.89 5.76
ResNet56 (ASHA) 93.58 92.89 93.72 544.36 374.82 139.03 17.6 13.58 7.4
MobileNetV2 94.43 95.03 95.04 95.04 917.11 944.88 291.48 28.815 30.29 10.43
BERT-Base 76.236 78.42 78.57 78.18 835.03 808.21 404.21 25.18 24.1 11.93
Table 5: Summary of all four single-study experiments, including the best accuracy reached, elapsed GPU-hour, and end-to-end time. For ResNet56, the target accuracy is the value reached in the original paper. As MobileNetV2 and BERT-Base do not have such official records, their targets were set from values reported in a popular GitHub repository and the SQuAD leaderboard, respectively.

Note that for ResNet56 and MobileNetV2, the reported metric is the top-1 accuracy, and for BERT-Base, the reported metric is the f1 score. The target top-1 accuracy of ResNet56 is 93.03, which is the value reported in the original paper [resnet]. MobileNetV2 has no official record for accuracy on CIFAR-10, but an accuracy of 94.43 is reported in this GitHub repository [mobilenet-repo]. The reported f1 score for BERT-Base can be seen on the SQuAD 2.0 official leaderboard, with 76.236 being the highest record for BERT-Base. The top-1 accuracies and f1 scores reached in our experiments can be found in Table 5. In all four studies, Hippo successfully achieved top-1 accuracies and f1 scores higher than the reported target values.

The search spaces for ResNet56, MobileNetV2, and BERT-Base are defined in Tables 2, 3, and 4, respectively. The end-to-end time and the GPU-hour are shown in Figure 12. Ray Tune and Hippo-trial show comparable end-to-end time and GPU-hours except for the study involving ASHA. In this case, Hippo-trial finishes earlier and shows a smaller GPU-hour than Ray Tune because a smaller number of trials were promoted due to the asynchronous nature of the algorithm.

Compared to Ray Tune, Hippo can reduce end-to-end time and GPU-hours by up to 2.76 and 4.81, respectively. For the two grid search experiments, the GPU-hour savings (3.15 and 2.07) quite accurately match the value of the merge rate. This is because the merge rates were calculated assuming that each trial was trained the maximum possible number of iterations, in other words, assuming that a grid search was performed over the search space. However, for the SHA and ASHA experiments, this assumption does not hold due to early-stopping, and the GPU-hour savings (4.81 and 3.92) are much greater than the merge rate. SHA and ASHA show better performances than anticipated by the search spaces’ merge rates because early-stopping lead to exploring only a subset of the whole search space we defined, which happened to have a higher merge rate than the whole search space. After analyzing the training logs from SHA, for example, we have discovered that the merge rate of the search space actually explored was 4.23.

6.2 Multiple Studies

We next evaluate several studies at once, to see the effect of inter-study merging. We evaluate the GPU-hour and the end-to-end time difference between Ray Tune and Hippo with a varying number of studies: 1, 2, 4, and 8. We will refer to each case as S1, S2, S4, and S8.

All studies use the ResNet20 model, the CIFAR-10 dataset, and 144 trials. With the hyper-parameter optimization algorithm policy, the merge rate is different for each of the studies. Two types of multi-studies are conducted, each with different search spaces. The learning rate and batch size were tuned as a sequence in each study.

The first search space has comparably higher intra-study and inter-study merge rates. The merge rate for each study ranges from 1.5 to 2.73. The k-wise merge rate for S2, S4, and S8 is 2.26, 2.77, and 2.47, respectively. Figure 13 depicts the results from this search space. We can see that with a relatively large merge rate between the studies, the GPU-hour and the end-to-end time shrinks by up to 6.77 and 3.53.

The second search space was designed to have lower intra-study and inter-study merge rates than the first search space. The merge rate for each study ranges from 1.2 to 2.1, and pairwise merge rates also have a similar range of 1.2 to 2.4. The k-wise merge rate for S2, S4, and S8 are 1.40, 1.19, and 1.66, respectively. Figure 14 depicts the results from this search space. Though the gains are smaller than in the previously defined search space due to lower merge rates, Hippo still reduces the GPU-hour and end-to-end time by up to 2.32 and 1.99.

Figure 13: Multi-Study results with k-wise merge rates S2: 2.26, S4: 2.77, and S8: 2.47.
Figure 14: Multi-Study results with k-wise merge rates S2: 1.40, S4: 1.19, and S8: 1.66.

7 Related Work

In our previous workshop paper [workshop], we explored the potentials of stage-based execution by implementing a prototype system and evaluating it with small-scale single study experiments. Building on these early ideas, in this paper, we present the complete design and implementation details of Hippo, including challenges and considerations in utilizing stage-based execution, and added support for multiple studies. We also provide a richer set of evaluations, consisting of single studies on a much larger scale (in terms of the total number of trials, types of hyper-parameters tuned, and the variety of models and datasets) as well as multi-study experiments, that could better demonstrate the impact of our work.

Trial-based systems

There have been several recent systems [raytune, vizier, nni, optuna, Katib, chopt, hyperdrive] for hyper-parameter optimization, helping users to manage their hyper-parameter optimization jobs on distributed environments. However, the trial-based systems miss the opportunities to reduce resource usage by reusing the common computation results.

Tune [raytune], for example, is a hyper-parameter optimization system built on top of Ray [ray], providing two-level interfaces: a user API to train models with hyper-parameters, and a scheduling API for implementing hyper-parameter optimization algorithms. Since Tune does not understand the internals of a trial, a single trial cannot be further split into multiple stages to merge the common computation between trials, achieving sub-optimal performance compared to Hippo. In addition, since Tune’s scheduler is always initiated by the underlying resource manager, e.g., when a trial is completed or a resource is available, sometimes it may be difficult for a user to manipulate the details of a hyper-parameter optimization algorithm unless the user is familiar with the behavior of the resource manager. Other popular trial-based hyper-parameter optimization systems such as Google Vizier [vizier], NNI [nni], Optuna [optuna], Kubeflow [Katib], CHOPT [chopt], and HyperDrive [hyperdrive] provide similar trial-level user APIs and schedule hyper-parameter optimization jobs in the trial basis, so they also overlook the opportunities to identify and optimize the identical computations as well.

Reusing computation

Hippo minimizes resource usage by identifying identical computations and reusing the computation results among multiple hyperparameter optimization trials. There exist a number of systems that reuse computation results to some extent, but none of those systems focus on finding out the same computation between hyper-parameter optimization trials as Hippo does.

Nectar [nectar] enables reusing common DryadLINQ computation within a datacenter, but does not focus on hyper-parameter optimization jobs. Pretzel [pretzel] and Clipper [clipper] reuse computed results for machine learning inference workloads. Another recent work [pipeline-aware-reuse]

attempts to reduce the resource usage of hyper-parameter optimization jobs by caching results in intermediate steps of machine learning pipelines such as data preprocessing and feature extraction. We expect that Hippo can further be improved by incorporating such techniques that optimize different aspects in machine learning systems.

Systems focusing on a specific algorithm

As hyper-parameter optimization algorithms such as ASHA [asha] and PBT [pbt] have been devised to optimize the resource usage on distributed environments, systems to efficiently run those algorithms have been introduced alongside with the algorithms themselves. However, the systems are not generic since each of these systems is specifically designed for executing only a specific algorithm. HyperSched [hypersched] extends ASHA [asha] and supports algorithms similar to ASHA. On the other hand, Hippo targets to support various hyper-parameter optimization algorithms including ASHA [asha], SHA [sha], PBT [pbt], and median-stopping rule [vizier], like other existing hyper-parameter optimization systems do.

8 Conclusion

Hippo is a hyper-parameter optimization system that removes redundant computation in the training process by breaking down the hyper-parameter sequences into stages, merging common stages to form a tree of stages, and executing a stage once per tree. Hippo is applicable to not only single-study scenarios but also multi-study scenarios. Our evaluations show that Hippo saves GPU-hours and reduces end-to-end training time significantly compared to Ray Tune on multiple models and hyperparameter optimization algorithms.

Acknowledgments

We thank our colleagues Taebum Kim, Eunji Jeong, Gyeongin Yu, and Won Wook Song for their feedback on this work. This work was supported by the AWS Machine Learning Research Awards (MLRA), the Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (No.2015-0-00221, Development of a Unified High-Performance Stack for Diverse Big Data Analytics), the ICT R&D program of MSIT/IITP (No.2017-0-01772, Development of QA systems for Video Story Understanding to pass the Video Turing Test), and Samsung Advanced Institute of Technology.

References