A Scalable and Cloud-Native Hyperparameter Tuning System

06/03/2020 ∙ by Johnu George, et al. ∙ Cisco 0

In this paper, we introduce Katib: a scalable, cloud-native, and production-ready hyperparameter tuning system that is agnostic of the underlying machine learning framework. Though there are multiple hyperparameter tuning systems available, this is the first one that caters to the needs of both users and administrators of the system. We present the motivation and design of the system and contrast it with existing hyperparameter tuning systems, especially in terms of multi-tenancy, scalability, fault-tolerance, and extensibility. It can be deployed on local machines, or hosted as a service in on-premise data centers, or in private/public clouds. We demonstrate the advantage of our system using experimental results as well as real-world, production use cases. Katib has active contributors from multiple companies and is open-sourced at https://github.com/kubeflow/katib under the Apache 2.0 license.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In machine learning, a hyperparameter is a parameter whose value must be fixed before the actual training process. Consequently, hyperparameters (e.g., number of clusters in k-means clustering, learning rate, batch size, and number of hidden nodes in neural networks) cannot be learnt during the training process, unlike the value of model parameters (e.g., weights of the edges in the neural network). Hyperparameters can impact both the quality of the model generated by the training process as well as the time and memory requirements of the algorithm 

(Goodfellow et al., 2016). Thus, hyperparameters have to be tuned to get the optimal setting for a given problem. This tuning can be done either manually or automatically. Manual tuning might suffice if the problem setup is not expected to change, at least not frequently, and if the number of hyperparameters is not large. Since these assumptions are frequently violated for realistic problems, an automatic hyperparameter tuning mechanism is required to make the overall machine learning approach practical.

There are several hyperparameter tuning systems that already exist. Notable among them are Optuna (Akiba et al., 2019), Ray Tune (Liaw et al., 2018), Vizier (Golovin et al., 2017), HyperOpt (Bergstra et al., 2015), and NNI (4). Even though these frameworks exhibit certain similarities (e.g., almost all frameworks support running parallel trials and customizable search algorithms), there are important advantages that Katib has over these systems. Namely, Katib is the only open-source framework that can realistically be run as a hosted service in a production environment.

There are several reasons for this distinction:

  1. Multi-tenancy: The only other framework that supports multi-tenancy is Vizier, which is closed-source. Other frameworks support only single-tenant usage, which makes cross-team collaboration more difficult.

  2. Distributed Training: Frameworks like Optuna and HyperOpt lack support for distributed training (either parameter servers (Li et al., 2014) or collective communications such as RingAllReduce (Sergeev and Balso, 2018)). Both distributed patterns are supported in Katib.

  3. Cloud Native: Katib is a Kubernetes-native (19) framework, thus making it a natural fit for cloud-native deployments. Other frameworks like Ray Tune and NNI support Kubernetes but require additional effort to configure.

  4. Extensibility. Katib is an open-source framework with pluggable interfaces for search algorithms and data storage. Some frameworks (like HyperOpt) do not support customizing search algorithms, and most do not support customizing the underlying metric storage.

Finally, Katib is one of only two frameworks that currently support neural architecture search (Zhou et al., 2019) (the other one being NNI). Neural architecture search (NAS) is an advanced automated machine learning technique that constructs entire neutral networks using various search strategies. A more detailed comparison of various frameworks is shown in Table 1.

In addition to these differences, Katib is the first system that was designed from the ground up with a focus on multiple persona of users. Thus, the system does not simply cater to a data scientist persona (called a user here), as do all the prior frameworks, but specifically targets the operations persona (called an admin here) as well. Consequently, among all the competing systems in open-source, Katib has the most support and contribution from more than 20 companies, many of whom are already using it in production. Katib is open-sourced at https://github.com/kubeflow/katib under the Apache 2.0 license.

Our contributions in this paper are as follows:

  1. Katib is the first hyperparameter tuning system that is completely open-source and supports multi-tenancy, scalability, and extensibility, thus making it the only candidate for a hosted hyperparameter tuning service.

  2. Katib is the only hyperparameter tuning system that was designed from the ground up with a focus on the usability for different persona of users, thus catering to both the data scientist as well as the administrator of the system.

The rest of this paper is arranged as follows. Section 2 goes into how accounting for multiple personas leads to a unique set of requirements. Section 3 goes into the detailed design and system workflow, followed by a list of supported features in Section 4. We present evaluation in Section 5 and finally conclude in Section 6.

2. Motivation

In order for a system to be deployed in an production setup, it needs to cater to multiple personas of users. From the point of view of a hyperparameter tuning system, we can broadly classify into

two personas, user and admin, both having a different set of expectations from the underlying system.

The user, often called the data scientist or machine learning engineer, focuses on building, testing, and maintaining production ready machine learning models. The user is interested in developing the best performing machine learning models using all available frameworks and tools in the market. The primary requirements for the user are as follows:

  1. [label=(U.0)]

  2. In the preliminary phase, the user wants to do HP tuning in a limited resource environment, e.g., laptop.

  3. Once promising results are obtained from the initial experiments, the user wants to try a similar experiment in a much larger compute environment, possibly with support for accelerators like GPUs and TPUs.

  4. The user wants to compare and visualize the results after HP tuning.

  5. The user wants to track, version, and share the experiment details and results.

The admin, also known as the operations engineer, has a vastly different set of requirements. The admin mainly focuses on maintaining hardware and software platforms which can be on premise or in the cloud. This persona is responsible for managing the underlying infrastructure, maintaining its health, and ensuring that the software on the infrastructure is highly available and up to date. The admin needs the ability to:

  1. [label=(A.0)]

  2. Perform resource efficient deployments.

  3. Support multiple users in the same cluster with dynamic resource allocation.

  4. Share the cluster with non hyperparameter tuning jobs.

  5. Perform capacity planning based on the user workloads with ease of cluster management.

  6. Upgrade live system without affecting other users or running workloads.

  7. Identify issues in the system through logging and monitoring.

The design of Katib, from the very beginning, was heavily influenced by these two different set of requirements of the two personas. To the best of our knowledge, existing hyperparameter tuning systems have primarily catered to the user persona, while largely neglecting the admin.

3. Design

Some of the requirements of the admin persona, specified in the previous section, are not unique to a hyperparameter tuning system but are indeed the requirements for operating any scalable, cloud-native service. Increasingly, such large scale systems are built using containers, coupled with an orchestration system for deploying, scaling, and managing containers. The de-facto standard container orchestration system used today is Kubernetes (19), which is an open-source software used for “production-grade container orchestration”. Consequently, from the very beginning, the design of Katib was tightly coupled with Kubernetes, thus reusing a large set of existing tools and ensuring compatibility to a rich cloud-native ecosystem. Next, we present some key definitions that are critical for explaining the design of the Katib system.

3.1. Definitions

The following are standard Kubernetes concepts and are included here for the sake of completeness:

  • Node – A single machine which can be a physical or virtual entity defined with a set of resources like memory, CPU, etc.

  • Cluster – A collection of nodes which represents the distributed deployment setup. It can be deployed either on a local machine, on-premise data centers, or private/public clouds.

  • Resource – A Kubernetes persistent object. The configuration for a resource is described in yaml format that contains two nested fields – “Spec” and “Status”. “Spec” refers to the desired state of the object while “Status” refers to the current state.

In addition to the above, some fundamental Katib constructs are as follows:

3.1.1. Experiment

An external (user-facing) resource referring to one complete user run or an optimization loop for a specific machine learning model. The experiment specifies user training task definition, the objective to be optimized, the parameter search space, and a search algorithm to be used.

2    type maximize
3    goal 0.99
4    objectiveMetricName Validation-accuracy
5    additionalMetricNames
6    - accuracy
8    - name --lr
9      parameterType double
10      feasibleSpace
11        min "0.01"
12        max "0.03"
13    - name --num-layers
14      parameterType int
15      feasibleSpace
16        min "2"
17        max "5"
18    - name --optimizer
19      parameterType categorical
20      feasibleSpace
21        list
22        - sgd
23        - adam
25    algorithmName bayesianoptimization
26    algorithmSettings
27      - name "random_state"
28        value "10"
29parallelTrialCount 3
30maxTrialCount 12
31maxFailedTrialCount 3
33trialTemplate |-
34    apiVersion batch/v1
35    kind Job
36    metadata
37      name {{.Trial}}
38      namespace {{.NameSpace}}
39    spec
40      template
41        spec
42          containers
43          - name {{.Trial}}
44            image  katib-mnist-example
45            command
46            - "python"
47            - "/classification/train_mnist.py"
48            - "--batch-size=64"
49            {{- with .HyperParameters}}
50            {{- range .}}
51            - "{{.Name}}={{.Value}}"
52            {{- end}}
53            {{- end}}
Listing 1: Experiment Specification

A sample experiment specification is given in Listing 1. The first section describes the ‘objective’ to be optimized. In the example, the target is to maximize the objective metric, Validation-accuracy to reach a goal value of 0.99. An additional metric, accuracy is also tracked in the experiment. Though multi variable optimization is not supported at this point, multiple metrics can be tracked simultaneously, thus giving a better understanding of the tuning progress. The progress of these metrics over time can be viewed in the Katib dashboard (shown later in Section 5). The second section describes the ‘parameters’ to be optimized with their type and search space. Supported types are integer, double, discrete, and categorical. The example specifies three parameters — a double type lr, an integer type num-layers, and a categorical type optimizer. The third section describes the search ‘algorithm’ to be used for searching hyperparameter values and its corresponding settings. In the example spec, the search algorithm used is Bayesian optimization with the setting random_state set to 10. The fourth section describes the experiment-wide settings that control the whole run. ‘ParallelTrialCount’ specifies the number of trials to be executed in parallel; ‘MaxTrialCount’ specifies the experiment budget or maximum number of completed trials for an experiment to be marked successful. ‘MaxFailedTrialCount ’ specifies the error budget or maximum number of failed trials before an experiment is marked as failed.

The fifth section describes the ‘trialTemplate’ which defines the user training task to be optimized and follows the Go Template format. The example spec specifies an MNIST example container image with hyperparameters passed as command line arguments.

3.1.2. Suggestion

A suggestion is an internal resource (not exposed to the user), referring to one proposed solution to the optimization problem or a set of generated hyperparameter values.

3.1.3. Trial

A Trial is an internal resource referring to one iteration of the optimization loop or an instance of the training job with generated suggestion values.

3.1.4. TrialJob

A TrialJob is the training instance provided by the user. A TrialJob can be a non-distributed training job with a single worker, or a distributed job consisting of several workers. Katib is designed to work with Kubeflow (18)

– an open-source machine learning toolkit for Kubernetes – and natively supports TensorFlow 

(Abadi et al., 2016)

, PyTorch 

(Paszke et al., 2019)

, and XGBoost 

(Chen and Guestrin, 2016) distributed training jobs.

14runSpec |-
15  apiVersion batch/v1
16  kind Job
17  metadata
18    name bayesian-run-1
19    namespace kubeflow
20  spec
21    template
22      spec
23        containers
24        - name bayesian-run-1
25          image katib-mnist-example
26          command
27          - "python"
28          - "/classification/train_mnist.py"
29          - "--batch-size=64"
30          - "--lr=0.013884981186857928"
31          - "--num-layers=3"
32          - "--optimizer=adam"
Listing 2: Trial Specification

3.1.5. Controller

The controller is a non-terminating process that watches the state of a resource and makes required changes attempting to move the current state (Status) of a resource closer towards the desired state (defined in Spec). For example, the Experiment controller provides life cycle management of the Experiment resource. Similarly, Trial and Suggestion controllers provides life cycle management of the Trial and Suggestion resources respectively.

3.2. System Workflow

Figure 1. Katib System Workflow.

In this section, we present the typical workflow that a user would follow to interact with the Katib system. The schematic of the workflow is presented in Figure 1 and the major steps are as follows:

  1. [leftmargin=12pt]

  2. The user creates an Experiment Spec in Yaml and submits it to the Katib system using client tools. Since OpenAPI (26) specification is also supported, the user can alternatively generate clients based on their choice of language and construct the config. For better clarity of the workflow, we will use experiment spec defined in Listing 1 as an example.

    2  algorithmName bayesianoptimization
    3  requests 2
    5    suggestionCount 2
    6    suggestions
    7    - name bayesianoptimization-run-1
    8      parameterAssignments
    9      - name --lr
    10        value "0.013884981186857928"
    11      - name --num-layers
    12        value "3"
    13      - name --optimizer
    14        value adam
    15    - name bayesianoptimization-run-2
    16      parameterAssignments
    17      - name --lr
    18        value "0.024941501303260026"
    19      - name --num-layers
    20        value "4"
    21      - name --optimizer
    22        value sgd
    Listing 3: Suggestion Resource
  3. The experiment controller reads the Experiment Spec and creates the corresponding Suggestion spec as given in the Listing 3. Suggestion Spec specifies the search algorithm and the number of requested suggestions. The number of requested suggestions is equal to the maximum number of trials to be executed in parallel. This is determined by the parameter ‘ParallelTrialCount’ in the Experiment Spec. Hence, 2 suggestions are requested using the Bayesian optimization algorithm.

  4. Based on the search algorithm parameter defined in the Experiment Spec, an algorithm service is deployed per experiment.

  5. The Suggestion Controller reads Suggestion Spec and updates the Suggestion Status with requested number of suggestions from the algorithm service. As seen in Listing 3, Suggestion status has two sets of parameter assignments which is returned from the deployed Bayesian optimization algorithm service.

  6. The Experiment Controller reads the Suggestion Status and spawns multiple Trials with each Trial Spec corresponding to one generated suggestion. The trial template of the experiment specification shown in Listing 1 gets converted to a run time trial specification shown in Listing 2 for each generated suggestion. The ‘runSpec’ in the Trial Spec is obtained by executing the Trial template and replacing the template variables name, namespace and HyperParameters with parameter assignments from the generated suggestion based on Listing 3.

  7. The Trial Controller reads the Trial Spec of each Trial and creates corresponding TrialJobs. The hyperparameters are passed to the training code through command line arguments.

  8. The Trial Controller continuously watches the status of all spawned TrialJobs and updates the Trial Status. When underlying jobs are completed, the Trial’s status is marked as completed. Once completed, the TrialJob metrics are also reported to the underlying metric storage and the best objective metric value is recorded in the Trial Status.

  9. The Experiment Controller reads the status of all Trials and verifies if the experiment ‘objective’ is met. If the experiment objective is met, the experiment is marked as complete.

    If the experiment objective is not met, steps 2-8 are repeated till the configured experiment budget is reached.

Framework Optuna (Akiba et al., 2019) Ray Tune (Liaw et al., 2018) Vizier (Golovin et al., 2017) HyperOpt (Bergstra et al., 2015) NNI (4) Katib (katib, 2020)
Open Sourced MIT Apache 2.0 No Custom MIT Apache 2.0
   Language Agnostic Python Python Any Python Python Any
Cloud Native Partial Partial No No Partial Yes
Platform None Ray, Kubernetes Google Borg None Kubernetes, PAI (29) Kubernetes
Multi-Tenancy No No Yes No No Yes
   Autoscalability No Yes Yes Partial Partial Yes
   Distributed Execution No Yes Yes No Yes Yes
   User Code Invasiveness High High Low High Low Low
   Metric Storage Partial No Partial No Partial Yes
   Metrics Collection Push Push Push Push Push Pull/Push
   Search Algorithm Yes Yes Yes Partial Yes Yes
Fault tolerance
   Trial Failure No Yes Yes No No Yes
   Error Budgets No No No No No Yes
NAS No No No No Yes Yes
Gang Scheduling (Feitelson and Rudolph, 1992) No No No No No Yes
Table 1. Comparison of different hyperparameter tuning frameworks. Vizier has a close-source implementation thus making it difficult to compare against.

In addition to this workflow, the admin carries out a host of activities, as specified in Section 2 (requirements A.1 – A.6). These tend to have separate workflows that are usually not specific to the Katib system but are well known workflows more tied to the administration of the underlying Kubernetes infrastructure.

4. Features

Having presented the design of Katib, in this section, we elaborate on the main features from Table 1; the table also contrasts Katib with comparable systems. These set of features span the requirements of both the user and the admin personas, as described in Section 2.

4.1. Generic

Katib is framework agnostic to machine learning frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many machine learning frameworks, such as TensorFlow, PyTorch, MPI, and XGBoost.

4.2. Multi Tenancy

Katib provides multi tenancy using namespaces and access control rules. A namespace is defined to be a logical separation of cluster resources. Each user is assigned a unique namespace. The user can create, modify, view, and delete experiments in the assigned namespace. In the default case, access rules are automatically set for each user that prevents unauthorized access to other namespaces. However, the admin can grant extra permission to users that permits shared access to multiple namespaces. This is particularly useful for users who want to collaborate across teams.

Figure 2. Distributed TensorFlow job with 2 workers and ParallelTrialCount=3 runs 3 parallel trials with 2 workers each.

Namespaces also provide the ability to set limits on resources like memory, CPU, and GPU. For better capacity planning of the deployment cluster, the admin defines in advance the limit for each user to ensure that cluster resources are not oversubscribed. Resource limits can be also set at individual experiment level, thus restricting the resource usage per experiment.

4.3. Scalability

Katib allows distributed execution at multiple levels to provide cloud scale. The experiment execution can be distributed at Trial as well as at TrialJob level. ‘ParallelTrialCount’ parameter determines the amount of parallelism at the trial level. For example, if ‘ParallelTrialCount‘ is set to 10, 10 trials can run in parallel, subject to resource limits. TrialJob execution can be non-distributed or distributed based on the framework that the trial uses. For example, TensorFlow, PyTorch, XGBoost, and MPI frameworks support distributed training jobs. Trial and TrialJob parallelism can be simultaneously tuned for maximum resource utilization in their deployment environment; an example is shown in Figure 2.

Katib also supports Auto Scalability, which allows cluster size to be automatically adjusted so that there is no under- or over-utilization of resources. In the autoscaler configuration, the minimum and maximum number of nodes can be set. The cluster is automatically scaled up when there are jobs that cannot be scheduled in the cluster due to insufficient resources. Similarly, the cluster is automatically scaled down when there are nodes left underutilized for a configurable period of time. This helps in controlling the total cost without exceeding the target budget for the experiment runs. Since Katib is horizontally scalable, new nodes can be added or removed from the cluster during run time.

4.4. Extensibility

Katib exposes a pluggable database interface for different types of metric storage, as shown in Figure 3. Any database can be supported in Katib by implementing the following functions of the Katib Database API:

  • RegisterObservationLog() – Save trial metrics into database.

  • GetObservationLog() – Retrieve trial metrics from database based on filters like start-timestamp and end-timestamp.

  • DeleteObservationLog() – Delete trial metrics from database.

The database can be either a local deployment or a remotely hosted service like Amazon Relational Database Service (RDS). Currently, MySQL (25) is the default deployed database. There are ongoing efforts to support PostgreSQL (28), ModelDB (Vartak et al., 2016), Kubeflow Metadata DB (18), and MLFlow (Zaharia et al., 2018).

Figure 3. Push and pull based metric collection in Katib

As shown in Figure 3, Katib supports two kinds of metric collection – push based and pull based, which can be configured using the metricCollectorKind parameter in the experiment specification. In push based metric collection, the metrics are pushed directly from the training container to the underlying metric storage using database APIs described above. The user training code has to be modified accordingly to enable metric tracking. In contrast, the pull based metric collection works with unmodified training code but it doesn’t provide synchronous control over the metrics written to the database. In this approach, there is a sidecar container (a sidecar is a container that does not exist by itself but is always paired with a main container) which pulls logs from the training container, applies custom parsing and then pushes the metrics to the underlying metric storage.

Katib also exposes pluggable algorithm interface to support new hyperparameter suggestion algorithms. This allows the user to plug in any custom algorithm that is suited to their environment needs. Any new suggestion algorithm can be integrated by implementing the GetSuggestions() API, which generates a new suggestion of hyperparameters to be evaluated. Currently, the supported algorithms are Random (Bergstra and Bengio, 2012), Grid (Bergstra and Bengio, 2012), Bayesian Optimization (Snoek et al., 2012), Hyperband (Li et al., 2017), and TPE (Bergstra et al., 2011).

4.5. Fault Tolerance

Since Katib leverages Kubernetes, it is a highly available system without any single point of failure. The latest state of any resource is recorded in its status field. Upon node restart or failure, jobs are redeployed automatically, resuming them from the last recorded checkpoint.

4.6. Portability

Katib system offers high portability, allowing the user to execute workloads across different environments with minimal effort. Because Katib is designed to be cloud-native, its resource management and scheduling mechanisms are decoupled from the underlying infrastructure. Consequently, users can deploy and reproduce Katib experiments on a variety of environments, such as on a laptop, in a private cluster, or on a public cloud. The user only needs to define the total budget and run-time requirements for each trial, and the underlying system takes care of the actual resource allocation.

4.7. Upgradeability

Since Katib is designed to be cloud native, all individual components listed in Figure 1 are loosely coupled, allowing the live upgrade of components without any downtime. Since all components are versioned, they can be independently upgraded without affecting other components. Existing experiments can be smoothly upgraded to a newer version in a live cluster. Similarly, newer algorithms can be added during runtime without affecting any running experiments. If existing algorithms are updated, the changes are reflected only from the next experiment.

4.8. Nas

Katib system is designed to be a general automated machine learning platform supporting features such as Neural Architecture Search (NAS). Since Katib follows an extensible architecture as shown in Figure 1

,several classes of NAS algorithms can be easily supported. Currently, it supports NAS based on reinforcement learning strategy. Since it is already presented in a previous paper 

(Zhou et al., 2019), we do not go into much detail.

5. Evaluation

In this section, we compare Katib with some other existing hyperparameter tuning systems and demonstrate features of Katib through some experiments. Then we introduce some real world applications in industries.

5.1. Feature Evaluation

5.1.1. Portability

(a) Laptop: 15 trials on a Minikube cluster.
Optimal accuracy (97.7%) is obtained with SGD optimizer,
learning rate 0.212, 3 layers, and batch size 800
(b) Cloud: 50 trials on 16vCPU cores on GKE cluster.
Optimal accuracy (98.3%) is obtained with SGD optimizer,
learning rate 0.287, 4 layers, and batch size 985.
Figure 4. Portability in Katib. Preliminary search is run on a laptop followed by extensive tuning in the cloud.

In this experiment, we demonstrate that Katib is a highly portable system that can run on various platforms with minimal configuration changes. We simulate what a typical user would do: run some preliminary trials locally on a laptop, and then run a larger experiment with more trials on a cloud cluster.

We conducted the experiment with a simple MNIST model defined via Apache MXNet (Chen et al., 2015). The first part of the experiment was run on a Minikube cluster (a single node Kubernetes cluster on a VM) deployed on a laptop. We ran 15 trials with random search over the following hyperparameter ranges:

  1. Learning rate: a float value ranging from 0 to 1.0

  2. Batch size: an integer value ranging from 10 to 1000

  3. Number of layers: an integer value ranging from 1 to 5

  4. Optimizer: choice of SGD, Adam, and FTRL.

The results are shown in Figure 3(a). The leftmost axis (”validation-accuracy”) shows the objective metric used to assess the trials, while the other axes plot the hyperparameter values used. Since we were using random search over fairly large ranges for hyperparameter values, the validation accuracy varied greatly as expected. But we can see that the more promising results (validation-accuracy ¿ 95%) were obtained when using the SGD optimizer and learning rate at lower than 0.3.

In the second part of the experiment, the same experiment was ported to a GKE cluster with 16 vCPU(virtual CPUs) cores. This time we made a few modifications: we changed the search algorithm to Bayesian optimization and increased the number of trials to 50. Also, based on the previous experiment, we narrowed down the search space for the hyperparameters to:

  1. Learning rate: Between 0 and 0.3

  2. Batch size: Between 600 and 1000

  3. Number of layers: Between 2 and 4

  4. Optimizer: Only use SGD

The results can be seen in Figure 3(b). Almost all of the trials had resulting validation accuracy above 97%, with the most optimal trial producing an accuracy of 98.3%.

Most hyperparameter tuning systems support the ability to run on multiple platforms, but in Katib exporting and reproducing experiments are very lightweight. This is because Katib is built to be Kubernetes-native, so the underlying infrastructure is completely abstracted away from the user persona. A Katib experiment can be exported as just a YAML file.

5.1.2. Multi Tenancy

Figure 5. Multi Tenancy using Katib..

To demonstrate the multi tenancy feature, we deployed Katib in a multi user environment. Two users are configured in a 24 vCPU cluster with each having a separate namespace. Resource quota is set for each namespace limiting the maximum CPU resources to 18 vCPUs for ‘user1’ and 6 vCPUs for ‘user2’. Each user is configured to run the same experiment config with MaxTrials set to 12, ParallelTrials set to 12, and each Trial requires 2vCPUs. The graph in Figure 5 indicates that there are maximum 8 parallel trial executions for user1 and 2 parallel runs for user2 though ParallelTrials is configured to 12 for both users. Since aggregate resources required for parallel trials for each user cannot exceed the maximum resources allocated to the user in the assigned namespace, maximum executed trials in parallel is restricted to 8 and 2 respectively though the cluster can handle up to 12 parallel trials. Since suggestion algorithm is deployed for every experiment, it takes 0.5 CPU by default from the available user resources, thus leaving only 1.5 vCPUs which cannot be used to execute another trial since each trial requires exactly 2 vCPUs. To the best of our knowledge, Katib is the only open-source hyperparameter tuning framework that natively supports multi-tenancy.

5.1.3. Scalability

Figure 6. Autoscaling in Katib
(a) Trial failures vs Number of trials
(b) Cross-entropy vs Number of trials
(c) Trial failures vs Number of trials
Figure 7. Fault Tolerance in Katib

In order to evaluate scalability, Katib is deployed in a autoscaler enabled cluster with 3 nodes of 4vCPU each. The autoscaler is configured with minimum and maximum nodes as 3 and 50 respectively. The experiment is run with MaxTrials and ParallelTrials set to 250 with each Trial using 2 vCPUs each. Figure 6 shows how autoscaler automatically resizes the number of nodes in the cluster based on the workload. The cluster autoscaler adds extra nodes automatically when there are pending Trials due to lack of CPUs. Figure also shows that cluster autoscaler removes nodes automatically after some grace termination period when nodes are underutilized. This ensures that total resource cost is controlled while ensuring the availability of the user workload. As indicated in the figure, the cluster autoscaler ensures that the required number of Trial CPUs at any instant is within the limits of cluster capacity.

5.1.4. Fault Tolerance

Failures occur for a variety of reasons and not all failures are the same. To inject failures, we applied chaos engineering(Basiri et al., 2016) on Katib, Optuna, and NNI with the help of Chaos Mesh (1), which is a cloud native chaos engineering platform. We designed couple of experiments to show how users can manage failures with Katib and how it compares with other frameworks in this respect.

The Katib experiment spec is configured to minimize the objective metric, cross-entropy for a distributed Tensorflow Job with 2 workers. MaxFailedTrialCount is set to 100, MaxTrials set to 150, and ParallelTrials set to 10 for both experiments.

In the first experiment, a fixed proportion of Katib trials are failed at a fixed interval of time by the chaos engineering platform. Specifically, the platform is configured to fail 0%, 5%, 50%, and 100% of trials every 20 minutes and objective metric values of cross-entropy is collected from succeeded trials. Failures are simulated by altering the trial container image names in the experiment spec to an invalid value which causes the trials to fail (and these trials cannot be recovered since the image names are invalid). Figure 6(a) shows cumulative failures of trials for different failed trial ratios. There are about 40 total failed trials when we fail all, i.e 100%, trials every 20 minutes, but the hyperparameter exploration is not affected by these failures as indicated in Figure 6(b). Figure 6(b) shows that objective metric values improve over time for all failure rates. This indicate that failures do not have a huge impact on the performance of the hyperparameter tuning experiments, which makes Katib fault tolerant. Figure 6(a) shows a failed trial even for 0% trial failure case because of a killed kubernetes process when its memory usage exceeds the limit.

In the second experiment, we kill 5% of trials every 20 minutes instead of failing them. This is simulated by terminating one of the workers of the Tensorflow job. We run the same experiment for Katib, Optuna, and NNI. The comparison results are shown in Figure 6(c). In Katib, no trials are marked failed because distributed TensorFlow training jobs in Katib supports restarting or resuming the training job if the exit code of any worker indicates temporary failure. In contrast, NNI and Optuna have failed trials since the trials created by them cannot be restarted. We cannot run the same experiment for Ray Tune due to the unavailability of tool to apply chaos engineering on Ray.

5.2. Real World Applications

Katib has been adopted in Ant Financial, Caicloud, Cisco, and many other enterprises. In this section, we present the applications in Ant Financial and Caicloud as examples.

5.2.1. Hyperparameter tuning system at Ant Financial

At Ant Financial, we manage Kubernetes clusters with tens of thousands of nodes (Financial, 2020) and have deployed Katib along with other Kubeflow operators. One popular combination is to use Katib in conjunction with MPI Operator (Ou et al., 2020). The MPI Operator leverages the network structure and collective communication algorithms so that users don’t have to worry about the right ratio between number of workers and parameter servers to obtain the best performance. When used with Katib, users can focus on finding reasonable hyperparameter search space of their chosen model architecture without spending time on tuning the hyperparameters and the downstream infrastructure for distributed training.

The models produced have been widely deployed in production and battle-tested in many different real life scenarios. One notable use case is Dingsunbao (Zhang et al., 2020) – a video-based mobile app that allows drivers to provide detailed vehicle damage information to insurers and claim vehicle insurance in real time. Car owners can capture video streams of their cars on Dingsunbao app by following the on-screen guidelines. The system then uploads those captured video streams, recognizes vehicle damage information on the cloud asynchronously, and finally presents the damaged components to users automatically, with recommendations on where and how to repair the vehicle and how much the car owner can claim from insurers. This makes filing claims easier without expensive laboratory costs and increases the transparency in what’s likely to be covered. Experiments have shown that the average damage assessment accuracy is 29.1% higher and the ratio of high quality shooting data on predefined criterion is also 20% higher compared with traditional approaches.

5.2.2. Hyperparameter tuning as a service at Caicloud

As a company focused on cloud-native machine learning infrastructure, we at Caicloud provide hyperparameter tuning services for customers in Caicloud Clever (10)

, an artificial intelligence cloud platform. We implement a trial kind and a new

metricCollectorKind to integrate the metrics to Caicloud Clever.

Users can create hyperparameter tuning jobs in Caicloud Clever platfom. Necessary source code, datasets or pretrained models will be pulled before the actual Katib experiment run. Once the Katib experiment is finished, the best-performing model and hyperparameter details are pushed to the internal model registry. The model can be served and the experiment can be reproduced with the saved hyperparameters.

We also integrate the tuning service to the model marketplace in our platform. There are some classical deep learning models available in the model marketplace. Users can import their datasets and tune the classical models with Katib without the need to build the models from scratch. In addition, there is also some ongoing research on how to run advanced neural architecture search algorithms such as DARTS

(Liu et al., 2018) with Katib in order to automate the process for our customers.

6. Conclusion

In this paper, we presented the motivation and design of Katib, a scalable and cloud-native hyperparameter tuning system that caters to both the user and admin personas. We contrast Katib with existing hyperparameter tuning systems and evaluate it along several aspects that are critical to production-ready systems such as portability, multi-tenancy, autoscaling, and fault tolerance. We also present case studies of large-scale, real world applications that are using Katib in production. Katib is a lively open-source project under the Apache 2.0 license and has contributors from more than 20 companies.


  • [1] (2020) A Chaos Engineering Platform for Kubernetes. External Links: Link Cited by: §5.1.4.
  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.1.4.
  • T. Akiba, S. Sano, T. Yanase, T. Ohta, and M. Koyama (2019) Optuna: a next-generation hyperparameter optimization framework. New York, NY, USA, pp. 2623–2631. External Links: ISBN 9781450362016, Link, Document Cited by: §1, Table 1.
  • [4] (2020) An open source AutoML toolkit for neural architecture search, model compression and hyper-parameter tuning. External Links: Link Cited by: §1, Table 1.
  • A. Financial (2020) Ant Financial’s Hypergrowth Strategy Using Kubernetes. External Links: Link Cited by: §5.2.1.
  • A. Basiri, N. Behnam, R. De Rooij, L. Hochstein, L. Kosewski, J. Reynolds, and C. Rosenthal (2016) Chaos engineering. IEEE Software 33 (3), pp. 35–41. Cited by: §5.1.4.
  • J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl (2011) Algorithms for hyper-parameter optimization. Red Hook, NY, USA, pp. 2546–2554. External Links: ISBN 9781618395993 Cited by: §4.4.
  • J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (1), pp. 281–305. External Links: ISSN 1532-4435 Cited by: §4.4.
  • J. Bergstra, B. Komer, C. Eliasmith, D. Yamins, and D. D. Cox (2015) Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery 8 (1), pp. 014008. Cited by: §1, Table 1.
  • [10] (2020) Caicloud clever: artificial intelligence cloud platform. External Links: Link Cited by: §5.2.2.
  • T. Chen and C. Guestrin (2016) XGBoost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §3.1.4.
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015) MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §5.1.1.
  • W. Zhang, Y. Cheng, X. Guo, et al. (2020) Automatic Car Damage Assessment System: Reading and Understanding Videos as Professional Insurance Inspectors. Cited by: §5.2.1.
  • D. G. Feitelson and L. Rudolph (1992) Gang scheduling performance benefits for fine-grain synchronization. Journal of Parallel and Distributed Computing 16 (4), pp. 306 – 318. External Links: ISSN 0743-7315, Document, Link Cited by: Table 1.
  • D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley (2017) Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, New York, NY, USA, pp. 1487–1495. External Links: ISBN 978-1-4503-4887-4, Link, Document Cited by: §1, Table 1.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep Learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1.
  • katib (2020) Katib in Kubeflow. External Links: Link Cited by: Table 1.
  • [18] (2020) Kubeflow: The Machine Learning Toolkit for Kubernetes. External Links: Link Cited by: §3.1.4, §4.4.
  • [19] (2020) Kubernetes: Production-Grade Container Orchestration. External Links: Link Cited by: item 3, §3.
  • L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2017) Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18 (1), pp. 6765–6816. External Links: ISSN 1532-4435 Cited by: §4.4.
  • M. Li, D. G. Andersen, A. J. Smola, and K. Yu (2014) Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 19–27. External Links: Link Cited by: item 2.
  • R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, and I. Stoica (2018) Tune: A research platform for distributed model selection and training. CoRR abs/1807.05118. External Links: Link, 1807.05118 Cited by: §1, Table 1.
  • H. Liu, K. Simonyan, and Y. Yang (2018) DARTS: differentiable architecture search. CoRR abs/1806.09055. External Links: Link, 1806.09055 Cited by: §5.2.2.
  • R. Ou, Y. Tang, et al. (2020) MPI Operator in Kubeflow. External Links: Link Cited by: §5.2.1.
  • [25] (2020) MySQL: The world’s most popular open source database. External Links: Link Cited by: §4.4.
  • [26] (2020) OpenAPI: an api description format for rest apis. External Links: Link Cited by: item 1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.1.4.
  • [28] (2020) PostgreSQL: The World’s Most Advanced Open Source Relational Database. External Links: Link Cited by: §4.4.
  • [29] (2020) Resource scheduling and cluster management for AI. External Links: Link Cited by: Table 1.
  • A. Sergeev and M. D. Balso (2018) Horovod: fast and easy distributed deep learning in tensorflow. CoRR abs/1802.05799. External Links: Link, 1802.05799 Cited by: item 2.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. Red Hook, NY, USA, pp. 2951–2959. Cited by: §4.4.
  • M. Vartak, H. Subramanyam, W. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia (2016) ModelDB: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 1–3. Cited by: §4.4.
  • M. Zaharia, A. Chen, A. Davidson, A. Ghodsi, S. A. Hong, A. Konwinski, S. Murching, T. Nykodym, P. Ogilvie, M. Parkhe, F. Xie, and C. Zumar (2018) Accelerating the machine learning lifecycle with mlflow. IEEE Data Eng. Bull. 41, pp. 39–45. Cited by: §4.4.
  • J. Zhou, A. Velichkevich, K. Prosvirov, A. Garg, Y. Oshima, and D. Dutta (2019) Katib: a distributed general automl platform on kubernetes. Santa Clara, CA, pp. 55–57. External Links: ISBN 978-1-939133-00-7, Link Cited by: §1, §4.8.

Appendix A Reproducibility

In this section we provide some details regarding the reproducibility of the experiments presented in the evaluation section of this paper.

a.1. Environment

All experiments were conducted on Google Kubernetes Engine (GKE) version 1.14. The local cluster experiment was run using Minikube with KVM driver on a Linux server.

a.2. Installation

Katib installation guide is provided at https://github.com/kubeflow/katib#installation. Once Katib is installed, the configuration files can be submitted to the cluster using the Kubernetes command line tool kubectl.

a.3. Experiments

The configuration and test files used in the evaluation section are uploaded to the GitHub repository at https://github.com/katib-examples/evaluation. The test and configuration files for each evaluation section are put into corresponding folders in the GitHub repository: