In machine learning, a hyperparameter is a parameter whose value must be fixed before the actual training process. Consequently, hyperparameters (e.g., number of clusters in k-means clustering, learning rate, batch size, and number of hidden nodes in neural networks) cannot be learnt during the training process, unlike the value of model parameters (e.g., weights of the edges in the neural network). Hyperparameters can impact both the quality of the model generated by the training process as well as the time and memory requirements of the algorithm(Goodfellow et al., 2016). Thus, hyperparameters have to be tuned to get the optimal setting for a given problem. This tuning can be done either manually or automatically. Manual tuning might suffice if the problem setup is not expected to change, at least not frequently, and if the number of hyperparameters is not large. Since these assumptions are frequently violated for realistic problems, an automatic hyperparameter tuning mechanism is required to make the overall machine learning approach practical.
There are several hyperparameter tuning systems that already exist. Notable among them are Optuna (Akiba et al., 2019), Ray Tune (Liaw et al., 2018), Vizier (Golovin et al., 2017), HyperOpt (Bergstra et al., 2015), and NNI (4). Even though these frameworks exhibit certain similarities (e.g., almost all frameworks support running parallel trials and customizable search algorithms), there are important advantages that Katib has over these systems. Namely, Katib is the only open-source framework that can realistically be run as a hosted service in a production environment.
There are several reasons for this distinction:
Multi-tenancy: The only other framework that supports multi-tenancy is Vizier, which is closed-source. Other frameworks support only single-tenant usage, which makes cross-team collaboration more difficult.
Cloud Native: Katib is a Kubernetes-native (19) framework, thus making it a natural fit for cloud-native deployments. Other frameworks like Ray Tune and NNI support Kubernetes but require additional effort to configure.
Extensibility. Katib is an open-source framework with pluggable interfaces for search algorithms and data storage. Some frameworks (like HyperOpt) do not support customizing search algorithms, and most do not support customizing the underlying metric storage.
Finally, Katib is one of only two frameworks that currently support neural architecture search (Zhou et al., 2019) (the other one being NNI). Neural architecture search (NAS) is an advanced automated machine learning technique that constructs entire neutral networks using various search strategies. A more detailed comparison of various frameworks is shown in Table 1.
In addition to these differences, Katib is the first system that was designed from the ground up with a focus on multiple persona of users. Thus, the system does not simply cater to a data scientist persona (called a user here), as do all the prior frameworks, but specifically targets the operations persona (called an admin here) as well. Consequently, among all the competing systems in open-source, Katib has the most support and contribution from more than 20 companies, many of whom are already using it in production. Katib is open-sourced at https://github.com/kubeflow/katib under the Apache 2.0 license.
Our contributions in this paper are as follows:
Katib is the first hyperparameter tuning system that is completely open-source and supports multi-tenancy, scalability, and extensibility, thus making it the only candidate for a hosted hyperparameter tuning service.
Katib is the only hyperparameter tuning system that was designed from the ground up with a focus on the usability for different persona of users, thus catering to both the data scientist as well as the administrator of the system.
The rest of this paper is arranged as follows. Section 2 goes into how accounting for multiple personas leads to a unique set of requirements. Section 3 goes into the detailed design and system workflow, followed by a list of supported features in Section 4. We present evaluation in Section 5 and finally conclude in Section 6.
In order for a system to be deployed in an production setup, it needs to cater to multiple personas of users. From the point of view of a hyperparameter tuning system, we can broadly classify intotwo personas, user and admin, both having a different set of expectations from the underlying system.
The user, often called the data scientist or machine learning engineer, focuses on building, testing, and maintaining production ready machine learning models. The user is interested in developing the best performing machine learning models using all available frameworks and tools in the market. The primary requirements for the user are as follows:
In the preliminary phase, the user wants to do HP tuning in a limited resource environment, e.g., laptop.
Once promising results are obtained from the initial experiments, the user wants to try a similar experiment in a much larger compute environment, possibly with support for accelerators like GPUs and TPUs.
The user wants to compare and visualize the results after HP tuning.
The user wants to track, version, and share the experiment details and results.
The admin, also known as the operations engineer, has a vastly different set of requirements. The admin mainly focuses on maintaining hardware and software platforms which can be on premise or in the cloud. This persona is responsible for managing the underlying infrastructure, maintaining its health, and ensuring that the software on the infrastructure is highly available and up to date. The admin needs the ability to:
Perform resource efficient deployments.
Support multiple users in the same cluster with dynamic resource allocation.
Share the cluster with non hyperparameter tuning jobs.
Perform capacity planning based on the user workloads with ease of cluster management.
Upgrade live system without affecting other users or running workloads.
Identify issues in the system through logging and monitoring.
The design of Katib, from the very beginning, was heavily influenced by these two different set of requirements of the two personas. To the best of our knowledge, existing hyperparameter tuning systems have primarily catered to the user persona, while largely neglecting the admin.
Some of the requirements of the admin persona, specified in the previous section, are not unique to a hyperparameter tuning system but are indeed the requirements for operating any scalable, cloud-native service. Increasingly, such large scale systems are built using containers, coupled with an orchestration system for deploying, scaling, and managing containers. The de-facto standard container orchestration system used today is Kubernetes (19), which is an open-source software used for “production-grade container orchestration”. Consequently, from the very beginning, the design of Katib was tightly coupled with Kubernetes, thus reusing a large set of existing tools and ensuring compatibility to a rich cloud-native ecosystem. Next, we present some key definitions that are critical for explaining the design of the Katib system.
The following are standard Kubernetes concepts and are included here for the sake of completeness:
Node – A single machine which can be a physical or virtual entity defined with a set of resources like memory, CPU, etc.
Cluster – A collection of nodes which represents the distributed deployment setup. It can be deployed either on a local machine, on-premise data centers, or private/public clouds.
Resource – A Kubernetes persistent object. The configuration for a resource is described in yaml format that contains two nested fields – “Spec” and “Status”. “Spec” refers to the desired state of the object while “Status” refers to the current state.
In addition to the above, some fundamental Katib constructs are as follows:
An external (user-facing) resource referring to one complete user run or an optimization loop for a specific machine learning model. The experiment specifies user training task definition, the objective to be optimized, the parameter search space, and a search algorithm to be used.
A sample experiment specification is given in Listing 1. The first section describes the ‘objective’ to be optimized. In the example, the target is to maximize the objective metric, Validation-accuracy to reach a goal value of 0.99. An additional metric, accuracy is also tracked in the experiment. Though multi variable optimization is not supported at this point, multiple metrics can be tracked simultaneously, thus giving a better understanding of the tuning progress. The progress of these metrics over time can be viewed in the Katib dashboard (shown later in Section 5). The second section describes the ‘parameters’ to be optimized with their type and search space. Supported types are integer, double, discrete, and categorical. The example specifies three parameters — a double type lr, an integer type num-layers, and a categorical type optimizer. The third section describes the search ‘algorithm’ to be used for searching hyperparameter values and its corresponding settings. In the example spec, the search algorithm used is Bayesian optimization with the setting random_state set to 10. The fourth section describes the experiment-wide settings that control the whole run. ‘ParallelTrialCount’ specifies the number of trials to be executed in parallel; ‘MaxTrialCount’ specifies the experiment budget or maximum number of completed trials for an experiment to be marked successful. ‘MaxFailedTrialCount ’ specifies the error budget or maximum number of failed trials before an experiment is marked as failed.
The fifth section describes the ‘trialTemplate’ which defines the user training task to be optimized and follows the Go Template format. The example spec specifies an MNIST example container image with hyperparameters passed as command line arguments.
A suggestion is an internal resource (not exposed to the user), referring to one proposed solution to the optimization problem or a set of generated hyperparameter values.
A Trial is an internal resource referring to one iteration of the optimization loop or an instance of the training job with generated suggestion values.
A TrialJob is the training instance provided by the user. A TrialJob can be a non-distributed training job with a single worker, or a distributed job consisting of several workers. Katib is designed to work with Kubeflow (18)
– an open-source machine learning toolkit for Kubernetes – and natively supports TensorFlow(Abadi et al., 2016)
, PyTorch(Paszke et al., 2019)
, and XGBoost(Chen and Guestrin, 2016) distributed training jobs.
The controller is a non-terminating process that watches the state of a resource and makes required changes attempting to move the current state (Status) of a resource closer towards the desired state (defined in Spec). For example, the Experiment controller provides life cycle management of the Experiment resource. Similarly, Trial and Suggestion controllers provides life cycle management of the Trial and Suggestion resources respectively.
3.2. System Workflow
In this section, we present the typical workflow that a user would follow to interact with the Katib system. The schematic of the workflow is presented in Figure 1 and the major steps are as follows:
The user creates an Experiment Spec in Yaml and submits it to the Katib system using client tools. Since OpenAPI (26) specification is also supported, the user can alternatively generate clients based on their choice of language and construct the config. For better clarity of the workflow, we will use experiment spec defined in Listing 1 as an example.
The experiment controller reads the Experiment Spec and creates the corresponding Suggestion spec as given in the Listing 3. Suggestion Spec specifies the search algorithm and the number of requested suggestions. The number of requested suggestions is equal to the maximum number of trials to be executed in parallel. This is determined by the parameter ‘ParallelTrialCount’ in the Experiment Spec. Hence, 2 suggestions are requested using the Bayesian optimization algorithm.
Based on the search algorithm parameter defined in the Experiment Spec, an algorithm service is deployed per experiment.
The Suggestion Controller reads Suggestion Spec and updates the Suggestion Status with requested number of suggestions from the algorithm service. As seen in Listing 3, Suggestion status has two sets of parameter assignments which is returned from the deployed Bayesian optimization algorithm service.
The Experiment Controller reads the Suggestion Status and spawns multiple Trials with each Trial Spec corresponding to one generated suggestion. The trial template of the experiment specification shown in Listing 1 gets converted to a run time trial specification shown in Listing 2 for each generated suggestion. The ‘runSpec’ in the Trial Spec is obtained by executing the Trial template and replacing the template variables name, namespace and HyperParameters with parameter assignments from the generated suggestion based on Listing 3.
The Trial Controller reads the Trial Spec of each Trial and creates corresponding TrialJobs. The hyperparameters are passed to the training code through command line arguments.
The Trial Controller continuously watches the status of all spawned TrialJobs and updates the Trial Status. When underlying jobs are completed, the Trial’s status is marked as completed. Once completed, the TrialJob metrics are also reported to the underlying metric storage and the best objective metric value is recorded in the Trial Status.
The Experiment Controller reads the status of all Trials and verifies if the experiment ‘objective’ is met. If the experiment objective is met, the experiment is marked as complete.
If the experiment objective is not met, steps 2-8 are repeated till the configured experiment budget is reached.
|Framework||Optuna (Akiba et al., 2019)||Ray Tune (Liaw et al., 2018)||Vizier (Golovin et al., 2017)||HyperOpt (Bergstra et al., 2015)||NNI (4)||Katib (katib, 2020)|
|Open Sourced||MIT||Apache 2.0||No||Custom||MIT||Apache 2.0|
|Platform||None||Ray, Kubernetes||Google Borg||None||Kubernetes, PAI (29)||Kubernetes|
|User Code Invasiveness||High||High||Low||High||Low||Low|
|Gang Scheduling (Feitelson and Rudolph, 1992)||No||No||No||No||No||Yes|
In addition to this workflow, the admin carries out a host of activities, as specified in Section 2 (requirements A.1 – A.6). These tend to have separate workflows that are usually not specific to the Katib system but are well known workflows more tied to the administration of the underlying Kubernetes infrastructure.
Having presented the design of Katib, in this section, we elaborate on the main features from Table 1; the table also contrasts Katib with comparable systems. These set of features span the requirements of both the user and the admin personas, as described in Section 2.
Katib is framework agnostic to machine learning frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many machine learning frameworks, such as TensorFlow, PyTorch, MPI, and XGBoost.
4.2. Multi Tenancy
Katib provides multi tenancy using namespaces and access control rules. A namespace is defined to be a logical separation of cluster resources. Each user is assigned a unique namespace. The user can create, modify, view, and delete experiments in the assigned namespace. In the default case, access rules are automatically set for each user that prevents unauthorized access to other namespaces. However, the admin can grant extra permission to users that permits shared access to multiple namespaces. This is particularly useful for users who want to collaborate across teams.
Namespaces also provide the ability to set limits on resources like memory, CPU, and GPU. For better capacity planning of the deployment cluster, the admin defines in advance the limit for each user to ensure that cluster resources are not oversubscribed. Resource limits can be also set at individual experiment level, thus restricting the resource usage per experiment.
Katib allows distributed execution at multiple levels to provide cloud scale. The experiment execution can be distributed at Trial as well as at TrialJob level. ‘ParallelTrialCount’ parameter determines the amount of parallelism at the trial level. For example, if ‘ParallelTrialCount‘ is set to 10, 10 trials can run in parallel, subject to resource limits. TrialJob execution can be non-distributed or distributed based on the framework that the trial uses. For example, TensorFlow, PyTorch, XGBoost, and MPI frameworks support distributed training jobs. Trial and TrialJob parallelism can be simultaneously tuned for maximum resource utilization in their deployment environment; an example is shown in Figure 2.
Katib also supports Auto Scalability, which allows cluster size to be automatically adjusted so that there is no under- or over-utilization of resources. In the autoscaler configuration, the minimum and maximum number of nodes can be set. The cluster is automatically scaled up when there are jobs that cannot be scheduled in the cluster due to insufficient resources. Similarly, the cluster is automatically scaled down when there are nodes left underutilized for a configurable period of time. This helps in controlling the total cost without exceeding the target budget for the experiment runs. Since Katib is horizontally scalable, new nodes can be added or removed from the cluster during run time.
Katib exposes a pluggable database interface for different types of metric storage, as shown in Figure 3. Any database can be supported in Katib by implementing the following functions of the Katib Database API:
RegisterObservationLog() – Save trial metrics into database.
GetObservationLog() – Retrieve trial metrics from database based on filters like start-timestamp and end-timestamp.
DeleteObservationLog() – Delete trial metrics from database.
The database can be either a local deployment or a remotely hosted service like Amazon Relational Database Service (RDS). Currently, MySQL (25) is the default deployed database. There are ongoing efforts to support PostgreSQL (28), ModelDB (Vartak et al., 2016), Kubeflow Metadata DB (18), and MLFlow (Zaharia et al., 2018).
As shown in Figure 3, Katib supports two kinds of metric collection – push based and pull based, which can be configured using the metricCollectorKind parameter in the experiment specification. In push based metric collection, the metrics are pushed directly from the training container to the underlying metric storage using database APIs described above. The user training code has to be modified accordingly to enable metric tracking. In contrast, the pull based metric collection works with unmodified training code but it doesn’t provide synchronous control over the metrics written to the database. In this approach, there is a sidecar container (a sidecar is a container that does not exist by itself but is always paired with a main container) which pulls logs from the training container, applies custom parsing and then pushes the metrics to the underlying metric storage.
Katib also exposes pluggable algorithm interface to support new hyperparameter suggestion algorithms. This allows the user to plug in any custom algorithm that is suited to their environment needs. Any new suggestion algorithm can be integrated by implementing the GetSuggestions() API, which generates a new suggestion of hyperparameters to be evaluated. Currently, the supported algorithms are Random (Bergstra and Bengio, 2012), Grid (Bergstra and Bengio, 2012), Bayesian Optimization (Snoek et al., 2012), Hyperband (Li et al., 2017), and TPE (Bergstra et al., 2011).
4.5. Fault Tolerance
Since Katib leverages Kubernetes, it is a highly available system without any single point of failure. The latest state of any resource is recorded in its status field. Upon node restart or failure, jobs are redeployed automatically, resuming them from the last recorded checkpoint.
Katib system offers high portability, allowing the user to execute workloads across different environments with minimal effort. Because Katib is designed to be cloud-native, its resource management and scheduling mechanisms are decoupled from the underlying infrastructure. Consequently, users can deploy and reproduce Katib experiments on a variety of environments, such as on a laptop, in a private cluster, or on a public cloud. The user only needs to define the total budget and run-time requirements for each trial, and the underlying system takes care of the actual resource allocation.
Since Katib is designed to be cloud native, all individual components listed in Figure 1 are loosely coupled, allowing the live upgrade of components without any downtime. Since all components are versioned, they can be independently upgraded without affecting other components. Existing experiments can be smoothly upgraded to a newer version in a live cluster. Similarly, newer algorithms can be added during runtime without affecting any running experiments. If existing algorithms are updated, the changes are reflected only from the next experiment.
Katib system is designed to be a general automated machine learning platform supporting features such as Neural Architecture Search (NAS). Since Katib follows an extensible architecture as shown in Figure 1
,several classes of NAS algorithms can be easily supported. Currently, it supports NAS based on reinforcement learning strategy. Since it is already presented in a previous paper(Zhou et al., 2019), we do not go into much detail.
In this section, we compare Katib with some other existing hyperparameter tuning systems and demonstrate features of Katib through some experiments. Then we introduce some real world applications in industries.
5.1. Feature Evaluation
In this experiment, we demonstrate that Katib is a highly portable system that can run on various platforms with minimal configuration changes. We simulate what a typical user would do: run some preliminary trials locally on a laptop, and then run a larger experiment with more trials on a cloud cluster.
We conducted the experiment with a simple MNIST model defined via Apache MXNet (Chen et al., 2015). The first part of the experiment was run on a Minikube cluster (a single node Kubernetes cluster on a VM) deployed on a laptop. We ran 15 trials with random search over the following hyperparameter ranges:
Learning rate: a float value ranging from 0 to 1.0
Batch size: an integer value ranging from 10 to 1000
Number of layers: an integer value ranging from 1 to 5
Optimizer: choice of SGD, Adam, and FTRL.
The results are shown in Figure 3(a). The leftmost axis (”validation-accuracy”) shows the objective metric used to assess the trials, while the other axes plot the hyperparameter values used. Since we were using random search over fairly large ranges for hyperparameter values, the validation accuracy varied greatly as expected. But we can see that the more promising results (validation-accuracy ¿ 95%) were obtained when using the SGD optimizer and learning rate at lower than 0.3.
In the second part of the experiment, the same experiment was ported to a GKE cluster with 16 vCPU(virtual CPUs) cores. This time we made a few modifications: we changed the search algorithm to Bayesian optimization and increased the number of trials to 50. Also, based on the previous experiment, we narrowed down the search space for the hyperparameters to:
Learning rate: Between 0 and 0.3
Batch size: Between 600 and 1000
Number of layers: Between 2 and 4
Optimizer: Only use SGD
The results can be seen in Figure 3(b). Almost all of the trials had resulting validation accuracy above 97%, with the most optimal trial producing an accuracy of 98.3%.
Most hyperparameter tuning systems support the ability to run on multiple platforms, but in Katib exporting and reproducing experiments are very lightweight. This is because Katib is built to be Kubernetes-native, so the underlying infrastructure is completely abstracted away from the user persona. A Katib experiment can be exported as just a YAML file.
5.1.2. Multi Tenancy
To demonstrate the multi tenancy feature, we deployed Katib in a multi user environment. Two users are configured in a 24 vCPU cluster with each having a separate namespace. Resource quota is set for each namespace limiting the maximum CPU resources to 18 vCPUs for ‘user1’ and 6 vCPUs for ‘user2’. Each user is configured to run the same experiment config with MaxTrials set to 12, ParallelTrials set to 12, and each Trial requires 2vCPUs. The graph in Figure 5 indicates that there are maximum 8 parallel trial executions for user1 and 2 parallel runs for user2 though ParallelTrials is configured to 12 for both users. Since aggregate resources required for parallel trials for each user cannot exceed the maximum resources allocated to the user in the assigned namespace, maximum executed trials in parallel is restricted to 8 and 2 respectively though the cluster can handle up to 12 parallel trials. Since suggestion algorithm is deployed for every experiment, it takes 0.5 CPU by default from the available user resources, thus leaving only 1.5 vCPUs which cannot be used to execute another trial since each trial requires exactly 2 vCPUs. To the best of our knowledge, Katib is the only open-source hyperparameter tuning framework that natively supports multi-tenancy.
In order to evaluate scalability, Katib is deployed in a autoscaler enabled cluster with 3 nodes of 4vCPU each. The autoscaler is configured with minimum and maximum nodes as 3 and 50 respectively. The experiment is run with MaxTrials and ParallelTrials set to 250 with each Trial using 2 vCPUs each. Figure 6 shows how autoscaler automatically resizes the number of nodes in the cluster based on the workload. The cluster autoscaler adds extra nodes automatically when there are pending Trials due to lack of CPUs. Figure also shows that cluster autoscaler removes nodes automatically after some grace termination period when nodes are underutilized. This ensures that total resource cost is controlled while ensuring the availability of the user workload. As indicated in the figure, the cluster autoscaler ensures that the required number of Trial CPUs at any instant is within the limits of cluster capacity.
5.1.4. Fault Tolerance
Failures occur for a variety of reasons and not all failures are the same. To inject failures, we applied chaos engineering(Basiri et al., 2016) on Katib, Optuna, and NNI with the help of Chaos Mesh (1), which is a cloud native chaos engineering platform. We designed couple of experiments to show how users can manage failures with Katib and how it compares with other frameworks in this respect.
The Katib experiment spec is configured to minimize the objective metric, cross-entropy for a distributed Tensorflow Job with 2 workers. MaxFailedTrialCount is set to 100, MaxTrials set to 150, and ParallelTrials set to 10 for both experiments.
In the first experiment, a fixed proportion of Katib trials are failed at a fixed interval of time by the chaos engineering platform. Specifically, the platform is configured to fail 0%, 5%, 50%, and 100% of trials every 20 minutes and objective metric values of cross-entropy is collected from succeeded trials. Failures are simulated by altering the trial container image names in the experiment spec to an invalid value which causes the trials to fail (and these trials cannot be recovered since the image names are invalid). Figure 6(a) shows cumulative failures of trials for different failed trial ratios. There are about 40 total failed trials when we fail all, i.e 100%, trials every 20 minutes, but the hyperparameter exploration is not affected by these failures as indicated in Figure 6(b). Figure 6(b) shows that objective metric values improve over time for all failure rates. This indicate that failures do not have a huge impact on the performance of the hyperparameter tuning experiments, which makes Katib fault tolerant. Figure 6(a) shows a failed trial even for 0% trial failure case because of a killed kubernetes process when its memory usage exceeds the limit.
In the second experiment, we kill 5% of trials every 20 minutes instead of failing them. This is simulated by terminating one of the workers of the Tensorflow job. We run the same experiment for Katib, Optuna, and NNI. The comparison results are shown in Figure 6(c). In Katib, no trials are marked failed because distributed TensorFlow training jobs in Katib supports restarting or resuming the training job if the exit code of any worker indicates temporary failure. In contrast, NNI and Optuna have failed trials since the trials created by them cannot be restarted. We cannot run the same experiment for Ray Tune due to the unavailability of tool to apply chaos engineering on Ray.
5.2. Real World Applications
Katib has been adopted in Ant Financial, Caicloud, Cisco, and many other enterprises. In this section, we present the applications in Ant Financial and Caicloud as examples.
5.2.1. Hyperparameter tuning system at Ant Financial
At Ant Financial, we manage Kubernetes clusters with tens of thousands of nodes (Financial, 2020) and have deployed Katib along with other Kubeflow operators. One popular combination is to use Katib in conjunction with MPI Operator (Ou et al., 2020). The MPI Operator leverages the network structure and collective communication algorithms so that users don’t have to worry about the right ratio between number of workers and parameter servers to obtain the best performance. When used with Katib, users can focus on finding reasonable hyperparameter search space of their chosen model architecture without spending time on tuning the hyperparameters and the downstream infrastructure for distributed training.
The models produced have been widely deployed in production and battle-tested in many different real life scenarios. One notable use case is Dingsunbao (Zhang et al., 2020) – a video-based mobile app that allows drivers to provide detailed vehicle damage information to insurers and claim vehicle insurance in real time. Car owners can capture video streams of their cars on Dingsunbao app by following the on-screen guidelines. The system then uploads those captured video streams, recognizes vehicle damage information on the cloud asynchronously, and finally presents the damaged components to users automatically, with recommendations on where and how to repair the vehicle and how much the car owner can claim from insurers. This makes filing claims easier without expensive laboratory costs and increases the transparency in what’s likely to be covered. Experiments have shown that the average damage assessment accuracy is 29.1% higher and the ratio of high quality shooting data on predefined criterion is also 20% higher compared with traditional approaches.
5.2.2. Hyperparameter tuning as a service at Caicloud
As a company focused on cloud-native machine learning infrastructure, we at Caicloud provide hyperparameter tuning services for customers in Caicloud Clever (10)
, an artificial intelligence cloud platform. We implement a trial kind and a newmetricCollectorKind to integrate the metrics to Caicloud Clever.
Users can create hyperparameter tuning jobs in Caicloud Clever platfom. Necessary source code, datasets or pretrained models will be pulled before the actual Katib experiment run. Once the Katib experiment is finished, the best-performing model and hyperparameter details are pushed to the internal model registry. The model can be served and the experiment can be reproduced with the saved hyperparameters.
We also integrate the tuning service to the model marketplace in our platform. There are some classical deep learning models available in the model marketplace. Users can import their datasets and tune the classical models with Katib without the need to build the models from scratch. In addition, there is also some ongoing research on how to run advanced neural architecture search algorithms such as DARTS(Liu et al., 2018) with Katib in order to automate the process for our customers.
In this paper, we presented the motivation and design of Katib, a scalable and cloud-native hyperparameter tuning system that caters to both the user and admin personas. We contrast Katib with existing hyperparameter tuning systems and evaluate it along several aspects that are critical to production-ready systems such as portability, multi-tenancy, autoscaling, and fault tolerance. We also present case studies of large-scale, real world applications that are using Katib in production. Katib is a lively open-source project under the Apache 2.0 license and has contributors from more than 20 companies.
-  (2020) A Chaos Engineering Platform for Kubernetes. External Links: Cited by: §5.1.4.
- TensorFlow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §3.1.4.
- Optuna: a next-generation hyperparameter optimization framework. New York, NY, USA, pp. 2623–2631. External Links: Cited by: §1, Table 1.
-  (2020) An open source AutoML toolkit for neural architecture search, model compression and hyper-parameter tuning. External Links: Cited by: §1, Table 1.
- Ant Financial’s Hypergrowth Strategy Using Kubernetes. External Links: Cited by: §5.2.1.
- Chaos engineering. IEEE Software 33 (3), pp. 35–41. Cited by: §5.1.4.
- Algorithms for hyper-parameter optimization. Red Hook, NY, USA, pp. 2546–2554. External Links: Cited by: §4.4.
- Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13 (1), pp. 281–305. External Links: Cited by: §4.4.
- Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery 8 (1), pp. 014008. Cited by: §1, Table 1.
-  (2020) Caicloud clever: artificial intelligence cloud platform. External Links: Cited by: §5.2.2.
- XGBoost: a scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pp. 785–794. Cited by: §3.1.4.
- MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274. Cited by: §5.1.1.
- Automatic Car Damage Assessment System: Reading and Understanding Videos as Professional Insurance Inspectors. Cited by: §5.2.1.
- Gang scheduling performance benefits for fine-grain synchronization. Journal of Parallel and Distributed Computing 16 (4), pp. 306 – 318. External Links: Cited by: Table 1.
- Google vizier: a service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, New York, NY, USA, pp. 1487–1495. External Links: Cited by: §1, Table 1.
- Deep Learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §1.
- Katib in Kubeflow. External Links: Cited by: Table 1.
-  (2020) Kubeflow: The Machine Learning Toolkit for Kubernetes. External Links: Cited by: §3.1.4, §4.4.
-  (2020) Kubernetes: Production-Grade Container Orchestration. External Links: Cited by: item 3, §3.
- Hyperband: a novel bandit-based approach to hyperparameter optimization. J. Mach. Learn. Res. 18 (1), pp. 6765–6816. External Links: Cited by: §4.4.
- Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 19–27. External Links: Cited by: item 2.
- Tune: A research platform for distributed model selection and training. CoRR abs/1807.05118. External Links: Cited by: §1, Table 1.
- DARTS: differentiable architecture search. CoRR abs/1806.09055. External Links: Cited by: §5.2.2.
- MPI Operator in Kubeflow. External Links: Cited by: §5.2.1.
-  (2020) MySQL: The world’s most popular open source database. External Links: Cited by: §4.4.
-  (2020) OpenAPI: an api description format for rest apis. External Links: Cited by: item 1.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.1.4.
-  (2020) PostgreSQL: The World’s Most Advanced Open Source Relational Database. External Links: Cited by: §4.4.
-  (2020) Resource scheduling and cluster management for AI. External Links: Cited by: Table 1.
- Horovod: fast and easy distributed deep learning in tensorflow. CoRR abs/1802.05799. External Links: Cited by: item 2.
- Practical bayesian optimization of machine learning algorithms. Red Hook, NY, USA, pp. 2951–2959. Cited by: §4.4.
- ModelDB: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 1–3. Cited by: §4.4.
- Accelerating the machine learning lifecycle with mlflow. IEEE Data Eng. Bull. 41, pp. 39–45. Cited by: §4.4.
- Katib: a distributed general automl platform on kubernetes. Santa Clara, CA, pp. 55–57. External Links: Cited by: §1, §4.8.
Appendix A Reproducibility
In this section we provide some details regarding the reproducibility of the experiments presented in the evaluation section of this paper.
All experiments were conducted on Google Kubernetes Engine (GKE) version 1.14. The local cluster experiment was run using Minikube with KVM driver on a Linux server.
Katib installation guide is provided at https://github.com/kubeflow/katib#installation. Once Katib is installed, the configuration files can be submitted to the cluster using the Kubernetes command line tool kubectl.
The configuration and test files used in the evaluation section are uploaded to the GitHub repository at https://github.com/katib-examples/evaluation. The test and configuration files for each evaluation section are put into corresponding folders in the GitHub repository:
Failure Tolerance: https://github.com/katib-examples/evaluation/tree/master/fault-tolerance