Log In Sign Up

Open Source Vizier: Distributed Infrastructure and API for Reliable and Flexible Blackbox Optimization

by   Xingyou Song, et al.

Vizier is the de-facto blackbox and hyperparameter optimization service across Google, having optimized some of Google's largest products and research efforts. To operate at the scale of tuning thousands of users' critical systems, Google Vizier solved key design challenges in providing multiple different features, while remaining fully fault-tolerant. In this paper, we introduce Open Source (OSS) Vizier, a standalone Python-based interface for blackbox optimization and research, based on the Google-internal Vizier infrastructure and framework. OSS Vizier provides an API capable of defining and solving a wide variety of optimization problems, including multi-metric, early stopping, transfer learning, and conditional search. Furthermore, it is designed to be a distributed system that assures reliability, and allows multiple parallel evaluations of the user's objective function. The flexible RPC-based infrastructure allows users to access OSS Vizier from binaries written in any language. OSS Vizier also provides a back-end ("Pythia") API that gives algorithm authors a way to interface new algorithms with the core OSS Vizier system. OSS Vizier is available at


page 6

page 22

page 24


Scaling Up Models and Data with and

Recent neural network-based language models have benefited greatly from ...

PyPhi: A toolbox for integrated information theory

Integrated information theory provides a mathematical framework to fully...

EvoCraft: A New Challenge for Open-Endedness

This paper introduces EvoCraft, a framework for Minecraft designed to st...

LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing

We present LEGOEval, an open-source toolkit that enables researchers to ...

Optuna: A Next-generation Hyperparameter Optimization Framework

The purpose of this study is to introduce new design-criteria for next-g...

NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost

The widespread availability of search API's (both free and commercial) b...

FastMoE: A Fast Mixture-of-Expert Training System

Mixture-of-Expert (MoE) presents a strong potential in enlarging the siz...

1 Introduction

Blackbox optimization is the task of optimizing an objective function where the output is the only available information about the objective. Due to its generality, blackbox optimization has been applied to an extremely broad range of applications, including but not limited to hyperparameter optimization (he2021automl), drug discovery (bayesopt_chemisty)

, reinforcement learning

(autorl_survey), and industrial engineering (materials_design).

Figure 1: Vizier: An advisor.

Google Vizier (vizier_v1) is the first hyperparameter tuning service designed to scale, and has thousands of monthly users both on the research111A list of research works that have used Google Vizier can be found in Appendix C. and production side at Google. Since its inception, Google Vizier has run millions of blackbox optimization tasks and saved a significant amount of computing and human resources to Google and its customers.

This paper describes Open Source (OSS) Vizier, a standalone Python implementation of Google Vizier’s APIs. It consists of a user API, which allows users to configure and optimize their objective function, and a developer API, which defines abstractions and utilities for implementing new optimization algorithms. Both APIs consist of Remote Procedure Call (RPC) protocols (Section 3) to allow the setup of a scalable, fault-tolerant and customizable blackbox optimization system, and Python libraries (Sections 4.3 and 6) to abstract away the corresponding RPC protocols.

Compared to (vizier_v1), OSS Vizier features an evolved backend design for algorithm implementations, as well as new functionalities such as conditional search and multi-objective optimization. OSS Vizier’s RPC API is based on Vertex Vizier222, making OSS Vizier compatible with any framework which integrates with Vertex Vizier, such as XManager333

Due to the existence of 3 different versions (Google, Vertex/Cloud, OSS) of Vizier, to prevent confusion, we explicitly refer to the version (e.g. "Google Vizier") whenever Vizier is mentioned. We summarize the distinct functionalities of each version of Vizier below:

  • Google Vizier: C++ based service hosted on Google’s internal servers and integrated deeply with Google’s internal infrastructure. The service is available only for Google software engineers and researchers to tune their own objectives with a default algorithm.

  • Vertex/Cloud Vizier: C++ based service hosted on Google Cloud servers, available for external customers + businesses to tune their own objectives with a default algorithm.

  • OSS Vizier: Fully standalone and customizable code that allows researchers to host a Python-based service on their own servers, for any downstream users to tune their own objectives.

2 Problem and Our Contributions

Blackbox optimization has a broad range of applications. Inside Google, these applications include: optimizing existing systems written in a wide variety of programming languages; tuning the hyperparameters of a large ML model using distributed parallel processes (survey_on_distributed_ml); optimizing a non-computational objective, which can be e.g. physical, chemical, biological, mechanical, or even human-evaluated (vizier_cookie). Generally, such objectives we are interested in optimizing possess a moderate number (e.g. several hundred) of parameters for the input , may produce noisy evaluation measurements, and may not be smooth or continuous.

Furthermore, the blackbox optimization workflow greatly varies depending on the application. The evaluation latency can be anywhere between seconds and weeks, while the budget for the number of evaluations, or Trials, varies from tens to millions. Evaluations can be done asynchronously (e.g. ML model tuning) or in synchronous batches (e.g. wet lab experiments). Furthermore, evaluations may fail due to transient errors and should be retried, or may fail due to persistent errors (e.g. cannot be measured) and should not be retried. One may also wish to stop the evaluation process early after observing intermediate measurements (e.g. from a ML model’s learning curve) in order to save resources.

To handle all of these scenarios, OSS Vizier is developed as a service. The service architecture does not make assumptions on how Trials are evaluated, but rather simply specifies a stable API for obtaining suggestions to evaluate and report results as Trials. Users have the freedom to determine when to request trials, how to evaluate trials, and when to report back results.

Another advantage of the service architecture is that it can collect data and metrics over time. Google Vizier runs as a central service, and we track usage patterns to inform our research agenda, and our extensive database of runs serves as a valuable dataset for research into meta-learning and multitask transfer learning. This allows users to transparently benefit from the resulting improvements we make to the system.

2.1 Comparisons to Related Work

Table 1 contains a non-comprehensive list of open-source packages for blackbox optimization, focusing on hyperparameter tuning. Overall, OSS Vizier API is compatible with many of the features present in other hyperparameter tuning open-source packages. We did not include commercial services for hyperparameter tuning such as Microsoft Azure, Amazon SageMaker, SigOpt and Vertex Vizier. For a comprehensive review of hyperparameter tuning tools, see (he2021automl). There are many other blackbox optimization tools not mentioned in Table 1, including iterated racing (irace; irace_thesis_again)

, as well as heuristics and automation of algorithm designs

(automatic_component_design; local_search_holgar); see more comparisons and usages in (lindauer2022smac3; feurer2015).

We divide the open-source packages into three categories:

  • Services host algorithms in a server. OSS Vizier, Advisor (advisor) and OpenBox (openbox), which are modeled after Google Vizier (vizier_v1), belong to this category. Services are more flexible and scalable than frameworks, at the cost of engineering complexities.

  • Frameworks execute the entire optimization, including both the suggestion algorithm and user evaluation code. Ax (ax) and HpBandSter (hpbandster) belong to this category. While frameworks are convenient, they often require knowledge on the system being optimized, such as how to manage resources and perform proper initialization and shutdown.

  • Libraries implement blackbox optimization algorithms. HyperOpt (bergstra2013making), Emukit (emukit2019), and BoTorch (botorch) belong to this category. Libraries offer the most freedom but lack scalability features such as error recovery and distributed/asynchronous trial evaluations. Instead, libaries are often used as algorithm implementations for frameworks or services (e.g. BoTorch in Ax).

Name Type Client Languages Parallel Trials Features*
OSS Vizier Service Any Yes Multi-Objective, Early Stopping, Transfer Learning, Conditional Search
SMAC Framework Python Yes Multi-Objective, Multi-fidelity, Early Stopping, Conditional Search, Parameter Constraints
Advisor Service Any Yes Early Stopping
OpenBox Service Any Yes Multi-Objective, Early Stopping, Transfer Learning, Parameter Constraints
HpBandSter Framework Python Yes Early Stopping, Conditional Search, Parameter Constraints
Ax + BoTorch Framework Python Yes Multi-Objective, Multi-fidelity, Early Stopping, Transfer Learning, Parameter and Outcome Constraints
HyperOpt Library Python No Conditional Search
Emukit Library Python No Multi-Objective, Multi-fidelity, Outcome Constraints
Table 1: Open Source Optimization Packages. *OSS Vizier supports the API only.

One major architectural difference between OSS Vizier and other services is that OSS Vizier’s algorithms may run in a separate service and communicate via RPCs with the API server, which performs database operations. With a distributed backend setup, OSS Vizier can serve algorithms written in different languages, scale up to thousands of concurrent users, and continuously process user requests without interruptions during a server maintenance or update.

Furthermore, there are other minor differences between the services. While OSS Vizier and OpenBox support distinguishing workers via the workers’ logical IDs (Section 5), Advisor does not. In addition, OSS Vizier’s Python clients possess more sophisticated functionalities than Advisor’s, while OpenBox lacks a client implementation and requires users to implement client code using framework-provided worker wrappers. OSS Vizier also emphasizes algorithm development, by providing a developer API called Pythia (Section 6) and utility libraries for state recovery. Other features of OSS Vizier include:

  • OSS Vizier is one of the first open-source AutoML systems simultaneously compatible with a large-scale industry production service, Vertex Vizier, via our PyVizier library (Section 4.3).

  • The backend of OSS Vizier is based on the standard Google Protocol Buffer library, one of the most widely used RPC formats, which allows extensive customizability. In particular, the client (i.e. blackbox function to be tuned) can be written in any language and is not restricted to machine learning models in Python.

  • OSS Vizier is extensively integrated with numerous other Google packages, such as Deepmind XManager for experiment management (Section 7).

3 Infrastructure

We briefly conceptually define a Study as all relevant data pertaining to an entire optimization loop, a Suggestion as a suggested , and a Trial containing both and the objective . Note that in the code, we use Trial as a container to store both and and thus, a Trial without is also considered a suggestion. We define these core primitives more programatically in Section 4.

3.1 Protocol Buffers

OSS Vizier’s APIs are RPC interfaces that carry protocol buffers, or protobufs/protos444, to allow simple and efficient inter-machine communication. The protos are language- and platform- independent objects for serializing structured data, which make building external software layers and wrappers onto the system straightforward. In particular, the user can provide their own:

  • Visualization Tools: Since OSS Vizier securely stores all study data in its database, the data can then be loaded and visualized, with e.g. standard Python tools (Colab, Numpy, Scipy, Matplotlib) and other statistical packages such as R via RProtoBuf (r_protobuf). Front-end languages such as Angular/Javascript may also be used for visualizing studies.

  • Persistent Datastore: The database in OSS Vizier can changed based on the user’s needs. For instance, a SQL-based datastore with full query functionality may be used to store study data.

  • Clients: Protobufs allow binaries written in Python, C++, and other languages to be tuned and/or used for evaluating the objective function. This allows OSS Vizier to easily tune existing systems.

We explain the interactions between these components in a distributed backend below.

3.2 Distributed Backend

In order to serve multiple users while remaining fault-tolerant, OSS Vizier runs in a distributed fashion, with a server performing the algorithmic proposal work, while users or clients communicate with the server via RPCs using the Client API, built upon gRPC 555 A packet of RPC communication is formatted in terms of standard Google protobufs.

Figure 2: Pictorial representation of the distributed pipeline. The OSS Vizier server services multiple clients, each with their own types of requests. Such requests can involve running Pythia Policies, saving measurement data, or retrieving previous studies. Note that Pythia may run as a separate service from the API service.

To start an optimization loop, a client will send a CreateStudy RPC request to the server, and the server will create a new Study in its datastore and return the ID to the client. The main tuning workflow in OSS Vizier will then involve the following repeated cycle of events:

  1. The client sends a SuggestTrials RPC request to the server.

  2. The server creates a Operation in its datastore, and starts a thread to launch a Pythia policy (i.e. blackbox optimization algorithm) to compute the next suggested Trials. The server returns an Operation protobuf to the client to denote the computation taking place.

  3. The client will repeatedly poll the server via GetOperation RPCs to check the status of the Operation until the Operation is done.

  4. When the Pythia policy produces its suggestions, the server will store these suggestions into the Operation and mark the Operation done, which will be collected by the client’s GetOperation ping.

  5. The client retrieves the suggestions stored inside the Operation, and returns objective function measurements to the server via calls to the CompleteTrial RPC.

Note that the server may be launched in the same local process as the client, in cases where distributed computing is not needed and functio evaluation is cheap (e.g. benchmarking algorithms on synthetic functions). However, if the user wishes to use the distributed setting, the following are core advantages of OSS Vizier’s system:

Server-side Fault Tolerance

The Operations are stored in the database and contain sufficient information to restart the computation after a server crash, reboot, or update.

Automated/Early Stopping

A similar sequence of events takes place when the client sends a CheckTrialEarlyStoppingStateRequest RPC, in which the policy determines if a trial’s evaluation should be stopped, and returns this signal as a boolean via the EarlyStoppingOperation RPC.

Batched/Parallel Evaluations

Note that multiple clients may work on the same study, and the same Trial. This is important for compute-heavy experiments (e.g. neural architecture search) which need to parallelize workload by using multiple machines, with each machine evaluating the objective after being given suggestion from the server.

Client-side Fault Tolerance

When one of the parallel workers fails and then reboots, the service will assign the worker the same suggestion as before. The worker can choose to load a model from the checkpoint to warm-start the evaluation.

4 Core Primitives

In Figure 3, we provide a pictorial example representation of how OSS Vizier’s primitives are structured; below we provide definitions.

Figure 3:

Example of a study that tunes a deep learning task, featuring relevant data types.

4.1 Definitions

A Study is a single optimization run over a feasible space. Each study contains a name, its description, its state (e.g. ACTIVE, INACTIVE, or COMPLETED), a StudySpec, and a list of suggestions and evaluations (Trials).

A StudySpec contains the configuration details for the Study, namely the search space (constructed by ParameterSpecs; see §4.2), the algorithm to be used, automated stopping type (see Appendix B.1), the type of ObservationNoise (see Appendix B.2), and at least one MetricSpec, containing information about the metric to optimize, including the metric name and the goal (i.e. whether to minimize or maximize ). Multiple MetricSpecs will be used for cases involving multiobjective optimization, where the goal is to find Pareto frontiers over multiple objectives .

A Trial is a container for the input , as well as potentially the scalar value or multiobjective values . Each Trial possesses a State, which indicates what stage of the optimization process the Trial is in, with the two primary states being ACTIVE (meaning that has been suggested but not yet evaluated) and COMPLETED (meaning that evaluation is finished, and typically that the objectives have been calculated).

Both the StudySpec and the Trials can contain Metadata. Metadata is not interpreted by OSS Vizier, but rather a convenient method for developers to store algorithm state, by users to store small amounts of arbitrary data, or as an extra communication medium between user code and algorithms.

4.2 Search Space

Search spaces can be built by combining the the following primitives, or ParameterSpecs:

  • Double: Specifies a continuous range of possible values in the closed interval for some real values .

  • Integer: Specifies an integer range of possible values in for some integers .

  • Discrete: Specifies a finite, ordered set of values from .

  • Categorical: Specifies an unordered list of strings.

Furthermore, each of the numerical parameters {Double, Integer, Discrete} has a scaling type, which toggles whether the underlying algorithm is performing optimization in a transformed space. The scale type allows the user to conveniently inform the optimizer about the shape of the function, and can sometimes drastically accelerate the optimization. For instance, a user may use logarithmic scaling, which expresses the intent that a parameter ranging over should roughly receive the same amount of attention in the subrange as , which would otherwise not be the case when using uniform scaling.

Each parameter also can potentially contain a list of child parameters, each of which will be active only if the parent’s value matches the correct value(s). This allows the notion of conditional search, which is helpful when dealing with search spaces involving incompatible parameters or parameters which only exist in specific scenarios. For example, this can be useful when competitively tuning several machine learning algorithms along with each algorithm’s parameters. E.g. one could tune the following for the model parameter: {"linear", "DNN", "random_forest"}, each with its own set of parameters. Conditional parameters help keep the user’s code organized, and also describe certain invariances to OSS Vizier, namely that when model="DNN", will be independent of the "random_forest" and "linear" model parameters.

These parameter primitives can be used flexibly to build highly complex search spaces, of which we provide examples in Appendix A.

4.3 PyVizier

All the above objects are implemented as protos to allow RPC exchanges through the service, as mentioned in Section 3. However, for ease-of-access, each object is also represented by an equivalent PyVizier class to provide a more Pythonic interface, validation, and convenient construction (further details and examples are found in Appendix D.3). Translations to and from protos are provided by the to_proto() and from_proto() methods in PyVizier classes. PyVizier provides a common interface across all Vizier variants (i.e. Google Vizier, Vertex Vizier, and OSS Vizier)666For compatibility reasons, protos have slightly different names than PyVizier equivalents; e.g. StudySpec protos are equivalent to StudyConfig PyVizier objects. We describe conversions further in Appendix D.3. The two intended primary use cases for PyVizier are:

  • Tuning user binaries. For such cases, the core PyVizier primitive is the VizierClient class that allows communication with the service.

  • Developing algorithms for researchers. In this case, the core PyVizier primitives are the Pythia Policy and PolicySupporter classes.

Both cases typically use the StudyConfig and SearchSpace classes to define the optimization, and the Trial, and Measurement classes to support the evaluation. We describe the two cases in detail below.

5 User API: Parallel Distributed Tuning with OSS Vizier Client

1from vizier import StudyConfig, VizierClient
3config = StudyConfig() # Search space, metrics, and algorithm.
4root = config.search_space.select_root() # "Root" params must exist in every trial.
5root.add_float(’learning_rate’, min=1e-4, max=1e-2, scale=’LOG’)
6root.add_int(’num_layers’, min=1, max=5)
7config.metrics.add(’accuracy’, goal=’MAXIMIZE’, min=0.0, max=1.0)
8config.algorithm = ’RANDOM_SEARCH’
10client = VizierClient.load_or_create_study(
11    ’cifar10’, config, client_id=sys.argv[1]) # Each client should use a unique id. 
12while suggestions := client.get_suggestions(count=1)
13  # Evaluate the suggestion(s) and report the results to Vizier.
14  for trial in suggestions:
15    metrics = _evaluate_trial(trial.parameters)
16    client.complete_trial(metrics,
Code Block 1: Pseudocode for tuning a blackbox function using the included Python client. To save space, we did not use longer official argument names from the actual code.

The OSS Vizier service must be set up first (see pseudocode in Appendix D.2), preferably on a multithreaded machine capable of processing multiple RPCs concurrently. Then, replicas of Code Block LABEL:lst:client can be launched in parallel, each with a unique command-line argument to be used as the client id in Line 1. The first replica to be launched creates a new Study from the StudyConfig, which defines the search space, relevant metrics to be evaluated, and the algorithm for providing suggestions. The other replicas then load the same study to be worked on. There are a few important aspects worth noting in this setting:

  • The service does not make any assumptions about how Trials are evaluated. Users may complete Trials at any latency, and may do so with a custom client written in any language. Algorithms may however, set a time limit and reassign Trials to other clients to prevent stalling (e.g. due to a slow client).

  • Each Trial is assigned a client_id and only suggested to clients created with the same client_id. This design makes it easy for users to recover from failures during Trial evaluations; if one of the tuning binaries is accidentally shut down, users can simply restart the binary with the same client id. The tuning binary creates a new client attached to the same study and OSS Vizier suggests the same Trial.

  • Multiple binaries can share the same client_id and collaborate on evaluating the same Trial. This feature is useful in tuning a large distributed model with multiple workers and evaluators.

  • The client may optionally turn on automated stopping for objectives that can provide intermediate measurements (e.g. learning curves in deep learning applications). Further details and an example code snippet can be found in Appendix B.1 and Appendix LABEL:lst:autostop respectively.

6 Developer API: Implementing a New Algorithm Using Pythia Policy

6.1 Overview

As we have explained in Section 3, OSS Vizier runs its algorithms in a binary called the Pythia service (which can be the same binary as the API service). When the client asks for suggestions or early stopping decisions, the API service creates operations and sends requests to the Pythia service. This section describes the default python implementation of the Pythia service included in the open-source package.

The Pythia service creates a Policy object that executes the algorithm and returns the response. Policy is designed to be a minimal and general-purposed interface built on top of PyVizier, to allow researchers to quickly incorporate their own blackbox optimization algorithms. Policy is usually given a PolicySupporter, which is a mini-client specialized in reading and filtering Trials. As shown in Code Block LABEL:lst:pythia1, a typical Policy loads Trials via PolicySupporter and processes the request at hand.

1from vizier.pythia import Policy, PolicySupporter, SuggestRequest, SuggestDecisions
3class MyPolicy(Policy):
4  def __init__(self, policy_supporter: PolicySupporter):
5    self.policy_supporter = policy_supporter  # Used to obtain old trials.
7  def suggest(self, request: SuggestRequest) -> SuggestDecisions:
8    """Suggests trials to be evaluated."""
9    Xs, y = _trials_to_np_arrays(self.policy_supporter.GetTrials(
10        status=’COMPLETED’)) # Use COMPLETED trials only.
11    model = _train_gp(Xs, y)
12    return _optimize_ei(model, request.study_config.search_space)
Code Block 2: Pseudocode for implementing a Gaussian Process Bandit.

6.2 PolicySupporter

The PolicySupporter allows the Policy to actively decide what Trials from what Studies are needed to generate the next batch of Suggestions. Policies can meta-learn from potentially any Study in the database by calling the GetStudyConfig and GetTrials methods. Beyond that, the Policy can request only the Trials it needs; e.g. for algorithms that only need to look at newly evaluated Trials, this can reduce the database work by orders of magnitude relative to loading all the Trials.

6.3 State Saving via Metadata

The primary application of Google Vizier (vizier_v1) was optimizing a blackbox function that is expensive to evaluate. Over time, as Google Vizier became widely adopted, there was an increasing number of applications where users wished to evaluate cheap functions over a very large number of Trials. Popular methods for these applications include evolutionary methods and local search methods, such as NSGA-II (nsga2), Firefly (firefly2010), and Harmony Search (harmony_search) to name a few (For a survey on meta-heuristics, see metaheuristic_survey).

A typical algorithm in this category iteratively updates its population pool and generates mutations to be suggested, both of which take constant time with respect to the number of previous trials, as opposed to e.g. cubic time when using Gaussian Processes in a Bayesian Optimization loop. Since the lifespan of a Policy object is equivalent to that of one suggestion or early stopping operation, the algorithm would need to fetch all Trials in the Study and reconstruct its state in time. This leads to slow and difficult-to-maintain implementations.

PolicySupporter provides an easy-to-use API for developers to send algorithm states into the database as Metadata. Metadata is a key-value mapping with namespaces that help prevent key collisions. There are two tables for metadata in the database: one attached to the StudySpec and another to each Trial. A Policy can restore its last saved state from metadata, reflect the recently added Trials, and process the operation at hand. We provide example code for this functionality in Appendix D.4

7 Integrations

OSS Vizier is also compatible with multiple other interfaces developed at Google as well. These include:

  • Vertex Vizier whose Protocol Buffer definitions are exactly the same777 as OSS Vizier’s. This consistency also allows a wide variety of other packages (discussed below) pre-integrated with Vertex Vizier to be used with minimal changes.

  • Deepmind XManager experiments currently can be tuned by Vertex Vizier888 through VizierWorker. This worker can also be directly connected to an OSS Vizier server to allow custom policies to manage experiments.

  • OSS Vizier will also be the core backend for PyGlove (pyglove)999, which is a symbolic programming language for AutoML, in particular facilitating combinational and evolutionary optimization which are common in neural architecture search applications.

8 Conclusion, Limitations and Broader Impact Statement


We discussed the motivations and benefits behind providing OSS Vizier as a service in comparison to other blackbox optimization libraries, and described how our gRPC-based distributed back-end infrastructure may be deployed as a fault-tolerant yet flexible system that is capable of supporting multiple clients and diverse use cases. We further outlined our client-server API for tuning, our algorithm development Pythia API, and integrations with other Google libraries.


Due to proprietary and legal concerns, we are unable to open-source the default algorithms used in Google Vizier and Cloud Vizier. Furthermore, this paper intentionally does not discuss algorithms or benchmarks, as the emphasis is on the systems aspect of AutoML. Algorithms may easily be added as policies to OSS Vizier’s collection over time from contributors.

OSS Vizier also may not be suitable for all problems within the very broad scope of blackbox optimization. For instance, if evaluating

is very cheap and fast (e.g. miliseconds), then the OSS Vizier service itself may dominate the overall cost and speed. Furthermore, for problems requiring very large numbers of parameters (e.g. 100K+) and evaluations (e.g. 1M+), such as training a large neural network with gradientless methods

(ars; es_million_params), OSS Vizier can also be inappropriate, as such cases can overload the datastore memory with redundant trials which do not need to be kept track of.

Broader Impact

While there are a rich collection of sophisticated and effective AutoML algorithms published every year, broad adoption to practical use cases still remains low, as only 7% of the ICLR 2020 and NeurIPS 2019 papers used a tuning method other than random or grid search (people_use_random_search). In comparison, Google Vizier is widely used among multiple researchers at Google, including for conference submissions. We hope that the release of OSS Vizier and its similar benefits may significantly improve the reach of AutoML techniques to users.

In terms of potential negative impacts, optimization as a service encourages central storage of data with the attendant risks and benefits. For example, currently through the Client API, a user may request all studies associated with another users, which may cause security and privacy concerns. This may be fixed by limiting user access to only their own studies in the service logic. Furthermore, the host of the service currently has full access to all client data, which is another potential privacy concern. However, from our experience with Google Vizier, the most impactful applications for clients typically occur when parameters and measurements correspond to aggregate data (e.g. the learning rate of a ML algorithm, or e.g. the number of threads in a server) rather than data that describes individuals. Furthermore, data received by OSS Vizier can be obscured to a degree to reduce unwanted exposure to the host. Most notably, names (e.g. study name, parameter and metric names) can be encrypted, and (within limits) differential privacy (differential_privacy_survey) approaches, especially for databases (sql_differential_privacy), can be applied to the parameters values and measurements.


The Vizier team consists of: Xingyou Song, Sagi Perel, Chansoo Lee, Greg Kochanski, Richard Zhang, Tzu-Kuo Huang, Setareh Ariafar, Lior Belenki, Daniel Golovin, and Adrian Reyes.

We further thank Emily Fertig, Srinivas Vasudevan, Jacob Burnim, Brian Patton, Ben Lee, Christopher Suter for Tensorflow Probability integrations, Daiyi Peng for PyGlove integrations, Yingjie Miao for AutoRL integrations, Tom Hennigan, Pavel Sountsov, Richard Belleville, Bu Su Kim, Hao Li, and Yutian Chen for open source and infrastructure help, and George Dahl, Aleksandra Faust, and Zoubin Ghahramani for discussions.

9 Reproducibility Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? We discussed the motivations for why OSS Vizier is designed as a service, and outlined in detail its distributed infrastructure. We further demonstrated (with pseudocode) the two main usages of OSS Vizier, which are to tune users’ objects via client-side API, and develop algorithms via Pythia.

    2. Did you describe the limitations of your work? See Section 8.

    3. Did you discuss any potential negative societal impacts of your work? See Section 8.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them? Our paper follows all of the ethics review guidelines.

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results? This is a systems paper.

    2. Did you include complete proofs of all theoretical results? This is a systems paper.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results, including all requirements (e.g., requirements.txt with explicit version), an instructive README with installation, and execution commands (either in the supplemental material or as a url)? We have provided a README, installation instructions with a requirements.txt, numerous integration and unit tests along with PyTypes which demonstrate each code snippet’s function.

    2. Did you include the raw results of running the given instructions on the given code and data? Our unit-tests demonstrate the expected results of running all components of our code.

    3. Did you include scripts and commands that can be used to generate the figures and tables in your paper based on the raw results of the code, data, and instructions given? This is a systems paper.

    4. Did you ensure sufficient code quality such that your code can be safely executed and the code is properly documented? Our code follows all standard industry-wide coding practices at Google, which include extensive unit tests with continuous integration, PyType and PyLint enforcement for code cleanliness, and peer review during code submission.

    5. Did you specify all the training details (e.g., data splits, pre-processing, search spaces, fixed hyperparameter settings, and how they were chosen)? This is a systems paper.

    6. Did you ensure that you compared different methods (including your own) exactly on the same benchmarks, including the same datasets, search space, code for training and hyperparameters for that code? This is a systems paper.

    7. Did you run ablation studies to assess the impact of different components of your approach? This is a systems paper.

    8. Did you use the same evaluation protocol for the methods being compared? This is a systems paper.

    9. Did you compare performance over time? This is a systems paper.

    10. Did you perform multiple runs of your experiments and report random seeds? This is a systems paper.

    11. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? This is a systems paper.

    12. Did you use tabular or surrogate benchmarks for in-depth evaluations? This is a systems paper.

    13. Did you include the total amount of compute and the type of resources used (e.g., type of gpus, internal cluster, or cloud provider)? This is a systems paper.

    14. Did you report how you tuned hyperparameters, and what time and resources this required (if they were not automatically tuned by your AutoML method, e.g. in a nas approach; and also hyperparameters of your own method)? This is a systems paper.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? Our work wraps around other Google libraries such as the Cloud Vizier SDK and Deepmind XManager, which we provided url links for.

    2. Did you mention the license of the assets? Both the Cloud Vizier SDK and Deepmind XManager use the Apache 2.0 License.

    3. Did you include any new assets either in the supplemental material or as a url? No new assets were used.

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? No human data was used.

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? This is a systems paper without data use.

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable? Not applicable.

    2. Did you describe any potential participant risks, with links to Institutional Review Board (irb) approvals, if applicable? Not applicable.

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? Not applicable.



Appendix A Search Space Flexibility

In this section, we describe the ways in which more complex search spaces may be created in OSS Vizier, showcasing its flexibility and applicability to a wide variety of problems.

a.1 Combinatorial Optimization

One of the most common uses for blackbox optimization in research involves combinatorial optimization. In this setting,

is usually defined via common manipulations over the set

, such as permutations or subset selections. Below, we provide example methods to deal with such cases, in the order of most practical to least practical. We note that many of these methods are more suited for evolutionary algorithms which only need to utilize mutations and cross-overs between trials, rather than regression-based methods (e.g. Bayesian Optimization).

a.1.1 Reparameterization

Reparameterization of the search space via conceptual means should be considered first, as it is one of the most practical and easiest ways to reduce the complexity of representing in OSS Vizier. Mathematically speaking, the high level idea is to construct a more practical search space which can easily be represented in OSS Vizier, and then create a surjective mapping .

For basic combinatorial objects such as permutations, if we consider the standard permutation space , then we may define and allow to be the decoding operator for the Lehmer code101010 If involves subset selection, then we may define and apply a similar mapping.

Another common case involves searching over the space of graphs. In such scenarios, there are a multitude of methods to parameterizing the graph, including adjacency matrices via . An illustrative example can be seen across neural architecture search (NAS) benchmarks. Even though such search spaces correspond to graph objects, ironically, many NAS benchmarks, termed “NASBENCH"s, actually do not use nested or conditional search spaces. For instance, NASBENCH-101 (nasbench_101) uses only a flat adjacency matrix and flat operation list. NASBENCH-201 (nasbench_201) is even simpler, as it takes the graph dual of the node-op representation, allowing the search space to be a full feasible set represented by only 5 categorical parameters.

a.1.2 Infeasibility

In some scenarios, we may not be able to find a mapping as in the reparameterization case above, but instead may lift the search space into a larger search space , where , and thus perform search on instead. For trials in , OSS Vizier supports reporting these trials as infeasible. As a basic example, if defines a disk, then . Another example can be seen with the same NASBENCH-101 (nasbench_101) benchmark described earlier, where some pairs of adjacency matrices and operation lists do not correspond to an actual valid graph, and are thus infeasible.

The main limitation is if , the vast bulk of trials may be infeasible, and if so, the search will converge slowly. Furthermore, for the disk case, this can lead to problems during optimization, as it creates a sharp border and a flat infeasible region . This leads to lack of information about which infeasible points are better/worse than others, and can also make it difficult to find a small feasible region. Modelling techniques such Gaussian Processes also inherently assume the objective function is continuous everywhere, which is incompatible with the discontinuity from the border .

a.1.3 Serialization

If all else fails, we may avoid the use of the ParameterSpec API and simply serialize into a string format, which can then be inserted into a Trial’s metadata field. In cooperation with a custom Pythia policy, this can be very effective.

Appendix B Additional OSS Vizier Settings

b.1 Automated Stopping

Automated/early stopping is used commonly when trials can be stopped early to save resources, and is determined by the trial’s intermediate measurements. Currently there are two modes to automated stopping which the client can specify in their StudyConfig:

  • Decay Curve Automated Stopping, in which a Gaussian Process Regressor is built to predict the final objective value of a Trial based on the already completed Trials and the intermediate measurements of the current Trial. Early stopping is requested for the current Trial if there is very low probability to exceed the optimal value found so far.

  • Median Automated Stopping, in which a pending trial is stopped if the Trial’s best objective value is strictly below the median ’performance’ of all completed Trials reported up to the Trial’s last measurement. Currently, ’performance’ refers to the running average of the objective values reported by the Trial in each measurement.

b.2 Observation Noise

We have found it useful to let the user give Vizer a hint about the amount of noise in their evaluations via the StudyConfig. Because the noise/irreproducibility of evaluations is often not well known in advance by users, we give users a broad choice that the noise is either Low or High:

  • Low: This implies that the objective function is (nearly) perfectly reproducible, and an algorithm should never repeat the same Trial parameters.

  • High: This assumes there is enough noise in the evaluations that it is worthwhile for OSS Vizier sometimes to re-evaluate with the same (or nearly) parameter values.

This hint is passed to the Pythia policy, and the policy is free to also use this hint to e.g. adjust priors on the hyperparameters of a Gaussian Process regressor.

Appendix C Google Vizier Users and Citations

Besides Google Vizier’s extensive internal production usage, below comprises a selected list of publicly available research works111111Full list of Google Vizier’s citations: which have used Google Vizier, demonstrating its rich research user-base which may directly translate to OSS Vizier’s future user-base as well.

Neural Architecture Search

Google Vizier has acted as a core backend for many of the neural architecture search (NAS) efforts at Google, beginning with Google Vizier having been used to hyperparameter tune the RNN controller in the original NAS paper (original_nas). Over the course of NAS research, Google Vizier has also been used to reliably handle the training of thousands of models (barett_cvpr_nas; multiscale), as well as comparisons against different NAS optimization algorithms in NASBENCH-101 (nasbench_101). Furthermore, it serves as the primary distributed backend for PyGlove (pyglove), a core evolutionary algorithm API for NAS research across Google.

Hardware and Systems

Google Vizier’s tuning led to crucial gains for hardware benchmarking, such as improving JAX’s MLPerf scores over TPUs 121212Link too long; hyperlink can be found here.. Google Vizier’s multiobjective optimization capabilities were a key component in producing better computer architecture designs in APOLLO (apollo) 131313 Furthermore, Google Vizier was a key component to Full-stack Accelerator Search Technique (FAST) (accelerator_search), an automated framework for jointly optimizing hardware datapath, software schedule, and compiler passes.

Reinforcement Learning

“AutoRL" (autorl_survey) has recently seen a great deal of promise in automating reinforcement learning systems. Google Vizier was extensively used as the core component in tuning hyperparameters and rewards in navigation (sandra_evolving_rewards_autorl; sandra_indoor_autorl; sandra_navigation_autorl). Google Vizier’s backend was also used to host the Regularized Evolution optimizer (regularized_evolution), used for evolving RL algorithms (evolving_rl_algorithms), where the search space involved combinatorial directed acyclic graphs (DAGs). On the infrastructure side, Google Vizier was used to improve the performance of Reverb (reverb), one of the core replay buffer APIs used for most RL projects at Google. (contrastive)


Google Vizier’s algorithms were used for comparison on several papers related to protein optimization (blundell), and was also used to tune RNNs for peptide identification in (peptide)

. For healthcare, Google Vizier was used to tune models for classifying diseases such as diabetic retinopathy


General Deep Learning

For fundamental research, Google Vizier was used to tune Neural Additive Models (neural_additive), and has also been the backbone of core research into infinite-width deep neural networks, having tuned (roman_ntk_conv; roman_ntk_finite; roman_infinite_attention; roman_ntk_bnn). For NLP-based tasks, Google Vizier regularly tunes language model training, and has also been used to search feature weights in (curriculum_NMT), as well improve performance for work on theorem proving (theorem_proving)

. Computer vision models such as ones used for the Pixel-3

141414 have been tuned by Google Vizier.


As an example of tuning for human-based judgement on objectives unrelated to technology, Google Vizier was used to tune the recipe for cookie-baking (vizier_cookie).

Appendix D Extended Code Samples

d.1 Automated stopping

Code Block LABEL:lst:autostop demonstrates the use of automated stopping, when training a standard machine learning model.

1from vizier import StudyConfig, VizierClient
3config = StudyConfig()
4... # configure search space and metrics
5client = VizierClient.load_or_create_study(
6    ’cifar10’, config, client_id=sys.argv[1]) # Each client should use a unique id. 
7while suggestions := client.get_suggestions(count=1)
8  # Evaluate the suggestion(s) and report the results to OSS Vizier.
9  for trial in suggestions:
10    for epoch in range(EPOCHS):
11      if client.should_trial_stop(
12         break
13       metrics = model.train_and_evaluate(trial.parameters[’learning_rate’])
14       client.report_metrics(epoch, metrics)
15    metrics = model.evaluate()
16    client.complete_trial(metrics,
Code Block 3: Pseudocode for tuning a model using the included Python client, with early stopping enabled.

d.2 Service Setup

Code Block LABEL:lst:server displays the simple method in which to setup the service on a multithreaded server.

1from vizier.service import vizier_server
2from vizier.service import vizier_service_pb2_grpc
4hostname = ’localhost’ # Example; usually user-specified
5port = 6006 # Example; usually user-specified
6address = f’{hostname}:{port}’
7servicer = vizier_server.VizierService()
9server = grpc.server(futures.ThreadPoolExecutor(max_workers=100))
10vizier_service_pb2_grpc.add_VizierServiceServicer_to_server(servicer, server)
11server.add_secure_port(address, grpc.local_server_credentials())
Code Block 4: Pseudocode for setting up the service on a server.

d.3 Proto vs Python API

We provide an example of equivalent methods between PyVizier and corresponding Protocol Buffer objects. Note that clients and algorithm developers should not normally need to modify protos. Such cases are more common if one wishes to add extra layers on top of the service, as mentioned in Subsection 3.1.

1from vizier.service import study_pb2
2from google.protobuf import struct_pb2
4param_1 = study_pb2.Trial.Parameter(parameter_id=’learning_rate’, value=struct_pb2.Value(number_value=0.4))
5param_2 = study_pb2.Trial.Parameter(parameter_id=’model_type’, value=struct_pb2.Value(string_value=’vgg’))
6metric_1 = study_pb2.Measurement.Metric(metric_id=’accuracy’,value=0.4)
7metric_2 = study_pb2.Measurement.Metric(metric_id=’num_params’,value=20423)
8final_measurement = study_pb2.Trial.Measurement(metrics=[metric_1,metric_2])
9trial = study_pb2.Trial(parameters=[param_1,param_2], final_measurement=final_measurement)
Code Block 5: Original Protocol Buffer method of creating a Trial.
1from vizier.pyvizier import ParameterDict, ParameterValue, Measurement, Metric, Trial
4params[’learning_rate’] = ParameterValue(0.4)
5params[’model_type’] = ParameterValue(’vgg’)
6final_measurement = Measurement()
7final_measurement.metrics[’accuracy’] = Metric(0.7)
8final_measurement.metrics[’num_params’] = Metric(20423)
9trial = pv.Trial(parameters=params,final_measurement=final_measurement)
Code Block 6: Equivalent method of writing the PyVizier version of the Trial from Code Block LABEL:lst:proto_trial. Note the significantly more "Pythonic" way of writing code, with a significant reduction in code complexity.

We also provide in Table 2, changes between OSS Vizier’s Protocol Buffer names and their corresponding PyVizier names, as well as converter objects.

Protocol Buffer Name PyVizier Name Converter
Study Study N/A
StudySpec SearchSpace + StudyConfig SearchSpace (self) + StudyConfig (self)
ParameterSpec ParameterConfig ParameterConfigConverter
Trial Trial TrialConverter
Parameter ParameterValue ParameterValueConvereter
MetricSpec MetricInformation MetricInformation (self)
Measurement Measurement MeasurementConverter
Table 2: Corresponding names and conversion objects between Protocol Buffer and PyVizier objects. (self) denotes that the PyVizier object has its own immediate to_proto() and from_proto() functions.

d.4 Implementing an Evolutionary Algorithm

OSS Vizier possesses an abstraction SerializableDesigner defined purely in terms of PyVizier without any Pythia dependencies. This interface wraps around most commonly used algorithms which sequentially update their internal states as new observations arrive. The interface is easy to understand and can be wrapped into a Pythia policy using the SerializableDesignerPolicy class which handles state management. See Code Block LABEL:lst:es_impl for an example.

1from vizier import pyvizier as vz
3class RegEvo(SerializableDesigner):
5  # override
6  def suggest(self, count: Optional[int]) -> Sequence[vz.TrialSuggestion]
7    """Generate ‘count‘ number of mutations and return them."""
9  # override
10  def update(self, delta: CompletedTrials):
11    """Apply selection step and update the population pool."""
13  # override
14  def dump(self) -> vz.Metadata:
15    """Dumps the population pool."""
16    md = vz.Metadata()
17    md[’population’] = json.dumps(...)
18    return md
20  # override
21  def recover(cls: Type[’_S’], metadata: vz.Metadata) -> ’_S’:
22    """Restores the population pool."""
23    if ’population’ not in md:
24      raise HarmlessDecodeError(’Cannot find key: "population"’)
25    ... = json.loads(md[’population’])
27policy = SerializableDesignerPolicy(
28    policy_supporter,
29    designer_factory=RegEvo.__init__,
30    designer_cls=RegEvo)
Code Block 7: Example Pseudocode of implementing an evolutionary algorithm as a Pythia policy using SerializableDesigner interface.