DeepAI
Log In Sign Up

Fourier Representations for Black-Box Optimization over Categorical Variables

02/08/2022
by   Hamid Dadkhahi, et al.
Amazon
ibm
4

Optimization of real-world black-box functions defined over purely categorical variables is an active area of research. In particular, optimization and design of biological sequences with specific functional or structural properties have a profound impact in medicine, materials science, and biotechnology. Standalone search algorithms, such as simulated annealing (SA) and Monte Carlo tree search (MCTS), are typically used for such optimization problems. In order to improve the performance and sample efficiency of such algorithms, we propose to use existing methods in conjunction with a surrogate model for the black-box evaluations over purely categorical variables. To this end, we present two different representations, a group-theoretic Fourier expansion and an abridged one-hot encoded Boolean Fourier expansion. To learn such representations, we consider two different settings to update our surrogate model. First, we utilize an adversarial online regression setting where Fourier characters of each representation are considered as experts and their respective coefficients are updated via an exponential weight update rule each time the black box is evaluated. Second, we consider a Bayesian setting where queries are selected via Thompson sampling and the posterior is updated via a sparse Bayesian regression model (over our proposed representation) with a regularized horseshoe prior. Numerical experiments over synthetic benchmarks as well as real-world RNA sequence optimization and design problems demonstrate the representational power of the proposed methods, which achieve competitive or superior performance compared to state-of-the-art counterparts, while improving the computation cost and/or sample efficiency, substantially.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/06/2020

Combinatorial Black-Box Optimization with Expert Advice

We consider the problem of black-box function optimization over the bool...
06/05/2020

Population-Based Black-Box Optimization for Biological Sequence Design

The use of black-box optimization for the design of new biological seque...
10/09/2019

Stochastic Implicit Natural Gradient for Black-box Optimization

Black-box optimization is primarily important for many compute-intensive...
10/01/2021

Surrogate-Based Black-Box Optimization Method for Costly Molecular Properties

AI-assisted molecular optimization is a very active research field as it...
09/01/2022

Black-box optimization for integer-variable problems using Ising machines and factorization machines

Black-box optimization has potential in numerous applications such as hy...
03/09/2017

mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions

We present mlrMBO, a flexible and comprehensive R toolbox for model-base...

1 Introduction

A plethora of practical optimization problems involve black-box functions, with no simple analytical closed forms, that can be evaluated at any arbitrary point in the domain. Optimization of such black-box functions poses a unique challenge due to restrictions on the number of possible function evaluations, as evaluating functions of real-world complex processes is often expensive and time consuming. Efficient algorithms for global optimization of expensive black-box functions take past queries into account in order to select the next query to the black-box function more intelligently. Black-box optimization of real-world functions defined over purely categorical type input variables has not been studied in the literature as extensively as its integer, continuous, or mixed variables counterparts.

Categorical type variables are particularly challenging when compared to integer or continuous variables, as they do not have a natural ordering. However, many real-world functions are defined over categorical variables. One such problem, which is of wide interest, is the design of optimal chemical or biological (protein, RNA, and DNA) molecule sequences, which are constructed using a vocabulary of fixed size, e.g. 4 for DNA/RNA. Designing optimal molecular sequences with novel structures or improved functionalities is of paramount importance in material science, drug and vaccine design, synthetic biology and many other applications, see Dixon et al. (2010); Ng et al. (2019); Hoshika et al. (2019); Yamagami et al. (2019). Design of optimal sequences is a difficult black-box optimization problem over a combinatorially large search space Stephens et al. (2015), in which function evaluations often rely on either wet-lab experiments, physics-inspired simulators, or knowledge-based computational algorithms, which are slow and expensive in practice. Another problem of interest is the constrained design problem, e.g. find a sequence given a specific structure (or property), which is inverse of the well-known folding problem discussed in Dill and MacCallum (2012). This problem is complex due to the strict structural constraints imposed on the sequence. In fact, one of the ways to represent such a complex structural constraint is to constrain the next choice sequentially based on the sequence elements that have been chosen a priori. Therefore, we divide the black box optimization problem into two settings, depending on the constraint set: Generic Black Box Optimization (BBO) problem referring to the unconstrained case and Design Problem that refers to the case with complex sequential constraints.

Let be the -th sequence evaluated by the black box function . The key question in both settings is the following: Given prior queries and their evaluations , how to choose the next query ? This selection must be devised so that over a finite budget of black-box evaluations, one is closest to the minimizer in an expected sense over the acquisition randomness.

In the literature, for design problems with sequential constraints, MCTS (Monte Carlo Tree Search) based search algorithms are often used with black box function evaluations. In the generic BBO problems in the unconstrained scenario, Simulated Annealing (SA) based techniques are typically used as search algorithms. A key missing

ingredient in the categorical domain is a surrogate model for the black-box evaluations that can interpolate between such evaluations and facilitate

cost-free approximate evaluations from the surrogate model internally in order to reduce the need for frequently accessing real black-box evaluations, thus leading to improved sample efficiency in both generic and design black-box optimization problems. Due to the lack of efficient interpolators in the categorical domains, existing search algorithms suffer under constrained computational budgets, due to reliance on only real black-box evaluations.

Contributions: We address the above problem in our work. Our main contributions are as follows:

  1. [leftmargin=5mm, itemsep=-1pt]

  2. We present two representations for modeling real-valued combinatorial functions over categorical variables, which we then use in order to learn a surrogate model for the generic BBO problem and the design problem, with a finite budget on the number of queries. The abridged one-hot encoded Boolean Fourier representation is novel to this work. The use of group-theoretic Fourier representation for modeling functions over categorical variables, and in particular their optimization, is novel to this work.

  3. Numerical experiments, over synthetic benchmarks as well as real-world biological (RNA) sequence optimization demonstrate the competitive or superior performance of the proposed methods incorporating our representations over state-of-the-art counterparts, while substantially reducing the computation time. We further evaluate the performance of our algorithms in design problems (inverse folding problem) and demonstrate the superior performance of our methods in terms of sample efficiency over the state-of-the-art counterparts.

2 Related Work

Black-Box Optimization: Hutter et al. Hutter, Hoos, and Leyton-Brown (2011)

suggest a surrogate model based on random forests to address optimization problems over categorical variables. The proposed SMAC algorithm uses a randomized local search under the expected improvement acquisition criterion to obtain candidate points for black-box evaluations. Bergstra et al.

Bergstra et al. (2011)

suggest a tree-structured Parzen estimator (TPE) for approximating the surrogate model, and maximizes the expected improvement criterion to find candidate points for evaluation. For optimization problems over Boolean variables, multilinear polynomials

Ricardo Baptista (2018); Dadkhahi et al. (2020) and Walsh functions Leprêtre et al. (2019); Deshwal, Belakaria, and Doppa (2021) have been used in the literature.

Bayesian Optimization (BO) is a commonly used approach for optimization of black-box functions Shahriari et al. (2015). However, limited work has addressed incorporation of categorical variables in BO. Early attempts were based on converting the black-box optimization problem over categorical variables to that of continuous variables Gómez-Bombarelli et al. (2018); Golovin et al. (2017); Garrido-Merchán and Hernández-Lobato (2020). A few BO algorithms have been specifically designed for black-box functions over combinatorial domains. In particular, the BOCS algorithm Ricardo Baptista (2018)

, primarily devised for Boolean functions, employs a sparse monomial representation to model the interactions among different variables, and uses a sparse Bayesian linear regression method to learn the model coefficients. The COMBO algorithm of

Oh et al. (2019)

uses Graph Fourier Transform over a combinatorial graph, constructed via graph Cartesian product of variable subgraphs, to gauge the smoothness of the black-box function. However, both BOCS and COMBO are hindered by associated high computation costs, which grow polynomially with both the number of variables and the number of function evaluations. More recently, a computationally efficient black-box optimization algorithm (COMEX)

Dadkhahi et al. (2020) was introduced to address the computational impediments of its Bayesian counterparts. COMEX adopts a Boolean Fourier representation as its surrogate model, which is updated via an exponential weight update rule. Nevertheless, COMEX is limited to functions over the Boolean hypercube.

We generalize COMEX to handle functions over categorical variables by proposing two representations for modeling functions over categorical variables: an abridged one-hot encoded Boolean Fourier representation and Fourier representation on finite Abelian groups. The utilization of the latter representation as a surrogate model in combinatorial optimization algorithms is novel to this work. Factorizations based on (vanilla) one-hot encoding has been previously (albeit briefly) suggested in

Ricardo Baptista (2018) to enable black-box optimization algorithms designed for Boolean variables to address problems over categorical variables. Different from Ricardo Baptista (2018), we show that we can significantly reduce the number of additional variables introduced upon one-hot encoding, and that such a reduced representation is in fact complete and unique. We incorporate the latter representation to extend a modified version of BOCS to problems over categorical variables, efficiently.

Sequence Design: For design problems, we focus on the RNA sequence design problem (RNA inverse folding). The goal is to find an RNA sequence consistent with a given secondary structure, as the functional state of the RNA molecule is determined by the latter structure Hofacker et al. (1994). Earlier RNA design methods explore the search space by trial and error and use classic cost function minimization approaches such as adaptive random walk Hofacker (2003), probabilistic sampling Zadeh et al. (2011)

, and genetic algorithms

Taneda (2015)

. Recent efforts employ more advanced machine learning methods such as different Monte Carlo Tree Search (MCTS) algorithms, e.g. MCTS-RNA

Yang et al. (2017) or Nested MCTS Portela (2018)

, and reinforcement learning that either performs a local search as in

Eastman et al. (2018) or learns complete candidate solutions from scratch Runge et al. (2019). In all these approaches, the assumption is that the algorithm has access to a large number of function evaluations, whereas we are interested in sample efficiency of each algorithm.

Learning to Search: As an alternative to parameter-free search methods (such as SA), Swersky et al. (2020) suggests to use a parameterized policy to generate candidates that maximize the acquisition function in Bayesian optimization over discrete search spaces. Our MCTS-based algorithm is similar in concept to Swersky et al. (2020) in the sense that the tabular value functions are constructed and maintained over different time steps. However, we are maintaining value functions rather than a policy network. Finally, while Swersky et al. (2020) trains a policy network, Deshwal et al. (2020) uses a rank learning algorithm to search.

Meta-heuristics

: A variety of discrete search algorithms and meta-heuristics have been studied in the literature for combinatorial optimization over categorical variables. Such algorithms, including Genetic Algorithms

Holland and Reitman (1978), Simulated Annealing Spears (1993), and Particle Swarms Kennedy and Eberhart (1995), are generally inefficient in finding the global minima. In the context of biological sequence optimization, the most popular method is directed evolution Arnold (1998)

, which explores the space by only making small mutations to existing sequences. In the context of sequence optimization, a recent promising approach consists of fitting a neural network model to predict the black box function and then applying gradient ascent on the latter model

Killoran et al. (2017); Bogard et al. (2019); Liu et al. (2020). This approach allows for a continuous relaxation of the discrete search space making possible step-wise local improvements to the whole sequence at once based on a gradient direction. However, these methods have been shown to suffer from vanishing gradients Linder and Seelig (2020). Further, the projected sequences in the continuous relaxation space may not be recognized by the predictors, leading to poor convergence. Generative model-based optimization approaches aim to learn distributions whose expectation coincides with evaluations of the black box and try to maximize such expectation Gupta and Zou (2019); Brookes, Park, and Listgarten (2019). However, such approaches require a pre-trained generative model for optimization.

Latent Space Optimization: The key idea is to employ an encoder-decoder architecture to learn a continuous representation from the data and perform BO in a latent space Gómez-Bombarelli et al. (2018); Tripp, Daxberger, and Hernández-Lobato (2020). This approach has two main drawbacks: it could generate a large fraction of invalid structures, and it requires a large database of relevant structures, for training an auto-encoder, which may not be available in many applications with scarce data availability.

Variable-Length Sequence Optimization: Moss et al. Moss et al. (2020) propose an algorithm called BOSS for optimization over strings of variable lengths. They use a Gaussian process surrogate model based on string kernels and perform acquisition function maximization for spaces with syntactic constraints. On the contrary, our main focus is on fixed length generic BBO and design problems.

Ensemble (Population-based) Methods: Angermueller et al. Angermüller et al. (2020) propose a population-based approach which generates batches of sequences by sampling from an ensemble of methods, rather than from a single method. This sampling is carried out proportional to the quality of sequences that each method has found in the past. A similar ensemble concept has been used in Sinai et al. (2021). In our work, we focus on individual methods; however, we add that since we are proposing two distinct representations employed via different algorithms (online and Bayesian), and the performance of each varies across different tasks, it would be interesting to consider an ensemble method over different algorithms as a future direction.

Batch (Offline) Optimization: Batch optimization techniques Angermueller et al. (2020); Brookes, Park, and Listgarten (2019); Fannjiang and Listgarten (2020) are a related but fundamentally distinct line of work, where black-box evaluations are carried out in sample batches rather than in an active/sequential manner. To highlight the distinction between the two settings we iterate our problem setting: Given query access to a combinatorial black box (i.e. it can be queried on a single point), what is the best solution we can obtain in a given number of such queries. One wishes to minimize the number of queries and hence goes for a highly adaptive sequential algorithm.

3 Black-Box Optimization over Categorical Variables

Problem Setting: Given the combinatorial categorical domain and a constraint set , with variables each of cardinality , the objective is to find

(1)

where is a real-valued combinatorial function. We assume that is a black-box function, which is computationally expensive to evaluate. As such, we are interested in finding in as few evaluations as possible. We consider two variations of the problem depending on how the constraint set is specified.

Generic BBO Problem: In this case, the constraint set . For example, RNA sequence optimization problem that searches for an RNA sequence with a specific property optimized lies within this category. A score for every RNA sequence, reflecting the property we wish to optimize, is evaluated by a black box function.

Design Problem: The constraint set is complex and is only sequentially specified. For every sequence of consisting of characters from the alphabet , the choice of the next character is specified by a constraint set function . The RNA inverse folding problem in Runge et al. (2019) falls into this category, where the constraints on the RNA sequence are determined by the sequential choices one makes during the sequence design. The goal is to find the sequence that is optimal with respect to a pre-specified structure that also obeys complex sequential constraints.

Proposed Techniques: To address this problem, we adopt a surrogate model-based learning framework as follows. The surrogate model (to represent the black-box function) is updated sequentially via black-box function evaluations observed until time step . An acquisition function is then obtained from the surrogate model. The selection of candidate points for black-box function evaluation is carried out via an acquisition function optimizer (AFO), which uses the acquisition function as an inexpensive proxy (to make many internal calls) for the black-box function and produces the next candidate point to be evaluated. We consider this problem in two different settings: online and Bayesian. The difference between the two settings is that in the former the acquisition function is exactly our surrogate, whereas in the latter the acquisition function is obtained by sampling from the posterior distribution of the surrogate.

In the sequel, we propose two representations that can be used in surrogate models for black-box combinatorial functions over categorical variables. These representations serve as direct generalizations of the Boolean surrogate model based on Fourier expansion proposed in Dadkhahi et al. (2020); Ricardo Baptista (2018) in the sense that they reduce to the Fourier representation for real-valued Boolean functions when the cardinality of the categorical variables is two. In addition, both approaches can be modified to address the more general case where different variables are of different cardinalities. However, for ease of exposition, we assume that all the variables are of the same cardinality. Finally, we introduce two popular search algorithms to be used as acquisition function optimizers in conjunction with the proposed surrogate models in order to select new queries for subsequent black-box function evaluations.

3.1 Representations for the Surrogate Model

We present two representations for combinatorial functions and an algorithm to update from the black-box evaluations. The representations use the Fourier basis in two different ways.

Abridged One-Hot Encoded Boolean Fourier Representation: The one-hot encoding of each variable can be expressed as a -tuple , where are Boolean variables with the constraint that at most one such variable can be equal to for any given .

We consider the following representation for the combinatorial function :

(2)

where denotes -fold cartesian product of the set , designates the set of -subsets of the set , and the monomials can be written as

(3)

A second order approximation (i.e. at ) of the representation in (2) can be expanded in the following way:

(4)
Example.

For variables and , each of which with cardinality , we have the one-hot encoding of and respectively. From Equation (4), the one-hot encoding factorization for this example can be written as

Note that the representation in Equation (2) has far less terms than a vanilla one-hot encoding with all the combinations of one-hot variables included (as suggested in Ricardo Baptista (2018)). The reason for this reduction is two-fold: Boolean variables model each categorical variable of cardinality , and more importantly each monomial term has at most one Boolean variable from its corresponding parent categorical variable (See the Appendix for the exact quantification of this reduction). The following theorem states that this reduced representation is in fact unique and complete.

Theorem 3.1.

The representation in Equation (2) is complete and unique for any real-valued combinatorial function.

Proof.

See Appendix. ∎

Fourier Representation on Finite Abelian Groups: We define a cyclic group structure over the elements of each categorical variable (), where is the cardinality of the latter variable. From the fundamental theorem of abelian groups Terras (1999), there exists an abelian group which is isomorphic to the direct sum (a.k.a direct product) of the cyclic groups corresponding to the categorical variables:

(5)

where the latter group consists of all vectors

such that and denotes group isomorphism. We assume that () for simplicity, but the following representation could be easily generalized to the case of arbitrary cardinalities for different variables.

The Fourier representation of any complex-valued function on the finite abelian group is given by Terras (1999)

(6)

where are (in general complex) Fourier coefficients, is the -fold cartesian product of the set and are complex exponentials 111Note that in the general case of different cardinalities for different variables, where denotes the cartesian product and the exponent denominator in the complex exponential character is replaced by . (-th roots of unity) given by

Note that the latter complex exponentials are the characters of the representation, and reduce to the monomials (i.e. in ) when the cardinality of each variable is two. A second order approximation of the representation in (6) can be written as:

(7)

For a real-valued function (which is of interest here), the representation in (6) reduces to

(8)

where

(9)

See the Appendix for the proof of this representation. We note that the number of characters utilized in this representation is almost twice as many as that of monomials used in the previous representation.

3.2 Surrogate Model Learning

Sparse Online Regression: We adopt the learning algorithm of combinatorial optimization with expert advice Dadkhahi et al. (2020) in the following way. We consider the monomials in (3) and the characters in (9) as experts. For each surrogate model, we maintain a pool of such experts, the coefficients of which are refreshed sequentially via an exponential weight update rule. We refer to the proposed algorithm as Expert-Based Categorical Optimization (ECO) and the two versions of the algorithm with the two proposed surrogate models are called ECO-F (based on the One-Hot Encoded Boolean Fourier Representation) and ECO-G (based on Fourier Representation on Finite Abelian Groups), respectively. See Appendix for more details.

Sparse Bayesian Regression: In order to incorporate the uncertainty of the surrogate model in acquisition function for exploration-exploitation trade-off, we adopt a Bayesian setting based on Thompson sampling (TS) Thompson (1933, 1935). In particular, we model the surrogate via a sparse Bayesian regression model with a regularized horseshoe prior Piironen and Vehtari (2017). At each time step, we learn the posterior using the data observed so far, and draw a sample for the coefficients from the posterior. We use the No-U-Turn Sampler (NUTS) Hoffman and Gelman (2014) in order to sample the coefficients efficiently. See Appendix B for more details. Our algorithm pursues the framework of BOCS Ricardo Baptista (2018), with the following differences for improved performance and computational efficiency: BOCS () employs a horseshoe prior Carvalho, Polson, and Scott (2010); Makalic and Schmidt (2016) and () uses a Gibbs sampler. Experimentally, the regularized horseshoe prior outperforms the horseshoe prior, and the NUTS sampler leads to a significant speed-up in computations (e.g. a speed-up for the RNA optimization problem), which is critical due to the higher dimensionality of the categorical domains. We use the proposed abridged one-hot encoded Fourier representation in conjunction with this model, due to its lower number of coefficients leading to a more manageable computation cost. We refer to this algorithm as TS-based Categorical Optimization (TCO-F).

3.3 Acquisition Function Optimizers

In this subsection, we discuss how two popular search algorithms, namely Simulated Annealing (SA) and Monte Carlo Tree Search (MCTS), can be utilized as acquisition function optimizers (AFO) in conjunction with a surrogate model and use cost-free evaluations of the surrogate model to select the next query for the black box evaluation. In the literature, SA has been used for the generic BBO problems, whereas MCTS has been used for the design problems.

SA as AFO: Our AFO is devised so as to minimize , the current estimate for the surrogate model. A simple strategy to minimize is to iteratively switch each variable into the value that minimizes given the values of all the remaining variables, until no more changes occur after a sweep through all the variables (). A strategy to escape local minima in this context Pincus (1970) is to allow for occasional increases in

by generating a Markov Chain (MC) sample sequence (for

), whose stationary distribution is proportional to , where is gradually reduced to . This optimization strategy was first applied by Kirkpatrick, Gelatt, and Vecchi (1983) in their Simulated Annealing algorithm to solve combinatorial optimization problems. We use the Gibbs sampler Geman and Geman (1984) to generate such an MC by sampling from the full conditional distribution of the stationary distribution, which in our case is given by the softmax distribution over , for each variable conditional on the values of the remaining variables . By decreasing from a high value to a low one, we allow the MC to first search at a coarse level avoiding being trapped in local minima.

Algorithm 1 presents our simulated annealing (SA) version for categorical domains, where is an annealing schedule, which is a non-increasing function of . We use the annealing schedule suggested in Spears (1993), which follows an exponential decay with parameter given by . In each step of the algorithm, we pick a variable () uniformly at random, evaluate the surrogate model (possibly in parallel) times, once for each categorical value for the chosen variable given the current values for the remaining variables. We then update with the sampled value in from the corresponding softmax distribution.

1:  Inputs: surrogate model , annealing schedule , categorical domain
2:  Initialize
3:  
4:  repeat
5:     
6:     
7:     
8:  until Stopping Criteria
9:  return  
Algorithm 1 SA for Categorical Variables with Surrogate Model
1:  Inputs: surrogate model , search tree
2:  Initialize ,
3:  repeat
4:     
5:     
6:     
7:     
8:     
9:     if  then
10:        
11:     end if
12:  until Stopping Criteria
13:  return  
Algorithm 2 MCTS with Surrogate Reward

MCTS as AFO:

We formulate the design problem as an undiscounted Markov decision process

. Each state corresponds to a partial or full sequence of categorical variables of lengths in . The process in each episode starts with an empty sequence , the initial state. Actions are selected from the set of permissible additions to the current state (sequence) at each time step , . The transition function is deterministic, and defines the sequence obtained from the juxtaposition of the current state with the action , i.e. . The transitions leading to incomplete sequences yield a reward of zero. Complete sequences are considered as terminal states, from which no further transitions (juxtapositions) can be made. Once the sequence is complete (i.e. at a terminal state), the reward is obtained from the current surrogate reward model . Thus, the reward function is defined as if is terminal, and zero otherwise.

MCTS is a popular search algorithm used for design problems. MCTS is a rollout algorithm which keeps track of the value estimates obtained via Monte Carlo simulations in order to progressively make better selections. The UCT selection criteria, see Kocsis and Szepesvári (2006), is typically used as tree policy, where action at state in the search tree is selected via: , where is the search tree, is the exploration parameter, is the state-action value estimate, and and are the visit counts for the parent state node and the candidate state-action edge, respectively. For the selection of actions in states outside the search tree, a random default policy is used: .

A summary of the proposed algorithm is given in Algorithm 2. Starting with an empty sequence at the root of the tree, we follow the tree policy until a leaf node of the search tree is reached (selection step). At this point, we append the state corresponding to the leaf node to the tree and initialize a value function estimate for its children (extension step). From the reached leaf node we follow the default policy until a terminal state is reached. At this point, we plug the sequence corresponding to this terminal state into the surrogate reward model and observe the reward . This reward is backed up from the leaf node to the root of the tree in order to update the value estimates via Monte Carlo (i.e. using the average reward) for all visited pairs along the path. This process is repeated until a stopping criterion (typically a max number of playouts) is met, at which point the sequence with maximum reward is returned as the output of the algorithm.

Computational Complexity: The computational complexity per time step associated with learning the surrogate model via the Hedge algorithm, for both representations introduced in 3.1, is in , and is thus linear in the number of experts . Moreover, the complexity of Algorithm 1 is in , assuming that the number of iterations in each SA run is in . Hence, the overall complexity of the algorithm is in . Finally, the computational complexity of each playout in Algorithm 2 is in , leading to an overall complexity of , assuming playouts per time step. On the other hand, when Bayesian regression is used to learn the surrogate model, NUTS dominates the computation time with a complexity of per independent sample.

4 Experiments and Results

In this section, we measure the performance of the proposed representations, when used as surrogate/reward model in conjunction with search algorithms (SA and MCTS) in BBO and design problems. The learning rate used in exponential weight updates is selected via the anytime learning rate schedule suggested in Dadkhahi et al. (2020) and Gerchinovitz and Yu (2011) (see Appendix B). The maximum degree of interactions used in our surrogate models is set to two for all the problems; increasing the max order improved the results only marginally (see Appendix C). The sparsity parameter in exponential weight updates is set to in all the experiments following the same choice made in Dadkhahi et al. (2020). Experimentally, the learning algorithm is fairly insensitive to the variations in the latter parameter. In each experiment, we report the results averaged over multiple runs ( runs in BBO experiments; runs in design experiments)

one standard error of the mean. The experiments are run on machines with CPU cores from the Intel Xeon E5-2600 v3 family.

BBO Experiments: We compare the performance of our ECO/TCO algorithms in conjunction with SA with two baselines, random search (RS) and simulated annealing (SA), as well as a state-of-the-art Bayesian combinatorial optimization algorithm (COMBO) Oh et al. (2019). In particular, we consider two synthetic benchmarks (Latin square problem and pest control problem) and a real-word problem in biology: RNA sequence optimization. In addition to the performance of the algorithms in terms of the best value of observed until a given time step , we measure the average computation time per time step of our algorithm versus that of COMBO. The decay parameter used in the annealing schedule of SA is set to in all the experiments. In addition, the number of SA iterations is set to and for ECO and TCO, respectively. Intuitively, for the ECO algorithm, each of these parameters creates an exploration-exploitation trade-off. The smaller (larger) the value of or , the more exploratory (exploitative) is the behavior of SA. The selected values seem to create a reasonable balance; tuning these parameters may improve the performance of the algorithm. On the other hand, the TCO algorithm has an exploration mechanism built into it via Thompson sampling; hence, we use a higher number of SA iterations to maximize exploitation in AFO.

Synthetic Benchmarks: We consider two synthetic problems: Latin square problem Colbourn and Dinitz (2006), a commonly used combinatorial optimization benchmark, and the pest control problem considered in Oh et al. (2019) (see Appendix for the latter results). In both problems, we have categorical variables, each of cardinality . A Latin square of order is a matrix of elements , such that each number appears in each row and column exactly once. When , the problem of finding a Latin square has solutions in a space of dimensionality . We formulate the problem of finding a Latin square of order as a black-box optimization by imposing an additive penalty of for any repetition of numbers in any row or column. Hence, function evaluations are in the range , and a function evaluation of corresponds to a Latin square of order . We consider a noisy version of this problem, where an additive Gaussian noise with

mean and standard deviation of

is added to function evaluations observed by each algorithm.

Figure 1 demonstrates the performance of different algorithms, in terms of the best function value found until time , over time steps. Both ECO-F and ECO-G outperform the baselines with a considerable margin. In addition, ECO-G outperforms COMBO at samples. At larger time steps, COMBO outperforms the other algorithms; however, this performance comes at the price of a far larger computation time. As demonstrated in Table 1, ECO-F and ECO-G offer a speed-up of roughly and , resp., over COMBO. We note that TCO-F (not included in the plot) performs poorly (similar to RS) on this problem, which can be attributed to the strong promotion of sparsity by the regularized horseshoe prior and the fact that the Latin Square problem has a dense representation (we observed a similar behavior from the horseshoe prior of Ricardo Baptista (2018)).

Figure 1: Latin Square problem with .
Data Latin Square Pest Cont. RNA Opt.
COMBO 170.4 151.0 253.8
ECO-F 1.5 1.4 2.0
ECO-G 3.6 3.3 5.7
TCO-F 55.7 53.2 67.0
Table 1: Average computation time per step (in Seconds).

RNA Sequence Optimization Problem: Consider an RNA sequence as a string of letters (nucleotides) over the alphabet . A pair of complementary nucleotides and , where , can interact with each other and form a base pair (denoted by ), A-U, C-G and G-U being the energetically stable pairs. Thus, the secondary structure, i.e. the minimum free energy structure, of an RNA can be represented by an ensemble of pairing bases. A number of RNA folding algorithms Lorenz et al. (2011); Markham and Zuker (2008) use a thermodynamic model (e.g. Zuker and Stiegler (1981)) and dynamic programming to estimate MFE of a sequence. However, the time complexity of these algorithms prohibits their use for evaluating substantial numbers of RNA sequences Gould, Hendy, and Papamichail (2014) and exhaustively searching the space to identify the global free energy minimum, as the number of sequences grows exponentially as .

We formulate the RNA sequence optimization problem as follows: For a sequence of length , find the RNA sequence which folds into a secondary structure with the lowest MFE. In our experiments, we set and . We then use the popular RNAfold package Lorenz et al. (2011) to evaluate the MFE for a given sequence. The goal is to find the lowest MFE sequence by calling the MFE evaluator minimum number of times. As shown in Figure 2, both ECO-F and particularly ECO-G outperform the baselines as well as COMBO by a large margin. At higher number of evaluations, TCO-F beats the rest of the algorithms, which can be attributed to its exploration-exploitation trade-off. See App. J for comparison between TCO-F and BOCS+vanilla one-hot encoding.

Figure 2: RNA BBO Problem with

RNA Design Experiments: The problem is to find a primary RNA sequence which folds into a target structure , given a folding algorithm . Such target structures can be represented as a sequence of dots (for unpaired bases) and brackets (for paired bases). In our algorithm, the action sets are defined as follows. For unpaired sites and for paired sites . At the beginning of each run of our algorithm (ECO-F/G, TCO-F in conjunction with MCTS), we draw a random permutation for the order of locations to be selected in each level of the tree. The reward value offered by the environment (i.e. the black-box function) at any time step corresponds to the normalized Hamming distance between the target structure and the structure of the sequence found by each algorithm, i.e. .

We compare the performance of our algorithms against RS as a baseline, where random search is carried out over the given structure (i.e. default policy ) rather than over unstructured random sequences. We also include two state-of-the-art algorithms in our experiments: MCTS-RNA Yang et al. (2017) and LEARNA Runge et al. (2019). MCTS-RNA has an exploration parameter, which we tune in advance (per sequence). LEARNA has a set of hyper-parameters tuned a priori using training data and is provided in Runge et al. (2019). Note that the latter training phase (for LEARNA) as well as the former exploration parameter tuning (for MCTS-RNA) are offered to the respective algorithms as an advantage, whereas for our algorithm we use a global set of heuristic choices for the two hyper-parameters, rather than attempting to tune the two hyper-parameters. In particular, we set the exploration parameter to for ECO and for TCO; and the number of MCTS playouts at each time step to , where is the tree height (i.e. number of dots and bracket pairs). The latter heuristic choice is made since the bigger the tree, the more playouts are needed to explore the space.

We point out that the entire design pipeline in state-of-the-art algorithms typically also includes a local improvement step (as a post-processing step), which is either a rule-based search (e.g. in Yang et al. (2017)) or an exhaustive search (e.g. in Runge et al. (2019)) over the mismatched sites. We do not include the local improvement step in our experiments, since we are interested in measuring sample efficiency of different algorithms. In other words, the question is the following: given a fixed and finite evaluation budget, which algorithm is able to get closer to the target structure.

In our experiments, we focus on three puzzles from the Eterna-100 dataset Anderson-Lee et al. (2016). Two of the selected puzzles ( and of lengths and , resp.), despite their fairly small lengths, are challenging for many algorithms (see Anderson-Lee et al. (2016)). In both puzzles, ECO-F, ECO-G, and TCO-F (with MCTS as AFO) are able to significantly improve the performance of MCTS when limited number of black-box evaluations is available. All algorithms outperform RS as expected. Within the given evaluation budget, TCO-F, ECO-G, and esp. ECO-F, are superior to LEARNA by a substantial margin (see Figure 3). In puzzle (Figure 4), again both ECO-G and ECO-F significantly outperform LEARNA over the given number of evaluations. ECO-F is able to outperform LEARNA throughout the evaluation process, and on average finds a far better final solution than LEARNA. With MCTS as AFO, ECO algorithms outperform TCO, which can be attributed to the exploratory behavior of the latter in both AF (via TS) and AFO (via MCTS). See Appendix for puzzle .

Figure 3: Design puzzle with .
Figure 4: Design puzzle with .

5 Conclusions and Future Work

We propose two novel Fourier representations as surrogate models for black box optimization over categorical variables and show performance improvements over existing baselines when combined with search algorithms. We utilize two algorithms to learn such surrogate models. Our ECO algorithm incorporates a computationally-efficient online estimator with strong adversarial guarantees (see Dadkhahi et al. (2020) for theoretical results in the Boolean case), which can be shown to carry over to our setting as well. Our TCO algorithm uses a Bayesian regression model with a regularized horseshoe prior and selects queries via Thompson sampling to balance exploration-exploitation trade-off in surrogate model learning at the price of a higher computation cost.

Considering the performance variability with respect to different algorithms and representations across different problems, an important research avenue would be to derive an ensemble of such models rather than a single one and investigate potential performance gains. Such an ensemble model would then update and explore all models simultaneously and draw samples from either individual or a combination of models at any given time step.

References

  • Anderson-Lee et al. (2016) Anderson-Lee, J.; Fisker, E.; Kosaraju, V.; Wu, M.; Kong, J.; Lee, J.; Lee, M.; Zada, M.; Treuille, A.; and Das, R. 2016. Principles for Predicting RNA Secondary Structure Design Difficulty. Journal of Molecular Biology, 428(5, Part A): 748 – 757. Challenges in RNA Structural Modeling and Design.
  • Angermueller et al. (2020) Angermueller, C.; Dohan, D.; Belanger, D.; Deshpande, R.; Murphy, K.; and Colwell, L. 2020. Model-based reinforcement learning for biological sequence design. In ICLR.
  • Angermüller et al. (2020) Angermüller, C.; Belanger, D.; Gane, A.; Mariet, Z.; Dohan, D.; Murphy, K.; Colwell, L. J.; and Sculley, D. 2020. Population-Based Black-Box Optimization for Biological Sequence Design. In ICML.
  • Arnold (1998) Arnold, F. H. 1998. Design by directed evolution. Accounts of chemical research, 31(3): 125–131.
  • Bergstra et al. (2011) Bergstra, J.; Bardenet, R.; Bengio, Y.; and Kégl, B. 2011. Algorithms for Hyper-Parameter Optimization. In NeurIPS.
  • Bogard et al. (2019) Bogard, N.; Linder, J.; Rosenberg, A. B.; and Seelig, G. 2019. A deep neural network for predicting and engineering alternative polyadenylation. Cell, 178(1): 91–106.
  • Brookes, Park, and Listgarten (2019) Brookes, D.; Park, H.; and Listgarten, J. 2019. Conditioning by adaptive sampling for robust design. In ICML.
  • Buchan and Stansfield (2007) Buchan, J. R.; and Stansfield, I. 2007. Halting a cellular production line: responses to ribosomal pausing during translation. Biology of the Cell, 99(9): 475–487.
  • Carvalho, Polson, and Scott (2010) Carvalho, C. M.; Polson, N. G.; and Scott, J. G. 2010. The horseshoe estimator for sparse signals. Biometrika, 97(2): 465–480.
  • Colbourn and Dinitz (2006) Colbourn, C. J.; and Dinitz, J. H. 2006. Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications). Chapman & Hall/CRC.
  • Dadkhahi et al. (2020) Dadkhahi, H.; Shanmugam, K.; Rios, J.; Das, P.; Hoffman, S.; Loeffler, T. D.; and Sankaranarayanan, S. 2020. Combinatorial Black-Box Optimization with Expert Advice. In KDD.
  • Davis et al. (2008) Davis, M.; Sagan, S. M.; Pezacki, J. P.; Evans, D. J.; and Simmonds, P. 2008. Bioinformatic and physical characterizations of genome-scale ordered RNA structure in mammalian RNA viruses. Journal of virology, 82(23): 11824–11836.
  • Deshwal, Belakaria, and Doppa (2021) Deshwal, A.; Belakaria, S.; and Doppa, J. R. 2021. Mercer Features for Efficient Combinatorial Bayesian Optimization. AAAI, 35(8): 7210–7218.
  • Deshwal et al. (2020) Deshwal, A.; Belakaria, S.; Doppa, J. R.; and Fern, A. 2020. Optimizing Discrete Spaces via Expensive Evaluations: A Learning to Search Framework. In AAAI.
  • Dill and MacCallum (2012) Dill, K. A.; and MacCallum, J. L. 2012. The protein-folding problem, 50 years on. Science, 338(6110): 1042–1046.
  • Dixon et al. (2010) Dixon, N.; Duncan, J. N.; Geerlings, T.; Dunstan, M. S.; McCarthy, J. E.; Leys, D.; and Micklefield, J. 2010. Reengineering orthogonally selective riboswitches. Proceedings of the National Academy of Sciences, 107(7): 2830–2835.
  • Eastman et al. (2018) Eastman, P.; Shi, J.; Ramsundar, B.; and Pande, V. S. 2018. Solving the RNA design problem with reinforcement learning. PLoS computational biology, 14(6): e1006176.
  • Fannjiang and Listgarten (2020) Fannjiang, C.; and Listgarten, J. 2020. Autofocused oracles for model-based design. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., NeurIPS.
  • Garrido-Merchán and Hernández-Lobato (2020) Garrido-Merchán, E. C.; and Hernández-Lobato, D. 2020. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocomputing, 380: 20–35.
  • Gelman et al. (2013) Gelman, A.; Carlin, J. B.; Stern, H. S.; Dunson, D. B.; Vehtari, A.; and Rubin, D. B. 2013. Bayesian Data Analysis. Chapman and Hall/CRC, 3rd ed. edition.
  • Geman and Geman (1984) Geman, S.; and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. on pattern analysis and machine intelligence, 6: 721–741.
  • Gerchinovitz and Yu (2011) Gerchinovitz, S.; and Yu, J. Y. 2011. Adaptive and Optimal Online Linear Regression on -Balls. In Algorithmic Learning Theory.
  • Golovin et al. (2017) Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.; and Sculley, D. 2017. Google vizier: A service for black-box optimization. In KDD, 1487–1495.
  • Gómez-Bombarelli et al. (2018) Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and Aspuru-Guzik, A. 2018. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2): 268–276.
  • Gould, Hendy, and Papamichail (2014) Gould, N.; Hendy, O.; and Papamichail, D. 2014. Computational tools and algorithms for designing customized synthetic genes. Frontiers in bioengineering and biotechnology, 2: 41.
  • Gupta and Zou (2019) Gupta, A.; and Zou, J. 2019. Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence, 1(2): 105–111.
  • Hofacker (2003) Hofacker, I. L. 2003. Vienna RNA secondary structure server. Nucleic acids research, 31(13): 3429–3431.
  • Hofacker et al. (1994) Hofacker, I. L.; Fontana, W.; Stadler, P. F.; Bonhoeffer, L. S.; Tacker, M.; and Schuster, P. 1994. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly, 125(2): 167–188.
  • Hoffman and Gelman (2014) Hoffman, M. D.; and Gelman, A. 2014. The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47): 1593–1623.
  • Holland and Reitman (1978) Holland, J. H.; and Reitman, J. S. 1978. Cognitive systems based on adaptive algorithms. In Pattern-directed inference systems, 313–329. Elsevier.
  • Hoshika et al. (2019) Hoshika, S.; Leal, N. A.; Kim, M.-J.; Kim, M.-S.; Karalkar, N. B.; Kim, H.-J.; Bates, A. M.; Watkins, N. E.; SantaLucia, H. A.; Meyer, A. J.; et al. 2019. Hachimoji DNA and RNA: A genetic system with eight building blocks. Science, 363(6429): 884–887.
  • Hutter, Hoos, and Leyton-Brown (2011) Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In International Conference on Learning and Intelligent Optimization, 507–523.
  • Kennedy and Eberhart (1995) Kennedy, J.; and Eberhart, R. 1995. Particle swarm optimization. In International Conference on Neural Networks (ICNN), volume 4, 1942–1948 vol.4.
  • Killoran et al. (2017) Killoran, N.; Lee, L. J.; Delong, A.; Duvenaud, D.; and Frey, B. J. 2017. Generating and designing DNA with deep generative models. arXiv:1712.06148.
  • Kirkpatrick, Gelatt, and Vecchi (1983) Kirkpatrick, S.; Gelatt, C. D.; and Vecchi, M. P. 1983. Optimization by Simulated Annealing. Science, 220(4598): 671–680.
  • Kocsis and Szepesvári (2006) Kocsis, L.; and Szepesvári, C. 2006. Bandit Based Monte-Carlo Planning. In European Conference on Machine Learning (ECML), 282–293.
  • Leprêtre et al. (2019) Leprêtre, F.; Verel, S.; Fonlupt, C.; and Marion, V. 2019. Walsh functions as surrogate model for pseudo-boolean optimization problems. In

    The Genetic and Evolutionary Computation Conference (GECCO 2019)

    , 303–311. ACM.
  • Li et al. (2015) Li, H.; Lee, T.; Dziubla, T.; Pi, F.; Guo, S.; Xu, J.; Li, C.; Haque, F.; Liang, X.-J.; and Guo, P. 2015. RNA as a stable polymer to build controllable and defined nanostructures for material and biomedical applications. Nano today, 10(5): 631–655.
  • Linder and Seelig (2020) Linder, J.; and Seelig, G. 2020. Fast differentiable DNA and protein sequence optimization for molecular design. Arxiv:2005.11275.
  • Liu et al. (2020) Liu, G.; Zeng, H.; Mueller, J.; Carter, B.; Wang, Z.; Schilz, J.; Horny, G.; Birnbaum, M. E.; Ewert, S.; and Gifford, D. K. 2020. Antibody complementarity determining region design using high-capacity machine learning. Bioinformatics, 36(7): 2126–2133.
  • Lorenz et al. (2011) Lorenz, R.; Bernhart, S. H.; Höner zu Siederdissen, C.; Tafer, H.; Flamm, C.; Stadler, P. F.; and Hofacker, I. L. 2011. ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(1): 26.
  • Makalic and Schmidt (2016) Makalic, E.; and Schmidt, D. F. 2016. A Simple Sampler for the Horseshoe Estimator. IEEE Signal Processing Letters, 23(1): 179–182.
  • Markham and Zuker (2008) Markham, N. R.; and Zuker, M. 2008. UNAFold. In Bioinformatics, 3–31. Springer.
  • Moss et al. (2020) Moss, H.; Leslie, D.; Beck, D.; González, J.; and Rayson, P. 2020. BOSS: Bayesian Optimization over String Spaces. In NeurIPS.
  • Ng et al. (2019) Ng, A. H.; Nguyen, T. H.; Gómez-Schiavon, M.; Dods, G.; Langan, R. A.; Boyken, S. E.; Samson, J. A.; Waldburger, L. M.; Dueber, J. E.; Baker, D.; et al. 2019. Modular and tunable biological feedback control using a de novo protein switch. Nature, 572(7768): 265–269.
  • Oh et al. (2019) Oh, C.; Tomczak, J.; Gavves, E.; and Welling, M. 2019. Combinatorial Bayesian Optimization using the Graph Cartesian Product. In NeurIPS.
  • Piironen and Vehtari (2017) Piironen, J.; and Vehtari, A. 2017. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2): 5018–5051.
  • Pincus (1970) Pincus, M. 1970. Letter to the editor—a Monte Carlo method for the approximate solution of certain types of constrained optimization problems. Operations Research, 18(6): 1225–1228.
  • Portela (2018) Portela, F. 2018. An unexpectedly effective Monte Carlo technique for the RNA inverse folding problem. BioRxiv, 345587.
  • Ricardo Baptista (2018) Ricardo Baptista, M. P. 2018. Bayesian Optimization of Combinatorial Structures. In ICML.
  • Runge et al. (2019) Runge, F.; Stoll, D.; Falkner, S.; and Hutter, F. 2019. Learning to Design RNA. In ICLR.
  • Shahriari et al. (2015) Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148–175.
  • Sinai et al. (2021) Sinai, S.; Wang, R.; Whatley, A.; Slocum, S.; Locane, E.; and Kelsic, E. 2021. AdaLead: A simple and robust adaptive greedy search algorithm for sequence design.
  • Spears (1993) Spears, W. M. 1993. Simulated Annealing for Hard Satisfiability Problems. In DIMACS Workshop: Cliques, Coloring, and Satisfiability, 533–557.
  • Stephens et al. (2015) Stephens, Z. D.; Lee, S. Y.; Faghri, F.; Campbell, R. H.; Zhai, C.; Efron, M. J.; Iyer, R.; Schatz, M. C.; Sinha, S.; and Robinson, G. E. 2015. Big data: astronomical or genomical? PLoS biology, 13(7): e1002195.
  • Swersky et al. (2020) Swersky, K.; Rubanova, Y.; Dohan, D.; and Murphy, K. 2020. Amortized Bayesian Optimization over Discrete Spaces. In UAI.
  • Taneda (2015) Taneda, A. 2015. Multi-objective optimization for RNA design with multiple target secondary structures. BMC bioinformatics, 16(1): 280.
  • Terras (1999) Terras, A. 1999. Fourier Analysis on Finite Groups and Applications. London Mathematical Society Student Texts. Cambridge University Press.
  • Thompson (1933) Thompson, W. R. 1933.

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.

    Biometrika, 25(3-4): 285–294.
  • Thompson (1935) Thompson, W. R. 1935. On the Theory of Apportionment. American Journal of Mathematics, 57: 450.
  • Tripp, Daxberger, and Hernández-Lobato (2020) Tripp, A.; Daxberger, E.; and Hernández-Lobato, J. M. 2020. Sample-Efficient Optimization in the Latent Space of Deep Generative Models via Weighted Retraining. In Advances in Neural Information Processing Systems.
  • Trotta (2014) Trotta, E. 2014. On the normalization of the minimum free energy of RNAs by sequence length. PloS one, 9(11).
  • Yamagami et al. (2019) Yamagami, R.; Kayedkhordeh, M.; Mathews, D. H.; and Bevilacqua, P. C. 2019. Design of highly active double-pseudoknotted ribozymes: a combined computational and experimental study. Nucleic acids research, 47(1): 29–42.
  • Yang et al. (2017) Yang, X.; Yoshizoe, K.; Taneda, A.; and Tsuda, K. 2017. RNA inverse folding using Monte Carlo tree search. BMC Bioinformatics, 8: 468.
  • Zadeh et al. (2011) Zadeh, J. N.; Steenberg, C. D.; Bois, J. S.; Wolfe, B. R.; Pierce, M. B.; Khan, A. R.; Dirks, R. M.; and Pierce, N. A. 2011. NUPACK: Analysis and design of nucleic acid systems. Journal of computational chemistry, 32(1): 170–173.
  • Zuker and Stiegler (1981) Zuker, M.; and Stiegler, P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic acids research, 9(1): 133–148.

Appendix A Proof of Theorem 3.1

We first assume that the one-hot variables . Plugging different choices of into Equation (2) leads to a system of linear equations with unknowns (coefficients ) and equations. We can express this system in matrix form as the product of the matrix of monomials (where each column corresponds to a monomial , and each element corresponds to the evaluation of the monomial at the -th choice for ) and the vector of unknown coefficients, which is set equal to the function values at the corresponding choices for . We claim that there exists a permutation of the rows and columns of the matrix of monomials such that the latter becomes unit lower triangular, and is thereby full rank. As a result, the representation in (2) for is complete and unique.

To formally show that this permutation for the monomials’ matrix exists, we use a construction by induction over the number of variables included in the representation. We denote the monomials’ matrix over variables with , and define the one-hot variables ) as the descendants of the parent categorical variable (). It is easy to see that such a construction for the base case of only one variable exits, as the monomials’ matrix is a matrix. In this case, we use the following permutation of the monomials (columns): . We also use the following permutation of the values in rows: . As a result, in this matrix only the elements of the first column and the main diagonal are non-zero and equal to one, and thus the matrix is unit lower triangular.

Assuming that the induction hypothesis holds for variables, we show that it also holds for variables. Starting from a unit lower triangular matrix , we can construct the matrix as follows. Note that all the columns of correspond to the monomials composed of the descendants of the first variables , whereas each of the additional columns introduced in involves exactly one term from (possibly also containing factors from the previous variables). We can express using the following binomial expansion:

(10)

In words, the additional columns can be considered as a collection of -th order monomials (), each of which includes one out of the descendant variables () together with variables from the (descendants) of the previous variables. Each of the latter variables can take values in , whereas the remaining variables are set to .

Starting with , we have first order terms which we assign to columns . The values associated with rows are constructed by assuming that takes values from to (in order), while all the remaining () variables are set to . As a consequence of , all the elements on the main diagonal are ones; also, as a consequence of , all the higher degree monomials involving (which occupy the elements after the diagonal ones) are equal to zeros. Thus the augmented matrix, until this point, remains unit lower triangular.

We then consider the second degree terms , where we have terms (choice of one out of the other variables, each taking values from ) for each of the choices for the variable . Starting with second order monomials involving and , we again assume that all the remaining variables () are equal222Note that this assumption is necessary in order to ensure that monomials in future columns for the same row are evaluated to zero; choices of where this assumption is not valid is addressed in next rows. to . For any choice of , we add a new column corresponding to the monomial as well as a new row in which and , whereas the remaining variables are set to . As a result, we have that: the diagonal element in the new row/column is equal to one, and all the elements in the future333Note that the elements in the previous columns in the same row corresponding to monomials involving the selected variables are non-zeros as well. columns are zeros, since any combination of one of the descendants of with the remaining variables (any variable except and ) is zero. We continue this construction strategy for the remaining variables until all the second degree terms involving and one out of the remaining variables is exhausted. We then repeat the same idea for terms with orders up to , as defined by the binomial expansion in . As a result of this construction strategy, the monomial matrix is unit lower triangular.

Now, we use this result to show the completeness and uniqueness of the representation with one-hot variables in in the following way.

Completeness: We showed that the representation (2) over is complete, i.e. we can express any function using the representation in (2), where we have at most one descendant term from the same parent in each monomial. Now, we replace each (from ) in the latter representation with , where . The new representation can also be expressed via the expansion (2) since no two descendants from the same parent variable are being multiplied with each another.

Uniqueness: Assume that the uniqueness condition is not satisfied. Then we have two distinct polynomial representations and that have the same value for every . However, since and are distinct polynomials, is a polynomial which is non-zero in at least one input . This implies that is also non-zero, which is a contradiction, and the proof is complete.

Remark 1.

The group-theoretic Fourier representation defined in Equation (6) is unique and complete.

Proof.

Let be the categorical domain. Let the true function be . For generality, let us consider a complex valued function where is the field of complex numbers. The basis functions are . Now, one can view a function as a -length vector, one entry each for evaluating the function at every point in the domain . We denote the vector for function , thus obtained, by . Similarly, denote the vector for evaluations of the basis function by . Let be a matrix created by stacking all vectors corresponding to basis vectors in the columns. Then, the Fourier representation can be written as where is the vector of Fourier coefficients in our group-theoretic representation. Now, due to the use of complex exponentials, one can show that if . Therefore, the columns of the matrix are orthogonal. Hence, is a full rank matrix. Therefore, our representation is merely representing a vector in another full rank orthogonal basis. Hence, it is unique and complete. ∎

Appendix B Description of Algorithms

Surrogate Model Learning Algorithm via Online Regression: Let and denote the number of variables and the cardinality of each variable, respectively. The surrogate models used in ECO-F and ECO-G correspond to approximations of the representations given in (2) and (8), respectively, where each approximation is obtained by restricting the maximum order of interactions among variables to . We consider each term in the latter surrogate models, i.e. monomials from (3) in ECO-F and characters from (9) in ECO-G, as an expert, denoted by (). The number of such experts in ECO-F is which coincides with the dimensionality of the space when , whereas the number of experts in ECO-G is equal to . The coefficient of each expert is designated by . Since the exponential weights, utilized to update the coefficients , are non-negative, we maintain two non-negative coefficients and , which yield .

We initialize all the coefficients with a uniform prior, i.e. (). In each time step , we draw a sample via Algorithm 1 with respect to our current estimate for the surrogate model . The latter sample is then plugged into the black-box function to obtain the evaluation . This leads to a mixture loss as the difference between the evaluations obtained by our surrogate model and the black-box function for query . Using this mixture loss, we compute the individual loss for each expert . Finally, we update each coefficient in the model via an exponential weight obtained according to its incurred individual loss. We repeat this process until stopping criteria are met.

1:  Inputs: sparsity , max model order
2:  
3:  repeat
4:      via Algorithm 1 or Algorithm 2
5:     Observe
6:     
7:     
8:     for  do
9:        
10:        
11:        
12:     end for
13:     
14:  until Stopping Criteria
15:  return  
Algorithm 3 Expert Categorical Optimization

Number of Characters in Representations: The number of terms in vanilla one-hot encoded Fourier representation is , whereas our abridged representation reduces this number to matching the space dimensionality, thereby making the algorithms computationally tractable and efficient. When a max degree of is used in the approximate representation, the number of terms in the abridged representation is equal to . The corresponding number in a vanilla one-hot encoded representation is equal to . Finally, the numbers of terms in the full and order- group-theoretic Fourier expansions are equal to and , respectively.

Surrogate Model Learning Algorithm via Sparse Bayesian Regression: We use our surrogate model to approximate the black-box function, where are the characters of the representation, is the number of characters in the representation, and are the coefficients. We utilize the abridged one-hot encoded Fourier representation for this model due to its smaller number of characters which makes it much more scalable. We assume that one-hot encoded variables take values in . Note that from the proof of the abridged one-hot encoded Fourier representation in Appendix A, both assignments and for one-hot encoded variables are permissible in the latter representation.

We assume that the approximation errors between the black-box evaluations at points and our surrogate function values are independent and identically distributed as follows:

(11)

We estimate the coefficients of our surrogate approximation using Bayesian linear regression over the black-box evaluation data seen so far. To avoid high-variance estimators for the coefficients due to data scarcity, sparsity-inducing priors are recommended in

Ricardo Baptista (2018). We use the regularized horseshoe model Piironen and Vehtari (2017) defined as follows:

where

is the standard half-Cauchy distribution, and

is the number of observations seen so far. In this model, the global and the local hyper-parameters individually shrink the magnitude of regression coefficients . The hyper-parameter designates the expected number of relevant coefficients, but since the sparsity induced by the corresponding threshold is only weakly sensitive to , a rough estimate is typically sufficient. In our experiments, we set , and .

Given the above Bayesian model, the TCO-F algorithm works as follows. At each time step , we draw a sample for the coefficients () from the posterior. We use the No-U-Turn Sampler (NUTS) Hoffman and Gelman (2014) in order to sample the coefficients efficiently. Given the latter sample, we then construct the approximation function instance . Following Ricardo Baptista (2018), we then select our next point for black-box evaluation by minimizing a regularized version of this approximation function, i.e. , where designates the -norm of , and

is a regularization hyperparameter. Throughout our experiments, we set

. The AFO in either algorithm 1 or algorithm 2 is then used to carry out this minimization for the generic BBO or the design problem, respectively.

MCTS Algorithm: For a complete version of Algorithm 2, see Algorithm 4.

1:  Inputs: surrogate reward model , exploration parameter , search tree
2:  ,
3:  repeat
4:     Initialize episode ,
5:     while   do
6:        
7:        
8:        
9:     end while
10:     
11:     
12:     , 
13:     repeat
14:        
15:        
16:        
17:     until  is terminal
18:     
19:     
20:     repeat
21:        
22:        
23:        ; visited action on
24:     until  is the root node
25:     if  then
26:        
27:     end if
28:  until Stopping Criteria
29:  return  
Algorithm 4 MCTS with Surrogate Reward Model

Learning Rate in ECO: The anytime learning rate (at time step ) used in Algorithm 3 is given by Gerchinovitz and Yu (2011); Dadkhahi et al. (2020):

(12)

where and

Appendix C Continued Related Work

Meta-heuristics: A variety of discrete search algorithms and meta-heuristics have been studied in the literature for combinatorial optimization over categorical variables. Such algorithms, including Genetic Algorithms Holland and Reitman (1978), Simulated Annealing Spears (1993), and Particle Swarms Kennedy and Eberhart (1995), are generally inefficient in finding the global minima. In the context of biological sequence optimization, the most popular method is directed evolution Arnold (1998), which explores the space by only making small mutations to existing sequences. In the context of sequence optimization, a recent promising approach consists of fitting a neural network model to predict the black box function and then applying gradient ascent on the latter model Killoran et al. (2017); Bogard et al. (2019); Liu et al. (2020). This approach allows for a continuous relaxation of the discrete search space making possible step-wise local improvements to the whole sequence at once based on a gradient direction. However, these methods have been shown to suffer from vanishing gradients Linder and Seelig (2020). Further, the projected sequences in the continuous relaxation space may not be recognized by the predictors, leading to poor convergence. Generative model-based optimization approaches aim to learn distributions whose expectation coincides with evaluations of the black box and try to maximize such expectation Gupta and Zou (2019); Brookes, Park, and Listgarten (2019). However, such approaches require a pre-trained generative model for optimization.

Latent Space Optimization: The key idea is to employ an encoder-decoder architecture to learn a continuous representation from the data and perform BO in a latent space Gómez-Bombarelli et al. (2018); Tripp, Daxberger, and Hernández-Lobato (2020). This approach has two main drawbacks: it could generate a large fraction of invalid structures, and it requires a large database of relevant structures, for training an auto-encoder, which may not be available in many applications with scarce data availability.

Variable-Length Sequence Optimization: Moss et al. Moss et al. (2020) propose an algorithm called BOSS for optimization over strings of variable lengths. They use a Gaussian process surrogate model based on string kernels and perform acquisition function maximization for spaces with syntactic constraints. On the contrary, our main focus is on fixed length generic BBO and design problems.

Ensemble (Population-based) Methods: Angermueller et al. Angermüller et al. (2020) propose a population-based approach which generates batches of sequences by sampling from an ensemble of methods, rather than from a single method. This sampling is carried out proportional to the quality of sequences that each method has found in the past. A similar ensemble concept has been used in Sinai et al. (2021). In our work, we focus on individual methods; however, we add that since we are proposing two distinct representations employed via different algorithms (online and Bayesian), and the performance of each varies across different tasks, it would be interesting to consider an ensemble method over different algorithms as a future direction.

Batch (Offline) Optimization: Batch optimization techniques Angermueller et al. (2020); Brookes, Park, and Listgarten (2019); Fannjiang and Listgarten (2020) are a related but fundamentally distinct line of work, where black-box evaluations are carried out in sample batches rather than in an active/sequential manner. To highlight the distinction between the two settings we iterate our problem setting: Given query access to a combinatorial black box (i.e. it can be queried on a single point), what is the best solution we can obtain in a given number of such queries. One wishes to minimize the number of queries and hence goes for a highly adaptive sequential algorithm.

Appendix D BBO Experiments

Latin Square Problem: A latin square of order is a matrix of elements , such that each number appears in each row and column exactly once. When , the problem of finding a latin square has solutions in a space of dimensionality . We formulate the problem of finding a latin square of order as a black-box function by imposing an additive penalty of one for any repetition of numbers in any row or column. As a result, function evaluations are in the range , and a function evaluation of zero corresponds to a latin square of order . We consider a noisy version of this problem, where an additive Gaussian noise with zero mean and standard deviation of is added to function evaluations observed by each algorithm.

Figure 5 demonstrates the performance of different algorithms, in terms of the best function value found until time , over time steps. Both ECO-F and ECO-G outperform the baselines with a considerable margin. In addition, ECO-G outperforms COMBO at samples. At larger time steps, COMBO outperforms the other algorithms, however, this performance comes at the price of a far larger computation time. As demonstrated in Table 2, ECO-F and ECO-G offer a speed-up over COMBO by a factor of approximately and , respectively. We note that TCO-F (not included in the plot) performs poorly (similar to RS) on this problem, which can be attributed to the strong promotion of sparsity by the regularized horseshoe prior and the fact that the Latin Square problem has a dense representation (we observed a similar behavior from the horseshoe prior of Ricardo Baptista (2018)).

Figure 5: Best function evaluation seen so far for the Latin Square problem. Each time step corresponds to a black-box evaluation.

Pest Control Problem: In the pest control problem, given stations and pesticide types, the idea is to maintain the spread of pest (with minimum cost), which is propagating throughout the stations in an interactive and probabilistic fashion. The -th category for each variable corresponds to the choice of no pesticide at all. Controlling the spread of the pest is carried out via the choice of the right type of pesticide subject to a penalty proportional to its associated cost. A closed form definition of this problem is given in Oh et al. (2019).

The results for different algorithms are shown in Figure 6. Despite the fact that COMBO is able to find the minimum in fewer time steps (in steps) than ECO-F (in steps) on average, ECO-F outperforms COMBO during initial time steps (until ). TCO-F is able to find the minimum faster than ECO-F and within steps. SA performs competitively, but eventually is unable to find the optimal solution to this problem over the designated steps. The poor performance of ECO-G can be explained by the interactive nature of the problem, where early mistakes are punished inordinately. Early mistakes made by ECO-G can also be attributed to the large number of experts (with noisy coefficients) in its model, which in turn promotes an early exploratory behavior.

Figure 6: Best function evaluation seen so far for the pest control problem.

RNA Sequence Optimization Problem: Structured RNA molecules play a critical role in many biological applications, ranging from control of gene expression to protein translation. The native secondary structure of a RNA molecule is usually the minimum free energy (MFE) structure. Consider an RNA sequence as a string of letters (nucleotides) over the alphabet . A pair of complementary nucleotides and , where , can interact with each other and form a base pair (denoted by ), A-U, C-G and G-U being the energetically stable pairs. Thus, the secondary structure of an RNA can be represented by an ensemble of pairing bases.

Finding the most stable RNA sequences has immediate applications in material and biomedical applications Li et al. (2015). Studies show that by controlling the structure and free energy of a RNA molecule, one may modulate its translation rate and half-life in a cell Buchan and Stansfield (2007); Davis et al. (2008), which is important in the context of viral RNA. A number of RNA folding algorithms Lorenz et al. (2011); Markham and Zuker (2008) use a thermodynamic model (e.g. Zuker and Stiegler (1981)) and dynamic programming to estimate MFE of a sequence. However, the time complexity of these algorithms prohibits their use for evaluating substantial numbers of RNA sequences Gould, Hendy, and Papamichail (2014) and exhaustively searching the space to identify the global free energy minimum, as the number of sequences grows exponentially as .

Here, we formulate the RNA sequence optimization problem as follows: For a sequence of length , find the RNA sequence that will fold into the secondary structure with the lowest minimum free energy. In our experiments, we initially set and . We then use the popular RNAfold package Lorenz et al. (2011) to evaluate the MFE for a given sequence. The goal is to find the lowest MFE sequence by calling the MFE evaluator minimum number of times. The performance of different algorithms is depicted in Figure 2, where both ECO-F and particularly ECO-G outperform the baselines as well as COMBO by a considerable margin. At higher number of evaluations, TCO-F beats the rest of the algorithms, which can be attributed to its exploration-exploitation trade-off.

Figure 7: Best function evaluation seen so far for the RNA sequence optimization problem with .
Data Latin Square Pest Cont. RNA Opt.
25 25 30
5 5 4
COMBO 170.4 151.0 253.8
ECO-F 1.5 1.4 2.0
ECO-G 3.6 3.3 5.7
TCO-F 55.7 53.2 67.0
Table 2: Average computation time per step (in Seconds) over different problems and algorithms.

Energy-optimized RNA Structures: Sample RNA sequences obtained via ECO-G after time steps for and are shown in Figures 17 and 17, respectively. The resulting energy-optimized sequences (as obtained using RNAfold service) have high () GC content that makes the strongest positive contribution to lowering MFE Trotta (2014), as pairings between G and C have three hydrogen bonds and are more stable compared to A and U pairings, which have only two. The final single-strand RNA sequence folds into a GC-paired double helix and a nucleotide long hairpin loop in the middle, which is a tetraloop reported in literature (http://www.rna.icmb.utexas.edu/SIM/4C/energetics˙new/). Figures 21 and 21 show two sample structures of the ECO-G optimized sequences for

, again showing the same trend. For odd values of

, there is presence of a loop with an odd number of residues or a single unpaired base at the end, but there is still a GC-rich double helix. In contrast, the structures generated by the under-performing algorithms do show presence of unpaired bases and are less in GC content, leading to high energy structures (e.g. Figures 19 and 19 are obtained via SA after steps for and , respectively).

Appendix E Design Experiments

For our experiments, we focus on three puzzles from the Eterna-100 dataset Anderson-Lee et al. (2016). Two of the selected sequences (puzzles and of lengths and , resp.), despite their fairly small lengths, are challenging for many algorithms (see Anderson-Lee et al. (2016)). In both puzzles, our MCTS variants (ECO-F and ECO-G) are able to significantly improve the performance of MCTS when limited number of true rewards (i.e. black-box evaluations) are available. All algorithms outperformed RS as expected. Within the given evaluation budget, TCO-F, ECO-G, and especially ECO-F, are superior to LEARNA by a substantial margin (see Figure 8). In puzzle number (Figure 9), again TCO-F, ECO-G and ECO-F significantly outperform LEARNA, over the given number of evaluations. Interestingly, ECO-F is able to outperform LEARNA throughout the evaluation process, and in average finds a far better final solution than LEARNA.

The final sequence is puzzle of length . The results of different algorithms over the latter puzzle is shown in Figure 10. As we can see from this figure, MCTS-RNA and LEARNA perform very similarly over the given evaluation budget. ECO-F is able to outperform the remaining algorithm throughout the evaluation steps. Initially, ECO-G has a similar performance to those of MCTS-RNA and LEARNA, but offers an improved performance over the latter two just after steps. Due to the higher dimensionality of this puzzle, we drop TCO-F due to its high computation cost (expected time: days).

Figure 8: Best function evaluation for RNA Design of puzzle with .
Figure 9: Best function evaluation for RNA Design of puzzle with .
Figure 10: Best function evaluation for RNA Design of puzzle with .

Appendix F Choice of AFO

Throughout the experiments, we designated SA and MCTS as AFO for generic BBO and design problems, respectively. The latter choice was made in accordance with the literature, where SA is typically used as a baseline and method of choice for the generic BBO problem, whereas MCTS has been commonly used for the design problem. For instance, SA has been considered as a baseline and/or AFO in Dadkhahi et al. (2020), Ricardo Baptista (2018), and Oh et al. (2019) (albeit with a different algorithm than ours). On the other hand, MCTS (i.e. RNA-MCTS as well as its variations) is perhaps the most popular RNA design technique in the literature. Here, we point out that both SA and MCTS can be used for both generic BBO and design problems. In this section, we compare the performance of different AFO methods in each problem.

First, we consider the generic BBO problem of RNA sequence optimization with (considered in Section 4). Figure 11 demonstrates the performance of ECO-F and ECO-G when SA or MCTS are used as AFO. As we can see from this figure, the SA-as-AFO variants perform slightly better than MCTS-as-AFO counterparts over steps. In particular, although the performance gap is initially moderately large, over time this performance gap becomes smaller.

Next, we consider two design problems considered in Section 4: puzzles and . Note that, when using SA as AFO, we apply the softmax operator (in Algorithm 1) over the set of if the corresponding variable is part of a paired base. As we can see from Figure 12, for puzzle , ECO-F with MCTS outperforms the remaining algorithms. The rest of the algorithms have very similar performances, with ECO-F (MCTS) marginally surpassing the SA variants. As we can see from Figure 13, for puzzle , the MCTS variant of ECO-F slightly outperforms its SA variant, whereas the MCTS variant of ECO-G maintains a bigger gap with its SA variant.

Figure 11: Comparison of different AFO methods for the generic BBO problem of RNA sequence optimization with .