1 Introduction
A plethora of practical optimization problems involve blackbox functions, with no simple analytical closed forms, that can be evaluated at any arbitrary point in the domain. Optimization of such blackbox functions poses a unique challenge due to restrictions on the number of possible function evaluations, as evaluating functions of realworld complex processes is often expensive and time consuming. Efficient algorithms for global optimization of expensive blackbox functions take past queries into account in order to select the next query to the blackbox function more intelligently. Blackbox optimization of realworld functions defined over purely categorical type input variables has not been studied in the literature as extensively as its integer, continuous, or mixed variables counterparts.
Categorical type variables are particularly challenging when compared to integer or continuous variables, as they do not have a natural ordering. However, many realworld functions are defined over categorical variables. One such problem, which is of wide interest, is the design of optimal chemical or biological (protein, RNA, and DNA) molecule sequences, which are constructed using a vocabulary of fixed size, e.g. 4 for DNA/RNA. Designing optimal molecular sequences with novel structures or improved functionalities is of paramount importance in material science, drug and vaccine design, synthetic biology and many other applications, see Dixon et al. (2010); Ng et al. (2019); Hoshika et al. (2019); Yamagami et al. (2019). Design of optimal sequences is a difficult blackbox optimization problem over a combinatorially large search space Stephens et al. (2015), in which function evaluations often rely on either wetlab experiments, physicsinspired simulators, or knowledgebased computational algorithms, which are slow and expensive in practice. Another problem of interest is the constrained design problem, e.g. find a sequence given a specific structure (or property), which is inverse of the wellknown folding problem discussed in Dill and MacCallum (2012). This problem is complex due to the strict structural constraints imposed on the sequence. In fact, one of the ways to represent such a complex structural constraint is to constrain the next choice sequentially based on the sequence elements that have been chosen a priori. Therefore, we divide the black box optimization problem into two settings, depending on the constraint set: Generic Black Box Optimization (BBO) problem referring to the unconstrained case and Design Problem that refers to the case with complex sequential constraints.
Let be the th sequence evaluated by the black box function . The key question in both settings is the following: Given prior queries and their evaluations , how to choose the next query ? This selection must be devised so that over a finite budget of blackbox evaluations, one is closest to the minimizer in an expected sense over the acquisition randomness.
In the literature, for design problems with sequential constraints, MCTS (Monte Carlo Tree Search) based search algorithms are often used with black box function evaluations. In the generic BBO problems in the unconstrained scenario, Simulated Annealing (SA) based techniques are typically used as search algorithms. A key missing
ingredient in the categorical domain is a surrogate model for the blackbox evaluations that can interpolate between such evaluations and facilitate
costfree approximate evaluations from the surrogate model internally in order to reduce the need for frequently accessing real blackbox evaluations, thus leading to improved sample efficiency in both generic and design blackbox optimization problems. Due to the lack of efficient interpolators in the categorical domains, existing search algorithms suffer under constrained computational budgets, due to reliance on only real blackbox evaluations.Contributions: We address the above problem in our work. Our main contributions are as follows:

[leftmargin=5mm, itemsep=1pt]

We present two representations for modeling realvalued combinatorial functions over categorical variables, which we then use in order to learn a surrogate model for the generic BBO problem and the design problem, with a finite budget on the number of queries. The abridged onehot encoded Boolean Fourier representation is novel to this work. The use of grouptheoretic Fourier representation for modeling functions over categorical variables, and in particular their optimization, is novel to this work.

Numerical experiments, over synthetic benchmarks as well as realworld biological (RNA) sequence optimization demonstrate the competitive or superior performance of the proposed methods incorporating our representations over stateoftheart counterparts, while substantially reducing the computation time. We further evaluate the performance of our algorithms in design problems (inverse folding problem) and demonstrate the superior performance of our methods in terms of sample efficiency over the stateoftheart counterparts.
2 Related Work
BlackBox Optimization: Hutter et al. Hutter, Hoos, and LeytonBrown (2011)
suggest a surrogate model based on random forests to address optimization problems over categorical variables. The proposed SMAC algorithm uses a randomized local search under the expected improvement acquisition criterion to obtain candidate points for blackbox evaluations. Bergstra et al.
Bergstra et al. (2011)suggest a treestructured Parzen estimator (TPE) for approximating the surrogate model, and maximizes the expected improvement criterion to find candidate points for evaluation. For optimization problems over Boolean variables, multilinear polynomials
Ricardo Baptista (2018); Dadkhahi et al. (2020) and Walsh functions Leprêtre et al. (2019); Deshwal, Belakaria, and Doppa (2021) have been used in the literature.Bayesian Optimization (BO) is a commonly used approach for optimization of blackbox functions Shahriari et al. (2015). However, limited work has addressed incorporation of categorical variables in BO. Early attempts were based on converting the blackbox optimization problem over categorical variables to that of continuous variables GómezBombarelli et al. (2018); Golovin et al. (2017); GarridoMerchán and HernándezLobato (2020). A few BO algorithms have been specifically designed for blackbox functions over combinatorial domains. In particular, the BOCS algorithm Ricardo Baptista (2018)
, primarily devised for Boolean functions, employs a sparse monomial representation to model the interactions among different variables, and uses a sparse Bayesian linear regression method to learn the model coefficients. The COMBO algorithm of
Oh et al. (2019)uses Graph Fourier Transform over a combinatorial graph, constructed via graph Cartesian product of variable subgraphs, to gauge the smoothness of the blackbox function. However, both BOCS and COMBO are hindered by associated high computation costs, which grow polynomially with both the number of variables and the number of function evaluations. More recently, a computationally efficient blackbox optimization algorithm (COMEX)
Dadkhahi et al. (2020) was introduced to address the computational impediments of its Bayesian counterparts. COMEX adopts a Boolean Fourier representation as its surrogate model, which is updated via an exponential weight update rule. Nevertheless, COMEX is limited to functions over the Boolean hypercube.We generalize COMEX to handle functions over categorical variables by proposing two representations for modeling functions over categorical variables: an abridged onehot encoded Boolean Fourier representation and Fourier representation on finite Abelian groups. The utilization of the latter representation as a surrogate model in combinatorial optimization algorithms is novel to this work. Factorizations based on (vanilla) onehot encoding has been previously (albeit briefly) suggested in
Ricardo Baptista (2018) to enable blackbox optimization algorithms designed for Boolean variables to address problems over categorical variables. Different from Ricardo Baptista (2018), we show that we can significantly reduce the number of additional variables introduced upon onehot encoding, and that such a reduced representation is in fact complete and unique. We incorporate the latter representation to extend a modified version of BOCS to problems over categorical variables, efficiently.Sequence Design: For design problems, we focus on the RNA sequence design problem (RNA inverse folding). The goal is to find an RNA sequence consistent with a given secondary structure, as the functional state of the RNA molecule is determined by the latter structure Hofacker et al. (1994). Earlier RNA design methods explore the search space by trial and error and use classic cost function minimization approaches such as adaptive random walk Hofacker (2003), probabilistic sampling Zadeh et al. (2011)
, and genetic algorithms
Taneda (2015). Recent efforts employ more advanced machine learning methods such as different Monte Carlo Tree Search (MCTS) algorithms, e.g. MCTSRNA
Yang et al. (2017) or Nested MCTS Portela (2018), and reinforcement learning that either performs a local search as in
Eastman et al. (2018) or learns complete candidate solutions from scratch Runge et al. (2019). In all these approaches, the assumption is that the algorithm has access to a large number of function evaluations, whereas we are interested in sample efficiency of each algorithm.Learning to Search: As an alternative to parameterfree search methods (such as SA), Swersky et al. (2020) suggests to use a parameterized policy to generate candidates that maximize the acquisition function in Bayesian optimization over discrete search spaces. Our MCTSbased algorithm is similar in concept to Swersky et al. (2020) in the sense that the tabular value functions are constructed and maintained over different time steps. However, we are maintaining value functions rather than a policy network. Finally, while Swersky et al. (2020) trains a policy network, Deshwal et al. (2020) uses a rank learning algorithm to search.
Metaheuristics
: A variety of discrete search algorithms and metaheuristics have been studied in the literature for combinatorial optimization over categorical variables. Such algorithms, including Genetic Algorithms
Holland and Reitman (1978), Simulated Annealing Spears (1993), and Particle Swarms Kennedy and Eberhart (1995), are generally inefficient in finding the global minima. In the context of biological sequence optimization, the most popular method is directed evolution Arnold (1998), which explores the space by only making small mutations to existing sequences. In the context of sequence optimization, a recent promising approach consists of fitting a neural network model to predict the black box function and then applying gradient ascent on the latter model
Killoran et al. (2017); Bogard et al. (2019); Liu et al. (2020). This approach allows for a continuous relaxation of the discrete search space making possible stepwise local improvements to the whole sequence at once based on a gradient direction. However, these methods have been shown to suffer from vanishing gradients Linder and Seelig (2020). Further, the projected sequences in the continuous relaxation space may not be recognized by the predictors, leading to poor convergence. Generative modelbased optimization approaches aim to learn distributions whose expectation coincides with evaluations of the black box and try to maximize such expectation Gupta and Zou (2019); Brookes, Park, and Listgarten (2019). However, such approaches require a pretrained generative model for optimization.Latent Space Optimization: The key idea is to employ an encoderdecoder architecture to learn a continuous representation from the data and perform BO in a latent space GómezBombarelli et al. (2018); Tripp, Daxberger, and HernándezLobato (2020). This approach has two main drawbacks: it could generate a large fraction of invalid structures, and it requires a large database of relevant structures, for training an autoencoder, which may not be available in many applications with scarce data availability.
VariableLength Sequence Optimization: Moss et al. Moss et al. (2020) propose an algorithm called BOSS for optimization over strings of variable lengths. They use a Gaussian process surrogate model based on string kernels and perform acquisition function maximization for spaces with syntactic constraints. On the contrary, our main focus is on fixed length generic BBO and design problems.
Ensemble (Populationbased) Methods: Angermueller et al. Angermüller et al. (2020) propose a populationbased approach which generates batches of sequences by sampling from an ensemble of methods, rather than from a single method. This sampling is carried out proportional to the quality of sequences that each method has found in the past. A similar ensemble concept has been used in Sinai et al. (2021). In our work, we focus on individual methods; however, we add that since we are proposing two distinct representations employed via different algorithms (online and Bayesian), and the performance of each varies across different tasks, it would be interesting to consider an ensemble method over different algorithms as a future direction.
Batch (Offline) Optimization: Batch optimization techniques Angermueller et al. (2020); Brookes, Park, and Listgarten (2019); Fannjiang and Listgarten (2020) are a related but fundamentally distinct line of work, where blackbox evaluations are carried out in sample batches rather than in an active/sequential manner. To highlight the distinction between the two settings we iterate our problem setting: Given query access to a combinatorial black box (i.e. it can be queried on a single point), what is the best solution we can obtain in a given number of such queries. One wishes to minimize the number of queries and hence goes for a highly adaptive sequential algorithm.
3 BlackBox Optimization over Categorical Variables
Problem Setting: Given the combinatorial categorical domain and a constraint set , with variables each of cardinality , the objective is to find
(1) 
where is a realvalued combinatorial function. We assume that is a blackbox function, which is computationally expensive to evaluate. As such, we are interested in finding in as few evaluations as possible. We consider two variations of the problem depending on how the constraint set is specified.
Generic BBO Problem: In this case, the constraint set . For example, RNA sequence optimization problem that searches for an RNA sequence with a specific property optimized lies within this category. A score for every RNA sequence, reflecting the property we wish to optimize, is evaluated by a black box function.
Design Problem: The constraint set is complex and is only sequentially specified. For every sequence of consisting of characters from the alphabet , the choice of the next character is specified by a constraint set function . The RNA inverse folding problem in Runge et al. (2019) falls into this category, where the constraints on the RNA sequence are determined by the sequential choices one makes during the sequence design. The goal is to find the sequence that is optimal with respect to a prespecified structure that also obeys complex sequential constraints.
Proposed Techniques: To address this problem, we adopt a surrogate modelbased learning framework as follows. The surrogate model (to represent the blackbox function) is updated sequentially via blackbox function evaluations observed until time step . An acquisition function is then obtained from the surrogate model. The selection of candidate points for blackbox function evaluation is carried out via an acquisition function optimizer (AFO), which uses the acquisition function as an inexpensive proxy (to make many internal calls) for the blackbox function and produces the next candidate point to be evaluated. We consider this problem in two different settings: online and Bayesian. The difference between the two settings is that in the former the acquisition function is exactly our surrogate, whereas in the latter the acquisition function is obtained by sampling from the posterior distribution of the surrogate.
In the sequel, we propose two representations that can be used in surrogate models for blackbox combinatorial functions over categorical variables. These representations serve as direct generalizations of the Boolean surrogate model based on Fourier expansion proposed in Dadkhahi et al. (2020); Ricardo Baptista (2018) in the sense that they reduce to the Fourier representation for realvalued Boolean functions when the cardinality of the categorical variables is two. In addition, both approaches can be modified to address the more general case where different variables are of different cardinalities. However, for ease of exposition, we assume that all the variables are of the same cardinality. Finally, we introduce two popular search algorithms to be used as acquisition function optimizers in conjunction with the proposed surrogate models in order to select new queries for subsequent blackbox function evaluations.
3.1 Representations for the Surrogate Model
We present two representations for combinatorial functions and an algorithm to update from the blackbox evaluations. The representations use the Fourier basis in two different ways.
Abridged OneHot Encoded Boolean Fourier Representation: The onehot encoding of each variable can be expressed as a tuple , where are Boolean variables with the constraint that at most one such variable can be equal to for any given .
We consider the following representation for the combinatorial function :
(2) 
where denotes fold cartesian product of the set , designates the set of subsets of the set , and the monomials can be written as
(3) 
A second order approximation (i.e. at ) of the representation in (2) can be expanded in the following way:
(4) 
Example.
For variables and , each of which with cardinality , we have the onehot encoding of and respectively. From Equation (4), the onehot encoding factorization for this example can be written as
Note that the representation in Equation (2) has far less terms than a vanilla onehot encoding with all the combinations of onehot variables included (as suggested in Ricardo Baptista (2018)). The reason for this reduction is twofold: Boolean variables model each categorical variable of cardinality , and more importantly each monomial term has at most one Boolean variable from its corresponding parent categorical variable (See the Appendix for the exact quantification of this reduction). The following theorem states that this reduced representation is in fact unique and complete.
Theorem 3.1.
The representation in Equation (2) is complete and unique for any realvalued combinatorial function.
Proof.
See Appendix. ∎
Fourier Representation on Finite Abelian Groups: We define a cyclic group structure over the elements of each categorical variable (), where is the cardinality of the latter variable. From the fundamental theorem of abelian groups Terras (1999), there exists an abelian group which is isomorphic to the direct sum (a.k.a direct product) of the cyclic groups corresponding to the categorical variables:
(5) 
where the latter group consists of all vectors
such that and denotes group isomorphism. We assume that () for simplicity, but the following representation could be easily generalized to the case of arbitrary cardinalities for different variables.The Fourier representation of any complexvalued function on the finite abelian group is given by Terras (1999)
(6) 
where are (in general complex) Fourier coefficients, is the fold cartesian product of the set and are complex exponentials ^{1}^{1}1Note that in the general case of different cardinalities for different variables, where denotes the cartesian product and the exponent denominator in the complex exponential character is replaced by . (th roots of unity) given by
Note that the latter complex exponentials are the characters of the representation, and reduce to the monomials (i.e. in ) when the cardinality of each variable is two. A second order approximation of the representation in (6) can be written as:
(7)  
For a realvalued function (which is of interest here), the representation in (6) reduces to
(8) 
where
(9) 
See the Appendix for the proof of this representation. We note that the number of characters utilized in this representation is almost twice as many as that of monomials used in the previous representation.
3.2 Surrogate Model Learning
Sparse Online Regression: We adopt the learning algorithm of combinatorial optimization with expert advice Dadkhahi et al. (2020) in the following way. We consider the monomials in (3) and the characters in (9) as experts. For each surrogate model, we maintain a pool of such experts, the coefficients of which are refreshed sequentially via an exponential weight update rule. We refer to the proposed algorithm as ExpertBased Categorical Optimization (ECO) and the two versions of the algorithm with the two proposed surrogate models are called ECOF (based on the OneHot Encoded Boolean Fourier Representation) and ECOG (based on Fourier Representation on Finite Abelian Groups), respectively. See Appendix for more details.
Sparse Bayesian Regression: In order to incorporate the uncertainty of the surrogate model in acquisition function for explorationexploitation tradeoff, we adopt a Bayesian setting based on Thompson sampling (TS) Thompson (1933, 1935). In particular, we model the surrogate via a sparse Bayesian regression model with a regularized horseshoe prior Piironen and Vehtari (2017). At each time step, we learn the posterior using the data observed so far, and draw a sample for the coefficients from the posterior. We use the NoUTurn Sampler (NUTS) Hoffman and Gelman (2014) in order to sample the coefficients efficiently. See Appendix B for more details. Our algorithm pursues the framework of BOCS Ricardo Baptista (2018), with the following differences for improved performance and computational efficiency: BOCS () employs a horseshoe prior Carvalho, Polson, and Scott (2010); Makalic and Schmidt (2016) and () uses a Gibbs sampler. Experimentally, the regularized horseshoe prior outperforms the horseshoe prior, and the NUTS sampler leads to a significant speedup in computations (e.g. a speedup for the RNA optimization problem), which is critical due to the higher dimensionality of the categorical domains. We use the proposed abridged onehot encoded Fourier representation in conjunction with this model, due to its lower number of coefficients leading to a more manageable computation cost. We refer to this algorithm as TSbased Categorical Optimization (TCOF).
3.3 Acquisition Function Optimizers
In this subsection, we discuss how two popular search algorithms, namely Simulated Annealing (SA) and Monte Carlo Tree Search (MCTS), can be utilized as acquisition function optimizers (AFO) in conjunction with a surrogate model and use costfree evaluations of the surrogate model to select the next query for the black box evaluation. In the literature, SA has been used for the generic BBO problems, whereas MCTS has been used for the design problems.
SA as AFO: Our AFO is devised so as to minimize , the current estimate for the surrogate model. A simple strategy to minimize is to iteratively switch each variable into the value that minimizes given the values of all the remaining variables, until no more changes occur after a sweep through all the variables (). A strategy to escape local minima in this context Pincus (1970) is to allow for occasional increases in
by generating a Markov Chain (MC) sample sequence (for
), whose stationary distribution is proportional to , where is gradually reduced to . This optimization strategy was first applied by Kirkpatrick, Gelatt, and Vecchi (1983) in their Simulated Annealing algorithm to solve combinatorial optimization problems. We use the Gibbs sampler Geman and Geman (1984) to generate such an MC by sampling from the full conditional distribution of the stationary distribution, which in our case is given by the softmax distribution over , for each variable conditional on the values of the remaining variables . By decreasing from a high value to a low one, we allow the MC to first search at a coarse level avoiding being trapped in local minima.Algorithm 1 presents our simulated annealing (SA) version for categorical domains, where is an annealing schedule, which is a nonincreasing function of . We use the annealing schedule suggested in Spears (1993), which follows an exponential decay with parameter given by . In each step of the algorithm, we pick a variable () uniformly at random, evaluate the surrogate model (possibly in parallel) times, once for each categorical value for the chosen variable given the current values for the remaining variables. We then update with the sampled value in from the corresponding softmax distribution.
MCTS as AFO:
We formulate the design problem as an undiscounted Markov decision process
. Each state corresponds to a partial or full sequence of categorical variables of lengths in . The process in each episode starts with an empty sequence , the initial state. Actions are selected from the set of permissible additions to the current state (sequence) at each time step , . The transition function is deterministic, and defines the sequence obtained from the juxtaposition of the current state with the action , i.e. . The transitions leading to incomplete sequences yield a reward of zero. Complete sequences are considered as terminal states, from which no further transitions (juxtapositions) can be made. Once the sequence is complete (i.e. at a terminal state), the reward is obtained from the current surrogate reward model . Thus, the reward function is defined as if is terminal, and zero otherwise.MCTS is a popular search algorithm used for design problems. MCTS is a rollout algorithm which keeps track of the value estimates obtained via Monte Carlo simulations in order to progressively make better selections. The UCT selection criteria, see Kocsis and Szepesvári (2006), is typically used as tree policy, where action at state in the search tree is selected via: , where is the search tree, is the exploration parameter, is the stateaction value estimate, and and are the visit counts for the parent state node and the candidate stateaction edge, respectively. For the selection of actions in states outside the search tree, a random default policy is used: .
A summary of the proposed algorithm is given in Algorithm 2. Starting with an empty sequence at the root of the tree, we follow the tree policy until a leaf node of the search tree is reached (selection step). At this point, we append the state corresponding to the leaf node to the tree and initialize a value function estimate for its children (extension step). From the reached leaf node we follow the default policy until a terminal state is reached. At this point, we plug the sequence corresponding to this terminal state into the surrogate reward model and observe the reward . This reward is backed up from the leaf node to the root of the tree in order to update the value estimates via Monte Carlo (i.e. using the average reward) for all visited pairs along the path. This process is repeated until a stopping criterion (typically a max number of playouts) is met, at which point the sequence with maximum reward is returned as the output of the algorithm.
Computational Complexity: The computational complexity per time step associated with learning the surrogate model via the Hedge algorithm, for both representations introduced in 3.1, is in , and is thus linear in the number of experts . Moreover, the complexity of Algorithm 1 is in , assuming that the number of iterations in each SA run is in . Hence, the overall complexity of the algorithm is in . Finally, the computational complexity of each playout in Algorithm 2 is in , leading to an overall complexity of , assuming playouts per time step. On the other hand, when Bayesian regression is used to learn the surrogate model, NUTS dominates the computation time with a complexity of per independent sample.
4 Experiments and Results
In this section, we measure the performance of the proposed representations, when used as surrogate/reward model in conjunction with search algorithms (SA and MCTS) in BBO and design problems. The learning rate used in exponential weight updates is selected via the anytime learning rate schedule suggested in Dadkhahi et al. (2020) and Gerchinovitz and Yu (2011) (see Appendix B). The maximum degree of interactions used in our surrogate models is set to two for all the problems; increasing the max order improved the results only marginally (see Appendix C). The sparsity parameter in exponential weight updates is set to in all the experiments following the same choice made in Dadkhahi et al. (2020). Experimentally, the learning algorithm is fairly insensitive to the variations in the latter parameter. In each experiment, we report the results averaged over multiple runs ( runs in BBO experiments; runs in design experiments)
one standard error of the mean. The experiments are run on machines with CPU cores from the Intel Xeon E52600 v3 family.
BBO Experiments: We compare the performance of our ECO/TCO algorithms in conjunction with SA with two baselines, random search (RS) and simulated annealing (SA), as well as a stateoftheart Bayesian combinatorial optimization algorithm (COMBO) Oh et al. (2019). In particular, we consider two synthetic benchmarks (Latin square problem and pest control problem) and a realword problem in biology: RNA sequence optimization. In addition to the performance of the algorithms in terms of the best value of observed until a given time step , we measure the average computation time per time step of our algorithm versus that of COMBO. The decay parameter used in the annealing schedule of SA is set to in all the experiments. In addition, the number of SA iterations is set to and for ECO and TCO, respectively. Intuitively, for the ECO algorithm, each of these parameters creates an explorationexploitation tradeoff. The smaller (larger) the value of or , the more exploratory (exploitative) is the behavior of SA. The selected values seem to create a reasonable balance; tuning these parameters may improve the performance of the algorithm. On the other hand, the TCO algorithm has an exploration mechanism built into it via Thompson sampling; hence, we use a higher number of SA iterations to maximize exploitation in AFO.
Synthetic Benchmarks: We consider two synthetic problems: Latin square problem Colbourn and Dinitz (2006), a commonly used combinatorial optimization benchmark, and the pest control problem considered in Oh et al. (2019) (see Appendix for the latter results). In both problems, we have categorical variables, each of cardinality . A Latin square of order is a matrix of elements , such that each number appears in each row and column exactly once. When , the problem of finding a Latin square has solutions in a space of dimensionality . We formulate the problem of finding a Latin square of order as a blackbox optimization by imposing an additive penalty of for any repetition of numbers in any row or column. Hence, function evaluations are in the range , and a function evaluation of corresponds to a Latin square of order . We consider a noisy version of this problem, where an additive Gaussian noise with
mean and standard deviation of
is added to function evaluations observed by each algorithm.Figure 1 demonstrates the performance of different algorithms, in terms of the best function value found until time , over time steps. Both ECOF and ECOG outperform the baselines with a considerable margin. In addition, ECOG outperforms COMBO at samples. At larger time steps, COMBO outperforms the other algorithms; however, this performance comes at the price of a far larger computation time. As demonstrated in Table 1, ECOF and ECOG offer a speedup of roughly and , resp., over COMBO. We note that TCOF (not included in the plot) performs poorly (similar to RS) on this problem, which can be attributed to the strong promotion of sparsity by the regularized horseshoe prior and the fact that the Latin Square problem has a dense representation (we observed a similar behavior from the horseshoe prior of Ricardo Baptista (2018)).
Data  Latin Square  Pest Cont.  RNA Opt. 

COMBO  170.4  151.0  253.8 
ECOF  1.5  1.4  2.0 
ECOG  3.6  3.3  5.7 
TCOF  55.7  53.2  67.0 
RNA Sequence Optimization Problem: Consider an RNA sequence as a string of letters (nucleotides) over the alphabet . A pair of complementary nucleotides and , where , can interact with each other and form a base pair (denoted by ), AU, CG and GU being the energetically stable pairs. Thus, the secondary structure, i.e. the minimum free energy structure, of an RNA can be represented by an ensemble of pairing bases. A number of RNA folding algorithms Lorenz et al. (2011); Markham and Zuker (2008) use a thermodynamic model (e.g. Zuker and Stiegler (1981)) and dynamic programming to estimate MFE of a sequence. However, the time complexity of these algorithms prohibits their use for evaluating substantial numbers of RNA sequences Gould, Hendy, and Papamichail (2014) and exhaustively searching the space to identify the global free energy minimum, as the number of sequences grows exponentially as .
We formulate the RNA sequence optimization problem as follows: For a sequence of length , find the RNA sequence which folds into a secondary structure with the lowest MFE. In our experiments, we set and . We then use the popular RNAfold package Lorenz et al. (2011) to evaluate the MFE for a given sequence. The goal is to find the lowest MFE sequence by calling the MFE evaluator minimum number of times. As shown in Figure 2, both ECOF and particularly ECOG outperform the baselines as well as COMBO by a large margin. At higher number of evaluations, TCOF beats the rest of the algorithms, which can be attributed to its explorationexploitation tradeoff. See App. J for comparison between TCOF and BOCS+vanilla onehot encoding.
RNA Design Experiments: The problem is to find a primary RNA sequence which folds into a target structure , given a folding algorithm . Such target structures can be represented as a sequence of dots (for unpaired bases) and brackets (for paired bases). In our algorithm, the action sets are defined as follows. For unpaired sites and for paired sites . At the beginning of each run of our algorithm (ECOF/G, TCOF in conjunction with MCTS), we draw a random permutation for the order of locations to be selected in each level of the tree. The reward value offered by the environment (i.e. the blackbox function) at any time step corresponds to the normalized Hamming distance between the target structure and the structure of the sequence found by each algorithm, i.e. .
We compare the performance of our algorithms against RS as a baseline, where random search is carried out over the given structure (i.e. default policy ) rather than over unstructured random sequences. We also include two stateoftheart algorithms in our experiments: MCTSRNA Yang et al. (2017) and LEARNA Runge et al. (2019). MCTSRNA has an exploration parameter, which we tune in advance (per sequence). LEARNA has a set of hyperparameters tuned a priori using training data and is provided in Runge et al. (2019). Note that the latter training phase (for LEARNA) as well as the former exploration parameter tuning (for MCTSRNA) are offered to the respective algorithms as an advantage, whereas for our algorithm we use a global set of heuristic choices for the two hyperparameters, rather than attempting to tune the two hyperparameters. In particular, we set the exploration parameter to for ECO and for TCO; and the number of MCTS playouts at each time step to , where is the tree height (i.e. number of dots and bracket pairs). The latter heuristic choice is made since the bigger the tree, the more playouts are needed to explore the space.
We point out that the entire design pipeline in stateoftheart algorithms typically also includes a local improvement step (as a postprocessing step), which is either a rulebased search (e.g. in Yang et al. (2017)) or an exhaustive search (e.g. in Runge et al. (2019)) over the mismatched sites. We do not include the local improvement step in our experiments, since we are interested in measuring sample efficiency of different algorithms. In other words, the question is the following: given a fixed and finite evaluation budget, which algorithm is able to get closer to the target structure.
In our experiments, we focus on three puzzles from the Eterna100 dataset AndersonLee et al. (2016). Two of the selected puzzles ( and of lengths and , resp.), despite their fairly small lengths, are challenging for many algorithms (see AndersonLee et al. (2016)). In both puzzles, ECOF, ECOG, and TCOF (with MCTS as AFO) are able to significantly improve the performance of MCTS when limited number of blackbox evaluations is available. All algorithms outperform RS as expected. Within the given evaluation budget, TCOF, ECOG, and esp. ECOF, are superior to LEARNA by a substantial margin (see Figure 3). In puzzle (Figure 4), again both ECOG and ECOF significantly outperform LEARNA over the given number of evaluations. ECOF is able to outperform LEARNA throughout the evaluation process, and on average finds a far better final solution than LEARNA. With MCTS as AFO, ECO algorithms outperform TCO, which can be attributed to the exploratory behavior of the latter in both AF (via TS) and AFO (via MCTS). See Appendix for puzzle .
5 Conclusions and Future Work
We propose two novel Fourier representations as surrogate models for black box optimization over categorical variables and show performance improvements over existing baselines when combined with search algorithms. We utilize two algorithms to learn such surrogate models. Our ECO algorithm incorporates a computationallyefficient online estimator with strong adversarial guarantees (see Dadkhahi et al. (2020) for theoretical results in the Boolean case), which can be shown to carry over to our setting as well. Our TCO algorithm uses a Bayesian regression model with a regularized horseshoe prior and selects queries via Thompson sampling to balance explorationexploitation tradeoff in surrogate model learning at the price of a higher computation cost.
Considering the performance variability with respect to different algorithms and representations across different problems, an important research avenue would be to derive an ensemble of such models rather than a single one and investigate potential performance gains. Such an ensemble model would then update and explore all models simultaneously and draw samples from either individual or a combination of models at any given time step.
References
 AndersonLee et al. (2016) AndersonLee, J.; Fisker, E.; Kosaraju, V.; Wu, M.; Kong, J.; Lee, J.; Lee, M.; Zada, M.; Treuille, A.; and Das, R. 2016. Principles for Predicting RNA Secondary Structure Design Difficulty. Journal of Molecular Biology, 428(5, Part A): 748 – 757. Challenges in RNA Structural Modeling and Design.
 Angermueller et al. (2020) Angermueller, C.; Dohan, D.; Belanger, D.; Deshpande, R.; Murphy, K.; and Colwell, L. 2020. Modelbased reinforcement learning for biological sequence design. In ICLR.
 Angermüller et al. (2020) Angermüller, C.; Belanger, D.; Gane, A.; Mariet, Z.; Dohan, D.; Murphy, K.; Colwell, L. J.; and Sculley, D. 2020. PopulationBased BlackBox Optimization for Biological Sequence Design. In ICML.
 Arnold (1998) Arnold, F. H. 1998. Design by directed evolution. Accounts of chemical research, 31(3): 125–131.
 Bergstra et al. (2011) Bergstra, J.; Bardenet, R.; Bengio, Y.; and Kégl, B. 2011. Algorithms for HyperParameter Optimization. In NeurIPS.
 Bogard et al. (2019) Bogard, N.; Linder, J.; Rosenberg, A. B.; and Seelig, G. 2019. A deep neural network for predicting and engineering alternative polyadenylation. Cell, 178(1): 91–106.
 Brookes, Park, and Listgarten (2019) Brookes, D.; Park, H.; and Listgarten, J. 2019. Conditioning by adaptive sampling for robust design. In ICML.
 Buchan and Stansfield (2007) Buchan, J. R.; and Stansfield, I. 2007. Halting a cellular production line: responses to ribosomal pausing during translation. Biology of the Cell, 99(9): 475–487.
 Carvalho, Polson, and Scott (2010) Carvalho, C. M.; Polson, N. G.; and Scott, J. G. 2010. The horseshoe estimator for sparse signals. Biometrika, 97(2): 465–480.
 Colbourn and Dinitz (2006) Colbourn, C. J.; and Dinitz, J. H. 2006. Handbook of Combinatorial Designs, Second Edition (Discrete Mathematics and Its Applications). Chapman & Hall/CRC.
 Dadkhahi et al. (2020) Dadkhahi, H.; Shanmugam, K.; Rios, J.; Das, P.; Hoffman, S.; Loeffler, T. D.; and Sankaranarayanan, S. 2020. Combinatorial BlackBox Optimization with Expert Advice. In KDD.
 Davis et al. (2008) Davis, M.; Sagan, S. M.; Pezacki, J. P.; Evans, D. J.; and Simmonds, P. 2008. Bioinformatic and physical characterizations of genomescale ordered RNA structure in mammalian RNA viruses. Journal of virology, 82(23): 11824–11836.
 Deshwal, Belakaria, and Doppa (2021) Deshwal, A.; Belakaria, S.; and Doppa, J. R. 2021. Mercer Features for Efficient Combinatorial Bayesian Optimization. AAAI, 35(8): 7210–7218.
 Deshwal et al. (2020) Deshwal, A.; Belakaria, S.; Doppa, J. R.; and Fern, A. 2020. Optimizing Discrete Spaces via Expensive Evaluations: A Learning to Search Framework. In AAAI.
 Dill and MacCallum (2012) Dill, K. A.; and MacCallum, J. L. 2012. The proteinfolding problem, 50 years on. Science, 338(6110): 1042–1046.
 Dixon et al. (2010) Dixon, N.; Duncan, J. N.; Geerlings, T.; Dunstan, M. S.; McCarthy, J. E.; Leys, D.; and Micklefield, J. 2010. Reengineering orthogonally selective riboswitches. Proceedings of the National Academy of Sciences, 107(7): 2830–2835.
 Eastman et al. (2018) Eastman, P.; Shi, J.; Ramsundar, B.; and Pande, V. S. 2018. Solving the RNA design problem with reinforcement learning. PLoS computational biology, 14(6): e1006176.
 Fannjiang and Listgarten (2020) Fannjiang, C.; and Listgarten, J. 2020. Autofocused oracles for modelbased design. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M. F.; and Lin, H., eds., NeurIPS.
 GarridoMerchán and HernándezLobato (2020) GarridoMerchán, E. C.; and HernándezLobato, D. 2020. Dealing with categorical and integervalued variables in bayesian optimization with gaussian processes. Neurocomputing, 380: 20–35.
 Gelman et al. (2013) Gelman, A.; Carlin, J. B.; Stern, H. S.; Dunson, D. B.; Vehtari, A.; and Rubin, D. B. 2013. Bayesian Data Analysis. Chapman and Hall/CRC, 3rd ed. edition.
 Geman and Geman (1984) Geman, S.; and Geman, D. 1984. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. on pattern analysis and machine intelligence, 6: 721–741.
 Gerchinovitz and Yu (2011) Gerchinovitz, S.; and Yu, J. Y. 2011. Adaptive and Optimal Online Linear Regression on Balls. In Algorithmic Learning Theory.
 Golovin et al. (2017) Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.; and Sculley, D. 2017. Google vizier: A service for blackbox optimization. In KDD, 1487–1495.
 GómezBombarelli et al. (2018) GómezBombarelli, R.; Wei, J. N.; Duvenaud, D.; HernándezLobato, J. M.; SánchezLengeling, B.; Sheberla, D.; AguileraIparraguirre, J.; Hirzel, T. D.; Adams, R. P.; and AspuruGuzik, A. 2018. Automatic chemical design using a datadriven continuous representation of molecules. ACS central science, 4(2): 268–276.
 Gould, Hendy, and Papamichail (2014) Gould, N.; Hendy, O.; and Papamichail, D. 2014. Computational tools and algorithms for designing customized synthetic genes. Frontiers in bioengineering and biotechnology, 2: 41.
 Gupta and Zou (2019) Gupta, A.; and Zou, J. 2019. Feedback GAN for DNA optimizes protein functions. Nature Machine Intelligence, 1(2): 105–111.
 Hofacker (2003) Hofacker, I. L. 2003. Vienna RNA secondary structure server. Nucleic acids research, 31(13): 3429–3431.
 Hofacker et al. (1994) Hofacker, I. L.; Fontana, W.; Stadler, P. F.; Bonhoeffer, L. S.; Tacker, M.; and Schuster, P. 1994. Fast folding and comparison of RNA secondary structures. Monatshefte für Chemie/Chemical Monthly, 125(2): 167–188.
 Hoffman and Gelman (2014) Hoffman, M. D.; and Gelman, A. 2014. The NoUTurn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47): 1593–1623.
 Holland and Reitman (1978) Holland, J. H.; and Reitman, J. S. 1978. Cognitive systems based on adaptive algorithms. In Patterndirected inference systems, 313–329. Elsevier.
 Hoshika et al. (2019) Hoshika, S.; Leal, N. A.; Kim, M.J.; Kim, M.S.; Karalkar, N. B.; Kim, H.J.; Bates, A. M.; Watkins, N. E.; SantaLucia, H. A.; Meyer, A. J.; et al. 2019. Hachimoji DNA and RNA: A genetic system with eight building blocks. Science, 363(6429): 884–887.
 Hutter, Hoos, and LeytonBrown (2011) Hutter, F.; Hoos, H. H.; and LeytonBrown, K. 2011. Sequential ModelBased Optimization for General Algorithm Configuration. In International Conference on Learning and Intelligent Optimization, 507–523.
 Kennedy and Eberhart (1995) Kennedy, J.; and Eberhart, R. 1995. Particle swarm optimization. In International Conference on Neural Networks (ICNN), volume 4, 1942–1948 vol.4.
 Killoran et al. (2017) Killoran, N.; Lee, L. J.; Delong, A.; Duvenaud, D.; and Frey, B. J. 2017. Generating and designing DNA with deep generative models. arXiv:1712.06148.
 Kirkpatrick, Gelatt, and Vecchi (1983) Kirkpatrick, S.; Gelatt, C. D.; and Vecchi, M. P. 1983. Optimization by Simulated Annealing. Science, 220(4598): 671–680.
 Kocsis and Szepesvári (2006) Kocsis, L.; and Szepesvári, C. 2006. Bandit Based MonteCarlo Planning. In European Conference on Machine Learning (ECML), 282–293.

Leprêtre et al. (2019)
Leprêtre, F.; Verel, S.; Fonlupt, C.; and Marion, V. 2019.
Walsh functions as surrogate model for pseudoboolean optimization
problems.
In
The Genetic and Evolutionary Computation Conference (GECCO 2019)
, 303–311. ACM.  Li et al. (2015) Li, H.; Lee, T.; Dziubla, T.; Pi, F.; Guo, S.; Xu, J.; Li, C.; Haque, F.; Liang, X.J.; and Guo, P. 2015. RNA as a stable polymer to build controllable and defined nanostructures for material and biomedical applications. Nano today, 10(5): 631–655.
 Linder and Seelig (2020) Linder, J.; and Seelig, G. 2020. Fast differentiable DNA and protein sequence optimization for molecular design. Arxiv:2005.11275.
 Liu et al. (2020) Liu, G.; Zeng, H.; Mueller, J.; Carter, B.; Wang, Z.; Schilz, J.; Horny, G.; Birnbaum, M. E.; Ewert, S.; and Gifford, D. K. 2020. Antibody complementarity determining region design using highcapacity machine learning. Bioinformatics, 36(7): 2126–2133.
 Lorenz et al. (2011) Lorenz, R.; Bernhart, S. H.; Höner zu Siederdissen, C.; Tafer, H.; Flamm, C.; Stadler, P. F.; and Hofacker, I. L. 2011. ViennaRNA Package 2.0. Algorithms for Molecular Biology, 6(1): 26.
 Makalic and Schmidt (2016) Makalic, E.; and Schmidt, D. F. 2016. A Simple Sampler for the Horseshoe Estimator. IEEE Signal Processing Letters, 23(1): 179–182.
 Markham and Zuker (2008) Markham, N. R.; and Zuker, M. 2008. UNAFold. In Bioinformatics, 3–31. Springer.
 Moss et al. (2020) Moss, H.; Leslie, D.; Beck, D.; González, J.; and Rayson, P. 2020. BOSS: Bayesian Optimization over String Spaces. In NeurIPS.
 Ng et al. (2019) Ng, A. H.; Nguyen, T. H.; GómezSchiavon, M.; Dods, G.; Langan, R. A.; Boyken, S. E.; Samson, J. A.; Waldburger, L. M.; Dueber, J. E.; Baker, D.; et al. 2019. Modular and tunable biological feedback control using a de novo protein switch. Nature, 572(7768): 265–269.
 Oh et al. (2019) Oh, C.; Tomczak, J.; Gavves, E.; and Welling, M. 2019. Combinatorial Bayesian Optimization using the Graph Cartesian Product. In NeurIPS.
 Piironen and Vehtari (2017) Piironen, J.; and Vehtari, A. 2017. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2): 5018–5051.
 Pincus (1970) Pincus, M. 1970. Letter to the editor—a Monte Carlo method for the approximate solution of certain types of constrained optimization problems. Operations Research, 18(6): 1225–1228.
 Portela (2018) Portela, F. 2018. An unexpectedly effective Monte Carlo technique for the RNA inverse folding problem. BioRxiv, 345587.
 Ricardo Baptista (2018) Ricardo Baptista, M. P. 2018. Bayesian Optimization of Combinatorial Structures. In ICML.
 Runge et al. (2019) Runge, F.; Stoll, D.; Falkner, S.; and Hutter, F. 2019. Learning to Design RNA. In ICLR.
 Shahriari et al. (2015) Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148–175.
 Sinai et al. (2021) Sinai, S.; Wang, R.; Whatley, A.; Slocum, S.; Locane, E.; and Kelsic, E. 2021. AdaLead: A simple and robust adaptive greedy search algorithm for sequence design.
 Spears (1993) Spears, W. M. 1993. Simulated Annealing for Hard Satisfiability Problems. In DIMACS Workshop: Cliques, Coloring, and Satisfiability, 533–557.
 Stephens et al. (2015) Stephens, Z. D.; Lee, S. Y.; Faghri, F.; Campbell, R. H.; Zhai, C.; Efron, M. J.; Iyer, R.; Schatz, M. C.; Sinha, S.; and Robinson, G. E. 2015. Big data: astronomical or genomical? PLoS biology, 13(7): e1002195.
 Swersky et al. (2020) Swersky, K.; Rubanova, Y.; Dohan, D.; and Murphy, K. 2020. Amortized Bayesian Optimization over Discrete Spaces. In UAI.
 Taneda (2015) Taneda, A. 2015. Multiobjective optimization for RNA design with multiple target secondary structures. BMC bioinformatics, 16(1): 280.
 Terras (1999) Terras, A. 1999. Fourier Analysis on Finite Groups and Applications. London Mathematical Society Student Texts. Cambridge University Press.

Thompson (1933)
Thompson, W. R. 1933.
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.
Biometrika, 25(34): 285–294.  Thompson (1935) Thompson, W. R. 1935. On the Theory of Apportionment. American Journal of Mathematics, 57: 450.
 Tripp, Daxberger, and HernándezLobato (2020) Tripp, A.; Daxberger, E.; and HernándezLobato, J. M. 2020. SampleEfficient Optimization in the Latent Space of Deep Generative Models via Weighted Retraining. In Advances in Neural Information Processing Systems.
 Trotta (2014) Trotta, E. 2014. On the normalization of the minimum free energy of RNAs by sequence length. PloS one, 9(11).
 Yamagami et al. (2019) Yamagami, R.; Kayedkhordeh, M.; Mathews, D. H.; and Bevilacqua, P. C. 2019. Design of highly active doublepseudoknotted ribozymes: a combined computational and experimental study. Nucleic acids research, 47(1): 29–42.
 Yang et al. (2017) Yang, X.; Yoshizoe, K.; Taneda, A.; and Tsuda, K. 2017. RNA inverse folding using Monte Carlo tree search. BMC Bioinformatics, 8: 468.
 Zadeh et al. (2011) Zadeh, J. N.; Steenberg, C. D.; Bois, J. S.; Wolfe, B. R.; Pierce, M. B.; Khan, A. R.; Dirks, R. M.; and Pierce, N. A. 2011. NUPACK: Analysis and design of nucleic acid systems. Journal of computational chemistry, 32(1): 170–173.
 Zuker and Stiegler (1981) Zuker, M.; and Stiegler, P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic acids research, 9(1): 133–148.
Appendix A Proof of Theorem 3.1
We first assume that the onehot variables . Plugging different choices of into Equation (2) leads to a system of linear equations with unknowns (coefficients ) and equations. We can express this system in matrix form as the product of the matrix of monomials (where each column corresponds to a monomial , and each element corresponds to the evaluation of the monomial at the th choice for ) and the vector of unknown coefficients, which is set equal to the function values at the corresponding choices for . We claim that there exists a permutation of the rows and columns of the matrix of monomials such that the latter becomes unit lower triangular, and is thereby full rank. As a result, the representation in (2) for is complete and unique.
To formally show that this permutation for the monomials’ matrix exists, we use a construction by induction over the number of variables included in the representation. We denote the monomials’ matrix over variables with , and define the onehot variables ) as the descendants of the parent categorical variable (). It is easy to see that such a construction for the base case of only one variable exits, as the monomials’ matrix is a matrix. In this case, we use the following permutation of the monomials (columns): . We also use the following permutation of the values in rows: . As a result, in this matrix only the elements of the first column and the main diagonal are nonzero and equal to one, and thus the matrix is unit lower triangular.
Assuming that the induction hypothesis holds for variables, we show that it also holds for variables. Starting from a unit lower triangular matrix , we can construct the matrix as follows. Note that all the columns of correspond to the monomials composed of the descendants of the first variables , whereas each of the additional columns introduced in involves exactly one term from (possibly also containing factors from the previous variables). We can express using the following binomial expansion:
(10) 
In words, the additional columns can be considered as a collection of th order monomials (), each of which includes one out of the descendant variables () together with variables from the (descendants) of the previous variables. Each of the latter variables can take values in , whereas the remaining variables are set to .
Starting with , we have first order terms which we assign to columns . The values associated with rows are constructed by assuming that takes values from to (in order), while all the remaining () variables are set to . As a consequence of , all the elements on the main diagonal are ones; also, as a consequence of , all the higher degree monomials involving (which occupy the elements after the diagonal ones) are equal to zeros. Thus the augmented matrix, until this point, remains unit lower triangular.
We then consider the second degree terms , where we have terms (choice of one out of the other variables, each taking values from ) for each of the choices for the variable . Starting with second order monomials involving and , we again assume that all the remaining variables () are equal^{2}^{2}2Note that this assumption is necessary in order to ensure that monomials in future columns for the same row are evaluated to zero; choices of where this assumption is not valid is addressed in next rows. to . For any choice of , we add a new column corresponding to the monomial as well as a new row in which and , whereas the remaining variables are set to . As a result, we have that: the diagonal element in the new row/column is equal to one, and all the elements in the future^{3}^{3}3Note that the elements in the previous columns in the same row corresponding to monomials involving the selected variables are nonzeros as well. columns are zeros, since any combination of one of the descendants of with the remaining variables (any variable except and ) is zero. We continue this construction strategy for the remaining variables until all the second degree terms involving and one out of the remaining variables is exhausted. We then repeat the same idea for terms with orders up to , as defined by the binomial expansion in . As a result of this construction strategy, the monomial matrix is unit lower triangular.
Now, we use this result to show the completeness and uniqueness of the representation with onehot variables in in the following way.
Completeness: We showed that the representation (2) over is complete, i.e. we can express any function using the representation in (2), where we have at most one descendant term from the same parent in each monomial. Now, we replace each (from ) in the latter representation with , where . The new representation can also be expressed via the expansion (2) since no two descendants from the same parent variable are being multiplied with each another.
Uniqueness: Assume that the uniqueness condition is not satisfied. Then we have two distinct polynomial representations and that have the same value for every . However, since and are distinct polynomials, is a polynomial which is nonzero in at least one input . This implies that is also nonzero, which is a contradiction, and the proof is complete.
Remark 1.
The grouptheoretic Fourier representation defined in Equation (6) is unique and complete.
Proof.
Let be the categorical domain. Let the true function be . For generality, let us consider a complex valued function where is the field of complex numbers. The basis functions are . Now, one can view a function as a length vector, one entry each for evaluating the function at every point in the domain . We denote the vector for function , thus obtained, by . Similarly, denote the vector for evaluations of the basis function by . Let be a matrix created by stacking all vectors corresponding to basis vectors in the columns. Then, the Fourier representation can be written as where is the vector of Fourier coefficients in our grouptheoretic representation. Now, due to the use of complex exponentials, one can show that if . Therefore, the columns of the matrix are orthogonal. Hence, is a full rank matrix. Therefore, our representation is merely representing a vector in another full rank orthogonal basis. Hence, it is unique and complete. ∎
Appendix B Description of Algorithms
Surrogate Model Learning Algorithm via Online Regression: Let and denote the number of variables and the cardinality of each variable, respectively. The surrogate models used in ECOF and ECOG correspond to approximations of the representations given in (2) and (8), respectively, where each approximation is obtained by restricting the maximum order of interactions among variables to . We consider each term in the latter surrogate models, i.e. monomials from (3) in ECOF and characters from (9) in ECOG, as an expert, denoted by (). The number of such experts in ECOF is which coincides with the dimensionality of the space when , whereas the number of experts in ECOG is equal to . The coefficient of each expert is designated by . Since the exponential weights, utilized to update the coefficients , are nonnegative, we maintain two nonnegative coefficients and , which yield .
We initialize all the coefficients with a uniform prior, i.e. (). In each time step , we draw a sample via Algorithm 1 with respect to our current estimate for the surrogate model . The latter sample is then plugged into the blackbox function to obtain the evaluation . This leads to a mixture loss as the difference between the evaluations obtained by our surrogate model and the blackbox function for query . Using this mixture loss, we compute the individual loss for each expert . Finally, we update each coefficient in the model via an exponential weight obtained according to its incurred individual loss. We repeat this process until stopping criteria are met.
Number of Characters in Representations: The number of terms in vanilla onehot encoded Fourier representation is , whereas our abridged representation reduces this number to matching the space dimensionality, thereby making the algorithms computationally tractable and efficient. When a max degree of is used in the approximate representation, the number of terms in the abridged representation is equal to . The corresponding number in a vanilla onehot encoded representation is equal to . Finally, the numbers of terms in the full and order grouptheoretic Fourier expansions are equal to and , respectively.
Surrogate Model Learning Algorithm via Sparse Bayesian Regression: We use our surrogate model to approximate the blackbox function, where are the characters of the representation, is the number of characters in the representation, and are the coefficients. We utilize the abridged onehot encoded Fourier representation for this model due to its smaller number of characters which makes it much more scalable. We assume that onehot encoded variables take values in . Note that from the proof of the abridged onehot encoded Fourier representation in Appendix A, both assignments and for onehot encoded variables are permissible in the latter representation.
We assume that the approximation errors between the blackbox evaluations at points and our surrogate function values are independent and identically distributed as follows:
(11) 
We estimate the coefficients of our surrogate approximation using Bayesian linear regression over the blackbox evaluation data seen so far. To avoid highvariance estimators for the coefficients due to data scarcity, sparsityinducing priors are recommended in
Ricardo Baptista (2018). We use the regularized horseshoe model Piironen and Vehtari (2017) defined as follows:where
is the standard halfCauchy distribution, and
is the number of observations seen so far. In this model, the global and the local hyperparameters individually shrink the magnitude of regression coefficients . The hyperparameter designates the expected number of relevant coefficients, but since the sparsity induced by the corresponding threshold is only weakly sensitive to , a rough estimate is typically sufficient. In our experiments, we set , and .Given the above Bayesian model, the TCOF algorithm works as follows. At each time step , we draw a sample for the coefficients () from the posterior. We use the NoUTurn Sampler (NUTS) Hoffman and Gelman (2014) in order to sample the coefficients efficiently. Given the latter sample, we then construct the approximation function instance . Following Ricardo Baptista (2018), we then select our next point for blackbox evaluation by minimizing a regularized version of this approximation function, i.e. , where designates the norm of , and
is a regularization hyperparameter. Throughout our experiments, we set
. The AFO in either algorithm 1 or algorithm 2 is then used to carry out this minimization for the generic BBO or the design problem, respectively.Appendix C Continued Related Work
Metaheuristics: A variety of discrete search algorithms and metaheuristics have been studied in the literature for combinatorial optimization over categorical variables. Such algorithms, including Genetic Algorithms Holland and Reitman (1978), Simulated Annealing Spears (1993), and Particle Swarms Kennedy and Eberhart (1995), are generally inefficient in finding the global minima. In the context of biological sequence optimization, the most popular method is directed evolution Arnold (1998), which explores the space by only making small mutations to existing sequences. In the context of sequence optimization, a recent promising approach consists of fitting a neural network model to predict the black box function and then applying gradient ascent on the latter model Killoran et al. (2017); Bogard et al. (2019); Liu et al. (2020). This approach allows for a continuous relaxation of the discrete search space making possible stepwise local improvements to the whole sequence at once based on a gradient direction. However, these methods have been shown to suffer from vanishing gradients Linder and Seelig (2020). Further, the projected sequences in the continuous relaxation space may not be recognized by the predictors, leading to poor convergence. Generative modelbased optimization approaches aim to learn distributions whose expectation coincides with evaluations of the black box and try to maximize such expectation Gupta and Zou (2019); Brookes, Park, and Listgarten (2019). However, such approaches require a pretrained generative model for optimization.
Latent Space Optimization: The key idea is to employ an encoderdecoder architecture to learn a continuous representation from the data and perform BO in a latent space GómezBombarelli et al. (2018); Tripp, Daxberger, and HernándezLobato (2020). This approach has two main drawbacks: it could generate a large fraction of invalid structures, and it requires a large database of relevant structures, for training an autoencoder, which may not be available in many applications with scarce data availability.
VariableLength Sequence Optimization: Moss et al. Moss et al. (2020) propose an algorithm called BOSS for optimization over strings of variable lengths. They use a Gaussian process surrogate model based on string kernels and perform acquisition function maximization for spaces with syntactic constraints. On the contrary, our main focus is on fixed length generic BBO and design problems.
Ensemble (Populationbased) Methods: Angermueller et al. Angermüller et al. (2020) propose a populationbased approach which generates batches of sequences by sampling from an ensemble of methods, rather than from a single method. This sampling is carried out proportional to the quality of sequences that each method has found in the past. A similar ensemble concept has been used in Sinai et al. (2021). In our work, we focus on individual methods; however, we add that since we are proposing two distinct representations employed via different algorithms (online and Bayesian), and the performance of each varies across different tasks, it would be interesting to consider an ensemble method over different algorithms as a future direction.
Batch (Offline) Optimization: Batch optimization techniques Angermueller et al. (2020); Brookes, Park, and Listgarten (2019); Fannjiang and Listgarten (2020) are a related but fundamentally distinct line of work, where blackbox evaluations are carried out in sample batches rather than in an active/sequential manner. To highlight the distinction between the two settings we iterate our problem setting: Given query access to a combinatorial black box (i.e. it can be queried on a single point), what is the best solution we can obtain in a given number of such queries. One wishes to minimize the number of queries and hence goes for a highly adaptive sequential algorithm.
Appendix D BBO Experiments
Latin Square Problem: A latin square of order is a matrix of elements , such that each number appears in each row and column exactly once. When , the problem of finding a latin square has solutions in a space of dimensionality . We formulate the problem of finding a latin square of order as a blackbox function by imposing an additive penalty of one for any repetition of numbers in any row or column. As a result, function evaluations are in the range , and a function evaluation of zero corresponds to a latin square of order . We consider a noisy version of this problem, where an additive Gaussian noise with zero mean and standard deviation of is added to function evaluations observed by each algorithm.
Figure 5 demonstrates the performance of different algorithms, in terms of the best function value found until time , over time steps. Both ECOF and ECOG outperform the baselines with a considerable margin. In addition, ECOG outperforms COMBO at samples. At larger time steps, COMBO outperforms the other algorithms, however, this performance comes at the price of a far larger computation time. As demonstrated in Table 2, ECOF and ECOG offer a speedup over COMBO by a factor of approximately and , respectively. We note that TCOF (not included in the plot) performs poorly (similar to RS) on this problem, which can be attributed to the strong promotion of sparsity by the regularized horseshoe prior and the fact that the Latin Square problem has a dense representation (we observed a similar behavior from the horseshoe prior of Ricardo Baptista (2018)).
Pest Control Problem: In the pest control problem, given stations and pesticide types, the idea is to maintain the spread of pest (with minimum cost), which is propagating throughout the stations in an interactive and probabilistic fashion. The th category for each variable corresponds to the choice of no pesticide at all. Controlling the spread of the pest is carried out via the choice of the right type of pesticide subject to a penalty proportional to its associated cost. A closed form definition of this problem is given in Oh et al. (2019).
The results for different algorithms are shown in Figure 6. Despite the fact that COMBO is able to find the minimum in fewer time steps (in steps) than ECOF (in steps) on average, ECOF outperforms COMBO during initial time steps (until ). TCOF is able to find the minimum faster than ECOF and within steps. SA performs competitively, but eventually is unable to find the optimal solution to this problem over the designated steps. The poor performance of ECOG can be explained by the interactive nature of the problem, where early mistakes are punished inordinately. Early mistakes made by ECOG can also be attributed to the large number of experts (with noisy coefficients) in its model, which in turn promotes an early exploratory behavior.
RNA Sequence Optimization Problem: Structured RNA molecules play a critical role in many biological applications, ranging from control of gene expression to protein translation. The native secondary structure of a RNA molecule is usually the minimum free energy (MFE) structure. Consider an RNA sequence as a string of letters (nucleotides) over the alphabet . A pair of complementary nucleotides and , where , can interact with each other and form a base pair (denoted by ), AU, CG and GU being the energetically stable pairs. Thus, the secondary structure of an RNA can be represented by an ensemble of pairing bases.
Finding the most stable RNA sequences has immediate applications in material and biomedical applications Li et al. (2015). Studies show that by controlling the structure and free energy of a RNA molecule, one may modulate its translation rate and halflife in a cell Buchan and Stansfield (2007); Davis et al. (2008), which is important in the context of viral RNA. A number of RNA folding algorithms Lorenz et al. (2011); Markham and Zuker (2008) use a thermodynamic model (e.g. Zuker and Stiegler (1981)) and dynamic programming to estimate MFE of a sequence. However, the time complexity of these algorithms prohibits their use for evaluating substantial numbers of RNA sequences Gould, Hendy, and Papamichail (2014) and exhaustively searching the space to identify the global free energy minimum, as the number of sequences grows exponentially as .
Here, we formulate the RNA sequence optimization problem as follows: For a sequence of length , find the RNA sequence that will fold into the secondary structure with the lowest minimum free energy. In our experiments, we initially set and . We then use the popular RNAfold package Lorenz et al. (2011) to evaluate the MFE for a given sequence. The goal is to find the lowest MFE sequence by calling the MFE evaluator minimum number of times. The performance of different algorithms is depicted in Figure 2, where both ECOF and particularly ECOG outperform the baselines as well as COMBO by a considerable margin. At higher number of evaluations, TCOF beats the rest of the algorithms, which can be attributed to its explorationexploitation tradeoff.
Data  Latin Square  Pest Cont.  RNA Opt. 

25  25  30  
5  5  4  
COMBO  170.4  151.0  253.8 
ECOF  1.5  1.4  2.0 
ECOG  3.6  3.3  5.7 
TCOF  55.7  53.2  67.0 
Energyoptimized RNA Structures: Sample RNA sequences obtained via ECOG after time steps for and are shown in Figures 17 and 17, respectively. The resulting energyoptimized sequences (as obtained using RNAfold service) have high () GC content that makes the strongest positive contribution to lowering MFE Trotta (2014), as pairings between G and C have three hydrogen bonds and are more stable compared to A and U pairings, which have only two. The final singlestrand RNA sequence folds into a GCpaired double helix and a nucleotide long hairpin loop in the middle, which is a tetraloop reported in literature (http://www.rna.icmb.utexas.edu/SIM/4C/energetics˙new/). Figures 21 and 21 show two sample structures of the ECOG optimized sequences for
, again showing the same trend. For odd values of
, there is presence of a loop with an odd number of residues or a single unpaired base at the end, but there is still a GCrich double helix. In contrast, the structures generated by the underperforming algorithms do show presence of unpaired bases and are less in GC content, leading to high energy structures (e.g. Figures 19 and 19 are obtained via SA after steps for and , respectively).Appendix E Design Experiments
For our experiments, we focus on three puzzles from the Eterna100 dataset AndersonLee et al. (2016). Two of the selected sequences (puzzles and of lengths and , resp.), despite their fairly small lengths, are challenging for many algorithms (see AndersonLee et al. (2016)). In both puzzles, our MCTS variants (ECOF and ECOG) are able to significantly improve the performance of MCTS when limited number of true rewards (i.e. blackbox evaluations) are available. All algorithms outperformed RS as expected. Within the given evaluation budget, TCOF, ECOG, and especially ECOF, are superior to LEARNA by a substantial margin (see Figure 8). In puzzle number (Figure 9), again TCOF, ECOG and ECOF significantly outperform LEARNA, over the given number of evaluations. Interestingly, ECOF is able to outperform LEARNA throughout the evaluation process, and in average finds a far better final solution than LEARNA.
The final sequence is puzzle of length . The results of different algorithms over the latter puzzle is shown in Figure 10. As we can see from this figure, MCTSRNA and LEARNA perform very similarly over the given evaluation budget. ECOF is able to outperform the remaining algorithm throughout the evaluation steps. Initially, ECOG has a similar performance to those of MCTSRNA and LEARNA, but offers an improved performance over the latter two just after steps. Due to the higher dimensionality of this puzzle, we drop TCOF due to its high computation cost (expected time: days).
Appendix F Choice of AFO
Throughout the experiments, we designated SA and MCTS as AFO for generic BBO and design problems, respectively. The latter choice was made in accordance with the literature, where SA is typically used as a baseline and method of choice for the generic BBO problem, whereas MCTS has been commonly used for the design problem. For instance, SA has been considered as a baseline and/or AFO in Dadkhahi et al. (2020), Ricardo Baptista (2018), and Oh et al. (2019) (albeit with a different algorithm than ours). On the other hand, MCTS (i.e. RNAMCTS as well as its variations) is perhaps the most popular RNA design technique in the literature. Here, we point out that both SA and MCTS can be used for both generic BBO and design problems. In this section, we compare the performance of different AFO methods in each problem.
First, we consider the generic BBO problem of RNA sequence optimization with (considered in Section 4). Figure 11 demonstrates the performance of ECOF and ECOG when SA or MCTS are used as AFO. As we can see from this figure, the SAasAFO variants perform slightly better than MCTSasAFO counterparts over steps. In particular, although the performance gap is initially moderately large, over time this performance gap becomes smaller.
Next, we consider two design problems considered in Section 4: puzzles and . Note that, when using SA as AFO, we apply the softmax operator (in Algorithm 1) over the set of if the corresponding variable is part of a paired base. As we can see from Figure 12, for puzzle , ECOF with MCTS outperforms the remaining algorithms. The rest of the algorithms have very similar performances, with ECOF (MCTS) marginally surpassing the SA variants. As we can see from Figure 13, for puzzle , the MCTS variant of ECOF slightly outperforms its SA variant, whereas the MCTS variant of ECOG maintains a bigger gap with its SA variant.