# A discrete version of CMA-ES

Modern machine learning uses more and more advanced optimization techniques to find optimal hyper parameters. Whenever the objective function is non-convex, non continuous and with potentially multiple local minima, standard gradient descent optimization methods fail. A last resource and very different method is to assume that the optimum(s), not necessarily unique, is/are distributed according to a distribution and iteratively to adapt the distribution according to tested points. These strategies originated in the early 1960s, named Evolution Strategy (ES) have culminated with the CMA-ES (Covariance Matrix Adaptation) ES. It relies on a multi variate normal distribution and is supposed to be state of the art for general optimization program. However, it is far from being optimal for discrete variables. In this paper, we extend the method to multivariate binomial correlated distributions. For such a distribution, we show that it shares similar features to the multi variate normal: independence and correlation is equivalent and correlation is efficiently modeled by interaction between different variables. We discuss this distribution in the framework of the exponential family. We prove that the model can estimate not only pairwise interactions among the two variables but also is capable of modeling higher order interactions. This allows creating a version of CMA ES that can accommodate efficiently discrete variables. We provide the corresponding algorithm and conclude.

## Authors

• 18 publications
• 24 publications
• 9 publications
• 14 publications
• ### A Computationally Efficient Limited Memory CMA-ES for Large Scale Optimization

We propose a computationally efficient limited memory Covariance Matrix ...
04/21/2014 ∙ by Ilya Loshchilov, et al. ∙ 0

• ### Maximum Likelihood-based Online Adaptation of Hyper-parameters in CMA-ES

The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) is widely a...
06/10/2014 ∙ by Ilya Loshchilov, et al. ∙ 0

• ### The CMA Evolution Strategy: A Tutorial

This tutorial introduces the CMA Evolution Strategy (ES), where CMA stan...
04/04/2016 ∙ by Nikolaus Hansen, et al. ∙ 0

• ### Hiding higher order cross-correlations of multivariate data using Archimedean copulas

In this paper we present the algorithm that changes the subset of margin...
03/21/2018 ∙ by Krzysztof Domino, et al. ∙ 0

• ### A proof of convergence for the gradient descent optimization method with random initializations in the training of neural networks with ReLU activation for piecewise linear tar

Gradient descent (GD) type optimization methods are the standard instrum...
08/10/2021 ∙ by Arnulf Jentzen, et al. ∙ 0

• ### Global Linear Convergence of Evolution Strategies on More Than Smooth Strongly Convex Functions

Evolution strategies (ESs) are zero-order stochastic black-box optimizat...
09/18/2020 ∙ by Youhei Akimoto, et al. ∙ 0

• ### Geometry of Discrete Copulas

Multivariate discrete distributions are fundamental to modeling. Discret...
02/20/2018 ∙ by Elisa Perrone, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

When facing an optimization problem where there is no access to the objective function’s gradient or the objective function’s gradient is not very smooth, the state of the art techniques rely on stochastic and derivative free algorithm that change radically the point of view of the optimization program. Instead of a deterministic gradient descent, we take a Bayesian point of view and assumes that the optimum is distributed according to a prior statistical distribution and uses particles or random draws to gradually update our statistical distribution. Among these method, the covariance matrix adaptation evolution strategy (CMA-ES; e.g., [14, 13]) has emerged as the leading stochastic and derivative-free algorithm for solving continuous optimization problems, i.e., for finding the optimum denoted by of a real-valued objective function , defined on a subset of a multi dimensional space of dimension : .This method generates candidate points ,

, from a multivariate Gaussian distribution. It evaluates their objective function (also called fitness) values

. As the distribution is characterized by its two first moments, it updates the mean vector and covariance matrix by using the sampled points and their fitness values,

. The algorithm keeps repeating the sampling-evaluation-update procedure (which can be seen like an exploration exploitation method until the distribution contracts to a single point or reaches the maximum of iterations. Convergence is measured either by a very small covariance matrix. The different variations around the original method to improve the convergence investigate various heuristic method to update the distribution parameters. This strongly determines the behavior and efficiency of the whole algorithm. The theoretical foundation of the CMA-ES are that for continuous variables with given first two moments, the maximum entropy distribution is the normal distribution and that the update is based on a maximum-likelihood estimation, making this method based on a statistical principle.

A natural question is to adapt this method for discrete variables. Surprisingly, this has not been done before as the Gaussian distribution is a continuous time distribution inappropriate to discrete variables. One needs to change the underlying distirbution and also find the way to correlate the marginal distribution which can be tricky. However, we show in this paper that multivariate binomials are the natural discrete counterpart of Gaussian distributions. Hence we are able to change CMA Es to accommodate for discrete variables. This is the subject of this paper. In the section 2

, we introduce the multivariate binomial distribution. Presenting this distribution in the general setting of exponential family, we can easily derive various properties and connect this distribution to maximum entropy. We also proved that for the assumed correlation structure, independence and correlation are equivalent, which is a also a feature of Gaussian distributions. In section

3, we present the algorithm.

## 2 Primer on Multivariate Binomials

### 2.1 Intuition

To start building some intuition on multivariate binomials, we start by the simplest case, that is a two dimensional Bernoulli. It is the extension to two dimensions of the univariate Bernoulli distribution. A Bernoulli random variable

, is a discrete variable that takes the value

with probability

and otherwise. The usual notation for the probability mass function is

 P(X=x)=px(1−p)1−x,x∈{0,1}

A natural extension is to consider the random vector . It takes values in the Cartesian product space . If we denote the joint probabilities for , then the probability for the bivariate Bernoulli writes:

 P(X=x) = P(X1=x1,X2=x2) (1) = px1x211px1(1−x2)10p(1−x1)x201p(1−x1)(1−x2)00

with the side conditions that the joint probabilities are between and : for , and they sum to one

It is however better to write the joint distribution in terms of canonical parameters of the related exponential family. Hence, if we define

 θ1 = log(p10p00), (2) θ2 = log(p01p00), (3) θ12 = log(p11p00p10p01), (4)

and the vector of sufficient statistics denoted by

 T(X)=(X1,X2,X1X2)T (5)

we can rewrite the distribution as an exponential family distribution as follows:

 P(X=x) = exp(⟨T(X),θ⟩−A(θ)) (6)

where the log partition function is defined such as the probability normalizes to one. It is very easy to check that . We can also relate the initial moment parameters for for , to the canonical parameters as follows

###### Proposition 1.

The moment parameters can be expressed in terms of the canonical parameters as follows;

 p00=11+exp(θ1)+exp(θ2)+exp(θ1+θ2+θ12), (7) p10=exp(θ1)1+exp(θ1)+exp(θ2)+exp(θ1+θ2+θ12), (8) p01=exp(θ2)1+exp(θ1)+exp(θ2)+exp(θ1+θ2+θ12), (9) p11=exp(θ1+θ2+θ12)1+exp(θ1)+exp(θ2)+exp(θ1+θ2+θ12). (10)
###### Proof.

See A.1

The expression of the distribution in terms of the canonical parameters is particularly useful as it indicates immediately that independence and correlation are equivalent as in a Gaussian distribution. We will see that this result generalizes to multivariate binomial in the next subsection 2.3.

###### Proposition 2.

The components of the bivariate Bernoulli random vector are independent if and only if is zero. Like for a normal distribution, independence and correlation are equivalent.

###### Proof.

See A.2

The equivalence between correlation and independence was already presented in [23] where it was referred to as Proposition 2.4.1. The importance of referred to as the cross term or u-terms) is discussed and called cross-product ratio between and . In [19] and [18]

, this cross product ratio is also identified but called the log odds.

Intuitively, there are similarities between Bernoulli (their sum version that is the Binomial) and the Gaussian. And like for the multivariate Gaussian, we can prove that the marginal and the conditional Bernoulli are still binomial as shown by the proposition 3, making the analogy between Bernoulli (and soon their independent sum version which is the Binomial) and Gaussian even more striking!

###### Proposition 3.

In the bivariate Bernoulli vector whose probability mass function is given by 1

• the marginal distribution of is also a univariate Bernoulli whose probability mass function is

 P(X1=x1)=(p10+p11)x1(p00+p01)(1−x1). (11)
• the conditional distribution of given is also a univariate Bernoulli whose probability mass function is

 P(X1=x1|X2=x2)=(p1x2p1x2+p0x2)x1 (p0,x2p1,x2+p0,x2)1−x1 (12)
###### Proof.

See A.3

Before we move on to the generalization, we need to mention a few important facts. First, recall that the sum of independent Bernoulli trials with parameter is a Binomial with parameter and

. And when we talk about independent sum, it should ring a bell! Independent sum should make you immediately think about the Central Limit Theorem. This intuition is absolutely correct and shows the connection between the binomial and Gaussian distribution. This is the Moivre Laplace theorem stated below

###### Theorem 1.

If the variable follows a binomial distribution with parameters and in , then the variable converges in law to a standard normal law . Another presentation of this result is to say that, for , as grows large, for in the neighborhood of we can approximate the binomial distribution by a normal as follows:

 (nk)pk(1−p)n−k≃1√2πnp(1−p)e−(k−np)22np(1−p) (13)

The proof of this theorem is traditionally done with doing a Taylor expansion of the characteristic function. An alternative proof is to use the Sterling formula as well as a Taylor expansion to relate the binomial distribution to the normal one. Historically, de Moivre was the first to establish this theorem in 1733 in the particular case:

. Laplace generalized it in 1812 for any value of between 0 and 1 and started creating the ground for the central limit theorem that extended this result far beyond. Later on, many more mathematicians generalized and extended this result like Cauchy, Bessel, Poisson but also von Mises, Pólya, Lindeberg, Lévy, Cramér as well as Chebyshev, Markov and Lyapunov.

Second, if we take an infinite sum of Bernoulli, this is a discrete distribution that is the asymptotic limit of the binomial distribution. This is also a distribution that is part of the exponential family and is given by the Poisson distribution.

###### Proposition 4.

For a large number of independent Bernoulli trials with probability such that , then the corresponding binomial distribution with parameter and converges in distribution to the Poisson distribution

###### Proof.

See A.4

The two previous results show that binomials, Poisson and Gaussian distributions that are part of the exponential family are closely connected and represent the discrete and continuous version of very similar concepts, namely independent and identically distributed increments.

### 2.2 Maximum entropy

It is also interesting to relate these distributions to maximum Shannon entropy. Let a function : , where is the space of the random variable and a vector . It is well known that the maximum entropy distribution whose constraint is given by is a distribution of the exponential family given by the following theorem

###### Theorem 2.

The distribution that maximizes the Shannon entropy : subject to the constraint and the obvious probability constraints , , is the unique distribution that is part of the exponential family and given by

 pθ=exp(<θ,Φ(x)>−A(θ)), (14)

with

 A(θ)=log∫exp(<θ,Φ(x)>)dμ(x)

See A.5

###### Remark 2.1.

The theorem 2 works also for discrete distributions. It says that the discrete distribution that maximizes the Shannon entropy subject to the constraint and the obvious probability constraints , , is the unique distribution that is part of the exponential family and given by

 pθ=exp(<θ,Φ(x)>−A(θ)), (15)

with

 A(θ)=log∫∑(<θ,Φ(x)>)

The theorem 2

implies in particular that the continuous distribution that maximizes entropy with given mean and variance (or equivalently first and second moments) is an exponential family of the form

 exp(θ1x+θ2x2−A(θ))

where the log partition function

is defined to ensure the probability distribution sums to one. This distribution is indeed a normal distribution as it is the exponential of a quadratic form. This is precisely the continuous distribution used in the CMA ES algorithm. Taking a distribution that maximizes the entropy means that we take a distribution that has the less information prior. Or said differently, this is the distribution with the minimal prior structural constraint. If nothing is known, it should therefore be preferred.

Ideally, for our CMA ES discrete adaptation, we would like to find the discrete distribution equivalent of the normal. if we want the discrete distribution with independent increment, we should turn to binomial distributions. Binomials have the other advantage to converge to the normal distribution whenever the discrete parameter converges to a continuous one. Binomials are also distributions that are part of the exponential family. But we are facing various problems. To keep the discussion simple, let us first look at a single parameter that can take as values all the integer between to

First of all, we face the issue of controlling the first two moments of our distribution or equivalent to be able to control the mean denoted by and the variance denoted by . Binomial distributions do not have two parameters like normals to be able to adapt to first and second moments constraints as easily as normals. Indeed for our given parameter that is the number of discrete state of our parameter to optimize in the discrete CMA ES, we are only left with a single parameter for our binomial distribution to accommodate for the constraints. The expectation is given by while the variance is given by . If we would like to have a discrete distribution that progressively peaks to the minimum, we would like to be able to force the variance to converge to . This will fix the variance to . We can easily solve this quadratic equation and use the minimal solution given by

 p=1−√1−4σ2/n2

provided that . As will tend to zero, the parameter will tend to zero. In order to accommodate for the mean constraint, we need a work-around. We see that the discrete parameter is we do not do anything will converge to as will converge to . A solution that is simple is to assume that our discrete parameter is distributed according to

 (μ+(B(n,p)−np))modn

where is a modulo . It is the remainder of the Euclidean division of a by . This method will ensure that we sample all possible to possible value with a mean that is equal to and a variance controlled by the parameter .

Secondly, we would like to use a discrete distribution that maximizes the entropy. This is the case for the continuous version of CMA-ES with the normal distribution. However, for discrete distribution, this maximum entropy condition is not as easy. It is well known that the maximum entropy discrete distribution with a given mean is not the binomial distribution but rather the distribution given by

 p(X=i)=cρi

where and where is determined such as which leads to the implicit equation for :

 (1+μ)ρ+(μ−(n+1))ρn+1+(n−μ)ρn+2=μ,

using the well known geometric identities: and

. The distribution is sometimes referred to as the truncated geometric distribution. This is not our desirable binomial distribution. Obviously, we can rule out this truncated geometric distribution as its probability mass function does not make sense for our parameter. The probability mass function is decreasing which is not a desirable feature. Rather, we would like a bell shape, which is the case for our binomial distribution. The tricky question is how to relate our binomial distribution to a maximum entropy principle as this is the case for the normal.

We can first remark that the binomial distribution is not too far away from a geometric distribution when the number of trials tends to infinity at least for some terms. Indeed, the probability mass function is given by . And using the Sterling formula, we can see that for large, we can approximate factorial as follows , which leads an asymptotic term similar to the geometric distribution. This gives some hope that there should be a way to relate our binomial distribution to a maximum entropy principle. And the trick here is to reduce the space of possible distributions. It instead of looking at the entire space of distribution, we reduce the space of possible distributions to any Poisson binomial distributions (also referred to in the statistics literature as the generalized binomial distribution), we could find a solution. The latter distribution named after the famous French mathematician Siméon Denis Poisson is the discrete probability distribution of a sum of independent Bernoulli trials that are not necessarily identically distributed. And nicely, restricting the space of possible distribution to any Poisson binomial distributions, theorem 3 proves that the binomial distribution is the distribution that maximizes the entropy for a given mean.

###### Theorem 3.

Among all Poisson binomial distributions with trials, the distribution that maximizes the Shannon entropy : subject to the constraint and the obvious probability constraints is the binomial distribution such that

See A.6

### 2.3 Multivariate and Correlated Binomials

Equipped with the intuition of the first section, we can see the profound connection between multivariate normal and multivariate binomial. We will define our multivariate binomial as the sum for independent trials of multivariate Bernoulli defined as before.

Let be a k-dimensional random vector of possibly correlated binomial random variables that may have different parameters and and let be a realization of . The joint probability is given naturally by

 P(X1=x1,X2=x2,…,XK=xK) =p(0,0,…,0)∏Kj=1(1−xj)×p(1,0,…,0)x1∏Kj=2(1−xj) ×p(0,1,…,0)(1−x1)x2∏Kj=3(1−xj)×…× ×p(1,1,…,1)∏Kj=1xj, (16)

Like for the simple case of section 2, we can re-write this joint probability in the exponential form. Let us give some notations.

Let be the vector of size whose elements represents all the possible to selection of . These to selections of are all the possible monomial polynomials of of degree to . By monomial, we mean that we can take only distinct power of with all of them having an exponent equal to 0 or 1. We also denote by an ordered set of elements of the integers from to and by the set of all the order sets with elements elements. is also the sets of all possible non empty sets with integer elements in . Similarly, is the sets of all possible non empty set with elements in . Finally, (respectively ) is the subset of for set whose cardinality is even (respectively odd).

We are now able to provide the following proposition that gives the exponential form of the multi variate binomial mass probability function:

###### Proposition 5 (Exponential form).

The multivariate Bernoulli model has a probability mass function of the exponential form given by

 P(X)=exp(<θ,T(X)>−A(θ)) (17)

where the sufficient statistic is , the log partition function is and the coefficients are given by:

 θi1,…,il = logEvenOdd, (18) with Even = ∏{j1,..,jm}∈Υeven{i1,..,il}p(1 for all i1,…,il but j1,…,jm with 0 rest with 0) (19) and Odd = ∏{j1,..,jm}∈Υodd{i1,..,il}p(1 for all i1,…,il but j1,…,jm with 0 rest with 0) (20)

Similarly, we can compute the regular probabilities from the canonical parameters as follows:

 p(1 for i1,…,il rest with 0)=exp(Si1,…,il)D. (21)

where is the sum of all the theta parameters indexed by any non empty selection within :

 Si1,…,il=∑{i1,…,im}∈Υ{i1,…,il}θi1,…,im (22)

with the convention for the empty set, and is the normalizing constant such that all the probabilities sum to 1:

 D=∑l=0,..,k1≤i1≤…≤ilexp(Si1,…,il) (23)

with the convention for that .

###### Proof.

See A.7

Last but not least, we can extend the result already found for the simple two dimension Bernoulli variable to the general multi dimensional Bernoulli concerning independence and correlation. Recall that one of the important statistical properties for the multivariate Gaussian distribution is the equivalence of independence and no correlation. This is a remarkable properties of the Gaussian (although more could be said about independent and Gaussian as explained for instance in [8]).

The independence of a random vector is determined by the separability of coordinates in its probability mass function. If we use the natural (or moment) parameter form of the probability mass function, this is not obvious. However, using the exponential form, the result is almost trivial and is given by the following proposition

###### Proposition 6 ((Independence of Bernoulli outcomes)).

The multivariate Bernoulli variable is independent element-wise if and only if

 θi1,…,il=0∀1≤i1<…

See A.8

###### Remark 2.2.

The condition of equivalence between independence and no correlation can also be rewritten as

 Si1,…,il=l∑k=1θik∀l≥2. (25)
###### Remark 2.3.

A general multi variate binomial model implies parameters, which is way to many when is large. A simpler model is to impose that only the probabilities involving one state or two states are non zero. This is in fact the Ising model.

## 3 Algorithm

#### 3.0.1 CMA-ES estimation

Another radically difference approach is to minimize some cost function depending on the Kalman filter parameters. As opposed to the maximum likelihood approach that tries to find the best suitable distribution that fits the data, this approach can somehow factor in some noise and directly target a cost function that is our final result. Because our model is an approximation of the reality, this noise introduction may leads to a better overall cost function but a worse distribution in terms of fit to the data.

Let us first introduce the CMA-ES algorithm. Its name stands for covariance matrix adaptation evolution strategy. As it points out, it is an evolution strategy optimization method, meaning that it is a derivative free method that can accomodate non convex optimization problem. The terminology covariance matrix alludes to the fact that the exploration of new points is based on a multinomial distribution whose covariance matrix is progressively determined at each iteration. Hence the covariance matrix adapts in a sense to the sampling space, contracts in dimension that are useless and expands in dimension where natural gradient is steep. This algorithm has led to a large number of papers and articles and we refer to [22], [20], [2], [1], [12], [5], [11], [4], [17], [6] to cite a few of the numerous articles around CMA-ES. We also refer the reader to the excellent wikipedia page [24].

CMA-ES relies on two main principles in the exploration of admissible solution for our optimization problem. First, it relies on a multi variate binomial distribution as this is the maximum entropy distribution given the first two moments. The mean of the multi variate distribution is updated at each step in order to maximize the likelihood of finding a successful candidate. The second moment, the covariance matrix of the distribution is also updated at each step to increase the likelihood of successful search steps. These updates can be interpreted as a natural gradient descent. Intuitively, the CMA ES algorithim conducts an iterated principal components analysis of successful search steps while retaining all principal axes.

Second, we retain two paths of the successive distribution mean, called search or evolution paths. The underlying idea is keep significant information about the correlation between consecutive steps. If consecutive steps are taken in a similar direction, the evolution paths become long. The evolution paths are exploited in two ways. We use the first path is to compute the covariance matrix to increase variance in favorable directions and hence increase convergence speed. The second path is used to control step size and to make consecutive movements of the distribution mean orthogonal in expectation. The goal of this step-size control is to prevent premature convergence yet obtaining fast convergence to a local optimum.

In order to make it practical, we assume that we have a general cost function that depends on our Bayesian graphical model denoted by where are the parameters of our Kalman filter. Our cost function is for instance the Sharpe ratio corresponding to a generic trend detection strategy whose signal is generated by our Bayesian graphical model that is underneath a Kalman filter. This approach is more developed in a companion paper [7] but we will give here the general idea. Instead of computing the parameter of our Bayesian graphical model using the EM approach, we would like to find the parameters that maximize our cost function . Because our cost function is to enter a long trade with a predetermined target level and a given stop loss whenever our Bayesian graphical model anticipates a price risen and similarly to enter a short trade whenever our prediction based on Bayesian graphical model is a downside movement, our trading strategy is not convex neither smooth. It is a full binary function and generates spike whenever there is a trade. Moreover, our final criterium is to use the Sharpe ratio of the resulting trading strategy to compare the efficiency of our parameters. This is way too complicated for traditional optimization method, and we need to rely on Balck box optimization techniques like CMA-ES. Before presenting results, we will also discuss in the following section computational issues and the choice of the State space model dynamics.

We will describe below the most commonly used CMA -ES algorithm referred to as the version. It is the version in which at each iteration step, we take a weighted combination of the best out of new candidate solutions to update the distribution parameters. The algorith has three main loops: first, it samples new solutions, second, it reorders the sampled solutions based on their fitness and third it updates the state variables. A pseudocode of the algorithm is provided in the appendix section 1

In this algorithm, for an dimensional optimization program, the five state variables at iteration step are:

1. , the distribution mean and current best solution,

2. , the variance step-size,

3. , a symmetric positive-definite covariance matrix initialized to ,

4. , the isotropic evolution path, initially set to null,

5. , the anisotropic evolution path, initially set to null.

It is worth noting that the order of the five update is important. The iteration starts with sampling candidate solutions from a multivariate binomial distribution , that is for .

We then evaluate candidate solutions for the objective function of our optimization.

We then sort candidate solution according to their objective function value: . It is worth noticing that we do not even need to know the value. Only the ranking is important for this algorithm.

The new mean value is updated as a weighted mean of our new candidate solutions

 mk+1 =μ∑i=1wixi:λ=mk+μ∑i=1wi(xi:λ−mk) (26)

where the positive (recombination) weights sum to one. Typically, and the weights are chosen such that .

The step-size is updated using cumulative step-size adaptation (CSA), sometimes also denoted as path length control. The evolution path is updated first as it is used in the update of the step-size :

 pσ= (1−cσ)discount factorpσ+complements for discounted% variance√1−(1−cσ)2 √μwC−1/2kmk+1−mkdisplacement ofmσkdistributed asN(0,I)% under neutral selection (27) σk+1= σk×exp(cσdσ(∥pσ∥E∥N(0,I)∥−1)% unbiased about 0 under neutral selection) (28)

where

• is the backward time horizon for the evolution path and larger than one ( is reminiscent of an exponential decay constant as where is the associated lifetime and the half-life),

• is the variance effective selection mass and by definition of ,

• is the unique symmetric square root of the inverse of , and

• is the damping parameter usually close to one. For or the step-size remains unchanged.

The step-size is increased if and only if is larger than and decreased if it is smaller (see [10] for more details).

Finally, the covariance matrix is updated, where again the respective evolution path is updated first.

 pc= (1−cc)discount factorpc+1[0,α√n](∥pσ∥)%indicatorfunctioncomplements for discounted % variance√1−(1−cc)2 √μwmk+1−mkσkdistributed asN(0,Ck)under neutral selection (29)
 Ck+1= (1−c1−cμ+cs)discount % factorCk+c1pcpTcrank one matrix +cμμ∑i=1wixi:λ−mkσk(xi:λ−mkσk)Trankmin(μ,n)matrix (30)

where denotes the transpose and is the backward time horizon for the evolution path and larger than one, and the indicator function evaluates to one if and only if or, in other words, , which is usually the case, makes partly up for the small variance loss in case the indicator is zero, is the learning rate for the rank-one update of the covariance matrix and is the learning rate for the rank- update of the covariance matrix and must not exceed .

The covariance matrix update tends to increase the likelihood function for and for to be sampled from . This completes the iteration step. The stop condition is quite standard and makes sure we stop if the best solution does not move or if we reached the maximum iterations number.

###### Remark 3.1.

The number of candidate samples per iteration, , is an important and influential parameter. It highly depends on the objective function and can be tricky to be determined. Smaller values, like 10, tends to do more local search. Larger values, like 10 times the dimension makes the search more global. A potential solution to the determination of the number of candidate samples per iteration is to repeatedly restart the algorithm with increasing by a factor of two at each restart (see [3])

Besides of setting (or possibly instead, if for example is predetermined by the number of available processors), the above introduced parameters are not specific to the given objective function. This makes this algorithm quite powerful as the end user has relatively few parameters to set to do an efficient optimization search.

## 4 Conclusion

In this paper, we showed that using multi-variate correlated binomial distribution, we can derive an efficient adaptation of CMA-ES for discrete variable optimization problem using correlated binomials. We have proved that correlated binomials share some similarities with normal distribution in terms of independence and correlation equivalence as well as rich information for correlation structure. In order to avoid too many parameters, we impose that only single state and bi-state probabilities are not null. In the future, we hope to develop additional variations around this CMA-ES version for the combination of discrete and continuous variables mixing potentially multivariate binomial and normal distributions.

## References

• [1] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. Continuous optimization and CMA-ES. In

Genetic and Evolutionary Computation Conference, GECCO 2015, Madrid, Spain, July 11-15, 2015, Companion Material Proceedings

, pages 313–344, 2015.
• [2] Youhei Akimoto, Anne Auger, and Nikolaus Hansen. CMA-ES and advanced adaptation mechanisms. In Genetic and Evolutionary Computation Conference, GECCO 2016, Denver, CO, USA, July 20-24, 2016, Companion Material Proceedings, pages 533–562, 2016.
• [3] A. Auger.

Convergence results for the (1,lambda)-sa-es using the theory of phi-irreducible Markov chains.

In Theoretical Computer Science, pages 1–3. 35–69, Elsevier. 334, 2005.
• [4] Anne Auger and Nikolaus Hansen. Benchmarking the (1+1)-CMA-ES on the BBOB-2009 noisy testbed. In Genetic and Evolutionary Computation Conference, GECCO 2009, Proceedings, Montreal, Québec, Canada, July 8-12, 2009, Companion Material, pages 2467–2472, 2009.
• [5] Anne Auger and Nikolaus Hansen. Tutorial CMA-ES: evolution strategies and covariance matrix adaptation. In Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7-11, 2012, Companion Material Proceedings, pages 827–848, 2012.
• [6] Anne Auger, Marc Schoenauer, and Nicolas Vanhaecke. LS-CMA-ES: A second-order algorithm for covariance matrix adaptation. In Parallel Problem Solving from Nature - PPSN VIII, 8th International Conference, Birmingham, UK, September 18-22, 2004, Proceedings, pages 182–191, 2004.
• [7] Eric Benhamou. Optimal parameter inference for bayesian graphical models. ArXiv, November 2018.
• [8] Eric Benhamou, Beatrice Guez, and Nicolas Paris. Three remarkable properties of the Normal distribution. arXiv e-prints, October 2018.
• [9] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, New York, NY, USA, 2006.
• [10] N. Hansen. The cma evolution strategy: a comparing review. Towards a new evolutionary computation. Advances on estimation of distribution algorithms, 2006.
• [11] Nikolaus Hansen and Anne Auger. CMA-ES: evolution strategies and covariance matrix adaptation. In 13th Annual Genetic and Evolutionary Computation Conference, GECCO 2011, Companion Material Proceedings, Dublin, Ireland, July 12-16, 2011, pages 991–1010, 2011.
• [12] Nikolaus Hansen and Anne Auger. Evolution strategies and CMA-ES (covariance matrix adaptation). In Genetic and Evolutionary Computation Conference, GECCO ’14, Vancouver, BC, Canada, July 12-16, 2014, Companion Material Proceedings, pages 513–534, 2014.
• [13] Nikolaus Hansen, Sibylle D. Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evol. Comput., 11(1):1–18, March 2003.
• [14] Nikolaus Hansen and Andreas Ostermeier. Completely derandomized self-adaptation in evolution strategies. Evol. Comput., 9(2):159–195, June 2001.
• [15] G.H. Hardy, J.E. Littlewood, and G. Pólya. Inequalities. Cambridge Mathematical Library. Cambridge University Press, 1988.
• [16] P. Harremoes. Binomial and poisson distributions as maximum entropy distributions. IEEE Transactions on Information Theory, 47(5):2039–2041, 2001.
• [17] Christian Igel, Nikolaus Hansen, and Stefan Roth. Covariance matrix adaptation for multi-objective optimization. Evol. Comput., 15(1):1–28, March 2007.
• [18] X. Ma. Penalized regression in reproducing kernel hilbert spaces with randomized covariate data. technical report no. 1159. dept. Statistics, Univ, 5370, 2010.
• [19] P. McCullagh and J. Nelder. Generalized Linear Models. Chapman Hall, New York, 1989.
• [20] Yann Ollivier, Ludovic Arnold, Anne Auger, and Nikolaus Hansen. Information-geometric optimization algorithms: A unifying picture via invariance principles. J. Mach. Learn. Res., 18(1):564–628, January 2017.
• [21] J. E. Pecaric, F. Proschan, and Y. L. Tong. Convex functions, partial orderings, and statistical applications. Boston: Academic Press., 1992.
• [22] Konstantinos Varelas, Anne Auger, Dimo Brockhoff, Nikolaus Hansen, Ouassim Ait ElHara, Yann Semet, Rami Kassab, and Frédéric Barbaresco. A comparative study of large-scale variants of CMA-ES. In Parallel Problem Solving from Nature - PPSN XV - 15th International Conference, Coimbra, Portugal, September 8-12, 2018, Proceedings, Part I, pages 3–15, 2018.
• [23] J. Whittaker. Graphical Models in Applied Mathematical Multivariate Statistics. Wiley, New York, 1990.
• [24] Wikipedia. Cma-es, 2018.

## Appendix A Proofs

### a.1 Proof of proposition 1

###### Proof.

We can trivially infer all the moment parameters from equations 2, 3 and 4. ∎

### a.2 Proof of proposition 2

###### Proof.

The exponential family formulation of the bivariate Bernoulli distribution shows that a necessary and sufficient condition for the distribution to seperable into two components with each only depending on and respectively is that . This proves the first assertion of proposition 2.

Proving equivalence between correlation and independence is the same as proving equivalence between covariance and independence. The covariance between and is easy to calculate and given by

 Cov(X1,X2) = E[X1X2]−E[X1]E[X2] (31) = Keθ0+θ1+θ12−K(eθ0+eθ0+θ1+θ12))K(eθ1+eθ0+θ1+θ12) (32) = Keθ0+θ1+θ12(1−Keθ0−Keθ1−Keθ0+θ1+θ12)−K2eθ0+θ1 (33) = K2eθ0+θ1(eθ12−1) (34)

where in equation 34, we have used that the four probabilities sum to one. Hence, the correlation or the covariance is null for non trivial probabilities if and only if , which is equivalent to the independence. ∎

### a.3 Proof of proposition 3

###### Proof.

For the coordinate , it is trivial to see that

 P(X1=1) = P(X1=1,X2=0)+P(X1=1,X2=1)=p10+p11, P(X1=0) = p00+p01, P(X1=1) + P(X1=0)=1.

which shows that follows the univariate Bernoulli distribution with density given by equation (11). Likewise, it is trivial to see that

 P(X1=0|X2=0) = P(X1=0,X2=0)P(X2=0)=p00p00+p10, P(X1=1|X2=0) = p10p00+p10, P(X1=1|X2=0) + P(X1=0|X2=0)=1.

Similar results apply for the condition , which shows the second result and concludes the proof. ∎

### a.4 Proof of proposition 4

###### Proof.

Let us write the limit for the binomial distribution when number of trials , and probability of success in trial but remains finite. We have for a given

 limn→∞(nk)(λn)k(1−λn)n−k (35) (36) (37) =e−λλkk!. (38)

which proves that the binomial converges to the Poisson distribution. ∎

### a.5 Proof of theorem 2

###### Proof.

We follow the proof of Theorem 11.1.1 of [9]. If we write the Lagrangian for the problem

 maximize−∫p(x)logp(x)dμ(x) subject toEP[ϕ(X)]=αand ∫p(x)dμ(x)=1and p(x)≥0

where the Lagrange multipliers are for the three constraints, we have

 L(p,θ,θ0,λ)=∫p(x)logp(x)dμ(x)+d∑i=1θi(αi−∫p(x)ϕi(x)dμ(x)) +θ0(∫p(x)dμ(x)−1)−∫λ(x)p(x)dμ(x)

and noticing that the function to optimize is convex and satisfies the Slater’s constraint, we can use Lagrange duality to characterize the solution as the solution of the critical point given by

 1+logp(x)−<θ,ϕ(x)>+θ0−λ(x)=0 (39)

or equivalently,

 p(x)=exp(<θ,ϕ(x)>−1−θ0+λ(x)) (40)

As this solution always satisfies the condition , we have necessarily that the Lagragian multiplier related to the constraint should be null: . The solution should be a probability distribution, which implies that

 ∫p(x)dμ(x)=1

which imposes that

 ∫exp(−1−θ0)dμ(x)=∫exp(<θ,ϕ(x)>)dμ(x)

or equivalently, writing in the exponential form , we have that satisfies

 pθ(x)=exp(<θ,ϕ(x)−A(θ)>) (41)

which shows that the distribution is part of the exponential family.

To prove its uniqueness, we use the fact that the Shannon entropy is related to the Kullback Leibler divergence

as follows:

 H(P)=−∫p(x)logp(x)dμ(x)=−Dkl(P∥Pθ0)−H(Pθ0) (42)

which concludes the proof as the Kullback Leibler divergence unless

### a.6 Proof of theorem 3

###### Proof.

We will prove a stronger result that the entropy of a generalized binomial distribution with parameters is Schur concave (see [21] for a definition and some properties). A straight consequence of Schur concavity is that the function is maximum for the constant function as follows:

 H(p1,…,pn)≤H(¯p,…,¯p) (43)

with . This will prove that the regular binomial distribution satisfies the maximum entropy principle.

Our proof of the Schur concavity uses the same trick as in [16], namely the usage of elementary symmetric functions. Let us denote by the independent Bernoulli variables with parameter and their sum the variable for the canonical Poisson binomial variable. Its probability mass function writes as

 πnkˆ=Pr(Sn=k)=∑A∈Fk∏i∈Api∏j∈Ac(1−pj) (44)

where is the set of all subsets of integers selected from . The entropy is permutation symmetric, hence to prove Schur concavity, it suffices to show that the cross term is negative. Let us compute

 ∂Hp1−∂Hp2=−n∑k=0(1+logπnk)(∂πnkp1−∂πnkp2) (45)

We can notice that for and

 πnk=p1p2πn−2k−2+(p1(1−p2)+(1−p1)p2)πn−2k−1+(1−p1)(1−p2)πn−2k (46)

where . The equation (46) can be extended for or with the convention that for and