Deep Neural Networks (DNN) are widely used for classification tasks, with impressive performance in contrast with other methods and even against human experts . Every neural model contains a number of parameters (traditionally known as weights and biases) that must be adjusted so that the model performs the desired task correctly. The process of adjusting the network parameters is known as training. The training of a DNN is modelled as an optimization problem, that is to say, the minimization of an error function computed over the training data instances. Hence, the performance of a DNN completely depends on the training stage.
As a rule, the solution of the optimization problem has been approximated via gradient-based (GB) and quasi-Newton optimizers 
, despite the fact that the problem presents multiple minima where such optimizers can be trapped. In addition, each kind of DNN requires of an specific training method because the GB optimizers explicitly compute derivatives of the parameters, including the specific DNN architecture and the transfer and activation functions. Even more, theFundamental Deep Learning Problem of Gradient Descent 
establishes that some DNN architectures cannot be efficiently trained by gradient-based optimizers via backpropagation, due to thegradient vanishing problem. In consequence, further research in alternative training methods is crucial for continuing expanding the current knowledge.
In contrast with the standard GB approaches, Evolutionary Algorithms (EAs) and other population-based stochastic methods exhibit two advantages: first, that these can escape from local minima, and second that these do not require of problem-specific information (such as first or second derivatives), hence, one EA method could be used for a variety of DNNs by virtue of a complete decoupling from the DNN architecture and transfer- and activation-functions used throughout the network.
On the other hand, a great disadvantage of EAs is their computational cost, in memory as well as operations. The adequate population size for most EAs is dependent on the number of optimization variables , often in a superlinear proportion. Some researchers have proposed EAs which have been successfully used for solving high-dimensional problems  using a population size which linearly depends on the number of variables . Nevertheless, EAs in general have not been adopted for training or pretraining of DNNs because of their computational cost, which makes these unpractical against the current approaches based on other paradigms  despite their competitive results .
There exist evolutionary algorithms applied to High Dimensional Problems (HDPs) , but they have not been applied to DL, or to highly non-linear optimization problems, such as the one attempted here. Additionally, other authors intended to use descent directions in population based algorithms , although, these proposals have not reached the deep learning field, mainly because they are designed and tested for low-dimension problems.
2 From the CMA-ES to the design of an algorithm
Evolutionary Algorithms have been widely used for addressing non-derivable problems with multiple optimal solutions . In this context, the most effective algorithm reported in the literature for high-dimensional problems in continuous domains is the Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) .
The CMA-ES uses the covariance matrix for biasing the search towards promising directions. Although it delivers the most competitive approximations to the optima (even compared with quasi-Newton algorithms for convex problems such as Levenberg-Maquard and BFGS ), the CMA-ES requires of memory and operations because it computes a covariance matrix oriented towards promising directions computed from the historical trajectory of the elite solution; this requires of memory locations formore memory locations and operations.
Translating these observations to the context of training DNNs, where the number of variables to be optimized is in the range , the covariance matrix storage and factorization would require at least memory locations. Assuming a double-precision floating-point format (64 bits or 8 bytes) for storage, then the training of a small DNN with CMA-ES would occupy about 160 GB of RAM.
Even though the CMA-ES is unpractical for high dimensional optimization problems such as those encountered in DNNs training, it possesses relevant features which could be included in the design of a algorithm.
2.1 Previous work
It is known that the CMA-ES main steps are quadratic, while some other, for instance the sampling method, is cubic ; this elevated computational complexity is hinders the application of CMA-ES to high-dimensional problems(HDPs). In consequence, the problem of designing a linear algorithm that includes the core ideas of the CMA-ES has been attempted before, prominently by Loshchilov , Ros and Hansen , and Arnold and Hansen .
In one of Loshchilov’s previous proposals , the algorithm computes the same trajectories as the CMA-ES and then updates a Cholesky factorization of the covariance matrix containing the most relevant dimensions. Nevertheless, the author intends to preserve as most as possible the CMA-ES structure and work-flow.
A similar scheme which reduces the cubic steps to quadratic complexity was presented by Arnold and Hansen ; they employ a strategy for the adaptation of the covariance matrix, reducing the computational cost. These previous proposals are well founded and have reported adequate results, but they are far from a competitive approach to be used in Deep Learning (DL), considering that the largest number of dimensions that have been discussed in those works is , while the optimization problems in DL are at least two orders of magnitude greater.
In Ros and Hansen’s proposal  a diagonal matrix is used to achieve linear complexity. This approach is intended for separable objective functions, and in our opinion loses one of the main advantages of CMA-ES, which is the orientation of the covariance matrix of the population of solutions towards a promising direction.
Contrary to these previous approaches, in this work we develop a naturally linear algorithm without the aim of preserving the CMA-ES work-flow but still allowing it to serve as an inspiration for our proposal. Namely, we introduce a novel Evolutionary Algorithm based on two proposals: a search based on a diagonal covariance matrix (that requires operations and memory), of which the main variance direction is aligned with a single promising direction that in turn is computed as a combination of two directions derived from the evolution of the best individual through previous iterations. The storage and computation requirements for all of these operations are .
3 The Linear Evolutionary Algorithm based on the Main Variance Directions (LEA-MVD)
The main idea of our proposed algorithm is as follows. After proper initialization and evaluation of a population of solutions (discussed below), the algorithm is expected to produce new candidate solutions based on a guided search. Three main components are used for guiding this search: the first component is a Normal distribution, which models the most promising region within the solution space; the second component is a direction based on the evolution of the best individual trough generations; the third component is a direction based on the descent from the worst individuals to the best individual in the current population. The three components are used for sampling new solutions within a promising region and to translate that sample in a potentially descent direction. Our proposal is thus termed Linear Evolutionary Algorithm based on the Main Variance Directions (LEA-MVD) and is shown in Algorithm 1.
The LEA-MVD requires of two sensitive parameters: the population size, , and the number of elite individuals, . Even they are considered sensitive for getting the best performance, a sufficient competitive performance can be achieved with fixed values, in particular the number of elite individuals can be fixed to . For the preliminary results presented here the population size is fixed to , although a more deeper analysis must be conducted for determining a better rule for setting these parameters.
3.1 Implementation Details
the input parameters are the population size (fixed to 24 for these preliminary results); the number of elite individuals (this fixed value is suggested for any problem); the number of optimization variables ; and the inferior limit and superior limit of the solution space, which in general are defined by the problem. For RBM pretraining, we suggest a symmetric range around 0, inside for all variables, in the experiments we employed .
the algorithm is stopped if the norm of the standard deviations is less thanor if the number of generations .
we test two initialization procedures to obtain the initial population , which is a matrix: (A) by uniformly sampling between and ; (B) by sampling from a Normal distribution , where seed is an initial solution, this case is discussed in the Results section.
the inputs for the evaluation are the current population and the objective function . This procedure returns the objective value as an one-dimensional array for iteration . In our experiments, each individual is a set of parameters of an RBM, hence the evaluation is carried out by computing the reconstruction error of the input-output data as our objective function.
the selection returns two index subsets, and , the second is the subset of indexes of the decreasing order of the population , while are the first indexes of .
3.2 First Component: Sampling from the most promising region
A probabilistic model is used to identify the most promising region, so that the algorithm can sample from it with a high probability. This is achieved by means of the Empirical Selection Distribution (ESD) which consists of assigning a set of discrete weights to each solution , according to Eq. (1). These weights can be seen as probabilities or relative frequencies associated to each individual (the better an individual, the larger its weight).
According to Valdez , sampling from the ESD is equivalent to applying a selection operator. For instance, if the exponent in Eq. (1) were , this distribution would be equivalent to a binary tournament. If the exponent were , the distribution would be equivalent to a double binary tournament (compute a selected set by a binary tournament, then apply another binary tournament over the first selected set). In our case, we choose to use a fractional tournament with exponent equal to , which according to prior experimentation, adequately modulates the selection pressure.
The Normal model is defined by a set of means and standard deviations, hence it could be seen as a multivariate Normal distribution with a diagonal covariance matrix. In other words, the random variables are considered independent. Thus, the mean and standard deviations are computed for the-th variable according to Eqs. (2) and (3), respectively.
3.3 Second and Third Components: Estimators of the Main Variance Direction
One of the main contributions of this proposal is to use a covariance matrix instead of a . The latter can be decomposed in eigenvectors associated with eigenvalues, the greatest of which corresponds to the largest variance direction. In this context, our proposal consists in setting the Main Variance Direction (MVD) as the direction of greatest improvement, and using this single direction to translate the population of candidate solutions. For this purpose we compute two descent directions, named as the second and third components which, as explained above, are combined into a single direction. The first descent direction is given by the improvement of the elite individual, while the second is given by the descent vector from the worst solutions in the population to the best. Possibly, both directions are similar, hence it is not desirable nor necessary to perform two translations in the same direction. In consequence, we propose to compute the first direction and then to compute from the second direction only the part that is orthogonal to the first. That is to say, the descent vector from the worst solutions to the best that is orthogonal to the improvement of the elite individual.
The first direction is related to the improving path of the best solution. It is computed as , notice that it is a combination of the vector from the best solution at prior iteration to the best solution at current iteration , and the direction , from the prior iteration.
The second descent direction is obtained by computing the vector of maximum projection to the matrix , subject to it being orthogonal to . Notice that are the population indexes decreasingly sorted, thus is the elite individual and represents a set of indexes randomly taken from . In other words, we select indexes from the worse individuals in the population. For the experiments in this paper .
The vector of the maximum projection, called anisotropic descent direction , can be computed using the power method over , removing the projection to at each iteration. Finally, the second direction, is . After computing the direction, we compute the mean and standard deviation , of the projections of the vectors over . The quantities and , are used for translating the sampled points in this direction.
3.4 Regenerating the Population
Our algorithm is equipped with elitism, so that at each iteration the best individuals are preserved and individuals are regenerated. The sampling of the new population is as follows:
samples for are generated by using univariate Normal distributions with mean= and the standard deviations computed according to the first component, described in Section 3.2.
Then, a direction vector which combines the second and third components (first and second directions) is computed using Eq. (4). This direction is used for shifting in a promising direction.
Notice that regulates which direction is the dominant.
For each dimension , that is a column of the sample , Equation (5) is used for generating the new sample. Notice that , regulates the shift inserted by direction .
Finally, with a small probability of 0.02, a dimension could be enlarged or shortened as follows:
so that, . This last step is a perturbation in 2% of the dimensions.
The union of and the individuals preserved from the prior population form the new population.
The step sizes and .–
the parameter regulates the total descent; while the best solution is improved (signaling that the descent direction is adequate) it increases, otherwise it is reduced. Meanwhile, regulates the dominant direction among and ; if there is not a descent (meaning that the elite solution is stagnated), the weight of is increased, and vice versa.
In this section we contrast the results obtained for the pretraining of a Deep Belief Network (DBN) using three algorithms: the LEA-MVD (our proposal); the CMA-ES, which is considered the most competitive in a set of test poblems , ; and Contrastive Divergence (CD), which is the predominant algorithm for pretraining DBNs.
4.1 Experimental setup
A DBN is formed by stacking together several RBMs as illustrated in Fig. 1 (we use a similar architecture to that introduced by Hinton and Salakhutdinov ), where represents the weight matrix (the matrix of parameters) of the -th RBM. The DBN used in our experiments is intended to perform handwritten-digit classification. To evaluate the performance, the MNIST (Modified National Institute of Standards and Technology) database of handwritten digits was employed. This database consists of binary images: for training and for testing; each image is of size pixels.
Notice that the pretraining of, for instance, the parameters in , involves the optimization of approximately variables. Since the CMA-ES requires of
memory allocations, this implies the use of several hundred GB of storage for a single matrix, which is prohibitive for a workstation. Therefore, in order to compare our approach against the CMA-ES, the images in the MNIST database were scaled down to 25% of their original size (i.e. topixels).
The training of a DBN implies pretraining each of the RBMs in its architecture sequentially but independently. The number of trainable variables in the -th RBM is cmputed as , where and represent the number of nodes in the visible (input) and hidden layers of the RBM, correspondingly. Since our DBN is formed by three RBMs, then we have three optimization problems. For the training based on scaled down images (Fig. 1-a), these problems involve: variables, variables and variables.
In a second experiment, the LEA-MVD is compared against CD. Since the CMA-ES is not involved in this experiment, the related limiting memory requirements do not apply, so that the training images were used in their original resolution. For the training using the images with their original size (Fig. 1-b), the number of optimization variables are: variables, variables, and variables.
The CD algorithm typically used to perform the pre-training of each RBM converges within 50 iterations (thus the complete pre-training of the DBN takes 150 iterations). Because of this, the algorithms that were compared against CD were allocated the same number of iterations.
4.2 Experimental Results of LEA-MVD vs CMA-ES and CD
the reconstruction error for the three algorithms compared under the training scheme using the scaled-down version of the training data is shown in Fig. 2. The plots illustrate a typical execution of the three algorithms. The CMA-ES and the LEA-MVD were initialized using as seed solution that delivered by CD at the first iteration. Even though this seed solution could not be improved for the first RBM, for the second and third RBMs the CMA-ES and LEA-MVD achieved significantly smaller reconstruction errors than the widely-used CD, with LEA-MVD being the best of the three. Hence, our proposal not only requires much lesser memory and operations but it also delivers better solutions than the CMA-ES and CD.
after showing in the above comparison that LEA-MVD can overcome the performance of CMA-ES in the limit imposed by its memory requirements, a second experiment was conducted to compare LEA-MVD against CD using the images in their original size of pixels. This second experiment was executed 30 times in order to provide statistical robustness. The average result of these 30 executions, per iteration, is shown in Fig. 3.
As can be appreciated, the reconstruction error in this second experiment is generally larger than that in the first experiment. This is due to the larger size of the images used for the training of the DBN. Notice as well that after an initial improvement of the error (barely noticeable in the far right of the plots), neither of the algorithms managed to find a significantly better solution for the first RBM. For the second RBM, CD converged to a solution corresponding to a reconstruction error of about 50,000 (notice that the error is shown in logarithmic scale), while LEA-MVD produced a solution that is about one order of magnitude better, with an error of approximately 5,000. For the third RBM, CD produced only small improvements that cannot be appreciated at the scale shown. In contrast, LEA-MVD further improved the solution from the previous RBM and manage to reduce the reconstruction error to approximately 900.
These results unequivocally illustrate that the LEA-MVD algorithm is significantly superior than CD for the task of pre-training of a DBN.
In this paper we introduced a novel evolutionary algorithm based on the paradigm of Estimation of Distribution Algorithms which hascomplexity in memory and operations. Our proposal uses two probabilistic components and one deterministic component for sampling candidate solutions. The first probabilistic component estimates the position of a promising region, while the second models promising directions: the direction taken from the improvement of the elite individual is a single value, and the direction taken from the direction from the worse solutions to the best solution in the current population is used for computing the second probabilistic component, which is a univariate Normal distribution over the direction .
In all cases the updating rules consider fixed learning rates, that is to say, the current directions depend on the current information as well as on the historical information; this provides robustness to the estimation of the directions in the same way that stochastic gradient does .
The algorithm introduced herein is competitive with two state of the art algorithms, the CMA-ES, which is regarded in the literature as one of the most competitive evolutionary algorithms for continuous optimization, and Contrastive Divergence, which is the default algorithm for pretraining of RBMs / DBNs. The main contribution of this work is the combination of different paradigms, namely the Estimation of Distribution Algorithms with improving paths (akin to the CMA-ES), and robust estimators similar to Stochastic Gradient methods. All of this, results in a competitive optimization algorithm, as has been illustrated for the problem of pre-training a deep neural network.
Furthermore, the design of LEA-MVD includes self-adaptive parameters, this is an attractive feature that enables practitioners to avoid the problem of hyper-parameter tuning. Future work contemplates the analysis of the performance of this proposal for larger deep learning problems, in order to improve their operators and to reduce, even more, the computational cost.
This work was supported by the National Council of Science and Technology of Mexico (CONACYT), through Research Grants: CÁTEDRAS-7795 (S.I. Valdez) and CÁTEDRAS-2598 (A. Rojas).
Dirk V Arnold and Nikolaus Hansen.
Active covariance matrix adaptation for the (1+ 1)-cma-es.
Proceedings of the 12th annual conference on Genetic and evolutionary computation, pages 385–392. ACM, 2010.
-  Multiple authors. The cma evolution strategy. urlhttp://cma.gforge.inria.fr/, 2016.
L. Bottou, F. Curtis, and J. Nocedal.
Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018.
Shi Cheng, Bin Liu, T. O. Ting, Quande Qin, Yuhui Shi, and Kaizhu Huang.
Survey on data science with population-based algorithms.Big Data Analytics, 1(1):3, Jul 2016.
-  Kalyanmoy Deb and Christie Myburgh. Breaking the billion-variable barrier in real-world optimization using a customized evolutionary algorithm. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, pages 653–660. ACM, 2016.
-  N. Hansen, A. Niederberger, L. Guzzella, and P. Koumoutsakos. A method for handling uncertainty in evolutionary optimization with an application to feedback control of combustion. IEEE Transactions on Evolutionary Computation, 13(1):180–197, 2009.
-  Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (cma-es). Evolutionary computation, 11(1):1–18, 2003.
-  Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 313(5786):504–507, 2006.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Ilya Loshchilov. CMA-ES with restarts for solving cec 2013 benchmark problems. In 2013 IEEE Congress on Evolutionary Computation, pages 369–376. Ieee, 2013.
-  Ilya Loshchilov. A computationally efficient limited memory cma-es for large scale optimization. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14, pages 397–404, New York, NY, USA, 2014. ACM.
-  Daniel Mora-Melià, F Javier Martínez-Solano, Pedro L Iglesias-Rey, and Jimmy H Gutiérrez-Bahamondes. Population size influence on the efficiency of evolutionary algorithms to design water networks. Procedia Engineering, 186:341–348, 2017.
-  Raymond Ros and Nikolaus Hansen. A simple modification in cma-es achieving linear time and space complexity. In International Conference on Parallel Problem Solving from Nature, pages 296–305. Springer, 2008.
-  Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85 – 117, 2015.
-  Alejandro Sierra Urrecho and Iván Santibáñez Koref. Evolution of Descent Directions, pages 221–237. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008.
-  Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, and Jeff Clune. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. CoRR, abs/1712.06567, 2017.
-  S Ivvan Valdez-Peña, Arturo Hernández-Aguirre, and Salvador Botello-Rionda. Approximating the search distribution to the selection distribution in edas. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 461–468. ACM, 2009.
-  Yoshua Bengio Yann LeCun and Geoffrey Hinton. Deep learning. nature, 521(7553):436–442, 2015.
-  Le Zhang and P.N. Suganthan. A survey of randomized algorithms for training neural networks. Information Sciences, 364-365:146 – 155, 2016.