A typical modern optimization technique is usually either heuristic or metaheuristic. This technique has managed to solve some optimization problems in the research area of science, engineering, and industry. However, implementation strategy of metaheuristic for accuracy improvement on convolution neural networks (CNN), a famous deep learning method, is still rarely investigated. Deep learning relates to a type of machine learning technique, where its aim is to move closer to the goal of artificial intelligence of creating a machine that could successfully perform any intellectual tasks that can be carried out by a human. In this paper, we propose the implementation strategy of three popular metaheuristic approaches, i.e. simulated annealing, differential evolution and harmony search, to optimize CNN. The performance of these metaheuristic methods in optimizing CNN on classifying MNIST and CIFAR dataset were evaluated and compared. Furthermore, the proposed methods are also compared with the original CNN. Although the proposed methods show an increase in the computation time, their accuracy has also been improved (up to 5.73 percent).
Keywords— metaheuristic, convolution neural network, deep learning, simulated annealing, differential evolution, harmony search
Deep learning (DL) is mainly motivated by the research of artificial intelligent, in which the general goal is to imitate the ability of human brain to observe, analyze, learn and make a decision, especially for complex problem 
. This technique is in the intersection amongst the research area of signal processing, neural network, graphical modeling, optimization and pattern recognition. The current reputation of DL is implicitly due to drastically improve the abilities of chip processing, significantly decrease the cost of computing hardware and advance research in machine learning and signal processing.
In general, the model of DL technique can be classified into discriminative models, generative models, and hybrid model 
. Discriminative models, for instance, are CNN, deep neural networks, and recurrent neural network. Some examples of generative models are deep belief networks (DBN), restricted Boltzmann machine, regularized autoencoders, and deep Boltzmann machines. On the other hand, hybrid model refers to the deep architecture use the combination of a discriminative and generative model. An example of this model is DBN to pre-train deep CNN, which can improve the performance of deep CNN over random initialization. Among all of hybrid DL techniques, this paper focuses on metaheuristic optimization for training a CNN.
. Some examples of successful methods for training DL are Stochastic Gradient Descent, Conjugate gradient, Hessian-free Optimization and Krylov Subspace Descent.
Stochastic Gradient Descent is easy to implement and also fast in the process for a case with many training samples. However, this method needs several manual tuning to make its parameters optimal, and also its process is principally sequential, as a result, it is hard to parallelize them with GPUs. Conjugate Gradient (CG) on the other side is easier to check for convergence as well as more stable to train. Nevertheless, CG is slow, so that it needs multicore CPUs and availability of a vast number of RAMs .
Hessian-free optimization (HFO) has been applied to train deep auto-encoders, proficient in handling under fitting problem, and more efficient than pre-training + fine tuning proposed by Hinton and Salakhutdinov . On the other side, Krylov Subspace Descent (KSD) is more robust and simpler than HFO as well as look like to work better for the classification performance and optimization speed. However, KSD needs more memory than HFO .
In fact, techniques of modern optimization are heuristic or metaheuristic. These optimization techniques have been applied to solve any optimization problems in the research area of science, engineering, and even industry 
. However, research about metaheuristic for optimize deep learning method is rarely conducted. One of paper is the combining of genetic algorithm (GA) and CNN, proposed by You Zhining and Pu Yunming.Their model select the CNN characteristic by the process of recombination and mutation on GA, in which the model of CNN exists as individual in the algorithm of GA. Besides, in recombination process, only the layers weights and threshold value of C1 (convolution in first layer) and C3 (convolution in third layer) are changed in CNN model.
In this paper, we compared the performance of three metaheuristic algorithms, i.e. simulated annealing (SA), differential evolution (DE) and harmony search (HS), for optimizing CNN.. The strategies by looking for the best value of the fitness function on the last layer using metaheuristic algorithm, then the results will be used again to calculate the weights and biases in the previous layer. In case of testing the performance of the proposed methods, we use MNIST dataset. This dataset is images of digital handwritten digits, in which it contains 60,000 training data and 10,000 testing data. All of the images have been centered and standardized with the size of 28 x 28 pixels. Each pixel of the image is represented by 0 for black, 255 for white and in between is a different shade of gray .
This paper is organized as follow: Section 1 is an introduction, Section 2 explains about the used metaheuristic algorithms, Section 3 describe the convolution neural networks, Section 4 gives a description of the proposed methods, Section 5 present result of simulation, and Section 6 is the conclusion.
1 Metaheuristic algorithms
Metaheuristic is well-known as an efficient method for hard optimization problems, i.e. the problems that cannot be solved optimally using deterministic approach within a reasonable time limit. Metaheuristic methods work for three main purposes: for fast solving problem, for solving large problems, for making a more robust algorithm. These methods are also simple to design as well as flexible and easy to implement .
In general, metaheuristic algorithms use the combination of rules and randomization to duplicate the phenomena of nature. The biological system imitation of metaheuristic algorithm, for instance, are evolution strategy, GA, and DE. Phenomena of ethology for examples are particle swarm optimization (PSO), bee colony optimization (BCO), bacterial foraging optimization algorithms (BFOA), and ant colony optimization (ACO). Phenomena of physic are SA, microcanonical annealing and threshold accepting method. Another form of metaheuristic is inspired by music phenomena, such as HS algorithm .
Classification of metaheuristic algorithm can also be divided into single-solution based and population-based. Some of the examples for single-solution based metaheuristic are the noising method, tabu search, SA, TA, and guided local search. In the case of metaheuristic based on population, it can be classified into swarm intelligent and evolutionary computation. The general term of swarm intelligent is inspired by the collective behavior of social insect colonies or animal societies. Examples of these algorithms are GP, GA, ES, and DE. On the other side, the algorithm for evolutionary computation takes inspiration from the principles of Darwinian for developing adaptation into their environment. Some examples of these algorithms are PSO, BCO, ACO, and BFOA. Among of all these metaheuristic algorithms, SA, DE and HS are used in this paper.
1.1 Simulated Annealing algorithm
SA is a technique of random search for the problem of global optimization. It mimics the process of annealing in material processing. This technique was firstly proposed in 1983 by Kirkpatrick, Gelatt, and Vecchi .
The principle idea of SA is using random search, which not only allows changes that improve the fitness function but also maintaining some changes that are not ideal. As example, in minimum optimization problem, any better changes that decrease the fitness function value will be accepted, but some changes that increase
will also be accepted with a transition probability () as follow:
where is the energy level changes, is the Boltzmann’s constant, and is temperature for controlling the process of annealing. This equation is based on the Boltzmann distribution in physics . The following is standard procedure of SA for optimization problems:
Generate the solution vector:
The initial solution vector is randomly selected, and then the fitness function is calculated.
Initialize the temperature: If the temperature value is too high, it will take a long time to reach convergence, whereas too small value can cause the system missed the global optimum.
Select a new solution: A new solution is randomly selected from the neighborhood of the current solution.
Evaluate a new solution: A new solution is accepted as a new current solution depending on its fitness function.
Decrease the temperature: During the search process, the temperature is periodically decreased.
Stop or repeat: The computation is stopped when the termination criterion is satisfied. Otherwise, step 2 and 6 are repeated.
1.2 Differential Evolution algorithm
Differential Evolution is firstly proposed by Price and Storn in 1995, to solve the Chebyshev polynomial problem . This algorithm is created on individual’s difference, exploiting random search in the space of solution, and finally operate the procedure of mutation, crossover, as well as selection to obtain the suitable individual in system .
There are some types in DE, including the classical form is DE/rand/1/bin, it indicates that in the process of mutation, the target vector is randomly selected, and only a single different vector is applied. The acronym of bin shows that crossover process is organized by a rule of binomial decision. The procedure of DE algorithm is shown by the following steps:
Determining parameter setting: Population size is the number of individuals. Mutation factor (F) control the magnification of the two individual differences to avoid search stagnation. Crossover rate (CR) decides how many consecutive genes of the mutated vector are copied to the offspring.
Initialization of population: The population is produced by randomly generating the vectors in the suitable search range.
Evaluation of individual: Each of individual is evaluated by calculating their objective function.
Mutation operation: Mutation adds identical variable to one or more vector parameters. In this operation, three auxiliary parents are selected randomly, in which they will participate in mutation operation to create a mutated individual as follows:
where and .
Combination operation: Recombination (cross over) is applied after mutation operation.
Selection operation: This operation determines the offspring in the next generation should become a member of the population or not.
Stopping criterion: The current generation is substituted by the new generation until the criterion of termination is satisfied.
1.3 Harmony Search algorithm
Harmony Search algorithm is proposed by Geem et al. in 2001 . This algorithm is inspired by the musical process of searching for a perfect state of harmony. Like harmony in music, solution vector of optimization and improvisation from the musician are analogous to structures of local and global search in optimization techniques.
In improvisation of the music, the players sound any pitch in the possible range together that can create one vector of harmony. In the case of pitches create a real harmony; this experience is stored in the memory of each player and they have the opportunity to create better harmony next time . There are three possible alternatives when one pitch is improvised by a musician: any one pitch is played from her/his memory, a nearby pitch is played from her/his memory and an entirely random pitch are played with the range of possible sound. If these options are used for optimization, they have three equivalent components; the use of harmony memory, pitch adjusting, and randomization. In HS algorithm, these rules are correlated with two relevant parameters, i.e. harmony consideration rate (HMCR) and pitch adjusting rate (PAR). The procedure of HS algorithm can be summarized into five steps as follows :
Initialize the problem and parameters: In this algorithm, the problem can be maximum or minimum optimization, and the relevant parameters are HMCR, PAR, size of harmony memory and termination criterion.
Initialize harmony memory: The harmony memory (HM) is usually initialized as a matrix that is created randomly as a vector of solution and arrange based on the objective function.
Improve a new harmony: A vector of new harmony is produced from HM based on HMCR, PAR, and randomization. Selection of new value based on HMCR parameter by range 0 and 1. The vector of new harmony is observed to decide whether it should be pitch-adjusted using PAR parameter. The process of pitch adjusting is executed only after a value is selected from HM.
Update harmony memory: The new harmony substitutes the worst harmony in terms of the value of the fitness function, in which the fitness function of new harmony is better than worst harmony.
Repeat (3) and (4) until satisfying the termination criterion: In the case of meeting the termination criterion, the computation is ended. Alternatively, process (3) and (4) are reiterated. In the end, the vector of the best HM is nominated and is reflected as the best solution for the problem.
2 Convolution Neural Network
Convolution Neural Network is a variant of the standard multilayer perceptron (MLP). A substantial advantage of this method, especially for pattern recognition compared with conventional approaches is due to its capability in reducing the dimension of data, extracting the feature sequentially, and classifying in one structure of network. The basic architecture model of CNN is inspired in 1962, from visual cortex proposed by Hubel and Wiesel.
In 1980, Fukushima’s Neocognitron created the first computation of this model, and then in 1989, following the idea of Fukushima, LeCun et al. found the state-of-the-art performance on a number of tasks for pattern recognition using error gradient method .
The classical CNN by LeCun et al. is an extension of traditional MLP based on three ideas: local receive fields, weights sharing, and spatial/temporal sub-sampling. These ideas can be organized into two types of layers, which are convolution layers and subsampling layers. As is showed in Fig.1, the processing layers contain three convolution layers C1, C3, and C5, combined in between with two sub-sampling layers S2 and S4, and output layer F6. These convolution and sub-sampling layers are structured into planes called features maps.
In convolution layer, each neuron is linked locally to a small input region (local receptive field) in the preceding layer. All neurons with similar feature maps obtain data from different input regions until the whole of plane input is skimmed, but the same of weights is shared (weights sharing).
In sub-sampling layer, the feature maps are spatially down-sampled, in which the size of the map is reduced by a factor 2. As an example, the feature map in layer C3 of size 10x10 is sub-sampled to a conforming feature map of size 5x5 in the subsequent layer S4. The last layer is F6 that is the process of classification .
Principally, a convolution layer is correlated with some feature maps, the size of the kernel, and connections to the previous layer. Each feature maps is the results of a sum of convolution from the maps of the previous layer, by their corresponding kernel and a linear filter. Adding a bias term and applying it to a non-linear function. The k-th feature map with the weights and bias is obtained using the function as follow:
The purpose of a sub-sampling layer is to reach spatial invariance by reducing the resolution of feature maps, in which each pooled feature map relates to one feature map of the preceding layer. The sub-sampling function, where is the inputs, is a trainable scalar, and is trainable bias, is given by the following equation:
After several convolution and sub-sampling, the last structure is classification layer. This layer works as an input for a series of fully connected layers that will execute the classification task. It has one output neuron every class label, and in the case of MNIST dataset, this layer contains ten neurons corresponds to their classes.
3 Design of proposed methods
The architecture of this proposed method refers to a simple CNN structure (LeNet-5), not a complex structure like AlexNet. We use two variations of design structure. First is i-6c-2s-12c-2s, where the number of C1 is 6, and C2 is 12. Second is i-8c-2s-16c-2s, where the number of C1 is 8 and C2 is 18. The kernel size of all convolution layer is 5x5, and the scale of sub-sampling is 2.These architecture is designed for recognizing handwritten digits from MNIST dataset.
In this proposed method, SA, DE and HS algorithm are used to train CNN (CNNSA, CNNDE, CNNHS) to find the condition of best accuracy and also to minimize estimated error and indicator of network complexity. This objective can be realized by computing the lost function of vector solution or the standard error on the training set. The following is the lost function used in this paper:
where is the expected output, is the real output and
is some training samples. In the case of termination criterion, two situations are used in this method. The first is when the maximum iteration has been reached and the second is when the loss function is less than a certain constant. Both conditions mean that the most optimal state has been achieved.
3.1 Design of CNNSA method
Principally, algorithm on CNN computes the values of weight and bias, in which on the last layer they are used to calculate the lost function. These values of weight and bias in the last layer are used as solution vector, denoted as , to be optimized in SA algorithm, by adding randomly.
is the essential aspect of this proposed method. Selection in the proper of this value will significantly increase the accuracy. For example in CNNSA to one epoch, ifrand, then the accuracy is 88.12, in which this value is 5.73 greater than the original CNN (82.39). However, if rand, its accuracy is 85.79 and its value is only 3.40 greater than the original CNN.
Furthermore, this solution vector is updated based on SA algorithm. When the termination criterion is satisfied, all of weights and biases are updated for all layers in the system. The following is the CNNSA algorithm of the proposed method.
3.2 Design of CNNDE method
At the first time, this method computes all the values of weight and bias. The values of weight and bias on the last layer () are used to calculate the lost function, and then by adding randomly, these new values are used to initialize the individuals in the population.
Similar to CNNSA method, selection in the proper of will significantly increase the value of accuracy. In the case of one epoch in CNNDE as an example, if rand, then the accuracy is 86.30, in which this value is 3.91 greater than the original CNN (82.39). However, if rand, its accuracy is 85.51.
Furthermore, these individual in the population are updated based on the of DE algorithm. When the termination criterion is satisfied, all of weights and biases are updated for all layers in the system. The following is the CNNDE algorithm of the proposed method.
3.3 Design of CNNHS method
At the first time like CNNSA and CNNDE, this method computes all the values of weight and bias. The values of weight and bias on the last layer () are used to calculate the lost function, and then by adding randomly, these new values are used to initialize the harmony memory.
In this method, is also an important aspect, while selection the proper of will significantly increase the value of accuracy. For example of one epoch in CNNHS (i-8c-2s-16c-2s), if rand, then the accuracy is 87.23, in which this value is 7.14 greater than the original CNN (80.09). However, if rand, its accuracy is 80.23, the value is only 0.14 greater than CNN.
Furthermore, these harmony memory is updated based on the HS algorithm. When the termination criterion is satisfied, all of weights and biases are updated for all layers in the system. The following is the CNNHS algorithm of the proposed method.
4 Simulation and results
In this paper, the primary goal is to improve the accuracy of original CNN by using SA, DE, and HS algorithm. This can be performed by minimizing the classification task error tested on the MNIST dataset. Some of the examples image for MNIST dataset are shown in Fig.2.
In CNNSA experiment, the size of neighborhood was set = 10 and maximum of iteration (maxit) = 10. In CNNDE, the population size = 10 and maxit = 10. In CNNHS, the harmony memory size = 10 and maxit = 10. Since it is difficult to make sure the control of parameter, in all of the experiment the values of c = 0.5 for SA, F = 0.8 and cr = 0.3 for DE, as well as HMCR = 0.8 and PAR = 0.3 for HS. We also set the parameter of CNN, i.e., the learning rate () and the batch size (100).
As for the epoch parameter, the number of epoch 1 to 10 for every experiment. All of the experiment was implemented in MATLAB-R2011a, on a personal computer with processor Intel Core i7-4500u, 8 GB RAM running memory, in Window 10, with five separate runtimes. The original program of this simulation is DeepLearn Toolbox from Palm.
Accuracy (Acc.) and its standard deviation (Std.Dev) for design: i-2s-6c-2s-12c
All of the experiment results of the proposed methods are compared with the experiment result from the original CNN. These results for the design of i-6c-2s-12c-2s are summarized in Table 1 for accuracy, Table 2 for the computational time, Fig. 3 for error and its standard deviation as well as Fig. 4 for computational time and its standard deviation. The results for the design of i-8c-2s-16c-2s are summarized in Table 3 for accuracy, Table 4 for the computational time, Fig. 4 for error and its standard deviation as well as Fig. 5 for computational time and its standard deviation.
The experiments of original CNN are conducted at only one time for each epoch because the value of its accuracy will not change if the experiment is repeated with the same condition. In general, the tests conducted showed that to the higher epoch value, the better is the accuracy. For example in one epoch, compared to CNN (82.39), the accuracy increased to 5.73 for CNNSA (88.12), 3.91 to CNNDE (86.30), and 4.84 to for CNNHS (87.23). While in 5 epoch, compared to CNN (93.11), the increase of accuracy is 3.18 for CNNSA (96.29), 2.04 for CNNDE (94.15), and 1.78 for CNNHS (94.89). In the case of 100 epoch, as shown in Fig.6, the increase in accuracy compared to CNN (98.65) is only 0.16 for CNNSA (98.81), 0.13 for CNNDE (98.78), and 0.09 for CNNHS (98.74).
The experiment results show that CNNSA presents the best accuracy for all epoch. Accuracy improvement of CNNSA, compared to the original CNN, varies for each epoch, with a range of values between 1.74 (9 epoch) up to 5.73 (1 epoch). The computation time for the proposed method, compared to the original CNN, is in the range of 1.01 times (CNNSA, two epoch: 246/244) up to 1.70 times (CNNHS, nine epoch: 1246/856).
In addition, we also test our proposed method with CIFAR10 (canadian institute for advanced research) data-set. This dataset consists of 60.000 color images, in which the size of every image is 32x32. There are five batches for training, composed of 50.000 images, and one batch of test images consist of 10.000 images. The CIFAR10 dataset is divided into ten classes, where each class has 6.000 images. Some example images of this dataset are showed in Fig.8 as follow.
The experiment of CIFAR10 dataset was conducted in MATLAB-R2014a. We use the number of epoch 1 to 15 for this experiment. The original program is MatConvNet from . In this paper, the program was modified with SA algorithm. The results can be seen in Fig. 9 for objective, Fig.10 for top-1 error, and Fig.11 for a top-5 error. In general, these results show that CNNSA works better than original CNN for CIFAR10 data-set.
|Objective||Top-1 error||Top-5 error||Objective||Top-1 error||Top-5 error|
|Objective||Top-1 error||Top-5 error||Objective||Top-1 error||Top-5 error|
This paper shows that SA, DE and HS algorithms improve the accuracy of the CNN. Although there is an increase in computation time, nevertheless error of the proposed method is smaller than the original CNN for all variation of the epoch.
It is possible to validate the performance of this proposed method on other benchmark datasets such as ORL, INRIA, Hollywood II, and ImageNet. This strategy can also be developed for other metaheuristic algorithms such as ACO, PSO, and BCO to optimize CNN.
For the future study, metaheuristic algorithms applied to the other DL methods need to be explored, such as the recurrent neural network, deep belief network, and AlexNet (a newer variant of CNN).
- 1. I. Boussaid, J. Lepagnot, and P. Siarry. A survey on optimization metaheuristics. Information Science, 237:82 – 117, 2013.
- 2. El-Ghazali Talbi. Metaheuristics From Design to Implementation. John Wiley & Sons, Hoboken, New Jersey, 2009.
- 3. P. O. Glauner. Comparison of training methods for deep neural networks. Master Thesis, 2015.
- 4. G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural network. Science, 313:504–507.
- 5. S. Kirkpatrick, C. Gelatt, and M. Vecchi. Optimization by simulated annealing. Science, New Series, 220(4598):671–680, 1983.
- 6. A. Krizhhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with eep convolutional neural networks. in Proc. Advances in Neural Information Processing Systems 25, Lake Tahoe, Nevada, 2012.
- 7. Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolution nework and applications in vision. in Proc. IEEE International Symposium on Circuit and Systems, pages 253–256, 2010.
- 8. K. S. Lee and Z. W. Geem. A new meta-heuristic algorithm for continuous engineering optimization: harmony search theory and practice. Comput. Methods Appl. Mech. Engrg, 194:3902–3933, 2005.
- 9. Q. Lee, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Ng. On optimization methods for deep learning. in Proc. The28th International Conference on Machine Learning, Bellevue, WA, USA, 2011.
- 10. Li Deng and Dong Yu. Deep Learning: Methods and Application. Foundation and Trends in Signal Processing, Redmond, WA 98052; USA, 2013.
- 11. LISA.lab. Deep learning tutorial release 0.1. University of Montreal, Canada, 2014.
- 12. J. Martens. Deep learning via hessian-free optimization. in Proc. The27th International Conference on Machine Learning, Haifa, Israel, 2010.
- 13. M. M. Najafabadi and et. al. Deep learning applications and challenges in big data analytics. Journal of Big Data, pages 1–21, 2015.
- 14. N. Noman, D. Bollgala, and H. Iba. An adaptive differential evolution algorithm. IEEE Evolutionary Computation, pages 2229–2236, 2011.
- 15. O.Vinyal and D. Poyey. Krylov subspace descent for deep learning. in Proc. The15th International Conference on Artificial Intelligent and Statistics (AISTATS), La Palma, Canada Island, 2012.
- 16. R. Palm. Master thesis: Prediction as a candidate for learning deep hierarchical model of data, 2012.
- 17. L. M. R. Rere, M. I. Fanany, and A. M. Arymurthy. Simulated annealing algorithm for deep learning. Procedia Computer Science, 72:137–144, 2015.
- 18. J. L. Sweeney. Deep learning using genetic algorithms. Master Thesis, 2012.
- 19. A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for matlab.
- 20. Xin-She Yang. Engineering Optimization: an introduction with metaheuristic application. John Wiley & Sons, Hoboken, New Jersey, 2010.
- 21. Yoshua Bengio. Learning Deep Architecture for AI, volume 2:No.1. Foundation and Trends in Machine Learning, 2009.
- 22. Y. Zhining and P. Yunming. The genetic convolutional neural network model based on random sample. ijunesstt, 8(11):317 – 326, 2015.