I Introduction
In the last decade, Machine Learning (ML) have gained significant interest, especially through Deep Learning (DL). Particularly, DL focuses on learning features from data using multiple layers of abstraction, i.e., Deep Neural Networks (DNNs)
[1], and thanks to this characteristic, DL has been able to dramatically improve the stateoftheart of several pattern recognition and prediction problems
[2, 3].There are several types of DNNs, where each one is suited for solving a specific problem. Among these network types, Recurrent Neural Networks (RNNs) are especially good at solving sequential modeling and prediction, e.g., natural language, image, and speech recognition and modeling [1]
. Basically, RNNs are feedforward networks that include feedback connections between layers and neurons, and this recurrence allows them to capture longterm dependency in the input. In spite of their great performance, RNNs have a drawback, they are hard to train, because of the
vanishing and the exploding gradient problems [4, 5].An alternative to mitigate the problems related to DNNs training is to optimize the hyperparameters of a network. By selecting an appropriate configuration of the parameters of the network (e.g., the activation functions, the number of hidden layers, the kernel size of a layer, etc.), it is tailored to the problem and by this mean the performance is improved
[6, 7, 8]. DNN hyperparameter optimization methods can be grouped into two main groups: the manual explorationbased approaches, usually, lead by expert knowledge, and the automatic searchbased methods (e.g., grid, evolutionary or random search) [9].The number of alternatives or parameters to configure a DNN are huge. Thus the hyperparameter optimization has to deal with a highdimensional search space. In spite of this size, most methods (manual and automatic) are based on trialanderror, meaning that each hyperparameter configuration is trained and tested to evaluate its numerical accuracy. Thus, the highdimensional search space and the high cost of the evaluation limit the results of this methodology.
Some authors have explored different approaches to speed up the evaluation of DNN architectures to improve the efficiency of automatic hyperparameter optimization algorithms [10, 11].
A promising approach to evaluate stacked RNN architectures is the MAE (mean absolute error) random sampling [10, 12]
. The main idea behind this method, inspired by the linear timeinvariant theory, is to infer the numerical accuracy of a given network without actually training it. Given an input, several sets of random weights are generated and analyzed measuring the MAE. Then, the probability of finding a set of weights whose MAE is below a predefined threshold is estimated.
In this work, we study the extension of the MAE random sampling to multiplehiddenlayer networks [12], and we put forward the use of this approach as a heuristic
to optimize the architecture. Particularly, we propose to use a metaheuristic to navigate through the architectures space and guide this search using the MAE random sampling. Specifically, we propose an evolutionary strategy (ES) based algorithm. Finally, once we have found a “final” solution, we propose to train it using a gradient descentbased method.
The remainder of this paper is organized as follows: the next section outlines the related work. Section III introduces the multiplehiddenlayer extension of the MAE random sampling [12]. Section IV presents the ESbased optimization algorithm. Sections V and VI present the experimental results. Finally, Section VII discusses the conclusions drawn from this study and proposes future work.
Ii Related Work
This section outlines the most outstanding studies related to our present work. First, we introduce the related work regarding RNNs hyperparameter optimization and evaluation. Then, we present research studies that apply metaheuristics to address DL optimization.
Iia RNN hyperparameters Optimization
An RNN is a network that incorporates recurrent (or feedback) edges that may form cycles and self connections. This approach introduces the notion of time to the model. Thus, at a time , a node connected to a recurrent edge receives input from the current data point and also from the hidden node (the previous state of the network). The output at each time is computed according to the hidden node values at time (). Input at time () can determine the output at time () and later by way of recurrent connections [13].
The majority of DL approaches to train a network are based on gradientbased optimization procedures, e.g., using a local numerical optimization such as stochastic gradient descent or second order methods. However, these methods are not suitable for RNNs. The main issue with gradientbased approaches is that they keep a vector of activations, which makes RNNs extremely deep and aggravates the exploding and the vanishing gradient problems
[4, 14, 5].More recently, Long ShortTerm Memory (LSTM) has emerged as a specific type of RNN architecture, which contains special units called memory blocks in the recurrent hidden layer
[15]. LSTM mitigates gradient problems, and therefore, they are easier to train than standard RNNs. However, not only the network architecture affects the learning process but also the weight initialization [16] as well as all specific parameters of the optimization algorithm [17].Therefore, to cope with the learning process as a whole, some authors[6, 7, 8] have proposed to perform an hyperparameter optimization. Specifically, they propose to look for a specific architecture (the number of layers, the number of hidden unit per layer, the activation function, etc.) and a set of parameters to train the network that improve the performance of the optimized network given a dataset. In other words, instead of using a general configuration, the idea is to tailor it to the problem.
When dealing with an hyperparameter configuration, an expert can discard a configuration based on his expertise, i.e., without the need of evaluating it. However, intelligent automatic hyperparameter configuration optimization procedures search more efficiently through a highdimensional search space.
Even though intelligent methods are more competitive than experts, they are not generally adopted because they are computationally intensive. They require to fit a model and to evaluate its performance on validation data (i.e., they are datadriven), which can be an expensive process [18, 6, 19].
Hence, few methods have been proposed to address this issue by speeding up the evaluation of the proposed hyperparameterization. For example, Domhan et al. [11] analyzed an approach that detects and finishes the neural networks evaluations that underperform a previously computed one. This solution was able to reduce the hyperparameterization search time up to 50%. More recently, Camero et al. [10] presented the MAE random sampling, a novel lowcost method to compare onehidden layer RNN architectures without training them. MAE random sampling evaluates an RNN architecture by generating a set of random weights and evaluating their performance.
In line with the latter approach, we propose to extend the MAE random sampling to evaluate RNNs with multiplehiddenlayers to make it suitable to evaluate deeper RNNs. Then, we propose to use this method to guide a metaheuristic algorithm to search for the most suitable hyperparameters. Nonetheless, nothing prevents the MAE random sampling to be used by any other type of hyperparameter optimization method.
IiB Deep Learning and Metaheuristics
Metaheuristics are well known optimization algorithms to address complex, nonlinear, and nondifferentiable problems [20, 9]. They efficiently combine exploration and exploitation strategies to provide good solutions requiring bounded computational resources. They have been successfully used to solve real world problems in different fields, e.g., networks design [21], smart mobility [22, 23], and facility allocation [24].
Optimization in DL may be viewed from different perspectives: training as optimization of the DNN weights, hyperparameters selection, network topologies, learning environment, etc. These different points of view are adopted to improve the DNNs generalization capabilities.
Gradientdescent based methods, such as backpropagation, are widely used to train DNNs. However, these methods need several manual tuning schemes to make its parameters optimal and it is difficult to parallelize them taking advantage of graphics processing units (GPUs). Thus, several authors have explored DNNs training by using metaheuristics, an idea explored long before DNN rise [25, 26]
. Different authors combined convolutional neural networks with metaheuristics to improve their accuracy and performance by optimizing the layers weights and threshold. Following this idea, You and Pu used a genetic algorithm (GA)
[27]; Rosa et al. applied harmony search (HS) [28]; Rere et al. analyzed simulated annealing (SA) [29], and later the same authors evaluated SA, differential evolution (DE), and HS [30].GA has been applied to evolve increasingly complex neural networks topologies and the connection weights simultaneously, in the NeuroEvolution of Augmenting Topologies (NEAT) method [31, 32]. However, NEAT has some limitations when it comes to evolving DNNs and RNNs [33].
Focusing on RNNs, NEATLSTM [34] and CoDeepNeat [35]
extend NEAT to mitigate its limitations when evolving the topologies and weights of the network. Besides, particle swarm optimization (PSO) has been analyzed to train RNNs instead of SGD
[36], providing comparable results. El Said et al. [37] proposed the use of ant colony optimization (ACO) to improve LSTM RNNs by refining their cellular structure.Taking into account the optimization of the RNNs hyperparameters and architecture optimization, as it is done in this study, Camero et al. [7] applied GA to search for the most efficient ones to improve the accuracy and the performance regarding the most commonly used RNNs configurations. In this case, the authors train the network using SGD to evaluate the performance of the configurations. Therefore, the main difference with our approach is that we propose to use the MAE random sampling instead of training each network/configuration. Thus, we expect to reduce the computational cost of the evaluation process, allowing the optimization algorithm to perform a larger number of iterations.
Iii MAE Random Sampling
Inspired by the simple fact that changing the weights of a neural network affects its output [17], Camero et al. proposed a novel approach to characterize and compare RNN architectures: the MAE random sampling [10]. First, they showed its usefulness for comparing the expected performance (i.e., the probability of a good result after training) of RNNs with a singlehiddenlayer. Later, they extended their technique to multiplehiddenlayers [12] and showed that there is a strong negative correlation between the estimated probability and the MAE measured after training and that this negative correlation increases when adding more hidden layers.
MAE random sampling consists in taking a userdefined number of samples of the output (on a given input) of a specific RNN architecture. Where every time a sample is taken, the weights are normally initialized independently. Then, a truncated normal distribution is fitted to the MAE values sampled, and a probability
of finding a set of weights whose error is below a userdefined threshold is estimated. Then, the probability is used as a predictor of the performance (error) of the analyzed architecture.Figure 1 depicts the MAE random sampling originally introduced by Camero et al. [10] extended to multiplehiddenlayers RNNs. The distribution of the sampled errors is used to estimate the probability of finding a good solution.
Algorithm 1 presents the adaptation of the MAE random sampling [10] to multiple hidden layers [12]. Given an architecture (ARCH), encoded as a vector whose terms represents the number of LSTM cells in the correspondent hidden layer and the number of time steps or look back (LB), and a userdefined time series (data), the algorithm takes MAX_SAMPLES samples of the MAE by initializing the weights with a normal distribution. After the sampling is done, a truncated normal distribution is fitted to the MAE values sampled, and finally, is estimated for the inputted THRESHOLD.
Iv MAE Random SamplingBased Optimization
In this section, we introduce our proposal for RNN architecture optimization based on the MAE Random Sampling. First, (i) we state the architecture optimization problem and then, (ii) we present an evolutionary strategybased algorithm to perform the optimization.
Iva Architecture Optimization
The optimization of an artificial neural network consists of searching for an appropriate network structure (i.e., the architecture) and a set of weights [17]. However, in spite of this definition, it is rather common to arbitrarily define the architecture and then applied a learning rule (e.g., stochastic gradient descent) to optimize the set of weights [9]. Thus, we might say that the network is partially optimized or, in other words, we are not fully leveraging the computational model. Therefore in this study, we are interested in the optimization of the architecture aiming to improve the overall performance.
Usually, the RNN architecture optimization is stated as a minimization problem [9]. For example, Equation IVA defines this problem as looking for an RNN architecture that minimizes the mean absolute error (MAE) of the predicted output () against the real one (), subject to a minimum/maximum number of hidden layers (HL), neurons per layer (NPL), and look back [7]. Normally, it is implied in this definition the training of the candidate solution. Therefore, due to the intensive computations of the training, this optimization tends to be time demanding. Therefore, we propose to restate the optimization problem using the MAE random sampling.
minimize  (1)  
subject to  
We propose to optimize the architecture of an RNN by maximizing , i.e., given an input X and an output Y, we propose to look for an RNN architecture that maximizes the estimated probability of finding a set of weights whose error is below a userdefined threshold (Algorithm 1). Equation IVA presents the referred problem.
maximize  (2)  
subject to  
IvB Evolutionary Approach
To solve the RNN architecture optimization problem stated in Equation IVA, we designed a deep neuroevolutionary algorithm based on the () Evolutionary Strategy (ES) [20]. Algorithm 2 presents a highlevel view of our proposal.
In our proposal, a solution represents an RNN architecture, and it is encoded as a variable length integer vector, , where is the LB , and () correspond to the number of LSTM cells in the th hidden layer. Thus, and . Given this definition, the number of hidden layers is implicitly derived from the length of the solution. Then, the population is defined as a set of population_size solutions.
First, the Initialize function randomly creates a set of solutions. Next, the Evaluate function computes (Algorithm 1) for each solution. Then, the population is evolved until the termination criteria is met (i.e., the number of evaluations is greater than max_evaluations).
The evolutionary process is divided into selection, mutation, evaluation, replace, and selfadjustment (Algorithm 2). First, (line 5) an offspring (of offspring_size solutions) is selected using a binary tournament from the actual population. Then, each solution in the offspring is mutated by a two step process. In the first step of the mutation (line 6, CellMutation), for every (), with a probability cell_mut_p, a value in the range (excluding zero) is added. Then, in the second step of the mutation (line 7, LayerMutation), independently, with a probability layer_mut_p, the layer () is cloned or removed (with the same probability), i.e., one layer is added or subtracted to the solution.
Once the mutation is done, the offspring is evaluated (Evaluate), and after, the best solutions from the population and the offspring are selected by the Best function, i.e., the population and the offspring are gathered together, sorted, and finally, the solutions that have a higher give place to the new population.
The number of evaluations is increased accordingly, and finally, a SelfAdapting process takes place. In the latter process, if the new population is improving on average (i.e., the average of the new population is greater than the former average), then, the cell_mut_p, max_step, and layer_mut_p parameters are multiplied by 1,5. Otherwise, these parameters are divided by 4 [38].
After the evolutionary process ends, the best solution (i.e., the solution with the greatest ) of the population is selected (line 13), and trained using a userdefined method. Without loss of generality, we defined to use Adam [39] optimizer to the train the final solution for a predefined number of epochs.
Finally, the algorithm returns an RNN that is optimized (structure and weights) to the given problem.
It is quite interesting to notice that the Evaluate function may be changed seamlessly by any other fitness function, e.g., the MAE after training the network for a userdefined number of times. Accordingly, the Best function has to be modified to maximize or minimize the new objective function (fitness).
V MAE Random Sampling Results
We have implemented our proposal in Python^{1}^{1}1Code available in https://github.com/acamero/dlopt, using DLOPT [40]
, Keras
[41], and Tensorflow
[42]. And we have tested it using a standard problem: the sine wave. The selection of the problem is based on two reasons. First, Camero et al. [10, 12] studied the problem so we can compare our results to theirs, and second, any periodic waveform can be approximated by adding sine waves [43].Equation 3 expresses a sine wave as a function of time (t), where is the peak amplitude, is the frequency, and is the phase. Particularly, we used the sine wave described by , , and , in the range seconds (s), sampled at ten samples per second.
(3) 
Va MAE Random Sampling as a Predictor
To study the proficiency of the MAE random sampling to predict how likely is to find a good set of weights, i.e., a set of weights that have a good error performance, we sampled one up to threehiddenlayer stacked RNN.
The method to do this study is as follows [12]. First, we defined the minimum (MIN_NPL) and the maximum (MAX_NPL) number of hidden LSTM cells of each hidden layer, and we defined the look back values to be studied. Then, we took 100 samples (MAX_SAMPLES) and estimated (i.e., THRESHOLD=0.01). At last, we selected 100 architectures (i.e., a number of LSTM cells and the look back), trained them using Adam optimizer [39], and analyzed the relation between the estimated probability and the observed MAE.
Table I presents the parameters used in this experiment. Two and threehiddenlayer parameters were selected upon the observation of the onehiddenlayer results. Note that the greatest variation of occurs in the region described by the parameters shown in the table (Figure 2).
Architecture  MIN_NPL  MAX_NPL  Look back 

Onehiddenlayer  1  100  [1,30] 
Twohiddenlayer  7  31  {1, 10, 20, 30} 
Threehiddenlayer  7  31  {1, 10, 20, 30} 
Summarizing, we estimated for the RNNs defined by the constraints presented in Table I, trained 100 architectures (selected uniformly from the referred sample), and study the correlation between the predicted probability and the observed error. Table II presents the correlation between the MAE random sampling results (Mean MAE, Sd MAE, and log ) and the observed MAE after training the RNNs for a predefined number of epochs.
Architecture  Epochs  Correlation  

Mean  Sd  log  
Onehiddenlayer  1  0.447  0.317  0.211 
10  0.726  0.431  0.321  
100  0.790  0.641  0.650  
1000  0.668  0.458  0.515  
Twohiddenlayer  1  0.086  0.135  0.171 
10  0.450  0.632  0.635  
100  0.709  0.827  0.905  
1000  0.695  0.843  0.922  
Threehiddenlayer  1  0.334  0.447  0.475 
10  0.546  0.724  0.745  
100  0.720  0.869  0.906  
1000  0.130  0.873  0.911 
Figure 2 shows the relation between the number of hidden cells and , each color represent a different look back. The probability rapidly increases from 1 to 25 cells (more than 6 order of magnitude). Then, i.e., when the number of cells is greater than 25, the probability tends to converge or stabilize for each look back.
In the onehiddenlayer case, there is a moderatetostrong negative correlation between and the MAE observed. This insight tells us that given two RNNs, we have to select the one that has a higher probability. Nonetheless, it is important to notice that we are not predicting the error (i.e., the loss value). Instead, we are predicting how likely would be to find a set of weights that have a good error performance.
The results also show a strong negative correlation between the estimated probability and the MAE observer after training in the two and threehiddenlayer setups. Moreover, the results indicate that when the problem gets more complex (i.e., when we added more hidden layers), the probability gets even more correlated to the actual performance.
VB Memory and Time Comparison
So far, we have shown that the MAE random sampling is useful to compare stacked RNN architectures. But, is it fast and light (in terms of the computational resources) enough? To give an answer, we analyzed the execution logs. Note that the experiments were executed under similar hardware and software conditions.
Table III summarizes the time and memory usage. In spite of the simplicity of the problem (sine wave), the difference between the time needed to train an RNN and to perform an MAE random sampling is quite notable. On the other hand, there is only a small difference in the memory required. Therefore, considering the lowcost of sampling an RNN and the usefulness of the approach for comparing architectures, we believe that using the MAE random sampling is worthwhile.
Time [s]  Memory [MB]  

Mean  Sd  Mean  Sd  
Adam 1000 epochs  996  0.006  127  6.338 
MAE rand samp  6  0.001  150  98.264 
Vi Optimization Results
Once we have shown that the MAE random sampling is a good heuristic to the RNN training performance, we studied its actual usefulness to optimize an RNN architecture using the metaheuristic algorithm proposed (Section IV). First, (i) we optimized an RNN to predict the sine wave defined in Equation 3 using the MAE random sampling as the heuristic of the algorithm. Then, (ii) we repeated the optimization but using early training results as the heuristic. Finally, (iii) we analyzed a realworld problem. Particularly, we optimized an RNN that predicts the waste generation of a city [44, 45], and we compared our results to the stateoftheart of urban waste containers filling level prediction methods [46].
Via MAE Random Sampling Heuristic
First, we optimized an RNN that predicts the sine wave (Equation 3) using the evolutionarybased method proposed (Section IV) and the MAE random sampling [10] as the heuristic of the optimization algorithm.
Table IV presents the parameter values of the algorithm used to optimize the RNN. It is very important to notice that during the optimization cell_mut_p, max_step, and layer_mut_p values are selfadapted (Algorithm 2, line 11). Thus, their initial values are not critical.
Parameter  Value 

cell_mut_p  0.2 
epochs  100 
max_step  5 
layer_mut_p  0.2 
population_size  10 
offspring_size  10 
max_eval  100 
We used the MAE random sampling results as the heuristic of the algorithm, i.e., we estimated the training performance of the solutions and used the estimated to sort them. Specifically, we compute the sampling using the parameters presented in Table V.
Parameter  Value 

num_samples  100 
0.01  
truncated_range  [0,2] 
On the other hand, accordingly to the definition of the optimization problem presented in Equation IVA, we set the constraints of the problem. Table VI presents the search space.
Parameter  Value 

min_LB  2 
max_LB  30 
min_NPL  1 
max_NPL  100 
min_HL  1 
max_HL  3 
We executed 30 independent times the optimization process using the parameter values already mentioned, and we computed the statistics of the error (MAE) over the final solution. Table VII summarizes this results, where MRS stands for the optimization guided by the MAE random sampling heuristic and GDET for the gradientdescent training heuristic (presented below in this section).
MRS  GDET  

Mean  0.105  0.142 
Median  0.100  0.149 
Max  0.247  0.270 
Min  0.063  0.054 
Sd  0.035  0.051 
Overall, the results of MRS exceeds GDET, i.e., the optimized RNN obtained by MRS has on average a lower error than the ones optimized by GDET. Moreover, the Wilcoxon rank sum test value is 0.001. Therefore, we can conclude that MRS is significantly better than GDET.
ViB Gradientdescent Early Training Heuristic
We repeated the RNN optimization using our evolutionary approach but using a different heuristic. Specifically, we used early training results to predict the performance, i.e., we trained the solutions using Adam for a short time and used the loss results as the heuristic of the optimization algorithm.
Table VIII presents the early training heuristic parameters. We train the candidates for a time (heuristic_epochs) that is much smaller than the epochs we train the final solution.
Parameter  Value 

heuristic_epochs  1 
dropout  0.5 
Short training  MRS  

MAE  No. LSTM  LB  No. HL  Time [min]  MAE  No. LSTM  LB  No. HL  Time [min]  
Mean  0.073  451  6  5  97  0.079  793  17  3  51 
Median  0.073  420  5  5  70  0.073  513  16  2  45 
Max  0.076  1252  16  8  405  0.138  2038  30  8  103 
Min  0.071  127  2  1  33  0.069  444  2  1  40 
Sd  0.001  228  2  2  75  0.017  493  11  3  13 
We executed 30 independent times the optimization process using the gradientdescent early training parameter values and the optimization parameters set before (Table IV and VI), and we computed the statistics over the final solution. Table VII summarizes this results, where MRS stands for the optimization guided by the MAE random sampling heuristic and GDET for the gradientdescent training heuristic. The results show that MRS outperforms GDET in regards to the numerical accuracy.
ViC Waste Generation
To continue with the validation of our proposal, we tested our proposal in a realworld problem, the waste generation prediction. We studied the problem presented by Ferrer and Alba [44, 45], where the filling level of 217 paper containers spread out in a city in Spain is predicted.
Originally, Ferrer and Alba [44] proposed to predict the filling level of each container individually using Gaussian processes, linear regression, and SMReg. Later, Camero et al. [46] outperformed those results by predicting all containers at once using an RNN.
Particularly, in the referred study [46], the authors proposed to optimize an RNN to the problem using an ESbased algorithm. More specifically, they trained each candidate solution using a gradient descentbased method for a short time (ten epochs), and once the termination criteria were met, they trained the final solution for 1000 epochs. Table IX summarizes the results presented by Camero et al. [46] under the column Short training.
Consequently to the constraints and parameters presented in [46], we defined a new framework to run our optimization process. Table X presents these parameters. Note that the search space matches the one explored in [46], as well as the number of epochs of training of the final solution. The rest of the parameters were taken from the previous experimentation presented in this study.
Parameter  Value 

min_LB  2 
max_LB  30 
min_NPL  10 
max_NPL  300 
min_HL  1 
max_HL  8 
cell_mut_p  0.2 
max_step  15 
layer_mut_p  0.2 
population_size  10 
offspring_size  10 
max_eval  100 
epochs  1000 
We executed 30 independent times our RNN optimizer using the parameters defined in Table X and the dataset introduced in [44]. Table IX summarizes the results presented by Camero et al. [46] (columns under Short training) and the results of the optimization using the MAE random sampling (columns under MRS) as the heuristic of the algorithm (Algorithm 2). In the table, MAE stands for the MAE of the final solution, No. LSTM is the number of LSTM in the network, LB corresponds to the look back, No. HL represents the number of hidden layers, and Time is the total time (i.e., the optimization process and the training of the final solution) in minutes.
In terms of the MAE, the results of both approaches are quite similar. Therefore, we performed a Wilcoxon rank sum test to validate if there is a significant difference between them. Note that both approaches are stochastic and were executed 30 independent times for statistical soundness. The value of the test (comparing the MAE) is equal to 0.665, therefore there is no evidence that one algorithm outperforms the other. Furthermore, the median is the same in both cases.
On the other hand, the use of the MAE random sampling as the heuristic for optimizing the network (Algorithm 2) dramatically reduces the time needed to optimize the RNN configuration. On average, the time has been cut in half (nearly one hour difference). Again, notice that Table IX presents the time in minutes.
Vii Conclusions and Future Work
In this work, we studied the MAE random sampling technique to multiplehiddenlayer architectures and presented an ESbased algorithm to optimize an RNN that uses the MAE random sampling as a search heuristic.
We studied the correlation between the MAE random sampling results (i.e., the probability of finding a set of weights whose error is below a userdefined threshold) and the error after training a network using Adam optimizer on stacked RNN architectures with up to threehiddenlayers, using a sine wave.
The results show that there is a strong negative correlation, i.e., a high estimated probability is strongly related to a low error value after training a network. Moreover, as we add hiddenlayers to the RNN, this negative correlation increases. We think that this might be explained in part by the growing complexity of the training process, however further analysis is required to explain this observation.
Also, we tested our ESbased RNN optimization algorithm using as heuristic the MAE random sampling and the results of a short training process. The results show that using the MAE random sampling to guide the search outperforms its competitor in the sine wave problem.
Moreover, we compared our proposal against stateoftheart RNN architecture optimization in a realworld problem, the waste generation prediction. The results show that our approach is as good as its competitors in terms of the error, i.e., the prediction error of our solutions are similar to the ones of the architectures optimized using stateoftheart methods. However, our approach reduces by half the time needed to optimize the network. Therefore, we conclude that our proposal offers a competitive alternative for RNN optimization.
Overall, the results suggest that the MAE random sampling is a “lowcost, trainingfree, rule of thumb” method to compare deep RNN architectures and that it is a very useful heuristic for architecture optimization.
As future work, we propose to extend the MAE random sampling technique to other error functions, so this technique can be used to tackle other types of problems (classification, clustering, among others).
References
 [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
 [2] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
 [3] S. Min, B. Lee, and S. Yoon, “Deep learning in bioinformatics,” Briefings in bioinformatics, vol. 18, no. 5, pp. 851–869, 2017.
 [4] Y. Bengio, P. Simard, and P. Frasconi, “Learning longterm dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994.
 [5] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in Proceedings of the 30th International Conference on International Conference on Machine Learning  Volume 28, ser. ICML 13. JMLR.org, 2013, pp. III–1310–III–1318.
 [6] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyperparameter optimization,” in Advances in Neural Information Processing Systems 24, J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2011, pp. 2546–2554.
 [7] A. Camero, J. Toutouh, D. H. Stolfi, and E. Alba, “Evolutionary deep learning for car park occupancy prediction in smart cities,” in Intl. Conf. on Learning and Intelligent Optimization. Springer, 2018, pp. 386–401.
 [8] R. Jozefowicz, W. Zaremba, and I. Sutskever, “An empirical exploration of recurrent network architectures,” in Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37, ser. ICML’15. JMLR.org, 2015, pp. 2342–2350.
 [9] V. K. Ojha, A. Abraham, and V. Snášel, “Metaheuristic design of feedforward neural networks: A review of two decades of research,” Engineering Applications of Artificial Intelligence, vol. 60, pp. 97–116, 2017.
 [10] A. Camero, J. Toutouh, and E. Alba, “Lowcost recurrent neural network expected performance evaluation,” Preprint arXiv:1805.07159, may 2018.
 [11] T. Domhan, J. T. Springenberg, and F. Hutter, “Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves,” in Proceedings of the 24th International Conference on Artificial Intelligence, ser. IJCAI’15. AAAI Press, 2015, pp. 3460–3468.
 [12] A. Camero, J. Toutouh, and E. Alba, “Comparing deep recurrent networks based on the mae random sampling, a first approach,” in Conference of the Spanish Association for Artificial Intelligence. Springer, 2018, pp. 24–33.
 [13] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015.
 [14] J. F. Kolen and S. C. Kremer, Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. WileyIEEE Press, 2001, pp. 464–479.
 [15] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [16] E. Z. Ramos, M. Nakakuni, and E. Yfantis, “Quantitative measures to evaluate neural network weight initialization strategies,” in 2017 IEEE 7th Annual Computing and Communication Workshop and Conference (CCWC), 2017, pp. 1–7.
 [17] S. Haykin, Neural networks and learning machines. Pearson Upper Saddle River, NJ, USA:, 2009, vol. 3.
 [18] S. Albelwi and A. Mahmood, “A framework for designing the architectures of deep convolutional neural networks,” Entropy, vol. 19, no. 6, p. 242, 2017.
 [19] S. C. Smithson, G. Yang, W. J. Gross, and B. H. Meyer, “Neural networks designing neural networks: multiobjective hyperparameter optimization,” in ComputerAided Design (ICCAD), 2016 IEEE/ACM International Conference on. IEEE, 2016, pp. 1–8.
 [20] T. Back, Evolutionary Algorithms in Theory and Practice: Evolution Strategies, Evolutionary Programming, Genetic Algorithms. Oxford university press, 1996.
 [21] J. Toutouh and E. Alba, “Parallel multiobjective metaheuristics for smart communications in vehicular networks,” Soft Computing, vol. 21, no. 8, pp. 1949–1961, 2017. [Online]. Available: http://dx.doi.org/10.1007/s0050001518912
 [22] A. Camero, J. ArellanoVerdejo, and E. Alba, “Road map partitioning for routing by using a micro steady state evolutionary algorithm,” Engineering Applications of Artificial Intelligence, vol. 71, pp. 155–165, 2018.
 [23] E. Fabbiani, S. Nesmachnow, J. Toutouh, A. Tchernykh, A. Avetisyan, and G. Radchenko, “Analysis of mobility patterns for public transportation and bus stops relocation,” Programming and Computer Software, vol. 45, no. 1, pp. 34–51, 2019.
 [24] R. Massobrio, J. Toutouh, S. Nesmachnow, and E. Alba, “Infrastructure deployment in vehicular communication networks using a parallel multiobjective evolutionary algorithm,” International Journal of Intelligent Systems, vol. 32, no. 8, pp. 801–829, 2017. [Online]. Available: http://dx.doi.org/10.1002/int.21890
 [25] E. Alba, J. Aldana, and J. M. Troya, “Full automatic ann design: A genetic approach,” in International Workshop on Artificial Neural Networks. Springer, 1993, pp. 399–404.
 [26] E. Alba and R. Martí, Metaheuristic procedures for training neural networks. Springer Science & Business Media, 2006, vol. 35.
 [27] Y. Zhining and P. Yunming, “The genetic convolutional neural network model based on random sample,” International Journal of uand eService, Science and Technology, vol. 8, no. 11, pp. 317–326, 2015.

[28]
G. Rosa, J. Papa, A. Marana, W. Scheirer, and D. Cox, “Finetuning
convolutional neural networks using harmony search,” in
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
, A. Pardo and J. Kittler, Eds. Cham: Springer International Publishing, 2015, pp. 683–690.  [29] L. R. Rere, M. I. Fanany, and A. M. Arymurthy, “Simulated annealing algorithm for deep learning,” Procedia Computer Science, vol. 72, pp. 137 – 144, 2015, the Third Information Systems International Conference 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1877050915035759
 [30] L. Rere, M. I. Fanany, and A. M. Arymurthy, “Metaheuristic algorithms for convolution neural network,” Computational intelligence and neuroscience, vol. 2016, 2016.
 [31] K. O. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting topologies,” Evolutionary computation, vol. 10, no. 2, pp. 99–127, 2002.
 [32] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin, “Exploring strategies for training deep neural networks,” Journal of machine learning research, vol. 10, no. Jan, pp. 1–40, 2009.
 [33] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan, N. Duffy et al., “Evolving deep neural networks,” in Artificial Intelligence in the Age of Neural Networks and Brain Computing. Elsevier, 2019, pp. 293–312.
 [34] A. Rawal and R. Miikkulainen, “Evolving deep lstmbased memory networks using an information maximization objective,” in Proceedings of the Genetic and Evolutionary Computation Conference 2016. ACM, 2016, pp. 501–508.
 [35] J. Liang, E. Meyerson, and R. Miikkulainen, “Evolutionary architecture search for deep multitask networks,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’18. New York, NY, USA: ACM, 2018, pp. 466–473. [Online]. Available: http://doi.acm.org/10.1145/3205455.3205489
 [36] A. M. Ibrahim and N. H. ElAmary, “Particle swarm optimization trained recurrent neural network for voltage instability prediction,” Journal of Electrical Systems and Information Technology, vol. 5, no. 2, pp. 216 – 228, 2018.
 [37] A. ElSaid, F. E. Jamiy, J. Higgins, B. Wild, and T. Desell, “Using ant colony optimization to optimize long shortterm memory recurrent neural networks,” in Proceedings of the Genetic and Evolutionary Computation Conference, ser. GECCO ’18. New York, NY, USA: ACM, 2018, pp. 13–20. [Online]. Available: http://doi.acm.org/10.1145/3205455.3205637
 [38] C. Doerr, “Nonstatic parameter choices in evolutionary computation,” in Genetic and Evolutionary Computation Conference, GECCO 2017, Berlin, Germany, July 1519, 2017, Companion Material Proceedings. ACM, 2017.
 [39] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [40] A. Camero, J. Toutouh, and E. Alba, “Dlopt: deep learning optimization library,” arXiv preprint arXiv:1807.03523, 2018.
 [41] F. Chollet et al., “Keras,” https://keras.io, 2015.
 [42] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for largescale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.

[43]
R. N. Bracewell and R. N. Bracewell,
The Fourier transform and its applications
. McGrawHill New York, 1986, vol. 31999.  [44] J. Ferrer and E. Alba, “BINCT: sistema inteligente para la gestión de la recogida de residuos urbanos,” in International Greencities Congress, 2018, pp. 117–128.
 [45] ——, “Binct: Urban waste collection based in predicting the container fill level,” arXiv preprint arXiv:1807.01603, 2018.
 [46] A. Camero, J. Toutouh, J. Ferrer, and E. Alba, “Waste generation prediction in smart cities through deep neuroevolution,” in IberoAmerican Congress on Information Management and Big Data. Springer, 2018, pp. 192–204.