1 Introduction
Optimization plays an important role in deep learning and is the key to achieving good accuracy from a neural network. This problem can be roughly categorized into Architecture optimization and hyperparameter optimization.
Architecture optimization refers to finding optimal layers and number of nodes in each layer of the network required to capture features from given data. There is no defined solution or some magic number of layers (or nodes) which works for every use case. To find the optimal value for these two parameters one needs to try different values and check which works best for the given problem, thus it is an iterative trial and error process which can take substantial time in some cases. Many algorithms have been proposed to solve this problem, but this network does not address this particular neural net optimization problem.
Hyperparameter optimization refers to choosing values of hyperparameters like  learning rate, optimization algorithm, dropout rate, batch size, etc. Finding the value for these hyperparameters is again based on iterative trial and error procedure, there are no perfect set of values which work for every possible scenario posed to us. The proposed architecture  Deep Genetic Network aims to solve the second problem (hyperparameter optimization).
Neural networks have been found to work well in a variety of tasks and are proven to excel humans in many tasks as they are able to process huge amount of information efficiently, more precisely and in much lesser amount of time. They can capture minute details of a problem which humans can sometimes fail to notice. So why not let the neural network itself choose the best hyperparameters for its training process rather than trying to find values by trial and error? Keeping this in mind we propose Deep Genetic Network  a neural network architecture which can choose the best hyperparameters from a given population of hyperparameters, eliminating the need to explicitly setting values of hyperparameters.
Deep Genetic Net uses genetic algorithms along with deep neural networks to find the fittest weights and biases. The network starts with more than one neural networks and best values from pairs of two nets are found after some number of iterations (epochs). Training many neural networks in parallel takes more time for a single iteration than a single neural network but the overall time that hyperparameter optimization takes which involves iterative retraining of the network to find the best fitting hyperparameters is still more time taking process than this.
Deep Genetic Network uses mating between pairs of neural networks to find the best possible parameters at each step by letting the more dominant parent pass more features to the child network than the less dominant parent. This process of mating is derived from genetic algorithms, this helps to choose the best possible values of parameters at each generation and stops the less effective parameters to express themselves in the subsequent generations. Thus we are in a way letting the fittest of the parameters to pass on through the generations depicting the ”Survival of the Fittest” scheme proposed by Charles Darwin in his Theory of Natural Selection.
The various experiments show that deep genetic networks help in better optimization of hyperparameters and helps build models faster without wasting time in retraining models with different possible combinations of hyperparameters.
We experimented with two different mating schemes called extreme mating and adjacent mating, these mating schemes were applied when the number of parents in initial population were more than two. The choice of scheme will be evident from the experiments.
2 Related Work
Many algorithms have been devised for hyperparameter optimization problem, the following are some popular techniques for solving this problem:
2.1 Grid Search
Grid search [1] is an approach to hyperparameter tuning that will methodically build and evaluate a model for each combination of values specified in a grid. In grid search algorithm a population (or set) of values are defined for each hyperparameter that has to be optimized for the current problem. The search algorithm then randomly chooses different combinations of values from the hyperparameter population.
This method stores the best possible hyperparameter combination after each set is evaluated and if it gives better result than the previously tested values then it is considered best option till that choice. It continually overwrites the best values of the hyperparameters when it finds one.
Thus in this approach sets of hyperparameters are chosen from given pool of hyperparameters and neural network is trained iteratively for all the possible combinations, which makes this approach a time consuming process.
2.2 Random Search
Random Search [2] is derived from Grid Search, instead of searching for best value in entire pool of hyperparameter values, this algorithm selects values from a random distribution. This process thus helps reducing the time required to find optimal hyperparameter values and is much more efficient than grid search.
A random sampling distribution is defined for values of hyperparameters, and number of iterations are defined for which the search will run for each random sample from the distribution. One of the primary motivations to use this technique instead of grid search is that for most cases hyperparameters are not all equally important and thus a sampling from a random distribution can solve this problem of choosing best hyperparameters rather than searching the entire set of combinations.
2.3 Meta Learning
Metalearning [3] systems are trained by being exposed to a large number of tasks and are then tested in their ability to learn new tasks; an example of a task might be classifying a new image within 5 possible classes, or learning to efficiently navigate a new maze with only one traversal through the maze. This differs from many standard machine learning techniques, which involve training on a single task and testing on heldout examples from that task.
During metalearning, the model is trained to learn tasks in the metatraining set. There are two optimizations at play – the learner, which learns new tasks, and the metalearner, which trains the learner. Methods for metalearning have typically fallen into one of three categories: recurrent models, metric learning, and learning optimizers.
3 Genetic Algorithms
Genetic algorithm [4] is a technique that helps to solve optimization problems, this is based on Charles Darwin’s Theory of Natural Selection which is the basis for evolution in living beings.
The processing of the algorithm begins with a population of species (in our case neural networks), the members of these populations are unique in some way or other. The members of population then reproduce children which are improved version of the previous generations. The two key concepts involved in genetic algorithms are: Mating and Mutation
3.1 Mating
Mating is the process in which two parents produce children (or single child), these children have the characteristics of both the parents. The parents pass genes to the children which are a way of expression of characteristics that are passed to child by parents, a child receives genes which are a combination of both the parent’s genes. Among the two parent’s one may have more dominant genes than the other, thus resulting in a child which has more characteristic of the parent.
In our use case, the genes are weights and biases of neural network which are passed onto the child neural network by the combination of two parent neural networks. The child thus has the best parameters (more genes from dominant parent) possible and are improved versions of the previous generation parents. The children are passed more genes from the dominant parent as they are more fit to survive in the succeeding generations.
3.2 Mutation
Mutations are changes in the children that were not present in the previous generations, though mating is what produces the children who generally have the best possible combination of characteristics of both parents but what makes the child really different from their parent is mutation. These mutations may be slight changes or may be major changes, this process is what led to a variety of species that evolved from singlecelled organisms to complex organisms like humans. Mutation thus plays an important role in the process of evolution, these changes make the children more stronger and susceptible to dangers the previous generations faced.
In our case the mutations happen in the iterations (epochs) after which the children neural networks are produced by mating of two parent neural networks, during this phase of training till it is able to produce its own children the parameters are updated for the neural networks which then produces better results than the previous generation of neural nets thus making them more better options for the problem at hand. These mutated children can give much better accuracy and lower loss than their parents.
Thus these processes together play an important role in genetic algorithms and helped Deep Genetic Network to provide better optimized neural networks.
4 Understanding Deep Genetic Net
Deep Genetic Network begins training with a population of neural networks which are initialized with different sets of hyperparameters. As training proceeds the neural networks continue updating their weights using backpropagation, after a certain number of training iterations (epochs) pairs of neural networks mate together to produce children which carry forward the best possible parameters from the parent networks. These children mutate further for some iterations then proceed to reproduce their own children. These techniques (mating and mutation) are derived from genetic algorithms.
4.1 Mating
Mating is the process of combination of two parent neural networks to produce child neural networks, during this process the parent passes genes to the children which carry the characteristics of the parent. In our case the genes are weights and biases of the parent neural networks that are passed to the children.
The mating process takes place after every ’n’ iterations. After the first generation of parents are trained for ’n’ iterations, their dominance is calculated based on their ability to reduce loss. The two children produced as a result of mating two neural networks are in 80:20 and 60:40 ratio. The 80:20 child contains 80% of genes (parameters) from the more dominant (MD) parent and 20% of genes from the less dominant (LD) parent, similarly 60:40 denotes 60% genes from MD parent and 40% from LD parent.
According to the number of parents in first generation deep genetic network can be divided into:

Two Parent Network: In this architecture the first generation will contain two parents and the subsequent mating will produce two children after each ’n’ iterations and during the last ’n’ iterations the mating procedure will produce a single neural network which will be responsible for producing the final output of the neural network.

TwoN Parent Network: The first generation will contain ’2N’ number of neural networks where N >1. In subsequent generations (after every ’n’ iterations) two children will be produced. These two children will continually produce two children. In the final generations the number of children at each step will be reduced exponentially in powers of 2 until we have a single neural network at the last generation which will produce the final output.
There are two mating schemes (or procedures) that we propose for TwoN Parent Networks:

Extreme Mating: In this mating scheme the first generation neural networks will be sorted according to their dominance in decreasing order, after ordering the potential parents mating occurs between extreme pairs with first parent mating with last parent in the ordering, second parent mating with second last parent and so on.

Adjacent Mating: This mating procedure involves ordering of parent neural networks like extreme mating but the mating takes place between pairs which are not polar opposites to one another, if we have a pool of 10 parent networks in the first generation then the first will mate with sixth, second with seventh after ordering them according to their dominance. This mating scheme is found to be more effective than extreme mating in the experimentations.
Based on the immediate parents we categorize mating into cousincousin and siblingsibling for a TwoN parent network.
4.1.1 CousinCousin Mating
Cousincousin mating refers to mating of two child neural networks who have different immediate parent but may have some parent in common in past generations. This mating scheme helps choose better parameters between not only the immediate parents but other parents among the pool of networks thus producing better chance to the child to explore new parameters rather than continuing with its immediate parent’s parameters.
4.1.2 SiblingSibling Mating
Siblingsibling mating happens between children that have same immediate parents, this helps choose the best parameters among both the parents and continuing with only those parameters which are best fit to survive among the population of parameter values.
4.2 Mutation
Mutation is the process by which a child becomes stronger than parent due to slight modifications in their genes which comes from both the parents who participate in the mating process. These changes in the genes help the children to overcome the shortcomings that their parents might have.
From a biological perspective the problem can be best described using the peppered moths and industrial revolution. Before industrial revolution began peppered moths (black and white in color) could camouflage in the bark of trees and were safe from predators but some population of the same species were mutated to be completely black in color which made them better visible to predators and thus these mutated moths were not fit, but after industrial revolution trees nearby factories were covered with layer of smoke and resulted in black barks which now camouflaged only those who were mutated and thus they continued to survive while the ones who were still white and black perished slowly.
In our case once parents mate to produce children these children are given ’n’ iterations to be trained with the weights and biases passed on by parents and these changes that it makes to the parameters to better address the problem is the mutation in the genes which was passed by the parents. Thus mutating genes helps the children learn facts which were not known by the previous generation of parents.
Thus Deep Genetic Network helps to find the best possible parameters from the given pool of neural networks with some initial hyperparameters, in this process we are easing the model building process by assigning the task of optimizing hyperparameters to the neural network itself during training which eliminates the need for us to explicitly set the values for these hyperparameters. This saves a lot of time as we do not need to iteratively retrain the model for finding best hyperparameter values.
5 Experimentation
The following experiments were conducted to verify the proposed neural networks architectures:
5.1 Image classifier model
Convolutional Neural Networks perform well in image classification tasks, in this network we used a convolutional neural network to find whether an image is that of a cat or not. This classifier used is a five layer neural network with two convolutional layers, three dense layers (last layer is the output layer/softmax layer).
There were two hyperparameters involved in this process the learning rate used in the optimizer and dropout rate in the dense layers. The best values that were found after tuning these values using trial and error method was 0.0075  (learning rate) and dropout rate of 0.8 in the first dense layer. The final test accuracy after the training was 0.84 (or 84%), this was trained for 2500 epochs.
These values required iterative retraining of the neural network to find the value that was able to reduce the loss and increase accuracy efficiently. This is where Deep Genetic Network came to rescue.
5.1.1 Model 1: Two Parent Deep Genetic Network
In this model we used two parents in the first generation the hyperparameters used for the parents were:

Parent 1:

Learning Rate () = 0.0075

Dropout Rate = 0.8


Parent 2:

Learning Rate () = 0.005

Dropout Rate = 0.5

The above model produced two children after every 100 epochs (80:20 child and 60:40 child), the final result was 0.86 (or 86%) test set accuracy.
5.1.2 Model 2: Four Parent Deep Genetic Network with Extreme Mating Scheme
The training began with four parents having the following hyperparameter values:

Parent 1:

Learning Rate () = 0.0075

Dropout Rate = 0.8


Parent 2:

Learning Rate () = 0.006

Dropout Rate = 0.6


Parent 3:

Learning Rate () = 0.0045

Dropout Rate = 0.5


Parent 4:

Learning Rate () = 0.003

Dropout Rate = 0.3

In this the parent networks were trained for 100 epochs before they were allowed to mate, before mating they were sorted based on their dominance (lowest loss = most dominant) and then the mating procedure started from extreme end with parent 1 mating with parent 4 and parent 2 with parent 3 to produce four more children. The combination of children to form single child took place in last 200 epochs where 4 children were reduced to 2 and 2 were reduced to 1 in subsequent 100 epochs. This model achieved 0.87 (or 87%) test set accuracy.
5.1.3 Model 3: Four Parent Deep Genetic Network with Adjacent Mating Scheme
This was the final model that was trained on the image classifier dataset, this model began with four parents with following parameter configurations:

Parent 1:

Learning Rate () = 0.0075

Dropout Rate = 0.8


Parent 2:

Learning Rate () = 0.006

Dropout Rate = 0.6


Parent 3:

Learning Rate () = 0.0045

Dropout Rate = 0.5


Parent 4:

Learning Rate () = 0.003

Dropout Rate = 0.3

This model was trained with Adjacent Mating scheme, this model achieved 0.89 (or 89%) test set accuracy.
5.2 Regression Model
Artificial Neural Networks or Multilayer perceptrons work well with text data for both classification and regression tasks. The regression model we used consisted of three layers (one output layer and two hidden layers). The model was trained using rmsprop optimizer.
There were three hyperparameters involved in this process  learning rate (), dropout rate for two hidden layers and momentum for rmsprop. When trained using trial and error method of hyperparameter optimization this produced test set accuracy of 0.95 (or 95%). The model that achieved best test set accuracy was trained for 4600 epochs using 0.07 learning rate, 0.5 dropout and 0.99 momentum.
A two parent deep genetic network worked well for this problem. It helped increase the test set accuracy to 0.98 (or 98%).
5.2.1 Model 1: Two Parent Deep Genetic Network
In this model two parents were used in the first generation which were allowed to mate after every 200 epochs till 4400^{th} epoch and then a single 80:20 child was produced for the final result. The parents had the following hyperparameters:

Parent 1:

Learning Rate () = 0.07

Dropout Rate = 0.5

Momentum = 0.99


Parent 2:

Learning Rate () = 0.05

Dropout Rate = 0.8

Momentum = 0.9

The above experiments prove that Deep Genetic Network performs well in optimizing hyperparameters and is much better option than trial and error methods or existing algorithms.
6 Acknowledgements
We would like to thank Andrew Ng for his deep learning courses in Coursera from where we started our deep learning journey.
7 Conclusion
As is evident from the experimentations, Deep Genetic Networks can train as well as optimize its own hyperparameters in parallel by mating pairs of networks to generate children which carry only the fittest genes (weights and biases) from the previous generation and almost neglects the less dominant (less fit) weights and biases which helps the network in making best possible decisions. This eases the tedious task of hyperparameter optimization and lets the network decide the values which are best suited rather than explicitly setting values.
From the experimentations we also found that adjacent mating works better than extreme mating. Another important aspect is that cousincousin mating helps in better convergence of the final neural network.
References

[1]
Taijia Xiao, Dong Ren, Shuanghui Lei, Junqiao Zhang, Xiaobo Liu, Based on gridsearch and PSO parameter optimization for Support Vector Machine, Proceeding of the 11th World Congress on Intelligent Control and Automation
 [2] James Bergstra, Yoshua Bengio, Random Search for HyperParameter Optimization, Journal of Machine Learning Research 13 (2012) 281305

[3]
Nicolas Schweighofer and Kenji Doya. Metalearning in reinforcement learning. Neural Networks, 16(1):5–9, 2003
 [4] K.F. Man, K.S. Tang, S. Kwong, Genetic algorithms: concepts and applications, IEEE Transactions on Industrial Electronics
Comments
There are no comments yet.