I Introduction
Most accurate neural architectures that we see today are handpicked and carefully designed by experts [1] [2] to achieve the desired performance. Typically, these experts design network architectures in such a way that each hyperparameter selection carefully supplements the problem at hand. Due a the steep learning curve for hyperparameter selection, most novices in neural network architecture design perform a grid or brute search of possible architectures in order to locate the best performing one. This, for obvious reasons, can be slow and tedious.
However, due to the work of [3] and [4] in the recent years, neural architecture search has seen a surge and many new traversal and search methods have been proposed. [5]. The Neural Architecture Search field has seen three focus points where researchers spent most of their time. These focus points are characterized by Elsken et al, 2019 in their literature survey [6] to be the following:

Search Space: All potential neural architecture candidates create a search space. Depending on the problem, the search space can be quite large, hence the importance of the efficiency of the search algorithm.

Search Algorithm: Even though search space is usually quite large, many candidates are quite similar and, therefore, can be skipped, hence fewer candidates to look at and faster run times.

Evaluation Strategy
: Usually accuracy is given the most attention as an evaluation strategy, but depending on the problem at hand, memory consumption, individual training duration, and/or energy consumption may be a more important evaluation metric.
In this paper, we look at binary classification problems where the output layer has only one node, relating to an output classification of zero or one. We vary the number of hidden layers and the number of nodes in each layer. Such variation creates a two degrees of freedom and a N
M search space, where N is the maximum number of nodes in each layer and M is the number of hidden layers. We propose three algorithms, two of which are offline traversal algorithms and one is an online traversal algorithms.Online traversal algorithms are those that can process input from a search space one and a time, and offline traversal algorithms are those that require all (or nearly all) of the search space to be known prior to traversing it. It becomes clear then that online algorithms are far superior than offline algorithms because they do not require the entire search space to be evaluated prior to locating the optimum. The most basic offline traversal algorithm is the brute force method where every single model candidate is evaluated and the optimum is located by comparing results over all candidates.
We mainly use the offline algorithms as a baseline for our online algorithm because these most frequently are guaranteed to locate the most optimal neural network architecture size.
Ii Related Work
In 2021, Alparslan et al. [7] looked at using binary search to determine the architecture size that would give the best accuracy. There was nearly a 100fold improvement over the naive search approach by using a modified binary search algorithm to determine the model with the highest accuracy. However, the assumptions made were too strong and worked only on certain datasets which met their criteria such as monotonic increase from both sides to global maximum for all the accuracies observed. Their work also investigated the binary search on very compact neural networks when there was only one hidden layer. In this paper, we relax two of their assumptions:

Monotonic Increase from Both Sides Constraint: In our paper, we no longer require that search space would have one global maximum and values would have to increase from both sides until the global maximum is reached. [7] assumed that the search space is sorted in ascending order from beginning to the global maximum and sorted in descending order from global maximum to end. Assuming a partial sortedness on the parameter space helped apply binary search and achieve massive speed improvements in their search. However, when the underlying dataset was highly nonconvex in the accuracy with respect to the architecture search space, the solution given by the search method returned a local optimum. In this paper, we no longer require such assumption.

One Hidden Layer Constraint: In our paper, we are no longer constraint to just one hidden layer. We vary the number of hidden layer as well as the number of nodes in each layer. Such variation creates two degrees of freedom and helps us test our search in a much larger search space (2D).
Iii Methodology
Iiia Terminology and Definitions
In order to traverse networks with different levels of completeness, we need to have a welldefined search space and a way of representing an architecture candidate as a node in that search space. In this paper, we define a twodimensional (2D) search space with one dimension that measures width and another that measures depth. This can be achieved by defining two terms: Initial Hidden Layer Size IHLS (Equation 1) and Division Factor (Equation 2). While the former initializes the width of the first hidden layer, the latter determines the number of subsequent hidden layers. For example, if the initial hidden layer size IHLS is 24 and the division factor DF is 3, the architecture has [24,8,2] nodes across three layers. Defining such two terms helps representing each architecture candidate easily and welldefines the search space.
Completeness, in short, is a measure of the depth and width of a given network architecture. If the hidden layer dimensions are greater than that of the input layer, we say the model incurs an overcomplete representation, and if it is less it is an undercomplete representation [13]. In the search space, some architecture candidates are overcomplete and some are undercomplete. In order to distinguish the variation among them, we define a kcompleteness score in Equation 4. The intuition behind it is to assign the size of the next hidden layer with quotient of a given hidden layer size and the value of the division factor. This score aims to distinguish overcomplete architectures from undercomplete architectures and to illustrate the trend between kcompleteness and training time for accurate models.
In order to formally define a kcompleteness score that ranges from zero to infinity, we provide the following generalized definitions for jumping factor (Equation 3) and the division factor (Equation 2).
Initial Hidden Layer Size:
(1) 
Division Factor:
(2)  
Jumping Factor:
(3) 
kCompleteness Score:
(4)  
The in Equation 4 for our experiment is chosen to be 0.5 because we wanted to weight the division factor and the jumping factor equally.
The interplay between these two factors and the width and depth of a network is not apparently obvious. While the size of the first hidden layer is proportional the kcompleteness of the network, the division factor is inversely proportional. In other words, a large division factor yields a small number of subsequent hidden layers as there is a larger value by which each hidden layer is divided by in order to yield the next. This space, along with classifications of low and high kcompleteness score, is observed in Figure 1.
IiiB Traversal Algorithms
In Figures 2, 3, 4, the yaxis represents division factor DF and is assigned powers of 2 as values (2, 4, 8, 16, etc.). The xaxis represents the initial hidden layer size IHLS and is assigned values that range from 1 to the maximum IHLS. For example, in Figure 2, there are 10 rows, so the yaxis goes from 2, 4, 8, 16, …, to 1024 from bottom left to top left. There are 10 columns so the xaxis goes from 1 to 10 from bottom left to bottom right. Such set up welldefines the search space and facilitates easier implementation since each node represents an architecture candidate. For example, the node labeled B has DF of two and IHLS of 10, which designates the architecture [10,5,2,1]. The node labeled A DF of two and IHLS of nine, which designates the architecture [9,4,2,1]. The node labeled C has DF of four and IHLS of ten, which designates the architecture [10,2] as defined in Equation 1 and Equation 2.
IiiB1 Offline Algorithm #1: Brute Force Search
This first offline algorithm is analogous to a linear search for a maximum value in a list–each value in the search space (or list) needs to be evaluated before a result can be determined. For this reason, in the case of the architecture search space, the complexity of this algorithm in best case, worst case, and average case is . For hyperparameter selection, this complexity is terribly large because each time the model architecture is changed slightly, the model has to be completely reevaluated. Regardless, this greedy algorithm is displayed in Figure 2.
IiiB2 Offline Algorithm #2: Diagonal Search
The diagonal traversal algorithm returns alternating primary diagonals. In this way, this offline algorithm does not evaluate all of the network architectures. Instead, it generalizes that the difference between the global optimum and the closest optimum is negligible. Regardless, similar to the brute force search, this diagonal search has a complexity of in the best case, worst case, and average case.
IiiB3 Online Algorithm #1: Zigzag Search
Zigzag traversal algorithm is the final algorithm that we investigate in this paper. It is an online traversal algorithm, meaning that the algorithm finds the next candidate by only processing the current candidates. By skipping over candidates similar in architecture, this algorithm sees significant running time improvements, even though it becomes harder to implement. Figure 4 explains the steps visually and Algorithm 2 formalizes the algorithm. Going in the opposite diagonal after finding the best architecture along that diagonal has its intuition coming from gradient descent. The opposite direction intuitively represents the orthogonal direction of the gradients along the surface when approaching a local minimum. The success of this accuracy traversal algorithm (see Table II for results) can be attributed to the fact that every time we go in the opposite diagonal, we are preventing oscillation [10] [11] and vanishing gradients [12] and providing a random jump factor to avoid getting stuck on local solutions.
Traversal  Running Time Complexity  

Algorithms  Best Case:  Worst Case: O  Average Case: 
Brute Force Search  
Diagonal Search  
Zigzag Search 
IiiC Datasets
In this study, we use two datasets, titanic dataset and customer churn dataset.
IiiC1 Titanic dataset
This dataset is made public by Kaggle[8]. The dataset has 14 columns to indicate each passenger on the Titanic ship. The columns include sex, name, destination, fare, destination etc. The dataset has about 1000 rows. The label for each row has a 1 or 0 to indicate whether the passenger survived the crash or not, hence the binary classification. The architecture candidates that we search to predict this binary classification problem has all 11 dimensions in the input layer and 1 in the output layer.
IiiC2 Churn dataset
This dataset is made public by Drexel Society of Artificial Intelligence
[9]. The dataset has information about customers that are using an imaginary contract and has labels to indicate whether the company has left the contract or not. The model that we build for this dataset tries to predict whether the customer is about to leave this contract or not, hence the binary classification. There are 14 columns and 10000 rows in the dataset. The dataset columns has features to indicate the customer revenue, customer contract cost, the type of the product the contract is about, the region of the customer, the fact that the customer whether renewed the contract in the previous 90 days or so etc. The dataset has features represented as categories as well as floating point numbers, therefore, we have to scale and apply categorical transformation when we train a model for this dataset. All the architectures that we look in the search space has 11 dimensions in the input layer and 1 in the output layer.Iv Results and Observations
All three traversal algorithms have been run against two models. Zigzag traversal algorithm has been found to be the fastest to find a model architecture candidate among all the traversal algorithms. The online nature of algorithm and the low amortized cost of finding a new candidate yields significant improvements in the running time. Zigzag traversal might get stuck on a local optimum and miss the global optimum due to its online nature. We see that the training accuracy and the testing accuracy of the candidate found by zigzag traversal is on average 1.49% and 1.52% lower for the titanic model respectively compared to brute force and diagonal traversal. For the titanic model, the brute force traversal was able to find an architecture candidate about 7.4% more sparse than the zigzag traversal algorithm, which was surprising since the brute force doesn’t take the kcompleteness into consideration, but the zigzag traversal algorithm takes the favors sparse architectures. The resulting architecture of the zigzag traversal was about 5% more sparse than that of the diagonal traversal.
Overall, for the titanic model, zigzag traversal algorithm found an architecture with a training accuracy of 80.89% and a testing accuracy of 77.48%, which is only 1.49% lower than the average training accuracy of the other traversal algorithms and only 1.52% lower than the average testing accuracy of the other traversal algorithms, but the completion time for the zigzag traversal is about 2 times faster than the brute force and 1.18 times faster than the diagonal search. Due to relatively small size of the titanic dataset(1000 rows) compared to the churn dataset(10000 rows), running time improvements when zigzag traversal are not truly observed.
We see a slightly better performance in the training and testing accuracy for zigzag traversal when applied to the churn model. The training accuracy of the candidate found by the zigzag traversal is 3.95% lower than that of the brute force traversal, but 3.34% better than that of the diagonal traversal. The testing accuracy of the candidate found by the zigzag traversal is very similar to that of the brute force and diagonal traversal, being only 0.84% and 0.69% lower compared to them. The candidate found by the zigzag traversal for the churn model is the most sparse architecture model compared to that of brute force and diagonal traversal. In fact, the zigzag traversal was able to find a candidate with a kcompleteness score twice (2.18x) as much as that of the brute force. The high kcompleteness score for the result of the zigzag traversal is the result of the nature of the algorithm where each traversal in the primary diagonal is followed by a traversal in the secondary diagonal to skip over similar and dense architectures and favor the more sparse ones over less sparse ones.
Overall, for the churn model, zigzag traversal algorithm found an architecture with a training accuracy of 79.98% and a testing accuracy of 81.00%, which is only 0.305% lower than the average training accuracy of the other traversal algorithms and only 0.765% lower than the average testing accuracy of the other traversal algorithms, but the completion time for the zigzag traversal is about 5 times faster than the brute force and 3 times faster than the diagonal search. One more thing to note is that the architecture found by the churn model is not only more sparse than the result of the other traversal algorithms, but about 30% more sparse than the second most sparse architecture that the zigzag traversal has ever discovered during its search. In this section, we only report the plots for the zigzag accuracy traversal algorithm. The reader can see Appendix for the plots of other traversal algorithms.
Model  Metrics  Brute Force  Diagonal  Zigzag 

Titanic  Completion Time  139.098  78.695  66.6 
Train Acc.  82.43%  82.33%  80.89%  
Test Acc.  78.62%  79.38%  77.48%  
kCompleteness  3.1591  2.7955  2.9403  
Best Architecture  [64,32,16,8,4,2]  [56,28,14,7,3]  [64,4]  
Churn  Completion Time  3854.482  1989.97  672.0 
Train Acc.  83.93%  76.64%  79.98%  
Test Acc.  81.84%  81.69%  81.00%  
kCompleteness  1.2273  1.8807  2.6818  
Best Architecture  [16]  [40, 5]  [48] 
V Conclusion
In this paper, we proposed an online traversal algorithm to find the best architecture candidate in a search space in O(1) time for best case and O(N) amortized time for average case for any compact binary classification problem by using kcompleteness as heuristics in our search. We compared our new algorithm to brute force and diagonal searching as a baseline and reported search running time improvement of 52.1% over brute force and of 15.4% over diagonal search to find the most accurate neural network architecture when given the same dataset. Our online traversal algorithm could find accurate architectures that were on par, if not better, than the other algorithms that we discuss in this paper. We hope that our findings will give insights to researchers in the Neural Architecture Search field when performing exhaustive grid search to find the most accurate architectures in shortest amount of time possible.
Vi Future Work
We hope to develop more online algorithms. This is because a different online algorithm might prove be more efficient with certain search spaces or architecture sizes compared to the three algorithms explored in this paper. Additionally, the models that we study in this paper are compact models with no convolutional layers. Moving from binary classification problems to image recognition and seeing the effect of our online algorithm might be an interesting study in the future.
Vii Acknowledgment
We would like to acknowledge Drexel Society of Artificial Intelligence for its contributions and support for this research.
References
 [1] Simonyan, K. and Zisserman, A., “Very Deep Convolutional Networks for LargeScale Image Recognition”, arXiv eprints, 2014.

[2]
K. He, X. Zhang, S. Ren and J. Sun, ”Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770778, doi: 10.1109/CVPR.2016.90.

[3]
Zoph, B. and Le, Q. V., “Neural Architecture Search with Reinforcement Learning”,arXiv eprints, 2016.
 [4] Baker, B., Gupta, O., Naik, N., and Raskar, R., “Designing Neural Network Architectures using Reinforcement Learning”, arXiv eprints, 2016.
 [5] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O., “Proximal Policy Optimization Algorithms”, arXiv eprints, 2017.
 [6] Elsken, T., Hendrik Metzen, J., and Hutter, F., “Neural Architecture Search: A Survey”, arXiv eprints, 2018.
 [7] Y. Alparslan, E. J. Moyer, I. M. Isozaki, D. Schwartz, A. Dunlop, S. Dave, E. Kim, ”Towards Searching Efficient and Accurate Neural Network Architectures in Binary Classification Problems”, arXiv eprints, 2021
 [8] Titanic dataset, Kaggle, 2018, https://www.kaggle.com/hesh97/titanicdatasettraincsv
 [9] Customer Churn Dataset, made public by Drexel Socity of Artificial Intelligence, 2020, https://github.com/drexelai/kcompletenessinbinaryneuralnets/blob/main/datasets/ChurnModel.csv

[10]
S. Townley et al., ”Existence and learning of oscillations in recurrent neural networks,” in IEEE Transactions on Neural Networks, vol. 11, no. 1, pp. 205214, Jan. 2000, doi: 10.1109/72.822523.
 [11] Keihiro Ochiai, Naohiro Toda, Shiro Usui, ”Kickout learning algorithm to reduce the oscillation of weights”, Neural Networks, Volume 7, Issue 5,1994, Pages 797807, ISSN 08936080, https://doi.org/10.1016/08936080(94)901015

[12]
Dai, Z. and Heckel, R., “Channel Normalization in Convolutional Neural Network avoids Vanishing Gradients”, arXiv eprints, 2019.
 [13] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” in Neural Computation, 2000.
Comments
There are no comments yet.