Evaluating Online and Offline Accuracy Traversal Algorithms for k-Complete Neural Network Architectures

01/16/2021 ∙ by Yigit Alparslan, et al. ∙ Drexel University 0

Architecture sizes for neural networks have been studied widely and several search methods have been offered to find the best architecture size in the shortest amount of time possible. In this paper, we study compact neural network architectures for binary classification and investigate improvements in speed and accuracy when favoring overcomplete architecture candidates that have a very high-dimensional representation of the input. We hypothesize that an overcomplete model architecture that creates a relatively high-dimensional representation of the input will be not only be more accurate but would also be easier and faster to find. In an NxM search space, we propose an online traversal algorithm that finds the best architecture candidate in O(1) time for best case and O(N) amortized time for average case for any compact binary classification problem by using k-completeness as heuristics in our search. The two other offline search algorithms we implement are brute force traversal and diagonal traversal, which both find the best architecture candidate in O(NxM) time. We compare our new algorithm to brute force and diagonal searching as a baseline and report search time improvement of 52.1 15.4 architecture when given the same dataset. In all cases discussed in the paper, our online traversal algorithm can find an accurate, if not better, architecture in significantly shorter amount of time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Most accurate neural architectures that we see today are handpicked and carefully designed by experts [1] [2] to achieve the desired performance. Typically, these experts design network architectures in such a way that each hyper-parameter selection carefully supplements the problem at hand. Due a the steep learning curve for hyper-parameter selection, most novices in neural network architecture design perform a grid or brute search of possible architectures in order to locate the best performing one. This, for obvious reasons, can be slow and tedious.

However, due to the work of [3] and [4] in the recent years, neural architecture search has seen a surge and many new traversal and search methods have been proposed. [5]. The Neural Architecture Search field has seen three focus points where researchers spent most of their time. These focus points are characterized by Elsken et al, 2019 in their literature survey [6] to be the following:

  1. Search Space: All potential neural architecture candidates create a search space. Depending on the problem, the search space can be quite large, hence the importance of the efficiency of the search algorithm.

  2. Search Algorithm: Even though search space is usually quite large, many candidates are quite similar and, therefore, can be skipped, hence fewer candidates to look at and faster run times.

  3. Evaluation Strategy

    : Usually accuracy is given the most attention as an evaluation strategy, but depending on the problem at hand, memory consumption, individual training duration, and/or energy consumption may be a more important evaluation metric.

In this paper, we look at binary classification problems where the output layer has only one node, relating to an output classification of zero or one. We vary the number of hidden layers and the number of nodes in each layer. Such variation creates a two degrees of freedom and a N

M search space, where N is the maximum number of nodes in each layer and M is the number of hidden layers. We propose three algorithms, two of which are offline traversal algorithms and one is an online traversal algorithms.

Online traversal algorithms are those that can process input from a search space one and a time, and offline traversal algorithms are those that require all (or nearly all) of the search space to be known prior to traversing it. It becomes clear then that online algorithms are far superior than offline algorithms because they do not require the entire search space to be evaluated prior to locating the optimum. The most basic offline traversal algorithm is the brute force method where every single model candidate is evaluated and the optimum is located by comparing results over all candidates.

We mainly use the offline algorithms as a baseline for our online algorithm because these most frequently are guaranteed to locate the most optimal neural network architecture size.

Ii Related Work

In 2021, Alparslan et al. [7] looked at using binary search to determine the architecture size that would give the best accuracy. There was nearly a 100-fold improvement over the naive search approach by using a modified binary search algorithm to determine the model with the highest accuracy. However, the assumptions made were too strong and worked only on certain datasets which met their criteria such as monotonic increase from both sides to global maximum for all the accuracies observed. Their work also investigated the binary search on very compact neural networks when there was only one hidden layer. In this paper, we relax two of their assumptions:

  1. Monotonic Increase from Both Sides Constraint: In our paper, we no longer require that search space would have one global maximum and values would have to increase from both sides until the global maximum is reached. [7] assumed that the search space is sorted in ascending order from beginning to the global maximum and sorted in descending order from global maximum to end. Assuming a partial sortedness on the parameter space helped apply binary search and achieve massive speed improvements in their search. However, when the underlying dataset was highly non-convex in the accuracy with respect to the architecture search space, the solution given by the search method returned a local optimum. In this paper, we no longer require such assumption.

  2. One Hidden Layer Constraint: In our paper, we are no longer constraint to just one hidden layer. We vary the number of hidden layer as well as the number of nodes in each layer. Such variation creates two degrees of freedom and helps us test our search in a much larger search space (2D).

Iii Methodology

Iii-a Terminology and Definitions

In order to traverse networks with different levels of completeness, we need to have a well-defined search space and a way of representing an architecture candidate as a node in that search space. In this paper, we define a two-dimensional (2D) search space with one dimension that measures width and another that measures depth. This can be achieved by defining two terms: Initial Hidden Layer Size IHLS (Equation 1) and Division Factor (Equation 2). While the former initializes the width of the first hidden layer, the latter determines the number of subsequent hidden layers. For example, if the initial hidden layer size IHLS is 24 and the division factor DF is 3, the architecture has [24,8,2] nodes across three layers. Defining such two terms helps representing each architecture candidate easily and well-defines the search space.

Completeness, in short, is a measure of the depth and width of a given network architecture. If the hidden layer dimensions are greater than that of the input layer, we say the model incurs an overcomplete representation, and if it is less it is an undercomplete representation [13]. In the search space, some architecture candidates are overcomplete and some are undercomplete. In order to distinguish the variation among them, we define a k-completeness score in Equation 4. The intuition behind it is to assign the size of the next hidden layer with quotient of a given hidden layer size and the value of the division factor. This score aims to distinguish overcomplete architectures from undercomplete architectures and to illustrate the trend between k-completeness and training time for accurate models.

In order to formally define a k-completeness score that ranges from zero to infinity, we provide the following generalized definitions for jumping factor (Equation 3) and the division factor (Equation 2).

Initial Hidden Layer Size:

(1)

Division Factor:

(2)

Jumping Factor:

(3)

k-Completeness Score:

(4)

The in Equation 4 for our experiment is chosen to be 0.5 because we wanted to weight the division factor and the jumping factor equally.

The interplay between these two factors and the width and depth of a network is not apparently obvious. While the size of the first hidden layer is proportional the k-completeness of the network, the division factor is inversely proportional. In other words, a large division factor yields a small number of subsequent hidden layers as there is a larger value by which each hidden layer is divided by in order to yield the next. This space, along with classifications of low and high k-completeness score, is observed in Figure 1.

Fig. 1: Search space for neural network architecture k-completeness as it varies with the size of the first hidden layer (x-axis) and the division factor (y-axis). On top left corner, a very large division factor DF (see Equation 2) and a very small initial hidden layer size IHLS (see Equation 1) are resulting in very undercomplete models. On bottom right, a very small division factor and a very large initial hidden layer size are resulting in very overcomplete models

Iii-B Traversal Algorithms

In Figures 2, 3, 4, the y-axis represents division factor DF and is assigned powers of 2 as values (2, 4, 8, 16, etc.). The x-axis represents the initial hidden layer size IHLS and is assigned values that range from 1 to the maximum IHLS. For example, in Figure 2, there are 10 rows, so the y-axis goes from 2, 4, 8, 16, …, to 1024 from bottom left to top left. There are 10 columns so the x-axis goes from 1 to 10 from bottom left to bottom right. Such set up well-defines the search space and facilitates easier implementation since each node represents an architecture candidate. For example, the node labeled B has DF of two and IHLS of 10, which designates the architecture [10,5,2,1]. The node labeled A DF of two and IHLS of nine, which designates the architecture [9,4,2,1]. The node labeled C has DF of four and IHLS of ten, which designates the architecture [10,2] as defined in Equation 1 and Equation 2.

Iii-B1 Offline Algorithm #1: Brute Force Search

This first offline algorithm is analogous to a linear search for a maximum value in a list–each value in the search space (or list) needs to be evaluated before a result can be determined. For this reason, in the case of the architecture search space, the complexity of this algorithm in best case, worst case, and average case is . For hyper-parameter selection, this complexity is terribly large because each time the model architecture is changed slightly, the model has to be completely reevaluated. Regardless, this greedy algorithm is displayed in Figure 2.

Fig. 2: Brute force traversal algorithm. This accuracy traversal is considered offline because traversal does not require other architectures’ positions.

Iii-B2 Offline Algorithm #2: Diagonal Search

The diagonal traversal algorithm returns alternating primary diagonals. In this way, this offline algorithm does not evaluate all of the network architectures. Instead, it generalizes that the difference between the global optimum and the closest optimum is negligible. Regardless, similar to the brute force search, this diagonal search has a complexity of in the best case, worst case, and average case.

Fig. 3: Diagonal traversal algorithm. This accuracy traversal algorithm skips nodes that are closer to each other in the same row and completes faster than the naive brute force approach.
1:procedure DiagonalSearch()
2:     
3:     for  := 1 to  do
4:         for  do := 1 to
5:              if  mod  then
6:                  D.append(space[i][j])                             return
Algorithm 1 Returns search space marked by alternating diagonals.

Iii-B3 Online Algorithm #1: Zigzag Search

Zigzag traversal algorithm is the final algorithm that we investigate in this paper. It is an online traversal algorithm, meaning that the algorithm finds the next candidate by only processing the current candidates. By skipping over candidates similar in architecture, this algorithm sees significant running time improvements, even though it becomes harder to implement. Figure 4 explains the steps visually and Algorithm 2 formalizes the algorithm. Going in the opposite diagonal after finding the best architecture along that diagonal has its intuition coming from gradient descent. The opposite direction intuitively represents the orthogonal direction of the gradients along the surface when approaching a local minimum. The success of this accuracy traversal algorithm (see Table II for results) can be attributed to the fact that every time we go in the opposite diagonal, we are preventing oscillation [10] [11] and vanishing gradients [12] and providing a random jump factor to avoid getting stuck on local solutions.

Fig. 4: Zigzag traversal algorithm. This algorithm relies on traversing the search space with alternating primary and secondary diagonals. Blue circles represent unseen network architectures, yellow represents seen network architectures, and greens represent optimal network architectures along a diagonal. The first pass forms a primary diagonal stretching along an indeterminate change of completeness from the lower left and corner to the upper right hand corner of the search space. It locates an optimal architecture four nodes into the search. The second pass forms a smaller secondary diagonal line which is used to locate the second optimal architecture along that determinate diagonal. From here, a third pass is the second primary diagonal, which is much smaller than the first. The search ends once the fourth diagonal finds a optimal architecture that was already previously recorded, such as in this case on the third diagonal line.
procedure ZigZagSearch()
     
     
     
     
     
     
     while  do
         if isPrimary then
              for  in    do
                  
                  visited.append((x, y))
                  if  then
                       
                       
                                                        
              if  then
                  
                  break              
              
         else
              for  in    do
                  
                  visited.append((x, y))
                  if  then
                       
                       
                                                        
              if  then
                  
                  break              
                             return (, )
Algorithm 2 Performs an online traversal of the search space using alternating primary and secondary diagonals to find the optimal network architecture.
Traversal Running Time Complexity
Algorithms Best Case: Worst Case: O Average Case:
Brute Force Search
Diagonal Search
Zigzag Search
TABLE I: Comparison among all traversal algorithms. For all cases, search space is considered to be a NM matrix. Zigzag search has the same worst case scenario as other naive approaches. However, zigzag search has a cheaper amortized cost at each traversal.

Fig. 5: Titanic Model Train Accuracies, Test Accuracies and k-Completeness Scores. Train and test accuracies have very small dissimilarities for Titanic model. Very small division factor and very large initial hidden layer size results in overcomplete architectures, which have very large k-completeness scores.

Fig. 6: Churn Model Train Accuracies, Test Accuracies and k-Completeness Scores. For Churn model, over complete architectures, which have very large k-completeness scores perform better train and testing accuracies.

Iii-C Datasets

In this study, we use two datasets, titanic dataset and customer churn dataset.

Iii-C1 Titanic dataset

This dataset is made public by Kaggle[8]. The dataset has 14 columns to indicate each passenger on the Titanic ship. The columns include sex, name, destination, fare, destination etc. The dataset has about 1000 rows. The label for each row has a 1 or 0 to indicate whether the passenger survived the crash or not, hence the binary classification. The architecture candidates that we search to predict this binary classification problem has all 11 dimensions in the input layer and 1 in the output layer.

Iii-C2 Churn dataset

This dataset is made public by Drexel Society of Artificial Intelligence

[9]. The dataset has information about customers that are using an imaginary contract and has labels to indicate whether the company has left the contract or not. The model that we build for this dataset tries to predict whether the customer is about to leave this contract or not, hence the binary classification. There are 14 columns and 10000 rows in the dataset. The dataset columns has features to indicate the customer revenue, customer contract cost, the type of the product the contract is about, the region of the customer, the fact that the customer whether renewed the contract in the previous 90 days or so etc. The dataset has features represented as categories as well as floating point numbers, therefore, we have to scale and apply categorical transformation when we train a model for this dataset. All the architectures that we look in the search space has 11 dimensions in the input layer and 1 in the output layer.

Iv Results and Observations

All three traversal algorithms have been run against two models. Zigzag traversal algorithm has been found to be the fastest to find a model architecture candidate among all the traversal algorithms. The online nature of algorithm and the low amortized cost of finding a new candidate yields significant improvements in the running time. Zigzag traversal might get stuck on a local optimum and miss the global optimum due to its online nature. We see that the training accuracy and the testing accuracy of the candidate found by zigzag traversal is on average 1.49% and 1.52% lower for the titanic model respectively compared to brute force and diagonal traversal. For the titanic model, the brute force traversal was able to find an architecture candidate about 7.4% more sparse than the zigzag traversal algorithm, which was surprising since the brute force doesn’t take the k-completeness into consideration, but the zigzag traversal algorithm takes the favors sparse architectures. The resulting architecture of the zigzag traversal was about 5% more sparse than that of the diagonal traversal.

Overall, for the titanic model, zigzag traversal algorithm found an architecture with a training accuracy of 80.89% and a testing accuracy of 77.48%, which is only 1.49% lower than the average training accuracy of the other traversal algorithms and only 1.52% lower than the average testing accuracy of the other traversal algorithms, but the completion time for the zigzag traversal is about 2 times faster than the brute force and 1.18 times faster than the diagonal search. Due to relatively small size of the titanic dataset(1000 rows) compared to the churn dataset(10000 rows), running time improvements when zigzag traversal are not truly observed.

We see a slightly better performance in the training and testing accuracy for zigzag traversal when applied to the churn model. The training accuracy of the candidate found by the zigzag traversal is 3.95% lower than that of the brute force traversal, but 3.34% better than that of the diagonal traversal. The testing accuracy of the candidate found by the zigzag traversal is very similar to that of the brute force and diagonal traversal, being only 0.84% and 0.69% lower compared to them. The candidate found by the zigzag traversal for the churn model is the most sparse architecture model compared to that of brute force and diagonal traversal. In fact, the zigzag traversal was able to find a candidate with a k-completeness score twice (2.18x) as much as that of the brute force. The high k-completeness score for the result of the zigzag traversal is the result of the nature of the algorithm where each traversal in the primary diagonal is followed by a traversal in the secondary diagonal to skip over similar and dense architectures and favor the more sparse ones over less sparse ones.

Overall, for the churn model, zigzag traversal algorithm found an architecture with a training accuracy of 79.98% and a testing accuracy of 81.00%, which is only 0.305% lower than the average training accuracy of the other traversal algorithms and only 0.765% lower than the average testing accuracy of the other traversal algorithms, but the completion time for the zigzag traversal is about 5 times faster than the brute force and 3 times faster than the diagonal search. One more thing to note is that the architecture found by the churn model is not only more sparse than the result of the other traversal algorithms, but about 30% more sparse than the second most sparse architecture that the zigzag traversal has ever discovered during its search. In this section, we only report the plots for the zigzag accuracy traversal algorithm. The reader can see  Appendix for the plots of other traversal algorithms.

Fig. 7: Loss Plots of Models found by Zigzag Traversal for Churn Model

Fig. 8: Accuracy Plots of Models found by Zigzag Traversal for Churn Model.

Fig. 9: Loss Plots of Models found by Zigzag Traversal for Titanic Model

Fig. 10: Accuracy Plots of Models found by Zigzag Traversal for Titanic Model
Model Metrics Brute Force Diagonal Zigzag
Titanic Completion Time 139.098 78.695 66.6
Train Acc. 82.43% 82.33% 80.89%
Test Acc. 78.62% 79.38% 77.48%
k-Completeness 3.1591 2.7955 2.9403
Best Architecture [64,32,16,8,4,2] [56,28,14,7,3] [64,4]
Churn Completion Time 3854.482 1989.97 672.0
Train Acc. 83.93% 76.64% 79.98%
Test Acc. 81.84% 81.69% 81.00%
k-Completeness 1.2273 1.8807 2.6818
Best Architecture [16] [40, 5] [48]
TABLE II: Results for All Traversal Algorithms. For all cases, initial hidden layer size is 64 and the division factor is 64. All comparison programs are run on Intel Core i5 4-cores 1.7 GHz processor with 16 GB memory and are written in Python 3.7. Completion times are reported in seconds.

V Conclusion

In this paper, we proposed an online traversal algorithm to find the best architecture candidate in a search space in O(1) time for best case and O(N) amortized time for average case for any compact binary classification problem by using k-completeness as heuristics in our search. We compared our new algorithm to brute force and diagonal searching as a baseline and reported search running time improvement of 52.1% over brute force and of 15.4% over diagonal search to find the most accurate neural network architecture when given the same dataset. Our online traversal algorithm could find accurate architectures that were on par, if not better, than the other algorithms that we discuss in this paper. We hope that our findings will give insights to researchers in the Neural Architecture Search field when performing exhaustive grid search to find the most accurate architectures in shortest amount of time possible.

Vi Future Work

We hope to develop more online algorithms. This is because a different online algorithm might prove be more efficient with certain search spaces or architecture sizes compared to the three algorithms explored in this paper. Additionally, the models that we study in this paper are compact models with no convolutional layers. Moving from binary classification problems to image recognition and seeing the effect of our online algorithm might be an interesting study in the future.

Vii Acknowledgment

We would like to acknowledge Drexel Society of Artificial Intelligence for its contributions and support for this research.

References

  • [1] Simonyan, K. and Zisserman, A., “Very Deep Convolutional Networks for Large-Scale Image Recognition”, arXiv e-prints, 2014.
  • [2]

    K. He, X. Zhang, S. Ren and J. Sun, ”Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.

  • [3]

    Zoph, B. and Le, Q. V., “Neural Architecture Search with Reinforcement Learning”,arXiv e-prints, 2016.

  • [4] Baker, B., Gupta, O., Naik, N., and Raskar, R., “Designing Neural Network Architectures using Reinforcement Learning”, arXiv e-prints, 2016.
  • [5] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O., “Proximal Policy Optimization Algorithms”, arXiv e-prints, 2017.
  • [6] Elsken, T., Hendrik Metzen, J., and Hutter, F., “Neural Architecture Search: A Survey”, arXiv e-prints, 2018.
  • [7] Y. Alparslan, E. J. Moyer, I. M. Isozaki, D. Schwartz, A. Dunlop, S. Dave, E. Kim, ”Towards Searching Efficient and Accurate Neural Network Architectures in Binary Classification Problems”, arXiv e-prints, 2021
  • [8] Titanic dataset, Kaggle, 2018, https://www.kaggle.com/hesh97/titanicdataset-traincsv
  • [9] Customer Churn Dataset, made public by Drexel Socity of Artificial Intelligence, 2020, https://github.com/drexelai/kcompleteness-in-binary-neural-nets/blob/main/datasets/ChurnModel.csv
  • [10]

    S. Townley et al., ”Existence and learning of oscillations in recurrent neural networks,” in IEEE Transactions on Neural Networks, vol. 11, no. 1, pp. 205-214, Jan. 2000, doi: 10.1109/72.822523.

  • [11] Keihiro Ochiai, Naohiro Toda, Shiro Usui, ”Kick-out learning algorithm to reduce the oscillation of weights”, Neural Networks, Volume 7, Issue 5,1994, Pages 797-807, ISSN 0893-6080, https://doi.org/10.1016/0893-6080(94)90101-5
  • [12]

    Dai, Z. and Heckel, R., “Channel Normalization in Convolutional Neural Network avoids Vanishing Gradients”, arXiv e-prints, 2019.

  • [13] M. S. Lewicki and T. J. Sejnowski, “Learning overcomplete representations,” in Neural Computation, 2000.

Appendix

Fig. 11: Accuracy Plots of Models found by Brute Force Traversal for Titanic Model

Fig. 12: Loss Plots of Models found by Brute Force Traversal for Titanic Model

Fig. 13: Accuracy Plots of Models found by Diagonal Traversal for Titanic Model

Fig. 14: Loss Plots of Models found by Diagonal Traversal for Titanic Model

Fig. 15: Accuracy Plots of Models found by Brute Force Traversal for Churn Model

Fig. 16: Loss Plots of Models found by Brute Force Traversal for Churn Model

Fig. 17: Accuracy Plots of Models found by Diagonal Traversal for Churn Model

Fig. 18: Loss Plots of Models found by Diagonal Traversal for Churn Model