Codes and data related to paper
Convolutional Neural Networks (CNNs) have shown strong promise for analyzing scientific data from many domains including particle imaging detectors. However, the challenge of choosing the appropriate network architecture (depth, kernel shapes, activation functions, etc.) for specific applications and different data sets is still poorly understood. In this paper, we study the relationships between a CNN's architecture and its performance by proposing a systematic language that is useful for comparison between different CNN's architectures before training time. We characterize CNN's architecture by different attributes, and demonstrate that the attributes can be predictive of the networks' performance in two specific computer vision-based physics problems – event vertex finding and hadron multiplicity classification in the MINERvA experiment at Fermi National Accelerator Laboratory. In doing so, we extract several architectural attributes from optimized networks' architecture for the physics problems, which are outputs of a model selection algorithm called Multi-node Evolutionary Neural Networks for Deep Learning (MENNDL). We use machine learning models to predict whether a network can perform better than a certain threshold accuracy before training. The models perform 16-20 better than random guessing. Additionally, we found an coefficient of determination of 0.966 for an Ordinary Least Squares model in a regression on accuracy over a large population of networks.READ FULL TEXT VIEW PDF
Codes and data related to paper
Deep Learning (DL) is a sub-field of machine learning that focuses on learning features from data through multiple layers of abstraction. Deep Convolutional Neural Networks (CNNs) have become the state-of-the-art DL technique in the fields of computer vision, natural language processing, and other scientific research domains such as High Energy Physics[YLeCun-DL-Nature-2015, LSong-DL-2019]. That said, due to CNNs’ inability to generalize for all datasets, a necessary step before applying CNNs to new data is selecting an appropriate set of architecture hyper-parameters. In the literature, the problem of choosing a CNN’s architecture which is well suited to a given problem domain is still poorly understood. Generally, while there have been many studies of automated architecture search [DBLP:journals/corr/abs-1905-01392], very little has been done to develop a standardized language for describing neural network architectures in such a way as to be useful for comparison of multiple networks, or prediction of network performance metrics on the basis of architectural parameters. Many model selection algorithms have been proposed to mitigate the hyper-parameter optimization process, yet they mainly rely on human intuition or random search as of which parameter search space should be explored [Bergstra:2012:RSH:2503308.2188395]. In this study, we will thus propose a systematic language to characterize CNN architecture for simple, modular networks, and focus on demonstrating that different characterizations of the network architecture can be predictive of its performance in two computer vision problems in a particle physics context–vertex finding [ijcnn7966131, Perdue_2018, LSong-DL-2019] and hadron multiplicity in MINERvA. MINERvA [Aliaga2014130] is a neutrino-nucleus scattering experiment at Fermi National Accelerator Laboratory with fine-grained, stereoscopic imaging capabilities and few-nanosecond timing resolution. We conclude that our architectural attributes set can be used to give us partial insights into a network’s performance prior to training. We will also present specific architectural attributes that are highly relevant to CNNs’ performance for those problems for further study and development of the models. In this work, network architecture refers to the structural qualities of the network which are specified prior to training
: e.g. the layer types, the ordering of layer types, the layer non-linearities, and layer-specific parameters like the widths of fully-connected layers and the kernel shapes and strides of convolutional layers. Network architecture can be considered as a subset of the network hyper-parameters, though in this work, we do not consider hyper-parameters such as learning rate, momentum, optimization method, or anything else related to the learning process. We also do not include learned weights and biases as attributes characterizing network architecture, although it would be interesting to see how different architectures, trained on the same dataset and subject to the same loss functions, learn different feature maps for each of their layers. All of the networks in each population presented here were trained with identical learning rate schedules and for an identical number of iterations. Only the architecture was varied, so we use the termattribute
to denote architectural properties of neural networks. The networks analyzed here are convolutional networks trained by an evolutionary algorithm called MENNDL (Multi-node Evolutionary Neural Networks for Deep Learning)[Young:2015:ODL:2834892.2834896]. The networks were trained for the task of vertex finding [LSong-DL-2019] and hadron multiplicity counting in images collected from Fermilab’s MINERvA detector111minerva.fnal.gov. For the vertex finding task, in each image input, the location of the point of interaction between incoming neutrino and the target, in terms of which plane in the detector, is the desired output. For the hadron multiplicity problem, we count the number of out-coming charged hadron tracks with sufficient energy to traverse about two planes of the detector from the interaction. A sample input image is given in Fig. 1. The networks were trained using data simulated by state-of-the-art physical models. In order that the networks are insensitive to differences between simulated and real images, some of the network populations were trained with a domain adversarial component (DANN) [JMLR:v17:15-239, Perdue_2018]. For this work, we studied two separate output populations of vertex-finding networks and one population of hadron-multiplicity networks, each of which is based on 4,999 repetitions of the evolutionary algorithm. In its running process, the algorithm was also allowed to alter layer types, layer order, and number of layers, as well as intra-layer features like kernel shapes, stride lengths, number of features, and type of non-linearity. The data set of networks analyzed was thus built on a total of 299,050 unique network architectures.
All studies in this paper are reproducible using our analysis, extraction codes, and attributes data set, which are publicly available222https://github.com/Duchstf/CNN-Architectural-Analysis
. The sample raw Caffe prototxt and output files from MENNDL are also documented with our codes††footnotemark:
. Feature extraction codes were run using a SINGULARITY software container333https://github.com/Duchstf/CNN-Architectural-Analysis-SingularityImg[Singularity] for software deployment. The rest of this paper is organized as follows. In Section II, we describe several architectural attributes which were extracted from our set of networks. In Section III, we present results of machine learning models built to predict CNN’s performance based on the extracted architectural attributes. Section IV details a possible way to get insights into CNN’s behaviour relative to its architecture by analyzing the machine learning models’ features. Finally, in Section V, we conclude by summarizing the study and discuss next steps required for further development of the research.
Here we describe various network attributes which may be extracted and represented in a uniform way using a minimal amount of computation. Several such attributes are the result of averaging over some groups of attributes. This is because the size of groups of attributes may depend on the specific network architecture, and may not always serve the same functionality or be at the same scale in different networks and thus may produce ambiguity in interpretation. For example, it is tempting to use network depth as an attribute, but different networks might have several input layers or several output layers. In particular, some networks developed for analysis of MINERvA data all expect three input layers (corresponding to different angles of the same input image), and each produces two output layers (one for the domain classifier, and the other for the target classifier). Thus, there is potential ambiguity in the notion of depth, since there are multiple input-to-output paths. To remedy this issue, we can ask for the average depth. Below is a list of all attributes extracted here. Note that the list here is not exhaustive — there are many other possibilities. Note that some attributes refer to “horizontal” and “vertical” directions. Here we mean along the “plane” axis for horizontal and along the “strip” axis for vertical (Fig.1). The abbreviations used in the analysis are given in [square brackets].
Average depth [net_depth_avg]
Number of convolutional layers [num_conv_layers]
Number of pooling layers [num_pooling_layers]
Average number of number of elements in outputs of fully-connected layers [avg_IP_neurons]
Average number of connection parameters of fully-connected layers to previous layer [avg_IP_weights]
Average number of output feature maps in convolutional layers [num_conv_features]
Proportion of convolutional layers followed by a pooling layer [prop_conv_into_pool]
Proportion of pooling layers followed by a pooling layer [prop_pool_into_pool]
Proportion of convolutional layers with kernels [prop_1x1_kernels]
Proportion of convolutional layers with square kernel-shapes [prop_square_kernels]
Proportion of convolutional layers with horizontally-oriented kernels [prop_horiz_kernels]
Proportion of convolutional layers with vertically-oriented kernels [prop_vert_kernels]
Number of sigmoid-activated convolutional layers [num_sigmoid]
Average percent reduction in activation grid area/ height/ width between consecutive convolutional layers
Average percent reduction in activation grid area/ height/ width between input layers and final convolutional layers
Proportion of convolutional layers using non-overlapping stride [prop_nonoverlapping]
Average convolutional stride height/width
Average ratio of convolutional layer’s output feature maps to its depth [avg_ratio_features_to_depth]
Average ratio of layer’s output feature maps to kernel area/height/width of convolutional layers
Average ratio of kernel area/height/width to depth of convolutional layers
Unravelling the attributes which can measure area, height, or width, this list comprises 32 architectural attributes.
We analyzed two populations of output networks designed for Vertex Finding in MINERvA using MENNDL. For convenience, we refer to them as the First and Second Populations. The first network in each genealogy in the two populations were initialized from the same set of networks. However, they were optimized in separate MENNDL run-times, and trained based on different Caffe solver parameters. The first population was trained with a DANN component [JMLR:v17:15-239, Perdue_2018], whereas the second population was not. In terms of accuracy, the networks have either 173 or 174 output classes corresponding to planes and targets in the detector. Therefore, the benchmark for random guessing is around . In Fig. 2
, both populations share very similar network accuracy distributions. They are both heavily left skewed with many networks’ accuracy clustering around a very low value. For each population, we split the data intobroken and healthy networks using threshold of , which is much higher than random guessing. The threshold was set so that the high peaks of very low performance network in the distributions are included in the broken class, and the two classes are balanced. The overall percentage of each category in each population is summarized in Table I.
|Population||Broken||Healthy||Total number of networks|
In this task, we choose to not combine the two populations together for fear that the mentioned inherent difference in the networks’ attributes can interfere with our classification task and cause difficulties in interpreting the results. For regression, we chose to combine the two populations together on the basis that we are only looking at the correlations of the network’s attributes to predict the accuracy.
Each population dataset was randomly split into training and testing sets with a 80/20 ratio, respectively. Several algorithms were used to classify between between the two categories. While more complex models, such as neural networks, were able to provide marginal improvements, we choose to only display results from Random Forest (RF)[Breiman:2001:RF:570181.570182] and Extremely Randomized Tree (ERT) [Geurts2006-ERT] for their performance and interpretability. The algorithms and feature analysis are implemented using scikit-learn [scikit-learn] library. For this task, we propose a base accuracy of 50%, since there is no class imbalance in both populations we used for classification. The primary purpose of building machine learning models was to demonstrate the predictive nature of the architectural attributes, but not to perform further analysis based on the outputs of the models. As can be seen in Table II, the scores are significantly better than random guessing (50%), which underlines that the models were able to detect architectural separation between the attribute sets for broken and healthy networks. Furthermore, the cross-validation scores and the accuracy on test set are very close together, so we would expect the models to have the same accuracy on unseen data set.
|Models||Population||Average accuracy scores|
|Cross-validation||On test set|
After performing healthy/broken classification, we performed regression on the healthy networks in order to relate network features to the accuracy on the hold-out test set. To prevent heteroscedasticity—where the sub-populations have different variabilities—in the data, the accuracies are transformed using the Box-Cox transformation[Box-Cox64]. The correlation between the independent variables and dependent variable remains the same after the transformation. Before fitting, interaction terms between the original attribute set are also added.
|Population||Adjusted||Number of healthy networks|
Using a non-linear Ordinary Least Square (OLS) model with linear parameters, we performed regression separately on each population and then combine them together. The OLS model is implemented using StatsModels[statsmodels]. As the two populations are distinct and we are looking merely at the relationship between network’s architecture and its accuracy, combining them will not affect the regression process. The results from the fit are summarized in Table III. In this case, while we cannot achieve a good for each individual population, the model is able to fit the combination of both populations. A general trend is that as the number of networks increase, the value gets better. This suggests that while we don’t have enough events in the sub-populations to get a good fit, they overlap enough in the right regions of phase space to allow a good fit altogether. However, it is worth noting that, as depicted in Fig. 3
, while the majority of residuals are distributed around 0, there seems to be a linear relationship between the residual and the fitted values, which means that more regressors are needed to account for this behaviour. Furthermore, the Quantile-Quantile (Q-Q) plot in Fig.3
with a high right tail indicates that there is a gap in the distribution of the residuals. This is due to the fact that the accuracy’s distribution is heavily left skewed with very few networks with high accuracy. We also tried several regression algorithms that can account for a high level of non-linearity in the data. For example, we tried Decision Tree Regressors, Random Forest Regressors, Multi-level Perceptrons, Theil-Sen, and Huber regressors. Almost all of them fail to generalize to validation data set and do not provide a significantly betterthan a simple OLS model.
Fig. 4 depicts the accuracy distribution of hadron multiplicity networks with the threshold to divide to two classes of networks for classification. To prevent class imbalances in the training data, we set the threshold to be 0.38 and broken networks were randomly sampled so that we have a 50/50 distribution between the two classes of 34614 networks in total.
For this task, we again used RF and ERT models to classify between broken and healthy networks. The classification results are reported in Table IV. Both models consistently achieve accuracy of more than 70% in both cross-validation on training set and testing set, which is 20% better than random guessing (50%), since there is no class imbalance.
|Model||Average cross-validation score||Accuracy on test set|
Note that here we do not present regression’s results for hadron multiplicity networks, since we have such a small amount of networks that the regression results are not significant to be presented.
Here we give some examples of how the attributes set can potentially be used to analyze the behaviour of the network’s architecture. While the OLS model for vertex finding networks cannot predict the accuracy of every network, the model has value under 0.05, and a significant value. Many attributes have value under 0.05, and we observe that, when plotting the attributes’ values that are less than (Fig. 5:Left), there are 45 attributes and interactions that are much more important than the rest. Since the data was normalized, the coefficients of variables are meaningful to look at. We plot the coefficients of different features in Fig. 5:Right. While many features have coefficients within 0.4 range from 0, there are 8 features that have significantly higher coefficients than the rest, which are reported in Table V.
By looking at the values of the coefficients, we can make some insightful observations about the relationship of CNN’s architecture and its accuracy in the vertex finding task. net_depth_avg, avg_IP_neurons and their interactions are strongly correlated with the performance. This suggests that increasing the capacity (number of parameters) of fully connected layers in the CNN can improve the overall performance of the CNN model. Additionally, num_pooling_layers and num_conv_layers are negatively correlated with the performance. This implies that, as we add more convolutional layers and pooling layers into the model, its performance will generally decrease. While the rest of the interactions are harder to interpret, the interaction term between avg_grid_reduction_height_total and avg_stride_h seems to point out an interesting property. Typically in computer vision problems only square kernels are ever considered. MINERvA physicists studied asymmetric kernel shapes for the vertex finding problem as a way of keeping the convolutions from reducing the image size along the planes axis [Perdue_2018, ijcnn7966131]. Overall, for the vertex finding and hadron multiplicity problem, our analysis of classification and regression models clearly indicates that it is possible to study a CNN model’s accuracy prior to training by just looking at its architectural attributes. Moreover, analyzing the important features of the machine learning models can give us insights into how to potentially improve a CNN model’s performance. That being said, our set of attributes is not extensive enough to fully characterize the complex relationship between CNN’s architecture and its accuracy. Further study of this is guaranteed.
In this paper, we proposed a systematic method that can be useful for uniform comparison of different architectural attributes of CNNs. We demonstrated the predictive nature of those attributes in two specific problems—vertex finding and hadron multiplicity counting in MINERvA—through building machine learning models that predict the CNN’s performance before its training time. The classification models perform significantly better (66% - 70%) than random guessing (50%). We were also able to achieve a significant OLS model with of 0.966 on a very large sample of Vertex-Finding networks. Additionally, we detailed a potential method to study a CNN’s behaviour relative to its architecture by analysing the predictive models’ features. For future work, we plan to extend the architectural attributes set and take into account other hyper-parameters related to input domains and training process. As we mentioned in the introduction, we did not look at anything related to the training process such as learning rate, momentum, optimization methods, etc. We also did not study the learned weights and biases of the networks. Additionally, statistics of the image dataset should be important (feature sizes and shapes, intensity distributions, etc.), although these can be challenging to quantify. Considering some or all of the above might provide us with a more comprehensive study of network performance. We also want to have architectural attributes that account for more recent types of neural layers in the literature, e.g. [2015arXiv151203385H, 2016arXiv160806993H, 2019arXiv190406952E]. Furthermore, it can be interesting for us to perform the same kind of analysis on state-of-the-art network architectures and see to what extent does our current set of architectural attributes correctly characterize the network’s performance. It is also promising to incorporate machine learning models such as the ones we built in this paper into model selection algorithms to evaluate a network’s accuracy before training time, thereby boosting the efficiency of the algorithms.