1 Introduction
Deep neural networks have been quite successful across various machine learning tasks. However, this advancement has been mostly limited to certain domains. For example in image and voice data, one can leverage domain properties such as location invariance, scale invariance, coherence, etc. via using convolutional layers
(Goodfellow et al., 2016). Alternatively, for graph data, graph convolutional networks were suggested to leverage adjacency patterns present in datasets structured as a graph (Kipf & Welling, 2016; Xu et al., 2019).However, there has been little progress in learning deep representations for datasets that do not follow a particular known structure in the feature domain. Take for instance the case of a simple tabular dataset for disease diagnosis. Such a dataset may consist of features from different categories such as demographics (e.g., age, gender, income, etc.), examinations (e.g., blood pressure, lab results, etc.), and other clinical conditions. In this scenario, the lack of any known structure between features to be used as a prior would lead to the use of a fullyconnected multilayer perceptron network (MLP). Nonetheless, it has been known in the literature that MLP architectures, due to their huge complexity, do not usually admit efficient training and generalization for networks of more than a few layers.
In this paper, we propose GroupConnected Multiplayer Perceptron (GMLP) networks. The main idea behind GMLP is to learn and leverage expressive feature subsets, henceforth referred to as feature groups. A feature group is defined as a subset of features that provides a meaningful representation or highlevel concept that would help the downstream task. For instance, in the disease diagnosis example, the combination of a certain blood factor and age might be the indicator of a higher level clinical condition which would help the final classification task. Furthermore, GMLP leverages feature groups limiting network connections to local groupwise connections and builds a feature hierarchy via merging groups as the network grows in depth. GMLP can be seen as an architecture that learns expressive feature combinations and leverages them via groupwise operations.
The main contributions of this paper are as follows: proposing a method for endtoend learning of expressive feature combinations, suggesting a network architecture to utilize feature groups and local connections to build deep representations, conducting extensive experiments demonstrating the effectiveness of GMLP as well as visualizations and ablation studies for better understanding of the suggested architecture.
We evaluated the proposed method on five different realworld datasets in various application domains and demonstrated the effectiveness of GMLP compared to stateoftheart methods in the literature. Furthermore, we conducted ablation studies and comparisons to study different architectural and training factors as well as visualizations on MNIST and synthesized data. To help to reproduce the results and encouraging future studies on groupconnected architectures, we made the source code related to this paper available online ^{1}^{1}1We plan to include a link to the source code and GitHub page related to this paper in the cameraready version..
2 Related Work
Fullyconnected MLPs are the most widelyused neural models for datasets in which no prior assumption is made on the relationship between features. However, due to the huge complexity of fullyconnected layers, MLPs are prone to overfitting resulting in shallow architectures limited to a few layers in depth (Goodfellow et al., 2016)
. Various techniques have been suggested to improve training these models which include regularization techniques such as L1/L2 regularization, dropout, etc. and normalization techniques such as layer normalization, weigh normalization, batch normalization, etc.
(Srivastava et al., 2014; Ba et al., 2016; Salimans & Kingma, 2016; Ioffe & Szegedy, 2015). For instance, selfnormalizing neural networks (SNNs) have been recently suggested as state of the art normalization methods that prevent vanishing or exploding gradients which help training feedforward networks with higher depths (Klambauer et al., 2017).From the architectural perspective, there has been great attention toward networks consisting of sparse connections between layers rather than having dense fullyconnected layers (Dey et al., 2018). Sparse connected neural networks are usually trained based on either a sparse prior structure over the network architecture (Richter & Wattenhofer, 2018) or based on pruning a fullyconnected network to a sparse network (Yun et al., 2019; Tartaglione et al., 2018; Mocanu et al., 2018). However, it should be noted that the main objective of most sparse neural network literature has been focused on improving the memory and compute requirements while maintaining competitive accuracies compared to MLPs.
As a parallel line of research, the idea of using expressive feature combinations or groups has been suggested as a prior over the feature domain. Perhaps, the most successful and widespread use of this idea is in creating random forest models in which different trees are trained based on different feature subsets in order to deal with highdimensional and highvariance data
(Breiman, 2001). More recently, feature grouping is suggested by Aydore et al. (2019) as a statistical regularization technique to learn from datasets of large feature size and a small number of training samples. They do the forward network computation by projecting input features using samples taken from a bank of feature grouping matrices, reducing the input layer complexity and regularizing the model. In another recent study, Ke et al. (2018) used expressive feature combinations to learn from tabular datasets using a recursive encoder with a shared embedding network. They suggest a recursive architecture in which more important feature groups have a more direct impact on the final prediction.While promising results have been reported using these methods, feature grouping has been mostly considered as a preprocessing step. For instance, Aydore et al. (2019) uses the recursive nearest agglomeration (ReNA) (HoyosIdrobo et al., 2018) clustering to determine feature groups prior to the analysis. Alternatively, Ke et al. (2018)
defined feature groups based on a pretrained gradient boosting decision tree (GBDT)
(Friedman, 2001). Feature grouping as a preprocessing step not only increases the complexity and raises practical considerations, but also limits the optimality of the selected features in subsequent analysis. In this study, we propose an endtoend solution to learn expressive feature groups. Moreover, we introduce a network architecture to exploit interrelations within the feature groups to reduce the network complexity and to train deeper representations.3 Proposed Method
3.1 Architecture Overview
In this paper, we propose GMLP which intuitively can be broken down to three stages: selecting expressive feature groups, learning dynamics within each group individually, and merging information between groups as the network grows in depth (see Figure 1). In this architecture, expressive groups are jointly selected during the training phase. Furthermore, GMLP is leveraging feature groups and using local groupwise weight layers to significantly reduce the number of parameters. While the suggested idea can be materialized as different architectures, in the current study, we suggest organization of the network as architectures resembling a binary tree spanning from leaves (i.e., features) to a certain abstraction depth closer to the root^{2}^{2}2Please note that, in this paper, tree structures are considered to grow from leaves to the root . In other words, in this context, limiting the depth is synonymous with considering the tree portion spanning from a certain depth to leave nodes.
. As the network grows deeper, after each local groupwise weight layer, half of the groups are merged using pooling operations, effectively reducing the width of the network while increasing the receptive field. At the last layer, all features within all groups are concatenated into a dense feature vector fed to the output layer.
3.2 Notation
We consider the generic problem of supervised classification based on a dataset of feature and target pairs, : , where , , and is the number of dataset samples. Furthermore, we define group size,
, as the number of neurons or elements within each group, and group count,
, as the number of selected groups which are essentially subsets of input features. Also, is used to refer to the total depth of a network. We use to refer to activation values of group in layer . In this paper, we define all vectors as column vectors.3.3 Network Layers
In this section, we present the formal definition of different GMLP network layers. The very first layer of the network, GroupSelect, is responsible for organizing features into groups of size each. A routing matrix, , is used for connecting each neuron within each group to exactly one feature in the feature set:
(1) 
where is a sparse matrix determining features that are present in each group. As we are interested in jointly learning during the training phase, we use the following continuous relaxation:
(2) 
In this equation, is a realvalued matrix reparameterizing the routing matrix through a softmax operation with temperature, . The lower the temperature, the more (2) converges to the desired discrete and sparse binary routing matrix. Note that, in the continuous relaxation, the matrix
can be optimized via the backpropagation of gradients from classification loss terms. In the next section, we provide further detail on temperature annealing schedules as well as other techniques to enhance the
approximation.Based on selected groups, we suggest local fullyconnected weight layers for each group: GroupFC. The goal of GroupFC
is to extract higherlevel representations using the selected expressive feature subsets. This operation is usually followed by nonlinearity functions (e.g., ReLU), normalization operations (e.g, Batch Norm), and dropout. Formally,
GroupFC can be defined as:(3) 
where and
are the weight matrix and bias vector, applied on group
at layer . Here, represents other subsequent operations such as nonlinearity, normalization, and dropout.Lastly, GroupPool is defined as an operation which merges representations of two groups into a single group, reducing network width by half while increasing the effective receptive field:
(4) 
where and are the th group from the first and second halves, respectively; and is a pooling function from to
. In this study, we explore different variants of pooling functions such as max pooling, average pooling, or using linear weight layers as transformations from
to . Please note that while we use a similar terminology as pooling in convolutional networks, the pooling operation explained here is not applied locationwise, but instead it is applied featurewise, between different groups pairs.The values of and are closely related to the number and order of feature interactions for a certain task. Using proper and values enables us to reduce the parameter space while maintaining the model complexity required to solve the task. However, finding the ideal and directly from a given dataset is a very challenging problem. In this work, we treat and
as hyperparameters to be found by a hyperparameter search.
3.4 Training
We define the objective function to be used for endtoend training of weights as well as the routing matrix as:
(5) 
In this objective function, the first term is the standard crossentropy classification loss where denotes the GMLP network as a function with parameters , and is the number of training samples used. The second term is an entropy loss over the distribution of the routing matrix that is weighted by the hyperparameter :
(6) 
is minimizing the entropy corresponding to the distribution of regardless of the temperature used for approximation. Accordingly, can be viewed as a hyperparameter and as an additional method for encouraging sparse matrices. The last term in (5) is an L2 regularization term with the hyperparameter to control the magnitude of parameters in layer weights and in . Note that without the L2 regularization term, elements may keep increasing during the optimization loop, since only appears in normalized form in the objective function of (5).
We use Adam (Kingma & Ba, 2014) optimization algorithm starting from the default 0.001 learning rate and reducing the learning rate by a factor of 5 as the validation accuracy stops improving. Regarding the temperature annealing, during the training, the temperature is exponentially decayed from 1.0 to 0.01. In order to initialize the GroupFC weights, we used Xavier initialization (Glorot & Bengio, 2010) with m for both fanin and fanout values. Similarly, the matrix is initialized by setting the fanin equal to and fanout to .
Further detail on architectures and hyperparameters used for each specific experiment as well as details on the software implementation are provided as appendices to this paper.
3.5 Analysis
The computational complexity of GMLP at the prediction time can be written as (for simplicity, ignoring bias and pooling terms):
(7) 
In this series, the first term, , is the work required to organize features to groups. The subsequent terms, except the last term, are representing the computational cost of local fullyconnected operations at each layer. The last term is the complexity of the output layer transformation from the concatenated features to the number of classes. Therefore, the computational complexity of GMLP at the prediction time can be written as . In comparison, the computational complexity of an MLP with a similar network width would be:
(8) 
where the first term is the work required for the first network layer from to neurons, the second term is corresponding to a hidden layer of size , and so forth. The last term is the complexity of the output layer similar to the case of GMLP. The overall work required from this equation is of complexity. This is substantially higher than GMLP, for typical , , and values.
Additionally, the density of the GroupFC layer connections can be calculated as: , which is very small for reasonably large number of values used in our experiments. Also, assuming pooling operations in every other layer, the receptive field size or the maximum number of features impacting a neuron at layer can be written as . For instance, a neuron in the first layer of the network is only connected to features, and a neuron in the second layer is connected to two groups or features and so forth.
4 Experiments
4.1 Experimental Setup
The proposed method is evaluated on five different realworld datasets, covering various domains and applications: permutation invariant CIFAR10 (Krizhevsky et al., 2009), human activity recognition (HAPT) (Anguita et al., 2013), diabetes classification (Kachuee et al., 2019), UCI Landsat (Dua & Graff, 2017), and MITBIH arrhythmia classification (Moody & Mark, 2001) datasets. Additionally, we use MNIST (LeCun et al., 2010) and a synthesized dataset to provide further insight into the operation of GMLP (see Section 4.4). Table 1 presents a summary of datasets used in this study. Regarding the CIFAR10 dataset, we permute the image pixels to discard pixel coordinates in our experiments. Note that the permutation is not changing across samples, it is merely a fixed random ordering used to remove pixel coordinates for each experiment. For all datasets, basic statistical normalization with and is used to normalize features as a preprocessing step. The only exception is CIFAR10 for which we used the standard channelwise normalization and standard data augmentation (i.e., random crops and random horizontal flips). The standard test and train data splits were used as dictated by dataset publishers. In cases that the separated sets are not provided, test and train subsets are created by randomly splitting samples to for test and the rest for training/validation.
We compare the performance of the proposed method with recent related work including SelfNormalizing Neural Networks (SNN) (Klambauer et al., 2017), Sparse Evolutionary Training (SET) (Mocanu et al., 2018)^{4}^{4}4https://github.com/dcmocanu/sparseevolutionaryartificialneuralnetworks, Feature Grouping as a Stochastic Regularizer (in this paper, denoted as FGR) (Aydore et al., 2019)^{5}^{5}5https://github.com/sergulaydore/FeatureGroupingRegularizer as well as the basic dropout regularized and batch normalized MLPs. In order to ensure a fair comparison, we adapted source codes provided by other work to be compatible with our data loader and preprocessing modules.
Furthermore, for each method, we conducted an extensive hyperparameter search using Microsoft Neural Network Intelligence (NNI) toolkit^{6}^{6}6https://github.com/microsoft/nni
and the Treestructured Parzen Estimator (TPE) tuner
(Bergstra et al., 2011) covering different architectural and learning hyperparameters for each case. More detail on hyperparameter search spaces and specific architectures used in this paper is provided in Appendix A and Appendix B. We run each case using the best hyperparameter configuration eight times and report mean and standard deviation values.
4.2 Results
Table 2 presents a comparison between the proposed method (GMLP) and 4 other baselines: MLP, SNN (Klambauer et al., 2017), SET (Mocanu et al., 2018), and FGR (Aydore et al., 2019). As it can be seen from this comparison, GMLP outperforms other work, achieving stateoftheart classification accuracies. Concerning the CIFAR10 results, to the best of our knowledge, GMLP achieves a new stateoftheart performance on permutation invariant CIFAR10 augmented using the standard data augmentation. We believe that leveraging expressive feature groups enables GMLP to consistently perform better across different datasets.
To compare model complexity and performance we conduct an experiment by changing the number of model parameters and reporting the resulting test accuracies. Here, we reduce the number of parameters by reducing the width of each network; i.e. reducing the number of groups and hidden neurons for GMLP and MLP, respectively. Figure 2 shows accuracy versus the number of parameters for the GMLP and MLP baseline on CIFAR10 and MITBIH datasets. Based on this figure, GMLP is able to achieve higher accuracies using significantly less number of parameters. It is consistent with the complexity analysis provided in Section 3.5. Note that in this comparison, we consider the number of parameters involved at the prediction time.
4.3 Ablation Study
Figure 4 presents an ablation study comparing the performance of GMLP on CIFAR10 dataset for networks trained: using both the temperature annealing and the entropy loss objective, using only temperature annealing without the entropy loss objective, using no temperature annealing but using the entropy loss objective, not using any of the temperature annealing or the entropy loss objective. From this figure, it can be seen that excluding both techniques leads to a significantly lower performance. However, using any of the two techniques leads to relatively similar high accuracies. It is consistent with the intuition that the functionality of these techniques is to encourage learning sparse routing matrices, either using softmax temperatures or entropy regularization to achieve this. In this paper, in order to ensure sparse and low complexity routing matrices, we use both techniques simultaneously as in case .
Figure 4 shows a comparison between GMLP models trained on CIFAR10 using different pooling types: linear transformation, max pooling, and average pooling. As it can be seen from this comparison, while there are slight differences in the convergence speed of using different pooling types, all of them achieve relatively similar accuracies. In our experiments, we decided to use max pooling and average pooling as they provide reasonable results without the need to introduce additional parameters required for the linear pooling method.
Figure 6 shows learning curves for training CIFAR10 GMLP models using different group sizes. As it can be seen from this figure, using very small group sizes would cause a reduction in the final accuracy. At the other extreme, the improvement achieved using larger values is negligible for values more than 16. Finally, Figure 6 shows a comparison between learning curves for using a different number of groups. Using very small values result in a significant reduction in performance. However, the rate of performance gains for using more groups is very small for of more than 1536. Note that the number of model parameters and compute scales linearly with and quadratically with (see Section 3.5).
4.4 Experiments on MNIST and Synthesized Data
MNIST dataset (LeCun et al., 2010) is used to visually inspect the performance of the GroupSelect layer. Figure 7 shows a heatmap of how frequently each pixel is selected across all feature groups for: original MNIST samples, MNIST samples where the lowerhalf is replaced by Gaussian noise. From Figure 6(a), it can be seen that most groups are selecting pixels within the center of the frame, effectively discarding margin pixels. This is consistent with other work which show the importance of different locations for MNIST images (Kachuee et al., 2018). Apart from this, in Figure 6(b), a version of the MNIST dataset is used in which half of the frame does not provide any useful information for the downstream classification task. From this figure, GMLP is not selecting any features to be used from the lower region.
In order to show the effectiveness of GMLP, we synthesized a dataset which has intrinsic and known expressive feature groups. Specifically, we used a simple Bayesian network as depicted in Figure
9. This network consists of six binary features, A to F, interacting with each other as specified by the graph edges, which determine the distribution of the target node, J. The graph and conditionals are designed such that each of the nodes in the second level take the XOR value of their parents with a probability. The target node, J, is essentially one with a high probability if at least two of the second level nodes are one. We synthesized dataset by sampling 6,400 samples from the network (1,280 samples for test and the rest of training/evaluation). On this dataset, we trained a very simple GMLP consisting of four groups of size two, one groupwise fullyconnected layer, and an output layer. Figure 9shows the features selected for each group after the training phase (i.e., the
matrix). From this figure, the GroupSelect layer successfully learns to detect the feature pairs that are interacting, enabling the GroupFC layers to decode the nonlinear XOR relations.5 Discussion
Intuitively, training a GMLP model with certain groups can be viewed as a prior assumption over the number and order of interactions between the features. It is a reasonable prior assumption as in many natural datasets, a conceptual hierarchy exists where only a limited number of features interact with each other. Furthermore, it is consistent with the discoveries made in understanding the human decisionmaking process; finding that we are only able to consider at most nine factors at the same time during a decision making process (Cowan, 2001; Baddeley, 1994).
Additionally, GMLP can be considered as a more general neural counterpart of random forests. Both models use different subsets of features (i.e., groups) and learn interactions within each group. One major difference between the two methods is the fact that GMLP combines information between different groups using pooling operations, while random forest uses the selected features to train an ensemble of independent trees on each group. From another perspective, the idea of studying feature groups is closely related to causal models such as Bayesian networks and factor graphs (Darwiche, 2009; Neapolitan et al., 2004; Clifford, 1990). These methods are often impractical for largescale problems, because without a prior over the causal graph, they require an architecture search of the NPcomplete complexity or more.
6 Conclusion
In this paper, we proposed GMLP as a solution for deep learning in domains where the feature interactions are not known as prior and do not admit the use of convolutional or other techniques leveraging domain priors. GMLP jointly learns expressive feature combinations and employs groupwise operations to reduce the network complexity. We conducted extensive experiments demonstrating the effectiveness of the proposed idea and compared the achieved performances with stateoftheart methods in the literature.
References
 Anguita et al. (2013) Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis ReyesOrtiz. A public domain dataset for human activity recognition using smartphones. In Esann, 2013.
 Aydore et al. (2019) Sergul Aydore, Bertrand Thirion, and Gael Varoquaux. Feature grouping as a stochastic regularizer for highdimensional structured data. In International Conference on Machine Learning, pp. 385–394, 2019.
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Baddeley (1994) Alan Baddeley. The magical number seven: Still magic after all these years? 1994.
 Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pp. 2546–2554, 2011.
 Breiman (2001) Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
 Clifford (1990) Peter Clifford. Markov random fields in statistics. Disorder in physical systems: A volume in honour of John M. Hammersley, 19, 1990.
 Cowan (2001) Nelson Cowan. The magical number 4 in shortterm memory: A reconsideration of mental storage capacity. Behavioral and brain sciences, 24(1):87–114, 2001.
 Darwiche (2009) Adnan Darwiche. Modeling and reasoning with Bayesian networks. Cambridge university press, 2009.
 Dey et al. (2018) Sourya Dey, KuanWen Huang, Peter A Beerel, and Keith M Chugg. Characterizing sparse connectivity patterns in neural networks. In 2018 Information Theory and Applications Workshop (ITA), pp. 1–9. IEEE, 2018.
 Dua & Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
 Friedman (2001) Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 249–256, 2010.  Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
 HoyosIdrobo et al. (2018) Andrés HoyosIdrobo, Gaël Varoquaux, Jonas Kahn, and Bertrand Thirion. Recursive nearest agglomeration (rena): fast clustering for approximation of structured signals. IEEE transactions on pattern analysis and machine intelligence, 41(3):669–681, 2018.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

Kachuee et al. (2018)
Mohammad Kachuee, Sajad Darabi, Babak Moatamed, and Majid Sarrafzadeh.
Dynamic feature acquisition using denoising autoencoders.
IEEE transactions on neural networks and learning systems, 2018.  Kachuee et al. (2019) Mohammad Kachuee, Kimmo Karkkainen, Orpaz Goldstein, Davina Zamanzadeh, and Majid Sarrafzadeh. Nutrition and health data for costsensitive learning. arXiv preprint arXiv:1902.07102, 2019.
 Ke et al. (2018) Guolin Ke, Jia Zhang, Zhenhui Xu, Jiang Bian, and TieYan Liu. Tabnn: A universal neural network solution for tabular data. 2018.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kipf & Welling (2016) Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Klambauer et al. (2017) Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In Advances in neural information processing systems, pp. 971–980, 2017.
 Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist, 2:18, 2010.
 Mocanu et al. (2018) Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):2383, 2018.
 Moody & Mark (2001) George B Moody and Roger G Mark. The impact of the mitbih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine, 20(3):45–50, 2001.
 Neapolitan et al. (2004) Richard E Neapolitan et al. Learning bayesian networks, volume 38. Pearson Prentice Hall Upper Saddle River, NJ, 2004.
 Richter & Wattenhofer (2018) Oliver Richter and Roger Wattenhofer. Treeconnect: A sparse alternative to fully connected layers. In 2018 IEEE 30th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 924–931. IEEE, 2018.
 Salimans & Kingma (2016) Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
 Tartaglione et al. (2018) Enzo Tartaglione, Skjalg Lepsøy, Attilio Fiandrotti, and Gianluca Francini. Learning sparse neural networks via sensitivitydriven regularization. In Advances in Neural Information Processing Systems, pp. 3878–3888, 2018.
 Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ryGs6iA5Km.
 Yun et al. (2019) Jihun Yun, Peng Zheng, Eunho Yang, Aurelie Lozano, and Aleksandr Aravkin. Trimming the regularizer: Statistical analysis, optimization, and applications to deep learning. In International Conference on Machine Learning, pp. 7242–7251, 2019.
Appendix A Hyperparameter Search Space
Tables 35 present the hyperparameter search space considered for experiments on GMLP, MLP, SNN, and FGR, respectively. For the GMLP search space, the number of groups is adjusted based on the number of features and samples in each specific task. Also, the number of layers is adjusted to be compatible with the number of groups being used. Regarding the FGR experiments, due to scalability issues of the published source provided by the original authors, we were only able to train networks with at most two hidden layers. For SET, as their architecture is evolutionary i.e., prunes certain weights and adds new ones, we only explored using a different number of hidden neurons in the range of 500 to 4000.
Regarding the number of epochs, we used 2000 epochs for CIFAR10, 1000 epochs for HAPT, 100 epochs for Diabetes, 300 epochs for Landsat, and 300 epochs for MITBIH experiments. The only exception is the SNN experiments where we had to reduce the learning rate to increase the stability of the training resulting in more epochs required to converge.
Appendix B Architectures
Table 6,7,8,9,10 show the selected architectures for GMLP, MLP, SNN, SET, and FGR, respectively. We used the following notation to indicate different layer types and parameters: GSelkm represents a GroupSelect layer selecting k groups of m features each. GFC indicates GroupFC layers, and FCx represents fullyconnected layer with x hidden neurons. GPoolx is a GroupPool layer of type x (max, mean, linear, etc.). Concat is concatenation of groups used prior to the output layer in GMLP architectures. SCx refers to SET sparse evolutionary layer of size x.
Appendix C Software Implementation
Table 11 presents the list of software dependencies and versions used in our implementation. To produce results related to this paper, we used a workstation with 4 NVIDIA GeForce RTX2080Ti GPUs, a 12 core Intel Core i97920X processor, and 128 GB memory. Each experiment took between about 30 minutes to 72 hours, based on the task and method being tested.
Appendix D Visual Analysis
In Figure 10, we present a visualization of the selected feature for 25 randomly selected groups in our final CIFAR10 architecture. Red, green, and blue colors indicate which channel is selected for each location. Compared to visualizations that are frequently used for convolutional networks, as GMLP has the flexibility to select pixels at different locations and different color channels, it is not easy to find explicit patterns in this visualization. However, one noticeable pattern is that features selected from a certain color channel usually appear in clusters resembling irregularly shaped patches.
Figure 11
shows the frequency in which each CIFAR10 location is selected by the GMLP network. From this visualization, GMLP is mostly ignoring the border areas which can be a result of the data augmentation process used to train the network i.e., randomly cropping the center area and padding the margins.
Comments
There are no comments yet.