The success of machine learning using deep neural networks (DNNs) has led to the widespread adoption of machine learning in engineering. Therefore, rapid development of DNNs is highly anticipated, and reusing model techniques have been well studied. Many of the existing studies on reusing models such as transfer learning involve retraining, which is cost-intensive. Although smaller models are preferable in terms of interpretability and verifiability, well-reused models are typically large and hard to handle. We propose a modularization method that decomposes a DNN into small modules from a functionality perspective and then uses them to recompose a new model for some other task. The proposed method does not require retraining and has wide applicability as it can be easily combined with existing functional modules. For example, for problems where only a subset of all classes needs to be predicted, it is expected that the classification task can be solved with only a smaller network than the model trained on all classes. As an example, for a 10-class dataset and its subsets of a 5-class dataset, the parameter size of the DNN required for a 5-class classifier can be smaller than required for a 10-class classifier. The goal here is to decompose the 10-class classifier into subnetworks as DNN modules and then recompose those modules to quickly build a 5-class classifier without retraining.
In this study, we consider the use of neural networks in environments where the subclasses to be classified frequently change. In this case, retraining of the model is required every time the task changes, which is a major obstacle in the application aspect. As an approach to solve this problem, we propose a novel method that extracts subnetworks specialized for single-class classification from a (large) trained model and combines these subnetworks for each task to classify an arbitrary subset of tasks. We propose a methodology to immediately obtain a small subnetwork model capable of classifying an arbitrary set of subclasses from a neural network model (defined as the trained model) trained on the original set of classes. This method involves the following two procedures: (1) We decompose the DNN classifier model trained on -classes dataset and extract subnetworks that can solve the binary classification of each class (). (2) By combining the subnetworks, a composed modular network is constructed. In our approach, we modularize the neural network by pruning edges using supermasks Zhou et al. (2019). The supermask is a binary score matrix that indicates whether each edge of the network should be pruned or not.
Based on reports, even DNNs with random weights can be trained by only pruning branches with supermasks without changing the weights of the network Zhou et al. (2019); Ramanujan et al. (2020). In this study, we apply supermasks to the trained model to extract the subnetworks required for the classification of the subclass set from the trained model (Figure 1
). We also propose a new algorithm called module stemming that can construct a partial network with a minimum number of parameters by sharing the learning of the supermask among modules. We demonstrate the effectiveness of our proposed method by comparing it with previous work using open datasets such as MNIST, Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100.
The contributions of this study are summarized below:
We applied supermasks to a trained DNN model and constructed a module network, which is a subnetwork of the trained model, by pruning networks. We showed that the trained model for multiclass can be decomposed into subnetworks to classify a single class. We also showed that the classification task of arbitrary subclass sets can be solved by a linear combination of these modules.
We proposed a new method for learning a supermask that can be trained to prune similar edges between modules. By adding the consistency of the supermask score of each layer to the loss function, we show that the supermask can learn to remove dissimilar edges among modules and can classify them with a smaller parameter size during recomposing.
We demonstrated the effectiveness of our proposed method for decomposing and recomposing modular networks using open datasets.
Modular Neural Network
Why decompose a trained neural network into modules?
The modularization of neural networks has been reported in many studies Auda and Kamel (1999); Amer and Maul (2019); Hupkes et al. (2020). In these existing studies, training is performed at the architectural level to take on a modular structure. The modular structure is said to be advantageous in the following ways:
The size of the network to be trained can be reduced, and the inference can be performed faster.
The analysis of networks becomes easier.
The effects of catastrophic forgetting McCloskey and Cohen (1989) in continual learning and incremental learning can be reduced.
Our approach is to extract and modularize subnetworks that can infer only one single class from a model trained on a general DNN model. These modular subnetworks can then be combined to address new tasks at the desired level. The purpose of this study is not to use a special network architecture with a modular structure but to extract partial networks (subnetworks) with functional levels as modules that satisfy the above conditions from a large-scale trained model. We handle these partial networks as module networks. These module networks are constructed as models that can classify a single class in binary to combine multiple modular networks and immediately obtain available models that can be classified for any subtasks.
In this study, we expect modular networks to have the following properties.
Decomposability. Modular networks can be decomposed from a large network and have only a single classification capability. Each modular network should be smaller than the original network. Furthermore, the classification accuracy should be high.
Recomposability. Modules with a single classification capability can be combined with each other, and multiclass classification is possible by combining modules. Multiclass classification after recomposing should be performed with a high accuracy.
Reusability / Capability in small parameters. Modules can be combined with only a simple calibration depending on the task. They can be reused without requiring relearning from scratch. Subnetwork should be represented with parameters as small as possible compared to the trained model .
An approach that extracts from a large trained model is expected to be faster than learning a small model from scratch. Methods such as transfer learning, which trains on large data and large models and fine-tunes them on a task, and self-supervised learning, which fits a problem dataset by pretraining on a large amount of unlabeled data, have been reported to perform promisingly. In the framework of extracting well-performing modular networks from such large models, we propose a method for extracting subnetworks from existing models through experiments.
Neural Network Decomposition
Functional decomposition studies of neural networks (NNs) have been reported. One of the major methods of NN decomposition is to consider the NN as a graph. Clustering of network connections is considered a method to ensure the explanatory and transparent nature of the model Watanabe et al. (2018); Filan et al. (2020). Based on existing reports, it is possible to identify relationships between edges through low-rank approximation Tai et al. (2016), weight masks Csordás et al. (2021)
, and ReLU activationCraighero et al. (2020)
. The goal of our study is modular decomposition for reuse, which means decomposition that can be recomposition. Our approach is not to decompose the trained model into network layers (layer-wise) but to decompose it into classes to be predicted (class-wise). For example, in a class-wise decomposition, a modified backpropagation was developed by modularizing and decomposing the-class problem into two-class problems Anand et al. (1995). Anand et al. applied the proposed method to each module trained to distinguish a class from its complement . Using a similar approach, Pan et al. Pan and Rajan (2020)
proposed a decomposition and recomposition method of NNs by removing their edges from the network relationships. However, they did not address convolutional layers, and their experiments were limited to simple datasets such as the MNIST dataset. Thus, the method is not applicable to more complex deep learning models. The studies on network decomposition are still in the preliminary experiment stage.
Lottery Ticket Hypothesis and Supermask
Fankle and Carbin Frankle and Carbin (2019) proposed the lottery hypothesis that the NN contains sparse subnetworks, and if these sparse subnetworks are initialized from scratch and trained, the same accuracy as the original network can be obtained. According to the experimental results of this study, the initial weights of the neural network contain subnetworks (called winning tickets) that are important for classification. They found the winning tickets by iteratively shrinking the size of the NN, masking out the weights that have the lowest magnitude. Zhou et al. Zhou et al. (2019) demonstrated that winning tickets achieve better than random performance without training. They proposed an algorithm to identify a supermask, which is a subnetwork of a randomly initialized neural network that achieves high accuracy without training. To improve the efficiency of learning and searching for supermasks, Ramanujan et al. Ramanujan et al. (2020) proposed the edge-popup algorithm using the score of a supermask. The application of a supermask to NNs with random weights shows that it is possible to superimpose a subnetwork capable of solving a specific task by any subnetwork without changing the weights of the original trained model. Because the supermask does not change the weights of the original (randomly initialized) model, it is expected to prevent catastrophic forgetting, which is a problem in continuous learning. Accordingly, there has been a widespread research on the application of supermasks, especially in continuous learning. Studies that have used supermasks to address the problem of continuous learning can be found in Piggyback Mallya et al. (2018) and Supsup Wortsman et al. (2020). These studies are based on superimposing different tasks on a randomly initialized NN using a supermask. Using this approach, we can assume that a trained DNN for -class problem contains subnetworks that can solve the binary classification of each class. By using this interesting property of NNs, we show experimentally that it is possible to extract a small subnetwork that can classify a single class.
This study aims to decompose the trained model for -class problems into subnetworks for a single problem prediction task and recompose NNs without retraining, as shown in Figure 2. In this section, we describe the details of modularization comprising model decomposition using the supermask of the trained model, learning process of supermask, and recomposing a new NN of decomposed subnetworks for prediction use.
Training Mask The module decomposition is achieved by applying supermasks, which indicates the branch pruning information, to the trained network. We follow the edge-popup algorithm Ramanujan et al. (2020) to calculate supermasks. For clarity, we briefly describe the edge-popup algorithm for training supermasks. Let and be the input and the output, where are the trained model weights, is the supermask, and indicates element-wise multiplication. Under fixed condition, the edge-popup algorithm learns a score matrix and computes the mask via . The function sets the top of entries in to 1 and the remaining to . The edge-popup algorithm updates
on the backward pass optimized via stochastic gradient descent (SGD). The scoreis initialized through Kaiming normal initialization He et al. (2015)
with rectified linear unit activation. However, if the edge-popup algorithm is applied in a straightforward manner, the best performing edges to predict each different single class will remain, and the other edges will be pruned away. This condition leads to using different edges across modules to make predictions, which makes the model size very large when combined. Therefore, we propose a novel algorithm calledmodule stemming that uses common edges among modules and can be recomposed with few parameters.
An additional fully connected network layer called grafting layer is put after the last layer in the trained model (see Figure 1). This layer changes from
-class prediction of the trained model output to the single-class prediction output of the modules. Single-class prediction estimates whether a class is a certain class or not. The weights in the grafting layer are trained by SGD with a low learning rate in the decomposition stage, but it is not retrained in the recomposition stage. Although it is possible to reuse only feature extraction from the trained model like transfer learning, we did not use this approach because the fully connected layer has the largest number of training parameters, and retraining this part of the model is not in line with the reusability objective of this study. The grafting layer performs fine-tuning. It uses alllogits of the -class trained model (if not masked) to predict the single-class classification, thus allowing it to outperform the original -class trained model for single-class classification.
For obtaining a subnetwork that can classify only the subtask, the parameter size of the module should be as small as possible. If the size of the parameters of the trained model is not significantly different from the model learned by recomposing the modules, then the trained model can be used as is, which is contrary to the purpose of this study. To obtain a small recomposed module (subnetwork), we can reduce the model size not only by reducing the network size of individual modules but also by making the supermask of each module as similar as possible and constraining the forward inference so that the same edges are used for inference. In this way, we can extract and build small subnetworks from the trained model. In this section, we discuss initial score sharing and stemming loss as ideas for extracting small network modules.
Initial Score Sharing
Pruning of a network depends on its initial parameter values. To construct modules consisting of similar edges, we take the simple approach of sharing the initial values of the super mask score between modules.
For training supermask scores to build a module, we propose a loss function that regularizes the layer-wise score among the modules. Consider linear layers that have the same neuron for simplicity, we define the following loss functionto learn a score matrix in a layer for the module by parameterized
for each epoch while training with training batch data:
where denotes cross-entropy, denotes the norm, denotes the scores in the previous epoch, and
denotes a hyperparameter.
For constructing a base module for computing stemming loss, we tried two methods: (A) We choose a single class arbitrarily and constructed the module without stemming loss. After that, we computed others with stemming loss. (B) We used the average of the scores of all modules as the base. Because method (A) does not have a methodology to select which class of modules to be based on, it was verified by brute force. However, as a result of trial and error, the accuracy of method (A) was not equal to or better than method (B). In addition, method (A) is difficult to choose a good base class and a good choice depends heavily on the data set, and therefore we adopted method (B). We confirmed that efficient module stemming can be performed using method (B), called module co-training. The algorithm is shown in Alg. 1.
Given a subset of the -class problem, the deployed model is a subnetwork of -class problem classifier derived by obtaining the union of the supermasks for the classes in the subset. Therefore, if we can compute similar supermasks via module stemming and solve the problem using common edges, the size of deployed model becomes smaller. The deployed model makes predictions by sequentially applying supermasks and the grafting layer and then voting by a confidence score (e.g., max softmax score) without retraining, as shown in Figure 2.
MNIST The MNIST dataset LeCun et al. (1998) is a well-known dataset used in many studies and consists of various handwritten digits (0-9) images. There are 60,000 training examples and 10,000 test examples, with an equal number of data for each class label.
Fashion-MNIST (F-MNIST) The F-MNIST dataset Xiao et al. (2017) has two-dimensional binary images from different 10 clothes. As with the MNIST dataset, there are 60,000 training examples and 10,000 test examples, with an equal number of data for each class label.
CIFAR-10 / CIFAR-100 The CIFAR-10 and CIFAR-100 training dataset Krizhevsky and Hinton (2009) consists of 50,000 images coming and the test set consists of 10k images from the same 10 classes and 100 classes, respectively. All images have a 32 × 32 resolution.
SVHN The Street View House Numbers (SVHN) dataset Netzer et al. (2011) has 73,257 images in the training set, 26,032 images in the test set. All images contain 32 × 32 colored digit images.
For evaluation, we use four fully connected models (FC1, FC2, FC3, FC4) that have 49 neurons and 1, 2, 3, and 4 hidden layers, respectively, following Pan and Rajan (2020)
. In addition, to verify the effect of convectional layer with batch normalization and dropout, we useVGG16 Simonyan and Zisserman (2015) for CIFAR-10, ResNet50 He et al. (2016) for SVHN and WideResNet28-10 Zagoruyko and Komodakis (2016) for CIFAR-100. These models were trained on corresponding datasets with fixed epochs and SGD with momentum. WideResNet28-10 was trained by RandAugment Cubuk et al. (2020), one of the leading augmentation methods. To evaluate the proposed algorithm, we compared the results in reference Pan and Rajan (2020) as a baseline method. They proposed six methods, but we selected the TI-I, TI-SNE and CM-RIE methods that showed the better performance for remaining similarity edges between modules in their proposed method. Note that the methods proposed in reference Pan and Rajan (2020) are not algorithms to increase the sharing of parameters between modules. Because the method proposed in reference Pan and Rajan (2020) is not applicable to models with convolutional layers, we compared them using FC models.
To evaluate the performance of the proposed method, we compared it with previous studies according to the evaluation criteria presented in this section to verify whether the constructed modules have the required properties.
We use the Jaccard Index (JI) to measure the similarities between the decomposed modules. If JI is 0, there is no shared edges between two modules. If it is 1, two modules are the same. A higher JI indicates modules can utilize similar NN paths for inference and the recomposing model size is smaller.
Accuracy over the test dataset for modules
To measure the performance of the DNN model, we use accuracy metrics. When we utilize the decomposed modules in the recomposing stage, the prediction is based on the subnetwork, as shown in Figure 2. After superimposing the supermask, we use the output of the grafting layer to predict labels via a voting method. The computing superimposing of the supermask is very fast because it is a matrix computation. As for voting, when we input the data and run each module, the positive output label with the highest confidence score among the modules is taken as the predicted label. Based on this, we calculate the accuracy of the test dataset.
Accuracy for the recomposed networks
We evaluated the accuracy with which a model that combines multiple modules can predict a subtask.
Number of remaining parameters
We evaluate the number of parameters (RPs) required when the modules are recomposed using module stemming. The smaller the number of parameters, the better. If it is possible to infer the class of instances while sharing edges between modules, the number of parameters needed to solve the subtasks can be reduced, thus reducing the model size. This can be evaluated by how many parameter percentages are needed to predict the subtasks based on the trained model.
As a preliminary experiment, we evaluated the -norm in Eq. 1. As a result of comparing , although there was no significant change in the best accuracy or the number of parameters required, tends to change more slowly, which facilitates a more detailed evaluation of the accuracy and the number of parameters, and thus we experimented with .
Evaluation of Decomposability
|Module Stemming (ours)||JI||0.60||0.71||0.67||0.67||0.72||0.67||0.66||0.75|
Table 1 shows the average accuracy of modules and recomposed module via decomposition of the trained model. In the MNIST and F-MNIST datasets, our method outperforms the conventional method in terms of JI of the recomposed module. Moreover, the accuracy is also higher than the baseline in most cases.
Evaluation of Composability and Reusability
The results in Figure 3 and Figure 4 show a much higher accuracy can be achieved and utilized as a small subnetwork by recomposing two modules (two-class problems). Each experiment is tested on five trials. Hyperparameter is the coefficient for module stemming, and it the number of parameters required for recomposition is significantly reduced compared with the case where . For example, when for model FC1 in Figure 3, the ratio of RPs that is required to represent one module is for the original model. When two modules are composed, RPs is at worst, considering the two modules use exactly different edges for inference. In fact, it is close to when without module stemming. Moreover, due to module stemming, we can construct supermasks in which the modules are largely similar to each other after epochs converging to the RPs close to . Figure 4 shows the results for the CIFAR-10, CIFAR-100 and SVHN, which shows that the RPs due to module stemming is reduced, whereas the accuracy is not ensured for compared to . This finding indicates that the number of parameters required for the CIFAR-10 / CIFAR-100 / SVHN classification is large in relation to the model size, and that the for the module depends on the model size. In this experiment, although the value of is determined in accordance with baseline, it is necessary to determine and depending on the model and dataset in terms of the performance. Similar to the model size compression method, we observed that the accuracy of the model decreased as the PRs increased. The proposed method is not suitable for environments that require significant class size because of the computational time required to build modules with module co-training. For actual use, it is necessary to verify the required recomposed model size and accuracy in advance.
Table 2 shows the comparison of the effects of the initial score sharing and stemming loss in the proposed methods. The results show that stemming loss could have a significant effect, and the low dependence of the score on the initial value was observed. Although initial score sharing has not produced a significant effect on performance, learning progresses in each module are expected to become more manageable.
|Initial Score Sharing||Stemming Loss||Avg. Ratio of RPs||Best||Worst|
Through experimental evaluation, we have confirmed that the proposed method can construct the module decomposition and merging of NNs for predicting classification problems with small parameters of the merging module for solving the subtasks and high accuracy. The conventional method is not applicable to networks that include CNNs, skip connection He et al. (2016), and average pooling Lin et al. (2014), and is limited to the fully connected layer. The proposed algorithm, which performs pruning while retaining similar edges among modules, can be applied to widely used NNs using the proposed loss function based on differentiable masking. As no training is required for recomposing, modules for predicting subtasks can be used immediately by applying masks. As a limitation of our method, the module decomposition itself does not require a large amount of training time because it uses a trained model to cut the edges, but it consumes a computation time proportional to the number of classes. To decompose modules with a very large number of classes, it is necessary to use techniques such as distributed learning.
In this study, we address to extract the subnetworks required for the classification of the subclass set from the trained model. We propose an approach to decompose the existing trained model and modularize it. The proposed method employs weight masks to extract modules and is applicable to arbitrary DNNs. Moreover, it does not require any assumptions about the architecture of the network and can be applied immediately without the need for retraining. The proposed method has shown promising results, showing that it is able to extract similar edges across modules on several datasets. This allows us to reduce the model size when recomposing modules. Future work includes a more detailed analysis of the edges in the decomposed model commonly used in module stemming, and further investigation of the conditions and model sizes under which stemming can work more efficiently.
TS was partially supported by JSPS KAKENHI (18H03201), Fujitsu Laboratories Ltd., and JST CREST.
- A review of modularization techniques in artificial neural networks. Artificial Intelligence Review 52 (1), pp. 527–561. Cited by: Why decompose a trained neural network into modules?.
- Efficient classification for multiclass problems using modular neural networks. IEEE Trans. Neural Networks 6 (1), pp. 117–124. Cited by: Neural Network Decomposition.
- Modular neural networks: a survey. International Journal of Neural Systems 9 (02), pp. 129–151. Cited by: Why decompose a trained neural network into modules?.
Investigating the compositional structure of deep neural networks.
International Conference on Machine Learning, Optimization, and Data Science, pp. 322–334. Cited by: Neural Network Decomposition.
- Are neural nets modular? inspecting functional modularity through differentiable weight masks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Cited by: Neural Network Decomposition.
- RandAugment: practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 18613–18624. External Links: Cited by: Appendix A, Models.
- Neural networks are surprisingly modular. arXiv. Cited by: Neural Network Decomposition.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: Lottery Ticket Hypothesis and Supermask.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: Decomposing.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Cited by: Models, Discussion.
- Compositionality decomposed: how do neural networks generalise? (extended abstract). In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 5065–5069. Note: Journal track Cited by: Why decompose a trained neural network into modules?.
- Learning multiple layers of features from tiny images. Cited by: item.
- Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Vol. 86, pp. 2278–2324. External Links: Cited by: item.
- Network in network. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: Discussion.
- Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, Lecture Notes in Computer Science, Vol. 11208, pp. 72–88. Cited by: Lottery Ticket Hypothesis and Supermask.
- Catastrophic interference in connectionist networks: the sequential learning problem. G. H. Bower (Ed.), Psychology of Learning and Motivation, Vol. 24, pp. 109–165. External Links: Cited by: 3rd item.
- Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, Cited by: item.
- On decomposing a deep neural network into modules. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, pp. 889–900. Cited by: Appendix A, Appendix A, Neural Network Decomposition, Models.
- What’s hidden in a randomly weighted neural network?. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 11890–11899. Cited by: Appendix A, Introduction, Lottery Ticket Hypothesis and Supermask, Decomposing.
- Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: Models.
- Convolutional neural networks with low-rank regularization. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: Neural Network Decomposition.
- Modular representation of layered neural networks. Neural Networks 97, pp. 62–73. Cited by: Neural Network Decomposition.
- Supermasks in superposition. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Cited by: Appendix A, Lottery Ticket Hypothesis and Supermask.
- External Links: Cited by: item.
- Wide residual networks. CoRR abs/1605.07146. External Links: Cited by: Appendix A, Models.
- Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3592–3602. Cited by: Introduction, Introduction, Lottery Ticket Hypothesis and Supermask.
Appendix A Additional Experiments’ Details
Training Parameter Details
Table 3 shows experiments’ training parameters with SGD for building modules. The details of our implementation code and training parameters essentially followed references Zagoruyko and Komodakis (2016); Wortsman et al. (2020); Pan and Rajan (2020). We confirmed that the accuracy of each model is almost equal to the accuracy presented in each reference.
|CIFAR100||learning rate||0.1 (stepLR)|
|multi step LR decay||0.02 (epoch at 60,120,160)|
As shown in the previous study Ramanujan et al. (2020), data augmentation can also improve the accuracy when using supermasks. For CIFAR-10 and SVHN, we applied standard data augmentation (flipping and random crop). We did not apply any data augmentation for MNIST and F-MNIST to compare with the previous study Pan and Rajan (2020) under the same conditions. For the WideResNet28-10/CIFAR-100 models, we used RandAugment Cubuk et al. (2020). For the RandAugment parameters, we used in and .
Training Module Details
For the build models via co-module training, we used parameters as shown in Table 4. Each model decomposition was performed for five trials each, and the shaded error bars shown in Figure 3 and Figure 4
. indicate one standard deviation. All models and modules were trained on eight NVIDIA Tesla V100 GPUs.
|learning rate||0.01 (WideResNet) / 0.1 (others)|
|batch size||512 (WideResNet) / 32 (others)|
Appendix B Additional Analysis
Analysis of Hyperparameter
Table 5 shows the results of the maximum test set accuracy of each module in MNIST for the comparison when the hyperparameter is varied. Under the condition of enough epochs, there is an upper bound of accuracy because there is no change of weights. In this experiment, under the condition of 50 epochs, there was no change in the best accuracy for 5 trials. As shown in Table 5, even when , which has the effect of module stemming, the accuracy is sometimes best higher than . This can be attributed to the inclusion of regularization effects in module stemming and the addition of a grafting layer. Just as the accuracy does not necessarily increase with the number of layers or parameters, this indicates that the search for is important as a hyperparameter. Note that the results shown in Table 5 is the average accuracy of a single module (two-class problems), which is different from the accuracy after recomposition of 10 modules shown in Table 1 (10-class problems).
Analysis of Modules Accuracy
Related to Figure 3, Table 6 shows how accurate each module is for classification on a dataset where each class is balanced with the other classes in a 1:1 ratio. The results in the table show that the accuracy is high only when the modules and subtasks are matched, but for different classes, binary classification is failed. This indicates that the modules can be successfully separated as subnetworks as single-class classification functions.