Neural Network Module Decomposition and Recomposition

by   Hiroaki Kingetsu, et al.
The University of Tokyo

We propose a modularization method that decomposes a deep neural network (DNN) into small modules from a functionality perspective and recomposes them into a new model for some other task. Decomposed modules are expected to have the advantages of interpretability and verifiability due to their small size. In contrast to existing studies based on reusing models that involve retraining, such as a transfer learning model, the proposed method does not require retraining and has wide applicability as it can be easily combined with existing functional modules. The proposed method extracts modules using weight masks and can be applied to arbitrary DNNs. Unlike existing studies, it requires no assumption about the network architecture. To extract modules, we designed a learning method and a loss function to maximize shared weights among modules. As a result, the extracted modules can be recomposed without a large increase in the size. We demonstrate that the proposed method can decompose and recompose DNNs with high compression ratio and high accuracy and is superior to the existing method through sharing weights between modules.



There are no comments yet.


page 1

page 2

page 3

page 4


Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding

Speech codecs learn compact representations of speech signals to facilit...

Are Neural Nets Modular? Inspecting Functional Modularity Through Differentiable Weight Masks

Neural networks (NNs) whose subnetworks implement reusable functions are...

Modular Networks: Learning to Decompose Neural Computation

Scaling model capacity has been vital in the success of deep learning. F...

Attention module improves both performance and interpretability of 4D fMRI decoding neural network

Decoding brain cognitive states from neuroimaging signals is an importan...

Stacking-Based Deep Neural Network: Deep Analytic Network for Pattern Classification

Stacking-based deep neural network (S-DNN) is aggregated with pluralitie...

Identifying similar functional modules by a new hybrid spectral clustering method

Recently, a large number of researches have focused on finding cellular ...

Speeding up Deep Model Training by Sharing Weights and Then Unsharing

We propose a simple and efficient approach for training the BERT model. ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The success of machine learning using deep neural networks (DNNs) has led to the widespread adoption of machine learning in engineering. Therefore, rapid development of DNNs is highly anticipated, and reusing model techniques have been well studied. Many of the existing studies on reusing models such as transfer learning involve retraining, which is cost-intensive. Although smaller models are preferable in terms of interpretability and verifiability, well-reused models are typically large and hard to handle. We propose a modularization method that decomposes a DNN into small modules from a functionality perspective and then uses them to recompose a new model for some other task. The proposed method does not require retraining and has wide applicability as it can be easily combined with existing functional modules. For example, for problems where only a subset of all classes needs to be predicted, it is expected that the classification task can be solved with only a smaller network than the model trained on all classes. As an example, for a 10-class dataset and its subsets of a 5-class dataset, the parameter size of the DNN required for a 5-class classifier can be smaller than required for a 10-class classifier. The goal here is to decompose the 10-class classifier into subnetworks as DNN modules and then recompose those modules to quickly build a 5-class classifier without retraining.

In this study, we consider the use of neural networks in environments where the subclasses to be classified frequently change. In this case, retraining of the model is required every time the task changes, which is a major obstacle in the application aspect. As an approach to solve this problem, we propose a novel method that extracts subnetworks specialized for single-class classification from a (large) trained model and combines these subnetworks for each task to classify an arbitrary subset of tasks. We propose a methodology to immediately obtain a small subnetwork model capable of classifying an arbitrary set of subclasses from a neural network model (defined as the trained model) trained on the original set of classes. This method involves the following two procedures: (1) We decompose the DNN classifier model trained on -classes dataset and extract subnetworks that can solve the binary classification of each class (). (2) By combining the subnetworks, a composed modular network is constructed. In our approach, we modularize the neural network by pruning edges using supermasks Zhou et al. (2019). The supermask is a binary score matrix that indicates whether each edge of the network should be pruned or not.

Figure 1: Overview of trained model decomposition and recomposition for the subtask.

Based on reports, even DNNs with random weights can be trained by only pruning branches with supermasks without changing the weights of the network Zhou et al. (2019); Ramanujan et al. (2020). In this study, we apply supermasks to the trained model to extract the subnetworks required for the classification of the subclass set from the trained model (Figure 1

). We also propose a new algorithm called module stemming that can construct a partial network with a minimum number of parameters by sharing the learning of the supermask among modules. We demonstrate the effectiveness of our proposed method by comparing it with previous work using open datasets such as MNIST, Fashion-MNIST, SVHN, CIFAR-10, and CIFAR-100.

The contributions of this study are summarized below:

  • We applied supermasks to a trained DNN model and constructed a module network, which is a subnetwork of the trained model, by pruning networks. We showed that the trained model for multiclass can be decomposed into subnetworks to classify a single class. We also showed that the classification task of arbitrary subclass sets can be solved by a linear combination of these modules.

  • We proposed a new method for learning a supermask that can be trained to prune similar edges between modules. By adding the consistency of the supermask score of each layer to the loss function, we show that the supermask can learn to remove dissimilar edges among modules and can classify them with a smaller parameter size during recomposing.

  • We demonstrated the effectiveness of our proposed method for decomposing and recomposing modular networks using open datasets.

Modular Neural Network

Why decompose a trained neural network into modules?

The modularization of neural networks has been reported in many studies Auda and Kamel (1999); Amer and Maul (2019); Hupkes et al. (2020). In these existing studies, training is performed at the architectural level to take on a modular structure. The modular structure is said to be advantageous in the following ways:

  • The size of the network to be trained can be reduced, and the inference can be performed faster.

  • The analysis of networks becomes easier.

  • The effects of catastrophic forgetting McCloskey and Cohen (1989) in continual learning and incremental learning can be reduced.

Our approach is to extract and modularize subnetworks that can infer only one single class from a model trained on a general DNN model. These modular subnetworks can then be combined to address new tasks at the desired level. The purpose of this study is not to use a special network architecture with a modular structure but to extract partial networks (subnetworks) with functional levels as modules that satisfy the above conditions from a large-scale trained model. We handle these partial networks as module networks. These module networks are constructed as models that can classify a single class in binary to combine multiple modular networks and immediately obtain available models that can be classified for any subtasks.

Required Properties

In this study, we expect modular networks to have the following properties.

  • Decomposability. Modular networks can be decomposed from a large network and have only a single classification capability. Each modular network should be smaller than the original network. Furthermore, the classification accuracy should be high.

  • Recomposability. Modules with a single classification capability can be combined with each other, and multiclass classification is possible by combining modules. Multiclass classification after recomposing should be performed with a high accuracy.

  • Reusability / Capability in small parameters. Modules can be combined with only a simple calibration depending on the task. They can be reused without requiring relearning from scratch. Subnetwork should be represented with parameters as small as possible compared to the trained model .

An approach that extracts from a large trained model is expected to be faster than learning a small model from scratch. Methods such as transfer learning, which trains on large data and large models and fine-tunes them on a task, and self-supervised learning, which fits a problem dataset by pretraining on a large amount of unlabeled data, have been reported to perform promisingly. In the framework of extracting well-performing modular networks from such large models, we propose a method for extracting subnetworks from existing models through experiments.

Related Research

Neural Network Decomposition

Functional decomposition studies of neural networks (NNs) have been reported. One of the major methods of NN decomposition is to consider the NN as a graph. Clustering of network connections is considered a method to ensure the explanatory and transparent nature of the model Watanabe et al. (2018); Filan et al. (2020). Based on existing reports, it is possible to identify relationships between edges through low-rank approximation Tai et al. (2016), weight masks Csordás et al. (2021)

, and ReLU activation 

Craighero et al. (2020)

. The goal of our study is modular decomposition for reuse, which means decomposition that can be recomposition. Our approach is not to decompose the trained model into network layers (layer-wise) but to decompose it into classes to be predicted (class-wise). For example, in a class-wise decomposition, a modified backpropagation was developed by modularizing and decomposing the

-class problem into two-class problems Anand et al. (1995). Anand et al. applied the proposed method to each module trained to distinguish a class from its complement . Using a similar approach, Pan et al. Pan and Rajan (2020)

proposed a decomposition and recomposition method of NNs by removing their edges from the network relationships. However, they did not address convolutional layers, and their experiments were limited to simple datasets such as the MNIST dataset. Thus, the method is not applicable to more complex deep learning models. The studies on network decomposition are still in the preliminary experiment stage.

Lottery Ticket Hypothesis and Supermask

Fankle and Carbin Frankle and Carbin (2019) proposed the lottery hypothesis that the NN contains sparse subnetworks, and if these sparse subnetworks are initialized from scratch and trained, the same accuracy as the original network can be obtained. According to the experimental results of this study, the initial weights of the neural network contain subnetworks (called winning tickets) that are important for classification. They found the winning tickets by iteratively shrinking the size of the NN, masking out the weights that have the lowest magnitude. Zhou et al. Zhou et al. (2019) demonstrated that winning tickets achieve better than random performance without training. They proposed an algorithm to identify a supermask, which is a subnetwork of a randomly initialized neural network that achieves high accuracy without training. To improve the efficiency of learning and searching for supermasks, Ramanujan et al. Ramanujan et al. (2020) proposed the edge-popup algorithm using the score of a supermask. The application of a supermask to NNs with random weights shows that it is possible to superimpose a subnetwork capable of solving a specific task by any subnetwork without changing the weights of the original trained model. Because the supermask does not change the weights of the original (randomly initialized) model, it is expected to prevent catastrophic forgetting, which is a problem in continuous learning. Accordingly, there has been a widespread research on the application of supermasks, especially in continuous learning. Studies that have used supermasks to address the problem of continuous learning can be found in Piggyback Mallya et al. (2018) and Supsup Wortsman et al. (2020). These studies are based on superimposing different tasks on a randomly initialized NN using a supermask. Using this approach, we can assume that a trained DNN for -class problem contains subnetworks that can solve the binary classification of each class. By using this interesting property of NNs, we show experimentally that it is possible to extract a small subnetwork that can classify a single class.


Figure 2: Overview of the model decomposition and recomposition flow for predicting the subtask.

This study aims to decompose the trained model for -class problems into subnetworks for a single problem prediction task and recompose NNs without retraining, as shown in Figure 2. In this section, we describe the details of modularization comprising model decomposition using the supermask of the trained model, learning process of supermask, and recomposing a new NN of decomposed subnetworks for prediction use.


Training Mask  The module decomposition is achieved by applying supermasks, which indicates the branch pruning information, to the trained network. We follow the edge-popup algorithm Ramanujan et al. (2020) to calculate supermasks. For clarity, we briefly describe the edge-popup algorithm for training supermasks. Let and be the input and the output, where are the trained model weights, is the supermask, and indicates element-wise multiplication. Under fixed condition, the edge-popup algorithm learns a score matrix and computes the mask via . The function sets the top of entries in to 1 and the remaining to . The edge-popup algorithm updates

on the backward pass optimized via stochastic gradient descent (SGD). The score

is initialized through Kaiming normal initialization He et al. (2015)

with rectified linear unit activation. However, if the edge-popup algorithm is applied in a straightforward manner, the best performing edges to predict each different single class will remain, and the other edges will be pruned away. This condition leads to using different edges across modules to make predictions, which makes the model size very large when combined. Therefore, we propose a novel algorithm called

module stemming that uses common edges among modules and can be recomposed with few parameters.

Grafting Layer

An additional fully connected network layer called grafting layer is put after the last layer in the trained model (see Figure 1). This layer changes from

-class prediction of the trained model output to the single-class prediction output of the modules. Single-class prediction estimates whether a class is a certain class or not. The weights in the grafting layer are trained by SGD with a low learning rate in the decomposition stage, but it is not retrained in the recomposition stage. Although it is possible to reuse only feature extraction from the trained model like transfer learning, we did not use this approach because the fully connected layer has the largest number of training parameters, and retraining this part of the model is not in line with the reusability objective of this study. The grafting layer performs fine-tuning. It uses all

logits of the -class trained model (if not masked) to predict the single-class classification, thus allowing it to outperform the original -class trained model for single-class classification.

Module Stemming

For obtaining a subnetwork that can classify only the subtask, the parameter size of the module should be as small as possible. If the size of the parameters of the trained model is not significantly different from the model learned by recomposing the modules, then the trained model can be used as is, which is contrary to the purpose of this study. To obtain a small recomposed module (subnetwork), we can reduce the model size not only by reducing the network size of individual modules but also by making the supermask of each module as similar as possible and constraining the forward inference so that the same edges are used for inference. In this way, we can extract and build small subnetworks from the trained model. In this section, we discuss initial score sharing and stemming loss as ideas for extracting small network modules.

  1. Initial Score Sharing  
    Pruning of a network depends on its initial parameter values. To construct modules consisting of similar edges, we take the simple approach of sharing the initial values of the super mask score between modules.

  2. Stemming Loss  

    For training supermask scores to build a module, we propose a loss function that regularizes the layer-wise score among the modules. Consider linear layers that have the same neuron for simplicity, we define the following loss function

    to learn a score matrix in a layer for the module by parameterized

    for each epoch while training with training batch data



    where denotes cross-entropy, denotes the norm, denotes the scores in the previous epoch, and

    denotes a hyperparameter.

Input: Training batch data , trained model
Parameter: Parameter remaining -%, hyperparameter
Output: Supermask scores for each class module

1:  for   do
2:     if first epoch then
3:        Initialize score
4:        Train supermasks using and without module stemming
5:     else
6:        Compute the average supermask scores of all modules in the previous epoch
7:        for each module do
8:           Compute prediction output with and
9:           Perform backpropagation with stemming loss (Eq. 1) using
10:        end for
11:     end if
12:  end for
Algorithm 1 Module Co-training

Module Co-training

For constructing a base module for computing stemming loss, we tried two methods: (A) We choose a single class arbitrarily and constructed the module without stemming loss. After that, we computed others with stemming loss. (B) We used the average of the scores of all modules as the base. Because method (A) does not have a methodology to select which class of modules to be based on, it was verified by brute force. However, as a result of trial and error, the accuracy of method (A) was not equal to or better than method (B). In addition, method (A) is difficult to choose a good base class and a good choice depends heavily on the data set, and therefore we adopted method (B). We confirmed that efficient module stemming can be performed using method (B), called module co-training. The algorithm is shown in Alg. 1.


Given a subset of the -class problem, the deployed model is a subnetwork of -class problem classifier derived by obtaining the union of the supermasks for the classes in the subset. Therefore, if we can compute similar supermasks via module stemming and solve the problem using common edges, the size of deployed model becomes smaller. The deployed model makes predictions by sequentially applying supermasks and the grafting layer and then voting by a confidence score (e.g., max softmax score) without retraining, as shown in Figure 2.

Experimental Setup


MNIST   The MNIST dataset LeCun et al. (1998) is a well-known dataset used in many studies and consists of various handwritten digits (0-9) images. There are 60,000 training examples and 10,000 test examples, with an equal number of data for each class label.

Fashion-MNIST (F-MNIST)   The F-MNIST dataset Xiao et al. (2017) has two-dimensional binary images from different 10 clothes. As with the MNIST dataset, there are 60,000 training examples and 10,000 test examples, with an equal number of data for each class label.

CIFAR-10 / CIFAR-100   The CIFAR-10 and CIFAR-100 training dataset Krizhevsky and Hinton (2009) consists of 50,000 images coming and the test set consists of 10k images from the same 10 classes and 100 classes, respectively. All images have a 32 × 32 resolution.

SVHN   The Street View House Numbers (SVHN) dataset Netzer et al. (2011) has 73,257 images in the training set, 26,032 images in the test set. All images contain 32 × 32 colored digit images.


For evaluation, we use four fully connected models (FC1, FC2, FC3, FC4) that have 49 neurons and 1, 2, 3, and 4 hidden layers, respectively, following Pan and Rajan (2020)

. In addition, to verify the effect of convectional layer with batch normalization and dropout, we use

VGG16 Simonyan and Zisserman (2015) for CIFAR-10, ResNet50 He et al. (2016) for SVHN and WideResNet28-10 Zagoruyko and Komodakis (2016) for CIFAR-100. These models were trained on corresponding datasets with fixed epochs and SGD with momentum. WideResNet28-10 was trained by RandAugment Cubuk et al. (2020), one of the leading augmentation methods. To evaluate the proposed algorithm, we compared the results in reference Pan and Rajan (2020) as a baseline method. They proposed six methods, but we selected the TI-I, TI-SNE and CM-RIE methods that showed the better performance for remaining similarity edges between modules in their proposed method. Note that the methods proposed in reference Pan and Rajan (2020) are not algorithms to increase the sharing of parameters between modules. Because the method proposed in reference Pan and Rajan (2020) is not applicable to models with convolutional layers, we compared them using FC models.


To evaluate the performance of the proposed method, we compared it with previous studies according to the evaluation criteria presented in this section to verify whether the constructed modules have the required properties.


We use the Jaccard Index (JI) to measure the similarities between the decomposed modules. If JI is 0, there is no shared edges between two modules. If it is 1, two modules are the same. A higher JI indicates modules can utilize similar NN paths for inference and the recomposing model size is smaller.

  • Accuracy over the test dataset for modules

To measure the performance of the DNN model, we use accuracy metrics. When we utilize the decomposed modules in the recomposing stage, the prediction is based on the subnetwork, as shown in Figure 2. After superimposing the supermask, we use the output of the grafting layer to predict labels via a voting method. The computing superimposing of the supermask is very fast because it is a matrix computation. As for voting, when we input the data and run each module, the positive output label with the highest confidence score among the modules is taken as the predicted label. Based on this, we calculate the accuracy of the test dataset.


  • Accuracy for the recomposed networks

We evaluated the accuracy with which a model that combines multiple modules can predict a subtask.


  • Number of remaining parameters

We evaluate the number of parameters (RPs) required when the modules are recomposed using module stemming. The smaller the number of parameters, the better. If it is possible to infer the class of instances while sharing edges between modules, the number of parameters needed to solve the subtasks can be reduced, thus reducing the model size. This can be evaluated by how many parameter percentages are needed to predict the subtasks based on the trained model.



As a preliminary experiment, we evaluated the -norm in Eq. 1. As a result of comparing , although there was no significant change in the best accuracy or the number of parameters required, tends to change more slowly, which facilitates a more detailed evaluation of the accuracy and the number of parameters, and thus we experimented with .

Evaluation of Decomposability

Model FC1 FC2 FC3 FC4 FC1 FC2 FC3 FC4
TI-I JI 0.47 0.64 0.44 0.53 0.78 0.64 0.52 0.56
Acc 94.91% 96.83% 69.19% 96.44% 85.82% 87.58% 77.55% 87.51%
TI-SNE JI 0.47 0.65 0.45 0.55 0.78 0.65 0.53 0.57
Acc 94.91% 96.83% 96.30% 96.79% 85.82% 87.58% 87.09% 87.79%
CM-RIE JI 0.43 0.63 0.43 0.53 0.75 0.63 0.51 0.55
Acc 94.90% 96.82% 96.33% 96.75% 85.85% 87.56% 87.10% 87.95%
Module Stemming (ours) JI 0.60 0.71 0.67 0.67 0.72 0.67 0.66 0.75
Acc 96.47% 95.92% 96.61% 96.75% 84.63% 85.44% 86.30% 86.42%
Table 1: Benchmarking average Jaccard Index (JI) of each module and the average accuracy (Acc) for the recomposed model of all modules on each test dataset for each method (). A higher JI implies better performance. Because the existing method cannot be applied to convolutional layers, we tested only on the fully connected layer. The proposed method outperforms in most conditions. Although the module stemming does not pursue accuracy as it aims to constrain the edges to be selected, it is competitive with the existing method.

Table 1 shows the average accuracy of modules and recomposed module via decomposition of the trained model. In the MNIST and F-MNIST datasets, our method outperforms the conventional method in terms of JI of the recomposed module. Moreover, the accuracy is also higher than the baseline in most cases.

Evaluation of Composability and Reusability

Figure 3: (MNIST) Top: Ratio of the average remaining parameters in the two-class module composition for each epoch when the hyperparameters are changed. Bottom: Average accuracy of the module. The dotted line (k) indicates the size of a single module.
Figure 4: (SVHN with ResNet50 / CIFAR-10 with VGG16 / CIFAR-100 with WideResNet28-10) Top: Ratio of the average remaining parameters in the two-class module composition for each epoch when the hyperparameters are changed. Bottom: Average accuracy of the module. The dotted line (k) indicates the size of a single module.

The results in Figure 3 and Figure 4 show a much higher accuracy can be achieved and utilized as a small subnetwork by recomposing two modules (two-class problems). Each experiment is tested on five trials. Hyperparameter is the coefficient for module stemming, and it the number of parameters required for recomposition is significantly reduced compared with the case where . For example, when for model FC1 in Figure 3, the ratio of RPs that is required to represent one module is for the original model. When two modules are composed, RPs is at worst, considering the two modules use exactly different edges for inference. In fact, it is close to when without module stemming. Moreover, due to module stemming, we can construct supermasks in which the modules are largely similar to each other after epochs converging to the RPs close to . Figure 4 shows the results for the CIFAR-10, CIFAR-100 and SVHN, which shows that the RPs due to module stemming is reduced, whereas the accuracy is not ensured for compared to . This finding indicates that the number of parameters required for the CIFAR-10 / CIFAR-100 / SVHN classification is large in relation to the model size, and that the for the module depends on the model size. In this experiment, although the value of is determined in accordance with baseline, it is necessary to determine and depending on the model and dataset in terms of the performance. Similar to the model size compression method, we observed that the accuracy of the model decreased as the PRs increased. The proposed method is not suitable for environments that require significant class size because of the computational time required to build modules with module co-training. For actual use, it is necessary to verify the required recomposed model size and accuracy in advance.

Ablation study

Table 2 shows the comparison of the effects of the initial score sharing and stemming loss in the proposed methods. The results show that stemming loss could have a significant effect, and the low dependence of the score on the initial value was observed. Although initial score sharing has not produced a significant effect on performance, learning progresses in each module are expected to become more manageable.

Initial Score Sharing Stemming Loss Avg. Ratio of RPs Best Worst
0.175 0.159 0.185
0.173 0.159 0.183
0.122 0.117 0.126
0.122 0.116 0.126
Table 2: Number of parameters required by the recomposition when building the module required to classify two subclasses at epoch 25 and in MNIST with the FC1 model. Ratio of RPs indicates the ratio of remaining parameters to solve a two-class classification task. A lower Avg. Ratio of RPs implies a good performance. The results show that the proposed method is effective in keeping the number of parameters low.


Through experimental evaluation, we have confirmed that the proposed method can construct the module decomposition and merging of NNs for predicting classification problems with small parameters of the merging module for solving the subtasks and high accuracy. The conventional method is not applicable to networks that include CNNs, skip connection He et al. (2016), and average pooling Lin et al. (2014), and is limited to the fully connected layer. The proposed algorithm, which performs pruning while retaining similar edges among modules, can be applied to widely used NNs using the proposed loss function based on differentiable masking. As no training is required for recomposing, modules for predicting subtasks can be used immediately by applying masks. As a limitation of our method, the module decomposition itself does not require a large amount of training time because it uses a trained model to cut the edges, but it consumes a computation time proportional to the number of classes. To decompose modules with a very large number of classes, it is necessary to use techniques such as distributed learning.


In this study, we address to extract the subnetworks required for the classification of the subclass set from the trained model. We propose an approach to decompose the existing trained model and modularize it. The proposed method employs weight masks to extract modules and is applicable to arbitrary DNNs. Moreover, it does not require any assumptions about the architecture of the network and can be applied immediately without the need for retraining. The proposed method has shown promising results, showing that it is able to extract similar edges across modules on several datasets. This allows us to reduce the model size when recomposing modules. Future work includes a more detailed analysis of the edges in the decomposed model commonly used in module stemming, and further investigation of the conditions and model sizes under which stemming can work more efficiently.


TS was partially supported by JSPS KAKENHI (18H03201), Fujitsu Laboratories Ltd., and JST CREST.


  • M. Amer and T. Maul (2019) A review of modularization techniques in artificial neural networks. Artificial Intelligence Review 52 (1), pp. 527–561. Cited by: Why decompose a trained neural network into modules?.
  • R. Anand, K. Mehrotra, C. K. Mohan, and S. Ranka (1995) Efficient classification for multiclass problems using modular neural networks. IEEE Trans. Neural Networks 6 (1), pp. 117–124. Cited by: Neural Network Decomposition.
  • G. Auda and M. Kamel (1999) Modular neural networks: a survey. International Journal of Neural Systems 9 (02), pp. 129–151. Cited by: Why decompose a trained neural network into modules?.
  • F. Craighero, F. Angaroni, A. Graudenzi, F. Stella, and M. Antoniotti (2020) Investigating the compositional structure of deep neural networks. In

    International Conference on Machine Learning, Optimization, and Data Science

    pp. 322–334. Cited by: Neural Network Decomposition.
  • R. Csordás, S. van Steenkiste, and J. Schmidhuber (2021) Are neural nets modular? inspecting functional modularity through differentiable weight masks. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Link Cited by: Neural Network Decomposition.
  • E. D. Cubuk, B. Zoph, J. Shlens, and Q. Le (2020) RandAugment: practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 18613–18624. External Links: Link Cited by: Appendix A, Models.
  • D. Filan, S. Hod, C. Wild, A. Critch, and S. Russell (2020) Neural networks are surprisingly modular. arXiv. Cited by: Neural Network Decomposition.
  • J. Frankle and M. Carbin (2019) The lottery ticket hypothesis: finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: Lottery Ticket Hypothesis and Supermask.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification


    Proceedings of the IEEE international conference on computer vision

    pp. 1026–1034. Cited by: Decomposing.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Vol. , pp. 770–778. External Links: Document Cited by: Models, Discussion.
  • D. Hupkes, V. Dankers, M. Mul, and E. Bruni (2020) Compositionality decomposed: how do neural networks generalise? (extended abstract). In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 5065–5069. Note: Journal track Cited by: Why decompose a trained neural network into modules?.
  • A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: item.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Vol. 86, pp. 2278–2324. External Links: Link Cited by: item.
  • M. Lin, Q. Chen, and S. Yan (2014) Network in network. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Discussion.
  • A. Mallya, D. Davis, and S. Lazebnik (2018) Piggyback: adapting a single network to multiple tasks by learning to mask weights. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV, Lecture Notes in Computer Science, Vol. 11208, pp. 72–88. Cited by: Lottery Ticket Hypothesis and Supermask.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. G. H. Bower (Ed.), Psychology of Learning and Motivation, Vol. 24, pp. 109–165. External Links: ISSN 0079-7421, Document, Link Cited by: 3rd item.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, Cited by: item.
  • R. Pan and H. Rajan (2020) On decomposing a deep neural network into modules. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, pp. 889–900. Cited by: Appendix A, Appendix A, Neural Network Decomposition, Models.
  • V. Ramanujan, M. Wortsman, A. Kembhavi, A. Farhadi, and M. Rastegari (2020) What’s hidden in a randomly weighted neural network?. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 11890–11899. Cited by: Appendix A, Introduction, Lottery Ticket Hypothesis and Supermask, Decomposing.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Models.
  • C. Tai, T. Xiao, X. Wang, and W. E (2016) Convolutional neural networks with low-rank regularization. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), Cited by: Neural Network Decomposition.
  • C. Watanabe, K. Hiramatsu, and K. Kashino (2018) Modular representation of layered neural networks. Neural Networks 97, pp. 62–73. Cited by: Neural Network Decomposition.
  • M. Wortsman, V. Ramanujan, R. Liu, A. Kembhavi, M. Rastegari, J. Yosinski, and A. Farhadi (2020) Supermasks in superposition. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Cited by: Appendix A, Lottery Ticket Hypothesis and Supermask.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: item.
  • S. Zagoruyko and N. Komodakis (2016) Wide residual networks. CoRR abs/1605.07146. External Links: Link, 1605.07146 Cited by: Appendix A, Models.
  • H. Zhou, J. Lan, R. Liu, and J. Yosinski (2019) Deconstructing lottery tickets: zeros, signs, and the supermask. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 3592–3602. Cited by: Introduction, Introduction, Lottery Ticket Hypothesis and Supermask.

Appendix A Additional Experiments’ Details

Training Parameter Details

Table 3 shows experiments’ training parameters with SGD for building modules. The details of our implementation code and training parameters essentially followed references Zagoruyko and Komodakis (2016); Wortsman et al. (2020); Pan and Rajan (2020). We confirmed that the accuracy of each model is almost equal to the accuracy presented in each reference.

model Parameters Value
MNIST/F-MNIST/CIFAR10 learning rate 0.01
weight decay
momentum(nesterov) 0.9
batch size 64
epochs 100
CIFAR100 learning rate 0.1 (stepLR)
multi step LR decay 0.02 (epoch at 60,120,160)
weight decay
momentum(nesterov) 0.9
batch size 256
epochs 200
Table 3: Training Parameters


As shown in the previous study Ramanujan et al. (2020), data augmentation can also improve the accuracy when using supermasks. For CIFAR-10 and SVHN, we applied standard data augmentation (flipping and random crop). We did not apply any data augmentation for MNIST and F-MNIST to compare with the previous study Pan and Rajan (2020) under the same conditions. For the WideResNet28-10/CIFAR-100 models, we used RandAugment Cubuk et al. (2020). For the RandAugment parameters, we used in and .

Training Module Details

For the build models via co-module training, we used parameters as shown in Table 4. Each model decomposition was performed for five trials each, and the shaded error bars shown in Figure 3 and Figure 4

. indicate one standard deviation. All models and modules were trained on eight NVIDIA Tesla V100 GPUs.

Parameters Value
learning rate 0.01 (WideResNet) / 0.1 (others)
weight decay
momentum(nesterov) 0.9
batch size 512 (WideResNet) / 32 (others)
Table 4: Decomposition Parameters

Appendix B Additional Analysis

Analysis of Hyperparameter

Table 5 shows the results of the maximum test set accuracy of each module in MNIST for the comparison when the hyperparameter is varied. Under the condition of enough epochs, there is an upper bound of accuracy because there is no change of weights. In this experiment, under the condition of 50 epochs, there was no change in the best accuracy for 5 trials. As shown in Table 5, even when , which has the effect of module stemming, the accuracy is sometimes best higher than . This can be attributed to the inclusion of regularization effects in module stemming and the addition of a grafting layer. Just as the accuracy does not necessarily increase with the number of layers or parameters, this indicates that the search for is important as a hyperparameter. Note that the results shown in Table 5 is the average accuracy of a single module (two-class problems), which is different from the accuracy after recomposition of 10 modules shown in Table 1 (10-class problems).

=0 =0.05 =0.1 =0 =0.05 =0.1 =0 =0.05 =0.1 =0 =0.05 =0.1
module 0 99.5 99.63 99.58 99.66 99.68 99.68 99.67 99.67 99.67 99.43 99.68 99.67
module 1 99.7 99.73 99.68 99.67 99.62 99.7 99.75 99.71 99.74 99.61 99.74 99.72
module 2 99.06 98.97 98.9 99.19 99.18 99.14 99.2 99.21 99.23 98.87 99.12 99.14
module 3 98.45 98.63 98.58 98.93 99.05 98.79 98.82 98.73 98.91 97.94 98.75 98.9
module 4 99.24 99.22 99.2 99.22 99.30 99.28 99.36 99.26 99.16 98.47 99.06 99.16
module 5 98.73 98.83 98.78 99.02 98.98 98.73 98.86 98.74 98.75 98.22 99.00 98.88
module 6 99.44 99.45 99.45 99.48 99.49 99.41 99.47 99.37 99.38 99.19 99.42 99.44
module 7 98.9 98.74 98.91 98.95 99.02 98.96 99.01 99.02 99.12 98.51 98.97 98.97
module 8 98.06 98.18 98.54 98.39 98.31 98.51 98.42 98.59 98.5 97.67 98.66 98.62
module 9 98.77 98.76 98.83 98.77 98.74 98.75 98.87 98.93 98.99 98.1 99.04 99.00
avg 98.99 99.01 99.05 99.13 99.14 99.10 99.14 99.12 99.15 98.60 99.14 99.15
Table 5: The results of the maximum test set accuracy of each single module for the comparison when the hyperparameter  is varied (MNIST ())

Analysis of Modules Accuracy

Related to Figure 3, Table 6 shows how accurate each module is for classification on a dataset where each class is balanced with the other classes in a 1:1 ratio. The results in the table show that the accuracy is high only when the modules and subtasks are matched, but for different classes, binary classification is failed. This indicates that the modules can be successfully separated as subnetworks as single-class classification functions.

class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9
module 0 98.48 44.39 44.64 44.70 44.46 44.91 45.12 44.81 44.83 44.76
module 1 43.81 98.38 43.99 43.84 43.80 43.88 43.93 44.07 44.13 43.96
module 2 44.37 44.22 96.71 44.71 44.50 44.52 44.76 45.16 44.64 44.39
module 3 44.74 44.92 46.04 92.43 44.79 45.52 44.72 44.96 46.16 45.26
module 4 44.61 44.51 44.91 44.59 96.28 44.84 45.02 45.33 44.75 46.12
module 5 45.64 45.54 45.58 46.19 45.72 92.42 45.95 45.59 46.03 45.70
module 6 44.88 44.67 44.96 44.74 45.28 45.13 97.99 44.79 44.86 44.79
module 7 44.73 44.71 45.17 44.99 44.73 44.86 44.79 94.21 44.94 45.11
module 8 45.31 45.45 46.10 45.68 45.33 45.82 45.45 45.43 91.39 45.54
module 9 44.57 44.45 44.79 44.77 45.51 44.89 44.58 45.31 45.18 96.11
(a) MNIST, FC1, k=
class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9
module 0 98.76 44.46 44.75 44.53 44.53 45.07 44.75 44.92 44.55 44.83
module 1 43.69 99.60 43.79 43.72 43.68 43.84 43.80 43.83 43.73 43.98
module 2 44.49 44.37 97.31 44.79 44.67 44.59 44.51 44.98 44.75 44.43
module 3 44.64 44.79 45.79 95.22 44.63 44.93 44.64 44.72 45.35 44.79
module 4 44.49 44.36 44.48 44.54 98.31 44.54 44.97 44.76 44.80 45.37
module 5 45.34 44.89 45.02 45.79 45.01 96.52 46.56 44.93 45.19 45.28
module 6 44.98 44.71 45.04 44.72 45.06 45.41 98.48 44.71 44.82 44.71
module 7 44.68 44.71 44.89 44.84 44.77 44.77 44.69 95.61 44.73 44.90
module 8 45.50 45.66 45.59 45.53 45.55 45.87 45.52 45.49 91.89 45.54
module 9 44.76 44.33 44.39 44.45 45.98 44.81 44.46 45.54 44.64 96.67
(b) MNIST, FC1, k=
class 0 class 1 class 2 class 3 class 4 class 5 class 6 class 7 class 8 class 9
module 0 98.72 44.48 44.77 44.54 44.54 44.68 44.77 45.12 44.69 44.88
module 1 43.73 99.24 43.98 43.83 43.72 43.79 44.02 43.74 43.86 43.80
module 2 44.37 44.41 96.97 44.83 44.44 44.62 44.51 44.81 44.52 44.31
module 3 44.82 44.80 45.25 94.46 44.92 45.49 44.82 45.06 45.16 44.87
module 4 44.56 44.41 44.51 44.58 97.66 44.88 44.70 44.75 44.84 45.36
module 5 45.41 45.28 45.32 45.47 45.39 94.76 46.47 45.33 45.74 45.35
module 6 45.11 44.69 44.93 44.73 44.94 45.10 98.08 44.83 44.76 44.72
module 7 44.34 44.23 44.74 44.71 44.52 44.71 44.26 97.53 44.31 45.09
module 8 45.38 45.33 45.58 45.39 45.23 45.97 45.45 45.24 93.18 45.21
module 9 44.59 44.39 44.46 44.95 45.36 44.68 44.55 45.26 44.79 96.49
(c) MNIST, FC1, k=
Table 6: Classification accuracy for each class for each module. The results colored in gray indicate the accuracy when the module and the subtask to be classified match. Each module holds only the classification function corresponding to its own class, and cannot classify other classes. This result indicates that each module retains only the ability to classify a single class.