New pointwise convolution in Deep Neural Networks through Extremely Fast and Non Parametric Transforms

06/25/2019 ∙ by Joonhyun Jeong, et al. ∙ Kyung Hee University 0

Some conventional transforms such as Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT) have been widely used as feature extractors in image processing but rarely applied in neural networks. However, we found that these conventional transforms have the ability to capture the cross-channel correlations without any learnable parameters in DNNs. This paper firstly proposes to apply conventional transforms to pointwise convolution, showing that such transforms significantly reduce the computational complexity of neural networks without accuracy performance degradation. Especially for DWHT, it requires no floating point multiplications but only additions and subtractions, which can considerably reduce computation overheads. In addition, its fast algorithm further reduces complexity of floating point addition from O(n^2) to O(n n). These nice properties construct extremely efficient networks in the number parameters and operations, enjoying accuracy gain. Our proposed DWHT-based model gained 1.49% accuracy increase with 79.1% reduced parameters and 48.4% reduced FLOPs compared with its baseline model (MoblieNet-V1) on the CIFAR 100 dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large Convolutional Neural Networks (CNNs)

(Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016; Szegedy et al., 2016b, a) and automatic Neural Architecture Search (NAS) based networks (Zoph et al., 2018; Liu et al., 2018; Real et al., 2018) have evolved to show remarkable accuracy on various tasks such as image classification (Deng et al., 2009; Krizhevsky & Hinton, 2009), object detection (Lin et al., 2014), benefitted from huge learnable parameters and computations. However, these large number of weights and high computations enabled only limited applications for mobile devices that require the constraint on memory space being low as well as for devices that require real-time computations (Canziani et al., 2016).

With regard to solving these problems, (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2017; Ma et al., 2018) proposed parameter and computation efficient blocks while maintaining almost same accuracy compared to other heavy CNN models. All of these blocks utilized depthwise separable convolution, which deconstructed the standard convolution with the () size for each kernel into spatial information specific depthwise convolution () and channel information specific pointwise () convolution. The depthwise separable convolution enjoyed comparable accuracy compared to standard convolution with hugely reduced parameters and FLOPs. These nice properties make the depthwise separable convolution as well as pointwise convolution (PC) more widely used in modern CNN architectures.

We point out that the existing PC layer is computationally expensive and occupies a lot of proportion in the number of parameters (Howard et al., 2017). Although the demand toward PC layer has been and will be growing exponentially in modern neural network architectures, there has been a little research on improving the naive structure of itself.

Therefore, this paper proposes a new PC layer formulated by non-parametric and extremely fast conventional transforms. Conventional transforms that we applied on CNN models are Discrete Walsh-Hadamard Transform (DWHT) and Discrete Cosine Transform (DCT), which have widely been used in image processing but rarely been applied in CNNs (Ghosh & Chellappa, 2016).

We empirically found that, although both of these transforms do not require any learnable parameters at all, they show the sufficient ability to capture the cross-channel correlations. Especially, DWHT is considered to be a good replacement of the conventional PC layer, as it requires no floating point multiplications but only additions and subtractions by which the computation overheads of PC layers can significantly be reduced. Furthermore, DWHT can take a strong advantage of its fast version where the computation complexity of the floating point operations is reduced from to . These nice properties construct extremely efficient neural network in perspective of parameter and computation as well as enjoying accuracy gain.

Our contributions are summarized as follows:

  • We propose a new PC layer formulated with conventional transforms which do not require any learnable parameters as well as significantly reducing the number of floating point operations compared to the existing PC layer.

  • The great benefits of using the bases of existing transforms come from their fast versions, which drastically decrease computation complexity in neural networks without degrading accuracy performance.

  • We found that applying ReLU after conventional transforms discards important information extracted, leading to significant drop in accuracy. Based on this finding, we propose a new computation block.

  • We also found that the conventional transforms can effectively be used especially for extracting high-level features in neural networks. Based on this, we propose a new transform-based neural network architecture. Specifically, using DWHT, our proposed method yields 1.49% accuracy gain as well as 79.1% and 48.4% reduced parameters and FLOPs, respectively, compared with its baseline model (MobileNet-V1) on CIFAR 100 dataset.

2 Related Work

2.1 Deconstruction and Decomposition of Convolutions

For reducing computation complexity of the existing convolution methods, several approaches of rethinking and deconstructing the naive convolution structures have been presented. Simonyan & Zisserman (2014) factorized a large sized kernel (e.g. ) in a convolution layer into several small size kernels (e.g. ) with several convolution layers. Jeon & Kim (2017) pointed out the limitation of existing convolution that it has the fixed receptive field. Consequently, they introduced learnable spatial displacement parameters, showing flexibility of convolution. Based on Jeon & Kim (2017), Jeon & Kim (2018) proved that the standard convolution can be deconstructed as a single PC layer with the spatially shifted channels. Based on that, they proposed a very efficient convolution layer, namely active shift layer, by replacing spatial convolutions with shift operations.

It is worth noting that PC layer usually takes much more computation and the number of parameters compared to spatial convolutions in a depthwise separable convolution layer (Howard et al., 2017). Therefore, there were attempts to reduce computation complexity of PC layer. Zhang et al. (2017) proposed ShuffleNet-V1 where the features are decomposed into several groups over channels and PC operation was conducted for each group, thus reducing the number of parameters and FLOPs by the number of groups . However, it was proved in Ma et al. (2018) that the memory access cost increases as increases, leading to slower inference speed. Similarly to the aforementioned methods, our work is to reduce computation complexity and the number of parameters in a convolution layer. However, our objective is more oriented on finding out mathematically efficient algorithms that make the weights in convolution kernels more effective in feature representation as well as efficient in terms of computation.

2.2 Quantization

Quantization in neural networks reduced the number of bits utilized to represent the weights and/or activations. Vanhoucke et al. (2011) applied 8-bit quantization on weight parameters, which enabled considerable speed-up with small drop of accruacy. Gupta et al. (2015) applied 16-bit fixed point representation with stochastic rounding. Han et al. (2015b) pruned the unimportant weight connections through thresholding the values of weight. Based on Han et al. (2015b), Han et al. (2015a) sucessfully combined pruning with 8 bits or less quantization and huffman encoding. The extreme case of quantized networks was evolved from Courbariaux et al. (2015), which approximated weights with the binary () values. From the milestone of Courbariaux et al. (2015), Courbariaux & Bengio (2016); Hubara et al. (2016)

constructed Binarized Neural Networks which either stochastically or deterministically binarize the real value weights and activations. These Binarized weights and activations lead to significantly reduced run-time by replacing floating point multiplications with 1-bit XNOR operations.

Based on Binarized Neural Networks (Courbariaux & Bengio, 2016; Hubara et al., 2016), Local Binary CNN (Juefei-Xu et al., 2016) proposed a convolution module that utilizes binarized fixed weights in spatial convolution based on Local Binary Patterns, thus replacing multiplications with a few addition/subtraction operations in spatial convolution. However, they did not consider reducing computation complexity in PC layer and remained the weights of PC layer learnable floating point variables. Our work shares the similarity to Local Binary CNN (Juefei-Xu et al., 2016) in using binary fixed weight values. However, Local Binary Patterns have some limitations for being applied in CNN, as they can only be used in spatial convolution as well as there are no approaches that enable fast computation of them.

2.3 Conventional Transforms

In general, several transform techniques have been applied for image processing. Discrete Cosine Transform (DCT) has been used as a powerful feature extractor (Dabbaghchian et al., 2010). For -point input sequence, the basis kernel of DCT is defined as a list of cosine values as below:


where is the index of a basis and captures higher frequency information in the input signal as increases. This property led DCT to be widely applied in image/video compression techniques that emphasize the powers of image signals in low frequency regions (Rao & Yip, 2014).

Discrete Walsh Hadamard Transform (DWHT) is a very fast and efficient transform by using only and

elements in kernels. These binary elements in kernels allow DWHT to perform without any multiplication operations but addition/subtraction operations. Therefore, DWHT has been widely used for fast feature extraction in many practical applications, such as texture image segmentation

(Vard et al., 2011)

, face recognition

(Hassan et al., 2007), and video shot boundary detection (G. & S., 2014).

Further, DWHT can take advantage of a structured-wiring-based fast algorithm (Algorithm 1, allowing very high efficiency in encoding the spatial information (Pratt et al., 1969). The basis kernel matrix of DWHT is defined using the previous kernel matrix as below:


where and . In this paper we denote as the

-th row vector of

in Eq. 2. Additionally, we adopt fast DWHT algorithm to reduce computation complexity of PC layer in neural networks, resulting in an extremely fast and efficient neural network.

3 Method

We propose a new PC layer which is computed with conventional transforms. The conventional PC layer can be formulated as follows:


where (, ) is a spatial index, and is output channel index. In Eq. 3, and are the number of input and output channels, respectively. is a vector of input at the spatial index (, ), is a vector of -th weight in Eq. 3

. For simplicity, the stride is set as 1 and the bias is omitted in Eq.


Our proposed method is to replace the learnable parameters with the bases in the conventional transforms. For example, replacing with in Eq. 3, we now can formulate the new multiplication-free PC layer using DWHT. Similarly, the DCT basis kernels in Eq. 1 can substitute for in Eq. 3

, formulating another new PC layer using DCT. Note that the normalization factors in the conventional transforms are not applied in the proposed PC layer, because Batch Normalization

(Ioffe & Szegedy, 2015)

performs a normalization and a linear transform which can be viewed as a normalization in the existing transforms.

The most important benefit of the proposed method comes from what the fast algorithms of the existing transforms can be applied in PC layer for further reduction of computation. Directly applying above new PC layer gives computational complexity of . Adopting the fast algorithms, we can significantly reduce the computation complexity of PC layer from to without any change of the computation results.

We demonstrate the pseudo-code of our proposed fast PC layer using DWHT in Algorithm 1 based on the Fast DWHT structure shown in Figure 0(a). In Algorithm 1, for

iterations, the even-indexed channels and odd-indexed channels are added and subtracted in element-wise manner, respectively. The resulting elements which were added and subtracted are placed in the first

elements and the last elements of the input of next iteration, respectively. In this computation process, each iteration requires only operations of additions or subtractions. Consequently, Algorithm 1 gives us complexity of in addition or subtraction. Compared to the existing PC layer that requires complexity of in multiplication, our method is extremely cheaper than the conventional PC layer in terms of computation costs as seen in Figure 0(b) and in power consumption of computing devices (Horowitz, 2014). Note that, similarly to fast DWHT, DCT can also be computed in a fast manner using a butterfly structure (Kok, 1997).

Compared to DWHT, DCT takes advantages of using more natural shapes of cosine basis kernels, which tend to provide better feature extraction performance through capturing the frequency information. However, DCT inevitably needs multiplications for inner product between and vectors, and a look up table (LUT) for computing cosine kernel bases which can increase the processing time and memory access. On the other hand, as mentioned, the kernels of DWHT consist only of which allows for building a multiplication-free module. Furthermore, any memory access towards kernel bases is not needed if our structured-wiring-based fast DWHT algorithm (Algorithm 1is applied. Our comprehensive experiments in Section 3.1 and 3.2 show that DWHT is more efficient than DCT in being applied in PC layer in terms of trade-off between the complexity of computation cost and accuracy.

Note that, for securing more general formulation of our newly defined PC layer, we padded zeros along the channel axis if the number of input channels are less than that of output channels while truncating the output channels when the number of output channels shrink compared to that of input channels as shown in Algorithm


Figure 0(a) shows the architecture of fast DWHT algorithm described in Algorithm 1. This structured-wiring-based architecture ensures that the receptive field of each output channels is , which means each output channel is fully reflected against all input channels through iterations. This nice property helps capturing the input channel correlations in spite of the computation process of what channel elements will be added and subtracted being deterministically structured.

For successfully fusing our new PC layer into neural networks, we explored two themes: i) an optimal block search for the proposed PC; ii) an optimal insertion strategy of the proposed block found by i), in a hierarchical manner on the blocks of networks. We assumed that there are an optimal block unit structure and an optimal hierarchy level (high-, mid-, low-level) blocks in the neural networks favored by these non-learnable transforms. Therefore, we conducted the experiments for the two aforementioned themes accordingly. Through these experiments, we evaluated the goodness for each of our networks by accuracy fluctuation as the number of parameters or FLOPs changes. Note that, in general, one floating point multiplication requires a variety of addition/subtraction operations depending on both the bits on the operands and hardware architectures [Ref]. For comparison, we counted FLOPs with the number of multiplications, additions and subtractions performed during the inference. Unless mentioned otherwise, we followed the default experimental setting as batch size = 128, training epochs = 200, initial learning rate = 0.1 where 0.94 is multiplied per 2 epochs, and momentum = 0.9 with weight decay value = 5e-4. In all the experiments, the model accuracy was obtained by taking an average of three training results in every case.

(a) A black circle indicates a channel element, and black and red lines are additions and subtractions, respectively. Best viewed in color.
(b) axis denotes logarithm of the number of input channels which range from to . For simplicity, the number of output channels is set to be same as that of the input channel for all PC layers. Best viewed in color.
Figure 1: Left: architecture of our PC layer based on fast DHWT algorithm in Algorithm 1, Right: comparison of the number of multiplications between our new PC layers and the conventional PC layer.
1:4D input features (), output channel
4:if  then
5:     ZeroPad1D(, axis=1) pad zeros along the channel axis
6:end if
7:for  to  do
12:end for
13:if  then
15:end if
Algorithm 1 A new pointwise convolution using fast DWHT algorithm

3.1 Optimal Block stucture for conventional transforms

In a microscopic perspective, the block unit is the basic foundation of neural networks, and it determines the efficiency of the weight parameter space and computation costs in terms of accuracy. Accordingly, to find the optimal block structure for our PC layer, our experiments are conducted to replace the existing PC layer blocks with our new PC layer blocks in ShuffleNet-V2 (Ma et al., 2018). The proposed block and its variation blocks are listed in Figure 2. Comparing the results of (c) and (d) in Table 1 informs us the important fact that the ReLU (Nair & Hinton, 2010)activation function significantly harms the accuracy of our neural networks equipped with the conventional transforms. We empirically analyzed this phenomenon in Section 4.1. Additionally, the results of (b) and (d) in Table 1 denote that the proposed PC layers are superior to the PC layer which randomly initialized and fixed its weights. These imply that DWHT and DCT kernels can extract meaningful information of cross-channel correlations. Compared to the baseline model in Table 1, (d)-DCT and (d)-DWHT show accuracy drop by approximately 2.5% under the condition that 41% and 50% of parameters and FLOPs are reduced, respectively. These imply that the proposed blocks (c) and (d) are still inefficient in trade-off between accuracy and computation costs of neural networks, leading us to more exploring to find out an optimal neural network architecture. In the next subsection, we will solve this problem through applying conventional transforms on the optimal hierarchy level features (See Section 3.2). Based on our comprehensive experiments, we set the block structure (d) as our default proposed block which will be exploited in all the following experiments.

Figure 2: Our blocks using conventional transform pointwise convolution (CTPC), random constant pointwise convolution (RCPC) block and baseline block of ShuffleNet-V2. Block (b) initialized the weights of PC layer with the distribution of , where is number of input channel and fixed these weights during training.
Top-1 Acc (%) # of Weights (M) # of FLOPs (M)
Baseline (a) 1.57 105
(b) 1.57 105
(c)-DWHT 0.92 52
(c)-DCT 0.92 54
(d)-DWHT 0.92 52
(d)-DCT 0.92 54
Table 1: Performance result of block units in Figure 2 on CIFAR100 dataset. All the experimented models are based on ShuffleNet-V2 with width hyper-parameter 1.1x which we customized to make the number of output channels in stage 2, 3, 4 as 128, 256, 512, respectively for fair comparison with DWHT which requires input channels. We replaced all of 13 stride 1 blocks in baseline model with (b), (c), (d) blocks. (c)-DWHT denotes CTPC in (c) block is based on DWHT. We applied Batch Normalization to all the blocks for stable training.

3.2 Optimal hierarchy level blocks for conventional transforms

In this section, we search on an optimal hierarchy level where our optimal block which is based on the proposed PC layer is effectively applied in a whole network architecture. The optimal hierarchy level will allow the proposed network to have the minimal number of weight parameters and FLOPs without accuracy drop, which is made possible by non-parameteric and extremely fast conventional transforms. It is noted that, applying our proposed block on the high-level blocks in the network enjoys much more reduced number of parameters and FLOPs rather than that in low-level blocks, because channel depth increases exponentially as the layer goes deeper in the network.

Figure 3: Performance curve of hierarchically applying our optimal block on CIFAR100, Top: in the viewpoint of the number of weight parameters, Bottom: in the viewpoint of the number of FLOPs. The performance of baseline models was evaluated by ShuffleNet-V2 architecture with width hyper-parameter 0.5x, 1x, 1.1x, 1.5x. Our models were all experimented with 1.1x setting, and each dot in the figures represents mean accuracy of 3 network instances. Note that the blue line denotes the indicator of the efficiency of weight parameters or FLOPs in terms of accuracy. The upper left part from the blue line is the superior region while lower right part from blue line is the inferior region compared to the baseline models.

In Figure 3, we applied our optimal block (i.e., (d) block in Figure 2) on high- , middle- and low-level blocks, respectively. In our experiments, we evaluate the performance of the networks depending on the number of blocks where the proposed optimal block is applied. The model that we have tested is denoted as (transform type)-(# of the proposed blocks)-(hierarchy level in Low (L), Middle (M), and High (H) where the proposed optimal is applied). For example, DWHT-3-L indicates the neural network model where the first three blocks in ShuffleNet-V2 consist of the proposed blocks, while the other blocks are the original blocks of ShuffleNet-V2. It is noted that, in this experiment, we fix all the blocks with stride = 2 in the baseline to be original blocks.

Figure 3 shows the performance of the proposed methods depending on the transform types {DCT, DWHT}, hierarchy levels {L, M, H} and the number of the proposed blocks that replace the original ones in the baseline {3, 6, 10} in terms of Top-1 accuracy and the number of parameters (or FLOPs). It is noted that, since the baseline model has only 7 blocks in the middle-level stage (Stage 3), we performed the middle-level experiments only for DCT/DWHT-3-M and -7-M models where the proposed blocks are applied from the beginning of Stage 3 in the baseline. In Figure 3, the performance of our 10-H (or 10-L), 6-H (or 6-L), 3-H (or 3-L) models (7-M and 3-M only for middle-level experiments) is listed in ascending order of the number of parameters and FLOPs.

As can be seen in the first column of Figure 3, applying our optimal block on the high-level blocks achieved much better trade-off between the number of weight parameters (FLOPs) and accuracy. Meanwhile, applying on middle- and low-level features suffered, respectively, slightly and severely from the inefficiency of the number of weight parameters (FLOPs) with regard to accuracy. This tendency is shown similarly for both DWHT-based models and DCT-based models, which implies that there can be an optimal hierarchy level of blocks favored by conventional transforms. Also note that our DWHT-based models showed slightly higher or same accuracy with less FLOPs in all the hierarchy level cases compared to our DCT-based models. This is because the fast version of DWHT does not require any multiplication but needs much less amount of addition or subtraction operations, while it also has the sufficient ability to extract cross-channel information with the exquisite wiring-based structure, compared to the fast version of DCT.

For confirming the generality of the proposed method, we also implemented our methods into MobileNet-V1 (Howard et al., 2017) and performed experiments. Inspired by the above results showing that optimal hierarchy blocks for conventional transforms can be found in the high-level blocks, we replaced high-level blocks of baseline model (MobileNet-V1) and changed the number of proposed blocks that are replaced to verify the effectiveness of the proposed method. The experimental results are described in Table 2. Remarkably, as shown in Table 2, our DWHT-6-H model yielded the 1.49% increase in Top-1 accuracy even under the condition that the 79.1% of parameters and 48.4% of FLOPs are reduced compared with the baseline 1x model. This outstanding performance improvement comes from the depthwise separable convolutions used in MobileNet-V1, where PC layers play dominant roles in computation costs and memory space, i.e., they consume 94.86% in FLOPs and 74% in the total number of parameters in the whole network (Howard et al., 2017). The full performance results for all the hierarchy levels {L, M, H} and the number of blocks {3, 6, 10} (exceptionally, {3, 7} blocks for the middle level experiments) are described in Appendix A.

In Appendix A, note that, 3-H and 6-H models showed great efficiency in the number of parameters and FLOPs with respect to the Top-1 accuracy compared to baseline model while 10-H model showed harmful results in trade-off between the number of weight parameters (or FLOPs) and accuracy. In summary, based on the comprehensive experiments, it can be concluded that i) the proposed PC block always favors high-levels compared to that in low-level ones in the network hierarchy; ii) the performance gain start to decrease when the number of transform based PC blocks exceeded a certain capacity of the networks.

Top-1 Acc (%) # of Weights (M) # of FLOPs (M)
Baseline 3.31 94
DWHT-3-H 1.47 73
DCT-3-H 1.47 74
DWHT-6-H 0.68 48
DCT-6-H 0.68 49
Table 2: Performance result of hierarchically applying our optimal block on CIFAR100 dataset. All the models are based on MobileNet-V1 with width hyper-parameter 1x. We replaced both stride 1, 2 blocks in the baseline model with the optimal block that consist of depthwise convolution - Batch Normalization - CTPC - Batch Normalization in series.

4 Experiments and Analysis

4.1 Hindrance of ReLU in cross-channel representability

As seen in Table 1, applying ReLU after conventional transforms significantly harmed the accuracy. This is due to the properties of conventional transform basis kernels that both in Eq. 2 and in Eq. 1 have the same number of positive and negative parameters in the kernels except for and that the distributions of absolute values of positive and negative elements in kernels are almost identical. Therefore, the PC followed by ReLU can severely damage

This property lets us know that the output channel elements that have under zero value should also be considered during the forward pass; when forwarding in Eq. 3 through the conventional transforms if some important channel elements in that have larger values than others are combined with negative values of or , the important cross-channel information in the output in Eq. 3 can reside in the value range under zero. Thus, applying ReLU after conventional transforms discards the information that must be forwarded through. From above analysis, we do not use non-linear activation function after the proposed PC layer, which can restrict information flow of conventional transforms.

Figure 4 shows that all the hierarchy level activations from both DCT and DWHT have not only positive values but also negative values in almost same proportion. These negatives values possibly include important cross-channel correlation information. Thus, applying ReLU on activations of PC layers which are based on conventional transforms discards crucial cross-channel information contained in negative values, leading to significant accuracy drop as shown in the result of Table 1.

Figure 4: Histograms of hierarchy level (low-level, middle-level, high-level) activations after the proposed PC layer based on conventional transforms, Left: DWHT, Right: DCT. Both DWHT and DCT models are based on ShuffleNet V2 1.1x model where we replaced all of stride 1 blocks with (d)-DWHT and (d)-DCT blocks, respectively in Figure 2.

4.2 Active depthwise convolution weights

In Figure 8 and Appendix B, it is observed that depthwise convolution weights of last 3 blocks in DWHT-3-H and DCT-3-H have much less near zero values than that of baseline model. That is, the number of values which are apart from near-zero is much larger on DCT-3-H and DWHT-3-H models than on baseline model. We conjecture that these learnable weights whose values are apart from near-zero were actively fitted to the optimal domain that is favored by conventional transforms. Consequently, these weights are actively and sufficiently utilized to derive accuracy increase for the compensation of conventional transforms being non-learnable.

To confirm the impact of activeness of these depthwise convolution weights in the last 3 blocks, we experimented with regularizing these weights varying the weight decay values. Higher weight decay values strongly regularize the scale of depthwise convolution weight values in the last 3 blocks. Thus, strong constraint on the scale of these weight values hinders active utilization of these weights, which results in accuracy drop as can be seen in Figure 6.

Figure 5: Histogram of depthwise convolution weights in the third block, out of last 3 blocks. DCT-3-H and DWHT-3-H models are based on ShuffleNet V2 1.1x model with (d) block. Baseline model is ShuffleNet V2 1.1x model.
Figure 6: Ablation study of weight decay values (5e-4, 2e-3, 1e-2, 1e-1). We applied these weight decay values only on depthwise convolution weights of last 3 blocks in DCT-based model and DWHT-based model, while all the other learnable weights were regularized with weight decay of 5e-4.

5 Conclusion

We propose the new PC layers through conventional transforms. Our new PC layers allow the neural networks to be efficient in complexity of computation and weight parameters. Especially for DWHT-based PC layer, its nice property of no multiplication required but only additions or subtractions enabled extremely efficient in computation overhead. With the purpose of successfully fusing our PC layers into neural networks, we empirically found the optimal block unit structure and hierarchy level blocks in neural networks for conventional transforms, showing accuracy increase and great representability in cross-channel correlations. We further intrinsically revealed the hindrance of ReLU toward capturing the cross-channel representability and the activeness of depthwise convolution weights on the last blocks in our proposed neural network.



Appendix A Generality of proposed PC layers in other Neural Network

Figure 7: Performance curve of hierarchically applying our optimal block (See Table 2 for detail settings) on CIFAR100, Top: in the viewpoint of the number of weight parameters, Bottom: in the viewpoint of the number of FLOPs. The performance of baseline models was evaluated by MobileNet-V1 architecture with width hyper-parameter 0.5x, 1x, 2x. Our models were all experimented with 1x setting, and each dot in the figures represents mean accuracy of 3 network instances. Our models experimented are 10-H, 6-H, 3-H models (first column) , 7-M, 3-M-Rear, 3-M-Front models (second column), 10-L, 6-L, 3-L models (final column) in ascending order of the number of weight parameters and FLOPs.

In Figure 7, for the purpose of finding more definite hierarchy level of blocks favored by our proposed PC layers, we subdivided our middle level experiment scheme; DCT/DWHT-3-M-Front model denotes the model which applied the proposed blocks from the beginning of Stage 3 in the baseline while DCT/DWHT-3-M-Rear model denotes the model which applied from the end of Stage 3. The performance curves of all our proposed models in Figure 7 shows that if we apply the proposed optimal block within the first 6 blocks in the network, the Top-1 accuracy is significantly deteriorated, informing us the important fact that there are the definite hierarchy level blocks which are favored or not favored by our proposed PC layers in the network.

Appendix B Histogram of depthwise convolution weights in high-level blocks

Figure 8: Histograms of depthwise convolution weights, Top: histogram of first block out of last 3 blocks, Bottom: histogram of second block out of last 3 blocks. DWHT-3-H and DCT-3-H models are based on ShuffleNet-V2 1.1x model with (d)-DWHT and (d)-DCT block, respectively. Baseline model is ShuffleNet-V2 1.1x model.

Further, 3-M-Rear models gave slightly superior efficiency while 7-M, 3-M-Front, and low-level (3-L, 6-L, 10-L) models showed inferior efficiency. From these results, we can obtain the fact that applying the proposed blocks within the first 6 blocks in the network gives significant deterioration in accuracy.