1 Introduction
Large Convolutional Neural Networks (CNNs)
(Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; He et al., 2016; Szegedy et al., 2016b, a) and automatic Neural Architecture Search (NAS) based networks (Zoph et al., 2018; Liu et al., 2018; Real et al., 2018) have evolved to show remarkable accuracy on various tasks such as image classification (Deng et al., 2009; Krizhevsky & Hinton, 2009), object detection (Lin et al., 2014), benefitted from huge learnable parameters and computations. However, these large number of weights and high computations enabled only limited applications for mobile devices that require the constraint on memory space being low as well as for devices that require realtime computations (Canziani et al., 2016).With regard to solving these problems, (Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2017; Ma et al., 2018) proposed parameter and computation efficient blocks while maintaining almost same accuracy compared to other heavy CNN models. All of these blocks utilized depthwise separable convolution, which deconstructed the standard convolution with the () size for each kernel into spatial information specific depthwise convolution () and channel information specific pointwise () convolution. The depthwise separable convolution enjoyed comparable accuracy compared to standard convolution with hugely reduced parameters and FLOPs. These nice properties make the depthwise separable convolution as well as pointwise convolution (PC) more widely used in modern CNN architectures.
We point out that the existing PC layer is computationally expensive and occupies a lot of proportion in the number of parameters (Howard et al., 2017). Although the demand toward PC layer has been and will be growing exponentially in modern neural network architectures, there has been a little research on improving the naive structure of itself.
Therefore, this paper proposes a new PC layer formulated by nonparametric and extremely fast conventional transforms. Conventional transforms that we applied on CNN models are Discrete WalshHadamard Transform (DWHT) and Discrete Cosine Transform (DCT), which have widely been used in image processing but rarely been applied in CNNs (Ghosh & Chellappa, 2016).
We empirically found that, although both of these transforms do not require any learnable parameters at all, they show the sufficient ability to capture the crosschannel correlations. Especially, DWHT is considered to be a good replacement of the conventional PC layer, as it requires no floating point multiplications but only additions and subtractions by which the computation overheads of PC layers can significantly be reduced. Furthermore, DWHT can take a strong advantage of its fast version where the computation complexity of the floating point operations is reduced from to . These nice properties construct extremely efficient neural network in perspective of parameter and computation as well as enjoying accuracy gain.
Our contributions are summarized as follows:

We propose a new PC layer formulated with conventional transforms which do not require any learnable parameters as well as significantly reducing the number of floating point operations compared to the existing PC layer.

The great benefits of using the bases of existing transforms come from their fast versions, which drastically decrease computation complexity in neural networks without degrading accuracy performance.

We found that applying ReLU after conventional transforms discards important information extracted, leading to significant drop in accuracy. Based on this finding, we propose a new computation block.

We also found that the conventional transforms can effectively be used especially for extracting highlevel features in neural networks. Based on this, we propose a new transformbased neural network architecture. Specifically, using DWHT, our proposed method yields 1.49% accuracy gain as well as 79.1% and 48.4% reduced parameters and FLOPs, respectively, compared with its baseline model (MobileNetV1) on CIFAR 100 dataset.
2 Related Work
2.1 Deconstruction and Decomposition of Convolutions
For reducing computation complexity of the existing convolution methods, several approaches of rethinking and deconstructing the naive convolution structures have been presented. Simonyan & Zisserman (2014) factorized a large sized kernel (e.g. ) in a convolution layer into several small size kernels (e.g. ) with several convolution layers. Jeon & Kim (2017) pointed out the limitation of existing convolution that it has the fixed receptive field. Consequently, they introduced learnable spatial displacement parameters, showing flexibility of convolution. Based on Jeon & Kim (2017), Jeon & Kim (2018) proved that the standard convolution can be deconstructed as a single PC layer with the spatially shifted channels. Based on that, they proposed a very efficient convolution layer, namely active shift layer, by replacing spatial convolutions with shift operations.
It is worth noting that PC layer usually takes much more computation and the number of parameters compared to spatial convolutions in a depthwise separable convolution layer (Howard et al., 2017). Therefore, there were attempts to reduce computation complexity of PC layer. Zhang et al. (2017) proposed ShuffleNetV1 where the features are decomposed into several groups over channels and PC operation was conducted for each group, thus reducing the number of parameters and FLOPs by the number of groups . However, it was proved in Ma et al. (2018) that the memory access cost increases as increases, leading to slower inference speed. Similarly to the aforementioned methods, our work is to reduce computation complexity and the number of parameters in a convolution layer. However, our objective is more oriented on finding out mathematically efficient algorithms that make the weights in convolution kernels more effective in feature representation as well as efficient in terms of computation.
2.2 Quantization
Quantization in neural networks reduced the number of bits utilized to represent the weights and/or activations. Vanhoucke et al. (2011) applied 8bit quantization on weight parameters, which enabled considerable speedup with small drop of accruacy. Gupta et al. (2015) applied 16bit fixed point representation with stochastic rounding. Han et al. (2015b) pruned the unimportant weight connections through thresholding the values of weight. Based on Han et al. (2015b), Han et al. (2015a) sucessfully combined pruning with 8 bits or less quantization and huffman encoding. The extreme case of quantized networks was evolved from Courbariaux et al. (2015), which approximated weights with the binary () values. From the milestone of Courbariaux et al. (2015), Courbariaux & Bengio (2016); Hubara et al. (2016)
constructed Binarized Neural Networks which either stochastically or deterministically binarize the real value weights and activations. These Binarized weights and activations lead to significantly reduced runtime by replacing floating point multiplications with 1bit XNOR operations.
Based on Binarized Neural Networks (Courbariaux & Bengio, 2016; Hubara et al., 2016), Local Binary CNN (JuefeiXu et al., 2016) proposed a convolution module that utilizes binarized fixed weights in spatial convolution based on Local Binary Patterns, thus replacing multiplications with a few addition/subtraction operations in spatial convolution. However, they did not consider reducing computation complexity in PC layer and remained the weights of PC layer learnable floating point variables. Our work shares the similarity to Local Binary CNN (JuefeiXu et al., 2016) in using binary fixed weight values. However, Local Binary Patterns have some limitations for being applied in CNN, as they can only be used in spatial convolution as well as there are no approaches that enable fast computation of them.
2.3 Conventional Transforms
In general, several transform techniques have been applied for image processing. Discrete Cosine Transform (DCT) has been used as a powerful feature extractor (Dabbaghchian et al., 2010). For point input sequence, the basis kernel of DCT is defined as a list of cosine values as below:
(1) 
where is the index of a basis and captures higher frequency information in the input signal as increases. This property led DCT to be widely applied in image/video compression techniques that emphasize the powers of image signals in low frequency regions (Rao & Yip, 2014).
Discrete Walsh Hadamard Transform (DWHT) is a very fast and efficient transform by using only and
elements in kernels. These binary elements in kernels allow DWHT to perform without any multiplication operations but addition/subtraction operations. Therefore, DWHT has been widely used for fast feature extraction in many practical applications, such as texture image segmentation
(Vard et al., 2011)(Hassan et al., 2007), and video shot boundary detection (G. & S., 2014).3 Method
We propose a new PC layer which is computed with conventional transforms. The conventional PC layer can be formulated as follows:
(3) 
where (, ) is a spatial index, and is output channel index. In Eq. 3, and are the number of input and output channels, respectively. is a vector of input at the spatial index (, ), is a vector of th weight in Eq. 3
. For simplicity, the stride is set as 1 and the bias is omitted in Eq.
3.Our proposed method is to replace the learnable parameters with the bases in the conventional transforms. For example, replacing with in Eq. 3, we now can formulate the new multiplicationfree PC layer using DWHT. Similarly, the DCT basis kernels in Eq. 1 can substitute for in Eq. 3
, formulating another new PC layer using DCT. Note that the normalization factors in the conventional transforms are not applied in the proposed PC layer, because Batch Normalization
(Ioffe & Szegedy, 2015)performs a normalization and a linear transform which can be viewed as a normalization in the existing transforms.
The most important benefit of the proposed method comes from what the fast algorithms of the existing transforms can be applied in PC layer for further reduction of computation. Directly applying above new PC layer gives computational complexity of . Adopting the fast algorithms, we can significantly reduce the computation complexity of PC layer from to without any change of the computation results.
We demonstrate the pseudocode of our proposed fast PC layer using DWHT in Algorithm 1 based on the Fast DWHT structure shown in Figure 0(a). In Algorithm 1, for
iterations, the evenindexed channels and oddindexed channels are added and subtracted in elementwise manner, respectively. The resulting elements which were added and subtracted are placed in the first
elements and the last elements of the input of next iteration, respectively. In this computation process, each iteration requires only operations of additions or subtractions. Consequently, Algorithm 1 gives us complexity of in addition or subtraction. Compared to the existing PC layer that requires complexity of in multiplication, our method is extremely cheaper than the conventional PC layer in terms of computation costs as seen in Figure 0(b) and in power consumption of computing devices (Horowitz, 2014). Note that, similarly to fast DWHT, DCT can also be computed in a fast manner using a butterfly structure (Kok, 1997).Compared to DWHT, DCT takes advantages of using more natural shapes of cosine basis kernels, which tend to provide better feature extraction performance through capturing the frequency information. However, DCT inevitably needs multiplications for inner product between and vectors, and a look up table (LUT) for computing cosine kernel bases which can increase the processing time and memory access. On the other hand, as mentioned, the kernels of DWHT consist only of which allows for building a multiplicationfree module. Furthermore, any memory access towards kernel bases is not needed if our structuredwiringbased fast DWHT algorithm (Algorithm 1is applied. Our comprehensive experiments in Section 3.1 and 3.2 show that DWHT is more efficient than DCT in being applied in PC layer in terms of tradeoff between the complexity of computation cost and accuracy.
Note that, for securing more general formulation of our newly defined PC layer, we padded zeros along the channel axis if the number of input channels are less than that of output channels while truncating the output channels when the number of output channels shrink compared to that of input channels as shown in Algorithm
1.Figure 0(a) shows the architecture of fast DWHT algorithm described in Algorithm 1. This structuredwiringbased architecture ensures that the receptive field of each output channels is , which means each output channel is fully reflected against all input channels through iterations. This nice property helps capturing the input channel correlations in spite of the computation process of what channel elements will be added and subtracted being deterministically structured.
For successfully fusing our new PC layer into neural networks, we explored two themes: i) an optimal block search for the proposed PC; ii) an optimal insertion strategy of the proposed block found by i), in a hierarchical manner on the blocks of networks. We assumed that there are an optimal block unit structure and an optimal hierarchy level (high, mid, lowlevel) blocks in the neural networks favored by these nonlearnable transforms. Therefore, we conducted the experiments for the two aforementioned themes accordingly. Through these experiments, we evaluated the goodness for each of our networks by accuracy fluctuation as the number of parameters or FLOPs changes. Note that, in general, one floating point multiplication requires a variety of addition/subtraction operations depending on both the bits on the operands and hardware architectures [Ref]. For comparison, we counted FLOPs with the number of multiplications, additions and subtractions performed during the inference. Unless mentioned otherwise, we followed the default experimental setting as batch size = 128, training epochs = 200, initial learning rate = 0.1 where 0.94 is multiplied per 2 epochs, and momentum = 0.9 with weight decay value = 5e4. In all the experiments, the model accuracy was obtained by taking an average of three training results in every case.
3.1 Optimal Block stucture for conventional transforms
In a microscopic perspective, the block unit is the basic foundation of neural networks, and it determines the efficiency of the weight parameter space and computation costs in terms of accuracy. Accordingly, to find the optimal block structure for our PC layer, our experiments are conducted to replace the existing PC layer blocks with our new PC layer blocks in ShuffleNetV2 (Ma et al., 2018). The proposed block and its variation blocks are listed in Figure 2. Comparing the results of (c) and (d) in Table 1 informs us the important fact that the ReLU (Nair & Hinton, 2010)activation function significantly harms the accuracy of our neural networks equipped with the conventional transforms. We empirically analyzed this phenomenon in Section 4.1. Additionally, the results of (b) and (d) in Table 1 denote that the proposed PC layers are superior to the PC layer which randomly initialized and fixed its weights. These imply that DWHT and DCT kernels can extract meaningful information of crosschannel correlations. Compared to the baseline model in Table 1, (d)DCT and (d)DWHT show accuracy drop by approximately 2.5% under the condition that 41% and 50% of parameters and FLOPs are reduced, respectively. These imply that the proposed blocks (c) and (d) are still inefficient in tradeoff between accuracy and computation costs of neural networks, leading us to more exploring to find out an optimal neural network architecture. In the next subsection, we will solve this problem through applying conventional transforms on the optimal hierarchy level features (See Section 3.2). Based on our comprehensive experiments, we set the block structure (d) as our default proposed block which will be exploited in all the following experiments.
Top1 Acc (%)  # of Weights (M)  # of FLOPs (M)  

Baseline (a)  1.57  105  
(b)  1.57  105  
(c)DWHT  0.92  52  
(c)DCT  0.92  54  
(d)DWHT  0.92  52  
(d)DCT  0.92  54 
3.2 Optimal hierarchy level blocks for conventional transforms
In this section, we search on an optimal hierarchy level where our optimal block which is based on the proposed PC layer is effectively applied in a whole network architecture. The optimal hierarchy level will allow the proposed network to have the minimal number of weight parameters and FLOPs without accuracy drop, which is made possible by nonparameteric and extremely fast conventional transforms. It is noted that, applying our proposed block on the highlevel blocks in the network enjoys much more reduced number of parameters and FLOPs rather than that in lowlevel blocks, because channel depth increases exponentially as the layer goes deeper in the network.
In Figure 3, we applied our optimal block (i.e., (d) block in Figure 2) on high , middle and lowlevel blocks, respectively. In our experiments, we evaluate the performance of the networks depending on the number of blocks where the proposed optimal block is applied. The model that we have tested is denoted as (transform type)(# of the proposed blocks)(hierarchy level in Low (L), Middle (M), and High (H) where the proposed optimal is applied). For example, DWHT3L indicates the neural network model where the first three blocks in ShuffleNetV2 consist of the proposed blocks, while the other blocks are the original blocks of ShuffleNetV2. It is noted that, in this experiment, we fix all the blocks with stride = 2 in the baseline to be original blocks.
Figure 3 shows the performance of the proposed methods depending on the transform types {DCT, DWHT}, hierarchy levels {L, M, H} and the number of the proposed blocks that replace the original ones in the baseline {3, 6, 10} in terms of Top1 accuracy and the number of parameters (or FLOPs). It is noted that, since the baseline model has only 7 blocks in the middlelevel stage (Stage 3), we performed the middlelevel experiments only for DCT/DWHT3M and 7M models where the proposed blocks are applied from the beginning of Stage 3 in the baseline. In Figure 3, the performance of our 10H (or 10L), 6H (or 6L), 3H (or 3L) models (7M and 3M only for middlelevel experiments) is listed in ascending order of the number of parameters and FLOPs.
As can be seen in the first column of Figure 3, applying our optimal block on the highlevel blocks achieved much better tradeoff between the number of weight parameters (FLOPs) and accuracy. Meanwhile, applying on middle and lowlevel features suffered, respectively, slightly and severely from the inefficiency of the number of weight parameters (FLOPs) with regard to accuracy. This tendency is shown similarly for both DWHTbased models and DCTbased models, which implies that there can be an optimal hierarchy level of blocks favored by conventional transforms. Also note that our DWHTbased models showed slightly higher or same accuracy with less FLOPs in all the hierarchy level cases compared to our DCTbased models. This is because the fast version of DWHT does not require any multiplication but needs much less amount of addition or subtraction operations, while it also has the sufficient ability to extract crosschannel information with the exquisite wiringbased structure, compared to the fast version of DCT.
For confirming the generality of the proposed method, we also implemented our methods into MobileNetV1 (Howard et al., 2017) and performed experiments. Inspired by the above results showing that optimal hierarchy blocks for conventional transforms can be found in the highlevel blocks, we replaced highlevel blocks of baseline model (MobileNetV1) and changed the number of proposed blocks that are replaced to verify the effectiveness of the proposed method. The experimental results are described in Table 2. Remarkably, as shown in Table 2, our DWHT6H model yielded the 1.49% increase in Top1 accuracy even under the condition that the 79.1% of parameters and 48.4% of FLOPs are reduced compared with the baseline 1x model. This outstanding performance improvement comes from the depthwise separable convolutions used in MobileNetV1, where PC layers play dominant roles in computation costs and memory space, i.e., they consume 94.86% in FLOPs and 74% in the total number of parameters in the whole network (Howard et al., 2017). The full performance results for all the hierarchy levels {L, M, H} and the number of blocks {3, 6, 10} (exceptionally, {3, 7} blocks for the middle level experiments) are described in Appendix A.
In Appendix A, note that, 3H and 6H models showed great efficiency in the number of parameters and FLOPs with respect to the Top1 accuracy compared to baseline model while 10H model showed harmful results in tradeoff between the number of weight parameters (or FLOPs) and accuracy. In summary, based on the comprehensive experiments, it can be concluded that i) the proposed PC block always favors highlevels compared to that in lowlevel ones in the network hierarchy; ii) the performance gain start to decrease when the number of transform based PC blocks exceeded a certain capacity of the networks.
Top1 Acc (%)  # of Weights (M)  # of FLOPs (M)  

Baseline  3.31  94  
DWHT3H  1.47  73  
DCT3H  1.47  74  
DWHT6H  0.68  48  
DCT6H  0.68  49 
4 Experiments and Analysis
4.1 Hindrance of ReLU in crosschannel representability
As seen in Table 1, applying ReLU after conventional transforms significantly harmed the accuracy. This is due to the properties of conventional transform basis kernels that both in Eq. 2 and in Eq. 1 have the same number of positive and negative parameters in the kernels except for and that the distributions of absolute values of positive and negative elements in kernels are almost identical. Therefore, the PC followed by ReLU can severely damage
This property lets us know that the output channel elements that have under zero value should also be considered during the forward pass; when forwarding in Eq. 3 through the conventional transforms if some important channel elements in that have larger values than others are combined with negative values of or , the important crosschannel information in the output in Eq. 3 can reside in the value range under zero. Thus, applying ReLU after conventional transforms discards the information that must be forwarded through. From above analysis, we do not use nonlinear activation function after the proposed PC layer, which can restrict information flow of conventional transforms.
Figure 4 shows that all the hierarchy level activations from both DCT and DWHT have not only positive values but also negative values in almost same proportion. These negatives values possibly include important crosschannel correlation information. Thus, applying ReLU on activations of PC layers which are based on conventional transforms discards crucial crosschannel information contained in negative values, leading to significant accuracy drop as shown in the result of Table 1.
4.2 Active depthwise convolution weights
In Figure 8 and Appendix B, it is observed that depthwise convolution weights of last 3 blocks in DWHT3H and DCT3H have much less near zero values than that of baseline model. That is, the number of values which are apart from nearzero is much larger on DCT3H and DWHT3H models than on baseline model. We conjecture that these learnable weights whose values are apart from nearzero were actively fitted to the optimal domain that is favored by conventional transforms. Consequently, these weights are actively and sufficiently utilized to derive accuracy increase for the compensation of conventional transforms being nonlearnable.
To confirm the impact of activeness of these depthwise convolution weights in the last 3 blocks, we experimented with regularizing these weights varying the weight decay values. Higher weight decay values strongly regularize the scale of depthwise convolution weight values in the last 3 blocks. Thus, strong constraint on the scale of these weight values hinders active utilization of these weights, which results in accuracy drop as can be seen in Figure 6.
5 Conclusion
We propose the new PC layers through conventional transforms. Our new PC layers allow the neural networks to be efficient in complexity of computation and weight parameters. Especially for DWHTbased PC layer, its nice property of no multiplication required but only additions or subtractions enabled extremely efficient in computation overhead. With the purpose of successfully fusing our PC layers into neural networks, we empirically found the optimal block unit structure and hierarchy level blocks in neural networks for conventional transforms, showing accuracy increase and great representability in crosschannel correlations. We further intrinsically revealed the hindrance of ReLU toward capturing the crosschannel representability and the activeness of depthwise convolution weights on the last blocks in our proposed neural network.
Acknowledgments
References
 Canziani et al. (2016) Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural network models for practical applications. CoRR, abs/1605.07678, 2016. URL http://arxiv.org/abs/1605.07678.
 Courbariaux & Bengio (2016) Matthieu Courbariaux and Yoshua Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, abs/1602.02830, 2016. URL http://arxiv.org/abs/1602.02830.
 Courbariaux et al. (2015) Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363, 2015. URL http://arxiv.org/abs/1511.00363.
 Dabbaghchian et al. (2010) Saeed Dabbaghchian, Masoumeh P Ghaemmaghami, and Ali Aghagolzadeh. Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern Recognition, 43(4):1431–1440, 2010.

Deng et al. (2009)
Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei.
Imagenet: A largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Ieee, 2009.  G. & S. (2014) L. P. G. G. and D. S. Walsh–hadamard transform kernelbased feature vector for shot boundary detection. IEEE Transactions on Image Processing, 23(12):5187–5197, Dec 2014. ISSN 10577149. doi: 10.1109/TIP.2014.2362652.
 Ghosh & Chellappa (2016) Arthita Ghosh and Rama Chellappa. Deep feature extraction in the dct domain. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3536–3541. IEEE, 2016.

Gupta et al. (2015)
Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan.
Deep learning with limited numerical precision.
In
International Conference on Machine Learning
, pp. 1737–1746, 2015.  Han et al. (2015a) Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015a.
 Han et al. (2015b) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015b.
 Hassan et al. (2007) M Hassan, I Osman, and M Yahia. Walshhadamard transform for facial feature extraction in face recognition. World Academy of Science, Engineering and Technology, 29:194–198, 2007.
 He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016. doi: 10.1109/CVPR.2016.90.
 Horowitz (2014) M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14, Feb 2014. doi: 10.1109/ISSCC.2014.6757323.
 Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017. URL http://arxiv.org/abs/1704.04861.
 Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, abs/1609.07061, 2016. URL http://arxiv.org/abs/1609.07061.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
 Jeon & Kim (2017) Yunho Jeon and Junmo Kim. Active convolution: Learning the shape of convolution for image classification. CoRR, abs/1703.09076, 2017. URL http://arxiv.org/abs/1703.09076.
 Jeon & Kim (2018) Yunho Jeon and Junmo Kim. Constructing fast network through deconstruction of convolution. CoRR, abs/1806.07370, 2018. URL http://arxiv.org/abs/1806.07370.
 JuefeiXu et al. (2016) Felix JuefeiXu, Vishnu Naresh Boddeti, and Marios Savvides. Local binary convolutional neural networks. CoRR, abs/1608.06049, 2016. URL http://arxiv.org/abs/1608.06049.
 Kok (1997) ChiWah Kok. Fast algorithm for computing discrete cosine transform. IEEE Transactions on Signal Processing, 45(3):757–760, 1997.
 Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 Lin et al. (2014) TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pp. 740–755. Springer, 2014.
 Liu et al. (2018) Chenxi Liu, Barret Zoph, Maxim Neumann, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 19–34, 2018.
 Ma et al. (2018) Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun. Shufflenet V2: practical guidelines for efficient CNN architecture design. CoRR, abs/1807.11164, 2018. URL http://arxiv.org/abs/1807.11164.
 Nair & Hinton (2010) Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.
 Pratt et al. (1969) W. K. Pratt, J. Kane, and H. C. Andrews. Hadamard transform image coding. Proceedings of the IEEE, 57(1):58–68, Jan 1969. ISSN 00189219. doi: 10.1109/PROC.1969.6869.
 Rao & Yip (2014) K Ramamohan Rao and Ping Yip. Discrete cosine transform: algorithms, advantages, applications. Academic press, 2014.
 Real et al. (2018) Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for image classifier architecture search. arXiv preprint arXiv:1802.01548, 2018.
 Sandler et al. (2018) Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018. URL http://arxiv.org/abs/1801.04381.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Szegedy et al. (2016a) Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016a. URL http://arxiv.org/abs/1602.07261.
 Szegedy et al. (2016b) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016b.
 Vanhoucke et al. (2011) Vincent Vanhoucke, Andrew Senior, and Mark Z Mao. Improving the speed of neural networks on cpus. 2011.
 Vard et al. (2011) AliReza Vard, AmirHassan Monadjemi, Kamal Jamshidi, and Naser Movahhedinia. Fast texture energy based image segmentation using directional walsh–hadamard transform and parametric active contour models. Expert Systems with Applications, 38(9):11722–11729, 2011.
 Zhang et al. (2017) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017. URL http://arxiv.org/abs/1707.01083.
 Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710, 2018.
Appendix A Generality of proposed PC layers in other Neural Network
In Figure 7, for the purpose of finding more definite hierarchy level of blocks favored by our proposed PC layers, we subdivided our middle level experiment scheme; DCT/DWHT3MFront model denotes the model which applied the proposed blocks from the beginning of Stage 3 in the baseline while DCT/DWHT3MRear model denotes the model which applied from the end of Stage 3. The performance curves of all our proposed models in Figure 7 shows that if we apply the proposed optimal block within the first 6 blocks in the network, the Top1 accuracy is significantly deteriorated, informing us the important fact that there are the definite hierarchy level blocks which are favored or not favored by our proposed PC layers in the network.
Appendix B Histogram of depthwise convolution weights in highlevel blocks
Further, 3MRear models gave slightly superior efficiency while 7M, 3MFront, and lowlevel (3L, 6L, 10L) models showed inferior efficiency. From these results, we can obtain the fact that applying the proposed blocks within the first 6 blocks in the network gives significant deterioration in accuracy.
Comments
There are no comments yet.