1 Introduction
Over the past few years, with the availability of largescale datasets and computational power, deep learning has achieved impressive results on a wide range of applications in the field of computer vision. In general, the trend is to achieve higher performance by developing deep and complex models using large computational resources
[1, 2, 3, 4, 5]. However, this progress is not necessarily making the networks more efficient with respect to memory and speed. In many real world and resource constraint applications such as robotics, satellites, and selfdriving cars, the recognition tasks need to be carried out in a fast and computationally efficient manner. Therefore, there is a need to develop spacetime efficient models for such applications.In a standard convolutional layer, the convolution filters learn the spatial and channel correlations simultaneously. The depthwise separable convolutions factorize the above process into two layers. In the first layer, a standard depthwise (channelwise) convolution is applied in order to learn the spatial correlations. In the second layer, pointwise convolutions ( convolutions) learn channel correlations by combining the outputs of the first layer. Fig. 1, compares various architectures. An important advantage of this factorization is that it significantly reduces the computation and model size. For example, for a filter of size , the depthwise separable convolution uses to times lesser parameters compared to the standard convolutional layer.
In this paper, we propose a new 2D convolutional layer named DepthwiseSTFT separable layer. Similar to the standard depthwise separable convolution layer, our DepthwiseSTFT separable layer has two sublayers. The first layer named DepthwiseSTFT captures the spatial correlations. For each channel in the input feature map, it computes the Fourier coefficients (at low frequency points) in a 2D local neighborhood () at each position of the channel to obtain new feature maps. The Fourier coefficients are computed using 2D Short Term Fourier Transform (STFT) at multiple fixed low frequency points in the 2D local neighborhood at each position of the channel. The second layer named pointwise convolution uses convolutions to learn channel correlations by combining feature maps obtained from the DepthwiseSTFT layer. Note that unlike the case of standard depthwise separable layer, here only the second layer (pointwise convolutions) is trainable. Thus, the DepthwiseSTFT separable layer has a lower spacetime complexity when compared to the depthwise separable convolutions. Furthermore, we show experimentally that the proposed layer achieves better performance compared to the many stateoftheart depthwise separable based models such as MobileNet [6, 7] and ShuffleNet [8, 9].
2 Related Works
Recently, there has been a growing interest into developing spacetime efficient neural networks for real time and resource restricted applications [10, 6, 7, 9, 8, 11, 12].
Depthwise Separable Convolutions. As discussed in Section 1, depthwise separable convolutions significantly reduce the spacetime complexity of convolutional neural networks (CNNs) when compared to the standard convolutions by partitioning the steps of learning spatial and channel correlations. Recently, many depthwise separable convolutions based networks have been introduced such as MobileNet [6, 7], ShuffleNet [8, 9], and Xception [13]. Note that with the reduced complexity in the network architectures, these networks achieve tradeoff between accuracy and spacetime complexity.
2D STFT based Convolutional Layers. Recently, in [14], the authors introduced a 2D STFT based 2D convolutional layer named ReLPU for fundus image segmentation. The ReLPU layer when used inplace of the first convolutional layer (following the input layer) of the UNet architecture [15] improved the performance of the baseline UNet architecture. However, the ReLPU layer could not be used to replace all the convolutional layers in the network due to the extreme bottleneck used in it which reduced its learning capabilities. This work aims to solve this issue by introducing 2D STFT in depthwise separable convolutions.
3 Method
We will denote the feature map output by a layer in a 2D CNN network with the tensor
where , , and are the height, width, and number of channels of the feature map, respectively. Fig. 1 presents the highlevel architecture of our DepthwiseSTFT based separable layer. Note that similar to the standard depthwise separable convolution layer, the DepthwiseSTFT based separable layer has two sublayers. In the first layer (named DepthwiseSTFT), for each channel in the input feature map, we compute the Fourier coefficients in a 2D local neighborhood at each position of the channel to obtain the new feature maps. The Fourier coefficients are computed using 2D Short Term Fourier Transform (STFT) at multiple fixed low frequency points in the 2D local neighborhood at each position of the channel. The second layer (named pointwise convolutions) uses convolutions to learn linear combinations of the feature maps obtained from the DepthwiseSTFT layer. The detailed description of each layer is as follows.DepthwiseSTFT. This layer takes a feature map as input from the previous layer. For simplicity, let us take . Hence, we can drop the channel dimension and rewrite the size of as . Here, are the 2D coordinates of the elements in .
Next, each x in has a 2D neighborhood (denoted by ) which can be defined as shown in Equation 1.
(1) 
Now, for all positions of the feature map , we use local 2D neighborhoods,
to derive the local frequency domain representation. For this, we use Short Term Fourier Transform (STFT) which is defined in Equation
2.(2) 
Here , is a 2D frequency variable and
. Note that, due to the separability of the basis functions, 2D STFT can be efficiently computed using simple 1D convolutions for the rows and the columns, successively. Using vector notation, we can rewrite Equation
2 as shown in Equation 3.(3) 
Here, is a complex valued basis function (at frequency variable v
) of a linear transformation and is defined as shown in Equation
4.(4) 
is a vector containing all the elements from the neighborhood and is defined as shown in Equation 5.
(5) 
In this work, we use four lowest nonzero frequency variables , and , where . Thus, from Equation 3, we can define the local frequency domain representation for the above four frequency variables as shown in Equation 6.
(6) 
At each position x, after separating the real and the imaginary parts of each component, we obtain a vector as shown in Equation 7.
(7) 
Here, and return the real and imaginary parts of a complex number, respectively. The corresponding transformation matrix can be written as shown in Equation 8.
(8) 
Hence, from Equations 3 and 8, the vector form of STFT for all the four frequency points , and can be written as shown in Equation 9.
(9) 
Since is computed for all positions x of the input , it results in an output feature map with size corresponding to the four frequency variables. Remember that we took . Thus, for one channel, the DepthwiseSTFT layer outputs a feature map of size corresponding to the four frequency variables. Therefore, for channels, the DepthwiseSTFT will output a feature map of size .
Pointwise convolutions. This layer is the standard trainable convolutional layer containing filters, each one of them has a depth equal to which takes as input a tensor of size and outputs a tensor of size . Note that it is this layer that gets learned during the training phase of the CNN.
In Fig. 2, we present the visualization of the DepthwiseSTFT based separable layer for and . First the Fourier coefficients of the input feature map is extracted in a local neighborhood of each position at four frequency points and to output the feature maps of size (after separating real and imaginary parts). Then, the output feature maps are linearly combined to output the final feature map.
Parameter analysis. Consider a standard 2D convolutional layer with input and
output channels. Assume that spatial padding is done such that the spatial dimensions of the channels remain same. Let
be the size of the filters. In Table 1, we compare the number of trainable parameters in various convolutional layers. The number of trainable parameters in DepthwiseSTFT separable layer is independent of the filter size .Layer  # parameters 

Standard Convolution  
Depthwise Separable  
DepthwiseSTFT Separable 
4 Experiments
Datasets. We evaluate the DepthwiseSTFT separable layer on the two popular CIFAR datasets CIFAR10 and CIFAR100 [16]. The CIFAR10 and the CIFAR100 datasets consist of 10 and 100 classes, respectively. Each dataset consists of natural RGB images of resolution pixels with 50,000 images in the training set and 10,000 images in the testing set. We use the standard data augmentation scheme horizontal flip/shifting/rotation
that is widely used for these two datasets. For preprocessing, we normalize the data using the channel means and standard deviations.
Network Architecture. We adopt a simple InceptionResNet style bottleneck architecture [5]. Fig. 3 presents the building blocks of our network, Block 1 (Fig. 2(a)) and Block 2 (Fig. 2(b)). Block 1 takes as input a feature map with channels and applies a bottleneck convolution (trainable) to output a feature map of size such that . This is followed by inception style nontrainable DepthwiseSTFT layers with filter size and . The intuition behind this design is to have a maximum number of possible pathways for information to flow through the network and let the network selectively choose the best frequency points/neighborhood sizes for computing local Fourier transform and to selectively give more weight to them during training. The outputs from the DepthwiseSTFT layers are concatenated channelwise and finally passed through an expansion convolution (trainable) to output a feature map of size such that . We use bottleneck architectures as they have been proved to be generalize better [7]. The architecture of Block 2 is just a slightly augmented version of Block 1 with skip connections (Fig. 2(b)
). Each block is followed by a batch normalization layer which is followed by a LeakyReLU (with
[17]. For downsampling, we use max pooling with pool size 2 and stride 2. Our final network as shown in Fig.
2(c) consists of 16 nondownsampling blocks (Block 1 and Block 2), followed by a global average pooling layer connected to a final classification layer with softmax activation. We implemented the proposed network using Keras deep learning library
[18] and performed all the experiments on a system with Intel i78700 processor, 32 Gb RAM, and a single Nvidia Titan Xp GPU.Network  # params  # feat.  CIFAR10  CIFAR100 
ShuffleNet [9]  1.05M  800  90.80  70.06 
ShuffleNetV2 [8]  1.35M  1024  91.42  69.51 
MobileNet [6]  3.31M  1024  91.87  65.98 
MobileNetV2 [7]  2.36M  1280  93.14  68.08 
ReLPU based [14]  3.08M  256  92.20  70.20 
Ours (b=8, f=128)  0.90M  128  93.16  70.19 
Ours (b=16, f=128)  1.30M  128  93.59  70.66 
Ours (b=32, f=128)  2.21M  128  93.72  71.08 
Ours (b=64, f=128)  3.69M  128  94.07  71.42 
Ours (b=64, f=256)  8.21M  256  94.25  73.01 
Ours (b=64, f=384)  14.06M  384  94.51  74.39 
Training. For training out networks, we use Adam optimizer [19]
, categorical crossentropy as loss function, and batch size of 64. All the trainable weights are initialized with orthogonal initializer
[20]. We train our networks with a learning rate of 0.01 for 300 epochs. After that, we increase the batch size to 128 and further train the networks for 100 epochs. Note that the above training method is inspired from
[21] which proposes that it is preferable to increase the batch size instead of decreasing the learning rate during training.Results and Analysis. Table 2 reports the classification results of different resource efficient architectures on the CIFAR10 and CIFAR100 datasets. For fair comparison, we compare our networks with depthwise separable based architectures only such as MobileNet [6, 7] and ShuffleNet [8, 9]. As discussed in Section 2, we also compare with the ReLPU layer [14] based network which is form by replacing all the layers of the network of Fig. 2(c) with ReLPU layer. Our results show that the proposed layer outperforms the depthwise separable layer based and the ReLPU layer based architectures. Note that here our target is not to achieve stateoftheart accuracy on the two datasets but to showcase the efficiency of our proposed layer when compared to the depthwise separable convolution based architectures. For analysis purpose, we ran two variants of the proposed network (Fig. 2(c)). In the first variant, the bottleneck parameter is increased while the expansion parameter is kept constant. In the second variant, we kept the bottleneck parameter constant while increasing the expansion parameter . In both the settings, we observe improvement in the performance of the networks.
5 Conclusion
This paper proposes DepthwiseSTFT separable layer, that can serve as an alternative to the standard depthwise separable layer. The proposed layer captures spatial correlations (channelwise) in the feature maps using STFT followed by pointwise convolutions to channel correlations. Our proposed layer outperforms the standard depthwise separable layer based models on the CIFAR10 and CIFAR100 datasets. Furthermore, it has lower spacetime complexity when compared to standard depthwise separable layer.
References
 [1] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
 [3] Dongyoon Han, Jiwhan Kim, and Junmo Kim, “Deep pyramidal residual networks,” in CVPR, 2017.
 [4] Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks,” BMVC, 2016.

[5]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi,
“Inceptionv4, inceptionresnet and the impact of residual connections on learning,”
in AAAI, 2017.  [6] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
 [7] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, 2018.
 [8] Ningning Ma, Xiangyu Zhang, HaiTao Zheng, and Jian Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in ECCV, 2018.
 [9] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in CVPR, 2018.
 [10] Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi, “Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation,” in ECCV, 2018.
 [11] Sachin Mehta, Mohammad Rastegari, Linda Shapiro, and Hannaneh Hajishirzi, “Espnetv2: A lightweight, power efficient, and general purpose convolutional neural network,” in CVPR, 2019.
 [12] Sudhakar Kumawat and Shanmuganathan Raman, “Lp3dcnn: Unveiling local phase in 3d convolutional neural networks,” in CVPR, 2019.
 [13] François Chollet, “Xception: Deep learning with depthwise separable convolutions,” in CVPR, 2017.
 [14] Sudhakar Kumawat and Shanmuganathan Raman, “Local phase unet for fundus image segmentation,” in ICASSP, 2019.
 [15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “Unet: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
 [16] A Krizhevsky, “Learning multiple layers of features from tiny images,” Master’s thesis, University of Tront, 2009.
 [17] Andrew L Maas, Awni Y Hannun, and Andrew Y Ng, “Rectifier nonlinearities improve neural network acoustic models,” in ICML Workshop on Deep Learning for Audio, Speech and Language Processing, 2013.
 [18] Antonio Gulli and Sujit Pal, Deep Learning with Keras, Packt Publishing Ltd, 2017.
 [19] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” ICLR, 2014.

[20]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,”
in ICCV, 2015.  [21] Samuel L Smith, PieterJan Kindermans, Chris Ying, and Quoc V Le, “Don’t decay the learning rate, increase the batch size,” in ICLR, 2017.
Comments
There are no comments yet.