Convolutional neural networks (CNNs) are a category of neural networks that have achieved an extremely good performance in a multitude of tasks, for example in image classification [15, 5] and segmentation [11, 19]. CNNs learn filters that allows the network to capture patterns in a feature space. During the last few years different alternatives to CNNs have been investigated such as MobileNets  and Dilated Convolutions . MobileNets use depth-wise and point-wise convolutions in order to compress neural networks and reduce the number of needed calculations. Dilated convolutions allows an expansion of the receptive field without loss of resolution or coverage.
On the other hand, Discrete Cosine Transforms (DCTs) have been widely used in many image processing applications such as the JPEG encoding standard in order to decompose an image into its spatial frequency spectrum. DCTs have also been applied into deep learning frameworks. For example, in, DCTs are used for speed up training of neural networks. Furthermore, Ghosh  demonstrate in their work that applying DCTs on the feature maps generated by layers of CNNs can result in a speed up in the converge of the training and, in some cases, an increment of the accuracy.
In this paper, motivated by the advantages that DCTs (and frequency analysis in general) have provided on several areas such as image processing and deep learning, we propose a new variant of the standard convolution filter named Cosine Based Convolution filter or CBC. In this variant, instead of learning directly filter weights, frequencies of cosine basis that later will generate convolution weights are learned. Figure 1 shows a graphical overview of how CBC filters generate convolutional filter weights. Later on, we extend this filter to the entire layer using a hybrid approach where proposed CBC filters are used together with standard convolutional filters.
The main contributions of this paper are:
A new variant of the standard convolutional layer, hybrid CBC layer, where filter weights are obtained from a cosine basis operation.
Hybrid CBC layers require less parameters as only amplitudes, frequencies and phases are needed to represent filter weights.
CBC parameters are independent of the convolution filter size thus receptive fields can be increased without any increase at no cost.
As less parameters need to be learned, hybrid CBC layers provide faster converge on training.
Experimental results show that even though less parameters are used, similar (and in some cases better) accuracy can be obtained from CBC layers in classification tasks with different datasets.
The paper is organized as follows. We review the related work in Section 2. In Section 3 the Cosine Based Convolutional filter is described. Following in Section 4 the hybrid extension is explained. Section 5 describes the studies done for each variant of our proposed method and their performance. In Section 6, the results of our proposal applied to different kind of datasets and architectures are showed. Finally conclusions are drawn in Section 7.
2 Related work
During the last years, frequency analysis have been applied to improve several deep learning architectures. For example, Jadeberg  showed a methodology that reduces memory usage and speeds up the inference stage. It consists of the approximation of the trained rank-N convolutional neural network filters by separable rank-1 filters. In an extension to this, the standard convolution on the spatial domain was replaced by multiplications of the transformed filters in the frequency domain .
Hashed Networks exploit inherent redundancy in neural networks using a low-cost hash function to randomly group connection weights into hash buckets, in which all the weights share the same parameter value . An extension by Chen  groups the weights based on their DCT representation. Another methodology by Han  prunes the network by learning only the important connections. Then the weights are quantified to enforce weight sharing. Finally, Huffman encoding is applied.
In a more close relation to the work proposed in this paper, Wang  present a method to compress neural networks in the frequency domain. Using a DCT, filter weights are clustered. These weights are quantized and encoded using Huffman coding. However, this compression is only applied for storage purposes.
In a similar manner, Uliný  present harmonics blocks that are used to replace standard convolutional layers. Harmonic blocks consist of a convolution with a filter bank that isolates the coefficients of the DCT basis functions to their exclusive feature maps, creating a new feature map per each channel and each frequency defined by hand. However, the spectral decomposition of this proposal upsamples the number of intermediate features between layers thus notably increasing the corresponding memory requirements. In our case, DCT frequencies are learned so only the more relevant decomposition are used in the network.
3 Cosine Based Convolution filter (CBC)
This section describes the proposed CBC filter. This filter modifies the standard convolution by using cosine basis to generate filter weights (that will be applied in a convolution operator). Therefore, CBC filters do not learn directly filter weights but the frequency parameters (amplitudes, frequencies and phases) that will allow to generate the filter weights in the spatial domain (the domain where the convolution operation works). In this work, the cosine basis of CBC filters can be interpreted as a frequency decomposition of convolutional filter weights (not input images or features). The main advantages of the CBC filters are that frequency parameters represent convolutional filter weights in a more compressed representation and that they are invariant to the window size of the convolution filter.
Figure 1 represents the pipeline to generate convolution filter weights from the cosine basis defined by the CBC layer. As it can be seen, the same cosine basis can represent convolutional filter weights for different filter sizes without the need of extra parameters.
In order to define CBC filters, we define different options to generate convolutional filter weights in the spatial dimensions, , and in the channel/feature dimension, , as frequency harmonics are more common in spatial dimensions in natural images. Therefore, we will separate the cosine basis to represent convolutional filter weights in the spatial dimension and in the feature dimension. Note that these spatial and feature dimensions refer to the convolutional filter weights and not the image or signal it is convolved with. This section assumes 2D images and 2D convolutional operations but higher dimensions can easily be extrapolated.
3.1 Spatial dimensions
Comparison the degrees of freedom between the spatial-product and the spatial-direction.
We propose two different cosine basis to generate filter weights on the spatial dimension of the filter. The first basis, which we called spatial-product or , is built with the product of two cosines where each of them represents a unique frequency, and , in the vertical and horizontal dimensions respectively:
where and represent the spatial coordinate of convolutional filter weights.
Even though this cosine basis will allow the generation of a large spectrum of different signals (filters), it only allows the composition of two unique spatial harmonics and directional filters are not represented. For this reason we propose and alternative basis, which we called spatial-direction or , where only one harmonic is represented but different 2-dimensional directions can be achieved:
Figure 2 represents the different degrees of freedom between the two cosine basis proposed for the spatial dimensions. It can be seen that spatial-direction is able to generate filters on different directions using a single harmonic while spatial-product represents two harmonics in the vertical and horizontal dimensions.
3.2 Feature dimension
In case of the feature dimension, we also propose two different alternatives.The first basis, called feature-direct or , is built as a cosine with a new frequency and phase along the feature dimension of the filter:
where denotes the amplitude of the cosine basis and represents the feature coordinate of convolutional filter weights.
In order to relax the previous restriction, we introduce a second alternative, called feature-weight or , without any cosine basis but directly using a different amplitude for each input channel (as opposed to in Eq. (3) that uses the same amplitude for all channels):
Using feature-direct offers higher compression in the number of parameters of the CBC layer as only 3 parameters (amplitude , frequency and phase ) can represent channels in the feature dimension (in fact, they are completely independent of the convolutional filter size). In contrast, it forces an order on the feature dimension and the optimizer should not only find the best parameters, but should find the order of the features to obtain the best performance. Using feature-weight, a different weight for each feature coordinate (channel) is used and therefore the number of parameters in not reduced in the feature dimension but the complexity of the optimizer decreases.
3.3 CBC filter
Finally, both spatial and feature basis can be combined to create the CBC filter that represent convolutional filter weights. We combine spatial dimensions using the same frequency for all input channels:
where represents any of the two spatial domain cosine basis, spatial-product ( in Eq. (1)) or spatial-direction ( in Eq. (2)) and represents any of the two feature domain basis, feature-direct ( in Eq. (3)) or feature-weight ( in Eq. (4)). For example, using spatial-product for the spatial dimension and feature-direct for the feature dimension the following cosine basis will be used to represent the filter convolutional weights:
In this case, only 7 parameters are needed to represent the cosine basis: the amplitude , frequencies and phases while it can represent up to weights of a standard convolution filter (where if the kernel height, the kernel width and the number of input channels).
Combining all cosine basis proposed, we obtain up to 4 variations for the CBC filter, depending on the different options for the spatial and feature dimensions. Table 1 summarizes all possible CBC filter combinations (and their respective equations) together with the number of parameters needed to represent the convolution filters.
Using the same spatial frequency for all channels provides the most efficient parameter compression. Using different spatial frequencies for each channel could provide a higher capability of representation for CBC layers. However, our initial search showed that using the same spatial frequencies is enough to capture a similar feature representation of standard convolutional layers and are the only ones considered in this paper.
|Name||Spatial dimensions||Feature dimension||Parameters||Equation|
4 Hybrid CBC Layer
Even though all convolution filters of a CNN layer could be replaced with CBC filters, we propose an hybrid CBC layer that combines the best of both worlds: conventional convolution filters and CBC filters. In this way, the hybrid layer can use CBC filters (with less parameters) to efficiently represent harmonics in the input signal while it can also use conventional convolution filters (with more parameters and more degrees of freedom) to represent more complex features of the input signal. Figure 3 shows a representation of a HCBC layer with total filters. The parameter controls the number of CBC filters with respect to the conventional convolutions. If is close to then the HCBC layer will be composed mostly of CBC filters while close to will create hybrid CBC layers that behave as standard convolutional layers. Section 5 will study the effect of using different values in HCBC layers.
Note that in the specific case of convolution layers, using a cosine basis in the spatial domain does not provide any gain as only parameter is used per channel. Therefore, we set in Eq. (5) and only the feature dimension is used.
5 Ablation studies
In this section we will analyze the different options proposed in this paper that are summarized in Table 1. We will use two known neural network architectures, VGG  and ResNet , where all their convolutional layers are replaced by hybrid CBC layers with different options.
We have selected VGG16bn and ResNet50 architectures as the specific architectures for VGG and ResNet for our experiments because: a) they are widely used by industry and academia, b) they provide a very robust architecture with very good results in many applications and c) they have a very different internal architecture.
For the ablation studies, we use the CIFAR10  dataset and show results for both the proposed hybrid CBC layers and the network architectures without modification which we consider baseline. All studies will show the loss curves for the training and validation sets for epochs (graphs in the first row of Figures). In order to facilitate the comparison we also show the same loss with respect to the baseline (graphs in the second row of the Figures). Finally, we also incorporate a histogram that indicates the number of parameters used by each architecture in log scale (bottom row of Figures).
5.1 Spatial criteria
In this study, we analyze the behaviour of spatial dimension options, spatial-product and spatial-direction, during training. In order to focus only in the spatial dimension options, we fix the feature dimension to feature-weights. Figure 4 shows the learning curve of the different spatial options.
As it can be seen, in case of VGG, both spatial alternatives offer a similar behaviour. Both achieve similar performance than baseline but with a small penalty in convergence speed in the case of spatial-direction. However, both achieve a great reduction of network parameters with a compression factor around with respect to the total parameters of convolutional layers in VGG. As will be stated in the experimental section, we measure the compression factor as the ratio between the total number of parameters in convolutional layers of baseline architectures (VGG or ResNet) with respect to the total number of parameters in CBC layers.
In case of ResNet architecture both spatial alternatives offer a similar performance. Neither spatial-direction or spatial-product are able to achieve the same performance than the baseline. Nevertheless it is very close to baseline with the advantage of the reduction of the number of parameters of the network with a compression factor around with respect to ResNet.
5.2 Feature criteria
In this study, we analyze the behavior of feature dimension options, feature-direct and feature-weight, during training. For this, we will fix the spatial dimension as spatial-product. Figure 5 shows the learning curve comparing both feature options.
It can be seen that for both architectures option feature-weight has a much better behavior than feature-direct. Even though feature-direct has a higher compression ratio ( for VGG and for ResNet), it cannot achieve a performance similar to baseline. This is due to the fact that combining channels using a cosine basis does not provide the flexibility needed by the network to learn the classification task. However, feature-weight shows a behavior close to baseline with a compression ratio of for VGG and for ResNet.
5.3 Hybrid CBC layer
In this study, we analyze the performance of the hybrid CBC layer for different parameters. We will test when (only CBCs filters) and (half the filters of the layer will be standard convolutional filters and the other half CBC filters). For this analysis we have fixed the spatial dimension to spatial-product and compare both feature-direct and feature-weight options.
Figure 5(a) shows the behaviour for both values when using the feature-weight option for the feature dimension. Both VGG and ResNet architectures obtain very similar results for the two values of . When the generalization of the networks is better than when . Moreover, when , the results are very close to baseline. This experiment results in compression factors of and for VGG and ResNet respectively.
Figure 5(b) shows the behaviour for the two values of setting feature-direct option for the feature dimension. In this case we can observe a huge difference when the value of changes. The network is not able to reach the performance of the baseline when for both architectures. However, when , the network is able to have a similar performance than the baseline with compression factors of and for VGG and ResNet respectively. In this case, letting half the filters of the layer to remain as standard convolution filters helps to compensate the lower capacity of representation of CBC filters with the feature-direct option.
To summarize, when , feature-direct is not able to reach the same performance as baseline. However, if , both feature-direct and feature-weight configurations achieve similar performance to baseline with the advantage of the compression ratios mentioned above.
This section compares the most representative options of hybrid CBC layers against VGG and ResNet architectures for different challenging datasets. We have tried to use datasets with different resolutions and number of classes to assess the performance of the proposed network. We have used validation accuracy in a classification tasks to measure the performance of the different options.
The datasets used in these experiments are:
CIFAR100  contains color images of resolution . It contains classes uniformly distributed. This dataset present a more challenging classification problem than CIFAR10 as it includes a greater number of classes and a smaller number of samples per class ( per class). We have select this dataset for its complexity in image classification task.
Monkeys  contains approximately color images of resolution . It contains a total of classes representing a fine-grain classification of different monkey species. The dataset is uniform distributed and corresponds to approximately images per class. We have selected this dataset to assess the performance of the hybrid CBC layers on higher resolution images and with a lower number of examples per class.
6.2 Dataset preparation
In the case of the CIFAR10/100 datasets, we have preserved the original image size of (in the following section 6.3 the modifications of the network to accept this resolution will be explained). In the case of Monkeys, input images of resolution
are used. We also applied online data augmentation. In the case of CIFAR10/100 random horizontal flips with probabilityare used. In the case of Monkeys, as there are few images per class, we decided to apply: a) random rotations between and degrees, b) random crops of and c) horizontal flips with probability .
6.3 Model Surgery
We have adapted the VGG architecture to work with images of resolution as input for the CIFAR10/100 datasets. In the case of ResNet, changes are not necessary since a large part of its architecture is composed of convolutional filters and it has an average global pooling at the end that unifies the size. However, for VGG, we need to replace the last fully connected layers with a new fully connected layer of weights as it is usually done in the literature. This change modifies the total number of parameters of the VGG network. However, we tried to analyze the behaviour of replacing standard convolutional layers for hybrid CBC layers. Therefore, we will focus only on the number of parameters on convolutional layers and not fully connected layers.
6.4 Training parameters
We have set a batch size of for CIFAR10/100 and for Monkeys. We have used the Adam optimizer with an initial learning rate of , and . The selection of this optimizer is to simplify the gradient descent and let the optimizer choose the most suitable learning rate during training. The learning window has been set for epochs in CIFAR10/100 and in Monkeys to limit the time of the experiments. Please note that for this experiments we are not achieving state-of-the-art results for the classification tasks as we want to compare the behaviour of the proposed hybrid CBC layers with respect to the original architectures. All networks are trained from scratch from initial random parameters.
We compare against untouched VGG and ResNet architectures (that we consider baselines). For our proposed options, we have replaced all convolutional layers in VGG and ResNet for hybrid CBC layers (note that for convolution layers only the feature dimension is replaced in CBC layers). For simplicity and as the ablation studies suggested, we fixed spatial-product for the cosine basis on the spatial dimension and we compare both feature-direct and feature-weight for the feature dimension. In both cases the hybrid CBC layers are used with . In order to better evaluate the proposed method we will indicate the following for all cases:
Best Acc.: Indicates the best accuracy obtained in the whole learning window ( epochs for CIFAR10/100 and epochs for Monkeys) for the validation set.
Epoch exceeds x% Acc.: Indicates in which epoch the validation accuracy has been exceeded. This allows us to see the learning speeds of the different architectures.
Parameters of conv. layers: Total number of parameters in the convolutional layers.
Compression factor: Ratio of compression between the total number of parameters in the convolutional layers of baseline with respect to the total number of parameters in hybrid CBC layers.
Table 2 shows the results obtained for the CIFAR10 dataset. It can be seen that in both VGG and ResNet architectures, hybrid CBC layers with feature-weight are able to slightly exceed the accuracy of the baseline using nearly half of the convolutional parameters. Additionally, they are able to reach the proposed accuracy limit in fewer epochs as the amount of parameters to learn is reduced.
|Architecture||Best Acc.||Epoch exceeds 75% Acc.||Parameters of conv. layers||Compression factor|
Table 3 shows the results obtained for the CIFAR100 dataset. This dataset presents a higher challenge and validation accuracy is much lower than with other datasets. In this case, only in the ResNet architecture, hybrid CBC layers with feature-weight are able to slightly outperform the baseline. Hybrid CBC layers on VGG architecture are slightly below baseline on validation accuracy. As commented previously, all of the options are able to reach the proposed accuracy in fewer epochs. Note that as the network architecture do not change between datasets, compression factors remain the same through the different experiments.
|Architecture||Best Acc.||Epoch exceeds 45% Acc.||Parameters of conv. layers||Compression factor|
Table 4 shows the results obtained for the Monkeys dataset. This dataset uses higher resolution images as we wanted to assess the adaptability of our proposed cosine basis to higher spatial dimensions in the input images. It can be seen that in both VGG and ResNet architectures, hybrid CBC layers with feature-weight are able to improve the accuracy of baselines even though compression ratios of and are used. Similar to previous experiments, training convergence also improves with respect to baseline.
|Architecture||Best Acc.||Epoch exceeds 75% Acc.||Parameters of conv. layers||Compression factor|
In this work we have proposed a new set of Convolutional Neural Network compression techniques, called Hybrid Cosine Based Convolutional Neural Networks, which use a cosine basis to represent the weights of the convolutional filters. In our experiments in classification accuracy, architectures based on hybrid CBC layers are able to obtain better performances that VGG and ResNet architectures even though they use less parameters in the convolutional layers. Compression factors between and can be used with similar or even better performances. Hybrid CBC layers provide faster convergence in training and receptive fields can be increased at no cost as the number of parameters does not change. Furthermore, the use of hybrid CBC layers is a straightforward replacement of current convolutional layers and no further adaptation is needed. In future work, we would like to study the applicability of hybrid CBC layers as feature detectors for other tasks, such as object detection. We will also generalize the cosine basis used in our method to be able to use different functions as basis that are adapted to specific scenarios or frameworks.
This research was supported by Secretary of Universities and Research of the Generalitat de Catalunya and the European Social Fund via a PhD grant (FI2018), and developed in the framework of project TEC2016-75976-R, financed by the Ministerio de Economía, Industria y Competitividad and the European Regional Development Fund (ERDF).
-  W. Chen, J. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing convolutional neural networks in the frequency domain. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1475–1484, New York, NY, USA, 2016. ACM.
W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen.
Compressing neural networks with the hashing trick.
Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pages 2285–2294. JMLR.org, 2015.
A. Ghosh and R. Chellappa.
Deep feature extraction in the dct domain.In
International Conference on Pattern Recognition, 2016.
-  S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In International Conference on Learning Representations, 2016.
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
-  M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions. In Proceedings of the British Machine Vision Conference, 2014.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research).
-  A. Krizhevsky, V. Nair, and G. Hinton. Cifar-100 (canadian institute for advanced research).
-  J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
-  M. Mathieu, M. Henaff, and L. Yann. Fast Training of Convolutional Networks through FFTs. In International Conference on Learning Representations, 2014.
-  G. Montoya, J. Zhang, and S. Loaiciga. Monkeys dataset. https://www.kaggle.com/slothkong/10-monkey-species. Accessed: 2019-03-23.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
-  M. Uliný, V. A. Krylov, and R. Dahyot. Harmonic networks: Integrating spectral information into cnns. CoRR, abs/1812.03205, 2018.
-  Y. Wang, C. Xu, S. You, D. Tao, and C. Xu. CNNpack: Packing Convolutional Neural Networks in the Frequency Domain. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 253–261. Curran Associates, Inc., 2016.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  X. Zou, X. Xu, C. Qing, and X. Xing. High speed deep networks based on Discrete Cosine Transformation. In 2014 IEEE International Conference on Image Processing (ICIP), 2014.