racnn
radius adaptive convolutional neural net
view repo
Convolutional neural network (CNN) is widely used in computer vision applications. In the networks that deal with images, CNNs are the most time-consuming layer of the networks. Usually, the solution to address the computation cost is to decrease the number of trainable parameters. This solution usually comes with the cost of dropping the accuracy. Another problem with this technique is that usually the cost of memory access is not taken into account which results in insignificant speedup gain. The number of operations and memory access in a standard convolution layer is independent of the input content, which makes it limited for certain accelerations. We propose a simple modification to a standard convolution to bridge this gap. We propose an adaptive convolution that adopts different kernel sizes (or radii) based on the content. The network can learn and select the proper radius based on the input content in a soft decision manner. Our proposed radius-adaptive convolutional neural network (RACNN) has a similar number of weights to a standard one, yet, results show it can reach higher speeds. The code has been made available at: https://github.com/meisamrf/racnn.
READ FULL TEXT VIEW PDFradius adaptive convolutional neural net
Convolution is a fundamental building block of many deep neural networks. In a convolutional layer, a set of per-trained weights extract spatial features from the input. For image processing, the input is 3D data in the form of pixels that each includes certain channels (or depth). Usually, the total sum of channels for all pixels is a large number. Thus, abundant operations and memory accesses are required to perform a convolution. Although computers have become more advanced to handle this amount of data, computation cost still justifies the need for more speed optimization. Many CNN acceleration methods have been introduced so far and except for few, most of them are not widely used. This is partially because they are mostly application specific and partially because their implementation is not straight-forward. Some of these approaches require analyzing the weights and network after training to compress the network or decompose the weights. For these reasons, the tendency to use more straightforward techniques like MobileNet [13]
is extremely higher. However, those straightforward techniques usually come with two drawbacks. First, their approach to reducing the computation cost is by reducing the number of parameters (or neurons). This is not usually desired because it diminishes the accuracy especially for convolution layers that the numbers of parameters or weights strongly affect the results. Second, they usually do not consider memory access as a slowdown factor. However, memory access can be speed-wise costly. As an example, in a separable convolution technique used in MobileNet, a 3x3 convolution is divided into two convolutions. Both convolutions in total have fewer operations compared to a standard one. However, in each convolution, a read and write from and to the memory is required which is doubled compared to a standard convolution. A convolutional layer has a fixed number of operations and memory access. Which makes its speed content-independent. One of the advantages that are normally used in traditional speed optimization is the branching in code, i.e., if some conditions are met, the program can skip the main processing and branch to a less-complex processing. As an example, let’s consider a text recognition network that searches for characters in a document. A window of pixels as an input gets classified as one of the characters or a non-character. There is a high chance that there is no text (or character) in the window. Thus, all pixels in the input will have similar values (such as white pixels). In such a scenario by checking a simple condition, program can skip the complex classification when the condition is met and speedup the process. In CNN, implementing such a concept is not as simple as this example because the building blocks of CNNs are extremely optimized matrix multiplications. But such scenarios motivated us to seek for a solution.
In this work, we propose a new content-adaptive convolution that addresses some of the drawbacks of the existing CNN acceleration ideas. Figure 1 demonstrates a high-level scheme of our radius-adaptive convolution for a convolution. Unlike a standard convolution, the radius of the kernel (or kernel size) can be adjusted according to the input. This adjustment is based on a soft decision where the parameter defines how much of the neighboring pixels are taken into account. This design is simple to implement and the number weights in RACCN similar to standard convolution (excluding the weights that select alpha). When the , RACNN acts as a standard convolution and when it acts as a standard convolution which is the cause for the speedup. The remainder of this paper is as follows: section 2 discusses related work; section 3 describes our radius-adaptive convolution scheme; section 4 presents the results, and section 5 concludes the paper.
The works that trying to reduce the convolution overhead can be roughly divided into two categories: weight thinning and network thinning. In the weight thinning the idea is to decrease the number of weights or bits-per-weights. Many schemes have been introduced that fit into this category. In the weight quantization technique, the idea is to utilize more computing resources by quantizing the weights into low-bit numbers. In [29] 8-bit parameters have been used and results show a speedup with small loss of accuracy. Authors in [8] used 16-bit fixed-point which results in significant memory usage and floating-point operation reduction with comparable classification accuracy. In the extreme case of the 1-bit per weight or binary weight neural networks, the main idea is to directly train binary weights [4, 5, 20].
The more commonly used technique in weight thinning is to decrease the parameters or neurons for each layer. In addition to network complexity, it also can address the over-fitting. One effective approach to reducing weights is analyzing pre-trained weights in a CNN model and remove non-informative ones. For example, [9] proposed a method to reduce the total number of parameters and operations for the entire network. Authors in [25] searched for the redundancy among neurons and proposed a data-free pruning method to remove redundant neurons. HashedNets model proposed in [3] used a low-cost hash function to group weights into hash buckets for parameter sharing. [30] utilized a structured sparsity regularizer in each layer to reduce trivial filters, channels or even layers. [28] proposed a regularization method based on soft weight-sharing, which included both quantization and pruning in one training procedure. [13] introduces an efficient family of CNNs called MobileNet, which uses depth-wise separable convolution operations to drastically reduce the number of computations required and the model size. Other extensions to this work have tried to improve the speed and accuracy [23, 14]
. Low-rank factorization technique uses matrix decomposition to estimate the informative parameters of the deep CNNs. Therefore, non-informative weights can be removed to reduce the parameter to save computation. In
[27]authors proposed a new method for training low-rank constrained CNNs based on the decomposition. They used batch normalization is used to transform the activation of the internal hidden units. Using the dictionary learning idea, learning separable filters was introduced by
[21]. Results in [7] show considerable speedup for a single convolutional layer. They used low-rank approximation and clustering schemes for the convolutional kernels. The authors in [15]aim for a speedup in text recognition. They used tensor decomposition schemes and their results show a small drop in the accuracy.
In the network thinning, the idea is to compress the deep and wide networks into shallower ones. Recently methods have adopted knowledge distillation to reach this goal [1]. [12] introduced a knowledge distillation compression framework, which simplifies the training of deep networks by following a student-teacher paradigm. Despite its simplicity, it achieves promising results in image classification tasks. There are other approaches that utilize other speedup techniques such as FFT based convolutions [19] or fast convolution using the Winograd algorithm [17]. Stochastic spatial sampling pooling also is another speedup idea used in [33] based on the idea of inverse bilateral filters [22]. Because of their complexity in the implementation, these approaches also are not as commonly used in practical solutions except for specific applications.
In both weight and network thinning the total number of weights is reduced, while in the proposed RACNN, conversely, the main focus is avoiding any decrease in the number of weights. Another advantage of RACNN is the simplicity of the implementation. It can be implemented easily as a form of two convolutions.
Recently, content-adaptive has become an active research topic. and they have been used in several task-specific use cases, such as and Monte Carlo rendering denoising [2], motion prediction [32] and semantic segmentation [10]. Dynamic filter networks is an example of content-adaptive filtering techniques [16]. Filter weights are themselves directly predicted by a separate network branch, and provide custom filters specific to different input data. An extension of this work in [31] also learns from multiple neighboring regions using position-specific kernels. Authors in [6] propose deformable convolution, which produces position-specific modifications to the filters. Pixel-adaptive convolution proposed by [26] is another example of content-adaptive convolution. In this work, filter weights are multiplied with a spatially varying kernel that depends on learnable, local pixel features. In these adaptive approaches usually, the main goal is not addressing the complexity which is the case for RACNN.
In this section, we first describe the general criteria in our speedup method. Then, we explain our radius-adaptive convolution structure and finally, we discuss other similar adaptive ideas.
In this work, we were seeking for a speed optimization solution that meets four criteria. First, the method, regardless of the speed, should be able to be implemented by standard deep learning libraries such as
TensorFlow, PyTorch, and Keras. Therefore, the unoptimized code can be run on a standard library. This facilitates the training and testing which will not require any low-level programming. Second, the speed-optimized method should be able to be implemented by a general matrix multiplication (GeMM). Deep learning libraries mostly use GeMMs, since, they are fast and speed-wise optimized. Next, the number of trainable parameters should not be decreased. Our goal is to keep the number of neurons at the same level so the accuracy does not get affected. Finally, memory access also should be considered as a speedup factor in the solution. Otherwise, for a typical computer, the speedup may become insignificant.Convolutional layer usually is the most time-consuming layer of the network. Therefore, our idea is to optimize convolutional layer algorithmically. With considering the criteria in 3.1 it seems a content-adaptive solution is reasonable. In a convolutional layer, a set of filters is convolved by the input. Assuming is the kernel size, , where is the radius of the kernel. The input is in the form of a 3D matrix with the rows, columns, and depth of , , and . This input has pixels and each pixel has channels. Assuming an image-to-column process can reshape the 3D input data into a 2D form , so each row of the matrix contains pixels and there are rows (see Figure 2).
Therefore the convolution can be computed as
(1) |
where has rows and columns and is both the number of filters and the depth of the output . The kernel radius and therefore, the kernel size is fixed for the all pixels in the convolution. Our idea is to have an adaptive radius. For simplicity let’s consider the two options and . Based on the value of each pixel, either or will be selected. In that case, the output will be computed as
(2) |
where and are the and convolution kernels with and . Assuming is an indexing operator, shows the row in the matrix and . has a value of either 0 or one depending on the radius of the kernel. In (2) for each row , there is a matrix multiplication. If we split the input into two and matrices based on the radius we can compute each convolution separately by two matrix multiplication as
(3) |
and the results can be achieved by merging two outputs as
(4) |
where is an index mapping table that maps pixels in and to . Algorithm 1 shows how to split the input into and and obtain at the same time.
For the pixels with , the idea in (3) and (4) result in a considerable reduction in both operation and memory access, since both and are smaller than and . Consequently, a significant speedup can be achieved. However, its disadvantage is the hard decision based on value. This, causes a discontinuity in the output and makes it untrainable for standard gradient-based optimization algorithms. To solve this problem we propose a soft decision, based on a linear combination of both matrices. This modifies (4) to
(5) |
and are the or convolutions. For each pixel (or row) in the , determines how much of each convolution contributes to the output . In the case of and either of or will be selected. The problem, however, is when . This means both and should be computed for that pixel. This not only contradicts the speedup idea but also adds some redundancy since two convolutions with different radii will be computed for the same pixel. To solve this problem we propose to make weights at the center of equal to (see Figure 3).
Let’s consider as a 4D convolution kernel with height, width, depth, and filter number equal to 3, 3, and, respectively, thus,
(6) |
points to at the row and column of 1 and 1 which has a dimension of . By sharing the weights between two convolutions we can address the redundancy problem. If we substitute and in (5) we get
(7) |
where, is a hollow kernel. In other words, a kernel that its weights at the center are zero. Note that,
should be padded with zeros to have the same size as
. In (7), the number of operations and memory access is in the worst-case scenario is equal to a normal convolution. Worst case scenario happens when the for all pixels in the input so the convolution should be computed for all pixels. Otherwise, we save some computation time by skipping convolution for some of the pixels. In (7), for each pixel should be calculated depending on the content of the pixel. Consequently, the network should learn and calculate the . One way to this is utilizing a convolution. kernel is convolved by the input as(8) |
The output of convolution should be clipped to ensure values are between zero and one. Since, the output has only one value, the number of filters in is one. For maximum speedup, our goal is to have the minimum number of matrix multiplication. By observing (8) and the first term in (7), we realize we can merge two matrix multiplication into one since both have the same input as
(9) |
where . Once we have , which is the clipped , we find the output as
(10) |
is a subset of only for the rows with and is a hollow kernel (see Figure 1). The speedup factor depends on the number of pixels with . Let’s consider an example, where for 50% of the pixels. Theoretically, the computation time of (9) and (10) are and of a standard convolution. In this example time becomes of a standard convolution. When for all pixels, radius-adaptive convolution is equivalent to a standard convolution.
we also analyzed similar content-adaptive ideas such as a convolution with an adaptive number of filters. Similar to RACNN, we can split the input into two matrices and each gets multiplied by different weights with different filter sizes. One major problem is the hard-decision, however, prior to solving that, we realized even in the best-case scenario we cannot get a satisfactory speedup. The main reason is, increasing or decreasing the size of filters in does not affect the speed in the convolution significantly. Due to the number of memory access, the bottleneck is usually the big input matrix . Although, this feature is a drawback in this idea but, it’s an advantage in the RACNN, because it makes the overhead cost of computing the in (9) minor.
We considered an object classification network to test the accuracy and speed of RACNN. We selected two well-known image recognition graphs VGG16 [24] and ResNet50 [11] as a base to analyze the accuracy and speed. For these tests we used COCO-2017 [18] training and validation dataset with approximately 850000 and 36000 object images and 80 classes. In the following, we first present the accuracy results then we discuss the speed.
To measure the efficiency of RACNN in accuracy, our idea is to replace the convolutions with radius-adaptive ones and examine the results. These modifications led to VGG16-RACNN and ResNet50-RACNN new graphs. For VGG16, the computation time of deep layers is negligible compared to the first layers. This is due to a small resolution of images at deep layers. Therefore, accelerating those layers won’t affect the total computation time of the network. Thus, for VGG16-RACNN, we set the first 7 convolutions (or the first 3 stages) as radius-adaptive convolutions and we kept others as standard convolutions. For ResNet, however, we replaced all 16 convolutions. For ResNet, even though the resolution in deep layers is low, the number of filters is relatively high. Then, we used COCO-2017 training dataset to train all 4 graphs (i.e., VGG16, ResNet50, VGG16-RACNN and ResNet50-RACNN). We used Adam optimizer with a learning rate of 1e-4 and 120 iterations. The batch size was set to maximum possible for our hardware which was 30 for VGG16 and VGG16-RACNN and 40 for ResNet50 and ResNet50-RACNN. After each iteration, we calculated the classification accuracy for both graphs using both training and validation datasets. The advantage of RACNN is that the unoptimized code can be implemented easily by standard libraries. For training and accuracy calculation we used Keras library in Python. Figures 4 and 5 compare the accuracy of both graphs with and without RACNN. We expect to get similar accuracies since both networks (i.e., standard and adaptive) have similar architecture and number of parameters. Besides in case of radius-adaptive convolution is equivalent to a standard convolution (see Figure 1). The results also confirm this fact. Although they may not follow a same learning path, they converge to same accuracy.
Another interesting test is examining the effect of RACNN for only one convolution in the network. For this test, we replaced only the first adaptive layer of VGG16-RACNN with a standard convolution layer and measured the accuracy. Figure 6 compares the results for 120 iterations which also confirms a similarity in the accuracy.
Now that we confirmed the similarity of accuracies between the proposed and standard models, we need to compare the speed results. Unlike accuracy, the speed cannot be tested with high-level programming using standard libraries. In order to optimally implement our custom design RACNN, we needed to use a lower level of programming. To have a fair comparison we had to also implement the standard graphs in the same way. We used Python to implement our code. For matrix multiplication, and other general operations, we used Numpy and cuBLAS for CPU and GPU. For other customized tasks, such as splitting and merging in (10), we used C++ and CUDA for CPU and GPU and we designed high-level interfaces for them to be used in Python. Then, we measured the computation time for all graphs using two different CPUs; i7-6700@2.6GHz and i7-8750H@2.2GHz. We ran the object classification network for the first 1000 object images in the COCO-2017 validation dataset and we computed the average. Figure 7 compares the average computation time between the graphs with and without RACNN. Figure 7 shows a 23% speed improvemnt on average for all tests.
We also tested the speed with GPU using Nvidia GTX 1050. We an observed insignificant speedup (4%). We believe, with more optimized code for kernels that handle splitting and merging in (10) we can reach speeds closer to the CPU results.
To more deeply analyze the effectiveness of RACNN, we measured the contribution of it in each convolution layer for two sample images. The percentage of the input pixels with , defines the amount of speed-up by RACNN. Figures 8 and 9 show the percentage of for sample images and we highlighted the more than 10% which includes the majority of the layers. By analyzing these numbers, we can possibly adjust the RACNN graph. If there a consistent low-percentage contribution for a specific layer, that layer we can be replaced with a standard convolution to avoid the overhead cost of calculation.
In this paper, we presented a content-adaptive convolution that can reduce the number of operations and memory access in a convolution layer without decreasing the number of trainable parameters. Our proposed radius-adaptive convolution utilizes different radii based on the content. We implemented RACNN and tested the results for both CPU and GPU. As we expected, the accuracy of RACNN is similar to a standard convolution. Results also show a significant speedup for CPU implementation. The speedup gain in GPU, however, is lower than CPU, and more works need to be done to make the code as efficient as CPU.
International Conference on Machine Learning
, pp. 2285–2294. Cited by: §2.Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §4.