1 Introduction
Convolutional Neural Networks (CNNs) have achieved remarkable success in many computer vision tasks
[16, 15, 45] and their efficiency keeps increasing with recent efforts to reduce the inherent redundancy in dense model parameters [14, 31, 43] and in the channel dimension of feature maps [48, 18, 6, 8]. However, substantial redundancy also exists in the spatial dimension of the feature maps produced by CNNs, where each location stores its own feature descriptor independently, while ignoring common information between adjacent locations that could be stored and processed together.shows that natural image can be decomposed into a low and a high spatial frequency part. (b) The output maps of a convolution layer can also be factorized and grouped by their spatial frequency. (c) The proposed multifrequency feature representation stores the smoothly changing, lowfrequency maps in a lowresolution tensor to reduce spatial redundancy. (d) The proposed Octave Convolution operates directly on this representation. It updates the information for each group and further enables information exchange between groups.
As shown in Figure 1(a), a natural image can be decomposed into a low spatial frequency component that describes the smoothly changing structure and a high spatial frequency component that describes the rapidly changing fine details [1, 10, 37, 39]. Similarly, we argue that the output feature maps of a convolution layer can also be decomposed into features of different spatial frequencies and propose a novel Multifrequency feature representation which stores high and lowfrequency feature maps into different groups as shown in Figure 1(b). Thus, the spatial resolution of the lowfrequency group can be safely reduced by sharing information between neighboring locations to reduce spatial redundancy as shown in Figure 1(c). To accommodate the novel feature representation, we generalize the vanilla convolution, and propose Octave Convolution (OctConv) which takes in feature maps containing tensors of two frequencies one octave apart, and extracts information directly from the lowfrequency maps without the need of decoding it back to the highfrequency as shown in Figure 1(d). As a replacement of vanilla convolution, OctConv consumes substantially less memory and computational resources. In addition, OctConv processes lowfrequency information with corresponding (lowfrequency) convolutions and effectively enlarges the receptive field in the original pixel space and thus can improve recognition performance.
We design the OctConv in a generic way, making it a plugandplay replacement for the vanilla convolution. Since OctConv mainly focuses on processing feature maps at multiple spatial frequencies and reducing their spatial redundancy, it is orthogonal and complementary to existing methods that focus on building better CNN topology [22, 41, 35, 33, 29], reducing channelwise redundancy in convolutional feature maps [48, 8, 34, 32, 21] and reducing redundancy in dense model parameters [43, 14, 31]. Moreover, different from methods that exploit multiscale information [4, 44, 12], OctConv can be easily deployed as a plugandplay unit to replace convolution, without the need of changing network architectures or requiring hyperparameters tuning. Compared to the closely related Multigrid convolution [25], OctConv achieves both better accuracy and efficiency due to the careful design of interfrequency information exchange. We further integrate the OctConv into a wide variety of backbone architectures (including the ones featuring group, depthwise, and 3D convolutions) and demonstrate universality of OctConv.
Our experiments demonstrate that by simply replacing the vanilla convolution with OctConv, we can consistently improve the performance of popular 2D CNN backbones including ResNet [16, 17], ResNeXt [48], DenseNet [22], MobileNet [18, 34] and SENet [19] on 2D image recognition on ImageNet [11], as well as 3D CNN backbones C2D [45] and I3D [45] on video action recognition on Kinetics [24, 3, 2]. The OctConvequipped OctResNet152 can match or outperform stateoftheart manually designed networks [32, 19] at lower memory and computational cost.
Our contributions can be summarized as follows:

[leftmargin=*,noitemsep,nolistsep]

We propose to factorize convolutional feature maps into two groups at different spatial frequencies and process them with different convolutions at their corresponding frequency, one octave apart. As the resolution for low frequency maps can be reduced, this saves both storage and computation. This also helps each layer gain a larger receptive field to capture more contextual information.

We design a plugandplay operation named OctConv to replace the vanilla convolution for operating on the new feature representation directly and reducing spatial redundancy. Importantly, OctConv is fast in practice and achieves a speedup close to the theoretical limit.

We extensively study the properties of the proposed OctConv on a variety of backbone CNNs for image and video tasks and achieve significant performance gain even comparable to the best AutoML networks.
2 Related Work
Improving the efficiency of CNNs. Ever since the pioneering work on AlexNet [26] and VGG [35] which achieve astonishing results by stacking a set of convolution layers, researchers have made substantial efforts to improve the efficiency of CNNs. ResNet [16, 17] and DenseNet [22] improve the network topology by adding shortcut connections to early layers to enhance the feature reusing mechanism and alleviate optimization difficulties. ResNeXt [48] and ShuffleNet [50] use sparsely connected group convolutions to reduce redundancy in interchannel connectivity, making it feasible to adopt deeper or wider networks under the same computational budget. Xception [8] and MobileNet [18, 34] adopt depthwise convolutions that further reduce the connection density. Besides these manually designed networks, researchers also tried to atomically find the best network topology for a given task. NAS [52], PNAS [29] and AmoebaNet [33] successfully discovered typologies that perform better than manually designed networks. Another stream of work focuses on reducing the redundancy in the model parameters, such as DSD [14] reduces the redundancy in model connections by pruning connections of low weights. ThiNet [31] prunes convolutional filters based on statistics computed from its next layer. HetConv [36] replaces the vanilla convolution filters with heterogeneous convolution filters that are in different sizes. However, all of these methods ignore the redundancy on the spatial dimension of feature maps, which is addressed by the proposed OctConv, making OctConv orthogonal and complimentary to the previous methods. Noticeably, OctConv does not change the connectivity between feature maps, making it also different from inceptionalike multipath designs [41, 40, 48].
Multiscale Representation Learning.
Prior to the success of deep learning, multiscale representation has long been applied for local feature extraction, such as the SIFT features
[30]. In the deep learning era, multiscale representation also plays a important role due to its strong robustness and generalization ability. FPN [27] and PSP [51] merge convolutional features from different depths at the end of the networks for object detection and segmentation tasks. MSDNet [20] and HRNets [38], proposed carefully designed network architectures that contain multiple branches where each branch has it own spatial resolution. The bLNet [4] and ELASTICNet [44] adopt similar idea, but are designed as a replacement of residual block for ResNet [16, 17] and thus are more flexible and easier to use. However, bLNet [4] and ELASTICNet [44] still require extra expertise and hyperparameter tuning when adopt to architectures beyond ResNet [16, 17], such as MobileNetV1 [18], DenseNet [22]. Multigrid CNNs [25]propose a multigrid pyramid feature representation and define the MGConv operator that can be integrated throughout a network, which is conceptually similar to our method but is motivated for exploiting multiscale features. Compared with OctConv, MGConv adopts less efficient design to exchange interfrequency information and uses maxpooling to conduct downsampling which may also cause some problems like higher memory cost. We will give more discussions under Sec.
3.3.In a nutshell, OctConv focuses on reducing the spatial redundancy in CNNs and is designed to replace vanilla convolution operations. It can be applied to most existing CNN models directly without needing to adjust their architecture. We extensively compare OctConv to closely related methods in the sections of method and experiment. We show that OctConv CNNs give top results on a number of challenging benchmarks.
3 Method
In this section, we first introduce the octave feature representation for reducing the spatial redundancy in feature maps and then describe the Octave Convolution that operates directly on it. We also discuss implementation details and show how to integrate OctConv into group and depthwise convolution architectures.
3.1 Octave Feature Representation
For the vanilla convolution, all input and output feature maps have the same spatial resolution. However, the spatialfrequency model [1, 10] argues that a natural image can be factorized into a lowfrequency signal that captures the global layout and coarse structure, and a highfrequency part that captures fine details, as shown in Figure 1(a). In an analogous way, we argue that there is a subset of the feature maps that capture spatially lowfrequency changes and contain spatially redundant information.
To reduce such spatial redundancy, we introduce the octave feature representation that explicitly factorizes the feature map tensors into groups corresponding to low and high frequencies. The scalespace theory [28] provides us with a principled way of creating scalespaces of spatial resolutions, and defines an octave as a division of the spatial dimensions by a power of (we only explore in this work). We define the low and highfrequency spaces in this fashion, by reducing the spatial resolution of the lowfrequency feature maps by an octave.
Formally, let denote the input feature tensor of a convolutional layer, where and denote the spatial dimensions and the number of feature maps or channels. We explicitly factorize along the channel dimension into , where the highfrequency feature maps capture fine details and the lowfrequency maps vary slower in the spatial dimensions (w.r.t. the image locations). Here denotes the ratio of channels allocated to the lowfrequency part and the lowfrequency feature maps are defined an octave lower than the high frequency ones, at half of the spatial resolution as shown in Figure 1(c).
In the next subsection, we introduce a convolution operator that operates directly on this Multifrequency feature representation and name it Octave Convolution, or OctConv for short.
3.2 Octave Convolution
The octave feature representation presented in Section 3.1 reduces the spatial redundancy and is more compact than the original representation. However, the vanilla convolution cannot directly operate on such a representation, due to differences in spatial resolution in the input features. A naive way of circumventing this is to upsample the lowfrequency part to the original spatial resolution, concatenate it with and then convolve, which would lead to extra costs in computation and memory and diminish all the savings from the compression. In order to fully exploit our compact Multifrequency feature representation, we introduce the Octave Convolution that can directly operate on factorized tensors without requiring any extra computational or memory overhead.
Vanilla Convolution. Let denote a convolution kernel and denote the input and output tensors, respectively. Each feature map in can be computed by
(1) 
where denotes the location coordinate and
defines a local neighborhood. For simplicity, in all equations we omit the padding, we assume
is an odd number and that the input and output data have the same dimensionality,
i.e. .Octave Convolution. The goal of our design is to effectively process the low and high frequency in their corresponding frequency tensor but also enable efficient communication between the high and low frequency component of our Octave feature representation. Let be the factorized input and output tensors. Then the high and lowfrequency feature maps of the output will be given by and , respectively, where denotes the convolutional update from feature map group to group . Specifically, denote intrafrequency information update, while denote interfrequency communication.
To compute these terms, we split the convolutional kernel into two components responsible for convolving with and respectively. Each component can be further divided into intra and interfrequency part: and with the parameter tensor shape shown in Figure 2(b). Specifically for highfrequency feature map, we compute it at location by using a regular convolution for the intrafrequency update, and for the interfrequency communication we can fold the upsampling over the feature tensor into the convolution, removing the need of explicitly computing and storing the upsampled feature maps as follows:
(2)  
where denotes the floor operation. Similarly, for the lowfrequency feature map, we compute the intrafrequency update using a regular convolution. Note that, as the map is in one octave lower, the convolution is also lowfrequency w.r.t. the highfrequency coordinate space. For the interfrequency communication we can again fold the downsampling of the feature tensor into the convolution as follows:
(3)  
where multiplying a factor to the locations performs downsampling, and further shifting the location by half step is to ensure the downsampled maps well aligned with the input. However, since the index of can only be an integer, we could either round the index to or approximate the value at
by averaging all 4 adjacent locations. The first one is also known as strided convolution and the second one as average pooling. As we discuss in Section
3.3 and Fig. 3, strided convolution leads to misalignment; we therefore use average pooling to approximate this value for the rest of the paper.An interesting and useful property of the Octave Convolution is the larger receptive field for the lowfrequency feature maps. Convolving the lowfrequency part with convolution kernels, results in an effective enlargement of the receptive field by a factor of 2 compared to vanilla convolutions. This further helps each OctConv layer capture more contextual information from distant locations and can potentially improve recognition performance.
3.3 Implementation Details
As discussed in the previous subsection, the index has to be an integer for Eq. 3. Instead of rounding it to , i.e. conduct convolution with stride 2 for downsampling, we adopt average pooling to get more accurate approximation. This helps alleviate misalignments that appear when aggregating information from different scales [9], as shown in Figure 3 and validated in Table 3. We can now rewrite the output of the Octave Convolution using average pooling for downsampling as:
(4)  
where denotes a convolution with parameters , is an average pooling operation with kernel size and stride . is an upsampling operation by a factor of
via nearest interpolation.
The details of the OctConv operator implementation are shown in Figure 2. It consists of four computation paths that correspond to the four terms in Eq. (4): two green paths correspond to information updating for the high and lowfrequency feature maps, and two red paths facilitate information exchange between the two octaves.
Group and Depthwise convolutions. The Octave Convolution can also be adopted to other popular variants of the vanilla convolution such as group [48] or depthwise [18] convolutions. For the group convolution case, we simply set all four convolution operations that appear inside the design of the OctConv to group convolutions. Similarly, for the depthwise convolution case, the convolution operations are depthwise and therefore the information exchange paths are eliminated, leaving only two depthwise convolution operations. We note that both group OctConv and depthwise OctConv reduce to their respective vanilla versions if we do not compress the lowfrequency part.
ratio ()  .0  .125  .25  .50  .75  .875  1.0 
#FLOPs Cost  
Memory Cost 
Efficiency analysis. Table 1 shows the theoretical computational cost and memory consumption of OctConv over the vanilla convolution and vanilla feature map representation. More information on deriving the theoretical gains presented in Table 1 can be found in the supplementary material. We note the theoretical gains are calculated per convolutional layer. In Section 4 we present the corresponding practical gains on real scenarios and show that our OctConv implementation can sufficiently approximate the theoretical numbers.
Integrating OctConv into backbone networks. OctConv is backwards compatible with vanilla convolution and can be inserted to regular convolution networks without special adjustment. To convert a vanilla feature representation to a Multifrequency feature representation, at the first OctConv layer, we set and . In this case, OctConv paths related to the lowfrequency input is disabled, resulting in a simplified version which only has two paths. To convert the Multifrequency feature representation back to vanilla feature representation, at the last OctConv layer, we set . In this case, OctConv paths related to the lowfrequency output is disabled, resulting in a single full resolution output.
Comparison to Multigrid Convolution [25]. The multigrid convolution (MGConv) [25] is a bidirectional and crossscale convolution operator that can be integrated throughout a CNN, conceptually similar similar to our OctConv. The core difference is the design of the operator, stemming from different motivations for each design, which leads to significant performance difference. MGConv aims to exploit multiscale information, while OctConv is optimized for reducing spatial redundancy. Specifically, MGConv and our OctConv rely on different down and upsampling strategies for the information exchange between features at different scales, which are critical for efficiency and performance (see Table 1 and Sec. 4). In detail, first, MGConv relies on maxpooling to extract lowfrequency features from the highfrequency ones, which requires extra memory to store the index of the maximum value during training. In contrast, OctConv adopts average pooling for distilling lowfrequency features from the highfrequency ones which might be better to downsample the feature maps and does not require extra memory. Average pooling also avoids any potential misalignment problem and it takes all features from the highfrequency group into account. Second, for upsampling, i.e. the lateral path from low resolution to high resolution, MGConv first upsamples and then convolves with the feature map. In contrast, OctConv performs upsampling after convolution which is more efficient than MGConv. The meticulous design of the lateral paths are essential for OctConv to consume substantially less memory and computation cost than MGConv and improve accuracy without increasing the network complexity. We compare OctConv with MGConv in Table 6.
4 Experimental Evaluation
In this section, we validate the effectiveness and efficiency of the proposed Octave Convolution for both 2D and 3D networks. We first present ablation studies for image classification on ImageNet [11] and then compare it with the stateoftheart. Then, we show the proposed OctConv also works in 3D CNNs using Kinetics400 [24, 3] and Kinetics600 [2] datasets. The best results per category/block are highlighted in bold font throughout the paper.
4.1 Experimental Setups
Image classification. We examine OctConv on a set of most popular CNNs [18, 34, 16, 17, 22, 48, 19] by replacing the regular convolutions with OctConv (except the first convolutional layer before the max pooling). The resulting networks only have one global hyperparameter , which denotes the ratio of low frequency part. We do appletoapple comparison and reproduce all baseline methods by ourselves under the same training/testing setting for internal ablation studies. All networks are trained with naïve softmax cross entropy loss except that the MobileNetV2 also adopts the label smoothing [40], and the best ResNet152 adopts both label smoothing and mixup [49] to prevent overfitting. Same as [4], all networks are trained from scratch and optimized by SGD with cosine learning rate [13]. Standard accuracy of single centeral crop [16, 17, 48, 4, 44] on validation set is reported.
Video action recognition. We use both Kinetics400 [24, 3] and Kinetics600 [2] for human action recognition. We choose standard baseline backbones from Inflated 3D ConvNet [45] and compare them with the OctConv counterparts. We follow the setting from [46] using frame length of 8 as standard input size and training 300k iterations in total. To make fair comparison, we report the performance of the baseline and OctConv under precisely the same settings. For the inference time, we average the predictions over 30 crops ( each of (left, center, right) 10 crops along temporal dimension), again following prior work [45].
4.2 Ablation Study on ImageNet
We conduct a series of ablation studies aiming to answer the following questions: 1) Does OctConv have better FLOPsAccuracy tradeoff than vanilla convolution? 2) In which situation does the OctConv work the best?
Results on ResNet50. We begin with using the popular ResNet50 [17] as the baseline CNN and replacing the regular convolution with our proposed OctConv to examine the flopsaccuracy tradeoff. In particular, we vary the global ratio to compare the image classification accuracy versus computational cost (i.e. FLOPs) [16, 17, 48, 7] with the baseline. The results are shown in Figure 4 in pink.
We make following observations. 1) The flopsaccuracy tradeoff curve is a concave curve, where the accuracy first rises up and then slowly goes down. 2) We can see two sweet spots: The first at , where the network gets similar or better results even when the FLOPs are reduced by about half; the second at , where the network reaches its best accuracy, 1.2% higher than baseline (black circle). We attribute the increase in accuracy to OctConv’s effective design of multifrequency processing and the corresponding enlarged receptive field which provides more contextual information to the network. While reaching the accuracy peak at , the accuracy does not suddenly drop but decreases slowly for higher ratios , indicating reducing the resolution of the low frequency part does not lead to significant information loss. Interestingly, of the feature maps can be compressed to half the resolution with only accuracy drop, which demonstrates effectiveness of grouping and compressing the smoothly changed feature maps for reducing the spatial redundancy in CNNs. In Table 2 we demonstrate the theoretical FLOPs saving of OctConv is also reflected in the actual CPU inference time in practice. For ResNet50, we are close to obtaining theoretical FLOPs speed up. These results indicate OctConv is able to deliver important practical benefits, rather than only saving FLOPs in theory.
ratio ()  Top1 (%)  #FLOPs (G)  Inference Time (ms)  Backend 
N/A  77.0  4.1  119  MKLDNN 
N/A  77.0  4.1  115  TVM 
.125  78.2  3.6  116  TVM 
.25  78.0  3.1  99  TVM 
.5  77.3  2.4  74  TVM 
.75  76.6  1.9  61  TVM 
Results on more CNNs. To further examine if the proposed OctConv works for other networks with different depth/wide/topology, we select the currently most popular networks as baselines and repeat the same ablation study. These networks are ResNet(26;50;101;200) [17], ResNeXt(50,324d;101,324d) [48], DenseNet121 [22] and SEResNet50 [19]. The ResNeXt is chosen for assessing the OctConv on group convolution, while the SENet [19] is used to check if the gain of SE block found on vanilla convolution based networks can also be seen on OctConv. As shown in Figure 4, OctConv equipped networks for different architecture behave similarly to the OctResNet50, where the FLOPsAccuracy tradeoff is in a concave curve and the performance peak also appears at ratio or . The consistent performance gain on a variety of backbone CNNs confirms that OctConv is a good replacement of vanilla convolution.
Method  Downsampling  Low High  High Low  Top1 (%)  

strided conv.  ✓  ✓  76.3  
avg. pooling  76.0  
avg. pooling  ✓  76.4  
avg. pooling  ✓  76.4  
avg. pooling  ✓  ✓  77.3 
ratio ()  Testing Scale (small large)  
N/A  77.2  78.6  78.7  78.7  78.3  77.6  76.7  75.8 
.5  +0.7  +0.7  +0.9  +0.9  +0.8  +1.0  +1.1  +1.2 
Besides, we also have some intriguing findings: 1) OctConv can help CNNs improve the accuracy while decreasing the FLOPs, deviating from other methods that reduce the FLOPs with a cost of lower accuracy. 2) At test time, the gain of OctConv over baseline models increases as the test image resolution grows because OctConv can detect large objects better due to its larger receptive field, as shown in Table 4. 3) Both the information exchanging paths are important, since removing any of them can lead to accuracy drop as shown in Table 3. 4) Shallow networks, e.g. ResNet26, have a rather limited receptive field, and can especially benefit from OctConv, which greatly enlarges their receptive field.
4.3 Comparing with SOTAs on ImageNet
Method  ratio ()  #Params (M)  #FLOPs (M)  CPU (ms)  Top1 (%) 
CondenseNet () [21]    2.9  274    71.0 
1.5 ShuffleNet (v1) [50]    3.4  292    71.5 
1.5 ShuffleNet (v2) [32]    3.5  299    72.6 
0.75 MobileNet (v1) [18]    2.6  325  13.4  70.3 
0.75 OctMobileNet (v1) (ours)  .375  2.6  213  11.9  70.6 
1.0 OctMobileNet (v1) (ours)  .5  4.2  321  18.4  72.4 
1.0 MobileNet (v2) [34]    3.5  300  24.5  72.0 
1.0 OctMobileNet (v2) (ours)  .375  3.5  256  17.1  72.0 
1.125 OctMobileNet (v2) (ours)  .5  4.2  295  26.3  73.0 
Method  ratio ()  Depth  #Params (M)  #FLOPs (G)  Top1 (%) 
RMG34 [25]    34  32.9  5.8  75.5 
OctResNet26 (ours)  .25  26  16.0  1.9  75.9 
OctResNet50 (ours)  .5  50  25.6  2.4  77.3 
ResNeXt50 + Elastic [44]    50  25.2  4.2  78.4 
OctResNeXt50 (324d) (ours)  .25  50  25.0  3.2  78.7 
ResNeXt101 + Elastic [44]    101  44.3  7.9  79.2 
OctResNeXt101 (324d) (ours)  .25  101  44.2  5.7  79.5 
bLResNet50 () [4]    50 (+3)  26.2  2.5  76.9 
OctResNet50 (ours)  .5  50 (+3)  25.6  2.5  77.7 
OctResNet50 (ours)  .5  50  25.6  2.4  77.3 
bLResNeXt50 (324d) [4]    50 (+3)  26.2  3.0  78.4 
OctResNeXt50 (324d) (ours)  .5  50 (+3)  25.1  2.7  78.6 
OctResNeXt50 (324d) (ours)  .5  50  25.0  2.4  78.3 
bLResNeXt101 (324d) [4]    101 (+1)  43.4  4.1  78.9 
OctResNeXt101 (324d) (ours)  .5  101 (+1)  40.1  4.2  79.3 
OctResNeXt101 (324d) (ours)  .5  101 (+1)  44.2  4.2  79.1 
OctResNeXt101 (324d) (ours)  .5  101  44.2  4.0  78.9 
Method  #Params (M)  Training  Testing ()  Testing ( / )  
Input Size  Memory Cost (MB)  Speed (im/s)  #FLOPs (G)  Top1 (%)  Top5 (%)  #FLOPs (G)  Top1 (%)  Top5 (%)  
NASNetA (N=6, F=168) [52]  88.9  /  43        23.8  82.7  96.2  
AmoebaNetA (N=6, F=190) [33]  86.7  47        23.1  82.8  96.1  
PNASNet5 (N=4, F=216) [29]  86.1  38        25.0  82.9  96.2  
SqueezeExciteNet [19]  115.1  43        42.3  83.1  96.4  
AmoebaNetA (N=6, F=448) [33]  469  15        104  83.9  96.6  
DualPathNet131 [7]  79.5  31,844  83  16.0  80.1  94.9  32.0  81.5  95.8  
SEShuffleNet v2164 [32]  69.9  70  12.7  81.4          
SqueezeExciteNet [19]  115.1  28,696  78  21  81.3  95.5  42.3  82.7  96.2  
OctResNet152, (ours)  60.2  15,566  162  10.9  81.4  95.4  22.2  82.3  96.0  
OctResNet152 + SE^{6}^{6}footnotemark: 6, (ours)  66.8  21,885  95  10.9  81.6  95.7  22.2  82.9  96.3  
Small models. We adopt the most popular light weight networks as baselines and examine if OctConv works well on these compact networks with depthwise convolution. In particular, we use the “0.75 MobileNet (v1)” [18] and “1.0 MobileNet (v2)” [34] as baseline and replace the regular convolution with our proposed OctConv. The results are shown in Table 5. We find that OctConv can reduce the FLOPs of MobileNetV1 by , and provide better accuracy and faster speed in practice; it is able to reduce the FLOPs of MobileNetV2 by 15%, achieving the same accuracy with faster speed. When the computation budget is fixed, one can adopt wider models to increase the learning capacity because OctConv can compensate the extra computation cost. In particular, our OctConv equipped networks achieve improvement on MobileNetV1 under the same FLOPs and 1% improvement on MobileNetV2.
Medium models. In the above experiment, we have compared and shown that OctConv is complementary with a set of stateoftheart CNNs [16, 17, 48, 22, 18, 34, 19]. In this part, we compare OctConv with MGConv [25], Elastic [44] and bLNet [4] which share a similar idea as our method. Six groups of results are shown in Table 6. In group 1, our OctResNet26 shows better accuracy than RMG34 while costing only one third of FLOPs and half of Params. Also, our OctResNet50, which costs less than half of FLOPS, achieves higher accuracy than RMG34. In group 2, our OctResNeXt50 achieves better accuracy than the Elastic [44] based method (78.7% v.s. 78.4%) while reducing the computational cost by 31%. In group 3, the OctResNeXt101 also achieves higher accuracy than the Elastic based method (79.5% v.s. 79.2%) while costing 38% less computation. When compared to the bLNet [4], OctConv equipped methods achieve better FLOPsAccuracy tradeoff without bells and tricks. When adopting the tricks used in the baseline bLNet [4], our OctResNet50 achieves 0.8% higher accuracy than bLResNet50 under the same computational budget (group 4), and OctResNeXt50 (group 5) and OctResNeXt101 (group 6) get better accuracy under comparable or even lower computational budget. This is because MGConv [25], ElasticNet [44] and bLNet [4] are designed following the principle of introducing multiscale features without considering reducing the spatial redundancy. In contrast, OctConv is born for solving the high spatial redundancy problem in CNNs, uses more efficient strategies to store and process the information throughout the network, and can thus achieve better efficiency and performance.
Large models. Table 7 shows the results of OctConv in large models. Here, we choose the ResNet152 as the backbone CNN, replacing the first convolution by three convolution layers and removing the max pooling by a lightweight residual block [4]. We report results for OctResNet152 with and without the SEblock [19]. As can be seen, our OctResNet152 achieves accuracy comparable to the best manually designed networks with less FLOPs (10.9G v.s. 12.7G). Since our model does not use group or depthwise convolutions, it also requires significantly less GPU memory, and runs faster in practice compared to the SEShuffleNet v2164 and AmoebaNetA (N=6, F=190) which have low FLOPs in theory but run slow in practice due to the use of group and depthwise convolutions. Our proposed method is also complementary to Squeezeandexcitation [19], where the accuracy can be further boosted when the SEBlock is added (last row).
4.4 Experiments of Video Recognition on Kinetics
In this subsection, we evaluate the effectiveness of OctConv for action recognition in videos and demonstrate that our spatial OctConv is sufficiently generic to be integrated into 3D convolution to decrease #FLOPs and increase accuracy at the same time. As shown in Table 8, OctConv consistently decreases FLOPs and meanwhile improves the accuracy when added to C2D and I3D [45, 46], and is also complimentary to the Nonlocal building block [45]. This is observed for models pretrained on ImageNet [11] as well as models trained from scratch on Kinetics.
Specifically, we first investigate the behavior of training OctConv equipped I3D models from scratch on Kinetics. We use a learning rate 10 larger than the standard and train it 16 times longer than finetuning setting for a better convergence. Compared to the vanilla I3D model, OctI3D achieves 1.0% higher accuracy with 91% of the FLOPs.
We then explore the behavior of finetuning a OctConv on ImageNet pretrained model with stepwise learning schedule. For this, we train an OctConv ResNet50 model [16] on ImageNet [11] and then inflate it into a network with 3D convolutions [42] (over space and time) using the I3D technique [3]. After the inflation, we finetune the inflated OctConv following the schedule described in [46] on Kinetics400. Compared to the 71.9% Top1 accuracy of the C2D baseline on the Kinetics400 validation set, the OctConv counterpart achieves 73.8% accuracy, using 90% of the FLOPs. For I3D, adding OctConv improves accuracy from 73.3% to 74.6% accuracy, while using only 91% of the FLOPs. We also demonstrate that the gap is consistent when adding Nonlocal [45]. Finally, we repeat the I3D experiment on Kinetics600 [2] dataset and have a consistent finding, which further confirms the effectiveness of our method.
Method  ImageNet Pretrain  #FLOPs (G)  Top1 (%)  
(a) Kinetics400 [3]  
I3D  28.1  72.6  
OctI3D, =0.1, (ours)  25.6  73.6  (+1.0)  
OctI3D, =0.2, (ours)  22.1  73.1  (+0.5)  
OctI3D, =0.5, (ours)  15.3  72.1  (0.5)  
C2D  ✓  19.3  71.9  
OctC2D, =0.1, (ours)  ✓  17.4  73.8  (+1.9) 
I3D  ✓  28.1  73.3  
OctI3D, =0.1, (ours)  ✓  25.6  74.6  (+1.3) 
I3D + Nonlocal  ✓  33.3  74.7  
OctI3D + Nonlocal, =0.1, (ours)  ✓  28.9  75.7  (+1.0) 
(b) Kinetics600 [2]  
I3D  ✓  28.1  74.3  
OctI3D, =0.1, (ours)  ✓  25.6  76.0  (+1.7) 
5 Conclusion
In this work, we address the problem of reducing spatial redundancy that widely exists in vanilla CNN models, and propose a novel Octave Convolution operation to store and process low and highfrequency features separately to improve the model efficiency. Octave Convolution is sufficiently generic to replace the regular convolution operation inplace, and can be used in most 2D and 3D CNNs without model architecture adjustment. Beyond saving a substantial amount of computation and memory, Octave Convolution can also improve the recognition performance by effective communication between the low and highfrequency and by enlarging the receptive field size which contributes to capturing more global information. Our extensive experiments on image classification and video action recognition confirm the superiority of our method for striking a much better tradeoff between recognition performance and model efficiency, not only in FLOPs, but also in practice.
Acknowledgement
We would like to thank the Min Lin and Xin Zhao for helpful discussions on the code development.
References
 [1] F. W. Campbell and J. Robson. Application of fourier analysis to the visibility of gratings. The Journal of physiology, 197(3):551–566, 1968.
 [2] J. Carreira, E. Noland, A. BankiHorvath, C. Hillier, and A. Zisserman. A short note about kinetics600. arXiv preprint arXiv:1808.01340, 2018.

[3]
J. Carreira and A. Zisserman.
Quo vadis, action recognition? a new model and the kinetics dataset.
In
proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 6299–6308, 2017.  [4] C.F. Chen, Q. Fan, N. Mallinar, T. Sercu, and R. Feris. Biglittle net: An efficient multiscale feature representation for visual and speech recognition. Proceedings of the Seventh International Conference on Learning Representations, 2019.
 [5] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. TVM: An automated endtoend optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, 2018.
 [6] Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. Multifiber networks for video recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 352–367, 2018.
 [7] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In Advances in Neural Information Processing Systems, pages 4467–4475, 2017.
 [8] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
 [9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
 [10] R. L. De Valois and K. K. De Valois. Spatial vision. Oxford psychology series, No. 14., 1988.
 [11] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
 [12] C. Feichtenhofer, H. Fan, J. Malik, and K. He. Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982, 2018.
 [13] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [14] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, et al. Dsd: Densesparsedense training for deep neural networks. arXiv preprint arXiv:1607.04381, 2016.
 [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [17] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
 [18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [19] J. Hu, L. Shen, and G. Sun. Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
 [20] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multiscale dense networks for resource efficient image classification. ICLR, 2018.
 [21] G. Huang, S. Liu, L. Van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2752–2761, 2018.
 [22] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
 [23] Intel. Math kernel library for deep neural networks (mkldnn). https://github.com/intel/mkldnn/tree/7de7e5d02bf687f971e7668963649728356e0c20, 2018.
 [24] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
 [25] T.W. Ke, M. Maire, and S. X. Yu. Multigrid neural architectures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6665–6673, 2017.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [27] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
 [28] T. Lindeberg. Scalespace theory in computer vision, volume 256. Springer Science & Business Media, 2013.
 [29] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.J. Li, L. FeiFei, A. Yuille, J. Huang, and K. Murphy. Progressive neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV), pages 19–34, 2018.
 [30] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
 [31] J.H. Luo, H. Zhang, H.Y. Zhou, C.W. Xie, J. Wu, and W. Lin. Thinet: pruning cnn filters for a thinner net. IEEE transactions on pattern analysis and machine intelligence, 2018.
 [32] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pages 116–131, 2018.

[33]
E. Real, A. Aggarwal, Y. Huang, and Q. V. Le.
Regularized evolution for image classifier architecture search.
Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence
, 2019.  [34] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
 [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [36] P. Singh, V. K. Verma, P. Rai, and V. P. Namboodiri. Hetconv: Heterogeneous kernelbased convolutions for deep cnns. arXiv preprint arXiv:1903.04120, 2019.
 [37] M. Stephane. A wavelet tour of signal processing.

[38]
K. Sun, B. Xiao, D. Liu, and J. Wang.
Deep highresolution representation learning for human pose estimation.
In CVPR, 2019.  [39] W. Sweldens. The lifting scheme: A construction of second generation wavelets. SIAM journal on mathematical analysis, 29(2):511–546, 1998.

[40]
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In ThirtyFirst AAAI Conference on Artificial Intelligence, 2017.  [41] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [42] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
 [43] F. Tung and G. Mori. Clipq: Deep network compression learning by inparallel pruningquantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7873–7882, 2018.
 [44] H. Wang, A. Kembhavi, A. Farhadi, A. Yuille, and M. Rastegari. Elastic: Improving cnns with instance specific scaling policies. arXiv preprint arXiv:1812.05262, 2018.
 [45] X. Wang, R. Girshick, A. Gupta, and K. He. Nonlocal neural networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 [46] X. Wang, R. Girshick, A. Gupta, and K. He. https://github.com/facebookresearch/videononlocalnet, 2018.
 [47] S. W. Williams. Autotuning performance on multicore computers. University of California, Berkeley, 2008.
 [48] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1492–1500, 2017.
 [49] H. Zhang, M. Cisse, Y. N. Dauphin, and D. LopezPaz. mixup: Beyond empirical risk minimization. Proceedings of the Sixth International Conference on Learning Representations, 2018.
 [50] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6848–6856, 2018.
 [51] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
 [52] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8697–8710, 2018.
1 Relative Theoretical Gains of OctConv
In Table 1 of the main paper, we reported the relative theoretical gains of the proposed multifrequency feature representation over regular feature representation with respect to memory footprint and computational cost, as measured in FLOPS (multiplications and additions). In this section, we show how the gains are estimated in theory.
Memory cost. The proposed OctConv stores the feature representation in a multifrequency feature representation as shown in Figure 5, where the low frequency tensor is stored in lower spatial resolution and thus cost less space to store the low frequency maps compared with the conventional feature representation. The relative memory cost is conditional on the ratio () and is calculated by
(5) 
Computational cost. The computational cost of OctConv is proportional to the number of locations and channels that are needed to be convolved on. Following the design shown in Figure 2 in the main paper, we need to compute four paths, namely , , , and .
We assume the convolution kernel size is , the spatial resolution of the highfrequency feature is , and there are channels in the highfrequency part and channels in the lowfrequency part. Then the FLOPS for computing each paths are calculated as below.
(6)  
We omit FLOPS for adding and together, as well as that of adding and together, since the FLOPS of such addition is less than , and is negligible compared with other computational costs. The computational cost of the pooling operation is also ignorable compared with other computational cost. The nearest neighborhood upsampling is basically duplicating values which does not involves any computational cost. Therefore, by adding up all FLOPS in Eqn 6, we can estimate the overall FLOPS for compute and in Eqn 7.
(7) 
For vanilla convolution, the FLOPS for computing output feature map of size with the kernel size , and input feature map of size , can be estimated as below.
(8) 
three out of four internal convolution operations are conducted on the lower resolution tensors except the first convolution, i.e. . Thus, the relative computational cost compared with vanilla convolution using the same kernel size and number of input/out channels is: Therefore, the computational cost ratio between the OctConv and vanilla convolution is .
(9)  
Note that the computational cost of the pooling operation is ignorable and thus is not considered. The nearest neighborhood upsampling is basically duplicating values which does not involves any computational cost.
2 ImageNet ablation study results
For clarity of presentation and to allow future work to compare to the precise numbers, we further report in Table 9 the values that are plotted in Figure 4 of the main text.
Backbone  baseline  
ResNet26  GFLOPs  2.353  2.102  1.871  1.491  1.216 
Top1 acc.  73.2  75.8  75.9  75.5  74.6  
DenseNet121  GFLOPs  2.852  2.428  2.044     
Top1 acc.  75.4  76.0  75.9      
ResNet50  GFLOPs  4.105  3.587  3.123  2.383  1.891 
Top1 acc.  77.0  78.2  78.0  77.3  76.6  
SEResNet50  GFLOPs  4.113  3.594  3.130  2.389  1.896 
Top1 acc.  77.6  78.6  78.4  77.9  77.2  
ResNeXt50  GFLOPs  4.250    3.196  2.406  1.891 
Top1 acc.  78.4    78.7  78.3  77.4  
ResNet101  GFLOPs  7.822  6.656  5.625  4.012   
Top1 acc.  78.5  79.1  79.1  78.6    
ResNeXt101  GFLOPs  7.993    5.719  4.050   
Top1 acc.  79.4    79.6  78.9    
ResNet200  GFLOPs  15.044  12.623  10.497  7.183   
Top1 acc.  79.6  80.0  79.8  79.4   
3 Updates
3.0.1 Changes over v1 (10 Apr 2019)
Add missing related works
In this updated version we added some missing related works, HRNets [38], HetConv [36] and MGConv [25], added sections that thoroughly discuss the connections to the latter, as well as comparisons that features OctConv’s design choices which enable it to outperform the MGConv with less computation. We thank the authors of [25] for bringing their paper to our attention and for the subsequent discussion.
Minor changes
1) Fix typo in author list; 2) Add link to GitHub.
Comments
There are no comments yet.