Dynamic Sampling Convolutional Neural Networks

We present Dynamic Sampling Convolutional Neural Networks (DSCNN), where the position-specific kernels learn from not only the current position but also multiple sampled neighbour regions. During sampling, residual learning is introduced to ease training and an attention mechanism is applied to fuse features from different samples. And the kernels are further factorized to reduce parameters. The multiple sampling strategy enlarges the effective receptive fields significantly without requiring more parameters. While DSCNNs inherit the advantages of DFN, namely avoiding feature map blurring by position-specific kernels while keeping translation invariance, it also efficiently alleviates the overfitting issue caused by much more parameters than normal CNNs. Our model is efficient and can be trained end-to-end via standard back-propagation. We demonstrate the merits of our DSCNNs on both sparse and dense prediction tasks involving object detection and flow estimation. Our results show that DSCNNs enjoy stronger recognition abilities and achieve 81.7 sharper responses in flow estimation on FlyingChairs dataset compared to multiple FlowNet models' baselines.


page 2

page 8

page 11


Use of symmetric kernels for convolutional neural networks

At this work we introduce horizontally symmetric convolutional kernels f...

Irregular Convolutional Neural Networks

Convolutional kernels are basic and vital components of deep Convolution...

Projection Convolutional Neural Networks for 1-bit CNNs via Discrete Back Propagation

The advancement of deep convolutional neural networks (DCNNs) has driven...

Targeted Kernel Networks: Faster Convolutions with Attentive Regularization

We propose Attentive Regularization (AR), a method to constrain the acti...

Temporal Factorization of 3D Convolutional Kernels

3D convolutional neural networks are difficult to train because they are...

Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation

Convolutional networks are not aware of an object's geometric variations...

1 Introduction

Convolutional Neural Networks have recently made significant progress in both sparse prediction tasks including image classification [4, 5, 6], object detection [7, 8, 9] and dense prediction tasks such as semantic segmentation [10, 11, 12], optical flow estimation [13, 14, 15], etc. Generally, deeper [16, 17, 5] architectures can provide richer features due to more trainable parameters and larger receptive fields. For instance, ResNet [5] introduces short-cut connection and residual learning to enable the stack of over 100 convolutional layers.

Most neural network architectures mainly adopt spatially shared kernels which work well in general cases. However, during training phase, the gradients at each spatial position may not share the same descent direction, which can decrease losses at every positions. These phenomenon are quite ubiquitous when multiple objects appear in a single image in object detection or multiple object with different motion direction and speeds in flow estimation, which make the spatially shared kernels more likely to produce blurred feature maps.111Please see the examples and detailed analysis in the Supplementary Material. The primary reasons are that even though the kernels are far from optimal for every position, the global gradients, which are the spatially summation of the gradients over entire feature maps, can be close to zero. Because they are used in the update process, the back-propagation process could quite often stall or make very slow progress.

Adopting position-specific kernels can alleviate the unshareable descent direction issue and take advantage of the gradients at each position (i.e. local gradients) since kernel parameters are not spatially shared. In order to maintain the translation invariance, Brabandere et al[1] propose a general paradigm called Dynamic Filter Networks (DFN) and verify them on the moving MNIST dataset [18]. However, DFNs [1] only generate the dynamic position-specific kernels for their own specific positions. As a result, the kernels can only receive the gradients from their own position ( square of kernel size), which is usually more unstable, noisy and harder to converge than normal CNN.

Figure 1: Visualization of the effective receptive field (ERF). Yellow circles denote the positions on the object and the red region denotes the corresponding ERF. Best view in color.

Having a properly enlarged receptive field is another important consideration when designing CNN architectures. Adopting stacked convolutional layers with small kernels (i.e) [16] is more preferable than larger kernels (i.e) [4], because the former one obtains the same receptive fields with fewer parameters. However, the effective receptive fields (ERF) [19] only occupy a fraction of the full theoretical receptive field due to some weak connections and inactivated ReLUs. In practice, it has been shown that adopting dilation strategies [20] can further improve performance [7, 12], which means that enlarging receptive fields in a single layer is still beneficial. However, despite these enhancements, effective receptive fields of these CNNs are still not large enough and require improvements in some applications.

Therefore, we present DSCNNs to solve the limited ERF and unshareable descent direction issues by utilizing dynamically generated position-specific kernels. In particular, DSCNNs achieve large ERFs via a sampling strategy where each kernel convolves with features from both their own specific position and multiple sampled neighbouring regions. As illustrated in Fig. 1, with ResNet-50 as pretrained model, adding a single DSCNN layer can significantly enlarge the ERF, which further yields significant improvements on the representation abilities. Moreover, since our kernels at each position are dynamically generated, DSCNNs also benefit from the local gradients.

We verify our DSCNNs performance of object detection on the VOC benchmark [2] and flow estimation on FlyingChairs dataset [3]. Extensive experimental results demonstrate the effectiveness of our new approach. We achieve 81.7% with CoupleNets detection head ( 80.4% for CoupleNets) in object detection on VOC2012, and 2.06 aEPE ( 2.19 for FlowNetC) in flow estimation on FlyingChairs. These results indicate that our DSCNNs are general and beneficial for both sparse and dense prediction tasks with demonstrable improvements over strong baseline models. Our codes will be made publicly available.

2 Related Work

Dynamic Filter Networks. Dynamic Filter Networks [1] are first proposed by Brabandere et al. to provide custom parameters for different input data. This architecture is powerful and more flexible since the kernels are dynamically conditioned on inputs. Recently, several task-oriented objectives and extensions have been developed. Deformable ConvNets [21] can be seen as an extension of DFNs that discovers geometric-invariant features. Segmentation-aware convolution [22] explicitly takes advantage of prior segmentation information to refine feature boundaries via attention masks. Cross convolution [23] learns the translation from one frame to another via motion information. Different from the models mentioned above, our DSCNNs aim at constructing large receptive fields and receiving local gradients to produce sharper and more semantic feature maps.

Receptive Field. Properly enlarging receptive field is one of the most important consideration in modern CNN’s architecture design. Wenjie et al. propose the concept of effective receptive field (ERF) and the mathematical measure using partial derivatives. The experimental results verify that the ERF usually occupies only a small fraction of the theoretical receptive field [19]

which is the input region that an output unit depends on. And, this has attracted lots of research especially in deep learning based computer vision. For instance, pooling strategies are ubiquitous in CNNs which scale down the feature maps so that output units can observe more input feature with the same kernel size. Chen

et al[20] propose dilated convolution with hole algorithm and achieve better results on semantic segmentation. Dai et al[21] propose to dynamically learn the spatial offset of the kernels at each position so that those kernels can observe wider regions in the bottom layer with irregular shapes. However, some applications such as large motion estimation and large object detection even require larger ERF which is one of the motivation of our DSCNNs.

Residual Learning. Generally, residual learning reduces the difficulties of directly learning the objectives by learning their residual discrepancies of the identity function. ResNets [5] are proposed to learn residual features of identity mapping via short-cut connection and helps deepen CNNs to over 100 layers easily. There have been plenty of works adopting residual learning to alleviate the problem of divergence and generate richer features. Kim et al[24] adopt residual learning to model multimodal data in visual QA. Long et al[25] learn residual transfer networks for domain adaptation. Besides, Fei Wang et al[6]

apply residual learning to alleviate the problem of repeated features in attention model. We apply residual learning strategy to learn residual discrepancy for identical convolutional kernels. By doing so, we ensure valid gradients’ back-propagation so that the DSCNNs can easily converge in real-world datasets.

Attention Mechanism. For the purpose of recognizing important features in deep learning unsupervisedly, attention mechanisms have been applied to lots of vision tasks including image classification [6], semantic segmentation [22], action recognition [26, 27], etc. Current visual attention mechanisms mainly consist of hard attention versions and soft attention versions. In hard attention [28, 29], most methods adopt sequential processing strategies, which extract features from one or several specific image areas and then decide next areas to focus on. Those methods often require sequential resampling which is computational costly and can not be accelerated via GPUs. In soft attention mechanisms [26, 30, 6], weights are generated to identify the important parts from different features using prior information. Sharma et al[26] use previous states in LSTMs as prior information to have the network focus on more meaningful contents in the next frame and get better results for action recognition. Fei Wang et al[6] benefit from lower-level features and learn attention for higher-level feature maps in a residual manner. In contrast, our attention mechanism aims at combining features from multiple samples via learning weights for each positions’ kernels at each sample.

3 Dynamic Sampling Convolution

Firstly, we present the overall structure of our DSCNN in Sec. 3.1, then introduce dynamic sampling strategies concretely in Sec. 3.2. This design allows kernels at each position to take advantage of larger receptive fields and the local gradients. Moreover, attention mechanisms are utilized to enhance the performance of DSCNNs, which is demonstrated in Sec. 3.3. Finally, Sec. 3.4 explains implementation details of our DSCNNs, especially for parameters reduction and residual learning techniques.

3.1 Network Overview

In this subsection, we introduce the DSCNNs’ overall architecture. As illustrated in Fig. 2, our DSCNNs consist of three branches, namely kernel branch, feature branch and attention branch via conventional convolutional layers with , , output channels respectively. More complicated architectures in each branch may yield better results, but it is not the focus of this work. Our DSCNNs are compatible modules in modern CNNs. With channels’ input feature maps, the feature branch firstly produces channels intermediate features. Secondly, the kernel branch generates position-specific kernels at each position to sample multiple neighbour regions in the feature branch’s channels’ feature maps via convolution. We efficiently implement the kernel branch by conventional CNNs, which requires channels by default, and further introduce a parameter reduction method that reduces the required numbers of channels in the kernel branch to . Thirdly, the attention branch outputs the corresponding attention weights for each position’s kernels during sampling. And the DSCNNs output feature maps with channels preserving the original spatial dimensions and in the whole process.

Figure 2: Overview of the Dynamic Sampling Convolutional Neural Networks (DSCNN). Our model consists of three branches: (1) the kernel branch generates position-specific kernels; (2) the feature branch generates features to be position-specifically convolved; (3) the attention branch generates attention weights to fuse features from each sampled neighbour region. Same color indicates features correlated to the same spatial sampled regions in features branch and after dynamically sampled.

3.2 Dynamic Sampling Convolution

This subsection demonstrates the dynamic sampling convolution, which enjoys both large receptive fields and the local gradients. In particular, the DSCNNs firstly generate position-specific kernels from the kernel branch and then convolve these kernels with features from multiple sampled neighbour regions in the feature branch, resulting in very large receptive fields.

Denoting as the feature maps from layer (or intermediate features from the feature branch) with shape , conventional convolutional layer with spatially shared kernels W can be formulated as


where denote the indices of the input and output channels, denote the spatial coordinates and indicates the kernel size.

In contrast, the DSCNNs treat generated features in kernel branch, which is spatially dependent, as convolutional kernels. Thus, this scheme requires the kernel branch to generate kernels from to map the -channel features in the feature branch to -channel ones222 denotes the kernels is generated from , and we omit the when there is no ambiguity.. Detailed kernel generation methods will be described in Sec. 3.4 and the supplementary material.

As we aim at very large receptive fields and more stable gradients, we not only convolve the generated position-specific kernels with features at their own positions in the feature branch, but also sample their neighbour regions as additional features shown in Eq. 2. Therefore, we have more learning samples for each position-specific kernel than DFN [1] and thus more stable gradients. Also, since we obtain more diverse kernels (i.e. position-specific) than conventional CNNs, we can robustly enrich the feature space.

Figure 3: Illustration of our multiple sampling strategy. Each position obtains its own kernels from the kernel branch and samples with the

neighbour regions in the feature maps in the feature branch with a sample stride

. The red square denotes the sampling point and same color indicates features correlated to the same spatial sampled regions in features branch and after dynamically sampled.

As shown in Fig. 3, each position (e.g. the red square) outputs its own kernels in the kernel branch and uses the generated kernels to sample the corresponding multiple neighbour regions (i.e. the cubes in different colors) in the feature branch. Assuming we have sampled regions for each position with sample stride , kernel size , the sampling strategy outputs feature maps with shape which obtain approximately times larger receptive fields. And applying stride and dilation on kernels are straightforward extensions which can further enlarge the receptive fields.

Formally, the dynamic sampling convolution thus can be formulated as


where and denote the coordinates of the center in sampled neighbour regions. denotes the position-specific kernels generated by the kernel branch. And are the indices of sampled region with sampling stride . Tt is worth noting that the origin DFN models are the special case of our DSCNNs when .

3.3 Attention Mechanism

In this subsection, we present our methods to fuse dynamic features from multiple sampled regions at each position. A direct solution is to stack sampled features to form a tensor or perform a pooling operation on the sample dimension (i.e. first dimension of ) as outputs. However, the first choice violates translation invariance and the second choice is not aware of which samples are more important.

To address this issue, we present our attention mechanism which learns attention weights for each position’s kernel at each sample. Since the weights for each kernel parameter are not shared, the resolution of output feature maps can be potentially preserved. In case of sampled regions and kernel size at each position, we should have attention weights for each position’s kernels so that the weighted dynamic features can be formulated as


However, Eq. 3 requires attention weights, which is computationally costly and easily leads to overfitting. We thus split this task into learning position attention weights for kernels at each position and learning sampling attention weights for each sampled region. Therefore, Eq. 3 reduce to Eq. 4


where share the same representations in Eq.2.

Specifically, we use two CNN sub-branches to generate the attention weights for samples and positions respectively. The sampling attention sub-branch has output channels. And for each position, the sample attention weights are generated from the current position denoted by the red box with cross in Fig.4 to coarsely predict the importance according to that position. On the other hand, the position attention sub-branch has output channels. And the position attention weights are generated from each sampled regions’ center denoted by black boxes with cross to model fine-grained local detailed importance based on the sampled local features.

Figure 4: At each position, we separately learn attention weights for each kernel and for each sample. Then, we combine features from multiple samples via these learned attention weights. Boxes with crosses denote the position to generate attention weights and red one denotes sampling position and black ones denote sampled positions.

Therefore, the number of attention weights will be reduced to as shown in Eq. 4. Further, we also manually add to each attention weight to take advantage of residual learning.  Obtaining Eq. 4, we finally combine different samples via attention mechanism as


As feature maps from previous conventional conolutional layers might still be noisy, the position attention weights help filter such noise when convolving with the dynamic kernels. And the sample attention weights indicate how much contribution each sampled neighbour region is expected to make.

3.4 Dynamic Kernels Implementation Details

Reducing Parameter. Given that directly generating the position-specific kernels in the conventional convolutional layers’ fashion will require parameters as shown in Eq. 2. Since and can be relatively large (e.g. up to 256 or 512), the required output channels in the kernel branch (i.e) can easily get up to hundreds of thousands, which is computationally costly and intractable even in modern GPUs. Recently, several literatures have focused on reducing kernel parameters (e.g. MobileNet [31]) by factorizing kernels into different parts to make CNNs efficient in modern mobile devices. Inspired by them and based on our DSCNNs’ case, we describe our proposed parameter reduction method. And we provide the evaluation and comparison with state-of-art counterparts in the supplementary material.

Inspecting that the activated output feature maps in a layer usually share similar geometric characteristics across channels, we propose a novel kernel structure that splits the original kernels into two separate parts for the purpose of parameter reduction. Concretely, as illustrated in Fig. 5. On the one hand, the part will be placed into the spatial center of each kernel with size to model the difference across channels. On the other hand, the part will be duplicated times to model the shared geometric characteristics within each channel.

Figure 5: Illustration of the original kernel structure and our new kernel structure. We show the case of one output channel for simplicity. In the first part, weights are placed in the center of the corresponding kernel and in the second part weights are duplicated times.

Combining the above two parts together, our method generates kernels that map channels’ feature maps to channel ones with kernel size by only parameters at each position instead of . Formally, the convolutional kernels used in Eq. 2 are formulated as


Residual Learning. Directly using the outputs of the kernel branch as the kernels in Eq. 6 can easily lead to divergence in noisy real-world datasets. The reason is that only if the convolutional layers in kernel branch are well trained can we have good gradients back to feature branch and vice versa. Therefore, it’s hard to train both of them from scratch simultaneously. Further, since kernels are not shared spatially, gradients at each position are more likely to be noisy, which makes kernel branch even harder to train and further hinders the training process of feature branch.

We adopt residual learning to address this issue, which learns the residual discrepancies of identical convolutional kernels. In particular, we add to each central position of the kernels as


Initially, since the outputs of the kernel branch are close to zero, DSCNN approximately averages features from feature branch. It guarantees gradients are sufficient and reliable for back propagation to the feature branch, which inversely benefits the training process of the kernel branch.

4 Experiments

We evaluate our DSCNNs via object detection and optical flow estimation tasks. Our experiment results show that firstly with much larger ERF illustrated in Fig.6, DSCNNs’ achieve significant improvements on recognition abilities.Secondly, with position-specific dynamic kernels and local gradients, DSCNN produces much sharper optical flow.

In the following subsections, we use denotes with, denotes without, denotes attention mechanism and denotes residual learning, denotes the number of dynamic features. Since in our DSCNN is relatively small ( 24) compared with conventional CNNs’ settings, we optionally apply a post-conv layer to increase dimension to channels to match the conventional CNNs.

4.1 Object Detection

We use PASCAL VOC datasets [2] for object detection tasks. Following the protocol in [9], we train our DSCNNs on the union of VOC 2007 trainval and VOC 2012 trainval and test on VOC 2007 and 2012 test sets. For evaluation, we use the standard mean average precision (mAP) scores with IoU thresholds 0.5.

When applying our DSCNN, we insert it right between the feature extractor and the detection heads. We treat these dynamic features as complementary features, which are concatenated with original features before fed into detection head. In particular, we adopt ResNets as feature extractor and bin R-FCN [7] or CoupleNets [32] with OHEM [33] as detection head. During training process, following [21], we resize images to have a shorter side of 600 pixels and adopt SGD optimizer. And we use pre-trained and fixed RPN proposals. Concretely, the RPN network is trained separately as in the first stage of the procedure in [8]. We train 110k iterations on single GPU with learning rate in the first 80k and in the next 30k.

mAP(%) on VOC12 mAP(%) on VOC07 GPU Time(ms)
R-FCN [7] 77.6 79.5 TITAN 121
R-FCN+DSCNN 79.2 81.2 TITAN 141
Deform. Conv. [21] - 80.6 K40 193
CoupleNet [32] 80.4 81.7 TITAN 157
CoupleNet+DSCNN 81.7 82.3 TITAN 179
Table 1: Evaluation of the DSCNN models in VOC 2007 and 2012 detection dataset. We use , , , = 256 with ResNet-101 as pre-trained networks in experiments when adding DSCNN layers. http://host.robots.ox.ac.uk:8080/anonymous/BBHLEL.html.

As shown in Table 1, DSCNN improves R-FCN baseline model’s mAP over 1.5% with only dynamic features. This implies that the position-specific dynamic features are good supplement to the original feature space. And even though CoupleNets [32] have already explicitly considered global information with large receptive fields, experimental results demonstrate that adding our DSCNN model is still beneficial.

Evaluation on Effective Receptive Field. We evaluate the effective receptive fields (ERF) in the subsection. As illustrated in Fig. 6, with ResNet-50 as backbone network, single additional DSCNN layer provides much larger ERF than vanilla models thanks to the multiple sampling strategy. With larger ERFs, the networks can effectively observe larger region at each position thus can gather information and recognize objects more easily. Further, Table. 1 experimentally verified the improvements on recognition abilities provided by our proposed DSCNNs.

Figure 6: Visualization on the effective receptive fields. The yellow circles denote the position on the objects. The first row presents input images. The second row contains the ERF figure from vanilla ResNet-50 model. The third row contains figures of the ERF with DSCNNs. Best view in color.

Ablation Study on Numbers of Sampled Regions. We perform experiments to verify the advantages of applying more sampled regions in DSCNN.

, 72.1 78.2 78.1 , 72.1 77.4 77.7
, 72.5 78.6 78.6 , 72.9 78.6 78.5
Table 2: Evaluation of numbers of samples . The listed results are trained with residual learning and the post-conv layer is not applied. The experiments use R-FCN baseline and adopt ResNet-50 as pretrained networks.

Table 2 evaluates the effect of sampling in the neighbour regions. In simple DFN model [1], where , though attention and residual learning strategy are adopted, the accuracy is still lower than R-FCN baseline (77.0%). We argue the reason is that simple DFN model has limited receptive field. Besides, kernels at each position only receive gradients on the identical position which esily leads to overfitting. With more sampled regions, we not only enlarge receptive field in feed-forward step, but also stabilize the gradients in back-propagation process. As shown in Table 2, when we take samples, the mAP score surpluses original R-FCN [7] by 1.6% and gets saturated with respect to when attention mechanism is applied.

Ablation Study on Attention Mechanism. We verify the effectiveness of the attention mechanism in Table 3 with different sample strides and numbers of dynamic features

. In the experiments without attention mechanism, max pooling in channel dimension is adopted. We observe that, in most cases, the attention mechanism helps improve mAP by more than 0.5% in VOC2007 detection tasks. Especially as the number of dynamic features

increases ( 32), the attention mechanism provides more benefits, increasing the mAP by 1%, which indicates that it can further strengthen our DSCNNs.

77.8 78.2 77.4 77.4
78.1 78.6 77.4 77.3
78.6 78.0 77.6 77.3
Table 3: Evaluation of attention mechanism with different sample strides and numbers of dynamic features. The post-conv layer is not applied. The experiments use R-FCN baseline and adopt ResNet-50 as pretrained networks.

Ablation Study on Residual Learning. We perform experiments to verify that with different numbers of dynamic features, residual learning contributes a lot to the convergence of our DSCNNs. As shown in Table 4, without residual learning, DSCNNs can hardly converge in real-world datasets. Even though they converge, the mAP is lower than expected. When our DSCNNs learn in a residual fashion, however, the mAP increase about 10% on average.

78.6 77.4 78.6 77.6
68.1 68.7
Table 4: Evaluaion of residual learning strategy in DSCNN. indicates that the model fails to converge and the post-conv layer is not applied. The experiments use R-FCN baseline and adopt ResNet-50 as pretrained networks.

Runtime Analysis. Since our model can be implemented on GPUs efficiently and the computation at each position and sampled region can be done in a parallel fashion, the running time for the DSCNN models could have potential of only slightly slower than several convolutional layers with kernel size . Table 1 shows the efficiency of the DSCNN models.

Figure 7: Examples of Flow estimation on FlyingChairs dataset. With DSCNNs, much sharper and more detailed optical flow can be estimated compared to various FlowNet models.

4.2 Optical Flow Estimation

We perform experiments on optical flow estimation using the FlyingChairs dataset [3]. This dataset is a synthetic one with optical flow ground truth and widely used in training deep learning based flow estimation methods. It consists of 22872 image pairs and corresponding flow fields. In experiments we use FlowNets(S) and FlowNetC [14] as our baseline models, though other complicated models are also applicable. All of the baseline models are fully-convolutional networks which firstly downsample input image pairs to learn semantic features then upsample the features to estimate optical flow.

In experiments, our DSCNN layers are inserted in a relative shallower layer(i.e. the third conv layer) to produce sharper optical flow images. In order to capture large displacement, we apply or samples with a sample stride . We adopt dynamic features and an conv layer with channels as post-conv layer. After that, we use skip-connection to connect the DSCNN outputs to the corresponding upsampling layer. We follow similar training process in [13] for fair comparison333We use 300k iterations with double batchsize. As shown in Fig. 7, our DSCNNs output sharper and more accurate optical flow thanks to the large receptive fields and dynamic position-specific kernels. Since each position estimates optical flow with its own kernels, our DSCNN can better identify the contours of the moving objects.

Figure 8: Training loss of flow estimation. We use moving average with window size of 2k iterations when plotting the loss curve.
model aEPE Time Spynet [34] 2.63 - EpicFlow [35] 2.94 - DeepFlow [36] 3.53 - PWC-Net [15] 2.26 - FlowNets [14] 3.67 6ms FlowNets+DSCNN, 2.88 23ms FlowNetS [14] 2.78 16ms FlowNetS+SegAware [22] 2.36 - FlowNetS+DSCNN, 2.34 34ms FlowNetC [14] 2.19 25ms FlowNetC+DSCNN, 2.11 43ms FlowNetC+DSCNN, 2.06 51ms
Table 5: aEPE and running time evaluation of optical flow estimation.

As illustrated in Fig. 4.2, DSCNN models successfully relax the constraint of sharing kernels spatially and converge to a lower training loss in both FlowNets and FlowNetC models. That further indicates the advantages of local gradients in dense prediction tasks.

We use average End-Point-Error (aEPE) to quantitatively measure the performance of the optical flow estimation. Table 4.2 shows that the aEPEs decrease in all baseline models by a large margin with a single DSCNN layer added. In FlowNets, aEPE decreases by 0.79 which demonstrates the increased learning capacity and robustness of our DSCNN models. Even though SegAware attention model [22] explicitly takes advantage of boundary information as additional training data, our DSCNN can still slightly outperforms them using FlowNetS as baseline model. With and , we have approximately times larger receptive fields which allow the FlowNet models to easily capture large displacements in flow estimation task on FlyingChairs dataset.

5 Conclusion

This work introduces Dynamic Sampling Convolutional Neural Networks (DSCNN) to learn dynamic position-specific kernels and takes advantage of very large ERF and local gradients, which ensures that DSCNNs have better performance in most general tasks. With robustly enlarged ERF via the multiple sampling strategy, the DSCNNs’ recognition abilities are significantly promoted. And With local gradients and dynamic kernels, DSCNNs produce much sharper output features, which is beneficial especially in dense prediction tasks such as optical flow estimation.