Convolutional Neural Networks have recently made significant progress in both sparse prediction tasks including image classification [4, 5, 6], object detection [7, 8, 9] and dense prediction tasks such as semantic segmentation [10, 11, 12], optical flow estimation [13, 14, 15], etc. Generally, deeper [16, 17, 5] architectures can provide richer features due to more trainable parameters and larger receptive fields. For instance, ResNet  introduces short-cut connection and residual learning to enable the stack of over 100 convolutional layers.
Most neural network architectures mainly adopt spatially shared kernels which work well in general cases. However, during training phase, the gradients at each spatial position may not share the same descent direction, which can decrease losses at every positions. These phenomenon are quite ubiquitous when multiple objects appear in a single image in object detection or multiple object with different motion direction and speeds in flow estimation, which make the spatially shared kernels more likely to produce blurred feature maps.111Please see the examples and detailed analysis in the Supplementary Material. The primary reasons are that even though the kernels are far from optimal for every position, the global gradients, which are the spatially summation of the gradients over entire feature maps, can be close to zero. Because they are used in the update process, the back-propagation process could quite often stall or make very slow progress.
Adopting position-specific kernels can alleviate the unshareable descent direction issue and take advantage of the gradients at each position (i.e. local gradients) since kernel parameters are not spatially shared. In order to maintain the translation invariance, Brabandere et al.  propose a general paradigm called Dynamic Filter Networks (DFN) and verify them on the moving MNIST dataset . However, DFNs  only generate the dynamic position-specific kernels for their own specific positions. As a result, the kernels can only receive the gradients from their own position ( square of kernel size), which is usually more unstable, noisy and harder to converge than normal CNN.
Having a properly enlarged receptive field is another important consideration when designing CNN architectures. Adopting stacked convolutional layers with small kernels (i.e. )  is more preferable than larger kernels (i.e. ) , because the former one obtains the same receptive fields with fewer parameters. However, the effective receptive fields (ERF)  only occupy a fraction of the full theoretical receptive field due to some weak connections and inactivated ReLUs. In practice, it has been shown that adopting dilation strategies  can further improve performance [7, 12], which means that enlarging receptive fields in a single layer is still beneficial. However, despite these enhancements, effective receptive fields of these CNNs are still not large enough and require improvements in some applications.
Therefore, we present DSCNNs to solve the limited ERF and unshareable descent direction issues by utilizing dynamically generated position-specific kernels. In particular, DSCNNs achieve large ERFs via a sampling strategy where each kernel convolves with features from both their own specific position and multiple sampled neighbouring regions. As illustrated in Fig. 1, with ResNet-50 as pretrained model, adding a single DSCNN layer can significantly enlarge the ERF, which further yields significant improvements on the representation abilities. Moreover, since our kernels at each position are dynamically generated, DSCNNs also benefit from the local gradients.
We verify our DSCNNs performance of object detection on the VOC benchmark  and flow estimation on FlyingChairs dataset . Extensive experimental results demonstrate the effectiveness of our new approach. We achieve 81.7% with CoupleNets detection head ( 80.4% for CoupleNets) in object detection on VOC2012, and 2.06 aEPE ( 2.19 for FlowNetC) in flow estimation on FlyingChairs. These results indicate that our DSCNNs are general and beneficial for both sparse and dense prediction tasks with demonstrable improvements over strong baseline models. Our codes will be made publicly available.
2 Related Work
Dynamic Filter Networks.
Dynamic Filter Networks  are first proposed by Brabandere et al. to provide custom parameters for different input data. This architecture is powerful and more flexible since the kernels are dynamically conditioned on inputs. Recently, several task-oriented objectives and extensions have been developed. Deformable ConvNets  can be seen as an extension of DFNs that discovers geometric-invariant features. Segmentation-aware convolution  explicitly takes advantage of prior segmentation information to refine feature boundaries via attention masks.
Cross convolution  learns the translation from one frame to another via motion information. Different from the models mentioned above, our DSCNNs aim at constructing large receptive fields and receiving local gradients to produce sharper and more semantic feature maps.
Receptive Field. Properly enlarging receptive field is one of the most important consideration in modern CNN’s architecture design. Wenjie et al. propose the concept of effective receptive field (ERF) and the mathematical measure using partial derivatives. The experimental results verify that the ERF usually occupies only a small fraction of the theoretical receptive field 
which is the input region that an output unit depends on. And, this has attracted lots of research especially in deep learning based computer vision. For instance, pooling strategies are ubiquitous in CNNs which scale down the feature maps so that output units can observe more input feature with the same kernel size. Chenet al.  propose dilated convolution with hole algorithm and achieve better results on semantic segmentation. Dai et al.  propose to dynamically learn the spatial offset of the kernels at each position so that those kernels can observe wider regions in the bottom layer with irregular shapes. However, some applications such as large motion estimation and large object detection even require larger ERF which is one of the motivation of our DSCNNs.
Residual Learning. Generally, residual learning reduces the difficulties of directly learning the objectives by learning their residual discrepancies of the identity function. ResNets  are proposed to learn residual features of identity mapping via short-cut connection and helps deepen CNNs to over 100 layers easily. There have been plenty of works adopting residual learning to alleviate the problem of divergence and generate richer features. Kim et al.  adopt residual learning to model multimodal data in visual QA. Long et al.  learn residual transfer networks for domain adaptation. Besides, Fei Wang et al. 
apply residual learning to alleviate the problem of repeated features in attention model. We apply residual learning strategy to learn residual discrepancy for identical convolutional kernels. By doing so, we ensure valid gradients’ back-propagation so that the DSCNNs can easily converge in real-world datasets.
Attention Mechanism. For the purpose of recognizing important features in deep learning unsupervisedly, attention mechanisms have been applied to lots of vision tasks including image classification , semantic segmentation , action recognition [26, 27], etc. Current visual attention mechanisms mainly consist of hard attention versions and soft attention versions. In hard attention [28, 29], most methods adopt sequential processing strategies, which extract features from one or several specific image areas and then decide next areas to focus on. Those methods often require sequential resampling which is computational costly and can not be accelerated via GPUs. In soft attention mechanisms [26, 30, 6], weights are generated to identify the important parts from different features using prior information. Sharma et al.  use previous states in LSTMs as prior information to have the network focus on more meaningful contents in the next frame and get better results for action recognition. Fei Wang et al.  benefit from lower-level features and learn attention for higher-level feature maps in a residual manner. In contrast, our attention mechanism aims at combining features from multiple samples via learning weights for each positions’ kernels at each sample.
3 Dynamic Sampling Convolution
Firstly, we present the overall structure of our DSCNN in Sec. 3.1, then introduce dynamic sampling strategies concretely in Sec. 3.2. This design allows kernels at each position to take advantage of larger receptive fields and the local gradients. Moreover, attention mechanisms are utilized to enhance the performance of DSCNNs, which is demonstrated in Sec. 3.3. Finally, Sec. 3.4 explains implementation details of our DSCNNs, especially for parameters reduction and residual learning techniques.
3.1 Network Overview
In this subsection, we introduce the DSCNNs’ overall architecture. As illustrated in Fig. 2, our DSCNNs consist of three branches, namely kernel branch, feature branch and attention branch via conventional convolutional layers with , , output channels respectively. More complicated architectures in each branch may yield better results, but it is not the focus of this work. Our DSCNNs are compatible modules in modern CNNs. With channels’ input feature maps, the feature branch firstly produces channels intermediate features. Secondly, the kernel branch generates position-specific kernels at each position to sample multiple neighbour regions in the feature branch’s channels’ feature maps via convolution. We efficiently implement the kernel branch by conventional CNNs, which requires channels by default, and further introduce a parameter reduction method that reduces the required numbers of channels in the kernel branch to . Thirdly, the attention branch outputs the corresponding attention weights for each position’s kernels during sampling. And the DSCNNs output feature maps with channels preserving the original spatial dimensions and in the whole process.
3.2 Dynamic Sampling Convolution
This subsection demonstrates the dynamic sampling convolution, which enjoys both large receptive fields and the local gradients. In particular, the DSCNNs firstly generate position-specific kernels from the kernel branch and then convolve these kernels with features from multiple sampled neighbour regions in the feature branch, resulting in very large receptive fields.
Denoting as the feature maps from layer (or intermediate features from the feature branch) with shape , conventional convolutional layer with spatially shared kernels W can be formulated as
where denote the indices of the input and output channels, denote the spatial coordinates and indicates the kernel size.
In contrast, the DSCNNs treat generated features in kernel branch, which is spatially dependent, as convolutional kernels. Thus, this scheme requires the kernel branch to generate kernels from to map the -channel features in the feature branch to -channel ones222 denotes the kernels is generated from , and we omit the when there is no ambiguity.. Detailed kernel generation methods will be described in Sec. 3.4 and the supplementary material.
As we aim at very large receptive fields and more stable gradients, we not only convolve the generated position-specific kernels with features at their own positions in the feature branch, but also sample their neighbour regions as additional features shown in Eq. 2. Therefore, we have more learning samples for each position-specific kernel than DFN  and thus more stable gradients. Also, since we obtain more diverse kernels (i.e. position-specific) than conventional CNNs, we can robustly enrich the feature space.
As shown in Fig. 3, each position (e.g. the red square) outputs its own kernels in the kernel branch and uses the generated kernels to sample the corresponding multiple neighbour regions (i.e. the cubes in different colors) in the feature branch. Assuming we have sampled regions for each position with sample stride , kernel size , the sampling strategy outputs feature maps with shape which obtain approximately times larger receptive fields. And applying stride and dilation on kernels are straightforward extensions which can further enlarge the receptive fields.
Formally, the dynamic sampling convolution thus can be formulated as
where and denote the coordinates of the center in sampled neighbour regions. denotes the position-specific kernels generated by the kernel branch. And are the indices of sampled region with sampling stride . Tt is worth noting that the origin DFN models are the special case of our DSCNNs when .
3.3 Attention Mechanism
In this subsection, we present our methods to fuse dynamic features from multiple sampled regions at each position. A direct solution is to stack sampled features to form a tensor or perform a pooling operation on the sample dimension (i.e. first dimension of ) as outputs. However, the first choice violates translation invariance and the second choice is not aware of which samples are more important.
To address this issue, we present our attention mechanism which learns attention weights for each position’s kernel at each sample. Since the weights for each kernel parameter are not shared, the resolution of output feature maps can be potentially preserved. In case of sampled regions and kernel size at each position, we should have attention weights for each position’s kernels so that the weighted dynamic features can be formulated as
However, Eq. 3 requires attention weights, which is computationally costly and easily leads to overfitting. We thus split this task into learning position attention weights for kernels at each position and learning sampling attention weights for each sampled region. Therefore, Eq. 3 reduce to Eq. 4
where share the same representations in Eq.2.
Specifically, we use two CNN sub-branches to generate the attention weights for samples and positions respectively. The sampling attention sub-branch has output channels. And for each position, the sample attention weights are generated from the current position denoted by the red box with cross in Fig.4 to coarsely predict the importance according to that position. On the other hand, the position attention sub-branch has output channels. And the position attention weights are generated from each sampled regions’ center denoted by black boxes with cross to model fine-grained local detailed importance based on the sampled local features.
Therefore, the number of attention weights will be reduced to as shown in Eq. 4. Further, we also manually add to each attention weight to take advantage of residual learning. Obtaining Eq. 4, we finally combine different samples via attention mechanism as
As feature maps from previous conventional conolutional layers might still be noisy, the position attention weights help filter such noise when convolving with the dynamic kernels. And the sample attention weights indicate how much contribution each sampled neighbour region is expected to make.
3.4 Dynamic Kernels Implementation Details
Reducing Parameter. Given that directly generating the position-specific kernels in the conventional convolutional layers’ fashion will require parameters as shown in Eq. 2. Since and can be relatively large (e.g. up to 256 or 512), the required output channels in the kernel branch (i.e. ) can easily get up to hundreds of thousands, which is computationally costly and intractable even in modern GPUs. Recently, several literatures have focused on reducing kernel parameters (e.g. MobileNet ) by factorizing kernels into different parts to make CNNs efficient in modern mobile devices. Inspired by them and based on our DSCNNs’ case, we describe our proposed parameter reduction method. And we provide the evaluation and comparison with state-of-art counterparts in the supplementary material.
Inspecting that the activated output feature maps in a layer usually share similar geometric characteristics across channels, we propose a novel kernel structure that splits the original kernels into two separate parts for the purpose of parameter reduction. Concretely, as illustrated in Fig. 5. On the one hand, the part will be placed into the spatial center of each kernel with size to model the difference across channels. On the other hand, the part will be duplicated times to model the shared geometric characteristics within each channel.
Combining the above two parts together, our method generates kernels that map channels’ feature maps to channel ones with kernel size by only parameters at each position instead of . Formally, the convolutional kernels used in Eq. 2 are formulated as
Residual Learning. Directly using the outputs of the kernel branch as the kernels in Eq. 6 can easily lead to divergence in noisy real-world datasets. The reason is that only if the convolutional layers in kernel branch are well trained can we have good gradients back to feature branch and vice versa. Therefore, it’s hard to train both of them from scratch simultaneously. Further, since kernels are not shared spatially, gradients at each position are more likely to be noisy, which makes kernel branch even harder to train and further hinders the training process of feature branch.
We adopt residual learning to address this issue, which learns the residual discrepancies of identical convolutional kernels. In particular, we add to each central position of the kernels as
Initially, since the outputs of the kernel branch are close to zero, DSCNN approximately averages features from feature branch. It guarantees gradients are sufficient and reliable for back propagation to the feature branch, which inversely benefits the training process of the kernel branch.
We evaluate our DSCNNs via object detection and optical flow estimation tasks. Our experiment results show that firstly with much larger ERF illustrated in Fig.6, DSCNNs’ achieve significant improvements on recognition abilities.Secondly, with position-specific dynamic kernels and local gradients, DSCNN produces much sharper optical flow.
In the following subsections, we use denotes with, denotes without, denotes attention mechanism and denotes residual learning, denotes the number of dynamic features. Since in our DSCNN is relatively small ( 24) compared with conventional CNNs’ settings, we optionally apply a post-conv layer to increase dimension to channels to match the conventional CNNs.
4.1 Object Detection
We use PASCAL VOC datasets  for object detection tasks. Following the protocol in , we train our DSCNNs on the union of VOC 2007 trainval and VOC 2012 trainval and test on VOC 2007 and 2012 test sets. For evaluation, we use the standard mean average precision (mAP) scores with IoU thresholds 0.5.
When applying our DSCNN, we insert it right between the feature extractor and the detection heads. We treat these dynamic features as complementary features, which are concatenated with original features before fed into detection head. In particular, we adopt ResNets as feature extractor and bin R-FCN  or CoupleNets  with OHEM  as detection head. During training process, following , we resize images to have a shorter side of 600 pixels and adopt SGD optimizer. And we use pre-trained and fixed RPN proposals. Concretely, the RPN network is trained separately as in the first stage of the procedure in . We train 110k iterations on single GPU with learning rate in the first 80k and in the next 30k.
|mAP(%) on VOC12||mAP(%) on VOC07||GPU||Time(ms)|
|Deform. Conv. ||-||80.6||K40||193|
As shown in Table 1, DSCNN improves R-FCN baseline model’s mAP over 1.5% with only dynamic features. This implies that the position-specific dynamic features are good supplement to the original feature space. And even though CoupleNets  have already explicitly considered global information with large receptive fields, experimental results demonstrate that adding our DSCNN model is still beneficial.
Evaluation on Effective Receptive Field.
We evaluate the effective receptive fields (ERF) in the subsection.
As illustrated in Fig. 6, with ResNet-50 as backbone network, single additional DSCNN layer provides much larger ERF than vanilla models thanks to the multiple sampling strategy.
With larger ERFs, the networks can effectively observe larger region at each position thus can gather information and recognize objects more easily.
Further, Table. 1 experimentally verified the improvements on recognition abilities provided by our proposed DSCNNs.
Ablation Study on Numbers of Sampled Regions. We perform experiments to verify the advantages of applying more sampled regions in DSCNN.
Table 2 evaluates the effect of sampling in the neighbour regions. In simple DFN model , where , though attention and residual learning strategy are adopted, the accuracy is still lower than R-FCN baseline (77.0%).
We argue the reason is that simple DFN model has limited receptive field.
Besides, kernels at each position only receive gradients on the identical position which esily leads to overfitting.
With more sampled regions, we not only enlarge receptive field in feed-forward step, but also stabilize the gradients in back-propagation process.
As shown in Table 2, when we take samples, the mAP score surpluses original R-FCN  by 1.6% and gets saturated with respect to when attention mechanism is applied.
Ablation Study on Attention Mechanism. We verify the effectiveness of the attention mechanism in Table 3 with different sample strides and numbers of dynamic features
. In the experiments without attention mechanism, max pooling in channel dimension is adopted. We observe that, in most cases, the attention mechanism helps improve mAP by more than 0.5% in VOC2007 detection tasks. Especially as the number of dynamic featuresincreases ( 32), the attention mechanism provides more benefits, increasing the mAP by 1%, which indicates that it can further strengthen our DSCNNs.
Ablation Study on Residual Learning.
We perform experiments to verify that with different numbers of dynamic features, residual learning contributes a lot to the convergence of our DSCNNs.
As shown in Table 4, without residual learning, DSCNNs can hardly converge in real-world datasets. Even though they converge, the mAP is lower than expected.
When our DSCNNs learn in a residual fashion, however, the mAP increase about 10% on average.
Runtime Analysis. Since our model can be implemented on GPUs efficiently and the computation at each position and sampled region can be done in a parallel fashion, the running time for the DSCNN models could have potential of only slightly slower than several convolutional layers with kernel size . Table 1 shows the efficiency of the DSCNN models.
4.2 Optical Flow Estimation
We perform experiments on optical flow estimation using the FlyingChairs dataset . This dataset is a synthetic one with optical flow ground truth and widely used in training deep learning based flow estimation methods. It consists of 22872 image pairs and corresponding flow fields. In experiments we use FlowNets(S) and FlowNetC  as our baseline models, though other complicated models are also applicable. All of the baseline models are fully-convolutional networks which firstly downsample input image pairs to learn semantic features then upsample the features to estimate optical flow.
In experiments, our DSCNN layers are inserted in a relative shallower layer(i.e. the third conv layer) to produce sharper optical flow images. In order to capture large displacement, we apply or samples with a sample stride . We adopt dynamic features and an conv layer with channels as post-conv layer. After that, we use skip-connection to connect the DSCNN outputs to the corresponding upsampling layer. We follow similar training process in  for fair comparison333We use 300k iterations with double batchsize. As shown in Fig. 7, our DSCNNs output sharper and more accurate optical flow thanks to the large receptive fields and dynamic position-specific kernels. Since each position estimates optical flow with its own kernels, our DSCNN can better identify the contours of the moving objects.
As illustrated in Fig. 4.2, DSCNN models successfully relax the constraint of sharing kernels spatially and converge to a lower training loss in both FlowNets and FlowNetC models. That further indicates the advantages of local gradients in dense prediction tasks.
We use average End-Point-Error (aEPE) to quantitatively measure the performance of the optical flow estimation. Table 4.2 shows that the aEPEs decrease in all baseline models by a large margin with a single DSCNN layer added. In FlowNets, aEPE decreases by 0.79 which demonstrates the increased learning capacity and robustness of our DSCNN models. Even though SegAware attention model  explicitly takes advantage of boundary information as additional training data, our DSCNN can still slightly outperforms them using FlowNetS as baseline model. With and , we have approximately times larger receptive fields which allow the FlowNet models to easily capture large displacements in flow estimation task on FlyingChairs dataset.
This work introduces Dynamic Sampling Convolutional Neural Networks (DSCNN) to learn dynamic position-specific kernels and takes advantage of very large ERF and local gradients, which ensures that DSCNNs have better performance in most general tasks. With robustly enlarged ERF via the multiple sampling strategy, the DSCNNs’ recognition abilities are significantly promoted. And With local gradients and dynamic kernels, DSCNNs produce much sharper output features, which is beneficial especially in dense prediction tasks such as optical flow estimation.
-  De Brabandere, B., Jia, X., Tuytelaars, T., Van Gool, L.: Dynamic filter networks. In: Neural Information Processing Systems (NIPS). (2016)
-  Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88(2) (June 2010) 303–338
-  Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., v.d. Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: IEEE International Conference on Computer Vision (ICCV). (2015)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., eds.: Advances in Neural Information Processing Systems 25. Curran Associates, Inc. (2012) 1097–1105
He, K., Zhang, X., Ren, S., Sun, J.:
Deep residual learning for image recognition.
In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2016)
-  Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., Tang, X.: Residual attention network for image classification. arXiv preprint arXiv:1704.06904 (2017)
-  Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully convolutional networks. In: Advances in neural information processing systems. (2016) 379–387
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. (2015) 91–99
-  Girshick, R.: Fast r-cnn. In: The IEEE International Conference on Computer Vision (ICCV). (December 2015)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)
-  Dai, J., He, K., Li, Y., Ren, S., Sun, J.: Instance-sensitive fully convolutional networks. In: European Conference on Computer Vision, Springer (2016) 534–549
-  Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709 (2016)
-  Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. (2015) 2758–2766
-  Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. arXiv preprint arXiv:1612.01925 (2016)
-  Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1–9
Srivastava, N., Mansimov, E., Salakhudinov, R.:
Unsupervised learning of video representations using lstms.
In: International Conference on Machine Learning. (2015) 843–852
-  Luo, W., Li, Y., Urtasun, R., Zemel, R.S.: Understanding the effective receptive field in deep convolutional neural networks. In: NIPS. (2016)
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
-  Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. arXiv preprint arXiv:1703.06211 (2017)
-  Harley, A.W., Derpanis, K.G., Kokkinos, I.: Segmentation-aware convolutional networks using local attention masks. arXiv preprint arXiv:1708.04607 (2017)
-  Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems. (2016) 91–99
-  Kim, J.H., Lee, S.W., Kwak, D., Heo, M.O., Kim, J., Ha, J.W., Zhang, B.T.: Multimodal residual learning for visual qa. In: Advances in Neural Information Processing Systems. (2016) 361–369
-  Long, M., Zhu, H., Wang, J., Jordan, M.I.: Unsupervised domain adaptation with residual transfer networks. In: Advances in Neural Information Processing Systems. (2016) 136–144
-  Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv preprint arXiv:1511.04119 (2015)
-  Wu, J., Wang, G., Yang, W., Ji, X.: Action recognition with joint attention on multi-level deep features. arXiv preprint arXiv:1607.02556 (2016)
-  Mnih, V., Heess, N., Graves, A., et al.: Recurrent models of visual attention. In: Advances in neural information processing systems. (2014) 2204–2212
-  Ba, J., Mnih, V., Kavukcuoglu, K.: Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755 (2014)
-  Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. (2015) 2048–2057
-  Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861 (2017)
-  Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., Lu, H.: Couplenet: Coupling global structure with local parts for object detection
-  Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 761–769
-  Ranjan, A., Black, M.J.: Optical flow estimation using a spatial pyramid network. arXiv preprint arXiv:1611.00850 (2016)
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.:
Epicflow: Edge-preserving interpolation of correspondences for optical flow.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1164–1172
-  Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: Deepflow: Large displacement optical flow with deep matching. In: Proceedings of the IEEE International Conference on Computer Vision. (2013) 1385–1392