1 Introduction
Capturing longrange dependencies is of central importance in deep neural networks. For sequential data (e.g., in speech, language), recurrent operations [38, 23] are the dominant solution to longrange dependency modeling. For image data, longdistance dependencies are modeled by the large receptive fields formed by deep stacks of convolutional operations [14, 30].
Convolutional and recurrent operations both process a local neighborhood, either in space or time; thus longrange dependencies can only be captured when these operations are applied repeatedly, propagating signals progressively through the data. Repeating local operations has several limitations. First, it is computationally inefficient. Second, it causes optimization difficulties that need to be carefully addressed [23, 21]. Finally, these challenges make multihop dependency modeling, e.g., when messages need to be delivered back and forth between distant positions, difficult.
In this paper, we present nonlocal operations as an efficient, simple, and generic component for capturing longrange dependencies with deep neural networks. Our proposed nonlocal operation is a generalization of the classical nonlocal mean operation [4] in computer vision. Intuitively, a nonlocal operation computes the response at a position as a weighted sum of the features at all positions in the input feature maps (Figure 1). The set of positions can be in space, time, or spacetime, implying that our operations are applicable for image, sequence, and video problems.
There are several advantages of using nonlocal operations: (a) In contrast to the progressive behavior of recurrent and convolutional operations, nonlocal operations capture longrange dependencies directly by computing interactions between any two positions, regardless of their positional distance; (b) As we show in experiments, nonlocal operations are efficient and achieve their best results even with only a few layers (e.g., 5); (c) Finally, our nonlocal operations maintain the variable input sizes and can be easily combined with other operations (e.g., convolutions as we will use).
We showcase the effectiveness of nonlocal operations in the application of video classification. In videos, longrange interactions occur between distant pixels in space as well as time. A single nonlocal block, which is our basic unit, can directly capture these spacetime dependencies in a feedforward fashion. With a few nonlocal blocks, our architecures called nonlocal neural networks are more accurate for video classification than 2D and 3D convolutional networks [48] (including the inflated variant [7]). In addition, nonlocal neural networks are more computationally economical than their 3D convolutional counterparts. Comprehensive ablation studies are presented on the Kinetics [27] and Charades [44] datasets. Using RGB only and without any bells and whistles (e.g., optical flow, multiscale testing), our method achieves results on par with or better than the latest competitions winners on both datasets.
To demonstrate the generality of nonlocal operations, we further present object detection/segmentation and pose estimation experiments on the COCO dataset [33]. On top of the strong Mask RCNN baseline [19], our nonlocal blocks can increase accuracy on all three tasks at a small extra computational cost. Together with the evidence on videos, these image experiments show that nonlocal operations are generally useful and can become a basic building block in designing deep neural networks.
2 Related Work
Nonlocal image processing.
Nonlocal means [4] is a classical filtering algorithm that computes a weighted mean of all pixels in an image. It allows distant pixels to contribute to the filtered response at a location based on patch appearance similarity. This nonlocal filtering idea was later developed into BM3D (blockmatching 3D) [10], which performs filtering on a group of similar, but nonlocal, patches. BM3D is a solid image denoising baseline even compared with deep neural networks [5]. Block matching was used with neural networks for image denoising [6, 31]. Nonlocal matching is also the essence of successful texture synthesis [12]
[16], and inpainting [1] algorithms.Graphical models.
Longrange dependencies can be modeled by graphical models such as conditional random fields (CRF) [29, 28]. In the context of deep neural networks, a CRF can be exploited to postprocess semantic segmentation predictions of a network [9]. The iterative meanfield inference of CRF can be turned into a recurrent network and trained [56, 42, 8, 18, 34]. In contrast, our method is a simpler feedforward block for computing nonlocal filtering. Unlike these methods that were developed for segmentation, our generalpurpose component is applied for classification and detection. These methods and ours are also related to a more abstract model called graph neural networks [41].
Feedforward modeling for sequences.
Recently there emerged a trend of using feedforward (i.e., nonrecurrent) networks for modeling sequences in speech and language [36, 54, 15]. In these methods, longterm dependencies are captured by the large receptive fields contributed by very deep 1D convolutions. These feedforward models are amenable to parallelized implementations and can be more efficient than widely used recurrent models.
Selfattention.
Our work is related to the recent selfattention [49] method for machine translation. A selfattention module computes the response at a position in a sequence (e.g., a sentence) by attending to all positions and taking their weighted average in an embedding space. As we will discuss in the next, selfattention can be viewed as a form of the nonlocal mean [4], and in this sense our work bridges selfattention for machine translation to the more general class of nonlocal filtering operations that are applicable to image and video problems in computer vision.
Interaction networks.
Interaction Networks (IN) [2, 52] were proposed recently for modeling physical systems. They operate on graphs of objects involved in pairwise interactions. Hoshen [24] presented the more efficient Vertex Attention IN (VAIN) in the context of multiagent predictive modeling. Another variant, named Relation Networks [40], computes a function on the feature embeddings at all pairs of positions in its input. Our method also processes all pairs, as we will explain ( in Eq.(1)). While our nonlocal networks are connected to these approaches, our experiments indicate that the nonlocality of the model, which is orthogonal to the ideas of attention/interaction/relation (e.g., a network can attend to a local region), is the key to their empirical success. Nonlocal modeling, a longtime crucial element of image processing (e.g., [12, 4]), has been largely overlooked in recent neural networks for computer vision.
Video classification architectures.
A natural solution to video classification is to combine the success of CNNs for images and RNNs for sequences [55, 11]. In contrast, feedforward models are achieved by 3D convolutions (C3D) [26, 48] in spacetime, and the 3D filters can be formed by “inflating” [13, 7] pretrained 2D filters. In addition to endtoend modeling on raw video inputs, it has been found that optical flow [45] and trajectories [50, 51] can be helpful. Both flow and trajectories are offtheshelf modules that may find longrange, nonlocal dependency. A systematic comparison of video architectures can be found in [7].
3 Nonlocal Neural Networks
We first give a general definition of nonlocal operations and then we provide several specific instantiations of it.
3.1 Formulation
Following the nonlocal mean operation [4], we define a generic nonlocal operation in deep neural networks as:
(1) 
Here is the index of an output position (in space, time, or spacetime) whose response is to be computed and is the index that enumerates all possible positions. is the input signal (image, sequence, video; often their features) and is the output signal of the same size as . A pairwise function computes a scalar (representing relationship such as affinity) between and all . The unary function computes a representation of the input signal at the position . The response is normalized by a factor .
The nonlocal behavior in Eq.(1) is due to the fact that all positions () are considered in the operation. As a comparison, a convolutional operation sums up the weighted input in a local neighborhood (e.g., in a 1D case with kernel size 3), and a recurrent operation at time is often based only on the current and the latest time steps (e.g., or ).
The nonlocal operation is also different from a fullyconnected (fc) layer. Eq.(1) computes responses based on relationships between different locations, whereas fc uses learned weights. In other words, the relationship between and is not a function of the input data in fc, unlike in nonlocal layers. Furthermore, our formulation in Eq.(1) supports inputs of variable sizes, and maintains the corresponding size in the output. On the contrary, an fc layer requires a fixedsize input/output and loses positional correspondence (e.g., that from to at the position ).
A nonlocal operation is a flexible building block and can be easily used together with convolutional/recurrent layers. It can be added into the earlier part of deep neural networks, unlike fc layers that are often used in the end. This allows us to build a richer hierarchy that combines both nonlocal and local information.
3.2 Instantiations
Next we describe several versions of and . Interestingly, we will show by experiments (Table (a)a) that our nonlocal models are not sensitive to these choices, indicating that the generic nonlocal behavior is the main reason for the observed improvements.
For simplicity, we only consider in the form of a linear embedding: , where is a weight matrix to be learned. This is implemented as, e.g., 11 convolution in space or 111 convolution in spacetime.
Next we discuss choices for the pairwise function .
Gaussian.
Following the nonlocal mean [4] and bilateral filters [47], a natural choice of is the Gaussian function. In this paper we consider:
(2) 
Here is dotproduct similarity. Euclidean distance as used in [4, 47]
is also applicable, but dot product is more implementationfriendly in modern deep learning platforms. The normalization factor is set as
.Embedded Gaussian.
A simple extension of the Gaussian function is to compute similarity in an embedding space. In this paper we consider:
(3) 
Here and are two embeddings. As above, we set .
We note that the selfattention module [49] recently presented for machine translation is a special case of nonlocal operations in the embedded Gaussian version. This can be seen from the fact that for a given , becomes the softmax computation along the dimension . So we have , which is the selfattention form in [49]
. As such, our work provides insight by relating this recent selfattention model to the classic computer vision method of nonlocal means
[4], and extends the sequential selfattention network in [49] to a generic space/spacetime nonlocal network for image/video recognition in computer vision.Despite the relation to [49], we show that the attentional behavior (due to softmax) is not essential in the applications we study. To show this, we describe two alternative versions of nonlocal operations next.
Dot product.
can be defined as a dotproduct similarity:
(4) 
Here we adopt the embedded version. In this case, we set the normalization factor as , where is the number of positions in , rather than the sum of , because it simplifies gradient computation. A normalization like this is necessary because the input can have variable size.
The main difference between the dot product and embedded Gaussian versions is the presence of softmax, which plays the role of an activation function.
Concatenation.
Concatenation is used by the pairwise function in Relation Networks [40] for visual reasoning. We also evaluate a concatenation form of :
(5) 
Here denotes concatenation and
is a weight vector that projects the concatenated vector to a scalar. As above, we set
. In this case, we adopt ReLU [35] in .The above several variants demonstrate the flexibility of our generic nonlocal operation. We believe alternative versions are possible and may improve results.
3.3 Nonlocal Block
We wrap the nonlocal operation in Eq.(1) into a nonlocal block that can be incorporated into many existing architectures. We define a nonlocal block as:
(6) 
where is given in Eq.(1) and “
” denotes a residual connection
[21]. The residual connection allows us to insert a new nonlocal block into any pretrained model, without breaking its initial behavior (e.g., if is initialized as zero). An example nonlocal block is illustrated in Figure 2. The pairwise computation in Eq.(2), (3), or (4) can be simply done by matrix multiplication as shown in Figure 2; the concatenation version in (5) is straightforward.The pairwise computation of a nonlocal block is lightweight when it is used in highlevel, subsampled feature maps. For example, typical values in Figure 2 are , or . The pairwise computation as done by matrix multiplication is comparable to a typical convolutional layer in standard networks. We further adopt the following implementations that make it more efficient.
Implementation of Nonlocal Blocks.
We set the number of channels represented by , , and to be half of the number of channels in . This follows the bottleneck design of [21] and reduces the computation of a block by about a half. The weight matrix in Eq.(6) computes a positionwise embedding on , matching the number of channels to that of . See Figure 2.
A subsampling trick can be used to further reduce computation. We modify Eq.(1) as: , where is a subsampled version of (e.g
., by pooling). We perform this in the spatial domain, which can reduce the amount of pairwise computation by 1/4. This trick does not alter the nonlocal behavior, but only makes the computation sparser. This can be done by adding a max pooling layer after
and in Figure 2.We use these efficient modifications for all nonlocal blocks studied in this paper.
layer  output size  

conv 
7 7, 64, stride 2, 2, 2 
16112112 
pool  333 max, stride 2, 2, 2  85656 
res  3  85656 
pool  311 max, stride 2, 1, 1  45656 
res  4  42828 
res  6  41414 
res  3  477 
global average pool, fc  111 
4 Video Classification Models
To understand the behavior of nonlocal networks, we conduct comprehensive ablation experiments on video classification tasks. First we describe our baseline network architectures for this task, and then extend them into 3D ConvNets [48, 7] and our proposed nonlocal nets.
2D ConvNet baseline (C2D).
To isolate the temporal effects of our nonlocal nets vs. 3D ConvNets, we construct a simple 2D baseline architecture in which the temporal dimension is trivially addressed (i.e., only by pooling).
Table 1 shows our C2D baseline under a ResNet50 backbone. The input video clip has 32 frames each with 224224 pixels. All convolutions in Table 1 are in essence 2D kernels that process the input framebyframe (implemented as 1
kernels). This model can be directly initialized from the ResNet weights pretrained on ImageNet. A ResNet101 counterpart is built in the same way.
The only operation involving the temporal domain are the pooling layers. In other words, this baseline simply aggregates temporal information.
Inflated 3D ConvNet (I3D).
As done in [13, 7], one can turn the C2D model in Table 1 into a 3D convolutional counterpart by “inflating” the kernels. For example, a 2D kernel can be inflated as a 3D kernel that spans frames. This kernel can be initialized from 2D models (pretrained on ImageNet): each of the planes in the kernel is initialized by the pretrained weights, rescaled by . If a video consists of a single static frame repeated in time, this initialization produces the same results as the 2D pretrained model run on a static frame.
We study two cases of inflations: we either inflate the 33 kernel in a residual block to 333 (similar to [7]), or the first 11 kernel in a residual block to 311 (similar to [13]). We denote these as I3D and I3D. As 3D convolutions are computationally intensive, we only inflate one kernel for every 2 residual blocks; inflating more layers shows diminishing return. We inflate conv to 577.
The authors of [7] have shown that I3D models are more accurate than their CNN+LSTM counterparts.
Nonlocal network.
We insert nonlocal blocks into C2D or I3D to turn them into nonlocal nets. We investigate adding 1, 5, or 10 nonlocal blocks; the implementation details are described in the next section in context.
4.1 Implementation Details
Training.
Our models are pretrained on ImageNet [39]. Unless specified, we finetune our models using 32frame input clips. These clips are formed by randomly cropping out 64 consecutive frames from the original fulllength video and then dropping every other frame. The spatial size is 224224 pixels, randomly cropped from a scaled video whose shorter side is randomly sampled in pixels, following [46]. We train on an 8GPU machine and each GPU has 8 clips in a minibatch (so in total with a minibatch size of 64 clips). We train our models for 400k iterations in total, starting with a learning rate of 0.01 and reducing it by a factor of 10 at every 150k iterations (see also Figure 4). We use a momentum of 0.9 and a weight decay of 0.0001. We adopt dropout [22] after the global pooling layer, with a dropout ratio of 0.5. We finetune our models with BatchNorm (BN) [25] enabled when it is applied. This is in contrast to common practice [21] of finetuning ResNets, where BN was frozen. We have found that enabling BN in our application reduces overfitting.
We adopt the method in [20] to initialize the weight layers introduced in the nonlocal blocks. We add a BN layer right after the last 111 layer that represents ; we do not add BN to other layers in a nonlocal block. The scale parameter of this BN layer is initialized as zero, following [17]. This ensures that the initial state of the entire nonlocal block is an identity mapping, so it can be inserted into any pretrained networks while maintaining its initial behavior.
Inference.
Following [46] we perform spatially fullyconvolutional inference on videos whose shorter side is rescaled to 256. For the temporal domain, in our practice we sample 10 clips evenly from a fulllength video and compute the softmax scores on them individually. The final prediction is the averaged softmax scores of all clips.







5 Experiments on Video Classification
We perform comprehensive studies on the challenging Kinetics dataset [27]. We also report results on the Charades dataset [44] to show the generality of our models.
5.1 Experiments on Kinetics
Kinetics [27] contains 246k training videos and 20k validation videos. It is a classification task involving 400 human action categories. We train all models on the training set and test on the validation set.
Figure 4 shows the curves of the training procedure of a ResNet50 C2D baseline vs. a nonlocal C2D with 5 blocks (more details in the following). Our nonlocal C2D model is consistently better than the C2D baseline throughout the training procedure, in both training and validation error.
Figure 1 and Figure 3 visualize several examples of the behavior of a nonlocal block computed by our models. Our network can learn to find meaningful relational clues regardless of the distance in space and time.
Table 2 shows the ablation results, analyzed as follows:
Instantiations.
Table (a)a compares different types of a single nonlocal block added to the C2D baseline (right before the last residual block of res). Even adding one nonlocal block can lead to 1% improvement over the baseline.
Interestingly, the embedded Gaussian, dotproduct, and concatenation versions perform similarly, up to some random variations (72.7 to 72.9). As discussed in Sec. 3.2, the nonlocal operations with Gaussian kernels become similar to the selfattention module [49]. However, our experiments show that the attentional (softmax) behavior of this module is not the key to the improvement in our applications; instead, it is more likely that the nonlocal behavior is important, and it is insensitive to the instantiations.
In the rest of this paper, we use the embedded Gaussian version by default. This version is easier to visualize as its softmax scores are in the range of .
Which stage to add nonlocal blocks?
Table (b)b compares a single nonlocal block added to different stages of ResNet. The block is added to right before the last residual block of a stage. The improvement of a nonlocal block on res, res, or res is similar, and on res is slightly smaller. One possible explanation is that res has a small spatial size (77) and it is insufficient to provide precise spatial information. More evidence of a nonlocal block exploiting spatial information will be investigated in Table (d)d.
model  backbone  modality  top1 val  top5 val  top1 test  top5 test  avg test 

I3D in [7]  Inception  RGB  72.1  90.3  71.1  89.3  80.2 
2Stream I3D in [7]  Inception  RGB + flow  75.7  92.0  74.2  91.3  82.8 
RGB baseline in [3]  InceptionResNetv2  RGB  73.0  90.9       
3stream late fusion [3]  InceptionResNetv2  RGB + flow + audio  74.9  91.6       
3stream LSTM [3]  InceptionResNetv2  RGB + flow + audio  77.1  93.2       
3stream SATT [3]  InceptionResNetv2  RGB + flow + audio  77.7  93.2       
NL I3D [ours]  ResNet50  RGB  76.5  92.6       
ResNet101  RGB  77.7  93.3      83.8 
Going deeper with nonlocal blocks.
Table (c)c shows the results of more nonlocal blocks. We add 1 block (to res), 5 blocks (3 to res and 2 to res, to every other residual block), and 10 blocks (to every residual block in res and res) in ResNet50; in ResNet101 we add them to the corresponding residual blocks. Table (c)c shows that more nonlocal blocks in general lead to better results. We argue that multiple nonlocal blocks can perform longrange multihop communication. Messages can be delivered back and forth between distant positions in spacetime, which is hard to do via local models.
It is noteworthy that the improvement of nonlocal blocks is not just because they add depth to the baseline model. To see this, we note that in Table (c)c the nonlocal 5block ResNet50 model has 73.8 accuracy, higher than the deeper ResNet101 baseline’s 73.1. However, the 5block ResNet50 has only 70% parameters and 80% FLOPs of the ResNet101 baseline, and is also shallower. This comparison shows that the improvement due to nonlocal blocks is complementary to going deeper in standard ways.
We have also tried to add standard residual blocks, instead of nonlocal blocks, to the baseline models. The accuracy is not increased. This again shows that the improvement of nonlocal blocks is not just because they add depth.
Nonlocal in spacetime.
Our method can naturally handle spacetime signals. This is a nice property: related objects in a video can present at distant space and longterm time interval, and their dependency can be captured by our model.
In Table (d)d we study the effect of nonlocal blocks applied along space, time, or spacetime. For example, in the spaceonly version, the nonlocal dependency only happens within the same frame: i.e., in Eq.(1) it only sums over the index in the same frame of the index . The timeonly version can be set up similarly. Table (d)d shows that both the spaceonly and timeonly versions improve over the C2D baseline, but are inferior to the spacetime version.
Nonlocal net vs. 3D ConvNet.
Table (e)e compares our nonlocal C2D version with the inflated 3D ConvNets. Nonlocal operations and 3D convolutions can be seen as two ways of extending C2D to the temporal dimensions.
Table (e)e also compares the number of parameters and FLOPs, relative to the baseline. Our nonlocal C2D model is more accurate than the I3D counterpart (e.g., 75.1 vs. 74.4), while having a smaller number of FLOPs (1.2 vs. 1.5). This comparison shows that our method can be more effective than 3D convolutions when used alone.
Nonlocal 3D ConvNet.
Despite the above comparison, nonlocal operations and 3D convolutions can model different aspects of the problem: 3D convolutions can capture local dependency. Table (f)f shows the results of inserting 5 nonlocal blocks into the I3D models. These nonlocal I3D (NL I3D) models improve over their I3D counterparts (+1.6 point accuracy), showing that nonlocal operations and 3D convolutions are complementary.
Longer sequences.
Finally we investigate the generality of our models on longer input videos. We use input clips consisting of 128 consecutive frames without subsampling. The sequences throughout all layers in the networks are thus 4 longer compared to the 32frame counterparts. To fit this model into memory, we reduce the minibatch size to 2 clips per GPU. As a result of using small minibatches, we freeze all BN layers in this case. We initialize this model from the corresponding models trained with 32frame inputs. We finetune on 128frame inputs using the same number of iterations as the 32frame case (though the minibatch size is now smaller), starting with a learning rate of 0.0025. Other implementation details are the same as before.
Comparisons with stateoftheart results.
Table 3 shows the results from the I3D authors [7] and from the Kinetics 2017 competition winner [3]. We note that these are comparisons of systems which can differ in many aspects. Nevertheless, our method surpasses all the existing RGB or RGB + flow based methods by a good margin. Without using optical flow and without any bells and whistles, our method is on par with the heavily engineered results of the 2017 competition winner.
model  modality  train/val  trainval/test 

2Stream [43]  RGB + flow  18.6   
2Stream +LSTM [43]  RGB + flow  17.8   
AsynTF [43]  RGB + flow  22.4   
I3D [7]  RGB  32.9  34.4 
I3D [ours]  RGB  35.5  37.2 
NL I3D [ours]  RGB  37.5  39.5 
5.2 Experiments on Charades
Charades [44] is a video dataset with 8k training, 1.8k validation, and 2k testing videos. It is a multilabel classification task with 157 action categories. We use a percategory sigmoid output to handle the multilabel property.
We initialize our models pretrained on Kinetics (128frame). The minibatch size is set to 1 clip per GPU. We train our models for 200k iterations, starting from a learning rate of 0.00125 and reducing it by 10 every 75k iterations. We use a jittering strategy similar to that in Kinetics to determine the location of the 224224 cropping window, but we rescale the video such that this cropping window outputs 288288 pixels, on which we finetune our network. We test on a single scale of 320 pixels.
Table 4 shows the comparisons with the previous results on Charades. The result of [7] is the 2017 competition winner in Charades, which was also finetuned from models pretrained in Kinetics. Our I3D baseline is higher than previous results. As a controlled comparison, our nonlocal net improves over our I3D baseline by 2.3% on the test set.
6 Extension: Experiments on COCO
We also investigate our models on static image recognition. We experiment on the Mask RCNN baseline [19] for COCO [33] object detection/segmentation and human pose estimation (keypoint detection). The models are trained on COCO train2017 (i.e., trainval35k in 2014) and tested on val2017 (i.e., minival in 2014).
Object detection and instance segmentation.
We modify the Mask RCNN backbone by adding one nonlocal block (right before the last residual block of res). All models are finetuned from ImageNet pretraining. We evaluate on a standard baseline of ResNet50/101 and a high baseline of ResNeXt152 (X152) [53]. Unlike the original paper [19] that adopted stagewise training regarding RPN, we use an improved implementation with endtoend joint training similar to [37], which leads to higher baselines than [19].
Table 5 shows the box and mask AP on COCO. We see that a single nonlocal block improves all R50/101 and X152 baselines, on all metrics involving detection and segmentation. AP is increased by 1 point in all cases (e.g., +1.3 point in R101). Our nonlocal block is complementary to increasing the model capacity, even when the model is upgraded from R50/101 to X152. This comparison suggests that nonlocal dependency has not been sufficiently captured by existing models despite increased depth/capacity.
In addition, the above gain is at a very small cost. The single nonlocal block only adds 5% computation to the baseline model. We also have tried to use more nonlocal blocks to the backbone, but found diminishing return.
method  AP  AP  AP  AP  AP  AP  

R50  baseline  38.0  59.6  41.0  34.6  56.4  36.5 
+1 NL  39.0  61.1  41.9  35.5  58.0  37.4  
R101  baseline  39.5  61.4  42.9  36.0  58.1  38.3 
+1 NL  40.8  63.1  44.5  37.1  59.9  39.2  
X152  baseline  44.1  66.4  48.4  39.7  63.2  42.2 
+1 NL  45.0  67.8  48.9  40.3  64.4  42.8 
Keypoint detection.
Next we evaluate nonlocal blocks in Mask RCNN for keypoint detection. In [19], Mask RCNN used a stack of 8 convolutional layers for predicting the keypoints as 1hot masks. These layers are local operations and may overlook the dependency among keypoints across long distance. Motivated by this, we insert 4 nonlocal blocks into the keypoint head (after every 2 convolutional layers).
Table 6 shows the results on COCO. On a strong baseline of R101, adding 4 nonlocal blocks to the keypoint head leads to a 1 point increase of keypoint AP. If we add one extra nonlocal block to the backbone as done for object detection, we observe an in total 1.4 points increase of keypoint AP over the baseline. In particular, we see that the stricter criterion of AP is boosted by 2.4 points, suggesting a stronger localization performance.
model  AP  AP  AP 

R101 baseline  65.1  86.8  70.4 
NL, +4 in head  66.0  87.1  71.7 
NL, +4 in head, +1 in backbone  66.5  87.3  72.8 
7 Conclusion
We presented a new class of neural networks which capture longrange dependencies via nonlocal operations. Our nonlocal blocks can be combined with any existing architectures. We show the significance of nonlocal modeling for the tasks of video classification, object detection and segmentation, and pose estimation. On all tasks, a simple addition of nonlocal blocks provides solid improvement over baselines. We hope nonlocal layers will become an important component of future network architectures.
Acknowledgement: This work was partially supported by ONR MURI N000141612007, Sloan, Okawa Fellowship to AG and NVIDIA Fellowship to XW. We would also like to thank Haoqi Fan, Du Tran, Heng Wang, Georgia Gkioxari and Piotr Dollar for many helpful discussions.
References
 [1] C. Barnes, E. Shechtman, A. Finkelstein, and D. B. Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. In Proceedings of SIGGRAPH, ACM Transactions on Graphics, 2009.
 [2] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. Interaction networks for learning about objects, relations and physics. In Neural Information Processing Systems (NIPS), 2016.
 [3] Y. Bian, C. Gan, X. Liu, F. Li, X. Long, Y. Li, H. Qi, J. Zhou, S. Wen, and Y. Lin. Revisiting the effectiveness of offtheshelf temporal modeling approaches for largescale video classification. arXiv:1708.03805, 2017.

[4]
A. Buades, B. Coll, and J.M. Morel.
A nonlocal algorithm for image denoising.
In
Computer Vision and Pattern Recognition (CVPR)
, 2005.  [5] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with BM3D? In Computer Vision and Pattern Recognition (CVPR), 2012.
 [6] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising with multilayer perceptrons, part 2: training tradeoffs and analysis of their mechanisms. arXiv:1211.1552, 2012.
 [7] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [8] S. Chandra, N. Usunier, and I. Kokkinos. Dense and lowrank Gaussian CRFs using deep embeddings. In International Conference on Computer Vision (ICCV), 2017.
 [9] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv:1412.7062, 2014.
 [10] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3d transformdomain collaborative filtering. Transactions on Image Processing (TIP), 2007.
 [11] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Longterm recurrent convolutional networks for visual recognition and description. In Computer Vision and Pattern Recognition (CVPR), 2015.
 [12] A. A. Efros and T. K. Leung. Texture synthesis by nonparametric sampling. In International Conference on Computer Vision (ICCV), 1999.
 [13] C. Feichtenhofer, A. Pinz, and R. Wildes. Spatiotemporal residual networks for video action recognition. In Neural Information Processing Systems (NIPS), 2016.
 [14] K. Fukushima and S. Miyake. Neocognitron: A selforganizing neural network model for a mechanism of visual pattern recognition. In Competition and cooperation in neural nets. Springer, 1982.

[15]
J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin.
Convolutional sequence to sequence learning.
In
International Conference on Machine Learning (ICML)
, 2017.  [16] D. Glasner, S. Bagon, and M. Irani. Superresolution from a single image. In Computer Vision and Pattern Recognition (CVPR), 2009.
 [17] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017.
 [18] A. Harley, K. Derpanis, and I. Kokkinos. Segmentationaware convolutional networks using local attention masks. In International Conference on Computer Vision (ICCV), 2017.
 [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask RCNN. In International Conference on Computer Vision (ICCV), 2017.
 [20] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In International Conference on Computer Vision (ICCV), 2015.
 [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.
 [22] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv:1207.0580, 2012.
 [23] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 1997.
 [24] Y. Hoshen. Multiagent predictive modeling with attentional commnets. In Neural Information Processing Systems (NIPS), 2017.
 [25] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

[26]
S. Ji, W. Xu, M. Yang, and K. Yu.
3d convolutional neural networks for human action recognition.
In International Conference on Machine Learning (ICML), 2010.  [27] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv:1705.06950, 2017.
 [28] P. Krähenbühl and V. Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In Neural Information Processing Systems (NIPS), 2011.
 [29] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In International Conference on Machine Learning (ICML), 2001.
 [30] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1989.
 [31] S. Lefkimmiatis. Nonlocal color image denoising with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [32] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [33] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV). 2014.
 [34] S. Liu, S. De Mello, J. Gu, G. Zhong, M.H. Yang, and J. Kautz. Learning affinity via spatial propagation networks. In Neural Information Processing Systems (NIPS), 2017.
 [35] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning (ICML), 2010.
 [36] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016.
 [37] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
 [38] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 1986.
 [39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015.
 [40] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Neural Information Processing Systems (NIPS), 2017.
 [41] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009.
 [42] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv preprint arXiv:1503.02351, 2015.
 [43] G. A. Sigurdsson, S. Divvala, A. Farhadi, and A. Gupta. Asynchronous temporal fields for action recognition. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [44] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision (ECCV), 2016.
 [45] K. Simonyan and A. Zisserman. Twostream convolutional networks for action recognition in videos. In Neural Information Processing Systems (NIPS), 2014.
 [46] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2015.
 [47] C. Tomasi and R. Manduchi. Bilateral filtering for gray and color images. In International Conference on Computer Vision (ICCV), 1998.
 [48] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In International Conference on Computer Vision (ICCV), 2015.
 [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems (NIPS), 2017.
 [50] H. Wang and C. Schmid. Action recognition with improved trajectories. In International Conference on Computer Vision (ICCV), 2013.
 [51] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectorypooled deepconvolutional descriptors. In Computer Vision and Pattern Recognition (CVPR), 2015.
 [52] N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction networks. In Neural Information Processing Systems (NIPS), 2017.
 [53] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017.
 [54] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. The Microsoft 2016 Conversational Speech Recognition System. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017.
 [55] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Computer Vision and Pattern Recognition (CVPR), 2015.

[56]
S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang,
and P. H. Torr.
Conditional random fields as recurrent neural networks.
In International Conference on Computer Vision (ICCV), 2015.