Relational reasoning between distant regions of arbitrary shape is crucial for many computer vision tasks like image classification , segmentation [35, 36] and action recognition . Humans can easily understand the relations among different regions of an image/video, as shown in Figure 1(a). However, deep CNNs cannot capture such relations without stacking multiple convolution layers, since an individual layer can only capture information locally. This is very inefficient, since relations between distant regions of arbitrary shape on the feature map can only be captured by a near-top layer with a sufficiently large receptive field to cover all the regions of interest. For instance, in ResNet-50  with 16 residual units, the receptive field is gradually increased to cover the entire the image of size at 11th unit (the near-end of Res4). To solve this problem, we propose a unit to directly perform global relation reasoning by projecting features from regions of interest to an interaction space and then distribute back to the original coordinate space. In this way, relation reasoning can be performed in early stages of a CNN model.
Specifically, rather than relying solely on convolutions in the coordinate space to implicitly model and communicate information among different regions, we propose to construct a latent interaction space where global reasoning can be performed directly, as shown in Figure 1(c). Within this interaction space, a set of regions that share similar semantics are represented by a single feature, instead of a set of scattered coordinate-specific features from the input. Reasoning the relations of multiple different regions is thus simplified to modeling those between the corresponding features in the interaction space, as shown on the top of Figure 1(c). We thus build a graph connecting these features within the interaction space and perform relation reasoning over the graph. After the reasoning, the updated information is then projected back to the original coordinate space for down-streaming tasks. Accordingly, we devise a Global Reasoning unit (GloRe) to efficiently implement the coordinate-interaction space mapping process by weighted global pooling and weighted broadcasting, as well as the relation reasoning by graph convolution , which is differentiable and also end-to-end trainable.
Different from the recently proposed Non-local Neural Networks (NL-Nets)  and Double Attention Networks  which only focus on delivering information and rely on convolution layers for reasoning, our proposed model is able to directly reason on relations over regions. Similarly, Squeeze-and-Extension Networks (SE-Nets)  only focus on incorporating image-level features via global average pooling, leading to an interaction graph containing only one node. It is not designed for regional reasoning as our proposed method. Extensive experiments show that inserting our GloRe can consistently boost performance of state-of-the-art CNN architectures on diverse tasks including image classification, semantic segmentation and video action recognition.
Our contributions are summarized below:
We propose a new approach for reasoning globally by projecting a set of features that are globally aggregated over the coordinate space into an interaction space where relational reasoning can be efficiently computed. After reasoning, relation-aware features are distributed back to the coordinate space for down-stream tasks.
We present the Global Reasoning unit (GloRe unit) a highly efficient instantiation of the proposed approach that implements the coordinate-interaction space mapping by weighted global pooling and weighted broadcasting, and the relation reasoning via graph convolution in the interaction space.
We conduct extensive experiments on a number of datasets and show the Global Reasoning unit can bring consistent performance boost for a wide range of backbones including ResNet, ResNeXt, SE-Net and DPN, for both 2D and 3D CNNs, on image classification, semantic segmentation and video action recognition task.
2 Related Work
Deep Architecture Design.
Research on deep architecture design focuses on building more efficient convolution layer topologies, aiming at alleviating optimization difficulties or increasing efficiency of backbone architectures.
Residual Networks (ResNet) [15, 16] and DenseNet  are proposed to alleviate the optimization difficulties of deep neural networks. DPN  combines benefits of these two networks with further improved performance. Xception , MobileNet [17, 27], and ResNeXt  use grouped or depth-wise convolutions to reduce the computational cost. Meanwhile, reinforcement learning based methods
use grouped or depth-wise convolutions to reduce the computational cost. Meanwhile, reinforcement learning based methods try to automatically find the network topology in a predefined search space. All these methods, though effective, are built by stacking convolution layers and thus suffer low-efficiency of convolution operations on reasoning between disjoint or distant regions. In this work we propose an auxiliary unit that can overcome this shortage and bring significant performance gain for these networks.
Global Context Modeling. Many efforts try to overcome the limitation of local convolution operators by introducing global contexts. PSP-Net  and DenseASPP  combine multi-scale features to effectively enlarge the receptive field of the convolution layers for segmentation tasks. Deformable CNNs  achieve the similar outcome by further learning offsets for the convolution sampling locations. Squeeze-and-extension Networks  (SE-Net) use global average pooling to incorporate an image-level descriptor at every stage. Nonlocal Networks , self-attention Mechanism  and Double Attention Networks (A-Net)  try to deliver long-range information from one location to another. Meanwhile, bilinear pooling  extracts image level second-order statistics to complement the convolution features. Although we also incorporate global information, in the proposed approach we go one step further and perform higher-level reasoning on a graph of the relations between disjoint or distant regions as shown in Figure 1(b).
Graph-based Reasoning. Graph-based methods have been very popular in recent years and shown to be an efficient way of relation reasoning. CRFs  and random walk networks  are proposed based on the graph model for effective image segmentation. Recently, Graph Convolution Networks (GCN)  are proposed for semi-supervised classification, and Wang et al.  propose to use GCN to capture relations between objects in video recognition tasks, where objects are detected by an object detector pre-trained on extra training data. In contrast to , we adopt the reasoning power of graph convolutions to build a generic, end-to-end trainable module for reasoning between disjoint and distant regions, regardless of their shape and without the need for object detectors or extra annotations.
3 Graph-based Global Reasoning
In this section, we first provide an overview of the proposed Global Reasoning unit, the core unit to our graph-based global reasoning network, and introduce the motivation and rationale for its design. We then describe its architecture in details. Finally, we elaborate on how to apply it for several different computer vision tasks.
Throughout this section, for simplicity, all figures are plotted based on 2D (image) input tensors. A graph
Throughout this section, for simplicity, all figures are plotted based on 2D (image) input tensors. A graphis typically defined by its nodes , edges and adjacent matrix describing the edge weights. In the following, we interchangeably use or to refer to a graph defined by .
Our proposed GloRe unit is motivated by overcoming the intrinsic limitation of convolution operations for modeling global relations. For an input feature tensor , with being the feature dimension and locations, standard convolutional layers process inputs w.r.t. the regular grid coordinates to extract features. Concretely, the convolution is performed over a regular nearest neighbor graph defined by an adjacent matrix where if regions and are spatially adjacent, and otherwise . The edges of the graph encode spatial proximity and its node stores the feature for that location as shown on the bottom of Figure 1(c). Then the output features of such a convolution layer are computed as where denotes parameters of the convolution kernels. A single convolution layer can capture local relations covered by the convolution kernel (i.e., locations connected over the graph ). But capturing relations among disjoint and distant regions of arbitrary shape requires stacking multiple such convolution layers, which is highly inefficient. Such a drawback increases the difficulty and cost of global reasoning for CNNs.
To solve this problem, we propose to first project the features from the coordinate space to the features in a latent interaction space , where each set of disjoint regions can be represented by a single feature instead of a bunch of features at different locations. Within the interaction space , we can build a new fully-connected graph , where each node stores the new feature as its state. In this way, the relation reasoning is simplified as modeling the interaction between pairs of nodes over a smaller graph as shown on the top of the Figure 1(c).
Once we obtain the feature for each node of graph , we apply a general graph convolution to model and reason about the contextual relations between each pair of nodes. After that, we perform a reverse projection to transform the resulting features (augmented with relation information) back to the original coordinate space, providing complementary features for the following layers to learn better task-specific representations. Such a three-step process is conceptually depicted in Figure 1(c). To implement this process, we propose a highly efficient unit, termed GloRe unit, with its architecture outlined in Figure 2.
In the following subsections, we describe each step of the proposed GloRe unit in detail.
3.2 From Coordinate Space to Interaction Space
The first step is to find the projection function that maps original features to the interaction space . Given a set of input features , we aim to learn the projection function such that the new features in the interaction space are more friendly for global reasoning over disjoint and distant regions. Here is the number of the features (nodes) in the interaction space. Since we expect to directly reason over a set of regions, as shown in Figure 1(b), we formulate the projection function as a linear combination (a.k.a weighted global pooling) of original features such that the new features can aggregate information from multiple regions. In particular, each new feature is generated by
with learnable projection weights , , .
We note that the above equation gives a more generic formulation than an existing method , where an object detector pre-trained on an extra dataset is adopted to determine , i.e. if is inside the object box, and if it is outside the box. Instead of using extra annotation and introducing a time-consuming object detector to form a binary combination, we propose to use convolution layers to directly generate (we use one convolution layer in this work).
In practice, to reduce input dimension and enhance capacity of the projection function, we implement the function as and . We model and by two convolution layers as shown in Figure 2. and are the learnable convolutional kernel of each layer. The benefits of directly using the output of a convolution layer to form the include the following aspects. 1) The convolution layer is end-to-end trainable. 2) Its training does not require any object bounding box as . 3) It is simple to implement and faster in speed. 4) It is more generic since the convolution output can be both positive and negative, which linearly fuses the information in the coordination space.
3.3 Reasoning with Graph Convolution
After projecting the features from coordinate space into the interaction space, we have graph where each node contains feature descriptor. Capturing relations between arbitrary regions in the input is now simplified to capturing interactions between the features of the corresponding nodes.
There are several possible ways of capturing the relations between features in the new space. The most straightforward one would be to concatenate the features as input and use a small neural network to capture inter-dependencies, like the one proposed in . However, even a simple relation network is computationally expensive and concatenation destroys the pair-wise correspondence along the feature dimension. Instead, we propose treating the features as nodes of a fully connected graph, propose to reason on the fully connected graph by learning edge weights that correspond to interactions of the underlying globally-pooled features of each node. To that end, we adopt the recently proposed graph convolution , a highly efficient, effective and differentiable module.
In particular, let and denote the node adjacency matrix for diffusing information across nodes, and let denote the state update function. A single-layer graph convolution network is defined by Eqn. (2), where the adjacency matrix is randomly initialized and learned by gradient decent during training, together with the weights.
The identity matrix serves as a shortcut connection that alleviates the optimization difficulties. The graph convolution
by gradient decent during training, together with the weights. The identity matrix serves as a shortcut connection that alleviates the optimization difficulties. The graph convolution[21, 23] is formulated as
The first step of the graph convolution performs Laplacian smoothing  , propagating the node features over the graph.
During training, the adjacent matrix learns edge weights that reflect the relations between the underlying globally-pooled features of each node. If, for example, two nodes contain features that focus on the eyes and the nose, learning a strong connection between the two would strengthen the features for a possible downstream “face” classifier.
After information diffusion, each node has received all necessary information and its state is updated through a linear transformation.
This two step process is conceptually visualized in Figure
, propagating the node features over the graph. During training, the adjacent matrix learns edge weights that reflect the relations between the underlying globally-pooled features of each node. If, for example, two nodes contain features that focus on the eyes and the nose, learning a strong connection between the two would strengthen the features for a possible downstream “face” classifier. After information diffusion, each node has received all necessary information and its state is updated through a linear transformation. This two step process is conceptually visualized in Figure3(a). In Figure 3(b), we show the implementation of this two step process and the graph convolution via two 1 convolution layers along different directions, i.e. channel-wise and node-wise.
3.4 From Interaction Space to Coordinate Space
To make the above building block compatible with existing CNN architectures, the last step is to project the output features back to the original space after the relation reasoning. In this way, the updated features from reasoning can be utilized by the following convolution layers to make better decisions. This reverse projection is very similar to the projection in the first step.
Given the node-feature matrix , we aim to learn a mapping function that can transform the features to as follows:
Similar to the first step, we adopt linear projection to formulate :
The above projection is actually performing feature diffusion. The feature of node is assigned to weighted by a scalar . These weighs form the dense connections from the semantic graph to the grid map. Again, one can force the weighted connections to be binary masks or can simply use a shallow network to generate these connections. In our work, we use a single convolution layer to predict these weights. In practice, we find that we can reuse the projection generated in the first step to reduce the computational cost without producing any negative effect upon the final accuracy. In other words, we set .
The right most side of Figure 2 shows the detailed implementation. In particular, the information from the graph convolution layer is projected back to the original space through the weighted broadcasting in Eqn. (4), where we reuse the output from the top convolution layer as the weight. Another convolution layer is attached after migrating the information back to the original space for dimension expansion, so that the output dimension can match the input dimension forming a residual path.
3.5 Deploying the Global Reasoning Unit
The core processing of the proposed Global Reasoning unit happens after flattening all dimensions referring to locations. It therefore straightforwardly applies to 3D (e.g. spatio-temporal) or 1D (e.g. temporal or any one-dimensional) features by adapting the dimensions of the three convolutions that operate in the coordinate space and then flattening the corresponding dimensions. For example, in the 3D input case, the input is a set of frames and , where are the spatial dimensions and is the temporal dimension, i.e. the number of frames in the clip. In this case, the three convolutional layers shown in Figure 2 will be replaced by convolutions.
In practice, due to its residual nature, the proposed Global Reasoning unit can be easily incorporated into a large variety of existing backbone CNN architectures. It is light-weight and can therefore be inserted one or multiple times throughout the network, reasoning global information at different stages and complementary to both shallow and deeper networks. Although the latter can in theory capture such relations via multiple stacked convolutions, we show that adding one or more of the proposed Global Reasoning unit increases performance for downstream tasks even for very deep networks. In the following section, we present results from different instantiations of Graph-Based Global Reasoning Networks with one or multiple Global Reasoning unit at different stages, describing the details and trade-offs in each case. We will refer to networks with at least one Global Reasoning unit as Graph-Based Global Reasoning Networks.
We begin with image classification task on the large-scale ImageNet
We begin with image classification task on the large-scale ImageNet dataset for studying key proprieties of the proposed method, which servers as the main benchmark dataset. Next, we use the Cityscapes  dataset for image segmentation task, examining if the proposed method can also work well for dense prediction on small-scale datasets. Finally, we use the Kinetics  dataset to demonstrate the proposed method can generalize well not only on D images, but also on D videos with spatial-temporal dimension for action recognition task.111Code and trained model will be released on GitHub.
4.1 Implementation Details
We first use ResNet-50  as a shallow CNN to conduct ablation studies and then use deeper CNNs to further exam the effectiveness of the proposed method. A variety of networks are tested as the backbone CNN, including the ResNet , ResNeXt , Dual Path Network(DPN) , and SE-Net . All networks are trained with the same strategy  using MXNet  with GPUs. The learning rate is decreased by a factor of starting from 222For SE-Nets, we adopt as the initial learning rate since it diverged when using as the initial learning rate.; the weight decay is set to ; the networks are updated using SGD with a total batch size of . We report the Top-1 classification accuracies on the validation set with single center crop [16, 33, 9].
Semantic Image Segmentation
We employ the simple yet effective Fully Convolutional Networks (FCNs)  as the backbone. Specifically, we adopt ImageNet  pre-trained ResNet , remove the last two down-sampling operations and adopt the multi-grid  dilated convolutions. Our proposed block(s) is randomly initialized and is appended at the end of the FCN just before the final classifier, between two adaptive convolution layers. Same with [25, 5, 4], we employ a “poly” learning rate policy where and the initial learning rate is with batch size of .
Video Action Recognition
We run the baseline methods and our proposed method with the code released by  using PyTorch
using PyTorch. We follow  to build the backbone D ResNet-50/101 which is pre-trained on ImageNet  classification task. However, instead of using convolution kernel for the first layer, we use convolution kernel for faster speed as suggested by . The learning rate starts from and is decreased by a factor of . Newly added blocks are randomly initialized and trained from scratch. We select the center clip with center crop for the single clip prediction, and evenly sample 10 clips per video for the video level prediction which is similar with .
4.2 Results on ImageNet
We first conduct ablation studies using ResNet-50  as the backbone architecture and considering two scenarios: 1) when only one extra block is added; 2) when multiple extra blocks are added. We then conduct further experiments with more recent and deeper CNNs to further examine the effectiveness of the proposed unit.
|Plain||+1 Global Reasoning unit|
Figure 4 shows the ablation study results, where the y-axis is the Top-1 accuracy and x-axis shows the computational cost measured by FLOPs, i.e. floating-point multiplication-adds . We use “R”, “NL”, “Our” to represent Residual Networks, Nonlocal Block , our proposed method respectively, and use “(n, m)” to indicate insert location. For example, “R50+Our(1,3)” means one extra GloRe unit is inserted to ResNet-50 on Res3, and three GloRe units are inserted on Res4 evenly. We first study the case when only one extra block is added as shown in gray area. Seen from the results, the proposed method improves the accuracy of ResNet-50 (pink circle) by when only one extra block is added. Compared with Nonlocal method, the proposed method shows higher accuracy under the same computation budget and model size. We also find inserting the block on Res4, i.e. “R50+Ours(0,1)”, gives better accuracy gain than inserting it on Res3, i.e. “R50+Ours(1,0)”, which is probably because Res4 contains more level features with semantics. Next, we insert more blocks on Res4 and the results are shown in the green area. We find that GloRe unit can consistently lift the accuracy when more blocks are added. Surprisingly, just adding three GloRe units enhances ResNet-50 by up to
“R50+Ours(1,0)”, which is probably because Res4 contains more level features with semantics. Next, we insert more blocks on Res4 and the results are shown in the green area. We find that GloRe unit can consistently lift the accuracy when more blocks are added. Surprisingly, just adding three GloRe units enhances ResNet-50 by up toin Top-1 accuracy, which is even better than the deepest ResNet-200 , yet with only about % GFLOPS and % model parameters. This is very impressive, showing that our newly added block can provide some complementary features which cannot be easily captured by stacking convolution layers. Similar improvement has also been oberved on SE-ResNet-50 . We also insert multiple blocks on different stages as shown in the purple area, and find adding all blocks at Res4 gives the best results. It is also interesting to see that the Nonlocal method starts to diverge during the optimization when more blocks are added, while we did not observe such optimization difficulties for the proposed method.555For better comparing the optimization difficulty, we do not adopt the zero initialization trick  for both methods. The Table 1 shows the effects of using different numbers of graph convolution layers for each GloRe unit. Since stacking more graph convolution layers does not give significant gain, we only use one graph convolution layer per unit unless explicitly stated.
|ResNeXt101  ()||Baseline||8.0||44.3M||78.8%|
Going Deeper with Our Block
We further examine if the proposed method can improve the performance of deeper CNNs. In particular, we exam four different deep CNNs: ResNet-200 , ResNeXt-101 , DPN-98  and DPN-131 . The results are summarized in Table 2, where all baseline results are reproduced by ourselves using the same training setting for fair comparison. We observe consistent performance gain by inserting GloRe unit even for these very deep models where accuracies are already quite high. It is also interesting to see that adding GloRe unit on both “Res3” and “Res4” can further improve the accuracy for deeper networks, which is different from the observations on ResNet-50, probably because deeper CNNs contains more informative features in “Res3” than the shallow ResNet-50.
4.3 Results on Cityscapes
The Cityscapes contains 5,000 images captured by the dash camera in resolution. We use it to evaluate the dense prediction ability of the proposed method for semantic segmentation. Compared with the ImageNet, it has much fewer images with higher resolution. Note that we do not use the extra coarse data  during training which is orthogonal to the study of our approach.
|FCN||multi-grid||+1 GloRe unit||+2 GloRe unit||mIoU||mIoU|
The performance gain of each component is shown in Table 3. As can be seen, adopting the multi-grid trick  can help improve the performance, but the most significant gain comes from our proposed GloRe unit. In particular, by inserting one GloRe unit, the mIoU is improved by compared with the “FCN + multi-grid” baseline. Besides, we find that adding two GloRe units sequentially does not give extra gain as shown in the last row of the table.
|Method||Backbone||IoU cla.||iIoU cla.||IoU cat.||iIoU cat.|
|FCN + 1 GloRe unit||ResNet50||79.5%||60.3%||91.3%||81.5%|
|FCN + 1 GloRe unit||ResNet101||80.9%||62.2%||91.5%||82.1%|
We further run our method on the testing set and then upload its prediction to the testing server for evaluation, with results shown in Table 4 along with other state-of-the-art methods. Interestingly without bells and trick (i.e. without using extra coarse annotations, in-cooperated low-level features or ASPP ), our proposed method that only use ResNet-50 as backbone can already achieves better accuracy than some of the popular bases, and the deep ResNet-101 based model achieves competitive performance with the state-of-the-arts.
Figure 5 visualizes the prediction results on the validation set. As highlighted by the yellow boxes, GloRe unit enhances the generalization ability of the backbone CNN, and is able to alleviate ambiguity and capture more details.
4.4 Results on Kinetics
The experiments presented in the previous section demonstrate the effectiveness of the propose method on 2D image related tasks. We now evaluate the performance of out GloRe unit on 3D inputs and the flagship video understanding task of action recognition. We choose the large-scale Kinetics-400  dataset fortesting that contains approximately 300k videos. We employ the ResNet-50(3D) and ResNet-101(3D) as the backbone and insert 5 extra GloRe units in total, on Res3 and Res4. The backbone networks are pre-trained on ImageNet , where the newly added blocks are randomly initialized and trained from scratch.
We first compare with Nonlocal Networks (NL-Net), the top performing method. We reproduce the NL-Net for fair comparison since we use distributive training with much larger batch size and fewer input frames for faster speed. We note that the reproduced models achieve performance comparable to the one reported by authors with much lower costs. The results are shown in Figure 6 and show that the proposed method consistently improves recognition accuracy over both the ResNet-50 and ResNet-101 baselines, and provides further improvement over the NL-Nets.
|Method||Backbone||Frames||FLOPs||Clip Top-1||Video Top-1|
|I3D-RGB ||Inception-v1||64||107.9 G||–||71.1%|
|R(2+1)D-RGB ||ResNet-xx||32||152.4 G||–||72.0%|
|MF-Net ||MF-Net||16||11.1 G||–||72.8%|
|S3D-G ||Inception-v1||64||71.4 G||–||74.7%|
|NL-Nets ||ResNet-50||8||30.5 G||67.12%||74.57%|
|GloRe (Ours)||ResNet-50||8||28.9 G||68.02%||75.12%|
|NL-Nets ||ResNet-101||8||56.1 G||68.48%||75.69 %|
|GloRe (Ours)||ResNet-101||8||54.5 G||68.78%||76.09%|
All results including comparison with other prior work are shown in Table 5 along with other recently proposed methods. Results show that by simply adding the GloRe unit on basic architectures we are able to outperforms other recent state-of-the-art methods, demonstrating its effectiveness in a different, diverse task.
5 Visualizing the GloRe Unit
Experiments in the previous section show that the proposed method can consistently boost the accuracy of various backbone CNNs on a number of datasets for both 2D and 3D tasks. We here analyze what makes it work by visualizing the learned feature representations.
To generate higher resolution internal features for better visualization, we trained a shallower ResNet-18  with one GloRe unit inserted in the middle of Res4. We trained the model on ImageNet with input crops, so that the intermediate feature maps are enlarged by containing more details. Figure 7 shows the weights for four projection maps (i.e. in Eqn. 1) for two images. The depicted weights would be the coefficients for the corresponding features at each location for a weighted average pooling over the whole image, giving a single feature descriptor in interaction space. For this visualization we used and therefore 128 such feature descriptors would be extracted for pooled regions, forming a graph with 128 nodes in interaction space. As expected, different projection weight map learn to focus on different global or local discriminative patterns. For example, the left-most weight map seems to focus on cat whiskers, the second weight maps seems to focus on edges, the third one seems to focus on eyes, and the last one focus on the entire space equally acting more like a global average pooling. As discussed in Sec 1, it is really hard for convolution operations to directly reason between such patterns that might be spatially distant or ill-shaped.
In this paper, we present a highly efficient approach for global reasoning that can be effectively implemented by projecting information from the coordinate space to nodes in an interaction space graph where we can directly reason over globally-aware discriminative features. The proposed GloRe unit is an efficient instantiation of the proposed approach, where projection and reverse projection are implemented by weighted pooling and weighted broadcasting, respectively, and interactions over the graph are modeled via graph convolution. It is lightweight, easy to implement and optimize, while extensive experiments show that the proposed unit can effectively learn features complementary to various popular CNNs and consistently boost their performance on both 2D and 3D tasks over a number of datasets.
-  G. Bertasius, L. Torresani, X. Y. Stella, and J. Shi. Convolutional random walk networks for semantic image segmentation. In CVPR, 2017.
-  J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
-  S. Chandra, N. Usunier, and I. Kokkinos. Dense and low-rank gaussian crfs using deep embeddings. In ICCV, 2017.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 40(4):834–848, 2018.
-  L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
-  T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
-  Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. -nets: Double attention networks. In NIPS, 2018.
-  Y. Chen, Y. Kalantidis, J. Li, S. Yan, and J. Feng. Multi-fiber networks for video recognition. ECCV, 2018.
-  Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. In NIPS.
Y. Chen and J. Z. Wang.
Image categorization by learning and reasoning with regions.
Journal of Machine Learning Research, 5(Aug):913–939, 2004.
Xception: Deep learning with depthwise separable convolutions.arXiv preprint, pages 1610–02357, 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In CVPR, 2016.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
-  P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, 2018.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
-  T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. ICLR, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
Q. Li, Z. Han, and X.-M. Wu.
Deeper insights into graph convolutional networks for semi-supervised learning.AAAI, 2018.
-  T.-Y. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for fine-grained visual recognition. In CVPR, 2015.
-  W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
-  A. Paszke, S. Gross, S. Chintala, and G. Chanan. Pytorch, 2017.
-  M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. arXiv preprint arXiv:1801.04381, 2018.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
-  D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. 2018.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010, 2017.
X. Wang, R. Girshick, A. Gupta, and K. He.
Non-local neural networks.
Computer Vision and Pattern Recognition (CVPR), 2018.
-  X. Wang and A. Gupta. Videos as space-time region graphs. Proceedings of the IEEE European Conference on Computer Vision, 2018.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995. IEEE, 2017.
-  S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851, 2017.
-  M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In Computer Vision and Pattern Recognition, 2018.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 6230–6239. IEEE, 2017.
-  H. Zhao, Y. Zhang, S. Liu, J. Shi, C. C. Loy, D. Lin, and J. Jia. PSANet: Point-wise spatial attention network for scene parsing. In ECCV, 2018.
-  B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.