Efficient Coarse-to-Fine Non-Local Module for the Detection of Small Objects

11/29/2018 ∙ by Hila Levi, et al. ∙ Weizmann Institute of Science 22

An image is not just a collection of objects, but rather a graph where each object is related to other objects through spatial and semantic relations. Using relational reasoning modules, allowing message passing between objects, can therefore improve object detection. Current schemes apply such dedicated modules either on a specific layer of the bottom-up stream, or between already-detected objects. We show that the relational process can be better modeled in a coarse to fine manner and present a novel framework, applying a non-local module sequentially to increasing resolution feature-maps along the top-down stream. In this way, the inner relational process can naturally pass information from larger objects to smaller related ones. Applying the modules to fine feature-maps also allows message passing between the small objects themselves, exploiting repetitions of instances from of the same class. In practice, due to the expensive memory utilization of the non-local module, it is unfeasible to apply the module as currently used to high-resolution feature-maps. We efficiently redesigned the non local module, improved it in terms of memory and number of operations, allowing it to be placed anywhere along the network. We also incorporated relative spatial information into the module, in a manner that can be incorporated into our efficient implementation. We show the effectiveness of our scheme by improving the results of detecting small objects on COCO by 1.5 AP over Faster RCNN and by 1 AP over using non-local module on the bottom-up stream.



There are no comments yet.


page 1

page 4

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

[width=trim=.20pt 0 0 .210pt,clip,keepaspectratio]front_000000515445.png

(a) Can you see the swimming people?

[width=trim=.150pt .10pt .050pt .060pt,clip,keepaspectratio]front_000000282037.png

(b) Where is the ball?
Figure 1: Using a relational non-local module directly on the feature maps in a coarse-to-fine manner enables the detection of small objects, based on (i) the existence of larger related objects and (ii) repeating object, allowing us to: (a) pay attention to the tiny swimmers in the sea and (b) locate the ball. Cyan - NL, Red - ours, ENL. Same threshold.

Scene understanding has shown an impressing improvement in the last few years. Since the revival of deep neural networks, there has been a significant increase of the performance of a range of relevant tasks, including classification, object detection, segmentation, part localization etc.

Early works relied heavily on the hierarchical structure of bottom up classification networks to perform additional tasks such as detection [14, 13, 41, 20], by using the last network layer to predict object locations. A next significant step, partly motivated by the human vision system, incorporated context into the detection scheme by using a bottom-up top-down architecture [30, 19, 10, 42]. This architecture combines high level contextual data from the last layers with highly localized fine-grained information expressed in lower layers. The next challenge, that is an active research area, is to incorporate relational reasoning into the detection systems [2, 7, 40]. By using relational reasoning, an image is not just a collection of unrelated objects, but rather resembling a ”scene graph” of entities (nodes, objects) connected by edges (relations, predicates).

In this line of development, the detection of small objects remains a difficult task. This task was shown to benefit from the use of context [9, 11], and the current work applies the use of relations to the detection of small objects.

Consider for example the images in figure 1. The repetition of instances from the same class in the image, as well as the existence of larger instances from related classes, serves as a semantic clue. It enables the detection of the tiny people in the sea (figure 0(a)), partly based on the existence of the larger people in the shore. It similarly localizes the small sport ball, partly based on the throwing man and the waiting glove (figure 0(b)).

Exploiting this information, specifically for small object detection, requires propagating information over large distances in high resolution feature-maps according to the data in a specific image. This is difficult to achieve by convolutional layers, since they transmit information in the same manner for all images, based on learning, and over short distances rather than the entire image.

Recently, a Non-Local module [51] has been formulated and integrated into CNNs for various tasks [50, 45]. Its formalism is simple: for each output pixel in the feature-map the scheme aggregates information from all of the input pixels ( ), based on their similarity to the specific input pixel . This building block is capable to pass information between distant pixels according to their appearance, and is applicable to our current task. Using it, sequentially, in a coarse-to-fine manner, enables to pass semantic information from larger, easy to detect objects, to smaller ones. Existence of the non-local (NL) module in shallower layers allows information jumps between the small objects themselves. Evidence for the above can be seen in Figures 6 and 7.

For the current needs, there are two disadvantages in the original design of the NL module. The first is its expensive computational and memory budget. The complexity of is . In preceding works this block was integrated into layers of the bottom-up stream, but in our task it is integrated into lower-level layers, where its memory demands become infeasible. Furthermore, in detection networks, it is a common practice to enlarge the input image, making the problem even worse.

The second disadvantage is the lack of relative position encoding in . Objects in the image are relatively placed on a 2D grid. Discarding this information source, especially in high resolution feature-maps, is not effective.

We modified the NL module to deal with the above difficulties. A simple modification, based on the associative law of matrix multiplication, and exploiting the existing factorization of , enables us to create a comparable building block with a complexity of . Relative position encoding was added to the similarity information and give the network the opportunity to use relative spatial information in an efficient manner. The resulting scheme still aggregates information across the entire image, but not uniformly. We named this module ENL: Efficient Non Local module.

In this paper, we use the ENL module as a reasoning module that passes information between related pixels, applying it sequentially along the top-down stream. Since it is applied also to high resolution feature maps, efficiently re-implementation of the module is essential.

Unlike other approaches, which placed a relational module on the BU stream, or establish relations between already detected objects, our framework can apply pairwise reasoning in a coarse to fine manner, guiding the detection of small objects. Applying the relational module to finer layers, also enables the small objects themselves to exchange information between each other.

To summarize our contributions:

1. We efficiently redesigned the NL module (ENL), improved it in terms of memory and number of operations, allowing it to be placed it anywhere along the network.

2. We incorporated relative spatial information into the NL module reasoning process, in a novel approach that keeps the efficient design of the ENL.

3. We applied the new module sequentially to increasing resolution feature-maps along the top-down stream, obtaining relational reasoning in a coarse-to-fine manner.

4. We show the effectiveness of our scheme, incorporating it into the Faster-RCNN [43] pipeline and improving state-of-the-art detection of small objects over the COCO dataset [32] by 1.5 AP.

The improvements presented in this work go beyond the specific detection application: tasks including semantic segmentation, fine-grained localization, images restoration, image generation processes, or other tasks in the image domain, which use an encoder-decoder framework and depend on fine image details are natural candidates for using the proposed framework.

2 Related Work

The current work combines two approaches used in the field of object detection: (a) modelling context through top down modulation and (b) using non local interactions in a deep learning framework. We briefly review related work in these domains.

Bottom Up Top Down Networks

In detection tasks, one of the major challenges is to detect simultaneously both large and small objects and parts. Early works used for the task a pure bottom-up (BU) architecture, and predictions were made only from the coarsest (topmost) feature map [14, 13, 41, 20]. Later works, tried to exploit the inherent hierarchical structure of neural networks to create a multi-scale detection architecture. Some of these works performed detection using combined features from multiple layers [3, 18, 25], while others performed detection in parallel from individual layers [35, 6, 34, 47].

Recent methods incorporate context (from the last BU layer) with low level layers by adding skip connections in a bottom-up top-down (BUTD) architecture. Some schemes [49, 44, 36] used only the last layer of the top down (TD) network for prediction, while others [30, 19, 10, 42] performed prediction from several layers along the TD stream.

The last described architecture supplies enhanced results, especially for small objects detection and was adopted in various detection schemes (e.g. one stage or two stages detection pipelines). It assumes to successfully incorporate multi scale BU data with semantic context from higher layers, serves as an elegant built-in context module.

In the current work we further enhance the representation created in the layers along the TD stream, using the pair-wised information, supplied by the NL module, already shown to be complementary to the CNN information [51]. We show that sequentially applying this complementary source of information, in a coarse to fine manner, helps detection, especially of small objects.

Context modelling

It is well known that context modelling plays an important role in detection both in humans [1, 4] and AI systems [9, 11]. There is also evidence that the use of context for detection and recognition is guided by a TD process [38, 37, 4]).

Contextual information includes both local context (the immediate local environment of a given object) or global context (the full scene category, or relationships between objects in the scene).

Examples of using local context in recent detection networks include [12, 53, 54, 27, 56]. These works model the local context by extracting local data about the proposed RoI (the output of the RPN), by adding larger surroundings windows, or by using nearby, automatically located, contextual regions [26].

Global scene categorization can also help detection [3, 39, 48] and segmentation [23]. Along this line, [48] stressed the role of the TD guidance in modelling and using context. Context and detection interact in fact in both directions, since object detection can also help global scene categorization, [29, 24], and this two-way interactions have been modeled by an iterative feedback scheme [48, 28].

Modern Relational Reasoning

Relational reasoning and messages passing between explicitly detected, or implicitly represented objects in the image, is an active and growing research area. Recent work in this area has been applied to scene understanding tasks (e.g. recognition [7], detection [51, 22, 40], segmentation [52]) and for image generation tasks (GANs [55], restoration [33]).

For scene understanding tasks, two general approaches exist. The first approach can be called ’object-centric’, as it models the relations between existing objects, previously detected by a detection framework [22]. In this case, a natural structure for formalizing relation is via Graph Neural Network [16, 46]; [2, 5] summarize and generalize many aspects in the growing field.

The second approach applies relational interactions directly to CNN feature-maps (in which objects are implicitly represented). In this case, a dedicated block (sometimes named non local module [51], relational module [45] or self-attention module [50]) is integrated into the network without an additional supervision, in an end-to-end learnable manner. In the frameworks of detection and segmentation, this block can be integrated into the network’s backbone, preceding to the RPN, to supply additional information for both recognition and localization tasks (example in figure 2).

[height=3cm,trim=0 0 0 0,clip,keepaspectratio] 000000147498_roi.png


[height=3cm,trim=.60pt .60pt 0 0,clip,keepaspectratio]000000377393_roi_arrow.png

Figure 2: Using a reasoning module directly on the feature maps before the RPN enables more RoIs to pass to the next detection stage. Examples, although rare, exists: (a) More than 10 birds were added to the flock. (b) A man watching on the sea. Yellow - Faster RCNN. Red (printed behind) - Ours. 1000 First RoIs are presented, for visually purposes only boxes smaller than 32x32 were printed.

3 Approach

We will first briefly review the implementation details of the Non Local (NL) module as described in [51] then present our proposed efficient ENL module, specifying our modifications in detail.

3.1 Preliminaries: the Non Local module

The formulation of the NL module as described in [51] is:



is the input tensor and

is the output tensor. x and y share the same dimensions, , where D is the channels dimension, and modified to the form of . is the current pixel under consideration, and runs over all spatial locations of the input tensor. summarizes the similarity between every two pixels in the input tensor ( is a scalar) and is the representation of the ’th spatial pixel (, channels in each pixel’s representation). The module sums information from all the pixels in the input tensor weighted by their similarity to the ’th pixel. The similarity function , can be chosen in different ways; one of the popular design choices is:


In this case the normalization factor takes the form of the softmax operation. A block scheme of this straight forward implementation is illustrated in figure 2(a).

The described NL module goes through a convolution and combined with a residual connection to take the form of:


For simplicity we omit the description of these operators from the rest of the section as well as from the illustrations (figures 3, 5).

Two drawbacks of this basic implementation are its extremely expensive memory utilization and the lack of position encoding. Both of these issues are addressed next.

3.2 ENL: Memory Effective Implementation

Let us consider the case of another design choice of :


In this case is a matrix created by a multiplication of two matrices, and . Since this matrix multiplication is immediately followed by another matrix multiplication with - one can simply use the associative rule to change the order of the calculation:

NL module ENL module
Table 1: Performance comparison of the original NL module and our memory effective implementation (ENL)

This re-ordering results in a large saving in terms of memory and operations used. Consider a detection framework with typical image size of

. On the second stage (stride 4)

. While the inner multiplication result by sequentially multiplying the matrices (original NL, figure 2(a)) is , the multiplication reordering (ENL module) gives an inner result of size (at least 4 orders reduction of memory utilization inside the block). The reduction in the number of operations is determined in a similar manner, see table 1. An illustration of the memory effective implementation can be visualized in figure 2(b).

(a) original NL module
(b) ENL module
Figure 3: (a) NL module vs. (b) ENL module. : The main difference is the order of matrix multiplication.

3.3 Adding Relative Position Encoding

We next consider two version of adding position encoding , , to our scheme. The first is based on the norm of followed by an exponent (similarly to [7]) and is applicable only for the case where we use a full version of as an inner variable. In this case:


With denotes elementwise multiplication or addition. This version is illustrated in figure 4(a). The second version of position encoding is applicable also to the efficient implementation. In this case, the formulation is given by:


For we use:


With , . In this case we can change again the order of matrix multiplication and equation 7 gets the form of:


Note that adding a spatial filter in general is straightforward, but here we want to add a spatial filter in a manner that will keep the low-rank properties of the ENL module.

To construct , denote

as the coefficients of the 2D cos transform of a one-hot image input (one in the i’th location and zero everywhere else), arranged as a row vector. Denote by

the matrix consisting of s as its rows. Followed from the orthogonality of the cos transform, is a diagonal matrix (weight for and everywhere else). Truncating the coefficients vectors (taking a subpart of the columns of , denoted as , corresponding to the lower frequencies of the cos transform), meets two goals:

a. , while , and can be elegantly integrated into the ENL design (see equation 9), and

b. serves as a low pass filter.

A block scheme of the implementation is detailed in figure 4(b).

We used the first columns corresponding to the lowest frequencies of the 2D cos transform. Optimizing the choice of the columns can be added but is out of the scope of this paper. The resulting (sinc-like) filter is almost invariant to the spatial position; its general structure is kept although fluctuations in its height exist. An example is demonstrated in figure 4.

Figure 4: An example of the resulting filter .
(a) NL module with position encoding
(b) ENL module with position encoding
Figure 5: Relative position encoding to: (a) NL, (b) ENL
spat eff AP AP50 AP75 APs APm APl
baseline 36.71 58.45 39.62 21.11 39.85 48.14
+1NL, BU 37.72 59.81 40.71 21.65 40.88 49.07
+1NL, BU 37.62 59.59 40.87 21.48 40.73 49.08
+3NL, TD 37.75 60.19 40.97 22.54 41.41 48.73
Table 2: Object detection results on the COCO benchmark. Results for faster RCNN were directly obtained from the publicly available model [15], other results were produced by us. Bottom line shows our improvements.

4 Implementation

We performed our experiences on Faster R-CNN [43] detection framework, using FPN [30] with resnet50 [21]

, pretrained on ImageNet

[8] as its backbone. We implemented our models on caffe2 using the Detectron framework [15]. We used the standard running protocols of Faster R-CNN and adjust our learning rates as suggested by [17]. The images were normalized to 800 pixels on their shorter axis. All the models were trained on COCO train2017 ([32],  118K images) and were evaluated on COCO val2017 (5K images).


We trained our models for 360000 iterations using a base learning rate of 0.005 and reducing it by a factor of 10 after 240000 and 320000 iterations. We used SGD optimization with momentum 0.9 and a weight decay of 0.0001. We froze the BN layers in the backbone and replaced them with an affine operation as the common strategy when fine-tuning with a small number of images per GPU.


During inference we followed the common practice of [43, 19]. We report our results based on the standard metrics of COCO, using AP (mean average precision) and APsmall (mean average precision for small objects, ) as our main criterions for comparison. Further explanations of the metrics can be found in [31].

Non Local Block

We placed the NL modules along the top down stream. We used three instances of the NL module in total, and located them in each stage, just before the spatial interpolation (in parallel to

res5, res4 and res3 layers).

We initialized the blocks weights with random Gaussian weights,

. We did not use additional BN layers inside the NL module (due to the relatively small minibatch), or affine layers (since no initialization is available).

5 Experiments & Results

We evaluated our framework on the task of object detection, comparing to Faster RCNN [43] as a baseline, demonstrating an improvement in performance.

5.1 Comparison with state of the art results

Table 2 compares the detection results of the proposed scheme (+3NL, TD, using Faster RCNN with three additional ENL modules sequentially located along the TD stream) to the baseline (Faster RCNN) and to the variant suggested in [51] (+1NL, BU). Adding three non-local modules along the TD stream in a coarse-to-fine manner leads to almost APsmall improvement over the baseline and almost APsmall improvement over adding non local module in the BU stream.

The improvement over the baseline emphasizes the potential of adding non-local modules in general: they exploit the data in the network in a complementary way to the convolutional layers. The improvement over +1NL, BU (a non local module in the BU) can be explained by the existence of non local modules in the shallower layers of the network, by the coarse-to-fine guidance through the TD stream and by the relative position encoding added to the scheme. We saw on the ablation studies that adding position encoding to the non-local modules on the TD stream improves the detection results. Interestingly, adding relative position encoding to the non local module in the BU stream (third line in the table) didn’t improve the results.

Qualitative results

Examples of interest are shown in Figure 6. The examples illustrate the detection of small objects, that cannot be detected on their own, detected either by the presence of other instances of the same class (a,b) or by larger instances of related classes: (c) The man is holding two remotes in his hands, (d) the woman is holding and watching her cellphone and (e) detecting the driver inside the truck. These objects, marked in red, were not detected by Faster RCNN or by Faster RCNN with non-local module on the BU stream (using the same threshold). In figure 7 we present the attention maps, , extracted from the non-local modules in response to the two images presented in figure 1. The left column shows the attention maps of the NL module on the BU stream. Here the attention is spread across the spatial grid, especially around large objects (the people on the shore and the glove). The right column shows the attention maps of the ENL module in the finest feature-map. The small objects are clearly seen, along with the local response of the low pass filter.

5.2 Ablation studies

We performed control and ablation experiments to study some aspects in the design of the ENL modules. Unless specified otherwise, all experiments were carried out using Faster RCNN with three additional non local modules (original modules or their efficient version, with or without position encoding). Due to the relative long training time of Faster RCNN we used smaller images, normalized to 600 pixels on their shorter axis. The rest of the training details follow section 4.

AP APs no spat 36.9 20.41 mul 36.40 19.94 add 36.71 20.64 cos mul 36.31 19.32 cos add 36.71 21.12
Table 3: Adding different variants of position encoding. Additive attention with relative encoding based on the cos transform works best.
spat eff AP AP50 AP75 APs APm APl baseline 35.84 57.01 38.69 18.60 38.03 49.10 +1 NL, BU 36.63 58.22 39.75 20.06 39.37 49.87 +3conv5, TD 35.87 57.48 38.54 18.90 38.39 48.81 +3 NL, TD 36.90 59.00 40.03 20.41 39.71 49.95 +3NL, TD 36.59 58.61 39.49 20.35 39.56 49.36 +3NL, TD 36.82 58.98 39.57 20.96 39.55 49.76
Table 4: Comparing different variants of NL modules added along the TD stream. The efficient NL module is an attractive alternative to the original NL module. Adding position encoding further improves the results.
NL softmax 36.9 20.41
36.48 20.41
ENL none none n.a. n.a.
softmax none 36.59 20.35
softmax tanh 36.65 20.13
none 36.71 19.80
tanh 36.05 19.23
Table 5: Efficient Non Local Block normalization strategies. We compared several normalization strategies of and in the ENL module with the original NL module normalized with softmax (first row) or with (second row). Normalizing with softmax gives best results.

Efficient implementation normalization strategies

Referring to the formulation of the original NL module in equation 1 the normalization factor is:


This straightforward normalization strategy is possible only if was calculated explicitly. in the ENL design, a special care must be taken in performing normalization.

Table 5 compares different normalization strategies of and in the ENL module (Section 3.2). Performing no normalization at all results in a quick divergence in the loss of the training process. For comparison, the results of the original NL module, normalized with softmax or with , are presented in the first two lines of the table, respectively.

In the test summarized in table (5), we can assume that using softmax to normalize was sufficient, and achieved results on par with the original NL module, when normalized with (using the design choice given by equation 4). In the rest of the paper we used this normalization strategy by default. Normalizing the original NL module with softmax yields slightly better results, in agreement with the findings in [51].

Adding relative position encoding

Table 4 compares four different ways to encode the relative position information in (using or cos transform, addition or multiplication, see section 3.3). The results of the same network without position encoding are presented in the first row of the table.

Table 4 shows that additive spatial attention, at least in our case, gives better results than multiplicative spatial attention. A possible explanation is that multiplicative spatial attention completely suppresses the influence of distant pixels, which goes against the purpose of the NL module.

Table 4 also demonstrates that a spatial filter based on the separable cos transform performs slightly better than the counterpart version, based on the norm. Conveniently, this is also the spatial filter that can be integrated into the more compact implementation of the ENL module.

Adding the NL block on the TD stream

Table 4 compares between several variants of the proposed network (+3NL, TD, 3 last lines) to the baselines.

The table shows that the efficient module (lines 5,6) is an attractive alternative to the original non-local module (line 4), and that position encoding (line 6) further improves it.

The improvement over the baseline is even higher than the improvement shown in table 2, maybe because the area statistics in the database are different for smaller images. The improvement is not just a matter of more parameters. Table 4 shows that adding 5x5 convolution layers, along the TD stream in the same positions, does not yield a distinct improvement in results.

[height=4.6cm,trim=.20pt .20pt 0 0,clip,keepaspectratio] 000000299553.png


[height=4.6cm,trim=.150pt .050pt .310pt .050pt,clip,keepaspectratio]000000393838.png


[height=4.6cm,trim=.20pt 0 .50pt .150pt,clip,keepaspectratio]000000325483.png


[height=4.6cm,trim=.20pt 0 .460pt .150pt,clip,keepaspectratio]000000074256.png


[height=4.6cm,trim=0 0 .30pt 0,clip,keepaspectratio]000000018837.png

Figure 6: Qualitative examples from COCO val2017. (a-b). Repetitions of the same class in the image is exploited for detection. (c-e). Highly semantic clues in a top-down architecture: (c) The man is holding remotes in his hands, (d) The woman is watching her cellphone and (e) There is a driver in the truck. Cyan - NL module, Red - ours ENL. Same thresholds. Best if zoomed-in.
(a) NL, BU (b) ENL, TD3
Figure 7: The Attention maps , for a certain pixel (denoted with white point), reordered to the form of an image, of the two images presented in Fig. 1. (a) detecting with NL module on the BU part: the people in the sea and the sport ball are weak, (b) detecting with ENL modules sequentially ordered on increasing resolution feature maps. Only the finest feature-map is presented here: The tiny people in the sea and the sport ball are recognized and isolated from the background. Best if zoomed-in.

6 Conclusions

We examined the possible use of several non local modules, arranged hierarchically along the top down stream to exploit the effects of context and relations among objects. We compared our method with the previous use of a non local module placed on the bottom-up network, and show 1 AP improvement in small objects detection. We suggest that this improvement is enabled by the coarse-to-fine use of pair-wise location information and show visual evidence in support of this possibility.

In practice, applying the non local module to large feature maps is a memory demanding operation. We deal with this difficulty and introduced ENL - an attractive alternative to the Non Local block, which is efficient in terms of memory and operations, and which integrates the use of relative spatial information. The ENL allows the use of non local module in a general encoder-decoder framework and consequently, might contribute in future work to a wide range of applications (segmentation, images generation etc.).