Spatial-Temporal Memory Networks for Video Object Detection

by   Fanyi Xiao, et al.
University of California-Davis

We introduce Spatial-Temporal Memory Networks (STMN) for video object detection. At its core, we propose a novel Spatial-Temporal Memory module (STMM) as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables the integration of ImageNet pre-trained backbone CNN weights for both the feature stack as well as the prediction head, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. We compare our method to state-of-the-art detectors on ImageNet VID, and conduct ablative studies to dissect the contribution of our different design choices. We obtain state-of-the-art results with the VGG backbone, and competitive results with the ResNet backbone. To our knowledge, this is the first video object detector that is equipped with an explicit memory mechanism to model long-term temporal dynamics.



There are no comments yet.


page 2

page 3

page 5

page 8


Exploiting long-term temporal dynamics for video captioning

Automatically describing videos with natural language is a fundamental c...

Space Time Recurrent Memory Network

We propose a novel visual memory network architecture for the learning a...

MGPSN: Motion-Guided Pseudo Siamese Network for Indoor Video Head Detection

Head detection in real-world videos is an important research topic in co...

Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending

We propose the first multi-frame video object detection framework traine...

ReMotENet: Efficient Relevant Motion Event Detection for Large-scale Home Surveillance Videos

This paper addresses the problem of detecting relevant motion caused by ...

Long Term Temporal Context for Per-Camera Object Detection

In static monitoring cameras, useful contextual information can stretch ...

Seeing Tree Structure from Vibration

Humans recognize object structure from both their appearance and motion;...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is a fundamental problem in computer vision, and has broad applicability in numerous fields. While there has been a long history of research in detecting objects in

static images, there has been relatively little research in detecting objects in videos. However, cameras on robots, surveillance systems, vehicles, wearable devices, etc., receive videos instead of static images. Thus, for these systems to recognize the key objects and their interactions, it is critical that they be equipped with accurate video object detectors.

The simplest way to detect objects in video is to run a static image-based detector independently on each frame. However, due to the different biases and challenges of video (e.g., motion blur, low-resolution, compression artifacts), a static image detector may not generalize well. Furthermore, videos provide rich temporal and motion information that should be utilized by the detector during both training and testing for improved performance. For example, in Fig. 1, since the profile view (frames 1-2) of the hamster is much easier to detect than the challenging viewpoint/pose presented in later frames, the static Fast-RCNN [15] image detector only succeeds in detecting the leading frame of the sequence. On the other hand, by learning to aggregate useful information over time, our video object detector can robustly detect the object under extreme viewpoint/pose.

The recent introduction of the ImageNet video object detection challenge [1] has sparked an interest in the relatively unexplored problem of video object detection. However, most existing methods only post-process static object detections returned by an image detector like Fast/Faster R-CNN [15, 42], either by linking detection boxes across frames [24, 18, 29] or by performing video segmentation to refine the detection results [17]. Although these methods show improvement over a static image object detector, exploiting temporal information through post-processing is sub-optimal since temporal and motion information are ignored during detector training. As such, these methods have difficulty overcoming consecutive failures of the static detector e.g, when the object-of-interest has large occlusion or unusual appearance for a long time.

Figure 1: Static image detectors, such as Fast-RCNN [15], tend to fail under occlusion or extreme pose (false detections shown in yellow). By learning to aggregate information across time, our STMN video object detector can produce correct detections in frames with challenging pose/viewpoints. In this example, it aggregates information from the easier profile views of the hamster (first two frames) to aid detection in occluded or extreme views of the hamster (third-fifth frames).

To address these limitations, we introduce the Spatial-Temporal Memory Network (STMN), which jointly learns an object’s appearance and long-term motion dynamics in an end-to-end fashion for video object detection. At its core is the Spatial-Temporal Memory Module (STMM), which is a convolutional recurrent computation unit that integrates pre-trained weights learned from static images (e.g., ImageNet [10]). This design choice is critical in addressing the practical challenge of learning from contemporary video datasets, which largely lack intra-category object diversity; i.e., since video frames are highly redundant, a video dataset of e.g., 1 million frames has much lower diversity than an image dataset with 1 million images. By designing our memory unit to be compatible with pre-trained weights from both its preceding and succeeding layers, we show that it outperforms the standard ConvGRU [4] recurrent module for video object detection.

Furthermore, in order to account for the 2D spatial nature of visual data, the STMM explicitly preserves spatial information in its memory. More specifically, to achieve accurate pixel-level spatial alignment over time, the STMM uses a novel MatchTrans module to explicitly model the displacement introduced by motion across frames. This design choice enables our network to provide an interpretable and expressive model of video content by modeling and tracking the object at the (coarse) pixel-level (specifically at the feature map cell level).

In summary, our main contribution is a novel memory-based deep network for video object detection. To our knowledge, our STMN is the first video object detector equipped with an explicit memory mechanism to model long-term temporal and motion dynamics. We evaluate our STMN on the ImageNet video object detection dataset (VID) [1], and demonstrate its effectiveness over existing state-of-the-art static image and video object detectors. Furthermore, our ablative studies show the benefits provided by the STMM and MatchTrans modules—integrating pre-trained static image weights and providing spatial alignment across time—for video object detection.

2 Related work

Static image object detection.

Recent work using deep networks have significantly advanced the state-of-the-art in static image object detection [16, 44, 15, 42, 41, 32]

. Our work also builds on the success of deep networks to learn the features, classifier, and bounding box localizer in an end-to-end optimization framework. However, in contrast to most existing works that focus on detecting objects in static images, this paper aims to detect objects in


Figure 2: Our STMN architecture. Consecutive frames are forwarded through the convolutional stacks to obtain spatial-preserving convolutional feature maps, which are then fed into the spatial-temporal memory module (STMM). In this example, in order to detect an object on the center frame, information flows into the center STMM from all five frames. The STMM output from the center frame is then fed into a classification and box regression sub-network. In our experiments, the “FCs” are fc6 and fc7 in VGG-16 and C5 in ResNet-101.

Video object detection.

Compared to static image-based object detection, there has been far less research in detecting objects in videos. Early work processed videos captured from a static camera or made strong assumptions about the type of scene (e.g., highway traffic camera for detecting cars or an indoor room for detecting persons) [54, 7]. Later work used hand-designed features by aggregating simple motion cues (based on optical flow, temporal differences, or tracking), and focused mostly on pedestrian detection [51, 9, 23, 2, 52, 39].

With the recent introduction of ImageNet VID [1], researchers have focused on more generic categories and realistic videos. However, most existing approaches combine per-frame detections from a static image detector via tracking in a two-step pipeline [24, 18, 29, 49]

. Since the motion and temporal cues are used as a post-processing step only during testing, many heuristic choices are required, which can lead to sub-optimal results. In contrast, our approach does not post-process the outputs of a static image-detector, but instead

learns to integrate the motion and temporal dependencies during training. Our end-to-end architecture also leads to a clean and fast runtime.

Recent work [59]

temporally-interpolates feature maps with optical flow to accelerate video object detection during testing. Unlike 

[59], where the goal is only to accelerate the evaluation of the image detector, the very recent work of Zhu et al. [58] learns to combine features of different frames with a feed-forward network. Although we share the same goal of making use of temporal information during training, our method differs in that it produces a spatial-temporal memory that can carry on information across long and variable number of frames, whereas the method in [58] can only aggregate information over a small and fixed number of frames. In Sec. 4, we demonstrate the benefits gained from this flexibility.

Another very recent work by Feichtenhofer et al. [13] proposed an approach aiming at unifying detection and tracking, in which they take the correlation between consecutive two frames to predict the movement of the bounding boxes. Unlike [13], our approach computes a spatial-temporal memory that is aggregating information across () frames. Furthermore, although our approach also computes the correlation between neighboring frames with the proposed MatchTrans module, instead of feeding it as the input to predict motion, we explicitly use it to warp the feature map for alignment.

Learning with videos.

Apart from video object detection, other recent work use convolutional and/or recurrent networks for video classification [25, 45, 37, 48, 4]. These methods tend to model entire video frames instead of pixels, which means the fine-grained details required for localizing objects are often lost. Object tracking [20, 30, 53, 36, 34, 43, 28], which requires accurate localization, is also closely-related. The key difference is that in tracking, the bounding box of the first frame is given and the tracker does not necessarily need to know the semantic category of the object being tracked.

Modeling sequence data with RNNs.

In computer vision, RNNs have been used for image captioning [26, 33, 50, 11], visual attention [3, 35, 55] and action/object recognition [31, 11, 4]

, human pose estimation 

[14, 6], as well as semantic segmentation [57]. Recently, Tripathi et al. [49] adopted RNNs for video object detection. However, in their pipeline, the CNN-based detector is first trained, then an RNN is trained to refine the detection outputs of the CNN.

To take spatial locality into account, Ballas et al. [4]

proposed the convolutional gated recurrent units (ConvGRU) and applied it to the task of action recognition. Built upon 

[4], Tokmakov et al. [47] used ConvGRUs for the task of video object segmentation. Our work differs in two ways: (1) we classify bounding boxes rather than frames or pixels, and (2) we propose a new recurrent computation unit called STMM that is able to make use of static image detector weights pre-trained on a large-scale image dataset like ImageNet. We show that this property leads to better results than ConvGRU for video object detection.

3 Approach

We propose a novel RNN architecture called the Spatial-Temporal Memory Network (STMN) to model an object’s changing appearance over time for video object detection.

Figure 3: Left: The transformation coefficients for position are computed by matching to , where indexes a spatial neighborhood surrounding . The transformation coefficients are then used to synthesize by interpolating the corresponding

feature vectors.

Right: squashes any value in into range , with a linear scaling function thresholded at . We set in all of our experiments.


The overall architecture is shown in Fig. 2. Assuming a video sequence of length , each frame is first forwarded through a convnet (e.g., the layers up through conv5 for VGG-16 or C1-4 residual blocks in ResNet-101) to obtain convolutional feature maps as appearance features. To aggregate information along the temporal axis, the appearance feature of each frame is fed into the Spatial-Temporal Memory Module (STMM). The STMM at time step receives the appearance feature for the current frame , as well as a spatial-temporal memory , which carries the information of all previous frames up through timestep . The STMM then updates the spatial-temporal memory for the current time step conditioned on both and . In order to capture information from both previous and later frames, we use two STMMs, one for each direction, to obtain both and . These are then concatenated to produce the temporally modulated memory for each frame.

The concatenated memory , which also preserves spatial information, is then fed into subsequent fully-connected layers for both category classification and bounding box regression, following Fast-RCNN [15]. This way, our approach combines information from both the current frame as well as temporally-neighboring frames when making its detections. This helps, for instance, in the case of detecting a frontal-view bicycle in the center frame of Fig. 2 (which is hard), if we have seen its side-view (which is easier) from nearby frames. In contrast, a static image detector would only see the frontal-view bicycle when making its detection.

Finally, to train the detector, we use the loss function used in Fast-RCNN 

[15]. Specifically, for each frame in a training sequence, we enforce a cross-entropy loss between the predicted class label and the ground-truth label, and enforce a smooth loss on the predicted bounding box regression coefficients. During testing, we slide the testing window and detect only on the center frame within each sliding window; this allows the model to aggregate an equal amount of information in both temporal directions when making the detection. Ideally, we would also train using only the center frame, but to accelerate training, we use all frames in a sequence. In practice, this accelerates training by a factor equivalent to the training sequence length.

Figure 4: Effect of alignment on spatial-temporal memory. In the first and second rows, we show the detection and the visualization of the spatial-temporal memory (by computing the norm across feature channels at each spatial location to get a saliency map), respectively, with MatchTrans alignment. The detection and memory without alignment are shown in rows 3 and 4, respectively. Without proper alignment, the memory has a hard time forgetting an object after it has moved to a different spatial position (third row), which is manifested by a trail of saliency on the memory map due to overlaying multiple unaligned maps (fourth row). Alignment with MatchTrans helps generate a much cleaner memory (second row), which also results in better detections (first row). Best viewed in pdf.

Spatial-temporal memory module.

We next explain how the STMM models the temporal correlation of an object across frames. At each time step, the STMM takes as input and and computes the following:


where is element-wise multiplication, is convolution, and are the 2D convolutional kernels, whose parameters are optimized end-to-end. Gate masks elements of (i.e., it allows the previous state to be forgotten) to generate candidate memory . And gate determines how to weight and combine the memory from the previous step with the candidate memory , to generate the new memory .

To generate and , the STMM first computes an affine transformation of and

, and then ReLU 

[27] is applied to the output values. Since and are gating variables, their values need to be in the range of . Therefore, we make two changes to the standard BatchNorm [22] (and denote it as ) such that it normalizes its input to

, instead of zero mean and unit standard deviation.

First, our variant of BatchNorm computes the mean and standard deviation for an input batch , and then normalizes values in with the linear squashing function shown in Fig. 3 (right). Second, we compute the mean and standard deviation for each batch independently instead of keeping running averages across training batches. In this way, we do not need to store different statistics for different time-steps, which allows us to generate test results for sequence lengths not seen during training (e.g., we can compute detections on longer sequences than those seen during training as demonstrated in Sec. 4). Although simple, we find works well for our purpose.

Differences with ConvGRU [4].

A key practical challenge of learning video object detectors is the lack of intra-category object diversity in contemporary video datasets; i.e., since video frames are highly redundant, a video dataset of e.g., 1 million frames has much lower diversity than an image dataset with 1 million images. The cost of annotation is much higher in video, which makes it difficult to have the same level of diversity as an image dataset. Therefore, transferring useful information from large-scale image datasets like ImageNet [10] would benefit video object detection by providing additional object diversity. Our STMM accomplishes this by making two key changes to the ConvGRU [4].

First, we use ReLU non-linearity instead of Sigmoid and Tanh, as shown in Eqs. 1-4. This is because of the way the ImageNet pre-trained backbone convnets are utilized in the Fast/Faster-RCNN framework – an ROI pooling module splits the backbone convnet into a feature stack (CONV in Fig. 2; conv_1 through conv_5 in VGG-16 and C1-C4 in ResNet-101) and a prediction head (FCs in Fig. 2; fc6-7 in VGG-16 and C5 in ResNet-101). In order to make use of both parts of the pre-trained weights, we need to make sure the output of the recurrent computation unit we insert in between is compatible with the pre-trained weights in the following prediction head layers. As an illustrative example, since the output of the standard ConvGRU is in (due to Tanh non-linearity), there would be a mismatch with the input range that is expected by e.g., fc6 in VGG-16 (the expected values should all be positive due to ReLU). To solve this incompatibility issue, we change the non-linearities in standard ConvGRU from Sigmoid and Tanh to ReLU. Second, we initialize , and in Eqs. 1-3 with the weights of the last convolution layer in “CONV”111For VGG, we directly take the weights of conv5_3. For ResNet, since its convs always come with skip connections, we simply add one more layer of convolution with 3x3 filters after C4, and pre-train it on ImageNet DET., rather than initializing them with random weights. Conceptually, this can be thought of as a way to initialize the memory with the pre-trained static convolutional feature maps. In Sec. 4, we show that these modifications allow us to make better use of pre-trained weights and achieve better performance.

Spatial-temporal memory alignment.

Finally, we explain how to align the memory across frames. Since objects move in videos, their spatial features can be mis-aligned across frames. For example, the position of a bicycle in frame might not be aligned to the position of the bicycle in frame (as in Fig. 2). In our case, this means that the spatial-temporal memory may not be spatially aligned to the feature map for current frame . This can be problematic, for example in the case of Fig. 4; without proper alignment, the spatial-temporal memory can have a hard time forgetting an object after it has moved to a different spatial position. This is manifested by a trail of saliency, in the fourth row of Fig. 4, due to the effect of overlaying multiple unaligned feature maps. Such hallucinated features can lead to false positive detections and inaccurate localizations, as shown in the third row of Fig. 4.

To alleviate this problem, we propose the MatchTrans module to align the spatial-temporal memory across frames. For a feature cell at location in , MatchTrans computes the affinity between and feature cells in a small vicinity around location in , in order to transform the spatial-temporal memory to align with frame . More formally, the transformation coefficients are computed as:

where both and are in the range of , which controls the size of the matching vicinity. With , we transform the unaligned memory to the aligned as follows:

The intuition here is that given transformation , we reconstruct the spatial memory as a weighted average of the spatial memory cells that are within the vicinity around on ; see Fig. 3 (left). At this point, we can thus simply replace all occurrences of with the spatially aligned memory in Eqs. 1-4. With proper alignment, our generated memory is much cleaner (second row of Fig. 4) and leads to more accurate detections (first row of Fig. 4).

airplane antelope bear bicycle bird bus car cattle dog dom. cat elephant fox giant panda hamster horse lion
STMN (Ours) 68.7 23.5 26.3 42.1 41.3 68.2 42.4 29.3 57.9 39.3 94.0 75.5 80.8 58.3 70.0 35.3
Fast-RCNN [15] 73.6 25.6 30.6 41.5 31.2 63.1 39.7 39.5 54.4 32.3 84.0 70.9 78.5 54.6 57.8 36.7
STMN-No-MatchTrans 65.3 30.7 29.2 42.9 36.9 61.5 40.0 47.2 56.5 45.1 95.4 71.3 84.1 39.0 67.2 26.1
ConvGRU-Pretrain 71.6 27.8 26.4 41.9 45.5 64.5 40.2 33.3 57.0 36.9 90.2 79.3 80.1 46.2 64.1 29.4
ConvGRU-FreshFC 72.3 26.1 29.4 41.2 36.8 63.9 41.3 16.4 56.4 33.9 88.1 70.1 76.3 38.1 59.2 26.5
lizard monkey motorcycle rabbit red panda sheep snake squirrel tiger train turtle watercraft whale zebra Test mAP Val mAP
STMN (Ours) 44.1 31.8 32.0 52.7 80.3 49.0 14.3 31.8 35.7 66.8 56.4 41.7 49.6 80.9 50.7 55.6
Fast-RCNN [15] 48.9 30.2 36.9 42.0 79.7 47.9 9.9 16.9 36.0 63.4 56.0 50.3 44.9 76.2 48.4 53.0
STMN-No-MatchTrans 42.0 26.9 36.9 53.8 83.6 53.0 10.5 19.4 27.4 64.8 48.1 40.7 47.4 78.3 49.0 55.0
ConvGRU-Pretrain 46.4 30.1 40.4 40.0 79.6 52.4 11.0 18.1 20.0 60.4 46.8 37.4 47.2 75.9 48.0 51.9
ConvGRU-FreshFC 38.3 30.5 28.8 8.3 79.6 41.0 9.5 18.5 30.1 68.7 45.3 44.0 50.0 75.9 44.8 50.3
Table 1: Quantitative detection results on ImageNet VID for all 30 categories. Our improvements over the baselines show the importance of aggregating temporal information with a memory (vs. static Fast-RCNN detection [15]), memory alignment across frames with MatchTrans (vs. STMN-No-MatchTrans), and the effectiveness of using pre-trained weights with STMM over standard ConvGRU (vs. ConvGRU-Pretrain and ConvGRU-FreshFC).

Another advantage of MatchTrans is that it is very lightweight. Unlike optical flow, which needs to be computed either externally e.g., using [5], or in-network through another large CNN e.g., FlowNet [12], MatchTrans is much more efficient (i.e., saving computation time and/or space for storing optical flow). For example, it is nearly an order of magnitude faster to compute (on average, 2.9ms vs. 24.3ms for an 337x600 frame) than FlowNet [12], which is one of the fastest optical flow methods.

Approach summary.

Together with the specially designed STMM and MatchTrans module, our STMN detector is able to leverage a well-aligned spatial-temporal memory that aggregates useful information from nearby frames for video object detection.

4 Results

We show quantitative and qualitative results of our STMN video object detector, and compare to both state-of-the-art static image and video detectors. We also conduct ablation studies to analyze the different components in our model.


We use ImageNet VID [1], which has 3862/555/937 videos for training/validation/testing for 30 object categories. Each video is captured at 30 fps and ranges from only a few frames to thousands. Bounding box annotation is provided for all frames. We choose ImageNet VID due to its relatively large size as well as for ease of comparison to existing state-of-the-art methods [1, 8, 58, 24, 13].

Implementation details.

For object proposals, we use DeepMask 

[40]; our own baselines use the same proposals to ensure fair comparison. We use two different CNN architectures—VGG-16 [46] and ResNet-101 [19]—as the backbone for Fast-RCNN. We set the sequence length to 7 during training. For testing, we observe better performance when using a longer sequence length; specifically, frames provides a good balance between performance and GPU memory/computation (we later show the relationship between performance and test sequence length). We set the number of channels of the spatial memory to 512 for VGG and 1024 for ResNet. We set the local region size

for MatchTrans. To reduce redundancy in sequences, we form a sequence by sampling 1 in every 10 video frames with uniform stride. For training, we start with a learning rate of 1e-3 with SGD and lower it to 1e-4 when training loss plateaus. We set weight decay to 0.0005 for VGG whereas for ResNet, we do not use weight decay.


In addition to comparing with state-of-the-art video object detectors, we also compare to a number of baselines to better analyze the contribution of the different components of our STMN detector. The first baseline is the static image Fast-RCNN detector [15], which uses the same base architecture as our model but lacks a memory module and does not aggregate information over time. The second baseline, compared with our model, lacks the MatchTrans module and thus does not align the memory from frame to frame (STMN-No-MatchTrans). The third baseline computes the memory using ConvGRU [4], instead of our proposed STMM. Like ours, this baseline (ConvGRU-Pretrain) also uses pre-trained ImageNet weights for both feature stack and prediction layers. Our final baseline is ConvGRU without pre-trained weights for the ensuing prediction FCs (ConvGRU-FreshFC).

Ablation studies.

We first present ablation studies to measure the impact of each component in our STMN video object detector. For this, we use VGG-16 as the backbone since it is faster to train compared with ResNet-101.

Table 1 shows the results. First, our full model (STMN) outperforms the static image Fast-RCNN detector by 2.3/2.6% mAP on ImageNet VID test/val set. This demonstrates the effectiveness of the spatial-temporal memory in aggregating information across time. Furthermore, comparing STMN-No-MatchTrans to STMN, we observe a 1.7/0.6% test/val mAP improvement brought by the proper spatial alignment across frames. To compare our STMM and ConvGRU, we first naively replace STMM with ConvGRU and randomly initialize the weights for the FC layers after the ConvGRU. With this setting (ConvGRU-FreshFC), we obtain a relatively low test mAP of 44.8%, due to the lack of data to train the large amount of weights in the FCs. If we instead initialize the weights of the FCs after the ConvGRU with pre-trained weights (ConvGRU-Pretrain), we improve the test mAP from 44.8% to 48.0%. By replacing Sigmoid and Tanh with ReLU (STMN), we boost the performance even further to 50.7% (with a similar trend on val). This shows the importance of utilizing pre-trained weights in both the feature stacks and prediction head, and the necessity of an appropriate form of recurrent computation that best matches its output to the input expected by the pre-trained weights.

Base network Base detector Test Val
STMN+SeqNMS (Ours) VGG-16 Fast-RCNN 56.5 61.7
STMN (Ours) VGG-16 Fast-RCNN 53.1 58.7
Fast-RCNN VGG-16 Fast-RCNN 48.4 53.0
ITLab VID - Inha [1] VGG-16 Fast-RCNN 51.5 -
Faster-RCNN+SeqNMS [18] VGG-16 Faster-RCNN 48.2 52.2
Faster-RCNN [1] VGG-16 Faster-RCNN 43.4 44.9
STMN+SeqNMS (Ours) ResNet-101 Fast-RCNN - 72.2
STMN (Ours) ResNet-101 Fast-RCNN - 71.4
Fast-RCNN ResNet-101 Fast-RCNN - 68.1
R-FCN [8] ResNet-101 R-FCN - 73.4
FGFA [58] ResNet-101 R-FCN - 76.3
D&T [13] ResNet-101 R-FCN - 79.8
T-CNN [24] DeepID+Craft [38, 56] RCNN 67.8 73.8
Table 2: mAP comparison to the state-of-the-art on ImageNet VID. We outperform all methods with a large margin using VGG-16 as the backbone (by 5.0+% on test set), and obtain competitive results with ResNet-101 as the backbone. See text for details.

Length of test window size.

We next analyze the relationship between detection performance and length of test window size. Specifically, we test our model’s performance with test window size 3, 7, 11, and 15, on ImageNet VID validation set (the training window size is always 7). The corresponding mAP are 53.0%, 54.9%, 55.6%, 55.9%, respectively; as we increase the window size, the performance keeps increasing. This suggests the effectiveness of our memory: the longer the sequence, the more (longer-range) useful information is stored in the memory, which leads to better detection performance. However, increasing the test window size also increases computation cost and GPU memory consumption. Therefore, we find that setting the test window size to 11 provides a good balance.

Comparison to state-of-the-art.

For a fair comparison, following [24, 29, 58, 13], here we use the subset of ImageNet DET dataset that has overlapping object categories with ImageNet VID to pre-train both the baseline Fast-RCNN detector and our STMN detector. We employ standard data augmentation including left-right flipping, random scale jittering, and photometric distortions [21].

Table 2 shows the comparison to existing state-of-the-art image and video detectors. First, our STMN detector outperforms Fast-RCNN detector with a large margin regardless of the selection of the base network (i.e., VGG-16 or ResNet-101). This demonstrates the effectiveness of our proposed spatial-temporal memory. For the VGG-16 base network, our STMN detector produces the state-of-the-art performance at 53.1% test mAP, without any post-processing, compared with 51.5% produced by “ITLab VID - Inha” which employs a sophisticated tracking method on top of the results of Fast-RCNN+VGG-16. With very simple temporal smoothing [18], we can further boost test mAP to 56.5% (STMN+Seq-NMS). This helps because even though we enforce temporal smoothness in the spatial-temporal memory, we do not have an explicit smoothness constraint in the output space (i.e., space of bounding box coordinates). Thus by performing temporal smoothing directly in the output space, we can get a further boost in performance.

For T-CNN [24], even though it outperforms our STMN by 1.6% in val mAP, it is a highly complex multi-stage system with an ensemble of multiple CNN base networks, whereas we do not perform any ensemble modeling. For FGFA [58] and D&T [13], instead of the Fast-RCNN framework adopted by our STMN, the detection framework they use is R-FCN [8], which itself yields a much higher static object detection performance to start with (R-FCN base detector obtains 73.4% compared to 68.1% achieved by Fast-RCNN).

Figure 5: Example detections produced by our STMN detector vs. Fast-RCNN. We compare to the standard Fast-RCNN in the first four sequences and to Fast-RCNN with Seq-NMS post-processing in the last sequence. The red boxes indicate correct detections, yellow boxes indicate false positives, and green boxes indicate missed detections. The ground-truth object in each sequence are: “hamster”, “bear”, “airplane”, “squirrel” and “car”.

Qualitative results.

Fig. 5 shows qualitative comparisons between our STMN detections and the static image Fast-RCNN detections. Our STMN detections are more robust to motion blur; e.g., in the last frame of the “hamster” sequence, Fast-RCNN gets confused about the class label of the object due to large motion blur, whereas our STMN detector correctly detects the object with high confidence. In the case of difficult viewpoint and occlusion (“bear” and “squirrel”, respectively), our STMN produces robust detections by leveraging the information from neighboring easier frames (i.e., first frame in both sequences). Also, our model outputs detections that are more consistent across frames, compared with the static image detector, as can be seen in the case of “airplane”. Finally, in the last frame of the “car” sequence, Fast-RCNN produces an extremely low score for the car due to the challenging viewpoint and large occlusion. Thus, post-processing with Seq-NMS [18] is unable to correct this error. Since our STMN has learned to temporally-aggregate the car’s features in its memory, it is able to produce a correct detection.


Finally, STMN only adds marginal computation cost – for an image of size 337x600, the forward pass for Fast-RCNN and STMN (using VGG-16 as the backbone) takes 0.176s and 0.204s, respectively. The added 0.03s is spent in STMM computation including MatchTrans.

5 Conclusion

We proposed a novel spatial-temporal memory network (STMN) for video object detection. Our main contributions are a carefully-designed recurrent computation unit that integrates pre-trained image classification weights into the memory and an in-network alignment module that spatially-aligns the memory across time. Together, these lead to state-of-the-art results on ImageNet VID with the VGG-16 base network and competitive results with ResNet-101 when compared to existing approaches.


This work was supported in part by the ARO YIP under Grant Number W911NF-17-1-0410, the AWS Cloud Credits for Research Program, and GPUs donated by NVIDIA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of ARO or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.