Impression Network for Video Object Detection

by   Congrui Hetang, et al.
SenseTime Corporation

Video object detection is more challenging compared to image object detection. Previous works proved that applying object detector frame by frame is not only slow but also inaccurate. Visual clues get weakened by defocus and motion blur, causing failure on corresponding frames. Multi-frame feature fusion methods proved effective in improving the accuracy, but they dramatically sacrifice the speed. Feature propagation based methods proved effective in improving the speed, but they sacrifice the accuracy. So is it possible to improve speed and performance simultaneously? Inspired by how human utilize impression to recognize objects from blurry frames, we propose Impression Network that embodies a natural and efficient feature aggregation mechanism. In our framework, an impression feature is established by iteratively absorbing sparsely extracted frame features. The impression feature is propagated all the way down the video, helping enhance features of low-quality frames. This impression mechanism makes it possible to perform long-range multi-frame feature fusion among sparse keyframes with minimal overhead. It significantly improves per-frame detection baseline on ImageNet VID while being 3 times faster (20 fps). We hope Impression Network can provide a new perspective on video feature enhancement. Code will be made available.



There are no comments yet.


page 2

page 3

page 5

page 6


Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

Video object detection is a fundamental problem in computer vision and h...

Flow-Guided Feature Aggregation for Video Object Detection

Extending state-of-the-art object detectors from image to video is chall...

Bicycle Detection Based On Multi-feature and Multi-frame Fusion in low-resolution traffic videos

As a major type of transportation equipments, bicycles, including electr...

Memory Warps for Learning Long-Term Online Video Representations

This paper proposes a novel memory-based online video representation tha...

Towards High Performance Video Object Detection for Mobiles

Despite the recent success of video object detection on Desktop GPUs, it...

Kill Two Birds With One Stone: Boosting Both Object Detection Accuracy and Speed With adaptive Patch-of-Interest Composition

Object detection is an important yet challenging task in video understan...

Dual Semantic Fusion Network for Video Object Detection

Video object detection is a tough task due to the deteriorated quality o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fast and accurate video object detection methods are highly valuable in vast number of scenarios. Single-image object detectors like Faster R-CNN [22] and R-FCN [3] have achieved excellent accuracy on still images, so it is natural to apply them to video tasks. One intuitive way is applying them frame by frame on videos, but this is far from optimal. First, image detectors typically involve a heavy feature network like ResNet-101 [11], which runs rather slow (5fps) even on GPUs. This hampers their potential in real-time applications like autonomous driving and video surveillance. Second, single-image detectors are vulnerable to the common image degeneration problem in videos [33]. As shown in Figure 2, frames may suffer from defocus, motion blur, strange object positions and all sorts of deteriorations, leaving too weak visual clues for successful detections. The two problems make object detection in videos challenging.

Feature-level methods [6, 34, 33, 26]

have addressed either one of the two problems. These methods treat single-image recognition pipeline as two stages: 1. the image is passed through a general feature network; 2. the result is then generated by a task-specific sub-network. When transferring image detectors to videos, feature-level methods seek ways to improve the feature stage, while the task network remains unchanged. The task-independence makes feature-level methods versatile and conceptually simple. To improve speed, feature-level methods reuse sparsely sampled deep features in the first stage 

[34, 26], because nearby video frames provide redundant information. This saves the expensive feature network inference and boosts speed to real-time level, but sacrifices accuracy. On the other hand, accuracy can be improved by multi-frame feature aggregation [33, 21]. This enables successful detection on low-quality frames, but the aggregation cost can be huge thus further slows down the framework. In this work, we combine the advantages of both tracks. We present a new feature-level framework, which runs at real-time speed and outperforms per-frame detection baseline.

Our method, called Impression Network, is inspired by the way how human understand videos. When there comes a new frame, humans do not forget previous frames. Instead, the impression is accumulated along the video, which helps us understand degenerated frames with limited visual clue. This mechanism is embodied in our method to enhance frame feature and improve accuracy. Moreover, we combine it with sparse keyframe feature extraction to obtain real-time inference speed. The pipeline of our method is shown in Figure 


To address the redundancy and improve speed, we split a video into segments of equal length. For each segment, only one keyframe is selected for deep feature extraction. With flow-guided feature propagation [34, 5], the key feature is reused by non-key frames to generate detection results. Based on this, we adopt our Impression mechanism to perform multi-frame feature fusion. When a key feature is extracted, it not only goes to task network, but is also absorbed by a impression feature. The impression feature is then propagated down to the next keyframe. The task feature for the next keyframe is a weighted combination of its own feature and the impression feature, and the impression feature is updated by absorbing the feature of that frame. This process keeps going on along the whole video. In this framework, the impression feature accumulates high-quality video object information and is propagated all the way down, helping enhance incoming key features if the frames get deteriorated. It improves the overall quality of task features, thus increases detection accuracy.

The Impression mechanism also contributes to the speed. With the iterative aggregation policy, it minimized the cost of feature fusion. Previous work [33]

has proved that, video frame features should be spatially aligned with flow-guided warping before aggregation, while flow computation is not negligible. Intuitive way requires one flow estimation for each frame being aggregated, while Impression Network only needs one extra flow estimation for adjacent segments, being much more efficient.

Without bells and whistles, Impression Network surpasses state-of-the-art image detectors on ImageNet VID [24] dataset. It’s three times faster (20 fps) and significantly more accurate. We hope Impression Network can provide a new perspective on feature aggregation in video tasks.

Code will be released to facilitate future research.

2 Related Work

Feature Reuse in Video Recognition: As shown by earlier analysis [30, 35, 15, 32, 27], consecutive video frames are highly similar, as well as their high-level convolutional features. This suggests that video sequences feature an inherent redundancy, which can be exploited to reduce time cost. In single image detectors [8, 10, 22, 7, 3], the heavy feature network (encoder) is much more costly than the task sub-network (decoder). Hence, when transplanting image detectors to videos, speed can by greatly improved by reusing the deep features of frames. Clockwork Convnets [26] exploit the different evolve speed of features at different levels. By updating low and high level convolutional features at different frequency, it partially avoids redundant feature computation. It makes the network times faster, while sacrifices accuracy by due to the lack of end to end training. Deep Feature Flow [34] is another successful feature-level acceleration method. It cheaply propagates the top feature of sparse keyframe to other frames, achieving a significant speed-up ratio (from 5 fps to 20 fps). Deep Feature Flow requires motion estimation like optical flow [12, 2, 29, 23, 5, 13] to propagate features, where error is introduced and therefore brings a minor accuracy drop (). Impression Network inherits the idea of Deep Feature Flow, but also utilizes temporal information to enhance the shared features. It’s not only faster than per-frame baseline, but also more accurate.

Exploiting Temporal Information in Video Tasks: Applying state-of-the-art still image detectors frame by frame on videos does not provide optimal result [33]. This is mainly due to the low-quality images in videos. Single image detectors are vulnerable to deteriorated images because they are restricted to the frame they are looking at, while ignoring the ample temporal information from other frames in the video. Temporal feature aggregation [17, 20, 25, 28, 18, 1, 31] provides a way to utilize such information. Flow-Guided Feature Aggregation(FGFA) [33] aims at enhancing frame features by aggregating all frame features in a consecutive neighborhood. The aggregation weight is learned through end-to-end training. FGFA boosts video detection accuracy to a new level (from to ), yet it is three-times slower than per-frame solution ( fps). This is caused by the aggregation cost. For each frame in the fusion range, FGFA requires one optical flow computation to spatially align it with the target frame, which costs even more time than the feature network. Additionally, since neighboring frames are highly similar, the exhaustive dense aggregation leads to extra redundancy. Impression Network fuses features in an iterative manner, where only one flow estimation is needed for every new keyframe. Moreover, the sparse feature sampling reduces the amount of replicated information.

Figure 2: Examples of deteriorated frames in videos.

3 Impression Network

1:input: video frames , segment length
2:for  to  do
3:    extract keyframe feature
4:   if  then first keyframe
5:       initialize impression feature
7:   else
9:       adaptive weighting
10:       update impression feature
11:   end if
12:   for  to  do feature propagation
13:       flow-guided warp
14:       detection result
15:   end for
16:end for
17:output: detection results
Algorithm 1 Inference algorithm of Impression Network for video object detection.

3.1 Impression Network Inference

Given a video, our task is to generate detection results for all its frames , . To avoid redundant feature computation, we split the frame sequence into segments of equal length . In each segment , only one frame (by default we take the central frame ) is selected for feature extraction via the feature network . The key feature is propagated to remaining frames with flow-guided warping, where the flow field is computed by a light-weight flow network, following the practice of Deep Feature Flow [34]. Features of all frames are then fed into task network to generate detection results.

In such framework, we use impression mechanism to exploit long-range, cross-segment temporal information. The inference phase of Impression Network is illustrated in Figure 1. Each segment generates three features: calculated by passing through , shared by all frames in the segment for detection sub-network and , the impression feature containing long-term temporal information. For the first segment , and are identical to . For , is a weighted combination of and . The aggregation unit uses a tiny FCN to generate position-wise weight maps. Generally, larger weights are assigned to the feature with better quality. This is concluded as


Notice that such quality is not a handcrafted metric, instead it’s learned by end-to-end training to minimize task loss. We observe that when is deteriorated by motion blur or defocus, gets lower quality score, as shown in Figure 4. Also notice that the aggregation of cross-segment features is not simply adding them up. Former practice [33] shows that due to spatial misalignment in video frames, naive weighted mean yields worse results. Here we use flow-guided aggregation. Specifically, we first calculate the flow field of and , then perform spatial warping accordingly on to align it with , getting ; the fusion is then done with and to generate . and are then mingled to get :


Here a constant factor controls the contribution of . serves as a gate to control the memory of the framework (detailed in Figure 6). If set to 0, will only contain information of . The procedure keeps going on until all frames in a video get processed.

By iteratively absorbing every keyframe feature, the impression feature contains visual information in a large time span. The weighted aggregation of and can be seen as a balancing between memory and new information, depending on the quality of the new incoming keyframe. When the new keyframe gets deteriorated, the impression feature compensate for the subsequent weak feature, helping infer bounding box and class information through low-level visual clue such as color distribution. On the other hand, the impression feature also keeps getting updated. Since sharp and clear frames get higher scores, they contribute more to an effective impression. Compared to exhaustively aggregating all nearby features in a fixed range for every frame, our framework is more natural and elegant. The whole process is summarized in Algorithm 1.

Figure 3: Training framework of Impression Network. The data flow is marked by solid lines. Components linked with dashed lines share weights. The working condition of inference stage is simulated with video frame triplets. All components are optimized end-to-end.

3.2 Impression Network Training

The training procedure of Impression Network is rather simple. With video data provided, a standard single-image object detection pipeline can be transfered to video tasks with slight modifications. The end-to-end training framework is illustrated in Figure 3.

During training, each data batch contains three images from a same video sequence. and are random offsets whose ranges are controlled by segment length . Typically, lies in , while falls into . This setting is coherent with the inference phase, as represents an arbitrary frame from segment , for keyframe of current segment , while stands for the previous keyframe. For simplicity, the three images are dubbed as . The ground-truth at is provided as label.

In each iteration, first, is applied on to get their deep features . Then, image pairs and are fed into the flow network, yielding optical flow fields and , respectively. Flow-guided warping unit then use to propagate to align with . We denote the warped old keyframe feature as . The aggregation unit weights and fuses , generating . in training corresponds to the impression feature in inference. This is an approximation since it only contains information of one previous keyframe. Finally, is warped to according to to get , the task feature for a standard detection sub-network. Since all the components are differentiable, the detection loss propagates all the way back to jointly fine-tune , , flow network and feature aggregation unit, optimizing task performance. Notice that single-image datasets can be fully exploited in this framework, in which case the three images are all the same.

3.3 Module Design

Feature Network:

We use ResNet-101 pretrained for ImageNet classification. The fully connected layers are removed. For denser feature map, feature stride is reduced from 32 to 16. Specifically, the stride of the last block is modified from 2 to 1. To maintain receptive field, A dilation of 2 is applied to convolution layers with kernel size greater than 1. A 1024-channel

convolution layer (randomly initialized) is appended to reduce feature dimension.

Flow-Guided Feature Propagation: Before aggregation, we spatially align frame features by flow-guided warping. Optical flow field is calculated first to obtain pixel-level motion path, then reference feature is warped to target frame with bilinear sampling. The procedure is defined as

where and denotes target frame and reference frame respectively, is the deep feature of reference frame, denotes reference feature warped to target frame, stands for flow estimation function, W denotes the bilinear sampler, and is a predicted position-wise scale map to refine warped feature. We adopt the state-of-the-art CNN-based FlowNet [5, 13] for optical flow computation. Specifically, we use FlowNet-S [5]. The flow network is pretrained on FlyingChairs dataset. The scale map has equal channel dimension with task features, and is predicted with flow field in parallel through an additional convolution layer attached to the top of FlowNet-S. The new layer is initialized with weights of all zeros and fixed biases of all ones. The implementation of bilinear sampling unit has been well described in [14, 4, 34]. It is fully differentiable.

Aggregation Unit: The aggregation weights of features are generated by a quality estimation network . It has three randomly initialized layers: a convolution, a convolution and a convolution. The output is a position-wise raw score map which will be applied on each channel of task feature. Raw score maps of different features are normalized by softmax function to sum up to one. We then multiply the score maps with features and sum them up to obtain the fused feature as Eq. 3.

Detection Network: We use the state-of-the-art R-FCN as detection sub-network. RPN and R-FCN are attached to the 1024-channel convolution of the feature network, using the first and second 512 channels respectively. RPN uses 9 anchors and generates 300 proposals for each image. We use groups position-sensitive score maps for R-FCN.

3.4 Runtime Complexity Analysis

The ratio of inference time of our method to that of per-frame evaluation is:

In each segment of length , Impression Network requires: 1. flow warping in total, one for impression feature propagation and for non-key frame detection; 2. One feature fusion operation ; 3. One feature network inference for keyframe feature; 4. detection subnetwork inference. In comparison, per-frame solution takes and inference. Notice that compared to (Resnet-101 in our practice) and FlowNet, the complexity of , and are negligible. So the ratio can be approximated as:

In practice, the flow network is times smaller than Resnet-101, while is large () to reduce redundancy. This suggests that unlike existing feature aggregation method like FGFA, Impression Network can perform multi-frame feature fusion while maintaining a noticeable speedup over per-frame solution.

Figure 4: Examples of frames assigned with different aggregation weights. The white number is the spatially averaged pixel-wise weight in algorithm 1. Consistent with intuition, the scoring FCN assigns larger weights to sharp and clear frames.
Figure 5: Examples where Impression Network outperforms per-frame baseline (standard ResNet-101 R-FCN). Green boxes are true positives while red ones are false positives.

4 Experiments

4.1 Experiment Setup

ImageNet VID dataset [24]: It is a large-scale video object detection dataset. There are 3862, 555 and 937 snippets with frame rates of 25 and 30 in training, validation and test sets, respectively. All the video snippets are fully-annotated. Imagenet VID dataset has 30 object categories, which is a subset of the Imagenet DET dataset. In our experiments, following the practice in [16, 19, 34, 33], model are trained on the training set, while evaluations are done on the validation set with the standard mean average precision (mAP) metric.

Implementation Details: Our training set consists of the full ImageNet VID train set, together with images from ImageNet DET train set. Only the same 30 categories are used. As mentioned before, each training batch contains three images. If sampled from DET, all images are same. In both training and testing, images are resized to have the shorter side of 600 and 300 pixels for the feature network and the flow network, respectively. The whole framework is trained end to end with SGD, where 120K iterations are performed on 8 GPUs. The learning rate is for the first 70K iterations, then reduced to for the remaining 50K iterations. For clear comparison, no bells-and-whistles like multi-scale training and box-level post-processing are used. Inference time is measured on a Nvidia GTX 1060 GPU.

methods (a) (b) (c) (d) (e)
sparse feature?
mAP (%)
runtime (ms) 156 48 50 50
Table 1: Accuracy and runtime of different approaches.

4.2 Ablation Study

Architecture Design: Table 1 summarizes main experiment results. It shows a comparison of single-frame baseline, Impression Network and its variants.

Method (a) is the standard ResNet-101 R-FCN applied frame by frame to videos. The accuracy is close to the mAP reported in  [34]

, which shows its validity as a strong baseline for our evaluations. The runtime is a little bit faster, probably due to differences in implementation environment. The

6fps inference speed is insufficient for real-time applications, where typically a speed of 15fps is required.

Method (b) is a variant of Method (a) with sparse feature extraction. In this approach, videos are divided into segments of frames. Only one keyframe in each segment will be passed through the feature network for feature extraction. That feature is then propagated to other frames with optical flow. Finally, the detection sub-network generates results for every frame. The structure is identical to a Deep Feature Flow framework for video object detection  [34]. Specifically, is set to 10 for all experiments in this table. We select the 5th frame as keyframe, because this minimizes average feature propagation distance, thus reduces the error introduced and improves accuracy (explained later). Compared to per-frame evaluation, there’s a minor accuracy drop of , mainly because of lessened information, as well as errors in flow-guided feature warping. However, the inference speed remarkably increases to 21fps, proving that sparse feature extraction is an efficient way to trade accuracy for speed.

Method (c) is a degenerated version of Impression Network. Keyframe features are iteratively fused to generate the impression feature, but without quality-aware weighting unit. The weights in Eq. 3 are naively fixed to . For all experiments here, the memory gate in Eq. 4 is set to . With information of previous frames fused into current task feature, mAP increases for over per-frame baseline. Notice that sparse feature extraction is still enabled here, which proves that 1.the computational redundancy of per-frame evaluation is huge; 2.such redundancy is not necessary for higher accuracy. Due to the one additional flow estimation for each segment, the framework slows down a little bit, yet still runs at a real-time-level 20fps.

Method (d) is the proposed Impression Network. Here the aggregation unit uses the tiny FCN to generate position-wise weights. Through end-to-end training, the sub-network learns to assign smaller weights to features of deteriorated frames, as shown in Figure 4. Experiment on ImageNet VID validation set shows the in algorithm 1

obeys a normal distribution of

. Quality-aware weighting brings another mAP improvement, mainly because of the increment of valid information. Overall, Impression Network increases mAP by to , comparable to exhaustive feature aggregation method [33], while significantly faster, running at 20fps. Impression Network shows that, if redundancy and temporal information are properly handled, the speed and accuracy of video object detection can actually be simultaneously improved. Examples are shown in Figure 5.

Method (e) is Impression Network without end-to-end training. The feature network is trained in single-image detection framework, same to that in Method (a). The flow network is the off-the-shelf FlyingChairs pretrained FlowNet-S [5]. Only the weighting and detection sub-networks learn during training. This clearly worsen the mAP, showing the importance of end-to-end optimization.

Figure 6: Averaged contribution of previous keyframes to current detection at different memory gate . When is , the contribution smoothly decreases as offset grows. As decreases, the impression gets increasingly occupied by the nearest keyframe, while the contribution of earlier ones rapidly shrinks to .
Figure 7: mAP at different values. Although it’s not exactly how the network is trained, enabling long-range aggregation do brings noticeable improvement.

The Influence of Memory Gate: As shown in Eq. 4, the memory gate controls the component of impression features. Here we study its influence on mAP. Experiment settings are the same as Method (d) in Table 1, except that varies from to . Figure 6 shows the average contribution of previous keyframes to current detection at different values. It can be seen that controls the available range of temporal information. When set to , the impression feature consists solely of the previous key feature, just like how the framework is trained; while setting to 1.0 leads in more temporal information. Figure 7 shows the mAP of different setting. Apparently, larger benefits accuracy. The involvement of long-range feature aggregation may help detection in longer series of low-quality frames.

keyframe id (frames) mAP (%)
0 5.5 73.9
1 4.7 74.4
2 4.1 74.9
3 3.7 75.2
4 3.5 75.5
5 3.5 75.5
Table 2: Average propagation distance and mAP at different keyframe selections. Other settings are same as Method (d) in Table 1. Because of the symmetry, only id 0-5 is shown.

Different Keyframe Selection: In aforementioned experiments, to reduce error, we select the central frame of each segment as keyframe. Here we explain this and compare different keyframe scheduling. Flow-guided feature warping introduces error, and as shown in  [34], the error has positive correlation with propagation distance. This is because that larger displacement increases the difficulty of pixel-level matching. Hence, we take average feature propagation distance as a metric for flow error, and seek the way to minimize it. is calculated as:

where is propagation distance, is the id of keyframe, and is segment length. Key feature needs to be propagated to non-key frames, and there’s also an impression feature propagation of distance . Apparently there’s an optimal to minimize :


which shows that the central frame is the best. Table 2 shows mAPs at different keyframe selections, coherent with our assumption. Notice that selecting the first frame enables strict real-time inference, while selecting the central frame brings a slight latency of frames. This can be traded-off according to application needs.

Figure 8: Comparing speed-accuracy curves of Deep Feature Flow (DFF) and Impression Network (Impression). Both using ResNet-101 as feature network and FlowNet-S as flow network.
method mAP (%) runtime (ms)
FGFA 76.3 733
FGFA-fast 75.3 356
Impression Network 75.5 50
Table 3: Comparison with aggregation-based method FGFA and its faster variant. Settings are same as Method (d) in Table 1.

4.3 Compare with Other Feature-Level Methods

We compare Impression Network with other feature-level video object detection methods. In Figure 8, we compare the speed-accuracy curve of Impression Network and Deep Feature Flow [34]. Per-frame baseline is also marked. Segment length varies from 1 to 20. Apparently, Impression Network is more accurate than per-frame solution even in high-speed zone. Similar to Deep Feature Flow, it also offers a smooth accuracy-speed trade-off as varies. The accuracy drops a little when gets close to 1, which is reasonable because Impression Network is trained for aggregating sparse frame features. Dense sampling limits aggregation range and result in a less useful impression.

Table 3 compares Impression Network with Flow-Guided Feature Aggregation and its faster variant. Both are described in [33]. FGFA is the standard version with a fusion radius of 10, and FGFA-fast is the accelerated version. It only calculates flow fields for adjacent frames, and composite them for non-adjacent pairs. This comparison shows that the accuracy of Impression Network is on par with the best aggregation-based method, yet being much more efficient.

5 Conclusion and Future Work

This work presents a fast and accurate feature-level method for video object detection. The proposed Impression mechanism explores a novel scheme for feature aggregation in videos. Since Impression Network works at feature stage, it’s complementary to existing box-level post-processing methods like Seq-NMS [9]. For now we use FlowNet-S [5] to guide feature propagation for clear comparison, while more efficient flow algorithms [13] exist and can surely benefit our method. We use fixed segment length for simplicity, while a adaptively varying length may schedule computation more reasonably. Moreover, as a feature-level method, Impression Network inherits the task-independence, and has the potential to tackle image degeneration problem in other video tasks.