Geometry-Aware Video Object Detection for Static Cameras

by   Dan Xu, et al.

In this paper we propose a geometry-aware model for video object detection. Specifically, we consider the setting that cameras can be well approximated as static, e.g. in video surveillance scenarios, and scene pseudo depth maps can therefore be inferred easily from the object scale on the image plane. We make the following contributions: First, we extend the recent anchor-free detector (CornerNet [17]) to video object detections. In order to exploit the spatial-temporal information while maintaining high efficiency, the proposed model accepts video clips as input, and only makes predictions for the starting and the ending frames, i.e. heatmaps of object bounding box corners and the corresponding embeddings for grouping. Second, to tackle the challenge from scale variations in object detection, scene geometry information, e.g. derived depth maps, is explicitly incorporated into deep networks for multi-scale feature selection and for the network prediction. Third, we validate the proposed architectures on an autonomous driving dataset generated from the Carla simulator [5], and on a real dataset for human detection (DukeMTMC dataset [28]). When comparing with the existing competitive single-stage or two-stage detectors, the proposed geometry-aware spatio-temporal network achieves significantly better results.


page 2

page 4

page 7

page 9

page 10


Plug Play Convolutional Regression Tracker for Video Object Detection

Video object detection targets to simultaneously localize the bounding b...

Generalized Object Detection on Fisheye Cameras for Autonomous Driving: Dataset, Representations and Baseline

Object detection is a comprehensively studied problem in autonomous driv...

Tube-CNN: Modeling temporal evolution of appearance for object detection in video

Object detection in video is crucial for many applications. Compared to ...

Rotationally Equivariant 3D Object Detection

Rotation equivariance has recently become a strongly desired property in...

Geometry-Aware Recurrent Neural Networks for Active Visual Recognition

We present recurrent geometry-aware neural networks that integrate visua...

HDNET: Exploiting HD Maps for 3D Object Detection

In this paper we show that High-Definition (HD) maps provide strong prio...

TSSD: Temporal Single-Shot Object Detection Based on Attention-Aware LSTM

Temporal object detection has attracted significant attention, but most ...

1 Introduction

(a) A False Positive Detection Case
(b) Height in Pixels of Objects
(c) Pseudo Depth Map of Humans
Figure 1: An illustration of the motivation of using scene geometry for detection: (a) a false positive detection of vehicle with a wrong scale; (b) the height of objects in different classes; (c) the geometry priors derived from (b), i.e. 2.5D pseudo depth maps, able to provide useful geometric constraints on the object scales for learning a geometry-aware detector.

In the Deep Learning era, we always expect the deep networks to learn all the required world knowledge given sufficient training data. However, as images are essentially a 2D projection of the 3D world, and the depth information has been lost during the image formation 

[Hartley and Zisserman(2003)], the scene geometry, e.gthe depth, plays an essential role in resolving the ambiguities from scale variations and object occlusion in images. Despite the great success achieved in video object detection [Han et al.(2016)Han, Khorrami, Paine, Ramachandran, Babaeizadeh, Shi, Li, Yan, and Huang, Tripathi et al.(2016)Tripathi, Lipton, Belongie, and Nguyen, Kang et al.(2016)Kang, Ouyang, Li, and Wang, Lu et al.(2017)Lu, Lu, and Tang], detecting objects under different scales or occlusions has only partially been tackled via learning multi-scale features in a brute-force fashion [Chen et al.(2016)Chen, Yang, Wang, Xu, and Yuille] or utilizing aggressive data augmentation [Pepikj et al.(2013)Pepikj, Stark, Gehler, and Schiele]. This may indicate that these detectors actually have not learnt the scene geometry well.

These existing approaches mostly work on videos collected from dynamic cameras or internet video streams, such as the ImageNet VID dataset 

[Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Li], and thus the varying scene geometry is difficult to incorporate explicitly. However, there are many real-world applications, for instance the video surveillance, where the cameras can be well-approximated as static settings, i.e. the camera sits at a fixed position, and the relative depth of an object in the world coordinate system can also be determined by its height on the image plane [Hoiem et al.(2006)Hoiem, Efros, and Hebert]

. In this paper, we propose a deep model for video object detection under static cameras, where the relative scene depth information can be estimated effectively (as shown in Figure 

1), and is further used as strong geometric constraint for learning scale-aware object detector.

To investigate the effectiveness of using scene geometry in video object detection under the static camera settings, we first design a compact video object detector. While the two-stage anchor-based object detectors (e.gFaster-RCNN [Ren et al.(2016)Ren, He, Girshick, and Sun]) have achieved impressive accuracy on image-wise object detection, the efficiency of the model is usually upper-bounded by the region proposal process. The generation of proposals would also significantly increase the complexity of the model especially when we deal with object detection in videos. In this paper, we extend the more efficient single-stage anchor free and single-frame object detection model CornerNet [Law and Deng(2018)] by incorporating spatio-temporal information, the proposed geometry-aware spatio-temporal network is termed as GAST-Net. Our main contribution is therefore three-fold: (i) we design a spatio-temporal corner network structure, which accepts video clips (image sequences) as input. As far as we know, this is the first use of a corner-based scheme for object detection in videos. The network utilizes a spatio-temporal convolutional backbone to encode appearance and motion representations, which are further used to predict and group corners only for the first frame and the last frame of the sequences. By doing so, we are able to capture the long temporal dependencies in videos. (ii) We explore how the scene geometry derived from static cameras can be employed as priors for multi-scale feature selection and for the network prediction, therefore helping to tackle the challenges from scale variations in video object detection. (iii) Extensive experiments have been conducted on a synthetic automous driving dataset generated with Carla [Dosovitskiy et al.(2017)Dosovitskiy, Ros, Codevilla, Lopez, and Koltun], and on pedestrian detection on the DukeMTMC dataset [Ristani et al.(2016)Ristani, Solera, Zou, Cucchiara, and Tomasi]. On both datasets, we demonstrate great benefits of incorporating the scene geometry, and show that the proposed GAST-Net significantly outperforms existing competitive single-stage and two-stage detectors.

2 Related Work

Object Detection in Static Images. Two families of detectors are currently popular: First, two-stage detectors, e.gR-CNN [Girshick et al.(2013)Girshick, Donahue, Darrell, and Malik], Fast R-CNN [Girshick(2015)], Faster R-CNN [Ren et al.(2016)Ren, He, Girshick, and Sun] and R-FCN [Dai et al.(2016)Dai, Li, He, and Sun]. The main idea of these detectors is to train a small sub-network for generating proposals that potentially contain objects, and then learn a classification network to predict the existence and categories of the objects. Second, one-stage detectors that predict object bounding boxes in one step such as YOLO [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi], SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] and CornerNet [Law and Deng(2018)]. Our model is also an one-stage detector which first predicts corner heatmaps and embeddings, then groups the corners as individual objects similar to CornerNet, while we extend it by building up a spatio-temporal corner network to capture temporal information for video object detection.

Object Detection in Videos. As an important research topic, video object detection has drawn significant attention [Vu et al.(2018)Vu, Osokin, and Laptev, Tripathi et al.(2016)Tripathi, Lipton, Belongie, and Nguyen, Prest et al.(2012)Prest, Leistner, Civera, Schmid, and Ferrari, Kwak et al.(2015)Kwak, Cho, Laptev, Ponce, and Schmid, Joulin et al.(2014)Joulin, Tang, and Li, Tang et al.(2019)Tang, Wang, Wang, Liu, Zeng, and Wang, Xiao(2018), Bertasius et al.(2018)Bertasius, Torresani, and Shi]. To take advantage of existing image-based detectors, several works focus on post-processing class scores from image-based detectors, and enforce temporal consistency on the scores. For instance, tubelet proposals are generated in [Kang et al.(2016)Kang, Ouyang, Li, and Wang] via applying a tracker to frame-based bounding box proposals. The class scores along the tubelet are further re-scored by a 1D CNN model. Unlike the detection problem in static images, videos naturally contain temporal coherence with objects changing smoothly in time. Zhu et al [Zhu et al.(2017)Zhu, Xiong, Dai, Yuan, and Y.]

thus consider a motion-based model that applies a detection net only on key frames, and an optical flow net is used for propagating deep features to the rest of the frames. To further simplify dense prediction in optical flow, a recent work 

[Feichtenhofer et al.(2017)Feichtenhofer, Pinz, and Zisserman] proposes to simultaneously learn detection and tracking. In this paper, we consider a common case of video object detection, where the camera is static. Under this situation, all the derived geometry information can therefore be applied to help the CNN with geometry-aware learning, in order to eliminate the scale ambiguities in 2D images.

Object Detection from RGB-D data.

Another line of research is about using RGB-D data where the scene depth information has been demonstrated beneficial for various computer vision tasks 

[Schwarz et al.(2018)Schwarz, Milan, Periyasamy, and Behnke, Lin et al.(2013)Lin, Fidler, and Urtasun], and is also widely used for object detection [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik, Eitel et al.(2015)Eitel, Springenberg, Spinello, Riedmiller, and Burgard, Spinello and Arras(2011)]. Among the existing works, Gupta et al [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik] proposed a joint framework for object detection and semantic segmentation, and the depth maps are encoded with a geocentric encoding approach to provide complementary features to the RGB representations. Spinello et al [Spinello and Arras(2011)] also explored using RGB-D data as input for people detection. There also exists some works exploring using depth data for 3D object detection [Song and Xiao(2016), Qi et al.(2018)Qi, Liu, Wu, Su, and Guibas]. Qi et al [Qi et al.(2018)Qi, Liu, Wu, Su, and Guibas] developed a 3D point cloud deep representation model for effective 3D object bounding box prediction. However, these works require the explicit depth data captured from depth sensors, which are not always available in many application scenarios. Our work targets deriving the scene depth information from the RGB data, and thus does not require additional depth sensors other than standard RGB cameras.

Geometry-Aware Deep Learning. Scene geometry is considered as important prior information for computer vision tasks [Wang et al.(2018)Wang, Zhao, Li, Wang, and Tao, Liu et al.(2018)Liu, Lis, Salzmann, and Fua]. Leibe et al [Leibe et al.(2008)Leibe, Schindler, Cornelis, and Luc] explored joint object tracking and detection using geometry assumptions within a traditional non-deep-learning framework. In crowd counting [Liu et al.(2018)Liu, Lis, Salzmann, and Fua]

, as the camera usually sits on a fixed position and the variance between people’s height is small, it is easy to obtain the homography between the image and the head plane. By incorporating this information in the model, it becomes possible to directly predict the crowd density in the physical world. Moreover, previous works 

[Hoiem et al.(2006)Hoiem, Efros, and Hebert] also considered to place the local object detection in the context of the overall 3D scene, by directly modelling the interdependence of objects, surface orientations, and camera viewpoint. In our work, we aim to explore using scene geometry for the task of video object detection in CNN. Instead of estimating accurate 3D geometry, we consider deriving and utilizing scene-specific coarse depth as well as image-plane coordinates, and enforce the convolutional operations to be conditioned on the object scales and positions, leading to a geometry-aware deep learning.

3 Geometry-Aware Spatio-Temporal Corner Network

Figure 2: Framework of the proposed geometry-aware spatio-temporal corner network for video object detection from static cameras. It accepts multiple frames as input, and predicts heatmaps and embeddings of the first and the last frame for detection. The geometry input contains 2D image-plane coordinates and 2.5D pseudo depth maps, which can be directly derived from training data. P denotes a prediction module. The symbols , and denote convolution, element-wise multiplication and element-wise addition operation, respectively.

Figure 2 depicts a framework overview of the proposed GAST-Net. It consists of two main components. The first is the proposed spatio-temporal network that accepts video clips as input, and outputs multi-heads feature representations with both appearance and motion information at different scales. The second component is a geometry-aware module that first encodes the inferred relative depth maps (i.e. the pseudo depth maps), and further used for selecting the features dynamically based on the geometry information. Intuitively, given the depth for all pixels on the image plane, in order to detect objects that are close to the camera, features from a large receptive field should be used. In our case, 2D image-plane coordinates and the pseudo depth maps are used to represent the image and scene geometry, which can both be derived from the training data. Eventually, the corner heatmaps and embeddings are predicted from the fused feature representation, and bounding boxes are obtained by grouping the corners. We introduce the details of the proposed GAST-Net in the following.

3.1 Spatio-Temporal Corner Network

Single-Frame Corner Network. In contrast to the traditional anchor-based detectors, such as SSD [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] and Faster-RCNN [Ren et al.(2016)Ren, He, Girshick, and Sun], CornerNet [Law and Deng(2018)] is an one-stage and anchor free detector in a bottom-up fashion. The main idea is to regress heatmaps for the top-left and the bottom-right corners of the objects and predict their embeddings. The bounding boxes that outline the objects are later generated via grouping corners from the embeddings.

Proposed Multi-Frame Spatio-Temporal Corner Network. For efficient object detection in videos, we extend the CornerNet architecture by exploiting the spatio-temporal information. To capture the temporal relationship between adjacent frames, the input to our model is video clips with multiple frames, and a 3D convolution-based backbone is used as an encoder (variants of VGG and ResNet50 in our experiments). The video clip representation  (Figure. 2) is further projected to two separate feature maps and  (corresponding to the first and last frame respectively), and later decoded as heatmaps and embeddings of the top-left and bottom-right corners for the object bounding boxes. In order to improve the efficiency and reduce the computational overhead, supervision is only added on the first and the last frame. Multi-scale context is enabled by taking feature maps from different layers from the encoder, and the spatio-temporal representation is calculated by fusing the multi-scale features with the proposed geometry-aware network module.

3.2 Geometry-Aware Network Module

Scene Geometry from Static Cameras. Static cameras are used in a wide range of real-world applications, where an important task is to detect and track all the cars and pedestrians. In these scenarios, as the variations of the objects’ physical height tend to be small, the depth information will be directly related to their sizes on image plane. For instance, the further objects will have smaller scales according to the perspective projection of the camera [Hoiem et al.(2006)Hoiem, Efros, and Hebert]. Therefore, the geometry information from static cameras i.e. the relative depth, can be directly estimated from the training data, as shown the psudo depth map in Figure 1.

In this work, we mainly consider two types of geometry information, e.gimage geometry and scene geometry. The image geometry considers the image-plane 2D coordinates as auxiliary information, which can be treated as a means to enable position-dependent convolutions. We generate a set of two coordinate maps {, } for the and the dimension respectively. In , with and as the height and the width of the images, each column is the dimension coordinate, while in , each row is the dimension coordinate. and are normalized in the range of [0,1]. The other type of geometry information is the scene geometry, i.e. relative scene depth maps. For a fixed camera viewpoint , the object height and its depth are inversely proportional [Hoiem et al.(2006)Hoiem, Efros, and Hebert]

. Given the bounding boxes of all the objects from the training data, we are able to estimate a coarse relative depth map by calculating the mean of the maximal and the minimal height of the bounding boxes for each row on the map. The rows without any objects are bilinearly interpolated using the values of adjacent rows. For each object class

and each camera viewpoint , we estimate such a pseudo depth map, and for the whole training data, we have a set of class- and scene-specific pseudo depth maps, {, where and are the number of object classes and camera viewpoints, respectively.

Non-Linear Transformation of Geometry Information. Given a camera view

, we concatenate all the coordinate maps with the pseudo depth maps, and perform non-linear transformations with two Convolution-Batch Norm-ReLU blocks denoted as

. Then the transformation operation is formulated as follows:


where is a concatenation operation. After that, we obtain a fine-grained geometry distribution . In our framework, is used to guide the multi-scale feature selection with an attention mechanism, and is also used to guide the prediction of heatmaps and embeddings for later grouping corners.

Geometry-Aware Multi-Scale Feature Fusion. In order to detect objects of different scales in the image plane, the geometry information is used to modulate the multi-scale features with an attentional process. Given a set of multi-scale features , we correspondingly learn a set of geometry-aware attention maps , . Our intuition of using for attention generation is that the geometry information, e.gpseudo depth map, essentially has strong constraints to the object scales on image plane. Formally, we generate the attention maps as follows:


is computed along the channels, and are the convolution parameters. Then the set of attention maps is used to select and fuse features in different scales as follows:


where the symbol denotes an element-wise multiplication operation.

Geometry-Aware Prediction. The geometry distribution is also used to guide the prediction of heatmaps and embeddings in the spatio-temporal corner network. The prediction part has four independent convolutional layers, corresponding to the top-left and bottom-right corners of the first frame and the last frame, respectively. For each prediction layer, it accepts features from both the image sub-network and from the geometry sub-network, i.e.  . A separate convolution is applied on to adjust the feature dimensions. In our setting, the number of geometry feature channels is set to of the image one. Then these two parts of features are concatenated and input into the prediction convolutional layer.

3.3 Network Optimization and Inference

The overall network architecture uses a combination of two types of losses. One is a regression loss on the corner heatmaps, and the other is an embedding loss for both the first frame and the last frame. Similar to [Law and Deng(2018)], the heatmap regression uses a focal loss, since the number of corner pixels is much fewer than background pixels. The embedding loss employs a pull-push loss that aims to train the network to group the corners by a pull loss, and to separate the corners by a push loss. During the inference, we supply the network with testing video clips and the geometry input derived in the training phase, where we assume that the training and the testing data are collected under the same camera scenes, which is usually a common setting in applications with static cameras. For each frame in the videos, it could be the first frame or the last frame of the input clips, and thus each frame is predicted twice. We collect grouped bounding boxes from the two-times predictions and apply an NMS operation to get the final bouding box output of that frame.

4 Experiments

Figure 3: Qualitative detection results of humans and vehicles under three different scene views on Carla-Vehicle-Pedestrian dataset. The detected corners are visualized as grey blobs.

4.1 Experimental Setup


We conduct the experiments on two different datasets: (i) a synthetic dataset generated from an open-sourced self-driving simulator Carla [Dosovitskiy et al.(2017)Dosovitskiy, Ros, Codevilla, Lopez, and Koltun], termed as Carla-Vehicle-Pedestrian dataset. We collected around 48 scenes with in total 60K images. The resolution of each image is of . Among them 40 scenes with around 50k images are used for training and the rest for testing. The dataset contains two classes of pedestrian and vehicle. This dataset is very challenging as it has been generated with many small-scale pedestrians and vehicle objects. The frame rate of this dataset is round 9 fps. (ii) the DukeMTMC [Ristani et al.(2016)Ristani, Solera, Zou, Cucchiara, and Tomasi] dataset that was originally created for object tracking and person identification. The dataset contains long videos with 9 different static cameras. We use the video data from camera 1 to 5, and create a dataset of around 720K images. The frame rate of the DukeTMTC is 60 fps. The image resolution is . Among them 70% are used for training and the rest for testing. In the training, we sample images at every 6th frame. Several qualitative detection samples on the two datasets are shown in Figure 3 and Figure 5.

Parameter Setting and Evaluation Metrics.

In training, the images are resized to a resolution of for Carla-Vehicle-Pedestrian, and

for DukeMTMC. The number of input frames is set to 4 for both datasets. The batch size is set to 8 and 16, and the network is trained with 30 and 20 epochs for the two datasets respectively. We used Adam 

[Diederik and Ba(2014)] for optimization; and the weights for the regression focal loss, the push loss, and the pull loss, are set as 1, 0.1 and 0.1 respectively. The learning rate is initialized as for both datasets. The detection performance is evaluated with the metric of average precision at IoU 0.5 (AP) and at IoU 0.75 (AP), and also with mAP, which is calculated by taking the average over the two APs.

4.2 Experimental Results

Baseline models.

To demonstrate the effectiveness of different components in the proposed GAST-Net, we conduct experiments on several different models: (i) Single-Frame CornerNet, which we follow [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] while replacing their hourglass backbone with a VGG-11 structure for fair comparison; (ii) GAST-Net (multi-frame), which is our base spatio-temporal corner network. We use a conv-3D network structure (e.gC3D [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri]) as the spatio-temporal convolutional backbone. The representations from the backbone are used to separately decode for the first and the last frame. This model does not employ any geometry information. (iii) Single-Frame CornerNet w/ 2D coordinates or 2.5D pseudo depth map in prediction, which uses 2D coordinates or 2.5D pseudo depth map in the network prediction module via the geometry network branch, and using the encoded to help the network prediction as we described in the Sec. 3.2. This baseline model is built upon the Single-Frame CornerNet, i.e. the model (i); (iv) GAST-Net (multi-frame) w/ 2D coordinates or 2.5D pseudo depth map in prediction, which uses 2D coordinates or 2.5D pseudo depth map in the prediction module in a means similar to (iii) via utilizing the encoded geometry distribution for the network prediction. This model is directly built upon the model GAST-Net (multi-frame), i.e. the model (ii); (v) GAST-Net (multi-frame) w/ geometry-guided feature fusion (our full model). It further uses the geometry information to guide the multi-scale feature fusion upon the model (iv) that uses the geometry only for the network prediction. All the models are learned in the same training setting as described in Sec. 4.1 for a fair performance comparison.

(a) PR-Curve at IoU 0.5 on Vehicle Class
(b) PR-Curve at IoU 0.5 on Pedestrian Class
Figure 4: Comparison of Precision-Recall Curves of different variants of the proposed approach on the Carla-Vehicle-Pedestrian dataset.
Method Vehicle-Class Pedestrian-Class
Single-Frame Corner Net [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] 64.27% 52.20% 54.08% 28.63%
GAST-Net (multi-frame) 72.85% 62.37% 61.31% 54.82%
Single-Frame Corner Net w/ 2D coordinates in prediction 67.69% 55.83% 56.47% 31.56%
Single-Frame Corner Net w/ 2.5D psudo depth map in prediction 67.91% 54.75% 56.67% 30.83%
GAST-Net (multi-frame) w/ 2D coordinates in prediction 76.08% 69.02% 65.25% 57.54%
GAST-Net (multi-frame) w/ 2.5D psudo depth map in prediction 76.17% 66.52% 63.41% 56.11%
GAST-Net (multi-frame) w/ geometry-guided feature fusion 79.39% 71.95% 66.51% 59.06%
Table 1: Quantitative comparison of different variants of the proposed approach on the Carla-Vehicle-Pedestrian dataset. We use a backbone structure of C3D [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri], which utilizes a VGG-11 structure while replacing all 2D convolution/pooling with 3D convolution/pooling.

Effectiveness of multi-frame spatio-temporal corner prediction. We conduct ablation study on the Carla-Vehicle-Pedestrian dataset. A quantitative comparison of different baseline models is shown in Table 1. PR-Curves of the different approaches are shown in Figure 4. It can be observed that GAST-Net (multi-frame) significantly outperforms Single-Frame CornerNet on all the metrics by a large margin. In terms of AP, GAST-Net (multi-frame) is around 8.6 and 7.0 points better than Single-Frame CornerNet on the vehicle and the pedestrian class, respectively. The performance gain is even higher on the more strict metric of AP, demonstrating the effectiveness of incorporating temporal relationship in the proposed video object detection architecture.

Effectiveness of geometry guided prediction. When comparing the performance of GAST-Net (multi-frame) w/ 2D coordinates or 2.5D pseudo depth map with GAST-Net (multi-frame), it is clear that the geometry priors, i.e. both the 2D image-plane coordinates and the 2.5D pseudo depth maps, are beneficial for improving the detection performance. On the vehicle class, GAST-Net with 2D coordinates improves AP around 6.7 points, meaning that the coordinates are especially beneficial for the network to learn better localization of corners. We also use the geometry information for Single-Frame CornerNet, and consistent performance gains can be observed.

Effectiveness of geometry guided multi-scale feature fusion. In this section, we use the learned geometry distribution to guide multi-scale feature fusion. As shown from Table 1, GAST-Net w/ Geometry-guided fusion further achieves better performance than model (iv) on all the metrics and on all the classes, verifying our initial motivation of encoding the geometry information into deep network for geometry-aware scale perception and learning.

Figure 5: Qualitative detection results of humans under three different camera view points on the DukeMTMC dataset. The detected corners are visualized in grey blobs.
Method Backbone mAP AP AP
Faster RCNN [Ren et al.(2016)Ren, He, Girshick, and Sun] VGG 63.85% 80.56% 47.15%
Single-Shot Detector (SSD) [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] VGG 59.06% 73.87% 44.26%
Single-Frame Corner Net [Law and Deng(2018)] VGG 61.49% 72.65% 50.34%
GAST-Net (full model) VGG 68.26% 78.65% 57.87%
Faster RCNN [Ren et al.(2016)Ren, He, Girshick, and Sun] ResNet-50 70.68% 81.73% 59.64%
Single-Frame Corner Net [Law and Deng(2018)] ResNet-50 68.71% 75.18% 62.25%
GAST-Net (full model) ResNet-50 74.42% 80.64% 68.21%
Table 2: Quantitative comparison with competitive one-stage and two-stage detectors on the DukeMTMC dataset. Among the comparison methods, Faster RCNN [Ren et al.(2016)Ren, He, Girshick, and Sun] is a two-stage anchor-based detector, while the rest are all one-stage detectors.

Comparison with existing one-stage and two-stage detectors. We compare the proposed architecture with representative one-stage and two-stage object detectors, including Single-Shot MultiBox Detector (SSD) [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg], Faster-RCNN [Ren et al.(2016)Ren, He, Girshick, and Sun], and Single-Frame CornerNet [Law and Deng(2018)] on the DukeMTMC dataset. The comparison experiments are performed with two differnet backbone network structures, e.gVGG-11 and ResNet50. Quantitative comparisons are shown in Table 2. GAST-Net achieves the best performance among these competitors. Specifically, to compare with the one-stage detectors, ours is 7.1 points better than Single-Frame CornerNet, and 8.1 points better than SSD on the mAP metric with VGG backbone. Ours is also around 4.4 points better than the two-stage Faster-RCNN approach. It can be also noted that, our corner-based framework has much better performance than anchor-based SSD and Faster-RCNN on AP

, which is probably because that the dense prediction of corners is more powerful in accurate bounding box localization than using sparse anchor based proposal generation.

Figure 6: Failure examples on the CVP and the DukeTMTC datasets. The object detections that fail are marked with red circles. The missing (e.gthe first two examples), or inaccurate grouping (e.gthe last example) of the detected top-left and bottom right corners is an important factor affecting the final detection performance.

Discussion. The proposed GAST-Net is an one-stage based approach, which detects the top-left and the bottom-right corners, and learns to group the corresponding corners to bounding boxes. The final detection performance is thus affected by the grouping capability. In our experiments, we observed that the detector is able to produce very good detection and localization on the object corners w.r.t the Percentage of Correct Keypoints (PCK) recall metric. However, the grouping fails in some cases, for instance, for extreme scale of objects, or for crowded cases with dense occlusion, as shown in Figure 6, leading to lower recall on the object bounding boxes. Possible solutions to tackle the grouping issues could be investigating a scale-aware network structure with long-term tracking.

5 Conclusion

We have presented a geometry-aware spatio-temporal network (GAST-Net) for video object detection from static cameras. GAST-Net consists of two main parts. One is the spatio-temporal corner network that aims to perform object detection from corner estimation and grouping with video clips as input. The other part is the designed geometry-aware network module which utilizes the scene geometry derived from static cameras for multi-scale feature selection and fusion. Extensive experiments on two challenging datasets demonstrate the superior performance of the proposed approach, and show the great advantage of using geometry in deep networks for the video object detection task. The geometry-aware network module is also potentially beneficial for other computer vision tasks affected by scale issues, such as object tracking and semantic segmentation.

Acknowledgement. This research work was supported by the EPSRC Programme Grant Seebibyte EP/M013774/1.


  • [Bertasius et al.(2018)Bertasius, Torresani, and Shi] G. Bertasius, L. Torresani, and J. Shi. Object detection in video with spatiotemporal sampling networks. In ECCV, 2018.
  • [Chen et al.(2016)Chen, Yang, Wang, Xu, and Yuille] L. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
  • [Dai et al.(2016)Dai, Li, He, and Sun] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
  • [Diederik and Ba(2014)] D. P. Diederik and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  • [Dosovitskiy et al.(2017)Dosovitskiy, Ros, Codevilla, Lopez, and Koltun] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun. CARLA: An open urban driving simulator. In CoRL, 2017.
  • [Eitel et al.(2015)Eitel, Springenberg, Spinello, Riedmiller, and Burgard] Andreas Eitel, Jost Tobias Springenberg, Luciano Spinello, Martin Riedmiller, and Wolfram Burgard. Multimodal deep learning for robust rgb-d object recognition. In IROS, 2015.
  • [Feichtenhofer et al.(2017)Feichtenhofer, Pinz, and Zisserman] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track and track to detect. In ICCV, 2017.
  • [Girshick(2015)] R. B. Girshick. Fast R-CNN. In ICCV, 2015.
  • [Girshick et al.(2013)Girshick, Donahue, Darrell, and Malik] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv preprint arXiv:1311.2524, 2013.
  • [Gupta et al.(2014)Gupta, Girshick, Arbeláez, and Malik] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In ECCV, 2014.
  • [Han et al.(2016)Han, Khorrami, Paine, Ramachandran, Babaeizadeh, Shi, Li, Yan, and Huang] W. Han, P. Khorrami, T. L. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang. Seq-nms for video object detection. arXiv preprint arXiv:1602.08465, 2016.
  • [Hartley and Zisserman(2003)] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [Hoiem et al.(2006)Hoiem, Efros, and Hebert] D. Hoiem, A. Efros, and M. Hebert. Putting objects in perspective. In CVPR, 2006.
  • [Joulin et al.(2014)Joulin, Tang, and Li] A. Joulin, K. Tang, and F. Li. Efficient image and video co-localization with frank-wolfe algorithm. In ECCV, 2014.
  • [Kang et al.(2016)Kang, Ouyang, Li, and Wang] K. Kang, W. Ouyang, H. Li, and X. Wang.

    Object detection from video tubelets with convolutional neural networks.

    In CVPR, 2016.
  • [Kwak et al.(2015)Kwak, Cho, Laptev, Ponce, and Schmid] S. Kwak, M. Cho, I. Laptev, J. Ponce, and C. Schmid. Unsupervised object discovery and tracking in video collections. In ICCV, 2015.
  • [Law and Deng(2018)] H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
  • [Leibe et al.(2008)Leibe, Schindler, Cornelis, and Luc] B. Leibe, K. Schindler, N. Cornelis, and V. G. Luc. Coupled object detection and tracking from static cameras and moving vehicles. TPAMI, 30(10):1683–1698, 2008.
  • [Lin et al.(2013)Lin, Fidler, and Urtasun] Dahua Lin, Sanja Fidler, and Raquel Urtasun.

    Holistic scene understanding for 3d object detection with rgbd cameras.

    In ICCV, 2013.
  • [Liu et al.(2016)Liu, Anguelov, Erhan, Szegedy, Reed, Fu, and Berg] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A.C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
  • [Liu et al.(2018)Liu, Lis, Salzmann, and Fua] W. Liu, K. Lis, M. Salzmann, and P. Fua. Geometric and physical constraints for head plane crowd density estimation in videos. arXiv preprint arXiv:1803.08805, 2018.
  • [Lu et al.(2017)Lu, Lu, and Tang] Y. Lu, C. Lu, and C. K. Tang. Online video object detection using association lstm. In ICCV, 2017.
  • [Pepikj et al.(2013)Pepikj, Stark, Gehler, and Schiele] B. Pepikj, M. Stark, P. Gehler, and B. Schiele. Occlusion patterns for object class detection. In CVPR, 2013.
  • [Prest et al.(2012)Prest, Leistner, Civera, Schmid, and Ferrari] A. Prest, C. Leistner, J. Civera, C. Schmid, and V. Ferrari. Learning object class detectors from weakly annotated video. In CVPR, 2012.
  • [Qi et al.(2018)Qi, Liu, Wu, Su, and Guibas] Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb-d data. In CVPR, 2018.
  • [Redmon et al.(2016)Redmon, Divvala, Girshick, and Farhadi] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [Ren et al.(2016)Ren, He, Girshick, and Sun] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2016.
  • [Ristani et al.(2016)Ristani, Solera, Zou, Cucchiara, and Tomasi] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, 2016.
  • [Russakovsky et al.(2015)Russakovsky, Deng, Su, Krause, Satheesh, Ma, Huang, Karpathy, Khosla, Bernstein, Berg, and Li] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. Li. ImageNet Large Scale Visual Recognition Challenge. IJCV, 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • [Schwarz et al.(2018)Schwarz, Milan, Periyasamy, and Behnke] Max Schwarz, Anton Milan, Arul Selvam Periyasamy, and Sven Behnke. Rgb-d object detection and semantic segmentation for autonomous manipulation in clutter. IJRR, 37(4-5):437–451, 2018.
  • [Song and Xiao(2016)] Shuran Song and Jianxiong Xiao. Deep sliding shapes for amodal 3d object detection in rgb-d images. In CVPR, 2016.
  • [Spinello and Arras(2011)] Luciano Spinello and Kai O Arras. People detection in rgb-d data. In IROS, 2011.
  • [Tang et al.(2019)Tang, Wang, Wang, Liu, Zeng, and Wang] P. Tang, C. Wang, X. Wang, W. Liu, W. Zeng, and J. Wang. Object detection in videos by high quality object linking. TPAMI, 2019. doi: 10.1109/TPAMI.2019.2910529.
  • [Tran et al.(2015)Tran, Bourdev, Fergus, Torresani, and Paluri] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [Tripathi et al.(2016)Tripathi, Lipton, Belongie, and Nguyen] S. Tripathi, Z. C. Lipton, S. Belongie, and T. Nguyen. Context matters: Refining object detection in video with recurrent neural networks. arXiv preprint arXiv:1607.04648, 2016.
  • [Vu et al.(2018)Vu, Osokin, and Laptev] T. Vu, A. Osokin, and I. Laptev. Tube-cnn: Modeling temporal evolution of appearance for object detection in video. arXiv preprint arXiv:1812.02619, 2018.
  • [Wang et al.(2018)Wang, Zhao, Li, Wang, and Tao] F. Wang, L. Zhao, X. Li, X. Wang, and D. Tao.

    Geometry-aware scene text detection with instance transformation network.

    In CVPR, 2018.
  • [Xiao(2018)] J. L. Xiao, F.and Yong. Video object detection with an aligned spatial-temporal memory. In ECCV, 2018.
  • [Zhu et al.(2017)Zhu, Xiong, Dai, Yuan, and Y.] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Wei Y. Deep feature flow for video recognition. In CVPR, 2017.