Occlusion often poses a significant challenge in reliably tracking or detecting objects. Current object detection tasks[4, 5, 7, 21, 27] mostly ignore full occlusion. However, maintaining object position through occlusion can be useful in certain cases and help us better understand video scenes. For instance, in order to encode video demonstrations, it might be useful to be able to reliably detect or track small tools that often get occluded from the camera viewpoint. Also, in order to properly predict trajectories of different pedestrians, it might be useful to track them reliably even under occlusion for self driving car applications.
While some older methods model explicit occluder-occludee relationships in visual object trackers to track objects through occlusion, in this paper we plan to develop our approach following the more recent tracking by detection paradigm. Here, instead of using visual object trackers to track occluded objects, we first use video object detectors to detect them. The frame level detections obtained from the detector are then linked temporally using association methods like [2, 31] to generate tracks. This way we are able to prevent catastrophic failures encountered by visual object trackers where we completely lose track of the object and are unable to track it further even when it has reappeared. Also tracking by detection enables us to easily extend our method for multi-object scenarios. In this paper, however, we only concentrate on the detection subproblem, since there has been considerable amount of work done for developing different association methods to solve multi-object tracking.
The detection subproblem is typically addressed by frame level and video level object detectors. Frame level object detectors like Faster RCNN, RFCN can hardly distinguish between the two situations, one where an object is occluded and the other where the object is not present. Video object detectors on the other hand, look into a temporal context surrounding the query frame to accumulate features, and then reason on top of the accumulated features. This helps the detector to collect features of the object of interest from surrounding frames in the video even if the object is occluded in the current frame.
Existing video object detectors [1, 8, 11, 17, 18, 19, 32, 33, 34, 35] are mostly designed to handle partial occlusion, motion blur, unseen viewpoints amongst other issues that frame level detectors are not capable of dealing with. Although we treat full occlusion the same way we treat any of the above mentioned issues, producing bounding boxes when an object is occluded is slightly more challenging mainly because of the following reasons: a) When an object is occluded in a frame, information about the class of the object does not come from that particular frame and hence the architecture needs to heavily rely on the temporal connections to obtain this information. b) The temporal connections usually reason on features belonging to completely different classes of objects to understand occlusion. For example, if a coffee mug is occluded by the hand of the person using the mug, the temporal connections need to combine features of the mug and the human hand and determine that the mug is occluded by the human hand.
In this research, we aim to solve the occlusion modelling problem in a data driven end-to-end fashion by adding a recurrent computational unit inside region based object detectors following  to enable propagation of features of the occluded object from both ends of the video. In doing so, we are able to maintain an approximate position of the object through occlusion. We use spatio-temporal memory networks from 
as our baseline model and show that our model achieves a substantial improvement in terms of raw detection score (mAP) under such severe cases of occlusion. Additionally, we show that our method is also able to achieve competitive results with the state of the art methods on video object detection datasets like ImageNet VID which does not deal with complete occlusion of objects.
2 Related work
2.1 Tracking through occlusion
Prior work on tracking objects through occlusion with visual object trackers mostly model explicit inter-occlusion relationships between objects in the scene, formulating motion models or use external knowledge with visual object trackers. Although such methods are useful when we do not have prior information about the objects that we want to track, visual object trackers suffer from a fundamental issue: once they lose track of the object of interest, they can hardly recover from that. Also, methods that model occlusion explicitly end up making several fundamental assumptions about the data (which includes motion of objects, objects belonging to foreground or background, etc.) which do not make them ideal for real world scenarios. For example,  assumes both occluder and occludee
objects belong to the foreground regions and thus learn to model occlusion relationships explicitly by classifying different events of occlusion. Specifically, when an object is getting occluded, they learn to identify it as a foreground region merging event, whereas when an occluded object is becoming visible they learn to identify it as a foreground region splitting event. Thus, by identifying such region merging, splitting and continuation events, their approach is able to track an object through occlusion. Unless theoccluder objects are known before hand, it is hard to make such methods work in a more general setting.
2.2 Tracking by detection paradigm
The drift encountered by traditional visual object trackers can in certain cases lead to irrecoverable failures. Even when the object reappears and is visible, if the appearance model of the tracker has changed considerably, it will not able to track the object of interest further. The situation worsens when there are multiple objects of interest appearing and disappearing through out a video sequence. These issues can be tackled by the recently popular paradigm of tracking by detection. The core idea is that instead of tracking the objects directly across frames with visual object trackers, we use object detectors to produce bounding boxes at every frame. Once we have the bounding boxes from the detector, we use some data association method like [2, 31] to link the boxes at different time steps to form tracks. Since there has been considerable amount of work in developing association methods, in this research we only concentrate on the subproblem of robust detection of objects under occlusion. It is worth noting that, a frame level object detector, i.e. an object detector that runs on individual frames of a video is unable to understand occlusion. Hence in order to detect occluded objects it is necessary to use video object detectors which aggregate features from a temporal context surrounding a query frame to produce detections.
2.3 Frame level object detection
Over the last few years, object detection[3, 10, 9, 20, 22, 24, 25, 26] in static images has received quite a lot of attention. This success can mostly be attributed to very deep convolutional backbones[12, 30]. Earliest of such detectors is a two stage object detector R-CNN, where at first region proposals were computed from the image and then each and every proposal was classified. Later on, computationally lighter versions of the original R-CNN was made possible by leveraging ROI pooling layers and by sharing the convolutional backbone with the region proposal network . RFCN introduces the position sensitive ROI pooling layer, and achieves significant speed up compared to  while achieving competitive detection accuracy. For our application, we mostly build up on Faster RCNN and RFCN, the two most popular region based object detectors.
2.4 Video level object detection
A more recent task in the domain of object detection[1, 8, 11, 17, 18, 19, 32, 33, 34, 35], video object detection, has also been given a lot of attention in the past few years. Even though static object detectors can be easily applied to individual frames of a video, there are certain difficult cases where they fail to perform reasonably well. Such difficult cases can mostly be attributed to occlusion, motion blur and unseen poses of the objects and these cases make the detection task challenging for the per-frame detector. This is where video object detectors come in. Instead of solely looking at the query frame, video object detectors solve these problems by accumulating features from a temporal context surrounding the given query frame with the hope that rich features of the object can be obtained from at least some of the frames in that context. Most of the recent approaches have concentrated on how to propagate such features efficiently in time.
Although, some of the older methods like [11, 17, 18, 19] have heavily relied on exhaustive post processing of detections produced by frame level detectors, more recent approaches[1, 8, 32, 33, 34, 35] in this domain focus on building connections across the convolutional feature maps of the network backbones at different time steps to aggregate features and then reason on such aggregated features. Such methods significantly outperform methods which rely on post-processing of frame level detections. A good number of these methods[1, 8, 35] use a short window of frames centered around the frame of interest to accumulate features, while methods like [16, 32, 34] use recurrent computational units like LSTM cell, Spatio-temporal memory module(STMM) and ConvGRU cell to propagate features in time.
Most existing datasets for video object detection do not take into consideration objects that undergo full occlusion. Since visual cues of objects are not present when they are fully occluded, existing datasets treat them as if they are not present and hence have no ground-truth annotations for such object instances. This creates the need for datasets with ground-truth annotations for occluded objects. In our datasets, when an object gets fully occluded, we annotate the object based on an approximate guess, which may not be precisely accurate. We also restrict our domain to indoor situations where we try to detect common handheld objects that undergo occlusion (mostly by the human hand). This is because we wanted to avoid certain ambiguous situations in outdoor environments (for example, when a pedestrian gets occluded by a large building, it might be hard to predict the exact location of the pedestrian.)
3.1 Staged occlusion dataset
The first type of dataset we collect comprises of videos where common hand-held desktop objects like mugs, calculator, notepads, etc. are being manipulated by an individual in front of a simple white background and scenes are captured by a static camera. Objects are manipulated in such a way that they remain occluded by the hand for sufficiently long periods of time from the camera viewpoint while being in motion. Figure 2(a) shows an example scene from this dataset.
By reducing the noise from external factors like camera movement, scale variation, background clutter, etc. we are able to narrow down and focus on the effect of long term occlusion. We primarily use this staged occlusion dataset to analyze and qualitatively evaluate the impact of occlusion on different methods.
3.2 Furniture assembly dataset
The second type of dataset comprises of different furniture assembly tasks collected from the internet as shown in figure 2(b). This dataset is more representative of occlusions that happen in the natural world. In this case, we try to detect only one class of objects (all small tools like screws, nuts and bolts are grouped into one class). Objects have a lot of scale variation and are often occluded by hands and different tools like hammer and screw-driver. We use this assembly dataset for quantitative evaluation by reporting mean average precision (mAP) of different detection methods.
3.3 ImageNet VID
Finally, we use the ImageNet VID dataset to evaluate our method and see how it performs against existing methods for video object detection in datasets which do not target full occlusion although that is not the main goal of this research. This is because in our data driven approach we do not explicitly model occlusion and hope to learn to do it through the temporal connections of the video object detector. Hence ideally our model should be able to adapt to both cases (where occlusion is ignored and where occlusion is annotated) fairly easily.
4.1 Building video level architecture
As mentioned earlier, in order to understand occlusion, it is important to utilize the temporal context by building a detector at the video level instead of the frame level. This is better explained in figure 3.
Video object detection methods that use recurrent computational units are not bound by time and thus in theory have the capability of propagating information from the very ends of the video sequence. When an object is occluded, information about the occluded object does not come from the corresponding frames where it remains occluded. Window based approaches like [1, 8, 35] can be sub-optimal in this case because the length of the frame window can become a bottleneck. Instead, we build recurrent connections on top of region based object detectors that enable propagation of salient features from both ends of a video sequence. We next explain the architecture details of the video level region based object detector.
In frame level region based object detectors like Faster RCNN or RFCN, the input frame is passed through a convolutional backbone (typically ResNet or VGG backbones) to obtain a full image backbone feature (subscript denotes time step of frame). A region proposal network (RPN) runs on and produces several ROI crops. Each such ROI crop is then processed further for classification and class specific offset regression. The final offset regression helps in producing slightly tighter or relaxed boxes for better localisation. In order to extend this family of architectures, we first detach the RPN and the ROI specific layers (ROI pooling layers, classification and regression layers) and then add the recurrent connections on top of the backbone feature maps. Following standard convention, we call the output of our RNN cell as memory () since it accumulates features from all past frames of the sequence. We then attach the RPN and the ROI specific layers on top of this memory. Since we are reasoning on top of the accumulated features in the memory and not on the backbone features of a particular time step, we hope to reason on top of features of the object of interest which are propagated to the current memory from when it was last visible in the past. This way we are able to propagate features of the object of interest through occlusion. Our video level object detector is next shown as follows.
4.2 Choice of recurrent computational unit
ConvGRU, ConvLSTM cells are common choices for the recurrent computational units of ConvRNNs. These cells are inspired from the original GRU and LSTM cells with gating mechanisms, where the dot product layers are replaced by convolutional layers. Recently, another ConvRNN cell called the spatio-temporal memory model (abbreviated as STMM) was developed which enables easy transfer of pretrained weights of the frame level detector to the video level detector. In our approach, we build on top of STMM because of its impressive performance on the ImageNet VID dataset and its ability to easily transfer backbone weights pretrained on static image datasets. We observe pretraining the weights of the baseline frame level detector to be particularly useful in our case due to the small volume of training videos. We use Cut, paste and learn to generate synthetic static image datasets to pretrain the weights of the frame level detector.
4.3 Memory alignment
In practice, a vanilla STMM is unable to align the memory properly. Successive such misalignments end up forming a trail of salient features in the memory, which often leads to false positive detections and inaccurate localisation. Xiao and Lee address this issue by introducing the MatchTrans module. They use correlation between the backbone features to determine affinity coefficients, which are then used to warp the spatio-temporal memory for alignment.
4.3.1 Effect of explicit memory alignment
While this method of explicit alignment works well for objects that are not completley occluded, we observe that under severe long-term occlusion, correlation based alignment can do away with salient features in the memory of the RNN cell. This happens because, when an object is occluded, backbone feature activations are not always fired for the object (since occluders can often belong to the background class). This in turn results in lower affinity coefficients for the spatial locations where the occluded object exists at a given time. Applying MatchTrans over successive time steps, results in dying out of the feature activations in memory and can thus result in false negatives as shown in figure 6.
This introduces a trade-off between long-term propagation of features under occlusion and better alignment for accurate localisation. We propose to address this via an alignment learning module, that can act as an alternative to explicit correlation based alignment.
4.3.2 Learning to the align the memory
Standard implementation of convRNN cells including the STMM cell uses features from the same spatial locations of its inputs, and to update a cell in its output feature . However, unless the objects in a scene are static or moving very slowly, such operations can be problematic, especially since it is a common convention to skip frames from a video to deal with the redundancy of adjacent frames. We believe in order to align memory with standard convolution layers, we should at least ensure large enough receptive fields for the layers of the RNN cell with respect to its input features. A naive implementation of this can be achieved by increasing the kernel size or adding successive 3 by 3 convolution layers. Although simple, such architectures are not memory efficient since adding each convolution layer only increases the receptive field by a finite amount. Hence, the number of parameters, needed to be added scales linearly with respect to the increase in the effective receptive field, i.e. . Instead we propose the following method.
First, we build feature pyramids of the input features of the RNN cell ( and
). For this step, we use standard 2 x 2 max pooling operation for downsampling, and set the number of levels of the pyramid to 3, which gives a good enough balance between memory usage and performance. Feature pyramids ofand are given by and where and . Here, numerical superscript denotes scale. We next propagate information using only the top most level of the pyramid i.e. using and instead of their corresponding full resolution feature maps. This way, we are able to increase the effective receptive field of the layers in the RNN cell without adding more parameters to it. The output of the STMM cell, thus needs to be upsampled to be passed on to other subnetworks of the object detector like the region proposal network, ROI pooling layers etc. In order to upsample the newly updated memory , we use skip connections from the backbone feature pyramid
to combat the information loss due to downsampling and to aid the network in better alignment of the memory. Every level of upsampling has three fundamental steps. Firstly, we do bilinear upsampling to scale the feature maps by 2x followed by an optional zero padding along the width, height or both axes to match the spatial resolution of the corresponding feature map from. This zero padding causes additional misalignment by 1 pixel in the feature space along its corresponding axis. To deal with that, we apply 3 by 3 convolution on top of the feature maps accompanied by the skip connections from the backbone feature pyramid. The entire architecture of our modified recurrent computational unit is shown in figure 8. This way we also end up adding much fewer parameters to the network. The only parameters that we add are for the skip connections and the number of such parameters linearly increases with the number of levels in the pyramid, . On the other hand, the effective receptive field exponentially increases with . Thus, in our model,
In this section, we evaluate our model both quantitatively and qualitatively on the respective datasets and show the effectiveness of our model in learning the alignment. We use frame level detectors and STMN as baseline models to compare against our method. Unlike , we do not take the ensemble of the frame level detector and STMN. Through out all our experiments we only evaluate the single model performance. It is to be noted that all the modules discussed earlier can be plugged into any existing region based object detector with any backbone. For each of the datasets we take different combinations of the convolutional backbone and frame level base network and hence show that our method is invariant of the type of backbone and frame level detector.
5.1 Experiments on furniture assembly dataset
|Base detector||Faster RCNN|
|Type of RNN||unidirectional|
|Type of nms||standard|
For both the occlusion datasets, we use Faster RCNN with vgg16 and ResNet-50 backbone as the frame level baseline detector. We train this detector on synthetic datasets generated by 
. Once the Faster RCNN baseline is trained, we add the recurrent connections into the model and fine-tune the entire network in an end-to-end fashion. Since we were interested in building online methods, for our case the RNN is uni-directional where information only flows from the beginning to the end. We use stochastic gradient descent with learning rate 1e-3 in the beginning and lower it to 1e-4 as the training loss plateaus. During training, we employ standard left-right flipping for data augmentation and during test time we use standard non-max suppression with an IoU threshold of 0.3. While there are additional techniques to boost the mAP like OHEM for better ROI sampling, or seq-NMS for better post processing of raw detections, in this case we do not use them. Under these settings, we obtain the following detection scores shown in table 2.
|Video level||STMM||learned (ours)||0.26|
Unsurprisingly, we observe that our method significantly outperforms the baseline frame level object detector. Also, the detection scores from table 2 confirm that our method of aligning the features are more suitable under such strong cases of occlusion. We also show qualitative results of our method on this dataset in figure 10 .
5.2 Experiments on ImageNet VID dataset
In our approach, we try to build a data driven end-to-end method for detecting occluded objects in videos and do not plan to model occlusion explicitly. Hence, it is worthwhile to see how our way of learning the alignment compares against explicit alignment with MatchTrans when objects are visible throughout the scene. To do so, we consider the ImageNet VID dataset, a common dataset for benchmarking video object detectors. From figures 11 and 12 we observe that learned alignment gives a relatively better aligned memory when compared to that of MatchTrans.
Further more, we quantitatively evaluate our method’s performance on the ImageNet VID dataset to show how our method of video object detection stacks up against current state-of-the-art approaches. In order to make a fair comparison, we make some changes to our method to match the experimental settings of . The details are available in table 3. Our settings differ with that of  only in two aspects: i) we evaluate single model performance of the video object detector and not performance of the ensemble model with RFCN and ii) during training we unroll the rnn for 4 time steps in stead of 7, because we were unable to fit the latter in a 12 GB Nvidia Titan X GPU. Under these settings, we observe that STMN with MatchTrans achieves an mAP of 0.789 and our STMN with learned alignment achieves an mAP of 0.796. Although, we acknowledge that in the case with no occlusion, the improvement is not necessarily statistically significant, we are able to show that our method learns the alignment well enough to act as an alternative to state of the art methods for videos object detection tasks.
|Type of RNN||bidirectional|
|Type of nms||seq-NMS|
6 Conclusion and future work
In this paper, we present a data driven approach to detecting occluded objects in videos. To the best of our knowledge, prior work on this domain has avoided data driven occlusion reasoning primarily due to lack of available data to train on. Although the advantage of such data driven methods is that we do not need to make any fundamental assumptions about the data, we observe that our method learns some biases for commonly occluding objects that it has seen during training time. As a result, it is unable to generalize to unseen occluder objects at test time. Future work will be concentrated on generalisation across different occluding objects.
Also, without significant volume of training data it is very difficult to make purely data driven occlusion modeling methods work well, and building such datasets with varying levels of occlusion can be laborious. Future work will also target creating synthetic videos and using domain adaptation techniques to address this problem.
Object detection in video with spatiotemporal sampling networks.
The European Conference on Computer Vision (ECCV), Cited by: §1, §2.4, §2.4, §4.1.
-  (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468. Cited by: §1, §2.2.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1, §2.3.
-  (2018) Scaling egocentric vision: the epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736. Cited by: §1.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
-  (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310. Cited by: §4.2, §5.1.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1.
-  (2017) Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046. Cited by: §1, §2.4, §2.4, §4.1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.3.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.3.
-  (2016) Seq-nms for video object detection. arXiv preprint arXiv:1602.08465. Cited by: §1, §2.4, §2.4, §5.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.3.
-  Http://image-net.org/challenges/lsvrc/2017/results#vid. Cited by: §1, §3.3.
-  (2005) Tracking multiple objects through occlusions. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 1051–1058. Cited by: §2.1.
-  (2007) Synthetic aperture tracking: tracking through occlusions. In 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8. Cited by: §2.1.
-  (2017) Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735. Cited by: §2.4.
T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28 (10), pp. 2896–2907. Cited by: §1, §2.4, §2.4.
Object detection from video tubelets with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 817–825. Cited by: §1, §2.4, §2.4.
-  (2016) Multi-class multi-object tracking using changing point detection. In European Conference on Computer Vision, pp. 68–83. Cited by: §1, §2.4, §2.4.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.3.
Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.3.
-  (2007) Robust occlusion handling in object tracking. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.1.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.3.
-  (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.3.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §1.
-  (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §5.1.
-  (2017) Convolutional gated recurrent networks for video segmentation. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 3090–3094. Cited by: §2.4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.3.
-  (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. Cited by: §1, §2.2.
-  (2018) Video object detection with an aligned spatial-temporal memory. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501. Cited by: §1, §1, §2.4, §2.4, §4.2, §4.3, §5.2, §5.
-  (2018) Towards high performance video object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218. Cited by: §1, §2.4, §2.4.
-  (2018) Towards high performance video object detection for mobiles. arXiv preprint arXiv:1804.05830. Cited by: §1, §2.4, §2.4.
-  (2017) Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417. Cited by: §1, §2.4, §2.4, §4.1.