Long Term Temporal Context for Per-Camera Object Detection

by   Sara Beery, et al.

In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. However, due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose an attention-based approach that allows our model to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply our models to two settings: (1) species detection using camera trap data, which is sampled at a low, variable frame rate based on a motion trigger and used to study biodiversity, and (2) vehicle detection in traffic cameras, which have similarly low frame rate. We show that our model leads to performance gains over strong baselines in all settings. Moreover, we show that increasing the time horizon for our memory bank leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, our best model which leverages context from up to a month of images outperforms the single-frame baseline by 17.9 baseline) by 11.2



There are no comments yet.


page 1

page 2

page 4

page 6

page 7


Joint Detection and Tracking in Videos with Identification Features

Recent works have shown that combining object detection and tracking tas...

Fast Hand Detection in Collaborative Learning Environments

Long-term object detection requires the integration of frame-based resul...

Spatial-Temporal Memory Networks for Video Object Detection

We introduce Spatial-Temporal Memory Networks (STMN) for video object de...

Design and Implementation of a 3D Undersea Camera System

In this paper, we present the design and development of an undersea came...

Design and Implementation of a Novel Compatible Encoding Scheme in the Time Domain for Image Sensor Communication

This paper presents a modulation scheme in the time domain based on On-O...

CADP: A Novel Dataset for CCTV Traffic Camera based Accident Analysis

This paper presents a novel dataset for traffic accidents analysis. Our ...

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection

Single frame data contains finite information which limits the performan...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Visual similarity over long time horizons. In static cameras, there exists significantly more long term temporal consistency than in data from moving cameras. In each case above, the images were taken on separate days, yet look strikingly similar.

Depending on the triggering mechanism and the camera placement, large numbers of the photos at any given camera location can be empty of any objects of interest (up to % for some camera trap datasets) [31]. Futher, as the images in static passive-monitoring cameras are taken automatically (without a human photographer), there is no guarantee that the objects of interest will be centered, focused, well-lit, or an appropriate scale. We break these challenges into three categories, each of which can cause failures in single-frame detection networks:

  • Objects of interest partially observed. Objects can be very close to the camera and occluded by the edges of the frame, partially hidden in the environment due to camouflage, or very far from the camera.

  • Poor image quality. Objects are poorly lit, blurry, or obscured by weather conditions like snow or fog.

  • Background distractors. When moving to a new camera location, there can exist salient background objects that cause repeated false positives.

Figure 2: Static Monitoring Camera Challenges. Images taken without a human photographer have no quality guarantees, we highlight challenges which cause mistakes in single-frame systems (left) and are fixed by our model (right). False single-frame detections are in red, detections missed by the single-frame model and corrected by our method are in green. Note that in camera traps, the intra-image context is very powerful due to the group behavior of animal species.

These cases are often difficult for humans to correctly classify unless they are provided with temporal context in the form of looking at other images from the same camera. This forms the intuitive basis for our design: a model that can learn how to find and use other, potentially easier examples from the same camera to help improve detection performance (see Figure

2). Further, like most real-world data [42], both the traffic camera and camera trap data have long-tailed class distributions. By providing context for rare classes from other examples, we improve performance on rare classes in the long tail as well as common classes.

We propose a detection architecture that is able to learn to leverage long term memory to improve detection metrics within static cameras, even in low, variable frame rate scenarios. We focus on two static-camera domains: camera traps and traffic cameras. Camera traps are remote static monitoring cameras used by biologists to study animal species occurrence, populations, and behavior. Monitoring biodiversity quantitatively can help us understand the connections between species decline and pollution, exploitation, urbanization, global warming, and conservation policy. Traffic cameras

are static monitoring cameras used to monitor roadways and intersections in order to analyze traffic patterns and ensure city safety. In both domains, the contextual signal within a single camera location is strong, and we allow the network to determine which previous images were relevant to the current frame, regardless of their distance in the temporal sequence. This is important within a static camera, as objects exhibit periodic, habitual behavior that causes them to appear days or even weeks apart. For example, an animal might follow the same trail to and from a watering hole in the morning and evening every night, or a bus following its route will return periodically throughout the day. At a high level, this approach could be framed as a non-parametric estimation method (like nearest neighbors) sitting on top of a high-powered parametric function (Faster R-CNN). When train and test locations are quite different, one might not expect a parametric method to generalize well

[5]. Considering that the data includes contextual examples across time for a test location, this provides a ‘neighborhood’ of test examples that we can leverage.

To summarize our main contributions:

  • We propose a model capable of leveraging temporal context for improving object detection regardless of frame rate or sampling irregularity.

  • We use our model to achieve major improvements over strong single-frame baselines on two camera trap datasets as well as a traffic camera dataset, showing the generality of our approach.

  • We show that our model is able to leverage up to a month of temporal context which is significantly more than prior approaches.

2 Related Work

Single frame object detection. Driven by popular benchmarks such as COCO [25] and Open Images [21], there have been a number of advances in single frame object detection in recent years. These detection architectures include anchor-based models, both single stage (e.g., SSD [27], RetinaNet [24], Yolo [32, 33]) and two-stage (e.g., Fast/Faster R-CNN [13, 34, 17], R-FCN [9]), as well as more recent anchor-free models (e.g., CornerNet [22], CenterNet [56], FCOS [41]

). Object detection methods have shown great improvements on COCO- or Imagenet-style images, but these gains do not always generalize to challenging real-world data (See Figure


Video object detection. Single frame architectures then form the basis for video detection and spatio-temporal action localization architectures, which build upon single frame models by incorporating contextual cues from other frames in order to deal with more specific challenges that arise in video data including motion blur, occlusion, and rare poses. Leading methods have used pixel level flow (or flow-like concepts) to aggregate features [59, 58, 57, 6] or used correlation [12] to densely relate features at the current timestep to an adjacent timestep. Other papers have explored the use of 3d convolutions (e.g., I3D, S3D) [7, 29, 48] or recurrent networks [26, 19] to extract better temporal features. Finally, many works apply video specific postprocessing to “smooth” predictions along time, including tubelet smoothing [14] or SeqNMS [15].

Object-level attention-based temporal aggregation methods.

The majority of these video detection approaches are not well suited to our target setting of sparse, irregular frame rates. For example, flow based methods, 3d convolutions and LSTMs typically all assume a dense, regular temporal sampling. And while models like LSTMs can theoretically depend on all past frames in a video, their effective temporal receptive field is typically much smaller. To address this limitation of recurrent neural networks, the NLP community has introduced attention-based architectures as a way to take advantage of long range dependencies in sentences 

[2, 43, 11]. The vision community has followed suit with attention-based architectures [39, 28, 38] that leverage longer term temporal context.

Along the same lines and most relevant to this work, there are a few recent works  [46, 37, 47, 10] that rely on non-local attention mechanisms in order to aggregate information at the object level across time. For example, Wu et al [46] applied non-local attention [45] to person detections to accumulate context from pre-computed feature banks (with frozen pre-trained feature extractors). These feature banks extend the time horizon of their network up to 60s in each direction, achieving strong results on spatio-temporal action localization. We similarly use a frozen feature extractor which allows us to create extremely long term memory banks which leverage the spatial consistency of static cameras and habitual behavior of the subjects across long time horizons (up to a month). However Wu et al use a 3d convnet (I3D) for short term features which is not well-suited to our setting due to low, irregular frame rate. Instead we use a single frame model for the current frame which is more similar to [37, 47, 10] who proposed variations of this idea for video object detection achieving strong results on the Imagenet Vid dataset. In contrast to these three papers, we augment our model with an additional dedicated short term attention mechanism which we show to be effective in experiments. Uniquely, our approach also allows negative examples into memory which allows the model to learn to ignore salient false positives in empty frames due to their immobility; we find that our network is able to learn background classes (e.g., rocks, bushes) without supervision.

More generally, our paper adds to the growing evidence that this attention-based approach of temporally aggregating information at the object level is highly effective for incorporating more context in video understanding. We argue in fact that it is especially useful in our setting of sparse irregular frame samples from static cameras. Whereas a number of competing baselines like 3d convolutions and flow based techniques perform nearly as well as these attention-based models on Imagenet Vid, the same baselines are not well-suited to our setting. Thus, we see a larger performance boost from prior, non-attention-based methods to our attention-based approach.

Camera traps and other visual monitoring systems. Image classification and object detection have been increasingly explored as a tool for reducing the arduous task of classifying and counting animal species in camera trap data [5, 31, 50, 30, 44, 54, 51, 35, 3, 4]. Object detection has been shown to greatly improve the generalization of these models to new camera locations [5]. It has been clearly shown in [5, 31, 50] that temporal information is useful. However, previous methods cannot report per-image species identifications (instead identifying a class at the burst level), cannot handle image bursts containing multiple species, and cannot provide per-image localizations and thus counts of each species, all of which are important to the biologists analyzing the data.

In addition to camera traps, traffic cameras, security cameras, and weather cameras on mountain passes are all frequently stationary and used to monitor places over long time scales. For traffic cameras, prior work focuses on crowd counting (eg counting the total number of vehicles or humans in each image) [53, 55, 1, 8, 36]. Some recent works have investigated using temporal information in traffic camera datasets [52, 49], but these methods only focus on short term time horizons, and do not take advantage of long term context.

3 Method

Figure 3: Architecture. (a) The high-level architecture of the model, with short term and long term attention used sequentially. Short term and long term attention are modular, and the system can operate with either or both. (b) We see the details of our implementation of an attention block, where is the number of boxes proposed by the RPN for the keyframe, and is the number of comparison features. For short term attention, is the total number of proposed boxes across all frames in the window, shown in (a) as . For long term attention, is the number of features in the long term memory bank associated with the current clip. See Section 3.1 for details on how this memory bank is constructed.

Our proposed approach builds a memory bank based on contextual frames and modifies a detection model to make predictions conditioned on this memory bank. In this section we discuss (1) the rationale behind our choice of detection architecture, (2) how to represent contextual frames, and (3) how to incorporate these contextual frame features into the model to improve current frame predictions.

Due to our sparse, irregular input frame rates, typical temporal architectures such as 3d convnets and recurrent neural networks are not well-suited, due to a lack of inter-frame temporal consistency (too much changes between frames). Instead, we build our model on top of single frame detection models. Additionally, building on our intuitions that moving objects exhibit periodic behavior and tend to appear in similar locations, we hope to inform our predictions by conditioning on instance level features from contextual frames. Because of this last requirement, we choose the Faster R-CNN architecture [34] as our base detection model as this model remains a highly competitive meta-architecture and provides clear choices for how to extract instance level features. Our method is easily applicable to any two stage detection framework. We note that while Faster R-CNN often is viewed as being slower than its single shot competitors (e.g., RetinaNet [24]), our effectively lower frame rates moot the need for real time inference. Moreover, passive monitoring cameras such as camera traps are traditionally not cloud connected and only analyzed periodically in batch.

Faster R-CNN is a two-stage detection model. An image is first passed through a first-stage region proposal network (RPN) which, after running non-max suppression, returns a collection of class agnostic bounding box proposals. These box proposals are then passed into the second stage, which extracts instance-level features via the ROIAlign or crop-and-resize operation [16, 18] which then undergo classification and box refinement.

In our model, the first-stage box proposals are instead routed through two attention-based modules that incorporate features from contextual frames (seen by the same camera) in order to provide local and global temporal context for the prediction. These attention-based modules return a contextually-informed feature vector which is then passed through the second stage of Faster R-CNN in the ordinary way. In the following section (

3.1), we discuss how to represent features from context frames using a memory bank and detail our design of the attention modules. See Figure 3 for a diagram of our pipeline.

3.1 Building a memory bank from context features

Long Term Memory Bank ().

Given a keyframe , for which we want to detect objects, we iterate over all frames from the same camera within a pre-defined time horizon , running a frozen, pre-trained detector on each frame. We build our long term memory bank () from feature vectors corresponding to resulting detections. Given the limitations of hardware memory, deciding what to store in a memory bank is a critical design choice. We use three strategies to ensure that our memory bank can feasibly be stored.

  • We take the instance level feature tensors after cropping proposals from the RPN and save only a spatially pooled representation of each such tensor concatenated with a spatiotemporal encoding of the datetime (normalized year, month, day, hour) and box position (yielding an embedded feature vector per box).

  • We curate by limiting the number of proposals for which we store features — we consider multiple strategies for deciding which and how many features to save to our memory banks, see Section 5.3 for more details.

  • We rely on a pre-trained single frame Faster R-CNN with Resnet-

    backbone as a frozen feature extractor (which therefore need not be considered during backpropagation). In experiments we consider an extractor pretrained on COCO alone, or fine-tuned on the training set for each dataset. We find that COCO features can be used effectively but that best performance comes from a fine-tuned extractor, see Table

    1 (c) for a comparison.

Together with our sparse frame rates, by using these strategies we are easily able to construct memory banks holding up to a month of context in memory. This corresponds to anywhere from to contextual features, depending on the number of images taken during the month by the camera (with an average of contextual frames per keyframe).

Short Term Memory (). In our experiments we find that it is helpful to include an additional separate mechanism for incorporating more thorough short term context features from nearby frames, using the same, trained first-stage feature extractor as for the keyframe. This is different from our long term memory described above which we build over longer time horizons using a frozen feature extractor. In contrast to our long term memory bank, we do not curate the short term features: for small window sizes it is feasible to hold features for all box proposals in memory. We take the stacked tensor of cropped instance-level features across all frames within a small window around the current frame (typically frames) and globally pool across the spatial dimensions (width and height). This results in a matrix of shape containing a single embedding vector per box proposal (which we call our Short Term Memory, ), that is then passed into the short term attention block.

3.2 Attention module architecture

We define an attention block [43] which aggregates from context features keyed by input features as follows (see Figure 3): Let be the tensor of input features from the current frame (which in our setting has shape , with the number of proposals emitted by the the first-stage of Faster R-CNN). We first spatially pool across the feature width and height dimensions, yielding with shape . Let be the matrix of context features, which has shape . We set or . We define as the key function, as the query function, as the value function, and as the final projection that returns us to the correct output feature length to add back into the input features. We use a distinct , or for long term or short term attention respectively. In our experiments, , , and are all fully-connected layers, with output dimension . We calculate attention weights using standard dot-product attention:


where is the softmax temperature. Note the attention weights tensor has shape , and is the feature dimension ().

We next construct a context feature for each box by taking a projected, weighted sum of context feature vectors:


where has shape in our setting. Finally, we add as a per-feature-channel bias back into our original input features .

Figure 4: Visualizing attention. In each example, the keyframe is shown at a larger scale, with our model’s detection, class, and score shown in red. We consider a time horizon of one month, and show the images and boxes with highest attention weights (shown in green). The model pays attention to objects of the same class, and the distribution of attention across time can be seen in the timelines below each example. A warthogs’ habitual use of a trail causes useful context to be spread out across the month, whereas a stationary gazelle results in the most useful context to be from the same day. The long term attention module is adaptive, choosing to aggregate information from whichever frames in the time horizon are most useful.
Model mAP AR mAP AR mAP AR
Single Frame 37.9 46.5 56.8 53.8 38.1 28.2
Ours 55.9 58.3 76.3 62.3 42.6 30.2
One minute 50.3 51.4
One hour 52.1 52.5
One day 52.5 52.9
One week 54.1 53.2
One month 55.6 57.5
One box per frame 55.6 57.5
COCO features 50.3 55.8
Only positive boxes 53.9 56.2
Subsample half 52.5 56.1
Subsample quarter 50.8 55.0
Single Frame 37.9 46.5
Maj. Vote 37.8 46.4
ST Spatial 39.6 36.0
S3D 44.7 46.0
SF Attn 44.9 50.2
ST Attn 46.4 55.3
LT Attn 55.6 57.5
ST+LT Attn 55.9 58.3
Single Frame 38.1 28.2
Top 1 Box 40.5 29.3
Top 8 Boxes 42.6 30.2
Table 1: Results. All results shown are based on Faster R-CNN with a Resnet 101 backbone. The datasets considered are Snapshot Serengeti (SS), Caltech Camera Traps (CCT), and CityCam (CC). All mAP values are reported with an IoU threshold of 0.5, and AR is reported for the top prediction (AR@1).

4 Data

Our network is built for variable, low-frame-rate real-world systems of static cameras, and we test our methods on two such domains: camera traps and traffic cameras. Because the cameras are static, we split each dataset into separate camera locations for train and test, to ensure our model does not overfit to the validation set [5].

Camera Traps. Camera traps are usually programmed to capture a low-frame-rate image burst of frames taken at one fps after each motion trigger, which results in data with variable, low frame rate. In this paper, we test our systems on the Snapshot Serengeti (SS) [40] and Caltech Camera Traps (CCT) [5] datasets, each of which have human-labeled ground truth bounding boxes for a subset of the data. We increase the number of bounding box labeled images for training by pairing class-agnostic detected boxes from the Microsoft AI for Earth MegaDetector [4] with image-level species labels on our training locations. SS has publically available seasons of data. We use seasons , which contain cameras, M images, and classes. CCT contains 140 cameras, 243k images, and 18 classes. Both datasets have large numbers of false motion triggers, % for SS and % for CCT, which means that many of the images contain no animals. We split the data using the location splits proposed in [23], and evaluate on the images with human-labeled bounding boxes from the validation locations for each dataset (k images across locations for SS and k images across locations for CCT).
Traffic Cameras. The CityCam dataset [53] contains types of vehicle classes, around k frames and k annotated objects. It covers cameras monitoring downtown intersections and parkways in a high-traffic city, and “clips” of data are sampled multiple times per day, across months or even years. The data is diverse, including day and nighttime images, images in rain and snow, and images with high and low traffic density. We use camera locations for training and cameras for testing, with both parkway and downtown locations in both sets.

5 Experiments

We evaluate all models on held-out camera locations, using established object detection metrics: mean average precision (mAP) and Average Recall (AR). We compare our results to a single-frame baseline for all three datasets. We focus the majority of our experiments on a single dataset, Snapshot Serengeti, and investigate the effects of both short term and long term attention, the feature extractor, and the long term time horizon, the and the frame-wise sampling strategy for . We further explore the addition of multiple features per frame in CityCam.

5.1 Implementation Details

We implemented our attention modules within the Tensorflow Object Detection API open-source Faster-RCNN architecture with Resnet 101 backbone


. Faster-RCNN optimization and model parameters are not changed between the single-frame baseline and our experiments, and we ensure robust single-frame baselines via hyperparameter sweeps. We train on Google TPUs (v3)

[20] using MomentumSGD with weight decay and momentum . We construct each batch using clips, drawing four frames for each clip spaced frame apart and resizing to . Batches are placed on 8 TPU cores, colocating frames from the same clip. We augment with random flipping, ensuring that the memory banks are flipped to match the current frames to preserve spatial consistency. All our experiments use a softmax temperature of , which we found in early experiments to outperform and .

5.2 Main Results

Our model strongly outperforms the single-frame Faster RCNN with Resnet-101 baseline on both the Snapshot Serengeti (SS) and Caltech Camera Traps (CCT) datasets, and shows promising improvements on CityCam (CC) traffic camera data as well (See Table 1 (a)). For all experiments, unless otherwise noted, we use a fine-tuned dataset specific feature extractor. We show an mAP improvement of % on CC, % on SS, and % on CCT. Recall improves as well, with AR at improving % on CC, % on SS, and % on CCT.

For SS, we show results on several baselines that have access to short term temporal information (see Table 1 (d)). All short term experiments have an input window of 3 frames. We find that as we increase the contextual information provided with the keyframe, performance improves.

  • We consider a simple majority vote (Maj. Vote) across the high-confidence single-frame detections within the window, and find that it does not improve over the single-frame baseline.

  • We attempt to leverage the static-ness of the camera by taking a temporal-distance-weighted average of the RPN box classifier features from the key frame with the cropped RPN features from the same box locations from the surrounding frames (ST Spatial), and find it outperforms single-frame by % mAP.

  • S3D [48], a popular video object detection model, outperforms single-frame by % mAP despite being designed for consistently sampled high frame rate video.

  • Since animals in camera traps occur in groups, cross-object intra-image context is valuable. An intuitive baseline is to restrict the short term attention context window () to the current frame (SF Attn). This removes temporal context, showing how much improvement we gain from explicitly sharing information across the box proposals in a non-local way. We see that we can gain % mAP over a vanilla single-frame model by adding this non-local attention module.

  • When we increase the short term context window to three frames, keyframe plus two adjacent, (ST Attn) we see an additional improvement of % mAP.

  • If we consider only long term attention into with a time horizon of one month (LT Attn), we see a % mAP improvement over short term attention.

  • By combining the short- and long term attention modules into a single model (ST+LT Attn), we see our highest performance at mAP, and show in Figure 5 that we improve for every class in the highly imbalanced dataset.

Figure 5: Performance per class. Our performance improvement is consistent across classes: we visualize SS per-species mAP from the single-frame model to our best long term and short term memory model.

5.3 Changing the Time Horizon (Table 1(b))

We ablate our long term only attention experiments by reducing the time horizon of , and find that performance decreases as the the time horizon decreases. We see a large performance improvement over the single-frame model even when only storing a minute-worth of representations in memory. This is due to the sapling strategy, highly-relevant bursts of images are captured for each motion trigger. The long term attention block can adaptively determine how to aggregate this information, and there is a large amount of useful context across images within a single burst. However, some cameras take only a single image at a trigger; in these cases the long term context becomes even more important. The adaptability of our model to be trained on and improve performance across data with not only variable frame rates, but also with different sampling strategies (time lapse, motion trigger, heat trigger, and bursts of - images per trigger) is a valuable attribute of our system. In Figure 6, we explore the time differential between the top scoring box for each image and the features it closely attended to, using a threshold of on the attention weight. We can see day/night periodicity in the week- and month-long plots, showing that attention is focused on objects captured at the same time of day. As the time horizon increases, the temporal diversity of the attention module increases and we see that our model attends to what is available across the time horizon, with a tendency to focus higher on images nearby in time (see figure 4).

Figure 6: Attention over time. We threshold attention weights at , and plot a histogram of time differentials from the highest-scoring object in the keyframe to the attended frames for varied long term time horizons. Note that the y-axis is in log scale. The central peak of each histogram shows the value of nearby frames, but attention covers the breadth of what is provided: namely, if given a month worth of context, our model will use it. Also note a strong day/night periodicity when using a week-long or month-long memory bank.

5.4 Contextual features for constructing .

Feature extractor (Table 1(c)). For Snapshot Serengeti, we consider both a feature extractor trained on COCO, and one trained on COCO and then fine-tuned on the SS training set. We find that while a month of context from a feature extractor tuned for SS achieves % higher mAP than one trained only on COCO, we are able to outperform the single-frame model by % using memory features that have never before seen a camera trap image.

Subsampling memory (Table 1(c)).

We further ablate our long term memory by decreasing the stride at which we store representations in the memory bank, while maintaining a time horizon of one month. If we use a stride of

, which subsamples the memory bank by half, we see a drop in performance of % mAP at . If we increase the stride to , we see an additional % drop. If instead of increasing the stride, we instead subsample by taking only positive examples (using an oracle to determine which images contain animals for the sake of the experiment), we find that performance still drops (explored below). Keeping representations from empty images. In our static camera scenario, we choose to add features into our long term memory bank from all frames, both empty and non-empty. The intuition behind this decision is the existence of salient background objects in the static camera frame which do not move over time, and can be repeatedly, erroneously detected by single-frame architectures. We assume that the features from the frozen extractor are visually representative, and thus sufficient for both foreground and background representation. By saving representations of highly-salient background objects, we hope to allow the model to learn per-camera salient background classes and positions without supervision, and suppress these objects in the detection output. In Figure 7, we see that adding empty representations reduces the number of false positives across all confidence thresholds compared to the same model with only positive representations. We investigated the highest confidence “false positives” from our context model, and found that in almost all of them () our model correctly found and classified animals that were missed by the human annotators. The Snapshot Serengeti dataset reports % noise in their labels [40], and looking at the high-confidence predictions of our powerful model on images labeled “empty” is intuitively a good way to catch these missing labels. Some of these are truly challenging, where the animal is difficult to spot and the annotator mistake is unfortunate but reasonable. Most are truly just label noise, where the existence of an animal is extremely obvious. This suggests our performance improvement estimates are likely conservative.

Figure 7: False positives on empty images. When adding features from empty images to the memory bank, we reduce false positives across all confidence thresholds compared to the same model without negative representations. Note that the y-axis is in log scale. The single frame model has fewer high-confidence false positives than either context model, but when given positive and negative context our model is able to suppress low-confidence detections. By analyzing our models’ high-confidence detections on images labeled “empty” we found a large number of images where the annotators missed animals.

Keeping multiple representations per image (Table 1(e)). In Snapshot Serengeti, there are on average objects and classes per image across the non-empty images, and of the images are empty. The majority of the images have just a single object, while a few have large herds of a single species. Given this, choosing only the top-scoring detection to add to memory makes sense, as that object is likely to be representative of the other objects in the image (eg keeping only one zebra example from an image with a herd of zebra). In CityCam, the traffic camera dataset, on average there are objects and classes per frame, and only of frames are empty. In this scenario, storing additional objects in memory is intuitively useful, to ensure that the memory bank is representative of the camera location. We investigate adding features from the top-scoring and detections, and find selecting objects per frame yields the best performance (see Table 1(e)). A logical extension of our approach would be selecting objects to store based not only on confidence, but also diversity. Failure modes. One potential failure case of this similarity-based attention approach is the opportunity for hallucination. If one image in a test location contains something that is very strongly misclassified, that one mistake may negatively influence other detections at that camera. For example, when exploring the confident “false positives” on the Snapshot Serengeti dataset (which proved to be almost universally true detections that were missed by human annotators) the images where our model erroneously detected an animal were all of the same tree, highly confidently predicted to be a giraffe.

6 Conclusions and Future Work

In this work, we contribute a model capable of leveraging per-camera temporal context up to a month, far beyond the time horizon of previous approaches, and show that in the static camera setting attention-based temporal context is particularly beneficial. We show that our method is general across static camera domains, improving detection performance over single-frame baselines on both camera trap and traffic camera data. Additionally, our model is adaptive and robust to passive-monitoring sampling strategies that provide data streams with low, irregular frame rates.

It is apparent from our results is that what and how much information is stored in memory is both important and domain specific. We plan to explore this is detail in the future, and hope to develop methods for curating diverse memory banks which are optimized for accuracy and size, to reduce the computational and storage overheads at training and inference time while maintaining performance gains.

7 Acknowlegdements

We would like to thank Pietro Perona, David Ross, Zhichao Lu, Ting Yu, Tanya Birch and the Wildlife Insights Team, Joe Marino, and Oisin MacAodha for their valuable insight. This work was supported by NSFGRFP Grant No. 1745301, the views are those of the authors and do not necessarily reflect the views of the NSF.


  • [1] C. Arteta, V. Lempitsky, and A. Zisserman (2016) Counting in the wild. pp. 483–498. Cited by: §2.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
  • [3] S. Beery, Y. Liu, D. Morris, J. Piavis, A. Kapoor, M. Meister, and P. Perona (2019) Synthetic examples improve generalization for rare classes. arXiv preprint arXiv:1904.05916. Cited by: §2.
  • [4] S. Beery and D. Morris (2019) Efficient pipeline for automating species id in new camera trap projects. Biodiversity Information Science and Standards 3, pp. e37222. Cited by: §2, §4.
  • [5] S. Beery, G. Van Horn, and P. Perona (2018) Recognition in terra incognita. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 456–473. Cited by: §1, §2, §4, §4.
  • [6] G. Bertasius, L. Torresani, and J. Shi (2018) Object detection in video with spatiotemporal sampling networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346. Cited by: §2.
  • [7] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6299–6308. Cited by: §2.
  • [8] A. B. Chan, Z. J. Liang, and N. Vasconcelos (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. pp. 1–7. Cited by: §2.
  • [9] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.
  • [10] H. Deng, Y. Hua, T. Song, Z. Zhang, Z. Xue, R. Ma, N. Robertson, and H. Guan (2019) Object guided external memory network for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6678–6687. Cited by: §2.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.
  • [12] C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046. Cited by: §2.
  • [13] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.
  • [14] G. Gkioxari and J. Malik (2015) Finding action tubes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 759–768. Cited by: §2.
  • [15] W. Han, P. Khorrami, T. L. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang (2016) Seq-nms for video object detection. arXiv preprint arXiv:1602.08465. Cited by: §2.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
  • [18] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et al. (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7310–7311. Cited by: §3, §5.1.
  • [19] K. Kang, H. Li, T. Xiao, W. Ouyang, J. Yan, X. Liu, and X. Wang (2017) Object detection in videos with tubelet proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735. Cited by: §2.
  • [20] S. Kumar, V. Bitorff, D. Chen, C. Chou, B. Hechtman, H. Lee, N. Kumar, P. Mattson, S. Wang, T. Wang, et al. (2019) Scale mlperf-0.6 models on google tpu-v3 pods. arXiv preprint arXiv:1909.09756. Cited by: §5.1.
  • [21] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §2.
  • [22] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.
  • [23] LILA.science. Note: http://lila.science/Accessed: 2019-10-22 Cited by: §4.
  • [24] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2, §3.
  • [25] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2.
  • [26] M. Liu and M. Zhu (2018) Mobile video object detection with temporally-aware feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5686–5695. Cited by: §2.
  • [27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.
  • [28] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv preprint arXiv:1908.02265. Cited by: §2.
  • [29] W. Luo, B. Yang, and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §2.
  • [30] A. Miguel, S. Beery, E. Flores, L. Klemesrud, and R. Bayrakcismith (2016) Finding areas of motion in camera trap images. In Image Processing (ICIP), 2016 IEEE International Conference on, pp. 1334–1338. Cited by: §2.
  • [31] M. S. Norouzzadeh, A. Nguyen, M. Kosmala, A. Swanson, M. S. Palmer, C. Packer, and J. Clune (2018)

    Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning

    Proceedings of the National Academy of Sciences 115 (25), pp. E5716–E5725. Cited by: §1, §2.
  • [32] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.
  • [33] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2, §3.
  • [35] S. Schneider, G. W. Taylor, and S. Kremer (2018) Deep learning object detection methods for ecological camera trap data. In 2018 15th Conference on Computer and Robot Vision (CRV), pp. 321–328. Cited by: §2.
  • [36] A. P. Shah, J. Lamare, T. Nguyen-Anh, and A. Hauptmann (2018) CADP: a novel dataset for cctv traffic camera based accident analysis. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–9. Cited by: §2.
  • [37] M. Shvets, W. Liu, and A. C. Berg (2019) Leveraging long-range temporal relationships between proposals for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9756–9764. Cited by: §2.
  • [38] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §2.
  • [39] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. arXiv preprint arXiv:1904.01766. Cited by: §2.
  • [40] A. Swanson, M. Kosmala, C. Lintott, R. Simpson, A. Smith, and C. Packer (2015) Snapshot serengeti, high-frequency annotated camera trap images of 40 mammalian species in an african savanna. Scientific data 2, pp. 150026. Cited by: §4, §5.4.
  • [41] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §2.
  • [42] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778. Cited by: §1.
  • [43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2, §3.2.
  • [44] A. G. Villa, A. Salazar, and F. Vargas (2017)

    Towards automatic wild animal monitoring: identification of animal species in camera-trap images using very deep convolutional neural networks

    Ecological Informatics 41, pp. 24–32. Cited by: §2.
  • [45] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §2.
  • [46] C. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick (2019) Long-term feature banks for detailed video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 284–293. Cited by: §2.
  • [47] H. Wu, Y. Chen, N. Wang, and Z. Zhang (2019) Sequence level semantics aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9217–9225. Cited by: §2.
  • [48] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 305–321. Cited by: §2, 3rd item.
  • [49] F. Xiong, X. Shi, and D. Yeung (2017) Spatiotemporal modeling for crowd counting in videos. pp. 5151–5159. Cited by: §2.
  • [50] H. Yousif, J. Yuan, R. Kays, and Z. He (2017) Fast human-animal detection from highly cluttered camera-trap images using joint background modeling and deep learning classification. In Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pp. 1–4. Cited by: §2.
  • [51] X. Yu, J. Wang, R. Kays, P. A. Jansen, T. Wang, and T. Huang (2013) Automated identification of animal species in camera trap images. EURASIP Journal on Image and Video Processing 2013 (1), pp. 52. Cited by: §2.
  • [52] S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura (2017) FCN-rlstm: deep spatio-temporal neural networks for vehicle counting in city cameras. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3667–3676. Cited by: §2.
  • [53] S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura (2017) Understanding traffic density from large-scale web camera data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5898–5907. Cited by: §2, §4.
  • [54] Z. Zhang, Z. He, G. Cao, and W. Cao (2016) Animal detection from highly cluttered natural scenes using spatiotemporal object region proposals and patch verification. IEEE Transactions on Multimedia 18 (10), pp. 2079–2092. Cited by: §2.
  • [55] H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon (2018) Adversarial multiple source domain adaptation. In Advances in Neural Information Processing Systems, pp. 8559–8570. Cited by: §2.
  • [56] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §2.
  • [57] X. Zhu, J. Dai, L. Yuan, and Y. Wei (2018) Towards high performance video object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218. Cited by: §2.
  • [58] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei (2017) Flow-guided feature aggregation for video object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417. Cited by: §2.
  • [59] X. Zhu, Y. Xiong, J. Dai, L. Yuan, and Y. Wei (2017) Deep feature flow for video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358. Cited by: §2.