Detecting and Tracking Small Moving Objects in Wide Area Motion Imagery (WAMI) Using Convolutional Neural Networks (CNNs)

by   Yifan Zhou, et al.

This paper proposes an approach to detect moving objects in Wide Area Motion Imagery (WAMI), in which the objects are both small and well separated. Identifying the objects only using foreground appearance is difficult since a 100-pixel vehicle is hard to distinguish from objects comprising the background. Our approach is based on background subtraction as an efficient and unsupervised method that is able to output the shape of objects. In order to reliably detect low contrast and small objects, we configure the background subtraction to extract foreground regions that might be objects of interest. While this dramatically increases the number of false alarms, a Convolutional Neural Network (CNN) considering both spatial and temporal information is then trained to reject the false alarms. In areas with heavy traffic, the background subtraction yields merged detections. To reduce the complexity of multi-target tracker needed, we train another CNN to predict the positions of multiple moving objects in an area. Our approach shows competitive detection performance on smaller objects relative to the state-of-the-art. We adopt a GM-PHD filter to associate detections over time and analyse the resulting performance.



There are no comments yet.


page 2

page 6


ClusterNet: Detecting Small Objects in Large Scenes by Exploiting Spatio-Temporal Information

Object detection in wide area motion imagery (WAMI) has drawn the attent...

Background subtraction based on Local Shape

We present a novel approach to background subtraction that is based on t...

Background Invariant Classification on Infrared Imagery by Data Efficient Training and Reducing Bias in CNNs

Even though convolutional neural networks can classify objects in images...

A Fusion Framework for Camouflaged Moving Foreground Detection in the Wavelet Domain

Detecting camouflaged moving foreground objects has been known to be dif...

CDN-MEDAL: Two-stage Density and Difference Approximation Framework for Motion Analysis

Background modeling is a promising research area in video analysis with ...

HM-Net: A Regression Network for Object Center Detection and Tracking on Wide Area Motion Imagery

Wide Area Motion Imagery (WAMI) yields high resolution images with a lar...

A Robust Visual System for Small Target Motion Detection Against Cluttered Moving Backgrounds

Monitoring small objects against cluttered moving backgrounds is a huge ...

Code Repositories



view repo


This repository holds the code for our work on Wide area motion images

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

(a) The input image.
(b) The result of background subtraction after image opening.
(c) The centres of the windows to be predicted by the classification CNN.
(d) The accepted regions of detections by the classification CNN.
(e) The red detections are accepted directly. The green detections will go through the regression CNN.
(f) The final detection result. The detections are cropped out by green polygons. The red points are from the ground-truth.
Fig. 1: An example of the processing chain of the proposed moving object detector. This example is a sub-set of AOI 34.

Current high-altitude airborne camera systems are now able to provide long-term wide-area surveillance videos. To avoid excessive manual effort, it is important that automated systems exist to detect and track objects (mostly vehicles) from the videos. Processing such data is different to traditional object detection problems and has drawn recent attention in both the computer vision and image processing communities. The challenges posed are as follows: lack of appearance information (since objects are always small and usually only grey-level frames are obtained, it is not possible to use appearance information alone to associate objects between frames); well-separated objects (vehicles are almost always on roads, which, in certain parts of the wide-area images, are very distant from one another, making it challenging for any detector that scans an entire image to be computational efficient while not generating large numbers of false alarms); pixel noise (the camera systems are often composed of multiple individual sensors, giving rise to pronounced artifacts, for example related to brightness discontinuities); image registration errors (small errors in image registration can lead to a large number of false alarms).

Mainstream object detection methods broadly sub-divide based on what information is used: spatial information, temporal information or both. For reasons described above, only using spatial information (e.g., the appearance of objects) has proved to be difficult for WAMI detection. There are a number of approaches that consider temporal information[1, 2, 3, 4, 5]. The simplest approach to using temporal information would be to perform frame differencing between two consecutive frames. This is straightforward and efficient but will generate two detections for each object. One way to improve this has been to look for the minimum differences among three consecutive frames [4]. Background subtraction goes beyond this idea to build an explicit background model. The current input image is then subtracted from the background to identify foreground objects. [6] reviews background subtraction methods comprehensively. In WAMI videos, the resolution is high and the frame rate low: this can give rise to artifacts caused by parallax, for example. Furthermore, background subtraction methods tend to use lightweight and fast-converging approaches to model the background. For example, [7] and [8]

use modified Gaussian Mixture Models (GMMs) to model each pixel comprising the background while

[9] and [10] compute the mean or median of multiple frames. To reduce the impact of parallax, previous research has considered blocks of pixels in a neighbourhood rather than individual pixels [4] or ignoring regions with large magnitude gradients[9].

The benefits of considering both spatial and temporal information has drawn attention in neighbouring contexts: for example, the approach is used in action recognition (e.g., [11, 12]). There is relatively little research into considering both in the context of WAMI videos. An exception is [2]

, which uses a deep learning based two-stage CNNs approach to generate point detections. The basic idea was to use a lightweight CNN to predict regions that could include moving objects and a deep regression CNN (‘FoveaNet’) to locate the centres of moving objects (the idea of using a regression network to predict positions directly/indirectly from a single image or multiple frames has been shown to be effective in other applications 

[13, 14, 15]). Moreover, [2] also considered other combinations (e.g., background subtraction and a single frame foreground object detector[13]), but the performances of these other combinations was not competitive.

In this paper, we focus on developing a processing chain that can detect small moving objects. We consider the second largest images ( pixels) in the WPAFB 2009 dataset [16]. In this dataset, individual vehicles often occupy less than

pixels. We note that objects of this size are usually assumed to be false alarms in existing papers. We adopt an approach that involves first proposing candidate positions for moving objects and then predicting the objects’ positions. We choose to use background subtraction to propose the candidate positions. A very low threshold is used (such that the probability of detection is high), and a small morphological kernel is applied to remove tiny false alarms. We treat the output from the background subtraction as regional proposals and create detection candidates within these areas. An efficient CNN that considers both spatial and temporal information is then trained to reject the false alarms. For the areas with heavy traffic, a lightweight regression network is trained to predict the centre of multiple moving objects from individual detections. This approach of combining the CNNs’ and background subtraction’s outputs makes it possible for the shape of moving objects to be obtained: this can be useful information that can be exploited by appearance-based tracking systems (e.g.,

[17, 18]). Finally, in addition to comparing the detections with the ground-truth, we applied a Gaussian Mixture Probabilistic Hypothesis Density (GM-PHD) filter to the detections to directly utilise the product of the proposed algorithm.

This paper is organised as follows. Section II briefly reviews the image registration methods that are used in our implementation. The stages comprising the moving object detection algorithm are then described in section III. The configuration of the GM-PHD filter tracker is introduced in section IV. The experiment setup and evaluation results are then reported in section V. Finally, section VI concludes this paper and describes opportunities for future research.

Ii Image Registration

Fig. 2: The architecture of the classification CNN described in section III-B.

To build a background for the current time from a number of previous frames that were captured by a moving camera system, it is necessary to compensate for the camera motion by aligning all the previous frames to the current frame. This process is also called image registration and the key is to estimate a transformation matrix,

which denotes transforming from frame to frame , based on a chosen transformation function. In this paper, we use a projective transformation (or Homography) which is widely discussed in multi-perspective geometry, i.e., a closely related area to that in which the WAMI camera system is functioning.

There are two categories of approach to work out the transformation matrix. Feature-based approaches extract feature points from both images. The feature points can be generated by, for example, Harris corner or SIFT-like [19] detectors. Feature (e.g., SURF, ORB [20]) descriptors are computed at all the detected feature points. By matching the feature descriptors, pairs of corresponding feature points between two images can be identified and the transformation matrix can then be estimated using RANSAC [21]

such that outliers are removed automatically. In our experience, feature-based image registration works well in the context of WAMI video. That said, as it is heavily based on the extracted features, it can sometimes malfunction (e.g., when there are fewer features identified) or the detected features are all concentrated in one area of an image.

The other approach is known as the direct approach and considers the image as a whole. This approach involves directly minimising the difference between the reference and registered images using Lucas-Kanade’s algorithm: the implementation is described in [22]. Although this approach often has a larger computational cost than feature-based approaches and can struggle if the displacement between two images is large, it typically generates more accurate alignment results and does work more effectively when the number of identified features is insufficient.

Iii Moving Object Detection

Fig. 3: The architecture of the regression CNN described in section III-C.

Iii-a Background Subtraction

We generate the background, , at each time by computing the median image of the previous aligned frames. Note that it is unnecessary to perform image registration times each time a new frame is received. Suppose that we obtained the transformation matrices from processing previous frames, we only need to perform image registration once to get . Then we can straightforwardly have , and so on.

The background subtraction result is defined by  where is the background subtraction threshold. Image morphological operations are applied to remove the tiny blobs that are too small to be an object of interest. As we wish to detect very small (and unclear) objects, the background threshold and minimum blob size are both chosen to be small111The details of these parameters will be described in table II.. An example of the output from background subtraction is shown in figure 1(b).

In the WPAFB 2009 dataset, it can be observed that there are multiple artifacts at the boundary between different optical sensors (due to the different configurations of the constituent cameras comprising the sensor system). It can also be observed that the brightness of one optical sensor can suddenly change. Both situations influence the result of background subtraction. As suggested in [23], we use a box filter to compensate for these brightness fluctuations. We perceive that this removes false alarms without reducing the probability of detection significantly.

Iii-B Refining Background Subtraction Output Using a Classification CNN

The major causes of false alarms generated by the background subtraction are: poor image registration, light changes and the apparent displacement of high objects (e.g., buildings and trees) caused by parallax. We emphasise that the objects of interest (e.g., vehicles) mostly appear on roads. More generally, we perceive that a moving object generates a temporal pattern (e.g., road-object-object) that could be exploited to discern whether or not a detection is an object of interest. Thus, in addition to the shape of the vehicle in the current frame, we assert that the historical context of the same place can help to distinguish the objects of interest and false alarms. We therefore create a binary classification CNN to predict if a pixels window contains a moving object given aligned image patches from the previous frames. We suggest in this paper, because moving vehicles usually take less than four frames (including the current frame) to cross a image patch. The input to the CNN is a matrix and the convolutional layers are identical to the traditional 2D CNNs except that the three colour channels are substituted with grey-level frames. The architecture of the proposed CNN is shown in figure 2. A batch normalisation layer [24]

is added after each convolutional layer and fully connected layer. A softmax layer is applied to output to obtain probability-like scores.

Background subtraction is used as a regional proposal method, however the connected components are sometimes too small or too large with respect to the ground-truth (moving) objects. To extract the regions to be used by the CNN, we subdivide the images into

cells. If a cell contains any pixels that are classified as a candidate detection by the background subtraction, we pick the

window centred on this cell as the input to the CNN (see figure 1(c)). If the CNN classifies the windows as positive, this cell will be flagged as containing a moving object (see figure 1(d)).

The proposed CNN is trained as follows. For each frame in the training set, the background subtraction is used to propose cells. All the cells that have their centres within the range of pixels from any ground-truth points are used as positive samples. All the cells that are at least pixels away from all the ground-truth points are used as negative samples. We note that the positive samples are dependent on the output of background subtraction, such that the obscured and subtle detections (i.e., potential objects whose appearances are too similar to the background) are not considered in training. We do this because we want to couple the CNN to the regional proposal method. This means that the trained model becomes specific to the configuration of the background subtraction process. It transpires that there are many more negative samples than positive ones. Indeed, there are sufficiently many negative samples that we cannot train the model with all the extracted negative samples. We therefore use a negative set which is four times larger than the positive set to train the CNN. Since the training negative set is then much smaller than the complete negative set, the threshold for accepting detections (as (1)), , was adjusted by a validation set to maximise the score.


where and are the outputs of the binary CNN and due to the softmax layer. In traditional CNNs, .

Iii-C Predicting Object Positions Using a Regression CNN

As illustrated in figure 1(b), two major problems that are present in the background subtraction output include the existence of multiple detections for one object and of one merged detection for multiple objects. While the former problem can be addressed by considering morphological operations, this makes the latter problem worse. Thus, it is necessary to separate the detections into two sets: detections which contain a single object that can be outputted directly and detections that may contain multiple objects. First, we assign the candidate detections output by the background subtraction to the accepted detection regions (figure 1(d)) and remove the candidate detections without any assignment. We refer the accepted blobs from background subtraction as , and the blobs from classification CNN as . The candidate detection which can be directly outputted must satisfy the following two conditions, otherwise we consider there might be multiple objects giving rise to the single candidate detection:

  • is assigned to and no other with is assigned to .

  • The size of is smaller than pixels.

The candidate detections can be accepted when the common area between the assigned pair, and , is larger than . This is such that we take full advantage of the background subtraction algorithm. An illustration of the two sets is shown in figure 1(e).

As proposed in [2], a regression CNN can predict the positions of objects given spatial and temporal information. Such a second CNN can be used to deal with the set of detections that may contain multiple detections. We therefore train a regression CNN, whose architecture is shown in figure 3, to predict the positions of moving objects. The input to this CNN is similar to the classification CNN described in section III-B but the size of the patches changes to (as shown in figure 4(a)). The response of the CNN is a

dimensional vector, equivalent to a down-sampled image (

) for reducing computational cost. For simplification, we always up-sample the response from the CNN to a image in the subsequent discussions. An example of the CNN response is shown as figure 4(b).

(a) The input cubic to the regression CNN. From left to right: , , ,
(b) The response (resized to ) of the regression CNN given the above input.
Fig. 4: An example of how the regression CNN works.

While preparing the training set, for each frame, we obtain a number of data-cubes. We assume that denotes the set of targets in the ground-truth for current frame, where is the number of targets. If we only use these ground-truth points as centres of windows to extract the data-cubes, a moving object could always be located in the centre of every window (which would bias the regression model). So, some drifts are added to each of the ground-truth points: where and

are each drawn from a uniform distribution on

. In order to get a more generalised training set, we iterate this process five times creating . Such that, for each ground-truth detection, we create five data-cubes and each data-cube contains the patches of the current frame and (aligned) three previous frames centred on where . The training set of the CNN response is thus straightforward. We down-sample the current frame patch to . If there is a ground-truth target within a pixel, the corresponding pixel in the CNN response is set to otherwise . Therefore, the responses for training are either or .

Although the training response is or , the actual response from the regression is between and ; see, for example, figure 4(b). We cannot extract the positions by directly picking out the positions wherever the CNN response value is , because a weak detection can often give outputs that are less than . So, after setting a minimum value () for the possible pixels, we consider the peaks in the response image as the centres of detections. A default bounding box will be given if the detection is made by the regression CNN. The detection results of both CNNs are illustrated in figure 1(f).

Iv Tracking

In this section, we focus on applying a multi-target tracker to generate multiple tracks using the output from the proposed detector. In order to reflect the performance of the detector directly, we only use the centres of detections (i.e., we do not use objects’ appearance at this stage). The Gaussian Mixture Probabilistic Hypothesis (GM-PHD) filter [25] is used to process the detections and an implementation can be found in [26]. We consider the near-constant velocity model as the dynamic model for the objects. The transition function is as standard and process noise is defined as (2). Since the position of the detections are always with respect to the current frame, the states are compensated for the camera motion as well.


and is defined as follow:


where is the magnitude of the process noise which should be adjusted based on the video content and is the time interval.

A track is initialised if a pair of points, and can be found such that , where is a detection position at frame , is a detection position which is observed in frame and transformed to frame . The state of the initialised track is defined as and the initial weight for each Gaussian is . We note that there can be a number of tracks initialised that duplicate existing tracks. These are typically merged as part of the pruning process within the GM-PHD. A track is confirmed if its weight is larger than . A track will be removed if its weight is lower than or moves out of the image.

A list of recommended parameters for the GM-PHD filter regarding WPAFB 2009 dataset is shown in table I.

Parameter description Value
Time interval () 1
Process noise () 3
Measurement noise () 3
Maximum velocity () 35
Birth weight () 0.25
Extraction weight () 0.5
Remove weight () 0.05
Probability of detection 0.8
Probability of Survival 0.95
TABLE I: The recommended parameters for the GM-PHD filter for WPAFB 2009 dataset.

V Experimental Results

We used the WPAFB 2009 [16] dataset to train and evaluate the proposed approach. The images were taken by a camera system with six optical sensors and had already been stitched to cover a wide area of around . This dataset includes 1025 frames and is divided into training video ( frames) and test video ( frames). All the vehicles and their trajectories are manually annotated. There are multiple resolutions of videos in the dataset. Unlike most of previous papers, which consider the largest ones ( pixels), we chose to use the smaller ones ( pixels) where the size of vehicles are often smaller than pixels.

Because the proposed algorithm is focusing on detecting moving objects, we remove all the objects with insufficient movement from the ground-truth for evaluation (this was also the approach to evaluation considered in [3, 2]). An object whose displacement is smaller than (as calculated from the latitudes and longitudes provided in the ground-truth) between two consecutive frames is removed. This scenario reduces the size of ground-truth from 1,176,447 to 455,427 true detections in the training video and from 1,145,779 to 460,386 in the test video.

We used the first 300 frames in the training video as the training set for both CNNs. We randomly picked 500,000 positive samples and 2,000,000 negative samples to train the classification CNN. We then validated this CNN with the same 300 frames to obtain (as described in section III-B). The regression CNN was trained using 500,000 samples that were extracted via the process described in section III-C.

We tested the proposed detector on the full stitched images in both the training and test videos. Feature-based image registration (see section II) was used to align consecutive frames. To compare both detection and tracking performance with existing papers, we additionally create six sub-videos including different Areas Of Interest (AOI 01, 02, 03, 34, 40, 41 in [3, 2, 27, 23]). As all these papers did not fully specify the details about the areas of interest, we use our pre-processing architecture to generate sub-videos which attempt to cover similar regions to those specified in these papers. The areas of interest in our experiment are shown in figure 5. By using the image registration results for the full images, we can produce a video that is centred on a particular point. Thus there will be no vertical or horizontal translations over time, even though rotations and scalings are apparent in the sub-videos. To process the sub-videos, we use the direct approach (see section II) to perform image registration, since processing time is acceptable regarding the lower resolution of AOIs222We noticed that for some sub-videos, the feature-based image registration was not working well as the extracted features were not well distributed across the image. In the other sub-videos, the influence of the choice of registration on detector performance was negligible..

Fig. 5: The areas of interest that are used in our experiments.

We used different parameters in background subtraction on the full stitched images and the AOIs since pronounced noise was present in some areas. The parameters are shown in table II.

Parameter Full AOI
Number of previous frames () 3 5
Background subtraction threshold () 8 5
Image opening kernel
Threshold for accepting detections in classification CNN () 0.8 0.8
Min. response in Regression CNN () 0.25 0.25
TABLE II: The parameters used while processing the full stitched images and the areas of interest (AOI).

V-a Evaluation Method

Metric 01 02 03 34 40 41 full image
Training video
Precision 0.953 0.965 0.937 0.970 0.955 0.960 0.952
Recall 0.932 0.951 0.964 0.961 0.931 0.928 0.899
Test video
Precision 0.939 0.956 0.934 0.957 0.942 0.943 0.941
Recall 0.931 0.936 0.958 0.949 0.928 0.925 0.891

The precisions and recalls on different areas of interest using the proposed detector.

The performances of the detector and the tracker were considered separately. For measuring the detector, we define the metrics as follows (in the same way as was considered elsewhere[2, 3]): A detection is considered as true positive when there is at least one ground-truth point within  pixels range and each ground-truth point can only be assigned to one detection. All the other detections are false positives. The false negative set includes the ground-truth points for which there is no detection within the range of  pixels. All the unassigned detections and ground-truth points were marked as false alarms and mis-detections respectively. The precision, recall and score are using the standard equations, which due to space limitations are not explicitly defined here.

In order to measure the tracker, we used the definitions and metrics in [23, 27]. The estimated track will be referred to as ‘track’ and the ground-truth target trajectory will be referred as ‘target trajectory’ in the subsequent discussions. For each target trajectory, we remove the way-points that occur when the objects is almost stationary, but we do not separate it into tracklets. We also ignore all the very short estimated tracks and targets trajectories (less than frames) as they may not be initialised and tracked well in practice. To calculate Target purity, we assign the most similar tracks to each target trajectory and compute the percentage of the pre-dominate track. Target continuity is defined as the number of tracks that are assigned to one target trajectory333For example, in a case that the tracker generates three tracks which are 5, 26 and 64 frames long, and they can be assigned to a target trajectory which moves for 40 frames, stops for 25 frames and then moves for 60 frames, the purity is and the continuity is .. We report the average values over all the target trajectories. These metrics evaluate the fragmentation of the tracks. Track purity and Track continuity are otherwise centred on the tracks: for each track we assign target trajectories to it and calculate the metrics in the same way as mentioned above. ‘Track purity & continuity’ present how much one track includes multiple targets. Again, the average over all the tracks is reported. The range of purity is with being the ideal. The continuity is within , with being ideal.

V-B Detection Results

The precision and recall of the proposed detector when applied to different AOIs and the full image are presented in table III. We can see that the performance on the training video is (as expected) better than on the test video. We believe this is due to the relatively small number of negative samples and because the threshold for accepting detections, , is estimated using on training set. Most of the AOIs yield similar precisions, except for AOI 03 which is lower because there are some non-vehicle moving objects that are detected by the algorithm, but were not included in the ground-truth. In terms of missed detections, the main cause we have identified is that some moving objects cannot be detected by background subtraction: for example, when they look similar to the background and move slowly. Regarding the results on the full images, the lower recall is caused by a larger background subtraction threshold , which leads to a smaller number of more reliable detections.

Table IV presents the score of the proposed algorithm compared to the best detectors we are aware of in the literature. Given that we are considering smaller objects, it is evident that the proposed detector is competitive with the others.

Methods 01 02 03 34 40 41
[2] 0.947 0.951 0.942 0.933 0.983 0.928
[3] 0.866 0.890 0.900 - - -
[28] - - - 0.874 0.847 0.854
Ours(train) 0.942 0.958 0.950 0.965 0.943 0.944
Ours(test) 0.935 0.947 0.945 0.953 0.935 0.934
TABLE IV: The scores of existing algorithms (as assessed on larger targets only) and of our approach on the AOIs.

V-C Tracking Results

The performance of applying a GM-PHD filter to the detections is reported in table V. We observe that AOI 01 and 40 are most challenging because in the middle of the area there is a crowded crossroads where vehicles start and stop frequently. AOI 34 includes heavy traffic and vehicles are very close to each other. The traffic is light in AOI 03, but there are a couple of traffic lights which cause the tracks to fragment. The tracker performs similarly in AOI 34 and AOI 03. The tracker achieves good results in AOI 02 and 41 where traffic is moderate and no traffic lights are present.

In comparison with [23] and [27], the advantage of our approach is obvious in heavy traffic. In AOI 34 and 41, the target purities from [27] were below and even reduced to in AOI 40. In AOI 02, our approach yields much higher target purity compared to [23] which is below . However, in areas with light traffic such as AOI 03, our approach yields similar results to [23]. In terms of track purity and continuity, the proposed approach outperforms the others as well: our estimated tracks each contain fewer targets on average.

Metric 01 02 03 34 40 41
Target Puri. (%) 53.1 71.7 64.7 64.9 52.2 70.9
Target Cont. 2.23 1.62 1.92 1.98 2.46 1.46
Track Puri. (%) 89.4 93.4 90.64 92.0 89.5 93.7
Track Cont. 1.10 1.07 1.07 1.06 1.11 1.03
TABLE V: The performance of GM-PHD filter using the proposed detector on AOIs of the WPAFB2009 (test video).

Vi Conclusions

In this paper, we propose a moving object detector for wide-area motion imagery (WAMI) videos. The detector is based on using background subtraction with a low threshold to ensure large numbers of potential detections are identified. False alarms are removed and merged detections disentangled using two CNNs that both consider spatial-temporal information. Our experimental results shows competitive performance compared to the state-of-the-art while dealing with smaller objects. We also demonstrate the feasibility of applying multi-target track to the detections we generate.

Future research includes: assessing the extent to which the trained CNNs can be applied to other WAMI videos; developing techniques to detect stationary vehicles in areas identified from an inability to detect a tracked object; tracking that can persist across long baselines by exploiting the appearance information derived from the background subtraction process.


This work was funded through “Track Analytics For Effective Triage Of Wide Area Surveillance Data” by the Defence Science & Technology Laboratory (DSTL).


  • [1] I. Saleemi and M. Shah, “Multiframe many–many point correspondence for vehicle tracking in high density wide area aerial videos,” International journal of computer vision, vol. 104, no. 2, pp. 198–219, 2013.
  • [2] R. LaLonde, D. Zhang, and M. Shah, “Clusternet: Detecting small objects in large scenes by exploiting spatio-temporal information,” in

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018

    , pp. 4003–4012, 2018.
  • [3] L. W. Sommer, M. Teutsch, T. Schuchert, and J. Beyerer, “A survey on moving object detection for wide area motion imagery,” in 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pp. 1–9, 2016.
  • [4] J. Xiao, H. Cheng, H. Sawhney, and F. Han, “Vehicle detection and tracking in wide field-of-view aerial video,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 679–684, IEEE, 2010.
  • [5] J. Prokaj, X. Zhao, and G. Medioni, “Tracking many vehicles in wide area aerial surveillance,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 37–43, IEEE, 2012.
  • [6] A. Sobral and A. Vacavant, “A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos,” Computer Vision and Image Understanding, vol. 122, pp. 4–21, 2014.
  • [7] T. Pollard and M. Antone, “Detecting and tracking all moving objects in wide-area aerial video,” in 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 15–22, IEEE, 2012.
  • [8] P. Kent, S. Maskell, O. Payne, S. Richardson, and L. Scarff, “Robust background subtraction for automated detection and tracking of targets in wide area motion imagery,” in Optics and Photonics for Counterterrorism, Crime Fighting, and Defence VIII, vol. 8546, p. 85460Q, International Society for Optics and Photonics, 2012.
  • [9] P. Liang, H. Ling, E. Blasch, G. Seetharaman, D. Shen, and G. Chen, “Vehicle detection in wide area aerial surveillance using temporal context,” in Proceedings of the 16th International Conference on Information Fusion, pp. 181–188, IEEE, 2013.
  • [10] V. Reilly, H. Idrees, and M. Shah, “Detection and tracking of large number of targets in wide area surveillance,” in European conference on computer vision, pp. 186–199, Springer, 2010.
  • [11] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, pp. 568–576, 2014.
  • [12] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision, pp. 20–36, Springer, 2016.
  • [13] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271, 2017.
  • [14] A. Rozantsev, V. Lepetit, and P. Fua, “Detecting flying objects using a single moving camera,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 5, pp. 879–892, 2017.
  • [15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
  • [16] AFRL, “Wright-patterson air force base (wpafb) dataset,”, 2009.
  • [17] J. Prokaj and G. Medioni, “Persistent tracking for wide area aerial surveillance,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1186–1193, 2014.
  • [18] B.-J. Chen and G. Medioni, “Exploring local context for multi-target tracking in wide area aerial surveillance,” in 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 787–796, IEEE, 2017.
  • [19] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: speeded up robust features,” in 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, pp. 404–417, Springer, 2006.
  • [20] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficient alternative to SIFT or SURF,” IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, 2011.
  • [21] M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981.
  • [22] J.-Y. Bouguet, “Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm,” Intel Corporation, vol. 5, no. 1-10, p. 4, 2001.
  • [23] M. Keck, L. Galup, and C. Stauffer, “Real-time tracking of low-resolution vehicles for wide-area persistent surveillance,” in 2013 IEEE Workshop on Applications of Computer Vision, WACV 2013, Clearwater Beach, FL, USA, January 15-17, 2013, pp. 441–448, 2013.
  • [24]

    S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in

    Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015

    , pp. 448–456, 2015.
  • [25] B.-N. Vo and W.-K. Ma, “The gaussian mixture probability hypothesis density filter,” IEEE Transactions on signal processing, vol. 54, no. 11, pp. 4091–4104, 2006.
  • [26] B. Clarke, “The implementation of GM-PHD filter.”
  • [27] A. Basharat, M. W. Turek, Y. Xu, C. Atkins, D. Stoup, K. Fieldhouse, P. Tunison, and A. Hoogs, “Real-time multi-target tracking at 210 megapixels/second in wide area motion imagery,” in IEEE Winter Conference on Applications of Computer Vision, Steamboat Springs, CO, USA, March 24-26, 2014, pp. 839–846, 2014.
  • [28] M. Teutsch and M. Grinberg, “Robust detection of moving vehicles in wide area motion imagery,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2016, Las Vegas, NV, USA, June 26 - July 1, 2016, pp. 1434–1442, 2016.