A New Urban Objects Detection Framework Using Weakly Annotated Sets

06/28/2017 ∙ by Eric Keiji, et al. ∙ 0

Urban informatics explore data science methods to address different urban issues intensively based on data. The large variety and quantity of data available should be explored but this brings important challenges. For instance, although there are powerful computer vision methods that may be explored, they may require large annotated datasets. In this work we propose a novel approach to automatically creating an object recognition system with minimal manual annotation. The basic idea behind the method is to use large input datasets using available online cameras on large cities. A off-the-shelf weak classifier is used to detect an initial set of urban elements of interest (e.g. cars, pedestrians, bikes, etc.). Such initial dataset undergoes a quality control procedure and it is subsequently used to fine tune a strong classifier. Quality control and comparative performance assessment are used as part of the pipeline. We evaluate the method for detecting cars based on monitoring cameras. Experimental results using real data show that despite losing generality, the final detector provides better detection rates tailored to the selected cameras. The programmed robot gathered 770 video hours from 24 online city cameras (3̃00GB), which has been fed to the proposed system. Our approach has shown that the method nearly doubled the recall (93%) with respect to state-of-the-art methods using off-the-shelf algorithms.



There are no comments yet.


page 2

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Modern cities collect a vast amount of data daily [11], which includes information about mobility, energy, violence, pollution and cultural life, to name but a few. There is a huge amount of data available and processing it is not an easy task. Particularly, much information can be deduced when multiple sources of information are considered. Image and video sources are particularly very rich ones. Useful information is not always promptly available from the data. In some cases, great manual effort is necessary to process and combine different data sources to obtain the desired information.

In this paper, we describe an ongoing project based on a framework to automatically combine different sources of information and to create a city model to address urban issues (Figure 1

). One broad categorization of the data is in visual and non-visual data. Visual data, in turn, accommodates at least three categories that are relevant to our work that are: city maps, remote sensing images and street-level images. Particularly, city images contain relevant information regarding urban elements including cars, people and buildings, and it is useful to have all of them identified in the data. With the combination of them, one would be able to obtain geographic coordinates based on the recognition of a building which position is known (Figure 

2). In this stage of our framework, we focus on the recognition of cars in city images, but an extension to other urban elements is direct. Extracting semantic information of visual data is challenging and may require great manual effort. An option is to use crowd-sourcing services like Amazon Mechanical Turk [42]. It works reasonably well but the quality may fall short depending on the kind of task [39]. Our contribution is the proposal of a new method of automatic generation of objects detectors, with no manual annotation. We evaluate the approach in the task of creating a car detector for monitoring cameras images.

We propose an approach to automatically generate object detectors for urban informatics. In our method, data is harvested, pre-annotated with a weak classifier and then used to fine-tune a model. We validated it through the creation of a car detector for video sequences captured in the wild from monitoring cameras.

Figure 1: Urban informatics framework under development. Data is acquired from different sources, processed and integrated to compose a city a model and address relevant urban issues.
(a) City map
(b) Remote sensing images
(c) City images
Figure 2: Three different visual data sources: images from cartography, from remote sensing and from monitoring cameras. The combination of them constitutes an important feature in the creation of a city model. The red selection exhibits the same urban element linked throughout the three different classes of images.

This paper is organized as follows. The remaining of this section presents relevant works to this approach. Section II explains in details the method proposed while Section III presents the experimental results and validation of the method. The paper is concluded in Section IV with final remarks on ongoing work.

I-a Related Work

Object recognition may involve the detection of objects in a scene [41]. Due to the emergence of rich datasets and the development of higher computing capabilities, object recognition has been one prominent task in computer vision. Meaningful advances have been achieved since the theme started to be explored ([46, 29, 14]) and methods such as [34] allow fast and high accuracy recognition rates. When we restrict the recognition to one known class, then we have a problem of object detection [41].

The Deformable Parts Model (DPM) is a notable object detection method proposed by [14]. It identifies an object through its constituent parts and the corresponding spatial dispositions. The model characterizes each part of the object through a Histogram of Oriented Gradients (HOG) [10] in pyramid levels and through the possibly deformed positions relative to the root object. The overlapping candidates are computed and ranked according to a score.

Among other approaches, Artificial Neural Networks (ANNs) 


have being widely used for object recognition. An ANN can be thought as a parallel processor composed of simple processing units, the neurons. Each neuron processes the inputs through an activation function and sends the results to the following neurons 

[36]. Neurons are organized in layers and architectures with more than one hidden layer are referred as deep [36]

networks. In a conventional multiple layers ANN, each layer is fully connected to the preceding layer. A variant, known as Convolutional Neural Networks (CNN), comprises one or more partially connected layers. This type of ANN has many applications in image classification 

[25, 37, 36]. In object recognition, one direct approach is to scan every possible window of the target image and classify it independently. This solution is compute-intensive and instead of performing exhaustive search of the objects, many mechanisms of a previous step, proposals of regions of interest have been created [19]. They are commonly categorized in superpixels grouping ([43, 7]) and variants of sliding-windows ([8, 50]). Region-proposal methods are analogous to interest point detectors [19] because they allow focusing attention on specific regions for subsequent tasks.

The work of Regions with Convolutional Neural Networks (RCNN) [34] introduces a unified network that performs region proposal and classification, which means that during the training step it accepts annotations of multiple sized objects and, during the testing stage, it performs classification of those objects in images of arbitrary sizes.

The conventional creation of an image recognition method involves training and evaluation of the method proposed in a particular set. There are several datasets available, some covering multiple classes [17, 13, 35, 28] while others focus in one particular class [2, 12, 38]. In particular there are several car datasets available [6, 48, 27, 32, 24] which include great number and diversity of samples. However, they lack diverse situations such as high occlusion, varying quality, naturally moving objects and diversified image acquisition conditions such as weather, common situations in monitoring cameras. Monitoring cameras datasets, such as Virat [31], Kitti [15], Visor [45], i-Lids [20], CamVid [4] and MIT Traffic Dataset [47] provide just limited amounts of this type of videos. Another possibility is to acquire data from public streaming. There are multiple platforms that aggregate monitoring cameras from many places around the world, like EarthCam [21], InseCam [33] and Camerite [5]. Due to their nature, such image sources present scenes with varying conditions, including scenes with scarce illumination, low resolution and bad weather conditions.

In the context of automatic learning of concepts, the work of [30]

introduces a framework addressing the semi-supervised learning problem for discovering multiple objects in sparsely labeled videos. Focus is given to the automatic quality assurance, due to the quick worsening of the classifier when false positives are included in the database. The method starts with few sparsely labeled videos and exemplar detectors 


are trained on these starting data. The video is consistently sampled and annotated by the classifier. Since the annotations are sparse, the authors argue that negative examples cannot be obtained from the neighborhood of a detection, so the use random image from external sources. An initial filtering is performed using temporally consistency, assuming a smoothness in the motion of the objects. Then an outliers removal approach was applied in a different feature space from the classifier using the unconstrained Least Squares Importance Fitting. The filtered detection serve as starting point for a short-term tracking, using sparse optical flow using HOG features. To filter the potentially redundancy in the resulting set, each object is associated with the exemplar detector 

[18] and high correspondences, corresponding to redundant detections, are removed. Finally, the final data is used to update the detectors. The current work here presented has many similarities to this work. Next we explain in detail the steps of our approach.

Ii Proposed Framework

Figure 3 describes the introduced method. We propose to create a detector with minimal user intervention, provided a dataset and two different classifiers. They are referred as Weak Classifier (WC) and Original Strong Classifier Model (OSCM). The resulting fine-tunned classifier is referred as Strong Classifier (SC).The method steps shown in Figure 3 are described below.

Figure 3: Proposed methodology. Frames are sampled and split in training and test sets. A weak detection is performed on the training set and used to retrain the strong classifier. Quality control is performed both on the detections of the new dataset on test set and also on the detections of the weak classifier on the same set.

Ii-a Data acquisition

The first stage consists in the acquisition of a large set of images and video of interest. Besides using available datasets, the proposed implementation retrieves data from monitoring cameras available. This solution allows an excellent variability and amount of data that may be retrieved. We used [5] to filter and select the monitoring cameras. See Figure 5 for some sample frames obtained.

Ii-B Data sampling

An optional step in the method is sampling the input video. The motivation is the redundancy of the objects that might be detected in consecutive frames, if no temporal coherency such as  [23] is taken into account. This step is also important to cope with the large datasets possibly obtained in the previous step.

Ii-C Training and test sets division

To properly evaluate the method proposed, the data should be split into training and test set. A to are rules-of-thumb commonly accepted in the field [1]. This allows performance assessment using cross-validation like strategies.

Ii-D Weak classifier application and fine tunning training set generation

The DPM method [14] was adopted as our WC. DPM is a remarkable work in object recognition with the use of low level features. The model assumes that an object is composed of components, each of which composed by a root and a set of parts. Each object hypothesis is computed as Equation II-D, the difference of a filter and deformation term plus a bias. The score of the filter part is given by the convolution of the model

convolved with the HOG features extracted at their own location. The deformation term consists in the convolution of the model deformation parameters and the displacement of the part. An object hypothesis is considered a detection if it results in the best placement of parts, as expressed by Equation 

2. The training set is processed by the WC and a set of objects detections is obtained (Figure 3).


Ii-E Strong classifier generation

The detected objects are used to fine-tune a RCNN pre-trained on ImageNet. RCNN in its original form 


performs a ordinary classification in each region of the image. The feature extraction is performed using a CNN and the classification is performed using Support Vector Machines 

[9]. An additional step of region proposals can be added to cut out the search space of objects detection in the image. A relevant drawback of this original version is that it is very compute intensive. A faster version of the RCNN has been proposed [34]. It starts by applying the convolutional layers followed by region proposals extraction. Classification is then performed as the last step. The resulting method is faster and it presents similar results [34]. The fine-tunned RCNN represents a final detector Strong Classifier tailored for the input online city cameras.

Ii-F Quality control

A critical aspect of the proposed approach is to assure the quality of the intermediate representations obtained by semi-automated ways. In the approach described in last section, the weak classifier is used to generate samples that are fed to train and fine-tune the strong classifier. A quality control step is performed to evaluate the performance of WC in this task.

Figure 4: Quality control methodology. In the quality control stage, a manual inspection is performed on the detections and the number of true positives and false positives are computed.

In the quality control stage (Figure 4

) we want to estimate the real performance of an object detector given the detected objects in a sample.

In order to evaluate the accuracy of each detector over the dataset, it is desirable to have the proportion of the true-positives each detector has produced. However, to avoid excessive manual work on annotating the images, we estimate the proportion

over a randomly selected sample from the dataset. The confidence interval for a population proportion

based on a sample of size is given by [3]:



is the normal distribution.

We have to use a reasonable value, based on our experience and on which outcome we expect from this experiment. The worst case (when we need the largest ) happens if , since reaches its maximum value. A good practice is to collect a small random pilot sample and to calculate the proportion over this sample. Based on our experiences and on a pilot sample of 50 images, we chose to use . Thus, with 95% confidence (), and a margin of error , we need .


The precision on the sample is computed according to Equation 3. The recall of the population is in principle unknown. To be calculated, it requires the number of false negatives (see Equation 4). Once again, instead of annotating all the dataset for recall, the same approach for estimating the precision can be done, by annotating a randomly selected sample.

The performance evaluation of the SC is carried out based on the results of the two quality control stages. Despite we want to avoid computing the recall of each classifier (since it requires manual annotation of a very large video set), the number of images tested for both WC and SC are the same, i.e., , so we can compute the relative change in the recall of WC, as Equation II-F. This is particularly interesting in the big data scenario, where we want to minimize the effort to label the data.


The quality control stage can be summarized in the following sequence of steps:

  1. Compute a sample size, according to the equation of minimum sample size.

  2. Label a small sample.

  3. Compute TP, TN and FP.

  4. Apply Equation II-F.

  5. Compute FN and Equation 4.

The last step is optional in case one just needs the relative improvement of the method, which can be obtained through Equation II-F. In case this step is performed, the false negative samples can be used in the retraining of the SC, as proposed by [14].

Ii-G Comparative performance assessment

The SC detector is evaluated on the test set and a quality control stage is also performed. The test set is also processed by the WC and quality control is performed in this stage as well. This allows assessing the gain obtained by the strong classifier w.r.t. the weak classifier. The results of the two quality control stages are combined to infer the accuracy of the SC.

Iii Experimental Results and Validation

The proposed methodology has been implemented in order to be validated. We used a relational database management system to store all the metadata of the acquisition and detections. We used a PostgreSQL database [44]. Different programming languages were used in the framework, including python, Matlab and Linux bash scripting language. In the acquisition stage, we continuously decoded HTTP live streaming [40] to MPEG [26] files. This stage included a failure-proof mechanism to take into account issues on the client-side, server-side and on the network. In the object detection stages, DPM 111https://github.com/rbgirshick/voc-dpm and the Python implementation 222https://github.com/rbgirshick/py-faster-rcnn of the Faster-RCNN [34] method were used.

We validated the proposed method in two experiments. In the first case, we created a car detector for images from the monitoring cameras. In the second case we were motivated by the task of efficiently finding cars in rainy weather, a more difficult computer vision task. There are multiple works dealing with the problem of rain removal [49], but none tackling the problem of finding cars in the rain. We restricted our dataset to images of rainy weather and developed our pipeline according to this data source.

Figure 5: Data source. Sample of the video frames used.
Dataset Number of Total Number of Size (GB)
cameras hours sampled Frames
Camerite-all 24 768.0 358,036 262.5
Camerite-rainy 14 63.7 7,011 5.8
Table I: Comparison of the two datasets, Camerite-all and Camerite-rainy.
Figure 6: Locations of some of the cameras used in São Paulo city, Brazil. Map from [22].
Table II: Sample of the detections performed by the SC .

Iii-a Validation 1: Uncontrolled weather

In the first experiment, we created a detector of cars based on images from monitoring cameras. 24 cameras were continuously monitored during 6 days resulting in 768 hour of videos and 265.2 GB. The videos were systematically sampled at a 1/20 rate from the streaming to reduce the redundancy on the detections since time-varying information has not been explored in the current implementation (e.g. tracking, which can be easily incorporated in the proposed methodology). The 358,036 frames obtained were split into two groups. The trainning set of 300,000 frames (84%) and the test set of 58,036 (16%). We applied the DPM [14] on the training set and a quality control step was performed on this stage. 557,036 cars were detected and used to fine-tune a RCNN VGG 16 layers [37] pre-trained with ImageNet, . Thus a final detector, was obtained. The second quality control stage was then performed. The quality control results are expressed in Table LABEL:tab:all.

Detectors TP FP FN Precision Recall

Validation 1

WC () 1366 127 2205 91.5% 36.9%
SC () 2638 345 1060 88.4% 71.3%
-3.2% +93.2%

Validation 2

WC () 914 115 2703 88.8% 25.2%
SC () 1512 449 2105 77.1% 41.8%
-12.5% +65.8%
Table III: Quality control over the Validation 1 and Validation 2.

In the performance assessment of the , a relative change of precision of and a relative change on WC performance was obtained. The generated detector thus presents a significant increase in recall with the trade-off of losing a little precision.

Iii-B Validation 2: Rainy days

In the second set of experiments, we created a detector of cars based on images of frames of rainy days from monitoring cameras.

We used 14 cameras during 2 days. Here as well, the raw videos were sampled at a 1/20 rate. The 7,011 frames obtained were split into 6,000 (trainning set 85%) and 1,011 (test set 15%). We applied the DPM [14] on the training set and a quality control step was performed on this stage. The 17,325 cars detected were used to fine-tune , the detector previously obtained with images from Camerite-all. A final detector was then obtained. The second quality control stage was then performed. Manual inspection was performed over a sample of 500 images from the test set, according to the quality control step proposed in Section III. The results of the quality control stages are expressed in Table LABEL:tab:all. In the performance assessment of the , relative changes  -12.5% and  65.8% were obtained.

Figure 7: Comparison of the cars detections performed by the , and , respectively, from left to right.

Iv Discussion and concluding remarks

This work presents the current state of a urban informatics framework that involves three data levels: Source data, City model and Knowledge. Our ongoing work focus on developing this framework with an implementation using real cases. The framework is based on:

  • Source data: city information should be collected as automatically as possible. We consider two types of data: visual and non-visual:

    • Visual data: city maps, remote sensing, urban images/videos;

    • Non-visual urban data: All types of socio-economic statistics available like education levels and information, violence, news information, traffic, etc.

  • City model: representation of different layers of city features. Different data structures have to be defined (e.g. images, networks, textual and numerical records, etc).

  • Knowledge: The framework should help to address questions like How do cities evolve?, How can they be compared?. Such questions should be answered with the support of analytical methods suitable for each case provided by the framework.

Annotating large amounts of data is challenging. An option is to perform it manually, which is labor-intensive for big data. Alternative options include hiring services like Amazon Mechanical Turk or Citizen Science [42]. However, new approaches to minimize human operation are desirable. In this scenario, the contribution of this paper is the proposal of a methodology for generating object detectors with minimal manual annotation and quality control. The source data is initially processed by a a weak classifier and the resulting detections generated are used to fine-tune a new detector. In both steps, the user inspects a small sample of detections looking just on the true and false positives ratio. We validated it in the creation of a car detector for monitoring cameras that was able to produce a relative change on the precision and on the recall of the weak classifier of -3.2 and 93.2, respectively. Motivated by the urban problem, we performed the same pipeline using rainy images and we got -12.5 and 65.8, respectively. These results show that the strong classifier presents a substantial improvement in recall with a small loss of precision.

For this paper we implemented one step of the collection and annotation of visual data in an urban environment. Next steps include the extension of our method to other urban elements such as pedestrians, buildings and roads. Then, other visual data sources besides city images will be explored including city maps and radar images. Following, non-visual data like from demography, traffic and violence information will be incorporated and, finally, we are going to create a city model to address the aforementioned urban issues.


The authors would like to thank FAPESP grants #2015/22308-2, #15/03475-5, # 16/12077-6, # 14/24918-0, CNPq, CAPES and NAP eScience - PRP - USP.


  • [1] Yaser S Abu-Mostafa, Malik Magdon-Ismail, and Hsuan-Tien Lin. Learning from data, volume 4. AMLBook New York, NY, USA:, 2012.
  • [2] Saad Ali and Mubarak Shah. A lagrangian particle dynamics approach for crowd flow segmentation and stability analysis. In

    Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on

    , pages 1–6. IEEE, 2007.
  • [3] Michael Baron. Probability and statistics for computer scientists. pages 256–257. CRC Press, 2 edition, 2014.
  • [4] Gabriel J Brostow, Julien Fauqueur, and Roberto Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters, 30(2):88–97, 2009.
  • [5] Camerite. (http://www.camerite.com). Camerite Publicidade e Monitoramente Ltda., Santa Catarina, Brazil, Last accessed March 2017.
  • [6] Peter Carbonetto, Gyuri Dorkó, Cordelia Schmid, Hendrik Kück, and Nando De Freitas. Learning to recognize objects with little supervision. International Journal of Computer Vision, 77(1-3):219–237, 2008.
  • [7] Joao Carreira and Cristian Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3241–3248. IEEE, 2010.
  • [8] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr.

    Bing: Binarized normed gradients for objectness estimation at 300fps.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3286–3293, 2014.
  • [9] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [10] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
  • [11] United States Environment Protection Agency Air Data. (https://www3.epa.gov/airdata/ad_data_daily.html). United States Environment Protection Agency, Last accessed March 2017.
  • [12] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 34(4):743–761, 2012.
  • [13] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • [14] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9):1627–1645, 2010.
  • [15] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013.
  • [16] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [17] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. 2007.
  • [18] Bharath Hariharan, Jitendra Malik, and Deva Ramanan. Discriminative decorrelation for clustering and classification. In Computer Vision–ECCV 2012, pages 459–472. Springer, 2012.
  • [19] Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What makes for effective detection proposals? Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2015.
  • [20] Paul Hosmer. i-lids dataset for avss 2007. In Advanced Video and Signal based Surveillance. IEEE, 2007.
  • [21] Earthcam Inc. (https://www.earthcam.com). EarthCam, Last accessed March 2017.
  • [22] Google Inc. (https://maps.google.com). Google Maps, Last accessed March 2017.
  • [23] Zdenek Kalal, Krystian Mikolajczyk, and Jiri Matas. Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence, 34(7):1409–1422, 2012.
  • [24] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  • [25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [26] Didier Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of the ACM, 34(4):46–58, 1991.
  • [27] Moshe Lichman. UCI machine learning repository, 2013.
  • [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, pages 740–755. Springer, 2014.
  • [29] David G Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999.
  • [30] Ishan Misra, Abhinav Shrivastava, and Martial Hebert. Watch and learn: Semi-supervised learning for object detectors from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3593–3602, 2015.
  • [31] Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. A large-scale benchmark dataset for event recognition in surveillance video. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3153–3160. IEEE, 2011.
  • [32] Mustafa Ozuysal, Vincent Lepetit, and Pascal Fua. Pose estimation for category specific multiview object localization. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 778–785. IEEE, 2009.
  • [33] Insecam Project. (https://www.insecam.org). Insecam, Last accessed March 2017.
  • [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
  • [35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [36] Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
  • [37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [38] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [39] Alexander Sorokin and David Forsyth. Utility data annotation with amazon mechanical turk. 2008.
  • [40] Thomas Stockhammer. Dynamic adaptive streaming over http–: standards and design principles. In Proceedings of the second annual ACM conference on Multimedia systems, pages 133–144. ACM, 2011.
  • [41] Richard Szeliski. Computer vision: algorithms and applications. Springer Science & Business Media, 2010.
  • [42] Mechanical Turk. (http://www.mturk.com). Amazon Web Services, Amazon Inc., 2013.
  • [43] Koen EA Van de Sande, Jasper RR Uijlings, Theo Gevers, and Arnold WM Smeulders. Segmentation as selective search for object recognition. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 1879–1886. IEEE, 2011.
  • [44] PostgreSQL version 9.4.6. (http://www.postgresql.org). PostgreSQL Global Development Group, California, United States, 2010.
  • [45] Roberto Vezzani and Rita Cucchiara. Visor: Video surveillance on-line repository for annotation retrieval. In Multimedia and Expo, 2008 IEEE International Conference on, pages 1281–1284. IEEE, 2008.
  • [46] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–511. IEEE, 2001.
  • [47] Xiaogang Wang, Xiaoxu Ma, and Eric Grimson. Unsupervised activity perception by hierarchical bayesian models. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, pages 1–8. IEEE, 2007.
  • [48] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3973–3981, 2015.
  • [49] Xiaopeng Zhang, Hao Li, Yingyi Qi, Wee Kheng Leow, and Teck Khim Ng. Rain removal in video by combining temporal and chromatic properties. In Multimedia and Expo, 2006 IEEE International Conference on, pages 461–464. IEEE, 2006.
  • [50] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating object proposals from edges. In Computer Vision–ECCV 2014, pages 391–405. Springer, 2014.