Log In Sign Up

Supporting Video Queries on Zero-Streaming Cameras

by   Mengwei Xu, et al.

As low-cost surveillance cameras grow rapidly, we advocate for these cameras to be zero streaming: ingesting videos directly to their local storage and only communicating with the cloud in response to queries. To support queries over videos stored on zero-streaming cameras, we describe a system that spans the cloud and cameras. The system builds on two unconventional ideas. When ingesting video frames, a camera learns accurate knowledge on a sparse sample of frames, rather than learning inaccurate knowledge on all frames; in executing one query, a camera processes frames in multiple passes with multiple operators trained and picked by the cloud during the query, rather than one-pass processing with operator(s) decided ahead of the query. On diverse queries over 750-hour videos and with typical wireless network bandwidth and low-cost camera hardware, our system runs at more than 100x video realtime. It outperforms competitive alternative designs by at least 4x and up to two orders of magnitude.


Focus: Querying Large Video Datasets with Low Latency and Low Cost

Large volumes of videos are continuously recorded from cameras deployed ...

Do You See What I See? Detecting Hidden Streaming Cameras Through Similarity of Simultaneous Observation

Small, low-cost, wireless cameras are becoming increasingly commonplace ...

Approximate Query Processing on Autonomous Cameras

Surveillance IoT cameras are becoming autonomous: they operate on batter...

ExSample: Efficient Searches on Video Repositories through Adaptive Sampling

Capturing and processing video is increasingly common as cameras become ...

MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos

Convolutional neural network inference on video input is computationally...

Clique: Spatiotemporal Object Re-identification at the City Scale

Object re-identification (ReID) is a key application of city-scale camer...

TASM: A Tile-Based Storage Manager for Video Analytics

The amount of video data being produced is rapidly growing. At the same ...

1. Introduction

Surveillance cameras are cheap (¡$50) and grow rapidly (140M annual shipment (ihs-report-2018, )

). They continuously produce videos with potential business/social values, e.g., for crime investigation, smart retailing, and road traffic analysis. Yet, the volume of videos is colossal (20GB from a single camera per day). Analyzing videos is expensive: it takes a $4,000 GPU to run deep neural networks (NN) with state-of-the-art accuracy over the video stream from a $25 camera 

(noscope, ; wyzecam, ). Given the fast data growth and expensive analytics, an increasingly smaller fraction of videos will be eventually analyzed (wdblog, ).

This necessitates retrospective analytics (focus, ; noscope, ; vstore, ): rather than continuously querying live videos, a system stores videos for a user-defined timespan (e.g., one week) and only analyzes the stored videos when a user queries, e.g., “return all frames containing yellow buses yesterday”. Retrospective analytics is efficient for harnessing colossal surveillance videos. It is flexible, in that it supports ad-hoc queries where queried object classes, query accuracies, and query operators only become known at query time (eureka, ; vstore, ).

To archive surveillance videos and support retrospective analytics, a common approach is “all streaming” (focus, ; vstore, ): at ingestion time, cameras ship live videos to the cloud or the edge111We do not differentiate them in the paper.. This approach heavily stresses the network bandwidth between cameras and the cloud, a limited resource scaling much slower than video data (see Section 2 for evidence). To mitigate the pressure, recent work advocates for camera’s ingestion to run “early discarders” (e.g., motion detection) continuously; the cameras only stream surviving videos to the cloud (filterforward, ). This, however, is effective only when possible queries, including their semantics and parameters, are limited and known beforehand. To support diverse ad-hoc queries, the required early discarders bloat and become unaffordable for low-cost cameras.

Figure 1. Overview of DIVA.

Zero-streaming cameras

While the design space is large and many intermediate points are possible, we explore an extreme shown in Figure 1: cameras capture videos to local flash storage (cheap and large, see Section 2) without uploading any; only in response to user queries, they communicate and cooperate with the cloud to analyze the stored videos. Optionally, cameras may periodically present proof of video possession to the cloud.

Shifting much responsibility from ingestion to query execution, zero-streaming cameras enjoy unique benefits. i) They consume network bandwidth only for queries but never for ingestion, making large camera deployment possible under limited network bandwidth. They further suit deployment in remote areas, as they only require network connectivity provisioned at query time, e.g., long-range wireless connection (lora, ) from drones. ii) They consume cloud resources only for queries, resonating well with the cloud’s on-demand spirit. The resultant billing model is clean: charging the cloud usage to whoever queries.

Goal & Challenges

Can we build a system to query such cameras with interactive performance (to keep users in the query loop) and uncompromised accuracy? Multiple factors are against this goal. As low-cost cameras are capable of storing large videos, one query may cover videos spanning days and in GBs. A camera’s uplink bandwidth is typically no more than a few MB/s (mpdash, ), limiting the rate of uploading videos to the cloud for processing. Low-cost cameras often see constrained hardware, e.g., wimpy CPUs and lack of NN accelerators (cheapcam, ; wyzecam, ).

These challenges make prior video analytics systems inadequate. Most systems target live analytics (vigil, ; lavea, ; filterforward, ), processing frames in a streaming fashion. They would miss key opportunities in retrospective processing, e.g., prioritizing frames or processing them in multiple iterations. Recent systems propose to index all ingested video frames and therefore execute queries on the indexes at low cost (focus, ). This is suboptimal for low-cost cameras: they can only afford to build low-accuracy indexes, resulting in poor query performance. We will demonstrate the inadequacies experimentally.


In response, we build DIVA, an analytics engine for querying videos stored on low-cost surveillance cameras, as shown in Figure 1. Similar to prior systems (noscope, ), DIVA executes a query by running light, specialized NNs on cameras and a full-fledged object detector in the cloud in conjunction; the cloud validates results uploaded from the cameras or processes frames undecidable to cameras. Unlike prior systems, DIVA builds around two unconventional designs.

  • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

  • Ingestion: sparse but sure landmarks (Figure 1(a)) On a small fraction of frames dubbed landmarks, the camera’s ingestion runs a generic, slow, yet high-accuracy object detector, e.g., YOLOv3 (yolov3, ). This is opposite to the common “all and rough” ingestion design: running low-accuracy object detectors on all or most frames (focus, ; filterforward, ). The landmark frames, uploaded to the cloud at the beginning of a query, are crucial to DIVA performance: they provide long-term video knowledge for query optimization; they serve as initial training samples for bootstrapping operators at query time; they provide initial results in response to statistical queries.

  • Query: iterative execution with operator upgrade (Figure 1(b)) Recognizing full query execution is slow, DIVA embraces online query processing (ola, ): it presents partial or inexact results to users as soon as a query starts and continuously refines the results. During one query, DIVA’s cloud runtime trains a library of operators and proactively sends to the camera the updated operators, picked based on the query progress, network/camera resources, and the outcome of online operator training. With upgraded operators, the camera processes video frames in multiple passes iteratively: early passes quickly explore the frames for inexact answers while later passes slowly exploit for more exact ones. This contrasts to many prior systems that process queried frames in one pass, using operators trained ahead of the query (noscope, ).

Beyond the two designs, DIVA employs important optimizations, including operator bootstrapping at query time, scheduling frames for processing, and tracking optical flows.

Through evaluation with 720-hour videos from 15 different scenes, we show that DIVA runs queries at more than 100 video realtime on average under typical wireless network and low-cost camera hardware. Compared to competitive alternatives, DIVA speeds up queries by at least 4.


We have made the following contributions.

  • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

  • We introduce zero streaming, a new paradigm for operating low-cost surveillance cameras under limited network bandwidth.

  • Towards querying low-cost, zero-streaming cameras, we present two unconventional designs: landmarks for capturing accurate knowledge on sparse frames rather than inaccurate knowledge on all; operator upgrade for processing frames in multiple passes with operators continuously trained and picked during a query, rather than one pass with operators decided ahead of the query.

  • We report DIVA, an online video query processing system that instantiates the designs. Evaluated on large video datasets, DIVA runs queries at more than 100 video realtime and significantly outperforms competitive alternatives.

To our knowledge, DIVA is the first system designed for querying large videos stored on low-cost remote cameras.

2. Background & Motivations

2.1. Hardware trends

Numerous low-cost cameras

Low-cost surveillance cameras (typically under $50) see explosive growth in recent years. At such low price points, they are equipped with decent image sensors accompanied by constrained compute resources: embedded CPU cores (e.g., Armv7-a or MIPS32) and a few hundred MBs of DRAM (wyzecam, ; cheapcam, ). They can run simple video processing, e.g., motion detection, but cannot run high-accuracy neural networks (NN) on ingested videos in real time (nnbenchmarks, ; yolov2-rpi, ). We envision such low-cost cameras to constitute the majority of future surveillance cameras, offering extensive coverage of the physical world. As complements, a smaller number of high-end cameras (jetson-camera, ) (costing a few hundred dollars), equipped with NPU or GPU (jetson-tx2-perf, ), are deployed strategically in critical locations, e.g., busy intersections.

We design for low-cost cameras. We will examine the applicability to more expensive cameras in evaluation.

Network bandwidth remains precious

To stream videos in real time, a surveillance camera consumes network bandwidth as high as 1–2 Mbps (fortinet-white-paper, ). As the number of cameras grows, such high consumption of shared network bandwidth draws users complaints (complain-camera-1, ; complain-camera-2, ) and researcher attention (vigil, ; filterforward, ). While emerging network technologies such as 5G promise severalfold bandwidth increase, consumer demand for network bandwidth is projected to grow even faster (e.g., 20 for VR/AR and 10 for Internet gaming) (cisco-white-paper, ). Edge resources is not a panacea, as numerous wireless cameras are limited by wireless bandwidth to reach the edge.

On-camera storage is cheap

Flash storage is increasingly denser and cheaper. Since 2017, the price of a 32GB MicroSD card dropped from $25 to $7 (camel-sdcard-32GB, ). Such a card is capable of holding 10 days of video footage (720p at 1fps, encoded).

2.2. System model: zero-streaming cameras

Motivated by pervasive low-cost cameras, precious network bandwidth, and cheap on-camera storage, we advocate for such cameras to capture videos to local storage without streaming anything; only in response to user queries, they collaborate with the cloud and process the queried videos. Zero-streaming cameras are reactive and highly efficient: they only consume network and cloud resources for analyzing queried video footage, which typically constitutes a small fraction of all captured videos (wdblog, ).


Our goal of querying zero-streaming cameras opens a new design space. In this paper, we take the initial step and address the first-class concerns with minimum assumption. Specifically, we focus on optimizing individual queries; the system performs a “cold start” for each query, i.e., assuming no prior knowledge about the queried videos or cameras; we focus on querying individual cameras. We intend the resultant system to serve as a basis for future enhancement, e.g., exploiting past queries (queryrefinement, ) or cross-camera topology (rexcam, ).

As discussed in Section 1, we address challenges from large videos, wimpy cameras (cheapcam, ), and limited network bandwidth (mpdash, ). We do not consider the cloud as a limiting factor, assuming it runs fast enough to process frames uploaded through the bandwidth-limited camera/cloud connection.

2.3. Query Model

Supported queries

We target generic, ad-hoc queries that cannot be enumerated ahead of time. Concerning specific video footage on a camera, a query () covers a timespan , typically lasting hours or days, an object class as recognizable by modern NNs, e.g., any of the 80 classes of YOLOv3 (yolov3, ), and one of three possible query types : A retrieval query requests video frames that contain a given object class, e.g., “retrieve all images that contain yellow buses”. A tagging query requests the time ranges in which the video contains any object of a given class, e.g., “return all time ranges when any deer shows up”. The time ranges can be represented in a compact form, which contrasts to retrieval’s results: much larger video frames. A counting query requests statistics (average, median, or max) of object count of a given class across video frames, e.g., “return the maximum number of cars that ever appear in any frame”. The above queries cover video analytics commonly studied in prior work (focus, ; noscope, ; blazeit, ; vigil, ).

A case for online query processing

Given constrained resources and large videos, executing a query to the full is likely to take long. We, therefore, decide for the system to present users with early results being continuously refined, as well as a report on query progress.

In executing a retrieval

query, the user keeps seeing new positive frames (i.e., containing requested objects) uploaded from cameras, and an estimated percentage of the already retrieved positives. In executing a

tagging query, the user sees the result as a video timeline with positive/negative tags; she sees the tagging resolution, i.e., one in every adjacent frames tagged, continuously being refined. In executing a counting query, the user sees runtime statistics that converge to the ground truth, the percentage of frame processed, and the estimated time to complete the query. Unlike many aggregation systems (blinkdb, ; summarystore, ), we eschew from providing error estimation in company with a query’s early results: doing so would require analytical models for the distribution of queried objects across video frames, which are in general difficult to the best of our knowledge.

Online processing suits our goal: it amortizes a query’s cost over many installments of results; it pipelines query execution with human thinking (eureka, ); a user may interactively explore a video, e.g., aborting a query, revising parameters, and issuing anew (ola, ; eureka, ). While online query processing is known for taming large datasets (olamr, ; online-mapreduce, ), we are the first applying it to videos to our knowledge.

Performance metrics

In the spirit of online processing, we characterize query performance as the rate of query result converging to the exact answer: (for retrieval) the rate of a user receiving positive frames; (for tagging) the refining rate of temporal resolution as observed by the user; (for counting) the rate of running statistics converging to the ground truth.

3. The DIVA Design

System overview

With DIVA, a camera ingests all the videos frames to its storage. From these frames, the camera samples a small fraction, called landmarks4), and runs generic object detection, e.g., with the YOLOv3 NN (yolov3, ).

Receiving a query from users, the cloud retrieves landmark frames from the camera as its initial knowledge for answering the query. Specialized to the query, the cloud trains a library of operators for the camera. Executing the operators, the camera processes frames iteratively. As a result of the processing, it selects and uploads frames. The cloud processes the uploaded frames with a generic, high-accuracy object detector (cloud oracle), and presents the query results to users. Throughout the query, the cloud keeps training operators with newly received frames. The camera processing, frame upload, and cloud processing all run in parallel.

Figure 2. According to a query’s type, DIVA executes (a) rankers or (b) filters on cameras.

Query execution

Given a query’s type, DIVA executes one of two possible operators on camera.

Retrieval and max count rankers on camera. As shown in Figure 2 (a), a ranker scans and scores all frames in the queried video; a frame’s score suggests the likelihood of the frame containing any queried object (for retrieval) or a large count of objects (for max count). The camera sorts the frames by score and uploads them in descending order. Notably, scoring, sorting, and uploading happen concurrently, as required by online query processing. Therefore, although ranking is traditionally known as a “blocking” algorithm (rank-aware-opt, ; similarity-join, ; koudas2000high, ), DIVA’s ranker is non-blocking.

Tagging filters on camera. As shown in Figure 2 (b), a filter scores frames; a frame’s score suggests the likelihood of being positive, i.e., containing any object of the queried class. A filter is trained with a pair of numerical thresholds; for any frame scored below/above the thresholds the filter emits a negative/positive tag; for other frames scored between the thresholds the filter deems them undecidable. The camera uploads a tag (conceptually one bit) for each decided frame; it uploads undecided frames for the cloud to decide. This is different from rankers that upload all frames.

average/median count random sampling on camera. The camera uploads all landmarks and then samples the remaining frames for upload. Accordingly, the cloud computes the initial statistics and keeps refining it. In particular, we avoid running lightweight object counters on camera (blazeit, ), as we observe their poor accuracy on constrained hardware.

With simple semantics, on-camera operators require low training cost and run fast. Note that the operator implementation (§6), i.e., NN architectures, is not our contribution.


DIVA is not designed for several cases and may perform poorly: querying for rare objects (e.g., one occurrence in days) that landmarks cannot provide enough training samples, for which other discovery techniques are needed (eureka, ); querying short video ranges, e.g., minutes, for which uploading all frames may be faster; querying videos from non-stationary cameras, e.g. dash cam, for which long-term knowledge may be difficult to build from landmarks.

3.1. Operator upgrade

Rationale: tradeoffs in query execution

A query’s performance is determined by an interplay among three factors: the choice of on-camera operators; the camera’s available upload bandwidth; the difficulty of frames (to be elaborated later). Rich tradeoffs exist among the three, unlike most prior systems that face operator tradeoffs only (noscope, ; focus, ). At run time, DIVA observes ii) and iii) while controlling i). We next show that the optimal operator choice depends not only on videos and queries (vstore, ; noscope, ; blazeit, ; filterforward, ) but also critically on query progress and network/camera resources. Figure 3. A toy example of retrieval query, showing multipass ranking (bottom) outperforms running individual rankers alone (top two). The figure shows the frames in upload queues as the query proceeds. In the queues, rankers rank the unsent frames from left (front) to right (tail).

Figure 4. Two on-camera operators excel at different stages of same queries. Each bar shows a query’s progress (annotated on top) with time (x-axis). Queries for buses on 6 hours of Banff (see Table 1). Note: numbers not corresponding to the toy example in Figure 3.

Retrieval and max count queries   With a ranker ranking frames much faster than frame uploading, the cloud receives most promising frames hastily selected from many; with a slower ranker, the cloud receives ones selected deliberately from fewer. Rankers shall not run slower than uploading, which is as bad as uploading without ranking.

Figure 3 shows a toy example of retrieval, comparing a fast ranker (OFAST) to a slow one (OSLOW). Shortly after the query starts (

), OFAST swiftly explores more frames by scanning the upload queue from its front (from left on the figure; the unexplored frames shown in diagonal stripes); OFAST discovers and uploads more positives. Proceeding to frames with median/low scores, OFAST makes more ranking mistakes, placing a number of positives behind negatives in the queue, delaying the query progress (

). Making fewer mistakes on such frames (

), OSLOW finishes returning all positives earlier than OFAST.

On an actual video, the microbenchmark in Figure 4(a) further offers quantitative evidence. Op1, compared to Op2, takes less time (0.7) in returning the first 90% of the total positives; it, however, takes more time (1.7) in return 99% of the positives. Such comparison varies with frame upload bandwidth (not shown): with 2 lower bandwidth, Op1 takes more time (2.3) than Op2 in returning the first 90% (and even 80%) of the total positives. This is because Op2 better uses the limited upload bandwidth with more positives – through deliberate ranking.

Tagging queries   On less difficult frames, e.g., ones in better light condition or clearer objects, fast filters are effective: at a lower cost per frame, they can tag a large fraction of frames. On more difficult frames, they are increasingly indecisive, wasting resources on attempting without much success.

The microbenchmark in Figure 4(b) exemplifies the tradeoff. Filter Op1 resolves less difficult frames quickly; compared to Op2, it takes 4 shorter time in tagging 50% of all frames. It, however, cannot decide for the remaining frames and must upload them for tagging. Op2, despite slower, can filter as many as 82% of all frames (not shown); compared to Op1, it takes a shorter time in tagging 90% (1.3) and after that.

Key idea: operator upgrade

During executing one query, A camera processes frames iteratively with multiple operators. The cloud progressively updates operators on camera, from faster ones to slower ones; in picking operators the cloud dynamically adapts operator speed to camera uploading speed. The cloud trains operators, using frames uploaded by earlier operators to train later ones, which are more accurate and hence require more training samples. Hence, a camera ranks and filters frames as follows. Multipass ranking This is exemplified in Figure 3. Op1, the initial fast ranker, moves many positives towards the front of the upload queue (

). Op2, the subsequent slower ranker, continuously reorders unsent frames in a smaller range (

). Throughout the entire query, the camera first quickly uploads less difficult frames as discovered by the faster ranker, and slows down to vet more difficult frames with the more accurate ranker. Notably, the earlier ranker (roughly) prioritizes the frames for the subsequent one to consume, ensuring the efficacy of the latter that incurs higher delay per frame.
Multipass filtering The camera sifts undecided, unsent frames in multiple passes, each with an increasingly slower filter over a sample of the remaining frames. Throughout one query, early filters quickly filter easier frames, leaving increasingly difficult ones for subsequent filters to decide.

Why not operator cascade?

Our operator upgrade fundamentally differs from operator cascade, commonly used for processing a stream of frames (noscope, ; objectdetect-cascade, ; pedestrian-cascade, ): a cascade resolves all frames in one pass, by attempting increasingly slower operators on each frame until one operator asserts sufficient confidence. It misses a key opportunity in online query processing: one operator can produce overall inexact results in one pass. It also mismatches our need for online training (clickbait, ; clickbaitv2, ): slower (and more complex) operators can only be trained with samples collected by earlier operators.

3.2. Query optimization with long-term knowledge

Figure 5.

Class spatial skews in videos

(48 hours each).
Figure 6. Class spatial distribution converges in tens of video hours with sparse frame samples (middle figure). Video: Tucson (see Table 1 for details).

Surveillance cameras have a unique opportunity: learning object class distribution from days or weeks of videos. The opportunity is untapped in prior computer vision work, which typically studies minute-long video clips 

(videoedge, ; rexcam, ; saligrama12cvpr, ; zhu18cvpr, ).

The knowledge is twofold. Long-term spatial skews: of a video frame, objects of one given class are likely to concentrate on certain small regions. Long-term temporal skews: the occurrences of one given object class may exhibit temporal patterns. Note this is different from short-term temporal locality, e.g., a minute-long video clip often contains a small set of object classes (shen17cvpr, ).

Figure 5 shows evidence. In (a) Banff: over a 48-hour video, 80% and 100% of cars appear in regions that are only 19% and 57% of the whole frame, respectively. We find such skews pervasive on 15 different videos (see Table 1).

Such distribution must be learned, as object classes may exhibit different skews, across videos (Figure 5 (a)-(c)) or in the same video (Figure 5(c)). Fortunately, the distribution knowledge can be learned through sparse frame samples; it typically converges over tens of video hours, as exemplified by Figure 6. To establish such knowledge, DIVA builds landmarks, as will be described in Section 4.

Why such long-term skews? We deem the causes fundamental. Being cost-effective, surveillance cameras often cover long timespan and a wide field of view with high-resolution images, where individual objects appear small (filterforward, ). The objects occurrences are subject to social constraints (e.g., buses typically stops at traffic lights, Figure 5(a)) or physical ones (e.g., humans only appear on the floor, Figure 5(b)).

Figure 7. On-camera operators benefit from long-term video knowledge substantially. Each marker: an operator. Operators for querying buses on video Banff (see Table 1).

Key idea: exploiting long-term skews for performance

To exploit spatial skews, the cloud trains on-camera operators on a variety of input regions with different locations and sizes. The resultant operators exhibit diverse speed/accuracy tradeoffs, offering rich options for the upgrade (§3.1); as the operators omit regions that contain no or few queried objects, they also require much fewer samples to train, run faster, and deliver higher accuracy. This is shown in Figure 7. To exploit temporal skews, the cloud divides a queried video into large-grained spans; it prioritizes the spans with higher density of the queried object class.

Alternative: joint distribution - dismissed

Why not exploit the spatial/temporal distribution? For instance, one may imagine a street video: most pedestrians would appear in AM on one sidewalk, and in PM most appear on the opposite sidewalk. Yet, we tested this idea on all our videos and was unable to find supportive evidence.

4. Landmark frames

From ingested frames, a camera samples a small fraction called landmarks and runs object detection on them. Upon a query, the cloud retrieves all the landmarks, as low-resolution thumbnails (e.g., 100100) accompanied by labels and bounding boxes of the detected objects. The cloud uses the landmarks for three essential purposes: (1) As estimation of object distribution: The cloud generates a heatmap for spatial distribution (see Figure 5) and a density function for temporal distribution; the cloud uses them for optimizing on-camera operators and prioritizing video timespans, as described above. Note the learned distributions are for optimization; DIVA’s correctness does not rely on them. (2) As initial training samples: The cloud bootstraps on-camera operators with landmark frames. (3) As initial estimation of statistics: The cloud provides initial results for a counting query, e.g., average, by counting the object labels on the collected landmarks.

In the large parameter space for landmark production, we make the following choices. High accuracy   We make the ingestion’s object detector a deep NN with highest possible accuracy affordable to the camera hardware. This is because object detection results on landmarks strongly impact the accuracy of trained operators and that of initial statistics estimation. We will present evidence in Section 7. Sparse sampling   To accommodate slow object detection, the camera creates landmarks at long intervals, e.g., 1 in every 30 ingested frames in our test (§7). Sparse sampling is a known approach towards low-error statistics on low-frequency signals (sparsesampling, ), e.g. object counts in our case. It also provides adequate training samples for DIVA: over 5K samples in 2 days in our test. As an optimization, DIVA further tracks optical flows (opflow, ) from landmarks to adjacent frames (§6). We will show the quantitative impact of landmark intervals in Section 7. Sampling at regular intervals   This is to ensure the initial answers to counting queries, computed based on landmarks, are easy for users to reason about: without a prior assumption of data distribution, statistics of time series are typically estimated through regular sampling (regularsampling, ). We will demonstrate the efficacy of the above choices in Section 7.

Alternative: sampling frame regions - dismissed

One may be tempted to create landmarks by run object detectors on part of a frame. This technique often truncates objects and degrades landmark accuracy. It breaks object detectors’ heuristics for searching objects in whole frames 

(rcnn, ).

5. Query execution planning

Figure 8. The workflow of a query’s execution.

We next present DIVA’s query execution in detail. All query types (except the much simpler average/median count, see Section 5.3) follow the same workflow as shown in Figure 8. Upon receiving a query, the cloud retrieves landmark frames from the queried camera(s) (

). Using the landmarks, the cloud bootstraps a library of candidate operators (

) and picks the initial one for the camera (

). As camera uploads frames, the cloud processes the uploaded frames and uses the processed frames to further train the candidate operators for higher accuracy (

). Based on the observed quality of uploaded frames, the cloud triggers the upgrading of the on-camera operator (

). With the new operator, the camera continues to process the remaining frames (

). Step (


) repeat until query abort or completion. Throughout the query, the cloud keeps presenting the refined result to the user (

). DIVA’s planning of a concrete query centers on two key policies: the camera’s policy for selecting frames for processing; the cloud’s policy for upgrading on-camera operators. We will next discuss them for different query types.

5.1. Executing retrieval queries

Policy for selecting frames

To discover positive frames early, DIVA exploits long-term temporal skews (§3). To execute the initial operator, the camera prioritizes fixed-length video spans (e.g., 1 hour in our prototype) likely rich in positive frames, estimated based on landmark frames. In executing subsequent operators, the camera processes frames in their existing ranking as decided by earlier operators, as described in Section 3.1. The camera gives opportunities to frames never ranked by prior operators, interleaving their processing with ranked frames with mediocre scores (0.5).

Policy for operator upgrade

As discussed in Section 3, DIVA switches from faster operators to slower ones, and matches operator speed to frame upload rate. To capture an operator ’s relative speed to upload, it uses one simple metric: the ratio between the two speeds, i.e., = . Operators with higher tend to rapidly explore frames while others tend to slowly exploit. For each candidate , the cloud measures through profiling; it obtains online, at the time of upgrade, as reported by the camera. (1) Selecting the initial operator In general, DIVA should fully utilize the upload bandwidth with positive frames. As positive frames are scattered in the queried video initially, the camera should explore all frames sufficiently fast, otherwise, it would either starve the uplink or knowingly upload negative frames. Based on this idea, the cloud picks the most accurate operator from the ones that are fast enough, i.e., , where is the ratio of positives in the queried video, estimated from landmarks. (2) When to upgrade: current operator losing its vigor The cloud upgrades operators either when the current operator finishes processing all frames, or the cloud observes a continuous quality decline in recently uploaded frames, an indicator of the current operator’s incapability. To decide the latter, DIVA employs a rule: the positive ratio in the last frames are lower than the previous frames; in the current prototype, we empirically set and . (3) Selecting the next operator: slow down exponentially Since the initial operator promotes many positives towards the front of the upload queue, subsequent operators, scanning from the queue front, likely operate on a larger fraction of positives. Accordingly, the cloud picks the most accurate operator among much slower ones, s.t. , where controls speed decay in subsequent operators. A larger leads to more aggressive upgrade: losing more speed for higher accuracy. In the current prototype, we empirically choose . Since is relative to measured at every upgrade, the upgrade adapts to network bandwidth change even during a query.

5.2. Executing tagging queries

Recall that for tagging a camera runs multipass filtering; the objective of each pass is to tag, as positive () or negative (), at least one frame from every K adjacent frames. We call the group size; DIVA pre-defines a sequence of group sizes as refinement levels, e.g., = 30, 10, …, 1. As in prior work (noscope, ; focus, ; blazeit, ), the user specifies tolerable error as part of her tagging query, e.g., 1% false negative and 1% false positive; DIVA trains filters with thresholds to meet the accuracy.

Policy for selecting frames

The goal is to quickly discover easy frames (in individual groups) to tag while dynamically balancing the workloads between the camera and the frame upload (for the cloud to tag). Algorithm 1 shows how one operator works in each pass, which consists of two stages. i) Rapid attempting. scans all the groups; it attempts one frame per group; if it succeeds, it moves to the next group; otherwise, it adds the undecidable frames to the upload queue (line 3–10). ii) Work stealing. seeks to steal work from the end of the upload queue. For an undecidable frame belonging to a group , attempts other untagged frames in ; once it succeeds, it removes from the upload queue as no longer needs to be tagged in the current pass. (line 11–16). After completing one pass, the camera switches to the next refinement level (e.g., 10 5). It keeps all the tagging results (, , and ) while cancels all pending uploads to release the bandwidth for next refinement. It executes the frame scheduling algorithm again.

input : : operator : (30, ... 1)
: pos/neg/undecidable tags
output :  (updated)
1 initialize upload queue
2 divide the query time range into groups
3 for  in  do // rapid attempting
4        if any frame in is / then  pass ;
5        else if any frame in is  then  add frame to ;
6        else
7               pick a random frame from
8               process and tag with
9               if  tagged as  then  add to ;
12 end for
13while  is not empty do // work stealing
14        .tail(); enclosing group of
15        an untagged frame within
16        process and tag with
17        if  tagged as / then  remove from ;
19 end while
Algorithm 1 frame scheduling for each tagging pass

Policy for operator upgrade

Given an operator and , the ratio of frames it can successfully tag, DIVA measures ’s efficiency by its effective tagging rate, , as a sum of ’s successful tagging rate and the uploading rate. As part of operator training, the cloud estimates for all the candidate operators by testing them on all landmarks (early in a query) and on uploaded frames (later in the query).

To select every operator, initial or subsequent, the cloud picks the candidate with the highest effective tagging rate. The cloud upgrades operators either when the current operator has attempted all remaining frames or the cloud finds another candidate having an effective tagging rate or higher, for which our prototype empirically sets . To decide the latter, the cloud periodically trains and tests all candidates on recently uploaded frames.

5.3. Executing counting queries

Max Count

DIVA seeks to quickly discover frames with higher scores, i.e., containing more objects of the given class.

Policy for selecting frames

As a query proceeds, the camera randomly selects frames to upload, avoiding the worst cases that the max resides at the end of the query range.

Policy for operator upgrade

As the camera runs rankers, the policy is similar to that for retrieval with a subtle yet essential difference. To determine whether the current operator shall be replaced, the cloud must assess the quality of recently uploaded frames. While for retrieval, DIVA conveniently measures the quality as the ratio of positive frames, the metric does not apply to max count which seeks to discover higher scored frames. Hence, DIVA adopts the Manhattan distance as a quality metric among the permutations from the ranking of the uploaded frames (as produced by the on-camera operator) and the ranking that is re-computed by the cloud oracle. A higher metric indicates worse quality hence more urgency for the upgrade.

Average/Median Count - no on-camera operators

After the initial upload of landmarks, the camera randomly samples frames in the queried videos, uploading them for the cloud to refine the average and median statistics. To avoid introducing any sampling bias, the camera does not run any computation to prioritize frames; it instead relies on the Law of Large Numbers (LLN) 

(encyclopaediaofmathematics, ) to approach the average/median ground truth through continuous sampling. The cloud provides runtime error as frames are continuously received. Our evaluation in Section 7 shows that the statistics typically converges in as short as a few seconds.

6. Implementation

DIVA consists of runtimes for the cloud and cameras. We build the cloud runtime atop Tensorflow 1.13 

(tensorflow, )

and Keras 2.2.4 

(keras, ) for NN operators. We build the camera runtime targeting Arm-based devices (see Table 2); the camera runtime executes operators accelerated by Arm NN SDK (armnn, ) and executes its ingestion object detection (YOLO NNs, see Table 3) with the NNPACK-accelerated darknet (nnpack-darknet, ). Both runtimes use OpenCV 3.3 (opencv, ) for processing images, e.g., as resizing or cropping. We borrow the techniques from VStore (vstore, ) to make sure that on-camera operators will not be bottlenecked by camera storage or video decoding.


To architect a library of on-camera operators, we take a known technique (noscope, ): create multiple variants of AlexNet (alexnet, ), a classic NN with 5 convolutional layers. We vary the number of convolutional layers (2–5), convolution kernel sizes (8, 16, 32), and the last dense layer’s size (16, 32, 64). We further vary the input image size (2525, 5050, 100100); As described in §3.2, DIVA’s cloud runtime exploits the spatial skews of object classes: based on heatmaps, it carves out various image regions for operators to consume, for which it identifies the smallest region that covers a given fraction (e.g., 95%) of object occurrences. This is a general problem of finding points with the smallest enclosing rectangle, for which we employ the k-enclosing algorithm (kenclosing, ). Together, these techniques produce a library of operator architectures; to keep training tractable, we empirically select 40 of them to be trained by DIVA online; we have attempted more but see diminishing returns.

Optimization with optical flow

The camera runtime tracks optical flow (opflow, ) at query time from landmarks to adjacent frames, that is, wrapping detected objects on landmarks into adjacent frames until the objects disappear in the camera view. Since tracking optical flow is cheap, the camera thus acquires additional labeled objects at amortized cost. Background subtraction - dismissed We tested background subtraction (bng-sub, ), a low-cost technique for extracting frame regions that only contain moving objects. We find that it produces too many different input regions, making training a library of operators difficult.

7. Evaluation

We seek to answer the following questions:

  • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

  • §7.2: Does DIVA achieve good performance?

  • §7.3 & §7.4: How much do DIVA’s key designs matter?

  • §7.5: Does DIVA adapt to different hardware resources?

7.1. Methodology

Name Object Description
T JacksonH (JacksonH, ) car A busy intersection in Jackson Hole, WY
JacksonT (JacksonT, ) car A night street in Jackson Hole, WY
Banff (Banff, ) bus A cross-road in Banff, Alberta, Canada
Mierlo (Mierlo, ) truck A rail crossing in Netherlands
Miami (Miami, ) car A cross-road in Miami Beach, FL
Ashland (Ashland, ) train A level crossing in Ashland, VA
Shibuya (Shibuya, ) bus An intersection in Shibuya (渋谷 ), Japan
O Chaweng (Chaweng, ) bicycle Absolut Ice Bar (outside) in Thailand
Lausanne (Lausanne, ) car A pedestrian plaza in Lausanne, Switzerland
Venice (Venice, ) person A waterfront walkway in Venice, Italy
Oxford (Oxford, ) bus A street beside Oxford Martin school, UK
Whitebay (Whitebay, ) person A beach in Virgin Islands
I CoralReef (CoralReef, ) person An aquarium video from CA
BoatHouse (BoatHouse, ) person A retail store from Jackson Hole, WY
W Eagle (Eagle, ) eagle A tree with an eagle nest in FL
Table 1. 15 videos used for test. Column 1: video type. T – traffic; O/I – outdoor/indoor surveillance; W – wildlife. We set the ingestion format of all videos to be 720p at 1FPS (focus, ).

Videos & Queries

We test on 15 videos captured from 15 live camera feeds, as summarized in Table 1. Each video lasts continuous 48 hours, and constitute 720 hours and 350 GB in total. We test retrieval/tagging/counting on 6/6/3 videos. We intentionally choose videos with disparate characteristics and hence different degrees of difficulty. For instance, Whitebay is captured from a close-up camera, containing clear and large persons; Venice is captured from a high camera view and hence contains blurry and small persons. For each video, we pick a representative object class to query; across videos, these classes are diverse. Every query under test covers a whole video (48 hours), except that each counting covers 6 hours, a length used in prior work for counting (blazeit, ).

Table 2. Hardware platforms in test.

Test platform & parameters

As summarized in Table 2, we test DIVA’s camera execution on two popular embedded boards with hardware similar to low-cost cameras (cheapcam, ; wyzecam, ). We use RPi3 as the default camera hardware and report its measurement unless stated otherwise. We test DIVA’s cloud execution on a commodity x86 server with a modern NVIDIA GPU. We control the network bandwidth between the two, setting the default value to 1MB/s based on prior work (mpdash, ).

We run YOLOv3, a deep NN, as the camera’s ingestion object detector and the cloud oracle, listed in Table 3. For Rpi3, we partition the models into three stages, so each stage fits into Rpi3’s memory separately; running YOLOv3, Rpi3 produces one landmark frame every 30 seconds. We will study the impact of alternative models, landmark intervals, and camera/network resources (§7.4 and §7.5).

Cam:Ingest Cam:Query Cloud:Query
ClondOnly Only upload frames
OptOp Yv3 every 30 frames Run one optimal op Yv3 on all
PreIndexAll YTiny on all frames Parse YTiny result uploaded
DIVA Yv3 every 30 frames Multi passes & ops frames
Table 3. DIVA and the alternatives in test. The table summarizes their executions for ingestion and query. NN models: Yv3 – YOLOv3, a high-accuracy NN (mAP=57.9 on COCO dataset); YTiny: YOLOv3-tiny, a fast/low accuracy NN (mAP=33.1). Both are state of the art.


We experimentally compare DIVA to the following alternatives, as summarized in Table 3.

  • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

  • CloudOnly, a naive design that uploads all queried frames at query time for the cloud to process.

  • OptOp: in the spirit of NoScope (noscope, ), the camera runs only one operator (a ranker or filter) specialized for a given query. The operator is selected based on a cost model for minimizing full-query delay (noscope, ). To make OptOp competitive, we augment it with landmark frames (our technique) which reduce the operator training cost. Compared to DIVA, OptOp’s key differences are the lack of operator upgrade and the lack of operator optimization by long-term video knowledge.

  • PreIndexAll: in the spirit of Focus (focus, ), the camera runs a cheap yet generic object detector on each frame at ingestion time. We pick the detector to be YOLOv3-tiny, one popular NN model that can run 1 FPS (the frame rate of our test videos) on Rpi3’s wimpy hardware. At query time, the camera ranks or filters frames by processing existing object labels (i.e., indexes) without processing actual images. Compared to DIVA, PreIndexAll’s key difference is: it answers queries solely based on the cheap indexes built at query time; in return, it requires no operator training at query time.

7.2. End-to-End Performance

Figure 9 and  10 show the performance of three query types.

Figure 9. For retrieval (top) and tagging (bottom) queries, DIVA shows good performance and outperforms the alternatives. In all plots, x-axis: query delay (secs); y-axis for the top: % of retrieved instances; y-axis for the bottom: refinement level (1/N frames). Each query covers a 48-hour video. Red triangles: DIVA upgrading operators.
Figure 10. For counting queries, DIVA shows good performance and outperforms the alternatives. See Figure 9 for the legend. All x-axis: query delay (secs); all left y-axis: count; top two right y-axis: ground truth statistics for avg/median queries; bottom right y-axis: % of max value. Each query covers a 6-hour video. Red triangles represent when DIVA upgrades operators.

Full query delay

We examine the following metrics. For retrieval queries: the time for the user to receive 99% positive instances (distinct target objects) as in prior work (focus, ); for tagging: the time taken to tag every frame; for counting: the time for the reported statistics to reach the ground truth (max) or converge within 1% error of the ground truth (avg/median). In overall, DIVA delivers good performance and outperforms the alternatives significantly.

  • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

  • Retrieval (Figure 9(a)). On videos each lasting 48 hours, DIVA spends 1,700 seconds on average, i.e., 103 of video realtime. On average, DIVA’s delay is 11.2, 9, and 4.2 shorter than CloudOnly, PreIndexAll, and OptOp, respectively.

  • Tagging (Figure 9(b)). DIVA spends 1,080 seconds on average, i.e., 160 realtime. This delay is 27.1, 5.1, and 4.8 shorter than CloudOnly, PreIndexAll, and OptOp, respectively.

  • Counting (Figure 10). On videos each lasting 6 hours, DIVA’s average/median values only take several seconds to converge. For average count, DIVA’s delay is 16.7 and up to three orders of magnitude shorter than CloudOnly and PreIndexAll, respectively. For median count, DIVA’s delay is two orders of magnitude shorter than the other two alternatives. For max count, DIVA spends 55 seconds on average, running at 393 realtime. On average, the delay is 9.9, 5.5, and 2.1 shorter than CloudOnly, PreIndexAll, and OptOp, respectively.

Query progress

In the most time during queries, DIVA makes much faster progress than the alternatives. For instance, it always outperforms CloudOnly and OptOp during retrieval and tagging queries (Figure 9), and always outperforms all alternatives in executing median/average count (Figure 10).

Why DIVA outperforms the alternatives?

The alternatives suffer from lacking DIVA’s key designs. Inaccurate indexes hurt. Seeking to index all frames at ingestion, PreIndexAll’s ingestion resorts to inaccurate indices (2 lower mAP compared to YOLOv3). Misled by them, retrieval and tagging upload too much garbage; counting includes significant errors in the initial estimation, slowing down convergence. Lack of long-term video knowledge. As a result, OptOp’s on-camera operators are much slower and less accurate than those of DIVA, as illustrated in Figure 7. One operator does not fit an entire query. Despite being optimal at some point (e.g., 99% retrieval), it spends too much time on processing and resolving easy frames, without taking advantage of faster operators.

Why DIVA underperforms (occasionally)?

On a few videos and some short occasions, DIVA is overtaken by small margins. Specifically, DIVA may underperform PreIndexAll at early stages of a query, e.g., BoatHouse in Figure 9. This is because i) PreIndexAll’s indexes, despite inaccurate, may have a good chance of being correct on easy frames, e.g., filtering or ranking them; ii) PreIndexAll does not pay for operator bootstrapping as DIVA does. Nevertheless, PreIndexAll’s advantage is transient. As the query proceeds and easy frames are exhausted, the inaccurate indexes on the remaining frames show more mistakes and hence slow down the query.

Training & shipping operators

For each query, DIVA trains 40 operators, of which 10 are on the Pareto frontier; the camera switches among 4–8 operators (annotated in Figure 9 and  10), which run at diverse speeds (27–1,000 realtime) and accuracies. We observe that DIVA chooses very different operators for different queries. While training NN is known as compute-intensive and may take days, DIVA’s online training is inexpensive. Thanks to on-camera operators’ simple structures (§6), training one typically takes 5–45 seconds on our test platform and requires 5k frames (for bootstrapping) to 15k frames (for stable accuracy). The size of an operator ranges from 0.2–15 MB; shipping one to a camera takes less than ten seconds. DIVA hides much of these delays by overlapping them with query execution; the unhidden bootstrapping delay (40 seconds) is amortized over the entire query execution.

7.3. Validation of query execution design

Figure 11. DIVA’s both key techniques – optimization with long-term video knowledge (opt) and operator upgrade (upgrade), contribute to performance significantly.

The experiments above show DIVA’s substantial advantage over OptOp, coming from a combination of two techniques – optimizing queries with long-term video knowledge (“Long-term opt”, §3.2) and operator upgrade (“Upgrade”, §3.1). We next break down the advantage by incrementally disabling the two techniques in DIVA. Figure 11 shows the results.

Both techniques contribute significant performance. For instance, disabling Upgrade increases the delay of retrieving 90% instances by 2 and that of tagging 1/1 frames by 2-3. Further disabling Long-term Opt increases the delay of retrieval by 1.3-2.1 and that of tagging by 1.6-3.1. Both techniques disabled, DIVA still outperforms CloudOnly with its a single non-optimized operator.

Upgrade’s benefit is universal; Long-term opt’s benefit is more dependent on queries, i.e., the skews of the queried object class in videos. For instance, DIVA’s benefit is more pronounced on Chaweng, where small bicycles only appear in a region in 1/8 size of the entire frame, than Ashland, where large trains take 4/5 of the frame. With stronger skews in the former video, DIVA trains operators that run faster and have higher accuracy. This also accounts for DIVA’s varying (yet substantial) advantages over the alternatives observed in end-to-end results in (Figure 9).

7.4. Validation of landmark design

(a) DIVA’s performance degrades significantly with less accurate landmarks (produced by Yv2 and YTiny), which can be even worse than no landmarks built at all (“w/o LM”).
(b) Landmark intervals longer than the default (¿30) degrade DIVA’s performance slowly. With shorter intervals (¡30), DIVA’s performance advantage diminishes, which, however, only happens for expensive cameras. The y-axis is logarithmic.
(c) On given camera hardware (Rpi3/Odroid), sparser yet more accurate landmarks (from slower object detection NNs) always improve DIVA’s performance. NNs: compute on the x-axis; names annotated above. Landmark intervals annotated along curves.
Figure 12. Validation of DIVA’s landmark design. In subfigures: Q1 (left): retrieval on Chaweng; Q2 (right): tagging on JacksonH.

Next, we deviate from the default landmark parameters (Table 3) to validate our design of sparse-but-sure landmarks.

DIVA hinges on accurate landmarks.

As shown in Figure 12(a), modestly inaccurate landmarks (as produced by YOLOv2; 48.1 mAP) increase delays for Q1/Q2 by 45% and 17%. Even less accurate landmarks (by YOLOv3-tiny; 33.1 mAP) increase the delays significantly by 5.3 and 4.3. Perhaps surprisingly, such inaccurate landmarks can be worse than no landmarks at all (“w/o LM” in Figure 12): when a query starts, a camera randomly uploads unlabeled frames for the cloud to bootstrap operators. For counting average/median (not shown in the figure), less accurate landmarks produced by YOLOv2 or YOLOv3-tiny slow down the queries by two orders of magnitude. Both of them are worse than “w/o LM”, which starts queries with no initial estimation.

Why inaccurate landmarks hurt so much? They i) provide wrong training samples for on-camera operators; ii) lead to incorrect observation of spatial skews which further misleads frame cropping; and iii) introduce large errors into initial statistics, making convergence harder.

DIVA tolerates longer landmark intervals.

As shown in Figure 12(b), as the interval grows, DIVA’s retrieval and tagging performance slowly degrade. Even with an infinite interval, i.e., “w/o LM” in Figure 12(a), DIVA sees slowdown by no more than 3. On counting, the performance degradation is more pronounced: 5 longer intervals for around 15 slow down. Yet, such degradation is still much smaller than one from inaccurate landmarks (two orders of magnitude).

Why longer landmark intervals hurt much less? Their major impact is that DIVA has to upload additional frames in full resolution (10 larger than LMs) when a query starts, e.g., for bootstrapping operators; such a one-time cost is amortized over one query which may last tens of minutes.

Create the most accurate landmarks possible

For a given camera, should DIVA build denser yet less accurate landmarks or sparser yet more accurate ones? The results in Figure 12(c) suggests the latter is always preferred. This is because of DIVA’s high sensitivity to landmark accuracy and low sensitivity to long landmark intervals, as described above.

DIVA’s applicability to wimpy/brawny cameras

DIVA is friendly to wimpy cameras that can only generate sparse landmarks, e.g., due to slow CPUs or tight energy budgets. Extremely resource-frugal cameras may have DRAM smaller than the memory footprint of a high-accuracy NN (e.g., 237 MB for YOLOv3); fortunately, recent orthogonal efforts reduce NN memory footprint (split-NN, ). On these cameras, the alternatives will be further disadvantaged, e.g., PreIndexAll will see even less accurate indexes. On higher-end cameras that can afford more computation at ingestion, DIVA’s advantages diminish. i) These cameras may run PreIndexAll with improved index accuracy. As Figure 12(a) shows, running YOLOv2 on all ingested frames (PreIndexAll+Yv2), DIVA’s performance gain is only 1.9 (left) or even 0.6 (right). ii) These cameras may simply generate denser landmarks and upload all other frames upon the query. As shown in Figure 12(b), when a camera generates one landmark every 5 frames, DIVA’s advantage shrinks to 1.5. However, high-end cameras are far more expensive, e.g., running YOLOv2 at 1 FPS requires hardware like NVIDIA Jetson ($600) (jetsoncam, ).

7.5. Adaptation to hardware resources

Figure 13. DIVA’s adaptation brings a noticeable performance benefit. (a) Average performance on retrieving 50%/80%/99% instances on Chaweng/Oxford. (b) Average performance on tagging one in every 10/3/1 frames on JacksonT/BoatHouse. DIVA even w/o adaptation outperforms the alternatives significantly.

DIVA benefits from its adaptive selection of on-camera operators. To demonstrate so, we test DIVA’s performance on several camera/network resource configurations and compare it to fixing the operators to be the ones picked by DIVA with the default resources. As shown in Figure 13, the adaptation speeds up retrieval and tagging queries by up to 37% and 30%, respectively. We observe that under higher network bandwidth (2 MB/s), DIVA automatically picks a series of faster operators, starting from one having one less convolutional layer and 2 smaller kernel; on a faster camera (Odroid), DIVA picks a series of slower operators, starting from one with one extra convolutional layer and 2 more parameters in dense layers.

8. Related Work

Optimizing video analytics

The CV community has studied video analytics for decades, for where we borrow many building blocks, e.g., online training (clickbait, ; clickbaitv2, )

and active learning 

(activelearning, ). However, these techniques alone cannot address challenges in video query systems, e.g., overcoming network bandwidth limit or scheduling frames for processing. They mostly focus on improving analytics accuracy for short videos (saligrama12cvpr, ; liu17cvpr, ; zhu18cvpr, ; kang16cvpr, ; secs, ; shen17cvpr, ) while missing opportunities in exploiting long-term knowledge (§3).

Recent work builds video analytics systems for various scenarios. We discussed NoScope (noscope, ) and Focus (focus, ) in Section 1 and compared with them experimentally in Section 7. A common theme is to trade accuracy for lower cost: VStore (vstore, ) does so for video storage; Pakha et al. (pakha18hotcloud, ) does so for network transport; Chameleon (chameleon, ) and VideoStorm (videostorm, ; videoedge, ) do so with video formats. Compared to them, DIVA as well exploits the accuracy/cost tradeoffs in operators while contributing new mechanisms. Video analytics can be optimized by exploiting sharing opportunities: for performance, ReXCam (rexcam, ) exploits spatial/temporal locality among co-located cameras; Mainstream (mainstream, ) exploits common DNN computations shared among concurrent queries. Orthogonal to them, DIVA focuses on querying individual cameras (Section  2.2). Multiple systems target archival video analytics on servers (scanner, ; deeplens, ; vstore, ; blazeit, ). Compared to them, DIVA also analyzes archival videos stored on remote cameras and therefore embraces new techniques, e.g., operator upgrade.

Edge video analytics

Towards saving the cloud/edge network bandwidth, partitioning analytics is popular. This includes partitioning between cloud/edge (lavea, ; deepdecision, ; filterforward, ), edge/drone (wang18sec, ), and edge/camera (vigil, ). Most work targets live analytics, therefore processing frames in a streaming fashion and/or training operators ahead of time. DIVA also spreads computation between cloud/cameras, but taking a disparate design point (zero-streaming cameras), for which the prior systems are inadequate as discussed in Section 1.

Online Query Processing

Dated back in the 90s, online query processing allows users to see early results and interactively control query execution (ola, ; control, ). It is proven effective in harnessing large data analytics such as MapReduce (olamr, ). DIVA retrofits the idea for a variety of video queries, and accordingly contribute new execution techniques, e.g., operator upgrade, to support the online fashion. DIVA could further borrow UI designs from existing online query engines.

WAN Analytics

To support queries over geo-distributed data, recent work has proposed optimizations from query placement to data placement (wanalytics, ; vulimiri15nsdi, ; clarinet, ; iridium, ; lube, ). Notably, JetStream (jetstream, ) adjusts data quality to meet network bandwidth; AWStream (awstream, ) facilitates apps to systematically trade-off analytics accuracy for network bandwidth. Like them, DIVA also adapts to network bandwidth; unlike them, DIVA does so by changing operator upgrade plan, a unique aspect in video analytics. Furthermore, DIVA targets resource-constrained cameras, which were unaddressed in prior WAN analytics systems.

9. Conclusions

DIVA is an analytics engine for large videos distributed on low-cost cameras. We exploit the zero-streaming paradigm and minimize the ingestion cost, shifting as much as possible to query execution. At ingestion time, DIVA runs generic NNs to get sparse but sure landmarks; at query time, it continuously refines the query result by proactively updating the camera NNs. Our evaluation on three types of queries shows that DIVA can run at more than 100 video realtime under typical wireless network and low-cost camera hardware.


  • (1) Wireless cameras slowing router too much., 2015.
  • (2) Understanding ip surveillance camera bandwidth., 2017.
  • (3) The zettabyte era: Trends and analysis., 2017.
  • (4) 3.4 mp nvidia jetson tx2/tx1 camera board., 2018.
  • (5) Hands-on nvidia jetson tx2: Fast processing for embedded devices., 2018.
  • (6) Price history of 32gb samsung sd card., 2018.
  • (7) Running yolo detection on raspberry pi., 2018.
  • (8) Wifi cameras., 2018.
  • (9) Arm nn ml software., 2019.
  • (10) Background subtraction., 2019.
  • (11) Hisilicon ip camera specifications., 2019.
  • (12) Nnpack-accelerated darknet., 2019.
  • (13) Nvidia jetson cameras., 2019.
  • (14) Opencv 3.3., 2019.
  • (15) Wyze camera specifications., 2019.
  • (16) Youtube live streaming: Ashland., 2019.
  • (17) Youtube live streaming: Banff., 2019.
  • (18) Youtube live streaming: Boathouse., 2019.
  • (19) Youtube live streaming: Chaweng., 2019.
  • (20) Youtube live streaming: Coralreef., 2019.
  • (21) Youtube live streaming: Eagle., 2019.
  • (22) Youtube live streaming: Jackson hole., 2019.
  • (23) Youtube live streaming: Jackson town., 2019.
  • (24) Youtube live streaming: Lausanne., 2019.
  • (25) Youtube live streaming: Miami., 2019.
  • (26) Youtube live streaming: Mierlo., 2019.
  • (27) Youtube live streaming: Oxford., 2019.
  • (28) Youtube live streaming: Shibuya., 2019.
  • (29) Youtube live streaming: Venice., 2019.
  • (30) Youtube live streaming: Whitebay., 2019.
  • (31) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X.

    Tensorflow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA, 2016), USENIX Association, pp. 265–283.
  • (32) Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., and Stoica, I. Blinkdb: Queries with bounded errors and bounded response times on very large data. In Proceedings of the 8th ACM European Conference on Computer Systems (New York, NY, USA, 2013), EuroSys ’13, ACM, pp. 29–42.
  • (33) Agrawal, N., and Vulimiri, A. Low-latency analytics on colossal data streams with summarystore. In Proceedings of the 26th Symposium on Operating Systems Principles (New York, NY, USA, 2017), SOSP ’17, ACM, pp. 647–664.
  • (34) Augustin, A., Yi, J., Clausen, T., and Townsley, W. A study of lora: Long range & low power networks for the internet of things. Sensors 16, 9 (2016), 1466.
  • (35) Blu, T., Dragotti, P., Vetterli, M., Marziliano, P., and Coulot, L. Sparse sampling of signal innovations. IEEE Signal Processing Magazine 25, 2 (March 2008), 31–40.
  • (36) Böhm, C., Braunmüller, B., Krebs, F., and Kriegel, H.-P.

    Epsilon grid order: An algorithm for the similarity join on massive high-dimensional data.

    In ACM SIGMOD Record (2001), vol. 30, ACM, pp. 379–388.
  • (37) Cai, Z., Saberian, M., and Vasconcelos, N. Learning complexity-aware cascades for deep pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 3361–3369.
  • (38) Canel, C., Kim, T., Zhou, G., Li, C., Lim, H., Andersen, D. G., Kaminsky, M., and Dulloor, S. R. Scaling video analytics on constrained edge nodes. In Proceedings of the 2nd SysML Conference (2019).
  • (39) Chakrabarti, K., Porkaew, K., and Mehrotra, S. Efficient query refinement in multimedia databases. In ICDE Conference (January 2000). Poster paper.
  • (40) Chollet, F. keras., 2015.
  • (41) Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., and Sears, R. Mapreduce online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (Berkeley, CA, USA, 2010), NSDI’10, USENIX Association, pp. 21–21.
  • (42) Feng, B., Wan, K., Yang, S., and Ding, Y. SECS: efficient deep stream processing via class skew dichotomy. CoRR abs/1809.06691 (2018).
  • (43) Feng, Z., George, S., Harkes, J., Pillai, P., Klatzky, R., and Satyanarayanan, M. Eureka: Edge-based discovery of training data for machine learning. IEEE Internet Computing PP (01 2019), 1–1.
  • (44) Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition

    (Washington, DC, USA, 2014), CVPR ’14, IEEE Computer Society, pp. 580–587.
  • (45) Han, B., Qian, F., Ji, L., and Gopalakrishnan, V. Mp-dash: Adaptive video streaming over preference-aware multipath. In Proceedings of the 12th International on Conference on Emerging Networking EXperiments and Technologies (New York, NY, USA, 2016), CoNEXT ’16, ACM, pp. 129–143.
  • (46) Hazewinkel, M. Encyclopaedia of Mathematics. Springer Netherlands, 1988.
  • (47) Hellerstein, J. M., Avnur, R., Chou, A., Hidber, C., Olston, C., Raman, V., Roth, T., and Haas, P. J. Interactive data analysis: the control project. Computer 32, 8 (Aug 1999), 51–59.
  • (48) Hellerstein, J. M., Haas, P. J., and Wang, H. J. Online aggregation. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 1997), SIGMOD ’97, ACM, pp. 171–182.
  • (49) Hsieh, K., Ananthanarayanan, G., Bodik, P., Venkataraman, S., Bahl, P., Philipose, M., Gibbons, P. B., and Mutlu, O. Focus: Querying large video datasets with low latency and low cost. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (Carlsbad, CA, 2018), USENIX Association.
  • (50) Hung, C.-C., Ananthanarayanan, G., Bodík, P., Golubchik, L., Yu, M., Bahl, V., and Philipose, M.

    Videoedge: Processing camera streams using hierarchical clusters.

  • (51) IHS. Top video surveillance trends for 2018, 2018.
  • (52) Ilyas, I. F., Shah, R., Aref, W. G., Vitter, J. S., and Elmagarmid, A. K. Rank-aware query optimization. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data (2004), ACM, pp. 203–214.
  • (53) Jain, S., Jiang, J., Shu, Y., Ananthanarayanan, G., and Gonzalez, J. Rexcam: Resource-efficient, cross-camera video analytics at enterprise scale. CoRR abs/1811.01268 (2018).
  • (54) Jiang, A. H., Wong, D. L.-K., Canel, C., Tang, L., Misra, I., Kaminsky, M., Kozuch, M. A., Pillai, P., Andersen, D. G., and Ganger, G. R. Mainstream: Dynamic stem-sharing for multi-tenant video processing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18) (Boston, MA, 2018), USENIX Association, pp. 29–42.
  • (55) Jiang, J., Ananthanarayanan, G., Bodik, P., Sen, S., and Stoica, I. Chameleon: Scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (New York, NY, USA, 2018), SIGCOMM ’18, ACM, pp. 253–266.
  • (56) Jin, T., and Hong, S.

    Split-cnn: Splitting window-based operations in convolutional neural networks for memory system optimization.

    In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (2019), ACM, pp. 835–847.
  • (57) Käding, C., Rodner, E., Freytag, A., and Denzler, J. Fine-tuning deep neural networks in continuous learning scenarios. In Computer Vision – ACCV 2016 Workshops (Cham, 2017), C.-S. Chen, J. Lu, and K.-K. Ma, Eds., Springer International Publishing, pp. 588–605.
  • (58) Kang, D., Bailis, P., and Zaharia, M. Blazeit: Fast exploratory video queries using neural networks. arXiv preprint arXiv:1805.01046 (2018).
  • (59) Kang, D., Emmons, J., Abuzaid, F., Bailis, P., and Zaharia, M. Noscope: Optimizing neural network queries over video at scale. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1586–1597.
  • (60) Kang, K., Ouyang, W., Li, H., and Wang, X. Object detection from video tubelets with convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2016), pp. 817–825.
  • (61) Koudas, N., and Sevcik, K. C. High dimensional similarity joins: Algorithms and performance evaluation. IEEE Transactions on Knowledge and Data Engineering 12, 1 (2000), 3–18.
  • (62) Krishnan, S., Dziedzic, A., and Elmore, A. J. Deeplens: Towards a visual data management system. In CIDR 2019, 9th Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 13-16, 2019, Online Proceedings (2019).
  • (63) Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
  • (64) Liao, M. Benchmarking hardware for cnn inference in 2018., 2018.
  • (65) Liu, M., and Zhu, M. Mobile video object detection with temporally-aware feature maps. CVPR (2018).
  • (66) Mahapatra, P. R. S., Karmakar, A., Das, S., and Goswami, P. P. k-enclosing axis-parallel square. In International Conference on Computational Science and Its Applications (2011), Springer, pp. 84–93.
  • (67) OpenCV. Optical flow, 2018.
  • (68) Pakha, C., Chowdhery, A., and Jiang, J. Reinventing video streaming for distributed vision analytics. In 10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 18) (Boston, MA, 2018), USENIX Association.
  • (69) Pansare, N., Borkar, V. R., Jermaine, C., and Condie, T. Online aggregation for large mapreduce jobs. Proc. VLDB Endow 4, 11 (2011), 1135–1145.
  • (70) Paz, Z. Innovation in surveillance: What’s changing at the edge, core and cloud?, year = 2018.
  • (71) Poms, A., Crichton, W., Hanrahan, P., and Fatahalian, K. Scanner: Efficient video analysis at scale. ACM Trans. Graph. 37, 4 (July 2018), 138:1–138:13.
  • (72) Pu, Q., Ananthanarayanan, G., Bodik, P., Kandula, S., Akella, A., Bahl, P., and Stoica, I. Low latency geo-distributed data analytics. SIGCOMM Comput. Commun. Rev. 45, 4 (Aug. 2015), 421–434.
  • (73) Rabkin, A., Arye, M., Sen, S., Pai, V. S., and Freedman, M. J. Aggregation and degradation in jetstream: Streaming analytics in the wide area. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14) (Seattle, WA, 2014), USENIX Association, pp. 275–288.
  • (74) Ran, X., Chen, H., Zhu, X., Liu, Z., and Chen, J.

    Deepdecision: A mobile deep learning framework for edge video analytics.

    In IEEE INFOCOM 2018 - IEEE Conference on Computer Communications (April 2018), pp. 1421–1429.
  • (75) Redmon, J., and Farhadi, A. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018).
  • (76) Saligrama, V., and Chen, Z.

    Video anomaly detection based on local statistical aggregates.

    2012 IEEE Conference on Computer Vision and Pattern Recognition (2012), 2112–2119.
  • (77) Shalizi, C. R. Advanced data analysis from an elementary point of view., year = 2019.
  • (78) Shen, H., Han, S., Philipose, M., and Krishnamurthy, A. Fast video classification via adaptive cascading of deep models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017).
  • (79) Teng, E., Falcão, J. D., and Iannucci, B. Clickbait: Click-based accelerated incremental training of convolutional neural networks. CoRR abs/1709.05021 (2017).
  • (80) Teng, E., Huang, R., and Iannucci, B. Clickbait-v2: Training an object detector in real-time. CoRR abs/1803.10358 (2018).
  • (81) Viola, P., Jones, M., et al. Rapid object detection using a boosted cascade of simple features. CVPR (1) 1 (2001), 511–518.
  • (82) Viswanathan, R., Ananthanarayanan, G., and Akella, A. CLARINET: Wan-aware optimization for analytics queries. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA, 2016), USENIX Association, pp. 435–450.
  • (83) Vulimiri, A., Curino, C., Godfrey, P. B., Jungblut, T., Karanasos, K., Padhye, J., and Varghese, G. Wanalytics: Geo-distributed analytics for a data intensive world. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2015), SIGMOD ’15, ACM, pp. 1087–1092.
  • (84) Vulimiri, A., Curino, C., Godfrey, P. B., Jungblut, T., Padhye, J., and Varghese, G. Global analytics in the face of bandwidth and regulatory constraints. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15) (Oakland, CA, 2015), USENIX Association, pp. 323–336.
  • (85) Wang, H., and Li, B. Lube: Mitigating bottlenecks in wide area data analytics. In 9th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 17) (Santa Clara, CA, 2017), USENIX Association.
  • (86) Wang, J., Feng, Z., Chen, Z., George, S., Bala, M., Pillai, P., Yang, S., and Satyanarayanan, M. Bandwidth-efficient live video analytics for drones via edge computing. In 2018 IEEE/ACM Symposium on Edge Computing, SEC 2018, Seattle, WA, USA, October 25-27, 2018 (2018), pp. 159–173.
  • (87) Xu, T., Botelho, L. M., and Lin, F. X. Vstore: A data store for analytics on large videos. In Proceedings of the Fourteenth EuroSys Conference 2019 (New York, NY, USA, 2019), EuroSys ’19, ACM, pp. 16:1–16:17.
  • (88) Yi, S., Hao, Z., Zhang, Q., Zhang, Q., Shi, W., and Li, Q. Lavea: Latency-aware video analytics on edge computing platform. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS) (June 2017), pp. 2573–2574.
  • (89) Zhang, B., Jin, X., Ratnasamy, S., Wawrzynek, J., and Lee, E. A. Awstream: Adaptive wide-area streaming analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (New York, NY, USA, 2018), SIGCOMM ’18, ACM, pp. 236–252.
  • (90) Zhang, H., Ananthanarayanan, G., Bodik, P., Philipose, M., Bahl, P., and Freedman, M. J. Live video analytics at scale with approximation and delay-tolerance. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17) (Boston, MA, 2017), USENIX Association, pp. 377–392.
  • (91) Zhang, T., Chowdhery, A., Bahl, P. V., Jamieson, K., and Banerjee, S. The design and implementation of a wireless video surveillance system. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking (New York, NY, USA, 2015), MobiCom ’15, ACM, pp. 426–438.
  • (92) Zhu, X., Dai, J., Yuan, L., and Wei, Y. Towards high performance video object detection. In CVPR (2018), IEEE Computer Society, pp. 7210–7218.