1 Introduction
Cameras are ubiquitous. Billions of them are being deployed in public (e.g., at road junctions) and private (e.g., in retail stores) all over the world [cctv]
. Recent advances in deep convolutional neural networks (CNNs) have led to incredible leaps in their accuracies in many machine learning tasks, notably image and video analysis. Such developments have created great demands on practical analytics over video data
[sigmod18Occlusion2, sigmod18Occlusion3, sigmod18Occlusion4, sigmod18Occlusion5, sigmod18Occlusion6]. For instance, civil engineers are using traffic cameras in intelligent transportation systems for road planning, traffic incident detection, and vehicle rerouting in public transit. An example in which urban planners had to go through a year’s worth of traffic video to retrieve all video frames that contain more than five buses is reported in [blazeit]. Tools for fast and accurate video analytics are essential in smart city applications.Object recognition in image and video using CNNs are accurate [resnet, prelu, fasterrcnn]. However, CNNs are computationally expensive to train and to use. This poses significant challenge in using CNNsbased objection recognition techniques to process massive amounts of streaming videos. For example, a stateoftheart object detector, Mask RCNN [mrcnn], runs at 5 fps (frames per second) using GPUs. This frameprocessing rate is 6 times slower than the frame rate of a typical 30fps video. For the example where a year’s worth of video is to be analyzed, a “scanandtest” approach that invokes the object detector on every frame would take 6 years to complete. For video analytics in widespread use, the research community has started to build systems with innovative solutions to support fast analytics over video data [noscope, tahoma, focus, blazeit, pakha18reinvent, chameleon].
Video analytics is an emerging research area that intersects database and computer vision. It is, however, still in its infancy – even the latest system BlazeIt [blazeit] supports only simple filter and aggregate queries. In data analytics, TopK is a very important analytical operation that enables analysts to focus on the most important entities in the data [sayingenough, surveytopk, bruno, topkguarantee]. In this paper we present the very first set of solutions for efficient TopK video analytics. TopK can help rank and identify the most interesting frames/moments from videos. Example use cases include:
Property Valuation. The valuation/rent of a shop is strongly related to its peak foot traffic [foottraffic]. Instead of manual counting, one can use a camera to capture pedestrian flow and use a machine to identify, say, the top10 frames (time of the day) with the highest pedestrian counts.
Transport Planning. Severe traffic congestion can occur when multiple bus routes pass through the same narrow expressway. This is because a bus stopping at a stop for passengers blocks the traffic (other buses) behind. The busiest bus creates a convoy effect, where a long line of other buses follow its lead. A TopK query on the moments with the biggest convoys provide valuable information on the congestion caused and for better bus scheduling.
DataDriven Agriculture. The global food demand is expected to increase by 70% from 2010 to 2050 [fooddemand]. In order to improve farm productivity, the FarmBeats project at Microsoft sends drones to collect and analyze farm videos [farmbeats]. Recent news also report that oilpalm farmers in Malaysia send drones to monitor the growth of oilpalm [raghu_2019]. With oilpalm plantations spread across 86,100 square miles in Malaysia, farmers with limited resources can only inspect a small number of fields onsite each day. Finding TopK fields (e.g., based on number of wellgrown palm trees) over drone videos can drastically help farmers prioritize field trips.
In video analytics, CNN Specialization and Approximate Query Processing (AQP) are two popular techniques for overcoming the inefficiency of the scanandtest method [noscope, tahoma, focus, blazeit]. With CNN specialization, a video analytic system trains a lightweight model specifically designed for certain query type(s) (e.g., “select all frames with dogs”) using frames sampled from the videoofinterest as training data. Since the lightweight model is highly specific to the query instance and the videoofinterest, it is both efficient to train and to use (but is generally less accurate). The lightweight model is then applied to all video frames to obtain a set of candidate frames that are likely to satisfy the given query. Finally, the candidate frames are passed to an accurate but less efficient object detector (e.g., YOLOv3 [yolo3], Mask RCNN [he2017mask]) to verify their validity. Approximate Query Processing (AQP) [blazeit, BlinkDB] is an orthogonal technique for fast analytics when users are willing to tolerate some degree of error. The complex interplay between deep model inference and TopK query processing, however, creates novel challenges for CNN specialization and AQP.
First, for CNN specialization, new lightweight models need to be designed for handling TopK queries, which differ from traditional ones, such as selection. Second, traditional TopK algorithms process objects with definite values. They are generally inapplicable to video analytics when lightweight models are used because these models’ outputs are imprecise and probabilistic. Third, classic AQP techniques aim at determining confidence intervals for
statisticalaggregate queries (such as point estimates). The answers of TopK queries, however, are setbased (e.g., sets of TopK frames). Therefore, existing AQP algorithms are inapplicable. Fourth, video data has temporal locality; TopK queries can be framebased or
timewindowbased. For example, one could be interested in finding the TopK 5second clips with the highest number of vehicles. This adds to the complexity of answering TopK queries in video analytics.To address all these challenges, we present Everest, a system that empowers users to answer TopK queries from video based on any given scoring (ranking) function. Everest is able to present results with probabilistic guarantees. Everest’s probabilistic outputs play in unison with video analytics where data uncertainty is prevalent. This is in sharp contrast to existing systems where valuable information in uncertain data is not fully leveraged. To illustrate, Table 0(a) shows an example output of the lightweight model used in BlazeIt [blazeit]. The car count
in each traffic video frame is best given by a probability distribution, which captures the
uncertainty information of the object detector. Existing systems, however, process queries based on a simplified view of the table where less likely cases are dropped and much of the uncertainty information is discarded (see Table 0(b)). Everest, in contrast, is designed to treat uncertain data as a firstclass citizen. Query processing in Everest is probabilistic in nature. Furthermore, Everest is extensible to support windowing so that users can ask TopKwindows queries as well.Supporting TopK analytics over videos enables rich analyses over the visual world, just as what traditional TopK query processing has been doing over relational data. Everest draws heavily on uncertain data management techniques (e.g., [uncertain_db_1, uncertain_db_2, uncertain_db_3]) but with its own novelty and uniqueness because of the existence of groundtruth at runtime. Specifically, uncertain query processing has been assuming no access to the groundtruth at runtime. By contrast, video analytics can access an accurate object detector to get the groundtruth and reduce the uncertainties at runtime. That essentially opens a new class of uncertain query processing problems in which the groundtruth is in the loop. We evaluate Everest on six real video streams and the latest Visual Road benchmark [visualroad]. We show that Everest can achieve 10.8 to 17.9 speedup over the baseline approach. In summary, we make the following contributions:

We design and present a novel notion of TopK analytics over videos. Our design is the first to treat uncertain data as a firstclass citizen in video analytics.

We present Everest, the first video analytics system that supports TopK queries with probabilistic guarantees. The system further includes extensions to find TopK windows instead of TopK frames.

We introduce efficient algorithms and implementations for each system module. Our algorithms overcome the combinatorial explosion in the number of possible worlds commonly found in uncertain query processing. Our implementation is highly optimized taking into account various practical efficiency issues such as numeric stability and GPU utilization.
The remainder of this paper is organized as follows. Section 2 provides background of our study. Section 3 defines TopK queries supported by Everest. Sections 4 and LABEL:sec:implementation present Everest’s algorithms and implementation. Section LABEL:sec:exp gives evaluation results. Section LABEL:sec:related discusses related work. Finally, Section LABEL:sec:conclusion concludes the paper.


2 Background
In this section we provide brief background on object detection by CNN, video analytics systems, TopK query processing, and uncertain query processing. Readers who are familiar with those topics may skip this section. We will further discuss their recent developments in Section LABEL:sec:related.
CNN Object detection is an important problem in Computer Vision (CV). The problem is to identify object occurrences in images. Large volumes of training data have been made available nowadays, which enable modern techniques, especially convolutional neural networks (CNNs), to achieve nearhuman or even betterthanhuman accuracy [prelu]. However, these deep models are rather complex with millions or even billions of parameters. Answering queries (or inference) with these models is thus computationally very expensive. For example, Mask RCNN and YOLOv3, two stateoftheart models, process video at respective rates of 5 fps and 30 fps on machines equipped with GPUs [mrcnn].
Timestamp (ts)  Class  Polygon  ObjectID  Content  Features 

01012019:23:05  Human  (10, 50), (30, 40),  16  
01012019:23:05  Bus  (45, 58), (66, 99),  58  
01012019:23:06  Human  (20, 80), (7, 55),  16  
01012019:23:06  Car  (6, 91), (10, 55),  59  
01012019:23:06  Car  (78, 91), (40, 55),  60  
Video Analytics The huge volume of video data poses great challenges to video analytics research [noscope, focus, scanner, tahoma, videostorm, blazeit]. For analytical processing, video data is often modeled as relations, each captures object occurrences in a specific video clip [blazeit]. Specifically, each tuple in a video relation corresponds to a single object in a video frame. Since a frame may contain 0 or more objects (of interest) and an object may appear in multiple frames, a frame can be associated with 0 or more tuples in a relation and an object can be associated with multiple tuples. Typical attributes of a tuple include a frame timestamp (ts); a unique id of an identified object (objectID); the object’s class label (class), bounding polygon (polygon), raw pixel content (content
), and feature vector (
features). A video relation can be materialized by invoking an object detector to process each frame to compute tuple values. Table 2 shows an example of a materialized video relation. The objectID attribute can be populated by invoking an object tracker (e.g, [zhang2011video]), which takes as input two polygons from two consecutive frames and returns the same objectID if the two polygons enclose the same object. In video analytics, a video relation that is materialized by an accurate object detector such as YOLOv3 is regarded as the system’s groundtruth [focus]. However, fully materializing a groundtruth relation is computationally expensive. Therefore, the key challenge is how to answer analytic queries without a fully materialized groundtruth relation [noscope, focus, scanner, tahoma, videostorm, blazeit]. This constraint distinguishes our work with previous video database studies [metadata, regionbased], which assume that video relations are already given, presumably via some external means (such as human annotations).CNN specialization is a key technique used to speed up video analytics [shen2017fast, noscope, blazeit, focus]. Inspired by the concept of “cascade” [cascade]
in computer vision, the idea is to use frames sampled from the videoofinterest to train a lightweight CNN (e.g., with fewer neurons and layers) as a proxy to an expensive groundtruth object detector (e.g., Mask RCNN). Training specialized CNNs is significantly faster because there are fewer neural layers and a specific context. For example, training of a CNN with only frames from a traffic video converges much faster than in a general setting because there are far fewer object classes to consider (e.g., cars, bicycles, pedestrians), as opposed to a general object detector that has to distinguish thousands of classes. Since specialized CNNs are lightweight, their models are much faster to execute. For example, it has been reported that specialized lightweight CNNs can infer at 10,000 fps with various degree of accuracy loss
[blazeit]. However, a specialized CNN is data specific and thus not generalizable to other video. Also, lightweight CNNs may produce false positives or false negatives. As a remedy, systems like BlazeIt [blazeit] use statistical techniques to bound result errors.TopK Query Processing In many application domains (e.g,. information retrieval [bruno], multimedia databases [faginthreshold], and spatial databases [topkspatial]), users are more interested in the K most important tuples ordered by a scoring(ranking) function. The essence of efficient TopK query processing is to wisely schedule data accesses such that most of the irrelevant (nontopK) items are skipped. This minimizes expensive data accesses (e.g., from disk or via the web) and provides early stopping criteria to avoid unnecessary computation.
Uncertain Query Processing One common uncertain data representation is “xtuples” [aggarwal2009trio]. An uncertain relation is a collection of xtuples, each consists of a number of alternative outcomes that are associated with their corresponding probabilities. Together, the alternatives form a discrete probability distribution of the true outcome. Different xtuples are assumed to be independent of each other. We discuss later how Everest uses a difference detector to approximate that so that the outputs of a lightweight CNN can be represented by the xtuple model (i.e., the inference results of a video frame is captured by one xtuple). Hence, in the following discussion, we use the terms xtuple, frame, and timestamp interchangeably.
3 TopK in Everest
Everest allows users to set a perquery threshold, , to ensure that the returned TopK result has a minimum of probability to be the exact answer. Given an uncertain relation (Section 4.1 discusses how to obtain that) and a scoring function (e.g., count the number of cars in a frame ), Everest returns a TopK result with confidence , where is the exact result and is the probability threshold specified by the user. The probability is defined over an uncertain relation (e.g., Table 0(a)) using the possible world semantic (PWS) [possible_world]. The possible world semantic is widely used in uncertain databases, where an uncertain relation is instantiated to multiple possible worlds, each of which is associated with a probability. Table 3 shows 2 possible worlds (out of ) of Table 0(a). Given the probability of each possible world, the confidence of a TopK answer is simply the sum of probabilities of all possible worlds in which is TopK:
(1) 
Here, denotes the set of all possible worlds of an uncertain relation . In addition, the answer has to satisfy the following condition:
Definition:
The Certain Result (CR) Condition. The TopK result has to be chosen from , where frames in are all certain, i.e., they have already been verified by a groundtruth object detector (GTOD) and with no uncertainty.
The certainresult condition is important in video analytics. For instance, the Top1 result of the uncertain relation in Table 0(a) has a confidence of (based on Equation 1). Assuming , the confidence of being the correct answer is above the user threshold. However, this result is computed from the uncertain frames in . Like other cases of uncertain query processing, the groundtruth is unknown during the computation. In video analytics, this is however undesirable because a user knows the groundtruth by visually inspecting returned frames in an answer to the query. In the example above, if the user sees no cars in the frame , the returned Top1 answer is unacceptable. The certain result condition avoids such awkward answers and is essential in ensuring that all the tuples in have been confirmed by a GTOD before an answer is returned. With that, would not be returned as the Top1 result because its probability of being Top1 is only 0.38 based on the updated uncertain relation in Table 4 and Equation 1, where is confirmed using a GTOD (we denote that operation as GTOD()). Finally, we remark that guarantees not only the whole result set has at least probability of being the true answer, but every frame in also has at least probability of being in the exact result set because: Pr(^f ∈R) ≥Pr(^R = R) ≥thres
reflects the precision of TopK answer, i.e., the fraction of results in that belongs to . Therefore, Everest effectively provides guarantees on the precision of the query answers.


timestamp  num of cars  conf. 

0  0.78  
1  0.21  
2  0.01  
0  0.49  
1  0.42  
2  0.09  
0  1.0 
4 Everest System
Following recent works [tahoma, focus, noscope], Everest also focuses on a batch setting. In this setting, large quantities of video are collected for postanalysis. Online analytics on live video stream is a different setting and is beyond the scope of this paper. In addition, we focus on a bounded set of labels because stateoftheart object detectors do so. For example, while the pretrained YOLOv3 can detect cars, it cannot distinguish between sedan and hatchback. However, users can supply UDFs to do further classification if necessary. Supporting unbounded vocabulary is not impossible if our lightweight network also outputs embeddings like in [panorama]. But we leave it as our next step and we focus on TopK query processing in this paper. To our knowledge, tracking model drift in visual data has not been well addressed in the literature. We will tackle that upon robust techniques for that are developed.
Figure 1 shows the system overview. Everest leverages CNN specialization and uncertain query processing to accelerate TopK analytics with probabilistic guarantee. Processing a query involves two phases. The first phase aims to train a lightweight deep convolutional mixture density network (DCMDN) [d2017dcmdn] that outputs a distribution about the scoring function (e.g., the number of cars in a frame), followed by sampling the DCMDN to build an initial uncertain relation . The first phase can be done offline (i.e., ingestion time) for standard scoring functions (e.g., counting). The second phase takes as input the resulting uncertain relation from Phase 1 and finds a TopK result that has a confidence . Initially, given only, it is unlikely the initial TopK result from has confidence over the threshold. Furthermore, we see from the previous section that given a potential TopK result , we have to confirm its frames for the CR condition using a GTOD — but that may conversely give the same a lower confidence based on the updated uncertain relation (e.g., drops from 0.85 to 0.38). Of course, if the uncertain relation contains no more uncertainty (i.e., all tuples are certain), the TopK result from that has a confidence of 1. Consequently, Phase 2 can be viewed as TopK processing via uncertain data cleaning, in which we hope to selectively “clean” the uncertain tuples in the uncertain relation using a GTOD until the TopK result from the latest uncertain relation satisfies the probabilistic guarantee. For high efficiency, Phase 2 aims to clean as few uncertain tuples as possible because each cleaning operation invokes the computational expensive GTOD. Furthermore, the algorithms in Phase 2 have to be carefully designed because uncertain query processing often introduces an exponential number of possible worlds.
4.1 Phase 1: Building the initial uncertain relation
The crux of CNN specialization is the design of a lightweight model that is suitable for the specific query type. Prior works [noscope, blazeit]
that focus on selection and aggregate queries build classifiers, where each distinct integer is treated as a standalone class label in the
softmax layer. The softmax layer effectively generates a discrete distribution of number of cars for frames as shown in Table 0(a). One drawback of building a classifier is that the maximum number of classes is restricted to be the maximum value found in the sample frames. This is undesirable because it will fail when the Top1 frame is missed in the training frames. We therefore discard that approach. Instead, we train a lightweight deep convolutional mixture density network (DCMDN). Figure 2 shows the architecture of our lightweight DCMDN. It uses YOLOv3tiny [yolo3] (excluding the output layer) as the backbone to extract features from the input frame and uses a mixture density network (MDN) to output parameters of Gaussians (meansand variances
) and their weights () in the mixture. Compared with typical regression models, the MDN layer can provide valuable uncertainty information (e.g., variance) in the output. In order to generate a high quality uncertain relation (i.e., high loglikelihood to the groundtruth), we follow a recent work[occlusion] to build the MDN layer, where the neural network first generates a set ofhypotheses as possible predictions. By regarding the hypotheses as samples drawn from a mixed Gaussian distribution, the parameters of the distribution is then estimated. This method is reported to overcome the limitations of traditional MDN including degenerated predictions and overfitting.
The training data of the DCMDN are frames randomly sampled from the videoofinterest with real scores (e.g, number of cars) obtained from a groundtruth object detector (GTOD). Since both the network and training data are small, the training quickly converges in several minutes. After the DCMDN is trained, Everest uses a difference detector to discard similar frames from the video. This step serves two purposes. First, frames with little differences are not informative (e.g., see Figure 3) in TopK context. Second, it approximates independence among frames so as to enable the use of “xtuples” to model the data. After that, Everest feeds the unique frames to the DCMDN to obtain the score distribution for each frame. Finally, Everest draws samples from the DCMDN to build the uncertain relation. An xtuple captures a discrete distribution but the Gaussian mixture is continuous and with infinitely long tails on both ends. In order to get a finite uncertain relation, we follow [truncated_gaussian] to replace the Gaussians with truncated Gaussians so that the probabilities beyond are set to zero and evenly distributed to the rest. After that, we populate the uncertain relation by discretizing the truncated Gaussian mixture. For counting based scoring function, the score distribution is discretized to nonnegative integers. For others, users have to provide the discretization step size when defining the scoring function. For frames that are already labeled by the GTOD when training the DCMDN, their scores are known and certain. They are inserted into the uncertain relation straight with no uncertainty.^{1}^{1}1Nevertheless, we still call that as an “uncertain relation”.
4.2 Phase 2: TopK processing via uncertain data cleaning
Figure 1 (right) outlines the flow of Phase 2. Starting from an uncertain relation given by Phase 1, it iteratively selects the best frame (by the Selectcandidate function) to clean using the GTOD until a TopK result based on the latest table has a confidence (computed by the Topkprob function) exceeds the user threshold .
Finding a TopK result from the latest updated table (by the TopK() function) is straightforward because it simply extracts all the certain tuples (because of the CR condition) from and applies an existing TopK algorithm (e.g., [sayingenough]) to find the TopK as . The function Topkprob that computes the probability for being the exact result and the function Selectcandidate that selects the most promising frame are more challenging because they involve an exponential number of possible worlds. Note that the techniques used in traditional uncertain data cleaning [cleantopk, ericee] are not applicable here because of the differences in the problem setting. In those cleaning problems, they have a constraint on the “budget”, which is the number of xtuples that could be cleaned; and their objective is to minimize the entropy of all possible TopK results from the updated table. Since their setting is to return a batch of xtuples for a person to manually carry out a oneoff cleaning operation, using entropy allows them to measure the answer quality solely based on the uncertain data as the results of the cleaning operation are not available at runtime. By contrast, our constraint is to pass the probability threshold and our objective is to minimize the cost of using the GTOD (and the algorithm overhead). For us, the GTOD provides online feedback, allowing us to put that into the equation to come up with a probabilistic guarantee (between 0 and 1), which is way more intuitive than using entropy (which could be 0 to ).
Uncertain relation  
Uncertain relation at iteration  
Subset of whose xtuples are all certain  
Subset of whose xtuples are all uncertain  
Approximate TopK result  
Confidence/probability of  
, obtained in th iteration  
Actual result  
Frame / xtuple  
GTOD  Groundtruth object detector 
Score of  
The frame ranked th in  
The frame ranked penultimately in 
In the following, we discuss efficient algorithms to implement the Topkprob and the Selectcandidate functions. Table 5 gives a summary of the major notations we used.
4.2.1 Implementing Topkprob(, )
Given a potential result extracted from , computing its confidence via Equation 1 has to expand all possible worlds, where is the number of frames in and assume each of them has possible scores in the uncertain relation. However, given the Certain Result (CR) condition, we can simplify the calculation of as:
(2) 
where (a) stands for the score of a frame, (b) is the “threshold” frame that ranks th in , and its score is known and certain (because is from ), and (c) are frames in with uncertainty, i.e., .
Computing Equation 2 requires only time linear to . Equations 1 and 2 are equivalent because the probability of being TopK is equal to the probability that no frames in having scores larger than the frames in .^{2}^{2}2We allow frames in to have scores tie with the threshold frame .
A further optimization is to compute two functions before Phase 2 begins: (a) the CDF for the score of each frame , i.e., and (b) a function , which is the joint CDF of all uncertain frames in . With them, in the th iteration, we can compute as follows:
(3) 
Equations 2 and 3 are equivalent because by definition . and for all and can be easily precomputed oneoff at the cost of . With Equation 3, Topkprob(, ) in the th iteration can compute using time instead, where .
As a remark, actually improves exponentially with the number of frames cleaned according to Equation 2. Therefore, we expect Phase 2 would spend more iterations to reach a small probability threshold, say 0.5. But after that, it would take fewer iterations to reach any probability threshold beyond.
4.2.2 Implementing Selectcandidate()
Selectcandidate() is the function to select a frame from the set of uncertain frames in the th iteration to apply the GTOD such that cleaning can maximize of the next iteration; and hopefully after that and thus Phase 2 can stop early.
Of course, is unknown before we apply GTOD(
). Therefore, we use a random variable
to denote the value of after a frame is cleaned. To maximize , we aim to find . Using to denote the value of when , where is a particular score, is thus:(4) 
[Efficient computation of ] is the probability of the result being the TopK based on , where the xtuple representing in is assumed to be cleaned and its score is and certain. Therefore, can be calculated on top of (Equation 3) by removing the uncertainty of , based on how the actual score of influences the TopK result:
(5) 
The idea of Equation 5 is that:

when , frame is not qualified to be in TopK; the TopK result would not change, and the threshold score is still ; So, discounting the uncertainty of suffices.

when , frame enters the TopK and but its score is lower than the penultimate frame in the TopK result, i.e., the one ranks ()st, so frame gets the th rank; the new “threshold” score is changed to ;

when , frame enters the TopK with a score greater than the original penultimate frame, the new threshold frame is , the new threshold score is changed to .
(6) 
Comments
There are no comments yet.