Cameras are ubiquitous. Billions of them are being deployed in public (e.g., at road junctions) and private (e.g., in retail stores) all over the world [cctv]
. Recent advances in deep convolutional neural networks (CNNs) have led to incredible leaps in their accuracies in many machine learning tasks, notably image and video analysis. Such developments have created great demands on practical analytics over video data[sigmod18Occlusion2, sigmod18Occlusion3, sigmod18Occlusion4, sigmod18Occlusion5, sigmod18Occlusion6]. For instance, civil engineers are using traffic cameras in intelligent transportation systems for road planning, traffic incident detection, and vehicle re-routing in public transit. An example in which urban planners had to go through a year’s worth of traffic video to retrieve all video frames that contain more than five buses is reported in [blazeit]. Tools for fast and accurate video analytics are essential in smart city applications.
Object recognition in image and video using CNNs are accurate [resnet, prelu, faster-rcnn]. However, CNNs are computationally expensive to train and to use. This poses significant challenge in using CNNs-based objection recognition techniques to process massive amounts of streaming videos. For example, a state-of-the-art object detector, Mask R-CNN [mrcnn], runs at 5 fps (frames per second) using GPUs. This frame-processing rate is 6 times slower than the frame rate of a typical 30-fps video. For the example where a year’s worth of video is to be analyzed, a “scan-and-test” approach that invokes the object detector on every frame would take 6 years to complete. For video analytics in widespread use, the research community has started to build systems with innovative solutions to support fast analytics over video data [noscope, tahoma, focus, blazeit, pakha18reinvent, chameleon].
Video analytics is an emerging research area that intersects database and computer vision. It is, however, still in its infancy – even the latest system BlazeIt [blazeit] supports only simple filter and aggregate queries. In data analytics, Top-K is a very important analytical operation that enables analysts to focus on the most important entities in the data [saying-enough, survey-topk, bruno, top-k-guarantee]. In this paper we present the very first set of solutions for efficient Top-K video analytics. Top-K can help rank and identify the most interesting frames/moments from videos. Example use cases include:
Property Valuation. The valuation/rent of a shop is strongly related to its peak foot traffic [foot-traffic]. Instead of manual counting, one can use a camera to capture pedestrian flow and use a machine to identify, say, the top-10 frames (time of the day) with the highest pedestrian counts.
Transport Planning. Severe traffic congestion can occur when multiple bus routes pass through the same narrow expressway. This is because a bus stopping at a stop for passengers blocks the traffic (other buses) behind. The busiest bus creates a convoy effect, where a long line of other buses follow its lead. A Top-K query on the moments with the biggest convoys provide valuable information on the congestion caused and for better bus scheduling.
Data-Driven Agriculture. The global food demand is expected to increase by 70% from 2010 to 2050 [food-demand]. In order to improve farm productivity, the FarmBeats project at Microsoft sends drones to collect and analyze farm videos [farmbeats]. Recent news also report that oil-palm farmers in Malaysia send drones to monitor the growth of oil-palm [raghu_2019]. With oil-palm plantations spread across 86,100 square miles in Malaysia, farmers with limited resources can only inspect a small number of fields onsite each day. Finding Top-K fields (e.g., based on number of well-grown palm trees) over drone videos can drastically help farmers prioritize field trips.
In video analytics, CNN Specialization and Approximate Query Processing (AQP) are two popular techniques for overcoming the inefficiency of the scan-and-test method [noscope, tahoma, focus, blazeit]. With CNN specialization, a video analytic system trains a lightweight model specifically designed for certain query type(s) (e.g., “select all frames with dogs”) using frames sampled from the video-of-interest as training data. Since the lightweight model is highly specific to the query instance and the video-of-interest, it is both efficient to train and to use (but is generally less accurate). The lightweight model is then applied to all video frames to obtain a set of candidate frames that are likely to satisfy the given query. Finally, the candidate frames are passed to an accurate but less efficient object detector (e.g., YOLOv3 [yolo3], Mask R-CNN [he2017mask]) to verify their validity. Approximate Query Processing (AQP) [blazeit, BlinkDB] is an orthogonal technique for fast analytics when users are willing to tolerate some degree of error. The complex interplay between deep model inference and Top-K query processing, however, creates novel challenges for CNN specialization and AQP.
First, for CNN specialization, new lightweight models need to be designed for handling Top-K queries, which differ from traditional ones, such as selection. Second, traditional Top-K algorithms process objects with definite values. They are generally inapplicable to video analytics when lightweight models are used because these models’ outputs are imprecise and probabilistic. Third, classic AQP techniques aim at determining confidence intervals forstatistical
aggregate queries (such as point estimates). The answers of Top-K queries, however, are set-based (e.g., sets of Top-K frames). Therefore, existing AQP algorithms are inapplicable. Fourth, video data has temporal locality; Top-K queries can be frame-based ortime-window-based. For example, one could be interested in finding the Top-K 5-second clips with the highest number of vehicles. This adds to the complexity of answering Top-K queries in video analytics.
To address all these challenges, we present Everest, a system that empowers users to answer Top-K queries from video based on any given scoring (ranking) function. Everest is able to present results with probabilistic guarantees. Everest’s probabilistic outputs play in unison with video analytics where data uncertainty is prevalent. This is in sharp contrast to existing systems where valuable information in uncertain data is not fully leveraged. To illustrate, Table 0(a) shows an example output of the lightweight model used in BlazeIt [blazeit]. The car count
in each traffic video frame is best given by a probability distribution, which captures theuncertainty information of the object detector. Existing systems, however, process queries based on a simplified view of the table where less likely cases are dropped and much of the uncertainty information is discarded (see Table 0(b)). Everest, in contrast, is designed to treat uncertain data as a first-class citizen. Query processing in Everest is probabilistic in nature. Furthermore, Everest is extensible to support windowing so that users can ask Top-K-windows queries as well.
Supporting Top-K analytics over videos enables rich analyses over the visual world, just as what traditional Top-K query processing has been doing over relational data. Everest draws heavily on uncertain data management techniques (e.g., [uncertain_db_1, uncertain_db_2, uncertain_db_3]) but with its own novelty and uniqueness because of the existence of ground-truth at run-time. Specifically, uncertain query processing has been assuming no access to the ground-truth at run-time. By contrast, video analytics can access an accurate object detector to get the ground-truth and reduce the uncertainties at run-time. That essentially opens a new class of uncertain query processing problems in which the ground-truth is in the loop. We evaluate Everest on six real video streams and the latest Visual Road benchmark [visualroad]. We show that Everest can achieve 10.8 to 17.9 speedup over the baseline approach. In summary, we make the following contributions:
We design and present a novel notion of Top-K analytics over videos. Our design is the first to treat uncertain data as a first-class citizen in video analytics.
We present Everest, the first video analytics system that supports Top-K queries with probabilistic guarantees. The system further includes extensions to find Top-K windows instead of Top-K frames.
We introduce efficient algorithms and implementations for each system module. Our algorithms overcome the combinatorial explosion in the number of possible worlds commonly found in uncertain query processing. Our implementation is highly optimized taking into account various practical efficiency issues such as numeric stability and GPU utilization.
The remainder of this paper is organized as follows. Section 2 provides background of our study. Section 3 defines Top-K queries supported by Everest. Sections 4 and LABEL:sec:implementation present Everest’s algorithms and implementation. Section LABEL:sec:exp gives evaluation results. Section LABEL:sec:related discusses related work. Finally, Section LABEL:sec:conclusion concludes the paper.
In this section we provide brief background on object detection by CNN, video analytics systems, Top-K query processing, and uncertain query processing. Readers who are familiar with those topics may skip this section. We will further discuss their recent developments in Section LABEL:sec:related.
CNN Object detection is an important problem in Computer Vision (CV). The problem is to identify object occurrences in images. Large volumes of training data have been made available nowadays, which enable modern techniques, especially convolutional neural networks (CNNs), to achieve near-human or even better-than-human accuracy [prelu]. However, these deep models are rather complex with millions or even billions of parameters. Answering queries (or inference) with these models is thus computationally very expensive. For example, Mask R-CNN and YOLOv3, two state-of-the-art models, process video at respective rates of 5 fps and 30 fps on machines equipped with GPUs [mrcnn].
|01-01-2019:23:05||Human||(10, 50), (30, 40),||16|
|01-01-2019:23:05||Bus||(45, 58), (66, 99),||58|
|01-01-2019:23:06||Human||(20, 80), (7, 55),||16|
|01-01-2019:23:06||Car||(6, 91), (10, 55),||59|
|01-01-2019:23:06||Car||(78, 91), (40, 55),||60|
Video Analytics The huge volume of video data poses great challenges to video analytics research [noscope, focus, scanner, tahoma, videostorm, blazeit]. For analytical processing, video data is often modeled as relations, each captures object occurrences in a specific video clip [blazeit]. Specifically, each tuple in a video relation corresponds to a single object in a video frame. Since a frame may contain 0 or more objects (of interest) and an object may appear in multiple frames, a frame can be associated with 0 or more tuples in a relation and an object can be associated with multiple tuples. Typical attributes of a tuple include a frame timestamp (ts); a unique id of an identified object (objectID); the object’s class label (class), bounding polygon (polygon), raw pixel content (content
), and feature vector (features). A video relation can be materialized by invoking an object detector to process each frame to compute tuple values. Table 2 shows an example of a materialized video relation. The objectID attribute can be populated by invoking an object tracker (e.g, [zhang2011video]), which takes as input two polygons from two consecutive frames and returns the same objectID if the two polygons enclose the same object. In video analytics, a video relation that is materialized by an accurate object detector such as YOLOv3 is regarded as the system’s ground-truth [focus]. However, fully materializing a ground-truth relation is computationally expensive. Therefore, the key challenge is how to answer analytic queries without a fully materialized ground-truth relation [noscope, focus, scanner, tahoma, videostorm, blazeit]. This constraint distinguishes our work with previous video database studies [metadata, region-based], which assume that video relations are already given, presumably via some external means (such as human annotations).
CNN specialization is a key technique used to speed up video analytics [shen2017fast, noscope, blazeit, focus]. Inspired by the concept of “cascade” [cascade]
in computer vision, the idea is to use frames sampled from the video-of-interest to train a lightweight CNN (e.g., with fewer neurons and layers) as a proxy to an expensive ground-truth object detector (e.g., Mask R-CNN). Training specialized CNNs is significantly faster because there are fewer neural layers and a specific context. For example, training of a CNN with only frames from a traffic video converges much faster than in a general setting because there are far fewer object classes to consider (e.g., cars, bicycles, pedestrians), as opposed to a general object detector that has to distinguish thousands of classes. Since specialized CNNs are lightweight, their models are much faster to execute. For example, it has been reported that specialized lightweight CNNs can infer at 10,000 fps with various degree of accuracy loss[blazeit]. However, a specialized CNN is data specific and thus not generalizable to other video. Also, lightweight CNNs may produce false positives or false negatives. As a remedy, systems like BlazeIt [blazeit] use statistical techniques to bound result errors.
Top-K Query Processing In many application domains (e.g,. information retrieval [bruno], multimedia databases [fagin-threshold], and spatial databases [topk-spatial]), users are more interested in the K most important tuples ordered by a scoring(ranking) function. The essence of efficient Top-K query processing is to wisely schedule data accesses such that most of the irrelevant (non-top-K) items are skipped. This minimizes expensive data accesses (e.g., from disk or via the web) and provides early stopping criteria to avoid unnecessary computation.
Uncertain Query Processing One common uncertain data representation is “x-tuples” [aggarwal2009trio]. An uncertain relation is a collection of x-tuples, each consists of a number of alternative outcomes that are associated with their corresponding probabilities. Together, the alternatives form a discrete probability distribution of the true outcome. Different x-tuples are assumed to be independent of each other. We discuss later how Everest uses a difference detector to approximate that so that the outputs of a lightweight CNN can be represented by the x-tuple model (i.e., the inference results of a video frame is captured by one x-tuple). Hence, in the following discussion, we use the terms x-tuple, frame, and timestamp interchangeably.
3 Top-K in Everest
Everest allows users to set a per-query threshold, , to ensure that the returned Top-K result has a minimum of probability to be the exact answer. Given an uncertain relation (Section 4.1 discusses how to obtain that) and a scoring function (e.g., count the number of cars in a frame ), Everest returns a Top-K result with confidence , where is the exact result and is the probability threshold specified by the user. The probability is defined over an uncertain relation (e.g., Table 0(a)) using the possible world semantic (PWS) [possible_world]. The possible world semantic is widely used in uncertain databases, where an uncertain relation is instantiated to multiple possible worlds, each of which is associated with a probability. Table 3 shows 2 possible worlds (out of ) of Table 0(a). Given the probability of each possible world, the confidence of a Top-K answer is simply the sum of probabilities of all possible worlds in which is Top-K:
Here, denotes the set of all possible worlds of an uncertain relation . In addition, the answer has to satisfy the following condition:
The Certain Result (CR) Condition. The Top-K result has to be chosen from , where frames in are all certain, i.e., they have already been verified by a ground-truth object detector (GTOD) and with no uncertainty.
The certain-result condition is important in video analytics. For instance, the Top-1 result of the uncertain relation in Table 0(a) has a confidence of (based on Equation 1). Assuming , the confidence of being the correct answer is above the user threshold. However, this result is computed from the uncertain frames in . Like other cases of uncertain query processing, the ground-truth is unknown during the computation. In video analytics, this is however undesirable because a user knows the ground-truth by visually inspecting returned frames in an answer to the query. In the example above, if the user sees no cars in the frame , the returned Top-1 answer is unacceptable. The certain result condition avoids such awkward answers and is essential in ensuring that all the tuples in have been confirmed by a GTOD before an answer is returned. With that, would not be returned as the Top-1 result because its probability of being Top-1 is only 0.38 based on the updated uncertain relation in Table 4 and Equation 1, where is confirmed using a GTOD (we denote that operation as GTOD()). Finally, we remark that guarantees not only the whole result set has at least probability of being the true answer, but every frame in also has at least probability of being in the exact result set because: Pr(^f ∈R) ≥Pr(^R = R) ≥thres
reflects the precision of Top-K answer, i.e., the fraction of results in that belongs to . Therefore, Everest effectively provides guarantees on the precision of the query answers.
|timestamp||num of cars||conf.|
4 Everest System
Following recent works [tahoma, focus, noscope], Everest also focuses on a batch setting. In this setting, large quantities of video are collected for post-analysis. Online analytics on live video stream is a different setting and is beyond the scope of this paper. In addition, we focus on a bounded set of labels because state-of-the-art object detectors do so. For example, while the pre-trained YOLOv3 can detect cars, it cannot distinguish between sedan and hatchback. However, users can supply UDFs to do further classification if necessary. Supporting unbounded vocabulary is not impossible if our lightweight network also outputs embeddings like in [panorama]. But we leave it as our next step and we focus on Top-K query processing in this paper. To our knowledge, tracking model drift in visual data has not been well addressed in the literature. We will tackle that upon robust techniques for that are developed.
Figure 1 shows the system overview. Everest leverages CNN specialization and uncertain query processing to accelerate Top-K analytics with probabilistic guarantee. Processing a query involves two phases. The first phase aims to train a lightweight deep convolutional mixture density network (DCMDN) [d2017dcmdn] that outputs a distribution about the scoring function (e.g., the number of cars in a frame), followed by sampling the DCMDN to build an initial uncertain relation . The first phase can be done offline (i.e., ingestion time) for standard scoring functions (e.g., counting). The second phase takes as input the resulting uncertain relation from Phase 1 and finds a Top-K result that has a confidence . Initially, given only, it is unlikely the initial Top-K result from has confidence over the threshold. Furthermore, we see from the previous section that given a potential Top-K result , we have to confirm its frames for the CR condition using a GTOD — but that may conversely give the same a lower confidence based on the updated uncertain relation (e.g., drops from 0.85 to 0.38). Of course, if the uncertain relation contains no more uncertainty (i.e., all tuples are certain), the Top-K result from that has a confidence of 1. Consequently, Phase 2 can be viewed as Top-K processing via uncertain data cleaning, in which we hope to selectively “clean” the uncertain tuples in the uncertain relation using a GTOD until the Top-K result from the latest uncertain relation satisfies the probabilistic guarantee. For high efficiency, Phase 2 aims to clean as few uncertain tuples as possible because each cleaning operation invokes the computational expensive GTOD. Furthermore, the algorithms in Phase 2 have to be carefully designed because uncertain query processing often introduces an exponential number of possible worlds.
4.1 Phase 1: Building the initial uncertain relation
The crux of CNN specialization is the design of a lightweight model that is suitable for the specific query type. Prior works [noscope, blazeit]
that focus on selection and aggregate queries build classifiers, where each distinct integer is treated as a standalone class label in thesoftmax layer. The softmax layer effectively generates a discrete distribution of number of cars for frames as shown in Table 0(a). One drawback of building a classifier is that the maximum number of classes is restricted to be the maximum value found in the sample frames. This is undesirable because it will fail when the Top-1 frame is missed in the training frames. We therefore discard that approach. Instead, we train a lightweight deep convolutional mixture density network (DCMDN). Figure 2 shows the architecture of our lightweight DCMDN. It uses YOLOv3-tiny [yolo3] (excluding the output layer) as the backbone to extract features from the input frame and uses a mixture density network (MDN) to output parameters of Gaussians (means
and variances) and their weights () in the mixture. Compared with typical regression models, the MDN layer can provide valuable uncertainty information (e.g., variance) in the output. In order to generate a high quality uncertain relation (i.e., high log-likelihood to the ground-truth), we follow a recent work[occlusion] to build the MDN layer, where the neural network first generates a set of
hypotheses as possible predictions. By regarding the hypotheses as samples drawn from a mixed Gaussian distribution, the parameters of the distribution is then estimated. This method is reported to overcome the limitations of traditional MDN including degenerated predictions and overfitting.
The training data of the DCMDN are frames randomly sampled from the video-of-interest with real scores (e.g, number of cars) obtained from a ground-truth object detector (GTOD). Since both the network and training data are small, the training quickly converges in several minutes. After the DCMDN is trained, Everest uses a difference detector to discard similar frames from the video. This step serves two purposes. First, frames with little differences are not informative (e.g., see Figure 3) in Top-K context. Second, it approximates independence among frames so as to enable the use of “x-tuples” to model the data. After that, Everest feeds the unique frames to the DCMDN to obtain the score distribution for each frame. Finally, Everest draws samples from the DCMDN to build the uncertain relation. An x-tuple captures a discrete distribution but the Gaussian mixture is continuous and with infinitely long tails on both ends. In order to get a finite uncertain relation, we follow [truncated_gaussian] to replace the Gaussians with truncated Gaussians so that the probabilities beyond are set to zero and evenly distributed to the rest. After that, we populate the uncertain relation by discretizing the truncated Gaussian mixture. For counting based scoring function, the score distribution is discretized to non-negative integers. For others, users have to provide the discretization step size when defining the scoring function. For frames that are already labeled by the GTOD when training the DCMDN, their scores are known and certain. They are inserted into the uncertain relation straight with no uncertainty.111Nevertheless, we still call that as an “uncertain relation”.
4.2 Phase 2: Top-K processing via uncertain data cleaning
Figure 1 (right) outlines the flow of Phase 2. Starting from an uncertain relation given by Phase 1, it iteratively selects the best frame (by the Select-candidate function) to clean using the GTOD until a Top-K result based on the latest table has a confidence (computed by the Topk-prob function) exceeds the user threshold .
Finding a Top-K result from the latest updated table (by the Top-K() function) is straightforward because it simply extracts all the certain tuples (because of the CR condition) from and applies an existing Top-K algorithm (e.g., [saying-enough]) to find the Top-K as . The function Topk-prob that computes the probability for being the exact result and the function Select-candidate that selects the most promising frame are more challenging because they involve an exponential number of possible worlds. Note that the techniques used in traditional uncertain data cleaning [clean-topk, eric-ee] are not applicable here because of the differences in the problem setting. In those cleaning problems, they have a constraint on the “budget”, which is the number of x-tuples that could be cleaned; and their objective is to minimize the entropy of all possible Top-K results from the updated table. Since their setting is to return a batch of x-tuples for a person to manually carry out a one-off cleaning operation, using entropy allows them to measure the answer quality solely based on the uncertain data as the results of the cleaning operation are not available at run-time. By contrast, our constraint is to pass the probability threshold and our objective is to minimize the cost of using the GTOD (and the algorithm overhead). For us, the GTOD provides online feedback, allowing us to put that into the equation to come up with a probabilistic guarantee (between 0 and 1), which is way more intuitive than using entropy (which could be 0 to ).
|Uncertain relation at iteration|
|Subset of whose x-tuples are all certain|
|Subset of whose x-tuples are all uncertain|
|Approximate Top-K result|
|, obtained in -th iteration|
|Frame / x-tuple|
|GTOD||Ground-truth object detector|
|The frame ranked -th in|
|The frame ranked penultimately in|
In the following, we discuss efficient algorithms to implement the Topk-prob and the Select-candidate functions. Table 5 gives a summary of the major notations we used.
4.2.1 Implementing Topk-prob(, )
Given a potential result extracted from , computing its confidence via Equation 1 has to expand all possible worlds, where is the number of frames in and assume each of them has possible scores in the uncertain relation. However, given the Certain Result (CR) condition, we can simplify the calculation of as:
where (a) stands for the score of a frame, (b) is the “threshold” frame that ranks -th in , and its score is known and certain (because is from ), and (c) are frames in with uncertainty, i.e., .
Computing Equation 2 requires only time linear to . Equations 1 and 2 are equivalent because the probability of being Top-K is equal to the probability that no frames in having scores larger than the frames in .222We allow frames in to have scores tie with the threshold frame .
A further optimization is to compute two functions before Phase 2 begins: (a) the CDF for the score of each frame , i.e., and (b) a function , which is the joint CDF of all uncertain frames in . With them, in the -th iteration, we can compute as follows:
Equations 2 and 3 are equivalent because by definition . and for all and can be easily pre-computed one-off at the cost of . With Equation 3, Topk-prob(, ) in the -th iteration can compute using time instead, where .
As a remark, actually improves exponentially with the number of frames cleaned according to Equation 2. Therefore, we expect Phase 2 would spend more iterations to reach a small probability threshold, say 0.5. But after that, it would take fewer iterations to reach any probability threshold beyond.
4.2.2 Implementing Select-candidate()
Select-candidate() is the function to select a frame from the set of uncertain frames in the -th iteration to apply the GTOD such that cleaning can maximize of the next iteration; and hopefully after that and thus Phase 2 can stop early.
Of course, is unknown before we apply GTOD(
). Therefore, we use a random variableto denote the value of after a frame is cleaned. To maximize , we aim to find . Using to denote the value of when , where is a particular score, is thus:
[Efficient computation of ] is the probability of the result being the Top-K based on , where the x-tuple representing in is assumed to be cleaned and its score is and certain. Therefore, can be calculated on top of (Equation 3) by removing the uncertainty of , based on how the actual score of influences the Top-K result:
The idea of Equation 5 is that:
when , frame is not qualified to be in Top-K; the Top-K result would not change, and the threshold score is still ; So, discounting the uncertainty of suffices.
when , frame enters the Top-K and but its score is lower than the penultimate frame in the Top-K result, i.e., the one ranks ()-st, so frame gets the -th rank; the new “threshold” score is changed to ;
when , frame enters the Top-K with a score greater than the original penultimate frame, the new threshold frame is , the new threshold score is changed to .