ExSample: Efficient Searches on Video Repositories through Adaptive Sampling

05/19/2020 ∙ by Oscar Moll, et al. ∙ MIT 0

Capturing and processing video is increasingly common as cameras and networks improve and become cheaper. At the same time, algorithms for rich scene understanding and object detection have progressed greatly in the last decade. As a result, many organizations now have massive repositories of video data, with applications in mapping, navigation, autonomous driving, and other areas. Because state of the art vision algorithms to interpret scenes and recognize objects are slow and expensive, our ability to process even simple ad-hoc selection queries (`find 100 example traffic lights in dashboard camera video') over this accumulated data lags far behind our ability to collect it. Sampling image frames from the videos is a reasonable default strategy for these types of queries queries, however, the ideal sampling rate is both data and query dependent. We introduce ExSample a low cost framework for ad-hoc, unindexed video search which quickly processes selection queries by adapting the amount and location of sampled frames to the data and the query being processed. ExSample prioritizes which frames within a video repository are processed in order to quickly identify portions of the video that contain objects of interest. ExSample continually re-prioritizes which regions of video to sample from based on feedback from previous samples. On large, real-world video datasets ExSample reduces processing time by up to 4x over an efficient random sampling baseline and by several orders of magnitude versus state-of-the-art methods which train specialized models for each query. ExSample is a key component in building cost-efficient video data management systems.



There are no comments yet.


page 7

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video cameras have become incredibly affordable over the last decade, and are ubiquitously deployed in static and mobile settings, such as smartphones, vehicles, surveillance cameras, and drones. These video datasets are enabling a new generation of applications. For example, video data from vehicle dashboard-mounted cameras, dashcams, is used to train object detection and tracking models for autonomous driving systems [bdd], or to annotate map datasets like OpenStreetMap with locations of traffic lights, stop signs, and other infrastructure [osmblog], and to analyze the scene of collisions from dashcam footage to automate insurance claims processing [nexar].

However, these applications process large amounts of video to extract useful information. Consider the basic task of finding examples of traffic lights – to, for example, annotate a map – within a large collection of dashcam video collected from many vehicles. The most basic approach to evaluate this query is to run an object detector frame by frame over the dataset. Because state of the art object detectors run at about 10 frames per second (fps) on state of the art GPUs, one third of the typical video recording rate of 30fps, scanning through a collection of 1000 hours of video with a detector on a GPU would take 3x that time: 3000 GPU hours. In the offline query case, which is the case we focus on in this paper, we can parallelize our scan over the video across many GPUs, but, as the rental price of a GPU is around $3 per hour[aws], our bill for this one ad-hoc query would be $10K regardless of parallelism. Hence, this workload presents challenges in both time and monetary cost. Note that accumulating 1000 hours of view represents just 10 cameras recording for less than a week.

A practical means for coping with this issue is to skip frames: for example only run object detection on one frame for every second of video. After all, we might think it reasonable to assume all traffic lights are visible for longer than that, and the savings are large compared to inspecting every frame: processing only one frame every second decreases costs by 30x for a video recorded at 30fps. Unfortunately, this strategy has limitations: for example 1 frame out of 30 that we look at may not show the light clearly, causing the detector to miss it completely, while the neighboring frames may show it more clearly. Secondly, lights that remain visible in the video for a long time, like 30 seconds, would be seen multiple times unnecessarily, and worse, for other types of objects that remain visible for shorter times the appropriate sampling rate is unknown, and will vary across datasets depending on factors such as whether the camera is moving or static, or the angle and distance to the object.

In this paper we introduce ExSample, a video sampling technique designed to reduce the number of frames that need to be processed by an expensive object detector for search queries on video datasets. ExSample frames this problem as one of deciding which frame from the dataset to look at next based on what it has seen in the past. Our approach to this problem ExSample starts by conceptually splitting the dataset into temporal chunks, (e.g half-hour chunks), and frames the problem as deciding which chunk to sample from next. As it does this, ExSample keeps a per-chunk estimate of the probability of finding a new result if the next frame we process through the object detector were sampled randomly from that chunk. As it samples more frames, ExSample estimates become more accurate.

Recent related work, such as probabilistic predicates [msr], NoScope[noscope], and BlazeIt[blazeit], overlap partially with ExSample on their aim at reducing the cost of processing a variety of queries over video. At a high level, they approach this problem by training cheaper surrogate models for each query which they use to approximate the behavior of the object detector. They then prioritize which frames to actually inspect based on how the surrogate scores them. This general approach to can yield large savings in specific scenarios.

However, approaches relying on training cheap surrogates have two important shortcomings in the context of ad-hoc object queries, especially when the number of desired results is limited. The first has to do with the extra work needed: for highly selective queries, seeking objects that appear rarely in video requires building and labelling a training set ahead of time, which can be as hard as solving the search problem in the first place. Conversely, for common objects that appear frequently throughout the dataset, the surrogate models introduce an additional inference cost that outweighs the limited savings they provide. Finally, when users only need results up to a fixed number, such as in a limit query or when building a training set, surrogate based approaches still require an upfront dataset scan in order to score the video frames in the dataset, which can be more expensive than simply sampling frames randomly. Unlike existing work, ExSample imposes no preprocessing overhead.

The second shortcoming is in the user need to avoid near duplicate results in search queries over video. ExSample is designed to give higher weight to areas of video likely to have results which are both new and different, rather than areas where the object detector would score high. A key challenge here is that to be general, ExSample makes no assumptions about how long objects remain visible on screen and how often they appear. Instead, ExSample is guided by feedback from the outputs of the object detector on previous samples.

Our contributions are 1) the adaptive sampling algorithm ExSample to facilitate ad-hoc searches over video repositories 2) a formal justification justifying ExSample’s design 3) an empirical evaluation showing ExSample is effective on real datasets and under real system constraints, and outperforms existing approaches to the search problem.

We evaluate ExSample on a variety of search queries spanning different objects, different kinds of video, and different numbers of desired results. We show savings in the number of frames processed ranging from 1.1 to 4x, with a geometric average of 2x across all settings, in comparison to an efficient random sampling. Additionally, in comparison to a surrogate-model based approach inspired on BlazeIt [blazeit], our method processes fewer frames to find the same number of distinct results in many cases, and in the remaining cases ExSample still requires one to two-orders of magnitude less clock time because ExSample does not require an upfront preprocessing phase, and so can avoid the preprocessing costs of surrogate based approaches.

2 Background

In this section we review object detection, introduce distinct object queries as opposed to plain object queries, explain our main cost evaluation metric: frames processed by the object detector, and justify our main baseline: random sampling.

2.1 Object detection

An object detector is an algorithm that operates on still images, inputting an image frame and outputting a set of boxes within the image containing the objects of interest. The amount of objects found will range from zero to arbitrarily many. Well known examples of object detectors include Yolo[yolov3] and Mask R-CNN[maskrcnn].

In Figure 1, we show two example frames that have been processed by an object detector, with the traffic light detections surrounded by boxes.

Object detectors with state of the art accuracy in benchmarks such as COCO[coco] typically process around 10 frames per second on modern hardware, though it is possible to achieve real time rates by sacrificing accuracy [tfmodels, yolov3].

In this paper we do not seek to improve on state-of-the-art object detection approaches. Instead, we treat object detectors as a black box with a costly runtime, and aim to substantially reduce the number of video frames processed by the detector.

2.2 Distinct object queries

In this paper we are interested in processing higher level queries on video enabled by the availability of object detectors. In particular, we are concerned with object queries such as “find 20 traffic lights in my dataset” over collections of multiple videos from multiple cameras. In natural video, any object of interest lingers within view over some length of time. For example, the frames in Figure 1 contain the same traffic light a few seconds apart. While either frame would be an acceptable result for our traffic light query, an application such as OpenStreetMap would not benefit more from having both frames. We are therefore interested in returning distinct object results, and we refer to this specific variation as a distinct

object query. Similarly, for an application such as constructing a training set for a classifier or detector, richer sets of diverse examples are preferable to near duplicates. Note that an application could always find near duplicates if desired by traversing the video backwards and forwards starting from a given result, the more difficult part is finding initial results in the first place.

Figure 1: Two video frames showing the same traffic light instance several seconds apart. A distinct object query is defined by having these two boxes only count as one result.

The goal of this paper is to reduce the cost of processing such queries. Moreover, we want to do this on ad-hoc distinct object queries on ad-hoc datasets, where there are diverse videos in our dataset and where object detections have not been computed ahead of time for the type of objects we are looking for. This distinction affects multiple decisions in the paper, including not only the results we return but also the main design of ExSample and how we measure result recall. Related work [blazeit, noscope] in video processing from the database community aims to reduce cost as well, but as far as we know no existing work targets this kind of query.

2.3 Processing a distinct object query

A straightforward method to process a distinct object query is to scan all frames. In the traffic light query, for example, we can process every video in the dataset, sequentially evaluating an object detector on each frame of each video. If one or more lights are detected, they are matched with detections from the previous frame, and only truly new, unmatched ones contribute to the result. If there is a limit clause in our query, such as the limit of 20 in our example query ‘find 20 traffic lights in our dataset’, we can stop scanning as soon as we accumulate 20 results.

Off the shelf object detectors act on a frame by frame basis, and do not have a notion of time, memory or deduplication. So, we assume that in addition to specifying an object detector, the application specifies a discriminator function that decides which detections are new instances and which correspond to previous instances of an already processed detection. This is how we define the full set of unique instances in the video and how we generate the ground truth.

For example, the discriminator function may apply an object tracking algorithm like Median Flow [medianflow] or SORT [SORT]

that computes the position of an object over a sequence of frames; then, by tracking new instances backwards and forwards through video around the sampled frame in which they were first detected, we can determine whether future instances correspond to prior ones by comparing them against previously computed tracks. We expand on this in our evaluation section. Alternatively, the application could use simple heuristics, such as two detections of traffic lights occurring within 30 seconds of each other must be the same light. As explained in the introduction, an alternative to simple scanning is to process frames sequentially but skip forward 1 second at every step, effectively only processing one frame per second of video. A better strategy still is picking from the dataset uniformly at random. By better, we mean that the time to find the number of results requested in the query is smaller when inspecting frames at random than it is for a strategy that skips 1 second forward in time. The main reason for this is that random will explore more areas of the data more quickly. Moving sequentially means that our future samples stay close to or past samples. In the case where we find a result, frames one second apart are less likely to have novel results. Similarly, in the case where there are no results, frames one second apart are less likely to have any results. Moreover, a fixed skipping one second may miss some results completely, while random sampling will eventually come back to nearby frames.

3 ExSample

In this section we explain how our approach, ExSample, minimizes the time of finding results. Due to the high compute demands of object detection models, even when processed in GPUs, runtime and compute cost when using ExSample are a function of the number of frames sampled and then processed by the object detector. When discussing ExSample we use the terms frames sampled and frames processed interchangeably because every sampled frame is also a processed frame.

In order to minimize the number of frames processsed, At a high level, ExSample works by estimating which files and temporal segments within a video are more likely to yield new results, sampling frames more often from those regions. Importantly, because its goal is to find distinct results, ExSample accounts for both result abundance and variability, rather than purely for raw hits.

This is implemented by conceptually splitting each file into for example, half-hour or shorter, chunks, and scoring each chunk separately based on what the object detector returns on previously sampled frames on each region. ExSample scores chunks based not just on past hits but also on past repetitions: the more repeated results there are, the lower the score will be, regardless of past hits. This scoring system allows ExSample to allocate resources to more promising areas initially also allowing it to diversify, over time, where it looks next. In our evaluation, we show this technique helps ExSample outperform purely greedy strategies even when they are equipped with ad-hoc heuristics to avoid duplicates.

To make it practical, ExSample is composed of two core parts: an estimate of future results per chunk explained in subsection 3.1, and a robust mechanism to translate these estimates into a decision which accounts for estimate errors explained in subsection 3.3. In those two sections we focus on quantifying the types of error. Later, in Algorithm 1 we explain the algorithm step by step.

3.1 Scoring a single chunk

In this section we derive our estimate for the future value of sampling a chunk. To facilitate keeping up with the notation we introduce in the following sections, we restate all definitions in Appendix A.

In order to make the optimal decision of which chunk to sample next, ExSample estimates for each chunk, which represents the number of new results we expect to find on the sample. By new, we mean does not count results already found in the previous samples even if they also appear in the frame. The chunk with the largest is a good location to sample next.

In our traffic light example, the chance of finding a traffic light by random sampling intuitively depends on both the number of traffic lights in the data, which we will call , the number of video frames each light is visible for, which we will call , and the chance of finding a new traffic light will depend additionally on how many samples we have drawn already, which we will call . Note that varies from light to light: for example in video collected from a moving vehicle that stops at a red light, red lights will tend to have large lasting in the order of minutes, while green and yellow lights are likely have much smaller , perhaps in the order of a few seconds.

Technically, , where is the set of results we have already seen in previous frames. is a single number fully determined by , but it is random if we only know . Both the total number of instances and their durations are unknown to us unless we have scanned and processed the whole dataset, so our ability to estimate may seem hopeless. Fortunately, there are tools to estimate which do not require for us to first estimate either or the individual . The estimate instead relies on counting the number of results seen exactly once so far, which we will represent with . is a quantity we can observe and track as we sample frames. The estimate is:


The formula appears in other contexts as the Good-Turing estimator[good-biometrika], but has a different meaning in those contexts. In our video search application we sample frames, not symbols, at random, and a single frame sample can indirectly sample arbitrarily many objects, or no object at all. This means that in our application, and could range from being to being far larger than itself, for example in a crowded scene. In the typical setting (such as the one explained in the original paper [good-biometrika]) there is exactly one instance per sample and the only question is whether it is new. In that situation will always be smaller than .

In the remainder of this section we show that the estimate applies in our problem setting as well, and in particular that we can bound its relative bias using high-level properties of the data and the query: the number of result instances , the average duration of a result

and also the standard deviation of the durations

. Note that the error we discuss in this section is the bias of our estimate , an error that will occur even in the absence of randomness. In a later section we deal with the problem of errors that arise from randomness in the sample.

In particular we will focus on the relative error:

The following inequalities bound the relative bias error of our estimate:

Theorem (Bias).
rel. err (3)
rel. err (4)

Intuitively, Equation 2 tells us that tends to overestimate, Equation 3 that the size of the over-estimate is guaranteed to be less than the largest probability, which is likely small, and Equation 4 says that even if a few outliers were large, as long the durations within one of the average are small the error will remain small. A large or a large or may seem problematic for Equation 4, but we note that a large number of results or long average duration implies many results will be found after only a few samples, so the end goal of the search problem is easy in the first place and having guarantees of accurate estimates is less important.

In a later experiment with skewed data and many instances we show the estimate works well, and real data in our evaluation has natural skew and we obtain consistently good experimental results.


Now we prove Equation 2 Equation 3, and Equation 4.

The chance that object is seen on the try after being missed on the first tries is . For the rest of the paper, we will name this quantity . By linearity of expectation

For the rest of this proof, we will avoid explicitly writing summation indexes , since our summations always are over the result instances, from to .

can also be expressed directly. The chance of having seen instance exactly once after samples is , with the extra factor coming from the possibility of the instance having shown up at any of the first samples. We can rewrite this as . So , giving:

Now we will focus on the numerator

We can see here that each term in the error is positive, hence we always overestimate, which proves Equation 2.

Now we want to bound this overestimate. Intuitively we know the overestimate is small because each term is a scaled down version of the terms in , more precisely:


For example, if all the are all less than 1%, then the overestimate is also less than 1% of . However, if we know there may be a few outliers with large which are not representative of the rest, unavoidable in real data, then we would like to know our estimate will still be useful. We can show this is still true as follows:

The term

is the second moment of the

, and can be rewritten as , where and are the mean and standard deviation of the underlying result durations:

Putting together Equation 5 and Equation 6 we get:

And dividing both sides, we get

Which justifies the remaining two bounds: Equation 3 and Equation 4. ∎

3.2 Extending scoring to multiple chunks

In principle, if we knew for every chunk, the algorithm could simply take the next sample from the chunk with the largest estimate. However, if we simply use the raw estimate


And pick the chunk with the largest estimate, then we run into two potential problems. The first is that each estimate may be off due to the randomness of the sample, especially those chunks with smaller . We will address this problem in the next section. The second potential problem is the issue of instances spanning multiple chunks, which we address afterwards.

3.3 Dealing with uncertainty

In reality we assign scores to multiple chunks as we sample them, and each chunk will be been sampled a different number of times and will have its own . In fact, we really want different chunks to be sampled very different number of times because that is what ExSample must do to outperform random.

We have shown how to estimate which chunk is promising, but for it to be practical, we still need to handle the problem that our observed

will fluctuate randomly due to randomness in our sampling. This is especially true early in the sampling process, where only a few samples have been collected but we need to make a sampling decision. Because the quality of the estimates themselves is tied to the number of samples we have taken, and we do not want to stop sampling a chunk due to a small amount of bad luck early on, it is important we estimate how noisy our estimate is. The usual way to do this is by estimating the variance of our estimator:

. Once we have a good idea on how this variance error depends on the number of samples taken, we can make informed decisions about which chunk to pick next, balancing both the raw score and the number of samples it is based on.

Theorem (Variance).

If instances occur independently of each other, then


Note that this bound also implies that the variance error is more than smaller than the value of the raw estimate, because we can rewrite it as . Note however, that the independence assumption is necessary for proving this bound. While in reality different results may not occur and co-occur truly independently, our experimental results in the evaluation results show our estimate works well enough in practice.


We will estimate the variance assuming independence of the different instances, We can express a sum of binary indicator variables , which are 1 if instance has shown up exactly once. with probability . Then, because of our independence assumption, the total variance can be estimated by summing. Therefore and because of our independence assumption . Because

is a Bernoulli random variable, its variance is

which is bounded by itself. Therefore, . This latter sum we know from before is . Therefore . ∎

In fact, we can go further and fully characterize the distribution of values takes.

Theorem (Sampling distribution of ).

Assuming are small or is large, and assuming independent occurrence of instances,

follows a Poisson distribution with parameter



We prove this by showing

’s moment generating function (MGF) matches that of a Poisson distribution with


As in the proof of Equation 8, we think of as a sum of independent binary random variables , one per instance. Each of these variables has a moment generating function . Because for small , and will be small, then . Note is always eventually small for some because .

Because the MGF of a sum of independent random variables is the product of the terms’ MGFs, we arrive at:

3.3.1 Thompson Sampling

Now we can use this information to design a decision making strategy. The goal is to meaningfully pick between and instead of only between and

, where the only reasonable answer would be to pick the largest. One way to implement this comparison is to randomize it, which is what Thompson sampling 

[thompson_sampling] does.

Thompson sampling works by modeling unknown parameters such as not with point estimates such as but with a wider distribution over its possible values. The width should depend on our uncertainty. Then, whenever we would have used the point estimate value to make a decision, we instead draw a number sample from its distribution and use that number instead, effectively adding noise to our estimate in proportion to how uncertain it is. In our implementation, we choose to model the uncertainty around

as following a Gamma distribution:


Although less common than the Normal distribution, the Gamma distribution is shaped much like the Normal when the

is large, but behaves more like a single-tailed distribution when is near 0, which is desirable because will become very small over time, but we know our hidden is always non-negative. The Gamma distribution is a common way to model the uncertainty around an unknown (but positive) parameter for a Poisson distribution whose samples we observe. This choice is especially suitable for our use case, as we have shown that does in fact follow a Poisson distribution. The mean value of the Gamma distribution Equation 9 is which is by design consistent with Equation 7, and its variance is which by design consistent with the variance bound of Equation 8.

Finally, the Gamma distribution is not defined when or are 0, so we need both a way to deal with the scenario where which could happen often due to objects being rare, or due to having exhausted all results. As well as at initialization, when both and are 0. We do this by adding a small quantity and to both terms, obtaining:


We used and in practice, though we did not observe a strong dependence on this value. The next question is whether these techniques work when applied on skewed data.

3.3.2 Empirical validation

In this section we provide an empirical validation of the estimates from the previous sections, including Equation 7, and Equation 10. The question we are interested in is: given an observed and , what is the true expected , and how does it compare to the belief distribution which we propose using.

We ran a series of simulation experiments. To do this, we first generate 1000 at random to represent 1000 results with different durations. To ensure there is duration skew we observe in real data, we use a lognormal distribution to generate the . To illustrate the skew in the values, the smallest is , while the . The median is . The parameters computed from the are and respectively. For a dataset with million frames (about 10 hours of video), these durations correspond to objects spanning from of a second up to about hours, more skew than what normally occurs.

Then, we model the sampling of a random frame as follows: each instance is in the frame independently at random with probability . To decide which of the instances will show up in our frame we simulate tossing 1000 coins independently, each with their own , and the positive draws give us the subset of instances visible in that frame. We then proceed drawing these samples sequentially, tracking the number of frames we have sampled , how many instances we have seen exactly once, , and we also record : the expected number of new instances we can expect in a new frame sampled, which is possible because we can compute it directly as , because in the simulation we know the remaining unknown instances and know their hidden probabilities , so we compute . We sequentially sample frames up to , and repeat the experiment 10K times, obtaining hundreds of millions of tuples of the form for our fixed set of . Using this data, we can answer our original question: given an observed and , what is the true expected ? by selecting different observed and at random, and conditioning (filtering) on them, plotting a histogram of the actual that occurred in all our simulations. We show these histograms for 10 pairs of and in Figure 2, alongside our belief distribution.

Figure 2: Comparing our Gamma heuristic of Equation 10 with a histogram of the true values from a simulation with heavily skewed . The details of the simulation are discussed in subsubsection 3.3.2. We picked 10 pairs from the data to include multiple important edge scenarios: where is less than as well as when is very large (in this case up to 20% of the total frames). We also show close to 0 due to bad luck in the last subplot. Note we are using the noisy observed and not the idealized which would be a lot more accurate, but is not directly observable. The histograms show the range of values seen for when we have the observed and . The point estimate of (Equation 7) is shown as a vertical line. The belief distribution density is plotted as a thicker orange line, and 5 samples drawn from it using Thompson sampling are shown with dashed lines.

Figure 2 shows a mix of 3 important scenarios. The first 3 subplots with , representative of the early stages of sampling. Here we see that the model has substantially more variance than the underlying true distribution of . This is intuitively expected, because early on both the bias and the variance of our estimate are bottlenecked by the number of samples, and not by the inherent uncertainty of . As grows to mid range values (next 4 plots), we see that the curve fits the histograms very well, and also that the curve keeps shifting left to lower and lower orders of magnitude on the x axis. Here we see that the one-sided nature of the Gamma distribution fits the data better than a bell shaped curve. The final 3 subplots show scenarios where has grown large and potentially very small, including a case where . In that last subplot, we see the effect of having the extra in Equation 10, which means Thompson sampling will continue producing non-zero values at random and we will eventually correct our estimate when we find a new instance. In 3 of the subplots there is a clear bias to overestimate, though not that large despite the large skew.

This empirical validation was based on simulated data. In our evaluation we show these modeling choices works well in practice on real datasets as well, where our assumption of independence is not guaranteed and where durations may not necessarily follow the same distribution law.

3.4 Instances spanning multiple chunks

If instances can span multiple chunks, for example a traffic light that spans across the boundaries of two neighboring chunks, Equation 7 is still accurate with the caveat that is interpreted as the number of instances seen exactly once globally and which were found in chunk . The same object found once in two chunks and does not contribute to either or to , even though each chunk has only seen it once. The derivation of this rule is similar to that in the previous section, and is given in Appendix B. In practice, if only a few rare instances span multiple chunks then results are almost the same and this adjustment does not need to be implemented.

At runtime, the numerator will only increase in value the first time we find a new result globally, decrease back as soon as we find it again either in the same chunk or elsewhere, and finding it a third time would not change it anymore. Meanwhile increases upon sampling a frame from that chunk. This natural relative increase and decrease of with respect to each other allows ExSample to seamlessly shift where it allocates samples over time.

3.5 Effective chunking strategies

In practice, the chunk and sample approach of ExSample will work well when different chunks have different scores and these differences persist after more than a few samples. This is the case if different files can be very different in content, for example a car driving in one city vs another city, or in a highway, or within a single file, if the camera moves or if there is a strong temporal pattern. For example, if after a few samples we find that 50% of likely have no results, we can expect ExSample to focus sampling on the rest of the dataset, with savings bounded by 2x compared to random sampling, which would keep allocating samples in one location. In contrast, if all chunks have essentially the same score then random sampling should be just as good or as bad at finding results.

Chunking based on time is likely to work well because there is some amount of locality to a lot of types of results. For example, traffic lights appear in cities and are likely to appear one block after the next. Making intervals too long means we have less opportunities for scores to be different across intervals (for example, making them the whole video). On the other hand, making intervals very short (for example, one second long) means a lot of sampling is spent estimating which chunks are better, and they payoff of this information is smaller because we run out of frames.

For our evaluation, simply using chunks based on files and up to 30 minute length video intervals worked well across our benchmarks.

3.6 Algorithm

In this section we lay out how the intuition of the previous translates into pseudocode, which we show in Algorithm 1.

input : video, chunks, detector, matcher, result_limit
output : ans
[] // arrays for stats of each chunk
1 [0,0,,0] [0,0,,0] while len(ans)  do
       // 1) choice of chunk and frame
2       for  to  do
4       end for
       // 2) io, decode, detection and matching
       // are the unmatched dets
       // are dets with only one match
       // 3) update state
5       len() len() ans.add()
6 end while
Algorithm 1 ExSample

The inputs to the algorithm are:

  • video The video dataset itself, which may be a single video or a collection of files.

  • chunks The collection of logical chunks that we have split our dataset into. One natural way is to do it by time, so we can think of them as splitting each file in our dataset into 30 minute units. There are chunks total.

  • detector. An object detector provided by the user, for detecting objects interest to the application.

  • matcher

    . An algorithm that matches detections to suggest which are new and which may be duplicates. The notion of new is application specific, but matcher can be implemented based on feature vector appearance, for example. We note that the matcher does not need to be accurate. A dummy matcher could say any two instances more than 1 second apart are distinct, which is effectively what much current work does. Its function is simply to signal that we are repeating results so we can better discount chunks.

  • result_limit An indication of when to stop.

After initializing arrays to hold per-chunk statistics, the code can be understood in three parts: choosing a frame, processing of the frame, and state update. The frame choice part is where ExSample makes a decision about which frame to process next. It starts with the Thompson sampling step in Algorithm 1, where we draw a separate sample from the belief distribution Equation 10 for each of the chunks, which is then used in Algorithm 1 to pick the highest scoring chunk. The in the code is used as the index variable for any loop over the chunks. During the first execution of the while loop all the belief distributions are identical, but their samples will not be, breaking the ties at random. Once we have decided on a best chunk index , we sample a frame index at random from it in Algorithm 1.

The second part includes all the heavy work involved in video processing: reading and decoding the frame we chose, applying the object detector to it (Algorithm 1). After that is done we pass the detections on to a matcher algorithm, which compares the detections we pass with those we have returned before in other frames and decides if they are distinct enough to be considered separate results. For it to be useful to our task, the matcher algorithm needs to give us the subset of detections which did not match any previous results, and the ones that matched exactly once with any detection from a previous iteration. The length of each of those lists is all we need to update our statistics in part 3. It is important to note that this part of the algorithm is the main bottleneck, with the detector call in Algorithm 1 being most of the work, followed in second place by the random read and decode of Algorithm 1. In comparison, the overhead of the first part is negligible and fully parallelizable. It only grows with the number of chunks.

The third part updates the state of our algorithm, updating and for the chunk we sampled from. Additionally, we store detections in the matcher and append the truly new detections to the answer. We note that the amount of state we need to track only grows with the amount of results we have so far, and not with the size of the dataset.

3.7 Other optimizations

Finally, we prevent several other optimizations to Algorithm 1.

3.7.1 Batched execution

Algorithm 1 processes one frame at a time, but to make good use of modern GPUs we may want to run inference in batches. The code for a batched version is similar to that in Algorithm 1, but on Algorithm 1 we draw samples per , instead of one sample from each belief distribution, creating cohorts of samples. In Figure 2 we show 5 different values from Thompson sampling of the same distribution in dashed lines. Because each sample for the same chunk will be different, the chunk with the maximum value will also vary and we will get chunk indices, biased toward the more promising chunks. The logic for state update only requires small modifications. In principle, we may fear that picking the next frames at random instead of only 1 frame could lead to suboptimal decision making within that batch, but at least for small values of up to 50, which is what we use in our evaluation, we saw no significant drop. This is likely enough to fully utilize a machine with multiple GPUs.

We do not implement or evaluate asynchronous, distributed execution in this paper, but the same reasoning suggests ExSample could be made to scale to an asynchronous setting, with workers processing a batch of frames at a time without waiting for other workers. Ultimately all the updates to and are commutative because they are additive.

3.7.2 Avoiding near duplicates within a single chunk

While random sampling is a good baseline, random allows samples to happen very close to each other in quick succession: for example in a 1000 hour video, random sampling is likely to start sampling frames within the same one hour block after having sampled only about 30 different hours, instead of after having sampled most of the hours once. For this reason, we introduce a variation of random sampling, which we call random+, to deliberately avoid sampling temporally near previous samples when possible: by sampling one random frame out of every hour, then sampling one frame out of every not-yet sampled half an hour at random, and so on, until eventually sampling the full dataset. We evaluate the separate effect of this change in in our evaluation, and we also use random+ to sample within each chunk in our dataset when evaluating ExSample. This is implemented by modifying the internal implementation of the chunk.sample() method, in Algorithm 1, but does not change the top level algorithm.

3.8 Generalizations

Generalized instance durations Throughout the paper we have used as proxy for duration, assuming we select frames with uniform random sampling. However, we could weight frames non-uniformly at random in the first place, for example by using some precomputed per-frame score. If we use a non-uniform weight to sample the frames, we effectively induce a different set of for each result, ideally one with . The estimates for the relative value of different chunks will still be correct since ExSample is designed to work with any underlying .

Generalized chunks We have introduced the idea of a chunk as that of a partitioning of the data. However, it would be possible for chunks to overlap, and this is equivalent to having instances that span multiple chunks. This choice is only meaningful if the different chunks are different in some way, for example one chunk can be a half an hour of video sampled uniformly at random, while a different chunk can be the same half hour of video sampled under a different set of weights, using the idea of generalized .

4 Evaluation

Our goals for this evaluation are to demonstrate the benefits of ExSample on real data, comparing it to alternatives including random sampling as well as existing work in video processing. We show on these challenging datasets ExSample achieves savings of up to 4x with respect to random sampling, and orders of magnitude higher to approaches based on related work.

Even though existing work does not explicitly optimize for distinct object queries like ExSample does, existing work such as BlazeIt and NoScope optimizes searching for frames that satisfy expensive predicates. They also recognize the need to avoid redundant results and implement a basic form of duplicate avoidance by skipping a fixed amount of time. It is therefore a reasonable question whether existing work already inadvertently processes distinct object queries efficiently and in this evaluation we show this is not the case, and that the two kinds of query require different approaches.

The main line of existing work for video processing uses light weight conv nets to assign a preliminary score to each frame and then processes frames from highest to lowest score. BlazeIt is the state of the art representative of this surrogate model approach.

4.1 Implementation

Both our implementation of BlazeIt and ExSample are at their core a sampling loop where the choice of which frame to process next is based on am algorithm-specific score. Based on this score, the system will fetch a batch of frames from the video and run that batch through an object detector. We implement this sampling loop in Python, using PyTorch to run inference on a GPU. The object detector model, which we refer to as the full model, is Faster-RCNN with a ResNet-101 backbone used for ground truth. To reach reasonable random access frame decoding rates we use the Hwang library from the Scanner project

[scanner], and re-encode our datasets to insert keyframes every 20 frames. BlazeIt requires extra upfront costs prior to sampling to train the surrogate model, and we will describe our implementation of that part of BlazeIt well and its associated fixed costs in subsection 4.6.

We implemented the subset of BlazeIt for limit queries with simple predicates, based on the description in the paper [blazeit]

as well as their published code. We opted for our own implementation to make sure the I/O and decoding components of both ExSample and BlazeIt were equally optimized, and also because extending BlazeItto handle our own datasets, ground truth, and metrics is more. For the cheaper surrogate (aka specialized model) in the BlazeItpaper we use an ImageNet pre-trained ResNet-18. This model is more heavyweight than the ones used in that paper, but also more accurate. We clarify our timed results do not depend strongly on the runtime cost ResNet-18.

4.2 Datasets

For this evaluation use two datasets. Which we refer to as Dashcam and BDD. The dashcam dataset consists of 10 hours of video, or over 1.1 million video frames, collected from a vehicle-mounted dashboard camera over several drives in cities and highways. Each drive can range from around 20 minutes to several hours.

The BDD dataset used for this evaluation consists of a subset of the Berkeley Deep Drive Dataset[bdd]. The BDD dataset consists of 40 second video clips of video also dashboard camera. Our subset is made out of 1000 randomly chosen clips.

In both datasets, the camera will move at variable speeds depending on the drive. The BDD dataset in particular includes clips from many cities.

4.3 Queries

Both the dashcam and BDD datasets include similar types of objects we expect to see in cities and highways. These include stationary objects such as traffic lights and and signs, and moving objects such as bicycles and trucks. Each type of object is a different search query, making our evaluation consist of 8 queries per dataset.

In addition to searching for different object classes, we also vary the limit parameter to achieve .1, .5 and .9 of recall, where recall is measured as fraction of distinct results found. These recall rates are meant to represent different kinds of applications: .1 (10%) represents a scenario where an autonomous vehicle data scientist is looking for a few test examples, whereas a higher recall like .9 would be more useful in a urban planning or mapping scenario where finding many instances is desired.

4.4 Ground truth

Neither the dashcam nor the BDD datasets have human-generated object instance labels that both identify and track multiple objects over time. Therefore we approximate ground truth by sequentially scanning every video in the dataset and running each frame through an object detector. If any objects are detected, we match the bounding boxes with those from previous frames and resolve which correspond to the same instance. For object detection we use a reference implementation of FasterRCNN[frcnn] from Facebook’s Detectron2 library[detectron2], one of the higher accuracy object detection models, pre-trained on the COCO[coco] dataset. In particular we use the version with a ResNet-101[resnet] backbone. To match object instances across neighboring frames, we employ an IoU matching approach similar to SORT[SORT]. IoU matching is a simple baseline for multi-object tracking that leverages the output of an object detector and matches detection boxes based on overlap across adjacent frames.

4.5 Results

Here we evaluate the cost of processing, in both time and frames, distinct object queries using ExSample, random+ and BlazeIt. Because some classes like parking meter are extremely rare, and some such as truck are much more common, the absolute times and frame counts needed to process a query vary by many orders of magnitude across different object classes. It is easier to get a global view of cost savings by normalizing query processing times of ExSample and BlazeIt against that of random+. That is, if random+ takes 1 hour to find 20% of traffic lights, and ExSample takes 0.5 hours to do the same, time savings would be 2x. We also apply this normalization when comparing frames processed. Results for dashcam are shown in Figure 3 and for bdd in Figure 4

Overall, ExSample saves more than 2x in frames processed vs random+, averaged across all classes and different recall levels. Savings do vary by class, reaching savings of above 4x for the person class. Savings by time are much larger when comparing with BlazeIt  because although BlazeIt does reduces the number of frames that the expensive detector is run on, especially at low recalls, it performs very poorly its hight overhead costs prior to processing the query cancel the early wins from better prediction. This is especially evident in Figure 5, top row. Note that random+ is a better baseline than BlazeIt for this query, demonstrating current techniques developed for other types of queries do not necessarily transfer to this query type.

Figure 3 shows two main trends. The first is that models trained with BlazeIt do succeed in finding objects faster, when we measure it in terms of number of frames processed. However, the total time it takes to sample using ExSample is orders of magnitude shorter because ExSample does not require a prior training phase, and only runs on a subset of frames, whereas the surrogate model needs to be run on every frame (we evaluate the relative costs of these two overheads in BlazeIt in the next Section). The impact of these overheads reduces in magnitude as more samples are taken. However, in none of our experiments is the gain enough to amortize the costs of training.

Figure 4 Shows similar trends. This is a challenging scenario for ExSample because each clip is only 40 seconds long, meaning we get less samples per chunk (and more chunks overall). This decreases our performance compared to the larger chunks in the dashcam dataset. However, we still achieve a 2x improvement over random. Surrogate models have a harder time learning with high accuracy on this dataset, as shown in Figure 5. Counterintuitively, this helps in Figure 4 because random scoring is a better baseline for unique object queries than very precise scoring.

Note that BlazeIttrained models are not able to beat random search for this task. Counterintuitively, this result is not due to random models being as surrogate models – BlazeIt’s error rates are much better than random in the validation set during training. However, BlazeIt is usually better only when used as a classifier but not as a way to sample promising frames. The main issue is that while BlazeIt models do predict which frames are likely to have an object of interest, they pick high scoring frames regardless of whether they are new or not. Because of the greedy nature of this approach, the extra accuracy from BlazeIt models is only helpful early on when every positive result is likely new, at low numbers of samples. The experiments in Figure 3 show BlazeIt is able to find 10% of all bicycles after only about 10 seconds into sampling, whereas ExSample reaches this level after 100 seconds, and random+ after 300 seconds. However, the training and surrogate scoring phases and an overhead of 10000 seconds to BlazeIt, hence ExSample can do 100x better, and random+ 30x better early on since they have no fixed overhead at the start.

Figure 3: Results on the Dashcam dataset. Comparing time and frames processed for different recall levels (each row) on the dashcam dataset Different methods (color) are compared relative to the costs of using random+to process the same query and reach the same level of instance recall. The left column shows savings when computed in terms times, which include initial surrogate training overheads for BlazeIt. The right column shows savings when comparing frames processed, excludes any frames processed for training purposes.
Figure 4: Results on BDD dataset. The surrogate model has a harder time learning on this dataset, with lower accuracy scores. Counterintuitively, this makes its savings comparable to those of random sampling, improving its results compared to Dashcam datasest
Figure 5: The surrogate models trained for each query have different effectiveness. Here we compare their Average Precision to that of a randomly assigned score. For the BDD dataset, the surrogate model does only slightly better than random in precision, which causes it to perform similarly to random when measured in savings. For the Dashcam dataset, the surrogate models are more accurate than random. Counterintuitively, higher accuracy in scores hurts savings in Figure 3 due to the greediness of the algorithm.

4.6 Overhead breakdown

This section aims to break down in more detail the costs underlying the results in the previous section. In short, after BlazeIt surrogate models are trained, they must be run over the full dataset and the scores are used to identify the highest scoring frames for sampling. While BlazeIt surrogate models are indeed much faster than the full models, scanning the full dataset is expensive even if the cheap model were free because loading and decoding video is not free.

Figure 6: Throughput of each processing phase in our implementation of BlazeIt. The yellow boxes show the overall throughput reached by those processing phases in our implementation. Additionally, to distinguish the bottlenecks from the object detector from those of video decoding, we show the maximum throughput achievable by I/O and video decoding in purple, and inference in cyan. Inference with surrogate models is marked to distinguish it from inference with the expensive model. For the scoring phase, which dominates the bulk of the overhead in BlazeIt, scanning through the dataset bounds the throughput to 100fps in our dataset. Although labelling throughput is low, labelling only happens on a fraction of the dataset, so represents a small fraction of the overall runtime.

BlazeIt[blazeit] prioritizes sampling the highest scoring frames, where the score is computed with a cheaper surrogate model. Answering queries with such systems involves four stages, whose throughputs on our data are shown in Figure 6.

  1. labelling phase: requires labelling a fraction of the dataset with the expensive object detector. Its runtime grows linearly with training set size, and because the object detector is involved, the throughput is as low as that of the sampling phase.

  2. training phase: once the labels are generated, a cheaper surrogate model is fit to the dataset. This phase can be relatively cheap if the surrogate is itself cheap and the training set fits in memory, avoiding any need for I/O or decoding. Figure 6 shows the throughput of ResNet-18 is indeed much higher and unlikely to be the bottleneck.

  3. scoring phase: the surrogate model runs over the dataset, producing a score for each frame. Even if the surrogate model is virtually free, Figure 6 shows that the IO and decode for the remainder of the dataset dominate the runtime.

  4. sampling phase: we fetch and process frames in descending order of surrogate score. This phase ends when we have found enough results for the user. This is the only phase for ExSample and for baselines such as random+. Regardless of access pattern, this phase is dominated by the cost of inference with the full model.

The first three phases can be seen as a fixed cost paid prior to finding results for surrogate-based methods. The promise of these surrogate based techniques is that by paying the upfront cost we can greatly save on the rest of the processing. But our results in the previous section show that ExSample is often more effective than the surrogate, without these high up-front costs. Furthermore, these up-front costs have to be paid multiple times, if, for example, the user wants to look for a new class of object or process a new data set. It’s unlikely that users will want to look for the same class of object on the same frames of video again, which is all pre-training a surrogate makes more efficient.

5 Related Work

Several approaches have recently proposed optimizations to address the cost of mining video data. A common idea in these approaches is to use cascaded classifiers, such as the Viola-Jones cascade detector [viola_jones]

, which enables real-time object tracking in video streams with a cascade framework that considers additional Haar-like features in successive cascade layers. Lu et al. propose applying cascaded classifiers to efficiently process video queries containing probabilistic predicates that specify desired precision and recall levels 


. They employ SVM, KDE, and deep neural network classifiers that input features from dimension reduction approaches such as principal component analysis and feature hashing to efficiently skip processing of video frames that the classifiers are confident do not contain objects relevant to the query. NoScope 

[noscope] trains specialized approximations to expensive CNNs while maintaining accuracy levels within a user-defined window. However, these approaches do not generalize well to diverse types of video data. For example, they require a costly training process to evaluate classifier accuracy (and, in some cases, to construct the classifiers), which may differ from video to video. Similarly, NoScope uses a difference detector specifically designed for video from static cameras, which is ineffective on video captured in mobile settings, e.g. by dashboard cameras. Moreover, for datasets where cascaded classifiers perform well, our approach is complementary, as the classifiers can be applied over sampled frames to obtain additional speedup.

Unlike cascaded classifier methods, BlazeIt [blazeit] proposes training specialized models to evaluate specific query clauses, and applies random sampling when specialized models are not available. BlazeIt also proposes a declarative, SQL-like language for querying objects with several constraint types. As with other methods, our approach complements BlazeIt – substituting random sampling for our sampling algorithm may yield substantial performance gains. It is possible that the methods proposed here could be integrated into systems such as BlazeIt.

Other techniques focus on improving neural network inference speed. Deep Compression [deepcompression] prunes connections in the network that have a negligible impact on the inference result. ShrinkNets [shrinknets] proposes a dynamic network resizing scheme that extends the pruning process to neural network training, reducing not only inference time but training time as well. Again, these techniques can be combined with ExSample to further improve query processing speed.

6 Conclusion

Over the next decade, workloads that process video to extract useful information may become a standard data mining task for analysts in application areas such as government, real estate, and autonomous vehicles. Such pipelines present a new systems challenge due to the cost of applying state of the art machine vision techniques.

In this paper we introduced ExSample, an approach for processing instance finding queries on large video repositories through chunk-based adaptive sampling. Specifically, the aim of the approach is to find frames of video that contain instances of objects of interest, without running an object detection algorithm on every frame, which could be prohibitively expensive. Instead, in ExSample, we sample frames and run the detector on just the sampled frames, tuning the sampling process based on whether a new instance of an object of interest is found in the sampled frames. To do this tuning, ExSample partitions the data into chunks and dynamically adjusts the frequency with which it samples from different chunks based on rate at which new instances are sampled from each chunk. We formulate this sampling process as an instance of Thompson sampling, using a Good-Turing estimator to compute the likelihood of finding a new object instance in each video chunk. In this way, as new instances in a particular chunk are exhausted, ExSample naturally refocuses its sampling on other less frequently sampled chunks.

Our evaluation of ExSample on a real-world dataset of dashcam data shows that it is able to substantially improve on both the number of frames it samples and the total runtime versus both random sampling and methods based on lightweight “surrogate” models, such as BlazeIt [blazeit], that are designed to estimate frames likely to contain objects of interest with lower overhead. In particular, these surrogate-based methods are much slower because they require running the surrogate model on all frames.

Appendix A Notation

Number of frames sampled so far. Frames sampled and frames processed means the same thing.
Number of distinct results in the data. We treat the terms result and instance as synonyms.
Index variable over results. .
Number of results seen exactly once up until the sampled frame. We omit when is clear from context
set of seen after frames have been processed.
Probability of seeing result in a randomly drawn frame. It is proportional to duration in video. We treat duration and probability as synonyms.
Number of new results we expect to find in the next sampled frame
mean duration
stddev of durations
: the chance result appears first at the sampled frame. We may leave the implicit.
Number of chunks
Index variable over chunks.

Appendix B Objects spanning multiple chunks

Here we prove Equation 7 is also valid when different chunks may share instances. Assume we have sampled frames from chunk 1, from chunk 2, etc. Assume instance can appear in multiple chunks: with probability of being seen after sampling chunk 1, of being seen after sampling chunk 2 and so on. We will assume we are working with chunk 1, without loss of generality. The expected number of new instances if we sample once more from chunk is:


Similarly, the expected number of instances seen exactly once in chunk , and in no other chunk up to this point is


In both equations, the expression factors in the need for instance to not have been while sampling chunks 2 to . We will abbreviate this factor as . When instances only show up in one chunk, , and everything is the same as in Equation 1.

The expected error is:


Which again is term-by-term smaller than by a factor of