Reinventing Data Stores for Video Analytics

10/03/2018 ∙ by Tiantu Xu, et al. ∙ 0

We present a data store managing large videos for retrospective analytics. Our data store orchestrates video ingestion, storage, retrieval, and consumption. Towards resource efficiency, it takes the key opportunity of controlling the video formats along the video data path. We are challenged by i) the huge combinatorial space of video format knobs; ii) the complex impacts of these knobs and their high profiling cost; iii) optimizing for multiple resource types. To this end, our data store builds on a key idea called backward derivation of configuration: in the opposite direction along the video data path, the data store passes the video quantity and quality desired by analytics backward to retrieval, to storage, and to ingestion. In this process, our data store derives an optimal set of video formats, optimizing for different system resources in a progressive manner. Our data store automatically derives large, complex configurations consisting of hundreds of knobs. It streams video data from disks through decoder to operators and runs queries as fast as 362x of video realtime.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Pervasive cameras produce videos at an unprecedented rate. Over the past 10 years, the annual shipments of surveillance cameras grow by 10x, to 130M per year (IHS, 2018). Many campuses are reported to run more than 200 cameras 24x7 (Seagate, 2017). In such deployment, a single camera produces as much as 24 GB encoded video footage per day (720p30).

Retrospective video analytics

To generate insights from enormous video data, retrospective analytics is vital: video streams are captured and stored on disks for a user-defined lifespan; users run queries over the stored videos on demand. Retrospective analytics offers several key advantages that live analytics lacks. i) Analyzing many video streams in real time is expensive, e.g. running neural networks over live videos from a $200 camera requires a $4000 GPU 

(Kang et al., 2017). ii) Query types may only become known after the video capture, e.g. for crime investigation (Kang et al., 2018). iii) At query time, users may interactively revise their query types or parameters (Kang et al., 2018; Feng et al., 2018), which may not be foreseen at ingestion time. iv) In many applications, only a small fraction of video will be eventually queried (IHS, 2016), making live analytics an overkill.

A video query (e.g. “what are the license plate numbers of all blue cars in the last week?”) is typically executed as a cascade of operators (Shen et al., 2017; Jiang et al., 2018; Kang et al., 2018; Kang et al., 2017; Zhang et al., 2017), e.g. object recognizer. Given a query, a query engine assembles a cascade and run the operators. Query engines typically expose to users tradeoffs between operator accuracy and resource costs, allowing users to learn inaccurate results in less time, exploring large videos interactively (Kang et al., 2018; Kang et al., 2017). Recent query engines show promise of high speed, e.g. consuming one-day video in several minutes (Kang et al., 2017).

While recent query engines assume all input data as raw frames present in memory, there lacks a video store that manages large videos for retrospective queries. The store should orchestrate four key stages on the video data path: ingestion, storage, retrieval, and consumption, as shown in Figure LABEL:fig:concept. The four stages demand multiple hardware resources, including encoder/decoder bandwidth, disk space, and CPU/GPU cycles for query execution. The resource demands are high, thanks to high-volume, high-velocity video data. Demands for different resource types may conflict. For optimizing these stages for resource efficiency, classic video databases are inadequate (Kang et al., 2009): they were designed for human consumers watching videos at 1x–2x speed of video realtime; they are incapable of serving algorithmic consumers processing videos at more than 1000x speed of video realtime. Shifting part of the analytics workload to ingestion (Hsieh et al., 2018) has important limitations and does not obviate the need for such a video store, as we will show in the paper.

Towards designing a video store, we advocate for taking a key opportunity: as video flows through the data path, the store should actively control video formats, namely fidelity and coding, through extensive video parameters called knobs. These knobs have significant impacts on resource costs and analytics accuracy, opening a rich space of tradeoffs.

We present VStore, a system managing large videos for retrospective analytics. The top feature of VStore is its automatic configuration of video formats. As video streams arrive, VStore saves multiple video versions and judiciously sets their storage formats; in response to queries, VStore retrieves stored video versions and converts them into consumption formats catering to the executed operators. Through configuring video formats, VStore ensures operators to meet their desired accuracies at high speed; it prevents video retrieval from bottlenecking consumption; it elastically trades off different resources in order to operate under system resource budgets.

Towards the set of video formats, VStore is challenged by i) an enormous combinatorial space of video knobs; ii) complex impacts of these knobs and high costs in profiling them; iii) optimizing for multiple resource types. These challenges were unaddressed by prior systems. While classic video databases may save video contents in multiple formats, their format choices are oblivious to analytics and often ad hoc (Kang et al., 2009). While existing query engines recognize the significance of video formats (Kang et al., 2018; Zhang et al., 2017; Jiang et al., 2018) and optimize them for query execution, they omit the impact of video coding and the components crucial to retrospective analytics, e.g. storage and retrieval.

To address these challenges, our key idea behind VStore is backward derivation, shown in Figure LABEL:fig:concept

. In the opposite direction of the video data path, VStore passes the desired data quantity and quality from algorithmic consumers backward to retrieval, to storage, and to ingestion. In this process, VStore optimizes for different resources in a progressive manner; it trades off among them to respect system resource budgets. More specifically, i) from operators and their desired accuracies, VStore derives video formats for fastest data consumption, for which it effectively searches high-dimensional parameter spaces with video-specific heuristics; ii) from the consumption formats, VStore derives video formats for storage, for which it systematically coalesces video formats to optimize for ingestion and storage costs; iii) from the storage formats, VStore derives a data erosion plan, which gradually deletes aging video data, trading off analytics speed for lower storage cost.

Through evaluation with two real-world queries over six video datasets, we demonstrate that VStore is capable of deriving large, complex configuration with hundreds of knobs that is infeasible for humans to tune. With the configuration, VStore automatically creates multiple video versions for storage. To serve queries, it streams video data (encoded or raw) from disks through decoder to operators, running queries as fast as 362 of video realtime. As users lower the target query accuracy, VStore elastically scales down the costs by switching operators to different formats of videos, accelerating the query by two orders of magnitude. This query speed is 150 higher compared to systems that lack automatic configuration of video formats. VStore judiciously adapts its configuration to respect resource budgets. VStore reduces the total configuration overhead by 5.


We have made the following contributions.

  • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

  • We make a case for a new video store for serving retrospective analytics over large videos. We formulate the design problem and experimentally explore the design space.

  • To design such a video store, we identify configuration of video formats as the central concern. We present a novel approach called backward derivation. With this approach, we contribute new techniques for searching large spaces of video knobs, for coalescing stored video formats, and for eroding aging video data.

  • We report VStore, a concrete implementation of our design. Our evaluation shows promising results. VStore is the first holistic system that manages the full video lifecycle optimized for retrospective analytics, to our knowledge.

2. Motivations

2.1. Retrospective Video analytics

Query & operators

A video query is typically executed as a cascade of operators. As shown in Figure LABEL:fig:pipeline, early operators scan most of the queried video timespan at low cost. They activate late operators over a small fraction of video for deeper analysis. Operators consume raw video frames. Of a cascade, the execution costs of operators can differ by three orders of magnitude (Kang et al., 2017); they also prefer different input video formats, catering to their internal algorithms.

Accuracy/cost tradeoffs in operators

An operator’s output quality is characterized by accuracy

, i.e., how close the output is to the ground truth. We use a popular accuracy metric called F1 score: the harmonic mean of precision and recall 

(Jiang et al., 2018). At runtime, an operator’s target accuracy is set in queries (Zhang et al., 2017; Kang et al., 2018). The system seeks to provision minimum resources for the operator to achieve the target accuracy.

2.2. System model

We consider a video store running on one or a few commodity servers. Incoming video data flows through the following major system components.

Ingestion: Video streams continuously arrive. In this work we consider the input rate of incoming video as given. The ingestion optionally converts the video formats, e.g., by resizing frames. It saves the ingested videos either as encoded videos (through transcoding) or as raw frames. The ingestion throughput is bound by transcoding bandwidth, which is one order of magnitude lower than disk bandwidth. This paper will present more experimental results on ingestion. Storage: Like other time-series data stores (Agrawal and Vulimiri, 2017), videos have age-based values. A store typically holds video footage for a user-defined lifespan (Oracle, 2015). In queries, user often shows higher interest in more recent videos. Retrieval: In response to operator execution, the store retrieves video data from disks, optionally converts the data format for the operators, and supplies the resultant frames. If the on-disk videos are encoded, the store must decode them before supplying. Data retrieval may be bound by decoding or disk read speed. Since the decoding throughput (often tens of MB/sec) is far below disk throughput (at least hundreds of MB/sec), the disk only becomes the bottleneck in loading raw frames. Consumption: The store supplies video data to consumers, i.e. operators spending GPU or CPU cycles to consume the data.

Each of the above system components incurs resource costs. Figure LABEL:fig:concept summarizes the types of resources. The retrieval/consumption costs are reciprocal to data retrieval/consumption speed, respectively. The operator runs at the lower speed than retrieval and consumption. To quantify such speed, we adopt as the metric the ratio between video duration and video processing delay. For instance, if a 1-second video is processed in 1 ms, the speed is 1000x of realtime.

Key opportunity: controlling video formats

As video data flows through, a video store is at liberty to control the video formats. This is shown in Figure LABEL:fig:concept. At the ingestion, the system decides fidelity and coding for each stored video version; at the data retrieval, the system decides the fidelity for each raw frame sequence supplied to consumers.

Running operators at ingestion is not a panacea

Recent work runs early-stage operators at ingestion to save executions of expensive operators at query time (Hsieh et al., 2018). Yet, this approach has important limitations. i) It bakes query types in the ingestion. As video queries are increasingly richer (Modiri Assari et al., 2016; Chen et al., 2018; Wu et al., 2013; Girshick, 2015; Ren et al., 2015), running all early operators at ingestion is expensive. ii) It bakes specific accuracy/cost tradeoffs in the ingestion. Yet, users at query time often know better tradeoffs, based on domain knowledge and interactive exploration (Feng et al., 2018; Kang et al., 2018). iii) It prepays computation cost for all ingested videos. In many applications, only a small fraction of ingested video is eventually queried (Seagate, 2017); most operator execution at ingestion would result in no return. In comparison, by preparing data for queries, a video store supports richer query types, incurs lower ingestion cost, and allows flexible query-time tradeoffs.

2.3. Video Format Knobs

Table 1. Knobs and their values considered in this work. Total: 7 knobs and 15K possible combinations of values. Note: no video quality and coding knobs for RAW

Video format is controlled by a set of parameters, or knobs.

Fidelity knobs

For video data, encoded or raw, fidelity knobs dictate i) the quantity of visual information, e.g. frame sampling which decides frame rate; ii) the quality of visual information, which is subject to loss due to video compression. Table 1 summarizes the fidelity knobs considered in this work, chosen due to their high resource impacts. Each fidelity knob has a finite set of possible values. A combination of knob values constitute a fidelity option . All possible fidelity options constitute a fidelity space .

“Richer-than” order

Among all possible values of one fidelity knob, one may establish a richer-than order (e.g. 720p is richer than 180p). Among fidelity options, one may establish a partial order of richer-than: option X is richer than option Y if and only if X has the same or richer values on all knobs and richer values on at least one knob. The richer-than order does not exist in all pairs of fidelity options, e.g. between good-50%-720p-1/2 and bad-100%-540p-1. One can degrade fidelity X to get fidelity Y only if X is richer than Y.

Figure 1. Impacts of coding knobs. Video: 100 seconds from tucson. See §6 for dataset and test hardware

Coding Knobs

Coding reduces raw video size by up to two orders of magnitude (Wu et al., 2018). Coding knobs control encoding/decoding speed and the encoded video size. Orthogonal to fidelity knobs, coding knobs provide valuable tradeoffs among the costs of ingestion, storage, and retrieval. These tradeoffs do not affect the consumer behaviors – operator accuracy and consumption cost. While a modern encoder may expose tens of coding knobs (e.g. around 50 for x264), we pick three for their high impacts and ease of interpretation. Table 1 summarizes these knobs and Figure 1 shows their impacts. Speed step accelerates encoding/decoding at the expense of increased video size. As shown in Figure 1(a), it can lead up to 40 difference in encoding speed and up to 2.5 difference in storage space. Keyframe interval: An encoded video stream is a sequence of chunks (also called “group of pictures” (Fouladi et al., 2017)): beginning with a key frame, a chunk is the smallest unit of video data that can be decoded independently. The keyframe interval offers opportunity to accelerate decoding if the consumers only consume a fraction of frames through sampling. If the frame sampling interval is larger than the keyframe interval , the decoder can skip chunks between two adjacent sampled frames without decoding these chunks. In the example in Figure 1b, this increases decoding speed by up to 6x, however, incurs higher storage cost. Coding bypass: The ingestion may save incoming videos as raw frames on disks. The resultant extremely low retrieval cost is desirable to some fast consumers (see Section 3).

A combination of coding knob values is a coding option . All coding options constitute a coding space .

2.4. Knob impacts

Figure 2. Fidelity knobs have high, complex impacts on costs of multiple components (normalized on each axis) and operator accuracy (annotated in legends). Each plot: one knob changing; all others fixed. See §6 for methodology

As illustrated in Figure LABEL:fig:concept, for videos stored on disks, fidelity and coding knobs jointly impact the costs of ingestion, storage, and retrieval; for videos consumed by operators, fidelity knobs impact the consumption cost and the consuming operator’s accuracy. We have a few observations. Fidelity knobs enable rich cost/accuracy tradeoffs. As shown in Figure 2, one may reduce resource costs by up to 50% with minor (5%) accuracy loss. The knobs enable rich tradeoffs among resource types. This is exemplified in Figure 3: although three video fidelity options all lead to similar operator accuracy (0.8), there is no single most resource-efficient one, e.g. fidelity B incurs the lowest consumption cost, but the high storage cost due to its high image quality. Each knob has significant impacts. Take Figure 2(b) as an example: one step change to image quality reduces accuracy from 0.95 to 0.85, the storage cost by 5x, and the ingestion cost by 40%. Omitting knobs misses valuable tradeoffs. For instance, to achieve high accuracy of 0.9, the license detector would incur 60% more consumption cost when the image quality of its input video changes from “good” to “bad”. This is because the operator must consume higher quantity of data to compensate for the lower quality. Yet, storing all videos with “good” quality requires 5 storage space. Fixing image quality at either value wastes resources. Unfortunately, most prior video analytics systems omitted the image quality knob and used default values.

Figure 3. Disparate costs of fidelity options A–C, despite all leading to operator accuracy 0.8. Operator: license detector. Cost normalized on each axis. See §6 for methodology

The quantitative impacts are complex.

i) The knob/cost relations are difficult to capture in analytical models (Zhang et al., 2017). ii) The quantitative relations vary across operators and across video contents (Jiang et al., 2018). This is exemplified by Figure 2 (c) and (d) that show the same knob’s different impacts on two operators. iii) One knob’s impact depends on the values of other knobs. Take the license detector as an example: as image quality worsens, the operator’s accuracy becomes more sensitive to resolution changes. With “good” image quality, lowering image resolution from 720p to 540p slightly reduces accuracy, from 0.83 to 0.81; with “bad” image quality, the same resolution adjustment significantly reduces the accuracy, from 0.76 to 0.52. While prior work assumes that certain knobs have independent impacts on accuracy (Jiang et al., 2018), our observation shows that dependency exists among a larger set of knobs.

Summary & discussion

Controlling video formats is central to a video store design. The store should actively manage fidelity and coding throughout the video data path. To characterize knob impacts, the store needs regular profiling.

Some video analytics systems recognize the significance of video formats (Kang et al., 2018; Zhang et al., 2017; Jiang et al., 2018). However, they focus on optimizing query execution yet omitting other resources, such as storage, which is critical to retrospective analytics. They are mostly limited to only two fidelity knobs (resolution and frame sampling) while omitting others, especially coding. As we will show, a synergy between fidelity and coding knobs is vital.

3. A case for a new video store

We set to design a video store that automatically creates and manages video formats in order to satisfy algorithmic video consumers with high resource efficiency.

3.1. The Configuration Problem

The store must determine a global set of video formats as follows. Storage format: the system may save one ingested stream in multiple versions, each characterized by a fidelity option and a coding option . We refer to SF as a storage format. Consumption format: the system supplies raw frame sequences to different operators running at a variety of accuracy levels, i.e. consumers. The format of each raw frame sequence is characterized by a fidelity option . We refer to CF as a consumption format.

We refer to the global set of video formats as the store’s configuration of video formats.

Configuration requirements

These formats should jointly meet the following requirements:

R1. Satisfiable fidelity

To supply frames in a consumption format CF, the system must retrieve video in storage format SF, where is richer than or the same as .

Figure 4. Video retrieval could bottleneck consumption. A comparison between decoding and consuming speeds (y-axis; logarithmic) with different video fidelity (x-axis). Operator accuracy annotated on the top. See §6 for test hardware

R2. Adequate retrieving speed

Video retrieval should not slow down frame consumption. Figure 4 show two cases where the slowdown happens. a) For fast operators sparsely sampling video data, decoding may not be fast enough if the on-disk video is in the original format as it is ingested (e.g. 720p@30fps as from a surveillance camera). These consumers benefit from storage formats that are cheaper to decode, e.g. with reduced fidelity. b) For some operators quickly scanning frames looking for simple visual features, even the storage format that is cheapest to decode (i.e. f’ is the same as f; cheapest coding option) is too slow. These consumers benefit from retrieving raw frames from disks.

R3. Consolidating storage formats

Each stored video version incurs ingestion and storage costs. The system should exploit a key opportunity: creating one storage format for supplying data to multiple consumers, as long as satisfiable fidelity and adequate retrieving speed are ensured.

R4. Operating under resource budgets

The store should keep the space cost by all videos under the available disk space. It should keep the ingestion cost for creating all video versions under the system’s transcoding bandwidth.

3.2. Inadequacy of existing video stores

Computer vision research typically assumes all the input data present in memory as raw frames, which does not hold for retrospective analytics over large videos: a server with 100 GB DRAM holds no more than two hours of raw frames even in low fidelity (e.g. 360p30). Most video stores choose video formats in ad hoc manners without optimizing for analytics (rol, 2018). On one extreme, many save videos in one unified format (e.g. the richest fidelity expected by all operators). This minimizes storage and ingestion costs while incurring high retrieval cost. As a result, data retrieval may bottleneck operators. On the other extreme, one may incarnate all the storage formats with fidelity exactly matching consumer expectations. This misses the opportunities for consolidating storage formats and will lead to excessive storage/ingestion cost. We will evaluate these two alternatives in Section 6.

Layered encoding cannot simplify the problem

Layered encoding promises space efficiency: it stores one video’s multiple fidelity options as complementary layers (Seufert et al., 2015). However, layered encoding has important caveats. i) Each additional layer has non-trivial storage overhead (sometimes 40%–100%) (Kreuzberger et al., 2015) which may result in storage space waste compared to consolidated storage formats. ii) Decoding is complex and slow, due to the combination of layers and random disk access in reading the layers. iii) Invented two decades ago, its adoption and coding performance are yet to be seen. Even if it is eventually adopted and proven desirable, it would make the configuration more complex.

4. The VStore Design

4.1. Overview

VStore runs over one or a few commodity servers. It works with an existing query executor, e.g. OpenALPR, which provides a library of operators, predefined or user-defined. From the executor, VStore expects an interface for executing individual operators for profiling, and a manifest specifying a set of desirable accuracies for each operator. VStore tracks the whole set of <operator, accuracy> tuples as consumers.


During operation, VStore periodically updates its video format configuration. It profiles operators and encoding/decoding on samples of ingested videos. VStore splits and saves video footage in segments, which are 10-second video clips in our implementation. VStore retrieves or deletes each segment independently.


VStore is challenged by the configuration cost. i) Exhaustive search is infeasible. A configuration consists of a set of consumption formats from the 4D space and a set of storage formats from the 7D space . In our prototype, the total possible global configurations are . ii) Exhaustive profiling is expensive, as will be discussed in Section 4.2. iii) Optimizing for multiple resource types further complicates the problem.

These challenges were unaddressed. Some video query engines seek to ease configuration and profiling (challenge i and ii), but are limited to a few knobs (Zhang et al., 2017; Jiang et al., 2018). For the extensive set of knobs we consider, some of their assumptions, e.g. knob independence, do not hold. They optimize for one resource type – GPU cycles for queries, without accounting for other critical resources, e.g. storage (challenge 3).

Figure 5. VStore derives the configuration of video formats. Example consumption/retrieval speed is shown

Mechanism overview – backward derivation

VStore derives the configuration backwards, in the direction opposite to the video data flow – from sinks, to retrieval, and to ingestion/storage. This is shown in Figure 5


. In this backward derivation, VStore optimizes for different resources in a progressive manner.


Section 4.2: From all given consumers, VStore derives video consumption formats. Each consumer consumes, i.e., subscribes to, a specific consumption format. In this step, VStore optimizes data consumption speed.


Section 4.3: From the consumption formats, VStore derives storage formats. Each consumption format subscribes to one storage format (along the reversed directions of dashed arrows in Figure 5). The chosen storage formats ensure i) satisfiable fidelity: a storage format SF has richer fidelity than any of its downstream consumption formats (CFs); ii) adequate retrieval speed: the retrieval speed of SF should exceed the speed of any downstream consumer (following the dashed arrows in Figure 5). In this step, VStore optimizes for storage cost and keeps ingestion cost under budget.


Section 4.4: From all the derived storage formats, VStore derives a data erosion plan, gradually deleting aging video. In this step, VStore reduces storage cost to be under budget.


  • VStore treats individual consumers as independent without considering their dependencies in query cascades. If consumer A always precedes B in all possible cascades, the speed of A and B should be considered in conjunction. This requires VStore to model all possible cascades, which we consider as future work. VStore does not manage algorithmic knobs internal to operators (Jiang et al., 2018; Zhang et al., 2017); doing so would allow new, useful tradeoffs for consumption but not for ingestion, storage, or retrieval.

  • 4.2. Configuring consumption formats


    For each consumer , the system decides a consumption format for the frames supplied to . By consuming the frames, should achieve the target accuracy while consuming data at the highest speed, i.e. with a minimum consumption cost.

    The primary overhead comes from operator profiling. Recall the relation has to be profiled per operator and regularly. For each profiling, the store prepares sample frames in fidelity f, runs an operator over them, and measures the accuracy and consumption speed. If the store profiles all the operators over all the fidelity options, the total number of required profiling runs, even for our small library of 6 operators is 600. The total profiling time will be long, as we will show in the evaluation.

    Key ideas

    VStore explores the fidelity space efficiently and only profiles a small subset of fidelity options. It works based on two key observations. O1. Monotonic impacts Increase in any fidelity knob leads to non-decreasing change in consumption cost and operator accuracy – richer fidelity will neither reduce cost nor accuracy. This is exemplified in Figure 2 showing the impact of changes to individual knobs. O2. Image quality does not impact consumption cost. Unlike other fidelity knobs controlling data quantity, image quality often does not affect operator workload and thus the consumption cost, as shown in Figure 2(b).

    We next sketch our algorithm deciding the consumption format for the consumer accuracy-t: the algorithm aims finding that leads to accuracy higher than accuracy-t (i.e. adequate accuracy) with the lowest consumption cost.

    Partitioning the 4D space

    i) Given that image quality does not impact consumption cost (O2), VStore starts by temporarily fixing the image quality knob at its highest value. ii) In the remaining 3D space (crop resolution sampling), VStore searches for fidelity that leads to adequate accuracy and the lowest consumption cost. iii) As shown in Figure 6, VStore partitions the 3D space into a set of 2D spaces for search. To minimize the number of 2D spaces under search, VStore partitions along the shortest dimension, chosen as the crop factor which often has few possible values (3 in our implementation). iv) The fidelity found from the 3D space already leads to adequate accuracy with the lowest consumption cost. While lowering the image quality of does not reduce the consumption cost, VStore still keeps doing so until the resultant accuracy becomes the minimum adequate. It then selects the knob values as . This reduces other costs (e.g. storage) opportunistically.

    Figure 6. Search in a set of 2D spaces for a fidelity option with accuracy >= 0.8 and max consumption speed (i.e. min consumption cost).

    Efficient exploration of a 2D space

    The kernel of the above algorithm is to search each 2D space (resolution sampling), as illustrated in Figure 6. In each 2D space, VStore looks for an accuracy boundary. As shown as shaded cells in the figure, the accuracy boundary splits the space into two regions: all points in the left have inadequate accuracies, while all in the right have adequate accuracies. To identify the boundary, VStore leverages the fact that accuracy is monotonic along each dimension (O1). As shown in Figure 6, it starts from the top-right point and explores to the bottom and to the left. VStore only profiles the fidelity options on the boundary. It dismisses points in the left region due to inadequate accuracies. It dismisses any point X in the right because X has fidelity richer than one boundary point Y; therefore, X incurs no less consumption cost than Y.

    This exploration is inspired by a well known algorithm in searching in a monotone 2D array (Cheng et al., 2008). However, our problem is different: has to offer both adequate accuracy and lowest consumption cost. Therefore, VStore has to explore the entire accuracy boundary: its cannot stop at the point where the minimum accuracy is found, which may not result in the lowest consumption cost.

    Cost & further optimization

    Each consumer requires profiling runs as many as , where is the number of values for knob . This is much lower than exhaustive search which requires () runs. Furthermore, in profiling for the same operator’s different accuracies, VStore memoizes profiling results. Our evaluation (§6) will show that, profiling all accuracies of one operator is still cheaper than exhaustive profiling the operator over the entire fidelity space.

    What if a higher dimensional fidelity space?

    The above algorithm searches in the 4D space of the four fidelity knobs we consider. One may consider additional fidelity knobs (e.g. color channel). To search in such a space, we expect partitioning the space along shorter dimensions to still be helpful; furthermore, the exploration of 2D space can be generalized for higher dimensional spaces, by retrofitting selection in a high-dimensional monotonic array (Cheng et al., 2008; Linial and Saks, 1985).

    4.3. Configuring storage formats


    For the chosen consumption formats and their downstream consumers, VStore determines the storage formats with satisfiable fidelity and adequate retrieval speed.

    Enumeration is unaffordable

    One may consider enumerating all possible ways to partition the set of consumption formats (CFs), and determining a common storage format for each subset of CFs. This enumeration is very expensive: the number of possible ways to partition a CF set is 4 for 12 CFs, and 4 for the 24 CFs in our implementation (Wikipedia, 2018a).

    Figure 7. Iterative coalescing of storage formats.

    Algorithm sketch

    VStore takes a greedy approach for iteratively coalescing the set of storage formats. Show on the right side of Figure 7, VStore starts from a full set of storage formats (SFs), each catering to a CF with identical fidelity. In addition, VStore creates a golden storage format SFg<fg,cg>: fg is the knob-wise maximum of the fidelity of all CFs; cg is the slowest coding option incurring the lowest storage cost. The golden SF is vital to data erosion to be discussed in Section 4.4. All these SFs participate in coalescing.

    VStore runs multiple rounds of pairwise coalescing. To coalesce SF0f0,c0 and SF1f1,c1 into SF2f2,c2, VStore picks f2 to be the knob-wise maximum of f0 and f1 for satisfiable fidelity. Such coalescing impacts resource costs in three ways. i) It reduces the ingestion cost as the video versions are fewer. ii) It may increase the retrieval cost, as SF2 with richer fidelity tends to be slower to decode than SF0/SF1. VStore therefore picks a cheaper coding option (c2) for SF2, so that decoding SF2 is fast enough for all previous consumers of SF0/SF1. If even the cheapest coding option is not fast enough, VStore bypasses coding and stores raw frames SF2. iii) The cheaper coding in turn may increase storage cost.

    Figure 7 illustrates a series of coalescing rounds. From right to left, VStore first picks the pairs that can be coalesced to reduce ingestion cost without increasing storage cost. Once VStore finds coalescing any remaining pair would increase storage cost, VStore checks if the current total ingestion cost is under budget. If not, VStore attempts cheaper coding options and continues to coalesce at the expense of increased storage cost, until the ingestion cost drops below the budget.

    Overhead analysis

    The primary overhead comes from profiling. To pick a pair for coalescing, VStore considers all possible pairs among the remaining SFs; for each pair, VStore profiles a video sample in the would-be coalesced SF, testing decoding speed and the video sample size. VStore coalesces in at most rounds, being the number of CFs. The total profiling runs are min(O(), ). In our implementation, is 24 and is 15K. Fortunately, by memoizing the previously profiled SFs in the same configuration process, VStore can significantly reduce the profiling runs, as we will evaluate in Section 6.

    4.4. Planning Age-based Data Erosion


    In previous steps, VStore plans multiple storage formats of the same content in order to cater to a wide range of consumers. In the last step, VStore reduces the total space cost to be below the system budget.

    Our insight is as follows. As video content ages, the system may slowly give up some of the formats, freeing space by relaxing the requirement for adequate retrieving speed on aged data (Section 3, R2). We made the following choices. i) Gracefully degrading consumption speed. VStore controls the rate of decay in speed instead of in storage space, as operator speed is directly perceived by users. ii) Low aging cost. VStore avoids transcoding aged videos which compete for encoder with ingestion. It hence creates no new storage formats for aging. iii) Never breaks fidelity satisfiability. VStore identifies some video versions as fallback data sources for others, ensuring all consumers to achieve their desired accuracies as long as the videos are still in lifespan.

    Figure 8. Data erosion decays operator speed and keeps storage cost under budget. Small cells: video segments.

    Data erosion plan

    VStore plans erosion at the granularity of video ages. Recall that VStore saves video as segments on disks (each segment contains 8-second video in our implementation). As shown in Figure 8, for each age (e.g. per day) and for each storage format, the plan dictates the percentage of deleted segments, which accumulate over ages.

    How to identify fallback video formats?

    VStore organizes all the storage formats of one configuration in a tree, where edges capture richer-than relations between the storage formats, as shown in Figure 8. Consumers, in an attempt to access any deleted segments of a child format, fall back to the parent format (or even higher-level ancestors). Since the parent format offers richer fidelity, the consumers are guaranteed to meet their accuracies; yet, the parent’s retrieval speed may be inadequate to the consumers (e.g. due to costlier decoding), thus decaying the consumers’ effective speed. If a consumer has to consume a fraction of segments from the parent format, on which the effective speed is only a fraction of its original speed with no eroded data, the consumer’s relative speed is defined as the ratio between its decayed speed to its original, given by . VStore never erodes the golden format at the root node; with its fidelity richer than any other format, the golden format serves as the ultimate fallback for all consumers.

    How to quantify the overall speed decay?

    Eroding one storage format may decay the speeds of multiple consumers to various degrees, necessitating a global metric for capturing the overall consumer speed. Our rationale is for all consumers to fairly experience the speed decay. Following the principle of max-min fairness (Wikipedia, 2018b), we therefore define the overall speed as the minimum relative speed of all the consumers. By this definition, the overall speed is also relative, in the range of (0,1]. is 1 when the video content is youngest and all the versions are intact; it reaches the minimum when all but the golden format are deleted.

    How to set overall speed target for each age?

    We follow the power law function, which gives gentle decay rate and has been used on time-series data (Agrawal and Vulimiri, 2017). In the function , is the video age. When (youngest video), is 1 (the maximum overall speed); as the grows, approaches to . Given a decay factor (we will discuss how to find a value below), VStore uses the function to set the target overall speed for each age in the video lifespan.

    How to plan data erosion for each age?

    For gentle speed decay, VStore always deletes from the storage format that would result in the minimum overall speed reduction. In the spirit of max-min fairness, VStore essentially spreads the speed decay evenly among consumers.

    VStore therefore plans erosion by resembling a fair scheduler (Molnár, 2007). For each video age, i) VStore identifies the consumer that currently that has the lowest relative speed; ii) VStore examines all the storage formats in the “richer-than” tree, finding the one that has the least impact on the speed of ; iii) VStore plans to delete a fraction of segments from the found format, so that another consumer ’s relative speeds drops below ’s. VStore repeats this process until the overall speed drops below the target of this age.

    Putting it together

    VStore generates an erosion plan by testing different values for the decay factor . It finds the lowest (most gentle decay) that brings down the total storage cost accumulated over all video ages under budget. For each tested , VStore generates a tentative plan: it sets speed targets for each video age based on the power law, plans data erosion for each age, sums up the storage cost across ages, and checks if the storage cost falls below the budget. As higher always leads to lower total storage cost, VStore uses binary search to quickly find a suitable .

    5. Implementation

    We built VStore in C++ and Python with 10K SLoC. Running its configuration engine, VStore orchestrates several major components. Coding and storage backend: VStore invokes FFmpeg, a popular software suite for coding tasks. VStore’s ingestion uses the libx264 software encoder; it creates one FFmpeg instance to transcode each ingested stream. Its retrieval invokes NVIDIA’s NVDEC decoder for efficiency. VStore invokes LMDB, a key-value store  (iMatix Corporation, 2018), as its storage backend. VStore stores 8-second video segments in LMDB. We choose LMDB as it well supports MB-size values. Ported query engines: We ported two query engines to VStore. We modify both engines so they retrieve data from VStore and provide interfaces for VStore’s profiling. OpenALPR (OpenALPR Technology, Inc., [n. d.]) recognizes vehicle license plates. Its operators build on OpenCV and run on CPU. To scale up, we create a scheduler that manages multiple OpenALPR contexts and dispatches video segments. NoScope (Kang et al., 2017)

    is a recent research engine. It combines operators that execute at various speeds and invoke deep NN. It invokes TensorFlow as the NN framework, which runs on GPU.

    Operator lib: The two query engines provide 6 operators as shown in Figure LABEL:fig:pipeline. In particular, S-NN uses a very shallow AlexNet (Krizhevsky et al., 2012) produced by NoScope’s model search and NN uses YOLOv2 (Redmon et al., 2016).

    6. Evaluation

    We answer the following questions in evaluation:



    Does VStore provide good end-to-end results?


    Does VStore adapt configurations to resource budgets?


    Does VStore incur low overhead in configuration?

    6.1. Methodology

    Video Datasets

    We carried out our evaluation on six videos, extensively used as benchmarks in prior work (Kang et al., 2017; Hsieh et al., 2018; Kang et al., 2018; Jiang et al., 2018). We include videos from both dash cameras (which contain high motion) and surveillance cameras that capture traffic from heavy to light. The videos are: jackson, from a surveillance camera at Jackson Town Square; miami, from a surveillance camera at Miami Beach crosswalk; tucson: from a surveillance camera at Tucson 4-th avenue. dashcam, from a dash camera when driving in a parking lot; park, from a stationary surveillance camera in a parking lot; airport, from a surveillance camera at JAC parking lot. The ingestion formats of all videos are 720p at 30fps encoded in h264.

    VStore setup

    We, as the system admin, declare a set of accuracy levels {0.95, 0.9, 0.8, 0.7} for each operator. These accuracies are typical in prior work (Jiang et al., 2018). In determining F1 scores for accuracy, we treat as the ground truth when the operator consumes videos in the ingestion format, i.e. highest fidelity. In our evaluation, we run the two queries as illustrated in Figure LABEL:fig:pipeline: Query A (Diff + S-NN + NN) and query B (Motion + License + OCR). In running the queries, we, as the users, select specific accuracy levels for the operators of the query. In running queries, we, as the users, specify different accuracy levels for the constituting operators.

    We run query A on the first three videos and B on the remaining, as how these queries are benchmarked in prior work (OpenALPR Technology, Inc., [n. d.]; Kang et al., 2017). To derive consumption formats, VStore profiles the two sets of operators on jackson and dashcam, respectively. Each profiled sample is a 10-second clip, a typical length used in prior work (Jiang et al., 2018). VStore derives a unified set of storage formats for all operators and videos.

    Hardware environment

    We test VStore on a 56-core Xeon E7-4830v4 machine with 260 GB DRAM, 41TB 10K RPM SAS 12Gbps HDDs in RAID 5, and a NVIDIA Quadro P6000 GPU. By their implementation, the operators from ALPR run on the CPU; we limit them to use up to 40 cores for ensuring query speed comparable to commodity multicore servers. The operators from NoScope run on the GPU.

    6.2. End-to-end Results

    Table 2. A sample configuration of video formats automatically derived by VStore

    Configuration by VStore

    VStore automatically configuring video formats based on its profiling. Table 2 shows a snapshot of configuration, including the whole set of consumption formats (CFs) and storage formats (SFs). For all the 24 consumers (6 operators at 4 accuracy levels), VStore generates 21 unique CFs. The configuration has 124 knobs, each with up to 10 possible values. Manually finding the optimal combination would be very infeasible, which warrants VStore’s automatic configuration. In each column (a specific operator), although the knob values tend to decrease as accuracy drops, the decrease is complex and can be non-monotone. For instance, from Diff F1=0.9 to Diff F1=0.8, VStore advises to decrease sampling rate, while increase in resolution and crop factor. This reflects the complex impacts of knobs as stated in Section 2.4. We also note that VStore chooses the lowest (cheapest) fidelity for Motion at all accuracies 0.9. This is because VStore cannot find a cheaper fidelity to further exploit the lower accuracies. It suggests that Motion can benefit from an even larger fidelity space with even cheaper fidelity options.

    From the CFs, VStore derives 4 SFs, including one golden format (SFg), as listed in the right of Table 2. Note that we as system admin has not yet imposed any budget on ingestion cost. Therefore, VStore by design chooses the set of SFs that minimize the total storage cost (Section  4.3). The CF table in the left tags each CF with the SF that the CF subscribes to. As shown, the CFs and SFs jointly meet design requirements R1–R3 in Section 4.3: each SF has fidelity richer than or same to what its downstream CFs demand; the SF’s retrieval speed is always faster than the downstream’s consumption speed. Looking closer at the SFs: SFg mostly caters to consumers demanding high accuracies but low consumption speeds; SF3, stored as low-fidelity raw frames, caters to high-speed consumers demanding low image resolutions; between SFg and SF3, SF1 and SF2 fill in the wide gaps of fidelity and costs. Again, it is difficult to manually determine such a complementary set of SFs without VStore’s configuration.

    Alternative configurations

    We next quantitatively contrast VStore to the following alternative configuration:

    • [leftmargin=0cm, itemindent=.3cm, labelwidth=labelsep=0pt, parsep=3pt, topsep=2pt, itemsep=1pt, align=left]

    • 11 stores videos in the golden format (SFg in Table 2). All consumers consume videos in this golden format. This resembles a video database oblivious to algorithmic consumers.

    • 1N stores videos in the golden format SFg. All consumers consume video in the CFs determined by VStore. This is equivalent to VStore configuring video formats for consumption but not for storage. The system therefore has to decode the golden format and convert it to various CFs.

    • NN stores videos in 21 SFs, one for each unique CF. This is equivalent to VStore giving up its coalescing of SFs.

    (a) Query speeds (y-axis; logarithmic scale) as functions of target operator accuracy (x-axis). Query A on left 3 videos; query B on right 3 videos. By avoiding video retrieval bottleneck, VStore significantly outperforms others
    (b) Storage cost per video stream, measured as the growth rate of newly stored video size as ingestion goes on. VStore’s coalsecing of SFs substantially reduces storage cost.
    (c) Ingestion cost per video stream, as required CPU usage for transcoding the stream into storage formats. VStore’s SF coalsecing substantially reduces ingestion cost. Note: this shows VStore’s worst-case ingestion cost with no ingestion budget specified; see Table 3 for more
    Figure 9. End-to-end result

    Query speed

    As shown in Figure 9a, VStore achieves good query speed overall, up to 362 realtime. VStore’s speed is incomparable with performance reported for retrospective analytics engines (Kang et al., 2017; Kang et al., 2018): while VStore streams video data (raw or encoded) from disks through decoder to operators, the latter were tested with all input data preloaded as raw frames in memory. VStore offers flexible accuracy/cost tradeoffs: for queries with lower target accuracy, VStore accelerates query speed by up to 150. This is because VStore elastically scales down the costs: according to the lower accuracy, it switches the operators to CFs that incur lower consumption cost; the CFs subscribe to SFs that incur lower retrieval cost.

    Figure 9(a) also shows the query speed under alternative configurations. 11 achieves the best accuracy (treated as the ground truth) as it consumes video in the full fidelity as ingested. However, it cannot exploit accuracy/cost tradeoffs, offering a fixed operating point. By contrast, VStore offers a wide range of tradeoffs, speeding up queries by two orders of magnitude.

    1N customizes consumption formats for consumers while only storing the golden format. Although it minimizes the consumption costs for consumers, it essentially caps the effective speed of all consumers at the speed of decoding the golden format, which is about 23x of realtime. The bottlenecks are more serious for lower accuracy (e.g. 0.8) where many consumers are capable of consuming data as fast as tens of thousand times of realtime, as shown in Table 2. As a result, VStore outperforms 1N by 3-16, demonstrating the necessity of SF set.

    Storage cost

    Figure 9b compares the storage costs. Among all, NN incurs the highest costs, because it stores 21 video versions in total. For dashcam, a video stream with intensive motion which makes video coding less effective, the storage cost reaches as high as 2.6 TB/day, filling a 10TB hard drive in four days. Compared to it, VStore consolidates the storage formats effectively (Section  4.3) and therefore reduces the storage cost by 2 to 5. 11 and 1N require the lowest storage space as they only save one video version per ingestion stream; yet, they suffer from high retrieval cost and low query speed, as shown above.

    Ingestion cost

    Figure 9 demonstrates that VStore substantially reduces ingestion cost through consolidation of storage formats. Note that it shows VStore’s worst-case ingestion cost. As stated earlier, in the end-to-end experiment with no ingestion budget imposed, VStore therefore reduces the ingestion cost without any increase in the storage cost. As we will show next, once an ingestion budget is given, VStore can keep the ingestion cost much lower than the worst case with only a minor increase in storage cost.

    Overall, on most videos VStore requires around 10 cores to ingest one video stream, transcoding it into the four storage formats in real time (30 fps). Ingesting dashcam is much more expensive, as the video contains intensive motion. VStore’s cost is 30%–50% lower than NN, which must transcode each stream to 21 storage formats. 11 and 1N incur the lowest ingestion cost as they only transcode the ingestion stream to the golden format, yet at the expense of costly retrieval and slow query, as shown above.

    6.3. Adapting to Resource Budgets

    Table 3. In response to ingestion budget drop, VStore tunes coding and coalesces formats to stay under the budget with increase in storage cost. Changed knobs shown in red.

    Ingestion budget

    VStore elastically adapts its configuration to respect the ingestion budget. To impose budget, we, as system admin, cap the number of CPU cores available to one FFmpeg that transcodes each ingested stream. In response to the reduced budget, VStore gently trades off storage for ingestion. Table 3 shows that, as the ingestion budget drops, VStore incrementally tunes up the coding speed (i.e. cheaper coding) for individual storage formats. As tradeoff, the storage cost slowly increases by 9%. During this process, the increasingly cheaper coding overprovisions retrieval speed to consumers and therefore will never fail the latter’s requirements. When the ingestion budget drops below 2 cores, VStore discovers that no cheaper coding for any storage format can help stay under the budget. It then decides to coalesce SF1 and SF2, significantly increasing the total storage cost by 2. Note that at this point, the total ingestion output throughput is still less than 7 MB/s; even the system ingests 56 streams with its 56 cores concurrently, the required disk throughput 370 MB/s is far below that of a commodity HDD array (1 GB/sec in our platform).

    Storage budget

    Figure 10. Age-based decay in operator speed (a), reducing storage cost (b) to respect storage budget

    VStore’s data erosion effectively respects the storage budget with gentle speed decay. To test VStore’s erosion planning, we, as admin, set the video lifespan to be 10 days; we then specify different storage budgets. With all the four storage formats listed in Table 2b, 10-day of the video stream will take 5 TB of disk space. If we specify a budget above 5 TB, VStore will determine no decay (=0), shown as the flat line in the Figure 10a. Further reduction in the storage budget prompts data erosion. With a budget of 4 TB, VStore decays the overall operator speed (defined in Section  4.4) following a power law function (=1). As we further reduce the budget, VStore plans more aggressive decays to respect the budget. Figure 10(b) shows how VStore erodes individual storage formats under a specific budget. On day 1 (youngest), all the four storage formats of the video are intact. As the video ages, VStore first deletes segments from SF1 and SF2, which have lower impacts on the overall speed. For video contents older than 5 days, VStore deletes all the data in SF1-3, while still keeping the golden format intact (not shown).

    6.4. Configuration Overhead

    VStore incurs moderate configuration overhead, thanks to our techniques in Section  4.2 and  4.3. Overall, one complete configuration (including all required profiling) takes around 500 seconds, suggesting the store can afford one configuration process in about every 1 hour online.

    Figure 11. Time spent on deriving consumption formats. Numbers of profiling runs annotated above columns. Each required profiling on a 10-second video segment. VStore reduces overhead by 5.

    Configuring consumption formats

    Figure 11 shows the overhead in determining consumption formats. Compared to exhaustive profiling of all fidelity options, VStore reduces the number of profiling runs by 9–15 and the total profiling delay by 5, from 2000 seconds to 400 seconds. We notice that the License operator is slow, contributing more than 75% of total delay, likely due to its CPU-based implementation.

    Configuring storage formats

    We have validated that VStore can find resource-efficient storage formats as exhaustive enumeration does. To do so, we compare VStore to exhaustive enumeration, on deriving SF from the 12 CFs used in query B; we cannot afford more CFs would which make exhaustive enumeration very slow. Both methods result in identical storage formats, validating VStore’s rationale behind coalescing. Yet, VStore’s overhead (37 seconds) is 2 orders of magnitude lower than enumeration (5548 seconds).

    To derive the storage formats from all the 21 unique consumption formats in our evaluation, VStore incurs moderate absolute overhead (less than 1 min) too. Throughout the 17 rounds of coalescing, it only profiles 475 (3%) storage formats out of all 15K possible ones. We observed that its memorization is effective: despite 5525 storage formats are examined as possible coalescing outcomes, 92% of them have been memoized before and thus requires no new profiling.

    7. Related Work

    Video analytics optimizations

    Several prior work aims to optimize video analytics. NoScope (Kang et al., 2017), MCDNN (Han et al., 2016), Focus (Hsieh et al., 2018), and BlazeIt (Kang et al., 2018) provide various filters or NNs for object detection either at query or ingestion. VideoStorm (Zhang et al., 2017) and VideoEdge (Hung et al., 2018) provide tradeoffs on resource and accuracy. LAVEA(Yi et al., 2017) investigates placement of video analytics tasks in edge-cloud environment. Jain et al. (Jain et al., 2018) adopt cross-camera correlations, thus incur sublinear cost grow when scaling to more camera streams. However, they lack support for optimizing ingestion, storage, retrieval, and consumption in conjunction. NVIDIA DeepStream SDK (dee, 2018) allows video frames to flow from a GPU’s built-in decoders to its stream processors without leaving the GPU. It could mitigate memory move overhead for retrieval. It does not change the fundamental tradeoffs between retrieval and consumption. NVIDIA Video Loader (Casper et al., 2018) implements wrappers over FFmpeg and NVDEC in order to retrieve frames from encoded videos for NN training. Neither automatically configures video formats for analytics.

    Time-series data store

    Recent time-series data stores co-design storage format with queries (Agrawal and Vulimiri, 2017; Andersen and Culler, 2016). However, the data format/schema (timestamped sensor readings), the operators (e.g. aggregation), and the analytics structure (no cascade) are different from video analytics. While some streaming databases (Srinivasan et al., 2016; InfluxData, 2018) provide benefits on data aging or frequent queries, they could not make storage decisions based on video queries as they are oblivious to the analytics.

    Video systems for human consumers

    Many multimedia server systems in 90’s stored videos on disk arrays in multiple resolutions or in complementary layers, in order to better serve human clients (Keeton and Katz, 1995; Chiueh and Katz, 1993). Since then, Kang et al. (Kang et al., 2009) optimizes placement of on-disk video layers in order to reduce disk seek. Oh et al. (Oh and Hua, 2000) segments videos into shots, which are easier for humans to browse and search. More recently, SVE (Huang et al., 2017) is a distributed service for fast transcoding of uploaded videos in datacenters. ExCamera (Fouladi et al., 2017) uses Amazon lambda function for parallel video transcoding. These systems were not designed for, and therefore are oblivious to, algorithmic consumers. They cannot automatically control video formats for video analytics.

    8. Conclusions

    VStore automatically configures video format knobs for retrospective video analytics. It addresses the challenges by the huge combinatorial space of knobs, the complex knobs impacts, and high profiling cost. VStore explores a key idea called backward derivation of configuration: the video store passes the video quantity and quality desired by analytics backward to retrieval, to storage, and to ingestion. VStore automatically derives complex configurations. It runs queries as fast as up to 362 of video realtime.


    • (1)
    • dee (2018) 2018. NVIDIA.
    • rol (2018) 2018. RollingDB Storage Library.
    • Agrawal and Vulimiri (2017) Nitin Agrawal and Ashish Vulimiri. 2017. Low-Latency Analytics on Colossal Data Streams with SummaryStore. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). ACM, New York, NY, USA, 647–664.
    • Andersen and Culler (2016) Michael P Andersen and David E. Culler. 2016. BTrDB: Optimizing Storage System Design for Timeseries Processing. In 14th USENIX Conference on File and Storage Technologies (FAST 16). USENIX Association, Santa Clara, CA, 39–52.
    • Casper et al. (2018) Jared Casper, Jon Barker, and Bryan Catanzaro. 2018. NVVL: NVIDIA Video Loader.
    • Chen et al. (2018) Y. Chen, X. Zhu, W. Zheng, and J. Lai. 2018. Person Re-Identification by Camera Correlation Aware Feature Augmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 40, 2 (Feb 2018), 392–408.
    • Cheng et al. (2008) Yongxi Cheng, Xiaoming Sun, and Yiqun Lisa Yin. 2008. Searching monotone multi-dimensional arrays. Discrete Mathematics 308, 11 (2008), 2213–2221.
    • Chiueh and Katz (1993) Tzi-cker Chiueh and Randy H. Katz. 1993. Multi-resolution Video Representation for Parallel Disk Arrays. In Proceedings of the First ACM International Conference on Multimedia (MULTIMEDIA ’93). ACM, New York, NY, USA, 401–409.
    • Feng et al. (2018) Ziqiang Feng, Junjue Wang, Jan Harkes, Padmanabhan Pillai, and Mahadev Satyanarayanan. 2018. EVA: An Efficient System for Exploratory Video Analysis. SysML (2018).
    • Fouladi et al. (2017) Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan Vasuki Balasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivaraman, George Porter, and Keith Winstein. 2017. Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 363–376.
    • Girshick (2015) Ross Girshick. 2015. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) (ICCV ’15). IEEE Computer Society, Washington, DC, USA, 1440–1448.
    • Han et al. (2016) Seungyeop Han, Haichen Shen, Matthai Philipose, Sharad Agarwal, Alec Wolman, and Arvind Krishnamurthy. 2016. MCDNN: An Approximation-Based Execution Framework for Deep Stream Processing Under Resource Constraints. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’16). ACM, New York, NY, USA, 123–136.
    • Hsieh et al. (2018) Kevin Hsieh, Ganesh Ananthanarayanan, Peter Bodik, Shivaram Venkataraman, Paramvir Bahl, Matthai Philipose, Phillip B. Gibbons, and Onur Mutlu. 2018. Focus: Querying Large Video Datasets with Low Latency and Low Cost. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA.
    • Huang et al. (2017) Qi Huang, Petchean Ang, Peter Knowles, Tomasz Nykiel, Iaroslav Tverdokhlib, Amit Yajurvedi, Paul Dapolito, IV, Xifan Yan, Maxim Bykov, Chuen Liang, Mohit Talwar, Abhishek Mathur, Sachin Kulkarni, Matthew Burke, and Wyatt Lloyd. 2017. SVE: Distributed Video Processing at Facebook Scale. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). ACM, New York, NY, USA, 87–103.
    • Hung et al. (2018) Chien-Chun Hung, Ganesh Ananthanarayanan, Peter Bodík, Leana Golubchik, Minlan Yu, Victor Bahl, and Matthai Philipose. 2018.

      VideoEdge: Processing Camera Streams using Hierarchical Clusters.
    • IHS (2016) IHS. 2016. Top Video Surveillance Trends for 2016.
    • IHS (2018) IHS. 2018. Top Video Surveillance Trends for 2018.
    • iMatix Corporation (2018) iMatix Corporation. 2018. Lightning Memory-mapped Database.
    • InfluxData (2018) InfluxData. 2018. InfluxDB.
    • Jain et al. (2018) Samvit Jain, Ganesh Ananthanarayanan, Junchen Jiang, Yuanchao Shu, and Joseph E Gonzalez. 2018. Scaling Video Analytics Systems to Large Camera Deployments. arXiv preprint arXiv:1809.02318 (2018).
    • Jiang et al. (2018) Junchen Jiang, Ganesh Ananthanarayanan, Peter Bodik, Siddhartha Sen, and Ion Stoica. 2018. Chameleon: Scalable Adaptation of Video Analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication (SIGCOMM ’18). ACM, New York, NY, USA, 253–266.
    • Kang et al. (2018) Daniel Kang, Peter Bailis, and Matei Zaharia. 2018. BlazeIt: Fast Exploratory Video Queries using Neural Networks. arXiv preprint arXiv:1805.01046 (2018).
    • Kang et al. (2017) Daniel Kang, John Emmons, Firas Abuzaid, Peter Bailis, and Matei Zaharia. 2017. NoScope: Optimizing Neural Network Queries over Video at Scale. Proc. VLDB Endow. 10, 11 (Aug. 2017), 1586–1597.
    • Kang et al. (2009) Sooyong Kang, Sungwoo Hong, and Youjip Won. 2009. Storage technique for real-time streaming of layered video. Multimedia Systems 15, 2 (01 Apr 2009), 63–81.
    • Keeton and Katz (1995) Kimberly Keeton and Randy H. Katz. 1995. Evaluating video layout strategies for a high-performance storage server. Multimedia Systems 3, 2 (01 May 1995), 43–52.
    • Kreuzberger et al. (2015) Christian Kreuzberger, Daniel Posch, and Hermann Hellwagner. 2015. A Scalable Video Coding Dataset and Toolchain for Dynamic Adaptive Streaming over HTTP. In Proceedings of the 6th ACM Multimedia Systems Conference (MMSys ’15). ACM, New York, NY, USA, 213–218.
    • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105.
    • Linial and Saks (1985) Nathan Linial and Michael Saks. 1985. Searching ordered structures. Journal of algorithms 6, 1 (1985), 86–103.
    • Modiri Assari et al. (2016) Shayan Modiri Assari, Haroon Idrees, and Mubarak Shah. 2016. Human Re-identification in Crowd Videos Using Personal, Social and Environmental Constraints. Springer International Publishing, Cham, 119–136.
    • Molnár (2007) Ingo Molnár. 2007. [patch] Modular Scheduler Core and Completely Fair Scheduler.
    • Oh and Hua (2000) JungHwan Oh and Kien A. Hua. 2000. Efficient and Cost-effective Techniques for Browsing and Indexing Large Video Databases. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD ’00). ACM, New York, NY, USA, 415–426.
    • OpenALPR Technology, Inc. ([n. d.]) OpenALPR Technology, Inc. [n. d.]. OpenALPR.
    • Redmon et al. (2016) J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In

      2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

      . 779–788.
    • Oracle (2015) Oracle. 2015. Dramatically Reduce the Cost and Complexity of Video Surveillance Storage.
    • Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15). MIT Press, Cambridge, MA, USA, 91–99.
    • Seagate (2017) Seagate. 2017. Video Surveillance Trends Report.
    • Seufert et al. (2015) Michael Seufert, Sebastian Egger, Martin Slanina, Thomas Zinner, Tobias Hossfeld, and Phuoc Tran-Gia. 2015. A survey on quality of experience of HTTP adaptive streaming. IEEE Communications Surveys & Tutorials 17, 1 (2015), 469–492.
    • Shen et al. (2017) Haichen Shen, Seungyeop Han, Matthai Philipose, and Arvind Krishnamurthy. 2017. Fast Video Classification via Adaptive Cascading of Deep Models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    • Srinivasan et al. (2016) V. Srinivasan, Brian Bulkowski, Wei-Ling Chu, Sunil Sayyaparaju, Andrew Gooding, Rajkumar Iyer, Ashish Shinde, and Thomas Lopatic. 2016. Aerospike: Architecture of a Real-time Operational DBMS. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1389–1400.
    • Wikipedia (2018a) Wikipedia. 2018a. Bell number.
    • Wikipedia (2018b) Wikipedia. 2018b. Max-min fairness.
    • Wu et al. (2018) Chao-Yuan Wu, Manzil Zaheer, Hexiang Hu, R. Manmatha, Alexander J. Smola, and Philipp Krähenbühl. 2018. Compressed Video Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
    • Wu et al. (2013) Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. 2013. Online Object Tracking: A Benchmark. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’13). IEEE Computer Society, Washington, DC, USA, 2411–2418.
    • Yi et al. (2017) S. Yi, Z. Hao, Q. Zhang, Q. Zhang, W. Shi, and Q. Li. 2017. LAVEA: Latency-Aware Video Analytics on Edge Computing Platform. In 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). 2573–2574.
    • Zhang et al. (2017) Haoyu Zhang, Ganesh Ananthanarayanan, Peter Bodik, Matthai Philipose, Paramvir Bahl, and Michael J. Freedman. 2017. Live Video Analytics at Scale with Approximation and Delay-Tolerance. In 14th USENIX Symposium on Networked Systems Design and Implementation (NSDI 17). USENIX Association, Boston, MA, 377–392.