ApproxNet: Content and Contention Aware Video Analytics System for the Edge

by   Ran Xu, et al.
Purdue University

Videos take lot of time to transport over the network, hence running analytics on live video at the edge devices, right where it was captured has become an important system driver. However these edge devices, e.g., IoT devices, surveillance cameras, AR/VR gadgets are resource constrained. This makes it impossible to run state-of-the-art heavy Deep Neural Networks (DNNs) on them and yet provide low and stable latency under various circumstances, such as, changes in the resource availability on the device, the content characteristics, or requirements from the user. In this paper we introduce ApproxNet, a video analytics system for the edge. It enables novel dynamic approximation techniques to achieve desired inference latency and accuracy trade-off under different system conditions and resource contentions, variations in the complexity of the video contents and user requirements. It achieves this by enabling two approximation knobs within a single DNN model, rather than creating and maintaining an ensemble of models (such as in MCDNN [Mobisys-16]). Ensemble models run into memory issues on the lightweight devices and incur large switching penalties among the models in response to runtime changes. We show that ApproxNet can adapt seamlessly at runtime to video content changes and changes in system dynamics to provide low and stable latency for object detection on a video stream. We compare the accuracy and the latency to ResNet [2015], MCDNN, and MobileNets [Google-2017].



There are no comments yet.


page 3

page 5

page 8

page 10


ApproxDet: Content and Contention-Aware Approximate Object Detection for Mobiles

Advanced video analytic systems, including scene classification and obje...

TOD: Transprecise Object Detection to Maximise Real-Time Accuracy on the Edge

Real-time video analytics on the edge is challenging as the computationa...

Parallel Detection for Efficient Video Analytics at the Edge

Deep Neural Network (DNN) trained object detectors are widely deployed i...

Towards Performance Clarity of Edge Video Analytics

Edge video analytics is becoming the solution to many safety and managem...

SiEVE: Semantically Encoded Video Analytics on Edge and Cloud

Recent advances in computer vision and neural networks have made it poss...

AccMPEG: Optimizing Video Encoding for Video Analytics

With more videos being recorded by edge sensors (cameras) and analyzed b...

VID-WIN: Fast Video Event Matching with Query-Aware Windowing at the Edge for the Internet of Multimedia Things

Efficient video processing is a critical component in many IoMT applicat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is an increasing number of scenarios where various kinds of analytics are required to be run on live video streams, on resource-constrained mobile and embedded devices. For example, in a smart city traffic system, vehicles are redirected by detecting congestions from the live video feeds from traffic cameras while in Augmented Reality (AR)/Virtual Reality (VR) systems, scenes are rendered based on the recognition of objects, faces or actions in the video. These applications require low latency for event classification or identification based on the content in the video frames. Most of these videos are captured at edge devices such as by IoT devices, surveillance cameras, head-mounted AR/VR systems etc. Video transportation over wireless network is slow and these applications often must operate under no or intermittent network connectivity. Hence such systems must be able to run video analytics in-place, on these resource-constrained edge-devices 111As a notational shorthand, we will often use the term “edge devices” to include end client devices plus static edge devices. The common characteristic is that they are computationally constrained.

to meet the low latency requirements for the applications. This problem becomes more challenging as state-of-the-art computer vision techniques are predominantly based on huge deep neural networks (DNNs) that are designed to run on the server side and cannot provide low

inference latency on smaller edge devices. Plus, the edge devices typically lack advanced resource isolation mechanisms, load-balancing and scaling capabilities. Hence the video analytics system must also have the capability to react intelligently to changes in the system dynamics to maintain low latencies.

State-of-the-art is too heavy for edge-platforms:

Most of the video analytics queries involve performing an inference over DNNs (mostly convolutional neural networks, a.k.a CNNs) with a variety of functional architectures for performing the intended tasks like classification 

[62, 64, 19, 23], object detection [46, 58, 59], face [53, 60, 71, 65] or action recognition [33, 54, 44, 61]

etc. Typically these inferences are performed on GPUs. With advancements in deep-learning and emergence of complex architectures, DNN-based models have become

deeper and wider. Correspondingly their memory footprints and their inference times have become significant. For example, DeepMon [27] runs the VGG-VeryDeep-16 deep learning model at around 1 2 fps on Samsung Galaxy S7. And we found empirically that the state-of-the-art CNN called Residual Network or ResNet [19], with its 101-layer version, has a memory footprint of 2.8 GB and takes 101 ms to perform inference on a single image on the NVIDIA Jetson TX2 board (which is one of the most resource-rich mobile/embedded platform today). This corresponds to less than 10 frames per second (fps) and is therefore not suitable for analyzing live video streams that typically have 30 fps.

Maintain performance under variable system scenarios: In many scenarios, these devices support multiple different applications executing concurrently. For example, while an AR application is running on a mobile device, a voice assistant might kick in if it detects background conversation or a spam filter might become active if a set of emails are received. All of these applications share common resources on the edge device such as CPU, memory, and the GPU itself. These concurrent processes or background activities lead to variable workload situations leading to resource contention [39, 2, 4] as the smaller edge devices do not have advanced resource isolation mechanisms. Heavy state-of-the-art DNN-based video analytics models running on the cloud fully or partially [15, 76, 34, 35, 21, 45, 38, 26, 11] do not suffer from these problems due to better resource isolation (such as through virtual machines) or through scale out (moving computation among a larger number of machines). It is currently an unsolved problem how a video analytics system running on these devices can maintain its low inference latency under variable resource availability to deliver pleasing user experience.

Mobile models cannot adapt to variable system scenarios: State-of-the-art computer vision research [30, 7, 64, 62, 20, 22], especially as focused on client devices, has made significant progress in making models smaller so that they can run on mobile devices and deliver low inference latencies. Examples of this line of work are MCDNN [15], VideoStorm [76], DeepMon [27], Mainstream [34], Chameleon [35], Focus [21], Liu’s work [45], MobileNets [20], and MSDNet [22]. However, none of them address how to optimize their usage under variable workload situations.

When a model becomes smaller (fewer layers, less neurons in each) 

[20], or is quantized (less precision), it invariably loses some accuracy. But under low contention scenario, such loss of accuracy is not required to maintain low inference latency. On the other hand, smaller models might be effective on simpler images, but a heavy DNN machinery might be needed to handle complex contents (see details in Fig. 1). None of the prior works is agile enough to handle these variable content and contention scenarios, which would be crucial for effective live video analytics on smaller edge devices. Also, a straight forward solution of switching among a pre-installed set of ensemble models either incurs huge switching overhead or cannot fit inside the device’s RAM. For example, MCDNN [15] relies on storing a catalog of tens of local models. According to our measurement, a single model, ResNet with only 34 layers, i.e., at the low end of the depth, takes 2.4 GB of memory (Table 3), while an NVIDIA Jetson TX2 board has a total of 8 GB of memory and thus an ensemble of up to 68 models as proposed in MCDNN is not feasible. Thus, resilient, on-device, real-time streaming video analytics has still not been achieved.

Our solution: ApproxNet. In this paper, we present ApproxNet, our end-to-end streaming video analytics system geared toward GPU-enabled edge devices. The fundamental idea behind ApproxNet is to perform approximate computing with tuning knobs that are changed automatically and seamlessly within the same video stream. These knobs trade off the accuracy of the inferencing for reducing inference latency and thus match the rate of the video stream. The optimal configuration of the the knobs is set based on resource contention or in response to complexity of the video frames. Uniquely to our solution, the approximation is done in the core of the neural network processing, as opposed to something external such as the frame rate (as done in Mainstream [34]). We adapt ResNet in combination with Spatial Pyramid Pooling (SPP) [18] that enables a CNN to accommodate inputs of various shapes and the output of the inferencing process to be taken at various depths of the CNN.

We make the following contributions in this paper:

  1. We develop an end-to-end, approximate video analytics pipeline, ApproxNet, that can handle dynamically changing workload contention or video content characteristics on resource-constrained edge devices. It achieves this through performing system context-aware and content-aware approximations to adapt to different accuracy and latency requirements.

  2. We design a novel DNN architecture that allows runtime accuracy and latency tuning using a single model. Our design is in contrast to ensemble systems like MCDNN that are composed of multiple independent model variants capable of satisfying different requirements. Our single-model-based design avoids high switching latencies on edge-devices when conditions change, as we show empirically (Fig. 1115).

  3. Our technique provides a superior accuracy-latency tradeoff than prior (server or hybrid mobile-server) solutions. We show empirically that ApproxNet achieves a better accuracy-latency Pareto curve compared to the two state-of-the-art image processing solutions on mobile devices, MobileNets and MCDNN, for the ImageNet datasets (Fig. 

    9 and Fig. 9, respectively).

2 Background and Motivation

2.1 DNNs for Streaming Video Analytics

DNNs have become a core element of various video processing tasks such as frame classification, human action recognition, object detection, face recognition, and so on. Though accurate, DNNs are computationally expensive, requiring significant CPU and memory resources. As a result, these DNNs are often too slow when running on mobile devices and become the latency bottleneck in video analytics systems. Huynh

et al. [27] experimented with VGG [62] of 16 layers on the Samsung Galaxy S7 and noted that classification on a single image takes as long as 644 ms, leading to less than 2 fps for continuous classification. Motivated by the observation, we explore in ApproxNet how we can make DNN-based video analytics pipelines more efficient through content-aware approximate computation within the neural network.


: Deep DNNs are typically hard to train due to the vanishing gradient problem 

[19]. ResNet solved this problem by introducing a short cut identity connection in between layers, which helps achieve at least the same accuracy upon further increasing the number of network layers. The unit of such connected layers is called a ResNet block.

The key idea of a deeper model producing no higher error than its shallower counterpart is leveraged by us for the construction of an architecture that provides different approximation branches with intermediate output ports.

Spatial Pyramid Pooling (SPP) [18]

: Popular DNN models, including ResNet, consist of convolutional and max-pooling (CONV) layers and fully-connected (FC) layers and the shape of an input image is fixed. Changing the input shape in a CNN typically requires re-designing the architecture of the CNN. SPP is a special layer that eliminates this constraint. The SPP layer is added at the end of CONV layers, providing the following FC layers with a fixed-dimensioned feature representation by pooling the CONV layer output with bins whose spatial shapes are proportional to the input shape. We use SPP layers to allow us to change input shape as an approximation knob.

Trading-off accuracy for inference latency: DNNs can have several variants due to different configurations, and these variants yield different accuracies and latencies. But these variants are trained and inferenced independently and cannot be switched efficiently at inference time to meet differing accuracy or latency requirements. For example, MCDNN [15] sets up an ensemble of (up to 68) model variants to satisfy different latency/accuracy/cost requirements. MSDNet [22] enables five early exits in its ImageNet model but does not evaluate on streaming video with any variable content or contention situations. Hence, we set ourselves to design a single-model DNN that is capable of handling the accuracy-latency trade-off at inference time and guarantees our video analytics system’s performance under variable content and workload situations.

(a) Simple image (b) Complex image
Figure 1: Examples of using a heavy DNN (on the left) and a light DNN (on the right) for simple and complex images in a video frame classification task. The light DNN downsamples an input image to half the default input shape and gets prediction labels at an earlier layer. The classification is correct for the simple image (red label denotes the correct answer) but not for the complex image.

2.2 Content-aware Approximate Computing

IRA [42] and VideoChef [73] first introduced the notion of content-aware approximation and applied the idea, respectively to image and video processing pipelines. These works for the first time showed how to tune approximation knobs as content characteristics change, e.g., the video scene became more complex. In particular, IRA performs approximation targeting individual images, while VideoChef exploits temporal similarity among frames in a video to further optimize accuracy-latency trade-off. In contrast, we apply approximation to the DNN model itself with the intuition that depending on complexity of the image in a frame, we want to feed input of a different shape and output at a different depth of layers to achieve the target accuracy. For example, as shown in Fig. 1, if the image is very simple, we can downsample it to half of the original dimensions and make a decision only after 12 layers of a DNN and still make the correct prediction. While if the image is complex, the same operation results in wrong predictions. Thus, this motivates our design of content-aware approximation on the DNN model itself, in addition to an approximation across frames.

3 Overview

Here we give a high-level overview of ApproxNet. In Sec. 4, we provide details of each component.

3.1 Requirements

We set ourselves a few requirements for streaming video analytics on mobile devices. First, the application may have changing input characteristics, such as, the complexity of the video frames (as can be measured through well-understood complexity metrics in the image processing domain [6, 75, 48]), which will necessitate changes in the approximation level. The changes happen within the video stream frequent enough and without any pre-determined pattern. Second, the application may suffer from resource contention due to the shared CPU, GPU, memory, or memory bandwidth with other concurrently running applications. Such contention can happen frequently with co-location due to limited resources and also without a pre-determined pattern. Third

, the application may have different target accuracy and latency requirements. Therefore it is important to build an easily configurable framework that can be optimized for various operating points. For example, the application may require low latency when the video analytics is time-critical like in AR/VR cases. The application may have higher latency budget later on when it wants higher accuracy because the scene has changed and it needs to precisely classify objects in the new scene. Thus, the aggregate model must be able to make efficient transitions in the tradeoff space of accuracy, latency, and throughput, optionally using edge or cloud servers. A

non-requirement in our work is that multiple concurrent applications consuming the same video stream be jointly optimized. We assume that the device is constrained enough that a single application for video streaming analytics will run on it. MCDNN [15] and Mainstream [34] bring significant design sophistication to handle the concurrency aspect.

3.2 Design Intuition and Workflow

To enable dynamic accuracy-latency trade-off at inference time, we design a DNN (the core unit of a video analytic system) that dynamically changes two factors: (1) input shape (by downsampling) and (2) the number of layers to make a prediction decision (by designing an output to be produced and extracted from multiple locations). These two factors taken together provide various points in the accuracy-latency tradeoff space and a specific setting of these two parameters defines an approximation branch. We design ApproxNet such that these approximation branches are achieved via a single DNN without the need to store and load multiple model variants and switch among them as in prior works for mobile CNN with the notable exception of MSDNet [22].

Figure 2: Workflow of ApproxNet. The input is a video frame and an optional user requirement, and the outputs are prediction labels of the video frame.
Figure 3: Outport of the approximation-enabled DNN.
Figure 4: A Pareto frontier for trading-off accuracy and latency in a particular image complexity category and at a particular contention level.

We show the overall structure of our approximate video analytics system for mobile devices in Fig. 4. ApproxNet currently focuses on the video object classification task, taking optional user inputs for accuracy or latency, and producing top-5 predictions as an output. We compose three major functional units: profiler, scheduler, and executor

. First, the set of feasible approximation branches is determined through the estimated accuracy-latency trade-off on the Pareto optimal boundary. Then, an offline profiler collects the accuracy and inference latency of each approximation branch in each image complexity category under variable resource contention.

Second, the scheduler makes the decision on where to execute (selecting an executor) and how to execute (selecting an approximation branch). The answer to the “where” question can be a local executor (an approximation-enabled DNN running on a mobile device) or it may be offloaded to a remote executor, if available (a DNN without approximation running on edge or cloud servers). The scheduler invokes an Image Complexity Estimator (ICE) and a Resource Contention Estimator (RCE). The scheduler estimates the expected inference latency of each approximation branch under the current contention. This is done by first estimating the current contention level using the RCE. For this, the RCE tracks the latency of the DNN in the last few executions on the video stream and estimates the resource contention level at runtime by comparing with the offline latency profile under no contention for the currently deployed approximation branch. To reduce the frequency with which the ICC needs to be invoked, the input image is first analyzed by a Scene Change Detector (SCD), which detects if any significant difference exists between consecutive frames. Only if such a change is detected, ICC is triggered and otherwise we use the image complexity category of the last input frame. The scheduler picks the approximation branch among the feasible ones, the one that meets the user requirement with the best value of the other metric (such as, meets the user accuracy required, while keeping latency as low as possible). Finally, the executor invokes the DNN with the chosen approximation branch and produces an inference on the video frame.

4 Design and Implementation

4.1 Dynamic Approximation in Local Executer

ApproxNet’s local executer, an approximation-enabled DNN, is designed to support multiple accuracy and latency requirements at runtime using a single DNN model. To enable this, we design a DNN that can be approximated using two approximation knobs. The DNN can take an input image in different shapes, which we call input shapes, our first approximation knob and can produce a classification output at multiple positions in the intervening layers, which we call outports, our second approximation knob. Combining these two approximation knobs, ApproxNet creates various approximation branches, which trade off between accuracy and latency, and thus can be used to meet a particular requirement. We describe our design using ResNet as the base DNN, though our design is applicable to any other mainstream CNNs consisting of convolutional (CONV) layers and fully-connected (FC) layers such as VGG [62], DenseNet [23] and so on.

Figure 5: The architecture of the approximation-enabled DNN in ApproxNet.
Input Shape Outport 1 Outport 2 Outport 3 Outport 4 Outport 5 Outport 6
224x224x3 28x28x64 28x28x64 14x14x64 14x14x64 14x14x64 7x7x64
192x192x3 24x24x64 24x24x64 12x12x64 12x12x64 12x12x64
160x160x3 20x20x64 20x20x64 10x10x64 10x10x64 10x10x64
128x128x3 16x16x64 16x16x64 8x8x64 8x8x64 8x8x64
112x112x3 14x14x64 14x14x64 7x7x64 7x7x64 7x7x64
96x96x3 12x12x64 12x12x64
80x80x3 10x10x64 10x10x64
Table 1: The list of the total 30 approximation branches supported for a baseline DNN of ResNet-34, given by the combination of the input shape and the outport from which the result is taken. “–” denotes the undefined settings.

Fig. 5 shows the design of our DNN using ResNet-34 as the base model. This enables 7 input shapes ( for

) and 6 outports (after 11, 15, 19, 23, 27, and 33 layers). We adapt the design of ResNet in terms of the stride, shape, # channels, use of convolutional layer or maxpool, and connection of the layers. In addition, we create

stacks, with stacks numbering 0 through 6 and each stack having 4, 6, or 8 ResNet layers and a variable number of blocks from the original ResNet design ( [19] Table 1). We then design an outport (Fig. 4), and connect with stacks 1 to 6, whereby we can get prediction labels by executing only the stacks (i.e., the constitutuent layers) till that stack. The use of 6 outports is a pragmatic system choice—too small a number does not provide enough granularity to approximate in a content and contention-aware manner and too many leads to a high training burden. Further, to allow the approximation knob of a down-sampled input of video frame to the DNN, we use the SPP layer at each outport to pool the feature maps of different shapes (due to different input shapes) into one unified shape and then connect with an FC layer. The SPP layer performs max-pooling on its input by three different levels with window size and stride , where is the shape of the input to the SPP layer, and and denote ceiling and floor operators, respectively. Note that our choice of the 3-level pyramid pooling is a typical practice for using the SPP layer [18]. In general, a higher value of requires a larger value of on the input of each outport, thereby reducing the number of possible approximation branches. On the other hand, a smaller value of results in coarser representations of spatial features and thus reduces accuracy. To support the case in the SPP, we require that the input shape of an outport be no less than 7 pixels in width and height, i.e., . This results in ruling out some input shapes as in Table 1. Our model has 30 configuration settings in total, instead of 7 6 (number of input shapes number of outports) settings. We name these configuration settings as approximation branches.

To train ApproxNet towards finding the optimal parameter set , we consider the softmax loss defined for the input shape and the outport

. The total loss function

that we minimize to train ApproxNet is a weighted average of for all and , defined as , where the value of is the factor that normalizes the loss at an outport by dividing by the number of shapes that are supported at that port . This makes each outport equally important in the total loss function. For mini-batch, we use 64 images for each of the 7 different shapes.

4.2 Image Complexity Estimator (ICE)

The design goal of the Image Complexity Estimator (ICE) is to estimate the expected accuracy of each approximation branch in a content-aware manner using the relevant features from the video frame. It is composed of an Image Complexity Categorizer (ICC), a Scene Change Detector (SCD) online and it uses the offline accuracy-latency profiles from the profiler.

Image Complexity Categorizer (ICC). ICC determines how hard it is for ApproxNet to classify a frame of the video. Various methods have been used in the literature to calculate image complexity such as edge information-based methods [75, 48], compression information-based methods [75] and entropy-based methods [6]. In this paper, we use mean edge value (also known as mean spatial information) as the image complexity metric, since it can be calculated with very low computation overhead (3.9 ms per frame on average in our implementation). To expand, we extract an edge map by converting a color image to a gray-scale image, applying Scharr operator [31] in both horizontal and vertical directions, and then calculating the L2 norm in both directions. We then compute the mean edge value of the edge map and use a pre-trained set of boundaries to quantize it into several image complexity categories. The selection of the number of categories is discussed in Sec. 4.4.

Figure 6: Sample images (first row) and edge maps (second row), going from left to right as simple to complex. Normalized mean edge values of images from left to right: 0.03, 0.24, 0.50 and 0.99 with corresponding image complexity categories: 1, 3, 6, and 7

Scene Change Detector (SCD). The Scene Change Detector in ApproxNet is designed to further reduce the runtime overhead of ICC by determining if the content in a frame is significantly different from that in a prior frame in which case the ICC will be invoked. SCD has to be very efficient as it gets invoked frequently in the video stream. In practice, we do not expect scenes to change very frequently and hence we execute SCD once every 30 frames ( 1 second interval). In case that ICC is not triggered for the current frame, we use the last computed image complexity category from the video stream. SCD tracks a histogram of the R-channel values for each pixel, and declares a scene change when the mean of the absolute difference across all bins of the histograms of two consecutive frames is greater than a certain threshold (45% of the total pixels in our design). To bound the execution time of SCD we use only the R-channel and downsample the shape of the image to . We empirically find that such optimizations do not reduce the accuracy of detecting new scenes but do reduce the SCD overhead, to only 1.3 ms per frame.

4.3 Resource Contention Estimator (RCE)

To estimate the resource contention at runtime, ideally we could use a sample classification task to probe the system and observe its latency under the current contention level. The use of such micro-benchmarks to probe for the current contention level is commonly done in data center environments [47, 74]. However, we have an advantage in that since we are dealing with streaming videos, the inference latencies of the latest frames form a natural observation of the contention level of the system. Thus we use the averaged inference latency of the approximation branch across the latest frames. We then check the latency sensitivity of branch (profile created offline as discussed in Sec. 4.4) and estimate the contention level by the nearest neighbor principle, i.e., which offline profiled latency is the closest one to . Specifically in this work, we consider memory contention among tasks executing on the device (our SoC board shares the memory between the CPU and the GPU), but our design is agnostic to what causes the contention.

4.4 Offline Profiler

Content-Specific Pareto Frontiers. For each image complexity category, we perform profiling to get the accuracy and the latency when executed on each approximation branch. We perform this with no resource contention. We then determine the accuracy-vs-latency Pareto frontier, with an example in Fig. 4. Only the branches on the Pareto frontier are candidate execution choices because others are inferior in both accuracy and latency than at least one other point on the frontier.

Determining Image Complexity Categories. The image complexity category is determined based on the criteria that all images within a category should have an identical Pareto frontier curve and images in different categories should have distinct curves. This enables us to come up with a range of mean edge values for each category. We start with considering the whole set of images as belonging to a single category and split the range of mean edge values into two iteratively. The binary splitting will stop (on that branch of the tree) if the Pareto frontiers of the two halves are exactly the same. In our video datasets, we derive 7 image complexity categories with 1 being the simplest and 7 the most complex. Fig. 1 shows examples of images and their edge maps.

Latency Sensitivity Profile under Resource Contention. We want ApproxNet to be able to select approximation branches at runtime in the face of contention. Therefore, we perform offline profiling of the inference latency of each approximation branch under different quantized levels of contention. Note that contention increases latency of the DNN but does not affect its accuracy. Contention can be due to other co-resident applications and for shared CPU, GPU, memory, or memory bandwidth resources. We find empirically that different approximation branches have different sensitivities to contention, i.e., their latencies increase by different amounts. We quantize the contention to 20 levels and then create the offline latency profile for each approximation branch under each contention level .

4.5 Remote Executer: Edge/Cloud Servers

As an optional feature of ApproxNet, we integrate a remote executer for the scenarios where high accuracy as well as high throughput are required. To support such scenarios, ApproxNet offloads the inference task to a high-end GPU powered edge (or cloud) server. Such a server uses the most accurate model that is available and serves the classification requests fast and in parallel among multiple servers to achieve the desired throughput. However, this mode requires ApproxNet to have network connectivity and the network latency is often significant. For example, the network round trip time to offload a frame to the edge server in the same wired LAN in our lab is 41 ms on average and it can be up to 100 ms via our campus wireless LAN. In practice, this use case only applies to the scenario where latency is not that important, such as classifying a series of frames in a burst mode.

4.6 Scheduler

The main job of the scheduler in ApproxNet is to make a decision whether we need to execute the task locally on the mobile device or remotely on the edge server, and what approximation branch should be used for local execution. The scheduler accepts user requirement on either the minimum accuracy, the maximum latency per frame, or explicitly requesting an offload to the edge. The scheduler gets the estimation on accuracy and latency of all approximation branches from ICE and RCE. Pareto frontiers are generated on-the-fly for the current contention level and an approximation branch on the frontier is selected based on the user requirement. If no Pareto frontier point satisfies the user requirement, ApproxNet picks the approximation branch that achieves metric value closest to the user requirement. If the user does not set any requirement, ApproxNet sets a latency requirement to the frame interval of the incoming video stream.

5 Evaluation

5.1 Evaluation Platforms

We evaluate ApproxNet by running its local executor on an NVIDIA Jetson TX2 embedded board [32], which includes 256 NVIDIA Pascal CUDA cores, a dual-core Denver CPU, a quad-core ARM CPU on a 8GB unified memory [69] between CPU and GPU. The specification of this board is a little above what is available in today’s high-end smartphones such as Samsung Galaxy S9 and Apple iPhone XS [55, 1]

. We run the remote executor on an edge server with NVIDIA Tesla K40c GPU with 12GB dedicated memory and an octa-core Intel i7-2600 CPU with 24GB RAM. For both the local embedded device and the edge server, we install Ubuntu 16.04 and Tensorflow v.1.6.0 (device) and v.1.4.0 (edge). The offline training of the DNN is done on the edge server.

5.2 Datasets, Task, and Metrics

ImageNet VID dataset: We evaluate ApproxNet on the video object classification task using ILSVRC 2015 VID dataset [29]. For the purpose of training, ILSVRC 2015 VID training set contains too many redundant video frames, leading to an over-fitting issue. To alleviate this problem, we follow the best practice in [37] such that the VID training dataset is sub-sampled every 180 frames and the resulting subset is mixed with ILSVRC 2014 detection (DET) training dataset to construct a new dataset with DET:VID=2:1. We use 90% of VID training dataset (mixed with DET dataset) to train ApproxNet’s DNN model and keep aside another 10% as validation set to fine-tune ApproxNet (offline profiling). To evaluate ApproxNet’s system performance, we use ILSVRC 2015 VID validation set – we refer to this as the “test set” throughout the paper.

ImageNet IMG dataset: We also use ILSVRC 2012 image classification dataset [9] to evaluate the accuracy-latency trade-off of our single DNN. We use 10% of the ILSVRC training set as our training set, first 50% of the validation set as our validation set to fine-tune ApproxNet, and the remaining 50% of the validation set as our test set. The choices made for training-validation-test in both the datasets follows common practice and serves the purpose that there is no overlap between the three.

Metrics: We use the latency and the top-5 accuracy as the two metrics. The latency includes the overheads of the respective solutions (ApproxNet, MCDNN, MobileNets). When user requirements, content characteristics, or contention characteristics changes, the switching overhead is also included in the latency.

5.3 Baselines

ResNet [19]: We use ResNet of 18 layers (ResNet-18) and of 34 layers (ResNet-34) as base models. We modify the last FC layer to classify into 30 labels in VID dataset and re-train the whole model. ResNet-34 plays a role as the reference providing the upper bound of the target accuracy. As our target is a real-time video analytics system, the ResNet architectures with more than 34 layers ([19] has considered up to 152 layers) become impractical as they are too slow to run on the resource-constrained local system and their memory consumption in ensemble mode (with other smaller ResNet models) exceeds the total memory on the board.

MCDNN [15]: We change the base model in MCDNN from VGG to the more recent ResNet for a fairer comparison. This system chooses between MCDNN-18 and MCDNN-34 depending on the accuracy requirement. MCDNN-18 uses two models: a specialized ResNet-18 followed by the generic ResNet-18. The specialized ResNet-18 is the same as the ResNet-18 except the last layer, which is modified to classify the most frequent classes only. This is MCDNN’s key novelty that most inputs belong to the top classes, which can be handled by a reduced-complexity DNN. If the top-1 prediction label of the specialized model in MCDNN is not among the top frequent classes, then the generic model processes the input again and outputs its final predictions. Otherwise, MCDNN uses the top-5 prediction labels of specialized model as its final predictions. We set that covers 80% of training video frames in the VID dataset. MCDNN-34 is defined similarly, replacing ResNet-18 with ResNet-34.

MobileNets [20]: This refers to 20 model variants (trained by the original authors) specifically designed for mobile devices (). These models can trade off in the accuracy-vs-latency space though the original paper does not discuss this. We enhance MobileNets to be an ensemble of these 20 models with the option of switching.

5.4 Typical Usage Scenarios

We use a few discrete usage scenarios to compare the protocols, although ApproxNet can support much finer-grained user requirements in latency or accuracy. High accuracy, High latency (HH) refers to the scenario where ApproxNet has less than 10% (relative) accuracy loss from ResNet-34, our most accurate single model baseline. Accordingly, the runtime latency is also high to achieve such accuracy. Medium accuracy, Medium latency (MM) has an accuracy loss less than 20% from our base model ResNet-34. Low accuracy, Low latency (LL): can tolerate an accuracy loss of up to 30% with a speed up in its inferencing. All three of these do not require network connectivity and can execute solely on the local device. In High accuracy, high latency with edge server support (HH-Offload), the user wants the highest possible accuracy comparable to state-of-the-art models but operates with guaranteed network connectivity. ApproxNet offloads the task and processes it in a remote executer running on an edge server. If no requirement is specified, the default is Real time (RT) which means the processing pipeline should keep up with 30 fps speed, i.e., maximum 33.33 ms latency.

Figure 7: Pareto frontier for test accuracy and inference latency on ImageNet dataset for ApproxNet compared to ResNet and MobileNets, the latter being specialized for mobile devices.
Figure 8: Pareto frontier trading-off validation accuracy with inference latency on video in ApproxNet. Baseline (ResNet-34) validation accuracy is 85.86%.
Figure 9: Comparison of system performance in typical usage scenarios. ApproxNet is able to meet the accuracy requirement for all three scenarios. User requirements are shown in dashed lines.
(a) Averaged accuracy and latency performance in ApproxNet. Shape Layers Latency Accuracy Scenario (rate) 80x80x3 12 16.14 ms 66.39% LL (62 fps) 96x96x3 12 16.78 ms 67.98% 112x112x3 12 17.70 ms 68.53% 128x128x3 12 17.97 ms 70.23% MM (56 fps) 112x112x3 20 26.84 ms 78.28% 128x128x3 20 27.95 ms 79.35% 160x160x3 20 31.33 ms 80.81% 128x128x3 24 31.42 ms 82.12% HH (32 fps) 224x224x3 34 20.57 ms 85.86% HH-Offload (b) Lookup table in MCDNN’s scheduler. Shape Layers Latency Accuracy Scenario (rate) 224x224x3 18 57.83 ms 71.40% MM/LL (17 fps) 224x224x3 34 88.11 ms 77.71% HH (11 fps) 224x224x3 34 20.57 ms 85.86% HH-Offload (c) Reference performance of single model variants. Shape Layers Latency Accuracy Model (rate) 224x224x3 18 45.22 ms 84.59% ResNet-18 (22 fps) 224x224x3 34 64.44 ms 85.86% ResNet-34 (16 fps)
Table 2: Averaged accuracy and latency performance of approximation branches on the Pareto frontier in ApproxNet and those of the baselines on validation set of the VID dataset.
Figure 10: Content-specific validation accuracy of Pareto frontier branches. Branches that fulfill real-time processing (30 fps) requirement are labeled in green. Note that both ResNet-18 and ResNet-34 models, though with the higher accuracy, cannot meet the 30 fps latency requirement.
Figure 11: Latency performance comparison with changing user requirements throughout video stream.

5.5 ApproxNet’s adaptability to changing user requirements

We first evaluate ApproxNet on ILSVRC IMG dataset on the accuracy-latency trade-off of each approximation branch in our single DNN as shown in Fig. 9. We compare with our re-trained ResNet and pre-trained MobileNets models. To benefit these baseline models, they have no switching overheads as there is no change in requirements or characteristics. Our approximation branch for the “HH” scenario has close accuracy to ResNet-18 and ResNet-34 but much lower latency than ResNet-34. Meanwhile, our approximation branch for the “MM” scenario has close accuracy and lower latency than MobileNets. And finally, our approximation branch for the “LL” scenario meets the accuracy requirement but has significantly lower latency than ResNet and MobileNets. This result indicates that ResNet is not practical, even without switching overhead, to keep up with 30 fps, while MobileNets can, but it cannot be configured to have low latency configurations as ApproxNet.

We then show how ApproxNet can meet different user requirements for accuracy and latency. Fig. 9 shows the averaged performance (over all complexity categories) of each approximation branch in the validation set and we plot the Pareto frontier. We list the averaged accuracy and latency of Pareto frontier branches in Table 2(a), which can serve as a lookup table in the simplest scenario without considering image complexity categories and resource contention. Note that ApproxNet, being aware of content characteristic, keeps a lookup table for each image complexity category, and being responsive to resource contention, updates the lookup table according to the runtime latency.

Next, we perform our evaluation on the entire test set, but again without the baseline protocols incurring any switching penalty. Fig. 9 compares the accuracy and latency performance between ApproxNet and MCDNN in the three typical usage scenarios“HH”, “MM”, and “LL” (AN denotes ApproxNet). In this experiment, ApproxNet uses the content-aware lookup table for each image complexity category and chooses the best approximation branch at runtime to meet the user accuracy requirement. MCDNN uses a similar lookup table (Table 2(b)) switching between MCDNN-18 and MCDNN-34 to satisfy the user requirement. We can observe that “AN-HH” achieves the accuracy of 67.7% at a latency of 35.0 ms, compared to “MCDNN-HH” that has an accuracy of 68.5% at the latency of 87.4 ms. Thus, MCDNN-HH is 2.5X slower while achieving 1.1% accuracy gain over ApproxNet. This win comes because ApproxNet has fine-grained choice of its approximation knobs and chooses the appropriate one for the specific test point. In “LL” and “MM” usage scenarios, MCDNN-LL/MM is 2.8-3.3X slower than ApproxNet, while gaining in accuracy 3% or less. ResNet-34, with its highest 71.5% accuracy and 64.44 ms latency is better than MCDNN in latency in the “HH” scenario but worse in “LL” and “MM” scenarios. Thus, compared to these baseline models, ApproxNet wins by providing lower latency and flexibility in achieving various points in the (accuracy, latency) space. Neither ResNet branch can keep up with 30 fps requirement.

5.6 ApproxNet’s adaptability to changing user requirements & content characteristics

We now show how ApproxNet can adapt to changing user requirements and content characteristics within the same video stream. The video stream, typically at 30 fps, may contain content of various complexities and this can change quickly and arbitrarily. We first show in Fig. 11 that ApproxNet with various approximation branches can satisfy different (accuracy, latency) requirements for each image complexity category. According to user’s accuracy or latency requirement, ApproxNet’s scheduler picks the appropriate approximation branch. The majority of the branches satisfy the real-time processing requirement of 30 fps and can also support high accuracy quite close to the ResNet-34.

In Fig. 11

, we show how ApproxNet adapts for a particular video. Here, we assume the user requirement changes every 100 frames between “HH”, “MM”, “LL”, and “HH-Offload”. We assume a uniformly distributed model selection among 20 model variants for MCDNN’s scheduler (in MCDNN the model catalog has up to 68 model variants) while the local executer can only cache two models in the RAM (more detailed memory results in Sec. 


). In this case, MCDNN has a high probability to load a new model variant into RAM from Flash, whenever the user requirement or content characteristics changes. We can observe that ApproxNet incurs little overhead in switching between any two approximation branches, while a huge latency spike, typically from 5 to 20 seconds, occurs in MCDNN. It is notable that there are also small spikes in MCDNN following the larger spikes because the generic model is invoked due to the specialized model’s prediction of “infrequent” class. Thus, this second spike is not aligned with the change in the user requirement. So, even though one may pre-load multiple model variants (in a mobile device with ample RAM, not practically available today), overhead will still occur when the generic model is invoked due to video content changes or inaccurate prediction in specialized model (the case here).

(a) Inference latency (b) Accuracy
Figure 12: Comparison of ApproxNet vs MCDNN under resource contention on test dataset.
Figure 13: System overhead in ApproxNet and MCDNN.
Figure 14: Transition latency overhead across approximation branches in ApproxNet. “from” branch on Y-axis and “to” branch on X-axis. Inside brackets: (input shape, outport depth). Latency unit is millisecond.
Figure 15: Case study: performance comparison of ApproxNet vs MCDNN under resource contention for a Youtube video.

This benefit of ApproxNet comes from the fact that it can accommodate multiple (accuracy, latency) points within one model through its two approximation knobs while MCDNN has to switch between model variants. To see in further detail the behavior of ApproxNet, we profile the mean transition time of all Pareto frontier branches under no contention as shown in Fig. 15. Most of the transition overheads are extremely low, while only a few transitions are above 30 ms. We can filter out such expensive transitions if they happen too frequently. In comparison, the latency spike in MCDNN when switching models is 5–20 seconds.

5.7 ApproxNet’s adaptability to resource contention

We evaluate in Fig. 12, the ability of ApproxNet to adapt to resource contention on the device. We evaluate ApproxNet’s ability to dynamically handle contention by running a bubble application [49, 74] on the CPU that creates stress of different magnitudes on the (shared) memory subsystem while the DNN application analyzing the video stream is running on the GPU. We generate five bubbles, each of memory size 10 KB (low contention) or 300 MB (high contention). The bubbles can be “unpinned” meaning they can run on any of the cores or they can be “pinned” in which case they run on a total of 5 CPU cores leaving the 6th one for dedicated use by the video analytics application. Naturally, the unpinned configuration causes higher contention. We introduce contention in phases—low pinned, low unpinned, high pinned, high unpinned.

As shown in Fig. 12(a), MCDNN with its fastest model variant MCDNN-18, runs between 40ms and 100 ms depending on the contention level and has no adaptation. For ApproxNet, on the other hand, our mean latency under low contention (10 KB, pinned) is 25.66 ms, and it increases a little to 34.23 ms when the contention becomes high (300 MB, unpinned). We also show the accuracy comparison in Fig. 12(b), where we are slightly better than MCDNN under low contention and high contention (2% to 4%) but slightly worse than (within 4%) MCDNN for intermediate contention (300 MB, pinned). This experiment bears out the claim that ApproxNet can respond to contention gracefully by recreating the Pareto curve for the current contention level and picking the appropriate approximation branch.

5.8 System Overhead

With the same experiment as in Sec. 5.6, we compare the overheads of ApproxNet and MCDNN in Fig. 15 (note that the figure has two different Y-axes, left for ApproxNet and right for MCDNN). For ApproxNet, we measure the overhead of all the steps outside of the core DNN, i.e., SCD, ICC, scheduler, and image resizing. For MCDNN, the dominant overhead is the model switching and loading. The model switching overhead of MCDNN is measured at each switching point and averaged across all frames in each scenario. We see that ApproxNet, including overheads, is to faster than MCDNN. Further, we can observe that in “MM” and “LL” scenarios, ApproxNet’s averaged latency is less than 30 ms and thus ApproxNet can achieve real-time processing on 30 fps videos. As mentioned before, MCDNN may be forced to reload the appropriate models whenever the user requirement changes. So, in the best case for MCDNN the requirement never changes or it has all its models cached in RAM. ApproxNet is still to faster.

5.9 Memory and Storage Consumption

Table 3 compares the peak memory consumption of ApproxNet and MCDNN in typical usage scenarios. ApproxNet-mixed and MCDNN-mixed are the cases where the experiment cycles through the three usage scenarios. We test MCDNN-mixed with two model caching strategies: (1) the model variants are loaded from Flash when they get triggered (named “re-load”), simulating the minimum RAM usage scenario (2) the model variants are loaded all to the RAM at the very beginning (named “load-all”), assuming the RAM is large enough.

Model LL MM HH Mixed
ApproxNet 1.6 1.62 1.7 2.06
MCDNN 1.85 (same model variant) 2.35 2.36(re-load), 2.72(load-all)
Table 3: Memory consumption of ApproxNet and MCDNN in different usage scenarios (unit: GB).

We see that ApproxNet in going from “LL” to “HH” requirement consumes 1.6 GB to 1.7 GB memory and is lower than MCDNN (1.9 GB and 2.4 GB). MCDNN’s cascade DNN design (specialized model followed by generic model) is the root cause that it consumes about 15% more memory than our model even though they only keep one model variant in the RAM and it consumes 32% more memory if it loads two. The memory usage of ApproxNet in the “mixed” usage scenario increases to 2.1 GB while MCDNN increases to 2.4 GB with “re-load” strategy and 2.7 GB with “load-all” strategy. We can set an upper bound on the ApproxNet memory consumption—it never exceeds 2.1 GB no matter how we switch among approximation branches at runtime, an important property for proving operational correctness in mobile or embedded environments. Further, ApproxNet, with tens of approximation branches available, offers more choices than MCDNN, while MCDNN’s RAM requirement will be significantly higher with the same number of model variants in “load-all” strategy and will have to incur huge switching overhead in “re-load” strategy as we have empirically seen (Fig. 11). Storage is a lesser concern but it does affect the pushing out of updated models from the server to the edge, a use case commonly considered in works in this space. ApproxNet’s storage cost is only 88.8 MB while MCDNN with 2 models takes 260 MB. A primary reason is the duplication in MCDNN of the specialized and the generic models which have identical architecture except for the last FC layer. Thus, ApproxNet is well suited to the mobile usage scenario because its maximum memory usage is bounded and switching overhead is negligible.

6 Case Study with YouTube Video

As a case study, we evaluate ApproxNet on a randomly picked YouTube video [63], to see how it adapts to different resource contention at runtime (Fig. 15). The video is a car racing match with changing scenes and objects and thus we want to evaluate the object classification performance. The interested reader may see a demo of ApproxNet and MCDNN on this and other videos at Similar to the control setup in Sec. 5.7, we test ApproxNet and MCDNN in four different contention levels and repeat them for a second time. Each phase is around 300 or 400 frames and the latency requirement is 33 ms to keep up with the 30 fps video. We see ApproxNet adapts to the resource contention well—it switches to an approximation branch that has low compute load (and correspondingly lower inference latency than the target latency) while still keeping high accuracy, comparable to MCDNN (not shown here, seen on the demo site). Further, ApproxNet is always faster than MCDNN, even without MCDNN’s switching overhead, while MCDNN, with latency ranging from 40 ms to 80 ms, has degraded performance under resource contention and has to drop approximately every two frames out of three.

7 Discussion

Training the approximation-enabled DNN of ApproxNet may take longer than conventional DNNs, since at each iteration of training, different outports and input shapes try to minimize their own softmax loss and thus they may adjust internal weights of the DNN in conflicting ways. In our experiments with the VID dataset, we observe that our training time is around 3 days on our evaluation edge server described in Sec. 5.1, compared to 1 day to train a baseline ResNet-34 model. However, training being an offline process, the time is of less concern. But it can be sped up by using one of various actively researched techniques for optimizing training, such as [43]. Smaller models than ResNet-18 are not adopted for baseline comparison with ApproxNet because both our model (Fig. 1) and ResNet model (Fig. 11) show the smaller model has lower accuracy in classifying complex images.

8 Related Work

System-wise optimization: There have been many optimization attempts to improve the efficiency of video analytics pipeline by building low power hardwares and software accelerators for DNNs [41, 57, 8, 52, 16, 51, 13, 77]. These are orthogonal and ApproxNet can also benefit from these optimizations. VideoStorm [76], Chameleon [35], and Focus [21] exploited various configurations and DNN models to handle video analytics queries in a situation-tailored manner. ExCamera [12] enabled low-latency video processing on the cloud using serverless architecture (AWS Lamda [3]). Mainstream [34] proposed to share weights of DNNs across applications. These are all server-side solutions, requiring to load multiple models at the same time, which are challenging in resource-constrained mobile devices. NoScope [36] targeted to reduce the computation cost of video analytics queries on servers by leveraging a specialized model. However, the specialized model can be used only for a small subset of videos on which it was trained and hence its applicability is limited. VideoChef [73] attempted to reduce the processing cost of video pipeline by dynamically changing approximation knobs of preprocessing filters in a content-aware manner. In contrast, ApproxNet does the approximation in the core DNN, which has much larger time overhead and different program structure than the filters.

DNN optimizations: Many solutions have been proposed to reduce computation cost of a DNN by controlling the precision of internal weights of the DNN [25, 14, 24, 78, 56] and restructuring or compressing a DNN model [10, 20, 5, 28, 7, 40, 17, 70]. These are orthogonal to our work and ApproxNet’s local executer can be further optimized by adopting such methods. There are several works which also present similar approximation knobs (input shape, outport depth). BranchyNet, CDL and MSDNet [66, 50, 22] purpose early exit branches in deep neural networks. However, BranchyNet and CDL only validate on small datasets like MNIST [68] and CIFAR-10 [67] and have not shown practical techniques to selectively choose the early exit branches in an end-to-end system manner by being aware of the resource contention, content characteristics and users’ requirement. MSDNet targets at very simple image classification task and does not show strong use case and the way of using the early exits. It has no evaluation on the real latency number on either a server or mobile and embedded devices. BlockDrop [72] trains a policy networks to determine whether to skip the execution of several residual blocks at inference time but their speed up is very marginal and cannot apply directly to mobile devices to achieve real-time classification.

9 Conclusion

Streaming video analytics systems often use multiple DNN models to meet various accuracy and latency requirements. However, managing multiple DNNs is not feasible in a resource-constrained mobile or embedded device due to memory pressure and high switching overhead, which throws off real-time requirements. For this reason, we propose a novel DNN architecture, called ApproxNet, by which we can support a variety of accuracy/latency configurations in a single DNN model in resource constrained devices. Our novel DNN architecture provides two approximation knobs—how much we can down-sample frames and how deep the inputs must traverse in the DNN during the inference to produce an accurate output. ApproxNet optimizes the accuracy vs latency trade-off in multiple usage scenarios including changing user requirement, dynamic system resource availability, and video stream characteristics, by scheduling the best approximation branch given the runtime conditions. By dynamically changing the two approximation knobs in a content-aware manner, ApproxNet significantly reduces inference latency, achieving comparable accuracy of video frame classification to the state-of-the-art DNN models, ResNet and mobile-specific models, MCDNN and MobileNets.


  • [1] (2018) Apple a12 soc. Note: Cited by: §5.1.
  • [2] R. Ausavarungnirun, V. Miller, J. Landgraf, S. Ghose, J. Gandhi, A. Jog, C. J. Rossbach, and O. Mutlu (2018) Mask: Redesigning the GPU memory hierarchy to support multi-application concurrency. In ACM SIGPLAN Notices, Vol. 53, pp. 503–518. Cited by: §1.
  • [3] (2018) AWS lambda. Note: Cited by: §8.
  • [4] M. G. Bechtel, E. McEllhiney, M. Kim, and H. Yun (2018) Deeppicar: a low-cost deep neural network-based autonomous car. In 2018 IEEE 24th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), pp. 11–21. Cited by: §1.
  • [5] S. Bhattacharya and N. D. Lane (2016) Sparsification and separation of deep learning layers for constrained resource inference on wearables. In Proceedings of the 14th ACM Conference on Embedded Network Sensor Systems CD-ROM, pp. 176–189. Cited by: §8.
  • [6] M. Cardaci, V. D. Gesù, M. Petrou, and M. E. Tabacchi (2009) A fuzzy approach to the evaluation of image complexity. Fuzzy Sets and Systems 160 (10), pp. 1474 – 1484. Note: Special Issue: Fuzzy Sets in Interdisciplinary Perception and Intelligence External Links: ISSN 0165-0114, Document, Link Cited by: §3.1, §4.2.
  • [7] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen (2015) Compressing neural networks with the hashing trick. In

    International Conference on Machine Learning

    pp. 2285–2294. Cited by: §1, §8.
  • [8] Y. Chen, T. Krishna, J. S. Emer, and V. Sze (2017) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52 (1), pp. 127–138. Cited by: §8.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    pp. 248–255. Cited by: §5.2.
  • [10] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: §8.
  • [11] T. Elgamal, A. Sandur, P. Nguyen, K. Nahrstedt, and G. Agha (2018) DROPLET: distributed operator placement for iot applications spanning edge and cloud resources. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pp. 1–8. Cited by: §1.
  • [12] S. Fouladi, R. S. Wahby, B. Shacklett, K. Balasubramaniam, W. Zeng, R. Bhalerao, A. Sivaraman, G. Porter, and K. Winstein (2017) Encoding, fast and slow: low-latency video processing using thousands of tiny threads.. In NSDI, pp. 363–376. Cited by: §8.
  • [13] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis (2017) Tetris: scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review 51 (2), pp. 751–764. Cited by: §8.
  • [14] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §8.
  • [15] S. Han, H. Shen, M. Philipose, S. Agarwal, A. Wolman, and A. Krishnamurthy (2016) Mcdnn: an approximation-based execution framework for deep stream processing under resource constraints. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pp. 123–136. Cited by: §1, §1, §1, §2.1, §3.1, §5.3.
  • [16] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. In Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on, pp. 243–254. Cited by: §8.
  • [17] S. Han, J. Pool, J. Tran, and W. Dally (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §8.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In european conference on computer vision, pp. 346–361. Cited by: §1, §2.1, §4.1.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §2.1, §4.1, §5.3.
  • [20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §1, §5.3, §8.
  • [21] K. Hsieh, G. Ananthanarayanan, P. Bodik, P. Bahl, M. Philipose, P. B. Gibbons, and O. Mutlu (2018) Focus: querying large video datasets with low latency and low cost. arXiv preprint arXiv:1801.03493. Cited by: §1, §1, §8.
  • [22] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844. Cited by: §1, §2.1, §3.2, §8.
  • [23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: §1, §4.1.
  • [24] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §8.
  • [25] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations.. Journal of Machine Learning Research 18, pp. 187–1. Cited by: §8.
  • [26] C. Hung, G. Ananthanarayanan, P. Bodik, L. Golubchik, M. Yu, P. Bahl, and M. Philipose (2018)

    Videoedge: processing camera streams using hierarchical clusters

    In 2018 IEEE/ACM Symposium on Edge Computing (SEC), pp. 115–131. Cited by: §1.
  • [27] L. N. Huynh, Y. Lee, and R. K. Balan (2017) Deepmon: mobile gpu-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services, pp. 82–95. Cited by: §1, §1, §2.1.
  • [28] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) Squeezenet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. ICLR. Cited by: §8.
  • [29] (2018) Imagenet large scale visual recognition challenge 2015 (ilsvrc2015). Note: Cited by: §5.2.
  • [30] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2017) Quantization and training of neural networks for efficient integer-arithmetic-only inference. arXiv preprint arXiv:1712.05877. Cited by: §1.
  • [31] B. Jähne, H. Haussecker, and P. Geissler (1999) Handbook of computer vision and applications. Vol. 2, Citeseer. Cited by: §4.2.
  • [32] (2018) Jetson tx2 module. Note: Cited by: §5.1.
  • [33] S. Ji, W. Xu, M. Yang, and K. Yu (2013) 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §1.
  • [34] A. H. Jiang, D. L. Wong, C. Canel, L. Tang, I. Misra, M. Kaminsky, M. A. Kozuch, P. Pillai, D. G. Andersen, and G. R. Ganger (2018) Mainstream: dynamic stem-sharing for multi-tenant video processing. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), Cited by: §1, §1, §1, §3.1, §8.
  • [35] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica (2018) Chameleon: scalable adaptation of video analytics. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pp. 253–266. Cited by: §1, §1, §8.
  • [36] D. Kang, J. Emmons, F. Abuzaid, P. Bailis, and M. Zaharia (2017) NoScope: optimizing neural network queries over video at scale. Proceedings of the VLDB Endowment 10 (11), pp. 1586–1597. Cited by: §8.
  • [37] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, et al. (2017) T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §5.2.
  • [38] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang (2017) Neurosurgeon: collaborative intelligence between the cloud and mobile edge. In ACM SIGARCH Computer Architecture News, Vol. 45, pp. 615–629. Cited by: §1.
  • [39] O. Kayiran, N. C. Nachiappan, A. Jog, R. Ausavarungnirun, M. T. Kandemir, G. H. Loh, O. Mutlu, and C. R. Das (2014) Managing gpu concurrency in heterogeneous architectures. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 114–126. Cited by: §1.
  • [40] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §8.
  • [41] N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, L. Jiao, L. Qendro, and F. Kawsar (2016) Deepx: a software accelerator for low-power deep learning inference on mobile devices. In Proceedings of the 15th International Conference on Information Processing in Sensor Networks, pp. 23. Cited by: §8.
  • [42] M. A. Laurenzano, P. Hill, M. Samadi, S. Mahlke, J. Mars, and L. Tang (2016) Input responsiveness: using canary inputs to dynamically steer approximation. ACM SIGPLAN Notices 51 (6), pp. 161–176. Cited by: §2.2.
  • [43] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng (2011) On optimization methods for deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, pp. 265–272. Cited by: §7.
  • [44] J. Liu, A. Shahroudy, D. Xu, and G. Wang (2016) Spatio-temporal lstm with trust gates for 3d human action recognition. In European Conference on Computer Vision, pp. 816–833. Cited by: §1.
  • [45] L. Liu, H. Li, and M. Gruteser (2019) Edge assisted real-time object detection for mobile augmented reality. Cited by: §1, §1.
  • [46] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • [47] D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis (2015) Heracles: improving resource efficiency at scale. In International Symposium on Computer Architecture (ISCA), Vol. 43, pp. 450–462. Cited by: §4.3.
  • [48] I. Mario, M. Chacon, D. Alma, and S. Corral (2005-06) Image complexity measure: a human criterion free approach. In NAFIPS 2005 - 2005 Annual Meeting of the North American Fuzzy Information Processing Society, Vol. , pp. 241–246. External Links: Document, ISSN Cited by: §3.1, §4.2.
  • [49] J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa (2011) Bubble-up: increasing utilization in modern warehouse scale computers via sensible co-locations. In Proceedings of the 44th annual IEEE/ACM International Symposium on Microarchitecture, pp. 248–259. Cited by: §5.7.
  • [50] P. Panda, A. Sengupta, and K. Roy (2016) Conditional deep learning for energy-efficient and enhanced pattern recognition. In 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 475–480. Cited by: §8.
  • [51] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally (2017) Scnn: an accelerator for compressed-sparse convolutional neural networks. In ACM SIGARCH Computer Architecture News, Vol. 45, pp. 27–40. Cited by: §8.
  • [52] E. Park, D. Kim, S. Kim, Y. Kim, G. Kim, S. Yoon, and S. Yoo (2015) Big/little deep neural network for ultra low power inference. In Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis, pp. 124–132. Cited by: §8.
  • [53] O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. (2015) Deep face recognition.. In BMVC, Vol. 1, pp. 6. Cited by: §1.
  • [54] R. Poppe (2010) A survey on vision-based human action recognition. Image and vision computing 28 (6), pp. 976–990. Cited by: §1.
  • [55] (2018) Qualcomm adreno gpus. Note: Cited by: §5.1.
  • [56] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §8.
  • [57] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. K. Lee, J. M. Hernández-Lobato, G. Wei, and D. Brooks (2016) Minerva: enabling low-power, highly-accurate deep neural network accelerators. In ACM SIGARCH Computer Architecture News, Vol. 44, pp. 267–278. Cited by: §8.
  • [58] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [59] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [60] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1.
  • [61] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §1.
  • [62] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §1, §2.1, §4.1.
  • [63] (2016) Sport cars drag race video. Note: Cited by: §6.
  • [64] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1, §1.
  • [65] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf (2014) Deepface: closing the gap to human-level performance in face verification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1701–1708. Cited by: §1.
  • [66] S. Teerapittayanon, B. McDanel, and H. Kung (2016) Branchynet: fast inference via early exiting from deep neural networks. In 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. Cited by: §8.
  • [67] The cifar-10 dataset. Note: Cited by: §8.
  • [68]

    The mnist database of handwritten digits

    Note: Cited by: §8.
  • [69] (2017) Unified memory for cuda beginners. Note: Cited by: §5.1.
  • [70] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li (2016) Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2074–2082. Cited by: §8.
  • [71] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In European Conference on Computer Vision, pp. 499–515. Cited by: §1.
  • [72] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8817–8826. Cited by: §8.
  • [73] R. Xu, J. Koo, R. Kumar, P. Bai, S. Mitra, S. Misailovic, and S. Bagchi (2018) Videochef: efficient approximation for streaming video processing pipelines. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pp. 43–56. Cited by: §2.2, §8.
  • [74] R. Xu, S. Mitra, J. Rahman, P. Bai, B. Zhou, G. Bronevetsky, and S. Bagchi (2018) Pythia: improving datacenter utilization via precise contention prediction for multiple co-located workloads. In Proceedings of the 19th International Middleware Conference, pp. 146–160. Cited by: §4.3, §5.7.
  • [75] H. Yu and S. Winkler (2013-07) Image complexity and spatial information. In 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX), Vol. , pp. 12–17. External Links: Document, ISSN Cited by: §3.1, §4.2.
  • [76] H. Zhang, G. Ananthanarayanan, P. Bodik, M. Philipose, P. Bahl, and M. J. Freedman (2017) Live video analytics at scale with approximation and delay-tolerance.. In NSDI, Vol. 9, pp. 1. Cited by: §1, §1, §8.
  • [77] S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen (2016) Cambricon-x: an accelerator for sparse neural networks. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 20. Cited by: §8.
  • [78] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §8.