Computer vision (CV) technology has become a key ingredient for automatized data analysis over a broad range of real-world applications: smart cameras for video surveillance, robotics, industrial quality assurance, medical diagnostics, and advanced driver assistance systems have recently become popular due the rising accuracy and robustness of CV algorithms. This industry interest has fostered the procedure of a wealth of research projects yielding a fierce competition on many benchmarks datasets such as the ImageNet/ILSVRC[Russakovsky2014], MS COCO [Lin2014], and Cityscapes [Cordts2016] benchmarks, on which scientists from academia and big industry players evaluate their latest algorithms.
In recent years, the most competitive approaches to address many CV challenges have relied on machine learning with complex, multi-layered, trained feature extractors commonly referred to asdeep learning [Goodfellow-et-al-2016]
. The most frequently used flavor of deep learning techniques for CV are convolutional neural networks (ConvNets, CNNs). Since their landslide success at the 2012 ILSVRC competition over hand-crafted features, their accuracy has further improved year-over-year even exceeding human performance on this complex dataset[HePReLU2015, Russakovsky2014]. CNNs keep on expanding to more areas of computer vision and data analytics in general [Abu-El-Haija2016, Kaluarachchi2015, Zhang2016a, Park2016] and are moving towards analyzing video data for action recognition [Wang2015], tracking [Chen2017a], and improved object detection [Kang2017, Jie2018].
Unfortunately, the high accuracy of CNNs comes with a high computational cost, requiring powerful GPU servers to train these networks for several weeks using hundreds of gigabytes of labeled data. While this is a massive effort, it is a one-time endeavor and can be done offline for many applications. However, the inference of state-of-the-art CNNs also requires several billions of multiplications and additions to classify even low-resolution images by today’s standards[Cavigelli2015]. While in some cases offloading to centralized compute centers with powerful GPU servers is also possible for inference after deployment, it is extremely costly in terms of compute infrastructure and energy. Furthermore, collecting large amounts of data at a central site raises privacy concerns and the required high-bandwidth communication channel causes additional reliability problems and potentially prohibitive cost of deployment and during operation [Ananthanarayanan2017]. For many applications, the introduced latency is prohibitive.
The alternative, on-site near sensor embedded processing, largely solves the aforementioned issues by transmitting only the less sensitive, condensed information—potentially only security alerts in case of a smart surveillance camera—but imposes restrictions on available computation resources and power. These push the evaluation of such networks for real-time semantic segmentation or object detection out of reach of even the most powerful embedded platforms available today for high-resolution video data [Cavigelli2015]. However, exactly such systems are required for a wide range of applications limited in cost (CCTV/urban surveillance, perimeter surveillance, consumer behavior and highway monitoring) and latency (aerospace and UAV monitoring and defense, visual authentication) [Ananthanarayanan2017, Nguyen-Meidine2017].
Large efforts have thus already been taken to develop optimized software implementations for heterogeneous platforms [Vasilache2018, Chetlur2014, Cavigelli2015, Lavin2015, Lavin2015a], to design specialized hardware architectures [Cavigelli2016, Andri2016, Cavigelli2015a, Chen2016, Park2016, Farabet2011], and to adapt the networks to avoid expensive arithmetic operations by reducing arithmetic precision, exploiting sparsity, and developing more compact DNNs [Rastegari2016, Zhang2016a]. However, they either 1) do not provide a strong enough performance boost, 2) are already at the theoretical limit of what can be achieved on a given platform, 3) are inflexible and not commercially available, or 4) incur a considerable accuracy loss. It is thus essential to extend the available options to efficiently perform inference on CNNs.
In this paper, we propose a novel method performing change-based inference (hence named CBinfer) for convolutional neural networks on video data from a static camera with limited frame-to-frame changes. We extend our preliminary work in [Cavigelli2017]:
Enhancements to the algorithm for improved compute time and ensuring a consistent input-output relation for each convolution layer.
In-depth analysis how changes propagate through the CBinfer DNN.
Analysis of accuracy, compute time and energy efficiency for long frame sequences.
Additional evaluations for pose and object detection applications with much deeper networks and datasets without annotations.
Optimizations and evaluations on the more recent Nvidia Tegra X2 platform111The Nvidia Tegra X2 is a system-on-chip available on an embedded board with an affordable power budget ( W) for a stationary smart camera..
Discussion and evaluation of the processing steps and how the chosen configuration provides the highest performance gain.
Overall the proposed method provides an average speed-up by a factor of 9.1–7.0 over an implementation relying on cuDNN and introducing only negligible accuracy loss. It thus significantly outperforms previous approaches exploiting frame-to-frame locality which all have measured performance gains in the range of a few ten percent while introducing accuracy losses of several percent (cf. Section II-C). Our method can be combined with most single-frame optimizations such as exploiting weight sparsity or the development of more compact DNN models. The code is available online at https://github.com/lukasc-ch/CBinfer.
Ii Related Work
In this section, we first describe existing optimized implementations for CNN inference and existing approximations trading accuracy for throughput. We then specifically survey related approaches exploiting the limited changes in video data to reduce the computational effort required to perform CNN inference. Finally, we discuss available datasets and CNNs with which we can evaluate our proposed algorithm.
Most per-frame optimization techniques can be combined with the method we propose herein. Existing approaches targeting video data have very limited gains and have not been specifically optimized for static camera frame sequences.
Ii-a Optimized Embedded System Implementations
The latest wave of interest in neural networks can be attributed to their sudden success driven by the availability of large datasets and the increasingly powerful computing platforms. One of the most economical and practicable solutions for training medium-sized CNNs is to use a workstation with GPUs. The available software frameworks to implement and train CNNs provide strong support for this kind of platform.
The massive amounts of compute time spent training CNNs has spurred the development of highly optimized GPU implementations. First, most widely used frameworks relied on their own custom implementations which have all converged to methods relying on matrix-multiplications, leveraging the availability of highly optimized code in BLAS libraries and the fact that GPUs are capable of achieving a throughput within a few percent of their peak performance with this type of workload. Specialized libraries such as Nvidia’s cuDNN and Nervana Systems’ Neon provide some additional performance gains through assembly-level implementations [Lavin2015] and additional algorithmic improvements such as Winograd and FFT-based convolution [Lavin2015a]. A specific implementation for non-batched inference on an embedded platform building on a matrix multiplication is documented in [Cavigelli2015], also showing that more than 90% of time is spent computing convolutions.
Ii-B Approximations Trading Accuracy for Throughput
DNNs commonly require a high computation effort in the order of 20 GOp/frame for classification of a pixel image (1 multiply-add is counted as 2 operations) [Canziani2017]. Extracting features when working with high resolution images (e.g. for object detection or semantic segmentation) scales up the effort proportional to the number of pixels, quickly reaching few 100 GOp/frame.
Admitting limited accuracy losses in order to gain a higher throughput by approximating existing networks, inference algorithms, and arithmetic operations can help overcome the computational obstacles preventing widespread adoption of CNN-based algorithms on embedded and mobile platforms. Several such approaches are surveyed and compared in [Sandler2018, Iandola2017]. In this section, we will provide an overview of different options that can be exploited.
One such option is the reduction of the required arithmetic precision to evaluation NNs. Various methods from normal fixed-point analysis to retraining networks to adapt for quantized weights and activations exist. On some off-the-shelf software programmable platforms, 16-bit or 8-bit arithmetic operations can be vectorized to obtain a performance boost[Gysel2016a]. Extreme methods go as far as to enforce binary weights [AojunZhou2016, Andri2018], and in some cases also binary activations [Rastegari2016]
. This means that multiplications can be dropped entirely, and in case of binary activations even collapse some of the add/subtract operations into XNOR and bit count operations. Many networks can be quantized with 8 bit without an increase in error rate, before there is a trade-off between precision and accuracy[Cavigelli2016, Hashemi2016]. Some methods try reducing the computational effort by pruning many very small weights to zero, making it possible to skip some operations [Li2016], or even dynamically skip operations when the activations are zero [Aimar2017]. More sophisticated quantization schemes such as vector quantization exist and can further compress a trained CNN model, but they require specialized hardware to bring an improvement in energy efficiency [Han2016a, Aimar2017].
Further research has focused on optimizing semantic segmentation and object detection algorithms to better reuse already computed features by eliminating any non-convolutional elements from the network [Redmon2016, Long2015] or introducing structured sparsity [Zhang2017]. Simplifying the operations in a network, such as low-rank approximations of 2D convolutions or by simply designing smaller networks with state-of-the-art methods have been evaluated in [Iandola2016, Canziani2017].
The method we propose in this paper does not supersede these methods, but can be combined with the aforementioned approximation methods to further improve throughput.
Ii-C Video-based Computation Reduction
Obtaining per-frame features naturally seems like an easier task when these frames belong to a video sequence rather than a random collection of images. Limited movement of objects in a frame can be exploited in object tracking by working with a limited search window within the frame [Held2016], not only reducing the problem size, but also simplifying the regression task—up until the tracked target is occluded by a large object.
Clockwork CNNs [Shelhamer2016] specifically target CNNs for semantic segmentation with a structure similar to [Long2015]
. They have extended this work on fully convolutional networks, which presents a CNN with skip connections and deconvolution layers to refine the lower-resolution feature maps obtained deep within the network using the features extracted early in the network. They exploit that lower-resolution feature maps within the network are more stable over time than the full-resolution input. They thus propose to reevaluate the first few layers and the last layers affected through the skip connections more frequently than the coarser grained feature maps. This is a strong limitation on the set of CNNs this method can be applied to. They present evaluations based on a static as well as a dynamic, content-adaptive reevaluation schedule, showing that they can reduce the number of full-frame convolutions by about 40% before the accuracy starts to drop on the Youtube-Objects dataset. However, this approach is limited to updating entire frames, whereas we exploit that often only small parts of the scene change and need to be reevaluated, which leads to larger savings.
CNNCache [Xu2017] describes a general approach pursuing a similar direction of work. They describe their method as a caching mechanism, where blocks of the image are matched to blocks in the previous frame, thereby fetching results of similar block from the cache instead of recomputing the results. Similarly to our work, this requires the selection of a threshold, and on top of that a block size and a cache depth in the form of an expiration time. The block matching allows to handle video data where the camera is not fully static, but it does not allow perspective changes. They have shown that their method achieves an average speed-up in the order of 20% at a top-1 accuracy loss of 3.5% performing image classification relative to the ncnn framework’s default implementation. The capability to recall convolution results even when the specific image tile has moved introduces a significant overhead comparing image tiles, thereby limiting the potential speed-up significantly. Further, this method requires a relatively high tolerance when comparing image tiles to be able to find matches, thereby introducing significant accuracy losses.
DeepMon [Huynh2017] proposes another method combining convolution layer decomposition, half-precision computation, and convolutional layer caching. Similarly to CNNCache, they divide the input to each convolutional layer into blocks and reuse the result when a block matches to the one in the previous frame. To reduce overhead, they do not directly compare the blocks, but instead extract histogram-based features. They apply their technique only to the first few layers, because in later layers the caching overhead exceeds the compute latency savings. They show a speed-up attributable to caching of 18% for object detection and 36% for image classification at an accuracy loss in the order of 3.8% to 6.2%. While their histogram-based comparison method for the image tiles reduces overhead, it still remains significant and the introduced accuracy loss increases further.
Sigma-Delta Quantized Networks [Connor2017] is the most similar method to ours. They combine quantizing the network and decomposing the input to each convolution layer with the difference of the current frame’s values to the previous frame’s values and accumulate the result over time. They show a reduction in the number of operations in total, of which can be attributed to the temporal differences aspect of their method at an accuracy drop. However, whether this reduction in number of multiply-add operations can be put into performance gains after all the introduced overhead remains an open question.
Ii-D Suitable Datasets and Neural Networks
|Type||Outp. Res.||Feat. Maps||CT [ms]||rel. CT|
We show the applicability of the concept to various applications, namely by evaluating the proposed method for semantic segmentation and pose detection. These are both often applied to high-resolution images and video streams with high frame rates above 10 frame/s for meaningful applications.
We are specifically interested in video sequences obtained from a static camera. While some such datasets exist (e.g. person or vehicle detection or re-identification), most of them are limited to extremely few (1-3) classes and rarely target semantic segmentation. However, for first application scenario—semantic segmentation—the dataset222Available online at https://doi.org/10.3929/ethz-b-000276417 used in [Cavigelli2016a] provides ground truth labels for 10-class semantic segmentation from an urban street surveillance perspective, and while they work with individual images, several surrounding unlabeled frames and a trained convolutional network are available. An example image labeled with the provided CNN is shown in Figure 1, and a sample sequence of 3 images is visualized in Figure 2.
For the second application—pose detection—several datasets to detect joints and limbs exist in the form of annotated images or a moving camera frame sequences, but none with a static camera. To overcome this and to show the feasibility of applying CBinfer without annotated data, we use unlabeled frame sequences from the CAVIAR dataset333Available at http://homepages.inf.ed.ac.uk/rbf/CAVIAR/, collected through the EC Funded CAVIAR project/IST 2001 37540. and take the pretrained network to generate the reference output. The dataset contains scenes recorded using surveillance cameras with wide-angle lenses and captures the interaction of few people. It has a resolution of pixel and a frame rate of 25 frame/s. A few sample frames are shown in Figure 3.
For object detection—our third application scenario—we use video sequences of traffic surveillance cameras. Object detection is performed using YOLOv3 [Redmon2018] trained on the MS COCO dataset [Lin2014]. Since there is no ground truth available for the sequences, we generate our reference output by applying the original YOLOv3 network to each frame.
The most straight-forward pixel-level approach is to detect changing pixels on the input frame based on a threshold on the difference to the previous frame, and then update all the pixels affected by them. This increases the number of pixels to be updated layer-after-layer due to the convolution operations. Thus for e.g. a convolution, a one-pixel change triggers an update of 49 pixels in the next layer and 169 pixels after another
convolution. Strided operations (often used with pooling layers) reduce this effect, but do not prevent it. This issue might seem prohibitive for multi-layer CNNs, particularly when considering that individual pixels might keep exceeding the threshold due to noise.
However, the change is not only spatially local at the input, but also at the output. Furthermore, noise-like changes will likely not have strong impacts on feature maps deeper within the network. We thus propose to perform the change-detection not only at the input, but before each convolution layer—relative to its previous input—and to compute an updated value only for the affected output pixels. This can be done without modifications to the training of the CNN, can be applied to existing pre-trained networks, and is not specific to the CNN on which we evaluate the proposed algorithm.
We propose to replace all spatial convolution layers (conv layers) with change-based spatial convolution layers (CBconv layers). This means adapting the widely used, simple and well-performing matrix-generation and matrix-multiplication sequence of operations [Jia2013, Cavigelli2015]. The convolution layer computes
where indexes the output channels and indexes the input channels . The pixel is identified by the tuple and denotes the support of the filters kernels . This can be computed by performing a matrix multiplication
The image matrix is constructed as with , , and , . The filter matrix is given by for , , and . The result matrix is stored as
. Zero-padding can be applied during the construction of thematrix and an efficient strided convolution can be computed by dropping the unused rows.
We replace this matrix multiplication by the following sequence of processing steps, thereby drastically reducing the size of the matrix used in the main computation step.
Iii-a Processing Steps
We modify the standard approach and use a sequence of processing steps (cf. Figure 5, top/feed-forward): change detection, change indexes extraction, matrix generation, matrix multiplication, and output update. In the following, we will explain the individual steps.
In this step, changed pixels are detected. We define a changed pixel as one where the absolute difference of the current to the previous input of any feature map/channel exceeds some threshold , i.e.
The computation effort of this step is crucial, since it is executed independently of whether any pixel changed. Each of these changes affects a region equal to the filter size, and these output pixels are marked for updating:
where is the filter kernel support, e.g. for a filter. All of this is implemented on GPU by clearing the change map to all-zero and having one thread per pixel, which—if a change is detected—sets the pixels of the filter support neighborhood in the resulting change map.
Change Indexes Extraction
In this step, we condense the change map to 1) a list of pixel indexes where changes occurred and 2) count the number of changed pixels. This has been implemented by relying on the Thrust444https://thrust.github.io copy_if function. The computed index list is later on needed to access only the needed pixels to assemble the matrix for the convolution.
Matrix Generation & Matrix Multiplication
Matrix multiplications are used in many applications, and highly optimized implementations such as the GEMM (general matrix multiplication) function provided by the Nvidia cuBLAS library come within a few percent of the peak FLOPS a GPU is capable to provide. Matrix multiplication-based implementations of the convolution layer relying on GEMM are widely available and are highly efficient [Jin2014, Cavigelli2015] as described above. The matrix in (2) is not generated full-sized, but instead only those columns corresponding to the relevant output pixels are assembled, resulting in a reduced width equal to the number of output pixels affected by the changes in the input image. The columns to be generated are selected using the change indexes (cf. Figure 5) and are constructed following the procedure described in the previous section. This is implemented with independent threads for each pixel and spatial filter position, where each of them copies all the feature map values at the position.
The matrix is made up of the filters trained using normal convolution layers and keeps the same dimensions, so the computation effort in this step is proportional to the number of changed pixels and the matrix multiplication is in the worst case only as time consuming as the full-frame convolution.
We use the previously stored results and the newly computed output values along with the change indexes list to provide the updated output feature maps. To maximize throughput, we also include the ReLU activation of the affected pixels in this step, reducing the compute time by 1) not writing the value to memory and immediately reading them again—an independent ReLU layer is strongly memory bandwidth limited, and 2) only applying the ReLU operation to changed pixels.
Iii-B Memory Requirements
The memory requirements of DNN frameworks are known to be very high, up to the point where it becomes a limiting factor for increasing the mini-batch size during learning and thus reducing the throughput when parallelizing across multiple GPUs. These requirements are very different when looking at embedded inference-only systems:
Inference is typically done on single frames. Creating mini-batches would introduce often unacceptable latency while only providing a few percent of additional performance [Cavigelli2015].
During training, the input of each layer has to be stored in order to be able to compute the gradients. This is not required during inference.
Batch normalization layers, Dropout layer, etc. (if present) are considered independent layers during training. They can be absorbed into the convolution layer for inference.
To obtain a baseline memory requirement, we compute the required memory of common DNN frameworks performing convolutions using matrix multiplication with a batch size of 1. We assume an optimized network minimizing the number of layers, e.g. by absorbing batch normalization layers into the convolution layers or using in-place activation layers. This way 30M values need to be stored for the intermediate results, 264M values for the matrix, and 873k values for the parameters. This can further be optimized by sharing among all convolution layers and by keeping only memory allocated to storing only the output of two layers and switching back-and-forth between them, layer-by-layer. This reduces the memory footprint to 9M, 93M, and 872k values, and a total of 103M values for our baseline.
Applying our algorithm requires a little more memory, because we need to store additional intermediate results (cf. Figure 5) such as the change matrix, the changed indexes list, and the matrix, which can all again be shared between the layers. We also need to store the previous output to use it as a basis for the updated output and to use it as the previous input of the subsequent layer. For our sample network, this required another 60M values to a total of 163M values (+round(100/103*(103+60)-100,0)%, total size 650 MB)—an acceptable increase and not a limitation, considering that modern graphics cards typically come with around 12 GB memory and even GPU-accelerated embedded platforms such as the Nvidia Jetson TX2 module provide 8 GB of memory.
Iii-C Closed-Loop Formulation
In Figure 5a and Section III-A we describe the processing steps for a feed-forward implementation of CBinfer. However, note that this structure allows gradually changing inputs (e.g. two images are morphed over several frames with increments below the change detection threshold) to never trigger any update within the network and thus keep a stale result. In an outdoors surveillance setting, the effects could be even worse: consider a static scenery with a sunset and thus gradually changing brightness without triggering any update operation. Now a moving object passes, leaving a dark trace behind which has been updated under the changed lighting conditions.
To overcome such issues, we are proposing a closed-loop version of CBinfer as shown in Figure 5, bottom/closed-loop. Rather than storing the previous input, we now have an input state, which is updated only for those pixels which have triggered a change. This can be done directly in the change detection phase. This way, the previous output is consistently the convolution result of the input state and ensured not to drift far away from ideal result.
Since the previous input had to be stored before as well, this does not introduce any memory overhead. Moreover, in many cases it can even decrease compute time since only the few values where changes occurred have to be copied over from the input to the input state. For the feed-forward CBinfer the entire input tensor would have to be copied555Note that one such tensor always has to be copied when applying CBinfer. Consider two CBinfer layers after each other. During the update output step of the first CBinfer layer, we copy the newly computed values into the previous output tensor and feed it to the next CBinfer layer as the input. If we would not copy the data from input to previous input here and instead just keep the memory address of the previous frame’s input, it will be at the same location where the output of the first CBinfer layer’s result will be stored when processing the next frame, thus directly modifying the previous input variable and thereby introducing incorrect behavior (i.e. there are never any changes, since ultimately the input and previous input would point to the same memory location).
Iii-D Fine-Grained Change-based Inference
In the proposed scheme, every output value affected by any change at the input is recomputed. As the convolution operation is linear, updates based on the difference to the previous frame can be computed to reduce the number of multiplications and additions in two ways:
Fine-grained across feature maps (FG-FM): Only some of the input feature maps affecting a given output value might have changed. An incremental update of the affected feature maps based on the difference of the change in input values relative to the previous frame would be sufficient. This results in a 3D-tensor change map and a correspondingly long change indexes list, and—crucially—forces to decompose the large matrix multiplication into several smaller ones of which the results have to be added individually during the update output step (cf. Figure 5c).
Spatially fine-grained (FG-SP): Just because an output pixel is affected by an input pixel does not require that it is completely recomputed. With a filter, a single pixel marked as change would trigger the re-computation of 9 pixels. Also, here an incremental update based on differences is possible (cf. Figure 5d).
However, there are some drawbacks and limitations:
For both approaches the structure of the core computation effort is less regular and cannot be written as a dense matrix multiplication.
The compute effort of the change indexes extraction scales linearly with the number of values that have to be checked for changes. In case of (1), the effort in this step is thus scale up by a factor of the number of input feature maps.
The potential gains in case of (2) are limited. Changing pixels are typically clustered together and all that is being saved is a small halo on the change map around the changes. This can in most cases be expected to be in the range of a few percent.
Iii-E Propagating Changes & Pooling
Change detection and change indexes extraction can contribute up to half of the compute time (cf. Section IV-F). In some cases, it is thus worth considering skipping these steps:
If the previous layer was a CBconv layer as well, we can skip the change detection step and instead start from the previous layer’s change map and apply change propagation to it (cf. Eq. 4 and Figure 6). This change propagation can be computed much faster, because no iteration across all the feature maps is necessary.
In the special case that the current layer has a filter size, the changes do not propagate. This implies that the change map is identical to the one of the previous layer, which allows to also skip the change indexes extraction and re-use the change indexes of the previous layer.
Avoiding change detection also implies saving the memory to store the previous input for that layer. Besides the aforementioned advantages, there are some potential drawbacks:
In case of (1), only the change detection step can be avoided and replaced with a change propagation step, and the change indexes have to be extracted again. The changes spread out at every layer this is done, although the change detection threshold might not have been exceeded everywhere and some of the changes could have been discarded.
For (2), there is no propagation of changes and both, change detection and change indexes extraction, can be skipped. So, the only drawback is that a few changes might be updated although they would be discarded if the input would be checked against the current layer’s threshold.
Besides for accelerating convolution layers, the above is also interesting for pooling layers which can also be implemented using a change-based approach. Since they typically follow a convolution layer, case (2) can be applied and the change-based update introduces no significant overhead but saves compute time—mostly by reducing memory bandwidth as pooling layers are memory-bound operations.
Iii-F Threshold Selection
The proposed algorithm adds one parameter to each convolution layer, the change detection threshold. It is fixed offline based on sample video sequences which are passed through the trained network. Other than through the selected values for the thresholds, this selection process does not affect the performance of the system. A threshold of zero yields identical results to the non-change-based implementation, which has been used for functional verification.
For our evaluations, we perform an automated threshold selection process. First, all convolution layers are converted to change-based convolutions, and batch normalization and ReLU layers are absorbed into the CBinfer layers wherever possible. We define and choose:
a performance metric such as pixel-wise classification accuracy, intersect-over-union (IoU), mean average precision (mAP)—possibly, the loss function of the network,
a set of frame sequences to evaluate the network, where the last frame is ideally annotated. An obvious alternative in case of a lack of frame sequences with annotated last frame is the generation the comparison of the change-based network model’s output to the output of the original model using an appropriate metric, and
an initial threshold, a factor determining the rate with which we adjust the threshold, and a maximum acceptable increment in quality loss per layer.
We then set all thresholds to zero and start to iteratively step from the first to the last layer of the network. For each layer, we set an initial threshold value and evaluate the model with the aforementioned metric and dataset. We increment the threshold by a fixed factor (e.g. 1.1), re-evaluate, and repeat until the quality loss introduced by the current layer (with respect to a zero threshold) exceeds the maximum acceptable limit and then take the previous threshold value.
In case of a DNN with (re-)convergent paths, we perform the threshold selection on these paths independently while setting the thresholds for the other paths to zero.
The maximum acceptable quality loss can be set equally for all layers of the network. We focus on low accuracy loss configurations, and thus we are trying to select the threshold values such that they are right at the point where implementation losses are starting to occur. Nevertheless, we have observed best results by splitting the overall acceptable loss unevenly, allowing the first layer to introduce most of the loss.
Iv Results & Discussion
In this section, we will first present the evaluation environment and analyze the baseline compute time breakdown. We then analyze the threshold selection, the effect on accuracy and achievable throughput. We then perform a more in-depth analysis of the throughput to verify the quality of the GPU implementation and investigate how the changes propagate in the network. We then establish why more fine-grained change detection does not pay off and how implementation loss and performance gains behave on longer sequences.
Iv-a Evaluation Environment
We evaluate our method for two application scenarios: semantic segmentation and pose detection. For the first, we perform our evaluations on the urban surveillance dataset described in Section II-D and [Cavigelli2016a] and using the corresponding scene labeling CNN, not using the multispectral imaging data. The dataset provides 51 training images and 6 validation images with pixel with the corresponding ground-truth scene labeling, classifying each pixel into one of the following 8 classes: building, road, tree, sky, tram, car/truck, water, distant background. For the validation set, the labeled images are part of short video sequences with 5 additional frames available before the frame for which the ground truth labeling is available. A trained network on this data is described in [Cavigelli2016a] and its parameters are reused unaltered for our evaluations. The procedure with which we perform our evaluations is visualized in Figure 7.
For the pose detection application, we use frame sequences from the CAVIAR dataset without ground truth annotations and the trained body estimation network of OpenPose[Cao2017] with stages. The frames are re-sampled to pixel as in the original OpenPose implementation to enable a meaningful comparison. The frame sequences are subsampled in time by a factor of 6 to arrive at a frame rate of around 4 frame/s. In this setting, we measure the accuracy loss in terms of mean-squared error (MSE) relative to the output of the non-change-based network. We have found a MSE of on the network’s output to be sufficient for the pose detection to work reliably. With this dataset we run change-based inference for 9 frames before the accuracy and throughput measurements are performed on Frame 10 to avoid any start-up transients. As we will show later in Figure LABEL:fig:timeTrace, these transients are very short and the error does not accumulate over time.
For our experiment on object detection, we use the YOLOv3 network trained on the MS COCO dataset with 80 classes of everyday objects. The input image is rescaled, such that its smaller dimension corresponds to 416 pixels. The input sequences for our evaluations are described in Section II-D
. Similar to pose detection we do not have ground-truth data, instead we generate our target output using non-change-based YOLOv3. For measuring the quality of the output feature maps, the MSE is not a suitable measure given that e.g. the outputs for the classification of the recognized object is scaled differently than the objectness score or the bounding box size. We have experimentally identified the objectness score to be the most sensitive output to potential artifacts of applying CBinfer and are thus measuring the accuracy loss due to change-based inference using the MSE on the objectness score.
We have implemented the proposed algorithm in the PyTorch framework using custom CUDA kernels, including functions to aid in converting DNNs to CBinfer (automatic conversion and threshold selection). We have evaluated the performance on a Jetson TX2 board. Our performance baseline is the PyTorch implementation using Nvidia’s cuDNN backend. It includes optimizations such as the Winograd algorithm and FFT-based convolutions mentioned in SectionII-A. Our evaluations were conducted using half-precision floating point numbers which have no negative impact on accuracy for both DNNs.
Iv-B Baseline Throughput and Computation Breakdown
Before we discuss the performance of the proposed algorithm, we analyze the baseline throughput and compute time breakdown of the segmentation DNN in Table II. Clearly, the convolution operations are dominant, taking up 94.5% () of the overall computation time (). This reaffirms the focus on the convolution layers and will later on show that after accelerating the convolution operation significantly, optimizations for activation and pooling become relevant.
Iv-C Threshold Selection
Our algorithm introduces a threshold parameter for each layer, for which we outline the selection process in Section III-F. In Figure 8 we visualize the relation between accuracy and each layer’s change detection threshold. We proceed similarly to our selection process, allowing an accuracy drop of 0.04% per layer for the semantic segmentation network. Starting from all-zero thresholds (), we sweep and select the optimal threshold parameter for each layer iteratively. The main purpose is to align the tipping points of the threshold-accuracy curve, such that not a single layer’s threshold is limiting the overall accuracy.
After the selection of the thresholds, we can scale them jointly to analyze the trade-off against the classification accuracy more concisely as can be observed in Figure 9 (left). The accuracy of the individual test sequences (different traces) clearly show a similar behavior with a plateau up to a clear point where there is a steep increase in error rate. We repeated this analysis for the much deeper pose detection network (cf. Figure 10), showing similar behavior for the MSE with respect to the baseline DNN.
Iv-D Throughput Evaluations
The motivation for the proposed algorithm was to increase throughput by focusing only on the frame-to-frame changes. We show the performance gain in Figure 9 (right) with the indicated baseline analyzing the entire frame with the same network using cuDNN. In the extreme case of setting all thresholds to zero, the entire frame is updated, which results in a clear performance loss because of the change detection overhead as well as fewer optimization options such as less cache-friendly access patterns when generating the matrix. Nevertheless, few operations are skipped where the pixels did not change at all.
When increasing the threshold factor, the average throughput increases rapidly to about 20 frame/s, where it starts saturating because the change detection step as well as other non-varying components like the pooling and pixel classification layers are becoming dominant and the number of detected changed pixels does not further decrease. We almost reach this plateau already for a threshold factor of 1, where we have by construction almost no accuracy loss. The average frame rate over the different sequences is near 18 frame/s at this point—an improvement of over the cuDNN baseline of 1.96 frame/s.
One sequence (Figure 9, 9) has—while still being close to faster than the baseline—a significantly lower throughput than the other sequences. While most of them show typical scenarios such as shown in Figure 2, this sequence shows a very busy situation where the entire road is full of vehicles and all of them are moving. The effective number of operations (add or multiply operations) to compute the convolution updates is visualized in Figure 9 (center). For most frame sequences the savings are above while the aforementioned exceptional cases have a significantly higher share with savings of around .
Running the same analysis for the pose detection network yields very similar results. For the cuDNN baseline, we get a frame rate of 0.72 frame/s and CBinfer achieves a rate of 3–8 frame/s for a threshold factor of 1 or a speed-up of to . A noticeable difference are performance gains for the zero threshold configuration. Here the overhead of CBinfer is outweighed by the savings due to many pixels at the input not changing at all and therefore not triggering an update even for a zero threshold, yielding a performance gain even in a completely loss-less configuration.
In Figure 11, we show the evaluation results for object detection using YOLOv3 trained on MS COCO and applied to various video sequences. We have observed that the most critical output of the network is the objectness score and that the classification and bounding box dimensions are much more resilient to the change detection threshold. As we do not have ground truth data available for the video sequences used in the experiment, we measure the loss based on the MSE of the objectness score relative to frame-by-frame inference. Again, a clear reduction by around in the number of operations can be observed, albeit not as much as for the other two application scenarios. We attribute this to the network’s structure, which uses leaky ReLU activations and hence does not naturally eliminate all changes of feature maps values which are below zero.
We have repeated the performance measurements for the segmentation application with fp32 precision on a workstation with a Nvidia GTX 1080 Ti GPU to compare them to the Tegra X2 platform, obtaining an almost identical throughput-threshold trade-off and compute time breakdown up to a scaling factor of —as can be expected for a largely very well parallelizable workload and a more powerful device with a similar architecture666Tegra X2: 437-750 GFLOPS (fp32), 874-1500 GFLOPS (fp16), and 58.4 GB/s DRAM bandwidth.
GTX 1080 Ti: 10609 GFLOPS (fp32) and 484 GB/s..
Iv-E Accuracy-Throughput Trade-Off
While for some scenarios any drop in accuracy is unacceptable, many applications allow for some trade-off between accuracy and throughput—after all choosing a specific CNN already implies selecting a network with an associated accuracy and computational cost.
We analyze the trade-off directly in Figure 12. The most extreme case is updating the entire frame every time resulting in the lowest throughput at the same accuracy as full-frame inference. Increasing the threshold factor in steps of 0.25 immediately results in a significant throughput gain and for most sequences the trade-off only starts at frame rates close to saturation above 20 frame/s. The same frame sequence that already deviated from the norm before behaves differently here as well. However, an adaptive selection of the threshold factor with a control loop getting feedback about the number of changed pixels could allow for a guaranteed throughput by reducing the accuracy in such cases and is left to be explored in future work.