Dynamic Network Quantization for Efficient Video Inference

08/23/2021 ∙ by Ximeng Sun, et al. ∙ 5

Deep convolutional networks have recently achieved great success in video recognition, yet their practical realization remains a challenge due to the large amount of computational resources required to achieve robust recognition. Motivated by the effectiveness of quantization for boosting efficiency, in this paper, we propose a dynamic network quantization framework, that selects optimal precision for each frame conditioned on the input for efficient video recognition. Specifically, given a video clip, we train a very lightweight network in parallel with the recognition network, to produce a dynamic policy indicating which numerical precision to be used per frame in recognizing videos. We train both networks effectively using standard backpropagation with a loss to achieve both competitive performance and resource efficiency required for video recognition. Extensive experiments on four challenging diverse benchmark datasets demonstrate that our proposed approach provides significant savings in computation and memory usage while outperforming the existing state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the availability of large-scale video datasets [5, 36]

, deep learning models based on 2D/3D convolutional neural networks (CNNs) 

[6, 52, 48, 28, 17] have dominated the field of video recognition. However, despite impressive performance on standard benchmarks, efficiency remains a great challenge for many resource constrained applications due to the heavy computational burden of deep CNN models.

Motivated by the need of efficiency, existing research efforts mainly focus on either designing compact models [41, 49, 11] or sampling of salient frames for efficient recognition [61, 58, 34]. While these methods have shown promising results, they all use 32-bit precision for processing all the frames in a given video, limiting their achievable efficiency. Specifically, orthogonal to the network design, the computational cost of a CNN is directly affected by the bit-width of weights and activations [16, 69, 8]

, which surprisingly as another degree of freedom for efficient video inference, is almost overlooked in previous works. To illustrate this, let us consider the video in Figure 

1, represented by five uniformly sampled frames. A quick glance on the video clearly shows that only the third frame can be processed using 32-bit precision as this is the most informative frame for recognizing the action “Long Jump”, while the rest can be processed at very low precision or even skipped (i.e., precision set to zero) without sacrificing the accuracy (Bottom), resulting in large computational savings compared to processing all frames with same 32-bit precision, as generally done in mainstream video recognition methods (Top).

Figure 1: A conceptual overview of our approach. Instead of processing all the video frames with the same 32-bit precision, VideoIQ learns to dynamically select optimal quantization precision conditioned on input clips for efficient video recognition. It is computationally very efficient to process more informative frames with high precision and less informative ones with lower precision, without sacrificing accuracy. Best viewed in color.

Inspired by this observation, we introduce Video Instance-aware Quantization (VideoIQ), which for the first time advocates a novel input-dependent dynamic network quantization strategy for efficient video recognition. While dynamic network quantization looks trivial and handy at the first glance, we need to address two challenges: (1) how to efficiently determine what quantization precision to use per target instance; and (2) given instance-specific precisions, how can we flexibly quantize the weights and activations of a single deep recognition network into various precision levels, without additional storage or computation cost.

To address the aforementioned challenges, we propose a simple end-to-end differentiable approach to learn a decision policy that selects optimal precision conditioned on the input, while taking both accuracy and efficiency into account in recognizing complex actions. We achieve this by sampling the policy from a discrete distribution parameterized by the output of a lightweight policy network, which decides on-the-fly what precision should be used on a per frame basis. Since these decision functions are discrete and non-differentiable, we train the policy network using standard back-propagation through Gumbel Softmax sampling [24]

, without resorting to complex reinforcement learning, as in 

[61, 9, 64]

. Moreover, instead of storing separate precision-specific models, we train a single deep neural network for action recognition using joint training, which enables us to directly adjust the numerical precision by simply truncating the least significant bits, without performance degradation. Our proposed approach provides not only high computational efficiency but also significant savings in memory–a practical requirement of many real-world applications which has been largely ignored by prior works 

[34, 60, 35, 61].

We conduct extensive experiments on four standard video recognition datasets (ActivityNet-v1.3 [3], FCVID [25], Mini-Sports1M [28] and Mini-Kinetics [5]) to demonstrate the superiority of our proposed approach over state-of-the-art methods. Our results show that VideoIQ can yield significant savings in computation and memory (e.g., average less GFLOPS and less memory), while achieving better recognition performance, over the most competitive SOTA baseline [34]. We also discover that the decision policies learned using our method are transferable to unseen classes and videos across different datasets. Furthermore, qualitative results suggest that our learned policies correlate with the distinct visual patterns in video frames, i.e., our method utilizes 32-bit full precision only for relevant video frames and process non-informative frames at low precision or skip them for computation efficiency.

2 Related Work

Video Recognition. Much progress has been made in developing a variety of ways to recognize videos, by either applying 2D-CNNs [28, 52, 45, 46] or 3D-CNNs [48, 5, 17]. Despite promising results, there is a significant interest in developing more efficient models with reasonable performance [41, 49]. SlowFast network [12] employs two pathways for recognizing actions by processing a video at both slow and fast frame rates. Many works utilize 2D-CNNs for efficient recognition by modeling temporal causality using different aggregation modules [52, 68, 10, 32]. Expansion of 2D architectures across frame rate, spatial resolution, network width, is proposed in [11]. While these approaches bring reasonable efficiency improvements, all of them process the video frames using same 32-bit precision, regardless of information content in each input frame, which varies in most real-world long videos. In contrast, our approach dynamically selects bit-width per input, to strategically allocate computation at test time for efficient recognition.

Figure 2: Illustration of our proposed approach. VideoIQ consists of a very lightweight policy network and a single backbone network for recognition which can be simply quantized to lower precisions by truncating the least significant bits. The policy network decides what quantization precision to use on a per frame basis, in pursuit of a reduced overall computational cost without sacrificing recognition accuracy. We train both networks using back-propagation with a combined loss of standard cross-entropy and efficiency for video recognition. We additionally distill knowledge from a pre-trained full-precision model to guide the training of lower precisions. During inference, each frame is sequentially fed into the policy network to select optimal precision for processing the current frame through the recognition network and then the network averages all the frame-level predictions to obtain the video-level prediction. Best viewed in color.

Dynamic Computation. Dynamic computation to improve efficiency has been studied from multiple perspectives [1, 2, 50, 54, 15, 37, 13, 33]. Representative methods for image classification, dynamically adjust network depth [13, 33, 59, 21, 63], width [66, 7, 20], perform routing [26, 33] or switch resolutions [62]. Similar in spirit, dynamic methods for efficient video recognition adaptively select salient frames/clips [64, 61, 30, 9, 58, 23], utilize audio [14], reduce feature redundancy [38], or select frame resolutions [60, 34]. Recently, AdaFuse [35] proposes adaptive fusion of channels from current and past feature maps on a per instance basis, for recognizing video actions. Our approach is closely related yet orthogonal to these approaches as it focuses on network quantization to dynamically select the optimal bit-width conditioned on inputs, in pursuit of computational efficiency without sacrificing accuracy. Moreover, unlike existing works, our framework requires neither complex RL policy gradients [61, 58, 64] nor additional modalities such as audio [14, 30] to learn dynamic policies.

Network Quantization. Low-precision networks [16, 69, 8], have attracted intense attention in recent years. Early works such as [16, 31, 69] mainly focus on quantizing weights while using 32-bit activations. Recent approaches quantize both weights and activations through using uniform quantization that uses identical bit-width for all layers [67, 8, 39], or mixed precision quantization that uses different bit-widths for different layers or even channels [51, 4, 57]. Binary networks [22, 42] constrain both weights and activations to binary values, which brings great benefits to specialized hardware devices. Designing efficient strategies for training low-precision [71, 29, 70] or any-precision networks [27, 65] that can flexibly adjust the precision during inference is also another recent trend in quantization. Despite recent progress, the problem of quantization for video recognition models is rarely explored. Moreover, existing methods perform quantization in a static manner with a fixed computational cost, leaving adaptive quantization conditioned on inputs an open problem.

3 Proposed Method

Given sampled frames from a video with the action label and a set of candidate bit-widths (precisions) (assuming ), our goal is to seek (1) a policy function that automatically decides the optimal bit-width for the frame for processing in the recognition network, (2) a single recognition network which can be quantized to different precisions in without additional storage or computation cost. With the desired policy network and recognition network , our main objective is to improve accuracy, while taking the resource efficiency into account for video action recognition. Note that given the optimal bit-width for the frame , we quantize all the network weights and activations to the same bit-width , which is well supported by existing hardwares.

3.1 Preliminaries

We denote the full-precision network weights by and activations by . Given a certain precision with bit-width and a quantization function , we denote the quantization of and as and . In this paper, we use DoReFa [69] for weight quantization and PACT [8] for activation quantization.

Weight Quantization. DoReFa [69] normalizes into and then rounds it to the nearest quantization levels:

(1)
(2)

where is the rounding operation.

Activation Quantization. PACT [8] introduces a learnable clipping value for activations in each layer. More specifically, the activation is first clipped into and then rounded to the nearest quantization levels:

3.2 Approach Overview

Figure 2 shows an overview of our approach. In general, we learn a instance-specific policy that decides on-the-fly which precision to use (or even skip) for processing the current frame

, and a video classifier

which can be flexibly quantized to the desired precision of the current frame by simply truncating the least significant bits without any extra computation or memory cost. To this end, VideoIQ consists of a lightweight policy network and a video recognition network . The policy network contains a feature extractor and an LSTM module to learn the discrete decisions of which precision to use, per input frame (see Section 3.3). Moreover, it is often unnecessary and inefficient to process every frame in a video due to large redundancy resulting from static scenes or frame quality being very low. Thus, we skip frames (i.e., precision set to zero) in addition to dynamic selection of precisions in an unified framework to improve efficiency in video recognition. To further enable flexible and scalable quantization, we learn the video classifier as an any-precision network and design a simple yet effective optimization scheme to ensure that the single set of network weights get executed with multiple precisions without additional storage and computation cost (see Section 3.4).

During the training, we first learn the any-precision recognition network and then optimize the policy network with Gumbel-Softmax Sampling [24] through standard back-propagation. We design the loss to achieve both competitive performance and computational efficiency (measured by FLOPS [55]) required for video recognition. We additionally distill knowledge from a pre-trained full-precision model to guide training of the lower precisions. During the inference, each video frame is sequentially fed into the policy network whose output decides the right precision to use for the given frame and then the frame is processed through the recognition network with the predicted precision to generate a frame-level prediction. Finally, the network averages predictions of all the frames as the final video-level prediction. It is worth noting that the policy network is designed to be very lightweight so that its computational overhead is negligible (e.g., MobileNetv2 [43] in our work).

3.3 Learning Dynamic Quantization Policy

VideoIQ learns the frame-wise policy to decide which precision to process the frame or directly skip it where skipping can be viewed as processing the frame with -bit. So our entire action space is . We generate decision from the policy network sequentially. We compose the policy network with a feature extractor followed by an LSTM module:

(3)

where and are hidden state and outputs of LSTM at the time step . We further compute the distribution over our action space from :

(4)

However, sampling policy from the discrete distribution is non-differentiable which makes direct optimization difficult. One way to solve this is to model the optimization problem as a reinforcement learning problem and then derive the optimal parameters of the policy network using policy gradient methods [56]

. However, policy gradient is often complex, unwieldy to train and requires techniques to reduce variance during training as well as carefully selected reward functions. In contrast, we use Gumbel-Softmax Sampling 

[24] to circumvent this non-differentiability and make our framework fully differentiable, as in [60, 47].

Gumbel-Softmax Sampling. The Gumbel Softmax trick [24] substitutes the original non-differentiable sample from a discrete distribution with a differentiable sample from a corresponding Gumbel-Softmax distribution.

Specifically, instead of directly sampling from its distribution , we generate it as,

(5)

where is a standard Gumbel distribution with

sampled from a uniform distribution

. To remove the non-differentiable argmax operation in Eq. 5, the Gumbel Softmax trick relaxes

(the one-hot encoding of

) to with the reparameterization trick [24]:

(6)

where and is the temperature of the softmax. Clearly, when , the Gumbel-Softmax distribution is smooth so can be directly optimized by gradient descent, and when approaches 0, the soft decision becomes the same as . Following [15, 47], we set as the initial value and gradually anneal it down to 0 during training.

3.4 Any-Precision Video Recognition

Given frame-specific precisions, quantizing weights and activations of a single network while recognizing videos is a major challenge. A naive strategy is to manually train different models tailored for the different precision and then route frames to the corresponding models to generate predictions. However such a strategy requires time-consuming training for each of the models and also increases the memory storage cost, making it inefficient for many real-time applications. To tackle this problem, we adopt any-precision recognition [27, 65] that makes a single model be flexible to any numerical precision during the inference. Specifically, we first modify the weight quantizer to enable the network parameters to get quantized to lower precision with low computation cost after the training. Then, we propose a simple and effective learning scheme for training of the any-precision video recognition network.

With the original DoReFa quantization [69] (Eq. 1 and 2), all numerical precisions need to be quantized down from the full-precision value. Thus, the repeated weight quantizations cause redundant computation when the recognition network frequently switches across different precisions. To reduce computational cost of switching operation, we quantize full precision weight to the largest bit-width and then truncate least significant bits to get quantized weight . We save the quantized -bit network weights after the training. Benefiting from this modified quantization, we only need to discard the extra bits to switch to lower precisions during inference. Furthermore, we align with to minimize the mean discrepancy caused by discarded bits.

Inspired by [66, 27], we jointly train a single network under different bit-widths with shared weights for any-precision video recognition. Specifically, we gather losses of all precisions with same input batch and then update the network. To get the loss of a precision with bit-width , we feed the input video and quantize network weights and activations to

-bit for every frame. To resolve mismatch in statistics of activations with different precisions, we use a separate set of Batch Normalization layers and clipping level parameters for different precisions 

[66]. Moreover, following the success of knowledge distillation [19]

, we transfer knowledge from a pretrained full-precision recognition network to guide training of lower precisions because the full-precision weights is expected to give confident predictions, and provide valuable knowledge in its soft logits, while the low-precision student gains the knowledge by mimicking the teacher.

3.5 Losses

For video action recognition, we minimize standard cross-entropy loss between predicted label and ground truth action:

(7)

where represents precisions to use for the sampled frames, which can be either predicted by the lightweight policy network () or set manually.

To better guide the optimization of the model with lower capacity, the recognition network with lower precision, we utilize a distillation loss to transfer knowledge from a pretrained full-precision video recognition network (teacher) by taking Kullback–Leibler (KL) divergence between soft-logits of our model and of the teacher network as

(8)

where is the number of video categories and denotes the

-th element of the vector. Thus, given the input video

, the overall loss to optimize the any-precision video recognition network is defined as

(9)

To address computational efficiency, we pre-compute FLOPs [55] needed for one frame to get processed in the recognition network with different candidate precisions in . We directly minimize FLOPs usage per video with the generated policy , to reduce the computational cost as

(10)

Furthermore, we introduce two additional regularizers to better optimize the policy network. First, we enforce a balanced policy usage over the entire action space to avoid the policy network learning some sub-optimal solutions where some actions are totally ignored. More formally, we define the balanced policy usage loss as

(11)

Second, we minimize the entropy of the learned probability distribution over the action space

of each frame. It forces the policy network to avoid randomness during the inference by generating deterministic prediction for the precision to use for each video frame:

(12)

where is the entropy function. Finally, the overall loss to optimize the policy network is defined as

(13)

where , and , and

are hyperparameters to balance loss terms. In summary, we first jointly train the any-precision recognition network

with all precisions in (using Eq. 9), and then train policy network (using Eq. 3.5) to generate policy over the action space per input frame.

4 Experiments

4.1 Experimental Setup

Datasets. We evaluate our approach using four datasets, namely ActivityNet-v1.3 [3], FCVID [25], Mini-Sports1M [28] and Mini-Kinetics [5]. ActivityNet contains videos for training and videos for validation across categories. FCVID consists of videos for training and videos for testing across classes. Mini-Sports1M [14] is a subset of full Sports1M dataset [28] containing videos per class in training and videos per class in testing over classes. Mini-Kinetics [6] is a subset of full Kinetics400 [5] dataset containing videos for training and videos for testing across classes.

Implementation Details. We adopt temporal segment network (TSN) [52] to aggregate the predictions over uniformly sampled frames from each video. We use ResNet-18 and ResNet-50 [18] for the recognition network while MobileNetv2 [43] combined with a single-layer LSTM (with 512 hidden units) to serve as policy network in all our experiments. To save computation, we use lower resolution images () in policy network. We set the action space in all experiments, i.e., the policy network can choose either one out of

precision or skip frame for efficient recognition. We first train the any-precision recognition network (pretrained from ImageNet weights) for 100 epochs to provide a good starting point for policy learning and then train the policy network for 50 epochs on all datasets. We use separate sets of learning parameters (learning rate, weight decay) for clipping values of each precision. Following  

[69, 8], we do not quantize input, first layer and last layer of the network. More implementation details are included in the Appendix B.

Baselines. We compare our approach with the following baselines and existing approaches. First, we consider a 2D-CNN based “Uniform” baseline that uses 32-bit precision to process all the sampled frames and then averages the frame-level results as the video-level prediction. We also compare with two more variants of uniform baseline that uses lower precisions such as 4-bit and 2-bit respectively to process the video frames. Second, we compare with “Ensemble” baseline that gathers all the frame-level predictions by processing them at different precision (instead of selecting an optimal precision per frame). This serves as a very strong baseline for classification, at the cost of heavy computation. Finally, we compare our method with existing efficient video recognition approaches, including LiteEval [60] (NeurIPS’19), SCSampler [30] (ICCV’19), AR-Net [34] (ECCV’20), and AdaFuse [35] (ICLR’21). We directly quote the numbers reported in the published papers when possible or use authors provided source codes [60, 35] using the same backbone and experimental settings for a fair comparison.

Metrics. We compute either mAP (mean average precision) or Top-1 accuracy depending on datasets to measure performance of different methods. We follow [55, 40, 44] and measure computational cost with giga floating-point operations (GFLOPs), which is a hardware independent metric. Specifically, given FLOPs of a full-precision layer by , the FLOPs of -bit weight and -bit activation quantized layer is . We also measure memory usage (MB) represented by the storage for parameters of the network, as in [55].

 

Model ActivityNet FCVID
Mem.
(MB)
mAP (%) GFLOPs mAP (%) GFLOPs

 

ResNet-18

 

Uniform (32-bit) 69.7 29.1 77.6 29.1 43.1
Uniform (4-bit) 68.0 7.3 76.5 7.3 5.4
Uniform (2-bit) 65.2 1.8 74.3 1.8 2.7
Ensemble 70.7 38.2 78.8 38.2 51.2
VideoIQ 70.9 9.5 79.1 9.4 50.2

 

ResNet-50

 

Uniform (32-bit) 72.5 65.8 81.0 65.8 91.4
Uniform (4-bit) 71.7 16.5 79.3 16.5 11.4
Uniform (2-bit) 69.3 4.1 78.5 4.1 5.7
Ensemble 74.7 86.4 83.0 86.4 108.5
VideoIQ 74.8 28.1 82.7 27.0 98.6

 

Table 1: Video recognition results on ActivityNet and FCVID. Our approach VideoIQ outperforms all the simple baselines.

4.2 Results and Analysis

Comparison with Traditional Uniform Baselines. We first compare VideoIQ using different backbones (ResNet-18 and ResNet-50) to show how much performance our dynamic approach VideoIQ can achieve compared to simple 2D-CNN based baselines on both ActivityNet and FCVID datasets. As shown in Table 1, our approach consistently outperforms the full-precision uniform baseline (32-bit) in both mAP and GFLOPS, with minimal increase in memory on both datasets. Using ResNet-18 as the backbone, VideoIQ obtains an mAP of and , requiring and GFLOPS on ActivityNet and FCVID respectively. Uniform quantization with low bit-widths leads to a significant reduction in computation and memory but they suffer from a noticeable degradation in recognition performance, e.g., the 2-bit performance is and lower than the 32-bit counterpart on ActivityNet and FCVID respectively.

Similarly, with ResNet-50, VideoIQ offers ( vs ) and ( vs ) savings in GFLOPS while outperforming the Uniform (32-bit) baseline by and in mAP on ActivityNet and FCVID, respectively. We further compare with 8-bit Uniform Baseline that uses same percentage of random skipping as VideoIQ (i.e. random skipping on ActivityNet). With ResNet-50, our approach outperforms this baseline by ( vs ), showing effectiveness of learned policy in selecting optimal quantization precision per frame while recognizing videos.

As shown in Table 1, Ensemble achieves comparable recognition performance because it is a very strong baseline that gathers all the predictions by processing frames through multiple backbones. However, VideoIQ provides and computational savings including a savings in memory over the Ensemble baseline on ActivityNet and FCVID respectively, showing the importance of instance-aware dynamic quantization for efficient video recognition. Moreover, we also compare with a Weighted Ensemble baseline, where weights are assigned based on entropy of softmax scores to reflect prediction confidence of different predictions. We observe that it only achieves higher mAP while requiring more computation than our method on ActivityNet ( vs ). Note that VideoIQ requires less computation on average on FCVID than ActivityNet as FCVID contains more static videos with high redundancy compared to ActivityNet that consists of action-centric videos with rich temporal information.

 

Model ActivityNet FCVID
Mem.
(MB)
mAP (%) GFLOPs mAP (%) GFLOPs

 

LiteEval 72.7 95.1 80.0 94.3 177.2
SCSampler 72.9 42.0 81.0 42.0 98.6
AR-Net 73.8 33.5 81.3 35.1 223.4
AdaFuse 73.1 61.4 81.6 45.0 151.2
VideoIQ 74.8 28.1 82.7 27.0 98.6

 

Table 2: Comparison with state-of-the-art methods on ActivityNet and FCVID. VideoIQ achieves the best mAP while offering significant savings in both GFLOPS and Memory (MB).

 

Model Mini-Sports1M Mini-Kinetics
Mem.
(MB)
mAP (%) GFLOPs Tops-1 (%) GFLOPs

 

LiteEval 44.7 66.2 61.0 99.0 177.2
SCSampler 44.3 42.0 70.8 42.0 98.6
AR-Net 45.0 37.6 71.7 32.0 223.4
AdaFuse 44.1 60.3 72.3 23.0 151.2
VideoIQ 46.4 26.8 72.3 20.4 98.6

 

Table 3: Comparison with state-of-the-art methods on Mini-Sports1M and Mini-Kinetics. Our approach VideoIQ (w/ ResNet-50) obtains the best performance with great savings in computation (GFLOPS) and memory (MB).
Figure 3: Computational cost (GFLOPS) vs mean Average Precision (%) on ActivityNet dataset. VideoIQ (red points) achieves the best trade-off when compared to existing methods.

Comparison with State-of-the-Art Methods. Tables 2-3 summarize the results and comparisons with existing dynamic inference methods on all four datasets. Our approach is clearly better than all the compared methods in terms of both accuracy and resource efficiency (computation and memory), making it suitable for efficient video recognition. VideoIQ obtains an mAP (accuracy for Mini-Kinetics) of , , and , while requiring , , and GFLOPs on ActivityNet, FCVID, Mini-Sports1M and Mini-Kinetics, respectively. Note that while most of the compared methods reduce computation at the cost of significant increase in memory, our approach improves computational efficiency by using a model whose memory size is just slightly larger than the 32-bit model.

Among the compared methods, AR-Net is the most competitive in terms of computational efficiency. However, VideoIQ consistently outperforms AR-Net in recognition performance while providing savings on average in computation and savings in memory. This is because of our two introduced components working in concert: dynamic quantization for computational efficiency and use of a single any-precision recognition network instead of separate models for memory efficiency. Likewise when compared with the recent method AdaFuse, our approach offers an average and reduction in computation and storage memory while improving the recognition performance (maximum on Mini-Sports1M) across all the datasets. AdaFuse obtains the best performance compared to other existing methods on Mini-Kinetics but it fails to achieve similar performance on untrimmed video datasets. We suspect that being a method that relies on efficient reuse of history feature maps, it fails to aggregate the information of all time stamps when the video gets very long, as in untrimmed datasets. In summary, VideoIQ establishes new state-of-the-art for the task of efficient video recognition on four datasets, improving previous best result in terms of accuracy, computational efficiency and memory efficiency.

Figure 3 compares our approach to the existing methods by varying computational budgets on ActivityNet. Our method consistently outperforms all the compared methods and achieves the best trade-off between computational cost and accuracy, which once again shows that VideoIQ is an effective and efficient design for video recognition.

 

TrainTest ActivityNet FCVID Mini-Sports1M Mini-Kinetics

 

ActivityNet 74.8 82.7 46.3 71.6
FCVID 74.4 82.8 45.8 72.1
Mini-Sports1M 74.6 82.6 46.4 72.2
Mini-Kinetics 74.7 82.7 46.3 72.3

 

Table 4: Transferring learned policies. Diagonal numbers refer to training and testing the quantization policy on the same dataset while non-diagonal numbers refer to learning the policy on one dataset (rows) and testing on others (columns).

Transferring Learned Policies. We analyze transferability of our learned policy by performing cross-dataset experiments, i.e., learning policy on one dataset while testing on the other. Specifically, we take the policy network trained on one dataset and utilize it directly for testing along with a trained any-precision recognition network on another dataset. Table 4 summarizes the results. As expected, training and testing on the same dataset provides the best performance on all cases (marked in blue). However, the negligible difference among the values across each column clearly shows that policies learned using our method are transferable to unseen classes and videos across different datasets.

Figure 4: Qualitative examples from ActivityNet dataset. Our approach VideoIQ processes more informative frames with high precision and less informative ones with lower precision or skip them when irrelevant, for efficient video recognition. Best viewed in color.
Figure 5: Dataset-specific policy distribution.

Qualitative Analysis. To better understand the learned policy, we visualize selected precision per input frame in Figure 4. Videos are uniformly sampled in 8 frames. Overall, our approach VideoIQ focuses on the right quantization precision to use per frame for correctly classifying videos while taking efficiency into account. VideoIQ processes the most indicative frames in 32-bit precision while it uses lower precision (or skips) for frames that irrelevant to the action (e.g., “Playing saxophone” and “Snow Tubing”). Similarly in the case of “Playing violin” and “Mixing drinks”, after being confident about the prediction, it interestingly avoids using the 32-bit precision even if informative content appear later in the video. More qualitative examples are included in the Appendix D.

Figure 5 shows the overall policy distribution on different datasets. Our approach leads to distinctive policy patterns representing different characteristics of datasets. For example, while only few frames on ActivityNet use 2-bit precision, about of the frames on the other datasets can be processed using 2-bit precision, leading to different amount of computational savings across datasets. VideoIQ skips very few frames on Mini-Kinetics (), which is because Mini-Kinetics dataset contains short trimmed videos ( seconds) while the remaining datasets consists of long untrimmed videos, lasting up to minutes.

4.3 Ablation Studies

We present the following ablation experiments using ResNet-50 on ActivityNet dataset to show the effectiveness of different components in our proposed method.

Effect of Different Losses. Table 5 summarizes the effect of different losses on ActivityNet. Training without knowledge transfer from the 32-bit model (top row: by turning off ) only obtains a mAP of with similar GFLOPS as ours, which shows that it is important to utilize soft targets of the full-precision model as the teacher to guide lower precisions in learning. As expected, training by setting to achieves the highest mAP of while requiring more GFLOPS compared to the one that uses efficient loss in training ( vs row). Finally, adding both regularizations ( and ) during the policy learning leads to the best performance with least computation showing the effectiveness of different losses in our framework.

Effect of Decision Space. We investigate the effect of decision space by using different combinations of precision and skipping. As shown in Table 6, only skipping frames (i.e., ) leads to an mAP of while setting the decision space to choose only precisions (i.e., ) leads to an mAP of on ActivityNet. Compared to all the alternatives, the best strategy is to combine the set of precisions with skipping by setting for achieving top performance of in mAP with GFLOPS on ActivityNet dataset.

 

mAP (%) GFLOPs

 

73.5 29.0
75.1 56.4
74.5 34.6
74.3 32.0
74.8 28.1

 

Table 5: Effect of different losses on ActivityNet.

 

Decision Space mAP (%) GFLOPs

 

{32, 0} 72.9 31.6
{32, 4, 2} 74.5 31.4
{32, 4, 0} 74.7 32.8
{32, 2, 0} 74.0 31.2
{32, 4, 2, 0} 74.8 28.1

 

Table 6: Effect of different decision space on ActivityNet. Note that indicates skipping the frame for processing by the classifier.

Comparison with Random Policy. We compare with random policy that uses the same backbone framework but randomly samples policy actions from uniform distribution and observe that our approach outperforms it by in mAP ( vs ) on ActivityNet, which demonstrates effectiveness of learned policy in selecting optimal quantization precision per frame while recognizing videos. We also observe similar improvements () on other datasets.

Effectiveness of Any-Precision Recognition Network. We use three separate precision specific quantized models as part of the classifier and route frames to the corresponding models based on the policy to generate predictions. Our approach using separate models on ActivityNet (with ResNet-50) achieves an mAP of (an improvement of only ) while requiring GFLOPS and MB of memory, in contrast to GFLOPS and MB of memory with a single any-precision network. Similarly, use of separate models on Mini-Sports1M yields only improvement in mAP with more computation and of additional memory, compared to an any-precision network. This clearly shows the effectiveness of our any-precision network over individual quantized models in obtaining very competitive performance with less computation and memory.

5 Conclusion

In this paper, we introduce video instance-aware quantization that decides what precision should be used on a per frame basis for efficient video recognition. Specifically, we utilize a lightweight policy network to predict these decisions and train it in parallel with an any-precision recognition network with the goal of achieving both competitive accuracy and resource efficiency. Comprehensive experiments on four challenging and diverse datasets demonstrate the superiority of our approach over existing state-of-the-art methods.

Acknowledgements. This work is also supported by the Intelligence Advanced Research Projects Activity (IARPA) via DOI/IBC contract number D17PC00341. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Disclaimer. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government.

References

Appendix A Dataset Details

We evaluate our approach using four standard video recognition benchmark datasets, namely ActivityNet-v1.3 [3], FCVID [25], Mini-Sports1M [28] and Mini-Kinetics [5]. Below we provide more details on each of the dataset.

ActivityNet. We use the v1.3 split of ActivityNet dataset which consists of more than 648 hours of untrimmed videos from a total of 20K videos. Specifically, this dataset has 10,024 videos for training, 4926 videos for validation and 5044 videos for testing with an average duration of 117 seconds. It contains 200 different daily activities such as: walking the dog, long jump, and vacuuming floor. We use the training videos to train our network, and the validation set for testing as labels in the testing set are withheld by the authors. The dataset is publicly available to download at http://activity-net.org/download.html.

FCVID. Fudan-Columbia Video Dataset (FCVID) contains total 91,223 Web videos annotated manually according to 239 categories (45,611 videos for training and 45,612 videos for testing). The categories cover a wide range of topics like social events, procedural events, objects, scenes, etc. that form in a hierarchy of 11 high-level groups (183 classes are related to events and 56 are objects, scenes, etc.). The total duration of FCVID is 4,232 hours with an average video duration of 167 seconds. The dataset is available to download at http://bigvid.fudan.edu.cn/FCVID/.

Mini-Sports1M.

Mini-Sports1M is a subset of Sports-1M 

[28] dataset with 1.1M videos of 487 different fine-grained sports. It is assembled by [14] using videos of length 2-5 mins, and randomly sample 30 videos for each class for training, and 10 videos for each class for testing. The classes are arranged in a manually-curated taxonomy that contains internal nodes such as Aquatic Sports, Team Sports, Winter Sports, Ball Sports, etc, and generally becomes fine-grained by the leaf level. We obtain the training and testing splits from the authors of [14] to perform our experiments. Both training and testing videos in this dataset are untrimmed. This dataset is available to download at https://github.com/gtoderici/sports-1m-dataset.

Mini-Kinetics.Kinetics-400 is a large-scale dataset containing 400 action classes and 240K training videos that are collected from YouTube. Since the full Kinetics dataset is quite large and the original version is no longer available from official site (about 15% videos are missing), we use the Mini-Kinetics dataset that contains 121K videos for training and 10K videos for testing, with each video lasting 6-10 seconds. We use official training/validation splits of Mini-Kinetics released by authors [34] in our experiments.

Appendix B Implementation Details

 

Arch. 32-bit 4-bit 2-bit

 

ResNet-18 4 0.01 5e-4 0.01 5e-4 0.01 5e-3
ResNet-50 2 0.1 5e-4 0.1 5e-4 0.01 6e-2

 

Table 7: Hyperparameters for training the any-precision recognition network. We use separate sets of learning parameters (learning rate, weight decay) for clipping values of each precision.

 

Dataset

 

ActivityNet 0.21 0.5 0.1
FCVID 0.11 1.0 0.1
Mini-Sports1M 0.21 0.5 0.1
Mini-Kinetics 0.21 0.3 0.1

 

Table 8: Hyperparameters to train the policy network.

In this section, we provide more details regarding the implementation. We train the any-precision recognition network from the full-precision recognition network pretrained on the same dataset for 100 epochs. Then we optimize the policy network accompanied with the well-trained (frozen) any-precision recognition network for 50 epochs and the policy network is initialized with the weight pretrained on the same dataset as well. For our experiments, we use 12 NVIDIA Tesla V100 GPUs for training the any-precision recognition network and 6 GPUSs for training the policy network. All our models were implemented and trained via PyTorch. In Table 

7 and 8, we provide the initial value (), learning rate () and weight decay () for each precision to train the any-precision recognition network, as well as hyperparameters , and (in Eq. (13) in the main paper) to train the policy network. The data augmentations in our approach are based on the practices in [53]. We first randomly resize the shorter side of an image to a range of [256, 320) while keeping aspect ratio and then randomly crop a

region and normalize it with the ImageNet’s mean and standard deviation to form the input (

). The training time depends on the size of datasets and the task. We will make our code publicly available after the acceptance.

Appendix C Additional Ablation Studies

Effectiveness of LSTM. We investigate the effectiveness of LSTM for modeling video causality in the policy network by comparing with a variant of VideoIQ without LSTM (see Table 9). On ActivityNet and Mini-Sports1M datasets, the variant without LSTM yields and lower mAP with similar GFLOPs than VideoIQ respectively. This demonstrates that LSTM is critical for good performance as it makes the policy network aware of all useful information seen so far by aggregating the sequence history.

 

Model mAP (%) GFLOPs

 

ActivityNet
No LSTM 74.1 28.8
LSTM 74.8 28.1
Mini-Kinetics
No LSTM 46.1 26.4
LSTM 46.4 26.8

 

Table 9: Effect of LSTM on ActivityNet and Mini-Sports1M.

Effect of Different Losses. Similar to Table 5 of the main paper, we further ablate different losses on Mini-Sports1M (see Table 10) and observe that without knowledge transfer from a pretrained full-precision model, our method only achieves with similar amount of GFLOPs. It once again demonstrates the importance of using the full-precision model as the teacher for effective training of lower precisions. When training without efficiency loss (by setting ), it achieves mAP ( improvement) but with more FLOPs. Furthermore, and both improve the performance with similar computational cost.

 

mAP (%) GFLOPs

 

44.6 26.5
46.6 58.5
46.3 28.5
46.2 26.9
46.4 26.8

 

Table 10: Effect of different losses on Mini-Sports1M.

Effect of Decision Space. Similar to Table 6 in main paper, we show the effect of decision space on Mini-Sports1M (see Table 11). We adjust the training loss to keep their GFLOPS at the same level and we only compare the differences in recognition performances. Only skipping frames yields in mAP ( lower than ). Among all the alternatives, the best strategy is to set for achieving top performance of in mAP with GFLOPs.

 

Decision Space mAP (%) GFLOPs

 

{32, 0} 43.9 28.7
{32, 4, 2} 46.1 29.3
{32, 4, 0} 43.9 33.5
{32, 2, 0} 46.0 32.9
{32, 4, 2, 0} 46.4 26.8

 

Table 11: Effect of different decision space on Mini-Sports1M.

Appendix D Qualitative Results

Figure 6: Qualitative examples. Our proposed approach VideoIQ processes more informative frames with high precision and less informative ones with lower precision or skip them when irrelevant, for efficient video recognition. Best viewed in color.

In this section, we provide additional qualitative examples to visualize the learnt policy (see Figure 6). Videos are uniformly sampled in 8 frames. VideoIQ processes most informative frames with 32-bit precision while it skips or uses lower precision for the less informative frames without sacrificing accuracy (see top 4 examples in Figure 6: “Swimming”, “Tractor Pulling”, “Bujinkan” and “Using Segway”). Moreover, it uses 2-bit precision instead of 32-bit precision (see bottom 2 examples in Figure 6: “Riding Camel” and “Freestyle Football”) after being confident about the action.