An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal

01/09/2020 ∙ by Sifeng Xia, et al. ∙ 14

In this paper, we study a new problem arising from the emerging MPEG standardization effort Video Coding for Machine (VCM), which aims to bridge the gap between visual feature compression and classical video coding. VCM is committed to address the requirement of compact signal representation for both machine and human vision in a more or less scalable way. To this end, we make endeavors in leveraging the strength of predictive and generative models to support advanced compression techniques for both machine and human vision tasks simultaneously, in which visual features serve as a bridge to connect signal-level and task-level compact representations in a scalable manner. Specifically, we employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern. By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames via a generative model, relying on the appearance of the coded key frames. Meanwhile, the sparse motion pattern is compact and highly effective for high-level vision tasks, e.g. action recognition. Experimental results demonstrate that our method yields much better reconstruction quality compared with the traditional video codecs (0.0063 gain in SSIM), as well as state-of-the-art action recognition performance over highly compressed videos (9.4 coding signal for both human and machine vision.



There are no comments yet.


page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video coding aims to compress the videos into a compact form for efficient computing, transmission, and storage. Many efforts are put into this domain, and over the last three decades, a few coding standards are built to significantly improve the coding efficiency. The latest video codecs, i.e. MPEG-4 AVC/H.264 [Overviewavc] and High Efficiency Video Coding (HEVC) [Overviewhevc] seek to improve the video coding performance by edging out spatial, temporal and coding redundancies of video frames. In the past few years, data-driven methods have been popular and bring in tremendous progress in the compression task. The latest data-driven methods have largely overpassed performance of the state-of-the-art codecs, e.g. HEVC by further improving various kinds of modules like intra-prediction [huintra], inter-prediction [fiicip, oneforall], loop filter [jiaTIP19, park2016cnn]etc. These techniques significantly improve the video quality from the perspective of the signal fidelity and human vision.

[width=0.98autoplay,loop]12imgs/394_1363_1/hevc_520 [width=0.98autoplay,loop]12imgs/394_1363_1/pro_520
Figure 1: The visual results of the reconstructed videos by HEVC (left panel) and our method (right panel). Embedded videos are best viewed in Acrobat Reader.

Existing coding techniques run into problems when encountering big data and video analytics. The massive data streaming generated everyday from the smart cities needs to be compressed, transmitted and analyzed to provide high valuable information, such as the results of action recognition, event detection, etc. Given this scenario, it is expensive to perform the analysis on the compressed videos, as the video coding bit-stream is redundant and existing coding mechanism is not flexible to discard the information that is unrelated to analytical tasks [rdo_feat]. Therefore, in the context of big data, it is still an open problem to perform the scalable video coding, where the requirement of machine vision is first met and additional bitrates can be utilized to further improve visual quality of the reconstructed video progressively and incrementally. It is an urgent need to obtain a scalable feature representation that connects the information of low and high-level vision and switches the forms between two purposes freely.

The success of deep learning models has opened a new door. The deep analytic models can extract compact and high-valuable representations, which can convert the redundant pixel domain information into the sparse feature domain. In contrast, deep generative models are responsible to produce the whole images and videos with only the guidance of highly abstracted and compact features. Supported by these tools, we can realize the scalable compression of videos and features jointly, which is close to both practical application demands in the big data context and accords with the mechanism of human brain circuits. The most compact and valuable abstracted features are first extracted via deep analytic models 

[Zhu_lstm, Song_attention, Liu_pku_mmd] to support the analytics applications. With these features, we can locate the place and time where some key events happen, namely rethinking rough situations. Then, guided by the features, other information is partly generated by deep generative models [gan, PSGAN, chan2019dance, Siarohin_2019_CVPR], and partly compressed and decoded to support the video reconstruction, namely rethinking scene details. This solution is potential to address the difficulty in combining video analytics and reconstruction in the big data streaming, which is the main target of video coding for machine (VCM). The first step of the process can provide timely analytical results with a small portion of bitrates to fulfill the need of machine vision and the second stage can further provide the reconstructed videos with regards to the analytical results using more bitrates to meet the need of human vision.

Figure 2: The coding pipeline of our proposed joint feature and video compression that serves for both human and machine vision.

Specifically, in this paper, we propose a scalable joint compression method for both features and videos in surveillance scenes, where a learnable motion pattern bridges the gap between machine and human vision. The sparse motion pattern is first extracted automatically via a deep predictive model. After that, the appearance of the currently coded frame is transfered from the coded key frame with the guidance of the motion pattern via a deep generative model. The sparse motion pattern is highly efficient for high-level vision tasks, e.g. action recognition, and it can also meet the requirement of human vision. In this way, the total coding cost of features and videos can be largely reduced.

In summary, the contributions of our paper are summarized as follows:

  • To the best of our knowledge, we make the first attempt towards VCM to compress features and videos jointly, serving for both machine and human vision. A novel scalable compression framework is designed with the aid of predictive and generative models to support both machine and human vision.

  • In our framework, the learned sparse motion pattern is used as a bridge, which is flexible and largely reduces the total coding cost of two kinds of vision. To promote the analysis performance of human action recognition, we additionally apply the constraint of the learned points with the guidance of human skeletons.

  • Compared with traditional video codecs, our method not only achieves much better video quality but also offers significantly better action recognition performance at very low bitrates, which showcases a promising paradigm of coding signal for both human and machine vision.

The rest of the article is organized as follows. Sec. 2 illustrates the pipeline of our proposed joint feature and video compression. The detailed network architecture for key point prediction and motion guided target video generation is also elaborated. Experimental results are shown in Sec. 3 and concluding remarks are given in Sec. 4.

2 Joint Compression of Features and

Given a video sequence where indicates the frame number, it is necessary to compress for transmission and storage. In this section, we will first analyze limitations of traditional video coding methods. Then, we develop our new framework to compress features and videos jointly in a scalable way.

2.1 Sequential Compression and Analytics

The traditional video codec targets to optimize the visual quality of the compressed video from the perspective of signal fidelity. In this process, all frames are coded. For each frame, spatial and temporal predictions are utilized to predict the target frame with existing coded signal to remove the spatial and temporal redundancy. Then, the prediction residue and much syntax information are coded for reconstruction at the decoder side. Though the data can be efficiently compressed via the latest codecs, the scale of data is still massive as a huge amount of data is taken all days and weeks. Therefore, it is intractable to compress and save data with a high quality, and analyze it later.

It is a reasonable trade-off to compress the data into a low-quality format. However, existing compression methods which target at optimizing the human vision are not desirable for high-level analytics tasks. If we lower the quality of the compressed videos, the performance of action recognition will be largely degraded. As demonstrated in Sec. 3.2, our method uses only about 1/3 bitrate cost of the traditional compression method to achieve a better performance in the action recognition task. Another path that leads to effective video analytics is to extract and compress features. However, in this case, we could not obtain the reconstructed videos. This also sets barriers to real applications, where the results usually need to be confirmed by human examiners. Therefore, we seek to develop a flexible and scalable framework which compresses the feature at first for machine vision and reconstructs the video later for human vision with more bits consumption.

2.2 An Overview of Joint Feature and Video Compression

Fig. 2 has illustrated the overview pipeline of the proposed joint feature and video compression method. The motivation lies in the fact that in surveillance scenes, the videos can be represented as a background layer (static or slow moving) and moving objects, such as human bodies. Then, the network is capable of learning to represent a video sequence with the learned sparse motion pattern, which can indicate the object motion among frames. In our work, we focus on indoor surveillance videos with a static background and moving humans.

At the encoder side, with the captured video frames , a set of key frames will be first selected and compressed with traditional video codecs and form the bit-stream . The coded key frames convey the appearance information which includes the background and human appearances and will be transmitted to the decoder side to synthesize the non-key frames. Moreover, the learned Sparse Point Prediction Network (SPPN) extracts sparse key points from video frames and form a point sequence . The sparse point sequence can mark the motion areas in the frames and convey the motion trajectories of objects along the temporal dimension, which is viewed as a sparse motion pattern of the video. The point sequence will also be coded to a bit stream for transmission.

At the decoder side, key frames will be first reconstructed from and we indicate the reconstructed key frames as . For reconstructing remaining non-key frames, the key points are decompressed as

and a learned Motion Guided Generation Network (MGGN) will first estimate the motion flow among frames based on the decompressed sparse motion pattern. Then, MGGN transfers the appearance of the reconstructed key frames to remaining non-key frames with the guidance of the estimated motion flow. Specifically, for the

-th frame to be reconstructed, we denote its previous key frame as . The target frame is synthesized as , where represents MGGN. Finally, the reconstructed key points and the video can be used respectively for machine analysis and human vision.

2.3 Detailed Network Architecture Illustration

The critical feature of our joint feature and video compression framework is to be capable of capturing the motion between video frames for both machine analytics and video reconstruction. There are several kinds of ways to model video motion, such as dense optical flow [PWCNet] or sparse motion representations based on human poses [chan2019dance]

or unsupervisely learned key points

[Siarohin_2019_CVPR]. In our work, we hope the motion representations to be sparse enough for efficient machine analytics. Therefore, we refer to [Siarohin_2019_CVPR] to predict key points of frames as the sparse motion pattern, which is compact enough that costs only a few bits for transmission and storage. For human vision, motion flow among video frames will be later derived from the sparse motion pattern to guide the generation of the target frame.

The framework of the network is shown in Fig. 3. For a key frame and a target frame which is to be generated at the decoder side, their key points will be first predicted by SPPN, and this sparse motion pattern is later combined with for estimating the flow map between frames. Then, the generated flow map will guide the transfer of the appearance of to the target frame. Details of different parts of the network are described as follows.

Sparse Point Prediction. For an input frame, a sub-network of the U-Net architecture followed by softmax activations is used to extract heatmaps for key point prediction. Each heatmap corresponds to one key point position , which is estimated as follows:


where is the set of positions of all pixels. Besides the key point position, the corresponding covariance matrix is defined as:


The covariance matrix is generated here because it can additionally capture the correlations between the key point and its neighbor pixels. Consequently, for each key point, totally 6 float numbers including two numbers indicating the position and 4 numbers in the covariance matrix are used for description.

For the succeeding usage, the key point description will be used to generate new heatmaps by a Gaussian-like function. This operation is done for that the new heatmaps are more compatible with convolutional operations. Specifically, the new heatmap will be generated as follows:


where is a normalization constant and set to . After this progress, two sets of newly generated heatmaps and are generated from frames and , respectively.

Motion Flow Estimation. With the estimated key points and newly generated heatmaps, a sub-network in MGGN will be first used to estimate the motion flow between frames and . The source frame is adopted to form the input for it conveys the appearance information. Meanwhile, the difference heatmaps between two frames are used to form the input to provide sparse motion information. The flow estimator will finally output a flow map .

Figure 3: Framework of our proposed joint feature and video compression, including a sparse point prediction network and motion guided generation network to extract the sparse motion pattern and generate the target frame.

Motion Guided Target Frame Generation. The target frame is generated with a sub-network of the U-Net architecture. Feature maps of different sizes are extracted by the appearance encoder and will be bypassed to the appearance decoder for feature fusion. In order to align the features to the target frame, features will be previously deformed with the estimated flow map before fusion. Besides, the difference heatmaps is used as side information that is inputted to the appearance decoder. Then, the target frame can be generated by the appearance decoder.

Skeleton Guided Point Prediction Loss Function

. In [Siarohin_2019_CVPR], the key points prediction is learned unsupervisely. In our work, we additionally use human skeleton information to guide the key point prediction. The skeleton information is used for its high efficiency in modeling human actions as the skeleton points cover many human joints, which are highly correlated to human actions. Consequently, the PKU-MMD dataset [pkummd] is used in our work for training and testing, which is a large-scale dataset and contains many human action videos. More importantly, human skeletons are available in this dataset for each human body in the videos.

We sample 16 skeleton points for each human body and employ an loss function for supervision. The key point detection loss function is defined as follows:


where represents the -th skeleton point of the human in the -th training sample.

Overall Loss Function. Besides the point prediction loss, a combination of an adversarial and the feature matching loss proposed in [ganloss] are used for training. The discriminator will take concatenated with either the real image or the generated image as its input. The discriminator and generator losses are calculated as follows:


For a better reconstruction quality, a reconstruction loss function is built to keep and to have similar feature representations. is implemented by calculating the

distance between features extracted from

and by the discriminator. Features outputted by all layers of the discriminator are all used for calculation.

The final loss function is calculated by , where and are respectively set to and .

3 Experiments

3.1 Experimental Details

PKU-MMD dataset [pkummd] is used to generate the training and testing samples. In total clips with frames are sampled for training and clips with frames are sampled for testing. All frames are cropped and resized to

during sampling. The skeleton information is also used during the training process. 16 skeleton points are chosen for each frame and mapped to the corresponding two-dimensional space to generate the labels for key point prediction. The network is implemented in PyTorch and the Adam optimizer

[adam] is used for training. We randomly select two frames from a clip to form a training sample.

In the testing process, we consistently use the first frame in each clip as the key frame. At the encoder side, the key frame is coded with the HEVC codec in the constant rate factor mode. The constant rate factor is set to . Besides the key frame, key points of all frames in the clip are predicted by SPPN and compressed for transmission. As mentioned in Sec. 2.3, each key point contains 6 float numbers. For the two position numbers, a quantization with the step is performed for compression. For the other 4 float numbers belonging to the covariance matrix, we calculate the inverse of the matrix in advance, and then quantize the 4 values with a step

. Then, the quantized key point values are further losslessly compressed by the Lempel Ziv Markov chain algorithm (LZMA) algorithm

[lzma]. At the decoder side, the compressed key frame and points are decompressed and used to generate remaining frames.

To verify the efficiency of our coding paradigm, we use HEVC as the anchor for comparison by additionally compressing all frames with the HEVC codec. The constant rate factor is firstly consistently set to 51, the highest compression ratio. Then, the recognition accuracies of using the learned sparse motion pattern and the compressed videos are compared. To verify the reconstruction quality, we set the constant rate factor to 44 and compare the reconstruction results between HEVC and our method with similar coding cost. The reconstruction quality is compared both quantitatively and qualitatively.

3.2 Action Recognition Accuracy

We identify the efficiency of the learned key points for high-level analytics tasks in the action recognition task. Although there are 6 numbers for each key point, we only use two quantized position numbers for action recognition. Consequently, only bits of the compressed position numbers are considered for calculating the bitrate cost of feature-based action recognition. To align to the bitrate cost of the features, we firstly resize all clips to the size of and then use the constant rate factor to compress the testing clips with HEVC.

(a) Ground Truth
(b) HEVC
(c) Proposed
Figure 4: Video reconstruction results of different methods. Left and right three panels correspond to two video clips in the testing set, respectively. The average SSIM values of the reconstructed clips are respectively and for HEVC and the proposed method for the left clip. For the right clip, the SSIM values of HEVC and the proposed method are and .
Input Bitrate (Kbps) Accuracy(%)
Compressed Video 16.2 65.2
Compressed Key Point 5.2 74.6
Table 1: Action recognition accuracy of different methods and corresponding bitrate costs.

Table 1 has shown the action recognition accuracy and corresponding bitrate costs of different kinds of data. Our method can obtain considerable action recognition accuracy with only Kbps bitrate cost. Although we have chosen the worst coding quality, it still needs Kbps to transform and store the compressed videos. More bitrates cannot bring too much performance improvement in action recognition on compressed videos. Unfortunately, the recognition accuracy even drops by .

Codec Bitrate (Kbps) SSIM
HEVC 33.0 0.9008
Ours 32.1 0.9071
Table 2: SSIM comparison between different methods and corresponding bitrate costs.

3.3 Video Reconstruction Quality

The video reconstruction quality of the proposed method is also compared with that of HEVC. During the testing phase, we compress the key frames with the constant rate factor to maintain a high appearance quality. The bitrate is calculated by jointly considering the compressed key frames and key points. As for HEVC, we compress all frames with the constant rate factor to achieve an approaching bitrate cost.

Table 2 has shown the quantitative reconstruction quality of different methods. SSIM values are adopted for quantitative comparison. It can be observed that, our method can achieve better reconstruction quality than HEVC with a fewer bitrate cost. Subjective results of different methods are shown in Fig. 4. There are obvious compression artifacts on the reconstruction results of HEVC, which heavily degrade the visual quality. Compared with HEVC, our method can provide far more visually pleasing results.

4 Conclusion

In our work, we propose a novel framework to bridge the gap between compression for features and videos. A conditional deep generation network is designed to reconstruct video frames with the guidance of a learned sparse motion pattern. This representation is highly compact and also effective for high-level vision tasks, e.g. action recognition. Therefore, it is scalable to meet the requirements of both machine and human vision, which reduces the total coding cost. Experimental results demonstrate that our method can obtain superior reconstruction quality and action recognition accuracy with fewer bitrate costs compared with traditional video codecs.