There has been an explosive growth of image/video data and the related applications in recent years. With the development of smart city and intelligent security, surveillance data are of great importance as all kinds of activities can be recorded and analyzed in real-time. Generally speaking, the surveillance camera data from the front-end are compressed and transmitted to the back-end for further analytical/understanding tasks, and this is referred to as the compress-then-analyze (CTA) paradigm. In order to achieve higher compression ratio under the limitation of bandwidth, the low bit-rate compression is usually employed. However, the performance of analysis tasks may be significantly affected by the low bit-rate coding 
. There is another feasible solution composed of feature extraction, compression and transmission, such that the compact features extracted at the front-end can be transmitted to the server side, and this is usually referred to as the analyze-then-compress (ATC) schema. Due to the features are much more compact compared to texture, the performance of the analysis task is quite promising at low bit-rate. However, it is difficult to recover the texture information as the features extracted from the original data cannot faithfully achieve signal level image/video reconstruction.
Feature coding algorithms play important roles in the ATC paradigm. In the literature, there have been many feature coding schemes proposed to further promote the compression ratio and transmission efficiency, based on both handcrafted features (e.g., SIFT , SURF [4, 5]) and deep learning features. A coding architecture for video sequence local feature compression was proposed by Baroffio et al. in . Regarding the learning deep features, a trade-off between feature compression ratio and analysis performance was proposed in . The MPEG standards CDVS  and CDVA , which standardize the handcrafted and deep learning features, have also found many applications in practice. Regarding surveillance video applications, besides the automatic visual analysis, human involved monitoring may also be required for further verification. As such, the texture reconstruction is also an important component which should not be ignored. From this perspective, Zhang et al. proposed a framework for both feature and texture compression in  and it provides the feasibility to compress the hand-craft features and video textures jointly. The joint rate-distortion optimization for simultaneous compression of texture and deep learning feature was further studied in . Moreover, the joint framework of texture-feature coding is overviewed and analyzed in , illustrating the advantages and disadvantages of both CTA and ATC schemes.
One aspect that has been largely ignored in feature and texture compression is that the information conveyed in feature is very helpful for image/video reconstruction. As such, to make better use of the deep learning features besides analysis tasks, we propose a scalable compression scheme based on the facial image, as face plays very important roles in video surveillance. More specifically, the base layer conveys the deep learning feature, and a deep de-convolutional network is adopted for face reconstruction from deep learning features. The enhancement layer is responsible for the residuals between the input face image and reconstructed image from the feature, and the residuals are further compressed. Extensive experiments have been conducted and the results show that the proposed schema can inherit the advantages of both ATC and CTA schemes.
2 Scalable Compression Framework
In the traditional CTA paradigm, low bit-rate compression often results in severe distortions to the decoded texture information and the degradation of feature quality, which would lead to poor analysis performance. By contrast, for the ATC paradigm, it is difficult for human beings to view and monitor the image/video texture, which may limit its applications in real world. To address these issues, we propose a scalable framework, where the base layer is responsible for the feature and enhancement layer is responsible for the texture. There are three main advantages for this framework. First, when the texture reconstruction is not necessary, the base layer can be directly transmitted to support the ATC paradigm. As such, the fidelity of the deep feature can be guaranteed since the features are extracted from the original image. Second, the redundancy between the feature and texture is exploited, which may significantly improve the texture compression performance. Third, it is feasible to perform the frequent analysis without image decoding and feature extraction, which is more economic in video surveillance applications.
The whole framework is shown in Fig. 1
. The facial images are fed into the deep neural network for deep learning feature extraction, and the extracted features are compressed with direct quantization and entropy coding. Given the features of the facial images, our proposed scheme reconstructs the texture information with deep feature reconstruction, and the residuals between input and reconstructed texture are further compressed. In particular, the deep learning feature is a vector in a hyperspace with 128 dimensions. Based on this consideration, we utilize de-convolutional neural network to recover the facial image as illustrate in. It is worth mentioning here that the DconvOP shown in Fig. 1
is composed of a series procedures, including de-convolution layer, batch-normalization layer and ReLU activation (the activation function of the last layer is Tanh). After the deep reconstruction from the deep learning feature, we can acquire the residual information as the enhancement layer, which is further lossy compressed. At the decoder side, the reconstructed image can be obtained by combining the deep feature reconstructed texture and the decoded residual from the enhancement layer.
3 Base and Enhancement Layer Compression
3.1 Deep Feature Compression
The deep feature extracted by FaceNet  is a vector in a hyper-space of 128 dimensions and every dimension is represented by a floating number. In order to compress and transmit the deep learning feature, as well as to guarantee the analysis performance, the deep feature undergoes quantization and entropy coding  for compression.
3.2 Deep Feature Reconstruction
, the authors proposed a deep feature reconstruction framework aiming to attack the face recognition system. We adopt this reconstruction strategy in this work. The deep learning features are extracted by FaceNet
, which could embed the facial images into a point in the hyperspace. The distance among the coordinate points reflects the similarity of the images and we assume every dimension represents some certain characteristics of the human face. It is straightforward to adopt Mean Squared Error (MSE) as the loss but the information contained in the deep learning feature is mainly regarding the structure information, instead of the detailed texture. Therefore, we adopt the linear combination of the MAE (Mean Absolute Error) and the perceptual measurement as the loss function of the deep feature reconstruction network. The MAE can be formulated as
where and represent the original facial image and the reconstructed image from the deep learning feature, respectively. Here, the parameter is set to 1.
We adopt the feature map in VGG-19 model  to compute the perceptual loss. The output of layer is utilized empirically to measure the structure information loss for the reconstructed image. The perceptual loss can be expressed as
As such, the loss function of the deep feature reconstruction network is given by,
where is the balancing factor between these two loss functions.
3.3 Enhancement Layer Compression
For the enhancement layer, the compression of the residual between the original facial image and the reconstructed image from the deep learning feature could be realized by traditional JPEG and JPEG2000 codecs. Moreover, the distribution of the residual patch could be different from common natural images and a deep learning based model can be utilized for the specific compression. As such, we also adopt the end-to-end compression framework in , which is based on generalized divisive normalization (GDN). The feature number of every convolutional layer for RGB images has been reduced from 192 to 128 here since compared to original images less structural information is contained in the enhancement layer.
The min-max normalization is applied to to reveal the texture information for compression and the minimum and maximum would also be encoded and transmitted to the decoder side for recovery of the texture scale,
The loss function is the MSE between texture and the decoded one ,
As such, the enhancement layer is conveyed and it is transmitted when necessary to reconstruct the texture in high fidelity.
We implement the deep feature reconstruction model using TensorFlow. We initialize the network using the method in
and the batch size is set to be 64. The learning rate is decayed exponentially from 0.01 to 0.0001 in 50 epoches andis set to to ensure the reconstruction of faces.
We can build our own residual dataset on the basis of the deep feature reconstruction network to train the end-to-end enhancement layer compression model. The optimization algorithm is the Adaptive Moment Estimation (Adam) optimizer which is the same as the feature reconstructed model. The batch size is set to be 16 and the learning rate is 0.0002 for 10 epoches.
4 Experimental Results
In this section, we conduct experiments for validations in terms of both rate-accuracy and rate-distortion performance of the proposed framework. The unconstrained large scale face dataset VGG-Face2  is used for training. This dataset comprises of over 3.3 million face images from 9131 subjects and there are over 360 images for per subject on average. The proposed framework is tested on a popular face verification dataset, Labeled Faces in the Wild (LFW) .
First, we evaluated the rate-accuracy performance of the proposed scheme. We compare the performance of the proposed framework with the CTA scheme that encodes the image with JPEG and JPEG2000. In particular, we adopt the deep learning based approach, JPEG and JPEG2000 to compress the enhancement layer. The experimental results are shown in Fig. 2. It is obvious that the proposed scheme achieves the best performance. This is not surprising as it inherits the properties of the ATC approach, such that the features are extracted from the original texture. However, along with the degradation of the texture quality, the performance of the CTA schemes is significantly influenced, leading to lower analysis accuracy.
To gain more insights about the texture compression performance of our proposed scheme, we conduct experiments to evaluate the rate-distortion performance. The reconstructed images from deep learning features are shown in Fig. 3, which show that the main structure of the facial images have been preserved. Furthermore, we compare the compression performance and the subjective quality in Fig. 4. For a fair comparison, when JPEG2000 is used as the anchor, we show the results when the enhancement layer is compressed by both JPEG2000 and deep learning framework. Again, for JPEG based comparisons, the enhancement layer is compressed by JPEG instead of JPEG2000. It is obvious that our proposed model improves the coding performance. Moreover, the rate-distortion curves shown in Fig. 5 provide useful evidence regarding the performance improvement in terms of PSNR.
In this work, we propose a scalable scheme for the facial image compression. The proposed scheme is composed of a base layer for feature compression and an enhancement layer for texture reconstruction, aiming at assimilating the advantages of both CTA and ATC. Interestingly, we find that the proposed scheme can inherit the rate-accuracy performance of ATC, and significantly improve the coding performance compared to traditional coding schemes. In the future, we will extend this framework to more domains in surveillance applications (e.g., vehicle and pedestrian).
-  Wen Gao, Yonghong Tian, Tiejun Huang, Siwei Ma, and Xianguo Zhang, “The ieee 1857 standard: Empowering smart video surveillance systems.,” IEEE Intelligent Systems, vol. 29, no. 5, pp. 30–39, 2014.
-  Alessandro Redondi, Luca Baroffio, Matteo Cesana, and Marco Tagliasacchi, “Compress-then-analyze vs. analyze-then-compress: Two paradigms for image analysis in visual sensor networks,” in MMSP. IEEE, 2013, pp. 278–282.
David G Lowe,
“Distinctive image features from scale-invariant keypoints,”
International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  Herbert Bay, Tinne Tuytelaars, and Luc Van Gool, “Surf: Speeded up robust features,” in European conference on computer vision. Springer, 2006, pp. 404–417.
-  Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool, “Speeded-up robust features (surf),” Computer vision and image understanding, vol. 110, no. 3, pp. 346–359, 2008.
-  Luca Baroffio, Matteo Cesana, Alessandro Redondi, Marco Tagliasacchi, and Stefano Tubaro, “Coding visual features extracted from video sequences,” IEEE transactions on Image Processing, vol. 23, no. 5, pp. 2262–2276, 2014.
-  Lin Ding, Yonghong Tian, Hongfei Fan, Yaowei Wang, and Tiejun Huang, “Rate-performance-loss optimization for inter-frame deep feature coding from videos,” IEEE Transactions on Image Processing, vol. 26, no. 12, pp. 5743–5757, 2017.
-  Ling-Yu Duan, Vijay Chandrasekhar, Jie Chen, Jie Lin, Zhe Wang, Tiejun Huang, Bernd Girod, and Wen Gao, “Overview of the mpeg-cdvs standard,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 179–194, 2016.
-  Ling-Yu Duan, Vijay Chandrasekhar, Shiqi Wang, Yihang Lou, Jie Lin, Yan Bai, Tiejun Huang, Alex Chichung Kot, and Wen Gao, “Compact descriptors for video analysis: The emerging mpeg standard,” IEEE MultiMedia, 2018.
-  Xiang Zhang, Siwei Ma, Shiqi Wang, Xinfeng Zhang, Huifang Sun, and Wen Gao, “A joint compression scheme of video feature descriptors and visual content,” IEEE Transactions on Image Processing, vol. 26, no. 2, pp. 633–647, 2017.
-  Yang Li, Chuanmin Jia, Shiqi Wang, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao, “Joint rate-distortion optimization for simultaneous texture and deep feature compression of facial images,” in BigMM. IEEE, 2018, pp. 1–5.
-  Siwei Ma, Xiang Zhang, Shiqi Wang, Xinfeng Zhang, Chuanmin Jia, and Shanshe Wang, “Joint feature and texture coding: Towards smart video representation via front-end intelligence,” IEEE Transactions on Circuits and Systems for Video Technology, 2018.
-  Guangcan Mai, Kai Cao, C YUEN Pong, and Anil K Jain, “On the reconstruction of face images from deep face templates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
Florian Schroff, Dmitry Kalenichenko, and James Philbin,
“Facenet: A unified embedding for face recognition and clustering,”
Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
-  Matt Mahoney, “Data compression programs, http://mattmahoney.net/dc/paq.html,” 2009.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Johannes Ballé, Valero Laparra, and Eero P Simoncelli, “End-to-end optimized image compression,” arXiv preprint arXiv:1611.01704, 2016.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,”in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Qiong Cao, Li Shen, Weidi Xie, Omkar Parkhi, and Andrew Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in International Conference on Automatic Face and Gesture Recognition, 2018.
-  Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller, “Labeled faces in the wild: A database for studying face recognition in unconstrained environments,” Tech. Rep. 07-49, University of Massachusetts, Amherst, October 2007.