Deep Optimized Multiple Description Image Coding via Scalar Quantization Learning

01/12/2020 ∙ by Lijun Zhao, et al. ∙ BEIJING JIAOTONG UNIVERSITY 10

In this paper, we introduce a deep multiple description coding (MDC) framework optimized by minimizing multiple description (MD) compressive loss. First, MD multi-scale-dilated encoder network generates multiple description tensors, which are discretized by scalar quantizers, while these quantized tensors are decompressed by MD cascaded-ResBlock decoder networks. To greatly reduce the total amount of artificial neural network parameters, an auto-encoder network composed of these two types of network is designed as a symmetrical parameter sharing structure. Second, this autoencoder network and a pair of scalar quantizers are simultaneously learned in an end-to-end self-supervised way. Third, considering the variation in the image spatial distribution, each scalar quantizer is accompanied by an importance-indicator map to generate MD tensors, rather than using direct quantization. Fourth, we introduce the multiple description structural similarity distance loss, which implicitly regularizes the diversified multiple description generations, to explicitly supervise multiple description diversified decoding in addition to MD reconstruction loss. Finally, we demonstrate that our MDC framework performs better than several state-of-the-art MDC approaches regarding image coding efficiency when tested on several commonly available datasets.



There are no comments yet.


page 1

page 7

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

PACKET LOSS and bit errors may often unavoidably occur when Internet data are transmitted over unreliable channels [1]

. As increasing more people surf the Internet in daily life with mobile devices, such as hand-held PAD and cell-phone, whose data packets are conveyed over wireless communication channels, incomplete decoding or complete loss of data very possibly occur. Multiple description coding (MDC)

[1, 2, 3] is one of the representative promising mechanisms among different error resilient coding techniques to make real-time transmission systems more simple and robust to a lossy channel in a challenging environment. Compared with layered-based coding, this mechanism does not need to design the priority of each packet or data re-transmission mechanism and does not require any signal feedback. Several independent yet correlated bit streams generated by the MDC technique facilitate to the independent decoding of each packet, while joint decoding of two or more bit-streams is also supported by the MDC technique.

Traditional MDC approaches have been widely studied in the last decades, among which the derivation of multiple description (MD) theoretical rate-distortion regions is a fundamental and significant topic [4]. Meanwhile, in practice, the achievable rate-distortion regions gradually approach the boundaries of theoretical MDC rate-distortion regions [4, 2, 5, 6, 7, 8, 9, 3, 10, 11]. However, a large number of traditional MDC methods face many difficult problems. For example, MD quantizers often must assign the optimal index for multiple description generations [5, 6, 12, 13, 14, 15, 16], especially when quantizing more than two descriptions, which is an extremely complicated problem. For the correlating transform-based MDC framework, good performance can be achieved for multiple description coding when adding a small amount of redundancy [7, 8, 9, 17]. However, this framework does not always perform well if more redundancy is introduced into multiple descriptions. Different from multiple description quantizers and the correlated transform-based MDC framework, a class of sampling-based multiple description coding is more flexible and compatible with the standard coders. However, most of existing sampling-based MDC methods are built on the specifically designed sampling methods or extend the existing sampling operator for multiple description generations [3, 10, 11, 18]

, whose coding efficiency is limited. Consequently, the research of sampling-based MDC methods should be further developed. Recently, a convolutional neural network(CNN)-based JPEG-compliant MDC framework

[2] has been used to sample an input image to adaptively create multiple description images, but its coding efficiency is not very high because the usage of the standard JPEG limits the performance of this framework. In summary, we should comprehensively study the topic of multiple description coding for error resilience against bit errors and packet loss over an unpredictable channel.

In this paper, a deep multiple description image coding framework is proposed for robust transmission. Our contributions are listed below:

  1. We design a general deep optimized MD coding framework based on artificial neural networks. Our framework has several main parts: an MD multi-scale-dilated encoder network, a pair of multiple description scalar quantizers, and MD cascaded-ResBlock decoder networks, as well as conditional probability models.

  2. A pair of scalar quantization operators is automatically learned in an end-to-end self-supervised way for generation of diversified multiple descriptions. Meanwhile, each scalar quantizer is accompanied by an importance-indicator map to quantize feature tensors according to the change in the image spatial content.

  3. We propose the use of the multiple description structural similarity distance loss to supervise the decoded multiple description images, which implicitly regularizes diversified multiple description generations and scalar quantization learning. Please note that our multiple description structural similarity distance loss is different from the pixel-wise mean-absolute-deviation distance loss from [2]. These two types of distance loss work on different spaces, among which the proposed distance is imposed on the decoded images, whereas the distance loss from [2] regularizes feature tensors produced by the MD generation neural network.

  4. A symmetrical parameter sharing structure is designed for our autoencoder network to greatly reduce the total amount of neural network parameters. Specifically, the parameters of the preceding convolutional layers in the encoder network are shared for multiple description generation, while the decoder networks have a symmetrical structure by sharing the parameters of the back layers.

  5. A pair of conditional probability models is learned to estimate the informative amount of the quantized tensors, which can supervise the learning of the multiple description multi-scale-dilated encoder network to compactly represent the input image.

The rest of this paper is organized as follows. First, four kinds of MDC methods are reviewed in Section II and the problem formulation of deep image compression is presented in Section III. Second, the proposed multiple description coding framework is introduced in Section IV. Third, experimental results and analysis are presented in Section V. Finally, we conclude our paper in Section VI.

Ii Related works

We mainly review four kinds of MDC methods: MD quantizers, correlating transform-based MDC methods, sampling-based MDC and standard-compliant MDC.

Ii-a Multiple description quantizers

For quantization-based MDC methods, there are three primary classes: scalar quantizers, trellis-coded quantizers, and lattice vector quantizers. In early development, MD scalar quantizers constrained by the symmetric entropy are formed as an optimization problem

[5]. Following this work, linear joint decoders are developed to resolve the problem of dynastical computation increase [12] when generalizing this method from two to L descriptions during encoder optimization. In [6], MD distortion-rate performance is derived for certain randomly generated quantizers. By generalizing randomly generated quantizers, the theoretical performance of MDC with randomly offset quantizers is given in the closed-form expressions [13]. To increase the robustness to bit errors, linear permutation pairs are developed for index assignment for two description scalar quantizers [14]. In [15], a trellis is formed by the tensor product of trellises for multiple description coding.

The scalar quantizers and trellis-coded quantizers usually require the complicated index assignment. As compared with these quantizers, lattice vector quantizers have been widely studied, since lattice vector quantizers have many advantages such as a symmetrical structure, avoidance of complex nearest neighbor search, lack of the need for codebook design, etc. In [16], the design problem of lattice vector quantizer is cast as a labeling problem, and a systematic construction method is presented for the general lattices. To further improve the performance of MD vector quantization at the cost of a slight complexity increase [19], the fine lattice codebook is replaced by a non-lattice codebook. In [20], a structured bit-error resilient mapping is leveraged to make MD lattice vector quantizers resilient to the transmission bit error by exploiting intrinsic structural characteristic of the lattice. In [21]

, a heuristic index assignment algorithm is given to control coding distortions in order to balance the reconstruction quality of different descriptions. In summary, MD quantizers always involve complicated index assignments, especially when more than two descriptions are expected to be generated by users, whose complexity is always very high.

Ii-B Correlating transform-based MDC methods

The transform-based MDC framework employs a pairwise correlating transform to correlate pairs of transform coefficients [7], in which the best strategy for image compression redundancy allocation is given. This framework considers only the case of coding two multiple descriptions. In [8], this transform-based approach is generalized to create L descriptions to enhance the transmission system robustness with a small quantity of redundancy. To satisfy the channel bandwidth ratio, a ratio configurable MD correlating transform coding [9] is introduced to adjust the description data size ratio. In [17], a gradient search is provided to determine the correlating transform, while statistical dependencies between different descriptions benefit estimation of the loss transform coefficients under a lossy network during transmission. Although these correlating transform-based MDC methods provide an effective MD generation method, this kind of MDC method tends to be inefficient when a great deal of information redundancy is expected to be introduced into multiple descriptions.

Ii-C Sampling-based multiple description coding

Multiple descriptions can be created from different domains for input images/videos coding in various ways. For instance, diverse descriptions can be obtained from the spatial domain, frequency domain or temporal domain. In

[22], two MD video coding coders employ a poly-phase downsampling technique to create multiple descriptions. In [23], adaptive temporal-spatial error concealment is applied for MD video coding, in which multiple descriptions are obtained by spatial subsampling. For MD video coding, motion information from temporal domain is often estimated in the encoder as a redundancy. To quickly recover error on the decoder side without complicated motion searching for MD video coding, motion vectors are sampled according to each description’s redundancy, which could further strengthen error resilience when video streams are transmitted over error-prone wireless networks [24]

. By selecting the temporal level key pictures, hierarchical-B-pictures-based MD video coding framework adopts the simplest odd/even frame splitting method of generating redundancy descriptions as a special case of the framework


. Among these approaches, some of the sampling-based MDC methods can be compatible with standard image/video compression. However, their sampling operators are manually designed, which limits multiple description coding efficiency, so adaptive sampling method of generating multiple descriptions should be deeply investigated, especially the use of deep learning methods.

Ii-D Standard-compliant multiple description coding

Research on the standard-compliant MDC framework has becomes increasingly significant due to the wide usage of standard image/video codecs such as JPEG and H.264, which has become essential parts of life in practice. Over the past decades, there have been numerous works regarding standard-compliant MD coding [3, 10, 11]. For instance, an H.264-compliant MD video coding is achieved by interlacing primary and redundant slices [3], whose rate distortion is optimized in an end-to-end manner. In [18], a hybrid layered MD video coding algorithm uses the H.264/AVC encoder to compress a video at low bit rates as the base layer, and then, four-bit streams are divided into two groups as two description enhancement layers after 3D dual-tree discrete wavelet transform and 3D-SPIHT encoding. For HEVC-compatible MD video coding, the redundancy allocation is assigned based on a visual saliency model [10]. Most recently, image representation-based compression with CNNs [11] has been extended for multiple description image generation to form a standard-compliant convolutional neural network-based MDC framework [2], which is trainable in an end-to-end fashion. Although great progresses have been achieved for the class of standard-compliant multiple description coding [18, 3, 10], their coding efficiency has great potential to be enormously improved by artificial neural networks in various ways such as pre-filtering and post-processing [2, 11, 26].

Iii Problem formulation for deep image compression

In this paper, deep image compression is defined as a new class of image compression approaches that uses artificial neural networks to learn image nonlinear transform and inverse transform from big-data. This class differs from standard image compression with a classical linear transform such as discrete cosine transform or discrete wavelet transform. Since the proposed framework involves extensive machine learning knowledge, we first introduce some definitions such as concepts and terminology for deep image compression. Then, we introduce and review the deep image compression framework. Finally, the extension of deep image coding to multiple description coding is discussed.

Iii-a Definitions

Generally, the discrete representation of continuous data can be called data quantization, and this quantization is a form of hard quantization. The derivative of the hard type of quantization function is almost zero everywhere, except at the quantized integer point. If a quantization function is differentiable in the definition domain, then it can be called a soft quantization function. In [27], a soft quantization function is defined based on soft allocation with softmax function. Given a -dimension vector , the softmax function for can be defined as:


in which . The elements of center variable vector is not limited to integers and can also be defined as floating-point numbers, a soft assignment on a variable can be defined as: , in which denotes the -norm. The corresponding hard assignment can be written as: =, which finally converges to the nearest center of a vector

in one-hot encoding, that is to say:


in which is the absolute value of . Here, the one-hot encoding of refers to a zero-vector along the new dimension with length of , except for its th element to be 1. Using the above soft assignment, the soft quantization and hard quantization can be defined respectively as:


in which is element-wise multiplication, is the nearest center to of the center vector and controls the smoothness strength of the soft-quantization function.

Fig. 1: The diagram for deep image compression framework.

Iii-B Deep image compression framework

As with the standard coders, general deep image compression usually consists of the following main parts: a nonlinear analysis transform (also known as the encoding network), hard quantization function (as well as soft quantization function for training), nonlinear synthesis transform (also known as the decoding network), a conditional probability model to estimate the entropy, and distortion function to measure how much image information is lost after compression, as shown in Fig. 1. Given an input image , the feature tensor can be obtained through nonlinear analysis transform , i.e., . Then, the hard quantization function is used to discretize the feature tensor to reduce image compression bits, i.e., . Finally, the conditional probability model is used to calculate the rate of the quantized tensor [28]. On the decoder side, the quantization vector is reconstructed as by the nonlinear synthesis transform ; that is, . In addition, the distortion function is used to measure the distortion of the compressed image. According to the rate-distortion theory of image coding [27], the objective rate-distortion optimization function of deep image compression can be defined as:


where the first term is the image compression bit rate constraint, while the second term includes compressed image distortion and the network-parameter regularization term . This equation includes the hard (nondifferentiable) quantization function

, so it cannot be directly optimized by the general optimization methods, e.g., stochastic gradient descent optimization or Adam optimization. To resolve this problem, a general solution

[29, 27, 30, 11, 28] replaces the hard quantization with soft quantization during back-propagation for training (please see the bold dashed orange-pink lines in Fig. 1), while the forward-propagation uses the hard quantization (please see the bold black lines in Fig. 1).

The earliest works on image compression based on artificial neural networks primarily study the nondifferentiability problem of the quantization function and the artificial neural network structure design problem [27, 30, 11, 28]

, as well as how to make compressed images more realistic using the perceptual loss functions

[31]. For example, the compressive autoencoder with an encoder network and a decoder network is often chosen as the image compression network [29, 27, 30, 11, 28], in which the encoder network condenses the input image into a certain amount of bit streams which are as small as possible and are mapped back to a lossy image by the decoder network. In [29], the derivative of the stochastic rounding operation is replaced with the derivative of the expectation for back-propagation, but no part in forward direction is changed. Like [29], soft relaxation of quantization is used to resolve the nondifferentiability problem of the quantization function [27]. Different from [29, 27]

, the thumbnail images are compressed by a recurrent neural networks architecture, in which a stochastic rounding operation makes feature maps binarized

[30]. Recently, a virtual codec network has been learned to imitate the projection from the represented vectors to the decoded images to make the image compression framework trainable in an end-to-end way [11]. Except for the quantization problem, the autoencoder should be restricted to satisfy the given bit-per-pixel (bpp) compression when learning compressive autoencoder for image compression. In [28], a conditional probability model for the latent distribution of the auto-encoder is learned to supervise the autoencoder network.

Iii-C Extension to deep multiple description image coding

Until now, there has been no work on deep multiple description coding, although deep image compression methods have been increasingly explored. To the best of our knowledge, our work is the first deep multiple description coding framework. There are two possible methods of deep multiple description coding. In the first, multiple descriptions can be directly generated by an artificial neural network and are discretized by a general quantization to reduce coding bits. If the multiple descriptions refer to multiple description images, which are compressed by the standard codec, then it becomes a multiple description coding framework described in [2]. In this framework, JPEG compression can be replaced by a deep image compression approach such as [28]. Different from [2], our deep optimized multiple description image coding belongs to the class of methods utilizing the second approach. When the artificial neural network represents an input image as the feature tensors, a pair of scalar quantizers is learned to quantize the feature tensors for diversified multiple description generation. Recently, a convolutional autoencoder-based multiple description coding method [32] has extracts features by learning, which improves image coding efficiency. However, this method suffers from severe coding artifacts similarly to conventional MDC approaches.

Iv The proposed deep multiple description coding

Fig. 2: The diagram for the proposed deep multiple description coding framework.

In this paper, we introduce a deep optimized multiple description coding framework that is entirely built upon artificial neural networks, as displayed in Fig. 2. Our framework is primarily composed of a multiple description encoder, a multiple description decoder, and a pair of conditional probability models. The multiple descriptions are generated by the multiple description encoder, which is supervised by the conditional probability models to estimate the entropy of each description, while the received multiple descriptions are decompressed by the multiple description decoder. Specifically, given an input image with a size of , the multiple description multi-scale-dilated encoder network in the multiple description encoder decomposes the input image into a feature tensor , as well as two importance-indicator maps and . Here, is the spatial size of the feature tensor , while is the number of feature maps for this feature tensor. The importance map indicates which regions should be emphasized during image coding, which is a kind of region-of-interest (ROI) based coding [33], considering that the spatial distributions of various images are different from each other. M. Li and M. Fabian as well as their co-author, reported that a content-weighted importance map can be used for guidance to allocate image content-aware bit rate for deep image coding based on region-of-interest [34, 28]. In consideration of the image spatial variation, each importance-indicator map can be directly multiplied by each feature map of the feature tensor in an element-wise manner; that is, the -feature maps for the feature tensor share the same importance-indicator map. However, vital yet different roles are played by these -feature maps for multiple description image compression. According to the work of M. Fabian and co-workers [28], the expansion operation for is written as:


in which the function is a repetition function used to repeat a given matrix with times along the axes of , to form a new matrix. Furthermore, , , , the function truncates the value between and , and the operation inserts a dimension of 1 at the last dimension index axis of input shape. We can expand in the same way. Then, the expansion of each importance-indicator map is multiplied by the feature tensor in an element-wise manner to obtain two new feature tensors, and , before scalar quantization. From Eq. (7), it can be known that only the elements of will have valid values for the -th map of along the first dimension, according to the Euclidean distance between and . As a result, the -th map of will influence the multiple description generation of . The significance of importance-indicator maps for deep multiple description image coding will be discussed later.

To reduce multiple description coding bits, the two feature tensors and are quantized by scalar quantizer-I and scalar quantizer-II, respectively. Due to the nondifferentiability of hard quantization, we follow the work of E. Agustsson [27] to utilize soft quantization to make our framework trainable in an end-to-end way. Scalar quantizer-I with Eq. (2) is represented as:


where is a center variable vector and represents the -th element of tensor . In a similar way, scalar quantizer-II with a new center variable vector can be defined. The vectors of the quantized tensors and can be represented as and

. This pair of scalar quantizers accompanied by two importance-indicator maps, can be simultaneously learned in an end-to-end self-supervised way. In other words, there is no label for multiple description generation, the ultimate goal of which is to independently decode each side image with an acceptable image quality or jointly decode a better-quality central image, so the side images are used as the opposing labels for each description redundancy measurement with the distance loss for self-supervised learning. Both tensors

and can be converted into one-hot tensors and by adding a new dimension, thus producing two descriptions. The generated descriptions are losslessly encoded by arithmetic coding for transmission. During forward-propagation, hard quantization is leveraged to obtain discrete feature tensors according to Eq. (2), but we use the derivation of the soft quantization function [27] to back-propagate gradients from the multiple description decoder to the multiple description encoder, which is conducted in a similar manner as shown in Fig. 1.

Fig. 3:

The structure diagram of convolutional neural network for deep multiple description coding framework. (Note that 3x3x64 means that spatial kernel size is 3x3, the number of output feature maps is 64, while the stride is 1 in this convolutional layer in default. 3x3x64/2 shares similar denotation except for with a stride of 2. Other convolutional layers can be denoted similarly. X5 in Resblock means that cascaded three-Resconv modules with skip connection are used with five times.)

At the receiver, the decoded one-hot tensors and can be reversibly converted into the tensors and , as displayed in Fig. 1. Then, these tensors are treated by corresponding scalar de-quantizers. The scalar de-quantizer-I can be written as , which returns the -th of the center variable . Similarly, scalar de-quantizer-II can be written as . Side decoder network-A (or side decoder network-B) is used to decompress the quantized tensor (or ) as the lossy image (or ), as shown in Fig. 3, if only one description is received at the decoder, when each description is transmitted over an unpredictable channel. If both of the descriptions are received, then the central decoder network is leveraged to jointly decompress quantized tensors and as the central decoded image .


Iv-a The objective function of multiple description coding

Similar to traditional single description image compression, the objective function of multiple description coding requires balancing two fundamental parts: the coding bit rate and multiple description image decoding distortion. The mean square error (MSE) is often employed to measure image compression distortion. However, the human visual perceptual quality of a compressed image with high MSE may be higher than that for images with low MSE [35]. There are a multitude of reasons for objective-subjective quality mismatch, such as one-pixel position shifting of the whole-image and new-pixels covered over details-lacking textural regions generated by generative adversarial networks to obtain a sense of reality.

Compared to MSE loss, mean absolute error (MAE) loss can better regularize image compression to advance the compressed images towards the ground-truth images during training [36, 37]. Thus, our framework uses MAE loss for both side decoded images and central decoded image as the first part of our multiple description reconstruction loss, which can be written as follows:


in which denotes the -norm and is a trade-off parameter to control the redundancy between average side reconstructions and the corresponding central reconstruction. Meanwhile, to well measure image distortion for structural preservation, we introduce the multi-resolution (MR) structural similarity index (MR-SSIM) as an evaluation factor of image quality between and according to [38], which is written in Eq. (10). Here, is the downsampled image, whose size is times less than that of , and is the sum operator of . Meanwhile, and are two constants, while and

are, respectively, the mean map, the variance map of

calculated from each pixel’s neighborhood window, while is the covariance map calculated from each pixel’s co-located neighborhood windows in and . Additionally, the weight vector = for different scales is , which is linearly proportional to each-scale image size; that is, the large-size image weight is greater than that of the small-size image weight. However, image distortions at different scales are of very different importance with respect to perceived quality. In contrast to the MR-SSIM, the MS-SSIM from the literature [38] uses the weight vector = . These weights indicate that the image of other scales are more significant than the largest and smallest scale images. More discussions about MR-SSIM and MS-SSIM will be provided later. The total structural dissimilarity loss as the second part of our multiple description reconstruction loss can be written as:


Unlike single description image compression with only one bit stream produced by image coder, multiple description coding should generate diversified multiple descriptions, between which some redundancy is shared, but each description has its unique information. The redundancy of these descriptions makes the receiver capable of decoding an acceptable quality image, even though one description is missing when multiple description bit streams are transmitted over an unstable channel. However, when different descriptions contain excessive shared information, the central image quality will not exhibit great improvements, even though all of the descriptions are obtained by the client. In [2], multiple description distance loss is directly employed to supervise multiple description generation in feature space. Although the feature tensor of each description can be regarded as the opposite label to regularize each other, the learning problem of multiple description coding in feature space is often challenging, because the same multiple description reconstruction may arise from the composition of different features. To allow our framework to automatically generate multiple diverse descriptions, we propose to use the multiple description structural similarity distance loss. This distance loss not only explicitly regularizes decoded multiple description images to be different but also implicitly supervises scalar quantization learning and diversified multiple description generation, which is written as:


To compactly represent multiple description feature tensors, there should be an entropy regularization term for our deep MDC framework training. However, our framework is built upon an artificial neural network, so the conditional probability model for the entropy regularization term should be differentiable, i.e., the conditional probability model should also use the artificial neural network. To precisely predict each description coding cost, we use two entropy estimation networks without parameter sharing as the conditional probability model. Following the work of [28], we use context-based entropy estimation neural networks in our framework to efficiently estimate multiple description coding costs during training, as shown in Fig. 2.

In our framework, the regularization loss from these context-based entropy estimation neural networks supervises the learning of the multiple description encoder, which leads to compact multiple description generation. The estimated coding costs for each description are denoted as and . Because and come from the hard quantizers, the is nondifferentiable, although context-based entropy estimation neural networks are differentiable. If our scalar quantization uses the soft quantizers to obtain and for each description coding cost prediction, which are the approximations of and , then and can be differentiable. As a result, the loss from context-based entropy estimation neural networks can be back-propagated to the multiple description encoder. Finally, the multiple description compressive loss for our trainable framework in Fig. 2 can be written in Eq. (13), in which is the parameter regularization term for our artificial neural networks. , , and

are three hyperparameters. The

, , and in the final multiple description compressive loss in Eq. (13) include and from the hard quantizers. To make it trainable, the solution of training our deep multiple description coding framework is operated like the Fig. 1, such that the forward-propagation uses hard quantization, but the back-propagation employs soft quantization.

Fig. 4: The structure diagram of multiple description cascaded-ResBlock decoder networks with the parameter sharing.

Iv-B Network

To fully explore image context information, we propose a multiple description multi-scale-dilated encoder network to create a feature tensor , as well as two importance-indicator maps, and , for multiple description generation, as shown in Fig. 2. As discussed in [39], the dilated convolution can significantly enlarge the image receptive field, but it may introduce the grid effect. In [40], several dilated-convolutional layers are defined as a hybrid dilated convolution to resolve this problem for semantic segmentation. Inspired by these works, we use three cascaded dilated convolutions to extract multi-scale features since this is vital to leverage image context information for diversified multiple description generation. Each cascaded dilated convolution is composed of three layers of dilated convolutions, which is displayed in Fig. 3. First, a convolutional layer is used to transform the input image into 64-channel feature maps before operating three cascaded dilated convolutions. Second, each cascaded dilated convolution is followed by a downsampling convolutional layer with a stride of 2 to shrink spatial size of the feature maps. Meanwhile, the first cascaded dilated convolution is followed by a downsampling convolutional layer with a stride of 4. Additionally, a convolutional layer with a stride of 2 is used to down-sample the output features of the second cascaded dilated convolution in the spatial domain. Finally, various features from different scales are concatenated and then aggregated together by a convolutional layer to leverage image multi-scale context information, as depicted in Fig. 3. Although the feature tensor, as well as the importance-indicator maps can be separately generated by three structure-similar networks in the MD multi-scale-dilated encoder network, the input is shared by these networks to produce these tensors. If there is no parameter sharing, then the number of parameters is enormously increased, which results in consumption of memory and additional computational complexity. Accordingly, almost all of the convolutional layers can be shared in the MD multi-scale-dilated encoder network except for the last layers, to create different kinds of tensors (See the MD multi-scale-dilated encoder network in Fig. 3).

After receiving the de-quantized tensors and , we propose to use MD cascaded-ResBlock decoder networks to decompress these tensors since the learning of ResConv (denoted as Res) with a shortcut connection is a type of residual learning, whose gradients can be easily back-propagated. The use of ResConv can also avoid the gradient vanishing problem. From Fig. 3, it can be seen that the side decoder networks and central decoder network share a similar network structure, except for different inputs. When all multiple descriptions are received, both de-quantized tensors and are concatenated and fed into the central decoder network for image decoding. If one description is missing but the other description is received, side decoder network-A or side decoder network-B is chosen to decode the de-quantized tensor or . These networks are composed of three deconvolutional layers and two ResBlocks. Each of the first two deconvolutional layers is followed by one ResBlock. As depicted in Fig. 3, 16 ResConv blocks are employed with skip-connection in each ResBlock.

To overwhelmingly reduce the total number of parameters of the MD cascaded-ResBlock decoder networks, a symmetrical structure is designed. This structure includes two parts: the back layers in the decoder networks and the preceding convolutional layers in the encoder network, which share the parameters. The benefits of the parameter sharing include a decrease in the number of network parameters, the acceleration of training, and avoidance of over-fitting. However, there is a new problem that must be considered; that is, the parameter sharing will strongly affect the image coding efficiency, so we add a three-layer cascaded dilated convolution before parameter sharing, as shown in Fig. 4, which will be ablated in the experimental section. As described above, the compact multiple description generation should be restricted by the entropy regularization term. To efficiently estimate the entropy of the feature tensor, M. Fabian and A. Eirikur, as well as co-authors presented context-based entropy estimation neural networks [28]. Following the work of [28], we use two entropy estimation networks with a shortcut connection structure [28], which has six 3D-convolutional layers, as shown in Fig. 3. In this structure, the middle four convolutional layers are cascaded with two skip-connection, which construct two 3D-Resconv block.

Method Distance Using The Importance- Parameter Sharing
/Conditions Loss Indicator Maps for Decoder Networks
Ours-mr MR-SSIM Y N
Ours-ms MS-SSIM Y N
Ours-mr-w/o MR-SSIM N N
Ours-mr(share) MR-SSIM Y Y
TABLE I: The setting of our framework and its variants. (Y=yes, N=no)

V Experimental results and analysis

We evaluate the proposed method with respect to the structural similarity of the compressed images of several commonly available datasets (including Set4111 used in [2], McMaster222, Kodak PhotoCD dataset, denoted as Kodak333, as well as SunHays444 . By default, two importance indicators are used for all of the models in the proposed framework, which also employs parameter sharing structure for the multiple description multi-scale-dilated encoder network. However, one model without the importance indicators will be clearly specified. When our framework uses MR-SSIM in the reconstruction loss and multiple description distance loss, the proposed method is marked as ”Ours-mr”. In contrast, the proposed method is denoted as ”Ours-ms”, if MS-SSIM is used for the reconstruction loss and multiple description distance loss. The ”Ours-mr” model is labeled as ”Ours-mr-w/o” when there is no parameter sharing for the multiple description cascaded-ResBlock decoder networks. When a model has the same settings as Ours-mr(share) but it has a symmetrical parameter sharing structure, then the final full-model of our framework is denoted as ”Ours-mr(share)”. To clearly display the differences between these models, they are listed in Table. I.

As described in [38], SSIM is a good approximation to assess image quality from perspective of human visual perception, but this method only considers single-scale image information. Compared with SSIM, MS-SSIM is an image synthesis approach for image quality assessment that considers the relative importance of distorted images across different scales. Consequently, both MS-SSIM and MR-SSIM are chosen as objective measurements to assess distorted image quality, in addition to SSIM. Note that each scale SSIM weight factor of MR-SSIM is proportional to the image size, but MS-SSIM’s weights are obtained according to visual testing. To demonstrate the coding efficiency of our MDC framework, our method is compared with several state-of-the-art MDC approaches, including the multiple description coding approach with randomly offset quantizers and the newest convolutional neural network-based standard-compatible method [2] in terms of image coding efficiency when testing on several datasets. At last, visual comparisons of different MDC methods are provided to observe the image quality because human eyes are the ultimate recipients of the compressed images.

Fig. 5: The average objective quality comparisons of different multiple description coding methods. (a1-d1) are the MS-SSIM measurements for the decoded central images testing on Set4, Kodak, McMaster, and SunHays respectively. (a2-d2) are the MS-SSIM measurements for the decoded side images testing on Set4, Kodak, McMaster, and SunHays respectively.
Fig. 6: The average objective quality comparisons of different multiple description coding methods. (a1-d1) are the MR-SSIM measurements for the decoded central images testing on Set4, Kodak, McMaster, and SunHays respectively. (a2-d2) are the MR-SSIM measurements for the decoded side images testing on Set4, Kodak, McMaster, and SunHays respectively.

V-a Training details

We train our framework on ImageNet training dataset from ILSVRC2012

555 During training, each image patch with the size of is obtained by randomly cropping the training images from this dataset. If the training image size is smaller than , we first resize the input image to be at least 160 in each dimension before cropping. Moreover, ImageNet’s validation dataset is used as our validation dataset. Several commonly available datasets mentioned above are chosen as our testing datasets666 for the comparison of different MDC methods. It should be noted that all of the testing image are cropped and/or resized to 16 integer multiples, which is required by one of the comparison methods. Among them, all image sizes of Set4 are , except for ”Boat” image with a size of . The sizes of the datasets of McMaster, Kodak, and SunHays are, respectively, , and . During training, we choose Adam optimization to minimize the objective loss of our MDC framework with the initial learning rate of 4e-3 for our autoencoder network. The training batch size is set to 8, while the hyper parameter , , and are 0.1, 2e-4, and 0.1, respectively.

Fig. 7: The average objective quality comparisons between Ours-mr and Ours-mr-w/o testing on several datasets.
Fig. 8: The visual quality comparisons (b1-d1) between Ours-mr (0.39bpp) and Ours-mr-w/o (0.38bpp) and the colorized image comparisons (d2-d2) for their the decoded central and side images. (a1) is an image from McMaster, (a2) is the colorized image of (a1).

V-B Ablation studies

As described above, the weights of MR-SSIM and MS-SSIM exert different effects on the MD coding efficiency, so the comparison between ”Ours-mr” and ”Ours-ms” is first given and discussed in the following. Fig. 5 and Fig. 6 show the objective performance comparison between ”Ours-mr” and ”Ours-ms” regarding the MR-SSIM and MS-SSIM measurements. Although the ”Ours-mr” model is trained with MR-SSIM for the reconstruction loss and multiple description distance loss, this model always performs better than the ”Ours-ms” model in terms of MR-SSIM and MS-SSIM when testing on the datasets of Set4, McMaster, Kodak, and SunHays. Accordingly, the other models such as ”Ours-mr-w/o” and ”Ours-mr(share)” are trained by using MR-SSIM in the proposed multiple description compressive loss.

Fig. 9: The average objective quality comparisons of different multiple description coding methods. (a) shows the SSIM measurements for the decoded central images testing on Set4, (b) shows the SSIM measurements for the decoded side images testing on Set4.
Fig. 10: The visual quality comparisons (a-e) of different multiple description coding methods with MDCNN (0.25bpp), MDROQ (0.29bpp), Ours-ms (0.29bpp), Ours-mr (0.25bpp), Ours-mr(share) (0.25bpp), (f) is an image from Kodak and its close-ups.
Method MDROQ [13] MDCNN [2] Ours-mr Ours-ms Ours-mr(share)
Encoding Time 0.8 0.1 0.6 0.5 0.5
Center Decoding 8.4 0.05 3.9 4.3 0.2
Average Side Decoding 0.6 0.03 1.1 1.2 0.07
Coding Time 9.8 0.18 5.6 6 0.77
TABLE II: The comparison of running time(S) for different MDC methods when testing on an image with a size of 512x512.

When we encode an image with one importance-indicator map as like [28, 34], a pair of quantizers is able to be learned to quantize the same tensor to generate different descriptions. Thus, in this case, the diversity of multiple description generation only depends on the quantizers, which affects image coding efficiency since the one importance-indicator map only controls the ROI coding, which contributes almost nothing to the diversity of multiple description tensors, thus affecting coding performance (The corresponding experimental results is illustrated in the supplementary material777

In our deep MDC framework, each quantizer is accompanied by an importance-indicator map to generate the diversified multiple descriptions. To see the significance of the importance-indicator maps in the proposed framework, we compare the proposed models ”Ours-mr” and ”Ours-mr-w/o” in Fig. 7 (a1-d1), from which it can be found that the objective performances of these models are very similar. But, the decoded side images and central images compressed by ”Ours-mr” and ”Ours-mr-w/o” exhibit some differences on the structural preservation in terms of image spatial changes, which can be seen in Fig. 8. To better observe the performance of these models, whose compressed images are colorized by ”Let there be Color!” [41], as shown in Fig. 8 (a1-d1) and (a2-d2). This figure shows that the ”Ours-mr” model can retain more image spatial structures than ”Ours-mr” after multiple description coding (see the areas to which the green arrows point in Fig. 8 (b1)).

Since the trained model’s size not only affects each model’s computational complexity, but also restricts the applications of the proposed framework, the network parameter sharing in the proposed framework is studied. First, we investigate how much the performance of the proposed framework is influenced by the network parameter sharing. The figures of Fig. 5 and Fig. 6 display the objective comparison of ”Ours-mr” and ”Ours-mr(share)”. From these figures, we can conclude that the performance of ”Ours-mr(share)” is very close to that of ”Ours-mr” in most cases. However, the objective measurement of the ”Ours-mr(share)” model is slightly lower than that of the ”Ours-mr” model at very high bit rates. Meanwhile, ”Ours-mr(share)” with a symmetrical structure for the parameter sharing, can reduce the number of parameters to approximately 0.436 times that of the ”Ours-mr” model. The number of network parameter of ”Ours-mr(share)” is 0.406 times that of the model for the proposed framework without any parameter sharing. From these results, it can be deduced that the parameter sharing with the symmetrical structure can greatly reduce the number of the model parameters in the proposed framework.

V-C Objective and visual quality comparisons of different methods

To validate the efficiency of the proposed framework, we compare our method with the latest standard-compatible CNN-based MDC method [2], with convolutional auto-encoder-based multiple description coding, and with a multiple description coding approach with randomly offset quantizers [13], which are denoted as ”MDCNN” [2], ”CAE” [2], and ”MDROQ” [13] respectively. As shown in Fig. 9, our method has higher coding efficiency than the methods of ”MDCNN” [2], ”CAE” [32], and ”MDROQ” [13] at all times in terms of SSIM, when testing on the Set4 dataset.

Although ”Ours-mr” is trained with MR-SSIM instead of MS-SSIM, both the MR-SSIM and MS-SSIM results of several comparative MDC approaches are provided in Fig. 5 and Fig. 6, after testing on several datasets. From these figures, we can observe that ”Ours-mr” has better performance than ”Ours-ms” regarding MR-SSIM and MS-SSIM. Moreover, the objective MS-SSIM measurements of the side decoded images between ”MDCNN” and ”MDROQ” are very similar and the objective MS-SSIM when the bit rate is higher than approximately 0.3 bpp when testing on the Set4, Kodak, McMaster, and SunHays datasets. The MR-SSIM measurements of the central decoded images compressed by ”MDCNN” can compete with or even exceed those of ”MDROQ”. However, ”MDROQ” has exhibits better performance than ”MDCNN” in terms of MS-SSIM and MR-SSIM at very low bit rates. ”Ours-mr” and ”Ours-mr(share)”, as well as ”Ours-ms” have the best coding efficiencies on all testing datasets with respect to both side and central decoded images compared with ”MDCNN” and ”MDROQ” in terms of MR-SSIM and MS-SSIM, as depicted in Fig. 5 and Fig. 6. From these figures, we can see that the coding efficiencies of ”Ours-mr”, as well as ”Ours-mr(share)” and ”Ours-ms” are far higher than those of the comparative methods at low bit rates when testing on several publicly available datasets. Meanwhile, the MR-SSIM and MS-SSIM of ”Ours-mr” and ”Ours-mr(share)” are superior to those of ”Ours-ms”.

”Ours-mr”, ”Ours-mr(share)” and ”Ours-ms” can retain more structures of each object than ”MDCNN” [2] and ”MDROQ” [13] for both side and central decoded images, as displayed in Fig. 10 (a-e). Although ”MDCNN” [2] can preserve some small structures, this method makes many significant objects disappear. At the same time, some objects are enormously distorted when testing images are compressed by ”MDCNN” [2] and ”MDROQ” [13], which can be seen in Fig. 10 (a-b). Moreover, ”Ours-mr” as well as ”Ours-mr(share)” and ”Ours-ms” does not contain obvious visual noises, such as coding artifacts, compared to ”MDROQ” [13], which can be seen in Fig. 10 (c-e), while our side and central decoded images appear to more natural.

V-D Comparisons of coding time

TABLE II provides the encoding time and decoding time of several MDC methods for comparison when testing on an image with a size of 512x512. In this table, all the operations of convolutional neural networks for these MDC methods are run on an NVIDIA-GTX1080 GPU device. Among these comparative MDC methods in TABLE II, ”MDCNN” requires the least coding time. From this table, it can be observed that ”Ours-mr” and ”Ours-ms” have very similar performances for image encoding, center decoding and average side decoding. The running time of these two models for encoding as well as decoding are greater than that of ”Ours-mr(share)”. Compared to ”Ours-mr”, ”Ours-ms”, ”Ours-mr(share)” and ”MDCNN”, ”MDROQ” requires the most time for image coding.

Vi Conclusion

In this paper, we propose a deep multiple description coding framework in which MD quantizers are automatically learned in an end-to-end manner during training. Meanwhile, a symmetrical parameter sharing structure is designed for our autoencoder networks to overwhelmingly reduce the total number of neural network parameters. Ablation studies on whether two importance-indictor maps are essential or not, how to control the redundancy between different descriptions and different kinds of structural dissimilarity loss functions, as well as parameter sharing, are provided in the section of experimental results and analysis. At last, we demonstrate that our method offers better coding efficiency than several advanced MD image compression methods, when tested on commonly available datasets, especially at low bit rates.


  • [1] M. Kazemi, R. Iqbal, and S. Shirmohammadi., “Joint intra and multiple description coding for packet loss resilient video transmission,” IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 781–795, 2018.
  • [2] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Multiple description convolutional neural networks for image compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2494–2508, 2019.
  • [3] Y. Xu and C. Zhu, “End-to-end rate-distortion optimized description generation for H. 264 multiple description video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 9, pp. 1523–1536, 2013.
  • [4] F. Shirani and S. S. Pradhan, “An achievable rate-distortion region for multiple descriptions source coding based on coset codes,” IEEE Transactions on Information Theory, vol. 64, no. 5, pp. 3781–3809, 2018.
  • [5] V. A. Vaishampayan, “Design of multiple description scalar quantizers,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 821–834, 1993.
  • [6] V. K. Goyal, “Scalar quantization with random thresholds,” IEEE Signal Processing Letters, vol. 18, no. 9, pp. 525–528, 2011.
  • [7] Y. Wang, M. T. Orchard, V. Vaishampayan, and A. R. Reibman, “Multiple description coding using pairwise correlating transforms,” IEEE Transactions on Image Processing, vol. 10, no. 3, pp. 351–366, 2001.
  • [8] V. K. Goyal and J. Kovacevic, “Generalized multiple description coding with correlating transforms,” IEEE Transactions on Information Theory, vol. 47, no. 6, pp. 2199–2224, 2001.
  • [9] D. Saitoh and T. Yakoh, “Ratio configurable multiple description correlating transforms coding,” in IEEE International Conference on Industrial Technology, Auburn, Mar. 2011.
  • [10] M. Majid, M. Owais, and S. M. Anwar, “Visual saliency based redundancy allocation in HEVC compatible multiple description video coding,” Multimedia Tools and Applications, vol. 77, no. 16, pp. 20 955–20 977, 2018.
  • [11] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based on deep convolutional neural network to compress image,” Journal of Visual Communication and Image Representation, vol. 63, no. 1, pp. 102 589–102 599, 2019.
  • [12] H. Wu, T. Zheng, and S. Dumitrescu, “On the design of symmetric entropy-constrained multiple description scalar quantizer with linear joint decoders,” IEEE Transactions on Communications, vol. 65, no. 8, pp. 3453–3466, 2017.
  • [13] L. Meng, J. Liang, U. Samarawickrama, Y. Zhao, H. Bai, and A. Kaup, “Multiple description coding with randomly and uniformly offset quantizers,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 582–95, 2014.
  • [14] S. Dumitrescu and Y. Wan, “Bit-error resilient index assignment for multiple description scalar quantizers,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2748–2763, 2015.
  • [15] H. Jafarkhani and V. Tarokh, “Multiple description trellis-coded quantization,” IEEE Transactions on Communications, vol. 47, no. 6, pp. 799–803, 1999.
  • [16] V. Vaishampayan, N. Sloane, and S. Servetto, “Multiple-description vector quantization with lattice codebooks: design and analysis,” IEEE Transactions on Information Theory, vol. 47, no. 5, pp. 1718–1734, 2001.
  • [17] G. Romano, P. S. Rossi, and F. Palmieri, “Multiple description image coder using correlating transforms,” in European Signal Processing Conference, Vienna, 2015.
  • [18] J. Chen, C. Cai, L. Li, and C. Li, “Layered multiple description video coding using dual-tree discrete wavelet transform and H. 264/AVC,” Multimedia Tools and Applications, vol. 75, no. 5, pp. 2801–2814, 2016.
  • [19] V. Goyal, J. Kelner, and J. Kovacevic, “Multiple description vector quantization with a coarse lattice,” IEEE Transactions on Information Theory, vol. 48, no. 3, pp. 781–788, 2002.
  • [20] S. Dumitrescu, Y. Chen, and J. Chen, “Index mapping for bit-error resilient multiple description lattice vector quantizer,” IEEE Transactions on Communications, vol. PP, no. 99, pp. 1–1, 2018.
  • [21] Z. Gao and S. Dumitrescu, “Flexible multiple description lattice vector quantizer with descriptions,” IEEE Transactions on Communications, vol. 62, no. 12, pp. 4281–4292, 2014.
  • [22] N. Franchi, M. Fumagalli, R. Lancini, and S. Tubaro, “Multiple description video coding for scalable and robust transmission over IP,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 3, pp. 321–334, 2005.
  • [23] N. Gadgil, H. Li, and E. J. Delp, “Spatial subsampling-based multiple description video coding with adaptive temporal-spatial error concealment,” in Picture Coding Symposium, Cairns, May 2015.
  • [24] G. Zhang and R. L. Stevenson, “Efficient error recovery for multiple description video coding,” in Picture Coding Symposium, Paris, Oct. 2004.
  • [25] C. Zhu and M. Liu, “Multiple description video coding based on hierarchical B pictures,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 4, pp. 511–521, 2009.
  • [26] X. Zhang, W. Yang, Y. Hu, and J. Liu, “DMCNN: Dual-domain Multi-scale Convolutional Neural Network for Compression Artifacts Removal,” in IEEE International Conference on Image Processing, Athens, Oct. 2018.
  • [27] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learned compression of images and neural networks,” in Advances in Neural Information Processing Systems, California, Dec. 2017.
  • [28] M. Fabian, A. Eirikur, T. Michael, T. Radu, and V. G. Luc, “Conditional probability models for deep image compression,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    , Salt Lake City, Jun. 2018.
  • [29] L. Theis, W. Shi, A. Cunningham, and F. Huszar, “Lossy image compression with compressive autoencoders,” in International Conference on Learning Representations, Palais, Apr. 2017.
  • [30] G. Toderici, S. Malley, S. Hwang, D. Vincent, D. Minnen, S. Baluja, and et al., “Variable rate image compression with recurrent neural networks,” in International Conference on Learning Representations (ICLR), Puerto Rico, May 2016.
  • [31] O. Rippel and L. Bourdev, “Real-time adaptive image compression,” in International Conference on Machine Learning, Sydney, Aug. 2017.
  • [32] H. Li, L. Meng, J. Zhang, Y. Tan, Y. Ren, and H. Zhang, “Multiple description coding based on convolutional auto-encoder,” IEEE Access, vol. 7, no. 1, pp. 26 013–26 021, 2019.
  • [33] A. Jerbi, W. Jian, and S. Shirani, “Error-resilient region-of-interest video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 9, pp. 1175–1181, 2005.
  • [34] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun. 2018.
  • [35] B. Yochai and M. Tomer, “The perception-distortion tradeoff,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun. 2018.
  • [36]

    L. Zhao, H. Bai, J. Liang, B. Zeng, A. Wang, and Y. Zhao, “Simultaneous color-depth super-resolution with conditional generative adversarial networks,”

    Pattern Recognition, vol. 88, no. 1, pp. 356–369, 2019.
  • [37]

    P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul. 2017.
  • [38] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems and Computers, 2003, Pacific Grove, Nov. 2003.
  • [39] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in arXiv:1511.07122, 2015.
  • [40] P. Wang, P. Chen, D. L. Y. Yuan, Z.Huang, X. Hou, and G.Cottrell, “Understanding convolution for semantic segmentation,” in IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar. 2018.
  • [41]

    S. Iizuka, E. Simoserra, and H. Ishikawa, “Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification,”

    ACM Transactions on Graphics, vol. 35, no. 4, pp. 1–11, 2016.