I Introduction
PACKET LOSS and bit errors may often unavoidably occur when Internet data are transmitted over unreliable channels [1]
. As increasing more people surf the Internet in daily life with mobile devices, such as handheld PAD and cellphone, whose data packets are conveyed over wireless communication channels, incomplete decoding or complete loss of data very possibly occur. Multiple description coding (MDC)
[1, 2, 3] is one of the representative promising mechanisms among different error resilient coding techniques to make realtime transmission systems more simple and robust to a lossy channel in a challenging environment. Compared with layeredbased coding, this mechanism does not need to design the priority of each packet or data retransmission mechanism and does not require any signal feedback. Several independent yet correlated bit streams generated by the MDC technique facilitate to the independent decoding of each packet, while joint decoding of two or more bitstreams is also supported by the MDC technique.Traditional MDC approaches have been widely studied in the last decades, among which the derivation of multiple description (MD) theoretical ratedistortion regions is a fundamental and significant topic [4]. Meanwhile, in practice, the achievable ratedistortion regions gradually approach the boundaries of theoretical MDC ratedistortion regions [4, 2, 5, 6, 7, 8, 9, 3, 10, 11]. However, a large number of traditional MDC methods face many difficult problems. For example, MD quantizers often must assign the optimal index for multiple description generations [5, 6, 12, 13, 14, 15, 16], especially when quantizing more than two descriptions, which is an extremely complicated problem. For the correlating transformbased MDC framework, good performance can be achieved for multiple description coding when adding a small amount of redundancy [7, 8, 9, 17]. However, this framework does not always perform well if more redundancy is introduced into multiple descriptions. Different from multiple description quantizers and the correlated transformbased MDC framework, a class of samplingbased multiple description coding is more flexible and compatible with the standard coders. However, most of existing samplingbased MDC methods are built on the specifically designed sampling methods or extend the existing sampling operator for multiple description generations [3, 10, 11, 18]
, whose coding efficiency is limited. Consequently, the research of samplingbased MDC methods should be further developed. Recently, a convolutional neural network(CNN)based JPEGcompliant MDC framework
[2] has been used to sample an input image to adaptively create multiple description images, but its coding efficiency is not very high because the usage of the standard JPEG limits the performance of this framework. In summary, we should comprehensively study the topic of multiple description coding for error resilience against bit errors and packet loss over an unpredictable channel.In this paper, a deep multiple description image coding framework is proposed for robust transmission. Our contributions are listed below:

We design a general deep optimized MD coding framework based on artificial neural networks. Our framework has several main parts: an MD multiscaledilated encoder network, a pair of multiple description scalar quantizers, and MD cascadedResBlock decoder networks, as well as conditional probability models.

A pair of scalar quantization operators is automatically learned in an endtoend selfsupervised way for generation of diversified multiple descriptions. Meanwhile, each scalar quantizer is accompanied by an importanceindicator map to quantize feature tensors according to the change in the image spatial content.

We propose the use of the multiple description structural similarity distance loss to supervise the decoded multiple description images, which implicitly regularizes diversified multiple description generations and scalar quantization learning. Please note that our multiple description structural similarity distance loss is different from the pixelwise meanabsolutedeviation distance loss from [2]. These two types of distance loss work on different spaces, among which the proposed distance is imposed on the decoded images, whereas the distance loss from [2] regularizes feature tensors produced by the MD generation neural network.

A symmetrical parameter sharing structure is designed for our autoencoder network to greatly reduce the total amount of neural network parameters. Specifically, the parameters of the preceding convolutional layers in the encoder network are shared for multiple description generation, while the decoder networks have a symmetrical structure by sharing the parameters of the back layers.

A pair of conditional probability models is learned to estimate the informative amount of the quantized tensors, which can supervise the learning of the multiple description multiscaledilated encoder network to compactly represent the input image.
The rest of this paper is organized as follows. First, four kinds of MDC methods are reviewed in Section II and the problem formulation of deep image compression is presented in Section III. Second, the proposed multiple description coding framework is introduced in Section IV. Third, experimental results and analysis are presented in Section V. Finally, we conclude our paper in Section VI.
Ii Related works
We mainly review four kinds of MDC methods: MD quantizers, correlating transformbased MDC methods, samplingbased MDC and standardcompliant MDC.
Iia Multiple description quantizers
For quantizationbased MDC methods, there are three primary classes: scalar quantizers, trelliscoded quantizers, and lattice vector quantizers. In early development, MD scalar quantizers constrained by the symmetric entropy are formed as an optimization problem
[5]. Following this work, linear joint decoders are developed to resolve the problem of dynastical computation increase [12] when generalizing this method from two to L descriptions during encoder optimization. In [6], MD distortionrate performance is derived for certain randomly generated quantizers. By generalizing randomly generated quantizers, the theoretical performance of MDC with randomly offset quantizers is given in the closedform expressions [13]. To increase the robustness to bit errors, linear permutation pairs are developed for index assignment for two description scalar quantizers [14]. In [15], a trellis is formed by the tensor product of trellises for multiple description coding.The scalar quantizers and trelliscoded quantizers usually require the complicated index assignment. As compared with these quantizers, lattice vector quantizers have been widely studied, since lattice vector quantizers have many advantages such as a symmetrical structure, avoidance of complex nearest neighbor search, lack of the need for codebook design, etc. In [16], the design problem of lattice vector quantizer is cast as a labeling problem, and a systematic construction method is presented for the general lattices. To further improve the performance of MD vector quantization at the cost of a slight complexity increase [19], the fine lattice codebook is replaced by a nonlattice codebook. In [20], a structured biterror resilient mapping is leveraged to make MD lattice vector quantizers resilient to the transmission bit error by exploiting intrinsic structural characteristic of the lattice. In [21]
, a heuristic index assignment algorithm is given to control coding distortions in order to balance the reconstruction quality of different descriptions. In summary, MD quantizers always involve complicated index assignments, especially when more than two descriptions are expected to be generated by users, whose complexity is always very high.
IiB Correlating transformbased MDC methods
The transformbased MDC framework employs a pairwise correlating transform to correlate pairs of transform coefficients [7], in which the best strategy for image compression redundancy allocation is given. This framework considers only the case of coding two multiple descriptions. In [8], this transformbased approach is generalized to create L descriptions to enhance the transmission system robustness with a small quantity of redundancy. To satisfy the channel bandwidth ratio, a ratio configurable MD correlating transform coding [9] is introduced to adjust the description data size ratio. In [17], a gradient search is provided to determine the correlating transform, while statistical dependencies between different descriptions benefit estimation of the loss transform coefficients under a lossy network during transmission. Although these correlating transformbased MDC methods provide an effective MD generation method, this kind of MDC method tends to be inefficient when a great deal of information redundancy is expected to be introduced into multiple descriptions.
IiC Samplingbased multiple description coding
Multiple descriptions can be created from different domains for input images/videos coding in various ways. For instance, diverse descriptions can be obtained from the spatial domain, frequency domain or temporal domain. In
[22], two MD video coding coders employ a polyphase downsampling technique to create multiple descriptions. In [23], adaptive temporalspatial error concealment is applied for MD video coding, in which multiple descriptions are obtained by spatial subsampling. For MD video coding, motion information from temporal domain is often estimated in the encoder as a redundancy. To quickly recover error on the decoder side without complicated motion searching for MD video coding, motion vectors are sampled according to each description’s redundancy, which could further strengthen error resilience when video streams are transmitted over errorprone wireless networks [24]. By selecting the temporal level key pictures, hierarchicalBpicturesbased MD video coding framework adopts the simplest odd/even frame splitting method of generating redundancy descriptions as a special case of the framework
[25]. Among these approaches, some of the samplingbased MDC methods can be compatible with standard image/video compression. However, their sampling operators are manually designed, which limits multiple description coding efficiency, so adaptive sampling method of generating multiple descriptions should be deeply investigated, especially the use of deep learning methods.
IiD Standardcompliant multiple description coding
Research on the standardcompliant MDC framework has becomes increasingly significant due to the wide usage of standard image/video codecs such as JPEG and H.264, which has become essential parts of life in practice. Over the past decades, there have been numerous works regarding standardcompliant MD coding [3, 10, 11]. For instance, an H.264compliant MD video coding is achieved by interlacing primary and redundant slices [3], whose rate distortion is optimized in an endtoend manner. In [18], a hybrid layered MD video coding algorithm uses the H.264/AVC encoder to compress a video at low bit rates as the base layer, and then, fourbit streams are divided into two groups as two description enhancement layers after 3D dualtree discrete wavelet transform and 3DSPIHT encoding. For HEVCcompatible MD video coding, the redundancy allocation is assigned based on a visual saliency model [10]. Most recently, image representationbased compression with CNNs [11] has been extended for multiple description image generation to form a standardcompliant convolutional neural networkbased MDC framework [2], which is trainable in an endtoend fashion. Although great progresses have been achieved for the class of standardcompliant multiple description coding [18, 3, 10], their coding efficiency has great potential to be enormously improved by artificial neural networks in various ways such as prefiltering and postprocessing [2, 11, 26].
Iii Problem formulation for deep image compression
In this paper, deep image compression is defined as a new class of image compression approaches that uses artificial neural networks to learn image nonlinear transform and inverse transform from bigdata. This class differs from standard image compression with a classical linear transform such as discrete cosine transform or discrete wavelet transform. Since the proposed framework involves extensive machine learning knowledge, we first introduce some definitions such as concepts and terminology for deep image compression. Then, we introduce and review the deep image compression framework. Finally, the extension of deep image coding to multiple description coding is discussed.
Iiia Definitions
Generally, the discrete representation of continuous data can be called data quantization, and this quantization is a form of hard quantization. The derivative of the hard type of quantization function is almost zero everywhere, except at the quantized integer point. If a quantization function is differentiable in the definition domain, then it can be called a soft quantization function. In [27], a soft quantization function is defined based on soft allocation with softmax function. Given a dimension vector , the softmax function for can be defined as:
(1) 
(2) 
in which . The elements of center variable vector is not limited to integers and can also be defined as floatingpoint numbers, a soft assignment on a variable can be defined as: , in which denotes the norm. The corresponding hard assignment can be written as: =, which finally converges to the nearest center of a vector
in onehot encoding, that is to say:
(3) 
in which is the absolute value of . Here, the onehot encoding of refers to a zerovector along the new dimension with length of , except for its th element to be 1. Using the above soft assignment, the soft quantization and hard quantization can be defined respectively as:
(4) 
(5) 
in which is elementwise multiplication, is the nearest center to of the center vector and controls the smoothness strength of the softquantization function.
IiiB Deep image compression framework
As with the standard coders, general deep image compression usually consists of the following main parts: a nonlinear analysis transform (also known as the encoding network), hard quantization function (as well as soft quantization function for training), nonlinear synthesis transform (also known as the decoding network), a conditional probability model to estimate the entropy, and distortion function to measure how much image information is lost after compression, as shown in Fig. 1. Given an input image , the feature tensor can be obtained through nonlinear analysis transform , i.e., . Then, the hard quantization function is used to discretize the feature tensor to reduce image compression bits, i.e., . Finally, the conditional probability model is used to calculate the rate of the quantized tensor [28]. On the decoder side, the quantization vector is reconstructed as by the nonlinear synthesis transform ; that is, . In addition, the distortion function is used to measure the distortion of the compressed image. According to the ratedistortion theory of image coding [27], the objective ratedistortion optimization function of deep image compression can be defined as:
(6) 
where the first term is the image compression bit rate constraint, while the second term includes compressed image distortion and the networkparameter regularization term . This equation includes the hard (nondifferentiable) quantization function
, so it cannot be directly optimized by the general optimization methods, e.g., stochastic gradient descent optimization or Adam optimization. To resolve this problem, a general solution
[29, 27, 30, 11, 28] replaces the hard quantization with soft quantization during backpropagation for training (please see the bold dashed orangepink lines in Fig. 1), while the forwardpropagation uses the hard quantization (please see the bold black lines in Fig. 1).The earliest works on image compression based on artificial neural networks primarily study the nondifferentiability problem of the quantization function and the artificial neural network structure design problem [27, 30, 11, 28]
, as well as how to make compressed images more realistic using the perceptual loss functions
[31]. For example, the compressive autoencoder with an encoder network and a decoder network is often chosen as the image compression network [29, 27, 30, 11, 28], in which the encoder network condenses the input image into a certain amount of bit streams which are as small as possible and are mapped back to a lossy image by the decoder network. In [29], the derivative of the stochastic rounding operation is replaced with the derivative of the expectation for backpropagation, but no part in forward direction is changed. Like [29], soft relaxation of quantization is used to resolve the nondifferentiability problem of the quantization function [27]. Different from [29, 27], the thumbnail images are compressed by a recurrent neural networks architecture, in which a stochastic rounding operation makes feature maps binarized
[30]. Recently, a virtual codec network has been learned to imitate the projection from the represented vectors to the decoded images to make the image compression framework trainable in an endtoend way [11]. Except for the quantization problem, the autoencoder should be restricted to satisfy the given bitperpixel (bpp) compression when learning compressive autoencoder for image compression. In [28], a conditional probability model for the latent distribution of the autoencoder is learned to supervise the autoencoder network.IiiC Extension to deep multiple description image coding
Until now, there has been no work on deep multiple description coding, although deep image compression methods have been increasingly explored. To the best of our knowledge, our work is the first deep multiple description coding framework. There are two possible methods of deep multiple description coding. In the first, multiple descriptions can be directly generated by an artificial neural network and are discretized by a general quantization to reduce coding bits. If the multiple descriptions refer to multiple description images, which are compressed by the standard codec, then it becomes a multiple description coding framework described in [2]. In this framework, JPEG compression can be replaced by a deep image compression approach such as [28]. Different from [2], our deep optimized multiple description image coding belongs to the class of methods utilizing the second approach. When the artificial neural network represents an input image as the feature tensors, a pair of scalar quantizers is learned to quantize the feature tensors for diversified multiple description generation. Recently, a convolutional autoencoderbased multiple description coding method [32] has extracts features by learning, which improves image coding efficiency. However, this method suffers from severe coding artifacts similarly to conventional MDC approaches.
Iv The proposed deep multiple description coding
In this paper, we introduce a deep optimized multiple description coding framework that is entirely built upon artificial neural networks, as displayed in Fig. 2. Our framework is primarily composed of a multiple description encoder, a multiple description decoder, and a pair of conditional probability models. The multiple descriptions are generated by the multiple description encoder, which is supervised by the conditional probability models to estimate the entropy of each description, while the received multiple descriptions are decompressed by the multiple description decoder. Specifically, given an input image with a size of , the multiple description multiscaledilated encoder network in the multiple description encoder decomposes the input image into a feature tensor , as well as two importanceindicator maps and . Here, is the spatial size of the feature tensor , while is the number of feature maps for this feature tensor. The importance map indicates which regions should be emphasized during image coding, which is a kind of regionofinterest (ROI) based coding [33], considering that the spatial distributions of various images are different from each other. M. Li and M. Fabian as well as their coauthor, reported that a contentweighted importance map can be used for guidance to allocate image contentaware bit rate for deep image coding based on regionofinterest [34, 28]. In consideration of the image spatial variation, each importanceindicator map can be directly multiplied by each feature map of the feature tensor in an elementwise manner; that is, the feature maps for the feature tensor share the same importanceindicator map. However, vital yet different roles are played by these feature maps for multiple description image compression. According to the work of M. Fabian and coworkers [28], the expansion operation for is written as:
(7) 
in which the function is a repetition function used to repeat a given matrix with times along the axes of , to form a new matrix. Furthermore, , , , the function truncates the value between and , and the operation inserts a dimension of 1 at the last dimension index axis of input shape. We can expand in the same way. Then, the expansion of each importanceindicator map is multiplied by the feature tensor in an elementwise manner to obtain two new feature tensors, and , before scalar quantization. From Eq. (7), it can be known that only the elements of will have valid values for the th map of along the first dimension, according to the Euclidean distance between and . As a result, the th map of will influence the multiple description generation of . The significance of importanceindicator maps for deep multiple description image coding will be discussed later.
To reduce multiple description coding bits, the two feature tensors and are quantized by scalar quantizerI and scalar quantizerII, respectively. Due to the nondifferentiability of hard quantization, we follow the work of E. Agustsson [27] to utilize soft quantization to make our framework trainable in an endtoend way. Scalar quantizerI with Eq. (2) is represented as:
(8) 
where is a center variable vector and represents the th element of tensor . In a similar way, scalar quantizerII with a new center variable vector can be defined. The vectors of the quantized tensors and can be represented as and
. This pair of scalar quantizers accompanied by two importanceindicator maps, can be simultaneously learned in an endtoend selfsupervised way. In other words, there is no label for multiple description generation, the ultimate goal of which is to independently decode each side image with an acceptable image quality or jointly decode a betterquality central image, so the side images are used as the opposing labels for each description redundancy measurement with the distance loss for selfsupervised learning. Both tensors
and can be converted into onehot tensors and by adding a new dimension, thus producing two descriptions. The generated descriptions are losslessly encoded by arithmetic coding for transmission. During forwardpropagation, hard quantization is leveraged to obtain discrete feature tensors according to Eq. (2), but we use the derivation of the soft quantization function [27] to backpropagate gradients from the multiple description decoder to the multiple description encoder, which is conducted in a similar manner as shown in Fig. 1.At the receiver, the decoded onehot tensors and can be reversibly converted into the tensors and , as displayed in Fig. 1. Then, these tensors are treated by corresponding scalar dequantizers. The scalar dequantizerI can be written as , which returns the th of the center variable . Similarly, scalar dequantizerII can be written as . Side decoder networkA (or side decoder networkB) is used to decompress the quantized tensor (or ) as the lossy image (or ), as shown in Fig. 3, if only one description is received at the decoder, when each description is transmitted over an unpredictable channel. If both of the descriptions are received, then the central decoder network is leveraged to jointly decompress quantized tensors and as the central decoded image .
(10) 
Iva The objective function of multiple description coding
Similar to traditional single description image compression, the objective function of multiple description coding requires balancing two fundamental parts: the coding bit rate and multiple description image decoding distortion. The mean square error (MSE) is often employed to measure image compression distortion. However, the human visual perceptual quality of a compressed image with high MSE may be higher than that for images with low MSE [35]. There are a multitude of reasons for objectivesubjective quality mismatch, such as onepixel position shifting of the wholeimage and newpixels covered over detailslacking textural regions generated by generative adversarial networks to obtain a sense of reality.
Compared to MSE loss, mean absolute error (MAE) loss can better regularize image compression to advance the compressed images towards the groundtruth images during training [36, 37]. Thus, our framework uses MAE loss for both side decoded images and central decoded image as the first part of our multiple description reconstruction loss, which can be written as follows:
(9) 
in which denotes the norm and is a tradeoff parameter to control the redundancy between average side reconstructions and the corresponding central reconstruction. Meanwhile, to well measure image distortion for structural preservation, we introduce the multiresolution (MR) structural similarity index (MRSSIM) as an evaluation factor of image quality between and according to [38], which is written in Eq. (10). Here, is the downsampled image, whose size is times less than that of , and is the sum operator of . Meanwhile, and are two constants, while and
are, respectively, the mean map, the variance map of
calculated from each pixel’s neighborhood window, while is the covariance map calculated from each pixel’s colocated neighborhood windows in and . Additionally, the weight vector = for different scales is , which is linearly proportional to eachscale image size; that is, the largesize image weight is greater than that of the smallsize image weight. However, image distortions at different scales are of very different importance with respect to perceived quality. In contrast to the MRSSIM, the MSSSIM from the literature [38] uses the weight vector = . These weights indicate that the image of other scales are more significant than the largest and smallest scale images. More discussions about MRSSIM and MSSSIM will be provided later. The total structural dissimilarity loss as the second part of our multiple description reconstruction loss can be written as:(11) 
Unlike single description image compression with only one bit stream produced by image coder, multiple description coding should generate diversified multiple descriptions, between which some redundancy is shared, but each description has its unique information. The redundancy of these descriptions makes the receiver capable of decoding an acceptable quality image, even though one description is missing when multiple description bit streams are transmitted over an unstable channel. However, when different descriptions contain excessive shared information, the central image quality will not exhibit great improvements, even though all of the descriptions are obtained by the client. In [2], multiple description distance loss is directly employed to supervise multiple description generation in feature space. Although the feature tensor of each description can be regarded as the opposite label to regularize each other, the learning problem of multiple description coding in feature space is often challenging, because the same multiple description reconstruction may arise from the composition of different features. To allow our framework to automatically generate multiple diverse descriptions, we propose to use the multiple description structural similarity distance loss. This distance loss not only explicitly regularizes decoded multiple description images to be different but also implicitly supervises scalar quantization learning and diversified multiple description generation, which is written as:
(12) 
To compactly represent multiple description feature tensors, there should be an entropy regularization term for our deep MDC framework training. However, our framework is built upon an artificial neural network, so the conditional probability model for the entropy regularization term should be differentiable, i.e., the conditional probability model should also use the artificial neural network. To precisely predict each description coding cost, we use two entropy estimation networks without parameter sharing as the conditional probability model. Following the work of [28], we use contextbased entropy estimation neural networks in our framework to efficiently estimate multiple description coding costs during training, as shown in Fig. 2.
In our framework, the regularization loss from these contextbased entropy estimation neural networks supervises the learning of the multiple description encoder, which leads to compact multiple description generation. The estimated coding costs for each description are denoted as and . Because and come from the hard quantizers, the is nondifferentiable, although contextbased entropy estimation neural networks are differentiable. If our scalar quantization uses the soft quantizers to obtain and for each description coding cost prediction, which are the approximations of and , then and can be differentiable. As a result, the loss from contextbased entropy estimation neural networks can be backpropagated to the multiple description encoder. Finally, the multiple description compressive loss for our trainable framework in Fig. 2 can be written in Eq. (13), in which is the parameter regularization term for our artificial neural networks. , , and
are three hyperparameters. The
, , and in the final multiple description compressive loss in Eq. (13) include and from the hard quantizers. To make it trainable, the solution of training our deep multiple description coding framework is operated like the Fig. 1, such that the forwardpropagation uses hard quantization, but the backpropagation employs soft quantization.(13) 
IvB Network
To fully explore image context information, we propose a multiple description multiscaledilated encoder network to create a feature tensor , as well as two importanceindicator maps, and , for multiple description generation, as shown in Fig. 2. As discussed in [39], the dilated convolution can significantly enlarge the image receptive field, but it may introduce the grid effect. In [40], several dilatedconvolutional layers are defined as a hybrid dilated convolution to resolve this problem for semantic segmentation. Inspired by these works, we use three cascaded dilated convolutions to extract multiscale features since this is vital to leverage image context information for diversified multiple description generation. Each cascaded dilated convolution is composed of three layers of dilated convolutions, which is displayed in Fig. 3. First, a convolutional layer is used to transform the input image into 64channel feature maps before operating three cascaded dilated convolutions. Second, each cascaded dilated convolution is followed by a downsampling convolutional layer with a stride of 2 to shrink spatial size of the feature maps. Meanwhile, the first cascaded dilated convolution is followed by a downsampling convolutional layer with a stride of 4. Additionally, a convolutional layer with a stride of 2 is used to downsample the output features of the second cascaded dilated convolution in the spatial domain. Finally, various features from different scales are concatenated and then aggregated together by a convolutional layer to leverage image multiscale context information, as depicted in Fig. 3. Although the feature tensor, as well as the importanceindicator maps can be separately generated by three structuresimilar networks in the MD multiscaledilated encoder network, the input is shared by these networks to produce these tensors. If there is no parameter sharing, then the number of parameters is enormously increased, which results in consumption of memory and additional computational complexity. Accordingly, almost all of the convolutional layers can be shared in the MD multiscaledilated encoder network except for the last layers, to create different kinds of tensors (See the MD multiscaledilated encoder network in Fig. 3).
After receiving the dequantized tensors and , we propose to use MD cascadedResBlock decoder networks to decompress these tensors since the learning of ResConv (denoted as Res) with a shortcut connection is a type of residual learning, whose gradients can be easily backpropagated. The use of ResConv can also avoid the gradient vanishing problem. From Fig. 3, it can be seen that the side decoder networks and central decoder network share a similar network structure, except for different inputs. When all multiple descriptions are received, both dequantized tensors and are concatenated and fed into the central decoder network for image decoding. If one description is missing but the other description is received, side decoder networkA or side decoder networkB is chosen to decode the dequantized tensor or . These networks are composed of three deconvolutional layers and two ResBlocks. Each of the first two deconvolutional layers is followed by one ResBlock. As depicted in Fig. 3, 16 ResConv blocks are employed with skipconnection in each ResBlock.
To overwhelmingly reduce the total number of parameters of the MD cascadedResBlock decoder networks, a symmetrical structure is designed. This structure includes two parts: the back layers in the decoder networks and the preceding convolutional layers in the encoder network, which share the parameters. The benefits of the parameter sharing include a decrease in the number of network parameters, the acceleration of training, and avoidance of overfitting. However, there is a new problem that must be considered; that is, the parameter sharing will strongly affect the image coding efficiency, so we add a threelayer cascaded dilated convolution before parameter sharing, as shown in Fig. 4, which will be ablated in the experimental section. As described above, the compact multiple description generation should be restricted by the entropy regularization term. To efficiently estimate the entropy of the feature tensor, M. Fabian and A. Eirikur, as well as coauthors presented contextbased entropy estimation neural networks [28]. Following the work of [28], we use two entropy estimation networks with a shortcut connection structure [28], which has six 3Dconvolutional layers, as shown in Fig. 3. In this structure, the middle four convolutional layers are cascaded with two skipconnection, which construct two 3DResconv block.
Method  Distance  Using The Importance  Parameter Sharing 
/Conditions  Loss  Indicator Maps  for Decoder Networks 
Oursmr  MRSSIM  Y  N 
Oursms  MSSSIM  Y  N 
Oursmrw/o  MRSSIM  N  N 
Oursmr(share)  MRSSIM  Y  Y 
V Experimental results and analysis
We evaluate the proposed method with respect to the structural similarity of the compressed images of several commonly available datasets (including Set4^{1}^{1}1https://github.com/mdcnn/MDCNN_test40/tree/master/SET4 used in [2], McMaster^{2}^{2}2http://www4.comp.polyu.edu.hk/~cslzhang/CDM_Dataset.htm, Kodak PhotoCD dataset, denoted as Kodak^{3}^{3}3http://www.r0k.us/graphics/kodak/, as well as SunHays^{4}^{4}4https://github.com/jbhuang0604/SelfExSR/blob/master/README.md) . By default, two importance indicators are used for all of the models in the proposed framework, which also employs parameter sharing structure for the multiple description multiscaledilated encoder network. However, one model without the importance indicators will be clearly specified. When our framework uses MRSSIM in the reconstruction loss and multiple description distance loss, the proposed method is marked as ”Oursmr”. In contrast, the proposed method is denoted as ”Oursms”, if MSSSIM is used for the reconstruction loss and multiple description distance loss. The ”Oursmr” model is labeled as ”Oursmrw/o” when there is no parameter sharing for the multiple description cascadedResBlock decoder networks. When a model has the same settings as Oursmr(share) but it has a symmetrical parameter sharing structure, then the final fullmodel of our framework is denoted as ”Oursmr(share)”. To clearly display the differences between these models, they are listed in Table. I.
As described in [38], SSIM is a good approximation to assess image quality from perspective of human visual perception, but this method only considers singlescale image information. Compared with SSIM, MSSSIM is an image synthesis approach for image quality assessment that considers the relative importance of distorted images across different scales. Consequently, both MSSSIM and MRSSIM are chosen as objective measurements to assess distorted image quality, in addition to SSIM. Note that each scale SSIM weight factor of MRSSIM is proportional to the image size, but MSSSIM’s weights are obtained according to visual testing. To demonstrate the coding efficiency of our MDC framework, our method is compared with several stateoftheart MDC approaches, including the multiple description coding approach with randomly offset quantizers and the newest convolutional neural networkbased standardcompatible method [2] in terms of image coding efficiency when testing on several datasets. At last, visual comparisons of different MDC methods are provided to observe the image quality because human eyes are the ultimate recipients of the compressed images.
Va Training details
We train our framework on ImageNet training dataset from ILSVRC2012
^{5}^{5}5http://www.imagenet.org/challenges/LSVRC/2012/. During training, each image patch with the size of is obtained by randomly cropping the training images from this dataset. If the training image size is smaller than , we first resize the input image to be at least 160 in each dimension before cropping. Moreover, ImageNet’s validation dataset is used as our validation dataset. Several commonly available datasets mentioned above are chosen as our testing datasets^{6}^{6}6https://github.com/mdcnn/DeepMultipleDescriptionCoding for the comparison of different MDC methods. It should be noted that all of the testing image are cropped and/or resized to 16 integer multiples, which is required by one of the comparison methods. Among them, all image sizes of Set4 are , except for ”Boat” image with a size of . The sizes of the datasets of McMaster, Kodak, and SunHays are, respectively, , and . During training, we choose Adam optimization to minimize the objective loss of our MDC framework with the initial learning rate of 4e3 for our autoencoder network. The training batch size is set to 8, while the hyper parameter , , and are 0.1, 2e4, and 0.1, respectively.VB Ablation studies
As described above, the weights of MRSSIM and MSSSIM exert different effects on the MD coding efficiency, so the comparison between ”Oursmr” and ”Oursms” is first given and discussed in the following. Fig. 5 and Fig. 6 show the objective performance comparison between ”Oursmr” and ”Oursms” regarding the MRSSIM and MSSSIM measurements. Although the ”Oursmr” model is trained with MRSSIM for the reconstruction loss and multiple description distance loss, this model always performs better than the ”Oursms” model in terms of MRSSIM and MSSSIM when testing on the datasets of Set4, McMaster, Kodak, and SunHays. Accordingly, the other models such as ”Oursmrw/o” and ”Oursmr(share)” are trained by using MRSSIM in the proposed multiple description compressive loss.
Method  MDROQ [13]  MDCNN [2]  Oursmr  Oursms  Oursmr(share) 
Encoding Time  0.8  0.1  0.6  0.5  0.5 
Center Decoding  8.4  0.05  3.9  4.3  0.2 
Average Side Decoding  0.6  0.03  1.1  1.2  0.07 
Coding Time  9.8  0.18  5.6  6  0.77 
When we encode an image with one importanceindicator map as like [28, 34], a pair of quantizers is able to be learned to quantize the same tensor to generate different descriptions. Thus, in this case, the diversity of multiple description generation only depends on the quantizers, which affects image coding efficiency since the one importanceindicator map only controls the ROI coding, which contributes almost nothing to the diversity of multiple description tensors, thus affecting coding performance (The corresponding experimental results is illustrated in the supplementary material^{7}^{7}7https://github.com/mdcnn/DeepMultipleDescriptionCoding).
In our deep MDC framework, each quantizer is accompanied by an importanceindicator map to generate the diversified multiple descriptions. To see the significance of the importanceindicator maps in the proposed framework, we compare the proposed models ”Oursmr” and ”Oursmrw/o” in Fig. 7 (a1d1), from which it can be found that the objective performances of these models are very similar. But, the decoded side images and central images compressed by ”Oursmr” and ”Oursmrw/o” exhibit some differences on the structural preservation in terms of image spatial changes, which can be seen in Fig. 8. To better observe the performance of these models, whose compressed images are colorized by ”Let there be Color!” [41], as shown in Fig. 8 (a1d1) and (a2d2). This figure shows that the ”Oursmr” model can retain more image spatial structures than ”Oursmr” after multiple description coding (see the areas to which the green arrows point in Fig. 8 (b1)).
Since the trained model’s size not only affects each model’s computational complexity, but also restricts the applications of the proposed framework, the network parameter sharing in the proposed framework is studied. First, we investigate how much the performance of the proposed framework is influenced by the network parameter sharing. The figures of Fig. 5 and Fig. 6 display the objective comparison of ”Oursmr” and ”Oursmr(share)”. From these figures, we can conclude that the performance of ”Oursmr(share)” is very close to that of ”Oursmr” in most cases. However, the objective measurement of the ”Oursmr(share)” model is slightly lower than that of the ”Oursmr” model at very high bit rates. Meanwhile, ”Oursmr(share)” with a symmetrical structure for the parameter sharing, can reduce the number of parameters to approximately 0.436 times that of the ”Oursmr” model. The number of network parameter of ”Oursmr(share)” is 0.406 times that of the model for the proposed framework without any parameter sharing. From these results, it can be deduced that the parameter sharing with the symmetrical structure can greatly reduce the number of the model parameters in the proposed framework.
VC Objective and visual quality comparisons of different methods
To validate the efficiency of the proposed framework, we compare our method with the latest standardcompatible CNNbased MDC method [2], with convolutional autoencoderbased multiple description coding, and with a multiple description coding approach with randomly offset quantizers [13], which are denoted as ”MDCNN” [2], ”CAE” [2], and ”MDROQ” [13] respectively. As shown in Fig. 9, our method has higher coding efficiency than the methods of ”MDCNN” [2], ”CAE” [32], and ”MDROQ” [13] at all times in terms of SSIM, when testing on the Set4 dataset.
Although ”Oursmr” is trained with MRSSIM instead of MSSSIM, both the MRSSIM and MSSSIM results of several comparative MDC approaches are provided in Fig. 5 and Fig. 6, after testing on several datasets. From these figures, we can observe that ”Oursmr” has better performance than ”Oursms” regarding MRSSIM and MSSSIM. Moreover, the objective MSSSIM measurements of the side decoded images between ”MDCNN” and ”MDROQ” are very similar and the objective MSSSIM when the bit rate is higher than approximately 0.3 bpp when testing on the Set4, Kodak, McMaster, and SunHays datasets. The MRSSIM measurements of the central decoded images compressed by ”MDCNN” can compete with or even exceed those of ”MDROQ”. However, ”MDROQ” has exhibits better performance than ”MDCNN” in terms of MSSSIM and MRSSIM at very low bit rates. ”Oursmr” and ”Oursmr(share)”, as well as ”Oursms” have the best coding efficiencies on all testing datasets with respect to both side and central decoded images compared with ”MDCNN” and ”MDROQ” in terms of MRSSIM and MSSSIM, as depicted in Fig. 5 and Fig. 6. From these figures, we can see that the coding efficiencies of ”Oursmr”, as well as ”Oursmr(share)” and ”Oursms” are far higher than those of the comparative methods at low bit rates when testing on several publicly available datasets. Meanwhile, the MRSSIM and MSSSIM of ”Oursmr” and ”Oursmr(share)” are superior to those of ”Oursms”.
”Oursmr”, ”Oursmr(share)” and ”Oursms” can retain more structures of each object than ”MDCNN” [2] and ”MDROQ” [13] for both side and central decoded images, as displayed in Fig. 10 (ae). Although ”MDCNN” [2] can preserve some small structures, this method makes many significant objects disappear. At the same time, some objects are enormously distorted when testing images are compressed by ”MDCNN” [2] and ”MDROQ” [13], which can be seen in Fig. 10 (ab). Moreover, ”Oursmr” as well as ”Oursmr(share)” and ”Oursms” does not contain obvious visual noises, such as coding artifacts, compared to ”MDROQ” [13], which can be seen in Fig. 10 (ce), while our side and central decoded images appear to more natural.
VD Comparisons of coding time
TABLE II provides the encoding time and decoding time of several MDC methods for comparison when testing on an image with a size of 512x512. In this table, all the operations of convolutional neural networks for these MDC methods are run on an NVIDIAGTX1080 GPU device. Among these comparative MDC methods in TABLE II, ”MDCNN” requires the least coding time. From this table, it can be observed that ”Oursmr” and ”Oursms” have very similar performances for image encoding, center decoding and average side decoding. The running time of these two models for encoding as well as decoding are greater than that of ”Oursmr(share)”. Compared to ”Oursmr”, ”Oursms”, ”Oursmr(share)” and ”MDCNN”, ”MDROQ” requires the most time for image coding.
Vi Conclusion
In this paper, we propose a deep multiple description coding framework in which MD quantizers are automatically learned in an endtoend manner during training. Meanwhile, a symmetrical parameter sharing structure is designed for our autoencoder networks to overwhelmingly reduce the total number of neural network parameters. Ablation studies on whether two importanceindictor maps are essential or not, how to control the redundancy between different descriptions and different kinds of structural dissimilarity loss functions, as well as parameter sharing, are provided in the section of experimental results and analysis. At last, we demonstrate that our method offers better coding efficiency than several advanced MD image compression methods, when tested on commonly available datasets, especially at low bit rates.
References
 [1] M. Kazemi, R. Iqbal, and S. Shirmohammadi., “Joint intra and multiple description coding for packet loss resilient video transmission,” IEEE Transactions on Multimedia, vol. 20, no. 4, pp. 781–795, 2018.
 [2] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Multiple description convolutional neural networks for image compression,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2494–2508, 2019.
 [3] Y. Xu and C. Zhu, “Endtoend ratedistortion optimized description generation for H. 264 multiple description video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 9, pp. 1523–1536, 2013.
 [4] F. Shirani and S. S. Pradhan, “An achievable ratedistortion region for multiple descriptions source coding based on coset codes,” IEEE Transactions on Information Theory, vol. 64, no. 5, pp. 3781–3809, 2018.
 [5] V. A. Vaishampayan, “Design of multiple description scalar quantizers,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 821–834, 1993.
 [6] V. K. Goyal, “Scalar quantization with random thresholds,” IEEE Signal Processing Letters, vol. 18, no. 9, pp. 525–528, 2011.
 [7] Y. Wang, M. T. Orchard, V. Vaishampayan, and A. R. Reibman, “Multiple description coding using pairwise correlating transforms,” IEEE Transactions on Image Processing, vol. 10, no. 3, pp. 351–366, 2001.
 [8] V. K. Goyal and J. Kovacevic, “Generalized multiple description coding with correlating transforms,” IEEE Transactions on Information Theory, vol. 47, no. 6, pp. 2199–2224, 2001.
 [9] D. Saitoh and T. Yakoh, “Ratio configurable multiple description correlating transforms coding,” in IEEE International Conference on Industrial Technology, Auburn, Mar. 2011.
 [10] M. Majid, M. Owais, and S. M. Anwar, “Visual saliency based redundancy allocation in HEVC compatible multiple description video coding,” Multimedia Tools and Applications, vol. 77, no. 16, pp. 20 955–20 977, 2018.
 [11] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based on deep convolutional neural network to compress image,” Journal of Visual Communication and Image Representation, vol. 63, no. 1, pp. 102 589–102 599, 2019.
 [12] H. Wu, T. Zheng, and S. Dumitrescu, “On the design of symmetric entropyconstrained multiple description scalar quantizer with linear joint decoders,” IEEE Transactions on Communications, vol. 65, no. 8, pp. 3453–3466, 2017.
 [13] L. Meng, J. Liang, U. Samarawickrama, Y. Zhao, H. Bai, and A. Kaup, “Multiple description coding with randomly and uniformly offset quantizers,” IEEE Transactions on Image Processing, vol. 23, no. 2, pp. 582–95, 2014.
 [14] S. Dumitrescu and Y. Wan, “Biterror resilient index assignment for multiple description scalar quantizers,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2748–2763, 2015.
 [15] H. Jafarkhani and V. Tarokh, “Multiple description trelliscoded quantization,” IEEE Transactions on Communications, vol. 47, no. 6, pp. 799–803, 1999.
 [16] V. Vaishampayan, N. Sloane, and S. Servetto, “Multipledescription vector quantization with lattice codebooks: design and analysis,” IEEE Transactions on Information Theory, vol. 47, no. 5, pp. 1718–1734, 2001.
 [17] G. Romano, P. S. Rossi, and F. Palmieri, “Multiple description image coder using correlating transforms,” in European Signal Processing Conference, Vienna, 2015.
 [18] J. Chen, C. Cai, L. Li, and C. Li, “Layered multiple description video coding using dualtree discrete wavelet transform and H. 264/AVC,” Multimedia Tools and Applications, vol. 75, no. 5, pp. 2801–2814, 2016.
 [19] V. Goyal, J. Kelner, and J. Kovacevic, “Multiple description vector quantization with a coarse lattice,” IEEE Transactions on Information Theory, vol. 48, no. 3, pp. 781–788, 2002.
 [20] S. Dumitrescu, Y. Chen, and J. Chen, “Index mapping for biterror resilient multiple description lattice vector quantizer,” IEEE Transactions on Communications, vol. PP, no. 99, pp. 1–1, 2018.
 [21] Z. Gao and S. Dumitrescu, “Flexible multiple description lattice vector quantizer with descriptions,” IEEE Transactions on Communications, vol. 62, no. 12, pp. 4281–4292, 2014.
 [22] N. Franchi, M. Fumagalli, R. Lancini, and S. Tubaro, “Multiple description video coding for scalable and robust transmission over IP,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 3, pp. 321–334, 2005.
 [23] N. Gadgil, H. Li, and E. J. Delp, “Spatial subsamplingbased multiple description video coding with adaptive temporalspatial error concealment,” in Picture Coding Symposium, Cairns, May 2015.
 [24] G. Zhang and R. L. Stevenson, “Efficient error recovery for multiple description video coding,” in Picture Coding Symposium, Paris, Oct. 2004.
 [25] C. Zhu and M. Liu, “Multiple description video coding based on hierarchical B pictures,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 4, pp. 511–521, 2009.
 [26] X. Zhang, W. Yang, Y. Hu, and J. Liu, “DMCNN: Dualdomain Multiscale Convolutional Neural Network for Compression Artifacts Removal,” in IEEE International Conference on Image Processing, Athens, Oct. 2018.
 [27] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. V. Gool, “Softtohard vector quantization for endtoend learned compression of images and neural networks,” in Advances in Neural Information Processing Systems, California, Dec. 2017.

[28]
M. Fabian, A. Eirikur, T. Michael, T. Radu, and V. G. Luc, “Conditional
probability models for deep image compression,” in
IEEE Conference on Computer Vision and Pattern Recognition
, Salt Lake City, Jun. 2018.  [29] L. Theis, W. Shi, A. Cunningham, and F. Huszar, “Lossy image compression with compressive autoencoders,” in International Conference on Learning Representations, Palais, Apr. 2017.
 [30] G. Toderici, S. Malley, S. Hwang, D. Vincent, D. Minnen, S. Baluja, and et al., “Variable rate image compression with recurrent neural networks,” in International Conference on Learning Representations (ICLR), Puerto Rico, May 2016.
 [31] O. Rippel and L. Bourdev, “Realtime adaptive image compression,” in International Conference on Machine Learning, Sydney, Aug. 2017.
 [32] H. Li, L. Meng, J. Zhang, Y. Tan, Y. Ren, and H. Zhang, “Multiple description coding based on convolutional autoencoder,” IEEE Access, vol. 7, no. 1, pp. 26 013–26 021, 2019.
 [33] A. Jerbi, W. Jian, and S. Shirani, “Errorresilient regionofinterest video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 9, pp. 1175–1181, 2005.
 [34] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for contentweighted image compression,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun. 2018.
 [35] B. Yochai and M. Tomer, “The perceptiondistortion tradeoff,” in IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Jun. 2018.

[36]
L. Zhao, H. Bai, J. Liang, B. Zeng, A. Wang, and Y. Zhao, “Simultaneous colordepth superresolution with conditional generative adversarial networks,”
Pattern Recognition, vol. 88, no. 1, pp. 356–369, 2019. 
[37]
P. Isola, J. Y. Zhu, T. Zhou, and A. A. Efros, “Imagetoimage translation with conditional adversarial networks,” in
IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Jul. 2017.  [38] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The ThritySeventh Asilomar Conference on Signals, Systems and Computers, 2003, Pacific Grove, Nov. 2003.
 [39] F. Yu and V. Koltun, “Multiscale context aggregation by dilated convolutions,” in arXiv:1511.07122, 2015.
 [40] P. Wang, P. Chen, D. L. Y. Yuan, Z.Huang, X. Hou, and G.Cottrell, “Understanding convolution for semantic segmentation,” in IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, Mar. 2018.

[41]
S. Iizuka, E. Simoserra, and H. Ishikawa, “Let there be color!: joint endtoend learning of global and local image priors for automatic image colorization with simultaneous classification,”
ACM Transactions on Graphics, vol. 35, no. 4, pp. 1–11, 2016.
Comments
There are no comments yet.