I Introduction
Large amounts of attentions have been paid to various techniques of Internet service and multimedia signal transmission for many years, which not only provide us a convenient manner of communication but also give us many choices for our life style. Meanwhile, the bandwidth of Internet has been accelerated and more stable transmission service is guaranteed by these developments. But there are still some risks of transmission failures, when the Internet congestion occurs in the overloaded case or signal packets are conveyed in the unpredictable yet unreliable channels [1, 2]. Multiple description coding has been studied as a promising technique of source coding to relieve these problems by decomposing the signal into multiple redundant subsets, which are transmitted in different channels. Thus, a degraded but acceptable signals reconstruction can be produced after decoding, even though only one description is received at the clients. If more descriptions are available for users, better quality of signal reconstruction can be achieved. Multiple description coding has been widely explored in the field of image and video coding [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21].
As one of the main techniques in multiple description image coding, multiple description scalar quantization could overcome impairments of transmission channel [6]. For example, in [7], multiple description scalar quantizers have been combined with efficient wavelet coders to generate independent multiple packets for error resilience. In [5], twostage multiple description scalar quantization is presented to create central and side decoders, whose distortions are closer to the ratedistortion bound of multiple description coding under the condition of the highresolution assumption. To cope with the Ldescription problem[9], two novel coding schemes are proposed, when the symmetric rates and symmetric distortion is constrained. In [3], a new achievable ratedistortion region with combinatorial message sharing is presented by introducing shared codebooks and the refinement codebook to generate Lchannel multiple descriptions.
Compared with multiple description scalar quantization, lattice vector quantization characterizes in good symmetric structure of lattices and avoiding complex nearest neighbor searching. In
[10], the main problem of designing lattice vector quantizer is formulated as a labeling problem for twochannel multiple description. In [11], nonlattice codebook with symmetries of the coarse lattice is used to get objective quality gains for multiple description coding but without a great increase of complexity. In [12], multiple description lattice vector quantization is operated in an optimized way in terms of appropriate construction of wavelet coefficient vectors, choosing sublattice index values and different subbands quantization step on the wavelet domain. In [8], the index assignment of multiple description lattice vector quantization is designed to be translated into a transportation problem and greedy algorithm as well as general algorithms is developed to pursue optimality of the index assignment.Except multiple descriptions directly produced by quantization, there are many alternative strategies for multiple description coding. To generate two descriptions in transform based coding framework, correlation between pairs of transform coefficients is introduced by a pairwise correlating transform [13]. This correlation facilities to reduce the distortion when only a single description is received. Later, both domainbased multiple description coding and forward error correction are used for concatenated multiple description coding of framerate scalable video [14]. Meanwhile, both prioritized discrete cosine transform in video compression and multiple description codes based on forward error correction are combined together to provide a wireless channels video transmission scheme [15].
From literatures [14, 15], it can be observed that multiple description video coding using forward error correction has been widely explored. There are several other kinds of multiple description video coding. In [16], a video is coded into multiple independently streams so that each stream has its own prediction and dependent state to defeat against bit error or packet loss. In multiple description motion coding algorithm, motion vector is encoded into two descriptions, which are transmitted over distinct channels to the decoder so that motion vector field is robust against transmission errors [17]. In the scalable wavelet video codec, each packet is encoded with a separate channel code, so that the integrity of the packets is protected and it allows to detect packetdecoding failures cases, after breaking wavelet transformation into several spatialtemporal tree blocks [18]. In [19], two architectures of multiple description video coding are built up based on motion compensation prediction loop and a polyphase downsampling technique is chosen to generate multiple descriptions and introduce cross redundancy among the descriptions.
Although the aforementioned approaches can well alleviate the congestion of Internet and satisfy the demanding of realtime application, these approaches are not compatible to standard codec, such as JPEG, and JPEG2000. To resolve this problem, some previous works have provided some feasible solutions, such as [20, 21, 4, 8]. In [21], through grouping the codeblock to generate two balanced set, these two set are compressed by JPEG2000 with two different quantization parameter to get four subsets, which are interlacedly merged together to create two descriptions. In [20], the rateallocation strategy embedded in the JPEG2000 encoder is introduced for the ratedistortion optimization of multiple descriptions of images, in which single description decoding is able to compatible with JPEG2000 Part 1 decoder. In view of human eyes’ always sensitivity to the changes above just noticeable difference (JND) threshold, only the significant visual information, which contributes to the JND tolerance, is encoded as the redundant information during H.264/AVC based multiple description video coding [4]. In [8], framelevel ratedistortion optimized description generation scheme takes account of temporal coding dependency to minimize the endtoend distortion, which is built on standard H.264/AVC.
Because the proposed approach is high related about the issue of compression artifact removal [22, 23, 24, 25, 26, 27, 28, 29, 30], we will next review several stateoftheart works about compression artifact removal. In [22], pointwise shapeadaptive discrete cosine transform is leveraged for both denoising and deblocking after image compression. In [23], dictionary learning is introduced to reducing JPEGcompressed artifacts in view of image’s sparse and redundant representations. In [24], collaborative filtering is designed to uncover the finest details and maintain each individual block’s unique features in the sparse 3D transformdomain, which is not restrict to the denoising of compressed image, so this approach is a general denoising method. Lately, the deblocking problem is formulated as an optimization problem, where nonconvex lowrank model constrained is considered to reduce blocking artifacts [25]. Meanwhile, the popular techniques of convolutional neural network and generative adversarial have been tried to remove artifacts [27, 28, 29].
Following the work of [19]
, we form multiple description coding baselines with a polyphase downsampling technique to generate multiple descriptions by combining stateoftheart artifact removal technique with superresolution based on very deep convolutional neural network. Specifically, the input image is downsampled with a polyphase downsampling technique along the main diagonal for each
nonoverlapped window to form two descriptions for coding with standard codec. After decoding, several stateoftheart artifact removal techniques, such as [22, 23, 25, 24] are used to enhance image quality, which is followed by superresolution to restore image from lowresolution to highresolution with very deep convolutional neural network, such as novel superresolution methods of [31] and [32]. The combinatorial methods with artifact removal [22, 23, 25, 24] and superresolution [31] are respectively referred to as multiple description coding baselines14, namely ”MDB1a”, ”MDB2a”, ”MDB3a”, ”MDB4a”. In this similar way, when artifact removal methods of [22, 23, 25, 24] are combined together with [32], they are respectively denoted as ”MDB1b”, ”MDB2b”, ”MDB3b”, ”MDB4b”.In this paper, we introduce a novel standardcompatible multiple description coding framework, in which multiple descriptions are produced by deep convolutional neural network. Our contributions are listed as follows:

Multiple description generator network (MDGN) is introduced to adaptively generate multiple descriptions according to image’s content, which are compressed by standard codec to reduce transmission bits.

We present multiple description reconstruction network (MDRN), which consists of side reconstruction networks (SRN) and central reconstruction network (CRN). When either one of two compressed description is received at the decoder, side reconstruction networkA network (SRNA) or side reconstruction networkB (SRNB) is used to reconstruct the lossy description and enlarge this description simultaneously by removing compression artifact and upsampling. Meanwhile, we utilize CRN network with two received descriptions as inputs to achieve highquality image reconstruction, if all the multiple description images are available.

We train the aforementioned two neural networks: MDGN network and MDRN network together by learning multiple description virtual codec network (MDVCN). It means that the learned MDVCN network is leveraged to further supervise the MDGN network’s training. Besides, we provide two kinds of learning algorithms for training our convolutional neural networks.

Distance loss for MDGN network is introduced as well as structural similarity loss to guarantee that the generated description images are structurally similar yet finely different.
The rest of this paper is given as follow. We first introduce the proposed methodology in Section 2. After that, we conduct a series of the experimental results to validate the efficiency in the Section 3. At last, we give a conclusion in the Section 4.
Ii The methodology
In this paper, multiple description coding framework based on deep convolutional neural network is introduced to efficiently compress images, when facing an unpredictable and nonprioritized channel. Our main works are put on how to generate multiple descriptions in terms of redundancy between each description and description’s diversity for better central reconstruction. Meanwhile, we design the neural network for description’s generation and reconstruction and introduce how to train our convolutional neural networks together used in the proposed method. To the best of our knowledge, this is the first work using convolutional neural network for multiple description coding.
Iia Framework
Our multiple description coding framework has three components: MDGN network, standard codec of JPEG, MDRN network, as depicted in the Fig. 1. The MDGN network is responsible to generate diverse descriptions and from the groundtruth image with size of . Here, is the parameter set of MDGN network and other networks’ parameter set can be defined in this similar way. Due to the widely usage of standard codec, such as JPEG, the standardcompatible coding framework becomes significant for practical applications. Thus, we use the JPEG codec to compress these descriptions so that image redundancy can be further reduced to get the lossy descriptions and . The JPEG compressions of and are respectively represented as , and , where is the compression function of codec. The compressed description streams are separately transmitted over different channels. However, image compression with standard codec often incurs coding artifacts. Thus, MDRN network, denoted as reconstruction function , is leveraged to remove these artifacts for image enhancement and enlarge the lossy description so that the final reconstruction image is guaranteed to have the same size with the groundtruth image . Finally, the receiver can still decode the received packet to get a description for acceptable quality reconstruction with SRNA network or with SRNB network, even though any one description is missing, as displayed in the Fig. 1. If both descriptions are received, high quality reconstruction can be built by CRN network.
As we all know, it’s not easy to jointly train the MDGN network and MDRN network, because the quantization function in the codec of lossy compression is nondifferentiable. Thus, the reconstruction error from the MDRN network can’t be directly backpropagated to the MDGN network. Following our previous work [33], we learn the MDVCN network to imitate the two consecutive procedures of codec’s compression and description’s reconstruction with MDRN network. As a result, we can train our whole framework in an endtoend fashion.
IiB Objective function
The objective function for our multiple description coding framework is written as follows:
(1) 
(2) 
where three losses for training are respectively the loss of MDGN network, the loss of MDRN network, and MDVCN network’s loss.
(3) 
The loss of is used to supervise the learning of the parameters of the MDGN network in Eq. (3), where is the linear upsampling function and balances the contributions between descriptor’s SSIM loss [34] and distance loss, which are in effect contradictory to a certain extent. In addition, is the quality factor for JPEG compression and is the clip function to restrict value between and (e.g., and ). Hence, the parameter of plays a significant role on generating valid multiple descriptions. Note that the better quality is encoded, when the larger is set for JPEG.
On one hand, we hope that the two produced descriptions structurally similar to the input image so that the decoded descriptions can be watched directly for receiver, even without the processing of MDRN network. Consequently, SSIM loss function is used to supervise each description’s learning. For example, the SSIM for description is defined as follows:
(4) 
(5) 
where and
respectively denote the mean value and the variance of the neighborhood window centered by pixel
in the image . Similarly, as well as is denoted in this way. is the covariance between neighbourhood windows centered by pixel in the image and in the image . Meanwhile, and are two constant values (e.g., , and ). As a matter of fact, the calculation of mean value is a special kind of convolution, which is also named by average pooling, while variance operation actually involve twice operations of average pooling. It’s obvious that the function of SSIM in Eq. (45) is differentiable, so the SSIM error can be efficiently backpropagated via optimization.On the other hand, according to the Gamal and Cover theorem of [35, 36], the MDGN network should pledge to have mutual information between two generated descriptions so that we can receive a acceptable reconstruction, even when only one description is got at the client. It’s obvious that SSIM loss function keeps the two descriptions yielded by the MDGN network structurally similar. In the meantime, the two produced descriptions by neural networks are used as opposing labels to regularize the training of MDGN network. Consequently, the highquality central reconstruction with two diverse descriptions can be guaranteed. Contrary to the SSIM loss, the distance loss function is utilized to keep the detail difference between two descriptions, which is written as:
(6) 
For brevity latter, the content loss function and gradient difference loss function between two images and are defined as:
(7) 
(8) 
where is the th gradient between each pixel and th pixels among 8neighbourhood . Here, L1norm is chosen to produce sharper results than L2norm, which has been reported in [37, 38].
In the MDRN network, both content loss and gradient difference loss supervise the learning of side reconstruction , and central reconstruction , which is presented as follows:
(9) 
In order to backpropagate the error from the MDRN network to the MDGN network, we learn MDVCN network to approximate the procedure from the lossless descriptions to the lossy description reconstruction. Both content loss and gradient difference loss are used to regularize the training of MDVCN network, which are given as follows:
(10) 
In addition to the aforementioned loss, we use MDVCN network to explicitly supervise the learning of the MDGN network or directly use gradient from MDVCN network as the gradient approximation from the standard codec. It’s worth noticing that MDVCN network does not be used any more, once the whole training is finished, that is to say, only the MDRN network and the MDGN network during the testing are respectively leveraged to create multiple descriptions for compression and reconstruct these descriptions.
IiC Network architecture
MDGN Network  

Layer  k  s  cin  cout  input 
conv1f  9  1  1  128  
conv2f  3  2  128  128  conv1f 
conv3f  3  1  128  128  conv2f 
conv4f  3  1  128  128  conv3f 
conv5A  3  1  128  128  conv4f 
conv6A  3  1  128  128  conv5A 
convs7A  3  1  128  128  conv6A 
conv8A  9  1  128  1  conv7A 
conv5B  3  1  128  128  conv4B 
conv6B  3  1  128  128  conv5B 
conv7B  3  1  128  128  conv6B 
conv8B  9  1  128  1  conv7B 
The MDGN network is composed with eight convolutional layers, which has one input stream, but two output streams, that is to say, the extracted feature maps with feature extraction network (FEN) from layer 14 are shared by generator networkA (GNA) and generator networkB (GNB). The FEN network has four convolutional layers, whose first layer’s spatial kernel size is
and other layers’ is . In the GNA and GNB networks, there are four convolutional layers with spatial kernel size except for the last layer with . The large spatial kernel of convolutional layer in the first layer and last layer could further enlarge the receptive field of convolutional networks on the basis of small kernel. Hence, image’s context information is well considered during the generation of descriptions. The details about each layer in the MDGN network are listed in the Table I, where ”k” represents the kernel size, ”cin” denotes the number of channel input, ”cout” is the total output map’s number in the corresponding layer. Meanwhile, ”conv” represents convolutional layer and ”deconv” indicates the deconvolutional layer. From this table, it can be seen that all the layers employ stride step 1 except for the second convolutional layer with stride of 2. All the convolutional layers are activated by the ReLU activation function apart from the last layer in the MDGN network.
SRNA Network  
Layer  k  s  cin  cout  input 
conv1a  9  1  1  128  
conv2a  3  1  128  128  conv1a 
conv3a  3  1  128  128  conv2a 
conv4a  3  1  128  128  conv3a 
conv5a  3  1  128  128  conv4a 
conv6a  3  1  128  128  conv5a 
conv7a  3  1  128  128  conv6a 
deconv8a  9  2  128  1  conv7a 
SRNB Network  
Layer  k  s  cin  cout  input 
conv1b  9  1  1  128  
conv2b  3  1  128  128  conv1b 
conv3b  3  1  128  128  conv2b 
conv4b  3  1  128  128  conv3b 
conv5b  3  1  128  128  conv4b 
conv6b  3  1  128  128  conv5b 
conv7b  3  1  128  128  conv6b 
deconv8b  9  2  128  1  conv7b 
CRN Network  
Layer  k  s  cin  cout  input 
conv1c  9  1  2  128  and 
conv2c  3  1  128  128  conv1c 
conv3c  3  1  128  128  conv2c 
conv4c  3  1  128  128  conv3c 
conv5c  3  1  128  128  conv4c 
conv6c  3  1  128  128  conv5c 
conv7c  3  1  128  128  conv6c 
deconv8c  9  2  128  1  conv7c 
VSRNA Network  
Layer  k  s  cin  cout  input 
conv1a  9  1  1  128  
conv2a  3  1  128  128  conv1a 
conv3a  3  1  128  128  conv2a 
conv4a  3  1  128  128  conv3a 
conv5a  3  1  128  128  conv4a 
conv6a  3  1  128  128  conv5a 
conv7a  3  1  128  128  conv6a 
deconv8a  9  2  128  1  conv7a 
VSRNB Network  
Layer  k  s  cin  cout  input 
conv1b  9  1  1  128  
conv2b  3  1  128  128  conv1b 
conv3b  3  1  128  128  conv2b 
conv4b  3  1  128  128  conv3b 
conv5b  3  1  128  128  conv4b 
conv6b  3  1  128  128  conv5b 
conv7b  3  1  128  128  conv6b 
deconv8b  9  2  128  1  conv7b 
VCRN Network  
Layer  k  s  cin  cout  input 
conv1c  9  1  2  128  and 
conv2c  3  1  128  128  conv1c 
conv3c  3  1  128  128  conv2c 
conv4c  3  1  128  128  conv3c 
conv5c  3  1  128  128  conv4c 
conv6c  3  1  128  128  conv5c 
conv7c  3  1  128  128  conv6c 
deconv8c  9  2  128  1  conv7c 
The MDRN network consists of SRNA network, SRNB network, and CRN network. In fact, we can let SRNA network and SRNB network share the same parameter set. Meanwhile, CRN network uses the outputs from the SRNAnetwork and SRNBnetwork to reconstruct the central images. But, in order to better backpropagate the errors from the MDRN network to the previous networks and avoid too deep networks for central reconstruction, we use three separate networks without cross connection and no weights sharing to respectively reconstruct side images and central image. They all use the eight convolutional layers. Seven convolutional layers and one deconvolution layer are used in the MDRN network so as to remove the coding artifacts and upscale feature maps to the fullresolution at the same time. The obvious difference between them is that CRN network has two lossy descriptions as input while the two other networks only have one lossy descriptions as input. All the details are specified in the Table II, from which we can observe that the first and last convolutional layers use the spatial kernel to ensure the receptive field large enough, so that more spatial features are captured to better reconstruct the degraded descriptions. In addition, all the convolutional layers are activated by the ReLU, but the last layers of SRNA network, SRNB network, and CRN network are processed without any activation.
As described above, MDVCN network bridges the gap between MDGN network and MDRN network so that the errors of the reconstruction can be properly backpropagated from MDRN network to MDGN network. MDVCN network and MDRN network are designed to have same structure, because they can be seen as the same class of lowlevel image processing problems by learning. Thus, we have three virtual networks for MDVCN network: virtual side reconstruction networkA (VSRNA), virtual side reconstruction networkB (VSRNB), and virtual central reconstruction network (VCRN), whose network structures in the Table III are similar to the one’s of MDRN network in the Table II. However, the inputs of MDVCN network and MDRN network are different, in which the former one takes the decoded lossy descriptions and as inputs, while the later one is fed with lossless multiple descriptions and .
IiD Network learning
Obviously, it’s challenging to learn our whole framework directly, but our problem of learning multiple description neural networks can be separated into several subproblems learning. In order to resolve these problems, we provide two learning ways for error backpropagation. These two ways are presented in the following and respectively referred to as learning algorithm1 and learning algorithm2. Our learning algorithm1 treats MDVCN network as feature function to build the reconstruction by fixing the parameter of MDVCN network so that reconstruction errors from MDVCN network can be backpropagated for the supervision of the MDGN network ahead of standard codec. It means that the MDGN network and the MDRN network are trained separately. On the contrary, our learning algorithm2
uses MDVCN network’s backpropagated error for MDGN network to approximately estimate the error from the codec without fixing any network’s parameter, when explicitly training the MDGN network and the MDRN network simultaneously. The details about these two learning algorithms will be described next.
IiD1 Learning algorithm1
To backpropagate the error from the MDRN network to MDGN network, we decompose the learning problem of MDGN network, MDRN network and MDVCN network once in Eq. (1) into three separate subproblem learning, but they depends on each other closely. Specifically, we first initialize all the parameter sets mentioned previously, and multiple descriptions and dataset by downsampling for the training of MDRN network and compress this dataset. Secondly, the parameter set of is updated by training MDRN network based on minimization of the Eq. (9). Then, we generate multiple descriptions reconstruction images , , and dataset with the parameter set of of MDRN network. This reconstruction dataset , , and can be used to train MDVCN network by updating the parameter set of based on the minimization of the Eq. (10). Next, we update the parameter set of with fixed to train MDGN network according to the minimization of the Eq. (3) and Eq. (9). After training MDGN network, the multiple descriptions images and are generated with the parameter set of MDGN network and then start the next iteration. The details about learning algorithm1 are summarized in the Algorithm1.
IiD2 Learning algorithm2
Different from our learning algorithm1, we separate the whole framework learning as two subproblem learning: the subproblem of simultaneously learning MDGN network and MDRN network, and the learning subproblem of MDVCN network. Concretely, the parameter sets of MDGN network and MDRN network: , are learned by the optimization with gradient descent method at the same time. After feeding input data into MDGN network to produce multiple descriptions and and compressing them with standard codec, MDRN network are used to reconstruct these compressed multiple descriptions and . Meanwhile, the lossless multiple descriptions and are fed into MDVCN network. This is feedforward propagation of our deep convolution neural networks, but the error from the MDRN network is blocked by the codec. Here, we can explicitly use the error from MDVCN network as the approximate error from the codec. Thus, we can simultaneously update MDGN network and MDRN network. The whole process is detailed in the Algorithm2.
After comparing learning algorithm1 with learning algorithm2, we can see that the training stability of the second one relies on whether pretrained MDVCN network is well trained or not. Meanwhile, this network also has great impacts on the learning of MDGN network, because the bad accuracy of approximated error propagation from MDVCN network will results in the insufficiency of multiple description generation. On the contrary, the first algorithm is more easily implemented in any neural network platform, because there is no any changes in the process of neural network’s optimization. Meanwhile, the performance of learning algorithm1 tends to be more stable than the second one due to the reliable dependency among three neural networks. It comes from a fact that the good training of MDRN network will directly lead to the good training of MDVCN network, and then MDVCN network will give a supervision of the MDGN network. Conclusively, both of them can resolve the learning problem of multiple description neural networks, but the learning algorithm1 is more practical, so we use it to illustrate the efficiency of the whole framework in the experimental sections.
Iii Experimental results
We evaluate the proposed method against eight baselines with stateoftheart artifacts removal techniques [22, 23, 25, 24] and advanced superresolution based on very deep convolutional neural networks, such as [31, 32]. Note that there are 20 convolutional layers used for superresolution in [31, 32]. Four baselines ”MDB1aMDB4a” are formed with the techniques of artifacts removal [22, 23, 25, 24] and very deep convolutional neural network based superresolution [31]. Meanwhile, superresolution of [32] are combined with artifacts removal [22, 23, 25, 24] to build four other baselines ”MDB1bMDB4b”. Furthermore, in order to fully demonstrate the efficiency of the proposed method, we form a baseline model, which is denoted as ”Ourbase”, when replacing MDGN network to generate multiple descriptions with the polyphase downsampling technique in [19]. For simplicity, the proposed method is marked as ”Ours”. Besides, the training of the proposed framework will be in detail described next.
Iiia Training data and implementation details
Our whole framework is implemented in the platform of TensorFlow
[39] with Algorithm1. The 400 images with size 180x180 from [40] are used as our training dataset, which are augmented by cropping, flipping, and rotating image to build our training data set. There are the total number 3200 of image patches with size of 160x160 used for our framework’s training. Four images in Fig. 2 are used to evaluate the efficiency of the proposed method for testing. Our framework is trained with the Adam optimization method [41]. The parameters for Adam optimization are set to be , . The learning rate of training is initially set as 0.0001, but the learning rate decays to be half of the initial one when the training step reaches 3/5 of total step. Once the training step reaches 4/5 of total step, it reduces to be 1/4 of the initial one. The multiple descriptions are compressed by standard JPEG codec with to be 2, 6, 10, 20, and 40 for the proposed framework during the training and testing. The multiple descriptions for ”MDB1aMDB4a” and ”MDB1bMDB4b” as well as ”Ourbase” are compressed with the set 2, 3, 4, 10, and 50.IiiB Comparisons with several baselines
To validate the efficiency of the proposed framework, we employ the Peak Signal to Noise Ratio (PSNR) and SSIM to measure the objective quality. The multiple description artifacts removal results with Foi’s [14], BM3D [22], DicTV [17] and CONCOLOR [18] are got with strict usage of the author’s open codes according to the parameter settings in their papers. Meanwhile, for image superresolution in [31, 32], we use their official provided model to enlarge these multiple description after artifacts removal so as to guarantee the advances of eight baselines, when comparing with the proposed method.
From the comparison in Fig. 3 and Fig. 4, it can be seen that oursbase has better performance on SSIM for the side reconstruction and cental reconstruction against eight baselines MDB1aMDB4a and MDB1bMDB4b in the full range. In the most cases, the PSNR measurement of oursbase is better than eight baselines MDB1aMDB4a and MDB1bMDB4b. Only at the very low bitrate, the PSNR of oursbase has slight smaller than MDB4a and MDB4b, but ourbase with higher SSIM measurement has priority than MDB4a and MDB4b. This comes from that the structural preservation of image is more significant than detail preservation at the very low bitrate, as shown in Fig. 3 and Fig. 4.
Compared to the oursbase, the proposed method has more PSNR and SSIM gains in the most cases, especially at the high bitrate. Because the proposed method in this paper focuses on the appearance similarity but details difference for multiple descriptions generation without apparent structural distance loss to regularize the training at the very low bitrate, the proposed method has a litter lower PSNR gains than oursbase in some cases. For the improvement of the proposed method at low bitrate, the first way is to replace direct description distance loss with structural distance loss during training. Another feasible way is to employ 4x resolution reduction when generating the descriptions with MDGN network and compressing descriptions at the very low bitrate, but larger is used for the proposed method like our previous work [33].
Among these baselines, MDB4a and MDB4b defeat against MDB1aMDB3a and MDB1bMDB3b on PSNR and SSIM measurement, when comparing side description reconstruction quality. But for the central reconstruction, MDB4a and MDB4b can not compete with the MDB1aMDB3a and MDB1bMDB3b. MDB3a and MDB3b have the best PSNR performance of the central reconstruction among the eight baselines. MDB1aMDB3a have very similar performance on central reconstruction. Although the literature of [32] has reported that their approach has greater PSNR gains than [31] for general image superresolution, the performance of [32] is slight better than the one’s of [31], when these superresolution approaches are used for description’s resolution enhancement after artifacts removal, which can be found in Fig. 3 and Fig. 4, when comparing MDB1aMDB4a with MDB1bMDB4b.
We have compared the visual quality of the proposed method with different methods’ for multiple description coding based on deep convolutional neural networks, which is displayed in Fig. 5 and Fig. 6. In these figures, MDB1a(24.574/0.714/0.292(s) and 27.849/0.778/0.583(c)) represents the measurements of PSNR/SSIM/bpp for side reconstruction and central reconstruction based on the approach of MDB1a. Similarly, other methods can be denoted in this way. Our MDGN networkproduced descriptions, as displayed in Fig. 5(a3) and Fig. 6(a3), maintain more important details than the ones generated with the polyphase downsampling technique [19], even after image compression. The differences between these pairs of descriptions are exhibited in Fig. 5(a4) and Fig. 6(a4), from which it can be observed that the proposed method tends to keep the description distance on the details and has less structural difference preservation. Furthermore, the descriptions from our MDGN network tend to highlight obvious feature pixels for all the descriptions in order to protect the key features. Therefore, the protected feature of lossy descriptions always can be kept, although they are possibly badly smoothed and contaminated by compression, as shown in Fig. 5(a5a6) and Fig. 6(a5a6).
The side reconstruction images and cental reconstruction images have been displayed in Fig. 5(bf) and Fig. 6(bf). From these figures, it can be clearly seen that the side reconstruction images and cental reconstruction images with the proposed method look more natural and have more detail preservation than eight baselines MDB1aMDB4a, MDB1bMDB4b, and ourbase. Ourbase has better performance than the eight baselines. Among these baselines, both MDB4a and MDB4b keep more details than MDB1aMDB3a, MDB1bMDB3b, which can be seen in Fig. 5(be) and Fig. 6(be). From the above objective and visual comparisons, it can be concluded that it’s very important to emphasize on significant context features when automatically generating appearancesimilar but detailsdifferent descriptions with convolutional neural networks, as compared to the polyphase downsampling technique. Meanwhile, the better descriptions always benefit the better side and central description reconstruction.
Iv Conclusion
In this paper, we introduce multiple description image coding based on deep convolutional neural networks. First, multiple description network is employed to automatically yield valid multiple descriptions. Then, these multiple descriptions are compressed by standard codec so that our whole framework is compatible with standard codec. Thirdly, we use multiple description reconstruction network to enhance these descriptions and restore them to be full resolution for the reconstruction of the compressed multiple descriptions. Besides, two learning algorithms are provided to train our whole framework. Moreover, both distance loss and SSIM loss are combined together to train the multiple description generator networks in order to make sure that the generated multiple descriptions are diverse, but they have shared structures information.
References
 [1] Y. Wang, A. Reibman, and S. Lin, “Multiple description coding for video delivery,” Proceedings of the IEEE, vol. 93, no. 1, pp. 57–70, 2005.
 [2] V. Goyal, “Multiple description coding: compression meets the network,” IEEE Signal Processing Magazine, vol. 18, no. 5, pp. 74–93, 2001.
 [3] K. Viswanatha, E. Akyol, and K. Rose, “Combinatorial message sharing and a new achievable region for multiple descriptions,” IEEE Transactions on Information Theory, vol. 62, no. 2, pp. 769–792, 2016.
 [4] H. Bai, W. Lin, M. Zhang, A. Wang, and Y. Zhao, “Multiple description video coding based on human visual system characteristics,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 8, pp. 1390–1394, 2014.
 [5] M. Liu and C. Zhu, “Enhancing twostage multiple description scalar quantization,” IEEE Signal Processing Letters, vol. 16, no. 4, pp. 253–256, 2009.
 [6] V. Vaishampayan, “Design of multiple description scalar quantizers,” IEEE Transactions on Information Theory, vol. 39, no. 3, pp. 821–834, 1993.
 [7] S. S., K. Ramchandran, V. Vaishampayan, and K. Nahrstedt, “Multiple description wavelet based image coding,” IEEE Transactions on Image Processing, vol. 9, no. 5, pp. 813–826, 2000.
 [8] Y. Xu and C. Zhu, “Endtoend ratedistortion optimized description generation for H. 264 multiple description video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 23, no. 9, pp. 1523–1536, 2013.
 [9] C. Tian and J. Chen, “New coding schemes for the symmetric kdescription problem,” IEEE Transactions on Information Theory, vol. 56, no. 10, pp. 5344–5365, 2010.
 [10] V. Vaishampayan, N. Sloane, and S. Servetto, “Multipledescription vector quantization with lattice codebooks: Design and analysis,” IEEE Transactions on Information Theory, vol. 47, no. 5, pp. 1718–1734, 2001.
 [11] V. Goyal, J. Kelner, and J. Kovacevic, “Multiple description vector quantization with a coarse lattice,” IEEE Transactions on Information Theory, vol. 48, no. 3, pp. 781–788, 2002.
 [12] H. Bai, C. Zhu, and Y. Zhao, “Optimized multiple description lattice vector quantization for wavelet image coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 7, pp. 912–917, 2007.
 [13] Y. Wang, M. Orchard, V. Vaishampayan, and A. Reibman, “Multiple description coding using pairwise correlating transforms,” IEEE Transactions on Image Processing, vol. 10, no. 3, pp. 351–366, 2001.
 [14] I. Bajic and J. Woods, “Concatenated multiple description coding of framerate scalable video,” in International Conference on Image Processing, New York, 2002.
 [15] Y. Zhang, M. Motani, and H. Garg, “Wireless video transmission using multiple description codes combined with prioritized DCT compression,” in International Conference on Multimedia and Expo, Lausanne, Aug. 2002.
 [16] J. Apostolopoulos, “Errorresilient video compression through the use of multiple states,” in IEEE International Symposium on Image Processing, Vancouver, Sep. 2000.
 [17] C. Kim and S. Lee, “Multiple description coding of motion fields for robust video transmission,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 9, pp. 999–1010, 2001.
 [18] S. Cho and W. Pearlman, “A fullfeatured, errorresilient, scalable wavelet video codec based on the set partitioning in hierarchical trees (SPIHT) algorithm,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 3, pp. 157–171, 2002.
 [19] N. Franchi, M. Fumagalli, R. Lancini, and S. Tubaro, “Multiple description video coding for scalable and robust transmission over IP,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 15, no. 3, pp. 321–334, 2005.
 [20] T. Tillo, M. Grangetto, and G. Olmo, “Multiple description image coding based on Lagrangian rate allocation,” IEEE Transactions on Image Processing, vol. 16, no. 3, pp. 673–683, 2007.
 [21] T. Tillo and G. Olmo, “A novel multiple description coding scheme compatible with the JPEG2000 decoder,” IEEE Signal Processing Letters, vol. 11, no. 11, pp. 908–911, 2004.
 [22] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shapeadaptive DCT for highquality denoising and deblocking of grayscale and color images,” IEEE Transactions on Image Processing, vol. 16, no. 5, pp. 1395–1411, 2007.
 [23] H. Chang, M. Ng, and T. Zeng, “Reducing artifacts in JPEG decompression via a learned dictionary,” IEEE Transactions on Signal Processing, vol. 62, no. 3, pp. 718–728, 2014.
 [24] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3D transformdomain collaborative filtering,” IEEE Transactions on Image Processing, vol. 16, no. 8, pp. 2080–2095, 2007.
 [25] J. Zhang, R. Xiong, C. Zhao, Y. Zhang, S. Ma, and W. Gao, “CONCOLOR: Constrained nonconvex lowrank model for image deblocking,” IEEE Transactions on Image Processing, vol. 25, no. 3, pp. 1246–1259, 2016.
 [26] L. Zhao, H. Bai, A. Wang, Y. Zhao, and B. Zeng, “Twostage filtering of compressed depth images with markov random field,” Signal Processing: Image Communication, vol. 51, pp. 11–22, 2017.

[27]
C. Dong, Y. Deng, L. Change, and X. Tang, “Compression artifacts reduction by
a deep convolutional network,” in
IEEE Conference on Computer Vision and Pattern Recognition
, Boston, Jun. 2015.  [28] L. Galteri, L. Seidenari, M. Bertini, and A. DelBimbo, “Deep generative adversarial compression artifact removal,” in arXiv:1704.02518, 2017.
 [29] L. Cavigelli, P. Hager, and L. Benini, “CASCNN: A deep convolutional neural network for image compression artifact suppression,” in IEEE International Joint Conference on Neural Networks, Anchorage, May 2017.
 [30] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Iterative rangedomain weighted filter for structural preserving image smoothing and denoising,” Multimedia Tools and Applications, pp. 1–28, 2017.
 [31] J. Kim, L. Kwon, and L. Mu, “Accurate image superresolution using very deep convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, Jun. 2016.
 [32] K. Zhang, Y. Chen, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142 – 3155, 2017.
 [33] L. Zhao, H. Bai, A. Wang, and Y. Zhao, “Learning a virtual codec based on deep convolutional neural network to compress image,” in arXiv:1712.05969, 2017.
 [34] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [35] A. Gamal and T. Cover, “Achievable rates for multiple descriptions,” IEEE Transactions on Information Theory, vol. 28, no. 6, pp. 851–857, 1982.
 [36] L. Lastras and V. Castelli, “Near sufficiency of random coding for two descriptions,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 681–695, 2006.
 [37] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multiscale video prediction beyond mean square error,” in arXiv: 1511.05440, 2015.
 [38] L. Zhao, J. Liang, H. Bai, A. Wang, and Y. Zhao, “Simultaneously colordepth superresolution with conditional generative adversarial network,” in arXiv: 1708.09105, 2017.
 [39] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, and et al., “Tensorflow: largescale machine learning on heterogeneous distributed systems,” in arXiv:1603.04467, 2016.
 [40] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, pp. 1256–1272, 2017.
 [41] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in arXiv:1412.6980, 2014.
Comments
There are no comments yet.