Codes for "Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control"
We present a novel adaptive deep joint source-channel coding (JSCC) scheme for wireless image transmission. The proposed scheme supports multiple rates using a single deep neural network (DNN) model and learns to dynamically control the rate based on the channel condition and image contents. Specifically, a policy network is introduced to exploit the tradeoff space between the rate and signal quality. To train the policy network, the Gumbel-Softmax trick is adopted to make the policy network differentiable and hence the whole JSCC scheme can be trained end-to-end. To the best of our knowledge, this is the first deep JSCC scheme that can automatically adjust its rate using a single network model. Experiments show that our scheme successfully learns a reasonable policy that decreases channel bandwidth utilization for high SNR scenarios or simple image contents. For an arbitrary target rate, our rate-adaptive scheme using a single model achieves similar performance compared to an optimized model specifically trained for that fixed target rate. To reproduce our results, we make the source code publicly available at https://github.com/mingyuyng/Dynamic_JSCC.READ FULL TEXT VIEW PDF
Codes for "Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control"
Based on Shannon’s separation theorem 
, most modern systems adopt a separate source coding (e.g., JPEG, BPG) and channel coding (e.g., LDPC, Polar, etc.) for wireless image transmission. However, to achieve optimality, the length of the codeword needs to be (infinitely) long. Besides, the optimality also breaks down with non-ergodic source or channl distribution. Thus, using joint source channel coding (JSCC) can enable significant gains as demonstrated in various schemes such as vector quantization and index assignment[16, 6, 2].
In this paper, we propose a novel JSCC scheme that adapts its rate conditioned on the channel SNR and image contents with a single network model. To train the JSCC model with various SNRs, we adopt the SNR-adaptive modules introduced in  to modulate the intermediate features based on the SNR. Furthermore, motivated by the adaptive computation idea in [4, 17, 5, 15], we introduce a policy network that learns to dynamically decide the number of active features for transmission. We adopt the Gumbel-Softmax trick  to make the decision process differentiable. Our experiment results show that our deep learning model can be trained to learn a policy that assigns a various rate for different SNRs and image contents. Specifically, it learns to reduce channel bandwidth utilization when the SNR is high and/or the source image contains less information. At the same time, the proposed scheme maintains the equivalent image quality compared to a specialized model trained for that particular rate.
The overall structure of the proposed deep JSCC scheme is shown in Fig. 1. First, the source-encoder extracts the image features from a source image . Then is fed to the channel-encoder which generates groups of length- features. The first groups contain selective features that can be either active or inactive as controlled by the policy network, whereas the last groups only have non-selective features that are always active. Instead of handcrafting the selection, we introduce a policy network that outputs a binary mask to select active features for each input . The total number of active groups is denoted as . After selection, all active features are passed through the power normalization module to generate complex-valued transmission symbols with unit average power using the first half of features as the real part and the other half as the imaginary part. is received as is transmitted over the noisy wireless channel. Then at the receiver, is fed to the channel-decoder and source-decoder sequentially to reconstruct the source image.
We quantify the transmission rate in term of wireless channel usage per pixel (CPP). Suppose an input image has dimension for RGB (3 channel) pixels. Then the CPP is defined as , which is within the range of depending on the number of active feature groups. The 1/2 factor in CPP is because of complex-valued (quadrature) transmission over the wireless channel. We only consider the AWGN wireless channel such that holds where
is the complex Gaussian noise vector. The channel condition/quality is fully captured by the signal-to-noise ratio (SNR), which is assumed to be known at both transmitter and receiver. The SNR value is fed to, , and so that the model adapts to the wireless channel condition.
The structures of the channel-encoder and channel-decoder networks are shown in Fig. 2, where additional SNR-adaptive modules are inserted between layers to modulate the intermediate features. The structure of the SNR-adaptive module is inspired by  and shown at the bottom of Fig. 2. The input features are first average pooled across each channel111We use ‘channel’ to indicate a feature channel of a neural network and ‘wireless channel’ to indicate the wireless transmission channel.
and then concatenated with the SNR value. After that, they are passed through two multi-layer perceptrons (MLP) to generate the factors for channel-wise scaling and addition. In the channel-encoder, the input features are first fed to a series of ResNet and SNR-adaptive modules. After that, they are projected to a specific output size through 2D convolution and reshaping. In the channel-decoder , since we only receive symbols in
active groups selected by the transmitter, we simply zero-pad the deactivated ones to keep the input size same regardless of the CPP.
The proposed policy network learns to select the number of active feature groups conditioned on the SNR and image content. The whole process can be modeled as sampling a categorical distribution whose sample space is . The structure of the policy network is shown in Fig. 3, where the image feature
is first average pooled and concatenated with the SNR. Then, it is passed through a two-layer MLP with a softmax function at the end to generate the probabilities for each option. After that, we sample the decision as an one-hot vector through Gumbel-Softmax (discussed later) and further transform it to athermometer-coded vector as the final adaptive transmission mask . This thermometer encoding of ensures that we always activate consecutive groups of features from the beginning. Thus, no extra control messages (except the end of transmission signaling) are needed to inform the receiver which groups of features are activated vs. deactivated.
Training the policy network is not trivial because the sampling process is discrete in nature, which makes the network non-differentiable and hard to optimize with back-propagation. One common choice is to use a score function estimate to avoid back-propagation through samples (e.g., REINFORCE
). However, that approach often has slow convergence issues for many applications and also has a high variance problem. As an alternative, we adopt the Gumbel-Softmax scheme  to resolve non-differentiability by sampling from the corresponding Gumbel-Softmax distribution.
Suppose the probability for each category is for . Then, with the Gumbel-Max trick , the discrete samples from the distribution can be drawn as:
where is a standard Gumbel distribution with
sampled from a uniform distribution. Since the argmax operation is not differentiable, Gumbel-Softmax distribution is used as a continuous relaxation to argmax. With a goal to represent as a one-hot vector , we use a relaxed form using the softmax function:
where is a temperature parameter that controls the discreteness of . The distribution converges to a uniform distribution when goes to infinity, whereas makes close to a one-hot vector and indistinguishable from the discrete distribution. For the forward pass during network training, we sample the policy from the discrete distribution (1) whereas the continuous relaxation (2) is used for the backward pass to approximate the gradient. When the trained model is used for the adaptive-rate JSCC, we transform the one-hot vector output to a thermometer encoded mask vector satisfying .
During training, we minimize the following loss to encourage improving the image reconstruction accuracy as well as minimizing the number of active feature groups to reduce the bandwidth usage (i.e., lower CPP).
The first term in (3) is the reconstruction loss and the second term is the channel usage with a weight parameter .
We evaluate the proposed method with the CIFAR-10 dataset which consists of 50000 training and 10000 testing images withpixels. The Adam optimizer 
is adopted to train the model. We first train the networks for 150 epochs with a learning rate ofand another 150 epochs with a learning rate of . Then, we fix the encoder and policy network , and fine-tune the other modules for another 100 epochs. The batch size is 128. The initial temperature is 5 and it gradually decreases with an exponential decay of . During training, we sample the SNR uniformly between 0dB and 20dB. For all experiments, we set , , and . Thus the proposed method provides 5 possible rates (CPP) in total: . To serve as the baseline, we also train multiple fixed-rate models that use pre-determined sub-group-level feature masking to obtain a constant target CPP.
We first show the trade-off between the average rate and image quality from the proposed method in Fig. 4. Image quality is evaluated with the peak signal-to-noise ratio (PSNR). The PSNR of baselines with the maximum () and minimum () fixed rate are plotted in black as a reference. When the SNR is low (0dB), our method tends to select the maximum possible rate. As we gradually increase the SNR, the average rate drops. This trend shows that our method can successfully learn a policy that utilizes less channel resources (lower CPP rate) when the channel condition (SNR) is good and vice versa. As we increase
in the loss function, the policy network pays more attention to the rate than the image quality, and thus the rate decreases faster when SNR increases.
Next, we compare the performance of the proposed method with 1) the state-of-the-art image codec BPG combined with idealistic error-free transmission based on Shannon capacity, and 2) collection of fixed-rate baseline models each trained for one particular rate. The result is shown in Fig. 5. With a relatively small , our method chooses a relatively high CPP (compared to a high counterpart) given SNR and it outperforms the BPG+Capacity in the low SNR region. Whereas when trained with , the resulting CPP is relatively small and our method outperforms the BPG+Capacity baseline for all SNRs. Compared with baseline models specifically trained for a fixed rate with a pre-determined activation masking, our method always achieves a similar performance for each comparison point although our method only uses a single trained model adapting to different SNRs.
Finally, we fix the SNR as well as to observe the variation of the rate across different image classes. In Fig. 6, we plot the average rates and PSNR for all 10 classes in CIFAR-10. It can be seen that uneven rate is adopted to different classes because the policy network tends to assign higher CPP to classes with richer information (e.g., Automobile, Truck) while assigning lower CPP to the classes with relatively simple contents (e.g., Ship, Airplane). With such a strategy, our scheme tends to decrease the variation of reconstructed image quality (PSNR) across all classes compared to a fixed-rate baseline which assigns the same CPP to all classes. For this particular SNR and
, the standard deviations of PSNR across different classes are 1.017, 0.991, 0.985, 0.947, and 0.923 for five fixed-rate baseline models with CPP = 0.25, 0.313, 0.375, 0.438, and 0.5, respectively. By contrast, our method with adaptive rate control exhibits a significantly lower standard deviation of 0.613. It indicates that when the average CPP is the same, our method produces more equalized image quality across difference classes (with an adaptive rate per class) while the fixed rate scheme generates unbalanced quality images to force the CPP same for all classes.
In this paper, we present a novel deep JSCC scheme that supports multiple rates with a single model. The policy network automatically assigns the rate conditioned on the channel SNR and image content by dynamically producing a binary mask to activate or deactivate image features. To make the policy network differentiable, the Gumbel-Softmax trick is adopted. Experiments show that our method can learn a reasonable policy that distributes less bandwidth when the SNR is high and the image contains less information. With the advantage of adaptive rate control, our method only experiences negligible performance degradation compared with multiple single-rate models at each operating condition.
International Conference on Machine Learning, pp. 1182–1192. Cited by: §1.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1039–1048. Cited by: §1.
Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8 (3), pp. 229–256. Cited by: §2.3.2.