I Introduction
Modern communication systems employ a two step encoding process for the transmission of image/video data (see Fig. (a)a for an illustration): (i) the image/video data is first compressed with a source coding algorithm in order to get rid of the inherent redundancy, and to reduce the amount of transferred information; and (ii) the compressed bitstream is first encoded with an error correcting code, which enables resilient transmission against errors, and then modulated. Shannon’s separation theorem proves that this twostep source and channel coding approach is optimal theoretically in the asymptotic limit of infinitely long source and channel blocks [1]. While in practical applications joint source and channel coding (JSCC) is known to outperform the separate approach [2], separate architecture is attractive for practical communication systems thanks to the modularity it provides. Moreover, highly efficient compression algorithms (e.g. JPEG, JPEG2000, WebP [3]) and nearoptimal channel codes (e.g. LDPC, Turbo codes) are employed in practice to approach the theoretical limits. However, many emerging applications from the Internetofthings to autonomous driving and to tactile Internet require transmission of image/video data under extreme latency, bandwidth and/or energy constraints, which preclude computationally demanding longblocklength source and channel coding techniques.
We propose a JSCC technique for wireless image transmission that directly maps the image pixel values to the complexvalued channel input symbols. Inspired by the success of unsupervised deep learning (DL) methods, in particular, the autoencoder architectures
[4, 5], we design an endtoend communication system, where the encoding and decoding functions are parameterized by two convolutional neural networks (CNNs) and the communication channel is incorporated in the neural network (NN) architecture as a nontrainable layer; hence, the name deep JSCC. Two channel models, the additive white Gaussian noise (AWGN) channel and the slow Rayleigh fading channel, are considered in this work due to their widespread adoption in representing realistic channel conditions. The proposed solution is readily extendable to other channel models, as long as they can be represented as a nontrainable NN layer with a differentiable transfer function.DLbased methods, and, particularly, autoencoders, have recently shown remarkable results in image compression, achieving or even surpassing the performance of stateoftheart lossy compression algorithms. Ballé et al. [6] propose an endtoend optimized image compression method, consisting of a nonlinear analysis transformation, a uniform quantizer, and a nonlinear synthesis transformation. Their method exhibits better ratedistortion performance than JPEG and JPEG2000 in most images, while the visual quality, as captured by the MSSSIM metric, improves for all test images and over all bitrate values. A compressive autoencoder is used in [7], where the authors propose to use a proxy of the quantization step only in the backward propagation, while keeping the rounding in the forward step. The authors of [8] complement the autoencoder based compression architecture with adversarial loss to achieve realistic reconstructions and improve the visual quality. Cheng et al. [9]
present a convolutional autoencoder based lossy image compression architecture, which achieves on average a 13.5% rate saving versus JPEG2000 on the Kodak image dataset. The advantage of DLbased methods for lossy compression versus conventional compression algorithms lies in their ability to extract complex features from the training data thanks to their deep architecture, and the fact that their model parameters can be trained efficiently on large datasets through backpropagation. While common compression algorithms, such as JPEG, apply the same processing pipeline to all types of images (e.g., DCT transform, quantization and entropy coding in JPEG), the DLbased image compression algorithms learn the statistical characteristics from a large training dataset, and optimize the compression algorithm accordingly, without explicitly specifying a transform or a code.
At the same time, the potential of DL has also been capitalized by researchers to design novel and efficient coding and modulation techniques in communications. In particular, the similarities between the autoencoder architecture and the digital communication systems have motivated significant research efforts in the direction of modelling endtoend communication systems using the autoencoder architecture [10, 11]. Some examples of such designs include decoder design for existing channel codes [12, 13], blind channel equalization [14], learning physical layer signal representation for SISO [11] and MIMO [15] systems, OFDM systems [16, 17], JSCC of text messages [18] and JSCC for MNIST images for analog storage [19].
In this work, we leverage the recent success of DL methods in image compression and communication system design to propose a novel JSCC algorithm for image transmission over wireless communication channels. We consider both timeinvariant and fading AWGN channels, and compare the performance of our algorithm to the stateoftheart compression algorithms (JPEG and JPEG2000, in particular) combined with capacityachieving channel codes. We show through experiments that our solution achieves superior performance in low signaltonoise ratio (SNR) regimes and for limited channel bandwidth, over a timeinvariant AWGN channel, even though the separation scheme is assumed to be operating at the channel capacity despite the short blocklengths. While we have mainly focused on the peak signaltonoise ratio () as the performance measure, we show that the deep JSCC can provide even better results when measured in terms of the structural similarity index (SSIM), which better captures the perceived visual quality of the reconstructed images. More interestingly, we demonstrate that our approach is resilient to variations in channel conditions, and does not suffer from abrupt quality degradations, known as the “cliff effect” in digital communication systems: deep JSCC algorithm exhibits graceful performance degradation when the channel conditions deteriorate. This latter property is particularly attractive when broadcasting the same image to multiple receivers with different channel qualities, or when transmitting to a single receiver over an unknown fading channel. Indeed, we show that the proposed deep JSCC scheme achieves a remarkable performance over a slow Rayleigh fading channel by learning coded representations robust to channel quality fluctuations and outperforms a separationbased digital transmission scheme even at high SNR and large channel bandwidth scenarios.
This is the first time an endtoend joint sourcechannel coding architecture is trained for wireless transmission of highresolution images over AWGN and fading channels. This architecture allows training for other performance measures or other source signals (e.g., video) as well. Moreover, while the training of the deep JSCC algorithm can be fairly time consuming, once the network is trained, the encoding and decoding tasks become extremely fast, compared to applying advanced image compression/decompression algorithms followed by capacityapproaching channel coding and decoding. We believe this may be key to enabling many lowlatency applications that require the transmission of high data rate content at the wireless edge, such as image/video sensor data from autonomous cars or drones, or emerging AR/VR applications. We also emphasize that the employed neural network architecture is quite efficient consisting of fully convolutional layers. With the rapid advances in hardware accelerators specially optimized for CNNs [20, 21], we believe the deep JSCC can very soon be deployed directly on mobile wireless devices.
The rest of the paper is organized as follows. In Section II, we introduce the system model, provide some background on the conventional wireless image transmission systems and their limitations, and motivate our novel approach. We introduce the proposed deep JSCC architecture in Section III. Section IV is dedicated to the evaluation of the performance of the proposed deep JSCC scheme, and its comparison with the conventional separate JSCC schemes over both static and fading AWGN channels. Finally, the paper is concluded in Section V.
Ii Background and Problem Formulation
We consider image transmission over a pointtopoint wireless communication channel. The transmitter maps the input image
to a vector of complexvalued channel input symbols
. Following the JSCC literature, we will call the image dimension as the source bandwidth, and the channel dimension as the channel bandwidth. We typically have , which is called bandwidth compression. We will refer to the ratio as bandwidth compression ratio. Due to practical considerations in realworld communication systems, e.g., limited energy, interference, etc., the output of the transmitter may be required to satisfy a certain power constraint, such as peak and/or average power constraints. The output signal is then transmitted over the channel, which degrades the signal quality due to noise, fading, interference or other channel impairments. The corrupted output of the communication channel is fed to the receiver, which produces an approximate reconstruction of the original input image.In conventional image transmission systems, depicted in Fig. (a)a, the transmitter performs three consecutive independent steps in order to generate the signal transmitted over the channel. First, the source redundancies are removed with a source encoder , which is typically one of the commonly used compression methods (e.g., JPEG/JPEG2000, WebP). A channel code (e.g., LDPC, Turbo code) is then applied to the compressed bitstream in order to protect it against the impairments introduced by the communication channel. Finally, the coded bitstream is modulated with a modulation scheme (e.g., BPSK, 16QAM) which maps the bits to complexvalued samples. The modulated symbols are then carried by the I and Q digital signal components over the communication link (the latter two components are often combined into a single codedmodulation step [22]).
The decoder inverts these operations in the reverse order. It first demodulates and maps the complexvalued channel output samples to a sequence of bits (or, log likelihood ratios) with a demodulation scheme that matches the modulator . It then decodes the channel code with a channel decoding algorithm , and finally provides an approximate reconstruction of the transmitted image from the (possibly corrupted) compressed bitstream by applying the appropriate decompression algorithm, .
Though the above encoding process is highly optimized and widely adopted in image transmission systems [23]
, its performance may suffer severely when the channel conditions differ from those for which the system has been optimized. Although the source and channel codes can be designed separately, their rates are chosen jointly targeting a specific channel quality, i.e., assuming that a capacity achieving channel code can be employed, the compression rate is chosen to produce exactly the amount of data that can be reliably transmitted over the channel. However, when the experienced channel condition is worse than the one for which the code rates are chosen, the error probability increases rapidly, and the receiver cannot receive the correct channel codeword with a high probability. This leads to a failure in source decoder as well, resulting in a significant reduction in the reconstruction quality.
Similarly, the separate design cannot benefit from improved channel conditions either; that is, once the source and channel coding rates are fixed, no matter how good the channel is, the reconstruction quality remains the same as long as the channel capacity is above the target rate. These two characteristics are known as the “cliff effect”. Various joint sourcechannel coding schemes have been proposed in the literature to overcome the “cliff effect” [24, 25], and to obtain graceful degradation of the signal quality with channel SNR, which typically combine multilayer digital codes with multilayer compression for unequal error protection.
In this paper we take a radically different approach, and leverage the properties of uncoded transmission [26, 27, 28] by directly mapping the real pixel values to the complexvalued samples transmitted over the communication channel. Our goal is to design a JSCC scheme that bypasses the transformation of the pixel values to a sequence of bits, which are then mapped again to complexvalued channel inputs; and instead, directly maps the pixel values to channel inputs as in [27, 28].
Iii DLbased JSCC
Our design is inspired by the recent successful application of deep NNs (DNNs), and autoencoders, in particular, to the problem of source compression [6, 7, 9, 29], as well as by the first promising results in the design of endtoend communication systems using autoencoder architectures [10, 11].
The block diagram of the proposed JSCC scheme is shown in Fig. (b)b. The encoder maps the dimensional input image to a length vector of complexvalued channel input samples , which satisfies the average power constraint , by means of a deterministic encoding function . The encoder function is parameterized using a CNN with parameters
. The encoder CNN comprises a series of convolutional layers followed by parametric ReLU (PReLU) activation functions
[30] and a normalization layer. The convolutional layers extract the image features, which are combined to form the channel input samples, while the nonlinear activation functions allow to learn a nonlinear mapping from the source signal space to the coded signal space. The output of the last convolutional layer of the encoder is normalized according to:(1) 
where is the conjugate transpose of , such that the channel input satisfies the average transmit power constraint .
Following the encoding operation, the joint sourcechannel coded sequence is sent over the communication channel by directly transmitting the real and imaginary parts of the channel input samples over the I and Q components of the digital signal. The channel introduces random corruption to the transmitted symbols, denoted by . To be able to optimize the communication system in Fig. (b)b in an endtoend manner, the communication channel must be incorporated into the overall NN architecture. We model the communication channel as a series of nontrainable layers, which are represented by the transfer function . We consider two widely used channel models: (i) the AWGN channel, and (ii) the slow fading channel. The transfer function of the Gaussian channel is , where the vector
consists of independent identically distributed (i.i.d.) samples from a circularly symmetric complex Gaussian distribution, i.e.,
, where is the average noise power. In the case of slow fading channel, we adopt the commonly used Rayleigh slow fading model. The multiplicative effect of the channel gain on the transmitted signal is captured by the channel transfer function , whereis a complex normal random variable. The joint effect of channel fading and Gaussian noise can be modelled by the composition of the transfer functions
and : . Other channel models can be incorporated into the endtoend system in a similar manner with the only requirement that the channel transfer function is differentiable in order to allow gradient computation and error back propagation.The receiver comprises a joint sourcechannel decoder. The decoder maps the corrupted complexvalued signal to an estimation of the original input using a decoding function . Similarly to the encoding function, the decoding function is parameterized by the decoder CNN with parameter set . The NN decoder inverts the operations performed by the encoder by passing the received (and possibly corrupted) coded signal through a series of transpose convolutional layers (with non linear activation functions) in order to map the image features to an estimate of the originally transmitted image.
The encoding and decoding functions are designed jointly to minimize the average distortion between the original input image and its reconstruction produced by the decoder:
(2) 
where is a given distortion measure, and
is the joint probability distribution of the original and reconstructed images. Since the true distribution of the input data
is often unknown, an analytical form of the expected distortion in Eq. (2) is also unknown. We, therefore, estimate the expected distortion by sampling from an available dataset.Iv Evaluation
To demonstrate the potential of our proposed deep JSCC scheme, we use the NN architecture depicted in Fig. 2. At the encoder, the normalization layer is followed by five convolutional layers. Since the statistics of the input data are generally not known at the decoder, the input images are normalized by the maximum pixel value , producing pixel values in the range. The notation denotes a convolutional layer with filters of spatial extent (or size)
and stride
. The values of the hyperparameters
and used in our experiments are given in Fig. 2. PReLU activation function is applied to the output of all convolutional layers. The output of the last convolutional layer, which consists of units, is followed by another normalization layer which enforces the average power constraint specified in Eq. (1). The output of the normalization layer is combined into complexvalued channel input samples and forms the encoded signal representation, which is transmitted over the channel.The decoder inverts the operations performed by the encoder. The real and imaginary parts of the complexvalued noisy channel output samples are combined into values which are fed into the transpose convolutional layers. The latter progressively transform the corrupted image features into an estimation of the original input image, while upsampling it to the correct resolution. The hyperparameters of the decoder layers mirror the corresponding values of the encoder layers (Fig. 2
). The output of all transpose convolutional layers of the decoder except for the last one are passed through a PReLU activation function, while a sigmoid nonlinearity is applied to the output of the last transpose convolutional layer in order to produce values in the
range. Finally, a denormalization layer multiplies the output values by in order to generate pixel values within the range.The above architecture is implemented in Tensorflow
[31]. We use the Adam optimization framework [32], which is a form of stochastic gradient descent. Our loss function is the average mean squared error (MSE) between the original input image
and the reconstruction at the output of the decoder, defined as:(3) 
where is the mean squarederror distortion and is the number of samples. In order to achieve various bandwidth compression ratios , we vary the number of filters in the last convolutional layer of the encoder. Since our architecture is fully convolutional, it can be trained and deployed on input images of any resolution.
The performance of the deep JSCC algorithm, as well as of all benchmark schemes is quantified in terms of . The PSNR metric measures the ratio between the maximum possible power of the signal and the power of the noise that corrupts the signal. The PSNR is defined as follows:
(4) 
where is the mean squarederror between the reference image and the reconstructed image , and is the maximum possible value of the image pixels. All our experiments are conducted on 24bit depth RGB images (8 bits per pixel per colour channel), thus .
The channel SNR is defined as:
(5) 
and represents the ratio of the average power of the coded signal (channel input signal) to the average noise power. Recall that is the average power of the channel input signal after applying the power normalization layer at the encoder of the proposed JSCC scheme. For benchmark schemes that use explicit signal modulation, is the average power of the symbols in the constellation. Without loss of generality, we set the average signal power to for all experiments.
Iva Evaluation on CIFAR10 dataset
We start by evaluating our deep JSCC scheme on the CIFAR10 image dataset. The training data consists of training images [33] combined with random realizations of the channel under consideration. The performance of the proposed JSCC scheme is tested on test images from the CIFAR10 dataset, which are distinct from the images used for training. We initially set the learning rate to and reduce it after 500k iterations to . We use a minibatch size of samples and train our models until the performance on the test set does not improve further. However, we would like to emphasize that we do not use the test set images to optimize the network hyperparameteres. During performance evaluation we transmit each image 10 times in order to mitigate the effect of randomness introduced by the communication channel.
We first investigate the performance of our proposed deep JSCC algorithm in the AWGN setting, i.e., the channel transfer function is
. We vary the SNR by varying the noise variance
and compare the proposed deep JSCC algorithm with an upper bound on any digital transmission scheme, which employs JPEG or JPEG2000 for source compression. The computation of the upper bound is based on the Shannon’s separation theorem, which states that the necessary and sufficient condition for reliable communication over a discrete memoryless channel with channel capacity is(6) 
The above expression defines the maximum rate
(7) 
for a channel with capacity at which the source can be compressed and transmitted with arbitrarily small probability of error. Thus, to compute the upper bound, we first compute the maximum number of bits per source sample using Eq. (7), where for a complex AWGN channel. This is the maximum rate for source compression that is guaranteed reliable transmission over the channel. Since JPEG and JPEG2000 cannot compress the image data at an arbitrarily low bitrate, we also compute the minimum bitrate value beyond which compression results in complete loss of information and the original image cannot be reconstructed. If, for a given set of values of , and , the minimum rate exceeds the maximum allowable rate , we assume that the image cannot be reliably transmitted and each color channel is reconstructed to the mean value of all the pixels for that channel. When , we compress the images at the largest rate that satisfies (since, again, it is not always possible to achieve an arbitrary target bitrate with JPEG or JPEG2000 compression software), and measure the distortion between the reference image and the compressed one, assuming that the compressed bitstream can be transmitted without errors.
We would like to note that we do not use any explicit practical channel coding and modulation scheme in the computation of the bound. Compressing the source at rate and assuming errorfree transmission at this rate, implicitly suggests that one would need to use a capacityachieving combination of channel code and modulation scheme to achieve reliable transmission. Thus, the performance of any digital transmission scheme that employs an actual channel coding scheme and modulation along with JPEG/JPEG2000 compression will be inferior to this upper bound.
Fig. 3 illustrates the performance of the proposed deep JSCC algorithm with respect to the bandwidth compression ratio, , in different SNR regimes. This performance is compared against the upper bound on the performance of any digital scheme that employs JPEG/JPEG2000 for compression. We note that the threshold behavior of the upper bound in the figure is not due to the “cliff effect”. The initial flat part of these curves is due to the fact that JPEG and JPEG2000 completely break down in this region, i.e., the maximum transmission rate is below the minimum number of bits per pixel, , required to compress the images at the worst quality and obtain a meaningful reconstruction at the decoder.
We observe that, in very bad channel conditions (e.g., for SNR=0dB), the digital schemes deploying JPEG or JPEG2000 would break down, while with the proposed deep JSCC scheme transmission is possible with reasonably good performance. At medium and high SNRs and for limited channel bandwidth, i.e., for , the performance of the proposed deep JSCC scheme is considerably above the one that can be achieved by JPEG and JPEG2000 even assuming that reliable transmission at channel capacity is possible^{1}^{1}1While near capacityachieving channel codes exist for the AWGN channel, these typically require very large blocklengths. It is known that the achievable rates guaranteeing a low block error probability for the blocklengths considered here are below the capacity [34] for the entire range of compression ratio values. Therefore, the upper bounds in Fig. 3 are typically not achievable.. Even when the channel bandwidth becomes less constrained, i.e., for , the performance of the deep JSCC scheme remains competitive with its JPEG/JPEG2000 counterparts. The saturation of the proposed deep JSCC scheme in the large channel bandwidth regime is possibly due to the limited capability of the particular autoencoder architecture employed, which may be improved, for example, by employing a different activation function than PReLU as in [6], or through incremental training as in [7].
We next study the robustness of the proposed deep JSCC scheme to variations in channel conditions. Figs. (a)a and (b)b illustrate the average of the reconstructed images versus the of the AWGN channel for two different values of bandwidth compression ratio, . Each curve in Figs. (a)a and (b)b is generated by training our endtoend system for a specific channel value, denoted as , and then evaluating the performance of the learned encoder/decoder parameters on the test images for varying values, denoted as . In other words, each curve represents the performance of the proposed JSCC scheme optimized for channel SNR equal to , and deployed in different channel conditions with SNR equal to . These results provide an insight into the performance of the proposed algorithm when the channel conditions are different from those for which the endtoend system is optimized and demonstrate the robustness of the proposed JSCC to variations in channel quality. We can observe that for , i.e., when the channel conditions are worse than those for which the encoder/decoder have been optimized, our deep JSCC algorithm does not suffer from the “cliff effect” observed in digital systems. Unlike digital systems, where the quality of the decoded signal drops sharply when drops below a critical threshold value, the deep JSCC scheme is more robust to channel quality fluctuations and exhibits a gradual performance degradation as the channel deteriorates. Such behavior is akin to the performance of an analog scheme [26, 24, 28], and is attributed to the capability of the autoencoder to map similar images/features to nearby points in the channel input signal space; thus, with decreasing the decoder can still obtain a reconstruction of the original image.
On the other hand, when increases above , we observe initially a gradual improvement in the quality of the reconstructed images before the performance finally saturates as increases beyond a certain value. The performance in the saturation region is driven solely by the amount of compression implicitly decided during the training phase for the target value . It is worth noting that performance saturation does not occur at as in digital image/video transmission systems [27], but at . This behavior indicates that the proposed JSCC scheme determines an implicit tradeoff between the amount of error protection and compression, which does not necessarily target an errorfree transmission when the system operates at . We also note that when the encoder/decoder are optimized for very high , and
, the system boils down to an ordinary autoencoder, and its performance is solely limited by the degreeoffreedom imposed by the bandwidth compression ratio
, i.e., the dimension of the bottleneck layer of the autoencoder.Next we study the performance of our deep JSCC scheme under the assumption of a slow Rayleigh fading channel with AWGN. In this case, the channel transfer function is , where and . In this experiment, we do not assume channel state information either at the receiver or the transmitter, or consider the transmission of pilot signals. As we assume slow fading, the channel gain is randomly sampled from the complex Gaussian distribution for each transmitted image and remains constant during the transmission of the entire image, and changes independently to another state for the next image. We set and vary the noise variance to emulate varying average channel SNR.
In Fig. 5, we plot the performance of the proposed deep JSCC algorithm over a slow Rayleigh fading channel as a function of the bandwidth compression ratio, , for different average values. Note that, due to the lack of channel state information, the capacity of this channel in the Shannon sense is zero, since no positive rate can be guaranteed reliable transmission at all channel conditions; that is, for any positive transmission rate, the channel capacity will be below the transmission rate with a nonzero probability. Therefore, we calculate an upper bound on any digital transmission scheme designed for the average SNR value. i.e., for , which uses JPEG/JPEG2000 for compression. Similarly to the case of the AWGN channel, we assume that the source image is compressed with JPEG/JPEG2000 at rate that is equal to the capacity of the complex AWGN channel at the average SNR value. That is, we calculate the maximum number of bits that can be transmitted reliably using Eq. (7), where the channel capacity is calculated for the average SNR value. If the channel capacity is below this value due to fading, an outage occurs, and the mean pixel values are used for reconstruction, i.e., maximum distortion is reached. If the channel capacity is above the transmission rate, the transmitted codeword can be decoded reliably. We observe that deep JSCC beats the upper bound on the digital transmission schemes at all SNR and bandwidth compression values. This result emphasizes the benefits of the proposed deep JSCC technique when communicating over a timevarying channel, or multicasting to multiple receivers with varying channel states.
We illustrate the robustness of the proposed deep JSCC scheme to variations of the average channel SNR in a slow Rayleigh fading channel in Figs. (a)a and (b)b. We observe that, while the performance of the deep JSCC scheme drops compared to the static AWGN channel, the quality of the reconstructed images is still reasonable, despite the lack of channel state information. This suggests that the network learns to estimate the channel state, and adapts the decoder accordingly; that is, the proposed deep JSCC scheme combines not only source coding, channel coding, and modulation, but also channel estimation, into one single component, whose parameters are learned through training.
IvB Evaluation on the Kodak dataset
We also evaluate the proposed deep JSCC scheme on higher resolution images. To this end, we train our NN architecture on the Imagenet dataset
[35] which consists of million images. The images are randomly cropped to patches of size and fed into the network in minibatches of samples. We set the learning rate to and train the models until convergence. The evaluation is performed on the Kodak image dataset^{2}^{2}2http://r0k.us/graphics/kodak/ consisting of 24 images. During evaluation, each image is transmitted times, so that the performance can be averaged over multiple realizations of the random channel.We first investigate the performance of the proposed deep JSCC algorithm over an AWGN channel by varying the noise power . The performance of the proposed deep JSCC algorithm is compared against digital transmission schemes that use JPEG/JPEG2000 for image compression followed by practical channel coding and modulation schemes. We use all possible combinations of , , and LDPC codes (which correspond to , and rate codes) with BPSK, 4QAM, 16QAM and 64QAM digital modulation schemes. For the sake of legibility, we only present the best performing digital transmission schemes and omit those that perform similarly, or whose performance in terms of PSNR is below 15dB.
Figs. 7 and 8 show the performance of the proposed deep JSCC scheme and the digital transmission schemes in an AWGN channel as a function of the test SNR for bandwidth compression ratios and , respectively. The results illustrate that our deep JSCC scheme significantly outperforms the baseline digital transmission schemes that use JPEG (the most widely used image compression algorithm) for low channel bandwidth and low SNR regimes, while it performs on par with the benchmark schemes for high bandwidth and high SNR values. Most importantly, our deep JSCC scheme does not suffer from the “cliff effect” observed in the digital transmission schemes. The inefficacy of the latter stems from the fact that, once the channel code and modulation scheme have been selected for a target SNR value, the number of bits available for compression is fixed and, thus, the quality of the reconstructed images does not improve with SNR. At the same time, when the channel quality drops below the target SNR value, the channel code is not able to deal with the increasing error rate, which leads to significant degradation in the quality of the reconstructed images. Contrarily to the digital transmission schemes, our deep JSCC scheme exhibits a graceful degradation of performance when the channel quality drops below the target SNR value, while the performance does not saturate immediately when the channel conditions improve beyond the target SNR.
When compared to schemes that use JPEG2000 for source compression, our JSCC algorithm outperforms the benchmark digital transmission schemes in AWGN channels only in very low SNR regimes and for low channel bandwidth. However, we believe that by using a deeper neural network architecture, and by employing more sophisticated activation and loss functions the performance of the deep JSCC algorithm can be further improved.
We next evaluate the performance of our deep JSCC algorithm on the Kodak image dataset over timevarying channels. Fig. 9 depicts the performance of deep JSCC and the benchmark digital transmission schemes in a slow Rayleigh fading channel for bandwidth compression ratio . We set the average channel gain to and vary the average SNR by varying the noise power . In these simulations, we assume that, in both the proposed scheme and the baseline digital transmission schemes, the phase shift introduced by the fading channel is known at the receiver, making the model equivalent to a real fading channel with double the bandwidth as only the channel gain changes randomly for each image transmission period. For the sake of readability, we only keep the best performing digital transmission schemes among all possible combinations of , and rate LDCP codes and BSPK, 4QAM, 16QAM and 64QAM modulation schemes. We can observe that due to the sensitivity of digital transmission schemes to the varying channel error rate as a result of varying channel SNR, the performance of the digital schemes that use separate source compression with JPEG/JPEG2000 followed by channel coding and modulation, is inferior to the performance of the proposed deep JSCC. While the digital transmission schemes perform well only in channel conditions for which they have been optimized, our deep JSCC scheme is more robust to channel quality fluctuations. Despite being trained for a specific average channel quality, deep JSCC is able to learn robust coded representations of the images that are resilient to fluctuations in the channel quality. The latter property is highly advantageous when transmitting over timevarying channels or to multiple receivers with different channel qualities.
Original  Deep JSCC  JPEG  JPEG2000 



PSNR/SSIM  25.07dB/0.81  20.63dB/0.61  24.11dB/0.70 


PSNR/SSIM  26.86dB/0.86  24.78dB/0.79  27.5dB/0.83 


PSNR/SSIM  28.45dB/0.90  27.14dB/0.86  30.15dB/0.89 


PSNR/SSIM  31.46dB/0.94  29.81dB/0.91  33.03dB/0.93 


PSNR/SSIM  34.3dB/0.97  31.86dB/0.94  35.52dB/0.96 
Original  Deep JSCC  JPEG  JPEG2000 
N/A  
PSNR/SSIM  30.69dB/0.87  22.68dB/0.67  


PSNR/SSIM  31.92dB/0.89  31.65dB/0.86  36.40dB/0.92 


PSNR/SSIM  32.90dB/0.90  34.36dB/0.91  38.46dB/0.94 


PSNR/SSIM  35.34dB/0.93  36.45dB/0.93  40.5dB/0.96 


PSNR/SSIM  36.83dB/0.94  37.79dB/0.95  41.96dB/0.96 

Finally, a visual comparison of the reconstructed images for the source and channel coding schemes under consideration in AWGN channels is presented in Figs. 10 and 11. For the digital transmission schemes deploying JPEG/JPEG2000, the images are transmitted using the bestperforming separate source and channel coding scheme for the target SNR value. Each row corresponds to a different channel SNR value starting from low SNR at the top (1dB) and progressing to high SNR (19dB) at the bottom. For each reconstruction, we report the PSNR and the SSIM [36] values. Fig. 10 illustrates an example where the deep JSCC outperforms the best performing digital scheme that deploys JPEG for source compression in terms of PSNR. More interestingly, although deep JSCC presents worse performance in terms of PSNR when compared to the separate scheme employing JPEG2000, its SSIM values are consistently higher, indicating superior perceived visual quality. Fig. 11 shows an example where for high SNR values the digital transmission schemes outperform deep JSCC in the PSNR metric, but deep JSCC can still achieve comparable SSIM values when compared to the scheme using JPEG. We can see that JPEG produces visible blocking artefacts, especially in channels with low SNR, which are not present in the images transmitted with deep JSCC. The noise introduced by deep JSCC appears to be smoother than the noise of JPEG thanks to the direct mapping of source values to soft channel input values. Note that the deep JSCC can also be trained with SSIM as the loss function, which can further improve its performance in terms of the SSIM metric.
IvC Computational complexity
In this section, we provide a brief discussion of the computational complexity of the proposed JSCC algorithm. Let us first consider the proposed encoder/decoder network. The most computationally costly operations in the encoder/decoder are the 2D convolutions/transpose convolutions, as they involve multiplications and additions. The computational cost of a single convolutional layer is [37], where is the filter size, is the number of filters, is the number of input channels and is the size of the feature map. The computational complexity of the encoder/decoder network is, thus, where and are the input image width and height, respectively. This implies that the computational complexity of the proposed encoder/decoder is linear in the number of pixels of the input image, as only the feature map width and height depend on the image dimensions, while all other factors are constant and independent of the image size. The JPEG encoding/decoding complexity is also linear in the number of pixels [38], while LDPC codes have linear encoding/decoding times [39]. Thus, the computational complexity of a separate joint source and channel coding scheme, which employs JPEG for compression and LDPC codes for channel coding, is also linear in the size of the input image, i.e., .
To complete our discussion of computational complexity, we have measured the average run time of the proposed algorithm on a Linux server with eight 2.10GHz Intel Xeon E52620V4 CPUs and a Tesla K80 GPU. The measurements were performed on the Kodak color images with a resolution of pixels. The average run time refers to the average time required to encode and decode one image using the proposed deep JSCC architecture. The average run time achieved by our GPU implementation is 18ms per image, while the average run time on CPU is 387ms. As a comparison, the average time required for the JPEG encoding and decoding of the above images, as reported in the literature, varies from 30ms [8] to 390ms [9], while for the JPEG2000 algorithm the average encoding and decoding time on these images is even higher (e.g., 430590ms [8, 9]). This time must be further augmented by the time needed to encode/decode the compressed bitstream with a channel code. The above proves that our method is competitive with the baseline separate source and channel coding approaches not only in terms of quality, but also in terms of computational complexity.
V Conclusions and Future Work
We have proposed a novel deep JSCC architecture for image transmission over wireless channels. In this architecture, the encoder maps the input image directly to channel inputs. The encoder and the decoder functions are modeled as complementary CNNs, and trained jointly on the dataset to minimize the average MSE of the reconstructed image. We have compared the performance of this deep JSCC scheme with conventional separationbased digital transmission schemes, which employ widely used image compression algorithms followed by capacityachieving channel codes. We have shown through extensive numerical simulations that deep JSCC outperforms separationbased schemes, especially for limited channel bandwidth and low SNR regimes. More significantly, deep JSCC is shown to provide a graceful degradation of the reconstruction quality with channel SNR. This observation is then used to benefit from the proposed scheme when communicating over a slow fading channel; deep JSCC performs reasonably well at all average SNR values, and outperforms the proposed separationbased transmission scheme at any channel bandwidth value.
In the case of DLbased JSCC, the encoder and decoder networks learn not only to communicate reliably over the channel (as in [11, 13]), but also to compress the images efficiently. For a perfect channel with no noise, if the source bandwidth is greater than the channel bandwidth, i.e., , the encoderdecoder NN pair is equivalent to an undercomplete autoencoder [5], which effectively learns the most salient features of the training dataset. However, in the case of a noisy channel, simply learning a good lowdimensional representation of the input is not sufficient. The network should also learn to map the salient features to nearby representations so that similar images can be reconstructed despite the presence of noise. We also note that, the resilience to channel noise acts as a sort of a regularizer for the autoencoder. For example, when there is no channel noise, if the channel bandwidth is larger than the source bandwidth, i.e., , we obtain an overcomplete autoencoder
, which can simply learn to replicate the image. However, when there is channel noise, even an overcomplete autoencoder learns a nontrivial mapping that is resilient to channel noise, similarly to denoising autoencoders.
The next step in improving the performance of the deep JSCC scheme is to exploit more advanced NN architectures in the autoencoder that have been shown to improve the compression performance [6, 40]. We will also explore the performance of the system for nonGaussian channels as well as for channels with memory, for which we do not have capacityapproaching channel codes. We expect that the benefits of the proposed NNbased JSCC scheme will be more evident in these nonideal settings.
References
 [1] T. M. Cover and J. A. Thomas, Elements of Information Theory. WileyInterscience, 1991.
 [2] F. Zhai, Y. Eisenberg, and A. K. Katsaggelos, “Joint sourcechannel coding for video communications,” in Handbook of Image and Video Processing, 2nd ed., A. Bovik, Ed. Burlington: Academic Press, 2005.
 [3] Google, “WebP compression study.” [Online]. Available: https://developers.google.com/speed/webp/docs/webp_study

[4]
Y. Bengio, “Learning deep architectures for AI,”
Found. and Trends in Machine Learning
, vol. 2, no. 1, pp. 1–127, Jan. 2009.  [5] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT Press, 2016.
 [6] J. Balle, V. Laparra, and E. P. Simoncelli, “Endtoend optimized image compression,” in Proc. of Int. Conf. on Learning Representations (ICLR), Apr. 2017, pp. 1–27.
 [7] L. Theis, W. Shi, A. Cunnigham, and F. Huszár, “Lossy image compression with compressive autoencoders,” in Proc. of the Int. Conf. on Learning Representations (ICLR), 2017.
 [8] O. Rippel and L. Bourdev, “Realtime adaptive image compression,” in Proc. Int. Conf. on Machine Learning (ICML), vol. 70, Aug. 2017, pp. 2922–2930.
 [9] Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, “Deep convolutional autoencoderbased lossy image compression,” in Proc. of Picture Coding Symposium (PCS), San Francisco, CA, 2018, pp. 253–257.
 [10] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: Channel autoencoders, domain specific regularizers, and attention,” in Proc. of IEEE Int. Symp. on Signal Processing and Information Technology (ISSPIT), Dec. 2016, pp. 223–228.
 [11] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Transactions on Cognitive Communications and Networking, vol. 3, no. 4, pp. 563–575, Dec 2017.
 [12] H. Kim et al., “Communication algorithms via deep learning,” in Proc. of Int. Conf. on Learning Representations (ICLR), 2018.
 [13] E. Nachmani et al., “Deep learning methods for improved decoding of linear codes,” IEEE Journal of Selected Topics in Signal Processing, vol. 12, no. 1, pp. 119–131, Feb 2018.
 [14] A. Caciularu and D. Burshtein, “Blind channel equalization using variational autoencoders,” in Proc. IEEE Int. Conf. on Comms. Workshops, Kansas City, MO, May 2018, pp. 1–6.
 [15] T. J. O’Shea, T. Erpek, and T. C. Clancy, “Deep learning based MIMO communications,” arXiv:1707.07980 [cs.IT], 2017.
 [16] A. Felix, S. Cammerer, S. Dorner, J. Hoydis, and S. ten Brink, “OFDM autoencoder for endtoend learning of communications systems,” in Proc. IEEE Int. Workshop Signal Proc. Adv. Wireless Commun. (SPAWC), Jun. 2018.
 [17] H. Ye, G. Y. Li, and B. Juang, “Power of deep learning for channel estimation and signal detection in OFDM systems,” IEEE Wireless Communications Letters, vol. 7, no. 1, pp. 114–117, Feb. 2018.
 [18] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint sourcechannel coding of text,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018.
 [19] R. Zarcone et al., “Joint sourcechannel coding with neural networks for analog data compression and storage,” in 2018 Data Compression Conference, March 2018, pp. 147–156.
 [20] A. Ignatov et al., “AI benchmark: Running deep neural networks on android smartphones,” in Computer Vision – ECCV 2018 Workshops, L. LealTaixé and S. Roth, Eds. Cham: Springer, 2019, pp. 288–314.
 [21] R. A. Solovyev, A. A. Kalinin, A. G. Kustov, D. V. Telpukhov, and V. S. Ruhlov, “FPGA implementation of convolutional neural networks with fixedpoint calculations,” CoRR, vol. abs/1808.09945, 2018.
 [22] A. G. Fabregas, A. Martinez, and G. Caire, “Bitinterleaved coded modulation,” Foundations and Trends in Communications and Information Theory, vol. 5, no. 12, pp. 1–153, 2008.
 [23] N. Thomos, N. V. Boulgouris, and M. G. Strintzis, “Optimized transmission of JPEG2000 streams over wireless channels,” IEEE Trans. on Image Processing, vol. 15, no. 1, pp. 54–67, Jan 2006.
 [24] D. Gunduz and E. Erkip, “Joint sourcechannel codes for MIMO blockfading channels,” IEEE Trans. on Information Theory, vol. 54, no. 1, pp. 116–134, Jan 2008.
 [25] I. Kozintsev and K. Ramchandran, “Robust image transmission over energyconstrained timevarying channels using multiresolution joint sourcechannel coding,” IEEE Transactions on Signal Processing, vol. 46, no. 4, pp. 1012–1026, April 1998.
 [26] T. Goblick, “Theoretical limitations on the transmission of data from analog sources,” IEEE Transactions on Information Theory, vol. 11, no. 4, pp. 558–567, October 1965.
 [27] S. Jakubczak and D. Katabi, “SoftCast: Cleanslate scalable wireless video,” in Proc. of the 48th IEEE Annual Allerton Conf. on Communication, Control, and Computing, Illinois,USA, Sept. 2010, pp. 530–533.
 [28] T. Tung and D. Gunduz, “Sparsecast: Hybrid digitalanalog wireless image transmission exploiting frequencydomain sparsity,” IEEE Communications Letters, vol. 22, no. 12, pp. 2451–2454, Dec 2018.
 [29] D. Alexandre, C.P. Chang, W.H. Peng, and H.M. Hang, “An autoencoderbased learned image compressor: Description of challenge proposal by nctu,” in IEEE Conf. Comp. Vision and Pattern Recog. Works., Jun. 2018.
 [30] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” arXiv:1502.01852v1 [cs.CV], 2015.
 [31] M. Abadi et al., “TensorFlow: Largescale machine learning on heterogeneous systems,” software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
 [32] D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv:1412.6980 [cs.LG], 2014.
 [33] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
 [34] Y. Polyanskiy, H. V. Poor, and S. Verdu, “Channel coding rate in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, May 2010.
 [35] J. Deng et al., “ImageNet: A LargeScale Hierarchical Image Database,” in CVPR09, 2009.
 [36] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
 [37] A. G. Howard et al., “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv:1704.04861v1 [cs.CV], 2017.
 [38] P. T. Chiou, Y. Sun, and G. S. Young, “A complexity analysis of the JPEG image compression algorithm,” in Proc. of 9th Computer Science and Electronic Engineering (CEEC), Sep. 2017, pp. 65–70.
 [39] T. Richardson and R. Urbanke, Modern Coding Theory. New York, NY, USA: Cambridge University Press, 2008.

[40]
N. Johnston et al., “Improved lossy image compression with priming and
spatially adaptive bit rates for recurrent networks,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2018.