With the rapid development of mobile terminals and wireless networks, applications of multi-node communications like multi-camera surveillance systems, IoT networks, 3D scene capture, and stereo image transmission has drawn a lot of attention. The main challenge is the asymmetric distribution of computational capability and power budget between edge devices and center sever, making the naive use of conventional separated communication systems unsuitable. A promising solution is the distributed source coding (DSC), which exploits the statistics of distributed sources only at the receiver side, enabling low-complexity encoding by shifting the bulk of computation to the receiver. Based on the information-theoretic bounds given by Slepian and Wolf  for distributed lossless coding, Wyner and Ziv extended it to lossy compression with side information at the decoder . After that, practical Slepian-Wolf and Wyner-Ziv schemes have become research hot pots. However, most distributed source coding techniques are derived from proven channel coding ideas and concentrate on synthetic datasets, and specific correlation structures [3, 4].
Thanks to the boosting development of deep learning (DL), recent advances achieve practical lossy compression schemes[5, 6] and JSCC schemes  for high-dimensional image sources using deep neural networks (DNNs), which show substantial improvements by exploiting decoder side information. In this paper, taking this idea one step further, we consider the design of practical distributed joint source-channel coding (JSCC) scheme. Although it has been proven that the separation theorem still holds in distributed scenarios, limited by delay and complexity in the distributed sceneries, distributed JSCC can reduce signal distortion and obtain a better performance through the integration of source and channel codes. Another relevant work is the deep JSCC (D-JSCC) schemes [8, 9, 10, 11]
, which achieves graceful degradation as the channel signal-to-noise ratio (SNR) varies and outperforms the state-of-the-art digital schemes. In the distributed scenario, the scheme needs to be redesigned.
In this paper, we propose a novel distributed D-JSCC scheme for wireless image transmission. The proposed scheme considers the independent channels case , where each source is transmitted through a noisy independent channel and recovered jointly at the common receiver. In particular, each pair of correlated images come from two cameras with probably overlapping fields of view. The proposed distributed encoders and the decoder are trained jointly to exploit such common dependence. Meanwhile, we propose a channel state information-aware cross attention module to enable efficient information interaction between two noisy feature maps at the receiver. It measures the correlation between the images in patch level, where the channel quality and the spatial correlation are explicitly considered. In this manner, the decoder can better reconstruct both images without extra transmission overhead to improve the whole transmission efficiency. We compare the proposed scheme to various image codecs combined with ideal capacity-achieving channel code or practical LDPC codes. The considered image coding algorithms include popular image codecs - BPG and JPEG2000 and the recently developed DL-based DSC image compression scheme. Results demonstrate we achieve an apparent performance gain compared to the baseline scheme without jointly decoding, and the improvement depends on the difference of SNRs between the two links. Besides, our method shows impressive performance on PSNR and MS-SSIM metrics compared to all schemes.
2 System Model and Proposed Architecture
Fig. 1 shows the system model, consider two statistically dependent identically distributed (i.i.d.) image sources and , let and represent samples of and from possible infinite sets and . The joint source channel encoder (JSCE) is a deterministic encoding function denoted as , which firstly maps an -dimensional input image into complex symbols , and employs a power normalization module to ensure the average power constraint, i.e., is normalized as before transmission. For the additive white Gaussian noise (AWGN) channels, the transfer function is , where
is the i.i.d. Gaussian noise with zero mean and variance. We assume the SNR defined as of every link can be different and known at transmitter and receiver. In the considered scenario, the other input goes through the same process, and the two JSCEs are working independently, i.e., one image is encoded without access to the other.
The joint source-channel decoder (JSCD) denoted as decodes two noisy received symbols and , and produce their reconstructions as
where and are the parameters set of JSCEs, and
is the shared JSCD’s. Although it seems to be a viable alternative that recovers the image from a high-quality link first and then uses it as side information to help decode the other. In practice, the channel quality of each node may vary a lot with time, and mismatched or low-quality side information cause degradation in the reconstruction quality. Besides, the serial decoding manner increases transmission latency and requires training and storage of multiple decoders, which is unfriendly to practical distributed applications. As a consequence, we propose a parallel decoding framework, which only uses one group of parameters to recover all images at the same time. To be more specific, the JSCD jointly takes two noisy feature maps (received symbol vectors) as well as their SNRs as input. The recovery process is fully parallel by concatenating two feature maps in the batch dimension.
In addition, the proposed architecture can also adapt to the asymmetric case , which assumes one of the sources is lossless available at the receiver and used as side information during the decoding of the other source. We also train and report this case’s performance by setting the SNR of the available source to . The system of every case is trained to minimize the empirical distortion loss defined as
is the distortion between the input image and the reconstruction, and hyperparameterdetermines the relative importance of two images.
In the proposed architecture, both JSCE and JSCD are parameterized by a series of convolutional blocks, and we plot the detailed structure in Fig. 1. Conv- denotes a convolutional block including a convolutional layer with kernel size and filters, a ChannelNorm layer ” / “
” mark the upscaling/downscaling convolutions with stride, and “*” indicates a single convolution used for compressing/decompressing the feature map into/from channels to satisfy the given bandwidth rate. Besides, we employ the attention feature module proposed in  to reduce the training cost, which can enable a single model to deal with different SNRs.
3 SNR-aware Cross Attention Mechanism
at the receiver
The existing methods in DNN based DSC achieve efficient lossy compression by using a correlated lossless image as the side information. The considered independent channels case is a more realistic scenario, where the mutual information changes with each pair of samples and both channel conditions. The lossy or uncorrelated side information may cause the decrease of recovered image quality. Moreover, we can no longer assume that the two images are always aligned nor that the layout of the images is similar. Thus, for different images and different channel conditions, it requires a dynamic decision algorithm to measure and make use of the correlation. To conquer this challenge, we propose an SNR-aware cross attention module (SCAM) to achieve beneficial information interactions between two samples using noisy feature maps, where the feature map of one image is dynamically adjusted according to the other. Meanwhile, the channel quality and the spatial correlation are jointly considered.
Inspired by vision transformers , which uses a self-attention mechanism to capture connections and dependencies between global and local content, the proposed SCAM is shown in Fig. 2. For simplicity, we use and to denote the stereo images and omit subscripts when they need not be distinguished. The two noisy feature maps and is extracted from the previous convolution block, where , , and denote the height, width, and the number of the channel of the feature maps respectively. SCAM aims to generate cross attention maps, which are further used to recalibrate the feature maps to achieve information interactions between two feature maps. We use patches to measure the correction between images, the feature map is reshaped to , where can be viewed as the number of spatial dimensions. Note that, since the encoder uses downscaling for three times, a vector here in spatial dimension is extracted from a patch of pixels in the same position.
It consists of the following steps in turn:
SNR Information Attachment. To adapt to a range of SNRs and promise considerable performance gain when the SNR of two links is different, it is important to let channel state information interact with context information. Thus, we inform the network of current channel quality by pre-setting a group of learnable quality tokens . Each token is used to cover a range of SNR values. If SNR of the current transmission channel is covered by , we concatenate it with () in the spatial dimension:
where is the fused feature map including channel quality and context information.
Cross Attention Layer. To evaluate the correlation of two sources, we employ three linear layers to map the input to its query , key , and value , i.e., , , and , where refers to layer normalization function. The cross attention is calculated between spatial vectors of two feature maps, that is, let query vectors of () to multiply the key value of () to get cross attention map (). In this manner, the relevance between and is calculated, which explicitly concerns the interaction between channel state information and context information. Feature map recalibration. We use cross attention map to recalibrate the value matrix:
where is a linear layer,
represent the SoftMax operator applied to each column of the matrix for normalization. After that, to further fuse the context and channel state information, we adapt a multilayer perceptron (MLP) including two linear layers with a skip connection:
where , are linear layers, , are bias, is the ReLU activation function, and is the hidden layer size of the MLP. In the end of cross attention layer, we remove the attached quality tokens from and finally get the recalibrated feature map .
4.1 Experimental setup
Datasets. We constructed our dataset from KITTI Stereo 2012  (includes 1578 stereo image pairs) and KITTI Stereo 2015  (includes 789 scenes with 21 images pairs per scene taken sequentially).
Here, a pair of two images means they are taken at the same time from a pair of calibrated stereo cameras.
Following the works on distributed source coding [5, 6], 1576 image pairs are selected as the training dataset, and 790 image pairs are used for the test. All images are center-cropped and resized to pixels before training and testing.
Peak signal-to-noise ratio (PSNR) is the most common image evaluation metric, and multi-scale structural similarity (MS-SSIM) is considered to be a perceptual quality metric closer to human perception. Since the two metrics sometimes provide different results, we use both of them to test the robustness of our model.
Training details. In all experiments, for the independent channels case , the bandwidth ratio of each distributed source (the channel use number per source dimension), which is corresponding to the bottleneck channel dimension . And in the asymmetric case , we assume source is lossless available in the receiver and evaluate the recover quality of in the fixed bandwidth ratio . The average transmission power constraint is set as , and the two stereo images have the same weight (). The hidden layer size of MLP is set to times the input dimension. Each model is trained for 250K iterations using Adam optimizer with a learning rate of
and a batch size of 12. Mean squared error (MSE) is used as the loss function for PSNR models, and
for MS-SSIM models. Each model is trained over transmission channel under a uniform distribution of SNR fromdB to dB, and the interval of channel quality tokens is set to dB. Besides, a special token is employed to indicate a noiseless channel for the training of asymmetric case.
4.2 Experimental results
We compare our scheme with existing state-of-the-art separated design schemes as well as the DL-based JSCC baseline. For the source coding of separated design systems, we consider the popular image codecs - BPG and JPEG2000, as well as the recently developed DL-base DSC image compression schemes DSIN citeDSIN, and DWSIC . For the channel coding, we employ practical LDPC codes and ideal capacity-achieving code (denoted as Capacity) as bound. We denote the combination of source coding and channel coding schemes by ”” for brevity. For each configuration of separated systems, given bandwidth ratio, we first calculate the maximum source compression rate (bits per pixel) using channel capacity or the rate of channel code and modulation, then compress the images at the largest rate that no more than . For LDPC code which cannot guarantee reliable transmission, follow , we set the failed reconstruction to the mean value for all the pixels per color channel. The DSIN and DWSIC are digital source coding schemes under lossless side information assumption. To test their transmission performance, we combine a channel code for evaluation.
As plotted in Fig. 3(a) and Fig. 3(b), we compare the quality of reconstructed images on the AWGN channel under various channel SNRs in terms of the average MS-SSIM and PSNR. The proposed I refers to the proposed model in independent channels case  while proposed II refers to the asymmetric case . The proposed model can adapt to a range of SNRs as well as in the condition when the side information is noiseless, thus all the lines about the proposed model are tested in a single model. For the proposed scheme I, we assume both transmission channels have the same SNR and present the results of each metric for two distributed sources. Besides, since the proposed JSCD treats each transmission link equally, results show that both distributed sources have almost identical PSNR and MS-SSIM throughout the simulation SNR interval.
We first compare the proposed method I to the baseline model and classical separated systems. As a naive use of D-JSCC, the baseline transmits images and independently without extra design. Due to the proposed scheme leveraging the mutual information of the two links, there is an apparent performance gain compared to the baseline model. As for the classical separated systems, our scheme outperforms the classical image codec JPEG2000Capacity and BPGCapacity scheme in terms of MS-SSIM or the low SNR region PSNR. Though BPGCapacity shows better PSNR results in the high SNR region, the scheme requires unlimited delay and complexity, which is ideal in the distributed scenario. Consider a more practical BPGLDPC scheme. It suffers from the “cliff effect” and performs worse than the proposed model in terms of PSNR.
Then, we also present the results of the proposed method II, which shows the upper bound performance of proposed I. Compared with DL-base DSC schemes DSIN and DWSIC which consider the same case, the proposed scheme shows competitive performance. Fig. 3(c) presents how the reconstruction quality of one image varies when the channel quality of the other link changes, denotes the SNR difference of two links. Due to the proposed cross attention module achieving interactions between SNR information and context information of each image, its performance improves with the quality of side information. It adapts to every value in a single model. Moreover, the performance of our proposed scheme under a Rayleigh fast fading channel with perfect channel state information is studied in Fig. 3(d), which also proves the robustness and the performance gain of the proposed model. A visual example of the reconstructed image over the AWGN channel with dB is shown in Fig. 4. The proposed schemes present a better reconstruction with more details.
In this paper, the problem of D-JSCC for correlated image sources has been studied. The proposed JSCE and JSCD structure leverage the common information across two stereo images to improve reconstruction quality without extra transmission overhead. Besides, we propose an SNR-aware cross attention module, which calculates the patch-wise relevance of two images as well as considers the SNR of each transmission link. Due to the channel state information and context information of two images being efficiently exploited, results have demonstrated that the proposed method achieves impressive performance in distributed scenarios.
-  D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,” IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, 1973.
-  A. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,” IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 1–10, 1976.
-  Z. Xiong, A. D. Liveris, and S. Cheng, “Distributed source coding for sensor networks,” IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 80–94, 2004.
-  S. S. Pradhan and K. Ramchandran, “Distributed source coding using syndromes (discus): Design and construction,” IEEE Transactions on Information Theory, vol. 49, no. 3, pp. 626–643, 2003.
S. Ayzik and S. Avidan,
“Deep image compression using decoder side information,”
In European Conference on Computer Vision. (ECCV), 2020, pp. 699–714.
-  N. Mital, E. Ozyilkan, A. Garjani, and D. Gündüz, “Deep stereo image compression with decoder side information using wyner common information,” arXiv preprint arXiv:2106.11723, 2021.
-  Z. Xuan and K. Narayanan, “Low-delay analog distributed joint source-channel coding using sirens,” in Proceedings of European Signal Processing Conference (EUSIPCO). IEEE, 2021, pp. 1601–1605.
-  E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019.
-  D. B. Kurka and D. Gündüz, “DeepJSCC-f: Deep joint source-channel coding of images with feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 178–193, 2020.
-  J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wireless image transmission using deep source channel coding with attention modules,” IEEE Transactions on Circuits and Systems for Video Technology, early access, 2021.
-  S. Wang, J. Dai, S. Yao, K. Niu, and P. Zhang, “A novel deep learning architecture for wireless image transmission,” in Proceedings of IEEE Global Communications Conference, 2021.
-  J. Garcia-Frias and Z. Xiong, “Distributed source and joint source-channel coding: from theory to practice,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. (ICASSP), 2005, vol. 5, pp. v/1093–v/1096 Vol. 5.
-  F. Mentzer, G. Toderici, Michael M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” in Advances in Neural Information Processing Systems, vol. 33, 2020.
-  A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” Proceedings of International Conference on Learning Representations. (ICLR), 2020.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” pp. 3354–3361, 2012.
M. Menze, C. Heipke, and A. Geiger,
“Joint 3d estimation of vehicles and scene flow,”ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 2, pp. 427, 2015.