Nonlinear Transform Source-Channel Coding for Semantic Communications

by   Jincheng Dai, et al.

In this paper, we propose a new class of high-efficient deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across simple example sources and test image sources, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its vigorous content-aware ability.



There are no comments yet.


page 1

page 10

page 11

page 12

page 15


Wireless Deep Video Semantic Transmission

In this paper, we design a new class of high-efficiency deep joint sourc...

Nonlinear Transform Coding

We review a class of methods that can be collected under the name nonlin...

Deep Learning for Joint Source-Channel Coding of Text

We consider the problem of joint source and channel coding of structured...

Perceptual Learned Source-Channel Coding for High-Fidelity Image Semantic Transmission

As one novel approach to realize end-to-end wireless image semantic tran...

Deep Joint Encryption and Source-Channel Coding: An Image Privacy Protection Approach

Joint source and channel coding (JSCC) has achieved great success due to...

Semantic Coded Transmission: Architecture, Methodology, and Challenges

Classical coded transmission schemes, which rely on probabilistic models...

Deep Joint Source-Channel Coding for Wireless Image Transmission with Adaptive Rate Control

We present a novel adaptive deep joint source-channel coding (JSCC) sche...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic communications are recently emerging as a new paradigm driving the in-depth integration of information and communication technology (ICT) advances and artificial intelligence (AI) innovations

[1]. This communication paradigm shifting calls for revolutionary theories and methodology innovations. Unlike traditional module-stacking system design philosophy, it is the very time to bridge the two main branches of Shannon theory [2], i.e., source and channel, together for boosting the end-to-end system capabilities towards wisdom-evolutionary and primitive-concise networks. By this means, the channel transmission process can be aware of the source semantic features.

The paradigm aiming at the integrated design of source and channel codes was named joint source-channel coding (JSCC) [3], which is a classical topic in the information theory and coding theory. However, classical JSCC schemes [3, 4, 5, 6]

were based on the statistical probabilities without considering source semantic aspects. By introducing AI, JSCC is expected to evolve to a modern version. On the one hand, source coding can intelligently extract the most valuable information for human understanding in intelligent human-type communications (IHTC) and decision making in intelligent machine-type communications (IMTC) such that the source coding rate can be efficiently reduced. On the other hand, channel coding can precisely identify the critical part of the source coded sequence to realize semantic-inclined unequal protection.

As one modern version, recent deep learning methods for realizing JSCC have stimulated significant interest in both AI and wireless communication communities

[7, 8, 9, 10, 11, 12, 13]

. By using artificial neural networks (ANN), the problem of JSCC can be carried out over analog channels, where transmitted signals are formatted as continuous-valued symbols. From Shannon’s perspective


, deep JSCC can be viewed as a geometric mapping of a vector in the source space onto a lower-dimensional space embedded in the source space and transmitted over the noisy channel. Like the landmark works of Gündüz

[9, 10, 11], without losing generality, we take the image source as a representative in this paper. Standard deep learning JSCC schemes operate on a simple principle: an image, modeled as a vector of pixel intensities , is mapped to a vector of continuous-valued channel input symbols via an ANN-based encoding function , where

encapsulates the parameters of the neurons in ANN encoder. We typically have

, and is named bandwidth ratio [10] denoting the coding rate. The wireless channel noise is denoted as such that the received vector . In the case of additive white Gaussian noise (AWGN) channel, each component of

obeys i.i.d. Gaussian distribution with zero mean and variance

. The decoder attempts to recover the transmitted source from the noisy vector , i.e., . This deep JSCC method yields end-to-end transmission performance surpassing classical separation-based JPEG/JPEG2000/BPG source compression combined with ideal channel capacity-achieving code family, especially for sources of small dimensions, e.g., small CIFAR10 image data set [14].

However, one can observe that, in general, as the source dimension increases, e.g., large-scale images, the performance of deep JSCC degrades rapidly, which is even inferior to the classical separation-based coding schemes. In addition, when the bandwidth ratio increases, existing deep JSCC cannot provide comparable coding gain as that of classical separated coding schemes, i.e., the slope of performance curve slowing down with the increase of bandwidth ratio

or the channel signal-to-noise ratio (SNR). This phenomenon stems from the inherent defect of standard deep JSCC that cannot identify the source distribution for realizing a patch-wise variable-length transmission. For example, when the dimension of embedding vector increases, these embeddings corresponding to simple patches will be saturated rapidly, leading to severe channel bandwidth wastes and inferior coding gain. This saturation phenomenon is more likely to appear on large-scale images that need higher dimensional representation.

Furthermore, current deep JSCC methods do not incorporate any hyperprior as side information, a concept widely used in modern image codecs but unexplored in image transmission using deep JSCC with ANNs.

In this paper, we aim to break the above limits by proposing a new joint source-channel coding architecture that integrates the new concept of nonlinear transform coding [15]

and deep JSCC, i.e., nonlinear transform source-channel coding (NTSCC). Particularly, as an emerging class of methods, nonlinear transform coding (NTC) over the past few years has become competitive with the best linear transform codecs for image compression, and outperforms them in terms of rate-distortion (RD) performance under well established perceptual quality metrics, e.g., PSNR, MS-SSIM, LPIPS, etc. In contrast to the linear transform coding (LTC) schemes, NTC can more closely adapt to the source distribution, leading to better compression performance. By integrating NTC into deep JSCC, the proposed NTSCC works on the principle: the source vector

is not directly mapped to the channel input symbols, instead, an alternative (latent) representation of is found first, a vector in the latent space, and deep JSCC encoding takes place in this latent representation. By introducing an entropy model on the latent space, NTSCC learns a prior as the side information to approximate the distribution of each patch, which is assumed intractable in practice. Accordingly, the following deep JSCC codec can select an appropriate coding scheme that optimizes the transmission RD performance for each embedding . As a result, the proposed NTSCC transmission framework can closely adapt to the source distribution and provide superior coding gain. Notably, the proposed NTSCC method can well support future semantic communications due to its content-aware ability.

Specifically, the contributions of this paper can be summarized as follows.

  1. NTSCC Architecture: We propose a new end-to-end learnable model for high-dimensional source transmission, i.e., NTSCC, that combines the advantages of NTC and deep JSCC. To the best of our knowledge, this is the first work exploiting the nonlinear transform to establish a learnable entropy model for realizing deep JSCC efficiently, where the entropy model on latent code implicitly represents the source distribution. Particularly, by using a parametric analysis transform , the source vector is transformed into a latent representation , i.e., , then is quantized as to enable the entropy model describing regional semantic features and controlling the transmission rate of each patch. Different from that in NTC, the latent code can be directly fed to the deep JSCC encoder to produce the analog sequence for channel transmission. On the other side, the receiver performs the inverse operation of that done by and . In our model, the nonlinear analysis transform condenses the source semantic information as a latent representation, thus driving the following source-channel coding.

  2. Adaptive Rate Transmission: To improve the coding gain of the proposed NTSCC method, we introduce a variable-length transmission mechanism for each embedding vector in the latent code . To this end, a conditional entropy model is performed on each quantized embedding to evaluate the entropy of . If the learned entropy model indicates the embedding of high entropy, its corresponding deep JSCC shall be assigned a high coding rate, and vice versa. Accordingly, we develop the Transformer ANN architecture and rate attention mechanism to realize an adaptive rate allocation for deep JSCC, that can finely tune the coding rate for each embedding , thus enable source content-aware transmission.

  3. Hyperprior-aided Codec Refinement: In the proposed NTSCC method, as the hyperprior on the latent representation, we show that the side information about the entropy model parameters can also be viewed as a prior on the deep JSCC codewords. We exploit this hyperprior to reduce the mismatch between the latent representation marginal distribution for a particular source sample and the marginal for the ensemble of source data the transmission model was designed for. This refinement mechanism only uses a small number of additional bits of information sent from encoder to decoder as signal modifications to achieve much better performance in deep JSCC decoding.

  4. Performance Validation: We verify the performance of the proposed NTSCC method across simple example sources and test image sources. The effect of adaptive rate allocation and the gap between the empirical transmission RD performance of NTSCC with respect to several state-of-the-art transmission schemes are assessed with the help of the exemplary “banana-shaped” two-dimensional source [15]. Furthermore, we show that for image transmission, the proposed NTSCC can achieve much better coding gain and RD performance on various established perceptual metrics such as PSNR, MS-SSIM, and LPIPS [16]. Equivalently, achieving the identical end-to-end transmission performance, the proposed NTSCC method can save more than 20% bandwidth cost, compared to both the emerging analog transmission schemes using the standard deep JSCC and the classical separation-based digital transmission schemes.

The remainder of this paper is organized as follows. In the next section II, we first review the variational perspective on deep JSCC and NTC, and propose the variational model for NTSCC. Then, in section III, we propose ANN architectures for realizing NTSCC, as well as key methodologies to guide the optimization of the NTSCC model. Section IV provides a direct comparison of a number of methods to quantify the performance gain of the proposed method. Finally, section V concludes this paper and provides future research trends on this new topic.

Notational Conventions: Throughout this paper, lowercase letters (e.g., ) denote scalars, bold lowercase letters (e.g., ) denote vectors. In some cases, denotes the elements of , which may also represent a subvector of as described in the context. Bold uppercase letters (e.g., ) denote matrices, and denotes an

-dimensional identity matrix.

denotes the natural logarithm, and denotes the logarithm to base .

denotes a probability density function (pdf) with respect to the continuous-valued random variable

, and denotes a probability mass function (pmf) with respect to the discrete-valued random variable . In addition, denotes the statistical expectation operation, and denotes the real number set. Finally, denotes a Gaussian function, and

stands for a uniform distribution centered on

with the range from to .

Ii Variational Modeling

Consider the following lossy transmission scenario. Alice is drawing a -dimensional vector from the source, whose probability is given as . Alice concerns how to map to a -dimensional vector , where is referred to as the channel bandwidth cost. Then, Alice transmits to Bob via a realistic communication channel, who uses the received information to reconstruct an approximation to .

Ii-a Variational Modeling of Deep JSCC

As stated in the introduction part, in deep JSCC [9], the source vector , is mapped to a vector of continuous-valued channel input symbols via an ANN-based encoding function

, where the encoder was usually parameterized as a convolutional neural network (CNN) with parameters

. Then, the analog sequence is directly sent over the communication channel. The channel introduces random corruptions to the transmitted symbols, denoted as a function , the channel parameters are encapsulated in . Accordingly, the received sequence is , whose transition probability is . In this paper, we consider the most widely used AWGN channel model such that the transfer function is where each component of the noise vector is independently sampled from a Gaussian distribution, i.e., , where is the average noise power. Other channel models can also be similarly incorporated by changing the channel transition function. The receiver also comprises a parametric function to recover the corrupted signal as , where can also be a format of CNN [9]. The whole operation is shown in the right panel of Fig. 1. The encoder and decoder functions are jointly learned to minimize the average



denotes the distortion loss function.

Fig. 1: Left: representation of a deep JSCC encoder combined with the communication channel as a inference model, and corresponding decoder as a generative model. Nodes denote random variables or parameters, and arrows show conditional dependence between them. Right: diagram showing the operational structure of the deep JSCC transmission model. Arrows indicate the data flow, and boxes represent coding functions of data and channel.

However, as analyzed in [17]

, it is more reasonable modeling the deep JSCC as a variational autoencoder (VAE)

[18]. As shown in the left panel of Fig. 1, the noisy sequence can be viewed as a sample of latent variables in the generative model. The deep JSCC decoder acts as the generative model (“generating” the reconstructed source from the latent representation) that transforms a latent variable with some predicted latent distribution into a data distribution that is unknown. The optimization objective of this generative model is


where depends on the type of loss function . The most widely-used squared error, i.e., , corresponds to

, where the hyperparameter

determines the power constraint of deep JSCC system. The deep JSCC encoder combined with the channel is linked to the inference model (“inferring” the latent representation from the source data). To efficiently optimize (2), we need knowledge about the true posterior , which is indeed intractable. An alternative way in variational inference is learning a parametric variational density substituting by minimizing the expectation of their Kullback-Leibler (KL) divergence over the data distribution , which forms the end-to-end optimization objective including both the generative model (decoder) and the inference model (encoder), i.e.,


where the variational density is determined by the encoder parameters and channel, and depends on the decoder and the type of loss function. We indicate that the first term in (3) equals a constant under the AWGN channel in which case it can be derived


Even the above function assumes the AWGN channel, it holds for any additive noise channel, which stems from the transition invariant property of different entropy for any distribution. In this sense, we can technically drop the first term in the KL loss function. The last term can also be similarly dropped. It can be concluded that the design of deep JSCC is equivalent to optimizing the parametric codec functions for transmission rate-distortion (RD) performance, rather than only optimizing the distortion function in (1).

Ii-B Variational Modeling of NTC

As indicated in the introduction section, when the dimension of representation becomes large, standard deep JSCC cannot provide sufficient coding gain. As an emerging lossy compression model [19, 20, 21, 22, 23], NTC imitates the classical source coding procedure, that operates on a simple principle: a source vector is not mapped to codeword vector directly, rather, an alterative latent representation of is found first, a vector in some other space , quantization then takes place in this latent representation, yielding discrete-valued vector . In standard NTC, due to the quantization step, the resulting can be compressed using entropy coding methods, e.g., arithmetic coding [24]

, to create bits streams. Since entropy coding relies on a prior probability of

, the entropy model is established in NTC to provide side information.

The encoder in NTC transforms the source image vector using a parametric nonlinear analysis transform into a latent representation , which is quantized as . The latent representation preserves the source semantic features while its dimension is usually much smaller than the source dimension . The decoder performs the inverse operation that recovers from the compressed signal first, a parametric nonlinear synthesis transform is then performed on to recover the source image . Here, an ideal error-free transmission is assumed such that the decoder can losslessly recover from entropy decoding. In NTC, and are usually parameterized as ANNs of nonlinear property, rather than the conventional linear transforms in LTC, and encapsulate their neural network parameters.

Since the nonlinear transform is not strictly invertible, also the quantization step introduces error, the optimization of NTC is attributed to a compression RD problem [25]. Assuming an efficient entropy coding is used, rate as the expected length of the compressed sequence equals to the entropy of , which is determined by the entropy model as


where denotes the quantization function. In the context of this paper, without loss generality, employs the uniform scalar quantization (rounding to integers). Distortion is the expected divergence between and . Clearly, a higher rate allows for a lower distortion, and vice versa. In order to use the gradient descent methods to optimize the NTC model, Ballé et al. have proposed a relaxed method for addressing the zero gradient problem caused by quantization [19]. It uses a proxy “uniformly-noised” representation to replace the quantized representation during the model training.

Fig. 2: Left: representation of an NTC encoder as a inference model, and corresponding NTC decoder as a generative model. Nodes denote random variables or parameters, and arrows show conditional dependence between them. Right: diagram showing the operational structure of the NTC compression model. Arrows indicate the data flow, and boxes represent the data transform. Boxes labeled denote either uniform noise addition during the model training, or quantization during the model testing.

The optimizing problem of NTC can also be formulated as a VAE model as shown in Fig. 2, a probabilistic generative model stands for the synthesis transform, and an approximate inference model corresponds to the analysis transform. Like that in deep JSCC, the goal of inference model is also creating a parametric variational density to approximate the true posterior , which is assumed intractable, by minimizing their KL divergence over the source distribution , i.e.,


The minimization of KL divergence is equivalent to optimizing the NTC model for compression RD performance. As shown in [19], the first term in (6) is computing the transition probability from the source to the proxy latent representation as


where denotes a uniform distribution centered on . Since the uniform distribution width is constant, the first term is also constant which can be technically dropped. The last term can also be similarly dropped. The third term representing the log likelihood can also be modeled by measuring the squared error between and , we have with where the squared error is weighted by the hyperparameter . The second term reflects the cross-entropy between the marginal and the prior . It represents the cost encoding that is constrained by the entropy model . In [19], Ballé et al. modeled the prior using a non-parametric fully-factorized density model as


where encapsulates all the parameters of , the convolutional operation “” with a standard uniform distribution is used to better match the prior to the marginal. This model is referred to as a factorized-prior model [19].

However, in general cases, there may still exist clear spatial dependencies among the latent representation , in which case the performance of the factorized-prior model degrades. To tackle this, Ballé et al. introduced an additional set of latent variables to represent the dependencies of in the same way that the original source is transformed to the latent representation [20]. Here, each is variationally modeled as a Gaussian with mean

and standard deviation

, where the two parameters are predicted by applying a parametric synthesis transform on as


The corresponding analysis transform is stacked on top of creating a joint factorized variational posterior as


Since we do not have prior beliefs about the hyperprior , it can be modeled as non-parametric fully factorized density like (8), i.e.,


where encapsulates all the parameters of . The optimization goal (6) works out to be changed as:


where the third term can be viewed as the side information widely used in traditional transform coding schemes. The right panel of Fig. 2 depicts the procedure of how the model is used for data compression. Following the variational analysis, the loss function for training the NTC model is


where denotes uniformly sampling one random quantization offset per latent dimension.

Ii-C Variational Modeling of the Proposed NTSCC

We integrate both the advantages of NTC and classical deep JSCC, that is collected under the name nonlinear transform source-channel coding (NTSCC). In the transmitter, the analysis transform in NTC is used as a type of “precoding” before the encoding of deep JSCC, which extracts the source semantic features as a latent representation. Deep JSCC then operates on this latent space. The bottom panel of Fig. 3 illustrates how the NTSCC model is used for data transmission. The analysis transform module subjects the input source vector to , yielding the latent representation with spatial varying mean values and standard derivations. The latent code is then fed into both the analysis transform and the deep JSCC encoder . On the one hand, summarizes the distribution of mean values and standard derivations of in the hyperprior , which is then quantized, compressed, and transmitted as side information. The transmitter utilizes the quantized

to estimate the mean vector

and the standard derivation vector , and use them to determine the bandwidth ratio to transmit the latent representation. The receiver also utilizes and to provide side information to correct the probability estimates for recovering . On the other hand, encodes as the channel-input sequence , and the received sequence is .

Fig. 3: Top: representation of NTSCC analysis transform encoding combined with the communication channel as a inference model, and corresponding decoding and synthesis transform as a generative model. Nodes denote random variables or parameters, and arrows show conditional dependence between them. Bottom: diagram showing the operational structure of the NTSCC transmission model. Arrows indicate the data flow, and boxes represent the data transform and coding. Boxes labeled denote either uniform noise addition during model training, or quantization during model testing. Dashed lines and boxes denotes the calculation of proxy variable.

The optimizing problem of NTSCC can also be formulated as a VAE model as shown in Fig. 3, a probabilistic generative model stands for the deep JSCC decoder and the synthesis transform, an approximate inference model corresponds to the analysis transform and the deep JSCC decoder. Like the above discussion, the goal of the inference model is also creating a parametric variational density to approximate the true posterior , which is assumed intractable, by minimizing their KL divergence over the source distribution as (14). It can be observed that the minimization of the KL divergence is equivalent to jointly optimizing the nonlinear transform model and the deep JSCC model for the end-to-end transmission RD performance.


Let’s take a close look at each term of the last line in (14). First, the variational inference model is computing the analysis transform and the deep JSCC encoding of the source vector , and adding the channel noise , thus:


As discussed before, the first term in the KL divergence can be technically dropped from the loss function.

The second term is identical to the cross-entropy between the marginal and the prior . It stands for the cost of encoding the side information in the inference model assuming as the entropy model. In practice, since we do not have prior beliefs about the hyperprior , it can also be modeled as non-parametric fully factorized density as (11).

The third term corresponds to the cross-entropy encoding that denotes the transmission rate of source message. As shown in Fig. 3, we derive with the help of the intermediate proxy variable , i.e., the latent presentation of . Each element is variationally modeled as a Gaussian with mean and standard deviation , where the two parameters are predicted by applying a parametric synthesis transform on as


The density can be transformed to a new density by using the deep JSCC encoder function as that employs the formula of functional distribution of random variable [26]. Under the AWGN channel, the received signal is with . We can thus calculate the density as


where “” denotes the convolutional operation. Since the latent representation is directly fed into the deep JSCC encoder without quantization, (17) indeed represents a differential entropy, as opposed to the discrete entropy used for the rate constraint in NTC. Note that is originally determined by , thus, to ensure a stable model training, like that in NTC, we employ a proxy “uniformly-noised” representation that is variationally modeled using as (9) to alternate as the transmission rate constraint term in practical implementations of NTSCC, which is marked as dashed lines in the inference model of Fig. 3. During the model testing, the conditional entropy model is established by taking discrete values from the learned entropy model , i.e., with and . Therefore, the transmission rate is constrained proportionally to .

The fourth term represents the log likelihood to recovering , the output of the synthesis transform , which is weighted by the output of the deep JSCC decoder . Here, the change of the fourth term from (b) to (c) in (14) follows the law of Jensen inequality, which indicates an upper bound on the KL divergence. The densities and can be assumed as




which measure the squared difference. Different from conventional schemes, (19) indicates that the deep JSCC decoder can use both the received signal and the side information to estimate the latent representation .

From the above analysis, we also summarize the operations in the receiver, it first recovers the hyperprior from the transmitted side information. It then exploits to recover and , which provide a prior probability helping to recover that can be computed as (19). In practice, the decoder for recovering can be summarized as a new parametric function with both and as inputs, where encapsulates the ANN parameters constituting . During model training, is replaced by the “uniformly-noised” proxy to generate . It then feeds into to reconstruct the source . The whole transceiver operation diagram is illustrated in Fig. 3.

Iii Architecture and Implementations

In this section, we discuss details about the NTSCC architecture and key methodologies. Following the aforementioned VAE model, the optimizing of NTSCC can also be attributed as a transmission RD optimization problem, i.e.,


where the parameter controls the scaling relation between the entropy of latent representation and its analog transmission channel bandwidth cost , denotes the digital channel capacity to transmit the quantized hyperprior , thus the digital channel bandwidth cost can be computed to transmit the side information. The Lagrange multiplier on the total transmission channel bandwidth term determines the trade-off between the transmission rate and the end-to-end distortion . Moreover, from the above variational modeling analysis, we find that both the conditional entropy model and the hyperprior model can be factorized. In order to use the gradient descent methods to optimize the NTSCC model, we relax the quantized variables as “uniformly-noised” proxies like that in NTC, therefore, the loss function for model training can be written as


where offsets and are sampled from uniform distribution , and are established by taking values from the parametric factorized model in (9) and the non-parametric factorized model in (11), respectively, as




Correspondingly, during model testing, the conditional entropy model is established by taking discrete values from the learned model as


and is similarly computed as


Iii-a The Overall Architecture of NTSCC

The learned NTSCC model in (21) indicates that the pmf for predicting the entropy of latent representation is factorized over each dimension without relying on preceding dimensions. Conditioning on the hyperprior vector typically requires transmitting the vector as side information. The whole structure corresponds to a learned forward adaptation (FA) of the density model [15]. To seek computational parallelism, in this paper, we focus on the FA mode, a better performance backward adaptation (BA) mode with higher processing latency will be discussed in the future [27].

Fig. 4: Illustration of an NTSCC codec architecture using learned FA.

Fig. 4 illustrates the overall architecture of NTSCC using learned FA. is the source vector at the transmitter, and denotes the recovered vector at the receiver.

is the semantic latent representation tensor that is obtained by performing an ANN-based transform function

on , and is the uniformly quantized counterparts of . Also, is the latent representation of computed using an ANN-based transform , denoting the side information, whose uniformly quantized version is . While the entropy model of on is predetermined, the factorized entropy model of is assumed to be conditionally independent Gaussian with the mean tensor and the standard deviation tensor as (22). Both tensors are obtained by performing the ANN-based function on . The resulting entropy terms are employed for determining the channel bandwidth cost of each dimension so as to realize adaptive rate transmission. Bob begins with channel decoding (CD) and entropy decoding (ED) to recover the side information , and then uses it to decode . Alice should know the entropy model on to entropy encode (EE) and channel encode (CE) it, that is modeled as a non-parametric density conditioned on as (23). To ensure reliable transmission of the side information, error-correction channel coding should adopt advanced capacity-approaching codes, e.g., low-density parity-check (LDPC) codes [28] or polar codes [29]. To decode , Bob jointly uses channel received and side information as the inputs of decoder function .

Next, we discuss the procedure of NTSCC model training as shown in Algorithm 1. Here, some tricks should be noted to ensure fast and stable training. First, before the training of NTSCC model, the parameters of , , , should be initialized by pre-training the corresponding NTC model as (13), where no transmitting error is considered such that NTC only executes the data compression task. During the training of NTSCC model, we modify the loss function derived in (21) by adding the NTC distortion terms, in this way, the convergence of NTSCC model training can be more stable. Moreover, due to the introduction of rate adaptation in NTSCC, the channel bandwidth cost of each embedding is different, thus, multi-head ANN structures shall be used to implement deep JSCC codec and . In each round of model training, only parts of and will be updated depending on the selected transmitting rates. Details will be given later.

Input: Training data , the Lagrange multiplier on the rate term, the scaling factor from entropy to channel bandwidth cost, learning rate ;
//model pre-training
1 Initialize parameters by pre-training the corresponding NTC model as that in [15];
2 Randomly initialize parameters ;
3 Encapsulate ;
4 for  do
5       Sample ;
       //add the NTC distortion to improve training convergence stability
6       Calculate the RD loss function ;
7       Calculate the gradients ;
8       Update ;
10return Trained parameters .
Algorithm 1 Training the NTSCC model
Fig. 5: Network architectures of nonlinear transforms and .

Iii-B Modular Implementation Details

In this part, we present the implementation details of each module in NTSCC. As aforementioned, the proposed NTSCC model mainly consists of a nonlinear transform step and a rate-adaptive deep JSCC step. The key of NTSCC implementation is designing dynamic and efficient ANN structures that can learn patch-wise representations and use the side information provided by hyperprior to flexibly determine the transmission bandwidth cost of each patch. To this end, the encoder function should incorporate these external parameters, such as the transmission rate of each embedding that is determined by the learned entropy model