I Introduction
Semantic communications are recently emerging as a new paradigm driving the indepth integration of information and communication technology (ICT) advances and artificial intelligence (AI) innovations
[1]. This communication paradigm shifting calls for revolutionary theories and methodology innovations. Unlike traditional modulestacking system design philosophy, it is the very time to bridge the two main branches of Shannon theory [2], i.e., source and channel, together for boosting the endtoend system capabilities towards wisdomevolutionary and primitiveconcise networks. By this means, the channel transmission process can be aware of the source semantic features.The paradigm aiming at the integrated design of source and channel codes was named joint sourcechannel coding (JSCC) [3], which is a classical topic in the information theory and coding theory. However, classical JSCC schemes [3, 4, 5, 6]
were based on the statistical probabilities without considering source semantic aspects. By introducing AI, JSCC is expected to evolve to a modern version. On the one hand, source coding can intelligently extract the most valuable information for human understanding in intelligent humantype communications (IHTC) and decision making in intelligent machinetype communications (IMTC) such that the source coding rate can be efficiently reduced. On the other hand, channel coding can precisely identify the critical part of the source coded sequence to realize semanticinclined unequal protection.
As one modern version, recent deep learning methods for realizing JSCC have stimulated significant interest in both AI and wireless communication communities
[7, 8, 9, 10, 11, 12, 13]. By using artificial neural networks (ANN), the problem of JSCC can be carried out over analog channels, where transmitted signals are formatted as continuousvalued symbols. From Shannon’s perspective
[2], deep JSCC can be viewed as a geometric mapping of a vector in the source space onto a lowerdimensional space embedded in the source space and transmitted over the noisy channel. Like the landmark works of Gündüz
[9, 10, 11], without losing generality, we take the image source as a representative in this paper. Standard deep learning JSCC schemes operate on a simple principle: an image, modeled as a vector of pixel intensities , is mapped to a vector of continuousvalued channel input symbols via an ANNbased encoding function , whereencapsulates the parameters of the neurons in ANN encoder. We typically have
, and is named bandwidth ratio [10] denoting the coding rate. The wireless channel noise is denoted as such that the received vector . In the case of additive white Gaussian noise (AWGN) channel, each component ofobeys i.i.d. Gaussian distribution with zero mean and variance
. The decoder attempts to recover the transmitted source from the noisy vector , i.e., . This deep JSCC method yields endtoend transmission performance surpassing classical separationbased JPEG/JPEG2000/BPG source compression combined with ideal channel capacityachieving code family, especially for sources of small dimensions, e.g., small CIFAR10 image data set [14].However, one can observe that, in general, as the source dimension increases, e.g., largescale images, the performance of deep JSCC degrades rapidly, which is even inferior to the classical separationbased coding schemes. In addition, when the bandwidth ratio increases, existing deep JSCC cannot provide comparable coding gain as that of classical separated coding schemes, i.e., the slope of performance curve slowing down with the increase of bandwidth ratio
or the channel signaltonoise ratio (SNR). This phenomenon stems from the inherent defect of standard deep JSCC that cannot identify the source distribution for realizing a patchwise variablelength transmission. For example, when the dimension of embedding vector increases, these embeddings corresponding to simple patches will be saturated rapidly, leading to severe channel bandwidth wastes and inferior coding gain. This saturation phenomenon is more likely to appear on largescale images that need higher dimensional representation.
Furthermore, current deep JSCC methods do not incorporate any hyperprior as side information, a concept widely used in modern image codecs but unexplored in image transmission using deep JSCC with ANNs.
In this paper, we aim to break the above limits by proposing a new joint sourcechannel coding architecture that integrates the new concept of nonlinear transform coding [15]
and deep JSCC, i.e., nonlinear transform sourcechannel coding (NTSCC). Particularly, as an emerging class of methods, nonlinear transform coding (NTC) over the past few years has become competitive with the best linear transform codecs for image compression, and outperforms them in terms of ratedistortion (RD) performance under well established perceptual quality metrics, e.g., PSNR, MSSSIM, LPIPS, etc. In contrast to the linear transform coding (LTC) schemes, NTC can more closely adapt to the source distribution, leading to better compression performance. By integrating NTC into deep JSCC, the proposed NTSCC works on the principle: the source vector
is not directly mapped to the channel input symbols, instead, an alternative (latent) representation of is found first, a vector in the latent space, and deep JSCC encoding takes place in this latent representation. By introducing an entropy model on the latent space, NTSCC learns a prior as the side information to approximate the distribution of each patch, which is assumed intractable in practice. Accordingly, the following deep JSCC codec can select an appropriate coding scheme that optimizes the transmission RD performance for each embedding . As a result, the proposed NTSCC transmission framework can closely adapt to the source distribution and provide superior coding gain. Notably, the proposed NTSCC method can well support future semantic communications due to its contentaware ability.Specifically, the contributions of this paper can be summarized as follows.

NTSCC Architecture: We propose a new endtoend learnable model for highdimensional source transmission, i.e., NTSCC, that combines the advantages of NTC and deep JSCC. To the best of our knowledge, this is the first work exploiting the nonlinear transform to establish a learnable entropy model for realizing deep JSCC efficiently, where the entropy model on latent code implicitly represents the source distribution. Particularly, by using a parametric analysis transform , the source vector is transformed into a latent representation , i.e., , then is quantized as to enable the entropy model describing regional semantic features and controlling the transmission rate of each patch. Different from that in NTC, the latent code can be directly fed to the deep JSCC encoder to produce the analog sequence for channel transmission. On the other side, the receiver performs the inverse operation of that done by and . In our model, the nonlinear analysis transform condenses the source semantic information as a latent representation, thus driving the following sourcechannel coding.

Adaptive Rate Transmission: To improve the coding gain of the proposed NTSCC method, we introduce a variablelength transmission mechanism for each embedding vector in the latent code . To this end, a conditional entropy model is performed on each quantized embedding to evaluate the entropy of . If the learned entropy model indicates the embedding of high entropy, its corresponding deep JSCC shall be assigned a high coding rate, and vice versa. Accordingly, we develop the Transformer ANN architecture and rate attention mechanism to realize an adaptive rate allocation for deep JSCC, that can finely tune the coding rate for each embedding , thus enable source contentaware transmission.

Hyperprioraided Codec Refinement: In the proposed NTSCC method, as the hyperprior on the latent representation, we show that the side information about the entropy model parameters can also be viewed as a prior on the deep JSCC codewords. We exploit this hyperprior to reduce the mismatch between the latent representation marginal distribution for a particular source sample and the marginal for the ensemble of source data the transmission model was designed for. This refinement mechanism only uses a small number of additional bits of information sent from encoder to decoder as signal modifications to achieve much better performance in deep JSCC decoding.

Performance Validation: We verify the performance of the proposed NTSCC method across simple example sources and test image sources. The effect of adaptive rate allocation and the gap between the empirical transmission RD performance of NTSCC with respect to several stateoftheart transmission schemes are assessed with the help of the exemplary “bananashaped” twodimensional source [15]. Furthermore, we show that for image transmission, the proposed NTSCC can achieve much better coding gain and RD performance on various established perceptual metrics such as PSNR, MSSSIM, and LPIPS [16]. Equivalently, achieving the identical endtoend transmission performance, the proposed NTSCC method can save more than 20% bandwidth cost, compared to both the emerging analog transmission schemes using the standard deep JSCC and the classical separationbased digital transmission schemes.
The remainder of this paper is organized as follows. In the next section II, we first review the variational perspective on deep JSCC and NTC, and propose the variational model for NTSCC. Then, in section III, we propose ANN architectures for realizing NTSCC, as well as key methodologies to guide the optimization of the NTSCC model. Section IV provides a direct comparison of a number of methods to quantify the performance gain of the proposed method. Finally, section V concludes this paper and provides future research trends on this new topic.
Notational Conventions: Throughout this paper, lowercase letters (e.g., ) denote scalars, bold lowercase letters (e.g., ) denote vectors. In some cases, denotes the elements of , which may also represent a subvector of as described in the context. Bold uppercase letters (e.g., ) denote matrices, and denotes an
dimensional identity matrix.
denotes the natural logarithm, and denotes the logarithm to base .denotes a probability density function (pdf) with respect to the continuousvalued random variable
, and denotes a probability mass function (pmf) with respect to the discretevalued random variable . In addition, denotes the statistical expectation operation, and denotes the real number set. Finally, denotes a Gaussian function, andstands for a uniform distribution centered on
with the range from to .Ii Variational Modeling
Consider the following lossy transmission scenario. Alice is drawing a dimensional vector from the source, whose probability is given as . Alice concerns how to map to a dimensional vector , where is referred to as the channel bandwidth cost. Then, Alice transmits to Bob via a realistic communication channel, who uses the received information to reconstruct an approximation to .
Iia Variational Modeling of Deep JSCC
As stated in the introduction part, in deep JSCC [9], the source vector , is mapped to a vector of continuousvalued channel input symbols via an ANNbased encoding function
, where the encoder was usually parameterized as a convolutional neural network (CNN) with parameters
. Then, the analog sequence is directly sent over the communication channel. The channel introduces random corruptions to the transmitted symbols, denoted as a function , the channel parameters are encapsulated in . Accordingly, the received sequence is , whose transition probability is . In this paper, we consider the most widely used AWGN channel model such that the transfer function is where each component of the noise vector is independently sampled from a Gaussian distribution, i.e., , where is the average noise power. Other channel models can also be similarly incorporated by changing the channel transition function. The receiver also comprises a parametric function to recover the corrupted signal as , where can also be a format of CNN [9]. The whole operation is shown in the right panel of Fig. 1. The encoder and decoder functions are jointly learned to minimize the average(1) 
where
denotes the distortion loss function.
However, as analyzed in [17]
, it is more reasonable modeling the deep JSCC as a variational autoencoder (VAE)
[18]. As shown in the left panel of Fig. 1, the noisy sequence can be viewed as a sample of latent variables in the generative model. The deep JSCC decoder acts as the generative model (“generating” the reconstructed source from the latent representation) that transforms a latent variable with some predicted latent distribution into a data distribution that is unknown. The optimization objective of this generative model is(2) 
where depends on the type of loss function . The most widelyused squared error, i.e., , corresponds to
, where the hyperparameter
determines the power constraint of deep JSCC system. The deep JSCC encoder combined with the channel is linked to the inference model (“inferring” the latent representation from the source data). To efficiently optimize (2), we need knowledge about the true posterior , which is indeed intractable. An alternative way in variational inference is learning a parametric variational density substituting by minimizing the expectation of their KullbackLeibler (KL) divergence over the data distribution , which forms the endtoend optimization objective including both the generative model (decoder) and the inference model (encoder), i.e.,(3)  
where the variational density is determined by the encoder parameters and channel, and depends on the decoder and the type of loss function. We indicate that the first term in (3) equals a constant under the AWGN channel in which case it can be derived
(4)  
Even the above function assumes the AWGN channel, it holds for any additive noise channel, which stems from the transition invariant property of different entropy for any distribution. In this sense, we can technically drop the first term in the KL loss function. The last term can also be similarly dropped. It can be concluded that the design of deep JSCC is equivalent to optimizing the parametric codec functions for transmission ratedistortion (RD) performance, rather than only optimizing the distortion function in (1).
IiB Variational Modeling of NTC
As indicated in the introduction section, when the dimension of representation becomes large, standard deep JSCC cannot provide sufficient coding gain. As an emerging lossy compression model [19, 20, 21, 22, 23], NTC imitates the classical source coding procedure, that operates on a simple principle: a source vector is not mapped to codeword vector directly, rather, an alterative latent representation of is found first, a vector in some other space , quantization then takes place in this latent representation, yielding discretevalued vector . In standard NTC, due to the quantization step, the resulting can be compressed using entropy coding methods, e.g., arithmetic coding [24]
, to create bits streams. Since entropy coding relies on a prior probability of
, the entropy model is established in NTC to provide side information.The encoder in NTC transforms the source image vector using a parametric nonlinear analysis transform into a latent representation , which is quantized as . The latent representation preserves the source semantic features while its dimension is usually much smaller than the source dimension . The decoder performs the inverse operation that recovers from the compressed signal first, a parametric nonlinear synthesis transform is then performed on to recover the source image . Here, an ideal errorfree transmission is assumed such that the decoder can losslessly recover from entropy decoding. In NTC, and are usually parameterized as ANNs of nonlinear property, rather than the conventional linear transforms in LTC, and encapsulate their neural network parameters.
Since the nonlinear transform is not strictly invertible, also the quantization step introduces error, the optimization of NTC is attributed to a compression RD problem [25]. Assuming an efficient entropy coding is used, rate as the expected length of the compressed sequence equals to the entropy of , which is determined by the entropy model as
(5) 
where denotes the quantization function. In the context of this paper, without loss generality, employs the uniform scalar quantization (rounding to integers). Distortion is the expected divergence between and . Clearly, a higher rate allows for a lower distortion, and vice versa. In order to use the gradient descent methods to optimize the NTC model, Ballé et al. have proposed a relaxed method for addressing the zero gradient problem caused by quantization [19]. It uses a proxy “uniformlynoised” representation to replace the quantized representation during the model training.
The optimizing problem of NTC can also be formulated as a VAE model as shown in Fig. 2, a probabilistic generative model stands for the synthesis transform, and an approximate inference model corresponds to the analysis transform. Like that in deep JSCC, the goal of inference model is also creating a parametric variational density to approximate the true posterior , which is assumed intractable, by minimizing their KL divergence over the source distribution , i.e.,
(6)  
The minimization of KL divergence is equivalent to optimizing the NTC model for compression RD performance. As shown in [19], the first term in (6) is computing the transition probability from the source to the proxy latent representation as
(7) 
where denotes a uniform distribution centered on . Since the uniform distribution width is constant, the first term is also constant which can be technically dropped. The last term can also be similarly dropped. The third term representing the log likelihood can also be modeled by measuring the squared error between and , we have with where the squared error is weighted by the hyperparameter . The second term reflects the crossentropy between the marginal and the prior . It represents the cost encoding that is constrained by the entropy model . In [19], Ballé et al. modeled the prior using a nonparametric fullyfactorized density model as
(8) 
where encapsulates all the parameters of , the convolutional operation “” with a standard uniform distribution is used to better match the prior to the marginal. This model is referred to as a factorizedprior model [19].
However, in general cases, there may still exist clear spatial dependencies among the latent representation , in which case the performance of the factorizedprior model degrades. To tackle this, Ballé et al. introduced an additional set of latent variables to represent the dependencies of in the same way that the original source is transformed to the latent representation [20]. Here, each is variationally modeled as a Gaussian with mean
, where the two parameters are predicted by applying a parametric synthesis transform on as(9)  
The corresponding analysis transform is stacked on top of creating a joint factorized variational posterior as
(10)  
Since we do not have prior beliefs about the hyperprior , it can be modeled as nonparametric fully factorized density like (8), i.e.,
(11) 
where encapsulates all the parameters of . The optimization goal (6) works out to be changed as:
(12)  
where the third term can be viewed as the side information widely used in traditional transform coding schemes. The right panel of Fig. 2 depicts the procedure of how the model is used for data compression. Following the variational analysis, the loss function for training the NTC model is
(13)  
where denotes uniformly sampling one random quantization offset per latent dimension.
IiC Variational Modeling of the Proposed NTSCC
We integrate both the advantages of NTC and classical deep JSCC, that is collected under the name nonlinear transform sourcechannel coding (NTSCC). In the transmitter, the analysis transform in NTC is used as a type of “precoding” before the encoding of deep JSCC, which extracts the source semantic features as a latent representation. Deep JSCC then operates on this latent space. The bottom panel of Fig. 3 illustrates how the NTSCC model is used for data transmission. The analysis transform module subjects the input source vector to , yielding the latent representation with spatial varying mean values and standard derivations. The latent code is then fed into both the analysis transform and the deep JSCC encoder . On the one hand, summarizes the distribution of mean values and standard derivations of in the hyperprior , which is then quantized, compressed, and transmitted as side information. The transmitter utilizes the quantized
to estimate the mean vector
and the standard derivation vector , and use them to determine the bandwidth ratio to transmit the latent representation. The receiver also utilizes and to provide side information to correct the probability estimates for recovering . On the other hand, encodes as the channelinput sequence , and the received sequence is .The optimizing problem of NTSCC can also be formulated as a VAE model as shown in Fig. 3, a probabilistic generative model stands for the deep JSCC decoder and the synthesis transform, an approximate inference model corresponds to the analysis transform and the deep JSCC decoder. Like the above discussion, the goal of the inference model is also creating a parametric variational density to approximate the true posterior , which is assumed intractable, by minimizing their KL divergence over the source distribution as (14). It can be observed that the minimization of the KL divergence is equivalent to jointly optimizing the nonlinear transform model and the deep JSCC model for the endtoend transmission RD performance.
(14)  
Let’s take a close look at each term of the last line in (14). First, the variational inference model is computing the analysis transform and the deep JSCC encoding of the source vector , and adding the channel noise , thus:
(15)  
As discussed before, the first term in the KL divergence can be technically dropped from the loss function.
The second term is identical to the crossentropy between the marginal and the prior . It stands for the cost of encoding the side information in the inference model assuming as the entropy model. In practice, since we do not have prior beliefs about the hyperprior , it can also be modeled as nonparametric fully factorized density as (11).
The third term corresponds to the crossentropy encoding that denotes the transmission rate of source message. As shown in Fig. 3, we derive with the help of the intermediate proxy variable , i.e., the latent presentation of . Each element is variationally modeled as a Gaussian with mean and standard deviation , where the two parameters are predicted by applying a parametric synthesis transform on as
(16) 
The density can be transformed to a new density by using the deep JSCC encoder function as that employs the formula of functional distribution of random variable [26]. Under the AWGN channel, the received signal is with . We can thus calculate the density as
(17) 
where “” denotes the convolutional operation. Since the latent representation is directly fed into the deep JSCC encoder without quantization, (17) indeed represents a differential entropy, as opposed to the discrete entropy used for the rate constraint in NTC. Note that is originally determined by , thus, to ensure a stable model training, like that in NTC, we employ a proxy “uniformlynoised” representation that is variationally modeled using as (9) to alternate as the transmission rate constraint term in practical implementations of NTSCC, which is marked as dashed lines in the inference model of Fig. 3. During the model testing, the conditional entropy model is established by taking discrete values from the learned entropy model , i.e., with and . Therefore, the transmission rate is constrained proportionally to .
The fourth term represents the log likelihood to recovering , the output of the synthesis transform , which is weighted by the output of the deep JSCC decoder . Here, the change of the fourth term from (b) to (c) in (14) follows the law of Jensen inequality, which indicates an upper bound on the KL divergence. The densities and can be assumed as
(18) 
and
(19)  
which measure the squared difference. Different from conventional schemes, (19) indicates that the deep JSCC decoder can use both the received signal and the side information to estimate the latent representation .
From the above analysis, we also summarize the operations in the receiver, it first recovers the hyperprior from the transmitted side information. It then exploits to recover and , which provide a prior probability helping to recover that can be computed as (19). In practice, the decoder for recovering can be summarized as a new parametric function with both and as inputs, where encapsulates the ANN parameters constituting . During model training, is replaced by the “uniformlynoised” proxy to generate . It then feeds into to reconstruct the source . The whole transceiver operation diagram is illustrated in Fig. 3.
Iii Architecture and Implementations
In this section, we discuss details about the NTSCC architecture and key methodologies. Following the aforementioned VAE model, the optimizing of NTSCC can also be attributed as a transmission RD optimization problem, i.e.,
(20)  
where the parameter controls the scaling relation between the entropy of latent representation and its analog transmission channel bandwidth cost , denotes the digital channel capacity to transmit the quantized hyperprior , thus the digital channel bandwidth cost can be computed to transmit the side information. The Lagrange multiplier on the total transmission channel bandwidth term determines the tradeoff between the transmission rate and the endtoend distortion . Moreover, from the above variational modeling analysis, we find that both the conditional entropy model and the hyperprior model can be factorized. In order to use the gradient descent methods to optimize the NTSCC model, we relax the quantized variables as “uniformlynoised” proxies like that in NTC, therefore, the loss function for model training can be written as
(21)  
where offsets and are sampled from uniform distribution , and are established by taking values from the parametric factorized model in (9) and the nonparametric factorized model in (11), respectively, as
(22)  
and
(23) 
Correspondingly, during model testing, the conditional entropy model is established by taking discrete values from the learned model as
(24) 
and is similarly computed as
(25) 
Iiia The Overall Architecture of NTSCC
The learned NTSCC model in (21) indicates that the pmf for predicting the entropy of latent representation is factorized over each dimension without relying on preceding dimensions. Conditioning on the hyperprior vector typically requires transmitting the vector as side information. The whole structure corresponds to a learned forward adaptation (FA) of the density model [15]. To seek computational parallelism, in this paper, we focus on the FA mode, a better performance backward adaptation (BA) mode with higher processing latency will be discussed in the future [27].
Fig. 4 illustrates the overall architecture of NTSCC using learned FA. is the source vector at the transmitter, and denotes the recovered vector at the receiver.
is the semantic latent representation tensor that is obtained by performing an ANNbased transform function
on , and is the uniformly quantized counterparts of . Also, is the latent representation of computed using an ANNbased transform , denoting the side information, whose uniformly quantized version is . While the entropy model of on is predetermined, the factorized entropy model of is assumed to be conditionally independent Gaussian with the mean tensor and the standard deviation tensor as (22). Both tensors are obtained by performing the ANNbased function on . The resulting entropy terms are employed for determining the channel bandwidth cost of each dimension so as to realize adaptive rate transmission. Bob begins with channel decoding (CD) and entropy decoding (ED) to recover the side information , and then uses it to decode . Alice should know the entropy model on to entropy encode (EE) and channel encode (CE) it, that is modeled as a nonparametric density conditioned on as (23). To ensure reliable transmission of the side information, errorcorrection channel coding should adopt advanced capacityapproaching codes, e.g., lowdensity paritycheck (LDPC) codes [28] or polar codes [29]. To decode , Bob jointly uses channel received and side information as the inputs of decoder function .Next, we discuss the procedure of NTSCC model training as shown in Algorithm 1. Here, some tricks should be noted to ensure fast and stable training. First, before the training of NTSCC model, the parameters of , , , should be initialized by pretraining the corresponding NTC model as (13), where no transmitting error is considered such that NTC only executes the data compression task. During the training of NTSCC model, we modify the loss function derived in (21) by adding the NTC distortion terms, in this way, the convergence of NTSCC model training can be more stable. Moreover, due to the introduction of rate adaptation in NTSCC, the channel bandwidth cost of each embedding is different, thus, multihead ANN structures shall be used to implement deep JSCC codec and . In each round of model training, only parts of and will be updated depending on the selected transmitting rates. Details will be given later.
IiiB Modular Implementation Details
In this part, we present the implementation details of each module in NTSCC. As aforementioned, the proposed NTSCC model mainly consists of a nonlinear transform step and a rateadaptive deep JSCC step. The key of NTSCC implementation is designing dynamic and efficient ANN structures that can learn patchwise representations and use the side information provided by hyperprior to flexibly determine the transmission bandwidth cost of each patch. To this end, the encoder function should incorporate these external parameters, such as the transmission rate of each embedding that is determined by the learned entropy model
Comments
There are no comments yet.