I Introduction
Digital communication systems have been conceptualized, designed, and optimized for the main design goal of reliably transmitting bits over noisy communication channels. Shannon’s channel coding theory provides the fundamental limit on the rate of this reliable communication, with the crucial design choice that the system design remains oblivious to the underlying meaning of the messages to be conveyed or how they would be utilized at their destination. It is this disassociation from content that allows Shannon’s approach to abstract the “engineering” problem that is to replicate a digital sequence generated at one point, asymptotically errorfree at another. While this approach has been tremendously successful in systems where communicating, e.g., voice and data, is the main objective, many emerging applications, from autonomous driving, to healthcare and Internet of Everything, will involve connecting machines that execute (sometimes humanlike) tasks. In such applications, the goal is often not to reconstruct the underlying message exactly as in pure communications applications, but to enable the destination to make the right inference or to take the right decision and action at the right time and within the right context. Similarly, humanmachine interactions will be an important component, where humans will simultaneously interact with multiple devices using text, speech, or image commands, leading to the need for similar interaction capabilities on the devices. These applications motivate the development of “semantic” and “taskoriented” communication systems. Recent advances in artificial intelligence technologies and their applications have boosted the interest and the potentials of semantics and taskoriented communications, particularly within the context of future wireless communications systems, such as 6G. Integrating semantics into system design, however, is still in its infancy with ample open problems ranging from foundational formulations to development of practical systems. This article provides an introduction to tools and advancements to date in semantic and taskoriented communications with the goal of providing a comprehensive tutorial for communication theorists and practitioners.
Ia Motivation of Semantics and Taskoriented Communications
There is a growing interest in semantic and goaloriented communication systems in the recent years. This interest is mainly driven by new verticals that are foreseen to dominate the data traffic in future communication networks. Current communication networks are designed to serve data packets in a reliable and efficient manner without paying attention to the contents of these packets or the impact they would have at the receiver side. However, there is a growing understanding that many of the emerging applications can benefit from going beyond the current paradigm that completely separates the design of the communication networks from the source and destination of the information that flows through the network. The reason behind this trend is twofold: First, the success of many of the emerging applications, such as autonomous driving or smart city/home/factory, relies on massive datasets that enable training large models for various tasks. Hence, supporting and enabling such applications will require carrying significant amounts of traffic due to the transmission of these massive datasets and large models, which can potentially saturate the network capacity. For example, each selfdriving car collects terrabytes of data each day from its many sensors, including radar, LIDAR, cameras, and ultrasonic sensors. Such data is often collected by the manufacturers to test and improve their models, generating a huge amount of traffic. Hence, the communication infrastructure cannot simply be an ignorant bitpipe to enable intelligence at the higher layers, but must incorporate intelligence itself to make sure only the required traffic is transmitted at the necessary time and speed [1]. This will require a dataaware communication network design that has the required intelligence to understand the relevance, urgency, and meaning of the data traffic in conjunction with the underlying task. Second, unlike the current contentdelivery type traffic, most of the emerging applications require extremely low endtoend latency. For example, when delivering a video signal to a human, even in the case of live streaming, a certain level of delay is acceptable. However, when this video signal is intended to be used by an autonomous vehicle to detect and avoid potential obstacles or pedestrians on the street, even small delays may not be acceptable. On the other hand, to carry out such a task does not require the vehicle to receive the video sequence with the highest fidelity that is often considered when serving a human receiver. The vehicle is only interested in the content that is relevant for the task at hand. Again, it is essential to incorporate intelligence into the communication system design to extract and deliver the taskrelevant information in the fastest and most reliable manner.
IB What is Semantics? A Historical Perspective
Semantics has been studied for centuries in the context of different disciplines, such as philosophy, linguistics, and cognitive sciences, to name a few. It is a highly complex, and to some extend, controversial topic, which is very difficult, if not impossible, to define in a concise manner that would be widely acceptable. Very briefly, semantics can be defined as the study of “meaning”, but maybe not surprisingly, it means different things in different areas. It is closely connected to semiotics, the study of signs. All communication systems are built upon signs, and a language can be broadly defined as a system of signs and rules. The rules applied to signs are divided into three categories [2]: 1) Syntax: studies signs and their relations to other signs; 2) Semantics: studies the signs and their relation to the world; and 3) Pragmatics: studies the signs and their relations to users.
In this classification, syntax is only concerned with the signs and their relations, and according to Cherry, “treats language as a calculus” [3]. Semantics is built upon syntax, and its main goal is to understand the relations between signs and the objects to which they apply, designata. According to Chomsky, syntax is independent of semantics [4]. Through his now famous sentence “Colorless green ideas sleep furiously,” Chomsky argued that it is possible to construct grammatically consistent but semantically meaningless phrases, hence the separation between syntax and semantics. Pragmatics, on the other hand, is the most general of the three, and considers context of communication; that is, takes into account all the personal and psychological factors (in human communications) when considering the impact of a sign on designata.
Possibly inspired by the above classification, in his accompanying article to Shannon’s Mathematical Theory of Communication in [5], Weaver identified three levels of communication problems, which correspond to the three categories above. According to Weaver, Level A deals with the technical problem, and tries to answer the question “How accurately can the symbols of communication be transmitted?”. Level B instead deals with the semantic problem, which asks “How precisely do the transmitted symbols convey the desired meaning?”. Finally, the third level, Level C, corresponds to the effectiveness problem, and asks “How effectively does the received meaning affect conduct in the desired way?”. Similarly to syntax in semiotics, the engineering problem can be considered as the syntax in communications, dealing only with the signs used in communication systems, their relations and how they are transmitted over a communication channel.
Similarly to Chomsky’s strict separation between syntax and semantics, Shannon’s information theory deals exclusively with the engineering problem, ignoring the meaning of the symbols transmitted. Indeed, in his seminal paper [6], Shannon explicitly states the following: “The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. The significant aspect is that the actual message is one selected from a set of possible messages. The system must be designed to operate for each possible selection, not just the one which will actually be chosen since this is unknown at the time of design.”
Despite Shannon’s clear indication, many researchers at the time were excited about using Shannon’s statistical theory to explain or measure semantics. For example, in [7]
, Wiener wrote “The amount of meaning can be measured. It turns out that the less probable a message is, the more meaning it carries, which is entirely reasonable from the standpoint of common sense.” Weaver, on the other hand, although explicitly excluding semantics from Shannon’s theory, still argues it has implications for Level B and Level C problems. He writes
[5] “[Shannons’s theory] although ostensibly applicable only to Level A problems, actually is helpful and suggestive for the level B and C problems.”. And he later adds: “Thus when one moves to levels B and C, it may prove to be essential to take account of the statistical characteristics of the destination. One can imagine, as an addition to the diagram, another box labeled “Semantic Receiver” interposed between the engineering receiver (which changes signals to messages) and the destination. This semantic receiver subjects the message to a second decoding, the demand on this one being that it must match the statistical semantic characteristics of the message to the statistical semantic capacities of the totality of receivers, or of that subset of receivers which constitute the audience one wishes to affect.”Not everybody was in the same opinion as Wiener and Weaver regarding the potential application of Shannon’s information theory to semantics. In [8], BarHillel and Carnap argue that the statistical theory of information conceived by Shannon cannot be applied to study semantics. They also express their dissatisfaction with such attempts: “Unfortunately, however, it often turned out that impatient scientists in various fields applied the terminology and the theorems of Communication Theory to fields in which the term “information” was used, presystematically, in a semantic sense, that is, one involving contents or designata of symbols, or even in a pragmatic sense, that is, one involving the users of these symbols.”
Based on Shannon’s explicit statement of excluding semantics from his theory, and the ensuing discussion by Carnap and BarHillel, today, many authors argue that the classical Shannon theory cannot handle the semantics related aspects of communication systems. In Shannon’s channel coding theorem, the goal is to convey the maximum number of bits through a communication channel in a reliable manner, where “reliable” means that the transmitted bit sequence must be reconstructed at the receiver with an arbitrarily low probability of error. Here, each bit is assumed to be equally likely, and what the receiver does with these bits is not relevant for the channel capacity. But, not all information sources generate sequences of equally likely bits. This is obviously not the case in a text in any language. Shannon also looked at such information sources, and showed in his source coding theorem [6] that any information source can be compressed into equally likely messages at a rate at least at the entropy of the information source (assuming the information source generates independent symbols from an identical distribution). Even though Shannon explicitly studied the entropy of the English language as an example [9]
, like channel coding theory, his source coding theory does not deal with the meaning of the words. From the point of Shannon theory, an information source generating messages from the set {I love you, I miss you, I can’t stand you} is the same as the one generating messages from the set {1, 2, 3} as long as the messages come from the same probability distribution.
On the other hand, while Shannon’s theory did not deal with the meaning of these messages, it did not ignore the possibility of imperfect reconstructions. Even in his seminal work [6], which mostly focused on the reliable transmission of sources, Shannon highlighted that the exact transmission of continuous sources would require a channel of infinite capacity, but they can be delivered within a certain fidelity criterion. He laid down the basic ideas of a ratedistortion theory in [6], although the theory is developed more rigorously only in his later work in 1959 [10]. In [6], he mentioned various fidelity measures that can be considered when transmitting continuous signals, including mean squared error, frequency weighted mean squared error, and absolute error. Shannon then states: “The structure of the ear and brain determine implicitly an evaluation, or rather a number of evaluations, appropriate in the case of speech or music transmission. There is, for example, an ‘intelligibility’ criterion in which is equal to the relative frequency of incorrectly interpreted words when message is received as . Although we cannot give an explicit representation of
in these cases it could, in principle, be determined by sufficient experimentation. Some of its properties follow from wellknown experimental results in hearing, e.g., the ear is relatively insensitive to phase and the sensitivity to amplitude and frequency is roughly logarithmic.” Here, Shannon uses the term “evaluation” to refer to different fidelity measures. He clearly makes a reference to reconstruction measures that go beyond recovering a sequence of bits, and allow a certain level of reconstruction error as long as that is within the intelligibility of the receiver, e.g., the ear or the brain. One can further argue that Shannon already hints towards datadriven evaluation of the fidelity of a reconstruction in accordance with the machine learning approaches widely used today.
In this paper, our goal is to provide a comprehensive overview of basic Shannon theoretic concepts in relation to semantic and taskoriented communications. While we will provide a summary of semantic information theory of BarHillel and Carnap and some of its future extensions as well, resolving the discussion regarding the applicability of Shannon’s statistical information theory to study semantics, particularly in the context of linguistics, is out of the scope of this paper. Indeed, our main argument is that if we are given a certain distortion measure on the pairs of transmitted and reconstructed messages, the problem then falls into the realm of information and coding theory. On the other hand, in most cases, particularly for natural languages, characterizing such a distortion measure can be extremely difficult if not impossible. Modern machine learning techniques, particularly those in the area of natural language processing (NLP), can provide such distortion measures, which are becoming increasingly accurate and useful, at least from an engineering standpoint. Therefore, our interpretation and application of semantics to communication systems is closer to Weaver’s, where semantics is simply a fidelity measure imposed by the underlying information source, captured by the concept of “semantic noise”
[5].IC Relevant Surveys and Our Contributions
Given the increasing interest in semantic and taskoriented communications, it is not surprising to note that there are quite a few existing surveys and tutorial articles focusing on this general topic already. We discuss them briefly and then list our main contributions. The authors in [11] introduced three kinds of semantic communications, humantohuman, humantomachine, and machinetomachine communications. The key performance indicators and system design for semantic learning mechanisms over future wireless communications were further pointed out in [12]. The goaloriented signal processing was investigated in [13], which included the graphbased semantic language and representation of the semantic information. Besides, the architectures of semantic communications for artificial intelligenceassisted wireless networks were investigated in [14, 15, 16]. In [17], the technical contents and application scenarios were discussed for the intelligent and efficient semantic communication network design. In [18]
, the authors discuss principles and challenges of semantic communications enabled by deep learning. Compared with the above works
[11, 12, 13, 14, 15, 16, 17, 18], the main focus of this paper is to provide a comprehensive introduction to semantic and taskoriented communications through an informationtheoretic viewpoint. In other words, it will be our intention to ground everything discussed in this paper in relevant informationtheoretic principles.In addition to providing a comprehensive survey of semantic and taskoriented communication systems, the main ingredients of our paper are listed below.

We point out the differences among existing definitions of semantic entropy. We further introduce how knowledge graph based semantics can be applied to and benefit a wide variety of common tasks.

We provide the basic information theoretic concepts for semantic and taskoriented communications. For instance, we show how the ratedistortion theory can capture the semantic distortion measure and explore the connection between information bottleneck and goaloriented compression. To convey the class information to the receiver, ratelimited remote inference theory is discussed.

Machine learning techniques for semantic and taskoriented communications are discussed in two phases separately, i.e., training phase and prediction phase. Various approaches to the remote inference and remote training problems are presented.

We introduce the information theoretic concepts for semantic and taskoriented transmission over noisy channels from the viewpoint of joint source and coding (JSCC). Some practical designs for goaloriented semantic communication over noisy channels are further provided for text, speech, and image sources.

We discuss the idea of “timing as semantics” where the relevance or value of information is in time. Connections of this idea with the related concept of age of information (AoI) are explored by discussing a general realtime remote tracking or reconstruction problem from the ratedistortion viewpoint such that the semantic information is contained in the timing of the source samples.

We present an effective communication framework, corresponding to the Level C communication problem put forth by Weaver. This results in a contextdependent communication paradigm, where the same message may have a different affect on the receiver depending on the context.
The paper is organized as follows. We first provide an overview of informationtheoretic foundations for semantic and taskoriented communications in Section II. Then, in Section III, relevant machine learning techniques for semantic and taskoriented communications are introduced in detail. The idea of semantic information theory is further discussed in Section IV and the joint sourcechannel coding approach is provided in Section V. As an instance of taskoriented communications, an important class of problems where the metric relates the freshness of information along with connections to semantics is discussed in Section VI. Finally, Section VII points out the conclusions and future directions.
Ii A Semantic Information Theory
To extend the engineering approach Shannon proposed, a number of researchers have started to work on a theory of semantic communications soon after Shannon’s work. The concept of semantic entropy has been foreseen likely to play a significant role in developing a framework that considers semantics. This notion is used to quantify semantic information of the source. Until now, a number of definitions of semantic entropy have been proposed from different perspectives. A commonlyagreed upon notion is yet to emerge however.
Iia Semantic Entropy
In 1952, Carnap and BarHillel [19] first explicated the concept of semantic entropy of a sentence within a given language system, and provided a way to measure it as follows:
(1) 
where is the degree of confirmation of sentence on the evidence , given by
(2) 
Here, and represent the logical probability of on and that of , respectively.
Different from the logicbased definition, Venhuizen et al. [20] derived the semantic entropy based on a language comprehension model in terms of the structure of the world instead of the probabilistic structure of the language, which can be expressed as
(3) 
where
is the set of meaning vectors that identify unique models in
, is the set of models and reflects the probabilistic structure of the world, and is the conditional probability of given . This comprehensioncentric notion of semantic entropy depends on both linguistic experience and world knowledge and quantifies the uncertainty with respect to the whole meaning space.Apart from the language system, the semantic entropy for intelligent tasks has also been studied. Melamed [21] proposed a informationtheoretic method for measuring the semantic entropy in translation tasks by using translational distributions of words in parallel text corpora. The semantic entropy of each word is given by
(4) 
where is the translational inconsistency of a source word , represents the set of target words, denotes the contribution of null links of , and is the frequency of . Additionally, for classification tasks, Liu et al. [22] defined the semantic entropy by introducing the membership degree in axiomatic fuzzy set theory. Let denote the membership degree of the data sample . The authors first obtained the matching degree, which characterizes the semantic entropy for data samples in class on semantic concept , as
(5) 
where is the set of data for class , and is the data set of all classes. According to the matching degree, the semantic entropy of class on is defined as
(6) 
Further, the semantic entropy of concept on can be obtained by
(7) 
Based on this definition, the optimal semantic descriptions of each class can be obtained and the uncertainty in designing the classifier is minimized.
In contrast to the aforementioned definitions that are specific to a single task, Chattopadhyay et al. [23] explore an informationtheoretic framework to quantify semantic information for any task and any type of source. They define the semantic entropy as the minimum number of semantic queries about the data whose answers are sufficient to predict task , which can be expressed as
(8) 
where denotes the query vector extracted from with the semantic encoder . From (8), in order to obtain the semantic entropy, one needs to find the optimal semantic encoder, which has the ability to encode into the minimal representation that can accurately predict the task. Currently, many methods have been utilized for measuring the semantic entropy, such as semantic information pursuit and variational inference. This direction is still in early stages and need to be further investigated.
In summary, there are significant differences among existing definitions of semantic entropy as each of them is based on the properties of its own concerned task. Although the last definition can be extended to different tasks, finding the optimal semantic encoder is as challenging as obtaining the semantic entropy. Hence, a unifying definition (as in the case of Shannon entropy) does not exist for semantic entropy, and most of these definitions lack the operational relevance Shannon’s entropy enjoys in a large number of engineering problems.
IiB Knowledge Graph for Semantic Communications
Fig. 1 presents an example of a knowledge graph, which represents a network of realworld entities, i.e., objects, events, situations, or concepts, and illustrates the relationship between them. This information is usually stored in a graph database and visualized as a graph structure. A knowledge graph is made up of three main components: nodes, edges, and labels. Any object, place, or person can be a node and an edge defines the relationship between the nodes. Knowledge graph embedding is to embed components of a knowledge graph including entities and relations into continuous vector spaces, so as to simplify the manipulation while preserving the inherent structure of the knowledge graph [24]. It can benefit a variety of downstream tasks; and hence, has quickly gained massive attention. In the following, we review semantic matching models for knowledge graph. Then, we introduce how knowledge graph based semantics can be applied to and benefit from a wide variety of downstream tasks, such as data integration, recommendation systems, and so forth. Subsequently, we present the analysis and framework of knowledge graph based semantics and its applications in semantic communication systems.
IiB1 Semantic Matching Model for Knowledge Graph
The knowledge graph techniques can be roughly categorized into two groups: translational distance models and semantic matching models. The former use distancebased scoring functions, and the latter similaritybased ones. In the following, we introduce the semantic matching models. In particular, they exploit similaritybased scoring functions and measure plausibility of facts by matching latent semantics of entities and relations embodied in their vector space representations. The authors in [25] associate each entity with a vector to capture its latent semantics. Each relation is represented as a matrix which models pairwise interactions between latent factors. The authors in [26]
propose the semantic matching energy method, which conducts semantic matching using neural network architectures. It first projects entities and relations to their vector embeddings in the input layer. The relation is then combined with the entity to get the score of a fact. Furthermore, the neural association model has been developed in
[27] to conduct semantic matching with a deep architecture.IiB2 Applications of Knowledge Graph Based Semantic
The knowledge graph embedding is a key technology for solving the problems in knowledge graph. The authors in [28]
propose a novel knowledge graph embedding method, which translates and transmits multidirectional semantics: (i) the semantics of head/tail entities and relations to tail/head entities with nonlinear functions, and (ii) the semantics from entities to relations with linear bias vectors. The knowledge graph based semantics has also been employed in data integration
[Integration], recommendation systems [29], and realtime ranking [30, 31, 32]. In [Integration], the authors devise a semantic data integration approach that exploits keyword and structured search capabilities of Web data sources. Resulting knowledge graphs model semantics or meaning of merged data in terms of entities that satisfy keyword queries, and relationships among those entities. It semantically describes the collected entities, and relies on semantic similarity measures to decide on relatedness among entities that should be merged. The authors in [29] incorporate both wordoriented and entityoriented knowledge graphs to enhance the data representations, and adopt mutual information maximization to align the wordlevel and entitylevel semantic spaces. In [30], a novel kind of knowledge representation and mining system has been proposed, which is referred to as the semantic knowledge graph. It provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. In [31], an entityduet neural ranking model has been proposed, which introduces knowledge graphs to neural search systems and represents queries and documents by their words and entity annotations. The semantics from knowledge graphs are integrated in the distributed representations of their entities, while the ranking is conducted by interactionbased neural ranking networks. Furthermore, the translationbased embedding models attempt to translate semantics from head entities to tail entities with the relations, and infer richer facts outside the knowledge graph. In
[32], a novel ranking technique that leverages knowledge graphs has been proposed. Analysis of the query log from the academic search engine reveals that a major error source is its inability to understand the meaning of research concepts in queries. To address this challenge, the authors propose to represent queries and documents in the entity space and rank them based on their semantic connections from their knowledge graph.IiB3 Analysis and Framework of Knowledge Graph Based Semantic
In [33], the authors introduce the semantic property graph for scalable knowledge graph analytics. To enhance the input data, the authors in [34] propose the framework of relevant knowledge graphs for recommendation and community detection, which improves both accuracy and explainability. In [35], the authors propose an iterative framework that is based on probabilistic reasoning and semantic embedding. The authors in [36] propose a semantic representation method for knowledge graph, which imposes a twolevel hierarchical generative process that globally extracts many aspects and then locally assigns a specific category in each aspect for every triplet. Because both the aspects and categories are semanticsrelevant, the collection of categories in each aspect is treated as the semantic representation of this triplet.
IiB4 Semantic Communication Systems Driven by Knowledge Graph
In [37], a cognitive semantic communication framework is proposed by exploiting knowledge graph. Moreover, a simple and interpretable solution for semantic information detection is developed by exploiting triplets as semantic symbols. It also allows the receiver to correct errors occurring at the symbolic level. Furthermore, the pretrained model is finetuned to recover semantic information, which overcomes the drawback that a fixed bit length coding is used to encode sentences of different lengths.
Iii Information Theoretic Foundations of Semantic and TaskOriented Communications
As we have mentioned, despite the many efforts to define a semantic information measure, none of the aforementioned attempts have so far resulted in a widely accepted definition, or had an impact on operational performance similarly to Shannon’s information theory had on communication systems. Therefore, we take a more statistical approach to semantics in this paper, and either treat it as a given distortion measure, or consider datadriven approaches to acquire it. In the current section, we will provide some of the basic concepts in Shannon’s statistical information theory, and how they can be used to study semantics in emerging communication systems. In particular, we will first review ratedistortion theory, and its characterization in various settings and under different distortion measures, and how they can capture semantic or goaloriented communications. In this section, we will focus on ratelimited errorfree communications; that is, we will mainly treat the semantic/ taskoriented compression problem. Semantic and taskoriented transmission over noisy channels will be considered in Section V.
Iiia RateDistortion Theory
In [10], Shannon expands the ideas put forth in his seminal work by formally defining the problem of lossy source transmission, which laid down the principles of ratedistortion theory. Note that, when compressing a single source sample we have a quantization problem, which is equivalent to analogtodigital conversion when the source samples come from a continuous alphabet, This can also be treated as a clustering problem with a specified fidelity criterion. Shannon showed that, similarly to the channel coding theorem, it is more efficient, in terms of bits per source sample, if many independent samples of the same source distribution can be compressed together.
Let denote independent source samples distributed according to , i.e., . Let denote the reconstruction at the receiver, where is the reconstruction alphabet, not necessarily same as the source alphabet. The goal is to minimize the distortion between and under some given distortion (fidelity) measure, , which assigns a certain distortion, or equivalently, a quality metric, for every pair of source and reconstruction sequences. Shannon considered singleletter additive distortion measures, that is, for a singleletter distortion measure .
The goal in lossy source coding is to represent the source sequence with as few bits as possible, measured in bits per source sample (bpss), while guaranteeing a certain average distortion level. A lossy source code consists of an encoderdecoder pair, where the encoder maps each length source sequence to an index , and the decoder maps each index
, to an estimated reconstruction sequence
. The collection of all reconstruction sequences forms the codebook, which is assumed to be shared between the encoder and decoder in advance.The average distortion of a code is given by
(9) 
where the expectation is taken over the source distribution.
For a given source distribution and distortion measure , we say that a ratedistortion pair is achievable if there exist a sequence of codes that satisfy
(10) 
The ratedistortion function of source under the singleletter distortion measure is defined as the infimum of rates for which is achievable.
For such singleletter additive distortion measures, Shannon provided a singleletter characterization of the optimal ratedistortion function.
Theorem 1.
(Shannon’s Lossy Source Coding Theorem, [10]) The ratedistortion function for source and distortion measure is given by
(11) 
We can argue that Shannon’s rate distortion function quantifies the communication rate required to convey samplelevel semantic information when many source samples can be compressed jointly. Here, our assumption is that the semantic relevance of reconstructing source sample as at the receiver is quantified by the prescribed distortion measure . In the context of text, this may refer to reconstructing a certain sentence in a manner that preserves its core meaning. In the context of music or image, it may mean, as argued by Shannon, preserving the intelligibility of the source signal, for example by conveying only the most audible or most distinguishable frequencies, which is the approach followed by modern audio and image compression standards.
We note, however, that in most cases we apply compression algorithms on a single sample, e.g., a single image, a single video sequence, or a single text file. In these cases, samples correspond to pixels or patches in an image, frames in a video, and letters or words in the text. Such samples are often highly correlated, and additive distortion measures may not preserve the semantics of the overall content. For example, preserving word level similarity, e.g., replacing words with their more common synonyms in its reconstruction, or generating an image with a low pixellevel mean squared error (MSE) may not lead to a good quality semantic compression. For text, semantic compression requires a much deeper understanding of the semantics of the underlying language. For images, alternative quality measures have been proposed that would provide a better image level semantic reconstruction. For example, structural similarity index measure (SSIM) or multiscale SSIM (MSSSIM) [38] have been introduced to measure the perceived quality of images and videos by incorporating luminance masking and contrast masking terms into the distortion measure, providing perceptually more satisfactory reconstructions [39]. Although SSIM/MSSSIM are better aligned with human perception, they still do not provide adaptivity to regions of interest in an image. In [40, 41], saliencybased attention prediction is used to detect regionsofinterest in image and video signals, which are then used for adaptive bit allocation.
Recently more formal definitions of the perceptual quality of an output image has been considered, defined as ‘the extent to which it is perceived as a valid (natural) sample’ [42]. We note that, while the distortion of image is defined with respect to the source image , perceptual quality is defined as a property of the reconstruction. The perceptual quality of an image is often defined as the divergence between the distribution of the sample and the statistics of natural images [43]. If one can make this divergence vanish, it means that the reconstruction is indistinguishable from real data samples; however, this does not guarantee that the distortion between the particular input sample and its reconstruction at the decoder is small. It is shown in [42] that in general there is a tension between the distortion that can be achieved and the divergence to the real data distribution, which is formalized as the ratedistortionperception tradeoff. An information theoretic formulation of the ratedistortionperception tradeoff is presented in [44] allowing stochastic and variablelength codes.
More recently image reconstruction techniques employing generative adversarial networks (GANs)
[45] also employ similar distance measures between the reconstructed image and the statistics of the images in the training dataset [46, 47], such as the JensenShannon divergence, the Wasserstein distance [48], or an fdivergence [49]. GANs train a discriminator that tries to distinguish the reconstructed image from the images in the dataset, forcing the decoder to generate realistic looking images. By taking the perceptual quality into account when reconstructing a compressed image, we aim not only to reproduce the original image with the highest fidelity, but also to reconstruct a more natural realistic image, which preserves the semantics of the underlying distribution.Despite trying to preserve the overall semantics of the image, the methods above do not use the image semantic information in the compression stage. On the other hand, semantic information can be used to provide a more efficient compression algorithm, or to achieve a better quality reconstruction [47, 50, 51]. See, for example, the images in Fig. 2. By only extracting and transmitting semantic information, e.g., the objects in the original image and their general layout, the output image can be reconstructed by simply including a generic representative of each of the objects in the image. Hence, it is possible to convey the image at a level to enable semantic reasoning about the image, albeit not reliably at the pixel level. A selective generative compression method is proposed in [47], which generates parts of the image from a semantic label map, which can be obtained using a semantic segmentation network [52, 53]. These parts of the image are fully synthesized rather than being reconstructed based on the original image. The rest of the image is generated using a conditional generative adversarial network (cGAN) [54]. A deep semantic segmentationbased layered image compression scheme is proposed in [50], where the semantic map of the input image is used to synthesize the image, while a compact representation and a residual are further encoded as enhancement layers. It is shown that this semanticbased compression approach outperforms BPG and other standard codecs in both PSNR and MSSSIM metrics. We also add that, including the segmentation map in the bitstream can further facilitate other downstream tasks such as image search or compression and manipulation of individual image segments.
Another semanticbased image processing approach uses scene graphs to extract not only objects within the image, but also their relationships [55]
. A scene graph is a directed graph data structure consisting of the objects and their attributes as vertices, and the relations between the objects as edges. Scene graph generation typically follows three steps: i) Feature extraction, which is responsible for identifying the objects in the image; ii) Contextualization, which established contextual information between entities, and finally iii) Graph construction and inference, which generates the final scene graph using the contextual information, and carry out desired inference tasks on the graph
[56]. Scene graphs are powerful tools that can encode images [55] or videos [57] using abstract semantic elements.In the context of video coding, semanticbased compression has long been considered for very low bitrate video compression [58]. These include motioncompensated compression, where motion vectors of pixels between two consecutive images or optical flow vectors [59] are encoded and transmitted. Alternative objectbased compression methods have also been considered in the literature for a very long time [60, 61]. In objectbased compression, each moving object in a video signal is separated from the stationary background and are conveyed to the decoder by describing their shape, motion, and content using an objectdependent parameter coding. Using this coded parameter set, each image is then synthesized at the decoder by modelbased image synthesis. Although such an objectbased compression approach was standardised as part of MPEG4 in the late 90s, it has not been widely adopted due to the lack of fast and reliable object and motion segmentation techniques. This approach is regaining interest in recent years due to the rapid advances in deep learning based segmentation techniques [62, 63].
In general, quantifying the semantic distortion measure for a particular information source is a formidable task. There have been many studies to understand and model semantics particularly in the context of text and natural language processing. We will go over these efforts in Section V in the context of semantic/goaloriented transmission.
IiiB Goal Oriented Compression
In the conventional ratedistortion framework overviewed above, the goal is to reconstruct the source sequence within a desired fidelity constraint. This framework applies to most delivery scenarios, such as the transmission of an image, video or audio source over a ratelimited channel, where the goal is to recover the original content with the highest fidelity. However, in many emerging applications, particularly those involving machinetype communications, the receiver may not be interested in the source sequence, but only in a certain feature of it. For example, rather than reconstructing an image, the receiver may be interested in certain statistical aspects of the image, or the presence or absence of certain objects or persons in the image. This can model a goaloriented compression scenario, where reconstructing the desired feature can represent the goal.
The desired feature in this context can be modelled as a correlated random variable
that follows a joint distribution
with the source sample . In this scenario, we can instead impose a distortion constraint on the feature and its reconstruction at the decoder, for a prescribed distortion measure for . This problem can be interpreted as a remote source coding problem, which was originally studied by Dobrushin and Tsybakov in [64]. In addition to the noisy observation of the features at the encoder, they also considered another random transformation at the output of the decoder, as illustrated in Fig. 3. They showed that this generalization of the Shannon’s ratedistortion problem can be reduced to Shannon’s original formulation by appropriately transforming the distortion measure. Consider any pair of encoding function and decoding function . The average endtoend distortion achieved by this encoderdecoder pair is given by(12) 
From the perspective of the encoder, it observes a source with marginal distribution . Let us now consider the modified distortion measure
(13) 
Using this new distortion measure, we can rewrite the endtoend distortion as
(14) 
Therefore, the problem of minimizing the average endtoend distortion for the remote source coding problem can be reduced to the classical source coding problem for a source with marginal distribution under the modified distortion measure .
The following equivalence can be generalized to the standard block coding setting. Assume now that we observe a sequence and want to reconstruct the corresponding feature vector , where are i.i.d. samples from the joint distribution . Then, the corresponding remote ratedistortion function can be characterized in a single letter form as given in the next theorem.
Theorem 2.
(Remote RateDistortion Function, [64]) The remote ratedistortion function for source based on its observation following joint distribution and distortion measure is given by
(15) 
Recently, the remote ratedistortion interpretation of semantic compression is considered in [65], where the decoder wants to reconstruct both the feature vector and the source vector , under different distortion measures. With the above reduction, one can see that this problem trivially reduces to the Shannon ratedistortion problem with two distortion measures. Further characteristics of this ratedistortion function for a Hamming distortion measure is studied in [66]. In [67], it is shown that the optimal transmission scheme for the general model in Fig. 3 under the squared error distortion measure can be divided into two steps: the encoder first estimates the feature variable conditioned on , and then conveys the estimated value to the decoder.
A natural extension of the above remote ratedistortion function involves multiple terminals, each observing a different noisy version of the underlying latent source (see Fig. 4 for an illustration). This is known as the “CEO problem” in the literature following [68]. In this setting, a chief executive officer (CEO) is interested in estimating an underlying source sequence, . She sends agents to observe independently corrupted versions of the source sequence, where the observations of the th agent are generated through the conditional distribution in an i.i.d. manner. The agents cannot communicate among each other, and each one only has a rateconstrained channel to the CEO. For a given sum rate constraint, what is the minimum average distortion the CEO can estimate under a given fidelity measure ? The special case of Gaussian source and noise statistics with squared error distortion is called the quadratic Gaussian CEO problem [69]. The problem remains open in the general case, while the rate region was characterized for the Gaussian case in [70], for logarithmic loss distortion for discrete sources in [71] and vector Gaussian sources in [72].
IiiC Context as Side Information
When the communication of an underlying source signal is considered, additional information that is correlated with this desired source variable available to the transmitter and the receiver can be leveraged to reduce the rate requirements. Consider for example a surveillance camera in a house recording a video and forwarding the recording to a remote node which aims at detecting the presence of intruders. Depending on the hour of the day, the illumination of the image will be different. The context in which the information is being obtained, e.g., the time, or weather conditions, could be exploited to improve the video compression quality. Images of the same scene recorded by other cameras can also serve as context information.
From an information theoretical perspective, contextual information can be modelled as side information. The problem of lossy source coding when common correlated side information is available both at the encoder and the decoder was studied by Gray in [73]. By characterizing the ratedistortion function for this problem, it was shown that the rate required to achieve a prescribed distortion could be reduced by exploiting the side information available. Interestingly, the rate required to achieve a particular distortion is also reduced when the correlated side information is only available at the decoder, and the encoder has to compress without knowledge of the realization of the side information available, as long as its distribution is known. This problem was studied in [74], and the corresponding singleletter ratedistortion function is referred to as the WynerZiv ratedistortion function.
Let denote the side information sequence observed by the receiver that is correlated with the source . In particular, let be i.i.d. samples jointly distributed with . The WynerZiv rate distortion function for source coding problem with side information at the decoder is characterized as given in the following theorem.
Theorem 3.
(Lossy Compression with Side Information at the Decoder, [74]) The ratedistortion function for source with side information available only at the decoder following joint distribution and distortion measure is given by
(16) 
where the minimum is over conditional distributions with and functions such that .
In WynerZiv coding, typical source codewords are split into bins, and only the bin index is forwarded to the decoder. By leveraging the side information, the receiver is able to identify the corresponding compression codeword within the selected bin. WynerZiv coding has been exploited in image and video compression [75, 76]. While in general having the side information available at both the encoder and the decoder is beneficial, for some sourcedistortion measure pairs, e.g., Gaussian sources under squarederror distortion, it is known that having the side information available only at the decoder does not result in any performance loss [74]. Side information can also be considered for the remote source coding problem since any remote compression problem is equivalent to a standard source coding problem with a new distortion measure as in (14). Side information can also be incorporated into multiterminal source coding problems, e.g., [77]. For example, in the CEO setting in Fig. 4, letting one of the observers to have a link to the CEO with sufficiently large rate allows conveying this observation perfectly, which then acts as side information.
IiiD The Information Bottleneck and Goal Oriented Compression
The information bottleneck (IB) was introduced by Tishby et al. [78] as a methodology for extracting the information that a variable provides about another one that is of interest and not directly observable by mapping into a representation , as shown in Figure 3. Specifically, the IB method consists in finding the mapping that, given , outputs the representation that is maximally informative about , such that the mutual information is maximized, while being minimally informative about , i.e., the mutual information is minimized. Here, is referred to as the relevance of and is referred to as the complexity of , where complexity is measured by the minimum description length (or bitlength) at which the observation is represented. For a distribution , the optimal mapping of the data with parameter , denoted by , is found by solving the IB problem, defined as
(17) 
over all mappings that satisfy , where is a positive Lagrange multiplier that allows to tradeoff relevance and complexity. In Section VIB several methods are discussed to obtain solutions to the IB problem in (17) in several scenarios, e.g., when the distribution of is perfectly known or only samples from it are available.
The IB problem is connected to multiple sourcecoding problems including source coding with logarithmic loss distortion[71], information combining [79, 80], common reconstruction[81], the WynerAhlswedeKorner problem [82, 83], the efficiency of investment information [84]; to communications and cloud radio access networks (CRAN) [85], as well as learning including generalization[86], variational inference [87]
, representation learning and autoencoders
[87], neural network compression [88], and others. See [89] and [90] for recent surveys on the IB principle and its application to learning. The connections between these problems allow extending results from one setup to another, and to consider generalizations of the classical IB problem to other setups including multiterminal versions of the IB [91, 92, 93].In fact, it is now wellknown [94] that the IB problem in (17) is essentially a remote pointtopoint lossy sourcecoding problem [95, 96, 97], where the distortion between the desired feature and the reconstruction is measured under the logarithmic loss (logloss) fidelity criterion [71]. That is, given i.i.d. samples , an encoder encodes the observation using at most bits per sample to generate an index . Using a decoder , the receiver generates an estimate as a probability distribution over , i.e., , such that is the value of that distribution evaluated at given index generated from , i.e, . The discrepancy between and the estimate is measured using the letter distortion , where
(18) 
where is the value of that distribution evaluated at . The decoder is interested in a reconstruction to within an average distortion , such that . The rate distortion function of this problem can be characterized as follows.
Theorem 4.
Using the substitution , the region described by this function can be seen to be equivalent to the convex hull of all pairs obtained by solving the IB problem in (17) for all . Note that, for every with and , we have
(20)  
which is minimized iff the estimate is given by the conditional posterior . Thus, operationally, the IB problem is equivalent to finding an encoder which maps the observation to a representation that satisfies the bitrate constraint , and such that captures enough relevance of
so that the posterior probability of
given minimizes the KL divergence between and the estimation produced by the decoder.The IB is deeply linked to goal oriented compression, i.e., the compression is intended to perform a task. Consider the scenario in which a picture is taken at an edge device and a computational task, such as classifying an element in the picture or retrieving similar images to the one taken, needs to be performed at a remote unit. In classical compression, the image is compressed with the goal of preserving the maximum reconstruction quality before forwarding it over the communication channel. However, in goaloriented compression only the information or features that are most relevant to perform the task need to be transmitted. For classification, those features should be the most discriminative ones, and not necessarily a representation that allows to reconstruct the image, while for image retrieval, the task is different and so are the most relevant features. That is, the features are specific for the task, and the metric under which the task will be evaluated. This communication scenario is similar to that in Fig.
3, where is a random variable modeling the observation, which is jointly distributed with the relevant information for the task, which is not directly observable. The goal of the receiver is to recover an estimate of with sufficient quality to perform the task, while the goal of the transmitter is to encode into the representation that conveys only the information necessary to recover at the receiver, but not necessarily . In the classical setup, in which the image needs to be reconstructed with minimum distortion we have , while in classification, can be the label class to classify the image to, such that the output obtained from the IB is the probability of the observation belonging to the class given by . How to select which are the relevant features for a given task is an open problem and depends on the metric that is used. However, in practise, often a careful encoding of the task into and a conditional probability tailored for the task estimated by the IB to maximize the relevance can be a good candidate. This is justified, since the logloss and the mutual information can be used to bound the performance of certain tasks, e.g., the probability of misclassification of a classifier using a decision rule can be shown to be upper bounded as [86].The formulation of the IB as a remote source coding problem under the logloss distortion measure can be extended to consider the context in the form of side information at the decoder or both the encoder and decoder [74] as described in Section IIIB and studied in [98]. The IB problem can also be extended to multiterminal scenarios. In particular, in [99, 93], the distributed compression problem is studied from an informationtheoretic perspective using the IB formulation. In this scenario, one is interested in performing a task at a remote destination, e.g., classification, represented by the latent random variable using the information relayed by encoders, each observing some correlated information with . As shown in Fig. 4, Encoder encodes the observation into a representation in order to preserve the most relevant and complementary information to the other encoders for the task. This problem can be shown to be essentially a encoder CEO problem under logloss distortion [71], in which the decoder is interested in a soft estimate of from the descriptions , each encoded with an average finite rate of bits per sample. The fundamental limits in this scenario can be characterized in terms of the optimal tradeoff between relevance and rate at each encoder as follows.
Theorem 5.
(Distributed Information Bottleneck Problem[93]) The relevancecomplexity region for the distributed IB problem is given by the union of all nonnegative tuples that satisfy
(21) 
for some distribution .
The above IB framework can be used to design encoders and decoders in order to extract the most relevant and complementary information from distinct observations. In Section VIB we describe how to design encoders and decoders by solving (or approximating) the IB problem both for scenarios in which the distribution is perfectly known or when the source distribution is unknown and only data samples are available.
More generally, the approach to extract the most relevant information for a task can also be considered for other endtoend metrics that go beyond logloss by modelling the relevant information extraction as a remote source coding problem in which a metric needs to be minimized under some rate constraints, and the metric is a distortion (not necessarily additive) that represents the performance of the task, and can include, for example, the KL divergence, classification, hypothesis testing, etc. In other cases, the metric can also be learned implicitly by defining an alternative task, as in GANs[45], in which the generator and discriminator are trained to simultaneously extract the relevant information to generate realistic signals by learning how to pass an hypothesis test on distinguishing the generated data from real data.
IiiE RateLimited Remote Inference
As mentioned above, the hidden feature variable can represent the class that the sample belongs to, or the value associated to it in the case of a regression problem. If the goal is to convey only the class information to the receiver, and if there are no constraints on the complexity of the encoder, then the optimal operation would be to detect the class at the encoder, and only convey the class information to the receiver. From an information theoretic perspective, this problem can also be formulated as a remote inference problem.
In statistical inference problems, an observer observes i.i.d. samples from some distribution . We assume that this distribution belongs to a known family of distributions indexed by ; that is, follows the conditional distribution . In general, the observer may want to estimate simply from the available samples. If set is a discrete set, we have a detection/ hypothesis testing (HT) problem. If, instead, is a continuous set, i.e., , then we have a parameter estimation problem. We impose a loss/ distortion function to quantify the quality of the estimation: , where is the estimate of the observer. The expected loss/risk of a decision rule , where , is then defined as .
In the Bayesian setting, we assume some known prior distribution on , and try to minimize the average loss (also called risk) over the joint distribution of and , i.e., . Alternatively, we can also aim at minimizing the worst case loss/risk . The corresponding decision rule is called the minimax rule. The classical Bayesian and minimax inference problems deal with centralized decision problems; that is, they assume the observer and the decision maker are the same agent, and makes the decision with full access to the samples. However, in many practical problems of interest, the observer and the decision maker are connected through a constrained communication channel. If we consider a ratelimited link, we obtain a remote inference problem. Remote inference problems over a ratelimited channel were first considered by Berger in [100]. We note that, the remote inference problem in the Bayesian setting is very similar to the information theoretic remoterate distortion formulation in Section IIIB, with the exception that we only have a single realization of the latent/hidden variable , rather than a sequence of i.i.d. realizations each generating a separate sample .
When the dimension of the parameter to be estimated is smaller than that of the observations, for example, in the case of HT, the observer can perform local inference and transmit its decision (indeed, optimal performance can be achieved asymptotically at zero rate by conveying the type information, which is a sufficient statistic [101]). Distributed parameter estimation with multiple terminals, each observing a component of a family of correlated samples, was studied in [102]
, under rate constraints from the observers to the decision maker in bits per sample. A singleletter bound, similar to the Shannon’s rate distortion function, is provided on the variance of an asymptotically unbiased estimator, which is later improved in
[103]. Ahlswede and Burnashev studied the remote estimation problem when the decision maker has its own side information [104].Ahlswede and Csiszár studied HT when the decision maker also has its own observations [105]
. They studied the exponent of the typeII error probability, when a constrained is imposed on the typeI error probability. For the case of testing against independence, they were able to provide a singleletter expression similar to Shannon’s ratedistortion function. This is one of the few cases in which a singleletter characterization is possible for a nonadditive distortion measure. The more general distributed setting is considered in
[106]. Han showed in this paper that a positive exponent can be achieved even with a singlebit compression scheme. This result was extended to the more general zerorate compression in [107]. This result was later refined by Shalaby and Papamarcou in [108], where they show that when the observers have fixed codebook sizes, the asymptotic performance does not depend on the particular codebook size. This means that no further gain can be obtained in terms of the asymptotic error exponents by allowing each observer to transmit a highresolution soft decision instead of a binary decision. Despite significant research efforts, the optimal characterization of the typeII error exponent for the remote HT problem for the general case (beyond testing against independence) remains open to this day. Lower bounds are provided for the general problem in [105] and [106]. Distributed hypothesis testing is also studied in the context of a sensor network in [109], where multiple sensors convey their noisy observations to a fusion center over ratelimited links. There has been a recent resurgence of interest in distributed HT problems [110, 111, 112, 113, 114].Iv Machine Learning Techniques for Semantic and TaskOriented Compression
The ultimate motivation of semantic compression is to extract the semantic information within the source data at the transmitter that is most relevant for the task to be executed at the receiver. By filtering out taskirrelevant data both the bandwidth consumption and the transmission latency can be reduced significantly. However, the information theoretic framework presented above either assumes known statistics for the data and the relevant features for the task, or it is limited to the parameter estimation framework assuming i.i.d. samples from a family of distributions. On the other hand, in most practical applications we do not have access to statistical information, and often need to make inferences based on a single data sample. An alternative approach is to consider a datadriven framework, where we have access to a large dataset, which would allow us to train a model using machine learning tools to facilitate semantic information extraction without requiring a mathematical model. In particular, deep learning (DL) aided semantic extraction techniques have shown great potential for various information sources and associated tasks in the recent years.
While most machine learning research can be considered within the context of semantic feature extraction, we will focus on the communication aspects here. Machine learning algorithms typically follow a twostep approach: in the training phase, a model is trained using the available dataset for the desired task, e.g., classification or regression. Once the model is trained, it is used for prediction on new data samples. Communication in both phases can be considered in the context of semantic or goaloriented communications. Below, we overview research in these two phases separately.
Iva Remote Model Training
In the training phase, a single node or multiple nodes each with its own dataset communicate with a destination node with the goal to reconstruct a model at the destination for a particular inference task. Note that this may also correspond to a storage problem, where the goal is to store the model in a limited memory to be later used in predicting future data samples. We would like to highlight here that this problem is an instance of a particular remote ratedistortion problem. Let us consider first the pointtopoint version of the problem illustrated in Fig. 5. Here, we can treat the dataset at the encoder as the information available at the encoder, and the model itself as the remote source that the decoder wants to recover. Note that, similarly to the other semantic ratedistortion problems, the fidelity measure here is also not the similarity of the reconstructed neural network weights at the receiver to those trained at the encoder. In the end, what really matters is the performance of the reconstructed model at the decoder in terms of the prescribed quality measure, e.g., the accuracy of the reconstructed model at the decoder on new data samples.
As in the previous remote ratedistortion problems, a natural solution approach is to first estimate the remote source at the encoder; that is, to first train a model locally, and then convey this model to the decoder with the highest quality over the ratelimited channel, that is, in a way retaining the predictive power of the model on future queries. While the former step is simply the standard training process, the latter corresponds to model compression, which has been widely studied in recent years particularly in the context of deep neural networks (DNNs) that would otherwise require significant communication or storage resources.
There are various widely used methods to reduce the size of a pretrained model. These include parameter pruning, quantization, lowrank factorization [115], and knowledge distillation. It has been known for a long time that many parameters in a neural network are redundant, and do not necessarily contribute significantly to the performance of the network. Therefore, redundant parameters that do not have a significant impact on the performance can be removed to reduce the network size and help address the overfitting problem [116, 117, 118]. In particular, the socalled ‘optimal brain surgeon’ in [118]
uses the secondorder derivative, i.e., the Hessian, of the loss function with respect to the network weights. In
[117], the authors assumed that the Hessian matrix is diagonal, which causes the removal of incorrect weights. In [118], the authors showed that the Hessian matrix is nondiagonal in most cases, and they proposed a more effective weight removal strategy. In addition to its better performance, the optimal brain surgeon does not require retraining after the pruning process. Note that pruning the network would also reduce the complexity and delay of the inference phase. Pruning is still a very active research area, and many different pruning methods have been studied, including weight, neuron, filter, or layer pruning. Please see
[119] and references therein for a detailed survey on advanced pruning techniques. Another approach is to train the network directly with the sparsity constraints imposed during training, rather than reducing the network to a sparse one after training a full network [120, 121].We can also apply quantization or other more advanced compression techniques, e.g., vector quantization, on the network parameters. Quantization has long been employed to reduce the network size for efficient storage [122, 123]. It is wellknown that low precision representation of network weights is sufficient not just for inference based on trained networks but also for training them [124, 125]. At the extreme, it is possible to train DNNs even with singlebit binary weights [126, 127].
DNN compression can also be treated as a standard source compression problem, and vector quantization techniques can be employed for codebookbased compression to reduce the memory requirements. Similarly to pruning, Hessianbased quantization is shown to be effective in [128]. Hash functions are employed in [129], where the connections are hashed into groups, such that the ones in the same hash group share weights. It is argued in [130] that, for a typical network about 90% of the storage is taken up by the dense connected layers, while more than 90% of the running time is taken by the convolutional layers. Therefore, the authors focus on the compression of dense connected layers to reduce the storage and communication resources, and employ vector quantization to reduce the communication rate. In [131], Huffman coding is used to further compress the quantized network weights.
In knowledge transfer, the goal is to transfer the knowledge learned by a large, complex ensemble model into a smaller model without substantially reducing the network performance. It was first studied in [132]. In [133], the concept of temperature was introduced to generate the soft targets used for training the smaller model.
Another possible approach to solve this problem is model architecture optimization, where the goal is to adjust the size and complexity of DNN architectures to the constraints of the communication link during training without sacrificing their performance. Some of the popular recent efficient model architectures include SqueezeNet [134], MobileNets [135], ShuffleNet [136], and DenseNet [137]. We refer the reader to [138] for a more comprehensive survey of recent advances in model compression techniques for DNNs.
A more common scenario in remote model training is distributed training, in which multiple nodes each with its own local dataset collaborate to train a model by communicating with a remote parameter server, or with each other. The former scenario is known as federated learning, while the latter is referred to as fully distributed, or peertopeer learning. Please see Fig. 6 for an illustration of the federated learning scenario. Similarly to the single node scenario discussed above, federated training can be treated as a multiterminal ratedistortion problem, where the datasets are observed samples at the multiple encoders, correlated with the underlying model, which is to be recovered at the parameter server. This would correspond to the CEO problem setting presented in Section IIIB
. Stochastic gradient descent based iterative algorithms are often used for federated learning. In the federated averaging (FedAvg) algorithm, proposed in
[139], a global model is sent from the parameter server to the nodes, each of which computes a model update, typically employing multiple stochastic gradient descent updates. The nodes then transmit these model updates back to the parameter server, which aggregates them, to finally update the global model. The algorithm is iterated until convergence. At each iteration of the algorithm, the goal is then to compute the average of the model updates, rather than the individual updates. Hence, this is a distributed lossy computation problem, which can be considered as yet another aspect of semantic communications. Here, the semantic that is relevant for the underlying task is a function of the multiple signals observed at different nodes.Computation is often considered as a distinct problem from communication. One approach to computation over networks would be to carry out separate communication and computation steps. For example, if a node wants to compute a function of random variables that are distributed over the network, we can first deliver these random variables to the node, which then computes the function value. In the pointtopoint setting, the optimality of this approach can be shown in certain cases following the arguments of remote ratedistortion problem, where we treat the function to be computed as the latent variable of the observed source (see Fig. 3). However, in the general case, the optimal performance for a generic function is an open problem, even in the lossless computation case. The multiterminal function computation problem was first introduced in [140], where the authors considered the parity function of two correlated symmetric binary random variables. They identified the optimal rate region for this case, and showed that this is not equivalent to the rate region one would obtain from [141] by first compressing and sending the observed sequences to the decoder. This illustrates the difficulty of the problem for arbitrary function computation. The problem was later studied in [142] in a pointtopoint setting, considering one of the two sources is available at the decoder as side information. The optimal rate required for lossless computation of any function (in the Shannon theoretic sense  over long blocks with vanishing error probability) is characterized, and is shown to be given by the the conditional  entropy [143] of given , where is the characteristic graph of and function to be computed as defined in [144]. While this is in general lower than first sending to the decoder at rate , and then computing the function, it is observed in [142] the gain is marginal in most cases.
IvB Remote Inference
We next consider the machine learning approaches for ratelimited remote inference problems. Following the arguments in Section III, various lossy compression problems can be considered in the context of semantic communication under the appropriate reconstruction metric. In recent years, DNN aided compression algorithms have achieved significant results, often outperforming stateoftheart standardised codecs in a variety of source domains, from image [145, 146, 147, 148, 149], to video [150, 151, 152, 153, 154, 155, 156, 157, 158], speech [159, 160], and audio [161, 162] compression. One of the main advantages of DNNbased approaches compared to conventional compression algorithms is that they can be trained for any desired reconstruction metric at the receiver. For example, image or video compression algorithms can be trained with the SSIM or MSSSIM metrics as objective, providing perceptually better reconstructions. We refer the reader to [163] for a comprehensive overview of recent developments in both lossless and lossy compression using DNNs and other machine learning methods.
Taskoriented image compression was considered in [164], where the authors proposed lossy compression of MRI images to preserve as much clinically useful information as possible depending on the diagnostic task to be performed. In [165], the authors propose a metric based on conditional class entropy for a target detection task. In video compression, one approach is to employ regionofinterest compression, where only the relevant part of the video stream is compressed [166, 167, 168]. A classification aware distortion metric is proposed in [169], and applied to the high efficiency video coding (HEVC) standard.
The authors of [170] have shown that latent representation produced by compressive autoencoders can be used to perform a classification task with ResNet [171] network resulting in almost the same accuracy obtained by training on uncompressed image, showing that the classification network does not need to reconstruct the image first, at least explicitly. They also consider jointly training of the compression and classification networks. Taskbased quantization is studied in [172] in the context of analogtodigital conversion of signals for a specific task, and for channel estimation in [173].
We remark that the aforementioned taskbased compression problems are remote ratedistortion problems in essence, but machine learning tools are employed mainly to acquire statistical knowledge from data. However, in these cases, since the transmitter has access to the original source information, the desired task can often be carried out at the transmitter, which is then transmitted to the receiver over the channel. Particularly, for classification tasks this would require only a limited amount of information to be transmitted. However, complete classification at the transmitter may not be possible due to complexity constraints, e.g., when the transmitter is a simple IoT device. In such a scenario, the transmitter may extract some features, which are then conveyed to the decoder using a finite number of bits, and the rest of the inference task is carried out at the receiver end. This is called as ‘split learning’ in the literature. In the context of DNNs, split learning refers to dividing a DNN into two parts, the head and the tail. The head consists of the first layers of the DNN architecture that are executed at the encoder, while the tail consists of the later layers that are executed at the receiver. From a ratedistortion perspective, the goal is to convey the features obtained at the end of the head network to the receiver with as few bits as possible, while still achieving the desired inference accuracy. Quantization and/or compression of feature vectors is considered in [174, 175] and [176]
. While the former considered using a quantized version of the network at the transmitter side to also reduce the storage and computation requirements, the latter considered split learning also for unsupervised learning with an autoencoder architecture. Ideas from knowledge distillation and neural image compression are exploited in
[177] to obtain a more efficient compression scheme for the intermediate feature representations obtained by the head network.A different type of remote inference problem is considered in [178], called image retrieval at the edge. In this setting, illustrated in Fig. 7, the goal is to identify a query image of a person or a vehicle recorded locally by matching with images stored in a large database (gallery), typically available only at the edge server. We emphasize that, unlike the typical classification tasks, the retrieval task cannot be performed locally as the database is available only at the remote edge server. In [178], the authors propose a retrievaloriented image compression scheme, which compresses the feature vectors most relevant for the retrieval task, depending on the available bit budget. To reduce the communication rate, the authors quantize and entropy code the features to be transmitted, using a learned probability model for the quantized bits for efficient compression.
In [179], the classificationdistortionperception tradeoff is studied, assuming that the reconstructed image at the receiver is also fed into a prescribed classification network. It is shown that the classification error rate on the reconstructed signal evaluated by the prescribed classifier cannot be made minimal along with the distortion and perceptual measures. A similar semanticoriented compression approach is applied to facial image compression in [180], where regionally adaptive pooling is used to optimize the compression parameters according to gradient feedback from the hybrid distortionperceptionsemantic fidelity metric. It is shown that the semantic distortion metric allows allocating more bits for the compression of more semantically critical areas in face images. Automatic generation of semantic importance maps is considered in [181], where the output of instance segmentation (combination of object detection and semantic segmentation) through MaskRCNN [182]
is used as the importance measure of each segment, and the bit allocation is carried out using reinforcement learning.
V Semantic and TaskOriented Communication over Noisy Channels: A JSCC Approach
In the previous section, we have mainly focused on the compression aspects, assuming an errorfree finiterate communication channel from the encoder to the decoder. However, many communication channels suffer from noise, interference, and other imperfections. Shannon’s channel coding theory mainly deals with communication over such noisy communication channels. However, as we have mentioned earlier, channel coding theory focuses on the reliable delivery of bits, whereas in the context of semantic communication, we will consider the transmission of source signals such as image, video, audio, or their features relevant for a particular task, over a noisy channel.
In this problem, illustrated in Figure 8, the transmitter wants to transmit a sequence of independent source symbols each sampled from the distribution
, over a memoryless noisy communication channel characterized by the conditional probability distribution
, where , . Let denote the reconstruction at the receiver based on . Similarly to the ratedistortion theory formulation, the goal is to minimize the distortion between and under some given distortion (fidelity) measure, . More formally, let denote the encoding function, and denote the decoding function. In the case of an average distortion criteria, the goal is to identify the encoder and decoder function pairs that minimize , where the expectation is over the source and channel distributions as well as any randomness the encoding and decoding functions may introduce. One can also impose an excess distortion criterion, where the goal is to minimize , for some maximum allowable distortion target .A joint sourcechannel code of rate consists of an encoderdecoder pair, where the encoder maps each source sequence to a channel input sequence , and the decoder maps the channel output to an estimated source sequence . A ratedistortion pair is said to be achievable if there exists a sequence of joint sourcechannel codes with rate such that , , and
(22) 
Shannon proved his wellknown Separation Theorem for a singleletter additive distortion measure; that is, for the distortion measure . The theorem states the following.
Theorem 6.
(Shannon’s Separation Theorem, [6]) Given a memoryless source and a memoryless channel with capacity , a ratedistortion pair is achievable if . Conversely, if a ratedistortion pair is achievable, then .
The theorem states that we can separate the design of the communication system into two subproblems without loss of optimality, the first focusing on the compression and the second focusing on the channel coding, each of them designed independently of the other. However, the optimality of separation holds only in the limit of infinite blocklength; whereas, in practice, it is possible to design joint sourcechannel codes that would outperform the best achievable separate code design. This was observed by Shannon in his 1959 paper [10]. Considering a binary source generating independent and equiprobable symbols and a memoryless binary symmetric channel, Shannon observed that simple uncoded transmission of symbols achieves the optimal distortion with rate for one particular value of distortion determined by the error probability of the channel. This observation was later extended by Goblick in [183] to Gaussian sources transmitted over Gaussian channels. This happens when the source distribution matches the optimal capacity achieving input distribution of the channel, and the channel at hand matches the optimal test channel achieving the optimal ratedistortion function of the source. Necessary and sufficient matching conditions are given in [184] for general source and channel distributions. However, these conditions are not satisfied for most practical source and channel distributions, and even when they hold, optimality of uncoded transmission fails when the coding rate is not , i.e., in the case of bandwidth compression or expansion. On the other hand, the presence of such optimality results, that is, the fact that asymptotically optimal performance that requires infinite blocklength source and channel codes can be achieved by simple zerodelay uncoded transmission implies that there can be other nonseparate coding schemes that can achieve near optimal performance, or outperform separationbased schemes in the finite blocklength regime.
Va Multiterminal JSCC
It is wellknown that the optimality of separation does not directly generalize to multiterminal scenarios, even in the infinite blocklength regime. This observation is often referred to the seminal work by Cover, El Gamal and Salehi [185], where they consider the transmission of correlated sources over a multiple access channel (MAC), and provide an example in which the uncoded transmission of the sources allow their perfect recovery, while this cannot be achieved by a separate scheme. Interestingly, it is much less known that a similar observation was already made by Shannon in [186], where he considers the transmission of correlated sources over a twoway channel.
The authors of [185] also proposed a coding technique (achievability result) exploiting the correlation among the sources. This result showed that instead of removing the correlation, we can utilize the dependency among the sources to design correlated channel codes, and in certain cases transmit the sources reliably even though this would not be possible with distributed compression followed by independent channel coding. Shortly afterwards a counter example was given in [187] showing that the sufficiency conditions provided in [185] are not necessary. More recently, [188] gave finiteletter sufficiency conditions for the lossless delivery of correlated sources over a MAC. A similar problem of broadcasting correlated sources to multiple users is considered in [106], while [189] considers broadcasting to multiple receivers each with a different distortion measure and side information. Many JSCC transmission strategies have been extensively studied for Gaussian sources over multiterminal [190, 191, 192], as well as nonergodic scenarios [193].
VB Remote Inference over Noisy Channels
In Subsection IIIE, we have presented various statistical inference problems under rate constraints. We highlighted that these inference problems do not satisfy the additive singleletter requirement of typical Shannon theoretic distortion measures considered in the context of ratedistortion theory. Therefore, the separation theorem does not directly apply for these distortion measures.
Distributed hypothesis testing problem over a noisy communication channel was studied in [194] considering the type II error exponent (under a prescribed constraint on the type I probability of error) as the performance measure. Here, the task is to make a decision on the joint distribution of the samples observed by a remote observer and those observed by the decision maker. The observer communicates to the decision maker over a noisy channel. A separate hypothesis testing and channel coding scheme is presented, combining the ShimokawaHanAmari scheme [195] with a channel code that achieves the expurgated exponent with the best errorexponent for a single special message [196]. A joint scheme is also proposed using hybrid coding [197]. It is shown that separate scheme achieves the optimal type II error exponent when testing against independence. This is a special case of the problem, when the alternate hypothesis is the independence of the samples observed by the observer and the decision maker. While the optimal type II error exponent remains open in general, it is shown in [194] that joint encoding can strictly improve upon separation. This shows that communication and inference cannot be separated (without loss of optimality), but how the two should interact is vastly unexplored.
Distributed hypothesis testing over independent additive white Gaussian noise (AWGN) and fading channels, respectively, is studied in [198] and [199]. These papers consider multiple sensors making noisy observations of the underlying hypothesis, and communicate to a fusion center over orthogonal noisy channels. Hypothesis testing over a discrete MAC is studied in [200], where the observations are quantized before being transmitted. Distributed estimation over a MAC is studied in [201], and a typebased uncoded transmission scheme is shown to be asymptotically optimal. Distributed estimation over a MAC is studied in [202] from a worst case risk point of view. Analog/uncoded transmission is again shown to outperform its digital separationbased counterparts, and to achieve a worst case risk that is within a logarithmic factor of an information theoretic lower bound.
Vi Practical Designs for GoalOriented Communication over Noisy Channels
Practical designs for JSCC of various information sources have been a long standing research challenge. Many different designs have been proposed in the literature, mainly based on the joint optimization of the parameters of an inherently separate design [203, 204, 205, 206, 207, 208, 209, 210, 211]. Another group of JSCC schemes instead consider a truly joint design. Motivated by the theoretical optimality of uncoded transmission in certain ideal scenarios, analog transmission of discrete cosine transform (DCT) coefficients is proposed in [212] for wireless image transmission. In [213], the authors proposed linear coding of quantized wavelet coefficients. However, these efforts in JSCC design either do not provide sufficient performance gains, or they are too complex and specific to the underlying source and channel distributions to be applied in practice. Recently, JSCC schemes based on autoencoders [214], which are DNNs aimed at unsupervised data coding, have been introduced [215, 216, 217, 218, 219] and are shown to provide comparable or better performance than stateoftheart separationbased digital schemes. In addition to improving the performance for a fixed channel state, these JSCC schemes also achieve graceful degradation with channel quality. That is, unlike separationbased approaches, their performance does not fall apart when the channel quality falls below a certain threshold. Similarly to DNNaided data compression schemes, DNNaided JSCC schemes also have the flexibility to adapt to a particular distortion measure, source domain, or a channel distribution.
The general schematic of a semantic communication system is shown in Fig. 9. The input data is sequentially passed through a semantic encoder and joint sourcechannel encoder to extract semantic information relevant to the receiver’s task, which can be either source signal recovery or intelligent task execution. The benefits of this semantic communication approach is due to both the intelligent semantic encoding step, which basically extracts taskrelevant features from the input, and the JSCC approach to delivering these features to the receiver.
The intelligent taskoriented semantic communications have attracted intensive investigation in recent years due to the capability to address pertinent challenges in the traditional communication system, which has been considered as one of the key technologies to cater to the unprecedented demands of intelligent tasks in the future intelligent communications era. While the JSCC approach can potentially outperform separate source and channel coding, particularly in the short blocklength regime, it loses the modularity. Modularity refers to the separate design of source and channel coding schemes, where the source encoder can be designed oblivious to the channel statistics or the particular channel coding and modulation scheme employed for communication. All the source encoder needs to know is the level of compression, in terms of the bit per source sample. Similarly, the channel code can be designed oblivious to the particular source signal and its statistics. However, in the case of JSCC, since the code is designed in an endtoend fashion, we need to take the source and channel statistics into account in a joint manner. Therefore, we will introduce taskoriented semantic communications separately according to the different types of source data, for text, speech, and image, respectively.
Vi1 TaskOriented Semantic Communications for Text
For networks that allow interactions between humans as well as smart devices that have unique backgrounds and behavior patterns, reliable communication can be redefined as the intended meaning of messages being preserved at reception. Further, communicating parties can form social relationships and build trust, which may further affect how the received messages are interpreted. Motivated by these factors, reference [220] has proposed an approach that takes into account the meanings of the communicated messages and demonstrated the design of a pointtopoint link to reliably communicate the meanings of messages through a noisy channel. To do so, in [220] the authors proposed a novel performance metric, semantic error, to measure how accurately the meanings of messages are recovered, and then determined the optimal transmission policies to best preserve the meanings of recovered messages. This is achieved by leveraging lexical taxonomies and contextual information to design a graphbased index assignment scheme for fixed rate codeword assignment in a noisy channel. Words that are similar in their meaning (measured by the semantic distance between the words) are assigned to closer codewords in terms of their Hamming distance.
Building on this work, [221] modeled an external agent who can influence how the destination perceives the meaning of received information, to study the impact of social influence on contextual message interpretations on semantic communication. The exact nature of the agent, whether adversarial or helpful, is unknown to the communicating parties. This problem is first modeled as a Bayesian game played between the encoder/decoder and the influential entity in [221]. By extending the Bayesian game into a dynamic setting, the authors studied the interplay between the influential entity and the communicating parties, in which each player learned the true nature of the other player by updating its own beliefs as the game progresses, revealing information through observed actions. While these works predate the recent efforts that brought semantics into the center stage, they serve as early works to build on designing semantic communication networks.
However, the semantic error of the aforementioned textbased semantic communication system is measured only at the word level. In an early effort on JSCC for text transmission using deep learning techniques, the authors of [216]
considered sentence level similarity using the edit distance. In particular, they studied the transmission of text over an erasure channel, and designed a JSCC scheme using long short term memory (LSTM) networks as encoder and decoder, and showed that this can outperform separationbased approaches using LempelZiv or Huffman coding. A transformerpowered semantic communication system for text, named DeepSC, is proposed in
[222]by utilizing the meaning difference between the transmitted and received sentences. DeepSC is shown to yield better performance than the traditional communication systems when coping with AWGN channels and it is more robust to channel variations, especially in the low signaltonoise ratio (SNR) regime.
The core idea behind taskoriented semantic communications for text is to extract the useful information, e.g., grammatical information, word meanings, and logical relationships between words, to achieve intelligent tasks at the receiver, while ignoring the mathematical expression of words. In [223], Xie et al. have designed a multiuser semantic communication system to execute textbased tasks by transmitting text semantic features. Particularly, a transformerenabled model, named DeepSCMT, is proposed to perform the machine translation task for EnglishtoChinese and ChinesetoEnglish by minimizing the meaning difference between sentences. The objective of DeepSCMT is to map the meaning of the source sentence to the target language, which is achieved by learning the word distribution of the target language. Therefore, cross entropy is utilized as the loss function, represented as
(23) 
where and are the real and predicted probabilities that the th word appears in the real translated sentence and the predicted translated sentence , respectively.
Visual question answering task is investigated in [223] based on a multimodal multiuser system. Particularly, the compressed text semantic features and image semantic features are extracted by a text semantic encoder and image semantic encoder at the transmitter, respectively, besides, a layerwise transformerenabled model is utilized at the receiver to perform the information query before fusing the imagetext information to infer an accurate answer.
Vi2 TaskOriented Semantic Communications for Speech
The semantic extraction of speech signals is more complicated than text information. For speech signals, the semantic information required for transmission may refer to text information, emotional expression and type of language, etc., which increases the difficulty in extracting semantic features. Weng et al. has investigated a semantic communication system for speech signal reconstruction in [222], which aims to minimize the MSE between the input and recovered speech sequences. Moreover, in [224]
, a speech recognitionoriented semantic communication system, named DeepSCSR, has been developed to obtain the text transcription by transmitting the extracted textrelated semantic features. Particularly, two convolutional layers are employed to constrain the input speech signals into a low dimension representation before passing through the multiple gated recurrent unit (GRU)based bidirectional recurrent neural network (BRNN)
[225] modules. The text transcription is recognized at the character level by minimizing the difference of the character distribution between the source text sequence and the predicted text sequence. According to the connectionist temporal classification (CTC) [226], the loss function can be expressed as(24) 
where represents the set of all possible valid alignments of text sequence to input speech , denotes the posterior probability to recover one of the valid alignments based on , and is the trainable parameters of the whole system.
Inspired by DeepSCSR, a semantic communication system for speech recognition at the word level is proposed in [227], in which a visual geometry group enabled redundancy removal module is utilized to compress the transmitted data. The objective of this system is to convert the word distribution into the readable text transcription, which is achieved by the crossentropy loss function between the source word sequence and predicted word sequence. Hyowoon et al. developed a novel stochastic model of semanticnative communication (SNC) for generic tasks [228], where the speaker refers to an entity by extracting and transmitting its symbolic representation. A curriculum learning framework for goaloriented task execution is investigated in [229], where the speaker describes the environment observations to enable the receiver to capture efficient semantics based on the defined language by using the concept of beliefs.
Vi3 TaskOriented Semantic Communications for Image and Video
Due to their high data rate content, significant contribution to the network traffic, and diverse applications from live video streaming to augmented and virtual reality and video gaming, semantic delivery of image and video content is essential for next generation communication systems; and hence, has been studied extensively in the recent years. In [215], a neural network aided JSCC scheme was proposed for the first time for efficient delivery of images over wireless channels. The authors proposed an autoencoderbased DeepJSCC scheme, where the channel is treated as an untrainable bottleneck layer. The surprising result in [215] showed that the proposed DNNbased solution could outperform the concatenation of stateoftheart image compression techniques (e.g., BPG) with stateoftheart channel coding (e.g., LDPC) at a prescribed channel SNR. We would like to highlight that DNNbased image compression techniques could only very recently outperform BPG [148], and their design is quite complex, requiring not only the training of a learned transform coding approach, but also the learning of the distribution of the quantized features for efficient entropy coding. Similarly, so far DNNbased channel code designs cannot meet the performance of stateoftheart channel codes, such as LDPC, in the long blocklength regime that would be used in image and video transmission. On the other hand, the DeepJSCC scheme proposed in [215], and later improved in [230], can outperform their combination, despite its rather simple architecture. This is because the problem of JSCC is a comparatively easier problem for DNNs to learn, since they simply need to learn to map similar signals in the source domain to similar channel inputs, such that after noise addition, they can be mapped to similar reconstructed signals, minimizing the error. On the other hand, learning digital compression and channel coding schemes is a much harder problem due to the structure they need to create, and the discrete nature of the problem makes is more difficult to be learned through SGD. This joint approach also provides significant speedup in endtoend delivery. The encoding and decoding tasks can be carried out with significantly less latency in DeepJSCC, thanks to the simple neural network architecture and the complete parallelization it provides, compared to conventional compression and channel coding algorithms, which are often iterative and can be highly complex.
As mentioned earlier, another significant benefit of the endtoend DNNbased approach is the graceful degradation it provides; that is, the performance, trained for a specific channel SNR generalizes to other SNRs. This capability was exploited in [215] to show that DeepJSCC can outperform the separationbased alternatives with even a greater margin when used over a fading channel, when channel state information (CSI) is not available. It is later shown in [231] that, when the CSI is available at the transmitter and the receiver, an attention mechanism can be used to train a single network that can achieve the best performance at every SNR.
In [232], it is shown that DeepJSCC can also achieve successive refinability, that is, the image can be delivered at multiple steps, using gradually more bandwidth, with minimal loss in performance. This means that receivers can tune into the transmission until they recover the transmitted image at the desired quality.
Another challenge in communication systems is to exploit feedback. It was shown by Shannon that feedback does not increase the channel capacity. Therefore, in the infinite blocklength regime, it does not help from a JSCC perspective either. Since separation is optimal in this regime, what matters for the endtoend performance is the channel capacity, which the feedback cannot improve. On the other hand, it is known that feedback helps to improve the error exponent in channel coding [233, 234]. More interestingly, when transmitting a Gaussian source signal over a Gaussian channel, it is shown in [235] that optimality of uncoded transmission shown in [183] only in the case of matched bandwidth between the source and the channel, extends to arbitrary bandwidth extensions. In [217], the DeepJSCC scheme is extended to channels with channel output feedback, and it is shown that feedback can significantly improve the endtoend performance. It is shown that the required bandwidth for image delivery can be reduced by half when variable rate transmission is allowed and channel feedback is exploited to stop transmission when the required reconstruction quality is reached at the receiver.
Benefits of DNNaided JSCC is extended to OFDM channels in [236, 237]. In all these works, the encoder is free to map the input signal to arbitrary points in the channel input space; that is, the channel inputs are limited only by an average power constraint, but there is no input constellation. However, in some practical communication systems, communication hardware is constrained to a fixed constellation diagram. In [238], a differentiable quantization approach is used to map channel inputs to points from a prescribed discrete constellation, and it is shown that the performance loss compared to DeepJSCC with unconstrained channel inputs is limited as long as a sufficiently rich constellation can be employed. In [239], JSCC of images transmitted over binary symmetric and over binary erasure channels is considered using a variational autoencoder (VAE) assuming a Bernoulli prior. To overcome the challenges imposed by the nondifferentiability of discrete latent random variables (i.e., the channel inputs), unbiased lowvariance gradient estimation is used, and the model is trained using a lower bound on the mutual information between the images and their binary representations.
Note that, similarly to neural image compression techniques, an important advantage of DNNbased JSCC approaches, compared to employing conventional compression and channel coding techniques is that, these codes can be trained for any desired final fidelity measure, including various inference tasks, that would not require a complete reconstruction of the source signal. Indeed, it has been shown that DNNbased JSCC approaches outperform their conventional counterparts particularly in terms of SSIM and MSSSIM performance measure, which are known to better capture the perceived quality, or semantics, of the transmitted images. Similarly, adversarial measures can also be used to further improve the perception quality of the reconstructed images [236].
A semantic communication system for image retrieval over a wireless channel is considered in [178], which aims to identify the top similar images to a query image in a large dataset of images. The authors proposed directly mapping the extracted image features to channel inputs through a threestep training procedure: feature encoder pretraining, followed by JSCC autoencoder pretraining, and finally, endtoend training. It is shown that this joint approach significantly outperforms the separationbased approach that combines the retrievaloriented compression scheme mentioned in Section IVB with a capacityachieving channel code. This is despite the fact that a very short blocklength is considered, and hence, capacity is far from achievable. This scheme is recently extended in [223]. A semantic ratedistortion theorybased communication system for multiple image tasks has been investigated in [240], in which the source image is first restored at the receiver before intelligent task execution.
The performance of DeepJSCC for image transmission has also been tested and verified in practical communication systems using a softwaredefined radio testbed in [230], which is also exhibited in [241]. The authors in [242] proposed a realtime semantic testbed based on a visual transformer.
The first work on the JSCC of video signals over wireless channels employing DNNs is carried out in [243]. In this paper, video signals are divided into group of pictures (GoPs), similarly to common video compression standards. Each GoP is directly mapped to channel input symbols of a fixed bandwidth. The first frame of each GoP is considered as a key frame, and transmitted on its own using JSCC techniques similar to the one used in [215]
. The remaining frames are transmitted using an interpolation encoder, which encodes the motion information in that frame and residual information with respect to the nearest two key frames as reference. Scaled space flow is used to estimate the motion information using the DNN architecture proposed in
[244]. There are two challenges in the proposed method: First, due to the JSCC for delivering key frames, the encoder does not know their exact reconstruction at the receiver, which depends on the noise realization. The authors use a stochastic encoding method, where the encoder emulates the channel, and generates a reconstruction of the key frame using the channel statistics. The interpolation is based on this stochastically generated version of the key frame. Second, the total bandwidth for the GoP is limited, but in general, one would expect to allocate more channel bandwidth for frames with more motion content. This is achieved by reinforcement learning in [243]. The authors show that the proposed learned bandwidth allocation methodology strictly improves upon equal allocation of the available bandwidth among the frames. The results in [243] show that the proposed JSCC technique for video delivery, called DeepWiVe, not only provides graceful degradation with channel SNR, similarly to DeepJSCC, but also outperforms stateoftheart separationbased digital transmission alternatives combining H.264 or H.265 video encoding with LDPC channel coding at a specified channel SNR value. These results are promising as they show the potential advantages of DNNbased JSCC techniques for future augmented/virtual reality (AR/VR) applications for wireless headsets.A semantic communication system for image classification has been proposed in [245], which performs a variational information bottleneck (VIB) framework to resolve the difficulty of mutual information computation of the original information bottleneck (IB) [78]. The adopted loss function can be expressed as
(25) 
where represents the input image, denotes the target label, and is the recovered semantic information at the receiver. and are two variational distributions to approximate the true distributions of and , respectively. and are the trainable parameters at the transmitter and receiver, respectively.
indicates the KullbackLeibler divergence.
Via Distributed Training over Noisy Channels
We can also extend the model training tasks presented in Section IVA for ratelimited channels to training over noisy channels. Since model training is often carried out over many iterations, training among wireless devices imposes strict delay constraints per iteration. Hence, the conventional approach of separate model compression and communication would not meet the desired delay and complexity requirements [246].
The problem of training and delivering a DNN to a remote terminal, called AirNet, is considered in [247]  extending the model in Fig. 5 by replacing the ratelimited errorfree link with a noisy wireless channel. The conventional approach would be to first train a DNN, which is then delivered reliably over the bandwidthlimited channel. We can either train a low complexity model, such as MobileNet or ShuffleNet, or first train a larger model, and then compress it to the level that can be delivered over the limited capacity wireless link. Here, the size of the delivered model will be dictated by the available channel capacity, and errors over the channel will further reduce the accuracy of inference at the decoder side.
An alternative joint training and channel coding approach is considered in [247], where the trained neural network weights are delivered over the wireless channel in an analog fashion; that is, they are mapped directly to the channel inputs. However, given the large size of DNNs, this would require a very large bandwidth. Moreover, the receiver will recover a noisy version of the network weights, which varies according to the noise realization. The authors propose two distinct strategies to remedy these problems. Pruning [248] is employed to reduce the network size without sacrificing its performance significantly. The encoder first trains a largedimensional DNN, which will be then pruned to the available channel bandwidth. Choosing a large DNN as an initial point, rather than directly training a DNN of dimension equal to the available channel bandwidth is motivated by the literature [249], which shows that pruning a trained largedimensional DNN generally performs better than directly training a lowdimensional DNN. The noise problem is remedied by injecting a certain amount of noise to the network’s weights at each training iteration, so that the trained network acquires robustness against channel noise. It is shown in [247] that the analog transmission of DNN weights achieves better accuracy compared to their digital delivery. A further unequal error protection strategy is also incorporated by pruning the network to a size smaller than the available channel bandwidth, and applying bandwidth expansion to a selected subset of more important weights using ShannonKotelnikov mapping [250].
While the above framework assumes the availability of the dataset at the encoder, in many practical settings, the encoder may simply have access to the DNN architecture and weights but no data. Delivering such a network over a wireless channel is considered in [251], where it is shown that a Bayesian approach at the receiver when estimating the noisy DNN weights can significantly improve its performance. The authors assume a Gaussian prior, and propose a population compensator and a bias compensator to the minimum mean square error (MMSE) metric.
Another common scenario is when the dataset is distributed across many wireless devices, which collaborate to train a common model in a federated manner. When the devices participating in federated learning share a common wireless medium to the parameter server, this is called federated learning at the edge (FEEL). FEEL was studied in [252] and [253] for AWGN and fading channels by first treating the uplink channel from the devices to the parameter server as a MAC, and the devices communicate at the boundary of the capacity region. Then, each device, depending on the rate available to it, reduces the size of its model update by compressing it using the technique proposed in [254]. With this approach, there is a trade0off between the number of devices participating in the model update at each round, and the accuracy of the updates they can convey to the parameter server. The higher the number of devices, less wireless channel resources they are allocated, and the less the accuracy of the updates they transmit to the parameter server. In [255], the authors studied the tradeoff between the energy cost of model updates and the latency in FEEL over fading channels. A joint wireless resource allocation problem is formulated in [256] for FEEL over fading channels to maximize the convergence rate of the underlying learning process.
In [252, 253], an alternative ‘analog’ transmission approach is proposed for FEEL, by treating the uplink model transmission as a distributed computation problem over a MAC. Inspired by the optimality of uncoded transmission in certain distributed computation and JSCC problems over MAC [185, 191], these papers proposed uncoded and synchronized transmission of local model updates enables, which enables the the parameter server to directly recover the sum of the updates from multiple terminals. This ‘overtheair computation (OAC)’ approach has received significant interest in recent years thanks to its bandwidth efficiency [252, 253, 257, 258]. Instead of allocating orthogonal channel resources to the participating devices, they all share the same bandwidth. Unlike in separate model compression and channel coding, the accuracy of the resultant computation benefits from more transmitters as the goal is to recover the sum of their model updates. OAC can also be used in a fully distributed learning scenario [259, 260], where many computations take place in parallel. We refer the reader to [261] for a comprehensive overview of distributed learning techniques over wireless networks.
ViB Solutions to the Information Bottleneck Problem
The IB problem detailed in Section IIID provides a formulation to design mappings that allow to extract relevant information within the IB relevance–complexity region, i.e, the pairs of achievable , by solving the IB problem in (17) for different values of . However, in general, this optimization is challenging as it requires computation of mutual information terms. In this section, we describe how, for a fixed parameter , the optimal solution , or an efficient approximation of it, can be obtained under: (i) known general discrete memoryless distributions and particular distributions, or particular distributions such as Gaussian and binary symmetric sources; and (ii) unknown distributions and only samples are available to design the encoders and decoders.
ViB1 Known Discrete Memoryless Distribution
When the relevant features and the observation is discrete and the joint distribution is known, the maximizing distributions in the IB problem in (17
), can be efficiently found by an alternating optimization procedure similar to the expectationmaximization (EM) algorithm
[262] and the standard Blahut–Arimoto (BA) method [263, 264], which is commonly used in the computation of ratedistortion functions of discrete memoryless sources. In particular, a solution to the constrained optimization problem is determined by the following selfconsistent equations, for all , [78](26) 
where and is a normalization term. It is shown in [78] that alternating iterations of these equations converges to a solution of the problem for any initial . However, different to the standard Blahut–Arimoto algorithm for which convergence to the optimal solution is guaranteed, convergence here may be to a local optimum only.
ViB2 Unknown Distributions
The main drawback of the solutions presented in the previous section is that requirement knowing the joint distribution , or at least a good estimation of it and that iterating (26) can only be performed for sources with small alphabet (or jointly Gaussian [265, 266, 98]). The Variational IB (VIB) method was proposed in [87] as a means to obtain approximate solutions to the IB problem in the case in which the joint distribution is unknown and only a give training set of samples is available or the alphabet is too large. The VIB consists in defining a variational (lower) bound on the cost of the IB problem in (17), use neural networks to parameterize this bound and show that, leveraging on the reparametrization trick [267] its optimization can be performed through stochastic gradient descendent (SGD). From a task oriented communication perspective, the VIB approach provides a principled way to generalize the evidence lower bound (ELBO) and Variational Autoencoders [267] (and its extension to VAE cost [268]) to scenarios in which the decoder is interested in recovering the relevant information for a task that is not necessarily the observed sample
by maximizing relevance. The idea is to use the IB principle to train an encoder and decoder, parameterized by DNNs, which are able to extract the relevant information to forward to a decoder in charge of reconstructing the relevant information. The resulting architecture to optimize with an encoder, a latent space, and a decoder parameterized by Gaussian distributions is shown in Fig.
10. This approach has been used for task oriented communications also in JSCC scenarios, as a means to extract the relevant information to transmit over communication noisy channels to perform a given task at the destination, e.g., [269],[270].More precisely, solving the IB problem in (17) consists in optimizing the IBLagrangian
(27) 
over all satisfying . It follows from Gibbs inequality, for a any satisfying , we have the following lower bound on the IBLagrangian,
(28)  
(29) 
where is a given stochastic map (also referred to as the variational approximation of or decoder) and is a given stochastic map (also referred to as the variational approximation of or prior), and is the relative entropy between and . The equality holds iff and , i.e., the variational approximations match the true values.
Therefore, optimizing (27) over is equivalent to optimizing the variational cost (29) over , and . In the VIB method, this optimization is done by further parameterizing the encoding and decoding distributions , , and that are to optimize using a family of distributions , , and , whose parameters are determined by DNNs with parameters , and respectively. A common example is the family of multivariate Gaussian distributions [267], which are parameterized by the mean and covariance matrix . Given an observation , the values of are determined by the output of the DNN , whose input is , and the corresponding family member is given by . Another common example are GumbelSoftmax distibutions [271, 272]).
The bound (29) restricted to family of distributions , , and can be approximated using MonteCarlo and the training samples
. To facilitate the computation of gradients using backpropagation
[267], the reparameterization trick [267] is used to sample from . In particular, consider to belong to a family of distributions that can be sampled by first sampling a random variable with distribution , and then transforming the samples using some function parameterized by , such that , e.g. a Gaussian distribution. The reparametrization trick is used to approximate by sampling independent samples for each training sample , . Then, the lower bound on the IB cost can be optimized using optimization methods such as SGD or ADAM [273] with backpropagation over the the DNN parameters as,(30) 
where the cost for the th sample in the training dataset is:
(31)  
and sampling is performed by using with i.i.d. sampled from for each pair.
For inference, let , be the DNN parameters obtained in training by solving (30). Inference for a new observation , the representation can be obtained by sampling from the encoders and a soft estimate of the remote source can be inferred by sampling from the decoder . Thus, from a task oriented communication perspective, is an encoder trained according to the cost (30) to extract the most relevant representation for inference of and is a decoder that is trained to reconstruct the relevant information from the representation that minimized the log loss.
Similar to the steps followed for the variational IB in Section VIB2 encoders and decoder performing on the region (21) can be computed by deriving a variational bound on and parameterizing encoding and decoding distributions using a family of distributions whose parameters are determined by DNNs, and optimize it by using the reparameterization trick [267], Monte Carlo sampling, and stochastic gradient descent (SGD)type algorithms. The encoders and decoders parameterized by DNN parameters , be optimized according the the distributed information bottleneck principles by considering the following empirical Monte Carlo approximation:
(32)  