1 Introduction
Endtoend learning of communication systems using autoencoders has been recently shown to be a viable method for designing the next generation of wireless networks [OShea2017deep, dorner2018jstsp, aoudia2019model, OShea2019approximating, yeli2018channel, wang2017survey]. A pointtopoint communication system consists of a transmitter (or encoder), a channel, and a receiver (or decoder). The key idea of endtoend learning for a communication system is to use an autoencoder architecture to model and learn the transmitter and receiver jointly using neural networks in order to minimize an endtoend performance metric such as the block error rate (BLER) [OShea2017deep]. The channel can be represented as a stochastic transfer function that transforms its input to an output . It can be regarded as a blackbox that is nonlinear and nondifferentiable due to hardware imperfections (e.g.
, quantization and amplifiers). Since autoencoders are trained using stochastic gradient descent (SGD)based optimization, with the gradients calculated using backpropagation
[OShea2017deep], it is challenging to work with a blackbox channel that is not differentiable. One approach to address this problem is by using a known mathematical model of the channel (e.g., additive Gaussian noise). Use of such models enables the computation of gradients of the loss function with respect to the autoencoder parameters via backpropagation. However, such standard channel models do not capture well the realistic channel effects, as shown in
[aoudia2018endtoend]. Alternatively, recent works have proposed to learn the channel using (deep) generative models that approximate, the conditional probability density of the channel output given the channel input, using generative adversarial networks (GANs)
[OShea2019approximating, yeli2018channel], mixture density networks (MDNs) [garcia2020mixture], and conditional variational autoencoders (VAEs) [xia2020millimeter]. The use of a differentiable generative model of the channel enables SGDbased training of the autoencoder, while also capturing realistic channel effects better than standard models.Although this endtoend optimization with real channels learned from data can improve the physical layer design for communication systems, in reality, channels often change, requiring collection of a large number of samples and retraining the channel model and autoencoder frequently. For this reason, adapting the learned conditional probability density of the channel as often as possible using only a small number of samples is required for good communication performance. In this paper, we study the problem of domain adaptation (DA) of autoencoders using an MDN as the channel model. We make the following contributions: i) We propose a lightweight and sampleefficient method for adapting a generative MDN (used for modeling the wireless channel) based on the properties of Gaussian mixtures. ii) Based on the MDN adaptation, we propose two methods to compensate for changes in the classconditional feature distributions at the decoder (receiver) that maintain or decrease the error rate of the autoencoder, without requiring any retraining.
In contrast to the conventional DA setting, where one has access to a large unlabeled dataset and none or a small labeled dataset from the target domain [jiang2008literature, bendavid2006analysis], here we consider DA when the target domain has only a small labeled dataset. This setting applies to the wireless communication channel where the target distribution changes frequently, and we only get to collect a small number of samples at a time from the target distribution. Recent approaches for DA (such as DANN [ganin2016domain]) based on adversarial learning of a shared representation between the source and target domains [ganin2015unsupervised, ganin2016domain, long2018conditional, saito2018maximum, zhao2019invariant, johansson2019support]
have achieved much success on computer vision and natural language processing tasks. The highlevel idea is to adversarially learn a shared feature representation for which inputs from the source and target distributions are nearly indistinguishable to a
domain discriminator DNN, such that a label predictor DNN using this representation and trained using labeled data from only the source domain also generalizes well to the target domain [ganin2016domain]. Adversarial DA methods are not suitable for this problem because of the imbalance in the number of source and target domain unlabeled samples (not enough target domain samples to learn a good domain discriminator). Also, adversarial DA methods are not wellsuited for fast and frequent adaptation that is required here. In summary, the problem setting addressed in this work has the following key differences from standard DA:
[leftmargin=*, topsep=1pt, noitemsep]

The number of target domain samples is much smaller compared to the source domain. All of them are labeled, i.e., no unlabeled samples.

We have a pretrained classifier (decoder) from the source domain and do not want to retrain the classifier frequently for two reasons: i) retraining may not be fast enough to keep up with changes in the wireless channel, ii) collecting a reasonablylarge number of labeled samples is hard while the channel distribution is changing.

Unlike adversarial DA methods, we do not train a single classifier that has low error rate on both the domains. We adapt a generative model of the classconditional feature distribution from the source to the target domain, and design feature transformations at the input of the source domain classifier.

Our method focuses on lowdimensional input domains such as a communication system, but may not be well suited for highdimensional input domains such as images.
2 Background
We introduce the notations and definitions used, followed by a brief background on i) endtoend learning in wireless communication using autoencoders, and ii) MDNbased generative modeling. A detailed background and discussion of related works on these topics is given in Appendix A.
Notations and Definitions.Vectors and matrices are denoted by boldface symbols. We use uppercase letters for random variables and lowercase for the specific values taken by them. and denote the set of real and complex numbers. We use the concise notation for any integer . We define to be an indicator function that takes the value () when predicate is true (false). We denote the onehotcoded vector of all zeros except a at position by , and we omit
when it is clear from the context. The probability density function (pdf) of a multivariate Gaussian with mean vector
and covariance matrix is denoted by . The categorical probability mass function is denoted by . The identity matrix is denoted by . The determinant and trace of a matrix are denoted by and respectively. The elementwise (Hadamard) product of two vectors or matrices is denoted by. We use a slightly different notation compared to the convention in machine learning. While a (feature vector, class label) pair is usually denoted by
, here we denote the same by , where is the channel output and is the input message. Also, is the encoded representation of a message , and the channel input. Table 4 in the Appendix provides a quick reference for the notations used in the paper.2.1 AutoencoderBased EndtoEnd Learning
Consider a singleinput, singleoutput (SISO) wireless communication system as shown in Fig. 1, where the transmitter encodes and transmits messages from the set to the receiver through discrete uses of the wireless channel. The receiver attempts to accurately decode the transmitted message from the distorted and noisy channel output . We discuss the endtoend learning of such a system using the concept of autoencoders [OShea2017deep, dorner2018jstsp].
Transmitter / Encoder Neural Network.
The transmitter or encoder part of the autoencoder is modeled as a multilayer, feedforward neural network (NN) that takes as input the onehotcoded representation
of a message , and produces an encoded symbol vector . Here, is the parameter vector (weights and biases) of the encoder NN and is the encoding dimension. Due to hardware constraints present at the transmitter, a normalization layer is used as the final layer of the encoder network in order to constrain the average power and/or the amplitude of the symbol vectors. The average power constraint is defined as , where the expectation is over the prior distribution of the input messages, and is typically set to . The amplitude constraint is defined as . The size of the message set is usually chosen to be a power of , i.e., representing bits of information. Following [OShea2017deep], the communication rate of this system is the number of bits transmitted per channel use, which in this case is . An autoencoder transmitting bits over uses of the channel is referred to as a autoencoder. For example, a autoencoder uses a message set of size and an encoding dimension of , with a communication rate bits/channel use.Receiver / Decoder Neural Network. The receiver or decoder component is also a multilayer, feedforward NN that takes the channel output
as its input and outputs a probability distribution over the
messages. The inputoutput mapping of the decoder NN can be expressed as , whereis the parameter vector of the decoder NN. The softmax activation function is used at the final layer to ensure that the outputs are valid probabilities. The message corresponding to the highest output probability is predicted as the decoded message,
i.e., . The decoder NN is essentially a discriminative classifier that learns to accurately categorize the received (distorted) symbol vector into one of themessage classes. This is in contrast to conventional autoencoders, where the decoder learns to accurately reconstruct a highdimensional tensor input from its lowdimensional representation learned by the encoder. The meansquared and medianabsolute error are commonly used endtoend performance metrics for conventional autoencoders. In the case of communication autoencoders, the
symbol or block error rate (BLER), defined as , is used as the endtoend performance metric.Channel Model. As discussed in § 1, the wireless channel can be represented by a conditional probability density of the channel output given its input . The channel can be equivalently characterized by a stochastic transfer function that transforms the encoded symbol vector into the channel output, where captures the stochastic components of the channel (e.g., random noise, phase offsets). For example, an additive white Gaussian noise (AWGN) channel is represented by , with and . For realistic wireless channels, the transfer function and conditional probability density are usually unknown and hard to approximate well with standard mathematical models. Recently, a number of works have applied generative models such as conditional generative adversarial networks (GANs) [OShea2019approximating, yeli2018channel], MDNs [garcia2020mixture], and conditional variational autoencoders (VAEs) [xia2020millimeter]
for modeling the wireless channel. To model a wireless channel, generative methods learn a parametric model
(possibly a neural network) that closely approximates the true conditional density of the channel from a dataset of channel input, output observations. Learning a generative model of the channel comes with important advantages. 1) Once the parameters of the channel model are learned from data, the model can be used to generate any number of representative samples from the channel distribution. 2) A channel model with a differentiable transfer function makes it possible to backpropagate gradients of the autoencoder loss function through the channel and train the autoencoder using stochastic gradient descent (SGD)based optimization. 3) It allows for continuous adaptation of the generative channel model to variations in the channel conditions, and thereby maintain a low BLER of the autoencoder.2.2 Generative Channel Model using a Mixture Density Network
In this work, we use an MDN [bishop1994mixture, bishop2007prml] with Gaussian components to model the conditional density of the channel output given its input. MDNs can model complex conditional densities by combining a (feedforward) neural network with a standard parametric mixture model (e.g., mixture of Gaussians). The MDN learns to predict the parameters of the mixture model as a function of the channel input . This can be expressed as , where is the parameter vector (weights and biases) of the neural network. The parameters of the mixture model defined by the MDN are a concatenation of the parameters from the density components, i.e., , where is the parameter vector of component . Focusing on a Gaussian mixture, the channel conditional density modeled by the MDN is given by
(1) 
where is the mean vector,
is the variance vector, and
is the weight (prior probability) of component
. Also, is the latent random variable denoting the mixture component of origin. We have assumed that the Gaussian components have a diagonal covariance matrix, with being the diagonal elements ^{1}^{1}1The diagonal covariance assumption does not imply conditional independence of as long as .. The weights of the mixture are parameterized using the softmax function as in order to satisfy the probability constraint. The MDN simply predicts the unnormalized weights (also known as theprior logits
). For a Gaussian MDN, the parameter vector of component is defined as , and its output parameter vector has dimension . Details on the conditional loglikelihood (CLL) training objective and the transfer function of the MDN, including a differentiable approximation of the transfer function, are discussed in Appendix B.3 Fast Adaptation of the MDN Channel Model
In this section, we propose a fast and lightweight method for adapting a Gaussian MDN when the number of target domain samples is much smaller compared to that used for training it. Consider the setting where the channel state (and therefore its conditional distribution) is changing over time due to e.g., environmental factors. Let denote the (unknown) source channel distribution underlying the dataset used to train the MDN . With a sufficiently large dataset and a suitable choice of , the Gaussian mixture learned by the MDN can closely approximate . Let denote a small set of iid samples () from an unknown target channel distribution , which is potentially different from but not by a large deviation. Our goal is to adapt the MDN (and therefore the underlying mixture density) using such that it closely approximates . Note that the space of inputs to the MDN is the finite set of modulated symbols (referred to as a constellation), with each symbol corresponding to a unique message ^{2}^{2}2 The prior probability over the constellation
is equal to the prior probability over the input messages. This is usually either set to be uniform, or estimated using relative frequencies from a large dataset.
.Key Insight. The proposed method is based on the affinetransformation property of multivariate Gaussians, i.e., one can transform between any two multivariate Gaussians through an affine transformation. Given any two Gaussian mixtures with the same number of components and a onetoone correspondence between the components, we can find the unique set of affine transformations (one percomponent) that transforms one Gaussian mixture to the other. Moreover, the affine transformations are bijective, allowing the mapping to be applied in the inverse direction. This insight allows us to formulate the MDN adaptation as an optimization over the percomponent affine transformation parameters, which is a much smaller problem compared to optimizing the weights of all the MDN layers (see Table 3
for a comparison). To reduce the possibility of the adapted MDN finding bad solutions due to the smallsample regime, we include a regularization term based on the KullbackLeibler divergence (KLD) in the adaptation objective that constrains the distribution shift produced by the affine transformations. The use of a parametric Gaussian mixture, combined with the
onetoone association of the components allows us to derive a closedform expression for the KLD between the source and target mixture distributions.3.1 Transformation Between Gaussian Mixtures
Consider the Gaussian mixtures corresponding to the source and target channel conditional densities
(2)  
(3) 
where is the parameter vector of the adapted MDN. The adapted MDN predicts the parameters of the target Gaussian mixture as , where is a concatenation of the parameters of the individual components. As before, the parameters of a component are defined as . We focus on the case of diagonal covariances, but the results easily extend to the case of general covariances. We summarize the feature and parameter transformations required for mapping the component densities of one Gaussian mixture to another. As shown in Appendix C, the transformation between any two multivariate Gaussians and can be achieved by the transformation: , where the mean vector and covariance matrix of the two Gaussians are related as follows:
(4) 
Affine and InverseAffine Feature Transfomations
Applying the above result to an MDN with components, we define the affine feature transformation for component mapping from to as
(5) 
It is straightforward to also define the inverseaffine transformation from to as
(6) 
Note that the feature transformations are conditioned on a given input to the MDN and a given component of the mixture. This idea is illustrated in Fig 3. For the case of diagonal covariances, we constrain and to also be diagonal ^{3}^{3}3 is actually not required to be diagonal, but we constrain it to be so for simplicity.. These feature transformations will be used for defining a validation metric for the MDN adaptation, and also for aligning the target classconditional distributions of with that of the source in § 4.
Parameter Transformations
The corresponding transformations between the source and target Gaussian mixture parameters for any component are given by
(7) 
where is a diagonal scale matrix for the means; is an offset vector for the means; is a diagonal scale matrix for the variances; and are the scale and offset for the component prior logits. The vector of all adaptation parameters to be optimized is defined as , where . The number of adaptation parameters (dimension of ) is given by . This is typically much smaller than the number of MDN parameters (weights and biases from all layers), even for shallow fullyconnected NNs. The overall idea for adapting the MDN is summarized in Fig. 2, where the adaptation layer mapping to basically implements the parameter transformations in Eq. (7).
3.2 Divergence Between the Source and Target Distributions
We would like to derive a distributionaldivergence metric between the source and target Gaussian mixtures (Eqs. (2) and (3)) in order to regularize the adaptation loss. A number of distributionaldivergence metrics such as the KullbackLeibler, JensenShannon [lin1991divergence], and Total Variation [verdu2014total] divergence are potential candidates, each possessing some unique properties. However, none of these divergences have a closedform expression for a pair of general Gaussian mixtures. Prior works such as [hershey2007approximating] have addressed the problem of estimating the KLD between a pair of Gaussian mixtures for the general case where the number of components could be different, with no association between the components. As mentioned earlier, in our problem there exists a onetoone association between the individual components of the source and target Gaussian mixtures (by definition). This allows us to derive a closedform expression for the KLD as follows:
(8) 
The first term in the above expression is the KLD between the component prior probabilities, which can be simplified into a function of the adaptation parameters . The second term in the above expression involves the KLD between two multivariate Gaussians (a standard result), which can also be simplified into a function of the adaptation parameters. A detailed derivation of this result, and the final expression for the KLD as a function of are given in Appendix D. To make the dependence on explicit, the KLD is henceforth denoted by .
3.3 Loss Function for Adaptation
We consider two scenarios for adaptation: 1. Generative adaptation of the MDN in isolation and 2. Discriminative adaptation of the MDN as the channel model for an autoencoder. In the first case, the goal of adaptation is to find a good generative model for the target data distribution, while in the second case, the goal is to improve the classification performance of the autoencoder on the target data distribution. Recall that we are given a small dataset sampled from the target distribution . We formulate the MDN adaptation as a minimization problem with a regularized negative loglikelihood objective, where the regularization term penalizes solutions with a large KLD between the source and target Gaussian mixtures.
Generative Adaptation. The adaptation objective is the following regularized negative conditional loglikelihood (CLL) of the target dataset:
(9) 
where is given by Eq. (3) and and as a function of are given by Eq. (7). The parameters of the original mixture density are constant terms since they have no dependence on . The regularization constant controls the KLD between the source and target Gaussian mixtures in the optimal solution. Small values of weight the CLL term more and allow more exploration in the adaptation; larger values of impose a stronger regularization to constrain the space of target distributions.
Discriminative Adaptation. With the goal of improving the accuracy of the decoder in recovering the transmitted symbol from , the datadependent term in the adaptation objective (9) is replaced with the posterior loglikelihood (PLL) as follows:
(10) 
We make the following observations about the minimization problem: i) The adaptation objective (in both cases) is a smooth and nonconvex function of ; ii) Computing the objective and its gradient w.r.t are inexpensive operations since and the dimension of are relatively small. Also, this does not require forward and backpropagation steps through the layers of the MDN. For this reason, we use the BFGS quasinewton method [nocedal2006numerical] for minimization, instead of SGDbased methods which are more suitable for largescale learning problems.
3.4 Validation Metric and Selection of
The choice of in the adaptation objective is crucial as it sets the amount of regularization most suitable for the target domain distribution. We propose a validation metric for selecting based on the CLL of the inverseaffinetransformed target dataset with respect to the source mixture density. The reasoning is that, if the adaptation finds a solution (i.e., ) that is a good fit for the target dataset, then the inverse feature transformations based on that solution should produce a transformed target dataset that has a high CLL with respect to the source mixture density. The validation metric is the negative CLL of the inversetransformed target dataset, given by
(11) 
Here is the best component assignment for the sample , given by
(12) 
The above equation is simply the maximumaposteriori (MAP) rule applied to the component posterior of the target Gaussian mixture defined as
Note that the validation metric (11) is based on the source Gaussian mixture (MDN with parameters ), but the MAP component assignment for each target domain sample Eq. (12) is based on the target Gaussian mixture (MDN with parameters ). The adaptation objective is minimized with varied over a range of values, and in each case the adapted solution is evaluated using the validation metric. The pair of and resulting in the smallest validation metric is chosen as the final adapted solution.
4 Adaptation of AutoencoderBased Communication System
In this section, we discuss how the proposed MDN adaptation can be combined with an autoencoderbased communication system to adapt the decoder to changes in the channel conditions. Recall that the decoder is basically a classifier that predicts the mostprobable input message from the received channel output . When the decoder operates in a new (target) channel environment, different from the one it was trained on, its classification accuracy can degrade due to the distribution change. Specifically, any change in the channel conditions reflects as changes in the classconditional density of the decoder’s input, i.e., changes ^{4}^{4}4 For this generative model, it is easy to see that the classconditional density is equal to the channelconditional density, i.e., . Hence, by adapting the MDN, we are effectively also adapting the classconditional density of the decoder’s input. . We propose to address this, by designing transformations to the decoder’s input that can compensate for changes in the channel distribution, and effectively present transformed inputs that are close to the source distribution on which the decoder was trained. Our method does not require any change or adaptation to the decoder network itself, making it fast and suitable for the smallsamplesize setting. We next discuss two such input transformation methods for the decoder.
4.1 Adapted Decoder Based on Affine Feature Transformations
Consider the same problem setup as § 3, where we observe a small dataset of samples from the target channel distribution. Suppose we have adapted the MDN channel by optimizing over the parameters , we can use the inverseaffine feature transformations (defined in Eq. (6)) to transform the channel output from a component of the target Gaussian mixture to the same component of the source Gaussian mixture. However, this transformation requires knowledge of both the channel input and the mixture component , which are not observed (latent) at the decoder. We propose to address this by first determining the mostprobable pair of channel input and mixture component for a given (using the MAP rule), and applying the corresponding inverseaffine feature transformation as follows:
(13) 
The joint posterior over the channel input and mixture component , given the channel output is based on the adapted (target) Gaussian mixture, given by
The adapted decoder based on the above affine feature transformation is defined as
(14) 
and illustrated in Fig. 4. Note that the adapted decoder is a function of the parameters , even though this is not made explicit in the notation.
We also explored a variant of this adapted decoder which uses a soft (probabilistic) assignment of the channel output to the channel input and mixture component pair , given by
(15) 
From our empirical evaluation, we found the hard MAP assignment based adaptation to have better performance. Hence, our experimental results are based on the adapted decoder (14).
4.2 Adapted Decoder Based on MAP Symbol Estimation
In the previous method, an input transformation layer is introduced at the decoder only during adaptation, and not during training of the autoencoder. Alternatively, here we propose an input transformation layer at the decoder that takes the channel output and produces a best estimate of the encoded symbol , which is then given as input to the decoder as shown in Fig. 5
. This input transformation layer is included during the autoencoder training as a fixed nonlinear transformation that does not have any trainable parameters. Since the decoder is trained to predict using
instead of , it is inherently robust to changes in the channel distribution of .Given a generative model of the channel conditional density using Gaussian mixtures, we can estimate the plugin Bayes posterior distribution of given as
From this, we define the MAP estimate of given as
(16) 
The adapted decoder based on this input transformation, referred to as the MAP symbol estimation (SE) layer, is defined as
(17) 
and illustrated in Fig. 5. Whenever the MDN model is adapted to changes in the channel distribution, resulting in a new MDN with parameters , the MAP SE layer is also updated using . This input transformation shields the decoder from changes to the distribution of .
Since the MAP SE layer is also included in the autoencoder during training, the nondifferentiable function presents an obstacle to training the autoencoder using backpropagation. We address this by using a temperaturescaled softmax approximation to the , which is differentiable and provides a close approximation for small temperature values. This approximation is used only during training, whereas the exact is used during inference. Details on this approximation, and a modified autoencoder training algorithm with temperature annealing are discussed in Appendix E.
5 Experimental Evaluation
Network  Layer  Activation  
Encoder  FC,  ReLU  
FC,  Linear  
Normalization (avg. power)  None  
MDN  FC,  ReLU  
FC,  ReLU  



Decoder  FC,  ReLU  
FC,  Softmax 
We implemented the mixture density network and communication autoencoder models using TensorFlow 2.3
[tensorflow2015whitepaper] and TensorFlow Probability [tf_proba]. We used the BFGS optimizer implementation available in TensorFlow Probability. The code base for our work can be found at https://anonymous.4open.science/r/domain_adaptation7C0D/. All the experiments were run on a Macbook Pro laptop with 16 GB memory and 8 CPU cores. Table 1 summarizes the architecture of the encoder, MDN (channel model), and decoder neural networks. Note that the output layer of the MDN is a concatenation (denoted by ) of three fullyconnected layers predicting the means, variances, and mixing prior logit parameters of the Gaussian mixture. We used the following setting in all our experiments. The size of the message set was fixed to , corresponding to bits. The dimension of the encoding (output of the encoder) was set to , and the number of mixture components was set to . The size of the hidden layers was set to .The generative adaptation objective (9) is used for the experiments in § 5.1, where the MDN is adapted in isolation (not as part of the autoencoder). The discriminative adaptation objective (10) is used for the experiments in § 5.2 and § 5.3, where the MDN is adapted as part of the autoencoder. For the proposed method, the scale and shift components of the adaptation parameters are initialized to s and s respectively. This ensures that the target Gaussian mixture is always initialized with the source Gaussian mixture. The regularization constant in the adaptation objective was varied over equallyspaced values on the scale (base ) with range to . The value and corresponding to the smallest validation metric are selected as the final solution (§ 3.3). We note that minimizing the adaptation objective for different values can be efficiently done in parallel over multiple CPU cores.
5.1 MDN adaptation on Simulated Channels
We evaluate the proposed adaptation method for an MDN (§ 3) on simulated channel variations based on models commonly used for wireless communication. Specifically, we use the following channel models: i) additive white Gaussian noise (AWGN), ii) Ricean fading, and iii) Uniform or flat fading [goldsmith2005wireless]
. Details on these channel models and calculation of the their signaltonoise ratio (SNR) are provided in Appendix
F. In each case, the MDN is first trained on a large dataset simulated from a particular type of channel model (e.g., AWGN), referred to as the source channel. The trained MDN is then adapted using a small dataset from a different type of channel model (e.g., Ricean fading), referred to as the target channel. We used a standard constellation corresponding to quadrature amplitude modulation of symbols, referred to as 16QAM [goldsmith2005wireless], as inputs to the channel. A training set of samples from the source channel is used to train the MDN. The size of the adaptation dataset from the target channel is varied over a few different values – 5, 10, 20 and 30 samples per symbol, corresponding to target datasets of size 80, 160, 320 and 480 respectively.



Proposed  Transfer  Transferlastlayer  

median  95% CI  median  95% CI  median  95% CI  
AWGN  Uniform fading  80  0.98  (0.80, 1.06)  0.54  (1.87, 0.98)  0.88  (0.30, 0.99)  
160  0.98  (0.90, 1.09)  0.93  (0.23, 1.00)  0.97  (0.68, 1.00)  
320  0.99  (0.84, 1.11)  0.99  (0.89, 1.03)  0.99  (0.93, 1.07)  
480  0.98  (0.88, 1.12)  1.00  (0.95, 1.09)  1.00  (0.96, 1.11)  
AWGN  Ricean fading  80  1.08  (0.00, 4.97)  0.58  (13.27, 1.00)  0.93  (19.34, 1.08)  
160  1.20  (0.00, 4.84)  1.02  (5.54, 1.26)  1.04  (1.98, 1.77)  
320  1.08  (0.00, 5.75)  1.08  (0.23, 4.34)  1.10  (0.26, 5.00)  
480  1.08  (0.00, 5.72)  1.10  (0.07, 5.15)  1.10  (0.05, 5.20)  
Ricean fading  Uniform fading  80  0.96  (0.31, 1.45)  0.10  (29.2, 0.87)  0.59  (34.72, 0.96)  
160  0.98  (0.42, 1.33)  0.73  (2.49, 0.97)  0.86  (0.90, 1.00)  
320  0.97  (0.23, 1.36)  0.95  (0.82, 1.15)  0.98  (0.94, 1.34)  
480  0.98  (0.24, 1.70)  0.99  (0.96, 2.32)  1.01  (0.96, 2.54) 
Baseline Methods. We evaluate the following two baseline methods for adapting the MDN. 1) A new MDN is initialized using the weights of the MDN trained on the source dataset, and trained using the target dataset. 2) Same as baseline 1, but only the weights of the final layer are optimized (finetuned) using the target dataset. The above methods are referred to as transfer and transferlastlayer respectively. We used the Adam optimization method [kingma2015adam] for epochs, with a batch size of or times the target dataset size, whichever is larger.
Evaluation Metric. Since the MDN is generative model, we evaluate the conditional loglikelihood of the learned Gaussian mixture on an unseen test set with 25000 samples from the target channel. We report the relative change in loglikelihood with respect to the original (unadapted) MDN, since the loglikelihood values may not be comparable across datasets. Suppose the loglikelihood of the original MDN is and that of an adaptation method is , then we calculate as the metric. Larger values are better and negative values indicate that adaptation leads to a worse model.
Results and Inference. Table 2 summarizes the results for three (source, target) channel pairs. For each pair, the methods are run on 50 randomly generated training, adaptation, and test datasets. The training dataset is sampled from the source channel, while the adaptation and test datasets are sampled from the target channel. The SNR of the source and target channels are independently and randomly selected from the range dB to
dB for each trial. We observe that the proposed method has a higher median relative loglikelihood gain for the low sample sizes (80 and 160) and comparable median for higher sample sizes. Also, the baseline methods often have a 95% confidence interval (CI) that is very skewed to the left, with a negative
th percentile. The proposed adaptation is more stable even for the the smallest sample size, and never has a negative lower CI.

# parameters 



Transfer 

12925  
Transferlastlayer  2525  
Proposed  40 
Table 3 compares the number of parameters being optimized by the proposed and baseline MDN adaptation methods for the architecture in Table 1. The method transfer optimizes all the layer weights of the MDN, which in this case has size . The method transferlastlayer optimizes only the weights of the final layer, which in this case has size . The number of parameters optimized by the proposed method (i.e., dimension of ) would be , which is a much smaller problem compared to the baseline methods. This makes the proposed method well suited for the smallsample adaptation setting.
5.2 Autoencoder Adaptation on Simulated Channels
We evaluate the proposed decoder adaptation methods on different pairs of simulated source and target channel distributions. The setup for this experiment for adapting from a source channel A to a target channel B is as follows. The autoencoder is initially trained using data from the source channel A at an SNR of dB. Details of how the SNR is related to the distribution parameters of the simulated channels is discussed in Appendix F. The MDN and the decoder are adapted using a small dataset from the target channel B for different fixed SNRs varied over dB to dB in steps of dB. For each SNR, the adaptation is repeated over randomlysampled datasets from the target channel, and the average block error rate (BLER) values are calculated on a large heldout test dataset (specific to each SNR). The size of training dataset (from channel A) and test dataset (from channel B) are both set to samples per symbol, with symbols. The size of the adaptation dataset from the target channel B is varied over and samples per symbol.
The results of this experiment are summarized in Figs. 6 and 7 for three pairs of source and target channels. Figure 6 corresponds to the adaptation method of § 4.1 referred to as Affine, and Figure 7 corresponds to the adaptation method of § 4.2 referred to as MAP SE. The plots show the BLER vs. SNR curve, with average BLER on the yaxis (logscaled) and SNR on the xaxis. This is commonly used to summarize the error performance of a communication system. The performance of a standard 16QAM decoder (referred to as 16QAM) ^{5}^{5}5MQAM is short for Mary quadrature amplitude modulation with an symbol constellation. This is a standard technique for modulation and demodulation (decoding), which does not adapt based on the channel conditions., and an autoencoder trained on the source channel without any adaptation (referred to as no_adapt) are included as baselines. For the proposed adaptation methods, the number of samples per symbol from the target channel are indicated as a suffix to the method name. For example, adapt_20 implies that the adaptation used samples per symbol.
Observations and Takeaways.

[leftmargin=*, topsep=1pt, noitemsep]

Both the adaptation methods significantly decrease the BLER for the cases AWGN to Uniform fading and Ricean fading to Uniform fading.

For the case of AWGN to Ricean fading, the adaptation methods perform at the same level or slightly worse compared to the baselines. We think this is because the distribution of the two domains are not very different.

In general, the BLER decreases with increasing size of the target dataset.

Between the two adaptation methods, MAP SE performs marginally better than the Affine method.
5.3 Autoencoder Adaptation on Real FPGA Traces
We evaluate the performance of the adaptation methods on real overtheair wireless experiments. We use a recent highperformance mmWave testbed [Lacruz_MOBISYS2021], featuring a highend FPGA board with 2 GHz of bandwidth per channel and 60 GHz SIVERS antennas [SIVERSIMA]. This platform allows to transmit the custom constellations generated by the encoder and to store the data to be either trained by the MDN, or extracted for further performance analysis. We train the MDN with a standard 16QAM constellation with 96000 samples. We evaluate the performance of our adaptation for 20, 35 and 50 samples per symbol. We introduced an IQ imbalancebased distortion to the constellation ^{6}^{6}6IQ imbalance is a common issue in radio frequency communications that introduces distortions to the final constellation., and gradually increase the level of imbalance to the system. The BLER of the proposed adaptation methods and the baseline methods (16QAM and no adaptation) is shown as a function of the IQ imbalance in Fig. 8. The proposed methods (both Affine and MAP SE) show an order of magnitude decrease in BLER compared to the baseline methods when the IQ imbalance is over 25%.
6 Conclusions
In this paper we proposed a fast and lightweight method for adapting a Gaussian MDN to a target domain with a very limited number of adaptation samples. The method is based on finding the optimal set of componentconditional affine transformations that transform the source Gaussian mixture to the target Gaussian mixture. This is formulated as the minimization of a conditional (or posterior) loglikelihood, regularized by the KLdivergence between the two mixture distributions. We applied the MDN adaptation to an autoencoderbased endtoend communication system, specifically by transforming the inputs to the decoder such that their classconditional distributions are close to that of the source domain. This allows for fast adaptation of both the MDN channel and the autoencoder without the need for expensive data collection and retraining. We demonstrated the effectiveness of the proposed methods through extensive experiments on both simulated wireless channels and a real mmWave FPGA testbed.
Limitations & Future Work
The proposed adaptation for a Gaussian MDN is primarily targeted for lowdimensional problems such as the wireless channel. It can be challenging to apply on highdimensional input domains with structure. Extensions of the proposed work to deep generative models based on normalizing flows [dinh2017realnvp, kingma2018glow, weng2018flow] is an interesting direction, which would be more suitable for highdimensional inputs. In this work, we do not adapt the encoder network, i.e., the autoencoder constellation is not adapted to changes in the channel distribution. Adapting the encoder, decoder, and channel networks jointly would allow for more flexibility, but would likely be slower and require more data from the target distribution.
References
Appendix
Notation  Description 

Input message or class label. Usually , where is the number of bits.  
or simply  Onehotcoded vector of a message , with at position and zeros elsewhere. 
with  Encoded representation or symbol vector corresponding to an input message. 
Channel output that is the feature vector to be classified by the decoder.  
Categorical random variable denoting the mixture component of origin.  
Encoder NN with parameters mapping a onehotcoded message to a symbol vector in .  
Decoder NN with parameters mapping the channel output into probabilities over the message set.  
MAP prediction of the input message by the decoder.  
Conditional density (generative) model of the channel with parameters .  
Mixture density network that predicts the parameters of a Gaussian mixture.  
Transfer or sampling function corresponding to the channel conditional density.  
Random vector independent of that captures the stochasticity of the channel.  
Inputoutput mapping of the autoencoder with combined parameter vector .  
Affine transformation parameters per component used to adapt the MDN.  
and  Affine and inverseaffine transformations between the component densities of the Gaussian mixtures. 
KullbackLeibler divergence between the distributions and .  
Multivariate Gaussian density with mean vector and covariance matrix .  
Categorical distribution with and .  
Indicator function mapping a predicate to if true and if false.  
norm of a vector . 
The appendices are organized as follows:

[leftmargin=*, topsep=1pt, noitemsep]

Appendix A provides a background on endtoend training of autoencoders and domain adaptation, and discusses related works.

Appendix B provides details on the training and sampling (transfer) function of MDNs.

Appendix C discusses the feature and parameter transformations between multivariate Gaussians.

Appendix D derives the KL divergence between the source and target Gaussian mixtures.

Appendix E provides details on training the MAP symbol estimation autoencoder.

Appendix F provides details on the simulated channel models that were used in our experiments.
Appendix A Background
a.1 Loss Function and Training of the Autoencoder
Expanding on the brief background provided in § 2.1, here we provide a formal discussion of the endtoend training of the autoencoder. First, let us define the inputoutput mapping of the autoencoder as , where is the combined vector of parameters from the encoder, channel, and decoder. Given an input message , the autoencoder maps the onehotcoded representation of into an output probability vector over the message set. Note that, while the encoder and decoder neural networks are deterministic, a forward pass through the autoencoder is stochastic due to the channel transfer function . The learning objective of the autoencoder is to accurately recover the input message at the decoder with a high probability. The crossentropy (CE) loss, which is commonly used for training classifiers, is also suitable for endtoend training of the autoencoder. For an input with encoded representation , channel output , and decoded output , the CE loss is given by
(18) 
which is always nonnegative and takes the minimum value when the correct message is decoded with probability . The autoencoder aims to minimize the following expected CE loss over the input message set and the channel output:
(19) 
Here is the prior probability of the input messages, which is usually assumed to be uniform in the absence of prior knowledge. In practice, the autoencoder minimizes an empirical estimate of the expected CE loss function by generating a large set of samples from the channel conditional density given each message. Let denote a set of independent and identically distributed (iid) samples from , the channel conditional density given message . Also, let denote the combined set of samples. The empirical expectation of the autoencoder CE loss (19) is then given by
(20) 
It is clear from the above equation that the channel transfer function should be differentiable in order to be able to backpropagate gradients through the channel to the encoder network. The transfer function defining sample generation for a Gaussian MDN channel is discussed in Appendix B.
The training algorithm for jointly learning the autoencoder and channel model (based on [garcia2020mixture]) is given in Algorithm 1. It is an alternating (cyclic) optimization of the channel parameters and the autoencoder (encoder and decoder) parameters. The reason this type of alternating optimization is required is because the empirical expectation of the CE loss Eq. (20) is valid only when the channel conditional density (i.e., ) is fixed. The training algorithm can be summarized as follows. First, the channel model is trained for epochs using data sampled from the channel with an initial encoder constellation (e.g., MQAM). With the channel model parameters fixed, the parameters of the encoder and decoder networks are optimized for one epoch of minibatch SGD updates (using any adaptive learning rate algorithm e.g., Adam [kingma2015adam]). Since the channel model is no longer optimal for the updated encoder constellation, it is retrained for epochs using data sampled from the channel with the updated constellation. This alternate training of the encoder/decoder and the channel networks is repeated for epochs or until convergence.
Finally, we observe some interesting nuances of the communication autoencoder learning task that is not common to other domains such as images. 1) The size of the input space is finite, equal to the number of distinct messages . Because of the stochastic nature of the channel transfer function, the same input message results in a different autoencoder output each time. 2) There is theoretically no limit on the number of samples that can be generated for training and validating the autoencoder. These two factors make the autoencoder learning less susceptible to overfitting, that is a common pitfall with neural network training.
a.2 A Primer on Domain Adaptation
We provide a brief review of domain adaptation (DA) and discuss the key differences of our problem setting from that of standard DA. In the traditional learning setting, training and test data are assumed to be sampled independently from the same distribution , where and are the input vector and target respectively ^{7}^{7}7The notation used in this section is different from the rest of the paper, but consistent with the statistical learning literature.. In many real world settings, it can be hard or impractical to collect a large labeled dataset for a target domain where the machine learning model (e.g., a DNN classifier) is to be deployed. On the other hand, it is common to have access to a large unlabeled dataset from the target domain, and a large labeled dataset from a different but related source domain ^{8}^{8}8One could have multiple source domains in practice; we consider the single source domain setting.. Both and are much larger than , and in most cases there is no labeled data from the target domain (referred to as unsupervised DA). For the target domain, the unlabeled dataset (and labeled dataset if any) are sampled from an unknown target distribution, i.e., and . For the source domain, the labeled dataset is sampled from an unknown source distribution, i.e., . The goal of DA is to leverage the available labeled and unlabeled datasets from the two domains to learn a predictor, denoted by the parametric function , such that the following risk function w.r.t the target distribution is minimized:
where is a loss function that penalizes the prediction for deviating from the true value (e.g., crossentropy or hinge loss). In a similar way, we can define the risk function w.r.t the source distribution . A number of seminal works in DA theory [bendavid2006analysis, blitzer2007learning, bendavid2010theory] have studied this learning setting and provide bounds on in terms of and the divergence between source and target domain distributions. Motivated by this foundational theory, a number of recent works [ganin2015unsupervised, ganin2016domain, long2018conditional, saito2018maximum, zhao2019invariant, johansson2019support] have proposed using DNNs for adversarially learning a shared representation across the source and target domains such that a predictor using this representation and trained using labeled data from only the source domain also generalizes well to the target domain. An influential work in this line of DA is the domain adversarial neural network (DANN) proposed by [ganin2015unsupervised] and later by [ganin2016domain]. The key idea behind the DANN approach is to adversarially train a label predictor NN and a domain discriminator NN in order to learn a feature representation for which i) the source and target inputs are nearly indistinguishable to the domain discriminator, and ii) the label predictor has good generalization performance on the source domain inputs.
Special Cases of DA. While the general DA problem addresses the scenario where and