Log In Sign Up

Domain Adaptation for Autoencoder-Based End-to-End Communication Over Wireless Channels

by   Jayaram Raghuram, et al.
University of Wisconsin-Madison

The problem of domain adaptation conventionally considers the setting where a source domain has plenty of labeled data, and a target domain (with a different data distribution) has plenty of unlabeled data but none or very limited labeled data. In this paper, we address the setting where the target domain has only limited labeled data from a distribution that is expected to change frequently. We first propose a fast and light-weight method for adapting a Gaussian mixture density network (MDN) using only a small set of target domain samples. This method is well-suited for the setting where the distribution of target data changes rapidly (e.g., a wireless channel), making it challenging to collect a large number of samples and retrain. We then apply the proposed MDN adaptation method to the problem of end-of-end learning of a wireless communication autoencoder. A communication autoencoder models the encoder, decoder, and the channel using neural networks, and learns them jointly to minimize the overall decoding error rate. However, the error rate of an autoencoder trained on a particular (source) channel distribution can degrade as the channel distribution changes frequently, not allowing enough time for data collection and retraining of the autoencoder to the target channel distribution. We propose a method for adapting the autoencoder without modifying the encoder and decoder neural networks, and adapting only the MDN model of the channel. The method utilizes feature transformations at the decoder to compensate for changes in the channel distribution, and effectively present to the decoder samples close to the source distribution. Experimental evaluation on simulated datasets and real mmWave wireless channels demonstrate that the proposed methods can quickly adapt the MDN model, and improve or maintain the error rate of the autoencoder under changing channel conditions.


page 1

page 2

page 3

page 4


DASA: Domain Adaptation in Stacked Autoencoders using Systematic Dropout

Domain adaptation deals with adapting behaviour of machine learning base...

Large-Scale Domain Adaptation via Teacher-Student Learning

High accuracy speech recognition requires a large amount of transcribed ...

Weak Adaptation Learning – Addressing Cross-domain Data Insufficiency with Weak Annotator

Data quantity and quality are crucial factors for data-driven learning m...

TOHAN: A One-step Approach towards Few-shot Hypothesis Adaptation

In few-shot domain adaptation (FDA), classifiers for the target domain a...

Algorithms and Theory for Supervised Gradual Domain Adaptation

The phenomenon of data distribution evolving over time has been observed...

An End-to-End Block Autoencoder For Physical Layer Based On Neural Networks

Deep Learning has been widely applied in the area of image processing an...

AdapterNet - learning input transformation for domain adaptation

Deep neural networks have demonstrated impressive performance in various...

1 Introduction

End-to-end learning of communication systems using autoencoders has been recently shown to be a viable method for designing the next generation of wireless networks [OShea2017deep, dorner2018jstsp, aoudia2019model, OShea2019approximating, yeli2018channel, wang2017survey]. A point-to-point communication system consists of a transmitter (or encoder), a channel, and a receiver (or decoder). The key idea of end-to-end learning for a communication system is to use an autoencoder architecture to model and learn the transmitter and receiver jointly using neural networks in order to minimize an end-to-end performance metric such as the block error rate (BLER) [OShea2017deep]. The channel can be represented as a stochastic transfer function that transforms its input to an output . It can be regarded as a black-box that is non-linear and non-differentiable due to hardware imperfections (e.g.

, quantization and amplifiers). Since autoencoders are trained using stochastic gradient descent (SGD)-based optimization, with the gradients calculated using backpropagation 

[OShea2017deep], it is challenging to work with a black-box channel that is not differentiable. One approach to address this problem is by using a known mathematical model of the channel (e.g.

, additive Gaussian noise). Use of such models enables the computation of gradients of the loss function with respect to the autoencoder parameters via backpropagation. However, such standard channel models do not capture well the realistic channel effects, as shown in

[aoudia2018endtoend]. Alternatively, recent works have proposed to learn the channel using (deep) generative models that approximate

, the conditional probability density of the channel output given the channel input, using generative adversarial networks (GANs) 

[OShea2019approximating, yeli2018channel], mixture density networks (MDNs) [garcia2020mixture], and conditional variational autoencoders (VAEs) [xia2020millimeter]. The use of a differentiable generative model of the channel enables SGD-based training of the autoencoder, while also capturing realistic channel effects better than standard models.

Although this end-to-end optimization with real channels learned from data can improve the physical layer design for communication systems, in reality, channels often change, requiring collection of a large number of samples and retraining the channel model and autoencoder frequently. For this reason, adapting the learned conditional probability density of the channel as often as possible using only a small number of samples is required for good communication performance. In this paper, we study the problem of domain adaptation (DA) of autoencoders using an MDN as the channel model. We make the following contributions: i) We propose a light-weight and sample-efficient method for adapting a generative MDN (used for modeling the wireless channel) based on the properties of Gaussian mixtures. ii) Based on the MDN adaptation, we propose two methods to compensate for changes in the class-conditional feature distributions at the decoder (receiver) that maintain or decrease the error rate of the autoencoder, without requiring any retraining.

In contrast to the conventional DA setting, where one has access to a large unlabeled dataset and none or a small labeled dataset from the target domain [jiang2008literature, bendavid2006analysis], here we consider DA when the target domain has only a small labeled dataset. This setting applies to the wireless communication channel where the target distribution changes frequently, and we only get to collect a small number of samples at a time from the target distribution. Recent approaches for DA (such as DANN [ganin2016domain]) based on adversarial learning of a shared representation between the source and target domains [ganin2015unsupervised, ganin2016domain, long2018conditional, saito2018maximum, zhao2019invariant, johansson2019support]

have achieved much success on computer vision and natural language processing tasks. The high-level idea is to adversarially learn a shared feature representation for which inputs from the source and target distributions are nearly indistinguishable to a

domain discriminator DNN, such that a label predictor DNN using this representation and trained using labeled data from only the source domain also generalizes well to the target domain [ganin2016domain]. Adversarial DA methods are not suitable for this problem because of the imbalance in the number of source and target domain unlabeled samples (not enough target domain samples to learn a good domain discriminator). Also, adversarial DA methods are not well-suited for fast and frequent adaptation that is required here. In summary, the problem setting addressed in this work has the following key differences from standard DA:

  • [leftmargin=*, topsep=1pt, noitemsep]

  • The number of target domain samples is much smaller compared to the source domain. All of them are labeled, i.e., no unlabeled samples.

  • We have a pre-trained classifier (decoder) from the source domain and do not want to retrain the classifier frequently for two reasons: i) retraining may not be fast enough to keep up with changes in the wireless channel, ii) collecting a reasonably-large number of labeled samples is hard while the channel distribution is changing.

  • Unlike adversarial DA methods, we do not train a single classifier that has low error rate on both the domains. We adapt a generative model of the class-conditional feature distribution from the source to the target domain, and design feature transformations at the input of the source domain classifier.

  • Our method focuses on low-dimensional input domains such as a communication system, but may not be well suited for high-dimensional input domains such as images.

2 Background

We introduce the notations and definitions used, followed by a brief background on i) end-to-end learning in wireless communication using autoencoders, and ii) MDN-based generative modeling. A detailed background and discussion of related works on these topics is given in Appendix A.

Notations and Definitions.Vectors and matrices are denoted by boldface symbols. We use uppercase letters for random variables and lowercase for the specific values taken by them. and denote the set of real and complex numbers. We use the concise notation for any integer . We define to be an indicator function that takes the value () when predicate is true (false). We denote the one-hot-coded vector of all zeros except a at position by , and we omit

when it is clear from the context. The probability density function (pdf) of a multivariate Gaussian with mean vector

and covariance matrix is denoted by . The categorical probability mass function is denoted by . The identity matrix is denoted by . The determinant and trace of a matrix are denoted by and respectively. The element-wise (Hadamard) product of two vectors or matrices is denoted by

. We use a slightly different notation compared to the convention in machine learning. While a (feature vector, class label) pair is usually denoted by

, here we denote the same by , where is the channel output and is the input message. Also, is the encoded representation of a message , and the channel input. Table 4 in the Appendix provides a quick reference for the notations used in the paper.

2.1 Autoencoder-Based End-to-End Learning

Figure 1: Representation of an end-to-end autoencoder-based communication system with a generative channel model.

Consider a single-input, single-output (SISO) wireless communication system as shown in Fig. 1, where the transmitter encodes and transmits messages from the set to the receiver through discrete uses of the wireless channel. The receiver attempts to accurately decode the transmitted message from the distorted and noisy channel output . We discuss the end-to-end learning of such a system using the concept of autoencoders [OShea2017deep, dorner2018jstsp].

Transmitter / Encoder Neural Network.

The transmitter or encoder part of the autoencoder is modeled as a multi-layer, feed-forward neural network (NN) that takes as input the one-hot-coded representation

of a message , and produces an encoded symbol vector . Here, is the parameter vector (weights and biases) of the encoder NN and is the encoding dimension. Due to hardware constraints present at the transmitter, a normalization layer is used as the final layer of the encoder network in order to constrain the average power and/or the amplitude of the symbol vectors. The average power constraint is defined as , where the expectation is over the prior distribution of the input messages, and is typically set to . The amplitude constraint is defined as . The size of the message set is usually chosen to be a power of , i.e. representing bits of information. Following [OShea2017deep], the communication rate of this system is the number of bits transmitted per channel use, which in this case is . An autoencoder transmitting bits over uses of the channel is referred to as a autoencoder. For example, a autoencoder uses a message set of size and an encoding dimension of , with a communication rate bits/channel use.

Receiver / Decoder Neural Network. The receiver or decoder component is also a multilayer, feedforward NN that takes the channel output

as its input and outputs a probability distribution over the

messages. The input-output mapping of the decoder NN can be expressed as , where

is the parameter vector of the decoder NN. The softmax activation function is used at the final layer to ensure that the outputs are valid probabilities. The message corresponding to the highest output probability is predicted as the decoded message,

i.e.. The decoder NN is essentially a discriminative classifier that learns to accurately categorize the received (distorted) symbol vector into one of the

message classes. This is in contrast to conventional autoencoders, where the decoder learns to accurately reconstruct a high-dimensional tensor input from its low-dimensional representation learned by the encoder. The mean-squared and median-absolute error are commonly used end-to-end performance metrics for conventional autoencoders. In the case of communication autoencoders, the

symbol or block error rate (BLER), defined as , is used as the end-to-end performance metric.

Channel Model. As discussed in § 1, the wireless channel can be represented by a conditional probability density of the channel output given its input . The channel can be equivalently characterized by a stochastic transfer function that transforms the encoded symbol vector into the channel output, where captures the stochastic components of the channel (e.g., random noise, phase offsets). For example, an additive white Gaussian noise (AWGN) channel is represented by , with and . For realistic wireless channels, the transfer function and conditional probability density are usually unknown and hard to approximate well with standard mathematical models. Recently, a number of works have applied generative models such as conditional generative adversarial networks (GANs) [OShea2019approximating, yeli2018channel], MDNs [garcia2020mixture], and conditional variational autoencoders (VAEs) [xia2020millimeter]

for modeling the wireless channel. To model a wireless channel, generative methods learn a parametric model

(possibly a neural network) that closely approximates the true conditional density of the channel from a dataset of channel input, output observations. Learning a generative model of the channel comes with important advantages. 1) Once the parameters of the channel model are learned from data, the model can be used to generate any number of representative samples from the channel distribution. 2) A channel model with a differentiable transfer function makes it possible to backpropagate gradients of the autoencoder loss function through the channel and train the autoencoder using stochastic gradient descent (SGD)-based optimization. 3) It allows for continuous adaptation of the generative channel model to variations in the channel conditions, and thereby maintain a low BLER of the autoencoder.

2.2 Generative Channel Model using a Mixture Density Network

In this work, we use an MDN [bishop1994mixture, bishop2007prml] with Gaussian components to model the conditional density of the channel output given its input. MDNs can model complex conditional densities by combining a (feed-forward) neural network with a standard parametric mixture model (e.g., mixture of Gaussians). The MDN learns to predict the parameters of the mixture model as a function of the channel input . This can be expressed as , where is the parameter vector (weights and biases) of the neural network. The parameters of the mixture model defined by the MDN are a concatenation of the parameters from the density components, i.e., where is the parameter vector of component . Focusing on a Gaussian mixture, the channel conditional density modeled by the MDN is given by


where is the mean vector,

is the variance vector, and

is the weight (prior probability) of component

. Also, is the latent random variable denoting the mixture component of origin. We have assumed that the Gaussian components have a diagonal covariance matrix, with being the diagonal elements 111The diagonal covariance assumption does not imply conditional independence of as long as .. The weights of the mixture are parameterized using the softmax function as in order to satisfy the probability constraint. The MDN simply predicts the un-normalized weights (also known as the

prior logits

). For a Gaussian MDN, the parameter vector of component is defined as , and its output parameter vector has dimension . Details on the conditional log-likelihood (CLL) training objective and the transfer function of the MDN, including a differentiable approximation of the transfer function, are discussed in Appendix B.

3 Fast Adaptation of the MDN Channel Model

Figure 2: Proposed MDN adaptation overview.

In this section, we propose a fast and light-weight method for adapting a Gaussian MDN when the number of target domain samples is much smaller compared to that used for training it. Consider the setting where the channel state (and therefore its conditional distribution) is changing over time due to e.g., environmental factors. Let denote the (unknown) source channel distribution underlying the dataset used to train the MDN . With a sufficiently large dataset and a suitable choice of , the Gaussian mixture learned by the MDN can closely approximate . Let denote a small set of iid samples () from an unknown target channel distribution , which is potentially different from but not by a large deviation. Our goal is to adapt the MDN (and therefore the underlying mixture density) using such that it closely approximates . Note that the space of inputs to the MDN is the finite set of modulated symbols (referred to as a constellation), with each symbol corresponding to a unique message 222 The prior probability over the constellation

is equal to the prior probability over the input messages. This is usually either set to be uniform, or estimated using relative frequencies from a large dataset.


Key Insight. The proposed method is based on the affine-transformation property of multivariate Gaussians, i.e., one can transform between any two multivariate Gaussians through an affine transformation. Given any two Gaussian mixtures with the same number of components and a one-to-one correspondence between the components, we can find the unique set of affine transformations (one per-component) that transforms one Gaussian mixture to the other. Moreover, the affine transformations are bijective, allowing the mapping to be applied in the inverse direction. This insight allows us to formulate the MDN adaptation as an optimization over the per-component affine transformation parameters, which is a much smaller problem compared to optimizing the weights of all the MDN layers (see Table 3

for a comparison). To reduce the possibility of the adapted MDN finding bad solutions due to the small-sample regime, we include a regularization term based on the Kullback-Leibler divergence (KLD) in the adaptation objective that constrains the distribution shift produced by the affine transformations. The use of a parametric Gaussian mixture, combined with the

one-to-one association of the components allows us to derive a closed-form expression for the KLD between the source and target mixture distributions.

3.1 Transformation Between Gaussian Mixtures

Consider the Gaussian mixtures corresponding to the source and target channel conditional densities


where is the parameter vector of the adapted MDN. The adapted MDN predicts the parameters of the target Gaussian mixture as , where is a concatenation of the parameters of the individual components. As before, the parameters of a component are defined as . We focus on the case of diagonal covariances, but the results easily extend to the case of general covariances. We summarize the feature and parameter transformations required for mapping the component densities of one Gaussian mixture to another. As shown in Appendix C, the transformation between any two multivariate Gaussians and can be achieved by the transformation: , where the mean vector and covariance matrix of the two Gaussians are related as follows:


Affine and Inverse-Affine Feature Transfomations

Applying the above result to an MDN with components, we define the affine feature transformation for component mapping from to as


It is straightforward to also define the inverse-affine transformation from to as


Note that the feature transformations are conditioned on a given input to the MDN and a given component of the mixture. This idea is illustrated in Fig 3. For the case of diagonal covariances, we constrain and to also be diagonal 333 is actually not required to be diagonal, but we constrain it to be so for simplicity.. These feature transformations will be used for defining a validation metric for the MDN adaptation, and also for aligning the target class-conditional distributions of with that of the source in § 4.

Parameter Transformations

The corresponding transformations between the source and target Gaussian mixture parameters for any component are given by


where is a diagonal scale matrix for the means; is an offset vector for the means; is a diagonal scale matrix for the variances; and are the scale and offset for the component prior logits. The vector of all adaptation parameters to be optimized is defined as , where . The number of adaptation parameters (dimension of ) is given by . This is typically much smaller than the number of MDN parameters (weights and biases from all layers), even for shallow fully-connected NNs. The overall idea for adapting the MDN is summarized in Fig. 2, where the adaptation layer mapping to basically implements the parameter transformations in Eq. (7).

3.2 Divergence Between the Source and Target Distributions

Figure 3: Affine feature transformations between the source and target Gaussian mixtures.

We would like to derive a distributional-divergence metric between the source and target Gaussian mixtures (Eqs. (2) and (3)) in order to regularize the adaptation loss. A number of distributional-divergence metrics such as the Kullback-Leibler, Jensen-Shannon [lin1991divergence], and Total Variation [verdu2014total] divergence are potential candidates, each possessing some unique properties. However, none of these divergences have a closed-form expression for a pair of general Gaussian mixtures. Prior works such as [hershey2007approximating] have addressed the problem of estimating the KLD between a pair of Gaussian mixtures for the general case where the number of components could be different, with no association between the components. As mentioned earlier, in our problem there exists a one-to-one association between the individual components of the source and target Gaussian mixtures (by definition). This allows us to derive a closed-form expression for the KLD as follows:


The first term in the above expression is the KLD between the component prior probabilities, which can be simplified into a function of the adaptation parameters . The second term in the above expression involves the KLD between two multivariate Gaussians (a standard result), which can also be simplified into a function of the adaptation parameters. A detailed derivation of this result, and the final expression for the KLD as a function of are given in Appendix D. To make the dependence on explicit, the KLD is henceforth denoted by .

3.3 Loss Function for Adaptation

We consider two scenarios for adaptation: 1. Generative adaptation of the MDN in isolation and 2. Discriminative adaptation of the MDN as the channel model for an autoencoder. In the first case, the goal of adaptation is to find a good generative model for the target data distribution, while in the second case, the goal is to improve the classification performance of the autoencoder on the target data distribution. Recall that we are given a small dataset sampled from the target distribution . We formulate the MDN adaptation as a minimization problem with a regularized negative log-likelihood objective, where the regularization term penalizes solutions with a large KLD between the source and target Gaussian mixtures.

Generative Adaptation. The adaptation objective is the following regularized negative conditional log-likelihood (CLL) of the target dataset:


where is given by Eq. (3) and and as a function of are given by Eq. (7). The parameters of the original mixture density are constant terms since they have no dependence on . The regularization constant controls the KLD between the source and target Gaussian mixtures in the optimal solution. Small values of weight the CLL term more and allow more exploration in the adaptation; larger values of impose a stronger regularization to constrain the space of target distributions.

Discriminative Adaptation. With the goal of improving the accuracy of the decoder in recovering the transmitted symbol from , the data-dependent term in the adaptation objective (9) is replaced with the posterior log-likelihood (PLL) as follows:


The posterior probability

can be expressed in terms of the conditional Gaussian mixtures by applying Bayes’ rule. This is the only difference from the generative adaptation scenario.

We make the following observations about the minimization problem: i) The adaptation objective (in both cases) is a smooth and nonconvex function of ; ii) Computing the objective and its gradient w.r.t are inexpensive operations since and the dimension of are relatively small. Also, this does not require forward and back-propagation steps through the layers of the MDN. For this reason, we use the BFGS quasi-newton method [nocedal2006numerical] for minimization, instead of SGD-based methods which are more suitable for large-scale learning problems.

3.4 Validation Metric and Selection of

The choice of in the adaptation objective is crucial as it sets the amount of regularization most suitable for the target domain distribution. We propose a validation metric for selecting based on the CLL of the inverse-affine-transformed target dataset with respect to the source mixture density. The reasoning is that, if the adaptation finds a solution (i.e.) that is a good fit for the target dataset, then the inverse feature transformations based on that solution should produce a transformed target dataset that has a high CLL with respect to the source mixture density. The validation metric is the negative CLL of the inverse-transformed target dataset, given by


Here is the best component assignment for the sample , given by


The above equation is simply the maximum-a-posteriori (MAP) rule applied to the component posterior of the target Gaussian mixture defined as

Note that the validation metric (11) is based on the source Gaussian mixture (MDN with parameters ), but the MAP component assignment for each target domain sample Eq. (12) is based on the target Gaussian mixture (MDN with parameters ). The adaptation objective is minimized with varied over a range of values, and in each case the adapted solution is evaluated using the validation metric. The pair of and resulting in the smallest validation metric is chosen as the final adapted solution.

4 Adaptation of Autoencoder-Based Communication System

In this section, we discuss how the proposed MDN adaptation can be combined with an autoencoder-based communication system to adapt the decoder to changes in the channel conditions. Recall that the decoder is basically a classifier that predicts the most-probable input message from the received channel output . When the decoder operates in a new (target) channel environment, different from the one it was trained on, its classification accuracy can degrade due to the distribution change. Specifically, any change in the channel conditions reflects as changes in the class-conditional density of the decoder’s input, i.e. changes 444 For this generative model, it is easy to see that the class-conditional density is equal to the channel-conditional density, i.e.. Hence, by adapting the MDN, we are effectively also adapting the class-conditional density of the decoder’s input. . We propose to address this, by designing transformations to the decoder’s input that can compensate for changes in the channel distribution, and effectively present transformed inputs that are close to the source distribution on which the decoder was trained. Our method does not require any change or adaptation to the decoder network itself, making it fast and suitable for the small-sample-size setting. We next discuss two such input transformation methods for the decoder.

4.1 Adapted Decoder Based on Affine Feature Transformations

Figure 4: Adapted decoder with affine transformations.

Consider the same problem setup as § 3, where we observe a small dataset of samples from the target channel distribution. Suppose we have adapted the MDN channel by optimizing over the parameters , we can use the inverse-affine feature transformations (defined in Eq. (6)) to transform the channel output from a component of the target Gaussian mixture to the same component of the source Gaussian mixture. However, this transformation requires knowledge of both the channel input and the mixture component , which are not observed (latent) at the decoder. We propose to address this by first determining the most-probable pair of channel input and mixture component for a given (using the MAP rule), and applying the corresponding inverse-affine feature transformation as follows:


The joint posterior over the channel input and mixture component , given the channel output is based on the adapted (target) Gaussian mixture, given by

The adapted decoder based on the above affine feature transformation is defined as


and illustrated in Fig. 4. Note that the adapted decoder is a function of the parameters , even though this is not made explicit in the notation.

We also explored a variant of this adapted decoder which uses a soft (probabilistic) assignment of the channel output to the channel input and mixture component pair , given by


From our empirical evaluation, we found the hard MAP assignment based adaptation to have better performance. Hence, our experimental results are based on the adapted decoder (14).

4.2 Adapted Decoder Based on MAP Symbol Estimation

Figure 5: Adapted decoder with MAP SE.

In the previous method, an input transformation layer is introduced at the decoder only during adaptation, and not during training of the autoencoder. Alternatively, here we propose an input transformation layer at the decoder that takes the channel output and produces a best estimate of the encoded symbol , which is then given as input to the decoder as shown in Fig. 5

. This input transformation layer is included during the autoencoder training as a fixed non-linear transformation that does not have any trainable parameters. Since the decoder is trained to predict using

instead of , it is inherently robust to changes in the channel distribution of .

Given a generative model of the channel conditional density using Gaussian mixtures, we can estimate the plug-in Bayes posterior distribution of given as

From this, we define the MAP estimate of given as


The adapted decoder based on this input transformation, referred to as the MAP symbol estimation (SE) layer, is defined as


and illustrated in Fig. 5. Whenever the MDN model is adapted to changes in the channel distribution, resulting in a new MDN with parameters , the MAP SE layer is also updated using . This input transformation shields the decoder from changes to the distribution of .

Since the MAP SE layer is also included in the autoencoder during training, the non-differentiable function presents an obstacle to training the autoencoder using backpropagation. We address this by using a temperature-scaled softmax approximation to the , which is differentiable and provides a close approximation for small temperature values. This approximation is used only during training, whereas the exact is used during inference. Details on this approximation, and a modified autoencoder training algorithm with temperature annealing are discussed in Appendix E.

5 Experimental Evaluation

Network Layer Activation
Encoder FC, ReLU
FC, Linear
Normalization (avg. power) None
FC, (means)
FC, (variances)
FC, (prior logits)
Decoder FC, ReLU
FC, Softmax
Table 1: Architecture of the Encoder, MDN channel, and Decoder neural networks. FC - fully connected (dense) layer; denotes layer concatenation; ELU - exponential linear unit; - number of messages; - encoding dimension; - number of mixture components; - size of a hidden layer.

We implemented the mixture density network and communication autoencoder models using TensorFlow 2.3 

[tensorflow2015-whitepaper] and TensorFlow Probability [tf_proba]. We used the BFGS optimizer implementation available in TensorFlow Probability. The code base for our work can be found at All the experiments were run on a Macbook Pro laptop with 16 GB memory and 8 CPU cores. Table 1 summarizes the architecture of the encoder, MDN (channel model), and decoder neural networks. Note that the output layer of the MDN is a concatenation (denoted by ) of three fully-connected layers predicting the means, variances, and mixing prior logit parameters of the Gaussian mixture. We used the following setting in all our experiments. The size of the message set was fixed to , corresponding to bits. The dimension of the encoding (output of the encoder) was set to , and the number of mixture components was set to . The size of the hidden layers was set to .

The generative adaptation objective (9) is used for the experiments in § 5.1, where the MDN is adapted in isolation (not as part of the autoencoder). The discriminative adaptation objective (10) is used for the experiments in § 5.2 and § 5.3, where the MDN is adapted as part of the autoencoder. For the proposed method, the scale and shift components of the adaptation parameters are initialized to s and s respectively. This ensures that the target Gaussian mixture is always initialized with the source Gaussian mixture. The regularization constant in the adaptation objective was varied over equally-spaced values on the -scale (base ) with range to . The value and corresponding to the smallest validation metric are selected as the final solution (§ 3.3). We note that minimizing the adaptation objective for different values can be efficiently done in parallel over multiple CPU cores.

5.1 MDN adaptation on Simulated Channels

We evaluate the proposed adaptation method for an MDN (§ 3) on simulated channel variations based on models commonly used for wireless communication. Specifically, we use the following channel models: i) additive white Gaussian noise (AWGN), ii) Ricean fading, and iii) Uniform or flat fading [goldsmith2005wireless]

. Details on these channel models and calculation of the their signal-to-noise ratio (SNR) are provided in Appendix 

F. In each case, the MDN is first trained on a large dataset simulated from a particular type of channel model (e.g., AWGN), referred to as the source channel. The trained MDN is then adapted using a small dataset from a different type of channel model (e.g., Ricean fading), referred to as the target channel. We used a standard constellation corresponding to quadrature amplitude modulation of symbols, referred to as 16-QAM [goldsmith2005wireless], as inputs to the channel. A training set of samples from the source channel is used to train the MDN. The size of the adaptation dataset from the target channel is varied over a few different values – 5, 10, 20 and 30 samples per symbol, corresponding to target datasets of size 80, 160, 320 and 480 respectively.

Proposed Transfer Transfer-last-layer
median 95% CI median 95% CI median 95% CI
AWGN Uniform fading 80 0.98 (0.80, 1.06) 0.54 (-1.87, 0.98) 0.88 (-0.30, 0.99)
160 0.98 (0.90, 1.09) 0.93 (0.23, 1.00) 0.97 (0.68, 1.00)
320 0.99 (0.84, 1.11) 0.99 (0.89, 1.03) 0.99 (0.93, 1.07)
480 0.98 (0.88, 1.12) 1.00 (0.95, 1.09) 1.00 (0.96, 1.11)
AWGN Ricean fading 80 1.08 (0.00, 4.97) 0.58 (-13.27, 1.00) 0.93 (-19.34, 1.08)
160 1.20 (0.00, 4.84) 1.02 (-5.54, 1.26) 1.04 (-1.98, 1.77)
320 1.08 (0.00, 5.75) 1.08 (-0.23, 4.34) 1.10 (-0.26, 5.00)
480 1.08 (0.00, 5.72) 1.10 (-0.07, 5.15) 1.10 (-0.05, 5.20)
Ricean fading Uniform fading 80 0.96 (0.31, 1.45) -0.10 (-29.2, 0.87) 0.59 (-34.72, 0.96)
160 0.98 (0.42, 1.33) 0.73 (-2.49, 0.97) 0.86 (-0.90, 1.00)
320 0.97 (0.23, 1.36) 0.95 (0.82, 1.15) 0.98 (0.94, 1.34)
480 0.98 (0.24, 1.70) 0.99 (0.96, 2.32) 1.01 (0.96, 2.54)
Table 2: Relative log-likelihood gain of the MDN adaptation methods on simulated channel variations

Baseline Methods. We evaluate the following two baseline methods for adapting the MDN. 1) A new MDN is initialized using the weights of the MDN trained on the source dataset, and trained using the target dataset. 2) Same as baseline 1, but only the weights of the final layer are optimized (fine-tuned) using the target dataset. The above methods are referred to as transfer and transfer-last-layer respectively. We used the Adam optimization method [kingma2015adam] for epochs, with a batch size of or times the target dataset size, whichever is larger.

Evaluation Metric. Since the MDN is generative model, we evaluate the conditional log-likelihood of the learned Gaussian mixture on an unseen test set with 25000 samples from the target channel. We report the relative change in log-likelihood with respect to the original (unadapted) MDN, since the log-likelihood values may not be comparable across datasets. Suppose the log-likelihood of the original MDN is and that of an adaptation method is , then we calculate as the metric. Larger values are better and negative values indicate that adaptation leads to a worse model.

Results and Inference. Table 2 summarizes the results for three (source, target) channel pairs. For each pair, the methods are run on 50 randomly generated training, adaptation, and test datasets. The training dataset is sampled from the source channel, while the adaptation and test datasets are sampled from the target channel. The SNR of the source and target channels are independently and randomly selected from the range dB to

dB for each trial. We observe that the proposed method has a higher median relative log-likelihood gain for the low sample sizes (80 and 160) and comparable median for higher sample sizes. Also, the baseline methods often have a 95% confidence interval (CI) that is very skewed to the left, with a negative

-th percentile. The proposed adaptation is more stable even for the the smallest sample size, and never has a negative lower CI.

# parameters
# parameters
Transfer-last-layer 2525
Proposed 40
Table 3: Number of parameters being optimized by the MDN adaptation methods.

Table 3 compares the number of parameters being optimized by the proposed and baseline MDN adaptation methods for the architecture in Table 1. The method transfer optimizes all the layer weights of the MDN, which in this case has size . The method transfer-last-layer optimizes only the weights of the final layer, which in this case has size . The number of parameters optimized by the proposed method (i.e., dimension of ) would be , which is a much smaller problem compared to the baseline methods. This makes the proposed method well suited for the small-sample adaptation setting.

5.2 Autoencoder Adaptation on Simulated Channels

(a) AWGN to Ricean fading.
(b) AWGN to Uniform fading.
(c) Ricean fading to Uniform fading.
Figure 6: Results of affine transformation based adaptation on simulated channels.
(a) AWGN to Ricean fading.
(b) AWGN to Uniform fading.
(c) Ricean fading to Uniform fading.
Figure 7: Results of MAP SE based adaptation on simulated channels.

We evaluate the proposed decoder adaptation methods on different pairs of simulated source and target channel distributions. The setup for this experiment for adapting from a source channel A to a target channel B is as follows. The autoencoder is initially trained using data from the source channel A at an SNR of dB. Details of how the SNR is related to the distribution parameters of the simulated channels is discussed in Appendix F. The MDN and the decoder are adapted using a small dataset from the target channel B for different fixed SNRs varied over dB to dB in steps of dB. For each SNR, the adaptation is repeated over randomly-sampled datasets from the target channel, and the average block error rate (BLER) values are calculated on a large held-out test dataset (specific to each SNR). The size of training dataset (from channel A) and test dataset (from channel B) are both set to samples per symbol, with symbols. The size of the adaptation dataset from the target channel B is varied over and samples per symbol.

The results of this experiment are summarized in Figs. 6 and 7 for three pairs of source and target channels. Figure 6 corresponds to the adaptation method of § 4.1 referred to as Affine, and Figure 7 corresponds to the adaptation method of § 4.2 referred to as MAP SE. The plots show the BLER vs. SNR curve, with average BLER on the y-axis (log-scaled) and SNR on the x-axis. This is commonly used to summarize the error performance of a communication system. The performance of a standard 16-QAM decoder (referred to as 16-QAM555M-QAM is short for M-ary quadrature amplitude modulation with an -symbol constellation. This is a standard technique for modulation and demodulation (decoding), which does not adapt based on the channel conditions., and an autoencoder trained on the source channel without any adaptation (referred to as no_adapt) are included as baselines. For the proposed adaptation methods, the number of samples per symbol from the target channel are indicated as a suffix to the method name. For example, adapt_20 implies that the adaptation used samples per symbol.

Observations and Takeaways.

  1. [leftmargin=*, topsep=1pt, noitemsep]

  2. Both the adaptation methods significantly decrease the BLER for the cases AWGN to Uniform fading and Ricean fading to Uniform fading.

  3. For the case of AWGN to Ricean fading, the adaptation methods perform at the same level or slightly worse compared to the baselines. We think this is because the distribution of the two domains are not very different.

  4. In general, the BLER decreases with increasing size of the target dataset.

  5. Between the two adaptation methods, MAP SE performs marginally better than the Affine method.

5.3 Autoencoder Adaptation on Real FPGA Traces

(a) Affine
(b) MAP SE
Figure 8: Results of the affine and MAP SE based adaptation on real FPGA traces.

We evaluate the performance of the adaptation methods on real over-the-air wireless experiments. We use a recent high-performance mmWave testbed [Lacruz_MOBISYS2021], featuring a high-end FPGA board with 2 GHz of bandwidth per channel and 60 GHz SIVERS antennas [SIVERSIMA]. This platform allows to transmit the custom constellations generated by the encoder and to store the data to be either trained by the MDN, or extracted for further performance analysis. We train the MDN with a standard 16-QAM constellation with 96000 samples. We evaluate the performance of our adaptation for 20, 35 and 50 samples per symbol. We introduced an IQ imbalance-based distortion to the constellation 666IQ imbalance is a common issue in radio frequency communications that introduces distortions to the final constellation., and gradually increase the level of imbalance to the system. The BLER of the proposed adaptation methods and the baseline methods (16-QAM and no adaptation) is shown as a function of the IQ imbalance in Fig. 8. The proposed methods (both Affine and MAP SE) show an order of magnitude decrease in BLER compared to the baseline methods when the IQ imbalance is over 25%.

6 Conclusions

In this paper we proposed a fast and light-weight method for adapting a Gaussian MDN to a target domain with a very limited number of adaptation samples. The method is based on finding the optimal set of component-conditional affine transformations that transform the source Gaussian mixture to the target Gaussian mixture. This is formulated as the minimization of a conditional (or posterior) log-likelihood, regularized by the KL-divergence between the two mixture distributions. We applied the MDN adaptation to an autoencoder-based end-to-end communication system, specifically by transforming the inputs to the decoder such that their class-conditional distributions are close to that of the source domain. This allows for fast adaptation of both the MDN channel and the autoencoder without the need for expensive data collection and retraining. We demonstrated the effectiveness of the proposed methods through extensive experiments on both simulated wireless channels and a real mmWave FPGA testbed.

Limitations & Future Work

The proposed adaptation for a Gaussian MDN is primarily targeted for low-dimensional problems such as the wireless channel. It can be challenging to apply on high-dimensional input domains with structure. Extensions of the proposed work to deep generative models based on normalizing flows [dinh2017realnvp, kingma2018glow, weng2018flow] is an interesting direction, which would be more suitable for high-dimensional inputs. In this work, we do not adapt the encoder network, i.e., the autoencoder constellation is not adapted to changes in the channel distribution. Adapting the encoder, decoder, and channel networks jointly would allow for more flexibility, but would likely be slower and require more data from the target distribution.



Notation Description
Input message or class label. Usually , where is the number of bits.
or simply One-hot-coded vector of a message , with at position and zeros elsewhere.
with Encoded representation or symbol vector corresponding to an input message.
Channel output that is the feature vector to be classified by the decoder.
Categorical random variable denoting the mixture component of origin.
Encoder NN with parameters mapping a one-hot-coded message to a symbol vector in .
Decoder NN with parameters mapping the channel output into probabilities over the message set.
MAP prediction of the input message by the decoder.
Conditional density (generative) model of the channel with parameters .
Mixture density network that predicts the parameters of a Gaussian mixture.
Transfer or sampling function corresponding to the channel conditional density.
Random vector independent of that captures the stochasticity of the channel.
Input-output mapping of the autoencoder with combined parameter vector .
Affine transformation parameters per component used to adapt the MDN.
and Affine and inverse-affine transformations between the component densities of the Gaussian mixtures.
Kullback-Leibler divergence between the distributions and .
Multivariate Gaussian density with mean vector and covariance matrix .
Categorical distribution with and .
Indicator function mapping a predicate to if true and if false.
norm of a vector .
Table 4: Commonly used notations and definitions

The appendices are organized as follows:

  • [leftmargin=*, topsep=1pt, noitemsep]

  • Appendix A provides a background on end-to-end training of autoencoders and domain adaptation, and discusses related works.

  • Appendix B provides details on the training and sampling (transfer) function of MDNs.

  • Appendix C discusses the feature and parameter transformations between multivariate Gaussians.

  • Appendix D derives the KL divergence between the source and target Gaussian mixtures.

  • Appendix E provides details on training the MAP symbol estimation autoencoder.

  • Appendix F provides details on the simulated channel models that were used in our experiments.

Appendix A Background

a.1 Loss Function and Training of the Autoencoder

Expanding on the brief background provided in § 2.1, here we provide a formal discussion of the end-to-end training of the autoencoder. First, let us define the input-output mapping of the autoencoder as , where is the combined vector of parameters from the encoder, channel, and decoder. Given an input message , the autoencoder maps the one-hot-coded representation of into an output probability vector over the message set. Note that, while the encoder and decoder neural networks are deterministic, a forward pass through the autoencoder is stochastic due to the channel transfer function . The learning objective of the autoencoder is to accurately recover the input message at the decoder with a high probability. The cross-entropy (CE) loss, which is commonly used for training classifiers, is also suitable for end-to-end training of the autoencoder. For an input with encoded representation , channel output , and decoded output , the CE loss is given by


which is always non-negative and takes the minimum value when the correct message is decoded with probability . The autoencoder aims to minimize the following expected CE loss over the input message set and the channel output:


Here is the prior probability of the input messages, which is usually assumed to be uniform in the absence of prior knowledge. In practice, the autoencoder minimizes an empirical estimate of the expected CE loss function by generating a large set of samples from the channel conditional density given each message. Let denote a set of independent and identically distributed (iid) samples from , the channel conditional density given message . Also, let denote the combined set of samples. The empirical expectation of the autoencoder CE loss (19) is then given by


It is clear from the above equation that the channel transfer function should be differentiable in order to be able to backpropagate gradients through the channel to the encoder network. The transfer function defining sample generation for a Gaussian MDN channel is discussed in Appendix B.

The training algorithm for jointly learning the autoencoder and channel model (based on [garcia2020mixture]) is given in Algorithm 1. It is an alternating (cyclic) optimization of the channel parameters and the autoencoder (encoder and decoder) parameters. The reason this type of alternating optimization is required is because the empirical expectation of the CE loss Eq. (20) is valid only when the channel conditional density (i.e.) is fixed. The training algorithm can be summarized as follows. First, the channel model is trained for epochs using data sampled from the channel with an initial encoder constellation (e.g., M-QAM). With the channel model parameters fixed, the parameters of the encoder and decoder networks are optimized for one epoch of mini-batch SGD updates (using any adaptive learning rate algorithm e.g., Adam [kingma2015adam]). Since the channel model is no longer optimal for the updated encoder constellation, it is retrained for epochs using data sampled from the channel with the updated constellation. This alternate training of the encoder/decoder and the channel networks is repeated for epochs or until convergence.

1:  Inputs: Message size ; Encoding dimension ; Initial constellation ; Number of optimization epochs for the autoencoder and channel .
2:  Output: Trained network parameters .
3:  Initialize the encoder, channel, and decoder network parameters.
4:  Sample training data from the channel using the initial constellation.
5:  Train the channel model for epochs to minimize .
6:  for epoch
7:     Freeze the channel model parameters .
8:     Perform a round of mini-batch SGD updates of and with respect to .
9:     Sample training data from the channel with the updated constellation .
10:     Train the channel model for epochs to minimize .
11:  Return .
Algorithm 1 End-to-end training of the autoencoder with a generative channel model

Finally, we observe some interesting nuances of the communication autoencoder learning task that is not common to other domains such as images. 1) The size of the input space is finite, equal to the number of distinct messages . Because of the stochastic nature of the channel transfer function, the same input message results in a different autoencoder output each time. 2) There is theoretically no limit on the number of samples that can be generated for training and validating the autoencoder. These two factors make the autoencoder learning less susceptible to overfitting, that is a common pitfall with neural network training.

a.2 A Primer on Domain Adaptation

We provide a brief review of domain adaptation (DA) and discuss the key differences of our problem setting from that of standard DA. In the traditional learning setting, training and test data are assumed to be sampled independently from the same distribution , where and are the input vector and target respectively 777The notation used in this section is different from the rest of the paper, but consistent with the statistical learning literature.. In many real world settings, it can be hard or impractical to collect a large labeled dataset for a target domain where the machine learning model (e.g., a DNN classifier) is to be deployed. On the other hand, it is common to have access to a large unlabeled dataset from the target domain, and a large labeled dataset from a different but related source domain 888One could have multiple source domains in practice; we consider the single source domain setting.. Both and are much larger than , and in most cases there is no labeled data from the target domain (referred to as unsupervised DA). For the target domain, the unlabeled dataset (and labeled dataset if any) are sampled from an unknown target distribution, i.e. and . For the source domain, the labeled dataset is sampled from an unknown source distribution, i.e.. The goal of DA is to leverage the available labeled and unlabeled datasets from the two domains to learn a predictor, denoted by the parametric function , such that the following risk function w.r.t the target distribution is minimized:

where is a loss function that penalizes the prediction for deviating from the true value (e.g., cross-entropy or hinge loss). In a similar way, we can define the risk function w.r.t the source distribution . A number of seminal works in DA theory [bendavid2006analysis, blitzer2007learning, bendavid2010theory] have studied this learning setting and provide bounds on in terms of and the divergence between source and target domain distributions. Motivated by this foundational theory, a number of recent works [ganin2015unsupervised, ganin2016domain, long2018conditional, saito2018maximum, zhao2019invariant, johansson2019support] have proposed using DNNs for adversarially learning a shared representation across the source and target domains such that a predictor using this representation and trained using labeled data from only the source domain also generalizes well to the target domain. An influential work in this line of DA is the domain adversarial neural network (DANN) proposed by [ganin2015unsupervised] and later by [ganin2016domain]. The key idea behind the DANN approach is to adversarially train a label predictor NN and a domain discriminator NN in order to learn a feature representation for which i) the source and target inputs are nearly indistinguishable to the domain discriminator, and ii) the label predictor has good generalization performance on the source domain inputs.

Special Cases of DA. While the general DA problem addresses the scenario where and