Wideband and Entropy-Aware Deep Soft Bit Quantization

10/18/2021 ∙ by Marius Arvinte, et al. ∙ The University of Texas at Austin 0

Deep learning has been recently applied to physical layer processing in digital communication systems in order to improve end-to-end performance. In this work, we introduce a novel deep learning solution for soft bit quantization across wideband channels. Our method is trained end-to-end with quantization- and entropy-aware augmentations to the loss function and is used at inference in conjunction with source coding to achieve near-optimal compression gains over wideband channels. To efficiently train our method, we prove and verify that a fixed feature space quantization scheme is sufficient for efficient learning. When tested on channel distributions never seen during training, the proposed method achieves a compression gain of up to 10 % in the high SNR regime versus previous state-of-the-art methods. To encourage reproducible research, our implementation is publicly available at https://github.com/utcsilab/wideband-llr-deep.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Soft bit quantization [15, 16] is an important task in integrated, low-power digital communication platforms where memory is an expensive asset [3]

. A critical application area for quantizing estimated soft bits is given by hybrid automatic repeat request (HARQ) schemes in, e.g., 5G networks

[4], where information from a failed transmission is stored in order to boost the performance via soft combining methods [7]. Given that a cellular base station may communicate with hundreds of users simultaneously, storing soft bits from failed packets requires efficient and low-distortion quantization methods to avoid memory bottlenecks on the platform. Another application area where a flexible trade-off between compression rate and reconstruction distortion is desirable is given by compress-and-forward relaying schemes [21, 8], where estimated soft bits are forwarded to a receiver and compressed in order to lower relay channel resource utilization.

The recent success of deep learning applied to compression problems [2] motivates us to develop deep soft bit quantization methods. A major challenge here is given by the fact that any hard quantization operator has zero gradient almost everywhere, and thus cannot be used in conjunction with modern optimization algorithms. To this end, various types of practical approximations and solutions have been developed [6, 2, 12]

to make deep neural networks

quantization-aware, and in this work, we use the pass-through gradient estimation approach [6] due to its simplicity and ease of implementation.

In this paper, we introduce a data-driven approach for soft bit quantization over wideband channels. Our scheme consists of a lightweight, properly initialized deep autoencoder network, and is trained for soft bit reconstruction in random channels using a differentiable approximation to quantization, as well as a continuous approximation to the entropy of a discrete source. During inference, lossless source coding is performed over the latent representations of soft bits from a

wideband channel transmission to maximize compression gains. Experimental results over simulated EPA [1] channel realizations demonstrate state-of-the-art performance of the proposed approach, as well as a controllable trade-off between compression rate and distortion.

I-a Related Work

Prior work on soft bit quantization generally belongs in one of two categories: classical methods [16, 15, 20] develop near-optimal scalar quantization methods directly in the log or hyperbolic tangent domain. In particular, the method in [20] introduces an optimal scalar quantization method for soft bits that supports a data-driven formulation and learns a codebook that maximizes the mutual information between the original and reconstructed soft bits. While this is optimal for scalar (per soft bit position) quantization, it does not take advantage of redundancy in soft bits derived from the same or correlated channels.

Recently, the work in [5] introduces an architecture for deep soft bit quantization, building on the observation that the soft bits corresponding to a high-order modulation scheme transmitted over a single channel are correlated, and can always be represented exactly with three values, regardless of the modulation order. This motivates a deep learning approach in which an autoencoder is trained to compress the soft bits, which are further numerically quantized at inference time. Our work builds directly upon [5], with the following important distinctions: (i) our approach is entropy- and quantization-aware during training, (ii) during inference, we apply source coding to compress soft bits, and (iii) provide a tunable, continuous trade-off between compression rate and distortion.

I-B Contributions

Summarized, our contributions are the following:

  1. We introduce a deep soft bit quantization architecture that is quantization-aware through a differentiable approximation used in the backwards pass and entropy-aware through a soft entropy that is annealed over the course of training.

  2. We derive the exact variance for the latent representation of the soft bits in the asymptotically large signal-to-noise ratio (SNR) regime, at initialization, in a deep neural network with one hidden layer and

    ReLU activation. This is used for the one-time design of a fixed quantization codebook and helps stabilize learning.

  3. We experimentally demonstrate state-of-the-art quantization performance and rate-distortion trade-off in realistic wideband channel models, when compared against classical and deep learning baselines, on a channel distribution that is completely unseen during training.

Ii System Model

Fig. 1: Diagram of the proposed wideband quantization approach. Blocks in yellow are active during both training and inference (real-world deployment). The pass-through approximation and entropy estimation are only performed for training purposes. During inference, source coding is applied to the entire wideband quantized latent matrix , and the binary codeword is stored for future use at a potentially different location that has access to the decoder .

We consider a communications model that transmits a number of parallel data channels to a single user, such as sub-carriers in an orthogonal frequency division multiplexing (OFDM) scenario. Assuming no cross-channel interference, the signal received on the -th channel is given by the linear model [18, Eq. 3.1]:

(1)

where is the channel gain, is the transmitted symbol and is the noise corresponding to the

-th subcarrier, with all variables being complex-valued. We assume that the noise is drawn from a complex, circular Gaussian distribution with zero mean and standard deviation of

. We assume that the symbols are obtained by mapping a set of bits to a complex-valued constellation symbol, which is the common practice of digital modulation. Given ideal channel knowledge and known noise statistics

, and assuming that transmitted bits are sampled i.i.d. with equal probabilities, the maximum likelihood (ML) estimate of the log-likelihood ratio for the

-th bit transmitted on the -th channel is given by:

(2)

The soft bits are defined as and grouped in the wideband soft bit matrix . The goal of wideband quantization is to design the triplet of functions , where the encoder maps the floating point input to a latent representation, maps this representation to a finite bit string, and the decoder recovers with minimal distortions. Note that this scheme maps the entire

soft bit matrix to a single binary codeword, and is thus a form of vector quantization. The compression rate and the distortion of the reconstruction are denoted by the functions:

(3)

respectively, where represents the entropy of a discrete source expressed in bits, and is the recovered soft bit matrix and is a small numerical constant used to prevent overflow. We use a sample-weighted version of the mean squared error, which is a pseudo-metric since it is not symmetric and does not satisfy the triangle inequality property. This choice is taken from [5], since minimizing this metric places more importance on uncertain soft bits and benefits the decoding of error-correcting codes [11].

Iii Proposed Method

Fig. 1 shows an overview of the proposed approach for wideband, entropy-aware soft bit quantization. During training, the model uses the wideband soft bit matrix and a soft entropy estimate of the quantized representation to optimize the weights of and , which are deep neural networks. During inference, lossless source coding is applied to the quantized representation to reduce storage costs to near-entropy levels. The resulting binary string is stored until decoding is required, e.g., in hybrid ARQ or relay scenarios. In the following, we give a description of each of the involved components.

Iii-a Encoder

The encoder is a function that maps an input matrix to a latent matrix by applying the same functional backbone in a row-wise manner and stacking the representations in a matrix:

(4)

We design the backbone of as a fully-connected, feed-forward network with ReLU activation in the hidden layers and as the output activation. The input size is a vector of size , the hidden layers are all of size , while the output is of fixed size equal to three. This corresponds to the universal latent dimension of a soft bit vector with arbitrary , as introduced in [5]. That is, without quantization, such a compressive representation (from soft bits to three latent variables) is guaranteed to exist and can be represented and learned by a deep neural network with a sufficient modeling capacity.

Iii-B Latent Quantization

This block applies a discrete quantization operator to each component of the latent representation in the forward pass of the network. During training, since the quantization operator has zero gradient almost everywhere, we use a pass-through approximation [6] to obtain a differentiable function for the backward pass. That is, the forward and backward pass signals are, respectively:

(5)

where and are the forward and backward pass latent signals, respectively, and sg is the stop-gradient operator, which prevents gradient from flowing in the backward pass. This leads to the gradient of the quantized representation with respect to its input being and allows gradients to propagate to earlier layers.

Importantly, the function is a pre-determined scalar quantization function that is held fixed throughout learning and inference. This differentiates us from [5] and [2] and enables efficient learning, given a careful choice of . In the following, we present a theoretical and empirical analysis of deep neural networks that are used for soft bit quantization in the high SNR regime and show that a choice for that avoids the issue of codebook collapse [19, 17] – where a portion of the codebook may never be used during training – can be found at initialization. We use the two following lemmas in our proof:

Lemma 1

Let be a matrix with i.i.d. Gaussian elements and let

be an i.i.d. Rademacher random variable, independent of

. Then, the elements of are distributed as i.i.d. Gaussian random variables.

Proof:

Immediate by the independence of and .

Lemma 2

Let be a scalar random variable distributed as . Then, has the following properties:

Proof:

Follows immediately from [10, Page 3] and re-writing as a mixture of two random variables.

We now state and prove the following the following theorem.

Theorem 1

Let be a one hidden-layer, fully-connected neural network with no biases, hidden activation, and linear output activation. Let the length of the input vector be , the hidden size be , and the output size be . The weight matrices are and , respectively. We make the following assumptions:

  • The entries of are drawn i.i.d. from a Rademacher distribution such that .

  • The entries of the hidden layer weight matrix are drawn i.i.d. from a Gaussian distribution with and , respectively.

  • The entries of the output layer weight vector are drawn i.i.d. from a Gaussian distribution with and , respectively.

Let be the output of the network. Then, it satisfies the following properties:

Proof:

Using Lemma 1 and the first two assumptions, it follows that the pre-activation values after the first layer are Gaussian distributed. Using Lemma 2 on these activations allows us to characterize the mean and standard deviation of . Since the entries are i.i.d., it follows that the entries of are also i.i.d., and also independent from the weights , as well as using the second and third assumptions. Since is zero-mean, we obtain that:

(6)
Fig. 2: Empirical verification of Theorem 1 across a varying number of values, corresponding to different modulation orders. The variables represent empirical estimates of the latent standard deviation at initialization, while the stars mark the theoretical values (only available for one hidden layer). The variables represent the 99.9 percentile values of the absolute latent variables at initialization.

The first assumption corresponds to operating in the asymptotically large SNR regime, where soft bits tend toward polarized values in the hyperbolic tangent domain. We study this regime since it allows exact analysis of the variance and serves as an upper bound for lower SNR regimes in terms of the latent space variance, since the distribution of the soft bits there is more biased toward zero and has a reduced variance.

The last two assumptions concern the deep neural network at initialization and match the Glorot weight initialization scheme [9]. As motivated in [9], this initialization is carefully chosen such that the variance of the signal is reduced as the network gets deeper. To verify Theorem 1 and the empirical reduction of variance, we plot the estimated standard deviation of and the estimated range of in Fig. 2 for a varying depth of the network and different values of . The architecture follows the exact assumptions of Theorem 1, and is the basis for the model we use in practice.

The exact match between the empirical and the starred points verifies Theorem 1 for a network with one hidden layer, and these values are almost invariant to due to the ratio in (6). Fig. 2 also plots the percentile values for an increasing number of hidden layers. While an exact analysis is out of scope here, we find that using the Glorot initialization leads to a decreasing latent variance as the network gets deeper, as originally pointed out in [9]. Since the encoder uses a function as activation, this is extremely useful in preventing latent collapse – the presence of strong modes at – and allows the use of a fixed quantizer throughout the entire training process.

Iii-C Entropy Estimation

Given , and a quantization codebook with entries, we estimate the -soft entropy [2] as:

(7)

where represents the soft allocation of to the -th entry in the quantization codebook. That is, , where the softmax is taken across all codebook entries and represents the inverse temperature of this approximation. The terms represent the empirical probability estimates of obtained by counting over samples. Hence, no gradient flows through the term during training. An important aspect here is that as and the sample size is sufficiently large, we have that

, the entropy of the discrete random variable

.

Iii-D Source Coding

Given a quantized representation , the discrete probabilities for all codebook symbols are estimated from feature representations of a fixed, finite set of training channels, and lossless source coding is applied during inference for storage or relaying purposes. Our method is compatible with any source coding scheme. In practice, we use arithmetic coding [13] due to its near-optimal performance and extremely efficient publicly available implementation [14]. Since the coding is lossless, there is no incurred performance loss.

Iii-E Decoder

The decoder is a deep neural network with an architecture that mirrors , including the number of layers and the hidden dimension. That is, it maps an input latent matrix to the reconstructed soft bit matrix by applying the shared layers in a row-wise manner:

(8)

Given all components, the model is trained with the end-to-end supervised loss:

(9)

The first term corresponds to the quantization-aware reconstruction loss that ensures soft bits are recovered properly after numerical quantization of the latent representation . The second term serves as a approximation for minimizing the entropy of the quantized latent representation and to enable further gains with source coding, where is a hyper-parameter that directly controls the rate-distortion trade-off.

Iv Experimental Results

Iv-a Architecture and Training

We use deep neural networks for and , each with four hidden layers, hidden activations, activation at the output (for both and ), and a hidden size of , where is the modulation order for which we train the method – as well as the input size to the network. The latent dimension is always three and we initialize all layers with the Glorot scheme [9] to match the conditions of Theorem 1. Complete details about the architecture are found in our code repository linked in the abstract.

The latent quantizer uniformly covers the interval using a number of codebook entries ( bits), and remains fixed throughout the entire training and inference procedures. The same is used for all the latent dimensions and the choice of the interval is a direct consequence of the range of the latent representation under the operator, as shown in Fig. 2. It can be seen that the latent code is bound to this interval, hence no codebook collapse occurs at initialization.

The data used to train all models comes from transmissions across i.i.d. Rayleigh fading channels, where and the noise . Payloads are generated by randomly sampling bits with equal probabilities and codewords are obtained by using a low-density parity check (LDPC) code of size , for a total of training codewords at uniformly spaced SNR values. Importantly, our method is only trained on soft bits from i.i.d. channels, and is not trained on a specific wideband channel distribution. We find that a range of between and generally covers the entire rate-distortion curve, and we anneal

at epoch

by the schedule .

A single network is trained across the entire SNR range, and takes about three hours for epochs (invariant to ) on an NVIDIA RTX 2080Ti GPU. Inference takes less than ms for an OFDM wideband channel with subcarriers. Storing the network for takes a total of kB in floating point precision. During inference, soft bits are quantized and reconstructed, and belief propagation decoding is performed to obtain a complete communication chain. We measure end-to-end performance through the block (codeword) error rate figure.

We train the baseline in [5] by using exactly the same data and backbone architecture for a fair comparison. We also compare with the optimal scalar method in [20] by learning a separate quantization codebook for each soft bit position, at each SNR value. This partially compensates for the extra learnable parameters that deep learning methods have.

Iv-B End-to-End Quantization Performance

Fig. 3 shows the performance of all methods in a (64-QAM) modulation scheme and EPA wideband channel model with subcarriers allocated per codeword, at a carrier frequency of GHz and channel bandwidth of MHz. The number in the parentheses indicates the average cost required to store a single soft bit, where we average this cost over the range of SNR values that lead to block error rates between and , since this range is of practical interest. A key takeaway here is that all methods are calibrated to produce the same end-to-end performance, with minimal deviations from un-quantized performance. The proposed approach suffers a performance loss of dB compared to floating point at a target error rate of , and has the same performance as [5], while achieving an average compression gain of . Both deep learning-based methods greatly surpass the scalar quantizer, with ours having an average compression gain of .

Fig. 4 reveals how quantization cost scales with SNR for the different methods, as well as the near-optimality of arithmetic coding in a wideband scenario. For scalar quantization methods such as maximum MI [20], the cost per soft bit decreases with increasing SNR. Asymptotically, this behaviour is optimal since as , then the soft bits become discrete binary random variables as in Theorem 1, and one bit per soft bit is the optimal quantization scheme. This trend is opposite for deep learning methods, since the same model accommodates the entire SNR regime: there, the average cost per bit increases as the SNR increases, and this phenomenon is much more pronounced for the baseline in [5]. We find that the proposed approach helps counteract this sub-optimality, again due to its entropy objective in the loss function.

Fig. 3: Block error rate as a function of SNR for the proposed method, baselines and floating point (no quantization), for (-QAM) modulation in EPA channels with a bandwidth of MHz. The values in parentheses indicate the average storage cost per soft bit.
Fig. 4: Soft bit quantization cost as a function of operating SNR for the proposed method and the baselines in EPA channels. The red and black solid lines indicate baselines without source coding, while the dashed lines are with source coding. The orange curve is the performance of our method, while the blue dots indicate the estimated entropy of (lowest possible rate). For all methods, no specific EPA channel simulations are used to train the models or calibrate the source coding.

Fig. 4 also plots the performance of the two baselines with and without (horizontal lines) source coding. We note that, while source coding benefits both baselines, the proposed approach still improves compression rates in a broad SNR range due to the entropy-aware nature of (9). In the high SNR regime, the proposed method achieves compression gains of up to compared to [5], and the source coding is near-optimal, since it achieves the entropy marked with circles.

Iv-C Rate-Distortion Trade-Off

Fig. 5 investigates the impact of during training our method, for on EPA channels with subcarriers (since the bit mapping is denser than , fewer channel uses are required to send a packet). The operating characteristics of the model are close to the ones in Fig. 3, and extended results can be found in our code repository.

To control the trade-off between rate and distortion, we vary the parameter in our loss function between the range of and and train a separate model at each value. From Fig. 5 it can be noticed that, at lower SNR values the absolute penalty in end-to-end error is larger if we quantize aggressively, whereas the error increases are much smaller in the high SNR regime.

Fig. 5: Rate-distortion curves obtained by tuning in the loss function, for (-QAM) in EPA channels. In this figure,

is linearly interpolated between

(right-most points) and (left-most points) with a spacing of . The y-axis represents the additive block error rate incurred against a floating point solution. Each of the curves represents a specific SNR point.

V Conclusion

In this paper, we have introduced a deep learning approach for wideband soft bit quantization. Our formulation included a fixed quantizer and a quantization- and entropy-aware training objective, as well as the use of source coding at inference. Our theoretical results proved that a fixed quantizer is sufficient for efficient training, and the experiments have shown state-of-the-art quantization performance in a wide SNR range, as well as flexibility in controlling the rate-distortion trade-off.

The model is compact and inference is efficient, achieving sub-ms latency for an entire wideband channel. Our model is also not trained on a specific channel distribution, but instead can operate on arbitrary wideband channels. While this provides a degree of flexibility, a promising future research direction is to investigate whether further compression gains can be obtained by specializing a model for a specific channel distribution and develop adaptive quantization schemes.

References

  • [1] (Release 12, 2012) 3GPP TS 136.116: Evolved Universal Terrestrial Radio Access. Technical report Cited by: §I.
  • [2] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte, L. Benini, and L. J. Van Gool (2017) Soft-to-hard vector quantization for end-to-end learning compressible representations. In NIPS, Cited by: §I, §III-B, §III-C.
  • [3] R. Akeela and B. Dezfouli (2018) Software-defined radios: architecture, state-of-the-art, and challenges. Computer Communications 128, pp. 106–125. Cited by: §I.
  • [4] A. Anand and G. de Veciana (2018) Resource allocation and harq optimization for urllc traffic in 5g wireless networks. IEEE Journal on Selected Areas in Communications 36 (11), pp. 2411–2421. Cited by: §I.
  • [5] M. Arvinte, A. H. Tewfik, and S. Vishwanath (2019) Deep log-likelihood ratio quantization. In 2019 27th European Signal Processing Conference (EUSIPCO), pp. 1–5. Cited by: §I-A, §II, §III-A, §III-B, §IV-A, §IV-B, §IV-B, §IV-B.
  • [6] Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    .
    arXiv preprint arXiv:1308.3432. Cited by: §I, §III-B.
  • [7] P. Frenger, S. Parkvall, and E. Dahlman (2001) Performance comparison of harq with chase combining and incremental redundancy for hsdpa. In IEEE 54th Vehicular Technology Conference. VTC Fall 2001. Proceedings (Cat. No. 01CH37211), Vol. 3, pp. 1829–1833. Cited by: §I.
  • [8] R. Ghallab, A. Sakr, M. Shokair, and A. Abou El-Azm (2018) Compress and forward cooperative relay in device-to-device communication with and without coding techniques. In 2018 13th International Conference on Computer Engineering and Systems (ICCES), pp. 425–429. Cited by: §I.
  • [9] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §III-B, §III-B, §IV-A.
  • [10] M. Harva and A. Kabán (2007) Variational learning for rectified factor analysis. Signal Processing 87 (3), pp. 509–527. Cited by: §III-B.
  • [11] S. Hemati and A. H. Banihashemi (2006) Dynamics and performance analysis of analog iterative decoding for low-density parity-check (ldpc) codes. IEEE Transactions on Communications 54 (1), pp. 61–70. Cited by: §II.
  • [12] S. Jung, C. Son, S. Lee, J. Son, J. Han, Y. Kwak, S. J. Hwang, and C. Choi (2019) Learning to quantize deep networks by optimizing quantization intervals with task loss. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 4350–4359. Cited by: §I.
  • [13] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool (2019) Practical full resolution learned lossless image compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-D.
  • [14] F. Mentzer(Website) External Links: Link Cited by: §III-D.
  • [15] C. Novak, P. Fertl, and G. Matz (2009) Quantization for soft-output demodulators in bit-interleaved coded modulation systems. In 2009 IEEE International Symposium on Information Theory, pp. 1070–1074. Cited by: §I-A, §I.
  • [16] W. Rave (2009) Quantization of log-likelihood ratios to maximize mutual information. IEEE Signal Processing Letters 16 (4), pp. 283–286. Cited by: §I-A, §I.
  • [17] A. Razavi, A. van den Oord, and O. Vinyals (2019) Generating diverse high-fidelity images with vq-vae-2. In Advances in neural information processing systems, pp. 14866–14876. Cited by: §III-B.
  • [18] D. Tse and P. Viswanath (2005) Fundamentals of wireless communication. Cambridge university press. Cited by: §II.
  • [19] A. van den Oord, O. Vinyals, and K. Kavukcuoglu (2017) Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6309–6318. Cited by: §III-B.
  • [20] A. Winkelbauer and G. Matz (2015) On quantization of log-likelihood ratios for maximum mutual information. In 2015 IEEE 16th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 316–320. Cited by: §I-A, §IV-A, §IV-B.
  • [21] X. Wu and L. Xie (2013) On the optimal compressions in the compress-and-forward relay schemes. IEEE Transactions on Information Theory 59 (5), pp. 2613–2628. Cited by: §I.