I Introduction
This paper outlines a few applications of machinelearning techniques to communication systems and focuses on what can be learned from the resulting systems. First, we consider the parameterized beliefpropagation (BP) decoding of paritycheck codes which was introduced by Nachmani et al. in [1]. Then, we study the lowcomplexity channel inversion known as digital backpropagation (DBP) for optical fiber communications [2].
Ii Machine Learning
Before discussing the two applications in detail in Secs. III and IV
, we start in this section by briefly reviewing the standard supervised learning setup for feedforward neural networks. Afterwards, we highlight a few important aspects when applying machine learning to communications problems.
Iia Supervised Learning of Neural Networks
A deep feedforward NN with layers defines a mapping
where the input vector
is mapped to the output vector by alternating between affine transformations (defined by ) and pointwise nonlinearities (defined by ) [3]. This is illustrated in the bottom part of Fig. 3. The parameter vector encapsulates all elements in the weight matricesand all elements in the bias vectors
. Common choices for the nonlinearities include , , .In a supervised learning setting, one has a training set containing a list of desired input–output pairs. Then, training proceeds by minimizing the empirical training loss , where the empirical loss for a finite set of input–output pairs is defined by
and is the loss associated with returning the output when is correct. When the training set is large, one typically chooses the parameter vector using a variant of stochastic gradient descent (SGD). In particular, minibatch SGD uses the parameter update
where is the step size and is the minibatch used by the th step. Typically, is chosen to be a random subset of with some fixed size (e.g., ) that matches available computational resources (e.g., GPUs).
IiB Machine Learning for Communications
Machine learning for communications differs from traditional machine learning in a number of ways.
IiB1 Accurate generative modeling and infinite training data supply
Machine learning is typically applied to fixedsize data sets, which are split into training and test sets. A central problem in this case is the generalization error caused by overfitting the model parameters to peculiarities in the training set. On the other hand, communication theory traditionally assumes that one can accurately simulate and/or model the communication channel. In this case, one can generate an infinite supply of training data with which to learn.
IiB2 Exponential number of classes
For classification tasks, a different type of generalization error is caused by a lack of class diversity in the training set. For classical machine learning applications, there are typically only few classes and the training set contains a sufficient number of training examples per class. On the other hand, for certain communications problems, e.g., decoding errorcorrecting codes, the number of classes increases exponentially with the problem size. Training unrestricted NNs (even deep ones) with only a subset of classes leads to poor generalization performance [4].
IiB3 Blackbox computation graphs vs. domain knowledge
Another consequence of having an accurate channel model is that one can actually implement optimal or closetooptimal solutions in many cases. In that case, learning can be motivated as a means to reduce complexity because there may exist simple approximations with much lower complexity. Moreover, existing domain knowledge can be used to simplify the learning task. Indeed, for both considered applications in this paper, one actually improves existing algorithms by extensively parameterizing their associated computation graphs, rather than optimizing conventional “blackbox” NN architectures. Our focus is on examining the trained solutions and trying to understand why they work better and solve the problem more efficiently than the handtuned algorithms they are based on.
Iii Optimized BP Decoding of Codes
Recently, Nachmani, Be’ery, and Burshtein proposed a weighted BP (WBP) decoder with different weights (or scale factors) for each edge in the Tanner graph [1]
. These weights are then optimized empirically using tools and software from deep learning. One of the main advantages of this approach is that the decoder automatically respects both code and channel symmetry and requires many fewer training patterns to learn. Their results show that this approach provides moderate gains over standard BP when applied to the paritycheck matrices of BCH codes. A more comprehensive treatment of this idea can be found in
[5]. In addition, there are other lessrestrictive NN decoders that also take advantage of code and channel symmetry [6, 7]While the performance gains of WBP decoding are worth investigating, the additional complexity of storing and applying one weight per edge is significant. In our experiments, we also consider simple scaling models that share weights to reduce the storage and computational burden. In these models, three scalar parameters are used for each iteration: the message scaling, the channel scaling, and the damping factor. They can also be shared for all iterations.
Iiia Weighted BeliefPropagation Decoding
Consider an linear code defined by an paritycheck matrix . Given any paritycheck matrix , one can construct a bipartite Tanner graph , where and are sets of variable nodes (i.e., code symbols) and check nodes (i.e., parity constraints). The edges, , connect all parity checks to the variables involved in them. By convention, the boundary symbol denotes the neighborhood operator defined by
The loglikelihood ratio (LLR) is the standard message representation for BP decoding of binary variables. The initial channel LLR for variable node
is defined by(1) 
where is the th symbol in the channel output sequence, and is the corresponding bit in the transmitted codeword.
WBP is an iterative algorithm that passes messages along the edges of the Tanner graph . During the th iteration, a pair of messages , and are passed in each direction along the edge . This occurs in two steps: the checktovariable step updates messages and variabletocheck step updates messages . In the variabletocheck step, the preupdate rule is
(2) 
where is the weight assigned to the edge and is the weight assigned to the channel LLR . In the checktovariable step, the preupdate rule is
(3) 
To avoid numerical issues, the absolute value of is clipped, if it is larger than some fixed value (e.g., 15).
To mitigate oscillation and enhance convergence, we also use a damping coefficient to complete the message updates [8]. Damping is referred to as “relaxed BP” in [5], where it is studied in the context of weighted minsum decoding. This method of improving performance was not considered in [1]. In particular, the final BP messages at iteration are computed using a convex combination of the previous value and the preupdate value:
(4)  
(5) 
For the marginalization step, the sigmoid function
is used to map the output LLR to an estimate of the probability that defined by(6) 
Setting all weights to and recovers standard BP.
IiiB From WBP to Optimized WBP
Any iterative algorithm, such as WBP decoding, can be “unrolled” to give a feedforward architecture that has the same output for some fixed number of iterations [9]. Moreover, the sections in the feedforward architecture are not required to be identical. This increases the number “trainable” parameters that can be optimized.
It is wellknown that BP performs exact marginalization when the Tanner graph is a tree, but good codes typically have loopy Tanner graph with short cycles. To improve the BP performance on short highdensity paritycheck (HDPC) codes, one can optimize the weights and in all iterations [1]. The damping coefficient can also be optimized.
For supervised classification problems, one typically uses the crossentropy loss function, and this loss function has also been proposed for the optimized WBP decoding problem
[1]. However, our experiments show that minimizing this loss may not actually minimize the bit error rate. Instead, we use the modified loss function(7) 
where is the total number of iterations. More details about the modified loss can be found in [10]. Our experiments also show that the optimization behaves better with the multiloss approach proposed by [1]. Thus, the results in this paper are based on optimizing the modified multiloss function
(8) 
The optimization complexity depends on the number of iterations and how the parameters are shared. For example, one can share the weights temporally (across decoding iterations) and/or spatially (across edges):

If the weights are shared temporally, i.e.,
one obtains a recursive NN (RNN) structure.

If the weights are shared spatially, i.e.,
then there are only two scalar parameters per iteration: one for the channel LLR and one for the BP messages. Compared to the fully weighted (FW) decoder, we call this the simple scaled (SS) decoder.

Sharing weights both temporally and spatially results in only two weight parameters, and .
IiiC Random Redundant Decoding (RRD)
A straightforward way to improve BP decoding for HDPC codes is to use redundant parity checks (e.g., by adding dual codewords as rows to the paritycheck matrix) [11]. In general, however, the complexity of BP decoding scales linearly with the number of rows in the paritycheck matrix.
Another approach is to spread these different parity checks over time, i.e., by using different paritycheck matrices in each iteration [12, 13, 14]. This can be implemented efficiently by exploiting the code’s automorphism group and reordering the code bits after each iteration in a way that effectively uses many different paritycheck matrices but stores only one.
In [5], optimized weighted RRD decoders are constructed by cascading several WBP blocks and reordering the code bits after each WBP block. In this work, we also consider optimized RRD decoding based on their approach. But, the input to th learned BP block is modified to be a weighted convex combination between the initial channel LLRs and the output of the th learned BP. This procedure is similar to damping and the mixing coefficient is also learned.
For RRD decoding, choosing a good paritycheck matrix is crucial because the code automorphisms permute the variable nodes without changing the structure of the Tanner graph. In general, good Tanner graphs have fewer short cycles and can be constructed with heuristic cyclereduction algorithms
[12].IiiD Experiments and Results
The various feedforward decoding architectures in this paper are implemented in the PyTorch framework and optimized using the RMSPROP optimizer. The number of decoding iterations is set to
. For the RRD algorithm, the code bits are permuted after every second decoding iteration and optimized (iterationindependent) mixing and damping coefficients are used. The decoder architectures are trained using transmitreceive pairs for the binaryinput AWGN channel where the SNR parameters is chosen uniformly between dB anddB for each training pair. To avoid numerical issues, the gradient clipping threshold is set to
and the LLR clipping threshold is. We define each epoch to be
minibatches and each minibatch to be transmitreceive pairs. All decoders are trained for epochs and optimized using the multiloss function (8).In Fig. 1, we show the performance curves achieved by the optimized decoders for the code. For the standard paritycheck matrix without RRD, the standard BP decoder with damping (StdDBP) performs very similarly to the FW optimized decoder (StdDRNNFW). Similarly, for the cyclereduced paritycheck matrix, damping (CRDBP) achieves essentially the same gain as the fullyweighted model (CRSRNNFW). Thus, the dominant effects are fully explained by using damping and cyclereduced paritycheck matrices.
For a similar complexity, the RRD algorithm achieves better results. This is true both for standard BP (CRRRDBP) with optimized mixing and damping and for optimized weights (CRRRDRNNSS) in the simplescaling model. However, the fullyweighted model (CRRRDRNNFW) does not provide significant gains over simple scaling. Also, RRD results are shown only for cyclereduced matrices because they perform much better.
Iv Machine Learning for FiberOptic Systems
In this section, we discuss the application of machine learning techniques to opticalfiber communications.
Iva Signal Propagation and Digital Backpropagation
Fiberoptic communication links carry virtually all intercontinental data traffic and are often referred to as the Internet backbone. We consider a simple pointtopoint scenario, where a signal with complex baseband representation is launched into an optical fiber as illustrated in Fig. 2. The signal evolution is implicitly described by the nonlinear Schrödinger equation (NLSE) which captures dispersive and nonlinear propagation impairments [15, p. 40]. After distance , the received signal is lowpass filtered and sampled at to give the samples .
In the absence of noise, the NLSE is invertible and the transmitted signal can be recovered by solving the NLSE in the reverse propagation direction. This approach is referred to as digital backpropagation (DBP) in the literature. DBP requires a numerical method to solve the NLSE and a widely studied method is the splitstep Fourier method (SSFM). The SSFM conceptually divides the fiber into segments of length and it is assumed that for sufficiently small , the dispersive and nonlinear effects act independently. A block diagram of the SSFM for DBP is shown in the top part of Fig. 3, where . In particular, one alternates between a linear operator and the elementwise application of a nonlinear phaseshift function . Assuming a sufficiently high sampling rate, the obtained vector converges to a sampled version of as . By comparing the two computation graphs in Fig. 3, one can see that the SSFM has a naturally layered or hierarchical Markov structure, similar to a deep feedforward NN.
IvB ParameterEfficient Learned Digital Backpropagation
A major issue with DBP is the large computational burden associated with a realtime implementation. Despite significant efforts to reduce complexity (see, e.g., [2, 16, 17]), DBP based on the SSFM is not used in any current optical system that we know. Instead, only linear equalizers are employed. Their implementation already poses a significant challenge; with data rates exceeding Gbit/s, linear equalization of chromatic dispersion is typically one of the most powerhungry receiver blocks [18].
Note that the linear propagation operator in the SSFM is a dense matrix. On the other hand, deep NNs are typically designed to have very sparse weight matrices in most of the layers to achieve computational efficiency. Sparsification of can be achieved by switching from a frequencydomain to a timedomain filtering approach using finiteimpulse response (FIR) filters. The main challenge in that case is to find short FIR filters in each SSFM step that approximate well the ideal chromatic dispersion allpass frequency response. In previous work, the general approach is to design either a single filter or filter pair and use is repeatedly in each step [2, 19, 20, 21]. However, this typically leads to poor parameter efficiency (i.e., it requires relatively long filters) because truncation errors pile up coherently. We have shown in [22, 23] that this truncation error problem can be controlled effectively by performing a joint optimization of all filter coefficients in the entire DBP algorithm. In particular, the computation graph of the SSFM is optimized via SGD by simply interpreting all matrices
as tunable parameters corresponding to the FIR filters, similar to the weight matrices in a deep NN. The nonlinearities are left unchanged, i.e., they correspond to the nonlinear phaseshift functions in the original SSFM and not to a traditional NN activation function. The resulting method is referred to as learned DBP (LDBP).
IvC Optimization Results
In Fig. 4, we compare the equalizer accuracy in terms of the effective SNR of LDBP to the conventional approach of designing a single FIR filter (either via leastsquares fitting or frequencydomain sampling) and then using it repeatedly in the SSFM. LDBP requires significantly fewer total filter taps (indicated in brackets) to achieve similiar or better peak accuracy. The obtained FIR filters are as short as or (symmetric) taps per step, leading to very simple and efficient hardware implementation. This is confirmed by recent ASIC synthesis results which show that the power consumption of LDBP becomes comparable to linear equalization [24]. LDBP can also be extended to subband processing to enable lowcomplexity DBP for multichannel or other wideband transmission scenarios [25].
At first glance, the obtained results in Fig. 4 may be somewhat counterintuitive. Indeed, after examining the optimized individual (perstep) filter responses in LDBP, we found that they are generally worse approximations to the ideal chromatic dispersion frequency response compared to filters obtained by leastsquares fitting or other methods. However, the combined response of neighboring filters and also the overall response is better compared to the conventional strategy of using the same filter in each dispersion compensation stage. In fact, using the same filter many times in series magnifies any weakness. By using different filters at each stage, the problem is avoided and shorter filters can achieve the same performance.
V Conclusion
Recent progress in machine learning and offtheshelf learning packages have made it tractable to add many parameters to existing communication algorithms and optimize. In this paper, we have reviewed this approach with the help of two applications.
For the decoding application, our experiments support the observations in [1, 5] that optimizing parameterized BP decoders can provide meaningful gains. In addition, we observed that many fewer parameters (e.g., damping alone) may be sufficient to achieve very similar gains. Thus, for this general approach, it can be fruitful to also minimize the parameterization necessary to achieve the same gain [10].
For the digital backpropagation application, we were pleasantly surprised that, after analyzing the learned solution, we were able to understand why it worked so well. In essence, deep learning discovered a simple and effective strategy that had not been considered earlier.
References
 [1] E. Nachmani, Y. Be’ery, and D. Burshtein, “Learning to decode linear codes using deep learning,” in Proc. Annual Allerton Conference on Communication, Control, and Computing, Illinois, USA, 2016.
 [2] E. Ip and J. M. Kahn, “Compensation of dispersion and nonlinear impairments using digital backpropagation,” J. Lightw. Technol., vol. 26, no. 20, pp. 3416–3425, Oct. 2008.
 [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
 [4] T. Gruber, S. Cammerer, J. Hoydis, and S. ten Brink, “On deep learningbased channel decoding,” in Proc. Annual Conf. Information Sciences and Systems (CISS), 2017.
 [5] E. Nachmani, E. Marciano, L. Lugosch, W. J. Gross, D. Burshtein, and Y. Be’ery, “Deep learning methods for improved decoding of linear codes,” IEEE J. Sel. Topics Signal Proc., vol. 12, no. 1, pp. 119–131, Feb. 2018.
 [6] L. G. Tallini and P. Cull, “Neural nets for decoding errorcorrecting codes,” in Proc. IEEE Technical Applications Conf. and Workshops, Portland, OR, 1995.
 [7] A. Bennatan, Y. Choukroun, and P. Kisilev, “Deep learning for decoding of linear codes  a syndromebased approach,” in Proc. IEEE Int. Symp. Information Theory (ISIT), Vail, CO, 2018.
 [8] M. Fossorier, R. Palanki, and J. Yedidia, “Iterative decoding of multistep majority logic decodable codes,” in Proc. Int. Symp. on Turbo Codes & Iterative Inform. Proc., 2003, pp. 125–132.
 [9] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Intl. Conf. on Mach. Learn., 2010, pp. 399–406.
 [10] M. Lian, F. Carpi, C. Häger, and H. D. Pfister, “Learned beliefpropagation decoding with simple scaling and SNR adaptation,” 2019, submitted to ISIT 2019.
 [11] J. S. Yedidia, J. Chen, and M. P. Fossorier, “Generating code representations suitable for belief propagation decoding,” in Proc. Annual Allerton Conf. on Commun., Control, and Comp., vol. 40, no. 1, 2002, pp. 447–456.
 [12] T. R. Halford and K. M. Chugg, “Random redundant softin softout decoding of linear block codes,” in Proc. IEEE Int. Symp. Inform. Theory. IEEE, 2006, pp. 2230–2234.
 [13] I. Dimnik and Y. Be’ery, “Improved random redundant iterative HDPC decoding,” IEEE Trans. Commun., vol. 57, no. 7, 2009.
 [14] T. Hehn, J. B. Huber, O. Milenkovic, and S. Laendner, “Multiplebases beliefpropagation decoding of highdensity cyclic codes,” IEEE Trans. Commun., vol. 58, no. 1, pp. 1–8, 2010.
 [15] G. P. Agrawal, Nonlinear Fiber Optics, 4th ed. Academic Press, 2006.
 [16] L. B. Du and A. J. Lowery, “Improved single channel backpropagation for intrachannel fiber nonlinearity compensation in longhaul optical communication systems.” Opt. Express, vol. 18, no. 16, pp. 17 075–17 088, July 2010.
 [17] D. Rafique, M. Mussolin, M. Forzati, J. Mårtensson, M. N. Chugtai, and A. D. Ellis, “Compensation of intrachannel nonlinear fibre impairments using simplified digital backpropagation algorithm.” Opt. Express, vol. 19, no. 10, pp. 9453–9460, April 2011.
 [18] B. S. G. Pillai, B. Sedighi, K. Guan, N. P. Anthapadmanabhan, W. Shieh, K. J. Hinton, and R. S. Tucker, “Endtoend energy modeling and analysis of longhaul coherent transmission systems,” J. Lightw. Technol., vol. 32, no. 18, pp. 3093–3111, 2014.
 [19] L. Zhu, X. Li, E. Mateo, and G. Li, “Complementary FIR filter pair for distributed impairment compensation of WDM fiber transmission,” IEEE Photon. Technol. Lett., vol. 21, no. 5, pp. 292–294, March 2009.
 [20] G. Goldfarb and G. Li, “Efficient backwardpropagation using wavelet based filtering for fiber backwardpropagation,” Opt. Express, vol. 17, no. 11, pp. 814–816, May 2009.
 [21] C. Fougstedt, M. Mazur, L. Svensson, H. Eliasson, M. Karlsson, and P. LarssonEdefors, “Timedomain digital back propagation: Algorithm and finiteprecision implementation aspects,” in Proc. Optical Fiber Communication Conf. (OFC), Los Angeles, CA, 2017.
 [22] C. Häger and H. D. Pfister, “Nonlinear interference mitigation via deep neural networks,” in Proc. Optical Fiber Communication Conf. (OFC), San Diego, CA, 2018.
 [23] ——, “Deep learning of the nonlinear Schrödinger equation in fiberoptic communications,” in Proc. IEEE Int. Symp. Information Theory (ISIT), Vail, CO, 2018.
 [24] C. Fougstedt, C. Häger, L. Svensson, H. D. Pfister, and P. LarssonEdefors, “ASIC implementation of timedomain digital backpropagation with deeplearned chromatic dispersion filters,” in Proc. European Conf. Optical Communication (ECOC), Rome, Italy, 2018.
 [25] C. Häger and H. D. Pfister, “Wideband timedomain digital backpropagation via subband processing and deep learning,” in Proc. European Conf. Optical Communication (ECOC), Rome, Italy, 2018.
Comments
There are no comments yet.