I Introduction
The application of machine learning techniques in communication systems has attracted a lot of attention in recent years [1, 2]. In the field of optical fiber communications, various tasks such as performance monitoring, fiber nonlinearity mitigation, carrier recovery and modulation format recognition have been addressed from the machine learning perspective [3, 4, 5]. In particular, since chromatic dispersion and nonlinear Kerr effects in the fiber are regarded as the major information ratelimiting factors in modern optical communication systems [6], the application of artificial neural networks (ANNs), known as universal function approximators [7], for channel equalization has been of great research interest [8, 9, 10, 11, 12]. For example, a multilayer ANN architecture, which enables deep learning techniques [13], has been recently considered in [14]
for the realization of lowcomplexity nonlinearity compensation by digital backpropagation (DBP)
[15]. It has been shown that the proposed ANNbased DBP achieves similar performance than conventional DBP for a single channel 16QAM system while reducing the computational demands. Deep learning has also been considered for shortreach communications. For instance, in [16] ANNs are considered for equalization in PAM8 IM/DD systems. Biterror rates (BERs) below the forward error correction (FEC) threshold have been experimentally demonstrated over 4 km transmission distance. In [17], deep ANNs are used at the receiver of the IM/DD system as an advanced detection block, which accounts for channel memory and linear and nonlinear signal distortions. For short reaches (1.5 km), BER improvements over common feedforward linear equalization were achieved.In all the aforementioned examples, deep learning techniques have been applied to optimize a specific function in the fiberoptic system, which itself consists of several signal processing blocks at both transmitter and receiver, each carrying out an individual task, e.g. coding, modulation and equalization. In principle, such a modular implementation allows the system components to be analyzed, optimized and controlled separately and thus presents a convenient way of building the communication link. Nevertheless, this approach can be suboptimal, especially for communication systems where the optimum receivers or optimum blocks are not known or not available due to complexity reasons. As a consequence, in some systems, a blockbased receiver with one or several suboptimum modules does not necessarily achieve the optimal endtoend system performance. Especially if the optimum joint receiver is not known or too complex to implement, we require carefully chosen approximations.
Deep learning techniques, which can approximate any nonlinear function [13], allow us to design the communication system by carrying out the optimization in a single endtoend process including the transmitter and receiver as well as the communication channel. Such a novel design based on full system learning avoids the conventional modular structure, because the system is implemented as a single deep neural network, and has the potential to achieve an optimal endtoend performance. The objective of this approach is to acquire a robust representation of the input message at every layer of the network. Importantly, this enables a communication system to be adapted for information transmission over any type of channel without requiring prior mathematical modeling and analysis. The viability of such an approach has been introduced for wireless communications [18] and also demonstrated experimentally with a wireless link [19]. Such an application of endtoend deep learning presents the opportunity to fundamentally reconsider optical communication system design.
Our work introduces endtoend deep learning for designing optical fiber communication transceivers. The focus in this paper is on IM/DD systems, which are currently the preferred choice in many data center, access, metro and backhaul applications because of their simplicity and costeffectiveness [20]. The IM/DD communication channel is nonlinear due to the combination of photodiode (squarelaw) detection and fiber dispersion. Moreover, noise is added by the amplifier and the quantization in both the digitaltoanalog converters (DACs) and analogtodigital converters (ADCs). We model the fiberoptic system as a deep fullyconnected feedforward ANN. Our work shows that such a deep learning system including transmitter, receiver, and the nonlinear channel, achieves reliable communication below FEC thresholds. We experimentally demonstrate the feasibility of the approach and achieve information rates of 42 Gb/s beyond 40 km. We apply retraining of the receiver to account for the specific characteristics of the experimental setup not covered by the model. Moreover, we present a training method for realizing flexible and robust transceivers that work over a range of distance. Precise waveform generation is an important aspect in such an endtoend system design. In contrast to [18], we do not generate modulation symbols, but perform a direct mapping of the input messages to a set of robust transmit waveforms.
The goal of this paper is to design, in an offline process, transceivers for lowcost optical communication system that can be deployed without requiring the implementation of a training process in the final product. During the offline training process, we can label the set of data used for finding the parameters of the ANN and hence use supervised training. This is a first step towards building a deep learningbased optical communication system. Such a system will be optimized for a specific range of operating conditions. Eventually, in future work, an online training may be incorporated into the transceiver, which may still work in a supervised manner using, e.g., pilot sequences, to cover a wider range of operating conditions. Building a complete unsupervised transceiver with online training will be a significantly more challenging task and first requires a thorough understanding of the possibilities with supervised training. Hence, we focus on the supervised, offline training case in this paper.
The rest of the manuscript is structured as follows: Section II introduces the main concepts behind the deep learning techniques used in this work. The IM/DD communication channel and system components are described mathematically in Sec. III. The architecture of the proposed ANN along with the training method is also presented in this section. Section IV reports the system performance results in simulation. Section V presents the experimental testbed and validation of the key simulation results. Section VI contains an extensive discussion on the properties of the transmit signal, the advantages of training the system in an endtoend manner, and the details about the experimental validation. Finally, Sec. VII concludes the work.
Ii Deep Fullyconnected Feedforward Artificial Neural Networks
A fullyconnected
layer feedforward ANN maps an input vector
to an output vector through iterative steps of the form(1) 
Where is the output of the th layer, is the output of the th layer, and
are respectively the weight matrix and the bias vector of the
th layer andis its activation function. The set of layer parameters
and is denoted by(2) 
The activation function
introduces nonlinear relations between the layers and enables the approximation of nonlinear functions by the network. A commonly chosen activation function in stateoftheart ANNs is the rectified linear unit (ReLU), which acts individually on each of its input vector elements by keeping the positive values and equating the negative to zero
[21], i.e., with(3) 
where , denote the th elements of the vectors and , respectively. Compared to other popular activation functions such as the hyperbolic tangent and sigmoid, the ReLU function has a constant gradient, which renders training computationally less expensive and avoids the effect of vanishing gradients. This effect occurs for activation functions with asymptotic behavior since the gradient can become small and consequently decelerate the convergence of the learning algorithm [13, Sec. 8.2].
The final (decision) layer of an ANN often uses the softmax activation function, where the elements of the output are given by
(4) 
The training of the neural network can be performed in a supervised manner by labeling the training data. This defines a pairing of an input vector and a desired output vector . Therefore, the training objective is to minimize, over the set of training inputs , the loss , with respect to the parameter sets of all layers, given by
(5) 
between an ANN output corresponding to the input processed by all layers of the ANN, and the desired, known output . In (5),
denotes the loss function and
denotes the cardinality of the training set (i.e., the size of the training set) containing 2tuples of inputs and corresponding outputs. The loss function we consider in this work is the crossentropy, defined as(6) 
A common approach for optimization of the parameter sets in (5), which reduces computational demands, is to operate on a small batch
(called minibatch) of the set of training data and perform the stochastic gradient descent (SGD) algorithm initialized with random
[13], which is iteratively updated as(7) 
where is the learning rate of the algorithm and is the gradient of the loss function of the minibatch defined by
(8) 
In modern deep learning, an efficient computation of the gradient in (7) is achieved by error backpropagation [13, 22]. A stateoftheart algorithm with enhanced convergence is the Adam optimizer which dynamically adapts the learning rate [23]
. The Adam algorithm is used for optimization during the training process in this work. All numerical results in the manuscript have been generated using the deep learning library TensorFlow
[24].Iii Proposed EndtoEnd Communication System
We implement the complete fiberoptic communication system and transmission chain including transmitter, receiver and channel as a complete endtoend ANN, as suggested in [18, 19]. To show the concept, we focus on an IM/DD system, but we emphasize that the general method is not restricted to this scheme and can be easily extended to other, eventually more complex models. In the following we explain all the components of the transceiver chain as well as the channel model in detail. The full, endtoend neural network chain is depicted in Fig. 1.
Iiia Transmitter Section
We use a blockbased transmitter as it has multiple advantages. Firstly, it is computationally simple, making it attractive for lowcost, highspeed implementations. Secondly, it allows massive parallel processing of the single blocks. Each block encodes an independent message from a set of total messages into a vector of transmit samples, forming a symbol. Each message represents an equivalent of bits.
The encoding is done in the following way: The message is encoded into a onehot vector of size , denoted as , where the
th element equals 1 and the other elements are 0. Such onehot encoding is the standard way of representing categorical values in most machine learning algorithms
[13] and facilitates the minimization of the symbol error rate. An integer encoding would for instance impose an undesired ordering of the messages. The onehot vector is fed to the first hidden layer of the network, whose weight matrix and bias vector are and , respectively. The second hidden layer has parameters and . The ReLU activation function (3) is applied in both hidden layers. The following layer prepares the data for transmission and its parameters are and , where denotes the number of waveform samples representing the message. The dimensionality of this layer determines the oversampling rate of the transmitted signal. In our work, oversampling is considered and thus the message is effectively mapped onto a symbol of samples. As fiber dispersion introduces memory between several consecutive symbols, multiple transmitted blocks need to be considered to model realistic transmission. Hence, the output samples of neighboring blocks (that encode potentially different inputs) are concatenated by the serializer to form a sequence of samples ready for transmission over the channel. All these ANN blocks have identical weight matrices and bias vectors. The system can be viewed as an autoencoder with an effective information rate bits/symbol. We consider unipolar signaling and the ANN transmitter has to limit its output values to the MachZehnder modulator (MZM) relatively linear operation region . This is achieved by applying the clippling activation function for the final layer which combines two ReLUs as follows(9) 
where the term
ensures the signal is within the MZM limits after quantization noise is added by the DAC. The variance
of the quantization noise is defined below.IiiB Communication Channel
The main limiting factor in IM/DD systems is the intersymbol interference (ISI) as a result of optical fiber dispersion [25]. Moreover, in such systems, simple photodiodes (PDs) are used to detect the intensity of the received optical field and perform optoelectrical conversion, so called squarelaw detection. As a consequence of the joint effects of dispersion and squarelaw detection, the IM/DD communication channel is nonlinear and has memory.
In our work, the communication channel model includes lowpass filtering (LPF) to account for the finite bandwidth of transmitter and receiver hardware, DAC, ADC, MZM, photoconversion by the PD, noise due to amplification and optical fiber transmission. The channel is considered part of the system implemented as an endtoend deep feedforward neural network shown in Fig. 1. The signal that enters the section of the ANN after channel propagation can be expressed as (neglecting the receiver LPF for ease of exposition)
(10) 
where is the waveform after fiber propagation, is the transmit signal, is an operator describing the effects of the electrical field transfer function of the modulator and the fiber dispersion, is additive Gaussian noise arising, e.g., from the transimpedance amplifier (TIA) circuit. We select the variance of the noise to match the signaltonoise ratios (SNRs) after photodetection obtained in our experimental setup. Further details on the SNR values at the examined distances are presented below in Sec. V. We now discuss in more detail the system components.
Chromatic dispersion in the optical fiber is mathematically expressed by the partial differential equation
[25](11) 
where is the complex amplitude of the optical field envelope, denotes time, is the position along the fiber and is the dispersion coefficient. Equation (11
) can be solved analytically in the frequency domain by taking the Fourier transform, yielding the dispersion frequency domain transfer function
(12) 
where
is the angular frequency. In our work, fiber dispersion is applied in the frequency domain on the fivefold zeropadded version of the signal stemming from
concatenated blocks. The FFT and IFFT necessary for conversion between time and frequency domain form part of the ANN and are provided by the TensorFlow library [24].The MZM is modeled by its electrical field transfer function, a sine which takes inputs in the interval [26]. This is realized in the ANN by using a layer that consists just of the MZM function , where the sine is applied elementwise. The DAC and ADC components introduce additional quantization noise due to their limited resolution. We model this noise and
as additive, uniformly distributed noise with variance determined by the effective number of bits (ENOB) of the device
[27](13) 
where is the average power of the input signal. Lowpass filtering is applied before the DAC/ADC components to restrict the bandwidth of the signal. Note that both LPF stages and the chromatic dispersion stage can be modeled as purely linear stages of the ANN, i.e., a multiplication with a correspondingly chosen matrix . The MZM and PD stages are modeled by a purely nonlinear function .
IiiC Receiver Section
After squarelaw detection, amplification, LPF, and ADC, the central block is extracted for processing in the receiver section of the neural network. The architecture of the following layers is identical to those at the transmitter side in a reverse order. The parameters of the first receiver layer are , with ReLU activation function (3). The next layer has parameters , , also with ReLU activation function. The parameters of the final layer in the ANN are and . The final layer’s activation is the softmax function (4
) and thus the output is a probability vector
with the same dimension as the onehot vector encoding of the message. At this stage, a decision on the transmitted message is made and a block (symbol) error occurs when , where is the index of the element equal to 1 in the onehot vector () representation of the input message. Then the block error rate (BLER) can be estimated as
(14) 
where is the cardinality of the set of messages and is the indicator function, equal to 1 when the condition in the brackets is satisfied and 0 otherwise.
In our work, the biterror rate (BER) is examined as an indicator of the system performance. For computing the BER, we use an ad hoc bit mapping by assigning the Gray code to the input . Whenever a block is received in error, the number of wrong bits that have occurred are counted. Note that this approach is suboptimal as the deep learning algorithm will only minimize the BLER and a symbol error may not necessarily lead to a single bit error. In our simulation results, we will hence provide a lower bound on the achievable BER with an optimized bit mapping by assuming that at most a single bit error occurs during a symbol error.
Note that the structure we propose is only able to compensate for chromatic dispersion within a block of receiver samples, as there is no connection between neighboring blocks. The effect of dispersion from neighboring blocks is treated as extra noise. The block size (and ) will hence limit the achievable distance with the proposed system. However, we could in principle extend the size of the receiver portion of the ANN to jointly process multiple blocks to dampen the influence of dispersion. This will improve the resilience to chromatic dispersion at the expense of higher computation complexity.
IiiD Training
The goal of the training is to obtain an efficient autoencoder [13, Ch. 14]
, i.e., the output of the final ANN softmax layer should be ideally identical to the onehot input vector. Such an autoencoder will minimize the endtoend BLER. In this work, the ANN is trained with the Adam optimizer
[23] on a set of randomly chosen messages (and messages of the neighboring transmit blocks) and minibatch size , corresponding to 100 000 iterations of the optimization algorithm. It is worth noting that in most cases, convergence in the loss and validation symbol error rate of the trained models was obtained after significantly less than 100 000 iterations, which we used as a fixed stopping criterion. During training, noise is injected into the channel layers of the ANN, as shown in Fig. 1. A truncated normal distribution with standard deviation
is used for initialization of the weight matrices . The bias vectors are initialized with 0. Validation of the training is performed during the optimization process every 5000 iterations. The validation set has the size . Good convergence of the validation BLER and the corresponding BER is achieved. The trained model is saved and then loaded separately for testing which is performed over a set of different random input messages. The BER results from testing are shown in the figures throughout this manuscript. We have confirmed the convergence of the results as well for minibatch sizes of and , and also when the training set was increased to .When designing ANNs, the choice of hyperparameters such as the number of layers, number of nodes in a hidden layer, activation functions, minibatch size, learning rate, etc. is important. The optimization of the hyperparameters was beyond the scope of our investigation. In this work they were chosen with the goal to keep the networks relatively small and hence the training effort manageable. Better results in terms of performance and its tradeoff with complexity can be obtained with welldesigned sets of hyperparameters.
Iv System Performance
Table I lists the simulation parameters for the endtoend deeplearningbased optical fiber system under investigation. We assume a set of input messages which are encoded by the neural network at the transmitter into a symbol of 48 samples at 336 GSa/s in the simulation. This rate corresponds to the 84 GSa/s sampling rate of the DAC used in experiment multiplied by the oversampling factor of 4, which we assume in simulation. The bandwidth of the signal is restricted by a 32 GHz lowpass filter to account for the significantly lower bandwidth of today’s hardware. Thus the information rate of the system becomes bits/sym. Symbols are effectively transmitted at 7 GSym/s and thus the system operates at a bit rate of 42 Gb/s.
Parameter  Value 

M  64 
n  48 
N  11 
Oversampling  4 
Sampling rate  336 GSa/s 
Symbol rate  7 GSym/s 
Information rate  6 bits/symbol 
LPF bandwidth  32 GHz 
DAC/ADC ENOB  6 
Fiber dispersion parameter  17 ps/nm/km 
Figure 2 shows the BER performance at different transmission distances. For this set of results, the ANN was trained for 7 different distances in the range 20 to 80 km in steps of km and the distance was kept constant during training. During the testing phase, the distance was swept. BERs below the 6.7% hard decision FEC (HDFEC) threshold of are achieved at all examined distances between 20 and 50 km. Moreover, up to 40 km the BER is below . Systems trained at distances longer than 50 km achieve BERs above . The figure also displays the lower bound on the achievable BER for each distance. This lower bound is obtained by assuming that a block error gives rise to a single bit error. An important observation is that the lowest BERs are obtained at the distances for which the system was trained and there is a rapid increase in the BER when the distance changes. Such a behavior is a direct consequence of the implemented training approach which optimizes the system at a particular distance without any incentive of robustness to variations. As the amount of dispersion changes with distance, the optimal neural network parameters differ accordingly and thus the BER increases as the distance changes. We therefore require a different optimization method that yields ANNs that are robust to distance variations and hence offer new levels of flexibility.
To address these limitations of the training process, we train the ANN in a process where instead of fixing the distance, the distance for every training message is randomly drawn from a Gaussian distribution with a mean
and a standard deviation . During optimization, this allows the deep learning to converge to more generalized ANN parameters, robust to certain variation of the dispersion. Figure 3 shows the test BER performance of the system trained at a mean distance km and different values of the standard deviation. We see that for both cases of and this training method allows BER values below the HDFEC threshold in wider ranges of transmission distances than for . For instance, when , BERs below the threshold are achievable between 30.25 km and 49.5 km, yielding a range of operation of 19.25 km. The distance tolerance is further increased when is used for training. In this case, the obtained BERs are higher due to the compromise taken, but still below the HDFEC threshold for a range of 27.75 km, between 24 km up to 51.75 km. A practical implementation of the proposed fiberoptic system design is expected to greatly benefit from such a training approach as it introduces both robustness and flexibility of the system to variations in the link distance. As a consequence of generalizing the learning over varied distance, the minimum achievable BERs are higher compared to the system optimized at a fixed distance, presented in Fig. 2, and there exists a tradeoff between robustness and performance.So far we examined an endtoend deep learning optical fiber system where an input message carrying 6 bits of information () is encoded into a bandlimited symbol of 48 samples ( with an oversampling factor of 4) at 336 GSa/s. Thus, the result is an autoencoder operating at the bit rate of 42 Gb/s. In the following, we examine different rates by varying the size of and and thus the size of the complete endtoend neural network. For this investigation, we fixed the sampling rate of the simulation to 336 GSa/s. In Figure 4 solid lines show the BER performance of the system at different rates when the number of symbols used to encode the input message is decreased, in particular we use , thus yielding a symbol rate of 14 GSym/s. In such a way bit rates of 42 Gb/s, 56 Gb/s and 84 Gb/s are achieved for , , and , respectively. We see that the BER at 84 Gb/s rapidly increases with distance and error rates below the HDFEC can be achieved only up to 20 km. On the other hand, 42 Gb/s and 56 Gb/s can be transmitted reliably at 30 km. An alternative to decreasing the transmitted samples in a block is to increase the information rate of the system by considering input messages with a larger information content. Dashed lines in Fig. 4 show the cases of , and , , corresponding to bit rates of 42 Gb/s and 56 Gb/s. In comparison to the case where , such systems have an extended operational reach below the BER threshold, due to the larger block size and the reduce influence of chromatic dispersion. For example, the 56 Gb/s system can achieve BER below the HDFEC at 40 km, while for 42 Gb/s, this distance is 50 km. Thus increasing the information rates by assuming larger enables additional reach of 10 km and 20 km at 56 Gb/s and 42 Gb/s, respectively. However, a drawback of such a solution is the larger ANN size, thus increasing the computational and memory demands as well as training times. Figure 4 shows that the general approach of viewing the optical fiber communication system as a complete endtoend neural network can be applied for designing systems with different information rates and gives an insight on the possible implementation approaches.
V Experimental Validation
To complement the simulation results, we built an optical transmission system to demonstrate and validate experimentally the results obtained for the endtoend deep learning IM/DD system operating at 42 Gb/s. Moreover, we utilize the proposed training method and train our models at the examined distances of 20, 40, 60, or 80 km with a standard deviation of . Figure 5
illustrates the experimental setup. The SNRs after photodetection assumed in the endtoend training process during generation of the transmit waveforms are 19.41 dB, 6.83 dB, 5.6 dB and 3.73 dB at 20, 40, 60 and 80 km, respectively, corresponding to measured values for the 42 Gbaud PAM2 system, which is described in this section and used for comparison reasons. Since the training for the experiment is performed at distances with a certain standard deviation, linear interpolation is used to find the SNR values at distances different from the above.
The transmit waveforms were obtained by feeding a random sequence to the transmitter ANN, filtering by a LPF with 32 GHz bandwidth, downsampling and DAC (after standard linear finiteimpulse response (FIR) DAC preemphasis). In the experiment, we downsample by a factor of 4 the resulting filtered concatenated series of symbols, each now containing 12 samples. Because of LPF, there is no loss of information, since the original series of symbols, at 48 samples each and running at 336 GSa/s, can be exactly regenerated from this downsampled series of symbols, 12 samples per symbol at 84 GSa/s. The waveform is then used to modulate an MZM, where the bias point is meticulously adjusted to match the one assumed in simulations. The optical signal at 1550 nm wavelength is propagated over a fixed fiber length of 20, 40, 60, or 80 km and through a Tunable Dispersion Module (TDM), which is deployed to allow sweeping the dispersion around a given value. The received optical waveform is direct detected by a PIN+TIA and realtime sampled and stored for the subsequent digital signal processing. There is no optical amplification in the testbed. After synchronization, proper scaling and offset of the digitized photocurrent, the upsampled received waveforms are fed blockbyblock to the receiver ANN. After finetuning of the receiver ANN parameters, the BLER and BER of the system are evaluated. In the experiment, blocks are transmitted and received for each dispersion value. This is achieved by transmitting 1000 sequences of blocks. To compare our system with conventional IM/DD schemes operating at 42 Gb/s, we perform experiments at the examined distances for two reference systems: the first operating at 42 Gbaud with PAM2 and raised cosine pulses (rolloff of 0.99); the second operating at 21 Gbaud with PAM4 and raised cosine pulses (rolloff of 0.4). Both reference system use feedforward equalization (FFE) with 13 taps (T/2spaced) at the receiver. It is easy to see that the computational complexity of this simple linear equalization scheme is lower than the complexity of a deep ANNbased receiver. Nevertheless, we use the comparison to emphasize on the viability of implementing the optical fiber system as an endtoend deep ANN. Hence, possible complexity reductions in the design are beyond the scope of the manuscript.
While carrying out the experiment, we found that the ANN trained in the simulation was not fully able to compensate distortions from the experimental setup. Hence, we decided to retrain the receiver ANN (while keeping the transmitter ANN fixed) to account for the experimental setup. Retraining has been carried out for every measured distance. For the retraining of the receiver ANN, we used a set of (75% of all block traces) received blocks, while validation during this process is performed with a set of (12.5% of all block traces) different blocks (from different measurements). The finetuned model is tested over the remaining (12.5% of all block traces) (these were not used for training and validation). The subdivision of the experimental data into training, validation and testing sets is in accordance to the guidelines given in [13, Sec. 5.3]
. Training was carried out over 4 epochs over the experimental data, which was enough to see good convergence. In a single epoch each of the received blocks for training is used once in the optimization procedure, yielding a single
pass of the training set through the algorithm. Realization of 4 epochs improved convergence and further ensured that we perform enough training iterations to observe convergence (see Sec. IIID). For retraining the receiver ANN, the layer parameters are initialized with the values obtained in simulation prior to the experiment. The output of the receiver ANN is optimized with respect to the labeled experimental transmit messages, following the same procedure as described in Sec. II. Again, a minibatch size of has been used. Experimental BER results are then obtained on the testing set only and are presented in what follows.Figure 6 shows the experimental results for a fiber of length 20 km and 40 km. The TDM dispersion value was swept between ps and ps, resulting in effective link distances in the ranges of km and km, respectively. For the system around 20 km, BERs below have been achieved experimentally at all distances. In particular, the lowest BER of has been obtained at 21.18 km. For comparison, the PAM2 system experimentally achieves BER at 20 km and is therefore significantly outperformed by the endtoend deep learning optical system. At 40 km, the proposed system outperforms both the 42 Gbaud PAM2 and the 21 Gbaud PAM4 schemes, as neither of these can achieve BERs below the HDFEC threshold. On the other hand, the ANNbased system achieved BERs below at all distances in the examined range. In particular, BERs of at 40 km and a lowest BER of at 38.82 km have been obtained. Furthermore, we see that both sets of experimental results at 20 km and at 40 km are in excellent agreement with the simulation results.
Figure 7 shows the experimental results at 60 km and 80 km fiber length and TDM dispersion swiped between ps and ps, yielding effective link distances in the ranges km and km, respectively. For both systems we see that BERs below the HDFEC threshold cannot be achieved by the endtoend deep learning approach, as predicted by the simulation. Nevertheless, at 60 km the system still outperforms the PAM2 and PAM4 links. However, for the 80 km, link the thermal noise at the receiver becomes more dominant due to the low signal power levels without optical amplification. In combination with the accumulated dispersion, whose effects at 80 km extend across multiple blocks and cannot be compensated by the blockbyblock processing, this results in operation close to the sensitivity limits of the receiver which ultimately restricts the achievable BERs.
To further investigate the impact of received signal power on the performance of the system, we included an erbiumdoped fiber amplifier (EDFA) in the deep learningbased testbed for preamplification at the receiver. Thereby, the received power is increased from 13 and 17 dBm at 60 km and 80 km, respectively to 7 dBm. The obtained BERs at these distances are shown as well in Fig. 7. We see that by changing the link to include an extra EDFA, the endtoend deep learning system achieves significantly improved performance. In particular, at 60 km, a BER of , slightly below the HDFEC threshold, can be achieved. Due to dispersion and blockbased processing, there is a significant impact at 80 km as well, where the obtained BER is . These results highlight the great potential for performance improvement by including different link configurations inside the endtoend learning process.
Vi Discussion
Via Transmitted Signal Characteristics
In our endtoend optimization of the transceiver, the transmitter learns waveform representations which are robust to the optical channel impairments. In the experiment, we apply random sequences to the transmitter ANN, followed by 32 GHz LPF to generate the transmit waveforms. We now exemplify the temporal and spectral representations of the transmit signal. Figure 8 (top) shows the filtered output of the neural network, trained at (40,4) km, for the representative 10symbol message sequence , with denoting the input message to the ANN at time/block . Each symbol carries 6 bits of information, consists of 48 samples, and is transmitted at 7 GSym/s, yielding a symbol duration ps. We observe that, as an effect of the clipping layer in our transmitter ANN, the waveform amplitude is limited in the linear region of operation of the MachZehnder modulator with small departure from the range due to the filtering effects. Figure 8 (bottom) also shows the unfiltered 48 samples for each symbol in the subsequence . These blocks of samples represent the direct output of the transmitter ANN. The trained transmitter can be viewed as a lookup table which simply maps the input message to one of optimized blocks. Figure 9 illustrates the 48 amplitude levels in each of these blocks. Interestingly, we see that the extremal levels and are the prevailing levels. It appears that the ANN tries to find a set of binary sequences optimized for endtoend transmission. However, some intermediate values are also used. Unfortunately, it is not easy to say if this is intended by the deep learning optimization or an artefact. To bring more clarity, we visualize the constellation of modulation format by using stateoftheart dimensionality reduction machine learning techniques such as tDistributed Stochastic Neighbor Embedding (tSNE) [28]. Figure 10 shows the twodimensional tSNE representation of the unfiltered ANN outputs of Fig. 9. We can see that the 64 different waveforms are wellseparated in the tSNE space and can hence be discriminated well enough.
Figure 11 shows the spectrum of the realvalued electrical signal at the transmitter. Because of the lowpass filtering the spectral content is confined within 32 GHz. The LPFs at both transmitter and receiver ensure that the signal bandwidth does not exceed the finite bandwidth of transmitter and receiver hardware. We can further observe that, as a result of the blockbased transmission, the signal spectrum consists of strong harmonics at frequencies that are multiples of the symbol rate. After DAC, modulation of the optical carrier, fiber propagation and direct detection by a PIN+TIA circuit, the samples of the distorted received waveforms are applied blockbyblock as inputs to the receiver ANN for equalization.
ViB Comparison with ReceiverOnly and TransmitterOnly ANNProcessing
In contrast to systems with transmitteronly and receiveronly ANNs, the proposed endtoend deep learningbased system enables joint optimization of the messagetowaveform mapping and equalization functions. To highlight the advantages of optimizing the transceiver in a single endtoend process we compare—in simulation—our endtoend design with three different system variations: (i) a system that deploys PAM2/PAM4 modulation and ANN equalization at the receiver; (ii) a system with ANNbased transmitter and a simple linear classifier at the receiver and (iii) a system with individually trained ANNs at both transmitter and receiver. In this section, we provide a detailed discussion on the implementation of each of these benchmark systems and relate their performance to the endtoend deep learning approach. For a fair comparison all systems have a bit rate of 42 Gb/s and 6 bits of information are mapped to a block of 48 samples (including oversampling by a factor 4). All simulation parameters are as in Table
I. All hyperparameters of the ANNs, such as hidden layers, activation functions, etc. as well as the other system and training parameters are identical to those used in the endtoend learning system in Sec. IV.ViB1 PAM transmitter & ANNbased receiver
The PAM2 transmitter directly maps 6 bits into 6 PAM2 symbols (). The PAM4 transmitter uses the best (6,3) linear code over GF(4) [29] to map the 6 bits into 6 PAM4 symbols (). The symbols are pulseshaped by a raisedcosine (RC) filter with rolloff 0.25 and 2 samples per symbol. The waveform is further oversampled by a factor of 4 to ensure that a block of 48 samples is transmitted over the channel (as in the reference setup). The first element of the channel is the 32 GHz LPF. The received block of distorted samples is fed to the ANN for equalization. Training of the receiver ANN is performed using the same framework as in Sec. IV by labeling the transmitted PAM sequences. Figure 12 compares the symbol error rate performance of the described PAM2/PAM4 systems and the system trained in an endtoend manner (curves “TXPAMx & RXANN”). For training distances of 20 km and 40 km, the endtoend ANN design significantly outperforms its PAM2 and PAM4 counterparts. In particular, at 20 km the symbol error rate of the endtoend system is below , while the PAM2 and PAM4 systems achieve and , respectively. At distances beyond 40 km, the PAMbased systems with receiveronly ANN cannot achieve symbol error rates below .
ViB2 ANNbased transmitter & linear receiver
In order to implement a system where the main ANN processing complexity is based at the transmitter, we employ the same ANNbased transmitter as in Fig. 1. At the receiver, we impose a simple linear classifier as shown in Fig. 13. This receiver is a linear classifier with classes, a socalled multiclassperceptron and carries out the operation , with and . The decision is made by finding the largest element of , i.e., . The receiver thus employs only a single fullyconnected layer with softmax activation to transform the block of received samples into a probability vector of size (i.e. the size of the input onehot vector, see Sec. IIIA,C). At the transmitter, we use the exact same structure as in our deep ANNbased endtoend design. Both the transmitter ANN parameters and the receiver parameters and are optimized in an endtoend learning process. Hence, such a system exclusively benefits from the ANNbased predistortion of the transmitted waveform and has a lowcomplexity receiver. Figure 12 also shows the performance of this system trained at distances 20 km and 40 km (“TXANN & RXlinear”). The system trained at 20 km achieves symbol error rate performance close to our deep learningbased endtoend design. Moreover, we can see that it exhibits slightly better robustness to distance variations. This may be accounted to the absence of a deep ANN at the receiver, whose parameters during training are optimized specifically at the nominal distance and thus hinder the tolerance to distance changes. However, when the training is performed at 40 km, this system exhibits a significantly inferior performance compared to the proposed endtoend deep learningbased design.
ViB3 ANNbased transmitter & ANNbased receiver, separately trained
Our final benchmark system deploys deep ANNs at both transmitter and receiver, which, in this case, are trained individually as opposed to performing a joint endtoend optimization. For this comparison we fix the receiver ANN, whose parameters were previously optimized for PAM2 transmission, and aim to optimize only the transmitter ANN to match this given receiver in the best possible way. Training is carried out in the same endtoend manner as detailed in Sec. IV, however, we keep the receiver ANN parameters fixed. Figure 12 shows the symbol error rate performance of such a system (“TXANN & PAM2opt. RXANN”). For training at the nominal distance of 20 km, this system design achieves a symbol error rate of . Interestingly, one can clearly observe the benefits of the ANNbased waveform predistortion, which significantly lowers the error rate compared to the PAM2 system with receiveronly ANN. For systems trained at 40 km however, the individually trained transmitter and receiver ANNs cannot outperform our proposed, jointly trained, endtoend system.
ViC Further Details on the Experimental Validation
As explained in Sec. V, after propagation of the optimized waveforms during the experiment, the receiver ANN was finetuned (retrained) to account for the discrepancies between the channel model used for training and the real experimental setup. Retraining can be carried out in two different ways: In the first approach, denoted “finetuning”, we initialize the receiver ANN parameters with the values previously obtained in simulation and then carry out retraining using the labeled experimental samples. In the second approach, denoted “randomization”, we initialize the receiver ANN parameters with randomly initialized parameters sampled from a truncated normal distribution before retraining. Figure 14 shows the experimental BER curves at 20 and 40 km for the two retraining approaches and compares them with the raw experimental results, obtained by applying the initial Rx ANN acquired from the simulation ’as is’ without any finetuning. We can observe that accounting for the difference between the real experimental environment and the assumed channel model by retraining improves performance at both distances. Moreover, expectedly, we confirm that both retraining solutions converge to approximately the same BER values at all examined distances. Although we kept the number of training iterations for the two approaches equal, initializing the ANN parameters with pretrained values had the advantage of requiring less iterations to converge for most of the presented values. It is also worth noting that the BER performance of the system without any retraining is well below the HDFEC threshold around 20 km, achieving a minimum value of at 20.59 km. More accurate and detailed channel models used during training will likely further reduce this BER.
It is important to point out that for the experimental evaluation of ANNbased transmission schemes and hence in the framework of our work, the guidelines given in [12] need to be meticulously followed to avoid learning representations of a sequence (e.g., PRBS) used in the experiment and hence biasing the error rates towards too low values. In our work, during the offline training, we continuously generate new random input messages using a random number generator with a long sequence (e.g., Mersenne twister). In the experimental validation, we generated a long random sequence (not a PRBS, as suggest in [12]) which is processed by the transmitter ANN to generate a waveform, loaded (after filtering and resampling) into the DAC, and transmitted multiple times, to capture different noise realizations. For retraining the receiver ANN, minibatches are formed by picking randomly received blocks from a subset of the database of experimental traces (combining multiple measurements). Finally, in order to obtain the results presented throughout the manuscript, we use the trained and stored models to perform testing on a disjoint subset of the database of experimental traces, having no overlap with the subset used for training. This procedure ensures that the presented experimental results are achieved with independent data. Finally note that, due to the long memory of the fiber, it is not possible to capture the interference effects of all possible sequences of symbol preceding and succeeding and following the symbol under consideration in the experiment. Hence, it is possible that the results after retraining underestimate the true error rate as the retrained ANN may learn to adapt to the interference pattern of the sequence. Hence, the performance of all such ANNbased (retrained) receivers can be considered to be a lower bound on the true system performance. Closely studying the effects of retraining based on repeated sequences is part of our ongoing work.
Vii Conclusion
For the first time, we studied and experimentally verified the endtoend deep learning design of optical communication systems. Our work highlights the great potential of ANNbased transceivers for future implementation of IM/DD optical communication systems tailored to the nonlinear properties of such a channel. We experimentally show that by designing the IM/DD system as a complete endtoend deep neural network, we can transmit 42 Gb/s beyond 40 km with BERs below the 6.7% HDFEC threshold. The proposed system outperforms IM/DD solutions based on PAM2/PAM4 modulation and conventional receiver equalization for a range of transmission distances. Furthermore, we proposed and showed in simulations a novel training method that yields transceivers robust to distance variations that offer a significant level of flexibility. Our study is the first attempt towards the implementation of endtoend deep learning for optimizing neural network based optical communication systems. As a proof of concept, we concentrated on IM/DD systems. We would like to point out that the method is general and can be extended to other, eventually more complex models and systems.
Acknowledgments
The authors would like to thank Dr. Jakob Hoydis and Sebastian Cammerer for many inspiring discussions on the application of deep learning for communication systems.
The work of B. Karanov was carried out under the EU Marie SkłodowskaCurie project COIN (676448/H2020MSCAITN2015). The work of F. Thouin was carried out during an internship at Nokia Bell Labs supported by the German Academic Exchange Council under a DAADRISE Professional scholarship. The work of L. Schmalen was supported by the German Ministry of Education and Research (BMBF) in the scope of the CELTIC+ project SENDATETANDEM.
References
 [1] C. Jiang et al., “Machine learning paradigms for nextgeneration wireless networks,” IEEE Wireless Commun., vol. 24, no. 2, pp. 98105, 2017.
 [2] F. Khan, C. Lu, A. Lau, “Machine learning methods for optical communication systems,” in Advanced Photonics 2017 (IPR, NOMA, Sensors, Networks, SPPCom, PS), OSA Technical Digest (online) (Optical Society of America, 2017), paper SpW2F.3.
 [3] J. Thrane et al., “Machine learning techniques for optical performance monitoring from directly detected PDMQAM signals,” J. Lightwave Technol., vol. 35, no. 4, pp. 868875, 2017.
 [4] D. Zibar, M. Piels, R. Jones, and C. Schäffer, “Machine learning techniques in optical communication,” J. Lightwave Technol., vol. 34, no. 6, pp. 14421452, 2016.
 [5] F. Khan, Y. Zhou, A. Lau, and C. Lu, “Modulation format identification in heterogeneous fiberoptic networks using artificial neural networks,” Opt. Express, vol. 20, no. 11, pp. 1242212431, 2012.
 [6] R.J. Essiambre, G. Kramer, P. Winzer, G. Foschini, and B. Goebel, “Capacity limits of optical fiber networks,” J. Lightwave Technol., vol. 28, no. 4, pp. 662701, 2010.
 [7] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359366, 1989.
 [8] K. Burse, R. Yadav, S. Shrivastava, “Channel equalization using neural networks: a review,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 40, no. 3, pp. 352357, 2010.
 [9] M. Ibnkahla, “Applications of neural networks to digital communications  a survey,” Elsevier Signal Processing, vol. 80, no. 7, pp. 11851215, 2000.
 [10] M. Jarajreh et al., “Artificial neural network nonlinear equalizer for coherent optical OFDM,” IEEE Photonics Technol. Lett., vol. 27, no. 4, pp. 387390, 2014.
 [11] E. Giacoumidis et al., “Fiber nonlinearityinduced penalty reduction in COOFDM by ANNbased nonlinear equalization,” Opt. Lett., vol. 40, no. 21, pp. 51135116, 2015.
 [12] T. Eriksson, H. Bülow, and A. Leven, “Applying neural networks in optical communication systems: possible pitfalls,” IEEE Photon. Technol. Lett., vol. 29, no. 23, pp. 20912094, 2017.
 [13] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016.
 [14] C. Häger and H. Pfister, “Nonlinear interference mitigation via deep neural networks,” in Proc. Optical Fiber Communication Conference, OSA Technical Digest (online) (Optical Society of America, 2018), paper W3A.4.
 [15] E. Ip and J. Kahn, “Compensation of dispersion and nonlinear impairments using digital backpropagation,” J. Lightwave Technol., vol. 26, no. 20, pp. 34163425, 2008.
 [16] S. Gaiarin et al., “High speed PAM8 optical interconnects with digital equalization based on neural network,” in Proc. Asia Communications and Photonics Conference (ACP), (Optical Society of America, 2016), paper AS1C1.
 [17] J. Estarán et al., “Artificial neural networks for linear and nonlinear impairment mitigation in highbaudrate IM/DD systems,” In Proc. of 42nd European Conference on Optical Communications (ECOC), (Institute of Eletrical and Electronics Engineers, 2016), pp. 106108.
 [18] T. O’Shea and J. Hoydis, “An introduction to deep learning for the physical layer,” IEEE Trans. Cogn. Commun. Netw., vol. 3, no. 4, pp. 563575, 2017.
 [19] S. Dörner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep learningbased communication over the air,” IEEE J. Sel. Topics Signal Process., vol. 12., no. 1, 2018.
 [20] M. Eiselt, N. Eiselt and A. Dochhan, “Direct detection solutions for 100G and beyond,” in Proc. of Optical Fiber Communication Conference (OFC), (Optical Society of America, 2017), paper Tu3I.3.

[21]
V. Nair and G. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in
Proc. of Int. Conf. Mach. Learn. (ICML), ACM, 2010, pp. 807814.  [22] D. Rumelhart, G. Hinton, and R. Williams, “Learning representations by backpropagating errors,” Nature, vol. 323, pp. 533536, 1986.
 [23] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” ArXiv preprint arXiv:1412.6980, 2014.
 [24] Available online: https://www.tensorflow.org/.
 [25] G. Agrawal, Fiberoptic Communication systems, 4th ed. John Wiley & Sons, Inc., 2010.
 [26] A. Napoli et al., “Digital predistortion techniques for finite extinction ratio IQ MachZehnder modulators,” J. Lightwave Technol., vol. 35, no. 19, 2017.
 [27] C. Pearson, “Highspeed, analogtodigital converter basics,” Texas Instruments, Dallas, Texas, App. Rep. SLAA510, Jan. 2011. Available: http://www.ti.com/lit/an/slaa510/slaa510.pdf
 [28] L. van der Maaten and G. Hinton, “Visualizing data using tSNE,” Journal of Machine Learning Research, vol. 9, pp.25792605, Nov. 2008
 [29] M. Grassl, “Bounds on the minimum distance of linear codes and quantum codes,” Online available at http://www.codetables.de. Accessed on 20180611.
Comments
There are no comments yet.