I Introduction
The next generation 5G wireless mobile networks is driven by new emerging use cases, such as URLLC [1]. To mention a few URLLC applications, tactile internet, industrial automation and smart grids contribute to increasing demands on the underlying communication system which have not existed as such before [2]. Depending on the actual application either very lowlatency or high reliability or a combination of both are required. In contrast to LTE, where services were provided in a best effort manner, 5G networks have to guarantee these requirements. In particular for URLLC, the ITU proposed an endtoend latency of 1 ms and a packet error rate of [3]. These demanding requirements have emerged discussions in the 3GPP Rel. 16 standardization process on how to fulfill these. Selfcontained subframes and grantfree access have been proposed to address these requirements on the air interface side [4]. However, the impact on wellknown mechanisms in wireless mobile networks is still unclear. In particular, the HARQ procedure poses a bottleneck for achieving aforementioned latencies. HARQ is a physical layer mechanism that employs feedback to transmit at higher target BLER, while achieving robustness of the transmission by providing retransmissions based on the feedback (ACK  acknowledgment / NACK  nonacknowledgment). However, it imposes an additional delay on the transmission, designated as HARQ RTT. This lead to the abandonment of HARQ for the 1 ms endtoend latency use case of URLLC at least for the initial URLLC specification in Rel. 15[5]. This decision implied that the code rate is lowered such that a single shot transmission, i.e. no retransmissions and no feedback, is possible. On the one hand, this simplifies the system design, however on the other hand it sacrifices the overall spectral efficiency of URLLC transmissions. Hence, reducing the RTT to enable HARQ for URLLC becomes a critical issue.
One possibility to achieve this is to use Early HARQ (EHARQ) schemes [6, 7] where the feedback on the decodability of the received signal is provided ahead of the end of the actual transmission process. The crucial component in this setting is the classification algorithm that provides the feedback, which we aim to optimize using Machine Learning techniques.
Earlier approaches addressing the feedback prediction problem with the sole exception of [8] focused exclusively on onedimensional input features as BER estimates in combination with hard thresholding as classification algorithms [6, 7]. In [9], authors introduced the socalled VNR to exploit the substructures of LDPC codes for prediction. However, only a single feature, i.e. a single decoder iteration, in combination with hard thresholding has been used. We expect improvements in prediction accuracy by extensions in several directions in combination with more complex classification algorithms: (a) the evolution of input features through several decoder iterations considered for the first time in [8], (b) higherdimensional intramessage features that in the ideal case leverage knowledge about the underlying block code and (c) history features that leverage information about the channel state from past submissions that is available at the receiver.
Here we significantly expand the approach put forward in [8], where we discuss first EHARQ results empowered by Machine Learning techniques. We present an extended theoretical discussion in particular including the extension to multiple retransmissions and a system model that incorporates scheduling effects for the system evaluation thereby allowing a much more precise evaluation of the evaluation of the performance of EHARQsystems in realistic environments. On the classification side, this is supplemented by extended experiments including different input features and classification algorithms such as a newly developed supervised autoencoder for a larger range of SNR conditions, subcode lengths and different channel models.
The paper is organized as follows: In Sec. II we review the EHARQ feedback process and investigate the role of the classification algorithm in a simple probabilistic model and in a more realistic setting of limited system resources. In Sec. III we discuss Machine Learning approaches for the classification problem introducing different input features and algorithms. The classification performance as well as the system performance is evaluated in Sec. IV for different signaltonoise ratios, subcode lengths and channel conditions. We summarize and conclude in Sec. V.
Ii Early HARQ Feedback
As discussed in the Introduction, EHARQ approaches aim to reduce the HARQ RTT by providing the feedback on the decodability of the received signal at an earlier stage. This enables the original transmitter to react faster to the current channel situation and to provide additional redundancy at an earlier point. In regular HARQ, the feedback generation is strongly coupled to the decoding process. In particular, the receiver applies the decoder on the whole signal representing the total codeword. An embedded CRC enables to check the integrity of the decoded bit stream. The result of this check is transmitted back as HARQ feedback, either acknowledging correct reception (ACK) or asking for further redundancy (NACK). Providing early feedback (EHARQ) implies decoupling the feedback generation from the decoding process, which introduces a misprediction probability since the actual outcome is not known afore. By taking that step, it is possible to use only a portion of the transmission and thus reducing the time from initial reception to transmitting the feedback (
). In total, the retransmission is scheduled earlier, hence also reducing the HARQ RTT, see Fig. 1. The time for transmitting the feedback and receiving the retransmission () is not affected by this. For LDPC codes, EHARQ can be realized under exploitation of the underlying code structure by investigating the feedback prediction problem on the basis of socalled subcodes [9, 10] from the paritycheck matrix, which we denote by the fraction of the subcode length to the full codelength with typical values ranging from 1/2 to 5/6, designated as subTTI in Fig. 1. Shorter subcode lengths reduce the RTT but at the same time render the prediction problem more complicated.In this section, we first introduce a simple probabilistic system model in Sec. IIA to provide an easy tool that evaluates the performance of the here presented EHARQ schemes. However, this model only provides a measure in means of the final BLER and additionally implies the assumption of infinite resources. Hence, in Sec. IIB, we provide a more realistic system model together with the analysis of implications of finite size systems in Sec. IIC. This model provides a more suitable tool to evaluate the performance in practical systems, such as 5G and LTE. The finitesize system argument establishes an optimal point of operation for the EHARQ schemes that is specific for the available system resources and does not exist in a system with unlimited resources.
Iia Probabilistic model for singleretransmission EHarq
We analyze singleretransmission EHARQ in a simple probabilistic model. For notational simplicity, we focus on the case of a single retransmission, the corresponding expressions for multiple retransmissions can be found in App. A.
The structure of the probabilistic model for EHARQ is reflected in Fig. 2. After the initial transmission we end up in an state with probability , where we follow the common scheme in imbalanced classification problems encoding the minority i.e. block error class as positive. In the case the codeword gets decoded correctly irrespective of the feedback sent and a false positive feedback only implies an unnecessary transmission, which has no effect on the performance under the infinite resources assumption. In the former case we send either ACK with probability , which leads to an effective block error, or NACK with probability . In the latter case the message gets retransmitted which leads to an effective block error with probability . The value for crucially depends on the design of the feedback system most notably on the code rate used for the retransmission. However, one has to keep in mind that a decreased block error rate for the retransmission due to a decreased code rate might lead to latency losses due to the necessity of accommodating longer retransmissions. For identical retransmissions using an independent channel realization we would have or even if the decoder makes use of information from both transmissions for example using chase combining. For later reference we also define the joint probability . This simple argument leads to an effective block error probability
(1) 
where we introduced an effective conditional probability to incorporate effects of an imperfect feedback channel. For simplicity we model the latter as a binary symmetric channel with bit flip probability . Using Fig. 3, we then obtain
(2) 
Empirically we can replace and by estimated block error rates and the conditional probability
by the classifier’s false negative rate (FNR) as obtained from the confusion matrix. Obviously the lowest possible effective BLER is achieved for perfect feedback, i.e.
, for which we have . Eq. 1 only depends on the baseline BLERs and and the classifier’s (effective) false negative rate with leading order contribution given by . In the limit where the the leading behavior is just and hence independent of the classification performance.Considering the question of latency, the simplest approach is to consider the expected number of retransmissions . Therefore we evaluate the probability for a single retransmission. Again using Fig. 2, we obtain
(3) 
where we defined in analogy to Eq. 2 an effective false positive rate (FPR)
(4) 
for the conditional probability that can be identified empirically with the classifier’s FPR. The leading order contribution to Eq. 5 is given by and the number of expected retransmissions therefore profits from a decreased FPR. For the case of a single retransmission, the expected number of retransmissions coincides with the singleretransmission probability,
(5) 
These results already hint at the crucial importance of adjusting the classifier’s working point by balancing FNR versus FPR: A reduction of the FNR leads to a smaller effective block error probability, see Eq. 1, but comes along with an increased FPR as the two kinds of classification errors counterbalance each other. This in turn leads to an increase in latency, see Eq. 5. From the present discussion it might seem a reasonable strategy to target an arbitrarily small FNR such that the effective block error probability approaches the theoretical limit. However, this argument only holds for a system with unlimited resources, as will be discussed below.
IiB System model
In order to derive a tool for evaluation of the performance of the discussed predictors, in this section we introduce a more sophisticated system model that leans on the structure of today’s mobile network technologies. In cellular networks, such as LTE and 5G, OFDMA has been established due to its scheduling flexibility. Especially, opportunistic scheduling allows to use the best possible channel for a transmission. Here, we assume a simplified OFDMA system with equally sized resources, i.e. frequency resources and a defined duration in time, socalled TTI, as illustrated in Fig. 4. The HARQ mechanism, regular HARQ as well as EHARQ, requests based on the received parts of the transmission a retransmission, which is scheduled at earliest after time slots.
The main advantage of EHARQ over regular HARQ is the reduced HARQ RTT. Hence, depending on the latency constraint more HARQ layers might be used to improve the system performance. In this work, we evaluated two different system approaches, long and short TTI lengths. The HARQ time line is mainly comprised by the processing time, which in general scales with the TTI length [11] and the transmission time for the feedback, which is not dependent on the TTI length of the transmission. Thus, for long TTI lengths this time can be considered insignificant. However, for short TTI lengths this constant component has to be considered for EHARQ as well as regular HARQ systems. Hence, for long TTI, we assumed for rate1/2 EHARQ, which means that the retransmission is received in the next TTI and for regular HARQ, so that for regular HARQ one TTI has to be skipped. Analogously, for short TTI, for rate5/6 EHARQ and for regular HARQ. For long and short TTI this allows depending on the system load up to two retransmissions in the EHARQscheme compared to only one in the regular HARQscheme. Due to the scalability of the TTI length, the absolute value of might be set to an arbitrary value, e.g. 1 ms. Thanks to the aforementioned opportunistic scheduling possibilities of OFDMA, we assume that the retransmission is independent of the previous transmission, i.e. and the total BLER , where is the number of retransmissions. Furthermore, an i.i.d. arrival rate for each UE is assumed. Thus, a single UE can only have one new transmission per time slot. For simplicity the following argument is carried for a perfect feedback channel, i.e. for , which is a reasonable assumption considering the results of the previous implying that the feedback error probability is at most of subleading importance. The system parameters are summarized in Tab. I.
UE packet arrival rate   medium load  0.3, 

high load  0.36  
Number of UE   20 
Number of resources  10 
per time slot   
Delay constraint   long symbols  3, 
short symbols  11  
long TTI HARQ RTT   1 (EHARQ 1/2), 
2 (regular HARQ)  
short TTI HARQ RTT   5 (EHARQ 5/6), 
6 (regular HARQ)  
BLER of (re)transmissions   as given in Tab. III 
IiC Implications of finite system size
In practical systems, there is a tradeoff between the FNR and FPR due to the limited amount of available resources. Whereas a lower FNR increases the effective BLER, as shown in the Sec. IIA, it increases the transmission overhead on the other hand. Depending on the available resources this leads to resource shortage, also causing additional delays since transmissions cannot be scheduled in the designated time slots. This brings us to the term of packet failure rate which represents the probability that a packet is delivered successfully within a given time constraint. Interestingly, there is an optimal operation point which captures the tradeoff such that the packet failure rate is minimized.
For the assumptions on the system model described in the previous section, the packet failure probability is given as
(6)  
(7) 
where is the number of maximum allowed retransmissions and for . The times correspond to the time required for scheduling transmissions. Thus, is the probability to schedule the initial transmission within the time constraint, is the probability to schedule the initial and the first retransmission within the time constraint etc. For simplicity we only condition the scheduling probabilities on the previous transmission, thus . If we set all scheduling probabilities to one Eq. 6 reduces to Eq. 14 and can therefore be seen as generalized version of the effective BLER. However, the effective BLER does not consider the finite resources and thus cannot capture the actual performance of the evaluated HARQ schemes in a practical implementation. We will refer to this case as the infinite resource baseline compared to the finite resource baselines discussed below.
At first glance, Eq. 6 suggests minimizing the FNR . However, a closer examination reveals that the scheduling probabilities carry a dependence on both FNR and FPR via the underlying resource distribution function. FNR and FPR counteract each other in the sense that a decreased FNR will lead to an increase in the FPR. Considering the dependency on the resource distribution function, an increase of the FPR increases the load on the system, thus lowers the probability that a transmission and its retransmission is scheduled within the time constraint. This fact is already apparent from the expected number of retransmission as obtained in Eq. 5 which scales with the FPR at leading order. This suggests that the packet failure probability seen as a function of the FNR will show a minimum characterizing an optimal tradeoff between FNR and FPR for the given system resources.
In Eq. 6, highly depends on the load of the system, since it is mainly a scheduling problem. Based on the resource distribution which is discussed in App. C, we can formulate the probability of scheduling the initial transmission arriving at time slot and retransmissions within a time constraint as follows:
(8) 
where is the probability that a packet that has arrived at is scheduled in time slot . Under the assumption that the resource distribution function is not diverging, the initial argument of in Eq. 8 is set to . As mentioned before, is the scheduling probability for an additional transmission assuming that this single transmission does not affect the system probabilities. So, this means that from the slots till the slot the system is fully loaded and the observed transmission is not scheduled (random scheduling). We allow only in slot a lower load or the random scheduler picks the observed transmission. Hence, this is expressed by:
(9) 
where is the resource distribution function, which is discussed in more detail in App. C. The scheduling probability is discussed in further detail in App. D.
The derived packet failure probability provides a good tool to evaluate the performance of the predictors in a practical system. Additionally, apart from comparing the different EHARQ schemes among each other, it enables a performance comparison with regular HARQ, which is crucial if EHARQ is considered for URLLC. Here, aside the system setup presented in the previous section, for regular HARQ the FNR and FPR is assumed to be zero. This is a valid assumption since the included CRC allows to minimize false prediction events such that they can be neglected.
Iii Machine Learning for early HARQ
The Machine Learning task of predicting the decodability of a message based on information from at most the first few decoder iterations is an inherently imbalanced classification problem. This imbalance is a direct consequence of the base BLERs of the order that are required in order to be able to reach effective BLERs of the order , see Eq. 1. Different ways of dealing with this imbalance have been explored, see [12] for a review, that can be categorized as costsensitive learning, rebalancing techniques and threshold moving. The discussion in this section focuses on the latter, see also [13] and references therein, in the sense of readjusting the decision boundary of any trained model that outputs probabilities for the predicted classes.
By moving the decision boundary one is able to investigate the discriminative power of a given classifier over a whole range of different working points. This is typically analyzed in terms of ReceiverOperation curves (ROC) or PrecisionRecall (PR) curves. In order to summarize the classifier’s performance with a single number, one conventionally resorts to reporting areaundercurve (AUC) metrics. Here we focus on the PR curve and the corresponding area under the PR curve, AUCPR, rather than the ROCcurve as the former has been shown to better reflect the classifier’s performance for highly skewed datasets
[14, 15]. However, when summarizing the discriminative power of a classifier using a single figure, one loses finegrained information about classification performance at different working points. This is particularly true since the full AUC naturally covers the whole range values for the decision boundary, many of which are irrelevant for practical applications where the classification performance in the small FNRregime is most relevant. In addition, the actual implementation of the classifier requires a definite choice for the decision threshold. Therefore we supplement the global AUCPR information with an analysis based on FNRPPR curves. It is worth noting that the FNRFPR curves directly relate to ROC curves since the true positive rate TPR that is plotted on the ordinate of the ROCcurve relates to the FNR via TPR = 1  FNR. FNR and FPR represents the natural choice in our case since they represent the key output figures from the system point of view, see Sec IIA.Iiia Input features
We distinguish singletransmissionfeatures derived from a single transmission and history information from past transmissions. In principle all these features can be combined at will to form the set of input features for the classification algorithm.
The raw data for a single transmission provided by the simulation is given by (a posteriori) LLR values after different decoder iterations. EHARQ approaches to reduce the HARQ RTT have been first discussed in [6] and [7]. This approach estimates the BER based on the LLR and utilizes a hard threshold to predict the decodability of the received signal. The LLR gives information on the likelihood of a bit being either or . Denoting as the observed sequence at the receiver, the LLR of the bit is defined as:
(10) 
Having the LLR of a subcode or the whole codeword allows to calculate an estimated BER for the received signal vector, as stated here:
(11) 
where is the length of the LLR vector. Based on this metric the decoding outcome is predicted, where a higher means a lower probability of successful decoding.
A further improved approach has been presented in [9] and [10]. The authors propose to exploit the code structure to improve the prediction performance. In case of LDPC codes, this is realized by constructing socalled subcodes from the paritycheck matrix. Using a beliefpropagation based decoder on the LLR of the subcodeword results in a posteriori LLR:
(12) 
where is the set of check nodes which are associated to the variable node of and is the checktovariable node message from check node to variable . Here we use the superscript in to denote the decoder iteration after which the posteriori LLR were extracted with the obvious identification . Again, the a posteriori LLRs are mapped to the same metric for each beliefpropagation iteration, designated as VNR:
(13) 
where is the length of the subcodeword and denotes the beliefpropagation iteration. Hence, corresponds to . In [9], the authors used a hard threshold applied to predict decodability.
In the following we use the abbreviations and to denote the VNRs/LLRs extracted from the th decoder iteration. If is omitted we refer to the set of all values from zeroth to fifth decoder iteration.
Assuming the receiver is operating on the same channel across different transmissions, it might be possible to increase the prediction performance by incorporating information from previous transmissions. This includes all features used as singletransmission features and in addition features that are only available after the end of the decoding process. As two representative examples for history features we investigate VNRs from past submissions (VNR_HIST) and information about the euclidean distance between the correct codeword and the final decoder result before the hard decision (EUCD_HIST). Here one has to keep in mind that the latter information is only available if the correct codeword is known to the receiver as for example from a previous pilot transmission but strictly speaking it cannot be reliably obtained from an ordinary previous transmission as even a correct CRC does not imply a correctly decoded transmission. For a given set of history features we consider means of the history features under consideration extracted from different numbers of past transmissions (1,2,5,9) in order to allow the classifier to extract information from past channel realizations at different time scales.
IiiB Classification algorithms
As discussed in the introduction, we can view the problem either as a heavily imbalanced classification problem or as an anomaly detection problem. Here we briefly discuss suitable algorithms for both of approaches. As examples for binary classification algorithms we consider hard threshold (HT) classifiers, logistic regression (LR) (with
regularization and balanced class weights)and Random Forests (RF). HT applied to
/data (referred to as HT0 and HT5 in the following) yield the classifiers used in the literature so far [6, 9]. For anomaly detection [16] on distinguishes unsupervised, semisupervised and supervised approaches depending on whether only unlabeled examples, only majorityclass examples or labeled examples from both classes are available for training. As anomaly detection algorithms we consider Isolation Forests (IF) [17]as classical treebased semisupervised anomaly detection algorithm and supervised autoencoder (SAE) as a novel neuralnetwork based approach for supervised anomaly detection, see App.
B for details. We leverage the implementations from scikitlearn [18]apart from SAC that was implemented in PyTorch
[19].Iv Results
Iva Simulation setup
Transport block size  360 bits 

Channel Code  Rate1/5 LDPC BG2 with Z = 36, 
see [20]  
Modulation order and algorithm  QPSK, Approximated LLR 
Waveform  3GPP OFDM, 1.4 MHz, 
normal cyclicprefix  
Channel type  1 Tx 1 Rx, TDLC 100 n , 2.9 GHz, 
3.0 km/h (pedestrian) or  
100.0 km/h (vehicular)  
Equalizer  Frequency domain MMSE 
Decoder type  MinSum 
Decoding iterations  50 
VNR iterations  5 
We compare classification performance of different classifiers based on AUCPR and FNRFPR curves. As external parameters we vary the SNR between 3.0 and 4.0 dB and subcode lengths between 1/2 and 5/6. The simulation setup used to produce training and test data follows the one reported in [9]. We use the raw simulation output as well as a number of derived features. Here we consider both singletransmission features as well as historyfeatures that incorporate information from a number of past transmissions, see App. IIIA for a detailed discussion. We then investigate the performance of a number of classification algorithms operating on these input features, see App. IIIB for a detailed breakdown. In all cases we use 1M transmissions with independent channel realizations for training and evaluate on a test set comprising at least 1M transmissions. The size of the test set for each SNR/subcode combination is given in the second column of Tab. III
. Hyperparameter tuning is performed once for the pedestrian channel (at SNR 4.0 dB and subcode length 5/6) on an additional validation set also comprising 1M samples. We standardscale all different sets of input features independently using training set statistics. In this way we obtain a reasonable input normalization that is required for certain classification algorithms while keeping relative difference within different input feature groups intact.
IvB Classification Performance
SNR SC ch  #train/#test  BLER  HT0  HT5  LR  RF  IF  SAE 
4.0dB 5/6 ped  1M/3M  0.001604  0.811  0.902  0.905  0.907  0.890  0.908 
4.0dB 1/2 ped  1M/4M  0.001626  0.801  0.799  0.834  0.832  0.827  0.834 
3.5dB 5/6 ped  1M/1M  0.002841  0.844  0.920  0.921  0.924  0.912  0.926 
3.5dB 1/2 ped  1M/4M  0.002777  0.821  0.814  0.847  0.846  0.839  0.847 
3.0dB 5/6 ped  1M/1.5M  0.004742  0.863  0.927  0.934  0.934  0.923  0.934 
3.0dB 1/2 ped  1M/1.5M  0.004742  0.851  0.840  0.872  0.871  0.865  0.874 
3.5dB 1/2 veh  1M/3M  0.002866  0.824  0.818  0.851  0.850  0.846  0.851 
We start by discussing the classification performance for different classification algorithms based on features extending the analysis from [8]. The classification results are compiled in Tab. III. We compare AUCPR that characterizes the overall discriminative power of the algorithm and which tends to 1 for a perfectly discriminative classifier. The largest improvements to the simplest thresholding method HT0 is seen for longer subcode lengths such as 5/6. In these cases more complex classification methods applied to the full VNRrange show only small improvements over the HT5 threshold baseline. A different picture emerges at smaller subcode lengths. Here using VNRs from higher decoder iterations (HT5) does not improve or even worsen the classification performance compared to HT0. Here more complex classification algorithms show their true strengths and show larger improvements compared to HT0/HT5. This is a plausible result since decreasing the subcode length renders the classification problem more complicated and more complex classifiers can profit more from this complication. If we assess the difficulty of the classification problem based on the scores achieved by the classifiers, a clear picture emerges: As discussed before decreasing the subcode length for fixed SNR renders the classification problem more difficult, whereas decreasing the SNR for fixed subcode length has the opposite effect most notably because of an increasing BLER. On the other hand the BLER sets the baseline for the HARQ performance, see Eq. 1, which overcompensates the positive effects of the improved classification performance. The overall best discriminative power across different SNRvalues, subcode lengths and channel conditions shows the supervised autoencoder closely followed by regularized logistic regression. The fact that the AUCPR results for LR, RF and SAE are so close just reflects a similar overall discriminative power of these algorithms despite of fundamentally different underlying principles.
This does, however, not imply coinciding FNRFPR curves, where the classifiers show rather different behavior in certain FNR regions, see Fig. 5 for selected results. Random Forests, for example, show in general a very good overall performance but are considerably weaker than other classifiers in the small FNRregime. When looking at FNRFPR curves as the ones presented in Fig. 5, one has to keep in mind that it is very difficult in the extremely imbalanced regime to obtain reliable estimates of the FNR as both the numerator (false negatives) and the denominator (sum of false negatives and true positives) are small numbers requiring large sample sizes for a stable evaluation. This applies in particular to the region of small FNRs below 0.001.
To summarize, we clearly demonstrated that incorporating the evolution of the VNR across the first five decoder iterations into more complex classification algorithms such as logistic regression or supervised autoencoders leads to gains in the overall classification performance in particular in comparison to hard threshold baselines. This conclusion holds for various SNRvalues, subcode lengths and channel conditions. Implications of these findings for the system performance will be discussed in Sec. IVC.
We restrict the investigation of history features to the SAE classifier as the bestperforming classifier from the previous section. However, we checked that the qualitative conclusions about the importance of history features hold irrespective of the classification algorithm under consideration. In Tab. IV we discuss the impact of history features on the classification performance in addition to the VNRfeatures discussed above.
features  4.0dB 1/2 ped  3.5dB 1/2 ped  3.5dB 1/2 veh 

VNR  0.834  0.847  0.851 
VNR+VNR_HIST  0.860  0.872  0.852 
VNR+EUCD_HIST  0.883  0.892  0.861 
Irrespective of SNR, subcode length and underlying pedestrian or vehicular channel model, we see an improvement in classification performance upon including history features with best results achieved by incorporating euclidean distance features. History information seems to lead to larger improvements in the pedestrian channel compared to the vehicular channel. This is in line with the the channel conditions remaining unchanged for a longer time in the pedestrian compared to the vehicular case.
There are different caveats to this result. First of all, as discussed in Sec. IIIA, the euclidean distance is only known to the receiver if the underlying codeword is known as it would be the case for a previous pilot transmission, which would however lead to latency overheads. Therefore the result including euclidean history features most likely overestimates the improvements in classification performance that can be obtained from using history features. Secondly, the use of history features is at tension with the assumption of an independent channel realization for the retransmission in the sense of as used in our system model. It is very unlikely that the improvements in classification performance can compensate the loss of approximately one order of magnitude in the error rate for the retransmission of using the same channel compared to the baseline BLER of the order of for an independent retransmission. Therefore the system level analysis is carried out using VNRfeatures only. Nevertheless the results put forward here stress the prospects of further investigations of features that explicitly characterize the channel state such as explicit channel state information that could have been obtained by a pilot transmission preceding the transmission.
IvC System Performance
We start by discussing system performance based on the simple probabilistic model for EHARQ with unlimited system resources as introduced in Sec. IIA. The results are obtained straightforwardly from the FNRFPRcurves presented in Sec. IVB using Eqs. 1 and 5 or the corresponding generalizations for multiple retransmissions Eqs.14 and 21. Here we adopt as in Sec. IIB. Here we present results for two retransmissions that are possible for EHARQ in both TTI scenarios discussed in Sec.IIB. In fact, increasing the number of retransmissions beyond two does not lead to further noticeable improvements in the given FNR range. In all cases effective BLERs of the order are attainable. Decreasing the subcode length from 5/6 to 1/2 while keeping the same effective BLER of as a definite example requires an increase of 40% and 45% in retransmissions at SNR 4 dB and 3 dB respectively. Correspondingly, decreasing the SNR for fixed subcode length from 4 dB to 3 dB while again keeping the effective BLER fixed leads to an overhead of 70% and 77% in retransmissions for subcode 5/6 and 1/2 respectively. However, as discussed in Sec.IIC, the presented effective BLERs only represent theoretical lower bounds for actual packet failure rates that are achievable in actual systems as they do not incorporate scheduling effects. In this infinite system setting there is no distinguished working point for the classifier and the only way of discriminating between different classifiers in the system setting is to rank by the number of expected transmissions for fixed effective error probability.
Fig. 7 shows exemplary results of the packet failure rate over the FNR of the EHARQ schemes under medium () and high system load () together with the regular HARQbaseline and the infinite system results from Eq. 14. The upper figures Figs. 7(a) and 7(b) show the long TTI design, as described in Sec. IIB, at 3.5 dB. For the high load (Fig. 7(a)) as well as the medium load (Fig. 7(b)) scenarios, the EHARQ schemes achieve a superior performance compared to the regular HARQ thanks to the additional retransmission which is possible within the same latency constraint. However, a packet failure rate less than is only achieved in the medium load scenario. Here, we note that the actual performance of the EHARQ schemes is approximated well by the approach with infinite resources, at least for high packet failure rates above . Only in the lower region an attenuation of the decrease is visible, whereas all prediction schemes achieve a comparable performance. In the high load scenario in Fig.7(a), we see the tradeoff behavior, discussed in Sec. IIC. The packet failure rate decreases only up to a certain minimum at the optimal FNRFPR tradeoff and starts increasing after passing that point. So, lowering the FNR further after passing that point increases the packet failure due to the resource shortage. In this region, the actual performance of the prediction schemes becomes critical. Hence, SAE and LR have the lowest optimum. HT0 and HT5 perform worse at their optimal operation points, whereas HT0 is still performing better than HT5.
The resource shortage effect is clearly visible in Fig. 8, where the same load is applied in both scenarios but the latency constraint is relaxed in Fig. 8(b). As obvious in Fig. 8(a), the packet failure rate for all schemes is far away from the targeted packet failure rate of . With a relaxed latency constraint, as shown in Fig. 8(b), the performance is closer to the target packet failure rate. This improvement is explainable by two effects. First, the EHARQ schemes benefit from the additional retransmission, which is possible in the relaxed latency constraint and thus in total achieve still a better performance than the regular HARQ. However, the gap is smaller compared to the normal latency constraint. Especially in the high load scenario, the regular HARQ profits from the increased scheduling flexibility although it can only perform the same number of HARQ retransmissions. The resource shortage effect is also observable for the regular HARQ performance comparing the medium load and the high load scenarios. It is notable that the regular HARQ could at least achieve a packet failure rate less than in the medium load scenario, whereas it is performing even worse in the high load scenario. We can see that even more clearly in the short TTI design in Figs. 7(c) and 7(d). In the medium load scenario in Fig. 7(d), the regular HARQ achieves a packet failure rate of almost , which corresponds approximately to the ideal performance of HARQ. In this system setup the regular HARQ makes use of the whole scheduling flexibility and thus, at least for the medium load scenario, the influence of scheduling probabilities can be neglected for the regular HARQ. Despite the limited scheduling flexibilities of the EHARQ schemes, they achieve a better performance than the regular HARQ. However, this changes in the high load scenario in Fig. 7(c). Here, we observe that the regular HARQ benefits from its scheduling gain and thus, achieves the lower packet failure rate. In the high load scenario, we see that all prediction schemes achieve a similar performance, except the HT0 which is remarkably less performing than the others.
As already visible in the previous results, there is no clear winning scheme for all the scenarios. However, to compare the overall performance of the schemes, we introduce the total score , where is the enumerator over all SNRs and prediction rates and is the enumerator over all HARQ schemes. In Tab. V we present the results for all scenarios, where the ”” sign indicates that an FNR larger than the optimal FNR has been used for evaluations. As already notable in Fig. 7, the available data does not allow arbitrary small FNR and thus the optimal operation point cannot be reached for the medium load case. Hence, we used for the medium load evaluations since it provides a sufficiently reliable estimation. The evaluation at fixed FNR underestimates the overall performance compared to regular HARQ but allow a reliable ranking between different classifiers. Obviously, for reaching the optimal point of operation more data is required in the medium load case.
Nevertheless, in the medium load regime, LR achieves by far the best overall performance. The other EHARQ schemes achieve a similar performance, where HT0 is able to achieve a slightly better performance than the other two. Interestingly here, SAE has a worse performance compared to LR although it was the best performing classifier in the previous section. A closer inspection reveals that for very low FNR SAE cannot keep up with the other classifiers. Especially that region, being not relevant for the performance metrics of the previous section, explains the contradicting results. However, the expected performance for SAE is observed going to the high load regime. Here, SAE and LR are the best performing EHARQ schemes far ahead HT0, HT5 and regular HARQ. As already noted in Fig. 7, in the high load regime the performance at higher FNR is key. Hence, SAE is again in a welloperating region. In this region, we also note that HT0 is performing the worst among the classifiers though having the secondbest performance in the medium load regime.
Summa summarum, EHARQ is able to achieve large gains in means of packet failure rate compared to regular HARQ under latency constraints. Especially, LR is a promising approach, which achieves a good overall performance in high load as well as medium load regimes. The SAE as bestperforming algorithm in the highload case and the more extendable approach compared to LR might provide a viable alternative if the performance at very low FNR is improved.
scenario  regular HARQ  HT0  HT5  LR  SAE  

medium load 
3.0dB 1/2 ped  
3.5dB 1/2 ped  
4.0dB 1/2 ped  
3.0dB 5/6 ped  
3.5dB 5/6 ped  
4.0dB 5/6 ped  
3.5dB 1/2 veh  
total score  6.2577  0.0685  0.0936  0.0075  0.0866  
high load 
3.0dB 1/2 ped  
3.5dB 1/2 ped  
4.0dB 1/2 ped  
3.0dB 5/6 ped  
3.5dB 5/6 ped  
4.0dB 5/6 ped  
3.5dB 1/2 veh  
total score  3.2918  0.4599  0.3306  0.1713  0.1703 
V Summary and Conclusions
In this work we investigated Machine Learning techniques for EHARQ by means of more elaborate classification methods to predict the decoding result ahead of the final decoder iteration. We demonstrated that more complex estimators such as logistic regression or supervised autoencoder that exploit the evolution of the subcodeword during the first few decoder iterations lead to quantitative improvements in the prediction performance over baseline results across different SNR and channel conditions. We put forward a simple probabilistic model and a more elaborate system model incorporating scheduling effects to evaluate system performance in a realistic environment. In this way we were able to demonstrate the practical feasibility of reaching effective packet error rates of the order as required for URLLC across a range of different SNRs, subcode lengths and system loads. More importantly, we showed that enabling more HARQ layers by introducing EHARQ improves the overall reliability over regular HARQ under strict maximum latency constraints.
Further improvements of the classification performance are conceivable extending the approach presented in this work. Our results suggest that history features incorporating channel information from previous transmissions positively influence the classification performance but remain to be investigated in more detail. Similarly it seems very likely that classification algorithms could profit from intramessage features that go beyond the simple averaging features such as VNRs considered in this work, which ideally directly incorporate the code structure of the underlying channel code. However, such features suffer from high dimensionality and large correlations. Here a challenge remains to identify the most discriminative set of input features and appropriate classification algorithms to further improve the classification performance.
Ultimately, more advanced classification algorithms, which are within reach using techniques presented in this work, might allow more finegrained feedback instead of a binary NACK/ACK response. Incorporating this information on the level of the feedback protocol would allow to design custom feedback schemes with potentially large latency gains.
References
 [1] R. ElHattachi and J. Erfanian, “NGMN 5G White Paper,” tech. rep., Next Generation Mobile Networks (NGMN), 02 2015.
 [2] T. Fehrenbach, R. Datta, B. Göktepe, T. Wirth, and C. Hellge, “URLLC Services in 5G Low Latency Enhancements for LTE,” in 2018 IEEE 88th Vehicular Technology Conference (VTC Spring), August 2018.
 [3] ITU, “IMT Vision – Framework and overall objectives of the future development of IMT for 2020 and beyond ,” tech. rep., ITU, 2015.
 [4] K. Takeda, L. H. Wang, and S. Nagata, “Latency reduction toward 5G,” in 2017 IEEE Wireless Communication, June 2017.
 [5] MCC Support, “Final Report of 3GPP TSG RAN WG1 #92b v1.0.0,” tech. rep., 3GPP, 04 2018.
 [6] G. Berardinelli, S. R. Khosravirad, K. I. Pedersen, F. Frederiksen, and P. Mogensen, “Enabling Early HARQ Feedback in 5G Networks,” in 2016 IEEE 83rd Vehicular Technology Conference (VTC Spring), pp. 1–5, May 2016.
 [7] G. Berardinelli, S. R. Khosravirad, K. I. Pedersen, F. Frederiksen, and P. Mogensen, “On the benefits of early HARQ feedback with nonideal prediction in 5G networks,” in 2016 International Symposium on Wireless Communication Systems (ISWCS), pp. 11–15, Sept 2016.
 [8] N.Strodthoff, B. Göktepe, T. Schierl, W. Samek, and C. Hellge, “Machine Learning for early HARQ Feedback Prediction in 5G,” in 2018 IEEE Global Communications Conference Workshops (GLOBECOM), submitted 2018.
 [9] B. Göktepe, S. Fähse, L. Thiele, T. Schierl, and C. Hellge, “Subcodebased early HARQ for 5G,” in 2018 IEEE International Conference on Communications Workshops (ICC), May 2018.
 [10] Fraunhofer HHI, “Agressive Early Hybrid ARQ for NR,” TDoc R11700647, 3GPP RAN1NR#1 Spokane (US), Jan 2017.
 [11] S. Nagata, L. H. Wang, and K. Takeda, “Industry perspectives,” IEEE Wireless Communications, vol. 24, pp. 2–4, June 2017.
 [12] P. Branco, L. Torgo, and R. P. Ribeiro, “A Survey of Predictive Modelling under Imbalanced Distributions,” CoRR, vol. abs/1505.01658, 2015.
 [13] G. Collell, D. Prelec, and K. R. Patil, “Reviving ThresholdMoving: a Simple Plugin Bagging Ensemble for Binary and Multiclass Imbalanced Data,” CoRR, vol. abs/1606.08698, 2016.
 [14] B. R. Kiran, D. M. Thomas, and R. Parakkal, “An overview of deep learning based methods for unsupervised and semisupervised anomaly detection in videos,” CoRR, vol. abs/1801.03149, 2018.
 [15] J. Davis and M. Goadrich, “The Relationship Between PrecisionRecall and ROC Curves,” in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, (New York, NY, USA), pp. 233–240, ACM, 2006.
 [16] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly Detection: A Survey,” ACM Comput. Surv., vol. 41, pp. 15:1–15:58, July 2009.
 [17] F. T. Liu, K. M. Ting, and Z.H. Zhou, “Isolation forest,” in Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pp. 413–422, IEEE, 2008.
 [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikitlearn: Machine Learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
 [19] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in PyTorch,” in NIPSW, 2017.
 [20] MCC Support, “3GPP TS 38.212 v1.0.1,” tech. rep., 3GPP, 09 2017.

[21]
B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen, “Deep Autoencoding Gaussian Mixture Model for Unsupervised Anomaly Detection,” in
International Conference on Learning Representations, 2018. 
[22]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in
Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, ACM, 2008.  [23] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[24]
S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in
ICML, 2015.  [25] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014.
Appendix A Probabilistic model for multipleretransmission EHarq
In this section, we present the generalization of the results from Sec. IIA. These are obtained straightforwardly using the same formalism as above. The generalization of the effective error probability from Eq. 1 to the case of retransmissions is given by the iterative relation
(14) 
where for :
(15) 
and otherwise , which reduces to Eq. 1 for . For simplicity we can work with independent retransmissions i.e. , where we used the shorthand notation . Explicit expression for up to three retransmissions are in this case given by
(16)  
(17)  
(18) 
If we denote the set of binary sequences of length by , the probability for having retransmissions is given by
(19) 
which again reduces to Eq. 3 for . Again we may set for independent transmissions. In this case Eq. A simplifies to
(20) 
The total number of expected transmissions is then simply given by
(21) 
Appendix B Supervised autoencoder for supervised anomaly detection
The supervised autoencoder is a neuralnetworkbased supervised anomaly detection algorithm. It enjoys a number of advantages compared to for example shallow neural network classifiers applied directly to the input data that arise from the fact that the classifier is not applied to the data directly but rather to the bottleneck features of an autoencoder. Therefore it is able to work in heavily imbalanced scenarios as the one considered in this work and does not suffer from highly correlated input.
For the construction of the SAE we leverage the approach put forward in [21] albeit in a supervised anomaly detection setting. Similar to their work we use a regular multilayer fullyconnected autoencoder with loss as a backbone. In addition, we jointly train a fullyconnected classifier operating on the bottleneck features that is trained using cross entropy loss, see Fig. 9. The idea behind the joint training is to allows the autoencoder to not only build a reduced representation but also to build bottleneck features that contain most discriminative information for the classification task. We also experimented with using features derived from the reconstruction error (measured using cosine distance and reduced Euclidean distance) as additional input to the classifier as proposed in [21] but found no improvement.
There are multiple ways of preventing overfitting in this setting: early stopping, reducing the bottleneck dimension, implementing the SAE as a denoising autoencoder [22] or regularization using dropout [23]. In our case dropout regularization both in the classifier as well as in the autoencoder itself proved most effective.
The network configuration reads for the autoencoder [FC(,25), FC(25,10), FC(10,3), FC(3,10), FC(10,25), Lin(25,)] and for the classifier [FC(3,10), FC(10,5), Lin(5,2), SM] with FC(x,y)
[Lin(x,y), BN, ReLU, DO] and input dimension
. Here Lin(x,y) denotes a linear transformation layer, BN a Batch Normalizationlayer
[24], ReLU a ReLU activation layer, DO a dropout layer at a dropout rate fixed via hyperparameter tuning (both 0.2) and SM a softmax activation layer. Optimization is performed using the Adam optimizer [25] at learning rate 0.001. To stabilize training oversampling the minority class samples by a factor of 100 turned out to be beneficial.Appendix C Resource distribution function of a system with finite resources
The resource distribution function describes the probability of having a specific number of resources to be scheduled at a time slot . With the aforementioned system setup mainly three components contribute to resource allocations. The first are the packet arrival processes of the individual UE. These pose the main component. Additionally, there are the HARQ retransmissions, which depend on the error probability of the underlying channel code for a specific channel. However, to simplify analysis a uniform BLER has been assumed for each of the transmissions. The last component is the overload of the previous time slot due to resource shortage, which is then transfered to the next time slot. Hence, the resource distribution is described as follows:
(22) 
with , and and being the probability of having arrival processes, being the probability of having HARQ retransmissions in time slot and being the probability of having resources overload in the time slot to be transferred to the next time slot.
The probability of arrival processes for
UE is described straightforwardly as a binomial distribution for
:(23) 
and otherwise , where is the probability of packet arrival of one UE at one time slot. This modeling implicitly assumes that one UE can only have at most one new transmission per time slot.
Formulating is a bit more intricate since for a limited allowed number of HARQ retransmissions initial packet transmissions have to be distinguished probabilitywise from HARQ retransmissions. This would require to distinguish initial transmissions and first, second up to retransmissions as separate dependencies in and would require to specify scheduling rules, which would considerably complicate the whole analysis. However, this limitation is overcome by allowing unlimited HARQ retransmissions. This implies that this approach cannot be used to analyze for example singleretransmission HARQ since the HARQ retransmission term assuming an infinite number of retransmissions as implemented below would drastically overestimate the system load from HARQ retransmissions hence punishing FPR too much. Hence, is given for and as:
(24) 
and otherwise except for , where is the number of system resources per time slot, is the HARQ RTT and the singleretransmission probability as in Eq. 5. Because of notational reasons, we chose using an infinite sum, which can be easily replaced by splitting the sum at and replacing the part from to by . Still, this way of evaluating the HARQcontributions in the system still overestimates the load from retransmissions and therefore underestimates the system performance.
The last component is simply defined by a back reference to the resource distribution function in the previous slot:
(25) 
For the sake of simplicity, we may assume . This assumption makes the resource distribution function at time slot only dependent on the previous time slot and is a valid assumption for the evaluated early HARQ schemes.
Here, the interesting question is, if the resource distribution converges for . By simulating the propagation of over , we gain an insight on that question, as presented in Fig. 10. As obvious in Fig. 10(a), choosing the parameters such that the system is massively overloaded results in divergence of the resource distribution function. However, in case of a balanced system the resource distribution function shows a strong convergence behavior, as noticeable in Fig. 10(b). From Eq. 22, the conditioned resource distribution function for and follows as
(26) 
where and for :
(27) 
otherwise .
Appendix D Scheduling probability in a moderately loaded finite system
The scheduling probability as the probability that a transmission arriving at is scheduled after TTI, is given as
(28) 
As obvious, crucially depends on the resource distribution function , which is the probability that resources arrive at time slot
, and its probability distribution conditioned on the previous number of resource arrivals
. The properties and formulation of this distribution is evaluated more in detail in App. C.However, the exact formulation of poses computational problems due to the infinite sums and the exponential growth of computation for increasing . Hence, we introduce Lemma 1 to simplify the computation of the scheduling probability.
Lemma 1.
For a moderately loaded system with and , the resource distribution function is approximated for sufficiently large time slots by
Proof.
Assuming a converging behavior of the resource distribution function, there exits a time slot and a lower bound and an upper bound , such that for all . Additionally for an nonheavily loaded system which is required for URLLC traffic, we assume . Also, the lower bound is assumed to be sufficiently large, .
The resource distribution function at time slot is formulated as
(29) 
The sum can be divided into two regions, below and above. Since for any and is close to the number of resources of the system, we approximate the conditional function by assuming resources in the previous time slot. For a moderately loaded system, this is a valid assumption, since the resource probability distribution function is decreasing fast for . Only for small arguments close to the deviation increases. However, the constraint regarding , which prevents underutilization, ensures that is getting very small in that region anyway. Hence, we approximate the conditional resource distribution probability for by
(30) 
∎
Using Lemma 1 for , the scheduling probability is approximated by
Comments
There are no comments yet.