Quantifying and Extrapolating Data Needs in Radio Frequency Machine Learning

by   William H. Clark IV, et al.

Understanding the relationship between training data and a model's performance once deployed is a fundamental component in the application of machine learning. While the model's deployed performance is dependent on numerous variables within the scope of machine learning, beyond that of the training data itself, the effect of the dataset is isolated in this work to better understand the role training data plays in the problem. This work examines a modulation classification problem in the Radio Frequency domain space, attempting to answer the question of how much training data is required to achieve a desired level of performance, but the procedure readily applies to classification problems across modalities. By repurposing the metrics of transfer potential developed within transfer learning an approach to bound data quantity needs developed given a training approach and machine learning architecture; this approach is presented as a means to estimate data quantity requirements to achieve a target performance. While this approach will require an initial dataset that is germane to the problem space to act as a target dataset on which metrics are extracted, the goal is to allow for the initial data to be orders of magnitude smaller than what is required for delivering a system that achieves the desired performance. An additional benefit of the techniques presented here is that the quality of different datasets can be numerically evaluated and tied together with the quantity of data, and the performance of the system.


Training Data Augmentation for Deep Learning RF Systems

Applications of machine learning are subject to three major components t...

Domain Adaptation and Transfer Learning in StochasticNets

Transfer learning is a recent field of machine learning research that ai...

Machine learning Calabi-Yau metrics

We apply machine learning to the problem of finding numerical Calabi-Yau...

Large Scale Radio Frequency Signal Classification

Existing datasets used to train deep learning models for narrowband radi...

Data Appraisal Without Data Sharing

One of the most effective approaches to improving the performance of a m...

A Visual Technique to Analyze Flow of Information in a Machine Learning System

Machine learning (ML) algorithms and machine learning based software sys...

On the Robustness of Deep Learning-predicted Contention Models for Network Calculus

The network calculus (NC) analysis takes a simple model consisting of a ...

1 Introduction

Machine Learning (ML) is “the capacity of computers to learn and adapt without following explicit instructions, by using algorithms and statistical models to analyse and infer from patterns in data” [1]. No matter the field, ML begins and ends on the data available to use during training. Without relevant data to learn from, ML is effectively a “garbage in, garbage out” system [2]. The application of ML to problems within the Radio Frequency (RF) domain is no exception to this rule, yet within the scope of intentional man-made emissions, data is easier to synthesize than within more prolific domains such as image processing [3]

. Due to the readily available tools for developing ML-based algorithms (Tensorflow


, PyTorch


, etc.), and this ease of synthesis for establishing comprehensive datasets, there has been an explosion of published work in the field, which has been only further spurred by the availability of open source toolsets for synthesizing RF waveforms such as GNU Radio

[6] and Liquid-DSP [7] to name a few. However, going from a purely synthetic environment to a functional application running in the real world has a number of considerations that must be addressed. A brief explanation about the gaps from synthetic data to functional application data is discussed in Section 3.

This paper focuses on providing an applied understanding of working with ML in the RF spectrum, with particular regard to understanding the training data in applications that fall in the domain space of Radio Frequency Machine Learning (RFML). To better clarify this, RFML is a subset of ML that overlaps communications, radar, or any other application space that utilizes the RF spectrum in a statistically repeatable manner, where ML algorithms are applied as intelligently close to the digitized samples of the RF spectrum, or Physical Layer in the Open Systems Interconnection (OSI) model, as possible [8]

. For simplicity within this work, focus will be given to the field of Deep Learning (DL) as the particular subset within ML due to the well suited nature of DL systems at extracting inference from raw data

[9]. To that end, two questions important to ML systems are

  • How much data needs to be used to achieve a given performance level?

  • What data should be used to train the system?

are going to be systematically addressed providing a means to answering these questions.

The two most common problems in regard to datasets with DL systems is having a large enough quantity, at a high enough quality, in order to develop a well generalized model [9]. The work of Chen et al [10] describe the concept of quality in three ways: Comprehensiveness, Correctness, and Variety. In this work the primary focus is on understanding the aspect of Variety in terms of the origin of the data, Captured, Synthetic, and Augmented, while the other two aspects are more concerned with the information being both present in the dataset as well as being correctly labeled, which is given for the datasets used. More details are discussed about the origin of the data are presented in Section 3.2, while data quantity and quality are discussed in more detail in Section 4.

Fig. 1: (a) A visualization of how the generalized problem space, , encompasses the application space, , as well as all possible data collection methods, . (b) The process of sampling from a collection method, , in order to produce a training dataset, . (c) The sampling of data from the application space to produce an evaluation dataset, , for estimating a trained model’s performance if used within the application space. (d) The process pf training, , with a given architecture, , and training set, , to produce the parameters, , that can be used for inference, . (e) The process of evaluating a trained model on the evaluation dataset and procuring the inference, .

The discussion of characteristics inherited in the application of ML are given in Section 2, with the RFML problem of Automatic Modulation Classification (AMC), which is discussed in more detail in Section 2.3. The more well discussed problems and nuisances inherent to the RF spectrum are discussed in Section 3. Examining the effects of data quantity, along with the concept of how data quality can be quantified are presented in Section 4. The problem of estimating the data needs from a minimal set are contrasted to the full availability of data in Section 5 where a combination of two metrics offers a more balanced estimate than either metric alone. Finally, conclusions about how the presented AMC approaches can be well generalized to RFML at large are presented in Section 6.

2 Machine Learning Concepts

ML, in the most general form, is the process of creating a function that maps an observable in the form raw data, meta data, and/or extracted features of the data to some more convenient form for an applied task through observations available during training [9]. Therefore, given a training dataset, , drawn from an observation space, , inherent to the generalized problem space, , ML works to create a mapping . The ultimate goal is that the learned mapping is well generalized and can also be applied to the general space with the same performance as within the . A visualization of the relationship between the generalized problem space, , and the observation space from which training data is collected, , is shown in Fig. 1 (a) with the training dataset, , shown in Fig. 1 (b).

Fig. 2:

DL Neural Network Architecture for the Convolutional Long Short-Term Memory (LSTM) Deep Neural Network (CLDNN) Architecture used in this work as the DL approach for the 10-class waveform AMC problem space. The model consists of multiple layers of convolution layers each followed by non-linear activation and regularization that are combined along the channel dimension before passing through a recurrent LSTM layer and finally passing through linear layers followed by non-linear activation and regularization to produce the model’s inference.

The mapping is learned through a training procedure, , which produces the parameter space that defines the behavior of , and is visualized in Fig. 1 (d). Given the focus on DL systems, are a set of weights and biases that are used within the DL architecture, . The DL architecture used in this work is shown in Fig. 2 and whose architecture was shown to be well suited to the AMC problem space in the work of West and O’Shea [11], while the regularization was incorporated from the work of Flowers and Headley for the increased convergence rate [12]. The challenges and discussion within AMC are discussed in further detail in Section 3.

A brief overview of concepts utilized in this work with regard to ML are given below. In particular, the concepts of DL and Transfer Learning (TL) are fundamental to the setup and analysis of the crux of this work.

2.1 Deep Learning

Goodfellow et al [9] goes into great detail for developing an understanding of the complexities surrounding DL. A brief explanation follows that DL is a subset of ML that makes use of multiple layers of processing that can in theory approach the problem with simpler computations, which when accumulated, allow for solving a complex problem [9]

. The DL approach used in this work uses Deep Neural Networks (DNN), which make use of multiple sequential layers consisting of convolutional layers, a recurrent Long Short-Term Memory (LSTM) layer whose hidden size is tied to the dimension of the classification space, and two successive linear layers before producing the model output. After each convolutional and linear layer that is not the final layer of the network, a Rectified Linear Unit (ReLU) non-linear activation is used followed by Batch Normalization regularization layer. The final layer is followed by a Softmax non-linear activation to give the output as an estimate of the probability that the current observation is one of the classes specified during training. A visualization of this architecture is shown in Fig.

2 for the 10-class AMC classification problem. For simplicity, the application of the function and weights on an observation is shortened to, where is the number of classes in the classification problem.

2.2 Transfer Learning

TL is the practice of training a model on one dataset/ domain (source), or otherwise taking a pre-trained model, and training with a new dataset/ domain (target) instead of starting from a random initialization [13]. Depending on assumptions between the source and target, the TL application can be categorized as homogeneous, where differences exist in the distributions between source and target, or heterogeneous, where the differences are in the feature space of the problem [13, 14]. A valuable discussion for understanding the concepts of homogeneous and heterogeneous is provided by Wong and Michaels [15] by explaining the change in distributions as a change of the dataset’s collected/generated domain, while the feature space of the problem can be associated with the intended task the source model is trained on and can contrasted to the task of the target

problem. The two most common types of TL include retraining the classification head where early layers are frozen during training preserving feature extraction, or fine-tuning of the whole model

[16]. In this work, TL is applied in a homogeneous problem space where the underlying distributions on the data vary, but the generalized problem space is the same between datasets, otherwise called a Domain Adaptation from [15] and more specifically an Environment Platform Co-Adaptation. Additionally, wherever the retraining is done in this work, the fine-tuning approach is utilized, allowing for adjustments to the feature space, which might not be observable in the source dataset. An important note here: this work does not evaluate any aspect of TL on the problem space, rather makes use of metrics developed for the purpose of TL. Understanding how TL is best used within RFML is beyond the scope of this work.

The study of TL is complex and well explored [13, 14, 17, 16, 18, 15], and from the effort to understand how to choose an optimal pre-trained model for a desired application, the metrics Negative Conditional Entropy (NCE) [17], Log Expected Empirical Prediction (LEEP) [16], and Logarithm of Maximum Evidence (LogME) [18] are repurposed to analyze the relationship between available data quantity during training and system performance for a given evaluation set.

Symbol Source Description
Capture Consists of only capture examples
Synthetic generation using KDE Consists of simulated examples using the KDE of the capture dataset
Augmentation using KDE Consists of augmented examples from the capture dataset using the KDE
Capture Consists of only capture examples

KDE, kernel density estimate

TABLE I: Description of datasets used within this work.

In this paper, the evaluation set has the same labels as the training set, but the distributions are not assumed to be equivalent. Therefore the evaluation set, , is drawn from an observation space, , which is inherent to the generalized problem space, , and is visualized in Fig. 1 (a) and (c). For clarity going forward, due to the shared labels between source and target in this work, the source labels are found through a forward pass of the evaluation set through the network, therefore the observation’s source label is given by , with the inference given as . The process of extracting the evaluation inference of a trained model is visualized in Fig. 1 (e). By contrast, the target label directly gives by the truth of the

observation and can be one-hot encoded to provide

. Given the above notation, NCE is given as


where are the empirical distributions found as


The source labels are iterated over with , while iterates over the target labels. LEEP is given as


where the empirical conditional probability, is given as


The LEEP score, for the combination of the model and evaluation set, is given as the average log of all probabilities of getting the correct label in the evaluation set given the empirical probability of the labels provided by the model under test [16]. Whereas LogME is given as


where and are iteratively solved to maximize the evidence,

, for a linear transform applied to

, which is then averaged over the number of classes, , and normalized by the number of observations, , in the evaluation set [18].

In the most general sense, the importance of these metrics is how well correlated, either positively or negatively, the metric is with the desired performance of the network after being retrained on the target dataset. Within this work the explanation provided by You et al [18] for using Kendall’s coefficient [19] is utilized as the most significant relationship between performance and the metric of choice is a shared general monotonicity that allows for a trend in the metric to indicate a trend in performance as well.

2.3 Automatic Modulation Classification

The RF problem presented in this work is then the classification of the modulation present in the original transmitted waveform . AMC is then the problem of being able to identify how information (or lack there of) is being applied to a specific time and frequency slice of the overall spectrum. When coupled with the problem of signal detection, is a waveform present or not in a time and frequency slice, the problem space is often referred to as Automatic Modulation Recognition; however, in this work the trained network is determining what is there, rather than the additional task of where is it, so AMC is a better category for the task. While AMC is one of the oldest disciplines within RFML, traditional approaches relied on expert analysis and feature extraction [20, 21, 22, 23], though over the last three decades heavier reliance on ML has been used for feature fusion and decision making [24] as well as direct application to raw waveforms [25, 26].

To help understand how data quantity and quality affect the performance of the system there are three primary datasets used while training, and one unique evaluation set for evaluating the performance of all models that are trained. The four datasets are described in Table I. Captured data makes up the first dataset (), along with the evaluation set () such that the two sets are disjoint (). The second set is a synthetic dataset () that makes the waveforms in their pure form as shown in (7) and adds synthetic errors associated with detection algorithms such as Frequency Offset (FO) and Sample Rate Mismatch (SRM), as well as varying the SNR to indicate different received power levels in the dataset. The second and third dataset make use of a Joint Kernel Density Estimate (KDE) on to mimic the distortions of FO, SRM, and SNR within them to attempt to minimize changing the distributions in the data to any extreme. The third dataset, by contrast with , applies synthetic permutations to observations from thereby creating an augmented dataset.

Set Waveforms
, QAM16, QAM64
TABLE II: Three waveform sets used in the work.

Additionally, in order to observe greater diversity in the application of dataset quality and quantity, the work shows the classification performance of three classification groups given in Table II.

3 RF Characteristics

In the idealized world where the transceivers are in a physically stationary environment, RF signals can be thought of as processing signals at complex baseband (BB) with a channel between transmitter and receiver. The transmitter’s waveform is then represented as


where the channel introduces a static set of unknowns: gain (); delay (); and phase shift (), as well as a time varying additive noise to the receiver observation in addition to the transmitter’s modulated baseband signal; however, with the received signal being modeled after perfect synchronization, eliminating the static unknowns with a perfect low pass filter results in the received signal given as


The ratio of power in the signal to that of the noise, or the Signal-to-Noise Ratio (SNR), often expressed in dB (

), is then the primary limiting factor explaining the performance of the system, with most commonly assumed to be a circularly-symmetric complex Gaussian process.

3.1 Real World Degradations

In practice, the problem becomes vastly more complex as relative motion between transceivers, multiple transceivers, environmental noise, environmental motion, unintended radio emissions from manmade devices, mutlipath interference, and the imperfect hardware that transmits and receives the waveform are introduced. An introduction to the effects of imperfect hardware is given by Fettweis et al in [27] by looking at the individual degradations the hardware can add to a system as well as some mitigation strategies that can be applied; however, it is worth mentioning that these degradations are compounding and time varying, so while the worst of the effects can be calibrated out, the effects persist and cause a separation from the ideals assumed in (8). For an example of how these degradations affect the ideal, here the frequency independent In-phase and Quadrature Imbalance (IQI) of the transceivers result in a carrier modulation functions that are ideally expressed as for the transmitter for a carrier frequency and for the receiver as


where is the magnitude ratio imbalance, and is the phase difference between the quadrature mixer and the in-phase mixer [27]. The ideal carrier modulators are recovered when the magnitude ratio imbalance is unity, , and the phase difference is zero, . This results in a ideal received signal being changed from the ideal baseband transmitted waveform in (8) into


where the is a function for the addition of IQI when both the transmitter’s and receiver’s parameters are known in the received signal as an additive interference given as


3.2 Understanding RF Data Origin

There are three common sources for data within ML dataset generation [8]. The first is the captured or collected data acquired by using a sensor and recording the data. Under the most intuitive conditions, data collection performed using this approach in the application space provides the highest quality data to the problem to learn from because all unknown characteristics and sensor degradations will be present in the data [26]. However, when performing capture of rare events or while in search of other infrequent and uncontrollable events performing collection events can prove to be difficult to properly label, let alone find. These problems along with having to procure and sustain the equipment and personnel to perform the collection often make collection in large quantities impractical and expensive.

Synthetic datasets are therefore the most common and typically orders of magnitude cheaper to procure due to not being bound to waiting on real-world limitations. For example, synthetic generation can occur in parallel for vastly different conditions with the limitation being computation resources, rather than the sensors and personnel in collection events. The trade-off with synthesis is that significantly more information on the data is necessary in order to properly simulate, which without the appropriate knowledge can render useless models in the field [26].

The process of creating an augmented dataset tries to bridge to strengths of captured and synthetic dataset creation, while covering their weaknesses [8]. By taking captured data and adding synthetic permutations of SNR, FO, and SRM the augmented dataset can smooth out missing observations from a limited collection event by having a better understanding of the detection characteristics of the sensors in use, while preserving all other real-world degradations native to the application space.

Here the different origins of data are kept separated from each other to get a better understanding of the characteristics of each approach, but the fusion of such datasets either by directly combining the datasets, or by performing staged learning should be done in practice.

4 Data Quantity and Quality

The work performed in Clark et al [26] showed that, within the realm of AMC, the quantity of data has a functional relationship to the performance of a trained system given all other variables are constant. Additionally, the work showed that the performance could be found as log-linear relation to the quantity of data for lower performance regions, but a log-sigmoidal relationship is more appropriate as performance reaches a maximum. The process of regressing the relationship between quantity and performance was then suggested as a quantification measure of dataset quality in [28], where different datasets could then be compared across different quantities with the expected accuracy (e.g. dataset needs observations, while dataset needs observations to achieve an accuracy of ), or other metric of performance, taken as the quality ( or in the previous example) of the dataset. The inherent quality of any dataset can be described in three generalized terms of Comprehensiveness, Correctness, and Variety [10]. In this work the datasets are already examined and confirmed to be Comprehensive in that all the information being sought is included within the dataset, and Correct in that the observations for each modulation are correctly identified and label. The main concept of quality being examined is then that of Variety or rather that the distributions on the observations within the datasets match, approximate, or deviate from the distributions of the test set, and therefore only the effect of quality in terms of Variety can be examined in this work.

While these works gave an initial understanding of the data quantity and quality that fundamentally drive the process of an ML system, they provide minimal utility when trying to understand how much data is needed in order to achieve ideal performance and therefore reliably plan a data collection campaign. For example, in Fig. 4 of [28] looking at the 10-class classification performance, the log-linear fit predicts a performance of 90% accuracy at roughly an order of magnitude less data than the corresponding log-sigmoidal fit, while both fits use the full range of trials available to regress the fit. The results discussed above all depend on some initial good dataset to contrast with, and while this work does not alleviate that requirement, here the question is answered of how to best use a limited good dataset to forecast how much total data would be needed during training if neither the model, nor training approach is modified.

An ideal approach would be to use a metric that is both strongly correlated with the desired performance of the system, such as accuracy, in terms of Kendall’s and has a relationship with data quantity that can be linearly derived from minimal data; however, a metric that reduces the error over that of performance directly regressed with quantity will be sufficient. For this reason, the metrics that have been developed to predict the transferability of a pre-trained model onto a new target dataset, discussed in Section 2.2, are repurposed to predict data quantity requirements and provide a new metric of quality for a model’s training dataset with regard to the target dataset, which is the evaluation dataset in this work as shown in Fig. 1 (c).

4.1 Examining the Correlation between Performance and Metrics

The first step is confirmation that the chosen metrics correlate in a beneficial manner with the performance value of interest, classification accuracy in this case. In order to understand whether a metric is well correlated with classification accuracy, the weighted Kendall’s is calculated using the SciPy implementation [29] and found for three datasets (Table I: ) and compared against three sets of modulation classification sets (Table II: ). The Kendall’s weighted correlations are presented in Table III and show high values of correlation for all three metrics in the case of and datasets; however, the correlation for the dataset shows a worse correlation between accuracy and all three metrics. Looking at the performance at the relationships between performance and the proposed metrics in Fig. 3 shows that the performance and metrics are tightly clustered, while for and , definite trends are observable. Looking at the performance of the different datasets as a function of quantity used during training in Fig. 4 helps to further explain this decrease in correlation in that the performance results of networks trained on are comparably independent from the quantity of data used for the synthetic observations. Therefore, the classification accuracy and metrics extracted from the networks trained on are more akin to noisy point measurements rather than a discernable trend to examine.

0.9774 0.9533 0.8033
0.8249 0.8144 0.7382
0.9666 0.9639 0.9377
0.9438 0.9443 0.8788
0.6554 0.6553 0.6334
0.9794 0.9791 0.9582
0.9794 0.9688 0.9609
0.5165 0.4262 0.5298
0.9836 0.9808 0.9764
TABLE III: Kendall’s weighted correlation across datasets () and waveform sets for Accuracy and (NCE, LEEP, LogME). Strong correlations will have an absolute value near 1, while no discernable correlation will be around 0. Bold values represent the combination of problem set and metric with the highest correlation with accuracy on the evaluation set.
Fig. 3: Visualization of the relationships between the three metrics (Left column: NCE, Middle column: LEEP, Right column: LogME) and the performance metric (Accuracy) of each network when measured on the results of the evaluation set , or the target dataset in TL vernacular. Each dataset used for training are positioned along the rows (Top row: , Middle row: , Bottom row: ). Linear trends shown between the metrics and accuracy for better clarity in the relationships.
Fig. 4: Plots show the relationship between quantity of data used from each dataset (Top: , Middle: , Bottom: ) and the Accuracy achieved by networks trained on that amount of data. In general, the networks trained from datasets and have an increasing relation, but network, trained using have a stagnant relation to performance in regards to quantity of data used to train.

The main observation is that when there is a discernable trend between performance and data quantity, the correlation of all three metrics are considerably high, and therefore are potential metrics with which to regress the relationship with data quantity in search of a quantity estimator for the total data needed to achieve a desired performance.

4.2 Regression of Quantity and Metrics

With the confidence that the TL metrics discussed above have a positive and significant correlation with the performance of the system when performance increases with regard to the quantity of data used during training, the goal is to now derive the relationship between those metrics and data quantity, with preference being given to the metric that has a better goodness-of-fit (GoF) with a form of linear regression. In this case, a log-linear regression is used between the metrics and the data quantity. Starting with the accuracy of each network as shown in Fig.

4, the log-linear fit is able to provide a quality value in terms of the accuracy achievable for a given observations per class (OPC) for the three datasets. Looking at the problem set shows the quality quantification as

  • accuracy M OPC

  • accuracy M OPC

  • accuracy M OPC,

but the quality can just as easily be defined as the OPC needed in order to achieve a given accuracy given the linear fit can be inverted as

  • M OPC accuracy

  • OPC accuracy

  • M OPC accuracy.

However, the log-linear regression between data quantity and accuracy has an undesired effect between the data points and the linear fit, which is that at the ends of the available data there is increased error relative to the center of the data points. Additionally, because the sign of error is the same at both ends, this suggests that the linear fit between data quantity and accuracy when there is minimal data will severely underestimate the data quantity needed to achieve high performance systems. For a better look at this issue, Fig. 5 examines the residuals for the waveform set across the three dataset types.

Fig. 5: Plots show the residuals between the regressed log-linear fits of quantity of data available during training and the accuracy of each trained network and the observed accuracy of each network. Plots show similar trends across the used datasets (Top: , Middle: , Bottom: ), with the edges of the available data deviating in the same direction, indicating a log-linear fit is not the ideal relationship between data quantity and accuracy.

This same sign of error at the ends of the available data suggest that a non-log-linear fit would be more appropriate for regressing the relationship between accuracy and data quantity, which is poorly suited to understanding the full relationship as available data becomes more limited to a narrow subset of the full data range. For example just looking at a narrow portion of either end, high or low data quantity, does not provide enough context to predict a good non-linear fit. Therefore, a GoF measure that weights the outer errors more significantly than the errors toward the center of the data range is desired. Additionally, since both edges of the residual are of equal significance and the results are non-uniformly sampled across the observation space, a weighting that balances the weights into histogram bins will be used to normalize equal significance in the edges of the GoF measure. For simplicity, three bins will be used indicating lower, mid, and high data quantity observations relative to the log-linear fit. The weights are suggested as


where the middle bin has one third the weight of the edges, which without, would have all three regions equally weighted. The division edges between bins is taken as evenly spaced on log scale between the minimum and maximum data quantities in the set, with being the number of elements in the set, while are the number of observations within that bin. Those weights are then normalized such that their sum is unity. The GoF is then taken as the Normalized Root Weighted Mean Squared Error (NRWMSE)


where the quantities () and accuracies () are use to derive the log-linear fit (); however, the accuracies and fit can be swapped out for any other metric and matching fit.

The GoF for accuracy, NCE, LEEP, and LogME metrics are given in Table IV. A general conclusion is that all three metrics have potential to provide a better prediction of data quantity needed to achieve high performance; however, considering the correlation presented in Table III in addition to these results suggest that NCE will be the most consistent estimate, with LEEP being a close second. LogME, by comparison, offers the most promise with regard to the augmented dataset, but has the highest variability among the three metrics examined here. One more unique attribute about the linear regressions of the metrics, is that accuracy, NCE, and LEEP all have residuals typically indicating that the true quantity of data that is needed will be underestimated, while LogME’s residuals are inverted suggesting that that LogME’s regression will overestimate the amount of data, giving soft bounds of the required quantity of data being between the estimates of NCE and LogME predictions.

0.2478 0.2054 0.1885 0.2662
1.0196 0.9515 0.9521 0.9531
0.3120 0.2674 0.2636 0.1672
0.2499 0.1552 0.1458 0.1987
0.9491 0.9367 0.9451 0.9652
0.3016 0.2163 0.2208 0.1433
0.1514 0.1138 0.1173 0.1179
0.9853 0.9783 0.9731 0.9697
0.2706 0.2652 0.2797 0.1102
TABLE IV: Goodness-of-fit (GoF) for a log-linear regression between dataset quantity available for training across datasets () and waveform sets () for Accuracy () and (NCE, LEEP, LogME). Perfect fit would have a value of 0. Bold values represent the best GoF value for the log-linear regression between the metric and data quantity available during training.

4.3 Predicting The Data Quantity Needed

Now that the metrics have been compared in terms of a regressed log-linear fit with the quantity of data used to train the model, the question is how to determine what value of the metrics will provide a desired performance. Looking back at Fig. 3 shows that the metrics and accuracy don’t have an easily fit relationship that would map a metric back to accuracy, and in fact would only be trading one non-linear regression for another. To overcome this problem, label whitening to acquire near perfect performance is proposed to act as a quasar that can help map the performance of the metrics with accuracy.

The procedure starts with label smoothing (14

) of the truth labels for the evaluation set, followed by a logit transform (

15), which without the label smoothing would not be a useful approach as infinite values would be returned for the correct class and negative infinity for all other classes.


Label smoothing applied on its own does not affect the value of accuracy, NCE, nor LogME, but it does affect LEEP score and is dependent on the smoothing factor, , and number of classes in the classification problem, . The effect of on the LEEP metric can significantly affect the metric, so is chosen to be the minimum value that the approach of

within the chosen machine precision. The effect of label smoothing and the logit transform allows for the values to now sit at a finite coordinate to which noise can be added to stochastically decrease the accuracy of the system in a controlled manner. The normal distribution is used to whiten the logits in this case where the standard deviation of the noise,

, can be chosen for a degradation of accuracy, , of the true labels given the number of classes in the problem space and the label smoothing in use.


With the whitened logits the inverse logit, or logisitic, transform is applied and balanced such that the sum of any result is unity, .


Fig. 6 shows the effects of this procedure on the error and metrics for a given averaged over 1000 iterations, and shows a trend that can be maintained with increasing ; however, an important note is that this type of error does not properly reflect the distributions of error that can be expected, so smaller values () of will likely be more appropriate than larger values (). Looking at the residual error, (19), in terms of the dependent variable, , relative to the measured value, , as seen in the top left plot of Fig. 6, the minimum average error across the three classes is achieved at , with the average normalized residuals being nearly equal at the extremes ().

Fig. 6: The change in the value of {Top Left: Accuracy (1-Accuracy, logscale); Tope Right: NCE (logscale); Bottom Left: LEEP (logscale); Bottom Right: LogME (linear)} as a function of the induced error, , expected to accuracy from whitening the truth labels of the evaluation set . Results plotted are the average values over 1000 iterations per data point.

At this point a means for determining the value for each metric has been proposed that won’t suffer from the need to have a perfect response that can be used, and will help with metrics such as LogME where the maximum is not immediately known given the iterative solution that is employed to produce the score. These values for a given small can then be used to regress the corresponding metric’s data estimate for achieving such performance. The log-linear regressions for each metric, dataset, and waveform space combinations are shown in Fig. 7, while the log-linear regressions for accuracy are shown in Fig. 3, and together help to better visualize the GoF results given in Table IV.

Fig. 7: Plots show the relationship between quantity of data used from each dataset (Left: , Center: , Right: ) and the metric (Top: NCE, Middle: LEEP, Bottom: LogME) achieved by networks trained on that amount of data. In general, the networks trained from datasets and have an increasing relation, but network, trained using have a stagnant relation to performance in regards to quantity of data used to train. With a solution for a system that can perform arbitrarily close to perfect on the evaluation set, these linear regressions can then predict how much data would be required to achieve such a system.

Making use of the whitening procedure above and the log-linear regressions between data quantity and the metric’s score a prediction for data quantity needed to achieve arbitrarily high performance can then be found. For example applying an error of to each examined problem space for the metrics and selecting a value averaged over 1000 iterations, the data quantity predictions can be made for each metric as shown in Table V. Where the predictions are found by inverting the linear fit to estimate the quantity from the predicted metric as


where the value is the logarithm base ten of the quantity estimate for the selected metric Accuracy, NCE, LEEP, LogME, is the metric value found through the whitening procedure, and are the slope and y-intercept, respectively, of the log-linear regression given a logarithm base ten applied.

Metric Metric Metric Metric Quantity Quantity Quantity Quantity
0.999990 -1.275e-4 -2.668e-3 2.696 48.7e6 237e6 88.5e6 2.46e12
6.42e224 1.64e190 1.38e158
72.8e6 291e6 144e6 3.47e15
0.999990 -1.272e-4 -2.972e-3 2.842 394e6 13.5e9 4.45e9 2.57e19
9.15e95 6.57e100
181e6 3.60e9 1.11e9 2.41e21
0.999990 -1.289e-4 -3.963e-3 3.066 34.2e6 514e6 430e6 1.61e21
29.3e6 337e6 233e6 1.57e23
TABLE V: The performance of Accuracy (), NCE, LEEP, and LogME for the label whitening proceedure proposed in Section 4 for a desired error . Additionally the data quantity needed per metric for the log-linear regression of the metric with the data quantity used during training is provided.

5 Prediction Performance with Minimal Data

The prior sections made use of all data points taken in order to establish the best predictions for data quantity with their given metric. As these are estimates that are intended to predict the data quantity needed to achieve high performance systems through the increase of available data alone, certifying any result in particular is beyond the scope of this work, as the expected predicted quantities will far exceed the available data acquired. Instead, the focus shifts to how less available data during training relatively affects the prediction capability for each metric in comparison to greater quantities of available data.

Due to the performance of the synthetic dataset stagnating, further analysis will ignore this case going forward. For the purpose of finding how well the log-linear regression with each metric is able to predict the data quantity needed, the data quantities provided in Table V with preference for a quantity estimate given by the GoF in Table IV will be used such that the metric that achieved the best GoF will be used as the truth for the problem space. Therefore predictions of the models making use of will use the LEEP metric’s quantity prediction for and , but will make use of the NCE prediction for , while the predictions for will all make use of the LogME metric and these quantity prediction are summarized in Table VI.

88.5e6 3.47e15
4.45e9 2.41e21
514e6 1.57e23
TABLE VI: Quantity estimates being taken as truth for the combinations of waveform groups, , and training datasets, . The augmented quantities are estimated using the LogME metric regression, while the captured quantities are estimated from either the NCE or LEEP metric based on the GoF in Table IV.

The predictions for each metric can be seen in Fig. 8 where the top row shows the predictions when using the dataset, while the bottom shows the predictions for the dataset. The columns consist of waveform spaces {,,} from left to right respectively.

Fig. 8: Plots show the quantity predictions based on a limited amount of available data used to regress the estimate on the {Top Row: Capture ; Bottom Row: Augmented } datasets when being used to estimate across the {Left: ; Center: ; Right: } waveform space. The lines represent using {Accuracy: circle, NCE: square, LEEP: diamond, LogME: pentagram, Log Scale Midpoint of NCE and LogME: hexagram, Target: none} to predict data quantity needed, while Target is determined in Table VI. The midpoint serves as a balance between the underestimates produced by Accuracy/NCE/LEEP and overestimates from LogME for when a Target is not known a priori.

The general understanding given in Fig. 8 is that both NCE and LEEP will give a more realistic prediction for data quantity than Accuracy alone, while the prediction given by LogME can serve as an upper bound. Due to the log-linear regression any deviation can result in orders of magnitude error in either underestimation or overestimation, and since without having enough data to acquire the metric that produces the best GoF regression, a midpoint estimate is recommended with only minimal data available. The midpoint estimate seeks to balance the two extremes such that the estimate becomes , such that the midpoint estimate averages the quantity estimates on the log scale rather than the linear.

Returning to Table VI, an important note of worth is understanding how long a sequential collection of data of this kind would take in order to accomplish, that is for a collection that records at 10kHz, these waveforms from three waveform groups {,,}, collection of the number of observations implied would require [1.72, 144.5, 33.4] years to acquire for the target observations needed for the predictions and require [0.989, 82.9, 19.1] terabytes of storage to store in an uncompressed state. While this could be feasible if the collection was performed in parallel rather than a sequential collection, the suggestion that should be taken rather than immediately starting a long term collection is to instead improve the training routine and model architecture to instead allow for this procedure to produce a regression with a more significant slope than the approach used to produce these results.

6 Conclusion

The problem of estimating the amount of data needed in order to achieve a high performing ML model for AMC problems is discussed within this work. While if large quantities of data are already available, observations per class in the problem presented in this work, a log scale nonlinear regression between performance and data quantity can help to determine how much more data is needed, this is often not feasible for problems with only a small amount of data on hand (i.e. observations per class) due to the potential for the asymptotic bends in performance not being visible leading to the performance regression significantly underestimating the amount required. Making use of the metrics developed for TL in order to chose a model to help linearize, on a log scale, the relation between a metric and data quantity increased accuracy predictions can be made by utilizing the metrics NCE, LEEP, and LogME. These metrics in turn can help bound to amount of data that would be needed with a current approach, and help determine whether a large scale data collection should take place or if further refinement to the training procedure and model architecture are a better approach given the program’s constraints. Given the tendency for NCE and LEEP metrics to underestimate the amount of data needed, and the tendency of LogME to overestimate the amount of data a midpoint approach is proposed, on a log scale, between NCE and LogME to offer a balanced estimate. Given that NCE and LEEP offer similar performance estimates and GoF measures on the collected dataset problems, and LogME is seen to offer a better GoF for the augmented dataset problems balancing the two offers a more reasonable measure under data constrained conditions when performing the estimate.

While this approach shows traction within the RFML problem space of AMC, additional research is still needed in order to understand if these techniques can be more widely applied to other classification problem spaces in ML in general.


This research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via [2021-2106240007]. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.


  • [1] “Machine Learning,” in Oxford English Dictionary, 3rd ed.   Oxford, UK: Oxford Univ. Press, Mar. 2012. [Online]. Available: https://www.oed.com
  • [2] H. Sanders and J. Saxe, “Garbage in, garbage out: how purportedly great ML models can be screwed up by bad data,” Technical report, 2017.
  • [3] T. O’Shea and N. West, “Radio Machine Learning Dataset Generation with GNU Radio,” Proceedings of the GNU Radio Conference, vol. 1, no. 1, 2016. [Online]. Available: https://pubs.gnuradio.org/index.php/grcon/article/view/11
  • [4] M. Abadi et al, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
  • [5] A. Paszke et al, “PyTorch: An Imperative Style, High-Performance Deep Learning Library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds.   Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • [6] E. Blossom. (2004) GNU Radio: Tools for Exploring the Radio Frequency Spectrum. Linux Journal. [Online]. Available: https://www.linuxjournal.com/article/7319
  • [7] J. Gaeddart, “Liquid DSP,” (Date last accessed 7-January-2020). [Online]. Available: https://github.com/jgaeddert/liquid-dsp/
  • [8] L. J. Wong, W. H. Clark, IV, B. Flowers, R. M. Buehrer, W. C. Headley, and A. J. Michaels, “An RFML Ecosystem: Considerations for the Application of Deep Learning to Spectrum Situational Awareness,” IEEE Open Journal of the Communications Society, vol. 2, pp. 2243–2264, 2021.
  • [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning.   MIT Press, 2016, http://www.deeplearningbook.org.
  • [10] H. Chen, J. Chen, and J. Ding, “Data evaluation and enhancement for quality improvement of machine learning,” IEEE Transactions on Reliability, vol. 70, no. 2, pp. 831–847, June 2021.
  • [11] N. E. West and T. O’Shea, “Deep architectures for modulation recognition,” in 2017 IEEE International Symposium on Dynamic Spectrum Access Networks (DySPAN), March 2017, pp. 1–6.
  • [12] B. Flowers and W. C. Headley, “Adversarial Radio Frequency Machine Learning (RFML) with PyTorch, MILCOM 2019 - 2019 IEEE Military Communications Conference (MILCOM),” 2019.
  • [13] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
  • [14] F. Zhuang et al, “A comprehensive survey on transfer learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2021.
  • [15] L. J. Wong and A. J. Michaels, “Transfer Learning for Radio Frequency Machine Learning: A Taxonomy and Survey,” Sensors, vol. 22, no. 4, 2022. [Online]. Available: https://www.mdpi.com/1424-8220/22/4/1416
  • [16] C. V. Nguyen, T. Hassner, C. Archambeau, and M. W. Seeger, “LEEP: A new measure to evaluate transferability of learned representations,” CoRR, vol. abs/2002.12462, 2020. [Online]. Available: https://arxiv.org/abs/2002.12462
  • [17] A. Tran, C. Nguyen, and T. Hassner, “Transferability and hardness of supervised classification tasks,” in

    2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    , 2019, pp. 1395–1405.
  • [18] K. You, Y. Liu, J. Wang, and M. Long, “LogME: Practical Assessment of Pre-trained Models for Transfer Learning,” in Proceedings of the 38th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, M. Meila and T. Zhang, Eds., vol. 139.   PMLR, 18–24 Jul 2021, pp. 12 133–12 143. [Online]. Available: https://proceedings.mlr.press/v139/you21b.html
  • [19] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938. [Online]. Available: http://www.jstor.org/stable/2332226
  • [20] A. V. Dandawate and G. B. Giannakis, “Detection and classification of cyclostationary signals via cyclic-HOS: a unified approach,” vol. 1770, 1992, pp. 315–326. [Online]. Available: http://dx.doi.org/10.1117/12.130939
  • [21] A. Swami and B. Sadler, “Hierarchical digital modulation classification using cumulants,” Communications, IEEE Transactions on, vol. 48, no. 3, pp. 416–429, Mar 2000.
  • [22] W. Headley and C. da Silva, “Asynchronous classification of digital amplitude-phase modulated signals in flat-fading channels,” Communications, IEEE Transactions on, vol. 59, no. 1, pp. 7–12, January 2011.
  • [23] O. Dobre, A. Abdi, Y. Bar-Ness, and W. Su, “Survey of automatic modulation classification techniques: classical approaches and new trends,” Communications, IET, vol. 1, no. 2, pp. 137–156, April 2007.
  • [24] A. Nandi and E. Azzouz, “Modulation recognition using artificial neural networks,” Signal Processing, vol. 56, no. 2, pp. 165 – 175, 1997. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S016516849600165X
  • [25] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recognition networks,” in Engineering Applications of Neural Networks, C. Jayne and L. Iliadis, Eds.   Cham: Springer International Publishing, 2016, pp. 213–226.
  • [26] W. H. Clark, IV, S. Hauser, W. C. Headley, and A. J. Michaels, “Training data augmentation for deep learning radio frequency systems,” The Journal of Defense Modeling and Simulation, 2021. [Online]. Available: https://doi.org/10.1177/1548512921991245
  • [27] G. Fettweis, M. Lohning, D. Petrovic, M. Windisch, P. Zillmann, and W. Rave, “Dirty RF: a new paradigm,” in 2005 IEEE 16th International Symposium on Personal, Indoor and Mobile Radio Communications, vol. 4, 2005, pp. 2347–2355 Vol. 4.
  • [28] W. H. Clark, IV, and A. J. Michaels, “Quantifying Dataset Quality in Radio Frequency Machine Learning,” in MILCOM 2021 Track 1 - Waveforms and Signal Processing (MILCOM 2021 Track 1), San DIego, USA, Nov. 2021.
  • [29] P. Virtanen et al, “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python,” Nature Methods, vol. 17, pp. 261–272, 2020.