I Introduction
The widespread use of wireless devices has raised the issue of massive device authentication in wireless communication. Conventional authentication based on cryptographic techniques [1, 2] has significant difficulty in detecting compromised keys. Moreover, as the number of devices increases, keybased authentication at the higher layers suffers from excessive latency caused by heavy computation in the key management procedures [3]. To this end, physical layer authentication (PLA) has become an alternative solution for fast and efficient authentication of a large number of connected wireless devices. Compared with conventional higher layer authentication schemes, PLA exploits inherent physical layer properties of the device hardware [4] and enables authentication with low latency, low power consumption, and low computational overhead [5]. As a result, PLA has attracted considerable attentions in the past few years [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16].
Radiofrequency fingerprints (RFFs) play an essential role in enabling PLA for device classification and authentication. In general, RFF refers to a set of physical layer features that are sufficient for uniquely identifying a wireless device. The quality of the RFFs crucially determines the reliability of PLA. Similar to human biometrics, these features are difficult to modify or tamper with [17].
Historically, RFF methods extracted physical layer features from the on/off transients of a radio signal received from a device. Methods of this kind date back to the work of Toonstra and Kinsner in 1996, where they distinguished seven VHF FM transmitters from four different manufacturers using wavelet analysis [18]. Later, other on/off transient features were also introduced for RFF, including phase offsets [19], amplitude, power, and DWT coefficients [20, 21, 22].
The above transientbased RFFs are sensitive to the position of the devices, the propagation environment, and the precision of the receivers [8]. To overcome these drawbacks, modulationbased methods were proposed to extract more stable features from a received signal (e.g., the preamble). A representative work in [23] proposed to exploit a union of the synchronization correlation, inphase/quadrature (I/Q) offset, phase offset, and magnitude of the received signal for RFF. Subsequent works further made use of the automatic gain control (AGC) response [24], amplifier nonlinearity [15], sampling frequency offset [13], and carrier frequency offset [11, 5, 12], all of which introduce various tradeoffs between system complexity and authentication performance. The different proposed methodologies are ”handcrafted”, because they extract the RFF according to expert knowledge. Although they can be used in many situations, handcrafted RFF methods suffer from their inability to be generalized, and thus they are not suitable for general device classification.
Machine learning techniques such as decision trees
[10], linear classifiers
[8], nearest neighbor (NN) [25], support vector machines (SVM)
[12, 7], etc., have been applied to the problem of RFF authentication. The effectiveness of these machine learning (ML)based RFF classifiers relies heavily on the quality of the extracted RFFs. In particular, if the extracted RFF features are not wellseparable for different devices (e.g., if the relationship between the extracted RFF and the device identity is highly nonlinear), these traditional ML models (or “shallow model” in the taxonomy of the machine learning community) are usually incapable of correctly classifying devices using RFFs. Moreover, these traditional MLbased methods perform poorly in many realworld environments, as hardware imperfections contain nonlinear features that handcrafted RFF methods cannot easily model.To overcome the limitations of the shallow models, deep neural networks (DNN) have been employed in RFF authentication for better performance. In principle, DNNs are by design flexible models for representing nonlinear functions. They can either be seen as a more powerful classifier, or as a feature extraction framework that automatically learns highlevel separable features
[26] for simple classifiers. As a result, DNNs have achieved enormous success in a wide range of domains [27, 28, 29, 30, 31, 32, 33]. For the field of RFF, the major advantage of adopting DNNs is their ability to exploit a wider family of RFFs, even if they are not directly separable for different devices. These nondirectly separable features can potentially contain more information about the device identity. Under this assumption, the pioneering work in [14]extracted RFFs using a lowpass filter after applying synchronization on the raw signals, which facilitated the application of a convolutional neural network (CNN) to extract important features. Thanks to the power of CNNs, a significant performance improvement over shallow modelbased methods was observed. In
[16], the authors further considered applying multiple sampling rates to the received signal for RFF extraction. Specifically, a CNN was trained to classify these sophisticated RFFs, and it effectively identified devices from highly nonseparable RFFs.It should be noted that despite the promising improvement, the performance of the above DLbased RFF methods still crucially depends on the quality of the original RFFs. However, typical preprocessing methods such as synchronization for extracting RFFs were designed for the general task of communication rather than the desired task of discriminative RFF extraction. These preprocessing procedures can weaken or even discard important information about the device identity. This loss of information can reduce the generalizability of the extracted RFF to the case of open set identification of unknown devices [34]. This explains why most of the existing methods can only treat PLA as a closedset identification problem where the set of devices remains static. It should be noted that the openset setting is more realistic in realworld applications where the set of devices in the system often varies with time. Directly feeding the raw signals to a neural network may not be viable as the trained network would presumably overfit the data. To preserve enough information about the device identifies while maintaining a balanced generalizability, a new preprocessing technique tailored for RFF extraction is needed. To this end, this paper proposes a new RFF extraction framework that contains a special purpose preprocessing module for PLA. The main contributions of this paper are summarized as follows.
1) Methodologically, we propose an endtoend RFF extraction framework that combines signal processing priors and deep learning for openset RFF authentication. This combination is realized by a novel preprocessing module called neural synchronization (NS), which generalizes traditional carrier synchronization (TS) with deep neural networks. Additionally, a hyperspherical representation is proposed to encourage large distances between devices in terms of a cosine distance metric of their RFFs. The resultant RFF can be directly used to distinguish either known or unknown devices, or even for outlier detection and new device discovery.
2) Theoretically, we prove that the learning process of the proposed RFF extraction framework is equivalent to optimizing a lower bound on the mutual information between the device identity and the RFF. In comparison, traditional handcrafted techniques designed for communication are unable to make such a claim to optimality. This observation highlights the necessity of the proposed learning framework in AIassisted RFF authentication systems.
3) Experimentally, we demonstrate that the proposed framework outperforms stateoftheart methods in terms of both robustness and accuracy for either closedset or openset RFF authentication tasks. This performance gain is mainly due to the use of inductive bias (synchronization) in the design of the resulting deep neural networks. We also verify the fact that traditional methods tend to rely on channel information rather than device information to distinguish devices, which explains their dissatisfactory performance in openset settings. This conclusion sheds important light on the design of RFF extraction methods.
The rest of this paper is organized as follows. Section II describes the system model. Section III elaborates the details of the proposed method, while Section IV provides the theoretical analysis of the proposed method. Section V presents the experimental results. Finally, Section VI concludes this paper.
Ii System Model
Iia RF Fingerprinting
We consider a PLA system as shown in Fig. 1, containing transmitters (Txs) denoted by and one receiver (Rx). The transmission of a preamble signal from can be represented as
(1) 
where is the preamble signal with length , is the received signal at the Rx, indicates the set of complex numbers, and the function represents the transmission between and Rx. Note that for an ideal channel, includes a representation of the hardware properties of , therefore imprinting the identityrelevant information on the corresponding received signal . In reality, also contains distinctions related to the different channels among the devices that complicate the extraction of the hardware features.
As described in Fig. 1, the goal of PLA is to determine the identity (a
dimensional onehot vector) of a device from the received signal
. This is typically done using a feature vector (i.e., RFF) derived from the signals rather than using the signal itself. Let be the training set. A PLA can be formulated as a classification problem given by:(2)  
(3) 
where
is the probability that the system classifies
to device identity (which is often realized by a softmax function), represents the parameter of the softmax function, and is the function used to extract the RFF from the received signal, referred to as the RFF extractor. Different RFF solutions lead to different forms of . In traditional RFF solutions, is a handcrafted signal processing function, whereas in recent deep learningbased approaches is replaced by a deep neural network^{1}^{1}1Strictly speaking, a composition of several preprocessing steps and a deep neural network. to effectively mimic the nonlinear signal processing.IiB OpenSet Physical Layer Authentication
Regardless of the specific form of the RFF extractor , there is a common problem in prior RFF solutions, i.e., the inability to cope with unknown devices. This is because the system is originally built to maximize the classification performance among already known devices. In reality, however, it is impossible to obtain all potential devices in advance when training a classifier since the number of devices in the system is unlikely to remain unchanged — a problem known as openset authentication [34]. An obvious solution is to retrain the classifier whenever a new device enters the system [35, 36]. While this approach is theoretically sound, it may incur huge system resources (e.g., time, energy) in a practical deployment. As an alternative, we seek a practical, lowcost solution that avoids retraining as much as possible.
To this end, we propose a completely different solution to openset RFF authentication. Compared with existing approaches that continually retrain the classifier, we learn and extract a discriminative RFF that can be generalized to unknown devices. In other words, the learned RFF extractor not only extracts distinguishing features for known devices but also discriminates against completely unknown devices as well. We then perform authentication by comparing the similarity between RFFs using a given distance function (e.g., the Euclidean or cosine distance). If two RFFs, say and , are highly similar to each other, they are considered to come from the same device; otherwise, they are assumed to be different. Mathematically, this procedure is formulated as follows:
(4) 
where
is a threshold that is optimized by the training dataset. No classifier is needed in this procedure. To facilitate subsequent learning by the softmaxbased loss function, we use the cosine distance in this work
^{2}^{2}2 We adopt the cosine distance base on the following two considerations. 1) The finite range of the cosine distance, which meets the requirements of the Lipschitz continuous condition in DL training; 2) For more stable and direct optimization of the distance among RFFs (see Section II)., as follows:(5) 
The only question that remains is how to learn an RFF extractor without accessing unknown devices. In the next section, we achieve this goal by presenting a novel modelanddata driven DL framework.
Iii Neural Synchronization for RFF Extraction
In this section, we elaborate the framework of the proposed neural synchronization (NS)based RFF. This framework combines the advantages of modelbased signal processing priors and the datadriven learning ability of DNN. A block diagram of the proposed framework is shown in Fig. 2.
Iiia Module Design
As shown in Fig. 2, the NSbased RFF extractor consists of three basic neural networks with similar architectures. The first two are the proposed NS module that estimate the phase and frequency offsets of the input signals and output the signals compensated by these offsets. The last neural network is the RFF extractor.
We begin with the design of the proposed NS module, which generalizes traditional carrier synchronization (TS) techniques used in previous RFF extraction methods [8, 14, 16]. TS was originally designed to compensate for the frequency and phase offsets in the received signals. These offsets are often caused by lowcost oscillators at the receiver, which jeopardize the communication quality. In particular, TS performs compensation by first estimating these offsets, typically via maximum likelihood estimation (MLE):
(6) 
where and are the frequency and phase offsets respectively, and is the received signal. In general, the problem in (6) can be further rewritten as [14]:
(7) 
where is the preamble signal, is the received signal, and denote the th element of and , respectively, and is the length of . Once and are obtained, the compensation of the received signal is performed as
(8) 
TS has been shown to be a useful technique for communication tasks. However, the offsets themselves may provide information about the device itself. Simply removing these offsets in the received signal may lead to a loss of information about the device identity when extracting RFFs from the compensated signals.
To this end, we develop a neural network generalization of the synchronization process that can automatically determine how to perform compensation in a datadriven manner. More specifically, rather than performing offset estimation as in (7), which removes information about the identity of the devices, we propose to estimate the offsets by two deep neural networks trained with data. Let and be the deep neural networks used in the estimation of the frequency and phase offsets, respectively. The compensation needed for RFF extraction is performed as follows:
Neural Frequency Compensation
We first adopt a neural network to estimate the frequency offset :
(9) 
where contains the parameters (i.e., weights and biases) of the neural network. We call the deviceirrelevant frequency offset since it is aimed at device identification. Given , frequency compensation of the received signal is performed by
(10) 
Neural Phase Compensation
Similarly, the deviceirrelevant phase offset, denoted by , is estimated by another neural network using :
(11) 
where includes the parameters of the phase neural network. The synchronized signal is then obtained by computing
(12) 
The two networks, i.e., and , constitute the NS module, which compensate for deviceirrelevant phase and frequency offsets in the received signal. After compensation, the signal is sent to a third neural network with parameters to compute the desired RFF:
(13) 
We denote the entire “neural frequency compensation + neural phase compensation + RFF computation” procedure above as:
(14) 
where is the proposed NSbased RFF extractor, and denotes the parameters of the entire RFF extraction network. The parameters in will be learned jointly using an objective function that will be described later.
We propose to use a convolutional neural network (CNN) for all the three networks , and . The motivation for using the convolution operation is that the preamble signal often exhibits high periodicity (e.g., the ideal preamble signal is often comprised of several identical symbols, as in the IEEE 802.15.4 standard [37].), and convolution is excellent at extracting periodic patterns [38]. In principle, 1D convolution can be used for this purpose; however, it cannot capture crossperiod patterns that exhibits, which may be useful for identifying different devices. We therefore propose the use of 2D convolution to extract a richer set of patterns. As shown in Fig. 3, there are two components in the proposed 2D convolution network (which we refer to as basic CNN (BCNN) from now on): a signaltoimage layer and a series of convolution layers. The complexity of these layers can be controlled by some hyperparameters as summarized in Table I.
i) SignaltoImage Layer: This layer converts the original 1D signal into a 2D image to facilitate the subsequent processing. Formally, given an input signal , this layer computes its image representation as:
(15) 
(16) 
where and and are, respectively, the real and imaginary parts of the input, and is the th element of the received signal . The two channels of the image correspond to the real and imaginary parts of the signal, with each pixel , representing one dimension of the input signal . The width of the image, denoted by , is set such that each row in the image corresponds to onehalf a symbol period (i.e., 16 chips in the IEEE 802.15.4 standard). Therefore pixels in the same row are from the same symbol, whereas pixels belonging to disjoint rows come from different symbols.
ii) Convolution Layers: With the image representation in (16), we are ready able to perform 2D convolution to extract both intraperiod and interperiod patterns. Mathematically, given an image and a kernel , the 2D convolution is defined as:
(17) 
Inspired by [39], we adopted small convolutional filters with size in our convolutional neural networks. The filters are the smallest available for capturing the 2D correlations of an image. By stacking the small filter layers, we can use fewer parameters to achieve the same effective receptive field as filters with larger size [39]. As in many other deep learning frameworks [40, 41, 42]
, we consecutively apply batch normalization (BN) and leaky ReLU activation (LReLU) after the convolution operation. The former helps to stabilize and accelerate training, whereas the latter serves as a nonlinear transformation in the network. The three operations together form a layer, denoted by
, in the convolution neural network, where is the layer index. We repeatedly apply these operations to extract highlevel patterns from the input image , resulting in a series of convolution layers with layers. A fully connected layer is appended to these convolution layers to compute the final output , , or .Note that the number of layers depends on the size of the input image. For example, the neural network structure we use in our experiments (with input size ) is presented in Table I. After applying six convolutional layers, the output size for the 6th layer is too small to apply additional convolution (i.e., smaller than
); therefore we stop convolving and instead adopt a fully connected layer, which is also the final layer of the CNN. The number of filters are controlled by a hyperparameter
, which is used for model complexity adjustment.To more intuitively explain the benefits of NS, we visualize the real part of the synchronized signals from three different devices in Fig.4. In contrast to TS, which directly eliminates the distinctions between these signals, the proposed NS module can preserve device information while maintaining signal alignment.
IiiB Learning Algorithm
Given the proposed NSbased RFF extractor , we switch our focus to training the above network with the collected data . Ideally, we wish that our learning algorithm and objective can maximally separate RFFs for different devices. We demonstrate how to achieve this goal below with a novel objective, which is motivated by the limitations of a particular naïve.
A Naïve Solution
A vanilla approach for learning to optimize the objective (2) can be found by defining the distribution as a softmax distribution:
(18) 
where is the RFF and are the parameters of the softmax score. With this construction, we can now optimize the objective in (2) to learn .
There is one major issue with the above naïve approach: maximizing the classification score does not necessarily produce a good distance between features. This is because the classification score in (18) can be pushed to be arbitrarily high by enlarging the norm of some of the obtained RFF, as demonstrated by the following proposition:

For all that satisfy , , and any , we have
(19)
Proof.
Dividing both sides of the inequality by , we can obtain
where , is monotonically increasing, and for all . Therefore, the inequality is directly established. ∎
The above proposition implies that the classifier can easily achieve a high classification score by manipulating the norms of the learned RFF. For example, we can increase the norms of RFF whose classification score is high and decrease the norms of those RFF whose classification score is low. Therefore, although this method is optimal in terms of the classification score, the learned RFFs are not necessarily wellseparated in the space (note that in openset settings we can only classify RFFs by their distance). As a result, the features obtained through the above naïve training may not be suitable for openset RFF authentication.
Hypersphere Projection
To overcome the above issue in the softmax classifier, we propose to use a hyperspherical representation
for the learned RFF, inspired by recent advance in face recognition
[43, 44, 45]. The RFF in our framework does not lie in Euclidean space but is on the surface of a hypersphere. These hyperspherical RFFs are obtained by applying a hyperspherical projection on the original RFF, namely:(20) 
where is the radius of the hypersphere, which is left as a hyperparameter. With this hyperspherical representation, the classification probability is now (re)defined as
(21) 
where
(22) 
Now, as both and in (21) have fixed norms, maximizing (21) is equivalent to minimizing the cosine distance between and , hence minimizing the cosine distance between all belonging to the same device. In other words, the hyperspherical representation can guarantee that the optimization of and the cosine distances coincide with each other. In the following experiments, we will see that this hyperspherical representation is also a key step for improving the performance of the proposed RFF framework.
Finally, given the proposed NSbased RFF extractor , the auxiliary classifier in (21), and the training set , the reformulated learning objective of (2)(3) becomes
(23)  
The endtoend training algorithm is summarized in Algorithm 1. Note that the auxiliary classifier here is to provide training supervision for more discriminative RFFs. Once the training is done, this classifier will be discarded. Only the RFF extractor is required for RFF authentication (since we compare RFFs by their pairwise distance).
Iv Theoretical Analysis
In this section, we illustrate the effectiveness of the proposed NSbased framework from an informationtheoretic perspective. Let us begin with a Markov chain that describes a regular RF fingerprinting process:
(24) 
where is the identity of the device, is the preamble, is the wireless channel, is the received signal, and is the processed version of . Symbol is the RFF extracted from . Therefore, given , the received signal and RFF are independent. Then according to the data processing inequality [46], we have:
(25) 
where is mutual information, and the equality is only achieved if and are both sufficient statistics of with respect to . The inequalities in (25) imply that the introduction of any preprocessing step will inevitably lead to a loss of information about . Traditional synchronization (TS) is designed for recovering , which is not directly relevant to extracting the device hardware information of from . Therefore, such from TS is not necessarily a sufficient statistic for the RFF identification. Let be any preprocessing that is not aimed at identity extraction, and define the information cost as . Then we rewrite (25) as
(26) 
This inequality indicates that no matter how powerful the learning model that is applied in subsequent fingerprint extraction, a certain amount of information loss exists due to the inappropriate preprocessing.
By contrast, the proposed NSbased RFF extractor is a unified trainable model that combines the procedures of preprocessing and RFF extraction. The corresponding Markov chain in (24) can be simplified as
(27) 
and the corresponding inequality as follows:
(28) 
In this setting, the information cost can be arbitrarily close to 0 by directly maximizing . In fact, the proposed endtoend training with the learning objective in (23) is equivalent to maximizing a lower bound on , as demonstrated by the following theorem.

Given the Markov chain in (27), the RFF extractor , auxiliary classifier , and the training dataset distribution , the variational lower bound of is given by
(29) with the equality if and only if , where
is the KullbackLeibler divergence.
Proof.
See Appendix A. ∎
Note that is deterministic, while can be approximated by the empirical data distribution . Then can be approximated by
(30)  
which is the proposed learning objective in (23). This means that the proposed framework can better optimize the mutual information between device identities and RFFs than handcrafted preprocessing methods. This mutual information directly reflects the quality of the feature learning, which further influences its generalizability to unknown devices. This is why the proposed learning algorithm is necessary for openset RFF authentication.
V Experimental Evaluation
In this section, we conduct a series of experimental tests to verify the effectiveness of the proposed NSbased RFF. We compare the performance of the proposed framework with that of conventional modeldriven (TSbased RFF) and datadriven methods (pure DLbased RFF). For a more comprehensive evaluation, we divide the experiments into four
parts: 1) Performance comparison for closed and open test sets; 2) Evaluating the performance for different signaltonoise ratios (SNRs); 3) Performance comparison for different network complexity between the proposed NSbased RFF extractor and the pure DLbased RFF extractor;
4) Performance comparison of TSbased RFF with the frequency and phase offsets as additional features.The source codes for the proposed NS approach are implemented in Pytorch with the DL research toolbox
MarverToolbox. Note that the source codes are open and available at [47], and MarverToolboxis an opensource toolbox developed on our own for GPU acceleration of complex tensor computations and facilitating DL communication research, and is available at
[48].Va Experimental Setup
Dataset
We collected data from 54 TI CC2530 ZigBee devices using a USRP N210 as the receiver and IEEE 802.15.4 as the physical layer standard. The ZigBee devices transmitted at a maximum power of 19 dBm and were located within one meter from the receiver. The experimental system operates at 2.4 GHz frequency band with a USRP sampling rate of 10 Msample/s. Each preamble signal contains 1280 sample points (i.e., the dimension of the data set), and the energy is normalized to one unit. All the data sets are collected in a real demo testbed, thus the received data are obtained with inevitable and practical noise level ( dB).
As shown in Fig. 5, the dataset consists of 8 sample blocks. Blocks 15 were sampled within the same day (without device aging^{3}^{3}3The term “device aging” here refers to the degradation of the device performance over time. The devices we used for collecting data had been operating for 18 months without interruption. ), while blocks 69 were sampled after 18 months (with device aging). Among them, blocks 67 were sampled on the same day, and the blocks 89 were sampled on another day. The extended data collection interval ensures data independence and helps better verify the generalizability and robustness of the proposed NS framework.
We split the whole dataset into seven parts for a comprehensive comparison. Except for the training set, we list the six test sets in order of classification difficulty from easy to hard:

Closed test set: known devices without device aging, all conditions identical to the training set;

Open 1: unknown devices without device aging, all conditions identical to the training set;

Open 2: known devices with unknown device aging, collected 18 months later;

Open 4: unknown devices with unknown device aging, collected 18 months later;

Open 23: known devices with two types of unknown device aging, collected 18 months later;

Open 45: unknown devices with two types of unknown device aging, collected 18 months laters.
The Closed test set is the simplest known devices test set that with channel conditions similar to the training set. In contrast, Open 45 is the most challenging test set with unknown devices and different unknown device aging (sampled at different dates).
Baselines and the Proposed NSbased RFF
We consider three types of baselines classified according to whether they are modeldriven, purely datadriven, or modeanddatadriven. As shown in Table II:

M: modeldriven methods, which use handcrafted operations for the signal preprocessing, with different RFF extractors and auxiliary linear classifier backends.

D: datadriven methods that use only neural networks to extract the RFF.

M&D: the proposed methods that combines signal processing priors with learnable parameters.
In Table II, TS refers to the traditional synchronization, and NS indicates that the design uses the proposed neural synchronization. We also denote “HP” as the proposed hyperspherical projection. Except for the model in [9] which has 63 million parameters, the number of parameters for the other methods is restricted to 12 million. All models are trained using the training data set shown in Fig. 5
for 150 epochs using the Adam optimizer
[49] with a learning rate of (). The codes for reproducing our experiments are available at https://github.com/xrjcom/NSRFF.Metric
Similar to biometric identification systems, we use the receiver operating characteristic (ROC) curve, area under the ROC curve (AUC), and the equal error rate (EER) as the metrics to evaluate the quality of the extracted RFF. The ROC curve is obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) at various thresholds [50]. Given the true positive (TP), the true negative (TN), the false positive (FP), and the false negative (FN) rates, TPR and FPR are respectively defined as:
(31) 
The TPR is also known as the probability of detection, defining how many correct positive samples (intraclass) occur among all positive samples from the test. Here, positive samples refer to the signal pairs from the same device (similarly, negative samples refer to the pairs from different devices.) FPR is also known as the probability of false alarm and is the percentage of correct negative samples (interclass) to the total negative samples. The ROC curve depicts the relative tradeoffs between TPR and FPR. The EER refers to the point where FNR and FPR are equal; here FNR = 1TPR.
The higher AUC and the lower EER mean that the ROC curve is closer to the top left corner, which is recognized as “perfect classification”. This means that we simultaneously achieve both fewer false negatives and fewer false positives.
VB Performance Under Closed and Open Sets Settings
To validate the superiority of the proposed modelanddata framework, we compare its performance with existing methods. For this purpose, we plot the ROC curves for the proposed method and the baselines for both closedset and openset settings, and we compare the EER in Table III. All results are measured on the test set.
Power of EndtoEnd Learning
Overall, endtoend methods (BCNN, NS+BCNN) outperform their nonendtoend counterparts (Yu et.al[16], TS+BCNN) by a large margin in all cases. These results verify our hypothesis: traditional preprocessing steps like carrier synchronization used in these methods does result in the loss of information about the device identities. We further discover that on cases with different sampled channels, nonendtoend methods perform no better than random guesses (see e.g. Fig. 6(b) to Fig. 6(d)). This suggests that these TSbased methods may indeed rely more on channel distinctions rather than hardware imperfection to distinguish devices. The use of deep neural networks, unfortunately, does not help to solve this problem. The result fully demonstrates the suboptimality of TS for openset RFF authentication and highlights the need for endtoend learning.
Power of Signal Processing Priors
To investigate the usefulness of the proposed NS module, we compare the performance of the two endtoend methods, namely BCNN (which directly learns the RFF from raw signals) and BCNN + NS (which preprocesses the signal with the NS module). We discover that BCNN with NS significantly outperforms BCNN without NS, as evidenced by both the ROC curve in Fig. 6 and the EER values in Table II. This is especially the case for highly openset scenarios (e.g., for unknown device+device aging, as shown in Fig. 6(e)(f)) where the difference between the ROC curves of the two methods are enlarged. This confirms the advantage of the inductive bias brought by the proposed NS module and demonstrates its necessity for openset RFF authentication.
Power of Hypersphere Representation
In this section we further investigate the usefulness of the proposed hypersphere representation. We do this by applying the HP operation to all methods. The results are presented in Table II. Interestingly, we find that while HP significantly improves the performance of the proposed modelanddata driven approaches, it deteriorates the performance of both model and datadriven methods. We conjecture the underlying reason may be that HP is designed to encourage the separation between different RFFs in training set, and when the RFFs learned from the training set can not generalized to the test set, this encouragement instead causes overfitting. In other words, only those methods that can extract highly generalizable RFFs will have good affinity for HP. This interesting result also serves as indirect evidence that the signal processing priors used in the NS module indeed helps for learning highquality RFFs that are better generalized to unseen data.
VC Visualizations
Visualization of the Distance Distribution
To better provide intuition about the superiority of the proposed method, we present histograms of the distances between intra and interdevice RFFs in Fig. 7. These distances are calculated on the Open 4 dataset, which is a mixture of unknown and aged devices. It can be seen from Fig. 7 that intradevice and interdevice distances are nearly indistinguishable from each other for the TSbased RFF, which explains its poor performance on this dataset. Although the situation is much better in DLbased RFF, the overlap between intra and interdevice distances is still evident in this method. Unlike these two methods, the distributions of the intradevice and interdevice distances for the NSbased method are wellseparated, meaning that one can easily distinguish one device from the others. This illustrates why the proposed framework can achieve satisfactory performance in even openset settings.
Visualization of the Frequency and Phase Offsets
Fig. 10(a) and Fig. 10(b) show the scatter plots comparing the offsets estimated by TS and the proposed NS, respectively. We use different colors to indicate the device identity. Each point in Fig. 10 represents the frequency and phase pairs, i.e., (, ).
In Fig. 10(a), the offsets from the same device tend to be clustered over a small frequency range and can be separated by a linear classifier. This implies that the offsets from TS have a certain degree of correlation with the device identities. In fact, some devicedependent information can be lost if and are estimated and then removed from the received signals before the RFF extraction. Therefore, the RFF discrimination performance is weakened by using TS. In contrast, as shown in Fig. 10(b), the (, ) from the proposed NS are randomly distributed on the plane. This means that the estimates and obtained by the proposed NS approach exhibit little dependence on the device identity. Thus, this suggests that NS removes deviceirrelevant information from the input signals, and devicerelevant information is better retained for subsequent RFF extraction.
VD Performance Versus SNR
In this subsection, we further investigate the robustness of the proposed method with respect to noise. For this purpose, we artificially add random noise with different signaltonoise ratio (SNR) to the input signal and investigate how the performance varies with SNR. Noise levels of SNR= dB are considered. We also retrain all the models by data augmentation with random SNRs from 5 to 30 dB. The results are presented in Fig. 8.
Again, we see that endtoend learning methods are much more robust to noise than preprocessingbased methods even in closedset and weakly openset (i.e. cases without device aging) settings. The reason why traditional preprocessing methods are not robust may be due to their inability to achieve high mutual information (here is the routine for computing the RFF), as we previously analyzed. This low mutual information will cause the system to be sensitive to injected noise — an issue familiar to the communication community. Endtoend learning methods, by contrast, can better attain high mutual information between the extracted RFF and device identity, and hence are much less sensitive to noise.
It is worth highlighing that the performance drop due to injected noise in TSbased methods is not comparable to that due to a changed channel (i.e., device aging). This again confirms that TSbased methods tend to overfit due to variations in the channels.
Types  

BCNN+HP  N/A  8, 12, 16, 24, 32, 
35, 40, 45  
NS+BCNN+HP  4, 8, 12, 16  8, 12, 16, 24, 32 
VE Comparison of Varied Complexity
In this subsection, we conduct a final experiment that studies the effects of the model complexity and the effectiveness of NS. We construct the model architectures by traversing the combination of values for and in Table IV. Here, we use and to control the number of filters of the neural networks in the NS and the RFF extractor, respectively.
The resulting scatter plots are shown in Fig. 9. The blue circles correspond to the proposed NSbased RFF with different levels of complexity, while the orange triangles correspond to the purely DLbased RFFs.
As seen in Fig. 9, for each test set, the typical BCNN without NS must become more complex to achieve satisfactory performance, while the convergence of the NSbased methods can easily achieve with only 9.9 M parameters (, ).
We argue that this is because of the signal preprocessing priors to synchronization. The synchronization preprocessing restricts the neural network’s model space while significantly reducing the difficulty of the learning task. This is also the reason why the proposed NSbased RFF has better performance in more general situations.
VF Performance of TS with Offsets as Additional Features
One may wonder whether the gain of the proposed NSbased RFF is just due to the gain derived from considering the frequency and phase offset in TS. To answer this question, we train a series of classifiers by taking corresponding sets as the training and evaluation data set. They include the following four different setups:

(, ): A classifier using only the frequency and phase offsets extracted by TS;

: A classifier using only RFFs extracted from the signal compensated by TS;

+ (, ): A classifier using both the TSbased RFFs and the TSbased offsets;

: A classifier using RFFs extracted from the signal compensated by the proposed NS.
Overall, we found that the performance of TSbased methods are not comparable with that of the NSbased method, even if we include offsets as additional features in these methods. In particular, the performance gain achieved by using ( , ) is minor, only around . In comparison, the performance gain by the proposed NS is significantly higher, around . This indicates that (, ) contains little additional information about the device identity, and the proposed NS does a much better job in preserving devicerelated information.
One of the underlying reasons why (, ) are less informative about the device identity is the imprecise estimation of and . Note that TS estimates the frequency and phase offsets by comparing the received signal with an ideal signal. This means that the noise corrupting the received signal will also affect the precision of TS. When synchronizing the received signal by such imprecise and , estimation errors are introduced in the processed signal, and this results in the loss of devicerelevant information.
Vi Conclusion
This paper has proposed a new modelanddatadriven framework for openset PLA. Traditional preprocessing techniques like TS have been widely used for RFF extraction. However, such preprocessing may cause a loss of information about the device identity, according to the data processing inequality. Based on this observation, in the proposed framework, we use a “neural” generalization of the carrier synchronization as a preprocessing module, referred to as NS. This module serves as an essential part of the proposed endtoend deep learning framework for introducing the inductive bias from signal processing models. We also proposed a hyperspherical representation to further improve the quality of the RFF identification. Experimental results show that TSbased methods tend to extract the weak features, i.e., channel distinction, rather than deviceinherent features. On the other hand, the proposed NS module and the hypersphere representation with the proposed endtoend training framework can guarantee the least information corruption and reduce the difficulty of the RFF learning task. The resulting RFF can not only perform well in known devices but can also be generalized to unknown devices and unknown channels.
Some challenging tasks remain to be solved in future work: 1) For deployment in realworld scenarios, the complexity of the RFF extractor can be further reduced by the recent model compression techniques [51, 52]; 2) the proposed scheme is only developed for singleinput singleoutput (SISO) system. How to devise a multipleinput multipleoutput (MIMO) version of NS is another interesting and challenging research topic.
Appendix A Proof of Theorem 1
Given the Markov chain in (27), let us begin with :
(32)  
where
denotes the entropy of , which is a constant and does not affect the optimization. The density in (32) is fully defined by the proposed RFF extractor and the given Markov chain as
(33) 
However, computing (33) is intractable. In order to accurately estimate , we use an auxiliary classifier as a variational approximation of . Given the KullbackLeibler divergence between and , it follows that
(34)  
By substituting (34) into (32), we have
(35)  
References
 [1] Y. C. Chen, “Fully incrementing visual cryptography from a succinct nonmonotonic structure,” IEEE Trans. Inf. Forensics Security, vol. 12, no. 5, pp. 1082–1091, May 2016.
 [2] M. Iwamoto, K. Ohta, and J. Shikata, “Security formalizations and their relationships for encryption and key agreement in informationtheoretic cryptography,” IEEE Trans. Inf. Theory, vol. 64, no. 1, pp. 654–685, Jan. 2017.
 [3] H. Fang, X. Wang, and L. Hanzo, “Learningaided physical layer authentication as an intelligent process,” IEEE Trans. Commun., vol. 67, no. 3, pp. 2260–2273, Nov. 2018.
 [4] B. Danev, D. Zanetti, and S. Capkun, “On physicallayer identification of wireless devices,” ACM Comput. Surv., vol. 45, no. 1, pp. 1–29, Dec. 2012.
 [5] W. Hou, X. Wang, J.Y. Chouinard, and A. Refaey, “Physical layer authentication for mobile systems with timevarying carrier frequency offsets,” IEEE Trans. Commun., vol. 62, no. 5, pp. 1658–1667, Apr. 2014.
 [6] L. Peng, J. Zhang, M. Liu, and A. Hu, “Deep learning based RF fingerprint identification using differential constellation trace figure,” IEEE Trans. Veh. Technol., vol. 69, no. 1, pp. 1091–1095, Oct. 2019.
 [7] K. Youssef, L. Bouchard, K. Haigh, J. Silovsky, B. Thapa, and C. Vander Valk, “Machine learning approach to RF transmitter identification,” IEEE J. Radio Freq. Identification, vol. 2, no. 4, pp. 197–205, Nov. 2018.
 [8] L. Peng, A. Hu, J. Zhang, Y. Jiang, J. Yu, and Y. Yan, “Design of a hybrid RF fingerprint extraction and device classification scheme,” IEEE Internet Things J., vol. 6, no. 1, pp. 349–360, May 2018.
 [9] J. Yu, A. Hu, G. Li, and L. Peng, “A multisampling convolutional neural networkbased RF fingerprinting approach for lowpower devices,” in INFOCOM. Paris, France. IEEE, Mar. 2019, pp. 1–6.
 [10] H. J. Patel, M. A. Temple, and R. O. Baldwin, “Improving ZigBee device network authentication using ensemble decision tree classifiers with radio frequency distinct native attribute fingerprinting,” IEEE Trans. Rel., vol. 64, no. 1, pp. 221–233, Mar. 2014.
 [11] N. T. Nguyen, G. Zheng, Z. Han, and R. Zheng, “Device fingerprinting to enhance wireless security using nonparametric bayesian method,” in Proc. IEEE INFOCOM, Jun. 2011, pp. 1404–1412.
 [12] P. Robyns, E. Marin, W. Lamotte, P. Quax, D. Singelée, and B. Preneel, “Physicallayer fingerprinting of LoRa devices using supervised and zeroshot learning,” in Proc. 10th ACM Conf. Security Privacy Wireless Mobile Netw., Jul. 2017, pp. 58–63.
 [13] T. D. VoHuu, T. D. VoHuu, and G. Noubir, “Fingerprinting WiFi devices using software defined radios,” in Proc. 9th ACM Conf. Security Privacy in Wireless Mobile Netw, Darmstadt, Germany, Jul. 2016, pp. 3–14.
 [14] K. Merchant, S. Revay, G. Stantchev, and B. Nousain, “Deep learning for RF device fingerprinting in cognitive communication networks,” IEEE J. Sel. Top. Signal Process., vol. 12, no. 1, pp. 160–167, Jan. 2018.
 [15] G. Huang, Y. Yuan, X. Wang, and Z. Huang, “Specific emitter identification based on nonlinear dynamical characteristics,” Canadian J. Electr Comput. Eng., vol. 39, no. 1, pp. 34–41, Dec. 2016.
 [16] J. Yu, A. Hu, G. Li, and L. Peng, “A robust RF fingerprinting approach using multisampling convolutional neural network,” IEEE Internet Things J., vol. 6, no. 4, pp. 6786–6799, Apr. 2019.
 [17] W. Wang, Z. Sun, S. Piao, B. Zhu, and K. Ren, “Wireless physicallayer identification: Modeling and validation,” IEEE Trans. Inf. Forensics Security, vol. 11, no. 9, pp. 2091–2106, Sep. 2016.
 [18] J. Toonstra and W. Kinsner, “A radio transmitter fingerprinting system ODO1,” in Proc. IEEE. Can. Conf. Elect. Comput. Eng., May 1996, pp. 60–63.
 [19] D. A. Knox and T. Kunz, “Practical RF fingerprints for wireless sensor network authentication,” in Proc. Int. Wireless Commun. Mobile Computing Conf. (IWCMC), Cyprus, Aug. 2012, pp. 531–536.
 [20] J. Hall, “Detection of rogue devices in wireless networks,” Ph.D. dissertation, Carleton University, Aug. 2006.
 [21] J. Hall, M. Barbeau, and E. Kranakis, “Enhancing intrusion detection in wireless networks using radio frequency fingerprinting.” in Proc. Third IASTED Int. Conf. Commun. Internet Inf. Technol. Eng, Jan. 2004, pp. 201–206.
 [22] J. Hall, M. Barbeau, and E. Kranakis, “Radio frequency fingerprinting for intrusion detection in wireless networks,” IEEE Trans. Dependable Secur. Comput., vol. 12, pp. 1–35, Jan. 2005.
 [23] V. Brik, S. Banerjee, M. Gruteser, and S. Oh, “Wireless device identification with radiometric signatures,” in Proc. 14th ACM Int. Conf. Mobile Comput. Netw. (MobiCom), New York, NY, USA, Sep. 2008, pp. 116–127.
 [24] D. A. Knox and T. Kunz, “AGCbased RF fingerprints in wireless sensor networks for authentication,” in Proc. IEEE Int. Symp. WoWMoM, Montrreal, QC, Canada, Aug. 2010, pp. 1–6.
 [25] Q. Tian, Y. Lin, X. Guo, J. Wen, Y. Fang, J. Rodriguez, and S. Mumtaz, “New security mechanisms of highreliability IoT communication based on radio frequency fingerprint,” IEEE Internet Things J., vol. 6, no. 5, pp. 7980–7987, Oct. 2019.
 [26] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep Learning. MIT press Cambridge, 2016.

[27]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,”
Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017.  [28] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, Jun. 2018.
 [29] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, pp. 354–359, Oct. 2017.
 [30] Y. Chen, D. Zhang, M. Gutmann, A. Courville, and Z. Zhu, “Neural approximate sufficient statistics for implicit models,” arXiv preprint arXiv:2010.10079, Oct. 2020.
 [31] D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. 32nd Int. Conf. Mach. Learn., May 2015, pp. 1530–1538.
 [32] S. S. Gu, Z. Ghahramani, and R. E. Turner, “Neural adaptive sequential monte carlo,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Dec. 2015, pp. 2629–2637.

[33]
X. Lu, L. Xiao, T. Xu, Y. Zhao, Y. Tang, and W. Zhuang, “Reinforcement learning based phy authentication for vanets,”
IEEE Trans. Veh. Technol., vol. 69, no. 3, pp. 3068–3079, 2020.  [34] C. Geng, S.j. Huang, and S. Chen, “Recent advances in open set recognition: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., Mar. 2020.
 [35] S. Hanna, S. Karunaratne, and D. Cabric, “Open set wireless transmitter authorization: deep learning approaches and dataset considerations,” IEEE Trans. Cogn. Commun. Netw., vol. 7, no. 1, pp. 59–72, Mar. 2021.
 [36] S. Hanna, S. Karunaratne, and D. Cabric, “Deep learning approaches for open set wireless transmitter authorization,” in IEEE 21st Int. Workshop Signal Process. Adv. Wireless Commun, May 2020, pp. 1–5.
 [37] “IEEE standard for lowrate wireless networks,” in IEEE Std 802.15.42015 (Revision of IEEE Std 802.15.42011), Apr. 22, Apr. 2015, pp. 61–69.
 [38] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” in arXiv:1609.03499v2, Sep. 2016, pp. 1–12.
 [39] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in Proc. Int. Conf. Learn. Representations, May 2014.
 [40] Y. Yu, Z. Gong, P. Zhong, and J. Shan, “Unsupervised representation learning with deep convolutional neural network for remote sensing images,” in Proc. Int. Conf. Image Graph. New York, NY, USA: Springer, Sep. 2017, pp. 97–108.

[41]
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in
Proc.IEEE Int. Conf. Comput. Vision
, Feb. 2015, pp. 1026–1034.  [42] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, Oct. 2016, pp. 630–645.
 [43] Y. Wen, K. Zhang, Z. Li, and Y. Qiao, “A discriminative feature learning approach for deep face recognition,” in Eur. Conf. Comput. Vision. Springer, Jul. 2017, pp. 499–515.
 [44] R. Ranjan, C. D. Castillo, and R. Chellappa, “L2constrained softmax loss for discriminative face verification,” arXiv preprint arXiv:1703.09507, Jun. 2017.

[45]
J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular margin
loss for deep face recognition,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
, Jun. 2019, pp. 4690–4699.  [46] D. J. MacKay and D. J. Mac Kay, Information Theory, Inference and Learning Algorithms. Cambridge university press, 2003.
 [47] R. Xie and W. Xu. (2021) NSRFF. [Online]. Available: https://github.com/xrjcom/NSRFF
 [48] R. Xie and W. Xu. (2020) MarvelToolbox. [Online]. Available: https://github.com/xrjcom/marveltoolbox
 [49] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, Dec. 2014.
 [50] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett., vol. 27, no. 8, pp. 861–874, Jun. 2006.
 [51] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” in Proc. Int. Conf. Learn. Representations, May 2019.
 [52] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in Proc. Int. Conf. Mach. Learn., vol. 97. PMLR, Jun. 2019, pp. 6105–6114.