I Introduction
With the exponential growth of the Internet of Things (IoT), billions of new wireless devices are being deployed across the world every year [1]. The sheer number of devices available means that security systems that authenticate these devices should become cheaper, more secure and more robust. While traditional cryptographybased authentication systems have been the mainstay of wireless authentication, the unique requirements of IoT devices call for alternative methods that are sensitive to their computation and power constraints.
Passive Physical Layer Authentication (passive PLA) has been proposed as a lowoverhead authentication method that requires little to no work on the part of the transmitter [2]. Here, the authenticator uses channel state information and fingerprints due to hardware impairments to identify transmitters. Recently, research on passive PLA that uses deep learning techniques has been gaining momentum. Most such techniques process raw IQ samples from transmitters to extract features that are used to build classifiers. Since deep learning based classifiers tend to extract better features, these approaches have been shown to outperform others which use handcrafted features, reaching markedly higher accuracies [3]. However, there is an inherent vulnerability of deep learningbased classifiers to socalled adversarial examples. For example, the existence of targeted adversarial examples have been pointed out [4]: given a valid input , a classifier and a target , it is possible to find an such that and is minimized. In this light, it is critical that deep learningbased PLA systems be analyzed for their vulnerabilities.
Radio fingerprints are usually considered hard to reproduce or replay because the replaying device suffers from its own impairments which disturb the features in the RF fingerprint. As such, naive replay attacks have limited success; only very recently has this problem been approached in more methodical ways. In [5]
, Generative Adversarial Networks are used to train a spoofing device. While the method shows promising results, it relies on being able to place an adversarial receiver near the authenticating receiver, and only considers the case when there is one authorized transmitter. Furthermore, it is only verified through simulations where little information is given about how transmitter fingerprints were simulated. In
[6], this problem was investigated in a variety of angles, considering targeted and untargeted adversarial attacks, where the adversary tries either to make signals from a target transmitter be recognized as another specified transmitter (targeted) or be recognized as any transmitter other than (untargeted). They showed that spoofing can be done at a high accuracy both when the full gradients and the activations of the classifier are known; and only the activations of the final layer are known. We identify the availability of both these as being practically unrealistic—the most we can expect from the authenticator is a binary feedback such as ACK or NACK denoting its decision. Although such 1bit feedback is technically compatible with their approach, the efficacy of the proposed method was not not evaluated in that regard. Additionally, their approach relies on signals from authorized transmitters being available to the adversary, and was only verified through training on offline data.Inspired by this past work, we explore an adversarial attack that at most expects a binary feedback from the authenticator; is able to achieve high fooling rates in realistic channel conditions under a wide range of signaltonoise ratios (SNRs); and attacks in realtime through online training. In this paper, we formulate this problem as a reinforcement learning problem and propose the use of policy gradient methods to perform spoofing of transmitters in a wireless network. Our results show that by distorting IQ samples of an adversarial transmitter—constrained to a maximum distortion level—before transmission, it is possible to fool a deeplearning based authenticator with high success rates even at low SNR and even when the only information available about the authenticator is a binary feedback received from it.
Ii System Model
We consider a wireless environment in which there are transmitters which are authorized to transmit to a single receiver .
is equipped with a pretrained neural networkbased authenticator
that uses raw IQ samples of the received signals to perform a binary authentication decision at the physical layer, denoting whether the signal under consideration is from an authorized transmitter or not. There is an adversarial transmitter that wants to gain access to , and it tries to do this by impersonating one of the authorized transmitters in . employs a generator whose purpose is to distort the complex IQ samples of input discrete time signal at such that after it is transmitted, it will be classified as authenticated at . We assume that sends its authentication decision for each signal received from any transmitter, back to the transmitter. is also aware of the modulation being used by transmitters in . This is visualized in Fig. 1.In a wireless communication system, there are three main sources of nonlinearities that are imparted on the intended transmitted signal: if is the signal at the beginning of the transmitter chain, the signal at the end of the receiver chain will be of the form , where , and are nonlinearities introduced by the receiver hardware, channel and transmitter hardware respectively. Since the channel is variable, is not a good a fingerprint and since there is only one receiver (), will effectively be invariable across all transmitters. Therefore will be discriminating transmitters based on .
We will try to spoof one of by only using the feedback from . If we assume that could be placed reasonably close to transmitters in , could be assumed to have the same variability for as well as for . Such an assumption is justifiable when transmitters in are close together and are reasonably far away from . This ensures that will discriminate signals from both and based on .
Iii Proposed Solution
We model this problem as a Markov Decision Process (MDP). An MDP is characterized by an
agent and an environment that interact at each discrete time step , whereby the agent selects an action according to a policy that takes into account the environment’s state . In response to , the agent receives a numerical reward from the environment and transitions to the next state [7]. We model , represented with a neural network parameterized by (weights and biases), as the policy of the agent. The action such that , is the distorted value of , and is a binary feedback received from , which is part of the environment. The state can be represented in a number of ways, and will be discussed later in this section.Assume the agent has collected a trajectory of length defined as a sequence of states, actions, and rewards, . Now, the goal is to tune the parameters of under the following optimization problem:
(1) 
Here is a metric of the the policy’s performance, which is simply the cumulative reward of discounted by factor
. To solve this, we can use a policy gradient method: we repeatedly estimate the gradient of the expected value of
with respect to and use that to update . To estimate the gradients, we will use a score function gradient estimator. With the introduction of a baselineto reduce variance, an estimate
for is [8](2) 
Now, the policy update can be done with any gradient ascent algorithm (e.g. ). This can be repeated for a number of iterations, with a trajectory collected for each iteration, until converges to a satisfactory state. This algorithm also allows for to be trained along with [8].
There are practical considerations when designing and . For example, depending on the main type of distortion that the impersonator tries to mimic, different definitions of the state can be used.

The state is a vector
containing the real and imaginary part of the most recent IQ sample of the signal . This is applicable when the distortion over each sample is independent of the other samples. For example, the distortion imparted by the power amplifier in the RF chain will have this property [9]. 
is , where is the hidden state of
, when it is modeled as a recurrent neural network. This state can in theory apply to any type of nonlinearity.
Irrespective of the particular definition of , is expected to reflect the distorted value of the most recent IQ sample; hence, .
Although a reward is required for each , a signal transmitted by is usually a sequence of complex symbols (effectively a real vector) and hence a reward cannot immediately be obtained for each . So we set (i.e. the trajectory length is ) and transmit as the signal to get the feedback . To estimate from , a MonteCarlo search can be performed from until , using a rollout policy [10]. Specifically, at time , if we have , the rest of the trajectory is sampled from as to produce a simulated . Then for any , the reward can be written as
(3) 
where for , we have averaged over MonteCarlo searches due to the stochasticity of . In this approach is periodically updated to be the same as ; however, considering the large number of MC searches expected to be run, we can use a faster (and maybe less accurate) function approximator for [11].
We now add several optimizations to the method proposed above. First, to encourage exploration, we introduce entropyregularization [12]: the agent gets a bonus reward at each time step proportional to , the entropy of the policy at that timestep. i.e. in (1) changes to
(4) 
where is the entropy coefficient (higher more exploration). Also in practice, the components of a symbol cannot be distorted arbitrarily, as the decodability of the signal at the receiver side must be ensured. To integrate this constraint, we impose an action space limitation (clipping):
(5) 
This effectively means that the maximum distortion level allowed is relative to the input state.
Now we present an algorithm summarizing the proposed method above, depicted as Algorithm 1. Note that is initially trained by using Mean Squared Error (MSE) loss on a set of signals captured from , such that at the beginning there is no distortion (
acts as an autoencoder).
is initialized to and updated to periodically, after training for iterations. This process is repeated for steps.Iv Experimental Evaluation
This section is divided into three sections: Section IVA details the simulation environment and the hardware testbed used for the evaluation, as well as the choices for different parameters; Section IVB presents the neural network architectures used for and ; and Section IVC describes four experiments conducted and the results obtained.
Iva Setup and parameters
The proposed method was first evaluated on a simulated wireless environment written in Python. Power amplifier nonlinearities are modeled by the Volterra Series, , where and are coefficients unique to each transmitter, generated to follow a nonlinear curve. Every transmitted packet of data consisted of completely random bits. We use QPSK modulation and rootraisedcosine (RRC) pulse shaping with 0.2 excess bandwidth.
Two channel models were investigated. The first one is a simple additivewhitegaussiannoise (AWGN) channel. The second one is a dynamic channel, which includes a set of more realistic impairments including timing errors, frequency errors, fading, intersymbol interference and noise. The timing error is simulated by interpolating the signal by a factor of 32, choosing a random offset, and then downsampling. The frequency error is obtained by multiplying the signal with a complex exponential whose frequency is selected from a Gaussian distribution with zero mean and 1 kHz standard deviation. For fading and intersymbol interference, a three tap channel was used along with a Rayleigh coefficient of scale 0.5.
The state definition 2 in Section 3 was used ( is ). The discount factor was set to 1 (undiscounted). The gradient ascent on
was done with the Adam optimizer, with the default configuration provided in the Keras API for Tensorflow, except that the learning rate was annealed starting from 0.001 to ensure convergence. We do not use any baseline function to train the generator, as entropy regularization and clipping already allow us to train the generator successfully.
, and values for SNR, , and were changed in different tests.Finally, to test our attack on real hardware, we created a testbed consisting of 8 Analog Devices ADALM Pluto Software Defined Radios (SDRs); for convenience of operation, all were connected to a single computer. 6 SDRs were designated as authorized transmitters, 1 as an unauthorized transmitter and the other as the receiver, as shown in Fig. 9. The Python module pyadiiio was used to interface with the SDRs.
IvB Neural Network Architectures of ,
We used a binary discriminator architecture for , which has been shown to perform well in [13] for similar transmitter fingerprinting based classifications. It consists of a feature extractor consisting of a series of residual blocks with different numbers of filters, and a classifier block; the architecture of each type of block is shown in Fig. 4. Note that it produces a scalar output through a sigmoid activation; when providing binary feedback, this was thresholded at 0.5 to get a binary value (1 if greater than 0.5 and 0 otherwise). L2 regularization was used in the dense layers with weights of either 0.001 or 0.002 to avoid overfitting.
When using a simple channel, the IQ samples of the raw signal was passed to the discriminator classifier in Fig. 4 without any preprocessing (each signal being a
vector). However, when using the dynamic channel model, this approach yielded poor discriminators with high fooling rates to begin with. So we first calculated the Discrete Fourier Transform of the raw signal, took the magnitude of the result, and reshaped it into a 2D signal of
before feeding to the discriminator. This preprocessing stage was chosen as it has been shown to produce superior results in similar transmitter fingerprinting based classifications [9]. Fig. 5 and Fig. 6 depict this pictorially.The architecture of (and ) is shown in Fig. 3
. Following the input, an LSTM layer with an output dimensionality of 100 was used, with the default configuration provided in Keras. Its outputs were modeled as the mean and the diagonal covariance of a two dimensional Gaussian distribution (one dimension each for the complex and imaginary part of the IQ samples), which was sampled to obtain the action, and to find the action probability (when calculating gradients) and entropy.
IvC Results
In this section, we report results of four experiments; Experiment 13 are conducted on the simulated environment and Experiment 4 is conducted on the hardware testbed.
For Experiment 1, we used a set of 10 authorized transmitters and a maximum distortion level of . Then for five SNR values we evaluated the fooling rate of at convergence for both channel models. The results are shown in Fig. 7. The dashed lines show the initial fooling rate; for the simple channel, it starts at around 7% for low SNR and decreases to near 0% for higher SNRs. For the dynamic channel, the initial fooling rates are higher but still less than 10% for even moderately high SNRs. This shows that the discriminator performs excellently at the beginning (except for the case of 5 dB SNR for the dynamic channel). It is clear that even at really low SNR, significant increases in fooling rate can be achieved, with near 100% fooling rates being achieved at and above 20 dB SNR for both types of channels. Although slightly higher fooling rates are achieved for the dynamic channel at certain SNRs, this should be put in perspective with the higher initial fooling rates of the discriminator in the dynamic channel—in fact, the simple channel gives a higher relative improvement. Fig. 7 denotes the convergence time corresponding to Fig. 7, measured by the number of gradient descent updates of (number of iterations of the innerloop of Algorithm 1). As expected, we see that algorithm converges faster for higher SNRs, except for the jump from 5 dB to 10 dB. This is due to the fooling rate gain at 5 dB being much smaller than at 10 dB, and hence the algorithm achieving that smaller gain in a less number of iterations.
For a practical perspective, consider the simple channel at 20 dB SNR with convergence time of roughly 100 iterations. Since each iteration requires 256 feedbacks, we need 25600 feedbacks in total from . Although this might seem excessive, assuming the system environment stays fairly static, we can space out the attack say, over 24 hours to reduce suspicion. This means that we only need a feedback every , which is several orders of magnitude larger than a typical packet transmission time (e.g. 1 ms)—if we do not desire a near 100% fooling rate, this timeinterval can be greatly increased.
In Experiment 2, we wish to evaluate the effect of the maximum distortion level on the fooling rate—specifically, we seek justification for our intuition that allowing more freedom for distortion should allow it to reach higher fooling rates. Fig. 8 shows the results obtained when the fooling rate was evaluated for , and for the same and as in Experiment 1, but only for the case of a simple channel. As expected, we see that a higher most certainly allows a higher fooling rate to be achieved and that a sufficiently high allows for near 100% fooling rates. This means that by limiting , we can still launch a successful attack, while keeping the amount of distortion imparted on the transmitted signals at a controlled level.
We seek to understand a fundamental property of our algorithm in Experiment 3: is learning adversarial noise, or is it somehow learning to replicate the fingerprint of one of the transmitters in ? Unlike images where we may answer this problem with a visual inspection, we try to find an answer to this question numerically with the following experiment: first is allowed to converge on a particular instance of , and the signals coming through that are tested on several other realizations of —of different neural network architectures—trained to discriminate the same set of authorized transmitters . If had actually learned to replicate RF fingerprints, it should achieve similar fooling rates irrespective of the particular it is being tested upon. To test our hypothesis, we trained 6 different realizations of , ; three different architectures disc, dclass and ova were used and two instances each were created from each architecture. disc is the binary discriminator architecture described in Section IVB. dclass and ova are two additional architectures defined and tested in [13] for RF fingerprinting, both sharing the same feature extractor given in Fig. 4 and only differing in the classifier blocks used. dclass consists of a multiclass classifier having outputs; the first
outputs correspond to the authorized transmitters and the last one corresponds to outliers.
ova has a single feature extractor shared across copies of the binary classifier block in Fig. 4, with the th such block denoting whether the signal is from the th transmitter or not. Table I denotes the fooling rates observed when was allowed to attack and signals from were tested on each discriminator in . This was then repeated with . It is clear that while a trained on either or could be used to attack the other with practically the same level of effectiveness, the effectiveness drops significantly when used against other architectures (albeit a significant increase in fooling rate). This confirms the hypothesis we set out to test; that is, learns to produce adversarial examples and does not learn actual RF fingerprints in .

is tested on  
0.999  0.999  0.514  0.552  0.548  0.449  
1  1  0.276  0.893  0.217  0.359 
For Experiment 4, a binary discriminator was trained offline from a dataset captured on the SDR testbed; each transmitter took turns repeatedly transmitting the same predefined sequence of 256 IQ samples to the receiver, and the signals received at were collected for each transmitter. Then the impersonator started transmitting (authorized transmitters were inactive), and it was allowed to modify its IQ samples before transmission according to Algorithm 1, using the feedback from the receiver. Note that the SDRs were simply used for overtheair transmission and reception—all other operations such as training the attacker and calculating authentication decisions were done inside the computer. The results obtained, given in Fig. 9, closely resembles the trend suggested in the moderate SNR region in Fig. 8. This means that our experimental results are consistent with the simulation results.
V Conclusion and Future Work
In this paper, we evaluated the feasibility of using policy gradient methods to penetrate a physical layer wireless authentication system which uses a passive deeplearning based classifier. We introduced an algorithm that adds carefully learned perturbations to the IQ samples transmitted by an adversarial transmitter to fool the authenticator into classifying it as an authorized transmitter. Experiments on a simulated wireless environment and an SDR testbed revealed that it is possible to fool the authenticator at extremely high fooling rates, using surprisingly little information—namely, a binary feedback from the authenticator indicating its decision and the modulation and pulse shaping used by the authorized transmitters. We also showed that by limiting , the distortion level of the impersonator signals could be kept low while still reaching a high fooling rate. Furthermore, we provided empirical evidence that our approach in fact produces adversarial examples and does not replicate the RF fingerprints of the transmitters.
While we only considered untargeted attacks, the possibility of launching targeted attacks with this method—where we try to impersonate a particulay transmitter in —still remains. In a future work, we expect to present an algorithm for the case when is noncooperative (it does not provide feedback endlessly), where an adversarial receiver is used instead to aid the impersonator. We also wish to evaluate the possible defenses that can be put in place against these types of attacks, both proactively and reactively.
Acknowledgments
We wish to thank Samer Hanna (UCLA) for help in implementing the wireless system of the simulation environment.
References
 [1] Statista, Number of IoT devices 20152025, 2020 (accessed October 30, 2020). https://www.statista.com/statistics/471264/iotnumberofconnecteddevicesworldwide/.
 [2] W. Wang, Z. Sun, S. Piao, B. Zhu, and K. Ren, “Wireless PhysicalLayer Identification: Modeling and Validation,” IEEE Transactions on Information Forensics and Security, vol. 11, pp. 2091–2106, Sept. 2016.

[3]
S. Riyaz, K. Sankhe, S. Ioannidis, and K. Chowdhury, “Deep Learning Convolutional Neural Networks for Radio Identification,”
IEEE Communications Magazine, vol. 56, pp. 146–152, Sept. 2018.  [4] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” in International Conference on Learning Representations, 2014.

[5]
Y. Shi, K. Davaslioglu, and Y. E. Sagduyu, “Generative adversarial network for
wireless signal spoofing,” in
Proceedings of the ACM Workshop on Wireless Security and Machine Learning
, pp. 55–60, 2019.  [6] F. Restuccia, S. D’Oro, A. AlShawabka, B. C. Rendon, K. Chowdhury, S. Ioannidis, and T. Melodia, “Hacking the Waveform: Generalized Wireless Adversarial Deep Learning,” arXiv:2005.02270 [cs, eess], May 2020. arXiv: 2005.02270 version: 1.
 [7] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction. Adaptive computation and machine learning series, Cambridge, Massachusetts: The MIT Press, second edition ed., 2018.
 [8] J. Schulman, Optimizing Expectations: From Deep Reinforcement Learning to Stochastic Computation Graphs. PhD thesis, EECS Department, University of California, Berkeley, Dec 2016.
 [9] S. S. Hanna and D. Cabric, “Deep learning based transmitter identification using power amplifier nonlinearity,” in 2019 International Conference on Computing, Networking and Communications (ICNC), pp. 674–680, IEEE, 2019.

[10]
L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial
nets with policy gradient,” in
ThirtyFirst AAAI Conference on Artificial Intelligence
, 2017.  [11] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, pp. 484–489, Jan. 2016.
 [12] Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understanding the impact of entropy on policy optimization,” vol. 97 of Proceedings of Machine Learning Research, (Long Beach, California, USA), pp. 151–160, PMLR, 09–15 Jun 2019.
 [13] S. Hanna, S. Karunaratne, and D. Cabric, “Deep Learning Approaches for Open Set Wireless Transmitter Authorization,” in 2020 IEEE 21st International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, May 2020. ISSN: 19483252.
Comments
There are no comments yet.