FedComm: Federated Learning as a Medium for Covert Communication

01/21/2022
by   Dorjan Hitaj, et al.
6

Proposed as a solution to mitigate the privacy implications related to the adoption of deep learning solutions, Federated Learning (FL) enables large numbers of participants to successfully train deep neural networks without having to reveal the actual private training data. To date, a substantial amount of research has investigated the security and privacy properties of FL, resulting in a plethora of innovative attack and defense strategies. This paper thoroughly investigates the communication capabilities of an FL scheme. In particular, we show that a party involved in the FL learning process can use FL as a covert communication medium to send an arbitrary message. We introduce FedComm, a novel covert-communication technique that enables robust sharing and transfer of targeted payloads within the FL framework. Our extensive theoretical and empirical evaluations show that FedComm provides a stealthy communication channel, with minimal disruptions to the training process. Our experiments show that FedComm, allowed us to successfully deliver 100 payload in the order of kilobits before the FL procedure converges. Our evaluation also shows that FedComm is independent of the application domain and the neural network architecture used by the underlying FL scheme.

READ FULL TEXT VIEW PDF
07/08/2022

A Survey on Participant Selection for Federated Learning in Mobile Networks

Federated Learning (FL) is an efficient distributed machine learning par...
07/24/2021

FedLab: A Flexible Federated Learning Framework

Federated learning (FL) is a machine learning field in which researchers...
02/17/2022

CoFED: Cross-silo Heterogeneous Federated Multi-task Learning via Co-training

Federated Learning (FL) is a machine learning technique that enables par...
01/06/2021

IPLS : A Framework for Decentralized Federated Learning

The proliferation of resourceful mobile devices that store rich, multidi...
05/07/2021

Towards Practical Watermark for Deep Neural Networks in Federated Learning

With the wide application of deep neural networks, it is important to ve...
10/26/2021

MarS-FL: A Market Share-based Decision Support Framework for Participation in Federated Learning

Federated learning (FL) enables multiple participants (PTs) to build an ...
09/14/2020

A Principled Approach to Data Valuation for Federated Learning

Federated learning (FL) is a popular technique to train machine learning...

I Introduction

The single biggest problem in communication is the illusion that it has taken place.

George Bernard Shaw

Deep Learning (DL) is the key factor for an increased interest in research and development in the area of Artificial Intelligence (AI), resulting in a surge of Machine Learning (ML) based applications that are reshaping entire fields and seedling new ones. Variations of Deep Neural Networks (DNN), the algorithms residing at the core of DL, have successfully been implemented in a plethora of domains, including here but not limited to image classification 

[80, 35, 17]

, natural language processing 

[22, 73, 6], speech recognition [33, 37], data (image, text, audio) generation [61, 44, 7, 40], cyber-security [59, 18, 21], and even aiding with the COVID-19 pandemic [55, 52].

DNNs can ingest large quantities of training data and autonomously extract and learn relevant features, while constantly improving in a given task. However, DNN models require significant amounts of information-rich data, and demand hardware able to support these models’ computation needs. These requirements limit the use of DNNs to institutions that can satisfy these requirements, and push entities that do not have the necessary resources to often pool their data in third-party resources.

While strategies like transfer learning alleviate these drawbacks, its adoption is not always possible. Also, as highlighted and emphasized by prior research 

[78, 38], sharing the data in third-party resources is not a viable solution for entities like hospitals, or government institutions, as sharing the data would risk potential privacy violations, infringing on current laws designed to protect the privacy and ensure data security.

Shokri and Shmatikov [78] are the first to address the issues described above. In their work, the authors introduce collaborative learning, a DL scheme that allows multiple participants to train a DNN without needing to share their proprietary training data. In the collaborative learning scheme, participants train replicas of the target DNN model on their local private data and share

a fraction of the model’s updated parameters with the other participants via the means of a global parameter server. This process allows the participants to train a DNN without having access to other participants’ data, or pooling their data in third-party resources.

In the same line of thought, McMahan et al. propose federated learning (FL), a decentralized learning scheme that scales the capabilities of the collaborative learning scheme to thousands of devices [57], successfully being incorporated into the Android Gboard [58]. Additionally, FL introduces the concept of secure aggregation, which includes an additional layer of security in the process [10, 57]. Currently, a growing body of work proposes variations of FL-schemes [86, 65, 53], novel attacks on existing schemes [38, 60, 5, 88], and approaches for mitigating such adversarial threats [2].

Our investigation focuses on the extent to which FL schemes can be exploited. To this end, we ask the following question: Is it possible for a subset of participants to intentionally transmit a desired payload to other members of an FL training process?

Fig. 1: High level overview of the FedComm communication scheme.

We demonstrate that the transmission of information “hidden” within the model’s parameters is feasible by encoding such information using the Code-Division Multiple Access (CDMA). Our proposed communication scheme, FedComm, stealthily transmits information that is hidden from the global parameter server and from the other participants who are not senders or receivers of the communication. Figure 1 shows the high-level operation of the FedComm scheme. First, the sender encodes the message in its model’s updated parameters and then sends them to the global parameter server. The receiver obtains the global model from the parameter server and can precisely decode the message that was sent. We demonstrate the transmission of covert content varying from simple text messages, e.g., “Hello World!”, to complex files, such as large images. Furthermore, we prove theoretically and empirically that the FedComm scheme causes no disruptions to the FL process and has negligible effect on the performance of the learned model.

We bring to the attention of our readers that the work proposed here differs from the work on backdoors [5, 82, 88, 34], trojanning [51, 83], and watermarking mechanisms [91, 63, 3]. Our proposed approach, FedComm, does not aim at changing the behavior of the DNN model in the presence of triggers; it uses a FL scheme as a medium to covertly transmit additional content to other participants without altering the behavior of the resulting ML model. Since our proposed approach is not a privacy attack, a differential privacy (DP)-based FL learning scheme is orthogonal to FedComm. In Section VII we conclude that, even in scenarios where DP is employed, our proposed communication mechanism is not significantly affected.

Our contributions include the following:

  1. We introduce FedComm, a novel covert-channel communication technique that can be incorporated seamlessly into existing FL schemes. FedComm permits entities (individuals, institutions) participating in the FL process to covertly transmit payloads of varying size to other participants.

  2. We incorporate Code-Division Multiple Access (CDMA), a spread-spectrum channel-coding technique introduced in the 1950s [84], designed to ensure secure and stealthy military communications and build the foundations of FedComm around it.

  3. We theoretically demonstrate the feasibility and effectiveness of the proposed approach. Moreover, we establish the foundations of deploying FedComm in different ‘stealthiness’ levels, allowing tunable trade-offs between the stealthiness level of the covert communication and the number of FL rounds required for successful message delivery.

  4. We demonstrate that FedComm is domain-independent and conduct an extensive empirical evaluation under varying conditions: a) diverse payload sizes; b) different DNN architectures; c) several benchmark datasets; d) different classification tasks; e) different number of participants selected for updating the model on each FL round; and f) multiple domains, including image, text, and audio.

This paper is organized as follows: Section II provides the necessary background information on the topics treated in the following sections. Section III introduces the threat model. Section IV describes FedComm, the covert communication channel procedure for FL proposed in this paper. Section V and Section VI provide details about the experimental set up and the evaluation of FedComm. Section VII provides relevant information and the implications of our covert communication technique. Section VIII covers the related work, and Section IX concludes the paper and presents future directions.

Ii Background

Ii-a Deep Learning

Deep Learning (DL) relies heavily on the use of neural networks (NN), which are ML algorithms inspired by the human brain and are designed to resemble the interactions amongst neurons 

[64]. While standard ML algorithms require the presence of handcrafted features to operate, NNs determine relevant features on their own, learning them directly from the input data during the training process [32].

Despite this crucial advantage, it was not until 2012111The work by McCulloch and Pitts [56] determining the basis for NNs dates back to 1943., when the seminal work by Krizhevsky et al., the resulting NN model referred to by the research community as AlexNet [46]

, won the ImageNet classification challenge by using convolutional neural networks (CNNs), a NN variant widely adopted in image-related tasks 

[80, 35, 17].

Two main requirements underline the success of AlexNet and NNs in general: 1) substantial quantities of rich training data, and 2) powerful computational resources. Large amounts of diverse training data enables NNs to learn features suitable for the task at hand, while simultaneously preventing them from memorizing (i.e., overfitting) the training data. Such features are better learned when NNs have multiple layers, thus the deep neural networks (DNN). Research has shown that the single-layer, shallow counterparts are not good for learning meaningful features and are often outperformed by other ML algorithms [31]. DNN training translates to vast numbers of computations requiring powerful resources, with graphical processing units (GPUs) a prime example.

Ii-B Privacy Concerns with Deep Learning

Large quantities of training data, often contributed or collected from end-users’ devices, enable DNNs to achieve tremendous results. This begs the question: Do DNNs introduce a new path to potential privacy violations? The answer is “Yes”. A growing body of work has successfully demonstrated that it is possible to extract meaningful, potentially privacy-violating, information from DNNs. Novel attacks such as property-inference [4], model inversion [26], or membership inference [79] have shown that it is possible to extract additional properties from a model and correlate them to a specific subset of data contributors [4, 29], reconstruct training data by simply querying the DNN [26, 27, 89, 36, 15], and determine the presence of a given input data in the training set used for a DNN [79, 77, 76], emphasising the need for privacy-preserving ML (PPML) mechanisms.

Proposed PPML strategies include, but are not limited to, restricting the information provided when querying a DNN [26], differential privacy (DP) [1], learning from synthetic data [42], and federated learning [57]. In particular, federated learning (FL) (discussed more in-depth in Section II-C) enables multiple parties to jointly contribute to the training of a DNN without the need to share the actual training data [57, 10].

New adversary models have shown that FL is also prone to privacy attacks, often making the attacks even stronger [38, 60, 69]. Combinations of FL with DP have been shown to mitigate the effects of such attacks. However, the noise introduced by DP can harm the accuracy of the FL model, and it is often hard to find the right balance between accuracy and privacy.

Additionally, DP must be applied at the right granularity level (i.e., clearly define what is being protected – the training data, the parameters of the model, or both), as that determines the effect of DP as a PPML mechanism. Misapplications of DP lead to information leakage, with the task becoming significantly harder when the goal is to scale to thousands or millions of participants, as with FL.

Ii-C Federated Learning

As emphasized earlier, federated learning (FL) removes the necessity to share the training data with (untrusted) third parties. Instead, the participants locally train replicas of the target DNN model and only share aggregated model updates with other participants. This form of learning enables collaborating parties to successfully train a DNN, making FL an attractive alternative for entities interested in benefiting from the DL, but do not possess large quantities of data and powerful resources, or possess sensitive data that cannot be easily distributed (e.g., medical records).

Advantages of the FL approach include: enhanced data privacy protection, and distribution of the computational load of training the DNN model across thousands of participating devices. FL [57] allows for high quality ML models and less power consumption, while ensuring privacy. A typical ML scenario requires a homogeneously partitioned dataset across multiple servers that are connected through high-throughput and low-latency connections to enable its optimization algorithm to run effectively. In the case of FL, the dataset is distributed unevenly across millions of devices, which have significantly higher-latency, low-speed connections, reduced computing power, and intermittent availability for training.

McMahan et al. [57] alleviate these limitations by enabling the training of DNNs using up to over 100 times less communication than the typical cloud training procedure. This reduction in communication is possible because few local training iterations can produce high-quality updates, and these updates can be uploaded more efficiently by compressing them beforehand. To limit the ability of the parameter server to observe individual updates, Bonawitz et al. [10] developed Secure Aggregation protocol that uses cryptographic techniques so that the parameter server can decrypt the parameter update only if 100s of 1000s of users have participated. The secure aggregation protocol prohibits the parameter server from inspecting individual user updates before averaging. This averaging protocol is practical and efficient for DNN-sized tasks and takes into account real-world connectivity constraints.

Ii-C1 How does FL work?

We denote the weight parameters of a DNN by . FL is typically organized in rounds (time-steps here). At time , a subset of users, out of participants that have signed up for collaborating is selected for improving the global model. Each user trains their model and computes the new model:

(1)

Where are the DNN weights at time , and is the mini-batch gradient for user at time .

Participants of the FL learning scheme send to the global parameter server the gradient . On receiving the gradients, the server recomputes the model at time , as follows:

(2)

where is the number of updates obtained by the parameter server (i.e., the number of participants taking part in the current training round) and is a weight factor chosen by the global parameter server.

Ii-D Code-Division Multiple Access (CDMA)

In digital communications, spread-spectrum techniques [84]

are methods by which a signal (e.g., an electrical, electromagnetic, or acoustic signal) with a particular bandwidth is deliberately spread in the frequency domain. These techniques enable the spreading of a narrowband information signal over a wider bandwidth. On receiving the signal, the receiver knowing the spreading mechanism can recover the original bandwidth signal. These techniques were developed in the 1950s for military communications, because they resist the enemy efforts to jam the communication channel, and hide the fact that the communication is taking place 

[87].

Practically, two main techniques are used to spread the bandwidth of a signal: frequency hopping and direct sequence. In frequency hopping, the narrowband signal is transmitted for a few (milli or micro) seconds in a given band, that is constantly changed using a pseudo-random frequency band that has been agreed-upon with the receiver. The receiver, in coordination, tunes its filter to the agreed-on frequency band to recover the full message. Direct Sequence, the spreading technique we use in our research, works by directly coding the data at a higher frequency by using pseudo-random generated codes that the receiver knows.

In the 1990s, Direct Sequence Spread Spectrum was proposed as a multiple-access technique (i.e., Code Division Multiple Access or CDMA) in the IS-95 standard for mobile communications in the US, and it was adopted worldwide as the Universal Mobile Telecommunications System (UMTS) standard in the early 2000s. This standard is better known as 3G. In CDMA, if several mobile users want to transmit information to a base station, they all transmit at the same time over the same frequency, but with different codes. The codes should be quasi-orthogonal. The base station would correlate the code of each user with its spreading code to detect the transmitted bits. The other users’ information would just contribute to the noise level. CDMA is proven to have higher capacity that (TDMA) Time Division Multiple Access [87] used in the GSM Standard. CDMA has been relegated in LTE/4G and 5G for Orthogonal Frequency Division Multiple Access (OFDMA), because OFDMA is more robust against channel bandwidth limitations.

Spreading the spectrum of a signal is fairly straightforward at both the transmitter and the receiver. For example, let assume that we have a binary sequence: , and , that we are transmitting with Phase-Shift Keying (PSK) [72], (i.e., the logical is transmitted as a for and the logical as and is the transmitting frequency). If we transmit each bit at a rate of one every millisecond (), the bandwidth of that signal will be approximately 1 KHz. For simplicity, we assume that , but without loss of generality any other value is possible. In this case, using Phase-Shift Keying (PSK) [72], the above-mentioned sequence is translated as , , . If we multiply this sequence with a 5-chip spreading code (e.g., , , , and ), we will get the following 15-bit sequence: , , , , , , , , , , , , , and . The chips in this sequence are transmitted every 0.2 milliseconds, so the time for transmitting a bit will be the same (i.e., 1ms) and we would have increased the bandwidth to 5Khz. The correlation to recover the original signal is simple. Every 0.2 milliseconds, the received sequence is convolved by the time-reversed spreading code. When the spreading code and the sequence are aligned, in the noiseless case, we get a value of 5 (if a 1 was transmitted) or (if was transmitted). When they are not aligned, we get a small value of the order of . Also, if we transmit with the same total energy, the energy per KHz is divided by 5. If we make the spreading code large enough, the transmitted sequence can be hidden under the noise level, so it is not detectable by the unintended user, but it can be recovered by the receiver once we add all the contributions from the code. Typically, the spreading codes are in the tens to hundreds of bits, so the signal is only visible when the spreading code is known and the gain of using CDMA is proportional to the length of the code.

Iii Threat Model

A typical communication scheme is composed of two parties, namely a sender and a receiver. Both parties work by alternating their respective roles depending on the destination of the message they want to communicate.

In FedComm, a participant (or subset of participants) takes the role of sender to covertly transmit a message to other participants in the FL scheme who are acting as the receiving parties. In FedComm, the sender does not deviate from the FL procedure, and the receiver has previously agreed on the decoding method(s).

In particular, the sender can only change its model’s updated parameters. Such changes must be as small as possible to eliminate or minimize disruption to the current FL process, and to prevent discovery by the global parameter server. For example, the global parameter server may try to detect the covert communication channel by looking for anything unusual in the parameter updates from a particular participant compared to the parameter updates sent by the rest of the participants.

The transmitted payload can vary from simple text messages to dangerous malware packages. For instance, the proposed FedComm scheme can be applied to scenarios where the sender is sending a malicious payload, and the receiver is unaware of the senders’ presence. In the aforementioned adversarial setting, the receivers contain a monitoring process (maliciously) installed on their device before the sender starts sending the malicious “message” using FedComm. The monitoring process knows the method used to decode the message and this assumption is in line with previous research [50]; methods to infiltrate such monitoring processes into the devices of the target participants are beyond the scope of this paper.

Iv FedComm

This section introduces FedComm, our covert communication channel technique built on top of the FL scheme. FedComm employs a direct application of CDMA in building this cover channel. In this view, each weight of the NN is a time instant in which we can encode information, i.e., a chip in CDMA parlance. Hence, the codes for each bit are as long as the number of weights of the NN. Also, it means that the gradients from the other users act as the noise in the CDMA channel, and we need to make sure that the information is buried under this noise so it cannot be detected by the aggregator 222In this paper we use the terms aggregator and parameter server interchangeably, but at the same time it can be decoded by the receiver at the other end.

Let’s assume that the sender wants to transmit a payload of bits . The bits are encoded as and the code for each bit is represented by

, which is a vector of

and that it is of the same length of the vector , namely . is an by

matrix that collects all the codes. We assume that the codes have been randomly generated with equal probabilities for

. The sender joins a federated learning scheme where users have signed up for collaborating. The aggregator proposes a set of initial weights , which are distributed to all the users. At every iteration, each user will use their data or a mini-batch of their data to compute the gradient for and . The sender will encode the payload on its gradient, where we assume without the loss of generality that the sender is user 0, as follows:

(3)

and are two gain factors to ensure that the message cannot be detected and that the power of the modified gradient is like the unmodified gradient for the other users.

The aggregator updates the weights of the network using equation (2) and distributes to all of the users. Now, the receiver can recover the payload that was hidden in the first gradient by correlating weights with the spreading code. For example for bit , the receiver can recover the payload as follows:

(4)

and . This operation can also be done for any other too.

How can we know that will be equal to ? Let’s do the math!

(5)
(6)
(7)
(8)
(9)
(10)

To derive (8), we have divided matrix in a column vector and a matrix that contains all columns except the -th column, resulting in a matrix. The vector is a dimensional vector, which is only missing the -th entry. In (8), we have also defined .

In (9), we have defined . The distribution for each component of

is a symmetric binomial distribution between

, because the entries of both and are with equal probabilities. When we multiply this vector by and add all the components together, we get a binomial distribution with values between . Hence the distribution for for large

can be approximated by a zero-mean Gaussian with variance

.

To compute the distribution of , we assume for now that is a zero-mean with a variance (it can be any distribution, it does not need to be Gaussian). Each component of adds up

of these values, by the central limit theorem (large enough

) each one of these variables would be zero-mean Gaussian with a variance333We have simplified to , as we will fix and we assume is large enough. In scenarios where is not too large, we use an upper bound to the variance of . When we multiply this vector by and add all the components together, we end up with a zero-mean Gaussian with a variance , because the components of are ; therefore, they do not change the distribution of the components of .

The assumption that each is zero-mean is not restrictive, as we are adding for all the users and all the time instants, so we can see users pulling and pushing in different directions until the right value has been set, so the mean of the distribution should be negligible after many rounds. Moreover, the constant variance is not a limiting factor, because all users’ gradients would be normalized by the aggregator, if a user has a gradient that it is significantly larger than the others, it should not be used in the FL procedure, as that user would hijack the whole learning procedure.

Finally, given that , and are mutually independent, the variable is zero mean with a variance that it is the sum of the variances of and

and also Gaussian distributed. The distribution of

is given by (we have dropped without loss of generality), and if we further normalized it by , it leads to:

(11)
(12)

To be able to recover with some certainty, we need the variance of to be less than one. For example, if the variance were one, the probability of making a mistake would be 16%, but this probability reduces, respectively, to 8% and 4%, if this variance drops to or . If we use a long enough error-correcting code, we can ensure errorless communication when the variance is about 1, so we will use this value as a reference in our calculations. We will describe the used Low-Density Parity-Check codes in Section IV-A.

To compute the variance of our estimate, we need to set the values of

and in Equation (3). If we set and , our modified gradient, would have the same power as our original gradient, but a simple hypothesis testing looking for a binomial or a Gaussian distribution will be able to detect that our gradient is not a true gradient, as we show in Section VI-A. We can also set and burying our signal in our true gradient. In this case, the information would be impossible to detect, as we would be 10dB under the power of the gradient. In Section VI, we also focus on an in-between case, in which and . The power of the gradient and our message would be the same; in this case, the signal might be detected.

For now, we focus on the analysis with and . In this case the variance of the estimate of becomes:

(13)
(14)

For small (significantly smaller than ), the error in is driven by the gradients of the other users. If where larger than the noise would be driven by the other bits in the message, which is the standard scenario of CDMA, and eventually, the message will be able to be decoded. We should expect that we need at least rounds before the message can be decoded. If the gradients from the other users do not behave as a zero-mean Gaussian with constant variance, we might need more rounds to be able to decode the message (we display this on Section VI). In general, for , the number of iterations that we need before the message can be seen by the receiver is , where for the stealthy mode and for the non-stealthy mode.

If we need to add the payload faster, we can add the same payload by tallying more users that transmit the same information with the same code. This information would be added coherently, even if the users are not transmitting the information at the same time, and would lead to an amplification of the message without additional noise. If we have senders, instead of one, adding the same payload with the same code to their gradients, distribution would be equal to . In this case, the payload will be visible times quicker, i.e. .

In the derivation above, we have assumed that all the users send their gradient in each iteration and that all the gradients are used to update the weights. If only a subset of users are included in each iteration, the analysis above would hold if we define each iteration as being communication rounds. If the aggregator uses a round-robin scheme, the analysis will be exactly the same. If the aggregator chooses the participants’ gradients at random, it would hold in mean and, given that the number of rounds should be large, the deviation would be negligible (We also test this extreme in Section VI).

As a final note, if we do not have access to when doing the decoding, we would have an additional error source in , coming from the initialization of the weights . This would become negligible as and grows.

Iv-a Low-Density Parity-Check codes

If we have a payload of a few kilobits, and we do not use error correcting codes, we would need to significantly reduce the variance to get an error-free communication; however thanks to the Shannon Channel coding theorem [20], we can add redundancy to our sequence. We use a standard rate, low-density parity-check (LDPC) code for all the available bits with three ones per column, which have very good properties in terms of error correction for a linear time decoding [75].

For LDPC codes to work, they need an estimate of the noise level. To obtain this value, our senders send a preamble of 100 random bits in the first 100 bits of so that the receivers can use these values to estimate the noise variance.

V Experimental Setup

We conduct a thorough and extensive evaluation of our proposed scheme considering: 1) a range of benchmark image [48, 47], text [62], and audio [71]

datasets; 2) well-known convolutional neural network (CNN), and recurrent neural network (RNN) architectures 

[28, 39]; and 3) different payload sizes. This evaluation demonstrates that FedComm is domain- and model-independent and can be generalized for future areas where FL is deployed.

V-a Datasets

We used the following datasets in our experiments.

The MNIST handwritten digits dataset consists of 60,000 training and 10,000 testing grayscale images of dimensions 28x28-pixels, equally divided among 10 classes (0-9) [48].

The CIFAR-10 dataset is another benchmark image dataset consisting of 50,000 training and 10,000 testing samples of 32x32 colour images divided in 10 classes, with roughly 6,000 images per class [47].

The WikiText-2

language modeling dataset, a subset of the larger WikiText dataset, which is composed of approximately 2.5 million tokens representing 720 Wikipedia articles divided into 2,088,628 train tokens, 217,646 validation tokens, and 245,569 testing tokes 

[62].

The ESC-50 dataset consists of 2,000 labeled environmental recordings equally balanced between 50 classes of 40 clips per class [71].

V-B DNN Architectures

We adopted different DNN models depending on the task. For the image classification tasks on MNIST and CIFAR-10, we used two CNN-based architectures: a) a standard CNN model composed of two convolutional layers and two fully connected layers; b) a modified VGG model [80]. To address the text classification tasks on WikiText-2, we used an RNN model composed of two LSTM layers. For the audio classification, we used a CNN model composed of four convolutional layers and one fully connected layer. Summaries of these models are found in Appendix, Section LABEL:app:dnn_arch

V-C Transmitted Messages

We used three different payloads of different sizes for transmission in our covert communication approach for federated learning. The smallest payloads were two text messages of 96 and 136 bits corresponding to the hello world! and The answer is 42! text phrases. The third payload is a 7904 bits image. For simplicity we refer to the text messages as SHORT and we refer to the image as the LONG message.

V-D Software and Hardware Specifications

FedComm is built on top of version 1.7.1 of the PyTorch ML framework 

[70], using an environment with Python version 3.8.5. The experiments were conducted on a desktop PC running the Ubuntu 20.04.2 LTS operating system with a Ryzen 9 3900x processor, 64GB of RAM, and Nvidia GeForce RTX 2080Ti GPU with 12GB of memory.

Vi FedComm Evaluation

This section focuses on the evaluation of FedComm. We rigorously assessed the effectiveness of the proposed scheme along three main axes: [label=()]

stealthiness,

impact on the overall model performance, and

message delivery time (the total number of global rounds needed for the receiver to detect the presence of the message). The following sections provide a step-by-step analysis of all of these metrics.

(a) Distribution of gradient update when FedComm is running with stealthiness parameters =0, vs distribution of gradient update when FedComm is not running.
(b) Distribution of gradient update when FedComm is running with stealthiness parameters =1, vs distribution of gradient update when FedComm is not running.
Fig. 2: Stealthiness level; A comparison of the distribution of the gradients when FedComm is running on different stealthiness parameters and the gradients when FedComm is not running (i.e. baseline).
Fig. 3: Gradient update norm comparison in a round between regular participants and the sender that employs FedComm with stealthiness parameters =1, .
(a) ,
1 sender, short message, MNIST
100% aggregated per round
(b) ,
10 senders, short message, MNIST
100% aggregated per round
(c) =0,
2 senders, 2 short messages, MNIST
100% aggregated per round.
(d) ,
5 senders, short message, MNIST
10% aggregated per round
(e) ,
10 senders, short message, MNIST
20% aggregated per round.
(f) =0,
1 sender, short message, MNIST
50% aggregated per round
(g) ,
5 senders, Long message, CIFAR10
100% aggregated per round
(h) ,
10 senders, short message, ESC-50
100% aggregated per round.
(i) ,
10 senders, short message, WikiText-2
100% aggregated per round.
(j) ,
1 sender, short message, MNIST
100% aggregated per round
(k) ,
2 senders, short message, MNIST
100% aggregated per round
(l) ,
4 senders, short message, MNIST
100% aggregated per round
Fig. 4: The FedComm approach run on different FL setups. In each figure, the stealthiness parameters and are displayed with the number of senders, size of the message, dataset type, and the percentage of users selected at random to participate in each FL round. The vertical yellow line on each plot indicates the round where the message was able to be correctly received by the receiver.

Vi-a Stealthiness

During an FL epoch where the participants perform a round of training over their respective

local datasets, the gradients of the updated weights are aggregated and uploaded to the parameter server, which, in turn updates the global model. Because FL usually relies on secure aggregation [10], the parameter server is oblivious of the individual updates. While typically, the parameter server of an FL scheme does not possess any tracking mechanism, we assume a hypothetical scenario in which the parameter server can actually observe each of the uploaded updates. Additionally, the server is equipped with additional tooling for performing statistical analysis of the provided updates to detect and mitigate eventual anomalies. We position ourselves in such a scenario, as in this setting, we can properly evaluate the stealthiness of FedComm, and demonstrate that the transmitted messages remain undetected.

To demonstrate the stealth of FedComm when using different stealthiness parameters we analyse the distribution of the gradient updates that come as a result of employing FedComm to transmit a message versus the gradient updates when FedComm is not used. We depict this comparison in Figure 2, in which we compare the distribution of the FedComm gradient updates in two extreme cases; non-stealthy (=0, ) and full-stealthy (=1, ), with the regular gradient updates.

In Figure 1(a), the distribution of typical gradient updates after the local iterations (the light color) differs from the distribution of the updates where the message is being transmitted in a non-stealthy manner. When the message is transmitted in non-stealthy mode the parameter server can actually find out that something abnormal is happening and might even choose to discard that particular gradient update.

Figure 1(b) displays the distribution of the gradient updates after a typical local update, and the distribution of the same gradient updates with the message transmitted using FedComm in full-stealthy mode. The two distributions are nearly indistinguishable, and in the eyes of the parameter server, nothing abnormal is happening. The impossibility of distinguishing between the distributions of a typical gradient update and the gradient update that contains the message using FedComm in full-stealthy mode aligns with the theoretical results reported in Section IV, which showed that using the stealthiness parameters , allows us to bury the message in the gradients, making it undetectable as it would be 10dB under the power of the gradient.

To provide additional evidence of FedComm’s ability to covertly transmit a message, we compared the vector norm among all the participants’ (i.e., senders and regular participants) gradient updates. Similar to the above results, the parameter server tries to detect anything unusual in the parameter updates from a particular participant compared to the parameter updates sent by the rest of the participants, by employing a different measure, the norm of the gradient update. In this experiment, we used the Frobenius norm [30]. Figure 3 shows that the norm of the gradient update of the sender (highlighted in dark blue) is similar to the norm of the gradients updates of other participants.

Vi-B FedComm’s impact on FL model performance

To measure the impact on the performance of the resulting FL model when using the FedComm covert communication scheme, we ran different experiments on a variety of tasks (MNIST, CIFAR10, ESC-50, WikiText-2) and a variety of DNN architectures (see Section V). In this way, we also empirically evaluate the generality of FedComm (i.e., domain- and DNN-architecture independent). We set the number of participants in the FL scheme to 100 and considered the following cases in terms of the percentage of participants randomly selected to update the parameters in each round (10%, 20%, 50%, 100%). To show that FedComm does not impact the performance of the FL scheme, we performed baseline runs with the same set up as the one in which we used FedComm to transmit the message, and we display those results in Figure 4. Each plot presents the FL baseline training accuracy against the training accuracy when FedComm is employed. A message is transmitted on each round the sender is selected for participating (i.e. when we use 100% this means the sender is always selected and transmits the message in each FL round). The vertical yellow line in each plot shows the FL round on which the message is correctly received by the receiver.

To display the effect on model performance when the sender employs different levels of FedComm stealthiness, Figure 3(a) and Figure 3(b) display the model performances when using FedComm non-stealthy (3(a)) and FedComm full-stealthy (3(b)) level to transmit the message. On both cases we can see that the performance of the learned model when FedComm is used is similar to the performance of the learned model when FedComm is not used.

Another important benefit that the use of spread-spectrum channel coding techniques bring to the table is that multiple senders can send their respective messages to their respective target receivers. Figure 3(c) showcases an experiment in which two senders transmit two different text messages (i.e., hello world! and The answer is 42!) to their respective receivers. The performance of the learned model is unaffected on each FL round, and both messages are correctly delivered to their respective receivers.

As previously mentioned in Section IV, we tested the case in which a subset of the total participants were selected at random to update the global model in each FL round. We assessed the impact of this approach on FedComm’s ability to transmit the message in Figures 3(d)3(e)3(f). These figures show that the baseline and FedComm’s training accuracy is still closely comparable even when a limited number of participants is selected for averaging (i.e., 10%, 20% and 50%).

Figures 3(g)3(h)3(i) evaluate FedComm’s performance when it is applied on different tasks and different DNN architectures. Figure 3(g) shows that FedComm can transmit the long message in under 400 global rounds while simultaneously training the VGG [80] network on the CIFAR10 image-recognition task. Figure 3(e), displays the baseline vs. FedComm’s performance on the ESC-50 [71] audio classification task. The performance of the learned model on each round is not affected by the ongoing covert communication powered by FedComm. Figure 3(i)

displays the baseline vs. FedComm performance on a language-modeling task, WikiText-2, using a LSTM-based recurrent NN. Different from other plots, the performance assessment on this task compared the perplexity of the learned models on each FL round. Perplexity measures how well a probability model predicts a sample, and in our case, the language model is trying to learn a probability distribution over the WikiText-2 

[62] dataset. Even in this case, the performance of the FL scheme is not impacted at all by having FedComm transmitting a message alongside the learning process.

Figures 3(j)3(k)3(l) show FedComm’s results when using the same settings (task, message and stealthiness level) with changes to the number of simultaneous senders of the same message and demonstrate how this scenario impacts the message delivery time. We elaborate on this in the next section where we discuss about the message delivery time of FedComm.

Vi-C Message delivery time

Having shown that employing FedComm we can covertly deliver a message without impacting the ability of the FL scheme to learn a high performing ML model, we focused on measuring the time (in terms of FL rounds) it takes a message to be delivered to its intended receiver. From the various FL configuration experiments demonstrated in Figure 4, message delivery time varies according to the number of senders in the network and their stealthiness levels.

Typically, a FL scheme either runs in a continuous learning fashion (i.e., the learning never stops) or it stops when the gradient updates can no longer improve the model. To display the potential of FedComm, we assume the latter case, which is also our worst-case scenario because it requires FedComm to transmit and deliver the message to the receiver before the FL procedure converges (i.e., the training stops). In Figure 4, the vertical line indicates the global round in which the receiver correctly received the message . On every FL run, the model performance continues improving, even after the FL round when the message is received, so the FL execution does not stop before the message is received.

Fig. 5: Error rate until the message can be retrieved from the receiver. , , 5 senders, Long message, CIFAR10, 100% aggregated per round.

The message delivery time drops significantly as the number of senders who send the message concurrently increases. In Section IV, we showed that the number of iterations drops with . We highlight this observation on the experiments displayed on Figures 3(j)3(k)3(l), which show the message delivery time when using 1, 2, and 4 senders with the stealthiness parameter , , and on each round, 100% of participants are selected by the parameter server for averaging. The vertical line in Figure 3(j) shows that fewer than 400 global rounds are needed for the receiver to be able to correctly decode the message when one sender is used. According to the calculations, two senders would require roughly and four senders would require roughly FL rounds for the message to be correctly decoded. Figure 3(k) and Figure 3(l) show that with 2 and 4 concurrent senders, the receiver can decode the message in under 120 and 30 rounds respectively. The small mismatch between the theory (the number of iterations is reduced by ) and practice (slightly slower rate of decrease, especially from 1 to 2 senders in Figures 3(j) and 3(k)) can be due to several factors. First, the variance of the gradients reduces as the number of iterations increases, so there is more noise present in the first few iterations. Second, the gradients for each sender are different, so all of them might not be adding the same amount of information in each iteration. Third, the natural stochasticity in FL training, as each user has a different gradient in each iteration that depends on all the other gradients in previous iterations.

We can also see in Figures 3(a) and 3(b) that ten stealthy users take a time that is of the same order of magnitude as one non-stealthy user, as and would cancel each other out in the predicted number of iterations, which is also predicted by the theory in Section IV.

Finally, to highlight how parts of the message become visible on the global model after each FL round, Figure 5 displays the decrease in the error rate of the message. This error rate is calculated as the portion of the message that is received by the receiver after each FL round. Note that the use of LPDC codes causes the error rate to drop rapidly to zero in the last few iterations of FedComm.

Vi-D Validating the Gaussian assumption

In Section IV

, we assumed that the gradients from all the users in every iteration are equally distributed, as a zero-mean Gaussian with a fixed standard deviation

for tractability of the theoretical analysis. The gradients are not Gaussian distributed and do not have the same standard deviation for every user at each iteration. Hence the resulting noise on our covert communication channel would not be Gaussian distributed and this would have an impact on the number of iterations that are needed to be able to decode the message.

In Figure 3(a), we can see that we need 120 iterations before the message can be detected. If we apply the theory developed in Section IV, we should expect to decode the message just after two iterations. We have done two experiments to understand where this deviation comes from. First, we recorded the power of all the gradients in each iteration and user and repeated the FL learning procedure, substituting the gradients by Gaussian noise444Obviously in this experiment, there is no learning happening and we just want to see when the message is detected.. In this case, we were able to detect the message within 30 iterations, the aggregated noise is not Gaussian in this case either. If we use the same Gaussian distribution for all the users and all the iterations, the resulting noise is Gaussian distributed, and we recover the message in the two iterations, as predicted by the theory.

Both the LDPC decoder and the CDMA detector relies on the resulting noise being Gaussian for optimal performance. The degradation observed in our experiments can be mitigated if we had the exact distribution of the noise that our CDMA communication is suffering. But this distribution would depend on the architecture of the DNN, the data that each user has, and each iteration of the FL procedure. It would be impossible to theoretically predict the number of needed iterations. In this work, we have focused on showing that the message can be detected and have left as further work designing the optimal detector.

Vii Discussing Potential Countermeasures

This section analyzes possible approaches that can be employed as countermeasures to FedComm, and discuss the extent to which these countermeasures can impact the performance of the covert communication channel.

Vii-a Differential Privacy

Differential privacy [24, 23] uses random noise to ensure that publicly visible information does not change much if one individual record in the dataset changes. As no individual sample can significantly affect the visible information, attackers cannot confidently infer private information corresponding to any individual sample. When employed on an FL setting, the noise level introduced by DP in the learning scheme has to be lower than the magnitude of the gradients of the participants to avoid impeding the learning process.

Since we are using FedComm to transmit bits in an -participant FL scenario, decoding one of these bits requires FedComm to account for the noise on the channel that comes from two sources; the noise from the gradients of the other users and the noise from the codes of the other bits of the message (see Section IV). When DP is employed on the learning scheme, FedComm has to account for an extra source of noise in the channel. Due to the intensity of the noise coming from DP, which must be lower than the magnitude of the gradient to avoid preventing the learning process, it can affect the message transmission only by affecting the time to delivery by slightly slowing it down. However, this potential delay will not have a significant impact on the transmission time because the noise from DP is very low and it also slows down the FL learning process, thus providing FedComm with more time to deliver the message before the model convergence.

Vii-B Parameter Pruning

Parameter pruning is a technique that is commonly used to reduce the size of a neural network while attempting to retain a similar performance as the non-pruned counterpart. Parameter pruning comprises of removing unused or least-used neurons of a neural network. Detecting and pruning these neurons requires a dataset that represents the whole population on which this NN was trained. By iteratively querying the model, the least activated neurons are identified and pruned. Pruning is not applicable in federated learning because neither the parameter server nor the participants possess a dataset that represents the whole population. Going against the FL paradigm, we assume that the parameter server has such a dataset. If the parameter server performs the pruning during the learning phase, the transmitter will find out when it downloads the next model update and can re-transmit the message using as target weights the parameters of this new architecture. If a transmitter wants to increase the chances that a message will not be disrupted by pruning, he can analyse the model updates to discover the most used parameters, which are less likely to be pruned, and use that subset of parameters to transmit the message.

Vii-C Gradient Clipping

Gradient clipping is a technique to mitigate the exploding gradient problem in DNNs [92]. Typically, gradient clipping introduces a pre-determined threshold and then scales down the gradient norms that exceed the threshold to match the norm, introducing a bias in the resulting values of the gradient, which helps stabilize the training process. In a federated learning scenario, the aggregator could employ gradient clipping on participants’ gradients. We performed many experiments where the aggregator employs gradient clipping, using a wide range of clipping values. We observed that this method incurs no penalty on FedComm’s ability to transmit the message. Even if a few bits of the message are corrupted by the aggregators clipping action, FedComm’s error-correction technique will remediate them.

Viii Related Work

Viii-a Attacks on Federated Learning

Recent years have seen an increasing and constantly evolving pool of attacks against deep-learning models, and also FL is shown to be susceptible to adversarial attacks [43, 41, 54]. For instance, while FL is designed with privacy in mind [57, 78], attacks such as property-inference [4, 60, 93], model inversion [26]

, and other generative adversarial network based reconstruction attacks 

[38], have shown that the privacy of the users participating in the FL protocol can be compromised too. One of the first property-inference attacks is [4] from Ateniese et al., which shows how an adversary with white-box access to an ML model can extract valuable information about the training data. Fredrikson et al. [26] extended the work in [4] by proposing a model-inversion attack that exploits the confidence values revealed by ML models. Along this line of work, Song et al. [81] demonstrated that it is possible to design algorithms that can embed information about the training data into the model (i.e., backdooring), and how it is possible to extract the embedded information from the model given only black-box access. Similarly, Carlini et al. [12] showed that deep generative sequence models can unintentionally memorize training inputs and can extract the memorized inputs in a black-box setting. Ganju et al. [29] extended the work by Ateniese et al. [4] by crafting a property-inference attack against fully connected neural networks, exploiting the fact that fully-connected neural networks are invariant under permutation of nodes in each layer. On the other hand, Zhang et al. [93] extended the above-mentioned property-inference attacks [4, 29] in the domain of multi-party learning by devising an attack that can extract the distribution of other parties’ sensitive attributes in a black-box setting using a small number of inference queries. Melis et al. [60] crafted various membership-inference attacks against FL protocol, under the assumption that the participants upload their weights to the parameter server after each mini-batch instead of after a local training epoch. Nasr et al. [67] presented a framework to measure the privacy leakage through parameters of fully trained models as well as the parameter updates of models during training both in the standalone and FL settings. Hitaj et al. [38] demonstrated that a malicious participant in a collaborative deep-learning scenario can use generative adversarial networks (GANs) to reconstruct class representatives. On the other hand, Bhagoji et al. [8] presented a model-poisoning attack that can poison the global model while ensuring convergence by assuming that the adversary controls a small number of participants of the learning-scheme.

FedComm is not an attack towards the federated learning protocol; we aim to transmit a hidden message within the model’s updated parameters, without impeding the learning process (Section VI-B).

Viii-B Defenses to Attacks on Federated Learning

In the past years, several proposed attacks to privacy and integrity have demonstrated that distributed deep learning presents new challenges that need to be solved to guarantee a satisfactory level of privacy and security. Shokri et al. [78] were among the first to introduce the concept of distributed deep learning with the privacy of training data as one of its main objectives. They attempted to achieve a level of privacy by modifying the participants’ behavior, by requiring them to upload only a small subset of their trained parameters. On the other hand, to defend against membership-inference attacks, techniques such as model regularization are promising [26, 79].

Differential Privacy, another fundamental defense against privacy attacks, was introduced by Dwork [24, 23] to guarantee privacy up to a parameter . DP uses random noise to ensure that the publicly visible information does not change much if one individual record in the dataset changes. As no individual sample can significantly affect the output, attackers cannot confidently infer the private information corresponding to any individual sample. Nasr et al. [66] built on the concept of DP by modeling the learning problem as a min-max privacy game and training the model in an adversarial setting, improving membership privacy with a negligible loss in model performance. Other defense strategies have been proposed in [94, 45]. In an attempt to prevent model-poisoning attacks, several robust distributed aggregators have been proposed [9, 90, 74, 68, 14] assuming direct access to training data or participants’ updates. However, Fang et al. [25] recently demonstrated that these types of resilient distributed aggregators do little to defend against poisoning in a FL setting.

FedComm, in the full-stealth mode (i.e., , ) (Section IV), does not introduce any artifact during the learning process because the senders do not behave maliciously by supplying inconsistent inputs, attempting to poison the model, or providing updates that differ from those of other participants (Section VI-A). For these reasons, the aforementioned approaches are not directly applicable to disrupt FedComm communication from happening.

Viii-C Backdooring Federated Learning

Backdoors are a class of attacks [34, 16] against ML algorithms where the adversary manipulates model parameters or training data in order to change the classification label given by the model to specific inputs. Bagdasaryan et al. [5] were the first to show that FL is vulnerable to this class of attacks. Simultaneously, Wang et al. [88] presented a theoretical setup for backdoor injection in FL, demonstrating how a model that is vulnerable to adversarial attacks is, under mild conditions, also vulnerable to backdooring.

Because this class of attacks is particularly disruptive, during the years, many mitigation techniques [16, 49, 85, 13] have been proposed. Burkhalt et al. [11] presented a systematic study to assess the robustness of FL, extending FL’s secure aggregation technique proposed in [10]. Burkhalt et al. [11] integrated a variety of properties and constraints on model updates using zero-knowledge proof, which is shown to improve FL’s resilience against malicious participants who attempt to backdoor the learned model.

With the same end goal as us (i.e transmitting a message in the FL setting covertly) Costa et. al. [19] aim at exploiting backdooring of DNNs in FL to encode an information that can be retrieved by observing the predictions given by the global model at that particular round. To do so, they define a communication frame the number of FL epochs necessary to transmit a bit. In order to transmit the bit during the frame , the sender applies model poisoning techniques to force the model to switch the classification label of a certain input . The receiver then monitors the classification label of at the beginning of the frame , at the beginning, and at the end, . If the transmitted bit is , then the label assigned to is unchanged, when the label assigned to is switched. We highlight four-major advantages of FedComm when compared to [19]: 1) The covert channel of [19] relies on backdooring FL [5], requiring specific tailoring to each domain. FedComm does not rely on backdooring FL, and more importantly FedComm is domain-independent. Our proposed strategy requires no additional modifications when deployed on different tasks/model architectures. 2) Bagdasaryan et al[5] emphasize that the model needs to be close to convergence to achieve successful backdoor injection. FedComm is not bound by such restrictions. 3) In [19], transmitting a payload of n-bits requires backdooring the global model n-times. Given that authors [19] do not report details about the dimension of the frame , let us assume a hypothetical best-case scenario where [19] can add a backdoor per round. If n=370, [19] can only send 370 bits. For n=370, FedComm can covertly transmit 7904 bits (see Figure 4g). 4) Work on backdoor detection in FL [13, 49, 85], can detect the covert channel introduced by [19]. FedComm is not a backdooring attack and does not attempt to alter in any way the behaviour of the learned model. FedComm employs spread-spectrum techniques to encode extra information within the model’s updated parameters without impairing the FL learning process. In FedComm’s full-stealth mode (Section IV), the gradient updates of the sender participant do not differ from the updates of other participants of the FL scheme. As such, backdooring defenses cannot prevent FedComm covert communication.

Ix Conclusions and Future Work

In this work we introduced FedComm, a covert communication technique that uses the federated learning (FL) scheme as a communication channel. We employ the CDMA spread-spectrum technique to transmit a desired message in a reliable and covert manner during the ongoing FL procedure. We hide the message in the gradient of the weight parameters of the neural network architecture being trained and then upload those parameters to the parameter server where they are averaged.

FedComm does not introduce any particular artifact during the learning process, such as supplying inconsistent inputs, attempting to poison the model, or providing updates that differ significantly from those of other participants. We empirically show that our covert communication technique does not hamper the FL scheme, by observing the accuracy and the loss of the global model at every round when the communication occurs. The performance of the global model behaves almost identically as when a covert communication is not taking place. We also show that, transmitting the message in full-stealth mode cannot be detected by the global parameter server even if the server could observe individual gradient updates.

We believe FedComm paves the way for new attacks that can further compromise the security, privacy, and utility of FL training procedures. For instance, a covert communication channel such as ours can be adopted for distributing malicious payloads to honest participants. Given the nature of FL, such a payload would reach thousands of participants in short amounts of time.

Existing defense strategies do not hinder the effectiveness of FedComm, if not they actually make the channel stealthier. To this end, we stress that it is imperative to investigate the correlation between the payload (message) size and model capacity. Understanding this process would allow to tune the countermeasures accordingly. A thorough investigation of such relationships and the development of defense mechanisms targeting FedComm specifically, is left as future work.

References

  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §II-B.
  • [2] D. Adam, Choquette-Choo, C. A., D. Natalie, and N. Papernot (2021) Beyond federation: collaborating in ml with confidentiality and privacy. Note: http://www.cleverhans.io/2021/05/01/capc.html Cited by: §I.
  • [3] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), Baltimore, MD. External Links: Link Cited by: §I.
  • [4] G. Ateniese, L. V. Mancini, A. Spognardi, A. Villani, D. Vitali, and G. Felici (2015)

    Hacking smart machines with smarter ones: how to extract meaningful data from machine learning classifiers

    .
    International Journal of Security and Networks 10 (3), pp. 137–150. Cited by: §II-B, §VIII-A.
  • [5] E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin, and V. Shmatikov (2020) How to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics, pp. 2938–2948. Cited by: §I, §I, §VIII-C, §VIII-C.
  • [6] T. Bansal, D. Belanger, and A. McCallum (2016) Ask the GRU: multi-task learning for deep text recommendations. In proceedings of the 10th ACM Conference on Recommender Systems, pp. 107–114. Cited by: §I.
  • [7] J. Behrmann, W. Grathwohl, R. T. Chen, D. Duvenaud, and J. Jacobsen (2019) Invertible residual networks. In International Conference on Machine Learning, pp. 573–582. Cited by: §I.
  • [8] A. N. Bhagoji, S. Chakraborty, P. Mittal, and S. Calo (2019) Analyzing federated learning through an adversarial lens. In International Conference on Machine Learning, pp. 634–643. Cited by: §VIII-A.
  • [9] P. Blanchard, E. M. El Mhamdi, R. Guerraoui, and J. Stainer (2017) Machine learning with adversaries: byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §VIII-B.
  • [10] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2017) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191. Cited by: §I, §II-B, §II-C, §VI-A, §VIII-C.
  • [11] L. Burkhalter, A. Viand, N. Küchler, A. Hithnawi, et al. (2021) RoFL: attestable robustness for secure federated learning. arXiv preprint arXiv:2107.03311. Cited by: §VIII-C.
  • [12] N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. External Links: 1802.08232 Cited by: §VIII-A.
  • [13] B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy, and B. Srivastava (2019) Detecting backdoor attacks on deep neural networks by activation clustering. ArXiv abs/1811.03728. Cited by: §VIII-C, §VIII-C.
  • [14] L. Chen, H. Wang, Z. B. Charles, and D. Papailiopoulos (2018) DRACO: byzantine-resilient distributed training via redundant gradients. In ICML, Cited by: §VIII-B.
  • [15] S. Chen, M. Kahla, R. Jia, and G. Qi (2021) Knowledge-enriched distributional model inversion attacks. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    ,
    pp. 16178–16187. Cited by: §II-B.
  • [16] X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. ArXiv abs/1712.05526. Cited by: §VIII-C, §VIII-C.
  • [17] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1800–1807.
    Cited by: §I, §II-A.
  • [18] A. Continella, A. Guagnelli, G. Zingaro, G. De Pasquale, A. Barenghi, S. Zanero, and F. Maggi (2016) ShieldFS: a self-healing, ransomware-aware filesystem. In ACSAC, Cited by: §I.
  • [19] G. Costa, F. Pinelli, S. Soderi, and G. Tolomei (2021) Covert channel attack to federated learning systems. CoRR abs/2104.10561. External Links: Link, 2104.10561 Cited by: §VIII-C.
  • [20] T. M. Cover and J. A. Thomas (2006) Elements of information theory. Wiley & Sons. Cited by: §IV-A.
  • [21] F. De Gaspari, D. Hitaj, G. Pagnotta, L. De Carli, and L. V. Mancini (2020) The naked sun: malicious cooperation between benign-looking processes. In International Conference on Applied Cryptography and Network Security, pp. 254–274. Cited by: §I.
  • [22] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, pp. 4171–4186. External Links: Link, Document Cited by: §I.
  • [23] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor (2006) Our data, ourselves: privacy via distributed noise generation. In Advances in Cryptology - EUROCRYPT 2006, S. Vaudenay (Ed.), Berlin, Heidelberg, pp. 486–503. External Links: ISBN 978-3-540-34547-3 Cited by: §VII-A, §VIII-B.
  • [24] C. Dwork (2006) Differential privacy. In Automata, Languages and Programming, M. Bugliesi, B. Preneel, V. Sassone, and I. Wegener (Eds.), Berlin, Heidelberg, pp. 1–12. External Links: ISBN 978-3-540-35908-1 Cited by: §VII-A, §VIII-B.
  • [25] M. Fang, X. Cao, J. Jia, and N. Z. Gong (2020) Local model poisoning attacks to byzantine-robust federated learning. In USENIX Security Symposium, pp. 1605–1622. External Links: Link Cited by: §VIII-B.
  • [26] M. Fredrikson, S. Jha, and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1322–1333. Cited by: §II-B, §II-B, §VIII-A, §VIII-B.
  • [27] M. Fredrikson, E. Lantz, S. Jha, S. Lin, D. Page, and T. Ristenpart (2014) Privacy in pharmacogenetics: an end-to-end case study of personalized warfarin dosing. In 23rd USENIX Security Symposium (USENIX Security 14), pp. 17–32. Cited by: §II-B.
  • [28] K. Fukushima (1980) Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics. External Links: ISSN 1432-0770, Document, Link Cited by: §V.
  • [29] K. Ganju, Q. Wang, W. Yang, C. A. Gunter, and N. Borisov (2018) Property inference attacks on fully connected neural networks using permutation invariant representations. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 619–633. Cited by: §II-B, §VIII-A.
  • [30] G. H. Golub and C. F. Van Loan (2013) Matrix computations, 4th edition. John Hopkins University Press. Cited by: §VI-A.
  • [31] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §II-A.
  • [32] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §II-A.
  • [33] A. Graves, A. Mohamed, and G. Hinton (2013-05) Speech recognition with deep recurrent neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. , pp. 6645–6649. External Links: Document, ISSN 1520-6149 Cited by: §I.
  • [34] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg (2019) Badnets: evaluating backdooring attacks on deep neural networks. IEEE Access 7, pp. 47230–47244. Cited by: §I, §VIII-C.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §I, §II-A.
  • [36] Z. He, T. Zhang, and R. B. Lee (2019) Model inversion attacks against collaborative inference. In Proceedings of the 35th Annual Computer Security Applications Conference, pp. 148–162. Cited by: §II-B.
  • [37] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal processing magazine 29 (6), pp. 82–97. Cited by: §I.
  • [38] B. Hitaj, G. Ateniese, and F. Pérez-Cruz (2017) Deep models under the GAN: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 603–618. Cited by: §I, §I, §II-B, §VIII-A.
  • [39] S. Hochreiter and J. Schmidhuber (1997-11) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Document, Link, https://direct.mit.edu/neco/article-pdf/9/8/1735/813796/neco.1997.9.8.1735.pdf Cited by: §V.
  • [40] J. Jacobsen, A. Smeulders, and E. Oyallon (2018-04) i-RevNet: Deep Invertible Networks. In ICLR 2018 - International Conference on Learning Representations, Vancouver, Canada. External Links: Link Cited by: §I.
  • [41] M. S. Jere, T. Farnan, and F. Koushanfar (2020) A taxonomy of attacks on federated learning. IEEE Security & Privacy 19 (2), pp. 20–28. Cited by: §VIII-A.
  • [42] J. Jordon, J. Yoon, and M. Van Der Schaar (2018) PATE-GAN: generating synthetic data with differential privacy guarantees. In International conference on learning representations, Cited by: §II-B.
  • [43] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings, et al. (2019) Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977. Cited by: §VIII-A.
  • [44] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 4396–4405. External Links: Document Cited by: §I.
  • [45] N. Koti, M. Pancholi, A. Patra, and A. Suresh (2021-08) SWIFT: super-fast and robust privacy-preserving machine learning. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2651–2668. External Links: ISBN 978-1-939133-24-3, Link Cited by: §VIII-B.
  • [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, USA, pp. 1097–1105. External Links: Link Cited by: §II-A.
  • [47] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. . Cited by: §V-A, §V.
  • [48] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/http://yann.lecun.com/exdb/mnist/ Cited by: §V-A, §V.
  • [49] K. Liu, B. Dolan-Gavitt, and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §VIII-C, §VIII-C.
  • [50] T. Liu, Z. Liu, Q. Liu, W. Wen, W. Xu, and M. Li (2020) StegoNet: turn deep neural network into a stegomalware. In Annual Computer Security Applications Conference, pp. 928–938. Cited by: §III.
  • [51] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2017) Trojaning attack on neural networks. Cited by: §I.
  • [52] M. A. Lozano, E. Piñol, M. Rebollo, K. Polotskaya, M. A. Garcia-March, J. A. Conejero, F. Escolano, N. Oliver, et al. (2021)

    Open Data Science to Fight COVID-19: Winning the 500k XPRIZE Pandemic Response Challenge

    .
    In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 384–399. Cited by: §I.
  • [53] W. Luping, W. Wei, and L. Bo (2019) CMFL: mitigating communication overhead for federated learning. In 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS), pp. 954–964. Cited by: §I.
  • [54] L. Lyu, H. Yu, J. Zhao, and Q. Yang (2020) Threats to federated learning. In Federated Learning, pp. 3–16. Cited by: §VIII-A.
  • [55] W. Marx (2021) How valencia crushed covid with ai. Note: https://www.wired.co.uk/article/valencia-ai-covid-data Cited by: §I.
  • [56] W. S. McCulloch and W. Pitts (1943) A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics 5 (4), pp. 115–133. Cited by: footnote 1.
  • [57] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282. Cited by: §I, §II-B, §II-C, §II-C, §VIII-A.
  • [58] B. McMahan and D. Ramage (2017) Federated learning: collaborative machine learning without centralized training data. Note: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html Cited by: §I.
  • [59] S. Mehnaz, A. Mudgerikar, and E. Bertino (2018) RWGuard: a real-time detection system against cryptographic ransomware. In Research in Attacks, Intrusions, and Defenses, RAID ’18. Cited by: §I.
  • [60] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov (2019) Exploiting unintended feature leakage in collaborative learning. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 691–706. Cited by: §I, §II-B, §VIII-A.
  • [61] J. Menick and N. Kalchbrenner (2019) Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Representations, External Links: Link Cited by: §I.
  • [62] S. Merity, C. Xiong, J. Bradbury, and R. Socher (2016) Pointer sentinel mixture models. CoRR abs/1609.07843. External Links: Link, 1609.07843 Cited by: §V-A, §V, §VI-B.
  • [63] E. L. Merrer, P. Perez, and G. Trédan (2017) Adversarial frontier stitching for remote neural network watermarking. CoRR abs/1711.01894. Cited by: §I.
  • [64] T. M. Mitchell (1997) Machine learning. 1 edition, McGraw-Hill, Inc., New York, NY, USA. External Links: ISBN 0070428077, 9780070428072 Cited by: §II-A.
  • [65] F. Mo, H. Haddadi, K. Katevas, E. Marin, D. Perino, and N. Kourtellis (2021) PPFL: privacy-preserving federated learning with trusted execution environments. arXiv preprint arXiv:2104.14380. Cited by: §I.
  • [66] M. Nasr, R. Shokri, and A. Houmansadr (2018) Machine learning with membership privacy using adversarial regularization. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pp. 634–646. Cited by: §VIII-B.
  • [67] M. Nasr, R. Shokri, and A. Houmansadr (2019) Comprehensive privacy analysis of deep learning: passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP), pp. 739–753. Cited by: §VIII-A.
  • [68] X. Pan, M. Zhang, D. Wu, Q. Xiao, S. Ji, and Z. Yang (2020-08) Justinian’s gaavernor: robust distributed learning with gradient aggregation agent. In 29th USENIX Security Symposium (USENIX Security 20), pp. 1641–1658. External Links: ISBN 978-1-939133-17-5, Link Cited by: §VIII-B.
  • [69] D. Pasquini, G. Ateniese, and M. Bernaschi (2020) Unleashing the tiger: inference attacks on split learning. arXiv preprint arXiv:2012.02670. Cited by: §II-B.
  • [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §V-D.
  • [71] K. J. Piczak (2015-10-13) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, pp. 1015–1018. External Links: Link, Document, ISBN 978-1-4503-3459-4 Cited by: §V-A, §V, §VI-B.
  • [72] J. Proakis (1998) Digital communications. McGraw-Hill. Cited by: §II-D.
  • [73] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §I.
  • [74] S. Rajput, H. Wang, Z. B. Charles, and D. Papailiopoulos (2019) DETOX: a redundancy-based framework for faster and more robust gradient aggregation. In NeurIPS, Cited by: §VIII-B.
  • [75] T. Richardson and R. Urbanke (2008) Modern coding theory. Cambridge University Press. Cited by: §IV-A.
  • [76] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes (2018) ML-leaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246. Cited by: §II-B.
  • [77] A. Shafran, S. Peleg, and Y. Hoshen (2021) Membership inference attacks are easier on difficult problems. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14820–14829. Cited by: §II-B.
  • [78] R. Shokri and V. Shmatikov (2015) Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310–1321. Cited by: §I, §I, §VIII-A, §VIII-B.
  • [79] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §II-B, §VIII-B.
  • [80] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. Cited by: §I, §II-A, §V-B, §VI-B.
  • [81] C. Song, T. Ristenpart, and V. Shmatikov (2017) Machine learning models that remember too much. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security, pp. 587–601. Cited by: §VIII-A.
  • [82] Z. Sun, P. Kairouz, A. T. Suresh, and H. B. McMahan (2019) Can you really backdoor federated learning?. arXiv preprint arXiv:1911.07963. Cited by: §I.
  • [83] R. Tang, M. Du, N. Liu, F. Yang, and X. Hu (2020) An embarrassingly simple approach for trojan attack in deep neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 218–228. Cited by: §I.
  • [84] D. Torrieri (2018) Principles of spread-spectrum communication systems, 4th edition. Springer, Cham. External Links: Document Cited by: item 2, §II-D.
  • [85] B. Tran, J. Li, and A. Madry (2018) Spectral signatures in backdoor attacks. In NeurIPS, Cited by: §VIII-C, §VIII-C.
  • [86] A. Triastcyn and B. Faltings (2019) Federated learning with bayesian differential privacy. In 2019 IEEE International Conference on Big Data (Big Data), pp. 2587–2596. Cited by: §I.
  • [87] S. Verdu (1998) Multiuser detection. Cambridge University Press. Cited by: §II-D, §II-D.
  • [88] H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J. Sohn, K. Lee, and D. Papailiopoulos (2020) Attack of the tails: yes, you really can backdoor federated learning. Advances in Neural Information Processing Systems. Cited by: §I, §I, §VIII-C.
  • [89] X. Wu, M. Fredrikson, S. Jha, and J. F. Naughton (2016) A methodology for formalizing model-inversion attacks. In 2016 IEEE 29th Computer Security Foundations Symposium (CSF), pp. 355–370. Cited by: §II-B.
  • [90] D. Yin, Y. Chen, R. Kannan, and P. Bartlett (2018-10–15 Jul) Byzantine-robust distributed learning: towards optimal statistical rates. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 5650–5659. External Links: Link Cited by: §VIII-B.
  • [91] J. Zhang, Z. Gu, J. Jang, H. Wu, M. Ph. Stoecklin, H. Huang, and I. Molloy (2018) Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, ASIACCS ’18, New York, NY, USA, pp. 159–172. External Links: ISBN 978-1-4503-5576-6, Link, Document Cited by: §I.
  • [92] J. Zhang, T. He, S. Sra, and A. Jadbabaie (2019) Why gradient clipping accelerates training: a theoretical justification for adaptivity. In International Conference on Learning Representations, Cited by: §VII-C.
  • [93] W. Zhang, S. Tople, and O. Ohrimenko (2021-08) Leakage of dataset properties in multi-party machine learning. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2687–2704. External Links: ISBN 978-1-939133-24-3, Link Cited by: §VIII-A.
  • [94] W. Zheng, R. A. Popa, J. E. Gonzalez, and I. Stoica (2019) Helen: maliciously secure coopetitive learning for linear models. In 2019 IEEE Symposium on Security and Privacy (SP), pp. 724–738. Cited by: §VIII-B.