Complex-valued deep learning with differential privacy

10/07/2021
by   Alexander Ziller, et al.
0

We present ζ-DP, an extension of differential privacy (DP) to complex-valued functions. After introducing the complex Gaussian mechanism, whose properties we characterise in terms of (ε, δ)-DP and Rényi-DP, we present ζ-DP stochastic gradient descent (ζ-DP-SGD), a variant of DP-SGD for training complex-valued neural networks. We experimentally evaluate ζ-DP-SGD on three complex-valued tasks, i.e. electrocardiogram classification, speech classification and magnetic resonance imaging (MRI) reconstruction. Moreover, we provide ζ-DP-SGD benchmarks for a large variety of complex-valued activation functions and on a complex-valued variant of the MNIST dataset. Our experiments demonstrate that DP training of complex-valued neural networks is possible with rigorous privacy guarantees and excellent utility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

08/14/2020

Three Variants of Differential Privacy: Lossless Conversion and Applications

We consider three different variants of differential privacy (DP), namel...
10/30/2021

Dynamic Differential-Privacy Preserving SGD

Differentially-Private Stochastic Gradient Descent (DP-SGD) prevents tra...
06/28/2019

DP-LSSGD: A Stochastic Optimization Method to Lift the Utility in Privacy-Preserving ERM

Machine learning (ML) models trained by differentially private stochasti...
02/16/2021

Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?

This paper addresses the problem of combining Byzantine resilience with ...
02/24/2022

Exploring the Unfairness of DP-SGD Across Settings

End users and regulators require private and fair artificial intelligenc...
03/16/2022

Differentiable DAG Sampling

We propose a new differentiable probabilistic model over DAGs (DP-DAG). ...
01/28/2022

Differential Privacy Guarantees for Stochastic Gradient Langevin Dynamics

We analyse the privacy leakage of noisy stochastic gradient descent by m...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to harness diverse, feature-rich datasets for algorithm training can allow the scientific community to create machine learning (ML) models capable of solving challenging data-driven tasks. These include the creation of robust autonomous vehicles

(Rao and Frtunikj, 2018), early-stage cancer discovery (Cruz and Wishart, 2006) or disease survival prediction (Rau et al., 2018). A subclass of these ML problems is able to profit particularly from the ability to execute deep learning workflows over complex-valued datasets, such as magnetic resonance imaging (MRI) (Virtue et al., 2017) or time-series data (Fan and Xiong, 2013; Kociuba and Rowe, 2016). Complex-valued deep learning has seen increased traction in the past years, owing in part to the improved support by ML frameworks and the broader availability of graphics processing unit (GPU) hardware able to tackle the increased computational requirements (Bassey et al., 2021). However, since complex numbers are often used to represent signals derived from sensitive biological or medical records (Cole et al., 2020; Küstner et al., 2020; Peker, 2016), privacy constraints can render such datasets hard to obtain. The resulting data scarcity impairs effective model training, prompting the adoption of regulation-compliant and privacy-preserving methods for data access. Distributed computation methods such as federated learning (FL) (Konečný et al., 2016) can partially address this requirement by only requiring participants to share results of the local computation rather than exchange data over the network. However, FL on its own has repeatedly been shown to be insufficient in the task of privacy protection (Geiping et al., 2020; Yin et al., 2021). Thus, bridging the gap between data protection and utilisation for algorithmic training requires methods able to offer objective privacy guarantees. Differential privacy (DP) (Dwork et al., 2014) has established itself as the cornerstone of such techniques and has been deployed in contexts like the US Census (Abowd, 2018) and distributed learning on mobile devices (Cormode et al., 2018). DP’s purview has been expanded to encompass deep learning through the introduction of DP stochastic gradient descent (DP-SGD) (Abadi et al., 2016), allowing for the training of deep neural networks on private data. So far, however, the application of DP to complex-valued ML tasks remains drastically under-explored. Our work attempts to address this challenge through the following contributions:

  1. We extend DP to the complex domain through a collection of techniques we refer to as -DP. We use this term instead of complex-valued DP for brevity and to avoid confusion with the abbreviation cDP, which is already used for concentrated DP (Dwork and Rothblum, 2016). The letter alludes to the complex-valued Riemann function and is intended to convey the notion of continuation to the complex domain.

  2. We define and discuss the complex Gaussian Mechanism (GM) in Section 4.1 and show that its properties generalise corresponding results on real-valued functions. This allows us to interpret the complex GM through the lens of previous work on -DP and Rényi-DP (RDP).

  3. To enable the design and privacy-preserving training of complex-valued deep learning models, we introduce -DP-SGD in Section 4.2.

  4. Finally, in Section 5 we experimentally evaluate our techniques on several real-life neural network training tasks, i.e. speech classification, abnormality detection in electrocardiograms and magnetic resonance imaging (MRI) reconstruction. Moreover, we establish baselines for future work by providing benchmark results on a complex-valued variant of the MNIST dataset and on complex neural network activation functions both with and without -DP-SGD.

2 Related work

Prior work has addressed several challenges in non-private complex-valued deep learning, including the introduction of appropriate activation functions, and has presented applications in domains such as MRI reconstruction (Küstner et al., 2020) or time series analysis (Fink et al., 2014). For a detailed overview of methodology and applications, we refer to Hirose (2012); Bassey et al. (2021). Until recently, deep learning frameworks did not fully support complex arithmetic and automatic differentiation. Hence, previous works (Trabelsi et al., 2017; Nazarov and Burnaev, 2020) express as and use two real-valued channels rather than complex floating-point numbers. This approach can lead to a spurious increase in function sensitivity and, by extension, to the addition of excessive noise in the private setting, adversely impacting utility. Our work specifically addresses this shortcoming through the use of Wirtinger/-calculus (Wirtinger, 1927; Kreutz-Delgado, 2009). A limited number of studies have utilised DP techniques in conjunction with complex-valued data (Fan and Xiong, 2013; Fioretto et al., 2019), however, to our knowledge none has formalised a notion of complex-valued DP or investigated neural network applications. The -definition of DP and the Gaussian mechanism are essential to our formalism, and details on their real-valued definitions can be found in Dwork et al. (2014). As stated above, DP-SGD was introduced by Abadi et al. (2016). Rényi-DP (RDP) was introduced by Mironov (2017) as a relaxation of -DP with favourable properties under composition, rendering it particularly useful for DP-SGD privacy accounting.

3 Background

We begin by introducing key terminology required in the rest of our work. We assume that a trusted analyst in possession of sensitive data wishes to publish the results of some analysis performed on this data while offering the individuals to whom the data belongs a DP guarantee. We will refer to the set of all sensitive records as the sensitive database , whereby we assume that one individual’s data is only present in the database once. Let , the metric space of all sensitive databases, be equipped with the Hamming metric and let . ’s adjacent database can be constructed from by adding or removing exactly one database row (that is, one individual’s data), such that . The analyst executes a query (function) , for example a mean calculation, over the database. We first define the sensitivity of :

Definition 1 (Sensitivity of ).

Let and be defined as above. maps the elements of to elements of a metric space equipped with a metric . The (global) sensitivity of is then defined as:

(1)

The maximum is taken over all adjacent database pairs in . When is the Euclidean space and is the metric, is referred to as the sensitivity. We will only use the sensitivity in this work.

In private data analysis and ML, we are often concerned with differentiable functions; for Lipschitz-continuous query functions, the equivalence of the Lipschitz constant and the -sensitivity (Raskhodnikova and Smith, 2016) can be exploited:

Definition 2 (Lipschitz constant of ).

Let and be defined as above. Then is said to be -Lipschitz continuous if and only if a non-negative real number exists for which the following holds:

(2)

Evidently, by Equation (1) and the definition of adjacency. Moreover, let be the differential operator; then , where is the operator norm (O’Searcoid, 2006). Therefore, for a scalar-valued query function, .

A DP mechanism adds noise to the results of calibrated to its sensitivity. Here, we provide the definition of the (real-valued) Gaussian mechanism (GM):

Definition 3 (Gaussian mechanism).

The Gaussian mechanism operates on the results of a query function with sensitivity over a sensitive database by outputting , where . Here,

denotes the identity matrix with

diagonal elements and

is the variance of Gaussian noise calibrated to

.

The application of the GM with properly calibrated noise satisfies -DP:

Definition 4 (-Dp).

The randomised mechanism preserves -DP if, for all pairs of inputs and and all subsets of ’s range:

(3)

A number of relaxations have been proposed to characterise the properties of the GM, of which Rényi DP is arguably the most widely employed in DP deep learning frameworks owing to its favourable properties under composition.

Definition 5 (Rényi DP).

preserves -Rényi-DP (RDP) if, for all pairs of inputs and :

(4)

where denotes the Rényi divergence of order .

4 -Dp

In this section we introduce -DP, an extension of DP to complex-valued query functions and mechanisms. -DP generalises real-valued DP and allows the re-use of prior theoretical results and software implementations.

4.1 The complex Gaussian mechanism

We begin by introducing a variant of the GM suitable to query functions with codomain .

Definition 6 (Complex Gaussian mechanism).

The complex Gaussian mechanism on outputs , where and denotes circularly symmetric complex-valued Gaussian noise with variance .

Of note, a random variable

can be constructed by independently drawing two random variables

from a real-valued normal distribution

and outputting , where is the imaginary unit.

We now state our two main theoretical results:

Theorem 1.

Let be a query function with sensitivity . Then, preserves -DP if and only if the following holds :

(5)

where

denotes the cumulative distribution function of the standard (real-valued) normal distribution.

Proof.

The claim represents a generalisation of the Analytic Gaussian Mechanism (Balle and Wang, 2018) to . It suffices to show that the magnitude of the privacy-loss random variable is bounded by

with probability

. As shown in Dwork and Rothblum (2016) and Balle and Wang (2018), given some fixed output , is given by:

(6)

where is the natural logarithm, and is distributed as:

(7)

As , where denotes the Hermitian transpose, has a real-valued mean and hence follows a real-valued normal distribution, even when . From here, the proof proceeds identically to the proof to Theorem 8 of (Balle and Wang, 2018). ∎

Theorem 2.

Let be defined as above. Then, preserves -RDP if:

(8)

We will rely on the following fact about the Rényi divergence of order between arbitrary distributions:

Corollary 1 (Definition 2 in (Van Erven and Harremos, 2014)).

Let and be two arbitrary distributions defined on a measurable space with densities and . Then, for :

(9)

In particular, for two normal distributions with means and and common variance :

(10)

where denotes the inner product.

We can now prove Theorem 2:

Proof.

By Definition 6

and the additive property of the Gaussian distribution, the density functions of

on and follow a circularly symmetric complex-valued Gaussian distribution with means and and common covariance matrix . By substituting in equation (10):

(11)

Hence, to preserve -RDP, it suffices to choose such that

(12)

These findings allow for a seamless transfer of results which apply to real-valued functions to the complex domain. In particular, they yield the following insights:

  1. The complex GM inherits all properties of the and RDP interpretations of the real-valued GM, such as composition and sub-sampling amplification.

  2. The complex GM, like the real-valued GM, is fully characterised by the sensitivity and the magnitude of the noise .

  3. The GM naturally fits -DP due to the convenient properties of the circularly symmetric complex-valued Gaussian distribution. As a counterexample, a complex-valued Laplace random variable is naturally non-circular in the complex (and multivariate) case, even when constructed from independent distributions (Kotz et al., 2001). Moreover, the utilisation of the -metric on the output space of is disadvantageous, as even for scalar (complex) outputs, the sensitivity can be higher than the sensitivity. Lastly, the utilisation of elliptical noise is inherently unable to satisfy -DP in any dimension (Reimherr and Awan, 2019). We thus leave the introduction of alternative strategies for obtaining -DP in the complex-valued setting to future investigation.

We conclude this section by introducing a modification of the DP stochastic gradient descent (DP-SGD) algorithm, which will be employed in our experimental evaluation.

4.2 -Dp-Sgd

The DP-SGD algorithm (Abadi et al., 2016)

represents an application of the GM to the training of deep neural networks. Using the terminology above, each training step of the neural network (whose loss function, in this setting, represents the

query) leads to the release of a privatised gradient. Evidently, the noise magnitude of the GM must be calibrated to the sensitivity of the loss function. However, most neural network loss functions have a Lipschitz constant which is too high to preserve DP while maintaining acceptable utility (and –generally –the Lipschitz constant of neural networks is NP-hard to compute (Scaman and Virmaux, 2018)). Thus, DP-SGD (Abadi et al., 2016) artificially induces a bounded sensitivity condition by clipping the -norm of the gradient to a pre-defined value. A real-valued loss function is required for minimisation as the complex plane –contrary to the real number line– does not admit a natural ordering. Our implementation of the algorithm makes use of Wirtinger (or -) calculus (Kreutz-Delgado, 2009) for gradient computations similar to previous works on complex-valued deep learning (Virtue et al., 2017; Boeddeker et al., 2017). This technique, discussed in detail Appendix A.1, provides several benefits: It relaxes the requirement for component functions to be holomorphic (that is, differentiable in the complex sense), only requiring them to be individually differentiable with respect to their real and imaginary components (differentiable in the real sense). For holomorphic functions , -derivatives nevertheless recover the correct derivative definition. Thus, -derivatives can also be used to compute the global sensitivity in the -case via the Lipschitz constant. More importantly, for functions

, they lead to a correct gradient magnitude calculation, whereas expressing complex-valued functions as vector-valued functions in

, a technique often employed in complex-valued neural network training (Trabelsi et al., 2017), can incur an undesirable multiplicative sensitivity increase which would diminish the utility of -DP-SGD. We exemplify this phenomenon and the noise savings -calculus can enable in Appendix A.1. -DP-SGD is presented in Algorithm 1

and relies on a modification of the gradient clipping step: we clip the

conjugate gradient, which represents the direction of steepest ascent for a real-valued loss function :

(13)

where is the conjugate weight vector.

0:  Database with samples , neural network with loss function and weight vector

. Hyperparameters: learning rate

, noise variance , sampling probability , gradient norm bound , total steps .
  Initialize randomly
  for  do
     Draw a lot with sampling probability using Poisson or uniform sampling
     Compute per-sample conjugate gradient
     For each , compute
     Clip conjugate gradient
     
     Apply the Complex Gaussian Mechanism and average
     
     Descend
     
  end for
  Output updated neural network weight vector and compute the privacy cost.
Algorithm 1 -DP-SGD

5 Experimental evaluation

Throughout this section, we present results from the experimental evaluation of -DP-SGD. Details on dataset preparation and training can be found in Appendix A.2.

5.1 Benchmarking -DP-SGD on PhaseMNIST

The MNIST dataset (LeCun et al., 2010) is widely used as a benchmark dataset in real-valued DP-SGD literature. As a means of comparison, we thus begin our experimental evaluation with results on an adapted, complex-valued version of MNIST, which we term PhaseMNIST. In brief, for each example of the original MNIST dataset with label , we obtain the imaginary component by selecting an image with label such that resulting in an input image arrangement . Only the label of the real-valued image is used. The results are summarised in Table 1, where we also provide baselines for real-valued MNIST training on the same architecture (with real-valued weights).

Accuracy
PhaseMNIST -DP-SGD
MNIST DP-SGD
PhaseMNIST non-DP
MNIST non-DP
Table 1: Results for PhaseMNIST training in a private (-DP-SGD) and non-private (non-DP) fashion. Results for real-valued MNIST are provided for approximate comparison using the same model architecture (but with real-valued weights) trained with identical settings.

The complex-weighted neural networks reached accuracy with a low privacy budget consumption of . We assume this to be due to the increased amount of information provided by the alignment with a second image as well as the higher entropic capacity of the network due to the complex-valued weights. A similar phenomenon was observed by Scardapane et al. (2018).

5.2 Privacy-preserving electrocardiogram abnormality detection on wearable devices

The advent of wearable devices incorporating electrocardiography (ECG) sensors has provided consumers the ability to detect signs of an abnormal heart rhythm. In this section, we demonstrate the utilisation of a small neural network architecture suitable for deployment, e.g. to a mobile device connected to such a biosensor, to be trained on ECG data from the the China Physiological Signal Challenge (CPSC) 2018 challenge dataset (Liu et al., 2018). We selected the task of automated Left Bundle Branch Block (LBBB) detection, formulated as a binary classification task against a normal (sinus) rhythm. This task is clinically relevant, as the sudden appearance of LBBB can herald acute coronary syndrome which requires urgent attention to avert myocardial infarction. As ECG data constitutes personal health information, its protection is mandated both legally and ethically. We utilised

-DP-SGD for training a complex-valued neural network on Fourier-transformed ECG acquisitions. We adopt this strategy as it can benefit from two key properties of the Fourier transform: ECG data can contain high-frequency noise which is irrelevant for diagnosis and can be reduced using Fourier filtering. Concurrently, this technique compresses the signal, which can drastically reduce the amount of data transferred. Table

2 shows classification results and Figure 1 shows exemplary source data.

ROC-AUC
Non-DP
-DP-SGD
Table 2: Private and non-private training results in the task of ECG abnormality detection.

[scale=0.75]Figures/ecg.jpg

Figure 1: Exemplary ECG data used for classification. A shows an example of a sinus rhythm (normal ECG) and B shows an example of an ECG exhibiting signs of LBBB. Observe also the high frequency noise around the baseline which can be filtered using the Fourier transform.

5.3 Differentially private speech command classification for voice assistant applications

In recent years, voice assistants have gained popularity in consumer applications such as home speakers, and rely heavily on ML. Recordings collected from users for training speech processing algorithms can be used in impersonation attacks, resulting in successful identity theft (Sweet, 2016) or in acoustic attacks, which trigger unintended behaviour in voice assistants (Yuan et al., 2018; Carlini et al., 2016)

. Protecting privacy in this setting is therefore paramount to increase trust and applicability, as well as safeguard both users and systems from adversarial interference. Convolutional neural networks (CNNs) have been demonstrated to yield state-of-the-art performance on spectrogram-transformed audio data

(Palanisamy et al., 2020). However, this and other works (Zhou et al., 2021) typically discard the imaginary components. We here experimentally demonstrate the differentially private training of a 2-dimensional CNN directly on the complex spectrogram data. We utilised a subset of the SpeechCommands dataset (Warden, 2018), specifically samples from the categories Yes, No, Up, Down, Left, Right, On, Off, Stop, and Go, summing up to examples. We transformed each waveform signal to a complex-valued 2-D spectrogram and used -DP-SGD to train a complex-valued CNN. These results are summarised in Table 3 and Figure 2.

ROC-AUC
Non-DP
-DP-SGD
Table 3: Results on the spectrogram-transformed samples from the SpeechCommands dataset for private and non-private training.

[scale=0.75]Figures/spectro.jpg

Figure 2: Exemplary waveform (A), real (B) and imaginary (C) spectrogram components from the utterance Stop. Spectrograms are log-magnitude transformed for clarity.

5.4 Benchmarking complex-valued activation functions for -Dp-Sgd

A number of specialised activation functions, designed for utilisation with complex-valued neural networks, have been proposed in literature. To guide practitioner choice in our newly proposed setting of -DP-SGD training, we here provide activation function benchmarks on the SpeechCommands dataset used in the previous section. Table 4 summarises these results. We consistently found the inverted Gaussian (iGaussian) activation function to perform best in the -DP-SGD setting. This may be in part due to its bounded magnitude, thereby recapitulating the effect Papernot et al. (2020) discuss for real-valued networks, i.e. that bounded activation functions lead to improved performance in DP-SGD. We leave the further investigation of this finding to future work.

Activation function Reference ROC-AUC
Separable Sigmoid Nitta (1997)
zReLU Guberman (2016)
Trainable Cardioid (per-feature bias) Virtue et al. (2017)
SigLog Georgiou and Koutsougeras (1992)
Trainable ModReLU (per-feature bias) Arjovsky et al. (2016)
Cardioid Virtue et al. (2017)
Trainable Cardioid (single bias) Virtue et al. (2017)
ModReLU Arjovsky et al. (2016)
cReLU Trabelsi et al. (2017)
iGaussian Virtue et al. (2017)
Table 4: ROC-AUC (meanSTD) of complex-valued activation functions on the SpeechCommand dataset trained with identical settings and the same network architecture over five repetitions with -DP-SGD.

5.5 MRI reconstruction

MRI is an important medical imaging modality and has been studied extensively in the context of deep learning (Akçakaya et al., 2019; Hammernik et al., 2018; Küstner et al., 2020; Muckley et al., 2020). MRI data is acquired in the so-called k-space. Sampling only a subset of k-space data allows for a considerable speed-up in acquisition time, benefiting patient comfort and costs, however, typically leads to image artifacts, which reduce the diagnostic quality of the resulting MR images. Although neural networks have the ability to produce high-quality reconstructions, their usage for this task has been shown to sometimes lead to the appearance of spurious image content from the fully-sampled reference images the models have been originally trained on (Hammernik et al., 2021; Muckley et al., 2020; Shimron et al., 2021). DP could counteract such hallucination as it is designed to limit the effect of individual training examples on model training. However, this positive effect of DP may be counterbalanced by an unacceptable decrease in the diagnostic suitability of the reconstructed images. In this section, we investigate the ramifications of DP on the quality of MRI reconstructions. For this purpose, we trained a complex-valued U-Net model architecture on the task of reconstructing single-coil knee MRI images from the fastMRI dataset (Zbontar et al., 2018) using pseudo-random k-space sampling at acceleration. We observed a nearly equivalent performance in the non-DP and the -DP-SGD settings, whereby the non-DP model enjoyed a performance advantage in all metrics. Moreover, to assess the diagnostic suitability of the reconstructed images, we asked a diagnostic radiologist who was blinded to whether or not -DP-SGD was used, to compare the resulting scans. No differences in diagnostic suitability were observed by the expert in any of the reconstructed images. We thus conclude that –at least with respect to image quality– DP can indeed match the non-private training of MRI reconstruction models, even at an value of ; we intend to investigate its effect on preventing training data hallucination into reconstructed images in future work. Results from these experiments are summarised in Table 5 and Figure 3.

NMSE PSNR SSIM
Non-DP
-DP-SGD
Table 5:

Results on the MRI reconstruction task. NMSE: normalised mean squared error, PSNR: peak signal-to-noise ratio in dB, SSIM: structural similarity index metric.

[width=0.8]Figures/kneerecon.png

Figure 3: Exemplary reconstruction from a coronal proton-density weighted image of the knee. A: reference image, B: reconstruction model trained non-privately, C: reconstruction model trained with -DP-SGD reconstruction.

6 Conclusion

Our work presents -DP, an extension of DP to the complex domain and introduces key building blocks of DP training, namely the complex Gaussian mechanism and

-DP-SGD. Our experiments on real-world tasks demonstrate that the training of DP complex-valued neural networks is possible with high utility under tight privacy guarantees. This may –in part– be attributable to the increased learning capacity of complex-valued models resulting from incorporating two degrees of freedom (real and imaginary) per trainable model parameter. On the flip side, both complex-valued deep learning and DP incur a considerable computational performance penalty. Despite steadily improving complex number support, current deep learning frameworks have not yet implemented a full palette of complex-valued layers and activation functions. Moreover, the software framework utilised to computationally realise

-DP-SGD in our work relies on multithreading, which suffers from considerable overhead compared to implementations utilising vector instructions and hardware. We discuss the topic of software implementation and provide computational performance benchmarks in Appendices A.3 and A.4. Our conclusions highlight a requirement for mature software frameworks able to offer feature and performance parity with their real-valued counterparts. We focus on the Gaussian mechanism and the -DP/RDP interpretations in this work, believing them to be the most relevant for deep learning applications. The formalisation of other DP mechanisms and interpretations (such as f-DP (Dong et al., 2019)) is a promising future research direction. Such future work could benefit from improved privacy accounting (interestingly also relying on complex numbers, e.g. Zhu et al. (2021)) to diminish –as much as possible– the utility gap to non-private training. In conclusion, we contend that DP and other privacy-enhancing technologies can increase the amount of data available for scientific study, and are optimistic that our work represents a worthwhile contribution to their implementation in a broad variety of tasks.

Ethics statement

Our work follows all applicable ethical research standards and laws. All experiments were conducted on publicly available datasets. No new data concerning human or animal subjects was generated during our investigation.

Reproducibility Statement

We adhere to ICLRs reproducibility standards and include all necessary information to reproduce our experimental and theoretical results either in the main manuscript or in the Appendix. Theoretical results and proofs can be found in the main manuscript, Section 4 and additional information can be found in Appendix A.1. Details of dataset preparation and analysis can be found in Appendix A.2

. Specifically, it contains details about the used datasets, their number of samples, all training, validation and test splits, as well as preprocessing steps. Furthermore, we describe model architectures, employed optimisers, learning rates, and the number of epochs for which models were trained. Lastly, for all DP trainings we provide the noise multipliers,

clipping norms and sampling rates, as well as the -values at which the -values were calculated. Software implementation details and computational resources used can be found in Appendices A.3 and A.4.

References

  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pp. 308–318. Cited by: §1, §2, §4.2.
  • J. M. Abowd (2018) The US Census Bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2867–2867. Cited by: §1.
  • M. Akçakaya, S. Moeller, S. Weingärtner, and K. Uğurbil (2019)

    Scan-specific robust artificial-neural-networks for k-space interpolation (RAKI) reconstruction: Database-free deep learning for fast imaging

    .
    Magnetic Resonance in Medicine 81 (1), pp. 439–453. Cited by: §5.5.
  • M. Arjovsky, A. Shah, and Y. Bengio (2016)

    Unitary evolution recurrent neural networks

    .
    In International Conference on Machine Learning, pp. 1120–1128. Cited by: Table 4.
  • B. Balle and Y. Wang (2018) Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. In International Conference on Machine Learning, pp. 394–403. Cited by: §4.1.
  • J. Bassey, L. Qian, and X. Li (2021) A survey of complex-valued neural networks. arXiv preprint arXiv:2101.12249. Cited by: §1, §2.
  • C. Boeddeker, P. Hanebrink, L. Drude, J. Heymann, and R. Haeb-Umbach (2017) On the computation of complex-valued gradients with application to statistically optimum beamforming. arXiv preprint arXiv:1701.00392. Cited by: §4.2.
  • D. Brandwood (1983) A complex gradient operator and its application in adaptive array theory. In IEE Proceedings H-Microwaves, Optics and Antennas, Vol. 130, pp. 11–16. Cited by: §A.1.
  • N. Carlini, P. Mishra, T. Vaidya, Y. Zhang, M. Sherr, C. Shields, D. Wagner, and W. Zhou (2016) Hidden voice commands. In 25th USENIX Security Symposium, pp. 513–530. Cited by: §5.3.
  • S. Chatterjee, C. Sarasaen, A. Sciarra, M. Breitkopf, S. Oeltze-Jafra, A. Nürnberger, and O. Speck (2021) Going beyond the image space: undersampled MRI reconstruction directly in the k-space using a complex valued residual neural network. In 2021 ISMRM & SMRT Annual Meeting & Exhibition, pp. 1757. Cited by: §A.3.
  • E. K. Cole, J. Y. Cheng, J. M. Pauly, and S. S. Vasanawala (2020) Analysis of deep complex-valued convolutional neural networks for MRI reconstruction. arXiv preprint arXiv:2004.01738. Cited by: §1.
  • G. Cormode, S. Jha, T. Kulkarni, N. Li, D. Srivastava, and T. Wang (2018) Privacy at scale: Local differential privacy in practice. In Proceedings of the 2018 International Conference on Management of Data, pp. 1655–1658. Cited by: §1.
  • J. A. Cruz and D. S. Wishart (2006) Applications of machine learning in cancer prediction and prognosis. Cancer informatics 2, pp. 59–77. Cited by: §1.
  • J. Dong, A. Roth, and W. J. Su (2019) Gaussian differential privacy. arXiv preprint arXiv:1905.02383. Cited by: §6.
  • C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy.. Found. Trends Theor. Comput. Sci. 9 (3-4), pp. 211–407. Cited by: §1, §2.
  • C. Dwork and G. N. Rothblum (2016) Concentrated differential privacy. arXiv preprint arXiv:1603.01887. Cited by: item 1, §4.1.
  • L. Fan and L. Xiong (2013) Adaptively sharing real-time aggregate with differential privacy. IEEE Transactions on Knowledge and Data Engineering (TKDE) 26 (9), pp. 2094–2106. Cited by: §1, §2.
  • O. Fink, E. Zio, and U. Weidmann (2014) Predicting component reliability and level of degradation with complex-valued neural networks. Reliability Engineering & System Safety 121, pp. 198–206. Cited by: §2.
  • F. Fioretto, T. W. Mak, and P. Van Hentenryck (2019) Differential privacy for power grid obfuscation. IEEE Transactions on Smart Grid 11 (2), pp. 1356–1366. Cited by: §2.
  • J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller (2020) Inverting Gradients–How easy is it to break privacy in federated learning?. arXiv preprint arXiv:2003.14053. Cited by: §1.
  • G. M. Georgiou and C. Koutsougeras (1992)

    Complex domain backpropagation

    .
    IEEE transactions on Circuits and systems II: analog and digital signal processing 39 (5), pp. 330–334. Cited by: Table 4.
  • N. Guberman (2016) On complex valued convolutional neural networks. arXiv preprint arXiv:1602.09046. Cited by: Table 4.
  • K. Hammernik, T. Klatzer, E. Kobler, M. P. Recht, D. K. Sodickson, T. Pock, and F. Knoll (2018) Learning a variational network for reconstruction of accelerated MRI data. Magnetic resonance in medicine 79 (6), pp. 3055–3071. Cited by: §5.5.
  • K. Hammernik, J. Schlemper, C. Qin, J. Duan, R. M. Summers, and D. Rueckert (2021) Systematic evaluation of iterative deep neural networks for fast parallel MRI reconstruction with sensitivity-weighted coil combination. Magnetic Resonance in Medicine 86 (4), pp. 1859–1872. Cited by: §5.5.
  • A. Hirose (2012) Complex-valued neural networks. Vol. 400, Springer Science & Business Media. Cited by: §2.
  • M. C. Kociuba and D. B. Rowe (2016) Complex-valued time-series correlation increases sensitivity in FMRI analysis. Magnetic resonance imaging 34 (6), pp. 765–770. Cited by: §1.
  • J. Konečný, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon (2016) Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492. Cited by: §1.
  • S. Kotz, T. J. Kozubowski, and K. Podgórski (2001) The Laplace Distribution and Generalizations. Birkhäuser Boston. Cited by: item 3.
  • K. Kreutz-Delgado (2009) The complex gradient operator and the CR-calculus. arXiv preprint arXiv:0906.4835. Cited by: §A.1, §2, §4.2.
  • T. Küstner, N. Fuin, K. Hammernik, A. Bustin, H. Qi, R. Hajhosseiny, P. G. Masci, R. Neji, D. Rueckert, R. M. Botnar, et al. (2020) CINENet: deep learning-based 3D cardiac CINE MRI reconstruction with multi-coil complex-valued 4D spatio-temporal convolutions. Scientific reports 10 (1), pp. 1–13. Cited by: §1, §2, §5.5.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. Cited by: §5.1.
  • F. Liu, C. Liu, L. Zhao, X. Zhang, X. Wu, X. Xu, Y. Liu, C. Ma, S. Wei, Z. He, J. Li, and E. N. Y. Kwee (2018) An Open Access Database for Evaluating the Algorithms of Electrocardiogram Rhythm and Morphology Abnormality Detection. Journal of Medical Imaging and Health Informatics 8 (7), pp. 1368–1373. Cited by: §A.2.2, §5.2.
  • I. Mironov (2017) Rényi Differential Privacy. 2017 IEEE 30th Computer Security Foundations Symposium (CSF). Cited by: §2.
  • M. J. Muckley, B. Riemenschneider, A. Radmanesh, S. Kim, G. Jeong, J. Ko, Y. Jun, H. Shin, D. Hwang, M. Mostapha, et al. (2020) State-of-the-art Machine Learning MRI reconstruction in 2020: Results of the second fastMRI challenge. arXiv preprint arXiv:2012.06318. Cited by: §5.5.
  • I. Nazarov and E. Burnaev (2020) Bayesian Sparsification of Deep C-valued Networks. In International Conference on Machine Learning, Vol. 119, pp. 7230–7242 (en). Note: ISSN: 2640-3498 Cited by: §2.
  • T. Nitta (1997) An extension of the back-propagation algorithm to complex numbers. Neural Networks 10 (8), pp. 1391–1415. Cited by: Table 4.
  • M. O’Searcoid (2006) Metric Spaces. Springer Undergraduate Mathematics Series, Springer London. External Links: ISBN 9781846286278, LCCN 2006924371 Cited by: §3.
  • K. Palanisamy, D. Singhania, and A. Yao (2020) Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154. Cited by: §5.3.
  • N. Papernot, A. Thakurta, S. Song, S. Chien, and Ú. Erlingsson (2020) Tempered sigmoid activations for deep learning with differential privacy. arXiv preprint arXiv:2007.14191. Cited by: §5.4.
  • M. Peker (2016) An efficient sleep scoring system based on EEG signal using complex-valued machine learning algorithms. Neurocomputing 207, pp. 165–177. Cited by: §1.
  • Q. Rao and J. Frtunikj (2018) Deep learning for self-driving cars: Chances and challenges. In Proceedings of the 1st International Workshop on Software Engineering for AI in Autonomous Systems, pp. 35–38. Cited by: §1.
  • S. Raskhodnikova and A. Smith (2016) Lipschitz extensions for node-private graph statistics and the generalized exponential mechanism. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 495–504. Cited by: §3.
  • C. Rau, P. Kuo, P. Chien, C. Huang, H. Hsieh, and C. Hsieh (2018) Mortality prediction in patients with isolated moderate and severe traumatic brain injury using machine learning models. PloS one 13 (11), pp. e0207192. Cited by: §1.
  • M. Reimherr and J. Awan (2019) Elliptical Perturbations for Differential Privacy. arXiv preprint arXiv:1905.09420. Cited by: item 3.
  • K. Scaman and A. Virmaux (2018)

    Lipschitz regularity of deep neural networks: analysis and efficient estimation

    .
    arXiv preprint arXiv:1805.10965. Cited by: §4.2.
  • S. Scardapane, S. Van Vaerenbergh, A. Hussain, and A. Uncini (2018) Complex-valued neural networks with nonparametric activation functions. IEEE Transactions on Emerging Topics in Computational Intelligence 4 (2), pp. 140–150. Cited by: §5.1.
  • E. Shimron, J. I. Tamir, K. Wang, and M. Lustig (2021) Subtle Inverse Crimes: Naively training machine learning algorithms could lead to overly-optimistic results. arXiv preprint arXiv:2109.08237. Cited by: §5.5.
  • C. E. Sweet (2016) The Hidden Scam: Why Consumers Should No Longer Be Forced to Shoulder the Burden of Liability for Mobile Cramming. J. Bus. & Tech. L. 11, pp. 69. Cited by: §5.3.
  • C. Trabelsi, O. Bilaniuk, Y. Zhang, D. Serdyuk, S. Subramanian, J. F. Santos, S. Mehri, N. Rostamzadeh, Y. Bengio, and C. J. Pal (2017) Deep complex networks. arXiv preprint arXiv:1705.09792. Cited by: §2, §4.2, Table 4.
  • T. Van Erven and P. Harremos (2014)

    Rényi divergence and Kullback-Leibler divergence

    .
    IEEE Transactions on Information Theory 60 (7), pp. 3797–3820. Cited by: Corollary 1.
  • P. Virtue, S. Yu, and M. Lustig (2017) Better than real: Complex-valued neural nets for MRI fingerprinting. In 2017 IEEE international conference on image processing (ICIP), pp. 3953–3957. Cited by: §1, §4.2, Table 4.
  • P. Warden (2018) Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §A.2.3, §5.3.
  • W. Wirtinger (1927) Zur formalen Theorie der Funktionen von mehr komplexen Veränderlichen. Mathematische Annalen 97 (1), pp. 357–375. Cited by: §2.
  • H. Yin, A. Mallya, A. Vahdat, J. M. Alvarez, J. Kautz, and P. Molchanov (2021) See through Gradients: Image Batch Recovery via GradInversion. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 16337–16346. Cited by: §1.
  • X. Yuan, Y. Chen, Y. Zhao, Y. Long, X. Liu, K. Chen, S. Zhang, H. Huang, X. Wang, and C. A. Gunter (2018) Commandersong: A systematic approach for practical adversarial voice recognition. In 27th USENIX Security Symposium, pp. 49–64. Cited by: §5.3.
  • J. Zbontar, F. Knoll, A. Sriram, T. Murrell, Z. Huang, M. J. Muckley, A. Defazio, R. Stern, P. Johnson, M. Bruno, et al. (2018) fastMRI: An open dataset and benchmarks for accelerated MRI. arXiv preprint arXiv:1811.08839. Cited by: §A.2.4, §5.5.
  • Q. Zhou, J. Shan, W. Ding, C. Wang, S. Yuan, F. Sun, H. Li, and B. Fang (2021) Cough Recognition Based on Mel-Spectrogram and Convolutional Neural Network. Frontiers in Robotics and AI 8. Cited by: §5.3.
  • Y. Zhu, J. Dong, and Y. Wang (2021)

    Optimal Accounting of Differential Privacy via Characteristic Function

    .
    arXiv preprint arXiv:2106.08567. Cited by: §6.
  • A. Ziller, D. Usynin, R. Braren, M. Makowski, D. Rueckert, and G. Kaissis (2021) Medical imaging deep learning with differential privacy. Scientific Reports 11 (1), pp. 1–8. Cited by: §A.3.

Appendix A Appendix

a.1 Wirtinger/-Calculus

In this section, we present key results from Wirtinger (or -) calculus which are used in our work. For a detailed treatment, we refer to Kreutz-Delgado (2009).
Consider a function . As for real-valued functions, the derivative of at a point can be defined as:

(14)

If this limit is defined for the (infinitely many) series approaching , is called complex differentiable (equivalently, differentiable in the complex sense). If, in addition, exists everywhere in the neighbourhood of , is called holomorphic. It is also possible to write and to then express as two real-valued functions and of the variables and :

(15)

can then be written as:

(16)

If this derivative exists at , is called differentiable in the real sense. This interpretation represents as or, more generally, for vector-valued functions, as . The Cauchy-Riemann equations state that, for to be holomorphic, it must satisfy:

(17)

As discussed above, the complex plane does not admit a natural ordering. Hence, the minimisation of a complex-valued function is not defined. Therefore, for complex-valued deep learning, we only consider real-valued (loss-) functions . By equation (15), . Thus, by the Cauchy-Riemann equations, such a real-valued function is only holomorphic if:

(18)

This means that any holomorphic real-valued function must be constant, which invalidates its usefulness for optimisation. The Wirtinger/-derivatives provide an alternative interpretation of the Cauchy-Riemann equations which allows us to consider holomorphicity and differentiability in the real sense separately. Thus, they recover the usefulness of interpreting as while preventing multiplicative penalties on the gradient norm as a consequence of following this interpretation too closely. We will motivate this somewhat informal notion with an example below. The Wirtinger/-derivatives111The term derivative represents an abuse of terminology, as they are formal operators and not derivatives with respect to actual variables. However, the interpretation as derivatives is intuitive, and we will thus retain it. of are defined as:

(19)

An immediate consequence of this definition is that the Cauchy-Riemann equations can be expressed as:

(20)

Therefore, if a function is holomorphic, corresponds to the derivative in the complex sense (that is, ) while, if is differentiable in the real sense, both and are valid (and are conjugates of each other). As stated above, it can be shown that the steepest ascent of is aligned with . In this sense, fulfils the role of the operator for real, scalar-valued loss functions. Evidently, compared to the actual gradient of in the real sense, the following relationship holds:

(21)

However, re-defining is desirable (and correct, as shown by Brandwood (1983)) . We will motivate this requirement with an example: Let be a function such that:

(22)

The Wirtinger/-derivative of is , whose -norm is . The same output can be realised by interpreting as a function of a real-valued vector :

(23)

The gradient of is , whose -norm is = . This undesirable multiplicative penalty, which would translate to a superfluous multiplicative increase in the noise scale of the GM to preserve DP, is a consequence of ignoring the connection between real and imaginary part inherent to complex numbers, but not to components of vectors. In fact, is neither equivalent to (as would be the case if where ), nor is it equivalent to , as complex multiplication lacks the bilinearity inherent to a real inner product space. Both complications are avoided by the re-definition of the Wirtinger/-derivative as the gradient used for optimisation, which prompts its utilisation in our work. As a note to practitioners, certain deep learning frameworks silently re-scale the Wirtinger/-gradient by to avoid user confusion by a lower effective learning rate. To ascertain a correct implementation, we therefore recommend examining this behaviour by testing the gradient norm of known functions.

a.2 Dataset preparation and Model Training

a.2.1 PhaseMNIST

Dataset construction

As described in the main manuscript, PhaseMNIST is intended as a benchmark dataset for complex-valued computer vision tasks and contains images of handwritten digits from to 222The version of the dataset used in this study will be made publicly available upon acceptance.. The training set consists of images and the testing set of images. For each example of the original MNIST dataset, from which PhaseMNIST is constructed, we performed the following procedure: Let be the label corresponding to the real-valued image. We then constructed the imaginary component by (deterministically) sampling uniformly with replacement from the set of images whose label satisfies . We used the label of the real-valued image as the label of the overall training example.

Model training

We used a complex-valued model consisting of three fully connected layers with units and an output layer of units. The Cardioid activation function was used between layers and the Softmax activation function after the output layer. The model was trained with the Stochastic Gradient Descent optimiser at a learning rate of both for -DP-SGD and for non-private training. The non-private model converged after 3 epochs, whereas the -DP-SGD model required 10 epochs to achieve the same accuracy. The noise multiplier was set to and the clipping norm to . The value was calculated at a . A sampling rate of was used for -DP-SGD, and a batch size of for non-DP training.

a.2.2 ECG Dataset

Dataset preparation

We utilised the China Physiological Signal Challenge 2018 (Liu et al., 2018) dataset for this task. We used the normal and left bundle branch block classes and channel . The ECGs were loaded from the provided Matlab format using the SciPy

library and trimmed or padded to a length of

. The numpy Fast Fourier Transform implementation was used whereby the signal was pre-trimmed to length . The final dataset consisted of training examples and testing examples.

Model training

We implemented a complex-valued fully-connected neural network architecture consisting of input/hidden layers with (-DP-SGD) units and a single output unit. The cReLU activation function was used both in the non-DP and the -DP-SGD setting. The output layer implemented the magnitude operation followed by a logistic sigmoid activation function. Models were trained using the SGD optimiser at a learning rate of with an regularisation of for non-DP training and a learning rate of for -DP-SGD training, respectively. A batch size of was used for non-private training and a sampling rate of at a noise multiplier of and an clipping norm of for -DP-SGD. was calculated at a of where is the number of training samples. Both models were trained for epochs.

a.2.3 Speech command classification dataset

Dataset preparation

We used a subset of the SpeechCommands dataset (Warden, 2018) as described above, consisting of samples each from the categories Yes, No, Up, Down, Left, Right, On, Off, Stop, and Go. Of these, examples were used as the training test and as the testing set. The waveform data was decoded using the TensorFlow library and, where necessary, padded to a length of samples. The TensorFlow implementation of the Short time Fourier Transform function was used with a frame length of and a frame step of .

Model training

For this task, we employed a complex-valued 2D CNN consisting using filters of size

without zero-padding and a stride of

. The convolutional layers had output filters, whereby a MaxPooling layer was used between the second layer and the third layer and an adaptive MaxPooling layer after the final convolutional layer. The convolutional block was followed by a fully connected layer with units and an output layer of units. Both employed the iGaussian activation function. The non-DP model was trained at a batch size of for epochs at a learning rate of using the Stochastic Gradient Descent optimiser, whereas the -DP-SGD network was trained using a sampling rate of for epochs with the same learning rate and optimiser, a noise multiplier of and an clipping norm of . We calculated at a -value of .

a.2.4 fastMRI knee dataset

Dataset preparation

We utilised the single coil knee MRI dataset of the fastMRI challenge proposed by Zbontar et al. (2018). We used the reference implementation 333https://github.com/facebookresearch/fastMRI/tree/main/fastmri_examples/unet, and employed the default settings using an acceleration rate of and of densely sampled k-space center lines in the mask. Masks are sampled pseudo-randomly during training time. The dataset offers train and validation images.

Model training

We changed the U-Net

network to use complex-valued weights and accept complex-valued inputs instead of the magnitude image employed in the original example. We replaced the original ReLU activation functions with CReLU. In the DP setting, we used a noise multiplier of

, an clipping norm of and a sampling rate of and calculated the at a of . The learning rate was set to

using the RMSProp optimiser and a stepwise learning rate scheduler. We trained both in the non-private and the

-DP-SGD setting for 30 epochs and disabled the collection of running statistics in the BatchNormalisation layers to render them compatible with DP.

a.3 Software libraries and computational resources used

Implementations of the DP-SGD algorithm, and –by extension– -DP-SGD require access to per-example gradients. We utilised the deepee software library (Ziller et al., 2021) to implement -DP-SGD, as it is compatible with arbitrary neural network architectures, including such containing complex-valued weights. We report results using uniform without replacement sampling and using the RDP option provided by deepee. For complex-valued neural network components, the PyTorch Complex library (Chatterjee et al., 2021) with PyTorch 1.9 were used. TensorFlow 2.4 was used for loading data and the Short Time Fourier Transforms discussed above, but no neural network components were used from this library. Experiments were carried out in Python 3.8.5 on a single workstation computer running Ubuntu Linux 20.04 and equipped with a single NVidia Quadro RTX 8000 GPU, 12 CPU cores and 64 GB of RAM.

a.4 Computational considerations

We conclude by presenting a systematic evaluation of the computational considerations incurred by the utilisation of complex-valued neural networks and by the implementation of -DP-SGD using the above-mentioned libraries. Two main sources of computational overhead arise between real-valued and complex-valued neural networks. Complex numbers are internally represented as a pair of -bit floating point numbers. This affects inputs and neural network weights. Moreover, even though a complex-valued architecture may contain the same number of parameters as its real-valued counterpart, an increased number of computational operations is required in . For instance, the operation requires a single multiplication operation in . However, in it can require up to multiplications and additions, depending if vector hardware is used and whether complex floating point instructions are implemented in the respective framework (e.g., cuDNN). Table 6 shows results for individual matrix multiplication operations and convolutions with real/complex-valued inputs and weight matrices.

Linear Conv.
CPU s ms ms ms
GPU s s ms ms
Table 6: Average computation times ( repetitions) for a batched matrix multiplication with batch size and matrix dimensions (Linear) and a convolution operation with input dimensions (batch, channel, height, width) and kernel dimensions (in, out, kernel size) (Conv.). : Real-valued input and weight matrix and : complex-valued input and weight matrix. Times are given on CPU and GPU.

-DP-SGD carries additional overhead as it requires per-sample gradients. In the utilised deepee framework, this is realised through dispatching one computation thread per example in the minibatch (more precisely, lot) to perform a forward and backward pass, which incurs substantial overhead compared to pure vectorisation. These results are shown in Table 7. Of note, for the non-private models, the computation time includes the forward pass, backward pass, loss gradient calculation (Mean Squared Error against a vector of dimensions ) and weight update (Stochastic Gradient Descent). For the -DP-SGD model, the following additional steps occur between the loss gradient calculation and the weight update: gradient clipping, averaging of per-sample gradients, noise application. Moreover, the deepee framework requires an additional step between the weight update and the subsequent batch.

Non-DP -DP-SGD
CPU ms s s min
GPU ms ms s s
Table 7: Average computation times for a model consisting of a 2D convolutional layer with input channels, output channels and a kernel shape of followed by a linear layer with matrix dimensions of executed on an input of dimensionality (batch, channel, height, width) for repetitions.