The handling and processing of sensitive human data inherently entails the risk of compromising individual privacy and exposing unwarranted quantities of personal information. After a widespread enforcement of personal data regulations as well as the emergence of an increasingly privacy-conscious population, the requirement has arisen to augment machine learning (ML) systems with a number of algorithmic solutions for private, regulation-compliant training. Extensions to popular ML libraries and deep learning toolkits enabling DP deep learning provide a solution by retrofitting established DP mechanisms to ML workflows.
The most successful application of DP to deep learning has arguably been the DP-SGD algorithm (abadi2016deep)
, which, although empirically effective in many cases, quintessentially relies on imposing a bound on the sensitivity by clipping gradient norms. The resulting geometric bias is associated with a loss of utility of the final model which may exceed the utility penalty of noise addition alone. The very requirement for gradient clipping is a result of a limited ability for introspection of the training process: At any time during training, some quantity in the network (input, label, weight, etc.) may cause an unbounded growth of the gradient norm. Moreover, recent work has highlighted that the mere application of DP mechanisms such as DP-SGD to existing model architectures may not lead to optimal outcomes, as DP deep learning introduces specific requirements which may be served by different architectural choices than learning in the non-private setting(papernot2019making). Such choices should (a) be made in a principled way and (b) should optimally not require the utilisation of private data. Current deep learning frameworks (and their DP extensions) are designed with efficient computations over batches of input data in mind. Neither the decoupling of model development from the actual data processing nor the ability to easily obtain closed-form mathematical expressions is inherent to their design.
Even beyond deep learning, in the setting of statistical learning and queries over databases, recent work (feldman2020individual) showcases that introspecting the properties of the functions applied to private database records can provide improved privacy guarantees to the individual.
From the above, we identify three key requirements for new computational tools in the realm of private machine learning: (1) An improved ability to model the properties of function compositions over private data and, especially, their influence on sensitivity; (2) a design philosophy centred on the data of a single individual and their features; (3) the option to reason about an algorithm’s privacy characteristics without having to input sensitive data. Our work tackles these requirements by introducing a novel hybrid automatic differentiation (AD) framework. It combines the efficiency of reverse-mode AD over computational graphs with the ability to obtain a closed-form expression for every quantity present in the graph. It can thus be used to quantify the privacy characteristics of algorithms used to analyse sensitive data.
The rest of the paper is organised as follows:
Section 2 provides an overview of key terms used and related work
Section 3 gives an overview of our proposed hybrid automatic differentiation system
Section 4 exemplifies our system in the context of statistical queries and the DP-SGD algorithm
We conclude with a concise discussion of limitations and future directions in Section 5
2 Related work
Automatic and symbolic differentiation
Automatic differentiation (AD) is a method to automatically determine the differentials of expressions with respect to their components to computer precision. In deep neural networks reverse mode
automatic differentiation is used which is computationally efficient when expressions with multiple inputs have a scalar output. Forward-mode automatic differentiation also exists, but is not covered in this work. On a high level, AD stores computations in a graph data structure in which every node maintains a reference to its parents as well as the operation from which it originated. The partial derivatives of these primitive operations are then utilised by repeated application of the chain rule of calculus over the graph sorted in reverse topological order (reverse accumulation step) to obtain the differentials (or gradients) with respect to the leaves of the graph. Modified versions of this system exist, but are concentrated on memory efficiency in the context of computational graph storage. We refer the reader to (baydin2018automatic)
for an overview. AD is a key component in deep learning frameworks, where it is utilised behind the scenes to perform the backpropagation step. Hence, the user only need specify the forward path of computation (from inputs to e.g. the loss calculation), a step often referred to as graph generation, since the computational graph is defined (either ahead of time or at run-time) by the programmatic inputs of the user.
Symbolic differentiation (SD) differs from automatic differentiation in that it allows the calculation of derivatives in an analytical (closed) form by manipulating mathematical expressions under a constrained set of rules. It is commonly utilised in symbolic algebra systems (sympy), focused more on algebraic manipulation than on the complex neural network training. Therefore, most existing systems do not provide a simple method to specify such complex architectures. Beyond this design choice, SD suffers from a phenomenon termed expression swell, in which derivatives of large functions become computationally impractical to store in memory and very slow to analytically calculate. To our knowledge, no SD system is widely utilised for training neural networks.
The closest line of work to our system are static-graph-based systems such as Aesara, a successor to the Theano differentiable programming framework (bergstra2011theano)
. Such systems are typically optimised for tensor operations, as they have been designed to allow efficient gradient computations for deep learning or gradient-assistedMonte Carlo sampling. Thus, their focus is not the provision of closed-form mathematical expressions for functions in the network.
Differential Privacy and DP-SGD
We assume familiarity with Differential Privacy and hence omit a detailed discussion at this point, however refer to (Dwork2013). We will use the definition of DP as put forward in this work unless noted otherwise.
The application of DP to the training of deep neural networks was described in (abadi2016deep) and termed DP-SGD. We reproduce some key insights of this work here to motivate our discussion below:
On a high level, the mechanism applied to private data in DP-SGD is a deep neural network, whose output, a scalar value termed the loss, is differentiated with respect to the weights to obtain an update rule. As deep neural networks do not necessarily satisfy any assumptions about Lipschitz continuity of their gradients and the gradient’s -norm represents the sensitivity term, gradients are processed by clipping, that is, bounding them to a desired -norm, proportional to which Gaussian noise is added to satisfy DP. The clipping step introduces geometric bias to the gradient (chen2020understanding), which is undesirable yet unavoidable, unless a concrete bound on the gradient’s norm can be provided through alternative means. Tackling this challenge is a key contribution of our work.
A recent line of work is closely related to the above-mentioned topic (feldman2020individual; lecuyer2021practical). Conceptually, these works attempt a more precise characterisation of individual privacy loss (termed individual differential privacy, IDP) which is used to e.g. automatically halt the utilisation of the individual’s data in an analysis while the data of other individuals can still be used (DP filtering). Notably the work by (feldman2020individual) also relates individual privacy guarantees to the Lipschitz constant of the function which is applied to the data.
3 System Description
We begin by describing the main components of the proposed framework. Our system consists of a front-end user-facing component and of a tensor manipulation library similar to common machine learning and linear algebra libraries. The user specifies a series of computations to be performed (e.g. defining a neural network architecture) as well as the inputs to the computation (e.g. input tensors, weights, etc.). Two key principles apply at the first step of the computation: (1) The inputs can be specified abstractly, that is, with a symbolic name (e.g. ) and (2) operations are per default defined in terms of individuals, and not as batches of inputs. At any point in the graph specification the user can request the partial derivatives of an operation with respect to some arbitrary input.
The second component of our system is a compiler tool-chain consisting of a pre-processor which emits an intermediate representation (IR) of the specified computations and a compiler which converts the IR into low-level code. This can occur just-in-time, which immediately returns a reference to a byte-code object representing the compiled expression to the user (at the cost of decreased performance) or ahead-of-time, where a function is statically compiled into an optimised binary for execution on an arbitrary computational back-end (at the cost of longer compilation times). By default, these computational kernels can be executed on batches of inputs, thus recovering the functionality of other frameworks. We note that, up to this point, no actual numeric input is required. The final computation takes place by substituting the abstract inputs specified at the beginning, with concrete values. In the case of just-in-time compiled expressions, partial evaluation can also take place, where only a subset of inputs are specified numerically.
The system, like other contemporary AD systems, is capable of computing differentials of arbitrary functions, provided a valid sub-differential exists. However, as such operations may contain discontinuities (e.g. piece-wise or step functions), the compiler maintains a reference to all possible execution paths and converts discontinuous expressions into conditional statements while optimising all computations beyond them for memory economy.
4.1 Application to the DP analysis of database queries
Our system can be utilised to determine the local Lipschitz constant of an arbitrary differentiable function for calculating the associated DP guarantees, which we showcase in the setting of a fictional database query under Rényi DP (RDP) (Mironov_2017).
Assume an analyst wishes to calculate the (fictional) quantity age-adjusted body mass index, calculated from an individual’s age , weight and height as follows:
It is obvious that (1) is not globally Lipschitz continuous. It is however possible to determine a local Lipschitz constant given bounds on the inputs, which can be shown to correspond to
and which we can determine by automatic differentiation. We receive:
Constrained minimisation of the inverse of this expression with reasonable bounds on the inputs given prior knowledge yields the desired quantity which can be used to compute the RDP guarantee as .
Of note, this method relies on retrieving a global optimum for the expression, which can either be obtained in closed form by higher-order differentiation or algorithmically, e.g. through the use of simplicial homology global optimization (endres2018simplicial). Alternatively, methods for approximate Lipschitz
-constant estimation such as(Scaman2018-gr) can be utilised.
4.2 Application to DP-SGD
In Algorithm 1, we present a modified version of DP-SGD enabled by obtaining closed-form bounds on the maximum -norm of the gradients, realised through the ahead-of-time compilation of a closed-form-expression for the gradient norm using our framework. This allows the following distinct options for training with DP-SGD:
It is possible to compute the maximum sensitivity of the gradient given specific bounds on the function inputs (features, weights, biases etc.). This calculation can be performed ahead of the actual training training and utilised to guide network architecture design decisions. We note that, for complex functions, estimating this constant precisely may be very hard and some bound with an acceptable effect on privacy guarantees may have to be used instead (fazlyab2019efficient). Provided a reasonable bound exists, this method can be used to replace gradient clipping during training: Recall that the maximum sensitivity can be computed given bounds on all function inputs. Hence, the sensitivity will also remain bounded, as long as the function inputs satisfy the pre-specified constraints. During training, data and targets are typically normalised to a certain range. This renders the realised values of the weights at every iteration the sole determinant of sensitivity. By re-normalising the weights between to values, or clipping weights which exceed a pre-specified bound, it is thus possible to avoid gradient clipping altogether. However, bounding the weights may also introduce bias or hinder training by over-regularisation.
For simple architectures, it is possible to determine the maximum sensitivity at every iteration. Samples whose norm remains below the maximum sensitivity require no clipping, and the noise can be scaled to the true sensitivity value. This process is similar to the technique described by (feldman2020individual) and (lecuyer2021practical). However, a costly optimisation procedure is necessary at every step, which does not scale appropriately.
To combine advantages of both approaches, it has been recommended to utilise activation functions which naturally bound the sensitivity, as described in(papernot2020tempered). Our framework allows to determine the exact Lipschitz constant in advance (given potential additional bounds on the inputs or weights), without requiring step (2) to be executed at every iteration.
We note that our system naturally provides per-sample gradients, as operations are specified per individual and are only dispatched at runtime. Thus, our framework allows to mitigate the performance degradation associated with obtaining per-sample gradients or their norms from batches of inputs.
5 Discussion and conclusion
Our work demonstrates that closed-form representations of arbitrary differentiable function compositions over private data can be utilised for the modelling of sensitivity in private machine learning. Our framework fits into a larger ecosystem of tools targeting automated sensitivity analysis (traskkritika) and privacy budgeting through composition analysis of heterogenous mechanisms (wang2019subsampled). It moreover allows decoupling network/algorithm design from data analysis, a desideratum
in privacy-by-design workflows. We sketch the utilisation of our system in the context of IDP and DP-SGD, however view it as a useful component in the toolkit of adversarial machine learning researchers, whom it can assist in reasoning concretely over the mathematical properties of the model. Moreover, as our framework allows the derivation of analytic expressions for every input to the function, it can be used to quantify the impact of individual features on overall privacy loss. We intend to investigate the benefits of such feature-level analyses in future work.
A notable limitation of the current form of our system is memory consumption and computational performance. Static compilation times scale quadratically with the number of parameters in the network. We note that, after compilation, execution times of the kernels are constant for a given batch size. Currently, the compilation of a 2.5M parameter network (such as MobileNetV3) would require approximately 60 hours on a single CPU core. Scaling our system to larger network architectures will require compiler-level optimisations such as multi-threaded compilation and employment of term simplification strategies such as common subexpression elimination, which we outline as future work.
In conclusion, hybrid automatic differentiation allows obtaining analytical expressions for quantities of interest in differentiable function compositions over private data such as neural networks. We are hopeful that our work will stimulate research at the intersection of compiler engineering, differentiable programming and privacy preserving machine learning to yield specialised tooling capable of efficiently manipulating such analytical representations to address key challenges in differentially private machine learning.