Despite the growing availability of high-performance algorithmic tools for advanced statistical modelling and machine learning, solutions to many of the world’s most important problems require access to sensitive or confidential data. Technologies such as differential privacy can allow drawing insights from such data while objectively allocating and quantifying individual privacy expenditure. Although DP is the gold standard for data protection, its application to everyday ML workflows is –in practice –often constrained. For one, tightly introspecting the privacy attributes of complex models such as deep neural networks can be very challenging. Moreover, substantial expertise is required on the analyst’s behalf to correctly apply DP mechanisms to such models. Software librariesopacus; tfprivacy; holohan2019diffprivlib; googledp
are being developed to alleviate these issues in specific domains such as DP deep learning. They are, however, limited to a small number of programming languages and application programming interfaces (APIs). The democratisation of DP machine learning therefore awaits generic infrastructure, not only compatible with arbitrary workflows, but designed from first principles to facilitate the implementation of DP. At its core, contemporary ML is based around the manipulation of multidimensional arrays and the composition of differentiable functions, a programming paradigm referred to asdifferentiable programming. Besides deep learning, some of the most successful ML algorithms chen2016xgboost and a large number of statistical queries, especially from the domain of robust statistics, can be expressed within this paradigm. Automatic differentiation (AD) systems are the core of differentiable programming frameworks and are able to track the flow of computation to return precise derivatives with respect to arbitrary computational quantities. Although this functionality may –at first –seem orthogonal to the goals described above, we contend that it is in fact not only highly compatible, but synonymous with automatic DP tracking.
In the current work, we present Tritium, a differentiable programming framework aiming to integrate the requirements of ML and privacy analysis through the use of AD. We recapitulate the link between the sensitivity of differentiable queries and the Lipschitz constant in section 2. We outline our system’s implementation in section 3 and present the substantial improvements in computational efficiency and sensitivity bound tightness in section 4. A discussion of prior work can be found in the appendix.
2 Theoretical motivation
We begin by briefly introducing an interpretation of DP using the language of functional analysis, which forms the theoretical motivation behind our work. We concentrate on the Gaussian mechanism, which forms the basis for private data analysis in high dimensions. Differentially private ML can fundamentally be abstracted as the application of a higher-order function (or functional) to private data. This higher-order function (often termed a DP mechanism) receives as its input another function (termed a query) which has been applied to a private dataset, inspects the query to derive its privacy attributes and modifies it to preserve DP (Definitions 1 and 2).
Definition 1 (Query).
A query is a function , where represents arbitrary (possibly unused) dimensions and which receives as input some private dataset and outputs a result representing the result of a computation over (e.g. a mean calculation or the output of a neural network).
Definition 2 (DP mechanism).
A DP mechanism is a higher-order function which receives as its input one or more query functions and outputs , where and is selected based on the privacy properties of .
The tight characterisation of these privacy properties is central to enabling privacy expenditure tracking. The effect on inputs on the output of the query functions is reflected in query sensitivity. We use the Lipschitz constant to reason about sensitivity.
Definition 3 (Lipschitz constant and sensitivity).
Let be a function between metric spaces and with distance metrics and , respectively. Then is Lipschitz continuous with constant (equivalently, -Lipschitz) if
The smallest value of corresponds to the sensitivity of raskhodnikova2016lipschitz:
where is the distance. Recall that in this case, as are adjacent, i.e. the Hamming distance between and equals . Thus for differentiable query functions, when and are Euclidean spaces endowed with the -norm. Then,
where is the Jacobian matrix (the differential operator).
This equivalence between Lipschitz constant and query sensitivity allows, in principle, to reason over the privacy attributes of individual query functions and calibrate noise appropriately. Typical functions with globally bounded sensitivity are affine queries or linear functions of the form , with . However, queries exist for which the Lipschitz constant is not defined over the entire input domain. One example of such a function is with , whose sensitivity is unbounded, as it depends on the value of . The sensitivity analysis of such queries sometimes therefore depends on (private) properties of the dataset. We term such a case as data-dependent sensitivity. Reasoning over sensitivity in such cases is complicated by a requirement to propagate this data dependency effect through function composition. Previous works on Lipschitz analysis of machine learning algorithms bhowmick2021lipbab; LipOuterBound achieve this through techniques such as interval bound propagation gowal2019effectiveness, that carry the bounds on input variables (which, for DP, should be defined in a data-agnostic manner) through the computation flow. This technique can easily be made compatible with tracing-type AD systems which are widely used in contemporary machine learning. However, due to well-known limitations of interval arithmetic (such as interval dependency Krmer2006) and due to the fact that the Lipschitz constant is defined by inequality, the resulting sensitivity terms may be valid, but too loose to be of any practical utility (e.g. ). The last challenge relates to the fact that the actual effect of an individual’s data on the query’s output may, in fact, be much smaller than the worst case assumed by the definition, resulting in more noise being added by the mechanism than would be required for the guarantee to hold. Consequently, although the worst-case sensitivity value has typically been used for privacy accounting, newer techniques perform accounting based not only the worst case but combine it with the actual output -norm feldman2020individual.
3 Implementation details
Our work presents Tritium, an automatic-differentiation-based machine learning and sensitivity analysis system engineered to address the above-mentioned challenges. It consists of the following components:
A user-facing front-end to specify a query abstractly
, i.e. without directly utilising private data during model creation. This is achieved through the utilisation of abstract tensors with pre-defined dimensions. The system creates an optimised computational graphbased on this specification.
During model specification, the user can impose bounds on the quantities (e.g. inputs, weights) used in the model.
The user selects the desired privacy parameters, e.g. and values or a maximum allowed sensitivity.
A compiler then emits a program which receives a private dataset and outputs an appropriately privatised result.
Internally, Tritium undertakes the following steps:
The computational graph is compiled into a program which outputs with respect to the inputs.
is computed given the input bounds.
Finally, is compiled into the program described in step (4) above which receives a private dataset , computes , potentially clips out-of-bound values to preserve the required , adds noise with proportional to to satisfy the required value for a given and outputs .
This system architecture has several benefits: It avoids utilising private data until the moment the final computation is executed (data minimisation). Moreover, it provides a tight sensitivity calculation by optimising the entire query function at once gouk2021regularisation instead of the above-mentioned forward-propagation, which can lead to vacuous sensitivity values. Furthermore, it utilises the pre-specified bounds on the input variables to not only enable the calculation of data-dependent sensitivity, but also greatly accelerate the process. Moreover, it is agnostic to the method used to actually obtain the desired sensitivity. For example, Lipschitz neural network layers shavit2019exploring; anil2019sorting
or activation functions with bounded outputs and gradientspapernot2020tempered can be used for model building, but bounded sensitivity can also be enforced by clipping, as is common in DP-SGD abadi2016deep. In addition, the system is able to compute the full Jacobian matrix (which is required in DP-SGD) as well as arbitrary higher-order derivative matrices (which can be used to accelerate the sensitivity computation). Additionally, as the system outputs both the Lipschitz constant and the norm of the outputs, it can be leveraged to provide tighter privacy guarantees through the use of e.g. individual privacy accounting
, as shown below. Finally, the system is designed to output privatised values by default instead of outputting non-private values and relying on the user to perform an appropriate privatisation step. This can reduce both user workload and the probability of failure due to incorrect application of DP mechanisms on the user’s behalf.
4 Experimental evaluation
4.1 Exact sensitivity calculations through a posteriori optimisation
To assess the benefits of computing query sensitivity by assessing the entire computational graph at once instead of forward-propagating interval bounds, we constructed a small neural network comprising 4 linear layers with logistic sigmoid activations and the binary cross-entropy cost function. We set the bounds for the neural network weights and the input bounds to the interval. We then calculated the sensitivity using two techniques: Interval Bound Propagation (IBP gowal2019effectiveness) and our proposed method optimising the entire computational graph at once. The estimate of the sensitivity was returned as by IBP (which is a valid, but vacuous bound) and as by Tritium. The IBP bounds are also similar to previous work (compare e.g. zhang2019recurjac). However, IBP was faster, requiring ms to return a result, compared with ms for our technique (excluding compilation time of ca. s). These results are summarised in Table 1 in the appendix.
4.2 Compilation improvements
In this section, we compare the compilation and execution time improvements of Tritium to the previously described framework by ziller2021sensitivity on neural network architectures with the architecture described above, but with increasing width of linear layers. We recall that the system proposed by the authors of this work relies on a scalar AD implementation and authors report compilation times quadratic in the number of model parameters (that is, around 60 hours for a million parameter model). In contrast, the here-presented implementation utilises the Aesara library (a fork of the now-defunct Theano framework theano) as a computational back-end. This allows both a memory-efficient vectorised execution of tensor operations and a more mature and faster compilation back-end. For all but the smallest architectures, this back-end achieved substantially faster compilation times which were independent of the number of parameters and able to leverage caching to accelerate re-compilation. A similar effect was observed in execution times, where our system achieved considerably higher performance. These results are visualised in Figure 1 in the appendix.
5 Discussion and conclusion
We propose Tritium, an automatic differentiation-based system for differentially private machine learning. Our framework relies on an interpretation of DP queries and mechanisms through the language of functional analysis, linking them by the definition of Lipschitz continuity. We found the combination of an efficient computational and compilation back-end with the consideration of the entire query function at once to yield both improved performance and tighter sensitivity bounds compared to previous work. Our proposed framework relies on static graph-based AD, which can apply specific compiler optimisations to the entire computational graph and is responsible for Tritium’s high performance. However, such systems have noteworthy limitations. For instance, the definition of control flow statements is cumbersome and such systems are not well-suited for utilisation with just-in-time compilers. Most currently used machine learning frameworks utilise tracing/eager execution AD back-ends, which, while more user-friendly, cannot always leverage the same optimisations. An alternative AD implementation, source-to-source translation can combine the benefits of dynamic graph specification with the high performance and optimisations of static compilation. It forms the basis of a recent paradigm in programming language design (e.g. saeta2021swift), attempting to merge a general-purpose programming language with differentiable programming primitives. A limitation of our method stems from the computational hardness of exactly calculating the Lipschitz constant scaman2018lipschitz. Our system will output the true bound using Simplicial Homology Global Optimisation endres2018simplicial if the bound exists and can be found, and the application of constraints can substantially accelerate this process. However, it will output a warning and switch to an approximate algorithm without a guaranteed bound otherwise. If such a bound is undesirable, an alternative technique is the utilisation of model components with known (or manually adjustable) Lipschitz constants, which can allow one to avoid the utility penalty imposed by clipping-based approaches, both enabling the design of algorithms with milder privacy-utility penalties and additionally reaping the benefits of well-defined model sensitivity, such as (certifiable) robustness to perturbations by adversarial samples. In conclusion, our work serves as a first proof-of-concept for the the design of generic infrastructure exposing familiar APIs to data scientists while automatically tracking privacy loss through the computation flow. We view the further development of such systems as an accelerator for the wide-spread adoption of privacy-preserving machine learning algorithms across data-driven research disciplines.
Appendix A Tables and Figures
|Upper bound||Computation time (ms)|
Appendix B Related works
Our work can be seen as a natural evolution of the previous study by [ziller2021sensitivity] in the context of AD-based sensitivity analysis for DP machine learning. In comparison to this work, Tritium relies on a vectorised, GPU-compatible execution engine and a mature graph compiler which drastically improves performance, as shown in the experimental section. The properties of Lipschitz continuous functions have been leveraged in several domains beyond DP. Works such as [shavit2019exploring, yoshida2017spectral, anil2019sorting, zhang2019recurjac] attempt to constrain the Lipschitz constant to reason over and control the properties of neural networks. The utilisation of this approach has been proposed for network certification against adversarial samples [lecuyer2019certified], whereby a network that is -certified is provably robust to input perturbations within a norm ball of radius . Additionally, constraining the Lipschitz constant has been proposed for DP model training, as this allows to calibrate the noise addition based on the bounded Lipschitz constant [shavit2019exploring]. Moreover, certain works [gupta2021adaptive] have addressed the problem of machine unlearning, providing methods for a reliable removal of contributions associated with an individual in the context of neural network training. We note that the approach to sensitivity analysis employed in these studies is orthogonal to our work, as our work is compatible with manual sensitivity constraints (such as directly adjusting the Lipschitz constant of neural network layers through appropriate layers as described above, which however may impair their expressivity) but also with sensitivity tracking for privacy loss calculation. Several recent works concentrate on the computation of accurate estimates of the Lipschitz constant mostly focused on ReLU networks such as [scaman2018lipschitz, anil2019sorting], but most of these obtain the upper bounds rather than the exact values of the Lipschitz constant, often resulting in valid, but extremely loose approximations that are not intended to be applied in DP training. A line of work centred on languages for differentially private programming also exists. Among these, the recently proposed DDuo framework [abuah2021dduo] performs dynamic sensitivity analysis in the context of DP algorithm specification. As shown above however, this approach does not attempt to derive a tight bound on sensitivity in the setting of unbounded queries, declaring sensitivity as infinite and relying exclusively on clipping. More general approaches, including category-theoretical views on the intersection of differentiable programming and differential privacy such as [pistone2021identity] have also recently been proposed.