The prevalence of data has been one of the key drivers of technological innovation in the last decade. The abundance of data, allied with ever-increasing computing power, has driven the rapid development of sophisticated machine learning techniques, many of which have achieved hitherto unseen levels of performance. Data collection today is pervasive, across applications and devices. The has resulted in data privacy becoming a matter of public concern.
It has long been known that querying even aggregated or perturbed data can lead to leakage of private information Dinur and Nissim (2003), motivating the development of databases and algorithms that mitigate such privacy breaches. Differential privacy Dwork et al. (2006) is one of the most rigorous ways of analysing and ameliorating such privacy risks. If an algorithm is
-differentially private, it means we can apply a multiplicative bound to the worst-case leakage of an individual’s private information. Many algorithms have been developed with this goal in mind, such as differentially private variants of linear regressionKifer et al. (2012)
, k-means clusteringHuang and Liu (2018); Su et al. (2016) et al. (2016).
Such privacy guarantees come at a cost—modifying an algorithm to be -differentially private typically involves adding noise to any queries made by that algorithm, which will tend to negatively affect the algorithm’s performance. This cost will tend to be higher when privatizing more complex algorithms that require multiple queries of the data, such as non-linear regression and classification algorithms Smith et al. (2018); Abadi et al. (2016). One such family of non-linear regression and classification algorithms, that allows for flexible modeling but involves multiple queries, is the class of tree-based methods. In particular, random forests and their variants have been shown to perform well on both classification and regression problems Breiman (2001); Geurts et al. (2006), and are used in many real-world applications. Tree-based ensemble methods make minimal assumptions on the parametric forms of relationships within the data, and can be easily applied to a mixture of continuous and categorical covariates. However, building trees that capture the appropriate structure requires many queries of the data, making them challenging to privatize. Further, since the data is partitioned into arbitrarily small subsets, the noise that must be added to each subset to ensure differential privacy can quickly swamp the signal.
We propose differentially private median (DiPriMe) forests, a novel, differentially private machine learning algorithm for nonlinear regression and classification with potentially sensitive data. Unlike previous differentially private tree-based algorithms, our approach can deal with continuous or categorical covariates, and is appropriate for either regression or classification. DiPriMe forests avoid overwhelming the signal at each leaf with noise, by using a median-based splitting criteria to encourage a balanced distribution of data across the leaf nodes Dwork et al. (2014). The median is a robust statistic, making it relatively cheap to privatize, and the noise added during privatization encourages exploration of the solution space.
We begin by reviewing the concept of differential privacy, and discussing related approaches, in Section 2, before introducing DiPriMe forests in Section 3. We provide theoretical and empirical guarantees on both the privacy and the utility of our approach. In Section 4 we show that our method outperforms existing differentially private tree-based algorithms on a variety of classification and regression tasks.
2.1 Differential Privacy
Differential privacy Dwork (2008) is a rigorous framework for limiting the amount of information that can be inferred from an individual’s inclusion in a database, or in the training set for an algorithm. Formally, a randomized mechanism satisfies -differential privacy for all datasets and differing on at most one element and all if
) implies that the inclusion of an individual’s data can change the probability of any given outcome by at most a multiplicative factor of. A lower value of provides a stronger privacy guarantee, as it limits the effect the omission of a data point can have on the statistic.
Typically, the mechanism is a randomized form of some deterministic query . To determine the degree of randomization required to satisfy (1), we need to know the global sensitivity Bojarski et al. (2014); Fletcher and Islam (2015),
The global sensitivity tells us the maximum change in the outcome of the query due to changing a single data point. Armed with the global sensitivity, we can use a number of approaches to ensure -DP; we outline the two most common mechanisms below.
Starting with a deterministic query , where is the space of possible data sets, we can construct an -DP query-answering mechanism that adds appropriately scaled Laplace noise, so that Dwork et al. (2014)
where denotes the sensitivity of the -th coordinate of the output. Note how the noise added scales as
– increased privacy directly translates to increased noise variance.
The Laplace mechanism assumes that our query returns values in . A more generally applicable privacy mechanism is the exponential mechanism McSherry and Talwar (2007), which allows us to pick an outcome , where is some arbitrary space. We define a scoring function with global sensitivity , and a base measure on . For any dataset , selecting an outcome with probability
ensures -differential privacy. Clearly, the scoring function should be constructed such that preferred outputs are assigned higher scores. We usually adopt a uniform McSherry and Talwar (2007).
Often, algorithms will involve multiple queries, requiring us to account for the overall differential privacy of the composite algorithm. We can make use of the following two composition theorems to obtain the overall privacy level.
tells us that a sequence of differentially private queries maintain differential privacy. Let each provide -differential privacy. Then sequentially evaluating provides differential privacy McSherry (2009).
tells us that the privacy guarantee for a set of queries on disjoint subsets of data is only limited by the worst-case privacy guarantee for any of the queries. If each provide -differential privacy, and are arbitrary disjoint subsets of the dataset , then the sequence of provides differential privacy McSherry (2009).
2.2 Differentially private tree-based methods
There have been several methods proposed in recent literature to learn ensembles of decision trees for classification in a differentially private manner (although, to the best of our knowledge, this paper is the first to consider regression). Learning a decision tree entails greedily choosing the best split at each internal node and storing sufficient statistics (counts, in the case of classification) at the leaf nodes. To ensure differential privacy, we must privatize both the splitting mechanism and the sufficient statistics. Privatizing the sufficient statistics is relatively straightforward via the Laplace mechanism, with only minor differences between approaches explored in the literature; the key difference between the proposed methods lies in how they privatize the splits.
One way of determining the splits in a decision tree is to choose the split with lowest entropy. Friedman and Schuster (2010) privatizes this by noising the node counts using the Laplace mechanism, and then selecting a binary split from a set of candidates using the exponential mechanism with entropy as the score. This is built upon in the differentially private random forest (DP-RF) algorithm Patil and Singh (2014), which builds multiple private trees using bootstrapped samples. The trees are grown iteratively until they reach a specified maximum depth or the node is pure, i.e., contains instances of a single class. The differentially private decision forest Fletcher and Islam (2015) and the differentially private greedy decision forest Xin et al. (2019) both use the Gini index instead of entropy as the score function in the exponential mechanism.
A limitation of these splitting approaches based on the exponential mechanism is that they assume categorical covariates. Continuous covariates need to be discretized in some manner. Any discretization techniques that utilize information about the data must also be privatized, increasing the overall privacy budget Kotsiantis and Kanellopoulos (2006). Rana et al. (2015) proposes a relaxation on the DP-RF approach so that the ensemble of trees preserves the variance, instead of the entire distribution of the data. This relaxation enables their method to achieve better performance, and allows for numerical covariates, at the expense of losing any claim to -differential privacy.
Extremely randomized trees Geurts et al. (2006) are an alternative tree-based method where the splits are chosen randomly, without considering the data. These are the motivation for the learning scheme proposed in Bojarski et al. (2014), which avoids the need to split the privacy budget across the levels of each tree by drawing completely random splits. As this choice is entirely random, none of the privacy budget is consumed when determining splits. Training the ensemble of trees then involves passing all the data instances through the trees and using the Laplace mechanism to store class counts.
3 Differentially private median forests
As we saw in Section 2.2, existing privatized versions of random forests make use of private queries at each internal node, both to determine the dimension to split the node, and the value at which to split. This very quickly eats up the privacy budget, leading to very noisy function estimates. Further, the privacy mechanisms used assume that all inputs are categorical, or have been discretized, limiting their use on real-valued or mixed real/categorical data.
Privatized versions of extremely random trees Bojarski et al. (2014) involve far fewer queries, since the splits are chosen randomly. This allows us to spend more of our privacy budget on privatizing the leaf nodes. However, the differential privacy requirement means we cannot deterministically stop pruning the tree based on the number of data points at a node, and as a result we are likely to end up with many low or zero-occupancy leaves. Since the detrimental impact of privatizing the parameter estimate at a leaf node increases as the number of occupants decreases, this approach can lead to excessively noisy predictions.
We take an in-between approach, using the data to gently inform the splitting of the tree. We split our privacy budget between the two tasks of privatizing splits and privatizing leaf node sufficient statistics, with assigned to the former and to the latter. Since the splits at a given depth are performed on disjoint subsets of the data, we can assign to each internal split, where is the maximum tree depth.
At each internal node, we create a candidate split for each dimension using a differentially private version of the median. Let be the subset of associated with node , and let . For numerical covariates, we order the observations and use of our privacy budget to select the th observation, using the exponential mechanism with probability
We then pick the value for the split uniformly from the interval between the th and th value. This scoring function penalizes imbalanced splits, as . Similarly, for a categorical covariate we consider all possible splits, selecting split with probability
Having selected candidate splits from covariates, we select a covariate to split on using the negative of the mean-squared error as the scoring function. The sensitivity of the mean-squared error is , where . So, we choose an attribute to split on with probability
Forests of trees based on median splits of attributes have been proposed as a simplification of random forests, and the resulting estimators have been proven to be consistent Breiman (2004). In our setting, the median serves to partition the data into sets of similar sizes, discouraging low-occupancy leaf nodes. Both these phenomena are empirically observed in Figure 1. Using a noisy, rather than deterministic, median increases estimate variation across trees, allowing better exploration of the solution space.
Once we have reached our desired depth, we privatize the statistics at each leaf. In the classification setting, the class counts each have sensitivity 1, so we add Laplace() noise to each count. In the regression setting, the sensitivity of the mean is , where is the number of data points at node , so we add Laplace( noise to the mean. In both cases, the quality of the privatized prediction deteriorates as decreases. However since we are encouraging even splits at each node, we are likely to avoid scenarios where the amount of noise is excessive relative to the signal, and we can pre-select a depth that will have no empty leaves with high probability.
We summarize the process of constructing a single DiPriMe tree in Algorithm 1. As is common in tree-based algorithms, rather than use a single tree, we construct an ensemble of trees. As described in Algorithm 2 (with ), we randomly partition our data into groups, and learn a DiPriMe tree on each group. The parallel composition theorem ensures that the overall privacy loss is given by the loss on a single tree.111If we have less data, we may choose not to partition, however this would multiply the privacy budget by . We use the following notation in Algorithms 1 and 2: refers to the set of data points to which the tree is being fit. denotes the input features and denotes the corresponding target values. refers to the set of features that the tree can split on with denoting the corresponding range of categories. and refer to the lower and upper bounds on the target values , is total privacy budget for the tree, and is the fraction of the privacy budget allocated to determine the median split. We include code in the supplement, and will make this public upon publication.
3.1 Utility analysis
We consider the utility of a single, privatized DiPriMe regression tree, in comparison with a non-private tree with median splits Breiman (2004). For ease of analysis, we restrict ourselves to continuous features and assume that the dataset is perfectly divisible into the leaf nodes, i.e., for a depth-d tree. Hence, for a tree with median splits, we will have data points at each leaf node. Let be the loss under the non-private median tree, i.e., the sums of the per-leaf-node sums of squares. Let denote the sum of squares at node .
Let be the loss under a tree where the splits have been randomized as described above, but where the sufficient statistics at the leaf nodes are not randomized. Let be the number of data points at the leaf node. Then if , then , where .
If , then
We can bound the tail probability of using the tighter of:
Let be the loss due to Algorithm 1. Then, with probability at least , where .
To consider the utility of our proposed algorithm, we look at the estimate qualities obtained across a range of regression and classification tasks.
We consider three datasets to benchmark our method’s regression performance: the Parkinson’s telemonitoring dataset () and the Appliance Energy Prediction dataset () from the UCI Machine Learning Repository Dua and Graff (2017), and the Flight Delay dataset used by Jagannathan et al. (2009); Hensman et al. (2013). The UCI datasets contain a mixture of categorical and numeric features, while the Flight Delay dataset contains only numeric features. For the purposes of computational complexity, we sampled 800,000 data instances from the Flight Delay dataset for this experiment. For each dataset, we scaled the target variable to lie in , took 90% of the data as the training set and computed the mean squared error (MSE) over the test set. The DP-ERT algorithm refers to a privatized version of the Extremely Randomized Trees algorithm Geurts et al. (2006), akin to that delineated in Bojarski et al. (2014) for classification (we are not aware of any differentially private tree-based methods developed explicitly for regression). The results shown in Table 1 are for trees in each ensemble, with the number of covariate splits to consider set to for all but the DP-ERT. The private methods were run for and . The maximum depth was taken to be 5 for the two UCI datasets and 10 for the Flight Delay dataset.
|Parkinson’s||Appliances||Flight Delay (800K)|
DiPriMe clearly outperforms DP-ERT as a private tree-based ensemble learner for regression. We then compared the performance of DiPriMe with these algorithms for varying values of maximum depth, number of trees and privacy budget.
The utility of using the median to form splits is borne out by all the plots in in Figure 3, with the ensemble of median trees showing comparable performance and similar trends to both the random forest and extremely randomized trees algorithms. Figure 2(a) illustrates the inherent trade-off between learning deeper trees and utility. Deeper trees give a finer approximation of the data, demonstrated by the decreasing MSE of the non-private methods. However, this deteriorates the utility of DiPriMe by (a) reducing the privacy budget for the split at each node (b) increasing the sensitivity of the mean at the leaf nodes as there are likely to be fewer data instances in deeper nodes. Increasing the number of trees in the ensemble results in a similar trade-off, as shown in Figure 2(b); the number of data points to learn each tree is inversely related to the number of trees in the ensemble. So, while more trees are expected to generally reduce the mean squared error, each tree has less data to learn from.
We use three datasets from the UCI Machine Learning Repository Dua and Graff (2017), the Banknote Authentication (, Credit Card Default () and Wall-Following Robot Navigation () datasets, to compare the performance of our proposed algorithm to the DP-DF algorithm Fletcher and Islam (2015) and the DP-ERT algorithm Bojarski et al. (2014). The Credit Card Default data contains both numeric and categorical features, while the other two datasets contain only numeric features. As DP-DF requires categorical features, we bin the numeric features into 5 bins. We chose this data-agnostic discretization procedure to avoid leaking privacy. Some papers have proposed other discretization strategies Patil and Singh (2014) but these have to be privatized and hence incur additional privacy budget.
|Banknote Authentication||Credit Card Default||Robot Navigation|
|DP-DF Fletcher and Islam (2015)||0.413||0.223||0.397|
Table 2 shows the classification errors obtained by each method. The performance of median trees is comparable to that of extremely randomized trees. In contrast, DiPriMe far exceeds the performance of privatized extremely randomized trees and DP-DF. This can likely be attributed to (a) DiPriMe’s capability to directly utilize and split on numeric features without the need for prior discretization (b) the performance gain derived from learning splits. As we see in Figure 4, while DiPrime has a notable loss of accuracy compared with the nonprivate algorithms for small values of , as increases we get comparable performance. By contrast, DP-ERT and DP-DF continue to underperform even as increases.
We have presented a new, differentially private, tree-based method for both regression and classification, based on random forests with median splits. To the best of our knowledge, this is the first differentially private tree-based method for regression, and works with both categorical and numeric covariates. Moreover, we have demonstrated, both theoretically and empirically, that our algorithm obtains impressive utility to competing methods, while maintaining the same level of differential privacy.
The privacy implications of the use of data are beginning to be acknowledged, with states and countries introducing regulation to protect the privacy of individuals. Differential privacy remains one of the few privatization methods able to achieve theoretically guaranteed bounds on privacy loss for individuals. This work provides users a tool to perform nonlinear regression and classification, while protecting the privacy of individuals.
Like most differentially private algorithms, the privacy in DiPriMe forests comes at a cost: adding noise to ensure privacy leads to lower accuracy. This decrease in accuracy could have negative implications for those relying on predictions obtained using this algorithm.
Our work does not address any potential bias either in the data, or in the resulting algorithm. It has been shown that the goals of differential privacy are often in opposition to those of fairness Cummings et al. (2019); Bagdasaryan et al. (2019). Mozannar et al. (2020) discuss methods of post-hoc bias correction for differentially private algorithms.
- Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1.
- Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems 32, pp. 15479–15488. Cited by: Broader Impact.
- Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 128–138.
- Differentially-and non-differentially-private random decision trees. arXiv preprint arXiv:1410.6973. Cited by: §2.1, §2.2, §3, §4.1, §4.2.
- Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1.
- Consistency for a simple model of random forests. Technical report Technical Report 670, Statistics Department, University of California at Berkeley. Cited by: §3.1, §3.
- On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pp. 309–315. Cited by: Broader Impact.
- Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 202–210. Cited by: §1.
- UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §4.1, §4.2.
Differential privacy and robust statistics.
Proceedings of the forty-first annual ACM symposium on Theory of computing, pp. 371–380.
- Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §1.
- The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §1, §2.1.
- Differential privacy: a survey of results. In International conference on theory and applications of models of computation, pp. 1–19. Cited by: §2.1.
- A differentially private decision forest.. In AusDM, pp. 99–108. Cited by: §2.1, §2.2, §4.2, Table 2.
- Decision tree classification with differential privacy: a survey. ACM Computing Surveys (CSUR) 52 (4), pp. 1–33.
- Data mining with differential privacy. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 493–502. Cited by: §2.2.
- Extremely randomized trees. Machine learning 63 (1), pp. 3–42. Cited by: §1, §2.2, §4.1.
- Gaussian processes for big data. arXiv preprint arXiv:1309.6835. Cited by: §4.1.
- Optimal differentially private algorithms for k-means clustering. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 395–408. Cited by: §1.
A practical differentially private random decision tree classifier. In 2009 IEEE International Conference on Data Mining Workshops, pp. 114–121. Cited by: §4.1.
- Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pp. 25–1. Cited by: §1.
- Discretization techniques: a recent survey. GESTS International Transactions on Computer Science and Engineering 32 (1), pp. 47–58. Cited by: §2.2.
- Mondrian forests for large-scale regression when uncertainty matters. In Artificial Intelligence and Statistics, pp. 1478–1487.
- Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pp. 19–30. Cited by: §2.1, §2.1.
- Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pp. 94–103. Cited by: §2.1.
- Fair learning with private demographic data. arXiv preprint arXiv:2002.11651. Cited by: Broader Impact.
- DP-em: differentially private expectation maximization. arXiv preprint arXiv:1605.06995. Cited by: §1.
- Differential private random forest. In 2014 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 2623–2630. Cited by: §2.2, §4.2.
- Differentially private random forest with high utility. In 2015 IEEE International Conference on Data Mining, pp. 955–960. Cited by: §2.2.
- Novel tractable bounds on the lambert function with application to maximum determinant problems. arXiv preprint arXiv:2004.01115. Cited by: §A.2.2.
- Differentially private regression with Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 1195–1203. Cited by: §1.
- Differentially private k-means clustering. In Proceedings of the sixth ACM conference on data and application security and privacy, pp. 26–37. Cited by: §1.
- The eu general data protection regulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International Publishing.
- High-dimensional statistics: a non-asymptotic viewpoint. Vol. 48, Cambridge University Press. Cited by: §A.2.2.
- Differentially private greedy decision forest. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , pp. 2672–2676. Cited by: §2.2.
Appendix A Appendix
We derive sensitivities used in the main paper in Section A.1 and provide proofs for the theorems in Section A.2. Algorithm 3 provides additional implementation details. Section A.3 provides some additional experimental results
We shall use the following constants in our analysis. We assume all the target values are bounded, i.e., , and .
a.1 Sensitivity analysis
We assume have a set with data points such that . denotes the set of all data points besides , i.e., .
a.1.2 Mean squared error
The mean squared error is equivalent to the variance. Denoting the variance by ,
Using the triangle inequality repeatedly, we get
a.2.1 Theorem 1
In a depth-d median tree, there are data points at the leaf node. W,l.o.g. let these be . Then, the value of our objective function is
Assuming there be a subset of data points at node with private median splits, our noisy objective value is
Sensitivity of objective
W.l.o.g., we take excluded points to be . We assume all the points are bounded, i.e., .
Returning to the utility analysis,
we see that noising the median by leads to a maximum change of in the objective value at node .
Extending this to tree of depth , if denotes the number of data points at the leaf node, then
Corollary 1.1 follows from applying the above result over all the leaf nodes of a depth-d tree.
a.2.2 Theorem 2
We consider a depth-d tree. Let be the noise added at the first level, be the noise added at the second level, and so on. In the noised tree, at the leaf node (depth d), we will therefore have
is unbounded, so we bound the tail probability instead, i.e., find an upper bound on
. We shall use two approaches to arrive at this bound: (a) sub-exponential random variables (b) Chebyshev’s inequality.
Using sub-exponential random variables
This implies that where denotes the class of sub-exponential random variables. To find , we need to find the range of for which
where is the principal branch of the Lambert W function. For ease of analysis, we use the lower bound on from Roig-Solvas and Sznaier (2020):
Let . Then,
We can now bound the tail probabilities Wainwright (2019) as
Using Chebyshev’s inequality
A simpler method of bounding the tail probabilities is to use Chebyshev’s inequality. This gives
This is tighter than the sub-exponential bound for
There is a clear dependence on in the bounds which intuitively makes sense as the noise scales by that factor.
a.2.3 Theorem 3
Let us examine the effect of noising the mean of the leaf node. We have
The conditional expectation of the perturbation due to this noise is
Note that .
As the distribution of is symmetric about , we have with probability , where .
a.3 Additional results
Figure 5 displays the results for DiPriMe with . The trends are similar to those seen in Figure 3. An observation of note is that the optimal maximum depth for trees fit on all the data is higher than that fit on disjoint subsets of data. This is congruent with the intuition that low-occupancy nodes are noised more heavily. Hence, trees fitted to more data can be grown deeper before suffering from a similar loss of utility. This line of reasoning leads us to believe that learning an ensemble of DiPriME trees on disjoint subsets of data will be a more powerful learner with larger amounts of training data.
Mean squared error of DiPriMe for various hyperparameter settings
Figure 6 once again exhibits improved performance with larger privacy budgets. It also shows that increasing improves performance only to a certain limit before the increased noise reduces the utility of the DiPriMe trees. The key insight here is the importance of the hyperparameter for good performance; large values of leaves less privacy budget for storing the means, resulting in deterioration in the MSE. This effect is more pronounced at smaller values of as the noise scales as .