Decision tree is one of the most fundamental structures used in machine learning. Constructing a tree of good quality is a hard computational problem though. Needless to say, the choice of the optimal attribute according to which the data partitioning should be performed in any given node of the tree requires nontrivial calculations involving data points located in that node. Nowadays, with an increasing importance of the mechanisms preserving privacy of the data handled by machine learning algorithms, the need arises to construct these algorithms with strong privacy guarantees (see e.g., , , , ). One of the strongest currently used notions of privacy is the so-called differential privacy that was introduced  in a quest to achieve the dual goal of maximizing data utility and preserving data confidentiality. A differentially-private database access mechanism preserves the privacy of any individual in the database, irrespectively of the amount of auxiliary information available to an adversarial database client. Differential-privacy techniques add noise to perturb data (such as Laplacian noise). Its magnitude depends on the sensitivity of the statistics that are being output. Even though the overall scheme looks simple, in practice it is usually very difficult to obtain a reasonable level of differential privacy and at the same time maintain good accuracy. This is the case since usually too big perturbation error needs to be added. In particular, this happens when machine learning computations access data frequently during the entire execution of the algorithm and output structures that are very sensitive to the data. This is also an obstacle for proposing a scheme that computes an optimal decision tree in a differentially-private way. In such a scenario the attribute chosen in every node and any additional information stored there depends on the data and that is why it must be perturbed in order to keep the desired level of differential privacy. Big perturbation added in this setting leads to the substantially smaller quality of the constructed tree.
Instead of constructing one differentially-private decision tree, in this paper we consider constructing a random forest. Random forests constitute an important member of the family of the decision tree-based algorithms due to their effectiveness and excellent performance. They are also the most accurate general-purpose classifiers available [8, 7]. In this paper we construct a forest consisting of random decision trees ( is the size of the dataset, e.g. number of data samples). An attribute according to which the selection is performed in any given node is chosen uniformly at random from all the attributes, independently from the dataset in that node. In the continuous case, the threshold value for the chosen attribute is then also chosen uniformly at random from the range of all possible values. That simple rule enables us to construct each decision tree very fast since the choice of nodes’ attributes does not depend on the data at all. The obtained algorithm is therefore fast and scalable with minimal memory requirements. It also takes only one pass over the data to construct the classifier. Since most of the structure of each random decision tree is constructed without examining the data, the algorithm suits the differentially-private scenario very well. After a sufficient number of random decision trees is constructed, the classification of every point from a dataset takes place. Classification is done according to one of the three schemes: majority voting (), threshold averaging or probabilistic averaging (). In the differentially-private setting we add perturbation error to the counters in leaves, but no perturbation error is added to the inner nodes. This leads to a much more accurate learning mechanism. Performing voting/averaging (see:  for applications of the voting methods) instead of just taking the best tree for a given dataset is important since it enables us to add smaller perturbation error to obtain the same level of differential privacy.
In this paper we analyze both non-differentially-private and differentially-private setting in all three variants: majority voting, threshold averaging, and probabilistic averaging. To the best of our knowledge, we are the first to give a comprehensive and unified theoretical analysis of all three models in both settings, where in case of differentially-private setting no theoretical analysis was ever provided in the context of random decision trees. The differentially-private setting is especially difficult to analyze since increasing the number of trees does not necessarily decrease the training (and test) error in this setting. Having more random decision trees require adding bigger perturbation error that may decrease the efficiency of the learning algorithm. In this paper we thoroughly investigate this phenomenon. The dependence of the quality of the random decision tree methods on the chosen level of differential privacy, the height of the tree and the number of trees in the forest is in the central focus of our theoretical and empirical analysis. Understanding these dependencies is crucial while applying these methods in practice. Our theoretical analysis relate the empirical error and the generalization error of the classifier to the average tree accuracy and explain quantitatively how the quality of the system depends on the number of chosen trees. Furthermore, we show that the random forest need not many trees to achieve good accuracy. In particular, we prove both theoretically and empirically that in practice the logarithmic in the size of the dataset number of random decision trees111Further in the paper by "logarithmic number of random decision trees" we always mean "logarithmic (in the size of the dataset) number of random decision trees". suffices to achieve good performance. We also show that not only do there exist parameters of the setting (such as: the number of random trees in the forest, the height of the tree, etc.) under which one can effectively learn, but the setting is very robust. To be more precise, we empirically demonstrate that the parameters do not need to be chosen in the optimal way, in example one can choose far fewer trees to achieve good performance. We also show that majority voting and threshold averaging are good candidates for the differentially-private classifiers. Our experiments reveal that a simple majority voting rule is competitive with the threshold averaging rule and simultaneously they both outperform the probabilistic averaging rule. Furthermore, majority voting rule is much less sensitive to the choice of the parameters of the random forest (such as the number of the trees and the height of the tree) than the remaining two schemes.
This article is organized as follows. In Section 2 we describe previous work regarding random decision trees. We then introduce our model and the notion of differential privacy in Section 3. In Section 4 we present a differentially-private supervised algorithm that uses random decision trees. Section 5 contains our theoretical analysis. We conclude the paper with experiments (Section 6) and a brief summary of our results (Section 7).
2 Prior work
Random decision trees are considered as important methods in machine learning often used for supervised learning due to their simplicity, excellent practical performance and somewhat unreasonable effectiveness in practice. They became successful in a number of practical problems, e.g. [11, 12, 13, 14, 15, 16, 17, 18, 19, 20] (there exist many more examples). The original random forests  were ensemble methods combining many CART-type  decision trees using bagging  (a convenient review of random forests can for instance be found in ). They were inspired by some earlier published random approaches [12, 23, 24, 25]. Despite their popularity, the statistical mechanism of random forests is difficult to analyze [8, 26] and to these days remains largely ununderstood [26, 27, 28]. Next we review the existing theoretical results in the literature.
A notable line of works provide an elegant analysis of the consistency of random forests [26, 27, 28, 8, 29, 30, 31, 32, 29, 33]. Among these works, one of the most recent studies  proves that the previously proposed random forest approach  is consistent and achieves the rate of convergence which depends only on the number of strong features and not on the number of noise variables. Another recent paper  provides the first consistency result for online variant of random forests. The predecessor of this work 
proposes the Hoeffding tree algorithm and prove that with high probability under certain assumptions the online Hoeffding tree converges to the offline tree. In our paper we focus on error bounds rather than the consistency analysis of random decision trees.
It has been noted  that the most famous theoretical result concerning random forests provides an upper-bound on the generalization error of the forest in terms of the correlation and strength of trees . Simultaneously, the authors show that the generalization error converges almost surely to a limit as the number of trees in the forest becomes large. It should be noted however that the algorithm considered by the authors has data-dependent tree structure opposite to the algorithms in our paper. To be more specific, the original "random forests" method 
selects randomly a subset of features and then it chooses the best splitting criteria from this feature subset. This affects efficiency since computing the heuristics (the best splitting criteria) is expensive. Furthermore, it also causes the tree structure to be data-dependent (another approach where the tree structure is data-dependent is presented in example in ) rather than fully random which poses a major problem when extending the method to the differentially-private setting since data-independent tree structure is important for preserving differential-privacy . Opposite to this approach, in our algorithms we randomly draw the attribute in each tree node according to which we split and then we randomly choose a threshold used for splitting. This learning model is therefore much simpler. Our fully random approach is inspired by a methodology already described before in the literature  (this work however has no theoretical analysis). Our theoretical results consider error bounds similarly to the original work on random forests . The difference of approaches however does not allow to use the theoretical results from  in our setting. Finally, note that in either  or  only a single voting rule is considered, majority voting or probabilistic averaging respectively. In this paper we consider a wider spectrum of different voting approaches.
Next, we briefly review some additional theoretical results regarding random forests. A simplified analysis of random forests in one-dimensional settings was provided in the literature in the context of regression problems where minimax rate of convergence were proved [37, 38]. Another set of results explore the connection of random forests with a specific framework of adaptive nearest-neighbor methods . Finally, for completeness we emphasize that there also exist some interesting empirical studies regarding random decision trees in the literature, e.g. ,  and , which however are not directly related to our work.
Privacy preserving data mining has emerged as an effective method to solve the problem of data sharing in many fields of computer science and statistics. One of the strongest currently used notions of privacy is the so-called differential privacy  (some useful tutorial material on differential privacy research can be found in ). In this paper we are interested in the differentially-private setting in the context of random decision trees. It was first observed in  that random decision trees may turn out to be an effective tool for constructing a differentially-private decision tree classifier. The authors showed a very efficient heuristic that averages over random decision trees and gives good practical results. Their work however lacks theoretical results regarding the quality of the differentially-private algorithm that is using random decision trees. In another published empirical study  the authors develop protocols to implement privacy-preserving random decision trees that enable efficient parallel and distributed privacy-preserving knowledge discovery. The very recent work  on differentially-private random forests shows experimental results demonstrating that quality functions such as information gain, max operator and gini index gives almost equal accuracy regardless of their sensitivity towards the noise. Furthermore, they show that the accuracy of the classical random forest and its differentially-private counterpart is almost equal for various size of datasets. To the best of our knowledge none of the published works on differentially-private random decision trees provide any theoretical guarantees. Our paper provides strong theoretical guarantees of both non-differentially-private and differentially-private random decision trees. This is a major contribution of our work. We simultaneously develop a unified theoretical framework for analyzing both settings.
3.1 Differential privacy
Differential privacy is a model of privacy for database access mechanism. It guarantees that small changes in a database (removal or addition of an element) does not change substantially the output of the mechanism.
(See .) A randomized algorithm gives -differential-privacy if for all datasets and differing on at most one element, and all ,
The probability is taken over the coin tosses of .
The smaller , the stronger level of differential privacy is obtained. Assume that the non-perturbed output of the mechanism can be encoded by the function . A mechanism can compute a differentially-private noisy version of over a database by adding noise with magnitude calibrated to the sensitivity of .
(See .) The global sensitivity of a function is the smallest number such that for all and which differ on at most one element, .
Let denote the Laplace distribution with mean. We will denote shortly by an independent copy of the -random variable.
(See .) Let be a function on databases with range , where is the number of rows of databases 222Number of rows of databases is the number of attributes of any data point from the databases.. Then, the mechanism that outputs , where are drawn i.i.d from , satisfies -differential-privacy.
Stronger privacy guarantees and more sensitive functions need bigger variance of the Laplacian noise being added. Differential privacy is preserved under composition, but with an extra loss of privacy for each conducted query.
The sequential application of mechanisms , each giving
-differential privacy, satisfies -differential-privacy.
3.2 The model
All data points are taken from , where is the number of the attributes and is either a discrete set or the set of real numbers. We assume that for every attribute its smallest () and largest possible value () are publicly available and that the labels are binary. We consider only binary decision trees (all our results can be easily translated to the setting where inner nodes of the tree have more than two children). Therefore, if is discrete then we will assume that , i.e. each attribute is binary. In the continuous setting for each inner node of the tree we store the attribute according to which the selection is done and the threshold value of this attribute. All decision trees considered in this paper are complete and of a fixed height that does not depend on the data. Let be a random decision tree and let be one of its leaves. We denote by the fraction of all training points in with label . If does not contain any of the training points we choose the value of uniformly at random from . The set of all possible decision trees is of size in the binary setting. It should be emphasized that it is true also in the continuous case. In that setting the set of all possible threshold values for a node is infinite but needless to say, the set of all possible partitionings in the node is still finite. Thus without loss of generality, we assume is finite. It can be very large but it does not matter since we will never need the actual size of in our analysis. For a given tree and given data point denote by the fraction of points (from the training set if is from this set and from the test set otherwise) with the same label as that end up in the same leaf of as . We call it the weight of in . Notice that a training point is classified correctly by in the single-tree setting iff its weight in is larger than (for a single decision tree we consider majority voting model for points classification).
The average value of over all trees of will be denoted as and called the weight of in . We denote by the fraction of trees from with the property that most of the points of the leaf of the tree containing have the same label as (again, the points are taken from the training set if is from it and from the test set otherwise). We call the goodness of in . For a given dataset the average tree accuracy of a random decision tree model is an average accuracy of the random decision tree from , where the accuracy is the fraction of data points that a given tree classifies correctly (accuracy is computed under assumption that the same distribution was used in both: the training phase and test phase).
|Input: , : train and test sets,|
|: height of the tree|
|Random forest construction:|
|construct random decision trees by|
|choosing for each inner node of the tree|
|independently at random its attribute (uniformly|
|from the set of all the attributes);|
|in the continuous case for each chosen attribute|
|choose independently at random a threshold|
|value uniformly from|
|add to the forest by updating for every leaf|
|corresponding to }|
|compute - the number of the trees|
|classifying as ;|
|classify as iff|
|compute , where is a set of all|
|leaves of the forest that correspond to ;|
|classify as iff|
|compute , where is a set of all|
|leaves of the forest that correspond to ;|
|classify as with probability|
|/*random tosses here are done independently|
|from all other previously conducted*/|
|Output: Classification of all|
|Input: , : train and test sets,|
|: height of the tree, : privacy parameter|
|Random forest construction:as in Algorithm 1|
|find the leaf for in every tree and|
|update , , where:|
|- the number of training points with|
|label belonging to that leaf;|
|- the number of training points with|
|label belonging to that leaf|
|For every leaf|
|if ( or or ( and ))|
|choose uniformly at random from|
|publish for every leaf|
|Testing: as in Algorithm 1 but replace with|
|Output: Classification of all|
Algorithm 1 captures the non-differentially-private algorithm for supervised learning with random decision trees (RDT). Its differentially-private counterpart is captured in Algorithm 2. We consider three versions of each algorithm:
Only variables stored in leaves depend on the data. This fact will play crucial role in the
analysis of the differen-
tially-private version of the algorithm where Laplacian error is added to the point counters at every leaf with variance calibrated to the number of all trees used by the algorithm.
5 Theoretical results
In this section we derive the upper-bounds on the empirical error (the fraction of the training data misclassified by the algorithm) and the generalization error (the fraction of the test data misclassified by the algorithm where the test data is taken from the same distribution as the training data) for all methods in Algorithm 1 and 2.
We also show how to find the number of random decision trees to obtain good accuracy and, in the differentially-private setting, good privacy guarantees.
We start with two technical results which, as we will see later, give an intuition why the random decision tree approach works very well in practice.
Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Then the average goodness of a training/test point in is at least .
Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Then the average weight of a training/test point in is at least .
The theorems above imply that if the average accuracy of the tree is better than random, then this is also reflected by the average values of and . This fact is crucial for the theoretical analysis since we will show that if the average values of and are slightly better than random then this implies very small empirical and generalization error. Furthermore, for most of the training/test points their values of and are well concentrated around those average values and that, in a nutshell, explains why the random decision trees approach works well. Notice that Theorem 5.1 gives better quality guarantees than Theorem 5.2.
We are about to propose several results regarding differen-
tially-private learning with random decision trees. They are based on careful structural analysis of the bipartite graph between the set of decision trees and datapoints. Edges of that bipartite graph connect datapoints with trees that correctly classified given datapoints. In the differentially-private setting the key observation is that under relatively weak conditions one can assume that the sizes of the sets of datapoints residing in leaves of the trees are substantial. Thus adding the Laplacian noise will not perturb the statistics to an extent that would affect the quality of learning. All upper-bounds regarding the generalization error were obtained by combining this analysis with concentration results (such as Azuma’s inequality).
5.1 Non-differentially-private setting
We start by providing theoretical guarantees in the non-differentially-private case. Below we consider majority voting and threshold averaging. The results for the probabilistic averaging are stated later in this subsection.
Let . Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Let be: the fraction of training/test points with goodness in at least / for (in the majority version) or: the fraction of training/test points with weight in at least / for (in the threshold averaging version). Then Algorithm 1 for every and selected random decision trees gives empirical error with probability . The generalization error will be achieved for trees with probability , where . Probabilities and are under random coin tosses used to construct the forest and the test set.
Note that parameter is always in the range . The more decision trees that classify data in the nontrivial way (i.e. with accuracy greater than ), the larger the value of is. The result above in particular implies that if most of the points have goodness/weight in a little bit larger than then both errors are very close to . This is indeed the case - the average point’s goodness/weight in , as Theorem 5.1 and Theorem 5.2 say, is at least / . The latter expression is greater than if the average tree accuracy is slightly bigger than the worst possible. Besides goodness/weight of most of the points, as was tested experimentally, is well concentrated around that average goodness/weight. We conclude that if the average accuracy of the decision tree is separated from (but not necessarily very close to ) then it suffices to classify most of the data points correctly. The intuition behind this result is as follows: if the constructed forest of the decision trees contains at least few "nontrivial trees" giving better accuracy than random then they guarantee correct classification of most of the points.
If we know that the average tree accuracy is big enough then techniques used to prove Theorem 5.3 give us more direct bounds on the empirical and generalization errors captured in Theorem 5.4. No assumptions regarding goodness/weight are necessary there.
Let . Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Then Algorithm 1 for every , and selected random decision trees gives empirical error: (in the majority version) or: (in the threshold averaging version) with probability . The generalization error: (in the majority version) or: (in the threshold averaging version) will be achieved for trees with probability , where . Probabilities and are under random coin tosses used to construct the forest and the test set.
Theorems 5.3 and 5.4 show that logarithmic number of random decision trees in practice suffices to obtain high prediction accuracy with a very large probability. In particular, the upper-bound on the generalization error is about two times the average error of the tree. The existence of the tree with lower accuracy in the forest does not harm the entire scheme since all trees of the forest play role in the final classification.
We now state our results (analogous to Theorem 5.4) for the probabilistic averaging setting. The following is true.
Let . Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Let be a constant. Let . Then with probability at least the probabilistic averaging version of Algorithm 1 gives empirical error and with probability , where , it gives generalization error . Probabilities are under random tosses used to construct the forest and the test set.
Notice that this result is nontrivial for almost the entire range of , and and close to , and large . This is the case since note that and the equality holds only for .
5.2 Differentially-private setting
We begin this section with the proof that all three methods captured in Algorithm 2, where the Laplacian noise is added to certain counts, are indeed -differentially-private. Notice that in every method to obtain the forest of random decision trees with perturbed counters in leaves we need queries to the private data (this is true since the structure of the inner nodes of the trees does not depend at all on the data and data subsets corresponding to leaves are pairwise disjoint). Furthermore, the values that are being perturbed by the Laplacian noise are simple counts of global sensitivity . Thus we can use use Theorem 3.1 and Theorem 3.2 to conclude that in order to obtain -differential privacy of the entire system we need to add a to every count in the leaf. This proves that our algorithms are indeed -differentially-private.
Next we show the theoretical guarantees we obtained in the differentially-private setting. As in the previous section, we first focus on the majority voting and threshold averaging, and then we consider the probabilistic averaging.
Assume that we are given a parameter . Let . Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Let be the fraction of training/test points with: goodness in at least / (in the majority version) or: weight in at least / (in the threshold averaging version) for . Then Algorithm 2 for selected random decision trees and differential privacy parameter gives empirical error with probability and generalization error with probability , where: and . Probabilities and are under random coin tosses used to construct the forest and the test set. Furthermore, we always have: / in the majority version and: / in the threshold averaging version.
Notice that if the number of trees in the forest is logarithmic in then is close to one and so is .
Again, as in the non-differentially-private case, we see that if there are many points of goodness/weight in close to the average goodness/weight then empirical and generalization error are small. Notice also that increasing the number of the trees too much has an impact on the empirical error (term in the lower bound on ). More trees means bigger variance of the single Laplacian used in the leaf of the tree. This affects tree quality. The theorem above describes this phenomenon quantitatively.
If the average tree accuracy is big enough then the following result becomes of its own interest. This result considers in particular the empirical error (similar result holds for the generalization error) of the threshold averaging version of Algorithm 2 (and also similar result holds for majority voting version of Algorithm 2).
Assume that we are given a parameter . Assume besides that the average tree accuracy of the set of all decision trees of height on the training set is is for some . Let . Let and let be the integer value for which the value of the function is smallest possible. Then with probability at least the -differentially-private threshold averaging version of Algorithm 2 gives empirical error at most for the forest with randomly chosen decision trees. Probability is under random coin tosses used to construct the forest.
Both theorems show that logarithmic number of random decision trees in practice suffices to obtain good accuracy and high level of differential privacy.
The next theorem considers the differentially-private probabilistic averaging setting.
Assume that we are given a parameter . Let and . Assume that the average tree accuracy of the set of all decision trees of height on the training/test set is for some . Let . Then for selected random decision trees the -differentially-private probabilistic averaging version of Algorithm 2 gives empirical error with probability and generalization error with probability , where: . Probabilities and are under random coin tosses used to construct the forest and the test set.
As in the two previous settings, information about the average accuracy of just a single tree gives strong guarantees regarding the classification quality achieved by the differentially-private version of the forest. The next result (analogous to Theorem 5.7) shows how to choose the optimal number of trees and that this number is again at most logarithmic in the data size.
Assume that we are given a parameter . Assume besides that the average tree accuracy of the set of all decision trees of height on the training set is for some . Let and let be the integer value for which the value of the function is smallest possible. Then with probability at least the -differentially-private probabilistic averaging version of Algorithm 2 gives empirical error at most for the forest with randomly chosen decision trees. Probability is under random coin tosses used to construct the forest.
The experiments were performed on the benchmark data-
sets333downloaded from http://osmot.cs.cornell.edu/kddcup/,
http://archive.ics.uci.edu/ml/datasets.html, and http://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets: Banknote Authentication (Ban_Aut), Blood Transfusion Service Center (BTSC), Congressional Voting Records (CVR), Mammographic Mass (Mam_Mass), Mushroom, Adult, Covertype and Quantum. of each dataset was used for training and the remaining part for testing. Furthermore, of the training dataset was used as a validation set. All code for our experiments is publicly released.
We first compare the test error (
) obtained using five different methods: open-source implementation of CART calledrpart , non-differentially-private (n-dp) and diffe-
rentially-private (dp) random forest with majority voting (RFMV) and threshold averaging (RFTA). For all methods except rpart we also report the number of trees in the forest () and the height of the tree () for which the smallest validation error was obtained, where we explored: and . In all the experiments the differential privacy parameter was set to , where is the number of training examples. Table 1 captures the results. For each experiment we report average test error over runs. We also show the binomial symmetrical confidence intervals for our results. The performance of random forest with probabilistic averaging (RFPA) was significantly worse than the competitive methods (RFMV, RFTA, rpart) and is not reported in the table. The performance of RFPA will however be shown in the next set of results.
|Congressional Voting Records|
Next set of results444All figures in this section should be read in color. (Figure 1 and 2) is reported for an exemplary datasets (Banknote Authentication, Congressional Voting Records, Mammographic Mass and Mushroom) and for the following methods: dpRFMV, dpRFTA and dpRFPA. Note that similar results were obtained for the remaining datasets. In Figure 1a we report the test error vs. for selected settings of 555Recall that in case when the forest contains only one tree () majority voting and threshold averaging rules are equivalent thus the blue curve overlaps with the green curve on the plot then.. In Figure 1b we also show minimal, average and maximal test error vs. for dpRFMV, whose performance was overall the best. Similarly, in Figure 1c we report the test error vs. for two selected settings of and in Figure 1d we also show minimal, average and maximal test error vs. for dpRFMV.
Finally, in Figure 2a we report test error for various settings of and two selected settings of . For each experiment was chosen from the set to give the smallest validation error. Additionally, in Figure 2b we show how the test error changes with for a fixed and various levels of .
Figure 2a shows that in most cases dpRFTA outperforms remaining differentially-private classifiers, however it requires careful selection of the forest parameters ( and ) in order to obtain the optimal performance as is illustrated on Figure 1c and 2b. This problem can be overcome by using dpRFMV which has comparable performance to dpRFTA but is much less sensitive to the setting of the forest parameters. Therefore dpRFMV is much easier to use in the differentially-private setting.
In this paper we first provide novel theoretical analysis of supervised learning with non-differentially-private random decision trees in three cases: majority voting, threshold averaging and probabilistic averaging. Secondly we show that the algorithms we consider here can be easily adapted to the setting where high privacy guarantees must be achieved. We furthermore provide both theoretical and experimental evaluation of the differentially-private random decision trees approach. To the best of our knowledge, the theoretical analysis of the differentially-private random decision trees was never done before. Our experiments reveal that majority voting and threshold averaging are good differentially-private classifiers and that in particular majority voting exhibits less sensitivity to forest parameters.
-  R. Agrawal and R. Srikant. Privacy-preserving data mining. In ACM SIGMOD, 2000.
-  W. Du and J. Zhan. Using randomized response techniques for privacy-preserving data mining. In KDD, 2003.
-  K. Choromanski, T. Jebara, and K. Tang. Adaptive anonymity via b-matching. In NIPS, 2013.
K. Chaudhuri and C. Monteleoni.
Privacy-preserving logistic regression.In NIPS, 2008.
G. Jagannathan and R. N. Wright.
Privacy-preserving distributed k-means clustering over arbitrarily partitioned data.In KDD, 2005.
-  C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. In TCC, 2006.
-  L. Breiman. Random forests. Machine Learning, 45:5-32, 2001.
-  G. Biau, L. Devroye, and G. Lugosi. Consistency of random forests and other averaging classifiers. J. Mach. Learn. Res., 9:2015–2033, 2008.
On the optimality of probability estimation by random decision trees.In AAAI, 2004.
-  R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:1651–1686, 1998.
-  J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, 2011.
-  Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Comput., 9:1545–1588, 1997.
-  C. Xiong, D. Johnson, R. Xu, and J. J. Corso. Random forests for metric learning with implicit pairwise position dependence. In KDD, 2012.
-  R. Agarwal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, 2013.
-  A. Z. Kouzani and G. Nasireding. Multilabel classification by bch code and random forests. International Journal on Network Security, 1(2):5, 2010.
-  R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification. In KDD, 2007.
-  V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston. Random forest: A classification and regression tool for compound classification and qsar modeling. Journal of Chemical Information and Computer Sciences, 43(6):1947–1958, 2003.
-  A. M. Prasad, L. R. Iverson, and A. Liaw. Newer Classification and Regression Tree Techniques: Bagging and Random Forests for Ecological Prediction. Ecosystems, 9(2):181–199, 2006.
A. Criminisi and J. Shotton.
Decision Forests for Computer Vision and Medical Image Analysis. Springer Publishing Company, 2013.
-  D. Zikic, B. Glocker, and A. Criminisi. Atlas encoding by randomized forests for efficient label propagation. In MICCAI, 2013.
-  L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. CRC Press LLC, Boca Raton, Florida, 1984.
-  L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
-  T. K. Ho. Random decision forest. In ICDAR, 1995.
-  T. K. Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell., 20(8):832–844, 1998.
-  T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2):139–157, 2000.
-  G. Biau. Analysis of a random forests model. J. Mach. Learn. Res., 13:1063–1095, 2012.
-  M. Denil, D. Matheson, and N. de Freitas. Consistency of online random forests. In ICML, 2013.
-  M. Denil, D. Matheson, and N. de Freitas. Narrowing the gap: Random forests in theory and in practice. In ICML, 2014.
-  N. Meinshausen. Quantile regression forests. Journal of Machine Learning Research, 7:983–999, 2006.
-  Y. Lin and Y. Jeon. Random forests and adaptive nearest neighbors. Journal of the American Statistical Association, 101:578-590, 2002.
-  L. Breiman. Some infinite theory for predictor ensembles. Technical Report 577, Statistics Department, UC Berkeley, http://www.stat.berkeley.edu/~breiman, 2000.
-  L. Breiman. Consistency for a simple model of random forests. Technical Report 670, Statistics Department, UC Berkeley, 2004.
-  H. Ishwaran and U. B. Kogalur. Consistency of random survival forests. Statistics and Probability Letters, 80(13-14):1056–1064, 2010.
-  P. Domingos and G. Hulten. Mining high-speed data streams. In KDD, 2000.
-  P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. Mach. Learn., 63:3–42, 2006.
-  G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright. A practical differentially private random decision tree classifier. Trans. Data Privacy, 5(1):273–295, 2012.
-  R. Genuer. Risk bounds for purely uniformly random forests. In ArXiv:1006.2980, 2010.
-  R. Genuer. Variance reduction in purely random forests. Journal of Nonparametric Statistics, 24:543–562, 2012.
-  Y. Lin and Y. Jeon. Random Forests and Adaptive Nearest Neighbors. Journal of the American Statistical Association, 101:578–590, 2006.
W. Fan, E. Greengrass, J. McCloskey, P. Yu, and K. Drummey.
Effective estimation of posterior probabilities: Explaining the accuracy of randomized decision tree approaches.In ICDM, 2005.
-  W. Fan, H. Wang, P.S Yu, and S. Ma. Is random model better? on its accuracy and efficiency. In ICDM, 2003.
-  Y. Yang, Z. Zhang, G. Miklau, M. Winslett, and X. Xiao. Differential privacy in data publication and analysis. In ACM SIGMOD, 2012.
-  J. Vaidya, B. Shafiq, Wei Fan, D. Mehmood, and D. Lorenzi. A random decision tree framework for privacy-preserving data mining. IEEE Transactions on Dependable and Secure Computing, 11:399–411, 2014.
-  A. Patil and S. Singh. Differential private random forest. In ICACCI, 2014.
-  C. Dwork. Differential privacy: A survey of results. In Theory and Applications of Models of Computation, 5th International Conference, TAMC 2008, Xi’an, China, April 25-29, 2008. Proceedings, pages 1–19, 2008.
-  F. McSherry and K. Talwar. Mechanism design via differential privacy. In FOCS, 2007.
-  R. Hall, A. Rinaldo, and L. A. Wasserman. Differential privacy for functions and functional data. Journal of Machine Learning Research, 14(1):703–727, 2013.
-  T. M. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive partitioning. http://CRAN.R-project.org/package=rpart, 2011.
8 Empirical and generalization errors
We will prove here results regarding empirical and generalization errors of all the variants of the algorithm mentioned in the paper as well as Theorem 5.1 and Theorem 5.2. Without loss of generality we will assume that all attributes are binary (taken from the set ). It can be easily noticed that the proofs can be directly translated to the continuous case. We leave this simple exercise to the reader.
Let us introduce first some useful notation that will be very helpful in the proofs we present next.
We denote by the size of the dataset (training or test) . Let us remind that is the number of attributes of any given data point, is the height of the random decision tree and is the set of all random decision trees under consideration.
We focus first on classifying with just one decision tree. Fix some decision tree and one of its leaves. Assume that it contains points with label: and points with label: . We associate label with that leaf if and label otherwise. To classify a given point using that tree we feed our tree with that point and assign to a point a label of the corresponding leaf. Denote by the number of data points that were correctly classified by a tree . Denote . We call the quality (or accuracy) of the tree . Note that obviously we always have: , since for every leaf of any given tree the majority of the data points from that leaf are classified correctly. Denote: . We call the average tree accuracy. This parameter measures how well data points are classified on average by a complete decision tree of a given height . Note that . Denote . Parameter is the number of leaves of the decision tree.
For and denote by the number of points from the dataset in the leaf of a decision tree . Denote by the number of points from the dataset in the leaf of the decision tree that were classified correctly. Denote for and for . Note that for every . Note also that we have: and . Denote by the number of data points in the leaf of the decision tree that are of label . Denote by the number of data points in the leaf of the decision tree that are of label .
We will use frequently the following structure in the proofs.
Let be a bipartite graph with color classes: , and weighted edges. Color class consists of points from the dataset. Color class consists of elements of the form , where , and .
Data point is adjacent to iff it belongs to larger of the two groups (these with labels: and ) of the data points that are in the leaf of the decision tree . An edge joining with has weight . Data point is adjacent to iff it belongs to smaller of the two groups of the data points that are in the leaf of the decision tree . An edge joining with has weight . Note that the degree of a vertex is and the degree of a vertex is .
In the proofs we will refer to the size of the set of decision trees under consideration as: or (note that is used in the main body of the paper).
We start with the proof of Theorem 5.2. Note that from the definition of we get:
Therefore, using formula on , we get:
Note that we have: . From Jensen’s inequality, applied to the function , we get: , where is the average quality of the system of all complete decision trees of height (the average tree accuracy). Similarly, . Thus we get:
That completes the proof of Theorem 5.2. The proof of Theorem 5.1 is even simpler. Notice that for any data point the expression counts the number of decision trees from that classified correctly (follows directly from the definition of ). Thus we have: . Therefore and we are done.
We need one more technical result, the Azuma’s inequality:
Let be a martingale with mean and suppose that for some non-negative constants: we have: for . Then for any , :
8.2 Majority voting and threshold averaging setting - empirical error
Again, we start with the analysis of the threshold averaging.
Take random decision tree , where . For a given data point from the training set let be a random variable defined as follows.
If does not belong to any leaf of then let . Otherwise let be the number of points from the training set with label in that leaf and let be the number of points from the training set with label in that leaf. If has label then we take . Otherwise we take .
When from the context it is clear to which data point we refer to we will skip upper index and simply write or respectively.
Fix some point from the training set. Note that if then point is correctly classified. Notice that the weight of the point denoted as is nothing else but the sum of weights of all the edges of incident to divided by the number of all trees (or the average weight of an edge indecent to if we consider real-valued attributes). Note that we have and that from Theorem 5.2 we get:
Take . Denote by the fraction of points from the training data such that . From the lower bound on , we have just derived, we get: , which gives us:
Take point from the training set such that Denote by the probability that is misclassified. We have:
Denote: for . We have:
Note that, since and random variables are independent, we can conclude that is a martingale. Note also that for some such that .
Using Lemma 8.1, we get:
Therefore the probability that at least one of points for which will be misclassified by the set of random decision trees is, by union bound, at most: . That, for , completes the proof of the upper bound on the empirical error from theorems: 5.3 and 5.4 since we have already proved that . The proof of the majority voting version goes along exactly the same lines. This time, instead of Theorem 5.2, we use Theorem 5.1. We know that , where . Denote the fraction of points with for by . Then, by the argument similar to the one presented above,we have:
All other details of the proof for the majority voting are exactly the same as for the threshold averaging scheme.
Let be a constant. We first consider the threshold averaging scheme. Take a decision tree . Denote by the set of points from the training set with the following property: point belongs in do the leaf that contains at least points. Note that since each has exactly leaves, we can conclude that . In this proof and proof of theorems: 5.8 and 5.8 (presented in the next section) we will consider graph that is obtained from by deleting edges adjacent to those vertices of the color class that correspond to leaves containing less than points from the training set. Take point from the training set with , where