S-GBDT: Frugal Differentially Private Gradient Boosting Decision Trees
Privacy-preserving learning of gradient boosting decision trees (GBDT) has the potential for strong utility-privacy tradeoffs for tabular data, such as census data or medical meta data: classical GBDT learners can extract non-linear patterns from small sized datasets. The state-of-the-art notion for provable privacy-properties is differential privacy, which requires that the impact of single data points is limited and deniable. We introduce a novel differentially private GBDT learner and utilize four main techniques to improve the utility-privacy tradeoff. (1) We use an improved noise scaling approach with tighter accounting of privacy leakage of a decision tree leaf compared to prior work, resulting in noise that in expectation scales with O(1/n), for n data points. (2) We integrate individual Rényi filters to our method to learn from data points that have been underutilized during an iterative training process, which – potentially of independent interest – results in a natural yet effective insight to learning streams of non-i.i.d. data. (3) We incorporate the concept of random decision tree splits to concentrate privacy budget on learning leaves. (4) We deploy subsampling for privacy amplification. Our evaluation shows for the Abalone dataset (<4k training data points) a R^2-score of 0.39 for ε=0.15, which the closest prior work only achieved for ε=10.0. On the Adult dataset (50k training data points) we achieve test error of 18.7 % for ε=0.07 which the closest prior work only achieved for ε=1.0. For the Abalone dataset for ε=0.54 we achieve R^2-score of 0.47 which is very close to the R^2-score of 0.54 for the nonprivate version of GBDT. For the Adult dataset for ε=0.54 we achieve test error 17.1 % which is very close to the test error 13.7 % of the nonprivate version of GBDT.
READ FULL TEXT