Making AI Forget You: Data Deletion in Machine Learning

07/11/2019 ∙ by Antonio Ginart, et al. ∙ Stanford University 0

Intense recent discussions have focused on how to provide individuals with control over when their data can and cannot be used -- the EU's Right To Be Forgotten regulation is an example of this effort. In this paper we initiate a framework studying what to do when it is no longer permissible to deploy models derivative from specific user data. In particular, we formulate the problem of how to efficiently delete individual data points from trained machine learning models. For many standard ML models, the only way to completely remove an individual's data is to retrain the whole model from scratch on the remaining data, which is often not computationally practical. We investigate algorithmic principles that enable efficient data deletion in ML. For the specific setting of k-means clustering, we propose two provably deletion efficient algorithms which achieve an average of over 100X improvement in deletion efficiency across 6 datasets, while producing clusters of comparable statistical quality to a canonical k-means++ baseline.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, one of the authors received the redacted email below, informing us that an individual’s data cannot be used any longer. The UK Biobank [67]

is one of the most valuable collections of genetic and medical records with half a million participants. Thousands of machine learning classifiers are trained on this data, and thousands of papers have been published using this data.

EMAIL ---- UK BIOBANK ---- Subject: UK Biobank Application [REDACTED], Participant Withdrawal Notification [REDACTED] Dear Researcher, As you are aware, participants are free to withdraw form the UK Biobank at any time and request that their data no longer be used. Since our last review, some participants involved with Application [REDACTED] have requested that their data should longer be used.

The email request from the UK Biobank illustrates a fundamental challenge the broad data science and policy community is grappling with: How should we provide individuals with flexible control over how corporations, governments, and researchers use their data? Individuals could decide at any time that they do not wish for their personal data to be used for a particular purpose by a particular entity. This ability is sometimes legally enforced. For example, the European Union’s General Data Protection Regulation (GDPR) and former Right to Be Forgotten [21, 20] both require that companies and organizations enable users to withdraw consent to their data at any time under certain circumstances. These regulations broadly affect international companies and technology platforms with EU customers and users. Legal scholars have pointed out that the continued use of AI systems directly trained on deleted data could be considered illegal under certain interpretations and ultimately concluded that:

it may be impossible to fulfill the legal aims of the Right to be Forgotten in artificial intelligence environments

[74]. Furthermore, so-called model-inversion attacks have demonstrated real capability of adversaries to extract user information from trained models [73].

Concretely, we may frame the problem of data deletion in machine learning as follows. Suppose a statistical model is trained on datapoints. For example, the model could be trained to perform disease diagnosis from data collected from patients. To delete the data sampled from the -th patient from our trained model, we would like to update it such that it becomes independent of sample , and looks as if it had been trained on the remaining patients. A naive approach to satisfy the requested deletion would be to retrain the model from scratch on the data from the remaining patients. For many applications, this is not a tractable solution – the costs (in time, computation, and energy) for training many machine learning models can be quite high. Large scale algorithms can take weeks to train and consume large amounts of electricity and other resources. Hence, we posit that efficient data deletion is a fundamental data management operation for machine learning models and AI systems, just like in relational databases or other classical data structures.

Beyond supporting individual data rights, there are various other possible use cases in which efficient data deletion is desirable. To name a few examples, it could be used to speed-up leave-one-out-cross-validation [2], support a user data marketplace [64, 69], or identify key or valuable datapoints within a model [32].

Deletion efficient learning has not been previously studied. While the desired output of a deletion operation on a deterministic model is fairly obvious, we have yet to even define data deletion for stochastic learning algorithms. At present, the only learning algorithms known to support fast data deletion operations are linear models [47, 24, 71], and certain types of lazy learning [76, 6, 11] techniques such as non-parametric Nadaraya-Watson kernel regressions [53]. Even so, there is no pre-exisitng notion of how engineers should think about the deletion efficiency of AI systems, nor understanding of the kinds of trade-offs such systems face. Related ideas for protecting data in machine learning –e.g. cryptography [55, 16, 14, 13, 54, 27], differential privacy [26, 19, 18, 56, 1]–do not lead to efficient data deletion, but rather attempt to make data private or non-identifiable. Conversely, the ability to efficiently perform data deletion does not imply privacy.

The key components of this paper include introducing deletion efficient learning, based on an intuitive and operational notion of what it means to (efficiently) delete data from a (possibly stochastic) statistical model. We pose data deletion as an online problem, from which a notion of optimal deletion efficiency emerges from a natural lower bound on amortized computation time. We do a case-study on deletion efficient learning using the simple, yet perennial, -means clustering problem. We propose two deletion efficient algorithms that (in certain regimes) achieve optimal deletion efficiency. Empirically, on six datasets, our methods achieve an average of over speedup in amortized runtime with respect to the canonical Lloyd’s algorithm seeded by -means++ [45, 5]. Simultaneously, our proposed deletion efficient algorithms perform comparably to the canonical algorithm on three different statistical metrics of clustering quality. Finally, we synthesize an algorithmic toolbox for designing deletion efficient learning systems.

We summarize our work into three contributions:

1) Formal: We formalize the problem of efficient data deletion in the context of machine learning.

2) Algorithmic: We propose two different deletion efficient solutions for -means clustering that have theoretical guarantees and strong empirical results.

3) Conceptual: From our theory and experiments, we synthesize four general engineering principles for designing deletion efficient learning systems.

2 Problem Formulation

We proceed by describing our setting and defining the notion of data deletion in the context of a machine learning algorithm and model. Our definition formalizes the intuitive goal that after a specified datapoint, , is deleted, the resulting model is updated to be indistinguishable from a model that was trained from scratch on the dataset sans . Once we have defined data deletion, we will conclude this section by defining a notion of deletion efficiency in the context of an online setting in which a stream of data deletion requests must be processed.

Throughout we denote dataset as a set consisting of datapoints, with each datapoint ; for simplicity, we often represent as a real-valued matrix as well. Let denote a (possibly randomized) algorithm that maps a dataset to a model in hypothesis space . We allow models to also include arbitrary metadata that may not be used at inference time. Note that algorithm operates on datasets of any size. Since is often stochastic, we can also treat as implicitly defining a conditional distribution over given dataset .

Definition 2.1.

Data Deletion Operation: We define a data deletion operation for learning algorithm , , which maps the dataset , model , and index to some model in . Such an operation is a data deletion operation if, for all and

, random variables

and are equal in distribution, .

Here we focus on exact data deletion: after deleting a training point from the model, the model should be as if this training point had never been seen in the first place. The above definition can naturally be relaxed to approximate data deletion by requiring a bound on the distance (or divergence) between distributions of and , though we defer a discussion of this to future work.

A Computational Challenge

Every learning algorithm, , supports a trivial data deletion operation corresponding to simply retraining on the new dataset after the specified datapoint has been removed — namely running algorithm on the dataset . Because of this, the challenge of data deletion is computational: 1) Can we design a learning algorithm , and supporting data structures, so as to allow for a computationally efficient data deletion operation? 2) For what algorithms is there a data deletion operation that runs in time sublinear in the size of the dataset, or at least sublinear in the time it takes to compute the original model, ?

Data Deletion as an Online Problem

One convenient way of concretely formulating the computational challenge of data deletion is via the lens of online algorithms [17]. Given a dataset of datapoints, a specific training algorithm , and its corresponding deletion operation , one can consider a stream of distinct indices, , corresponding to the sequence of datapoints to be deleted. The online task then is to design a data deletion operation that is given the indices one at a time, and must output upon being given index . As in the extensive body of work on online algorithms, the goal is to minimize the amortized computation time. The amortized runtime in the proposed online deletion setting is a natural and meaningful way to measure deletion efficiency. A formal definition of our proposed online problem setting can be found in Appendix A.

In online data deletion, a simple lower bound on amortized runtime emerges. All (sequential) learning algorithms run in time under the natural assumption that must process each datapoint at least once. Furthermore, in the best case, comes with a constant time deletion operation (or a deletion oracle).

Remark 2.1.

In the online setting, for datapoints and deletion requests we establish an asymptotic lower bound of for the amortized computation time of any (sequential) learning algorithm.

We refer to an algorithm achieving this lower bound as deletion efficient

. Developing tight upper and lower bounds is an open question for many basic learning paradigms including ridge regression, decision tree models, and settings where

corresponds to the solution to a stochastic optimization problem. In this paper, we do a case study on -means clustering, showing that we can achieve deletion efficiency without sacrificing statistical performance.

Data Deletion and Data Privacy

We reiterate that deletion is not privacy. Algorithms that support efficient deletion do not have to be private, and algorithms that are private do not have to support efficient deletion. To see the difference between privacy and our notion of data deletion, note that every map, , supports the naive data deletion operation of retraining from scratch. The model is not required to satisfy any privacy guarantees. Even an operation that outputs the entire dataset in the clear could support data deletion, whereas such an operation is certainly not private. In this sense, the challenge of data deletion only arises in the presence of computational limitations. Privacy, on the other hand, presents statistical challenges, even in the absence of any computational limitations. With that being said, data deletion has direct connections and consequences in data privacy and security, which we explore in more detail in Appendix A.

3 Deletion Efficient Clustering

Data deletion is a general challenge for machine learning. Due to its simplicity and ubiquity, we focus on -means clustering as a case study. In Section 5, we discuss general principles and insights that can be applied to support efficient deletion in other learning paradigms. We propose two algorithms for deletion efficient -means clustering. In the context of -means, we treat the output centroids as the model from which we are interested in deleting datapoints. We summarize our proposed algorithms and state theoretical runtime complexity and statistical performance guarantees. Please refer to [28] for background concerning -means clustering.

3.1 Quantized -Means

We propose a quantized variant of Lloyd’s algorithm as a deletion efficient solution to -means clustering, called Q-

-means. By quantizing the centroids at each iteration, we show that the algorithm’s centroids are constant with respect to deletions with high probability. Under this notion of quantized stability, we can support efficient deletion, since most deletions can be resolved without re-computing the centroids from scratch. Our proposed algorithm is distinct from other quantized versions of

-means [63], which quantize the data to minimize memory or communication costs. We present an abridged version of the algorithm here (Algorithm 1). Detailed pseudo-code for Q--means and its deletion operation may be found in Appendix B.

  Input: data matrix
  Parameters: , , ,
   // initialize centroids with -means++
  Save initial centroids: .
  -means loss of initial partition
  for  to  do
      Store current centroids:
      Compute centroids:
      Apply correction to -imbalanced partitions
      Quantize to random -lattice:
      Update partition:
      Save state to metadata:
      Compute loss
      if then else break
  end forreturn //output final centroids as model
Algorithm 1 Quantized -means (abridged)

Q--means follows the iterative protocol as does the canonical Lloyd’s algorithm (and makes use of the -means++ initialization). There are four key differences from Lloyd’s algorithm. First and foremost, the centroids are quantized in each iteration before updating the partition. The quantization maps each point to the nearest vertex of a uniform -lattice [33]. To de-bias the quantization, we apply a random phase shift to the lattice. The particulars of the quantization scheme are discussed in Appendix B. Second, at various steps throughout the computation, we memoize the optimization state into the model’s metadata for use at deletion time (incurring an additional memory cost). Third, we introduce a balance correction step, which compensates for -imbalanced clusters by averaging current centroids with a momentum term based on the previous centroids. Explicitly, for some , we consider any partition to be -imbalanced if . We may think of as being the ratio of the smallest cluster size to the average cluster size. Fourth, because of the quantization, the iterations are no longer guaranteed to decrease the loss, so we have an early termination if the loss increases at any iteration.

Deletion in Q--means is straightforward. Using the metadata saved from training time, we can verify if deleting a specific datapoint would have resulted in a different quantized centroid than was actually computed during training. If this is the case (or if the point to be deleted is one of randomly chosen initial centroids according to -means++) we must retrain from scratch to satisfy the deletion request. Otherwise, we may satisfy deletion by updating our metadata to reflect the deletion of the specified datapoint, but we do not have to recompute the centroids.

Deletion Time Complexity

We turn our attention to an asymptotic time complexity analysis of Q--means deletion operation. Q-Lloyd’s supports deletion by quantizing the centroids, so they are stable to against small perturbations (caused by deletion of a point).

Theorem 3.1.

Let be a dataset on of size . Fix parameters , , , and for Q--means. Then, Q--means supports deletions in time in expectation, with probability over the randomness in the quantization phase and -means++ initialization.

The proof for the theorem is given in Appendix C. The intuition is as follows. Centroids are computed by taking an average. With enough terms in an average, the effect of a small number of those terms is negligible. The removal of those terms from the average can be interpreted as a small perturbation to the centroid. If that small perturbation is on a scale far below the granularity of the quantizing -lattice, then it is unlikely to change the quantized value of the centroid. Thus, beyond stability verification, no additional computation is required for a majority of deletion requests. This result is in expectation with respect to the randomized initializations and randomized quantization phase, but is actually worst-case over all possible (normalized) dataset instances. The number of clusters , iterations , and cluster imbalance ratio are usually small constants in many applications, and are treated as such here. Interestingly, for constant and , the expected deletion time is independent of due to the stability probability increasing at the same rate as the problem size (see Appendix C). Deletion time for this method may not scale well in the high-dimensional setting. In the low-dimensional case, the most interesting interplay is between , , and . To obtain as high-quality statistical performance as possible, it would be ideal if as . In this spirit, we can parameterize for . We will use this parameterization for theoretical analysis of the online setting in Section 3.3.

Theoretical Statistical Performance

We proceed to state a theoretical guarantee on statistical performance of Q--means, which complements the asymptotic time complexity bound of the deletion operation. Recall that the loss for a -means problem instance is given by the sum of squared Euclidean distance from each datapoint to its nearest centroid. Let be the optimal loss for a particular problem instance. Achieving the optimal solution is, in general, NP-Hard [3]. Instead, we can approximate it with -means++, which achieves [5].

Corollary 3.1.1.

Let be a random variable denoting the loss of Q--means on a particular problem instance of size . Then .

This corollary follows from the theoretical guarantees already known to apply to Lloyd’s algorithm when initialized with -means++, given by [5]. The proof can be found in Appendix C. We can interpret the bound by looking at the ratio of expected loss upper bounds for -means++ and Q--means. If we assume our problem instance is generated by iid samples from some arbitrary non-atomic distribution, then it follows that . Taking the loss ratio of upper bounds yields . Ensuring that implies the upper bound is as good as that of -means++.

3.2 Divide-and-Conquer -Means

  Input: data matrix
  Parameters: , , tree width , tree height
  Initialize a -ary tree of height such that each node has a pointer to a dataset and centroids
  for  to  do
      Select a leaf node uniformly at random
  end for
  for  down to  do
      for  each node in level  do
          if  then
              save all nodes as metadata
              return //model output
          end if
      end for
  end for
Algorithm 2 DC--means

We turn our attention to another variant of Lloyd’s algorithm that also supports efficient deletion, albeit through quite different means. We refer to this algorithm as Divide-and-Conquer -means (DC--means). At a high-level, DC--means works by partitioning the dataset into small sub-problems, solving each sub-problem as and independent -means instance, and recursively merging the results. We present pseudo-code for DC--means here, and we refer the reader to Appendix B for pseudo-code of the deletion operation.

DC--means operates on a perfect -ary tree of height (this could be relaxed to any rooted tree). The original dataset is partitioned into each leaf in the tree as a uniform multinomial random variable with datapoints as trials and leaves as outcomes. At each of these leaves, we solve for some number of centroids via -means++. When we merge leaves into their parent node, we construct a new dataset consisting of all the centroids from each leaf. Then, we compute new centroids at the parent via another instance of -means++. For simplicity, we keep fixed throughout all of the sub-problems in the tree, but this could be relaxed. We make use of the tree hierarchy to modularize the computation’s dependence on the data. At deletion time, we need only to recompute the sub-problems from one leaf up to the root. This observation allows us to support fast deletion operations.

Our method has close similarities to pre-existing distributed -means algorithms [59, 58, 9, 7, 34, 8, 79], but is in fact distinct (not only in that it is modified for deletion, but also in that it operates over general rooted trees). For simplicity, we restrict our discussion to only the simplest of divide-and-conquer trees. We focus on depth-1 trees with leaves where each leaf solves for centroids. This requires only one merge step with a root problem size of .

Analogous to how serves as a knob to trade-off between deletion efficiency and statistical performance in Q--means, for DC--means, we imagine that might also serve as a similar knob. For example, if , DC--means degenerates into canonical Lloyd’s (as does Q--means as ). The dependence of statistical performance on tree width is less theoretically tractable than that of Q--means on , but in Appendix D, we empirically show that statistical performance tends to decrease as increases, which is perhaps somewhat expected.

As we show in our experiments, depth-1 DC--means demonstrates an empirically compelling trade-off between deletion time and statistical performance. There are various other potential extensions of this algorithm, such as weighting centroids based on cluster mass as they propagate up the tree or exploring the statistical performance of deeper trees.

Deletion Time Complexity

For ensuing asymptotic analysis, we may consider parameterizing tree width

as for . As before, we treat and as small constants. Although intuitive, there are some technical minutia to account for to prove correctness and runtime for the DC--means deletion operation. The proof of Proposition 3.2 may be found in Appendix C.

Proposition 3.2.

Let be a dataset on of size . Fix parameters and for DC--means. Let and Then, with a depth-1, -ary divide-and-conquer tree, DC--means supports deletions in time in expectation with probability over the randomness in dataset partitioning.

3.3 Amortized Runtime Complexity in Online Deletion Setting

We state the amortized computation time for both of our algorithms in the online deletion setting defined in Section 2. We are in an asymptotic regime where the number of deletions for (see Appendix C for more details). Recall the lower bound from Section 2.1. For a particular fractional power , an algorithm achieving the optimal asymptotic lower bound on amortized computation is said to be -deletion efficient. The following corollaries result from direct calculations which may be found in Appendix C. Note that Corollary 3.2.2 assumes DC--means is training sequentially.

Corollary 3.2.1.

With for , Q--means algorithm is -deletion efficient in expectation if

Corollary 3.2.2.

With for , and a depth-1 -ary divide-and-conquer tree, DC--means is -deletion efficient in expectation if

4 Experiments

With a theoretical understanding in hand, we seek to empirically characterize the trade-off between runtime and performance for the proposed algorithms. In this section, we provide proof-of-concept for our algorithms by benchmarking their amortized runtimes and clustering quality on a simulated stream of online deletion requests. As a baseline, we use the canonical Lloyd’s algorithm initialized by -means++ seeding [45, 5]. Following the broader literature, we refer to this baseline simply as -means, and refer to our two proposed methods as Q--means and DC--means.


We run our experiments five real, publicly available datasets: Celltype (, , ) [37], Covtype (, , ) [12], MNIST (, , ) [43], Postures (, , ) [30, 29] , Botnet (, , )[49]

, and a synthetic dataset made from a Gaussian mixture model which we call

Gaussian (, , ). We refer the reader to Appendix D for more details on the datasets. All datasets come with ground-truth labels as well. Although we do not make use of them at learning time, we can use them to evaluate the statistical quality of the clustering methods.

Online Deletion Benchmark

We simulate a stream of 1,000 deletion requests, selected uniformly at random and without replacement. An algorithm trains once, on the full dataset, and then runs its deletion operation to satisfy each request in the stream, producing an intermediate model at each request. For the canonical -means baseline, deletions are satisfied by re-training from scratch.


To measure statistical performance, we evaluate with three metrics (see Section 4.1) that measure cluster quality. To measure deletion efficiency, we measure the wall-clock time to complete our online deletion benchmark. For both of our proposed algorithms, we always fix 10 iterations of Lloyd’s, and all other parameters are selected with simple but effective heuristics (see Appendix D). This alleviates the need to tune them. To set a fair

-means baseline, when reporting runtime on the online deletion benchmark, we also fix 10 iterations of Lloyd’s, but when reporting statistical performance metrics, we run until convergence. We run five replicates for each method on each dataset and include standard deviations with all our results. We refer the reader to Appendix D for more experimental details.

4.1 Statistical Performance Metrics

To evaluate clustering performance of our algorithms, the most obvious metric is the optimization loss of the -means objective. Recall that this is the sum of square Euclidean distances from each datapoint to its nearest centroid. To thoroughly validate the statistical performance of our proposed algorithms, we additionally include two canonical clustering performance metrics.

Silhouette Coefficient [62]: This coefficient measures a type of correlation (between -1 and +1) that captures how dense each cluster is and how well-separated different clusters are. The silhouette coefficient is computed without ground-truth labels, and uses only spatial information. Higher scores indicate denser, more well-separated clusters.

Normalized Mutual Information (NMI) [75, 42]: This quantity measures the agreement of the assigned clusters to the ground-truth labels, up to permutation. NMI is upper bounded by 1, achieved by perfect assignments. Higher scores indicate better agreement between clusters and ground-truth labels.

4.2 Summary of Results

We summarize our key findings in four tables. In Tables 1-3, we report the statistical clustering performance of the 3 algorithms on each of the 6 datasets. In Table 1, we report the optimization loss ratios of our proposed methods over the -means++ baseline.
Table 1: Loss Ratio
Dataset -means Q--means DC--means
In Table 2, we report the silhouette coefficient for the clusters. In Table 3, we report the NMI. In Table 4, we report the amortized total runtime of training and deletion for each method. Overall, we see that the statistical clustering performance of the three methods are competitive.
Table 2: Silhouette Coefficients (higher is better)
Dataset -means Q--means DC--means
Furthermore, we find that both proposed algorithms yield orders of magnitude of speedup. As expected from the theoretical analysis, Q--means offers greater speed-ups in then the dimension is lower relative to the sample size, whereas DC--means is more consistent across dimensionalities.
Table 3: Normalized Mutual Information (higher is better)
Dataset -means Q--means DC--means
-means Q--means DC--means
Dataset Runtime (s) Runtime (s) Speedup Runtime (s) Speedup
Table 4: Amortized Runtime in Online Deletion Benchmark (Train once + 1,000 Deletions)

In particular, note that MNIST has the highest ratio of the datasets we tried, followed by Covtype, These two datasets are, respectively, the datasets for which Q--means offers the least speedup. On the other hand, DC--means offers consistently increasing speedup as increases, despite . Furthermore, we see that Q-

-means tends to have higher variance around its deletion efficiency, due to the randomness in centroid stabilization having a larger impact than the randomness in the dataset partitioning. We remark that 1,000 deletions is less than 10% of every dataset we test on, and statistical performance remains virtually unchanged throughout the benchmark. In Figure 1, we plot the amortized runtime on the online deletion benchmark as a function of number of deletions in the stream. We refer the reader to Appendix D for supplementary experiments providing more detail on our methods.

Figure 1: Deletion efficiency in the online deletion setting: number of deletion requests vs. amortized runtime (seconds) for 3 algorithms on 6 datasets.

5 Discussion

We formulate the computational problem of efficient data deletion in machine learning. While our analysis focuses on the concrete problem of clustering, we identify four design principles which we envision as the pillars of deletion efficient learning algorithms. We outline them here, and discuss them in greater detail in Appendix E, including their potential applications to supervised learning algorithms such as random forests, SVMs, and neural networks.

Linearity: Use of linear computation allows for simple post-processing to undo the influence of a single datapoint on a set of parameters. For example, in Q--means, we are able to leverage the linearity in centroid computation to recycle computation. Generally speaking, the Sherman-Morrison-Woodbury matrix identity and matrix factorization techniques can be used to derive fast and explicit formulas for updating linear computations [47, 24, 71, 38]. For example, in the case of linear least squares regressions, QR factorization can be used to delete datapoints from learned weights in time [36]. Linearity should be most effective in domains in which randomized [60], reservoir [77, 65], domain-specific [46], or pre-trained feature spaces elucidate linear relationships in the data.

Laziness: Lazy learning methods delay computation until inference time [76, 11, 6], resulting in trivial deletions. One of the simplest examples of lazy learning is -nearest neighbors [28, 4], where deleting a point from the dataset at deletion time directly translates to an updated model at inference time. There is a natural affinity between lazy learning and non-parametric techniques [53, 15]

. Although we did not make use of laziness in this work, pre-existing literature on kernel density estimation for clustering would be a natural starting place

[39]. Laziness should be most effective in regimes when there are fewer constraints on inference time and model memory than training time or deletion time.

Modularity: In the context of forgetful learning, modularity is the restriction of dependence of computation state or model parameters to specific partitions of the dataset. Under such a modularization, we can isolate specific modules of data processing that need to be recomputed in order to account for deletions to the dataset. Our notion of modularity is conceptually similar to its use in software design [10] and distributed computing [58]. In DC--means, we leverage modularity by managing the dependence between computation and data via the divide-and-conquer tree. Modularity should be most effective in regimes for which the dimension of the data is small compared to the dataset size, allowing for partitions of the dataset to capture the important structure and features.

Quantization: Many models come with a sense of continuity from dataset space to model space — small changes to the dataset should result in small changes to the (distribution over the) model space. We can leverage this by quantizing the mapping from datasets to models (either explicitly or implicitly). Then, for a small number of deletions, such a quantized model is unlikely to change. In Q--means, we explicitly quantize the centroids, which implies a quantization over the datasets. Quantization is most effective in regimes for which the number of parameters is small compared to the dataset size.


  • [1] M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 308–318. Cited by: §1.
  • [2] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H. Lin (2012) Learning from data. Vol. 4, AMLBook New York, NY, USA:. Cited by: §1.
  • [3] D. Aloise, A. Deshpande, P. Hansen, and P. Popat (2009) NP-hardness of euclidean sum-of-squares clustering. Machine learning 75 (2), pp. 245–248. Cited by: §3.1.
  • [4] N. S. Altman (1992) An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46 (3), pp. 175–185. Cited by: §5.
  • [5] D. Arthur and S. Vassilvitskii (2007) K-means++: the advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035. Cited by: §C.2.1, §1, §3.1, §3.1, §4.
  • [6] C. G. Atkeson, A. W. Moore, and S. Schaal (1997) Locally weighted learning for control. In Lazy learning, pp. 75–113. Cited by: §1, §5.
  • [7] O. Bachem, M. Lucic, and A. Krause (2017) Distributed and provably good seedings for k-means in constant rounds. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 292–300. Cited by: §3.2.
  • [8] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii (2012) Scalable k-means++. Proceedings of the VLDB Endowment 5 (7), pp. 622–633. Cited by: §3.2.
  • [9] M. F. Balcan, S. Ehrlich, and Y. Liang (2013) Distributed -means and -median clustering on general topologies. In Advances in Neural Information Processing Systems, pp. 1995–2003. Cited by: §3.2.
  • [10] O. Berman and N. Ashrafi (1993) Optimization models for reliability of modular software systems. IEEE Transactions on Software Engineering 19 (11), pp. 1119–1123. Cited by: §5.
  • [11] M. Birattari, G. Bontempi, and H. Bersini (1999) Lazy learning meets the recursive least squares algorithm. In Proceedings of the 1998 Conference on Advances in Neural Information Processing Systems II, Cambridge, MA, USA, pp. 375–381. External Links: ISBN 0-262-11245-0, Link Cited by: §1, §5.
  • [12] J. A. Blackard and D. J. Dean (1999) Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture 24 (3), pp. 131–151. Cited by: 3rd item, §4.
  • [13] D. Bogdanov, L. Kamm, S. Laur, and V. Sokk (2018)

    Implementation and evaluation of an algorithm for cryptographically private principal component analysis on genomic data

    IEEE/ACM transactions on computational biology and bioinformatics 15 (5), pp. 1427–1432. Cited by: §1.
  • [14] K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ramage, A. Segal, and K. Seth (2017) Practical secure aggregation for privacy-preserving machine learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175–1191. Cited by: §1.
  • [15] G. Bontempi, H. Bersini, and M. Birattari (2001) The local paradigm for modeling and control: from neuro-fuzzy to lazy learning. Fuzzy sets and systems 121 (1), pp. 59–72. Cited by: §5.
  • [16] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser (2015) Machine learning classification over encrypted data.. In NDSS, Cited by: §1.
  • [17] L. Bottou (1998) Online learning and stochastic approximations. On-line learning in neural networks 17 (9), pp. 142. Cited by: §2.
  • [18] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate (2011) Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (Mar), pp. 1069–1109. Cited by: §1.
  • [19] K. Chaudhuri, A. D. Sarwate, and K. Sinha (2013) A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research 14 (1), pp. 2905–2943. Cited by: §1.
  • [20] Council of European Union (2014) Council regulation (eu) no 2012/0011. Note: Cited by: §1.
  • [21] Council of European Union (2014) Council regulation (eu) no 2016/678. Note: Cited by: §1.
  • [22] M. Courbariaux, Y. Bengio, and J. David (2014) Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024. Cited by: §E.2.5.
  • [23] T. M. Cover and J. A. Thomas (2012) Elements of information theory. John Wiley & Sons. Cited by: §C.2.1.
  • [24] R. E. W. D. A. Belsley (1980) Regression diagnostics: identifying influential data and sources of collinearity. John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 9780471058564 Cited by: §1, §5.
  • [25] S. Dasgupta and A. Gupta (2003) An elementary proof of a theorem of johnson and lindenstrauss. Random Structures & Algorithms 22 (1), pp. 60–65. Cited by: §E.3.
  • [26] C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9 (3–4), pp. 211–407. Cited by: §A.2.1, §A.2.1, §1.
  • [27] Z. Erkin, T. Veugen, T. Toft, and R. L. Lagendijk (2012) Generating private recommendations efficiently using homomorphic encryption and data packing. IEEE transactions on information forensics and security 7 (3), pp. 1053–1066. Cited by: §1.
  • [28] J. Friedman, T. Hastie, and R. Tibshirani (2001) The elements of statistical learning. Vol. 1, Springer series in statistics New York. Cited by: §3, §5.
  • [29] A. Gardner, C. A. Duncan, J. Kanno, and R. Selmic (2014) 3d hand posture recognition from small unlabeled point sets. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 164–169. Cited by: 2nd item, §4.
  • [30] A. Gardner, J. Kanno, C. A. Duncan, and R. Selmic (2014) Measuring distance between unordered sets of different sizes. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 137–143. Cited by: 2nd item, §4.
  • [31] P. Geurts, D. Ernst, and L. Wehenkel (2006) Extremely randomized trees. Machine learning 63 (1), pp. 3–42. Cited by: §E.2.3.
  • [32] A. Ghorbani and J. Zou (2019) Data shapley: equitable valuation of data for machine learning. arXiv preprint arXiv:1904.02868. Cited by: §1.
  • [33] R. M. Gray and D. L. Neuhoff (1998) Quantization. IEEE transactions on information theory 44 (6), pp. 2325–2383. Cited by: §3.1.
  • [34] S. Guha, R. Rastogi, and K. Shim (1998) CURE: an efficient clustering algorithm for large databases. In ACM Sigmod Record, Vol. 27, pp. 73–84. Cited by: §3.2.
  • [35] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi (2018)

    Ristretto: a framework for empirical study of resource-efficient inference in convolutional neural networks

    IEEE Transactions on Neural Networks and Learning Systems. Cited by: §E.2.5.
  • [36] S. Hammarling and C. Lucas (2008) Updating the qr factorization and the least squares problem. Cited by: §5.
  • [37] X. Han, R. Wang, Y. Zhou, L. Fei, H. Sun, S. Lai, A. Saadatpour, Z. Zhou, H. Chen, F. Ye, et al. (2018) Mapping the mouse cell atlas by microwell-seq. Cell 172 (5), pp. 1091–1107. Cited by: 1st item, §4.
  • [38] N. J. Higham (2002) Accuracy and stability of numerical algorithms. Vol. 80, Siam. Cited by: §5.
  • [39] A. Hinneburg and H. Gabriel (2007) Denclue 2.0: fast clustering based on kernel density estimation. In International symposium on intelligent data analysis, pp. 70–80. Cited by: §5.
  • [40] W. B. Johnson and J. Lindenstrauss (1984) Extensions of lipschitz mappings into a hilbert space. Contemporary mathematics 26 (189-206), pp. 1. Cited by: §E.3.
  • [41] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017)

    In-datacenter performance analysis of a tensor processing unit

    In Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 1–12. Cited by: §E.2.5.
  • [42] Z. F. Knops, J. A. Maintz, M. A. Viergever, and J. P. Pluim (2006) Normalized mutual information based registration using k-means clustering and shading correction. Medical image analysis 10 (3), pp. 432–439. Cited by: §4.1.
  • [43] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: 5th item, §4.
  • [44] D. Lin, S. Talathi, and S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §E.2.5.
  • [45] S. Lloyd (1982) Least squares quantization in pcm. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: §1, §4.
  • [46] D. G. Lowe et al. (1999) Object recognition from local scale-invariant features.. In iccv, Vol. 99, pp. 1150–1157. Cited by: §5.
  • [47] J. H. Maindonald (1984) Statistical computation. John Wiley & Sons, Inc., New York, NY, USA. External Links: ISBN 0471864528 Cited by: §1, §5.
  • [48] A. May, A. B. Garakani, Z. Lu, D. Guo, K. Liu, A. Bellet, L. Fan, M. Collins, D. Hsu, B. Kingsbury, et al. (2017) Kernel approximation methods for speech recognition. arXiv preprint arXiv:1701.03577. Cited by: §E.2.2.
  • [49] Y. Meidan, M. Bohadana, Y. Mathov, Y. Mirsky, A. Shabtai, D. Breitenbacher, and Y. Elovici (2018)

    N-baiot—network-based detection of iot botnet attacks using deep autoencoders

    IEEE Pervasive Computing 17 (3), pp. 12–22. Cited by: 4th item, §4.
  • [50] V. M. Muggeo et al. (2008) Segmented: an r package to fit regression models with broken-line relationships. R news 8 (1), pp. 20–25. Cited by: §E.2.1.
  • [51] V. M. Muggeo (2003) Estimating regression models with unknown break-points. Statistics in medicine 22 (19), pp. 3055–3071. Cited by: §E.2.1.
  • [52] V. M. Muggeo (2016) Testing with a nuisance parameter present only under the alternative: a score-based approach with application to segmented modelling. Journal of Statistical Computation and Simulation 86 (15), pp. 3059–3067. Cited by: §E.2.1.
  • [53] E. A. Nadaraya (1964) On estimating regression. Theory of Probability & Its Applications 9 (1), pp. 141–142. Cited by: §1, §5.
  • [54] V. Nikolaenko, U. Weinsberg, S. Ioannidis, M. Joye, D. Boneh, and N. Taft (2013) Privacy-preserving ridge regression on hundreds of millions of records. In Security and Privacy (SP), 2013 IEEE Symposium on, pp. 334–348. Cited by: §1.
  • [55] O. Ohrimenko, F. Schuster, C. Fournet, A. Mehta, S. Nowozin, K. Vaswani, and M. Costa (2016) Oblivious multi-party machine learning on trusted processors.. In USENIX Security Symposium, pp. 619–636. Cited by: §1.
  • [56] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar (2016) Semi-supervised knowledge transfer for deep learning from private training data. arXiv preprint arXiv:1610.05755. Cited by: §1.
  • [57] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §D.1.3.
  • [58] D. Peleg (2000) Distributed computing. SIAM Monographs on discrete mathematics and applications 5, pp. 1–1. Cited by: §3.2, §5.
  • [59] J. Qin, W. Fu, H. Gao, and W. X. Zheng (2016) Distributed -means algorithm and fuzzy -means algorithm for sensor networks based on multiagent consensus theory. IEEE transactions on cybernetics 47 (3), pp. 772–783. Cited by: §3.2.
  • [60] A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In Advances in neural information processing systems, pp. 1177–1184. Cited by: §E.2.2, §5.
  • [61] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In European Conference on Computer Vision, pp. 525–542. Cited by: §E.2.5.
  • [62] P. J. Rousseeuw (1987)

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    Journal of computational and applied mathematics 20, pp. 53–65. Cited by: §4.1.
  • [63] V. Schellekens and L. Jacques (2018) Quantized compressive k-means. IEEE Signal Processing Letters 25 (8), pp. 1211–1215. Cited by: §3.1.
  • [64] F. Schomm, F. Stahl, and G. Vossen (2013) Marketplaces for data: an initial survey. ACM SIGMOD Record 42 (1), pp. 15–26. Cited by: §1.
  • [65] B. Schrauwen, D. Verstraeten, and J. Van Campenhout (2007) An overview of reservoir computing: theory, applications and implementations. In Proceedings of the 15th european symposium on artificial neural networks. p. 471-482 2007, pp. 471–482. Cited by: §5.
  • [66] C. E. Shannon (1949) Communication theory of secrecy systems. Bell system technical journal 28 (4), pp. 656–715. Cited by: §A.1.
  • [67] C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray, et al. (2015) UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine 12 (3), pp. e1001779. Cited by: §1.
  • [68] J. A. Suykens and J. Vandewalle (1999)

    Least squares support vector machine classifiers

    Neural processing letters 9 (3), pp. 293–300. Cited by: §E.2.4.
  • [69] H. Truong, M. Comerio, F. De Paoli, G. Gangadharan, and S. Dustdar (2012) Data contracts for cloud-based data marketplaces. International Journal of Computational Science and Engineering 7 (4), pp. 280–295. Cited by: §1.
  • [70] S. Van Der Walt, S. C. Colbert, and G. Varoquaux (2011) The numpy array: a structure for efficient numerical computation. Computing in Science & Engineering 13 (2), pp. 22. Cited by: §D.1.1.
  • [71] C. F. Van Loan and G. H. Golub (1983) Matrix computations. Johns Hopkins University Press. Cited by: §1, §5.
  • [72] V. Vanhoucke, A. Senior, and M. Z. Mao Improving the speed of neural networks on cpus. Cited by: §E.2.5.
  • [73] M. Veale, R. Binns, and L. Edwards (2018) Algorithms that remember: model inversion attacks and data protection law. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 376 (2133), pp. 20180083. Cited by: §1.
  • [74] E. F. Villaronga, P. Kieseberg, and T. Li (2018) Humans forget, machines remember: artificial intelligence and the right to be forgotten. Computer Law & Security Review 34 (2), pp. 304–313. Cited by: §1.
  • [75] N. X. Vinh, J. Epps, and J. Bailey (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. Journal of Machine Learning Research 11 (Oct), pp. 2837–2854. Cited by: §4.1.
  • [76] G. I. Webb (2010) Lazy learning. In Encyclopedia of Machine Learning, C. Sammut and G. I. Webb (Eds.), pp. 571–572. External Links: ISBN 978-0-387-30164-8, Document, Link Cited by: §1, §5.
  • [77] J. Yin and Y. Meng (2012) Self-organizing reservior computing with dynamically regulated cortical neural networks. In The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. Cited by: §5.
  • [78] J. Zhang, A. May, T. Dao, and C. Ré (2018) Low-precision random fourier features for memory-constrained kernel approximation. arXiv preprint arXiv:1811.00155. Cited by: §E.2.2.
  • [79] W. Zhao, H. Ma, and Q. He (2009) Parallel k-means clustering based on mapreduce. In IEEE International Conference on Cloud Computing, pp. 674–679. Cited by: §3.2.

Appendix A Supplementary Materials

Here we provide material supplementary to the main text. While some of the material provided here may be somewhat redundant, it also contains technical minutia perhaps too detailed for the main body.

a.1 Online Data Deletion

We precisely define the notion of a learning algorithm for theoretical discussion in the context of data deletion.

Definition A.1.

Learning Algorithm

A learning algorithm is an algorithm (on some standard model of computation) taking values in some hypothesis space and metadata space based on an input dataset . Learning algorithm may be randomized, implying a conditional distribution over given . Finally, learning algorithms must process each datapoint in at least once, and are constrained to sequential computation only, yielding a runtime bounded by .

We re-state the definition of data deletion. We distinguish between a deletion operation and a robust deletion operation. We focus on the former throughout our main body, as it is appropriate for average-case analysis in a non-security context. We use to denote distributional equality.

Definition A.2.

Data Deletion Operation

Fix any dataset and learning algorithm . Operation is a deletion operation for if for any selected independently of .

For notational simplicity, we may let refer to an entire sequence of deletions () by writing . This notation means the output of a sequence of applications of to each in deletion sequence . We also may drop the dependence on when it is understood for which the deletion operation corresponds. We also drop the arguments for and when they are understood from context. For example, when dataset can be inferred from context, we let directly mean and when and deletion stream can be inferred, we let directly mean .

Our definition is somewhat analogous to information-theoretic (or perfect) secrecy in cryptography [66]

. Much like in cryptography, it is possible to relax to weaker notions – for example, by statistically approximating deletion and bounding the amount of computation some hypothetical adversary could use to determine if a genuine deletion took place. Such relaxations are required for encryption algorithms because perfect secrecy can only be achieved via one-time pad

[66]. In the context of deletion operations, retraining from scratch is, at least slightly, analogous to one-time pad encryption: both are simple solutions that satisfy distributional equality requirements, but both solutions are impractical. However, unlike in encryption, when it comes to deletion, we can, in fact, at least for some learning algorithms, find deletion operations that would be both practical and perfect.

The upcoming robust definition may be of more interest in a worst-case, security setting. In such a setting, an adaptive adversary makes deletion requests while also having perfect eavesdropping capabilities to the server (or at least the internal state of the learning algorithm, model and metadata).

Definition A.3.

Robust Data Deletion Operation

Fix any dataset and learning algorithm . Operation is a robust deletion operation if in distribution, for any , perhaps selected by an adversarial agent with knowledge of .

To illustrate the difference between these two definitions, consider Q--means and DC--means. Assume an adversary has compromised the server with read-access and gained knowledge of the algorithm’s internal state. Further assume that said adversary may issue deletion requests. Such a powerful adversary could compromise the exactness of DC--means deletion operations by deleting datapoints from specific leaves. For example, if the adversary always deletes datapoints partitioned to the first leaf, then the number of datapoints assigned to each leaf is no longer uniform or independent of deletion requests. In principle, this, at least rigorously speaking, violates equality in distribution. Note that this can only occur if requests are somehow dependent on the partition. However, despite an adversary being able to compromise the correctness of the deletion operation, it cannot compromise the efficiency, since that depends on the maximum number of datapoints partitioned to a particular leaf, and that was decided randomly without input from the adversary.

In the case of Q--means we can easily see the deletion is robust to the adversary by the enforced equality of outcome imposed by the deletion operation. However, an adversary with knowledge of algorithm state could make the Q--means deletion operation entirely inefficient by always deleting an initial centroid. This causes every single deletion to be satisfied by retraining from scratch. From the security perspective, it could be of interest to study deletion operations that are both robust and efficient.

We continue by defining the online data deletion setting in the average-case.

Definition A.4.

Online Data Deletion (Average-Case)

We may formally define the runtime in the online deletion setting as the expected runtime of Algorithm 3. We amortize the total runtime by .

  Input: Dataset , learning algorithm , deletion operation
  for  to  do
       //constant time
       //constant time
  end for
Algorithm 3 Online Data Deletion

Fractional power regime. For dataset size , when for , we say we are in the fractional power regime. For -means, our proposed algorithms achieve the ideal lower bound for small enough , but not for all in .

Online data deletion is interesting in both the average-case setting (treated here), where the indices are chosen uniformly and independently without replacement, as well as in a worst-case setting, where the sequence of indices is computed adversarially (left for future work). It may also be practical to include a bound on the amount of memory available to the data deletion operation and model (including metadata) as an additional constraint.

Definition A.5.

Deletion Efficient Learning Algorithm

Recall the lower bound on amortized computation for any sequential learning algorithm in the online deletion setting (Section 2). Given some fractional power scaling , we say an algorithm is -deletion efficient if it runs Algorithm 3 in amortized time .

Inference Time

Of course, lazy learning and non-parametric techniques are a clear exception to our notions of learning algorithm. For these methods, data is processed at inference time rather than training time – a more complete study of the systems trade-offs between training time, inference time, and deletion time is left for future work.

a.2 Approximate Data Deletion

We present one possible relaxation from exact to approximate data deletion.

Definition A.6.

Approximate deletion

We say that such a data deletion operation is an -deletion for algorithm if, for all and for every measurable subset :

The above definition corresponds to requiring that the probability that the data deletion operation returns a model in some specified set, , cannot be more than a factor larger than the probability that algorithm retrained on the dataset returns a model in that set. We note that the above definition does allow for the possibility that some outcomes that have positive probability under have zero probability under the deletion operation. In such a case, an observer could conclude that the model was returned by running from scratch.

a.2.1 Approximate Deletion and Differential Privacy

We recall the definition of differential privacy [26]. A map, , from a dataset to a set of outputs, , is said to be -differentially private if, for any two datasets that differ in a single datapoint, and any subset ,

Under the relaxed notion of data deletion, it is natural to consider privatization as a manner to support approximation deletion. The idea could be to privatize the model, and then resolve deletion requests by ignoring them. However, there are some nuances involved here that one should be careful of. For example, differential privacy does not privatize the number of datapoints, but this should not be leaked in data deletion. Furthermore, since we wish to support a stream of deletions in the online setting, we would need to use group differential privacy [26], which can greatly increase the amount of noise needed for privatization. Even worse, this requires selecting the group size (i.e. total privacy budget) during training time (at least for canonical constructions such as the Laplace mechanism). In differential privacy, this group size is not necessarily a hidden parameter. In the context of deletion, it could leak information about the total dataset size as well as how many deletions any given model instance has processed. While privatization-like methods are perhaps a viable approach to support approximate deletion, there remain some technical details to work out, and this is left for future work.

Appendix B Algorithmic Details

In Appendix B, we present psuedo-code for the algorithms described in Section 3. We also reference for Python implementations of our algorithms.

b.1 Quantized -Means

We present the psuedo-code for Q--means (Algo. 4).

  Input: data matrix
  Parameters: , , ,
   // initialize centroids with -means++
  Save initial centroids: .
  -means loss of initial partition
  for  to  do
      Store current centroids:
      Compute centroids:
      for  to  do
          if   then
              Apply correction to -imbalanced partition:
          end if
      end for
      Generate random phase
      Quantize to -lattice:
      Update partition:
      Save state to metadata:
      Compute loss
      if  then
           //update state
      end if
  end forreturn //output final centroids as model
Algorithm 4 Quantized -means


Q--means follows the same iterative protocol as the canonical Lloyd’s (and makes use of the -means++ initialization). As mentioned in the main body, there are four key variations from the canonical Lloyd’s algorithm that make this method different: quantization, memoization, balance correction, and early termination. The memoization of the optimization state and the early termination for increasing loss are self-explanatory from Algo. 4. We provide more details concerning the quantization step and the balance correction in Appendix B.1.1 and B.1.2,s respectively.

Although it is rare, it is possible for a Lloyd’s iteration to result in a degenerate (empty) cluster. In this scenario, we have two reasonable options. All of the theoretical guarantees are remain valid under both of the following options. The first option is to re-initialize a new cluster via a -means++ seeding. Since the number of clusters and iterations are constant, this does not impact any of the asymptotic deletion efficiency results. The second option is to simply leave a degenerate partition. This does not impact the upper bound on expected statistical performance which is derived only as a function of the -means++ initialization. For most datasets, this issue hardly matters in practice, since Lloyd’s iterations do not usually produce degenerate clusters (even the presence of quantization noise).

In our implementation, we have chosen to re-initialize degenerate clusters, and are careful to account for this in our metadata, since such a re-initialization could trigger the need to retrain at deletion time if the re-initialized point is part of the deletion stream.

Below we present the pseudo-code for the deletion operation (Algo. 5), and then elaborate on the quantization scheme and balance correction in the following subsections.

  Input: data matrix , target deletion index , training metadata
  Obtain target deletion point
  Retrieve initial centroids from metadata:
  if   then
     // Selected initial point.
     return // Need to retrain from scratch.
     for  to  do
         Retrieve state for iteration :
         Cluster assignment of :
         Perturbed centroid:
         Apply -correction to if necessary
         Quantize perturbed centroid:
         if   then
            // Centroid perturbed unstable quantization
            return // Need to retrain from scratch.
         end if
         Update metadata with perturbed state:
     end for
  end if
  Update return //Successfully verified centroid stability
Algorithm 5 Deletion Op for Q--means

b.1.1 -Balanced Clusters

Definition B.1.


Given a partition , we say it us -balanced if for all partitions . The partition is -imbalanced if it is not -balanced.

In Q--means, imbalanced partitions can lead to unstable quantized centroids. Hence, it is preferable to avoid such partitions. As can be seen in the pseudo-code, we add mass to small clusters to correct for -unbalance. At each iteration we apply the following formula on all clusters such that : where denotes the centroids from the previous iteration.

In prose, for small clusters, current centroids are averaged with the centroids from the previous iteration to increase stability.

For use in practice, a choice of must be made. If no class balance information is known, then setting is a solid heuristic for all but severely imbalanced datasets, in which case it is likely that DC--means would be preferable to Q--means.

b.1.2 Quantizing with an -Lattice

We detail the quantization scheme used. A quantization maps analog values to a discrete set of points. In our scheme, we uniformly cover with an -lattice, and round analog values to the nearest vertex on the lattice. It is also important to add an independent, uniform random phase shift to each dimension of lattice, effectively de-biasing the quantization.

We proceed to formally define our quantization . is parameterized by a phase shift and an granularity parameter . For brevity, we omit the explicit dependence on phase and granularity when it is clear from context. For a given :

We set with an iid random sequence such that .

b.2 Divide-and-Conquer -Means

We present pseudo-code for the deletion operation of divide-and-conquer -means. The pseudo-code for the training algorithm may be found in the main body. The deletion update is conceptually simple. Since a deleted datapoint only belong to one leaf’s dataset, we only need recompute the sub-problems on the path from said leaf to the root.

  Input: data matrix , target deletion index , model metadata
  Obtain target deletion point
   leaf node assignment of
  while node is not root do
  end while
  Update return node.centroids
Algorithm 6 Deletion Op for DC--means

Appendix C Mathematical Details

Here, we provide proofs for the claims in the main body. We follow notation introduced in the main body and Appendix A. As a notational shorthand, we will let denote by and as when there is only one dataset in the context. Also, when it is unambiguous, we will use to denote the specific learning algorithm in question, and to denote its corresponding deletion operation.

c.1 Proof of Theorem 3.1

Refer to the main body for the statement of Theorem 3.1. Here is an abridged version:


Q--means supports deletions in expected time .

Note that we assume the dataset is scaled onto the unit hypercube. Otherwise, the theorem still holds with an assumed constant factor radial bound. We prove the theorem in three successive steps, given by Lemma C.1 through Lemma C.3.

Lemma C.1.

Define for some . is the hypercube in Euclidean -space of side length centered at the origin. Let for some . Let be a uniform random variable with support . Then, .


(Lemma C.1) If , then there exists some such that . Marginally, Taking a union bound over the dimensions obtains the bound. ∎

We make use of Lemma C.1 in proving the following lemma. First, recall the definition of our quantization scheme from Section 3:

We take , implying a distribution for .

Lemma C.2.

Let be a uniform quantization -lattice over with uniform phase shift . Let denote the quantization mapping over and let denote the quantization image for subsets of . Let . Then , where is the -ball about under Euclidean norm.


(Lemma C.2) Due to invariance of measure under translation, we may apply a coordinate transformation by translation to the origin of . Under this coordinate transform, . Further, note that is precisely equivalent to . Because is uniform, applying Lemma C.1 as an upper bound completes the proof. ∎

With Lemma C.2 in hand, we proceed to state and prove Lemma C.3.

Lemma C.3.

Let be an dataset on of size . Let be the centroids computed by quantized Lloyd’s algorithm with parameters , , , and . Then, with probability greater than , it holds that for any with , where probability is with respect to the randomness in the quantization phase and the -means++ initialization.


(Lemma C.3) We analyze two instances of Q-Lloyd’s algorithm operating with the same initial centroids and the same sequence of iid quantization maps . One instance runs on input dataset and the other runs on input dataset . This is the only difference between the two instances.

Let denote the -th analog (i.e. non-quantized) centroid at the -th iteration of Q-Lloyd’s on some input dataset with initialization . By construction of Q-Lloyd’s, for any datasets , we have that if = for all and all .

Fix any particular and . We can bound as follows. Note that where denotes the indicator function. Furthermore, . Assume that . Because and , these sums can differ by at most . On the other hand, assume that . In this case, the -correction still ensures that the sums differ by at most . This bounds .

To complete the proof, apply Lemma C.2 setting . Taking a union bound over and yields the desired result.

We are now ready to complete the theorem. We briefly sketch and summarize the argument before presenting the proof. Recall the deletion algorithm for Q--means (Appendix B). Using the runtime memo, we verify that the deletion of a point does not change the what would have been the algorithm’s output. If it would have changed the output, then we retrain the entire algorithm from scratch. Thus, we take a weighted average of the computation expense in these two scenarios. Recall that retraining from scratch takes time and verifying the memo at deletion time takes time . Finally, note that we must sequentially process a sequence of deletions, with a valid model output after each request. We are mainly interested in the scaling with respect to , and , treating other factors as non-asymptotic constants in our analysis. We now proceed with the proof.


Q-Lloyd’s supports deletions in expected time .


In order for to be a valid deletion, we require that for any . In this setting, we identify models with the output centroids: . Consider the sources of randomness: the iid sequence of random phases and the -means++ initializations.

Let be a random function computing the -means++ initializations over a given dataset. denote the event that . Then, from the construction of -means++, we have that for all , . Thus, and are equal in distribution conditioned on not being an initial centroid.

Let denote the iid sequence of random phases for and let denote the iid sequence of random phases for . Within event , we define a set of events , parameterized by , as the event that output centroids are stable under deletion conditioned on a given sequence of phases :

By construction of , we have where event is verified given the training time memo. To conclude, let be any Borel set:

by law of total probability.

by construction of

by definition of

by .



Let be the total runtime of after training once and then satisfying deletion requests with . Let denote the deletion sequence, with each deletion sampled uniformly without replacement from .

Let be the event that the centroids are stable for all deletions. Using Theorem 3.1 to bound the event complement probability :


In the centroids are stable, and verifying in takes time in total. In we coarsely upper bounded by assuming we re-train to satisfy each deletion. ∎

c.2 Proofs of Corollaries and Propositions

We present the proofs of the corollaries in the main body. We are primarily interested in the asymptotic effects of , , , and . We treat other variables as constants. For the purposes of online analysis, we let for some and for some

c.2.1 Proof of Corollory 3.1.1

We state the following Theorem of Arthur and Vassilvitskii concerning -means++ initializations [5]:

Theorem C.4.

(Arthur and Vassilivitskii)

Let be the optimal loss for a -means clustering problem instance. Then -means++ achieves expected loss

We re-state corollary 3.1.1:


Let be a random variable denoting the loss of Q--means on a particular problem instance of size . Then .


Let be the initialization produced -means++.