Many applications require finding entities in raw data, such as individual objects or people in image streams or particular speakers in audio streams. Often, entity-finding tasks are addressed by applying clustering algorithms such as -means (for instance in [Niebles et al.2008]). We argue that instead they should be approached as instances of the frequent items problem, also known as the heavy hitters problem. The classic frequent items problem assumes discrete data and involves finding the most frequently occurring items in a stream of data. We propose to generalize it to continuous data.
Figure 3 shows examples of the differences between clustering and entity finding. Some clustering algorithms fit a global objective assigning all/most points to centers, whereas entities are defined locally leading to more robustness to noise (0(a)). Others, join nearby dense groups while trying to detect sparse groups, whereas entities are still distinct (0(b)). These scenarios are common because real world data is often noisy and group sizes are often very unbalanced [Newman2005].
We characterize entities using two natural properties: similarity
- the feature vectors should be similar according to some (not necessarily Euclidean) distance measure, such as cosine distance, andsalience - the region should include a sufficient number of detections over time.
Even though our problem is not well-formulated as a clustering problem, it might be tempting to apply clustering algorithms to it. Clustering algorithms optimize for a related, but different, objective. This makes them less accurate for our problem; moreover, our formulation overcomes typical limitations of some clustering algorithms such as relying on the Euclidean distance metric and performing poorly in high-dimensional spaces. This is important because many natural embeddings, specially those coming from Neural Networks, are in high dimensions and use non-Euclidean metrics.
In this paper we suggest addressing the problem of entity finding as an extension of heavy hitters, instead of clustering, and propose an algorithm called hac with multiple desirable properties: handles an online stream of data; is guaranteed to place output points near high-density regions in feature space; is guaranteed to not place output points near low-density regions (i.e., is robust to noise); works with any distance metric; can be time-scaled, weighting recent points more; is easy to implement; and is easily parallelizable.
We begin by outlining a real-world application of tracking important objects in a household setting without any labeled data and discussing related work. We go on to describe the algorithm and its formal guarantees and describe experiments that find the main characters in video of a TV show and that address the household object-finding problem.
1.1 Household Setting
The availability of low-cost, network-connected cameras provides an opportunity to improve the quality of life for people with special needs, such as the elderly or the blind. One application is helping people to find misplaced objects.
More concretely, consider a set of cameras recording video streams from some scene, such as a room, an apartment or a shop. At any time, the system may be queried with an image or a word representing an object, and it has to answer with candidate positions for that object. Typical queries might be: ”Where are my keys?” or ”Warn me if I leave without my phone.” Note that, in general, the system won’t know the query until it is asked and thus cannot know which objects in the scene it has to track. For such an application, it is important for the system to not need specialized training for every new object that might be the focus of a query.
Our premise is that images of interesting objects are such that 1) a neural network embedding [Donahue et al.2014, Johnson et al.2016, Mikolov et al.2013] will place them close together in feature space, and 2) their position stays constant most of the time, but changes occasionally. Therefore objects will form high-density regions in a combined featureposition space. Random noise, such as people moving or false positive object detections, will not form dense regions. Objects that don’t move (walls, sofas, etc) will be always dense; interesting objects create dense regions in featureposition space, but eventually change position and form a new dense region somewhere else. We will exploit the fact that our algorithm is easy to scale in time, to detect theses changes over time.
1.2 Related work
Our algorithm, hac, addresses the natural generalization of heavy hitters
, a very well-studied problem, to continuous settings. In heavy hitters we receive a stream of elements from a discrete vocabulary and our goal is to estimate the most frequently occurring elements using a small amount of memory, which does not grow with the size of the input. Optimal algorithms have been found for several classes of heavy hitters, which are a logarithmic factor faster than our algorithm, but they are all restricted todiscrete elements [Manku and Motwani2002]. In our use case (embeddings of real-valued data), elements are not drawn from a discrete set, and thus repetitions have to be defined using regions and distance metrics. Another line of work [Chen and Zhang2016] estimates the total number of different elements in the data, in contrast to hac that finds (not merely counts) different dense regions.
Our problem bears some similarity to clustering but the problems are fundamentally different (see figure 3). The closest work to ours within the clustering literature is density-based (DB) clustering. In particular, they first find all dense regions in space (as we do) and then join points via paths in those dense regions to find arbitrarily-shaped clusters. In contrast, we only care about whether a point belongs to one of the dense regions. This simplification has two advantages: first, it prevents joining two close-by entities, second, it allows much more efficient, general and simple methods.
The literature on DB clustering is very extensive. Most of the popular algorithms, such as DBScan [Ester et al.1996] and Level Set Tree Clustering [Chaudhuri and Dasgupta2010], as well as more recent algorithms [Rodriguez and Laio2014], require simultaneous access to all the points and have complexity quadratic in the number of points; this makes them impractical for big datasets and specially streaming data. There are some online DB clustering algorithms [Chen and Tu2007], [Wan et al.2009],[Cao et al.2006], but they either tessellate the space or assume a small timescale, tending to work poorly for non-Euclidean metrics and high dimensions.
Two pieces of work join ideas from clustering with heavy hitters, albeit in very different settings and with different goals. [Larsen et al.2016] uses graph partitioning to attack the discrete heavy hitters problem in the general turnstile model. [Braverman et al.2017] query a heavy hitter algorithm in a tessellation of a high dimensional discrete space, to find a coreset which allows them to compute an approximate -medians algorithm in polynomial time. Both papers tackle streams with discrete elements and either use clustering as an intermediate step to compute heavy hitters or use heavy hitters as an intermediate step to do clustering (-medians). In contrast, we make a connection pointing out that the generalization of heavy hitters to continuous spaces allows us to do entity finding, previously seen as a clustering problem.
. We can detect Gaussians with different variances by customizingfor each output.
The goal isn’t to cover the whole group with the circle but to return the smallest radius that contains a fraction of the data. Points near an output are guaranteed to need a similar radius to contain the same fraction of data.
We illustrate our algorithm in some applications that have been addressed using different methods. Clustering faces is a well-studied problem with commercially deployed solutions. However, these applications generally assume we care about most faces in the dataset and that faces occur in natural positions. This is not the case for many real-world applications, where photos are taken in motion from multiple angles and are often blurry. Therefore, algorithms that use clustering in the conventional sense, [Schroff et al.2015, Otto et al.2017], do not apply.
[Rituerto et al.2016] proposed using DB-clustering in a setting similar to our object localization application. However, since our algorithm is online, we allow objects to change position over time. Their method, which uses DBScan, can be used to detect what we will call stable objects, but not movable ones (which are generally what we want to find). [Nirjon and Stankovic2012] built a system that tracks objects assuming they will only change position when interacting with a human. However, they need an object database, which makes the problem easier and the system much less practical, as the human has to register every object to be tracked.
2 Problem setting
In this section we argue that random sampling is surprisingly effective (both theoretically and experimentally) at finding entities by detecting dense regions in space and describe an algorithm for doing so in an online way. The following definitions are of critical importance.
Let be the distance metric. A point is -dense with respect to dataset if the subset of points in within distance of represents a fraction of the points that is at least . If ; then must satisfy:
A point is -sparse with respect to dataset if and only if it is not -dense.
The basic version of our problem is the natural generalization of heavy hitters to continuous spaces. Given a metric , a frequency threshold , a radius and a stream of points , after each input point the output is a set of points. Every -dense point (even those not in the dataset) has to be close to at least one output point and every -sparse region has to be far away from all output points.
Our algorithm is based on samples that hop between data points and count points nearby; we therefore call it Hop And Count (hac).
2.1 Description of the algorithm
A very simple non-online algorithm to detect dense regions is to take a random sample of elements and output only those samples that satisfy the definition of -dense with respect to the whole data set. For a large enough
, each dense region in the data will contain at least one of the samples with high probability, so the output will include a sample from this region. For sparse regions, even if they contain a sampled point, this sample will not be in the output since it will not pass the denseness test.
Let us try to make this into an online algorithm. A known way to maintain a uniform distribution in an online fashion isreservoir sampling[Vitter1985]: we keep stored samples. After the -th point arrives, each sample changes, independently, with probability to this new point. At each time step, samples are uniformly distributed over all the points in the data. However, once a sample has been drawn we cannot go back and check whether it belongs to a dense or sparse region of space, since we have not kept all points in memory.
The solution is to keep a counter for each sample in memory and update the counters every time a new point arrives. In particular, for any sample in memory, when a new point arrives we check whether ; if so, we increase ’s counter by 1. When the sample hops to a new point , the counter is no longer meaningful and we set it to 0.
Since we are in the online setting, every sample only sees points that arrived after it and thus only the first point in a region sees all the other points in that region. Therefore, if we want to detect a region containing a fraction of the data, we have to introduce an acceptance threshold lower than , for example , and only output points with frequency above it. The probability of any sample being in the first half of any dense region is at least and thus, for a large enough number of samples , with high probability every dense region will contain a sample detected as dense. Moreover, since we set the acceptance threshold to , regions much sparser than will not produce any output points. In other words, we will have false positives but they will be good false positives, since those points are guaranteed to be in regions almost as dense as the target dense regions we actually care about. In general we can change to with trading memory with performance. Finally, note that this algorithm is easy to parallelize because all samples and their counters are completely independent.
2.2 Multiple radii
In the previous section we assumed a specific known threshold . What if we don’t know , or if every dense region has a different diameter? We can simply have counts for multiple values of for every sample. In particular, for every in memory we maintain a count of streamed points within distance for every . At output time we can output the smallest such that the is -dense. With this exponential sequence we guarantee a constant-factor error while only losing a logarithmic factor in memory usage. and may be user-specified or automatically adjusted at runtime.
Following is the pseudo-code version of the algorithm with multiple specified radii. Note that the only data-dependent parameters are and , which specify the minimum and maximum radii, and which specifies the minimum fraction that we will be able to query. The other parameters (, , ) trade off memory vs. probability of statisfying guarantees.
We make a guarantee for every dense or sparse point in space, even those that are not in the dataset. Our guarantees are probabilistic; they hold with probability where is a parameter of the algorithm that affects the memory usage. We have three types of guarantees, from loose but very certain, to tighter but less certain. For simplicity, we assume here that . Here, we state the theorems; the proofs are available in appendix A.
is the smallest s.t. is -dense. For each point we refer to its circle/ball as the sphere of radius centered at .
For any tuple , with probability , for any point s.t. our algorithm will give an output point s.t. .
Moreover, the algorithm always needs at most memory and time per point. Finally, it outputs at most points.
Any -sparse point will not have an output point within .
Notice that we can use this algorithm as a noise detector with provable guarantees. Any -dense point will be within of an output point and any -sparse point will not.
For any tuple , with probability , for any -dense point our algorithm will output a point s.t. with probability at least .
We can apply a post-processing algorithm that takes parameter in time to reduce the number of output points to while guaranteeing that for any point there is an output within . The same algorithm guarantees that for any -dense point there will be an output within .
Note that the number of outputs can be arbitrarily close to the optimal .
The post-processing algorithm is very simple: iterate through the original outputs in increasing . Add to the final list of outputs if there is no s.t. . See appendix A for a proof of correctness.
In high dimensions many clustering algorithms fail; in contrast, our performance can be shown to be provably good in high dimensions. We prove asymptotically good performance for dimension with a convergence fast enough to be meaningful in real applications.
With certain technical assumptions on the data distribution, if we run hac in high dimension , for any -dense point there will be an output point within , with , with probability , where is the total number of datapoints.
Moreover, the probability that a point is -sparse yet has an output nearby is at most .
We refer the reader to appendix A for a more detailed definition of the theorem and its proof.
The intuition behind the proof is the following: let us model the dataset as a set of high-dimensional Gaussians plus uniform noise. It is well-known that most points drawn from a high dimensional Gaussian lie in a thin spherical shell. This implies that all points drawn from the same Gaussian will be similarly dense (have a similar ) and will either all be dense or all sparse. Therefore, if a point is -dense it is likely that another point from the same Gaussian will be an output and will have a similar radius. Conversely, a point that is -sparse likely belongs to a sparse Gaussian and no point in that Gaussian can be detected as dense.
Note that, for and the theorem guarantees that any dense point will have an output within with probability and any -sparse point will not, with probability ; close to the ideal guarantees. Furthermore, in the appendix we show how these guarantees are non-vacuous for values as small as : the values of the dataset in section 3.
2.4 Time scaling
We have described a time-independent version of hac in which all points have equal weight, regardless of when they arrive. However, it is simple and useful to extend this algorithm to make point have weight proportional to for any timescale , where is the current time and is the time when point was inserted.
Trivially, a point inserted right now will still have weight . Now, let be the time of the last inserted point. We can update all the weights of the previously received points by a factor . Since all the weights are multiplied by the same factor, sums of weights can also be updated by multiplying by .
We now only need to worry about hops. We can keep a counter for the total weight of the points received until now. Let us define as the weight of point at the time point arrives. Since we want to have a uniform distribution over those weights, when the -th point arrives we simply assign the probability of hopping to be . Note that for the previous case of all weights being (i.e. ) this reduces to a probability of as before.
We prove in the appendix that by updating the weights and modifying the hopping probability, the time-scaled version has guarantees similar to the original ones.
2.5 Fixing the number of outputs
We currently have two ways of querying the system: 1) Fix a single distance and a frequency threshold , and get back all regions that are -dense; 2) Fix a frequency , and return a set of points , each with a different radius s.t. a point near output point is guaranteed to have .
It is sometimes more convenient to directly fix the number of outputs instead. With hac we go one step further and return a list of outputs sorted according to density (so, if you want outputs, you pick the first elements from the output list). Here are two ways of doing this: 1) Fix radius . Find a set of outputs each -dense. Sort by decreasing , thus returning the densest regions first. 2) Fix frequency , sort the list of regions from smallest to biggest . Note, however, that the algorithm is given a fixed memory size which governs the size of the possible outputs and the frequency guarantees.
In general, it is useful to apply duplicate removal. In our experiments we sort all -dense outputs by decreasing , and add a point to the final list of outputs if it is not within of any previous point on the list. This is similar to but not exactly the same as the method in theorem 2.4; guarantees for this version can be proved in a similar way.
3 Identifying people
As a test of hac’s ability to find a few key entities in a large, noisy dataset, we analyze a season of the TV series House M.D.. We pick 1 frame per second and run a face-detection algorithm (dlib [King2009]) that finds faces in images and embeds them in a 128-dimensional space. Manually inspecting the dataset reveals a main character in of the images, a main cast of four characters appearing in each and three secondary characters in each. Other characters account for and poor detections (such as figure 4(a)) for .
We run hac with and apply duplicate reduction with . These parameters were not fine-tuned; they were picked based on comments from the paper that created the CNN and on figure 6. We fix for all experiments; these large values are sufficient because hac works better in high dimensions than guaranteed by theorem 2.1.
(a) The closest training example to the mean of the dataset (1-output of -means) is a blurry misdetection. (b) DBSCAN merges different characters through paths of similar faces.
We compare hac against several baselines to find the most frequently occurring characters. For we ask each algorithm to return outputs and check how many of the top characters it returned. The simplest baseline, Random, returns a random sample of the data. Maximal Independent Set starts with an empty list and iteratively picks a random point and adds it to the set iff it is at least apart from all points in the list. We use sklearn [Pedregosa et al.2011] for both -means and DBSCAN. DBSCAN has two parameters: we set its parameter to , since its role is exactly the same as our and grid-search to find the best . For -means we return the image whose embedding is closer to each center and for DBSCAN we return a random image in each cluster.
As seen in figure 4, hac consistently outperforms all baselines. In particular, -means suffers from trying to account for most of the data, putting centers near unpopular characters or noisy images such as figure 4(a). DBSCAN’s problem is more subtle: to detect secondary characters, the threshold frequency for being dense needs to be lowered to . However, this creates a path of dense regions between two main characters, joining the two clusters (figure 4(b)).
While we used offline baselines with fine-tuned parameters, hac is online and its parameters do not need to be fine-tuned. Succeeding even when put at a disadvantage, gives strong evidence that hac is a better approach for the problem.
4 Object localization
In this section we show an application of entity finding that cannot be easily achieved using clustering. We will need the flexibility of hac: working online, with arbitrary metrics and in a time-scaled setting as old observations become irrelevant.
4.1 Identifying objects
In the introduction we outlined an approach to object localization that does not require prior knowledge of which objects will be queried. To achieve this we exploit many of the characteristics of the hac
algorithm. We assume that: 1) A convolutional neural network embedding will place images of the same object close together and images of different objects far from each other. 2) Objects only change position when a human picks them up and places them somewhere else.
Points in the data stream are derived from images as follows. First, we use SharpMask[Pinheiro et al.2016] to segment the image into patches containing object candidates (figure 7). Since SharpMask is not trained on our objects, proposals are both unlabeled and very noisy. For every patch, we feed the RGB image into a CNN (Inception-V3 [Szegedy et al.2016]), obtaining a 2048-dimensional embedding. We then have 3 coordinates for the position (one indicates which camera is used, and then 2 indicate the pixel in that image).
We need a distance for this representation. It is natural to assume that two patches represent the same object if their embedding features are similar and they are close in the 3-D world. We can implement this with a metric that is the maximum between the distance in feature space and the distance in position space:
We can use cosine distance for and for ; hac allows for the use of arbitrary metrics. However, for good performance, we need to scale the distances such that close in position space and close in feature space correspond to roughly similar numerical values.
We can now apply hac to the resulting stream of points. In contrast to our previous experiment, time is now very important. In particular, if we run hac with a large timescale and a small timescale , we’ll have 3 types of detections:
Noisy detections (humans passing through, false positive camera detections): not dense in either timescale;
Detections from stable objects (sofas, walls, floor): dense in both timescales; and
Detections from objects that move intermittently (keys, mugs): not dense in , and alternating dense and sparse in . (When a human picks up an object from a location, that region will become sparse; when the human places it somewhere else, a new region will become dense.)
We are mainly interested in the third type of detections.
4.2 Experiment: relating objects to humans
We created a dataset of 8 humans moving objects around 20 different locations in a room; you can find it on http://lis.csail.mit.edu/alet/entities.html. Locations were spread across 4 tables with 8, 4, 4, 4 on each respectively. Each subject had a bag and followed a script with the following pattern: Move to the table of location A; Pick up the object in your location and put it in your bag; Move to the table of location B; Place the object in your bag at your current location.
The experiment was run in steps of 20 seconds: in the first 10 seconds humans performed actions, and in the last 10 seconds we recorded the scene without any actions happening. Since we’re following a script and humans have finished their actions, during the latter 10 seconds we know the position of every object with an accuracy of 10 centimeters. The total recording lasted for 10 minutes and each human picked or placed an object an average of 12 times. In front of every table we used a cell phone camera to record that table (both human faces and objects on the table).
We can issue queries to the system such as: Which human has touched each object? Which objects have not been touched? Where can I find a particular object? Note that if the query had to be answered based on only the current camera image, two major issues would arise: 1) We would not know whether an object is relevant to a human. 2) We would not detect objects that are currently occluded.
This experimental domain is quite challenging for several reasons: 1) The face detector only detects about half the faces. Moreover, false negatives are very correlated, sometimes missing a human for tens of seconds. 2) Two of the 8 subjects are identical twins. We have checked that the face detector can barely tell them apart. 3) The scenes are very cluttered: when an interaction happens, an average of 1.7 other people are present at the same table. 4) Cameras are 2D (no depth map) and the object proposals are very noisy.
We focus on answering the following query: for a given object, which human interacted with it the most? The algorithm doesn’t know the queries in advance nor is it provided training data for particular objects or humans. Our approach, shown in figure 8, is as follows:
Run hac with (all points have the same weight regardless of their time), seconds, and a distance function and threshold which link two detections that happen roughly within 30 centimeters and have features that are close in embedding space.
Every , query for outputs representing dense regions.
For every step, look at all outputs from the algorithm and check which ones do not have any other outputs nearby in the previous step. Those are the detections that appeared. Similarly, look at the outputs from the previous step that do not have an output nearby in the current step; those are the ones that disappeared.
(Figure 8) For any output point becoming dense/sparse on a given camera, we take its feature vector (and drop the position); call these features and the current time . We then retrieve all detected faces for that camera at times , which is when a human should have either picked or placed the object that made the dense region appear/disappear. For any face we add the pair to a list with a score of , which aims at distributing the responsibility of the action between the humans present.
Now, at query time we want to know how much each human interacted with each object. We pick a representative picture of every object and every human to use as queries. We compute the pair of feature vectors, compare against each object-face pair in the list of interactions and sum its weight if both the objects and the faces are close. This estimates the number of interactions between human and object.
Results are shown in table 1. There is one row per object. For each object, there was a true primary human who interacted with it the most. The columns correspond to: the number of times the top human interacted with the object, the number of times the system predicted the top human interacted with the object, the rank of the true top human in the predictions, and explanations. hac successfully solves all but the extremely noisy cases, despite being a hard dataset and receiving no labels and no specific training.
|top human||pred. human||human (of 8)|
In many datasets we can find entities, subsets of the data with internal consistency, such as people in a video, popular topics from Twitter feeds, or product properties from sentences in its reviews. Currently, most practitioners wanting to find such entities use clustering.
We have demonstrated that the problem of entity finding is well-modeled as an instance of the heavy hitters problem and provided a new algorithm, hac, for heavy hitters in continuous non-stationary domains. In this approach, entities are specified by indicating how close data points have to be in order to be considered from the same entity and when a subset of points is big enough to be declared an entity. We proved, both theoretically and experimentally, that random sampling (on which hac is based), works surprisingly well on this problem. Nevertheless, future work on more complex or specialized algorithms could achieve better results.
We used this approach to demonstrate a home-monitoring system that allows a wide variety of post-hoc queries about the interactions among people and objects in the home.
We gratefully acknowledge support from NSF grants 1420316, 1523767 and 1723381 and from AFOSR grant FA9550-17-1-0165. F. Alet is supported by a La Caixa fellowship. R. Chitnis is supported by an NSF GRFP fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
We want to thank Marta Alet, Sílvia Asenjo, Carlota Bozal, Eduardo Delgado, Teresa Franco, Lluís Nel-lo, Marc Nel-lo and Laura Pedemonte for their collaboration in the experiments and Maria Bauza for her comments on initial drafts.
[Blum et al.2016]
Avrim Blum, John Hopcroft, and Ravindran Kannan.
Foundations of data science.2016.
- [Braverman et al.2017] Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler, and Lin F Yang. Clustering high dimensional dynamic data streams. arXiv preprint arXiv:1706.03887, 2017.
- [Cao et al.2006] Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou. Density-based clustering over an evolving data stream with noise. In SIAM international conference on data mining, 2006.
- [Chaudhuri and Dasgupta2010] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates of convergence for the cluster tree. In NIPS, 2010.
- [Chen and Tu2007] Yixin Chen and Li Tu. Density-based clustering for real-time stream data. In ACM SIGKDD International Conference On Knowledge Discovery And Data Mining, pages 133–142, 2007.
- [Chen and Zhang2016] Di Chen and Qin Zhang. Streaming algorithms for robust distinct elements. In International Conference on Management of Data, 2016.
- [Donahue et al.2014] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
- [Ester et al.1996] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, volume 96, pages 226–231, 1996.
- [Johnson et al.2016] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. Google’s multilingual neural machine translation system: enabling zero-shot translation. arXiv preprint arXiv:1611.04558, 2016.
Davis E. King.
Dlib-ml: A machine learning toolkit.Journal of Machine Learning Research, 10:1755–1758, 2009.
- [Larsen et al.2016] Kasper Green Larsen, Jelani Nelson, Huy L Nguyên, and Mikkel Thorup. Heavy hitters via cluster-preserving clustering. In Foundations of Computer Science (FOCS), 2016 IEEE 57th Annual Symposium on, pages 61–70. IEEE, 2016.
- [Laurent and Massart2000] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
- [Manku and Motwani2002] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In VLDB’02: Proceedings of the 28th International Conference on Very Large Databases, pages 346–357. Elsevier, 2002.
- [Mikolov et al.2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- [Newman2005] Mark EJ Newman. Power laws, pareto distributions and zipf’s law. Contemporary physics, 46(5):323–351, 2005.
- [Niebles et al.2008] Juan Carlos Niebles, Hongcheng Wang, and Li Fei-Fei. Unsupervised learning of human action categories using spatial-temporal words. IJCV, 79(3), 2008.
- [Nirjon and Stankovic2012] Shahriar Nirjon and John A Stankovic. Kinsight: Localizing and tracking household objects using depth-camera sensors. In Distributed Computing in Sensor Systems (DCOSS), 2012 IEEE 8th International Conference on, 2012.
- [Otto et al.2017] Charles Otto, Anil Jain, et al. Clustering millions of faces by identity. IEEE PAMI, 2017.
- [Pedregosa et al.2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[Pinheiro et al.2016]
Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár.
Learning to refine object segments.
European Conference on Computer Vision, 2016.
- [Rituerto et al.2016] Alejandro Rituerto, Henrik Andreasson, Ana C Murillo, Achim Lilienthal, and José Jesús Guerrero. Building an enhanced vocabulary of the robot environment with a ceiling pointing camera. Sensors, 16(4), 2016.
- [Rodriguez and Laio2014] Alex Rodriguez and Alessandro Laio. Clustering by fast search and find of density peaks. Science, 344(6191):1492–1496, 2014.
[Schroff et al.2015]
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering.In CVPR, 2015.
- [Szegedy et al.2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
- [Vitter1985] Jeffrey S Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS), 11(1):37–57, 1985.
- [Wan et al.2009] Li Wan, Wee Keong Ng, Xuan Hong Dang, Philip S Yu, and Kuan Zhang. Density-based clustering of data streams at multiple resolutions. ACM Transactions on Knowledge discovery from Data (TKDD), 3(3):14, 2009.
Appendix A Appendix: proofs and detailed theoretical explanations
|Thm. A.3 and corollary A.2.1 prove that with high probability:|
|All||-dense pts||will||have an output within|
|For hac with radius , thm. A.1 and corollary A.2.1 prove:|
|Most||-dense pts||will||have an output within|
|Fig. 6 and thm A.11 show that in high dimensions:|
|Most||-dense pts||will||” ” ” within|
We make a guarantee for every dense or sparse point in space, even those that are not in the dataset. Our guarantees are probabilistic; they hold with probability where is a parameter of the algorithm that affects the memory usage. We have three types of guarantees, from loose but very certain, to tighter but less certain. Those guarantees are summarized in table 2. For simplicity, the guarantees in that table assume that there’s a single radius . We also start by proving properties of the single radius algorithm.
First we prove that if we run most -dense points will have an output within distance using a small amount of memory (and, in particular, not dependent of the length of the stream). Notice that, for practical values such as we’re guaranteeing that an -dense point will be covered with probability.
Let . For any -interesting point , outputs a point within distance with probability . Moreover, it always needs at most memory and time per point. Finally, it outputs at most points.
Proof We maintain independent points that hop to the -th point with probability . They carry an associated counter: the number of points that came after its last hop and were within of its current position. When the algorithms is asked for centers, it returns every point in memory whose counter is greater than .
By triangular inequality any point within ’s ball will count towards any other point in the sphere, since we’re using a radius of . Moreover, the first points within ’s sphere will come before at least a fraction of points that within ’s ball. Therefore there’s at least a fraction of points within distance of point that, if sampled, would be returned.
We have samples. The probability that none of that fraction gets sampled is:
Therefore the probability that at least one sample is within that fraction (and therefore at least there’s an output within of ) is at least .
Now we want to prove that the same algorithm will not output points near sufficiently sparse points.
If we run , any -sparse point will not have an output point within .
Let us prove it by contradiction. Let be a -sparse point. Suppose outputs a point within distance of . By triangular inequality, any point within distance of the output is also within distance of . Since to be outputed a point has to have at least a fraction within distance that implies there is at least a fraction within of . However, this contradicts the definition that was -sparse.
If we run , any -sparse point will not have an output within and any -sparse point will not have an output within .
Proof Use and in the previous lemma.
We have shown that running , most -dense points will have an output within and none of the -sparse will. Therefore we can use HAC as a dense/noise detector by checking whether a point is within of an output.
We now want a probabilistic guarantee that works for all
dense points, not only for most of them. Notice there may be an uncountable number of dense points and thus we cannot prove it simply using probability theory; we need to find a correlation between results. In particular we will create a finite coverage: a set of representatives that is close to alldense points. Then we will apply theorem A.1 to those points and translate the result of those points to all dense points.
Let . With probability , for any -interesting point , outputs a point within distance . Moreover, it always needs at most memory and time per point. Finally, it outputs at most points.
Proof Let be the set of -dense points. Let be the biggest subset of such that for any . Since the pairwise intersection is empty and for any , we have . However, , so we must have .
We now look at a single run of . Using theorem A.1, for any the probability of having a center within is at least . Therefore, by union bound the probability that all have a center within is at least: .
Let us assume that all points in have an output within . Let us show that this implies something about all dense points, not just those in the finite coverage. For any point s.t. . If that were not the case, we could add to , contradicting its maximality. Since their balls of radius intersect this implies their distance is at most . We now know s.t. and that center s.t. . Again by triangular inequality, point will have a center within distance .
Both runtime and memory are directly proportional to the number of samples, which we specified to be .
Let us now move to the multiple radii case. For that we need the following definition:
is the smallest s.t. is -dense. For each point we refer to its circle/ball as the sphere of radius centered at .
Note that now any point will be dense for some . Given that all points are dense for some , there are two ways of giving guarantees:
All output points are paired with the radius needed for them to be dense. Then, guarantees can be made about outputs of a specific radius.
We can still have a , for which all guarantees for the single radius case apply directly.
When we pair outputs with radius we call an output of radius to an output that needed a radius to be dense. In that case, we can make a very general guarantee about not putting centers near sufficiently sparse regions, where sparsity is a term relative to .
If we run ; for any point , there will not be an output of radius within distance less than .
Proof Similar to A.2, we can assume there is an output point within that distance and apply triangular inequality. We then see that all points within distance of the output would be within distance of . However, we know that the output has at least a fraction within distance , contradicting the minimality of .
For any tuple , for any point s.t. our algorithm will give an output point within of at most radius with probability at least .
Moreover, the algorithm always needs at most memory and time per point. Finally, it outputs at most points.
Proof Let us run our algorithm with multiple radius and then filter only the outputs of radius less than . Since radius are discretized we are actually filtering by the biggest radius of the form . Nevertheless, since there’s one of those radii for every scale, there must be one between and , let’s call it . Running the multiple radii version then filtering by is equivalent to running the single radius version with radius . Since , counters for must all be at least as big as for and thus the outputs for are a superset of those for . We can apply the equivalent theorem for a single radius (thm A.1) to know that if we had run the single radius version we would get an output within with probability at least . Therefore the filtered version of multiple radii must also do so. Since we have filtered at least an output of radius less than within distance that means the multiple radii version will output such a center with probability at least .
Since memory mainly consists of an array of dimensions , the memory cost is . Notice that, to process a point we do not go over all discrete radii but rather only add a counter to the smallest radius that contains it, therefore the processing time per point is .
For any tuple , with probability , for any point s.t. our algorithm will give an output point within of at most radius .
Moreover, the algorithm always needs at most memory and time per point. Finally, it outputs at most points.
Note that we proved guarantees for all points and for all possible metrics. Using only triangular inequality we were able to get reasonably good guarantees for a non-countable amount of points, even those not on the dataset. We finally argue that the performance of hac in high dimensions is guaranteed to be almost optimal.
a.1 The blessing of dimensionality: stronger guarantees in high dimensions
In high dimensions many clustering algorithms fail; in contrast, our performance can be shown to be provably good in high dimensions. We will prove asymptotically good performance for dimension with a convergence fast enough to be meaningful in real applications. In particular, we will prove the following theorem:
Let , , . Let -dimensional samples come from Gaussians with means infinitely far apart, unit variance and empirical frequencies . If we run hac with radius and frequency , any point with will have an output within with associated radius () at most with probability at least .
Moreover, the probability that a point has yet has an output nearby is at most
Later, we will add 2 conjectures that make guarantees applicable to our experiments. Since the proof is rather long, we first give a roadmap and intuition.
If we fix a point in Gaussian we can look at other points and their distance to , we call this distribution . is the distance for which a fraction of the dataset is within of . Since all but of points are infinitely far away; is equivalent to the quantile of
. One problem is that this quantile is a random variable; which we will have to bound probabilistically.
Remember that quantile of the theoretical distribution is simply the inverse of the Cumulative Density Function; i.e. there is a probability that a sample is smaller than the quantile. We denote the quantile for by , sometimes omitting when implicit; notice is a function. For finite data, samples don’t follow the exact CDF and therefore quantiles are random variables; we denote these empirical quantiles by . We refer to figure 9 for more intuition.
Model the data as a set of -dimensional Gaussians with the same variance but different means. If we want to have uniform noise, we can have many Gaussians with only 1 sample.
Without loss of generality (everything is the same up to scaling) assume .
Most points in a high dimensional Gaussian lie in a shell between and , for a small constant (lemma A.7). We will restrict our proof to points in that shell.
The function we care about, , from a particular fixed point to points coming from the same Gaussian follows a non-central chi distribution, a complex distribution with few known bounds, we will thus try to avoid using it.
The distribution where
follows a (central) chi-squared distribution,, a well studied distribution with known bounds.
where and where are very similar distributions because most are in the shell. Bounds on the former distribution will imply bounds on the latter.
We need to fix and only sample . We show quantiles of the distribution are 1-Lipschitz and use it along with Bolzano’s Theorem to get bounds with fixed shell.
Since we care about finite-data bounds we need to get bounds on empirical quantiles, we bridge the gap from theoretical quantiles using Chernoff bounds.
We will see that quantiles are all very close together because in high dimensional Gaussians most points are roughly at the same distance. For any point we will be able to bound its radius using the bounds on quantiles of the distance function.
With this bound we will be able to bound the ratio between the radius of a point and the distance to its closest output or the radius of such output.
We join all the probabilistic assertions made in the previous steps via the union bound, getting a lowerbound for all the assertions to be simultenously true.
In high dimensions, Gaussians look like high dimensional shells with all points being roughly at the same distance from the center of the cluster, which is almost empty. We will first assume Gaussians are infinitely far away and Gaussians of variance 1. For many lemmas we will assume mean 0 since it doesn’t lose generality for those proofs.
We first use a lemma 2.8 found in an online version of [Blum et al.2016] 111https://www.cs.cmu.edu/~venkatg/teaching/CStheory-infoage/chap1-high-dim-space.pdf, which was substituted by a weaker lemma in the final book version. This lemma formalizes the intuition that most points in a high dimensional Gaussian are in a shell:
For a -dimensional spherical Gaussian of variance 1, a sample will be outside the shell with probability at most for any .
We will prove that things work well for points inside the shell; which for it’s of points and it’s . For future proofs let us denote ; to further simplify notation we will sometimes omit the dependence on .
Let with and , but no restriction on the norm of . Then:
If we forget for a moment about the shell and considerthen and .
We now observe that there are two options for , either it is inside the shell () or outside. Since the probability of being inside the shell is very high, and are very close. In the worst case, using that and are chosen independently, we have:
using we can get the following inequalities:
Now [Laurent and Massart2000] shows that:
Remember that , we transform into by taking the square root and multiplying by :
To shorten formulas let us denote the lowerbound by and the upperbound by . Finally, if we look at the and quantiles we know from the equations above that they must be above and below .
Note that setting we get bounds on quantiles and setting we get bounds on quantiles .
We now have bounds on theoretical quantiles for ; as mentioned before we can translate them to bounds on getting probabilities bounded by .
Up until now we have proved things about arbitrary . Our ultimate goal is proving that the radius for a particular point in the shell cannot be too big or too small. To reflect this change in goal we change the notation from to . is defined as the minimum distance for a fraction of the dataset to be within distance of . Therefore we care about samples from with constant . Since is sampled only once those samples are correlated and we have to get different bounds.
Let be fixed. Let be the theoretical . Then the quantiles and are both contained in .
Proof From the previous lemma A.8 we know that when is not fixed, the quantiles and from that distribution are in .
By rotational symmetry of the Gaussian we know that this distribution only depends on the radius ; overriding notation let us call it .
Let us now consider two radius , . We can consider the path from to passing through , which upperbounds the distance from to by triangular inequality. The shortest path from to is following the line from to the origin taking length .
We thus have that and thus the Cumulative Density Function of is upperbounded by shifted by .
As figure 11 illustrates, this implies that the quantiles of are -Lipschitz and, in particular, also continuous. Remember that a function is -Lipschitz if .
Since and are chosen independently, we can first select then . Let us consider three options:
. Taking the lower fraction for every represents fraction of the total samples . We have data of fraction all less than . This contradicts the definition of quantile.
. By definition of no other sample can be below which implies that the quantile is above . Again this contradicts the definition of quantile.
. Since is continuous, by Bolzano’s Theorem we know:
After this, as shown in figure 12, we apply that quantiles are 1-Lipschitz and since the maximum distance in that interval is we know that for all points in the shell their theoretical quantiles and must be inside .
Note that we now have bounds on theoretical quantiles; empirical quantiles (those that we get when the data comes through) will be noisier for finite data and thus quantiles are a bit more spread; as illustrated in figure 9. This difference can be bounded with Chernoff bounds. In particular let us compare the probability that the empirical and quantiles are more extreme than the theoretical and quantiles.
Let us have fixed s.t. and take samples . Then with probability higher than the empirical quantile of