Tight Bound of Incremental Cover Trees for Dynamic Diversification

06/15/2018 ∙ by Hannah Marienwald, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

Dynamic diversification---finding a set of data points with maximum diversity from a time-dependent sample pool---is an important task in recommender systems, web search, database search, and notification services, to avoid showing users duplicate or very similar items. The incremental cover tree (ICT) with high computational efficiency and flexibility has been applied to this task, and shown good performance. Specifically, it was empirically observed that ICT typically provides a set with its diversity only marginally (∼ 1/ 1.2 times) worse than the greedy max-min (GMM) algorithm, the state-of-the-art method for static diversification with its performance bound optimal for any polynomial time algorithm. Nevertheless, the known performance bound for ICT is 4 times worse than this optimal bound. With this paper, we aim to fill this very gap between theory and empirical observations. For achieving this, we first analyze variants of ICT methods, and derive tighter performance bounds. We then investigate the gap between the obtained bound and empirical observations by using specially designed artificial data for which the optimal diversity is known. Finally, we analyze the tightness of the bound, and show that the bound cannot be further improved, i.e., this paper provides the tightest possible bound for ICT methods. In addition, we demonstrate a new use of dynamic diversification for generative image samplers, where prototypes are incrementally collected from a stream of artificial images generated by an image sampler.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The diversification problem is a notoriously important issue in a number of popular applications, e.g., in recommender systems [24, 23, 22], web search [11, 20], and database search [19, 6, 15]. Diversity helps to avoid showing users duplicate or very similar items, etc. Formally, it is defined as follows: (Diversification problem) Let be a set of points in some metric space with distance metric . The goal of the -max-min diversification problem (for ) is to select a subset of such that

(1)

where the diversity is defined as the minimum distance of any pair of data points in a set, i.e.,

(2)

This problem is known to be NP-hard [9], and variety of approximation methods have been proposed [9, 18, 10, 20, 16, 8]. Among them, greedy max-min [18] is one of the state-of-the-art methods, although it was proposed several decades ago [10, 8]. It was shown that GMM is guaranteed to provide an approximate solution with its diversity no smaller than , where is the diversity of the optimal solution [18]. Furthermore, it was also shown that is the best achievable bound by any polynomial time algorithm [18].

Figure 1: Achieved inverse diversities relative to GMM on the Cities, Faces and MNIST data set. Detailed setting of experiments is described in Section 4.

Recently, in many web applications including recommender systems, web search, and notification services, the diversification algorithm is required to handle streaming data, where the sample pool is dynamic—some new items can be added and some old items can be removed from time to time. Drosou and Pitoura (2014) [8] argued that the cover tree (CT), which was originally proposed for nearest neighbor search [3, 12], is an appropriate tool to meet those streaming requirements. In particular a family of CT approaches, called the incremental cover trees (ICTs), are suitable for dynamic diversification, because it allows addition and removal of items online. The authors reported excellent empirical performance. Especially, two variants, called ICT Greedy and ICT Inherit, typically provide a set with its diversity only times smaller than GMM (see Fig.1 for our experimental results).

Nevertheless, the theoretically known performance bound of ICT is [8], which is 4 times worse than GMM. The goal of this paper is therefore to fill this gap between theory and empirical observations. We theoretically analyze the property of ICT, and derive a tighter performance bound, , for ICT Greedy and ICT Inherit. Then, we investigate the obtained bound and the empirical observations in details, by using specially designed artificial data for which the optimal diversity is known. Finally, we analyze the tightness of performance bounds, and show that our proposed bounds cannot be improved further, i.e., for worst case analysis, this paper gives the tightest possible bounds of all ICT variants. In addition, we demonstrate a new application of dynamic diversification for generative image samplers, where prototypes are incrementally collected from a stream of artificial images generated by an image sampler.

2 Background

As discussed already above, the -max-min diversification problem is known to be NP-hard [18]. Therefore, various approximation algorithms have been proposed (see [13] for an overview). In order to assess the accuracy of their approximation, a theoretical bound on the worst possible performance is usually given. [21] (-approximation algorithm) An -approximation algorithm for an optimization problem produces for all instances of the problem a solution whose value is within a factor of the value of the optimal solution. is called the approximation factor. Let be a diverse subset of of size , which was computed using method . is said to be an -approximation algorithm, iff for all possible diversification problems

(3)

where is the optimal diversity. Small values for indicate better guaranteed performance, but because the approximate solution can at most be as good as the optimal solution, . Along with the computational complexity, the approximation factor is an important criterion that theoretically guarantees the performance of an algorithm.

0:   - a set of points, - size of the subset
0:   - diverse subset of of size
1:   randomly selected point
2:  while  do
3:     
4:     
5:  end while
6:  return  
Algorithm 1 Greedy Max-Min (GMM)
0:   - a cover tree for data set , - size of the subset
0:   - diverse subset of of size
1:  
2:  while  do
3:     
4:  end while
5:   select nodes from
6:  return  
Algorithm 2 Cover Tree for -Max-Min Diversification Problem

In this section we summarize GMM, one of the most established algorithms, and cover tree-based approaches. We give for each method the complexity and the approximation factor.

2.1 Greedy Max-Min (GMM) Algorithm

Although proposed several decades ago, greedy max-min (GMM) [18] is still state-of-the-art for the diversification problem. A detailed description of the algorithm can be found in Algorithm 1. GMM approximates the diverse subset in a greedy manner, starting with either a randomly selected data point (line 1) [1] or the two most distant data points of the set [18]. In subsequent iterations the data point with the largest pairwise distance to the current subset is added to the diverse set (line 3, 4). As the authors showed, GMM has an approximation factor of 2 and a complexity that is in .

2.2 Incremental Cover-tree (ICT) Approaches

Although, originally proposed for sublinear-time -nearest neighbor search, the cover tree [4] can easily be adapted for diverse set approximation. A cover tree for a data set is a leveled tree such that each layer of the tree covers the layer beneath it. Every layer of the tree is associated with an integer level , which decreases as we descend the tree. The lowest layer of the tree holds the whole data set and is associated with the smallest level , whereas the root of the tree is associated with the largest level . A node in the tree corresponds to a single data point in , but a data point might map to multiple nodes. However, any point can only appear once in each layer. Let layer be the set of all nodes at level . For all level with , the following invariants must be met: (1) Nesting: . Once a point appears in , every lower layer in the tree has a node associated with . (2) Covering: For every , there exists a such that and the node in associated with is a parent of the node of in . (3) Separation: For all distinct . Here denotes the base of the cover tree. The cover tree was proposed with , but extended to arbitrary bases in [8] (cf. Fig. 8 in Appendix F for cover tree examples).

Due to the invariants, each layer of the tree might already be an useful approximation. The general procedure of using a cover tree for the -max-min diversification problem, is shown in Algorithm 2. It aims to find the termination layer of the cover tree, i.e. the first layer that holds at least nodes. In general , thus, a subset of nodes must be selected (line 5). Possible selection strategies were introduced in [8] and are presented below.111 The authors of [8] also proposed a CT variant, called the cover-tree batch (CT Batch), with guaranteed approximation factor . However, the cover-tree construction is as slow as GMM, and it cannot accept addition and removal of items in streaming data.

ICT Basic

ICT Basic is the most straight-forward approach. It randomly selects nodes out of , with its complexity in . Because of the random selection, the diversity of the subset computed with ICT Basic might be the same as the diversity of the whole termination layer.

ICT Greedy

ICT Greedy combines the cover tree approach with GMM. After the termination layer was located, we apply GMM on in order to select nodes. Compared to the purely random approach, this selection strategy will, in most cases, give results with higher diversity. By applying GMM only on instead of , the complexity drastically reduces and is in .

ICT Inherit
0:   - a cover tree for data set , - size of the subset
0:   - diverse subset of of size
1:  
2:  while  do
3:     
4:  end while
5:  
6:  while  do
7:     
8:     
9:  end while
10:  return  
Algorithm 3 Cover Tree Inherit

ICT Inherit is the most enhanced approach. It maintains the performance of ICT Greedy but further reduces the complexity. A detailed description can be found in Algorithm 3. Instead of applying GMM on the whole layer , we initialize the diverse subset with the previous layer (line 5) and only select some nodes from the termination layer (line 6 to 9). Due to the separation invariant of the cover tree, already has a high diversity and, therefore, is an adequate initialization for the selection process. The complexity of ICT Inherit is in .

2.2.1 Approximation Factor

In [8]

a first attempt to estimate the approximation factor was made.

[8] For

(4)

For a cover tree built with , this results in an approximation factor of , regardless of which selection strategy is used. However, as we will see in Sec. 3, a lower and tighter approximation factor for ICT Greedy and ICT Inherit can be proven.

2.3 Diversification of Dynamic Data

In many applications the set, for which a diverse subset is required, is not static. New data points must be added or old ones have to be removed from time to time. Unfortunately, GMM is not an adequate choice for the diversification of dynamic data. Whenever the data set is changed, GMM has to be rerun from scratch. Whereas the cover tree is a dynamic data structure. Insertion and removal of data points state no problem and the cover tree can easily be used for the diversification of dynamic data using the approaches presented above. Adding or removing a data point has complexity where is the expansion constant [4, 8].

3 Theoretical Analysis

Because of the cover tree properties, each layer of the cover tree might already be an appropriate starting point for the approximation of the diverse set. However, depending on the selection strategy, the quality of the diversity is likely to differ. We expect ICT Greedy and ICT Inherit to give results with higher diversity than ICT Basic. In the following we will prove that the approximation factor derived in [8] is only a loose bound when it comes to the GMM-based selection strategies. Let be the termination layer, i.e. the first layer of the cover tree that holds at least nodes

(5)

and be defined, such that

(6)

The diversity of every subset of can be bounded

(7)

Furthermore, will hold a subset of nodes with diversity of at least

(8)

and . It follows

(9)

We give a sketch of the proof. For a detailed discussion see Appendix B. Eq. (7) and the first part of the maximum in Eq. (8) follows from the separation property of the cover tree. The latter part of the maximum follows from the covering and nesting property. Any layer of the cover tree covers the whole data set. As a consequence is guaranteed to hold a subset of nodes, that cover the optimal diverse set. The triangle inequality can then be used, to bound the distance between those nodes. This gives rise to the existence of and the latter part of the maximum.

Figure 2: Visualization of the theoretical bounds derived in Sec. 3. (left) Guaranteed approximation factor for ICT Basic and ICT Inherit for different values of the base . ICT Greedy has the same approximation factor as ICT Inherit. (right) The bound on the diversity derived in Eq. (8) for various values of and . Separation Property shows the first part of the maximum and Covering Property shows the latter part of the maximum. The intercept of both lines shows what is stated in Eq. (9). The worst case approximation factor is 6.

For some it is crucial to select a subset of in an appropriate manner. Thus, the diversity strongly depends on the strategy, that is used to select nodes out of .

Let be a subset of nodes selected from using GMM (ICT Greedy). The diversity of is at least

(10)

The approximation factor of ICT Greedy is given by

(11)

For a detailed proof see Appendix C. The diversity of the selected subset cannot be worse than , because that bound is given by the separation criterion. As it was shown in [18], GMM has an approximation factor of . The best possible diversity in layer is given by Eq. (8). We get what is stated in Eq. (10).

For , this results in an approximation factor of . Compared to ICT Greedy, ICT Inherit has a lower complexity. Moreover, ICT Inherit has the same bound on the diversity and the same approximation factor. Let be a subset of nodes selected from . It holds all nodes from the previous layer and remaining nodes were selected from using GMM (ICT Inherit). The bound of the diversity of is the same as the bound for , i.e.

(12)

The approximation factor of ICT Inherit is given by

(13)

For a detailed proof, see Appendix D. Instead of starting with a randomly selected data point, ICT Inherit initializes GMM with . An approximation factor of 2 for GMM was proven by induction in [18]. In order to prove that initializing GMM still leads to an approximation factor of 2, it is thus sufficient to show the minimum pairwise distance in is larger than or equal to half of the optimal diversity of . Because of the separation property of the cover tree, the nodes in are guaranteed to have larger pairwise distance than the nodes in . Using also the nesting property , we can conclude that is sufficiently large.

Figure 2 (left) shows the approximation factor for different bases . Especially for , using ICT Greedy or ICT Inherit as selection strategy reduces the approximation factor. Figure 2 (right) shows the bound on the diversity derived in Eq. (8). One can see, for small the diversity of the subset is given by the separation criterion. This can also be explained intuitively. As increases, the pairwise distance that is bounded by the separation criterion decreases. However, the cover radius of each node decreases as well, so that at least nodes in must lie close to the data points in the optimal solution to be able to cover them. Therefore, the pairwise distance of those nodes increases as decreases. When the termination layer corresponds to a lower level of the tree (larger ), it is beneficial to use a GMM-based selection strategy. The minimum of both bounds corresponds to the approximation factor.

4 Tightness of Performance Bounds

According to our theory in Section 3, the approximation factor of ICT methods are , for . In this section, we investigate if these bounds are tight or whether there is still room for improvement. We first conducted an artificial data experiment, designed for validating our theory. We created a set of data points consisting of grid points and random points (see Fig. 3 for examples). For equal to the number of grid points, the optimal diverse subset is given by the grid.

Figure 3: 2D artificial data consisting of (left), (center) or

grid points (right) and random, uniformly distributed points. The optimal diversity is

.

Using the methods presented above, we approximate the diverse subsets and computed the relative inverse diversity for . According to Eq. (3), the computed rel. inv. diversity can at most be as large as the approximation factor and small rel. inv. diversity indicate better results. By repeating the experiments (100 trials each), we were able to estimate the distribution of the rel. inv. diversity. Figure 4 shows the result on the 2D and 5D grid data for base set to .

Figure 4: Estimated distribution of the relative inverse diversity on the 2D (left) and 5D (right) grid data. The cover tree was built with . The dashed vertical lines indicate the proven approximation factor.

We still observe a big gap between the theoretical bounds and empirical observations. Although GMM has an approximation factor that is three times lower than the guaranteed approximation factor of ICT Greedy and ICT Inherit with , only minor differences can be seen. One can see, that the center of the distribution of the observed rel. inv. diversity is even lower than 2. The highest observed rel. inv. diversity was for ICT Inherit, for ICT Greedy and for ICT Basic on the 2D data set. Note that for larger , we tend to select nodes from lower layers of the tree. Lower layers hold more, but evenly spaced nodes, because their cover radius is smaller. This might explain the excellent performance on the 5D grid data.

The approximation factor is defined as an upper bound. Thus, when an approximation factor is proven, it does not imply, that no smaller approximation factor is possible. Because of the high performance of the cover tree approaches on the grid data sets, one might expect lower approximation factors, than the ones that were proven in Sec. 3. However, we provide two examples, that show the tightness of , and (see Appendix E). Thereby, we have shown that no lower approximation factor can be proven. Note, that the rel. inv. diversity does not only depend on the data pool, but also on the order in which the data is added to the tree. Even for the examples, which are discussed in the Appendix, we only get the worst case rel. inv. diversity, if the data points are added in a specific order. When this order is changed, we might also get the optimal diverse set. See the Appedendix A for the incremental insertion algorithm of the cover tree. With the excellent performance on the artificial data and this observation at hand, we can conclude, that observing the worst possible rel. inv. diversity is highly unlikely, but not impossible.

We also conducted real world data experiments - both high dimensional and large sample sized. Cities [5] consists of the latitude and longitude of cities in Europe. Faces [2] consists of 60x64 grey-scale images of people taken with varying pose and expressions. MNIST [14] holds 28x28 images of handwritten digits. For each of the data sets we ran GMM and the cover tree approaches (built with ) with varying diverse set sizes (). In general the optimal diversity is unknown. Therefore, we plotted the inverse diversity relative to the diversity of GMM, i.e., , where corresponds to the applied method. This can be used to bound the true rel. inv. diversity. If, for example, , we can conclude , because GMM has an approximation factor of 2.

Figure 1 shows the inv. diversity relative to the diversity of GMM on the Cities, Faces and MNIST data set. ICT Greedy, ICT Inherit and GMM compute subsets with diversities of almost the same magnitude. Even ICT Basic gives results with high diversity. Note, that the inv. diversity relative to GMM of ICT Basic resembles a step-function. This can be explained by the layer-wise structure of the cover tree. When is small compared to the number of nodes in the termination layer, ICT Basic might select poorly which results in lower diversity than ICT Inherit. As approaches the size of the layer, both approaches will select similar subsets.

5 Application to Image Generator

Figure 5: randomly chosen images (left) from the MCMC sequence of PPGN for Volcano class, and the corresponding diverse sets (right) chosen by ICT Inherit, after 20 (bottom), 100 (middle), and 200 (top) MCMC steps.

As a new use of dynamic diversification, we applied ICT Inherit to sequentially collect a diverse set from the MCMC image sequence generated by the plug and play generative networks (PPGN) [17],222 The code is available from https://github.com/Evolving-AI-Lab/ppgn. which have shown to provide highly diverse realistic images with high resolution. For generated images by MCMC sampling, there is strong correlation between subsequent images. One needs to cherry-pick diverse images by hand, or generate a long sequence and randomly choose images from it, to show how diverse the images are, that a new sampling strategy generates. Dynamic diversification by ICT Inherit can automate this process, and might be used as a standard tool to assess the diversity of samplers.

Fig.5 shows images of the Volcano class generated by PPGN. We ran PPGN up to 200 steps, adding a new image to ICT Inherit in each step. In the right half of Fig.12, we show the diverse images after 20 (bottom row), 100 (middle row), and 200 (top row) steps of image generation. For comparison, we show randomly chosen images in the left half. We can see that ICT Inherit successfully chose more diverse images than the random choice. More examples for other classes are shown in Appendix G.

6 Conclusion

Selecting a diverse subset of a dynamic sample pool—dynamic diversification—is a common problem in many machine learning applications. The diversification problem is known to be NP-hard, but polynomial time approximation algorithms, such as greedy max-min (GMM) algorithm, can have an approximation factor of 2, i.e., they are guaranteed to give a solution with diversity larger than half of the optimal diversity. However, GMM has to be performed from scratch every time the sample pool is updated, e.g., new items are added or old items are removed, and therefore is not suitable for dynamic diversification. Previous work argued that cover trees, originally proposed for nearest neighbor search, could be adapted for dynamic diversification, and proposed ICT Basic, ICT Greedy and ICT Inherit, which gave results only marginally worse than the results of GMM, while the approximation factor of those approaches was assumed to be four times larger.

In this work we have conducted both theoretical analyses and extensive experiments to fill the gap between the theoretical bound and empirical observation. Specifically, we could prove a tighter approximation factor for ICT Greedy and ICT Inherit, reducing it to 6 instead of 8 for a cover tree with base . Through artificial experiment, we have validated the bounds, and assessed the tightness of the bounds. Even on real world data sets, all three cover tree approaches give excellent results, with diversities almost of the same magnitude as the diversity given by GMM. The performance of the cover tree approaches is remarkably higher than the theoretical approximation factor, which might imply that our bounds are still loose. However, we found worst case examples that achieve the theoretical approximation factor, which proves that our bounds for ICT are tightest possible.

In general, the diversity of subsets computed with one of the cover tree approaches does not only depend on the data pool, but also on the order in which the data is added to the tree. Therefore, we conclude, that observing the worst possible relative inverse diversity is highly unlikely. Further effort must be made to assess how likely cover tree approaches give solutions with worst possible diversity.

Finally, our demonstration of dynamic diversification for generative image samplers shows the high potential of our theoretical insight for practical applications. Future work will also study diversification in scientific applications, where systematically creating a diverse (and therefore representative) subset of large data corpora can lead to novel insight, e.g. in molecular dynamics or sampling applications in chemistry.

Acknowledgments

The authors thank Vignesh Srinivasan of Fraunhofer HHI for discussion on the image generator application. This work was supported by the German Research Foundation (GRK 1589/1) by the Federal Ministry of Education and Research (BMBF) under the project Berlin Big Data Center (FKZ 01IS14013A).

References

  • [1] Sofiane Abbar, Sihem Amer-Yahia, Piotr Indyk, Sepideh Mahabadi, and Kasturi R Varadarajan. Diverse near neighbor problem. In Proceedings of the twenty-ninth annual symposium on Computational geometry, pages 207–214. ACM, 2013.
  • [2] The UCI KDD Archive. CMU faces images. http://kdd.ics.uci.edu/databases/faces/faces.html. Accessed on 2018-05-18.
  • [3] Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proc. of ICML, pages 97–104, 2006.
  • [4] Alina Beygelzimer, Sham Kakade, and John Langford. Cover trees for nearest neighbor. In Proceedings of the 23rd international conference on Machine learning, pages 97–104. ACM, 2006.
  • [5] Center For International Earth Science Information Network-CIESIN-Columbia University. Gridded population of the world: Basic demographic characteristics. https://doi.org/10.7927/H45H7D7F. Accessed on 2016-12-13.
  • [6] E. Demidova, P. Fankhauser, X. Zhou, and W. Nejdl. DivQ: Diversification for keyword search over structured databases. In Proc. of SIGIR, pages 331–338, 2010.
  • [7] Marina Drosou and Evaggelia Pitoura. Poikilo: A tool for evaluating the results of diversification models and algorithms. Proceedings of the VLDB Endowment, 6(12):1246–1249, 2013.
  • [8] Marina Drosou and Evaggelia Pitoura. Diverse set selection over dynamic data. IEEE Transactions on Knowledge and Data Engineering, 26(5):1102–1116, 2014.
  • [9] E. Erkut. The discrete -dispersion problem. European Journal of Operational Research, 1(1), 1990.
  • [10] E. Erkut, Y. Ulkusal, and O Yenicerioglu. A comparison of

    -dispersion heuristics.

    Computers and Operations Research, 21(10), 1995.
  • [11] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In Proc. of WWW, pages 381–390, 2009.
  • [12] M. Izbicki and C. R. Shelton. Fast cover trees. In Proc. of ICML, 2015.
  • [13] Matevž Kunaver and Tomaž Požrl. Diversity in recommender systems–a survey. Knowledge-Based Systems, 123:154–162, 2017.
  • [14] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/, 2010.
  • [15] Z. Liu, P. Sun, and Y. Chen. Structured search result differentiation. In Proc. of VLDB, 2009.
  • [16] E. Minack, W. Siberski, and W. Nejdl. Incremental diversification for very large sets: A streaming-based approach. In Proc. of SIGIR, 2011.
  • [17] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 3510–3520, 2017.
  • [18] SS Ravi, Daniel J Rosenkrantz, and Giri Kumar Tayi. Facility dispersion problems: Heuristics and special cases. In Workshop on Algorithms and Data Structures, pages 355–366. Springer, 1991.
  • [19] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In Proc. of ICDE, pages 228–236, 2008.
  • [20] M. R. Vieira, H. L. Razente, M. C. N. Barioni, M. Hadjieleftheriou, D. Srivastava, C. T. Jr., and V. J. Tsotras. On query result diversification. In Proc. of ICDE, 2011.
  • [21] David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011.
  • [22] C. Yu, L. V. Lakshmanan, and S. Amer-Yahia. Recommendation diversification using explanations. In Proc. of ICDE, 2009.
  • [23] M. Zhang and N. Hurley. Avoiding monotony: Improving the diversity of recommendation lists. In Proc. of RecSys, 2008.
  • [24] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In Proc. of WWW, pages 381–390, 2005.

Appendix A Incremental Insertion

Algorithm 4 shows how a data point can be inserted in an existing cover tree. It is based on the insertion algorithm in [4], which was extended to arbitrary bases in [8]. We reduced the complexity of the algorithm by omitting the nearest-parent heuristic.

0:   - data point - subset of nodes of layer of the cover tree
0:  Cover tree with inserted
1:  
2:  if  then
3:     return  true
4:  else
5:     
6:     flag Insert
7:     if flag and  then
8:         choose node in as parent
9:        make a child of
10:        return  false
11:     else
12:        return  flag
13:     end if
14:  end if
Algorithm 4 Insert(, , )

We start with the root layer Insert and traverse the cover tree until we find the first layer such that is separated from all other nodes (line 2). Only the nodes that can cover are considered in each iteration (). When we found an appropriate layer (line 7), we add as a child to one node in (line 8,9).

Appendix B Proof of Theorem 1

b.1 Bound on

Recall (Eq. (6)), that is defined as the fraction of the optimal diversity and the cover radius of the termination layer. Because the optimal solution is not known, the exact value of can in general not be identified. We will use the properties of the cover tree to give a bound on possible values of . We first consider how must be chosen, so that holds at least nodes. To do so, assume we try to enclose, two data points and with with one ball. A single ball won’t be able to enclose those two data points, if its radius is less than . No matter where the ball is placed, the maximal distance it can enclose is less than and thus, smaller than the distance between and . Because it holds for any pair of data points, the same will hold for the general case of data points. If the pairwise distance between two data points is at least , any ball with radius less than can only enclose one of those data points at once. Thus, we need at least balls to enclose data points.

Let denote the the set of descendants of node . We say, that a node covers a node , if . Furthermore, we say that a layer of the cover tree covers a set of nodes , if

(14)

As a consequence of the covering and the nesting property, each layer of the cover tree does not only cover the layer beneath it, but covers the whole data set , i.e.

(15)

Let with denote the optimal diverse set. Because is a subset of , will cover the optimal solution i.e.

(16)

The maximal distance to any descendant of any node can be bounded

(17)

Thus, if the maximal distance to any descendant is less than , any node in will only be able to cover one of the optimal data points in . Because , is guaranteed to hold at least nodes, if

(18)
(19)
(20)

(we can use instead of strict inequality, because is already a strict upper bound for the distance to the descendants).

We have shown, that will hold at least data points, if . Note, that the cover tree might already hold nodes in an higher layer, especially because must not only cover but the whole data set . In those cases, the diversity of the solution will be higher (separation criterion).

This property can be used to find an upper bound for . Assume is the termination layer and . In that case . However, as shown above, this implies that holds at least nodes. Thus, cannot be the termination layer, because would have been. We can conclude

(21)

b.2 Bound on the Diversity

We prove each part of the maximum in Eq. (8) individually. The first part follows immediately from the separation criterion and the definition of . Any subset of will have a diversity of at least ,

(22)

In order to prove the second part of the maximum, we assume every data point of the optimal solution is covered by a single node in , i.e. . Let be an arbitrary pair of nodes in the termination layer which covers . Without loss of generality we assume covers , covers and . Because of the covering property

(23)
(24)

Using the triangle inequality, we get

(25)
(26)
(27)
(28)
(29)

Because any data point of the optimal solution is by assumption covered by one node in and the pairwise distance of those data points is at least , it follows that holds a subset of nodes, that have pairwise distance of at least .

Recall, that we assumed in the beginning of the second part of the proof, to make sure that every data point of the optimal solution is covered by a different node of the termination layer. This does not impose any restrictions on the validity of the bound, because for , and the overall bound still holds. This concludes the proof of the theorem.

b.3 Fraction of Diversities

First note, that decreases and increases as increases. As a consequence, the bound of the diversity in Eq. (8) is minimal, when

(30)

Solving for leads to

(31)
(32)
(33)

We use this minimum and give a bound on the fraction of the diversities

(34)
(35)
(36)

Appendix C Proof of Corollary of Theorem 1

Suppose we use ICT Greedy to select a subset of nodes from . The diversity of the selected subset cannot be worse than , because that bound is given by the separation criterion. As it was shown in [18], GMM has an approximation factor of . We select nodes from not from itself. Thus, the best possible diversity is not , but rather given by Eq. (8), i.e. the optimal diversity in layer . We get what is stated in Eq. (10).

As it was shown in [18], the complexity of GMM is in where is the size of the set, from which data points are selected. Here we select data points from the termination layer, i.e. . Analogous to what was shown in Sec. B.3, we can prove the approximation factor of ICT Greedy. Solving

(37)

for leads to

(38)

Again, we use this value for the calculation of the minimum of

(39)
(40)
(41)

Appendix D Proof of Theorem 2

ICT Inherit uses GMM to calculate a subset of layer . Instead of starting with a randomly selected data point (as in GMM or ICT Greedy resp.), we start with and subsequently add new data points. The proof of the approximation factor of GMM relies on induction, thereby it was shown that the approximation factor holds after each iteration (see [18] for more details). Once it is shown that does not contradict an approximation factor of 2, the remaining proof in [18] can simply be applied. Thus, we show for any possible value of the minimum possible pairwise distance in is larger than or equal to half of the optimal diversity in . Because the diversity of can at most be as large as the optimal diversity, (recall Eq. (7)). As it was stated in the first theorem, . According to the separation criterion

(42)

and thus, we show

(43)
  • (44)

    holds, since .

  • (45)
    (46)
    (47)
    (48)
    (49)
    (50)

    holds for .

Thus,

(52)

which was to be shown.

When GMM is initialized with it only has to select data points. It is easy to see that the complexity reduces. In the first iteration, the distances from each data point in to every data point in is computed in order to find the next data point. In each subsequent iteration, only the distance between the data point which was selected in the last iteration and the remaining available data set is computed. In total, ICT Inherit performs

(53)

distance computations and has therefore a complexity of .

Because ICT Inherit has the same bound on the diversity as ICT Greedy (cf. (10) and (12)), both approaches have the same approximation factor.

Appendix E Tightness of the Proven Approximation Factors

The approximation factor is defined as an upper bound. Thus, when an approximation factor is proven, it does not imply, that no smaller approximation factor is possible. In this section we give two examples, that show the tightness of the derived approximation factors for arbitrary bases and .

e.1 Tightness of

Let the data set be defined as

(54)

where small and be the smallest integer such that

(55)

Most of the data points in lie on the line between and , where the distance between neighboring data points decreases as we move to the ends of the line. Only the data point does not lie on that line, but above the midpoint . We choose the Euclidean distance as distance metric. The optimal solution for is given by

(56)

with a diversity of

(57)

The cover tree for is built with base such that data points are incrementally added, ordered according to their absolute value. As a result, we get the following cover tree:

  • :

  • :

  • :

  • :

The data point will be the root node and each lower layer of the tree will hold the next two outer data points on the line. must be large enough so that the root is able to cover even the most distant data points, i.e. the optimal solution.

(58)
(59)
(60)

For simplicity, we assume (any layer above will hold only the root). The layer with will hold the whole data set . Because the distance between any data point in and is larger than , there will be a node for in layer . We will proof the cover tree properties to show that this is a valid cover tree. (1) Nesting: Obviously holds, as . (2) Covering: Holds for , because the maximal distance of any point in to the root is . Let be an arbitrary level. The nodes and are the parents of the nodes and . Their distance can be computed as

(61)
(62)
(63)
(64)
(65)
(66)
(67)

(analogously for negative data points). Therefore, the covering property holds. It also holds for level , because the nodes and are the parents of the nodes and . Their distance can be computed as

(68)
(69)
(70)