Agglomerative Clustering of Growing Squares

06/30/2017
by   Thom Castermans, et al.
Aarhus Universitet
TU Eindhoven
0

We study an agglomerative clustering problem motivated by interactive glyphs in geo-visualization. Consider a set of disjoint square glyphs on an interactive map. When the user zooms out, the glyphs grow in size relative to the map, possibly with different speeds. When two glyphs intersect, we wish to replace them by a new glyph that captures the information of the intersecting glyphs. We present a fully dynamic kinetic data structure that maintains a set of n disjoint growing squares. Our data structure uses O(n ( n n)^2) space, supports queries in worst case O(^3 n) time, and updates in O(^7 n) amortized time. This leads to an O(nα(n)^7 n) time algorithm to solve the agglomerative clustering problem. This is a significant improvement over the current best O(n^2) time algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/07/2019

Orthogonal Range Reporting and Rectangle Stabbing for Fat Rectangles

In this paper we study two geometric data structure problems in the spec...
08/07/2019

Fully dynamic hierarchical diameter k-clustering and k-center

We develop dynamic data structures for maintaining a hierarchical k-cent...
09/27/2018

Point Location in Incremental Planar Subdivisions

We study the point location problem in incremental (possibly disconnecte...
09/20/2021

Resilient Level Ancestor, Bottleneck, and Lowest Common Ancestor Queries in Dynamic Trees

We study the problem of designing a resilient data structure maintaining...
10/18/2021

Data structure for node connectivity queries

Let κ(s,t) denote the maximum number of internally disjoint paths in an ...
09/12/2011

Modern hierarchical, agglomerative clustering algorithms

This paper presents algorithms for hierarchical, agglomerative clusterin...
04/08/2018

A Proposal of Interactive Growing Hierarchical SOM

Self Organizing Map is trained using unsupervised learning to produce a ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We study an agglomerative clustering problem motivated by interactive glyphs in geo-visualization. Our specific use case stems from the eHumanities, but similar visualizations are used in a variety of application areas. GlamMap [5]222http://glammap.net/glamdev/maps/1, best viewed in Chrome. is a visual analytics tool which allows the user to interactively explore datasets which contain (at least) the following metadata of a book collection: author, title, publisher, year of publication, and location (city) of publisher. Each book is depicted by a square, color-coded by publication year, and placed on a map according to the location of its publisher. Overlapping squares (many books are published in Leipzig, for example) are recursively aggregated into a larger glyph until all glyphs are disjoint (see Fig. 1). As the user zooms out, the glyphs “grow” relative to the map to remain legible. As a result, glyphs start to overlap and need to be merged into larger glyphs to keep the map clear and uncluttered. It is straightforward to compute the resulting agglomerative clustering whenever a data set is loaded and to serve it to the user as needed by the current zoom level. However, GlamMap allows the user to filter by author, title, year of publication, or other applicable meta data. It is impossible to pre-compute the clustering for any conceivable combination of filter values. To allow the user to browse at interactive speeds, we hence need an efficient agglomerative clustering algorithm for growing squares (glyphs). Interesting bibliographic data sets (such as the catalogue of WorldCat, which contains more than 321 million library records at hundreds of thousands of distinct locations) are too large by a significant margin to be clustered fast enough with the current state-of-the-art time algorithms (here is the number of squares or glyphs).

In this paper we formally analyze the problem and present a fully dynamic data structure that uses space, supports updates in amortized time, and queries in time, which allows us to compute the agglomerative clustering for glyphs in time. Here, is the extremely slowly growing inverse Ackermann function. To the best of our knowledge, this is the first fully dynamic clustering algorithm which beats the classic time bound.

Figure 1: Zooming out in GlamMap will merge overlapping squares. This figure shows a sequence of three steps zooming out from the surroundings of Leipzig.
Figure 2: The timeline of squares that grow and merge as they touch.

Formal problem statement.

Let be a set of points in (the locations of publishers from our example). Each point has a positive weight (number of books published in this city). Given a “time” parameter , we interpret the points in as squares. More specifically, let be the square centered at with width . For ease of exposition we assume all and to be unique. With some abuse of notation we may refer to as a set of squares rather than the set of center points of squares. Observe that initially, i.e. at , all squares in are disjoint. As increases, the squares in grow, and hence they may start to intersect. When two squares and intersect at time , we remove both and and replace them by a new point , with , of weight (see Fig. 2).

Related Work.

Funke, Krumpe, and Storandt [6] introduced so-called “ball tournaments”, a related, but simpler, problem, which is motivated by map labeling. Their input is a set of balls in with an associated set of priorities. The balls grow linearly and whenever two balls touch, the ball with the lower priority is eliminated. The goal is to compute the elimination sequence efficiently. Bahrdt et al. [4] and Funke and Storandt [7] improved upon the initial results and presented bounds which depend on the ratio of the largest to the smallest radius. Specifically, Funke and Storandt [7] show how to compute an elimination sequence for balls in time in arbitrary dimensions and in time for , where denotes the number of different radii. In our setting eliminations are not sufficient, since merged glyphs need to be re-inserted. Furthermore, as opposed to typical map labeling problems where labels come in a fixed range of sizes, the sizes of our glyphs can vary by a factor of 10.000 or more (Amsterdam with its many well-established publishers vs. Kaldenkirchen with one obscure one).

Ahn et al. [2] very recently and independently developed the first sub-quadratic algorithms to compute elimination orders for ball tournaments. Their results apply to balls and boxes in two or higher dimensions. Specifically, for squares in two dimensions they can compute an elimination order in time. Their results critically depend on the fact that they know the elimination priorities at the start of their algorithm and that they only have to handle deletions. Hence they do not have to run an explicit simulation of the growth process and can achieve their results by the clever use of advanced data structures. In contrast, we are handling the fully dynamic setting with both insertions and deletions, and without a specified set of priorities.

Our clustering problem combines both dynamic and kinetic aspects: squares grow, which is a restricted form of movement, and squares are both inserted and deleted. There are comparatively few papers which tackle dynamic kinetic problems. Alexandron et al. [3] present a dynamic and kinetic data structure for maintaining the convex hull of points (or analogously, the lower envelope of lines) moving in . Their data structure processes (in expectation) events in time each. Here, , and is the maximum length of a Davenport-Schinzel sequence on symbols of order . Agarwal et al. [1] present dynamic and kinetic data structures for maintaining the closest pair and all nearest neighbors. The expected number of events processed is again roughly , each of which can be handled in expected time. We are using some idea and constructions which are similar in flavor to the structures presented in their paper.

Results.

We present a fully dynamic data structure that can maintain a set of disjoint growing squares. Our data structure will produce an intersection event at every time when two squares and , with , start to intersect (i.e. at any time before , all squares in remain disjoint). At such a time, we then have to delete some of the squares, to make sure that the squares in are again disjoint. At any time, our data structure supports inserting a new square that is disjoint from the squares in , or removing an existing square from . Our data structure can handle a sequence of updates in a total of time, each update is performed in amortized time.

The Main Idea.

We develop a data structure that can maintain a dynamic set of disjoint squares , and produce an intersection event at every time when starts to intersect with a square of a point that dominates . We say that a point dominates if and only if and . We then combine four of these data structures, one for each quadrant, to make sure that all squares in remain disjoint. The main observation that allows us to maintain efficiently, is that we can maintain the points dominating in an order so that a prefix of will have their squares intersect the top side of first, and the remaining squares will intersect the right side of first. We formalize this in Section 2. We then present our data structure –essentially a pair of range trees interlinked with linking certificates– in Section 3. While our data structure is conceptually simple, the exact implementation is somewhat intricate, and the details are numerous. Our initial analysis shows that our data structure maintains certificates per square, which yields an amortized update time. This allows us to simulate the process of growing the squares in –and thus solve the agglomerative glyph clustering problem– in time using space. In Section 4 we analyze the relation between canonical subsets in dominance queries. We show that for two range trees and in , the number of pairs of nodes and for which occurs in the canonical subset of a dominance query defined by and vice versa is only , where is the total size of and . This implies that the number of linking certificates that our data structure maintains, as well as the total space used, is actually only . Since the linking certificates actually provide an efficient representation of all dominance relations between two point sets (or within a point set), we believe that this result is of independent interest as well.

Figure 3: The squares and the projection of their centers and relevant corners onto the line .

2 Geometric Properties

Let denote the bottom left vertex of a square , and let denote the top right vertex of . Furthermore, let denote the subset of points of dominating , and let denote the set of bottom left vertices of the squares of those points.

Observation .

Let be a point dominating point . The squares and intersect at time if and only if dominates at time .

Consider a line with slope minus one, project all points in , for some time , onto , and order them from left to right. Observe that, since all points in move along lines with slope one, this order does not depend on the time . Moreover, for any point , we have , so we can easily compute this order by projecting the centers of the squares onto and sorting them. Let denote the (ordered) subset of that occur before in the order along , and let denote the ordered subset of that occur after in the order along . We define and analogously.

Observation .

Let be a point dominating point , and let be the first time at which dominates . We then have that

  • and if and only if , and

  • and if and only if .

See Fig. 3 for an illustration.

Observation 2 implies that the points in will start to intersect at some time because the bottom left vertex of will enter through the top edge, whereas the bottom left vertex of the (squares of the) points in will enter through the right edge. We thus obtain the following result.

Let be the first time at which a square of a point intersects . We then have that

  1. , and is the point with minimum -coordinate among the points in at time ,

if and only if , and
  1. , and is the point with minimum -coordinate among the points in at time ,

otherwise (i.e. if and only if ).

3 A Kinetic Data Structure for Growing Squares

In this section we present a data structure that can detect the first intersection among a dynamic set of disjoint growing squares. In particular, we describe a data structure that can detect intersections between all pairs of squares in such that . We build an analogous data structure for when . This covers all intersections between pairs of squares , where . We then use four copies of these data structures, one for each quadrant, to detect the first intersection among all pairs of squares.

We describe the data structure itself in Section 3.1, and we briefly describe how to query it in Section 3.2. We deal with updates, e.g. inserting a new square into or deleting an existing square from , in Section 3.3. In Section 3.4 we analyze the total number of events that we have to process, and the time required to do so, when we grow the squares.

3.1 The Data Structure

Our data structure consists of two three-layered trees and , and a set of certificates linking nodes in to nodes in . These trees essentially form two 3D range trees on the centers of the squares in , taking third coordinate of each point to be their rank in the order along the line (ordered from left to right). The third layer of will double as a kinetic tournament tracking the bottom left vertices of squares. Similarly, will track the top right vertices of the squares.

The Layered Trees.

The tree is a 3D-range tree storing the center points in . Each layer is implemented by a weight-balanced binary search tree (bb[] tree) [9], and each node corresponds to a canonical subset of points stored in the leaves of the subtree rooted at . The points are ordered on -coordinate first, then on -coordinate, and finally on -coordinate. Let denote the set of bottom left vertices of squares corresponding to the set , for some node .

Consider the associated structure of some secondary node . We consider as a kinetic tournament on the -coordinates of the points  [1]. More specifically, every internal node corresponds to a set of points consecutive along the line . Since the -coordinates of a point and its bottom left vertex are equal, this means also corresponds to a set of consecutive bottom left vertices . Node stores the vertex in with minimum -coordinate, and will maintain certificates that guarantee this [1].

The tree has the same structure as : it is a three-layered range tree on the center points in . The difference is that a ternary structure , for some secondary node , forms a kinetic tournament maintaining the maximum -coordinate of the points in , where are the top right vertices of the squares (with center points) in . Hence, every ternary node stores the vertex with maximum -coordinate among .

Let and denote the set of all kinetic tournament nodes in and , respectively.

Linking the Trees.

Next, we describe how to add linking certificates between the kinetic tournament nodes in the trees and that guarantee the squares are disjoint. More specifically, we describe the certificates, between nodes and , that guarantee that the squares and are disjoint, for all pairs and .

Consider a point . There are nodes in the secondary trees of , whose canonical subsets together represent exactly . For each of these nodes we can then find nodes in representing the points in . So, in total is interested in a set of kinetic tournament nodes. It now follows from Lemma 2 that if we were to add certificates certifying that is left of the point stored at the nodes in we can detect when intersects with a square of a point in . However, as there may be many points interested in a particular kinetic tournament node , we cannot afford to maintain all of these certificates. The main idea is to represent all of these points by a number of canonical subsets of nodes in , and add certificates to only these nodes.

Consider a point . Symmetric to the above construction, there are nodes in kinetic tournaments associated with that together exactly represent the (top right corners of) the points dominated by and for which . Let denote this set of kinetic tournament nodes.

Figure 4: The points and are defined by a pair of nodes , with , and , with . If and then we add a linking certificate between the rightmost upper right-vertex , , and the leftmost bottom left vertex , .

Next, we extend the definitions of and to kinetic tournament nodes. To this end, we first associate each kinetic tournament node with a (query) point in . Consider a kinetic tournament node in a tournament , and let be the node in the primary for which . Let be the point associated with (note that we take the minimum over different sets , , and for the different coordinates), and define . Symmetrically, for a node in a tournament , with and , we define and .

We now add a linking certificate between every pair of nodes and for which (i) is a node in the canonical subset of , that is , and (ii) is a node in the canonical subset of , . Such a certificate will guarantee that the point currently stored at lies left of the point stored at . Every kinetic tournament node is involved in linking certificates, and thus every point is associated with at most certificates.

Proof.

We start with the first part of the lemma statement. Every node can be associated with at most linking certificates: one with each node in . Analogously, every node can be associated with at most linking certificates: one for each node in .

Every point occurs in the canonical subset of at most kinetic tournament nodes in both and : is stored in leaves of the kinetic tournaments, and in each such a tournament it can participate in certificates (at most two tournament certificates in nodes). As we argued above, each such a node itself occurs in at most certificates. The lemma follows. ∎

What remains to argue is that we can still detect the first upcoming intersection.

Consider two sets of elements, say blue elements and red elements , stored in the leaves of two binary search trees and , respectively, and let and , with , be leaves in trees and , respectively. There is a pair of nodes and , such that

  • and , and

  • and ,

where , , and denotes the minimal set of nodes in whose canonical subsets together represent exactly the elements of .

Figure 5: The nodes and in the trees and .

Proof. Let be the first node on the path from the root of to such that the canonical subset of is contained in the interval , but the canonical subset of the parent of is not. We define to be the root of if no such node exists. We define to be the first node on the path from the root of to for which is contained in but the canonical subset of the parent is not. We again define as the root of if no such node exists. See Fig. 5. Clearly, we now directly have that is one of the nodes whose canonical subsets form , and that (as lies on the search path to ). It is also easy to see that , as lies on the search path to . All that remains is to show that is one of the canonical subsets that together form . This follows from the fact that —and thus is indeed a subset of — and the fact that the subset of the parent of contains an element smaller than , and can thus not be a subset of . ∎

Let and , with , be the first pair of squares to intersect, at some time , then there is a pair of nodes that have a linking certificate that fails at time .

Proof.

Consider the leaves representing and in and , respectively. By Lemma 3.1 we get that there is a pair of nodes and that, among other properties, have and . Hence, we can apply Lemma 3.1 again on the associated trees of and , giving us nodes and which again have and . Finally, we apply Lemma 3.1 once more on and giving us nodes and with and . In addition, these three applications of Lemma 3.1 give us two points and such that:

  • occurs as a canonical subset representing ,

  • occurs as a canonical subset representing , and

  • occurs as a canonical subset representing ,

and such that

  • occurs as a canonical subset representing ,

  • occurs as a canonical subset representing , and

  • occurs as a canonical subset representing .

Combining these first three facts, and observing that gives us that occurs as a canonical subset representing , and hence . Analogously, combining the latter three facts and gives us . Therefore, and have a linking certificate. This linking certificate involves the leftmost bottom left vertex for some point and the rightmost top right vertex for some point . Since and , we have that and , and thus we detect their intersection at time . ∎

From Lemma 3.1 it follows that we can now detect the first intersection between a pair of squares , with . We define an analogous data structure for when . Following Lemma 2, the kinetic tournaments will maintain the vertices with minimum and maximum -coordinate for this case. We then again link up the kinetic tournament nodes in the two trees appropriately.

Space Usage.

Our trees and are range trees in , and thus use space. However, it is easy to see that this is dominated by the space required to store the certificates. For all kinetic tournament nodes we store at most certificates (Lemma 4), and thus the total space used by our data structure is . In Section 4 we will show that the number of certificates that we maintain is actually only . This means that our data structure also uses only space.

3.2 Answering Queries

The basic query that our data structure supports is testing if a query square currently intersects with a square in , with . To this end, we simply select the kinetic tournament nodes whose canonical subsets together represent . For each such a node we check if the -coordinate of the lower-left vertex stored at that node (which has minimum -coordinate among ) is smaller than the -coordinate of . If so, the squares intersect. The correctness of our query algorithm directly follows from Observation 2. The total time required for a query is . Similarly, we can test if a given query point is contained in a square , with . Note that our full data structure will contain trees analogous to that can be used to check if there is a square , with , or in one of the other quadrants defined by , that intersects .

3.3 Inserting or Deleting a Square

At an insertion or deletion of a square we proceed in three steps. First, we update the individual trees and , making sure that they once again represent 3D range trees of all center points in , and that the ternary data structures are, by themselves, correct kinetic tournaments. For each kinetic tournament node in affected by the update, we then query to find a new set of linking certificates. We update the affected nodes in analogously. Finally, we update the global event queue that stores all certificates. Inserting a square into or deleting a square from takes amortized time.

Proof.

We use the following standard procedure for updating the three-level bb[] trees in amortized time. An update (insertion or deletion) in a ternary data structure can easily be handled in time. When we insert into or delete an element in a bb[] tree that has associated data structures, we add or remove the leaf that contains , rebalance the tree by rotations, and finally add or remove from the associated data structures. When we do a left rotation around an edge we have to build a new associated data structure for node from scratch. See Fig. 6. Right rotations are handled analogously. It is well known that if building the associated data structure at node takes time, for some , then the costs of all rebalancing operations in a sequence of insertions and deletions takes a total of time, where is the maximum size of the tree at any time [8]. We can build a new kinetic tournament for node (using the associated data structures at its children) in linear time. Note that this cost excludes updating the global event queue. Building a new secondary tree , including its associated kinetic tournaments, takes time. It then follows that the cost of our rebalancing operations is at most . This is dominated by the total number of nodes created and deleted, , during these operations. Hence, we can insert or delete a point (square) in in amortized time. ∎

Figure 6: After a left rotation around an edge , the associated data structure of node (pink) has to be rebuilt from scratch as its canonical subset has changed. For node we can simply use the old associated data of node . No other nodes are affected.

Analogous to Lemma 3.3 we can update in amortized time. Next, we update the linking certificates. We say that a kinetic tournament node in is affected by an update if (i) the update added or removed a leaf node in the subtree rooted at , (ii) node was involved in a tree rotation, or (iii) occurs in a newly built associated tree (for some node ). Let denote the set of nodes affected by update . Analogously, we define the set of nodes of affected by the update. For each node , we query to find the set of nodes whose canonical subsets represent . For each node in this set, we test if we have to add a linking certificate between and . As we show next, this takes constant time for each node , and thus time in total, for all nodes . We update the linking certificates for all nodes in analogously.

We have to add a link between a node and if and only if we also have . We test this as follows. Let be the node whose associated tree contains , and let be the node in whose associated tree contains . We have that if and only if , , and . We can test each of these conditions in constant time:

Observation .

Let be a query point in , let be a node in a binary search tree , and let of the parent of in , or if no such node exists. We have that if and only if and .

Finally, we delete all certificates involving no longer existing nodes from our global event queue, and replace them by all newly created certificates. This takes time per certificate. We charge the cost of deleting a certificate to when it gets created. Since every node affected creates at most new certificates, all that remains is to bound the total number of affected nodes. We can show this using basically the same argument as we used to bound the update time. This leads to the following result.

Inserting a disjoint square into , or deleting a square from takes amortized time.

Proof.

An update visits at most nodes itself (i.e. leaf nodes and nodes on the search path). All other affected nodes occur as newly built trees due to rebalancing operations. As in Lemma 3.3, the total number of nodes created due to rotations in a sequence of updates is . It follows that the total number of affected nodes in such a sequence is . Therefore, we create linking certificates in total, and we can compute them in time. Updating the event global queue therefore takes time. ∎

3.4 Running the Simulation

All that remains is to analyze the number of events processed. We show that in a sequence of operations, our data structure processes at most events. This leads to the following result. We can maintain a set of disjoint growing squares in a fully dynamic data structure such that we can detect the first time that a square intersects with a square , with . Our data structure uses space, supports updates in amortized time, and queries in time. For a sequence of operations, the structure processes a total of events in a total of time.

Proof.

We argued the bounds on the space, the query, and the update times before. All that remains is to bound the number of events processed, and the time to do so.

We start by the observation that each failure of a linking certificate produces an intersection, and thus a subsequent update. It thus follows that the number of such events is at most .

To bound the number of events created by the tournament trees we extend the argument of Agarwal et al. [1]. For any kinetic tournament node in , the minimum -coordinate corresponds to a lower envelope of line-segments in the -space. This envelope has complexity , where is the multiset of points that ever occur in , i.e. that are stored in a leaf of the subtree rooted at at some time . Hence, the number of tournament events involving node is also at most . It then follows that the total number of events is proportional to the size of these sets , over all in our tree. As in Lemma 3.3, every update directly contributes one point to nodes. The remaining contribution is due to rebalancing operations, and this cost is again bounded by . Thus, the total number of events processed is .

At every event, we have to update the linking certificates of . This can be done in time (including the time to update the global event queue). Thus, the total time for processing all kinetic tournament events in is . The analysis for the kinetic tournament nodes in is analogous. ∎

To simulate the process of growing the squares in , we now maintain eight copies of the data structure from Theorem 3.4: two data structures for each quadrant (one for , the other for ). We thus obtain the following result. We can maintain a set of disjoint growing squares in a fully dynamic data structure such that we can detect the first time that two squares in intersect. Our data structure uses space, supports updates in amortized time, and queries in time. For a sequence of operations, the structure processes events in a total of time. And thus we obtain the following solution to the agglomerative glyph clustering problem. Given a set of initial square glyphs , we can compute an agglomerative clustering of the squares in in time using space.

4 Efficient Representation of Dominance Relations

The linking certificates of our data structure actually comprise an efficient representation of all dominance relations between two point sets. We therefore think that this representation, and in particular the tighter analysis in this section, is of independent interest.

Let and be two point sets in with and , and let and be range trees built on and , respectively. We assume that each layer of and consists of a bb[]-tree, although similar analyses can be performed for other types of balanced binary search trees. By definition, every node on the lowest layer of or has an associated -dimensional range (the hyper-box, not the subset of points). For a node , we consider the subset of points in that dominate all points in , which can be comprised of canonical subsets of , represented by nodes in . Similarly, for a node , we consider the subset of points in that are dominated by all points in , which can be comprised of canonical subsets of , represented by nodes in . We now link a node and a node if and only if represents such a canonical subset for and vice versa. By repeatedly applying Lemma 3.1 for each dimension, it can easily be shown that these links represent all dominance relations between and .

As a -dimensional range tree consists of nodes, a trivial bound on the number of links is (assuming ). Below we show that the number of links can be bounded by . We first consider the case for .

4.1 Analyzing the Number of Links in 1D

Let and be point sets in with , , and . Now, every associated range of a node in or is an interval . We can extend the interval to infinity in one direction; to the left for , and to the right for . For analysis purposes we construct another range tree on , where is not a bb[]-tree, but instead a perfectly balanced tree with height . For convenience we assume that the associated intervals of are slightly expanded so that all points in are always interior to the associated intervals. We associate a node in or with a node in if the endpoint of is contained in the associated interval of .

Observation .

Every node of or is associated with at most one node per level of .

For two intervals and , corresponding to a node and a node , let be the spanning interval of and . We now want to charge spanning intervals of links to nodes of . We charge a spanning interval to a node of if and only if is a subset of , and is cut by the splitting coordinate of . Clearly, every spanning interval can be charged to exactly one node of .

Now, for a node of , let be the height of the highest node of associated with , and let be the height of the highest node of associated with . The number of spanning intervals charged to a node of is .

Proof.

Let be the splitting coordinate of and let and form a spanning interval that is charged to . We claim that, using the notation introduced in Lemma 3.1, (and symmetrically, ). Let be the associated interval of , where . By definition, . If , then the right endpoint of must lie between and . But then the spanning interval of and would not be charged to . As a result, we can only charge spanning intervals between nodes of and nodes of , of which there are at most . ∎

Using Lemma 4.1, we count the total number of charged spanning intervals and hence, links between and . We refer to this number as . This is simply . We can split the sum and assume w.l.o.g. that . Rewriting the sum based on heights in gives

where is the number of nodes of that have a node of height associated with it.

To bound we use Observation 4.1 and the fact that is a bb[] tree. Let , then we get that from properties of bb[] trees. Therefore, the number of nodes in that have height is at most . .

Proof.

As argued, there are at most nodes in of height . Consider cutting the tree at level . This results in a top tree of size , and bottom trees. Clearly, the top tree contributes at most its size to . All bottom trees have height at most . Every node in of height can, in the worst case, be associated with one distinct node per level in the bottom trees by Observation 4.1. Hence, the bottom trees contribute at most to . ∎

Using this bound on in the sum we previously obtained gives:

Where indeed, because . Thus, we conclude:

The number of links between two -dimensional range trees and containing and points, respectively, is bounded by .

4.2 Extending to Higher Dimensions

We now extend the bound to dimensions. The idea is very simple. We first determine the links for the top-layer of the range trees. This results in links between associated range trees of dimensions (see Fig. 7). We then determine the links within the linked associated trees, which number can be bounded by induction on .

The number of links between two -dimensional range trees and containing and () points, respectively, is bounded by .

Proof.

We show by induction on that the number of links is bounded by the minimum of and . The second bound is simply the trivial bound given at the start of Section 4. The base case for is provided by Theorem 4.1. Now consider the case for . We first determine the links for the top-layer of and . Now consider the links between an associated tree in containing points and other associated trees that contain at most points. Since can be linked with only one associated tree per level, and because both range trees use bb[] trees, the number of points in satisfy () where . By induction, the number of links between and is bounded by the minimum of and . Now let . Then, for , we get that . Since the sizes of the associated trees decrease geometrically, the total number of links between and for is bounded by . The links with the remaining trees can be bounded by . Finally note that the top-layer of each range tree has levels, and that each level contains points in total. Thus, we obtain links in total. The remaining links for which the associated tree in is larger than in can be bounded in the same way. ∎

Figure 7: Two layered trees with two layers, and the links between them (sketched in black). We are interested in bounding the number of such links.

It follows from Theorem 4.2 that our data structure from Section 3 actually maintains only certificates. This directly implies that the space usage is only as well.

5 Conclusion and Future Work

We presented an efficient fully dynamic data structure for maintaining a set of disjoint growing squares. This leads to an efficient algorithm for agglomerative glyph clustering. The main future challenge is to improve the analysis of the running time. Our analysis from Section 4 shows that at any time, we need only few linking certificates. However, we would like to bound the total number of linking certificates used throughout the entire sequence of operations. An interesting question is if we can extend our argument to this case. This may also lead to a more efficient algorithm for maintaining the linking certificates during updates.

References

  • [1] P. K. Agarwal, H. Kaplan, and M. Sharir. Kinetic and Dynamic Data Structures for Closest Pair and All Nearest Neighbors. ACM Transactions on Algorithms, 5(1):4:1–4:37, 2008.
  • [2] H.-K. Ahn, S. W. Bae, J. Choi, M. Korman, W. Mulzer, E. Oh, J.-w. Park, A. van Renssen, and A. Vigneron. Faster Algorithms for Growing Prioritized Disks and Rectangles. ArXiv e-prints, 2017. arXiv:1704.07580.
  • [3] G. Alexandron, H. Kaplan, and M. Sharir. Kinetic and dynamic data structures for convex hulls and upper envelopes. Computational Geometry, 36(2):144–1158, 2007.
  • [4] D. Bahrdt, M. Becher, S. Funke, F. Krumpe, A. Nusser, M. Seybold, and S. Storandt. Growing Balls in . In Proceedings of the 19th Workshop on Algorithm Engineering and Experiments (ALENEX), pages 247–258. SIAM, 2017.
  • [5] T. H. A. Castermans, B. Speckmann, K. A. B. Verbeek, M. A. Westenberg, A. Betti, and H. van den Berg. GlamMap: Geovisualization for e-Humanities. In Workshop on Visualization for the Digital Humanities (Vis4DH), 2016.
  • [6] S. Funke, F. Krumpe, and S. Storandt. Crushing Disks Efficiently. In International Workshop on Combinatorial Algorithms (IWOCA), pages 43–54. Springer, 2016.
  • [7] S. Funke and S. Storandt. Parametrized Runtimes for Ball Tournaments. In European Workshop on Computational Geometry (EuroCG), pages 221–224, 2017.
  • [8] K. Mehlhorn. Data Structures and Algorithms 1: Sorting and Searching. Springer-Verlag, 1984.
  • [9] J. Nievergelt and E. M. Reingold. Binary Search Trees of Bounded Balance. SIAM Journal on Computing (SICOMP), 2(1):33–43, 1973.