Analysis of Spatiotemporal Anomalies Using Persistent Homology: Case Studies with COVID-19 Data

07/19/2021
by   Abigail Hickok, et al.
0

We develop a method for analyzing spatiotemporal anomalies in geospatial data using topological data analysis (TDA). To do this, we use persistent homology (PH), a tool from TDA that allows one to algorithmically detect geometric voids in a data set and quantify the persistence of these voids. We construct an efficient filtered simplicial complex (FSC) such that the voids in our FSC are in one-to-one correspondence with the anomalies. Our approach goes beyond simply identifying anomalies; it also encodes information about the relationships between anomalies. We use vineyards, which one can interpret as time-varying persistence diagrams (an approach for visualizing PH), to track how the locations of the anomalies change over time. We conduct two case studies using spatially heterogeneous COVID-19 data. First, we examine vaccination rates in New York City by zip code. Second, we study a year-long data set of COVID-19 case rates in neighborhoods in the city of Los Angeles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 10

page 13

page 16

page 20

07/20/2017

Visual Detection of Structural Changes in Time-Varying Graphs Using Persistent Homology

Topological data analysis is an emerging area in exploratory data analys...
01/03/2021

Dynamics, behaviours, and anomaly persistence in cryptocurrencies and equities surrounding COVID-19

This paper uses new and recently introduced methodologies to study the s...
07/09/2019

Shadow Accrual Maps: Efficient Accumulation of City-Scale Shadows Over Time

Large scale shadows from buildings in a city play an important role in d...
01/03/2018

Topological Tracking of Connected Components in Image Sequences

Persistent homology provides information about the lifetime of homology ...
06/27/2020

Mining Persistent Activity in Continually Evolving Networks

Frequent pattern mining is a key area of study that gives insights into ...
07/26/2018

Sparips

Persistent homology of the Rips filtration allows to track topological f...
04/06/2020

Spanning analysis of stock market anomalies under Prospect Stochastic Dominance

We develop and implement methods for determining whether introducing new...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many systems are spatial in nature. When working with spatial data sets, it is important to study the role of underlying spatial relationships [11]. To illustrate this importance, consider the spatiotemporal dynamics of Coronavirus disease 2019 (COVID-19) case rates, which is one of the key motivations for our work. The spatial adjacencies between the neighborhoods of a city affect these dynamics, and it is important to account for them. Researchers have studied a wide variety of spatial data sets, such as gross domestic product (GDP) and life expectancy by country [40, 3] and voting in elections across different regions of a state [17]. Such data sets often also include temporal information (e.g., daily COVID-19 case rates), and it is also important to take this into account.

We develop new methods for using topological data analysis (TDA) to analyze geospatial (i.e., geographical) and geospatiotemporal data sets in a way that directly incorporates spatial information. TDA is a way of studying the “shape” of a data set [6]. Using persistent homology (PH), a tool from algebraic topology, allows one to find geometric voids of different dimensions in a data set and to quantify the “persistence” of these voids [30]. Zero-dimensional (0D) voids are connected components, and one-dimensional (1D) voids are holes; such voids are particularly important in two-dimensional (2D) spatial data. To quantify the persistence of holes and other voids, one constructs a simplicial complex (a combinatorial description of a topological space) and a filtration function (see Section 2.1). In our work, we treat the geographical data as 2D data and construct a 2D filtered simplicial complex to represent it. PH has yielded insights into a wide variety of areas, such as dynamical systems [42, 45, 24, 38, 1], neuroscience [33, 18], materials science [5], and chemistry [25]. Spatial applications that have been examined as 2D data sets include sensor networks [14], percolation, [35], and city-street networks and other complex systems [16].

In our consideration of time-dependent data, we use vineyards, which were introduced in [13] as a way of representing time-varying PH, to incorporate temporal information. One can visualize a vineyard as a continuous stack of persistence diagrams (PDs), with one PD for each time point. The PH features trace out curves, which are called vines, in . See Section 2.2 for the definition of a vineyard.

In our approach, the voids that PH identifies correspond to local extrema of real-valued geospatial data. Our approach captures both local information (specifically, the locations and values of the local extrema) and global information about the relationships between the extrema, such as the extent to which extrema are spatially separated. The vineyards allow us to measure the persistence of the extrema over time and to track how the locations of the extrema change over time. We are thus able to track how the spatial structure of the data changes over time.

One of the contributions of our paper is a method for constructing an efficient simplicial complex that is homeomorphic to a geographical space (the set of regions, as we will explain shortly). It is important to attempt to minimize the number of simplices in a simplicial complex because PH and vineyard computation times are very sensitive to the number of simplices. More precisely, we construct a 2D simplicial complex that comes with a natural mapping from the set of 2D simplices to the geographical regions. In our construction, the union of any subset of geographical regions is homeomorphic to the subsimplicial complex (see Section 3 for the definition of a subsimplicial complex) that is induced by the union of the corresponding simplices. When every geographical region is simply connected, our construction uses the minimal number of simplices that a simplicial complex with the property above can have. We believe that our construction is also minimal in more general cases, but we do not prove this. Other methods of producing simplicial complexes from geospatial data, such as rasterization of a shapefile or by treating the regions as a point cloud, require a trade-off between the number of simplices and the accuracy of the representation of the geographical regions. For example, the level-set-based PH method of [17] uses orders-of-magnitude more simplices to achieve sufficient resolution of the smallest geographical regions (e.g., densely populated urban centers that are important to analyze). See Section 7 for further discussion.

Our method addresses several limitations of previous efforts to combine TDA with geospatial analysis. In [37], Stolz et. al. studied the percentage of United Kingdom voters by district that voted to leave the European Union in the so-called “Brexit” referendum. The holes that they identified using PH corresponded to districts that voted differently than the surrounding districts. However, the approach in [37] did not distinguish between PH classes that were merely noise and PH classes that corresponded to small geographical districts. In [17], Feng and Porter developed an approach for studying PH in which they constructed filtered simplicial complexes using the level-set method of front propagation [29]. Using their level-set complexes, they examined the percentage of voters in each precinct of California counties that voted for a given candidate (e.g., Hillary Clinton) in the 2016 presidential election. The PH features represented precincts that voted more heavily for Clinton than the surrounding precincts (e.g., “islands of blue in a sea of red”). The level-set complexes in [17] have two key limitations. The first is that they cannot handle time-dependent data, as they are built to study data at a single point in time or data that has been aggregated over some time window to yield time-independent data. The second limitation is that they reduce real-valued data (e.g., the percentage of voters who voted for Clinton) to binary data (e.g., whether or not the majority voted for Clinton). Consequently, in this example, the level-set-based PH does not capture the extent to which a blue “political island” voted more heavily for Clinton. By contrast, our approach is designed specifically to capture such information. As a trade-off, we no longer capture the geographical sizes of the political islands. See Feng, Hickok, and Porter [15], who applied the level-set filtration to study one of the COVID-19 data sets that we also study in the present paper, for further discussion.

Our new approach for computing PH is also able to resolve some other technical issues in [17]. In particular, some of the PH features in the level-set approach from [17] are geographical artifacts that are indistinguishable from true features of a data set. In our method, by contrast, the finite 1D PH features have either a one-to-one correspondence with the local maxima of the real-valued geospatial function or a one-to-one correspondence with the local minima, depending on the choices that one makes. Additionally, unlike the level-set approach in [17], we are able to detect extrema that are adjacent to the boundary of the geographical space.

As case studies, we apply our method to two data sets. The first data set is a geospatial data set of per capita vaccination rates in New York City (NYC) by zip code. The PH features identify zip codes in which the vaccination rate is either lower or higher (depending on choices that one can make in our approach) than in the surrounding zip codes. The second data set consists of 14-day mean per capita COVID-19 case rates in neighborhoods in the city of Los Angeles (LA) in the time period 25 April 2020–25 April 2021. Modeling the spatiotemporal spread of COVID-19 is a complex task [2, 44]. In this geospatiotemporal data set, the PH features of our approach identify COVID-19 anomalies, which are regions whose case rates are higher than in the surrounding regions.111We examine local maxima in the case-rate data. This contrasts to COVID-19 “hotspots,” which the CDC has defined using an absolute threshold for the number of cases and criteria that are related to the temporal increase in the number of cases [10]. It is important to examine such anomalies, as COVID-19 spreads with significant spatial heterogeneity and thus has heterogeneous effects on different areas.222 Other scholars have studied contagions using TDA in ways that do not yield topological features with geographical meaning. For example, recent work has used TDA to study the spatiotemporal spread of COVID-19 [32] and Zika [34]. These papers studied topological features in atmospheric data, which were then used to forecast case rates. TDA was also used in [39] to analyze the Watts threshold model of a social contagion on noisy geometric networks. Many factors (such as mobility, population density, socioeconomic differences, and racial demographics) play a role in how COVID-19 affects regions differently [19, 20, 9]. In our case study of COVID-19 case rates in LA, we construct a vineyard that (1) identifies which anomalies are most persistent in time and (2) reveals how the anomalies move geographically over time.

Our paper proceeds as follows. In Section 2, we review relevant topological background. In Section 3, we formulate how we construct simplicial complexes. In Section 4, we define several filtration functions and discuss how to interpret the resulting PDs and vineyards. In Section 5, we apply our method to the LA and NYC data sets. In Section 6, we discuss our choices in our methodology. In Section 7, we summarize our work and discuss some of its implications. In Appendix A, we discuss technical details of the simplicial complex-construction. In Appendix B, we discuss alternative topological approaches for studying PH in geospatiotemporal data. Our code is available at https://bitbucket.org/ahickok/vineyard/src/main/.

2 Background

2.1 Persistent Homology (PH)

We briefly review persistent homology (PH). See [30] for a more thorough introduction. To start, let be a simplicial complex. A filtration function (or simply a filtration) is a function such that if the simplex is a face of , then . The pair is a filtered simplicial complex (FSC). Let be the -sublevel simplicial complex, and let be the image of . The sequence is a nested sequence of simplicial complexes. See Figure 7 for an example of an FSC.

(a) (b) (c) (d) (e)
Figure 6: An example of nested simplicial complexes in a filtered simplicial complex.
Figure 7: The persistence diagram for the filtered simplicial complex in Figure 7.

We compute the homology of each over a field , which we set to in the present paper. Homology classes represent connected components, holes, and higher-dimensional voids in a simplicial complex. The inclusion induces a map from the homology of over to the homology of over . The persistent homology (PH) of the filtered simplicial complex is the -module , where the action of is given by the maps ; that is, if is a homology class in , then . We say that a PH class is born at filtration level if is the earliest filtration level at which exists. More precisely, is born at if is not equivalent to for all and . We say that the PH class dies at filtration level if is the minimum index such that . Not every class dies; we refer to classes that do die as finite and classes that do not die as infinite.

The Fundamental Theorem of Persistent Homology yields a set of generators for a given persistence module. According to it, the persistence module is isomorphic to

(1)

for some , , . An summand corresponds to a PH class that is born at filtration and never dies. An summand corresponds to a PH class that is born at filtration and dies at filtration . Each generator has a birth simplex that creates the homological class and (if finite) a death simplex that destroys the homological class. For example, in Figure 7, there is one 1D PH generator. Its birth simplex is the edge because this is the edge that completes the loop that encircles the hole, and its death simplex is the triangle because this is the triangle that fills in the hole. The birth filtration level of the PH class is , and the death filtration level (if finite) is .

A persistence diagram (PD) is a way of representing PH as a multiset of points in . Each point represents a PH class; the point’s coordinates are the class’s birth and death filtration levels. Given a decomposition of the persistence module of the form Eq. 1, the PD includes the points for all , the points for all , and all points on the diagonal. One includes the points on the diagonal for technical reasons; one can think of them as PH classes that die instantaneously upon birth. See Figure 7 for an example of a PD.

2.2 Vineyards

Vineyards are a tool for computing time-varying PH [22]. A time-dependent filtration function on a simplicial complex is a function such that is a filtration for all and is a filtered simplicial complex for all . We compute the PH of for all times . We visualize the vineyard in as a continuous stack of PDs (see Figure 8). The points in the PDs trace out curves over time; these curves are the vines. Each vine corresponds to a PH class; a vine is the graph of the birth and death filtration levels of a particular PH class over time. The PH class that is represented by a vine has a time-dependent birth simplex and (if finite) a time-dependent death simplex . At time , the homology class is created by the simplex at filtration level and destroyed by the simplex at filtration level (if finite). The functions and are piecewise constant. We measure the overall persistence of a vine by calculating .

Cohen-Steiner et al. [13] developed an algorithm for computing vineyards when they first introduced the concept. One computes the initial PH at time , and one then updates the pairings of birth and death simplices as the order of the simplices (as induced by ) changes over time. Each change in the order of the simplices occurs one transposition at a time. One can make these updates in time (where is the number of simplices) per transposition of simplices.

Figure 8: An example of a vineyard. Each curve is a vine in the vineyard. This figure is a slightly modified version of a figure that appeared originally in [22].

3 Constructing a Simplicial Complex

We now show how we construct a simplicial complex from geographical data (e.g., a shapefile that specifies geographical boundaries). We partition the geographical space into regions. In Section 5.1, the regions are zip codes in NYC; in Section 5.2, the regions are neighborhoods in the city of LA. Let be the set of regions. We refer to the complement of as the exterior region. Given the geographical boundaries of such a set , we construct a 2D simplicial complex with the following property:

  1. There is an assignment of 2D simplices to regions such that the union of any subset of regions is homeomorphic to the subsimplicial complex that is induced by the union of the corresponding 2D simplices. The subsimplicial complex that is induced by a set is the smallest simplicial complex that contains the set of simplices. That is, if is a simplicial complex that contains , then . When is 1D, a subsimplicial complex is equivalent to an induced subgraph.

In Figure 11, we present an example of our construction.

(a)
(b)
Figure 11: (a) A set of geographical regions. (b) The resulting simplicial complex .

When every geographical region is simply connected, our simplicial complex has the minimal number of simplices that is possible for a simplicial complex with property 3. We believe that our construction is minimal in more general cases (specifically, under the assumptions 33 that we define shortly). Constructing an efficient simplicial complex is important because the run time of TDA computations is very sensitive to the number of simplices.

We make the following (mild) assumptions about geographical regions:

  1. There are a finite number of regions, and each region has a finite number of connected components.

  2. Each region is a compact subset of .

  3. The boundary of each component of a region is a finite collection of curves that are homeomorphic to . (This ensures that region boundaries are not self-intersecting. Each component of a region has an outer boundary component and some number (which can be ) of inner boundary components.)

  4. The intersection between any two regions has a finite number of components. Each component of the intersection is homeomorphic to a point, a closed interval in , or .

  5. The intersection between three or more regions is either a point or .

These conditions are very reasonable for human-made geographical boundaries. We do not even require the regions to be simply connected or for the region intersections to be connected. In Figure (a)a, we illustrate the most typical situation that we encounter. In this example, LA neighborhood Granada Hills is homeomorphic to a disc and its intersection with each of its neighbors is homeomorphic to a closed interval in . In Figures 14 and (a)a, we illustrate a few of the other possible configurations that arise in geospatial applications. In our case studies, the geographical data take the form of shapefiles. In a shapefile, each region is represented by a polygon (or by multiple polygons, if the region is disconnected). If the interiors of the polygons do not intersect, then conditions 33 are satisfied. In practice, the polygon boundaries are not always aligned perfectly and thus may overlap slightly, but we can still approximate a given set of regions by a set of regions that do satisfy 33. In our data, the only assumption that does not always hold is 3; for example, see Figure (a)a for a violation of 3 in the NYC data set. However, by making a few modifications, we are still able to construct a simplicial complex with property 3 for this data data. One can make analogous modifications for similar data sets. We discuss this in more detail in Section 5.1.

(a) Valley Glen
(b) West Vernon
Figure 14: Various neighborhoods in LA. (a) The intersection between Valley Glen and Valley Village is a point. The four neighborhoods Valley Glen, Valley Village, Sherman Oaks, and North Hollywood intersect in a point. (b) The neighborhood West Vernon has an inner boundary component because of its neighbor Vermont Square. The intersection between West Vernon and Vermont Square is homeomorphic to .
(a)
(b)
Figure 17: (a) A geographical set that consists of the neighborhood Koreatown and its neighbors. We observe that the neighborhood Little Bangladesh has only two neighbors and that the intersection between Koreatown and Wilshire Center has two components. (b) The simplicial complex  that results from gluing Koreatown’s simplicial complex to the simplicial complexes of its neighbors. Koreatown’s simplicial complex has two edges with the annotation “Wilshire Center”. The two vertices of one edge have the adjacency sequences {Koreatown, Hancock Park, Wilshire Center} and {Koreatown, Wilshire Center, Little Bangladesh}, respectively. The two vertices of the other edge have the adjacency sequences {Koreatown, Little Bangladesh, Wilshire Center} and {Koreatown, Wilshire Center, Pico-Union}, respectively. To find the correct matching with the two edges with the annotatation “Koreatown” in Wilshire Center’s simplicial complex, we compare vertex adjacencies.

To build a simplicial complex from our geographical data, we proceed as follows. First, we construct a 2D simplicial complex  for each region . We then glue their boundaries together in a way that respects the geographical region boundaries. In Figure 11, we show an example of this procedure. For each of the five regions in the example, we construct a simplicial complex that consists of a few triangles. We then glue five simplicial complexes together along their boundaries to obtain a simplicial complex  with property 3. We assign a 2D simplex to the region whose simplicial complex  originally contained . In the remainder of this section, we discuss the details of this process.

Under the geographical assumptions 33, the intersections of a region with its neighbors are such that for each component of the region’s boundary, one can order the neighbors in clockwise (or counterclockwise) fashion, possibly with repetition333Theoretically, several 0D intersections can be adjacent to each other, although this scenario does not occur in our data sets. That is, in principle, there can be a sequence of neighbors such that is the same point for all . The order of this sequence is not determined uniquely by the intersections of the neighbors with . Instead, we order them in the order in which they appear clockwise (or counterclockwise) around the point . This sequence must be finite because there are a finite number of regions and 3 implies that if .. We list intersections with the exterior region in the same manner as for any other neighboring region. We also record whether each intersection is 1D or 0D. For example, in Figure (a)a, the clockwise sequence of neighbors around the boundary of Valley Glen is {Van Nuys, North Hollywood, Valley Village, Sherman Oaks}. The intersection with Valley Village is 0D and the other intersections are 1D. For regions such as West Vernon in Figure (b)b, we obtain a sequence for each boundary component. Each sequence is unique up to the choice of starting neighbor.

Given a sequence of neighbors for each boundary component of each region (which, if necessary, we adjust as in Appendix A.1), we construct a 2D simplicial complex  for each region using Algorithm A.2. In Figure 21, we illustrate examples of the resulting simplicial complexes. Without loss of generality, we assume that each region is connected; if not, we treat each component of a region as if it were its own unique region. To region , we assign a simplicial complex  such that the th boundary component of is a cycle that has one edge for each neighbor such that is 1D. For example, Granada Hills (see Figure 11) is assigned the simplicial complex in Figure (a)a. We annotate each edge of the boundary with the neighbor that corresponds to it. We also annotate each vertex with the sequence of its adjacent regions, which we list in clockwise order starting with .

(a)
(b)
(c)
Figure 21: Construction of a simplicial complex  for a region when (a) has no inner boundary components, (b) has a single inner boundary component and (c) has multiple inner boundary components.

We then glue the simplicial complexes along their edges according to their edge and vertex annotations. More precisely, if has disjoint edges with the annotation (which is the typical situation when has components that are 1D), then has exactly disjoint edges with the annotation . Let , with and in clockwise order, be the vertices of an edge in with annotation . Because the edges are disjoint, and must have at least neighbors (including ). We seek an edge (with and in clockwise order) in with the annotation such that (1) and are annotated with the same sequences and (2) and are annotated with the same sequences. We know that there must be at least one such edge because represents a component of and there is some edge in that represents the same component (and thus its vertices have the same sequences of adjacent regions as and ). In Appendix A.3, we prove that there is a unique such edge. In Figure 17, we show an example of this case. If there are consecutive edges on the boundary of with annotation , then there are consecutive edges on the boundary of with annotation . This situation arises precisely because of the adjustments we discuss in Appendix A.1. We glue to for all . If is homeomorphic to , then the choice of as the first edge in is not unique, but all choices result in topologically equivalent spaces. The result of this gluing process is a topological space with property 3.

Code for our simplicial-complex algorithm is available at https://bitbucket.org/ahickok/vineyard/src/main/. This code has one limitation that the algorithm in the present paper does not: It requires that no interior region (i.e., a region that is contained within the outer boundary of another region) intersects any other interior region. This does not occur in our data, and we believe that it does not occur in most geographical spaces.

4 Our Filtration Functions

We define various filtrations that one can use with the simplicial complex that we constructed in Section 3, and we discuss how to interpret the resulting PDs and vineyards. Let be the set of geographical regions that the simplicial complex  represents, and let be a real-valued function on . For example, in Section 5.1, is the per capita full vaccination rate (i.e., having received all required doses of some vaccine) for COVID-19 in NYC zip code . In Sections 4.1 and 4.2, we define two filtration functions that are induced by . Given a time-dependent and real-valued function , we define time-dependent filtration functions in Section 4.3. For example, in Section 5.2, is the 14-day mean per capita COVID-19 case rate in neighborhood on day . From a time-dependent filtration function, we compute a vineyard.

4.1 The Sublevel Filtration

In this subsection, we define a sublevel filtration. In our applications, we use the 1D PH of the sublevel filtration to analyze local maxima in our data sets. We illustrate the idea of a sublevel filtration in Figure 32.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
Figure 32: In panels (a)–(e), we show the -sublevel sets for increasing of a function that has two well-separated local maxima. In (a), where is smallest, there is one hole that corresponds to the global maximum. In (b), a second hole appears; it corresponds to the other local maximum. In (d), the second hole is filled in. In (e), the first hole is filled in. In panels (f)–(j), we show the -sublevel sets for increasing of a function whose two local maxima have the same locations and values as , but are less separated from each other. Consequently, the second hole does not appear until the sublevel set in panel (h). In all panels, the jagged edges are artifacts of the way that the Python package matplotlib plots surfaces.
Definition 1 (Sublevel Filtration)

Let be a simplicial complex from the construction in Section 3 for a set of regions, and let be the assignment of 2D simplices to regions. Let . We define the sublevel filtration function by considering the sublevel sets of . On the 2D simplices, we define the filtration function by

We extend the filtration function to the remaining simplices by setting

if is a vertex or edge on the boundary of and by setting

(2)

otherwise, where denotes the boundary of .

At filtration level , the simplicial complex  is the subsimplicial complex of that is induced by the union of the set of 2D simplices such that and the set of vertices and edges that are on the boundary of (that is, the set of vertices and edges that represent intersections of the regions in with the exterior region). (Henceforth, we say that such simplices are “exterior-adjacent”.) By construction, is homeomorphic to the union of regions such that along with the exterior boundary. We set for exterior-adjacent vertices and edges for technical reasons that we will explain in a few paragraphs. In Appendix B.2, we explore an alternative definition in which we set the filtration values of exterior-adjacent vertices and edges to , where is the connected component that contains .

The 1D PH of the sublevel filtration encodes information about the structure of the local maxima of . A region is a local maximum if the value of is larger than the value of for all neighboring regions of for which is 1D. If is a local maximum, there is a 1D PH class whose death simplex is one the simplices in the preimage . The class dies at filtration level . For example, if is the COVID-19 case rate in region , then 1D PH classes correspond to COVID-19 anomalies and the death simplex of a 1D PH class is the epicenter of that anomaly. The larger the value of in comparison to the surrounding regions (including regions that are nearby but not necessarily immediate neighbors), the more persistent the PH class is. If the union of all regions (excluding the exterior region) is not simply connected, then there is at least one 1D PH class with an infinite death time. See Figure (b)b for an example. The infinite 1D PH classes correspond to the holes in the geographical space, rather than to local maxima. The local maxima of are in one-to-one correspondence with the set of 1D PH classes with finite death times. There is a canonical mapping from finite 1D PH classes to regions. A class that is represented by simplex pair is mapped to the region that contains . The region is the location of the local maximum of that corresponds to the PH class, and the death simplex’s filtration value is the value of the local maximum. The death simplices of the finite 1D PH classes and their filtration values give the local-maximum locations and their function values .

The 1D PH does more than simply identify local maxima and their locations; it also reveals information about their relationships to each other. If the local maxima are well-separated from one another, then the corresponding PH classes all have early birth times. In the NYC data set, for example, there are several connected components and one can think of the global maxima of each connected component as “totally separated” from each other because they are on different connected components. The corresponding 1D PH classes are all born at the earliest possible filtration time, which is (see Figure (a)a). We show an example of well-separated local maxima in Figure (e)e. By contrast, the two local maxima in Figure (j)j are not well-separated, so the PH class that corresponds to the lower peak in Figure (j)j is born at a higher filtration value than the PH class in Figure (e)e. The birth times of the 1D PH classes reflect structural information about the local maxima.

We set the filtration value of exterior-adjacent vertices and edges to so that 1D PH can detect local maxima on the boundary of a geographical space. This is important for the LA data set of COVID-19 case rates. In Figure 51, we observe that many of the most persistent COVID-19 anomalies are on the boundary of the geographical space, and it is crucial that we are able to detect them. If we had not made this adjustment, the filtration value of exterior-adjacent vertices and edges would be the value of , where is the unique region that is adjacent to . If is a local maximum, its corresponding 1D PH class would be born and die at filtration level . In the PD, it would then appear as a point on the diagonal. Therefore, for 1D PH to detect local maxima on the boundary of a geographical space, we must adjust the filtration values of exterior-adjacent vertices and edges.

The 0D PH classes correspond to local minima of . However, unlike for the 1D PH classes, there is not a natural mapping from 1D PH classes to the locations of the minima. In Appendix B.1, we discuss the interpretation and computation of 0D PH classes in more detail.

4.2 The Superlevel Filtration

An alternative to using the sublevel filtration from Section 4.1 is to instead consider superlevel sets of and use them to construct a superlevel filtration. In our case studies, we use the superlevel filtration to analyze local minima in our data sets. We illustrate the idea of the superlevel filtration in Figure 33.

Figure 33: The -superlevel sets for decreasing for the graph of a planar function with two local minima.
Definition 2 (Superlevel Filtration)

Let for a set of regions. The superlevel filtration function is the sublevel filtration function that is induced by .

At filtration level , the simplicial complex is the subsimplicial complex of that is induced by the union of the set of 2D simplices for which . By construction, is homeomorphic to the union of regions for which . Local maxima of now correspond to 0D PH classes, and local minima of now correspond to 1D PH classes; this is the opposite situation from the sublevel filtration. Our discussion of local maxima for the sublevel filtration in Section 4.1 applies to local minima for the superlevel filtration, and our discussion of local minima for the sublevel filtration in Section 4.1 applies to local maxima for the superlevel filtration. The only difference is that the filtration values in the superlevel filtration are the additive inverses of the function values of . This implies, for example, that the death filtration value of a 1D PH class that corresponds to a local minimum at region is , rather than .

4.3 A Time-Dependent Filtration

Suppose that we have a time-dependent, real-valued function whose domain is , where is the initial time and is the final time. For example, in Section 5.2, the value of is the 14-day mean per capita COVID-19 case rate in Los Angeles on day . We seek to analyze the structure of local extrema as they change over time.

Definition 3 (Time-Dependent Sublevel Filtration)

Let be a time-dependent function on a set of regions. At each time , we define the time-dependent filtration function to be the sublevel filtration that is induced by . To extend this filtration function to the entire interval

, we linearly interpolate.

In the present paper, we only use the time-dependent sublevel filtration, but one can analogously define a time-dependent superlevel filtration. We have implemented both of these filtrations in our code.

We use a time-dependent sublevel filtration to construct a vineyard. This allows us to track how the extrema move in both space and time. As in Section 4.1, each finite vine corresponds to a local maximum whose location at time is given by the region that contains the vine’s time-dependent death simplex . The length of a vine corresponds to its persistence in time.

5 Case Studies

We now apply our methods to two data sets, which we visualize in Figure 36.

(a) NYC zip codes
(b) LA neighborhoods
Figure 36: We show the (a) per capita COVID-19 full vaccination rate in New York City (NYC) by (modified) zip code on 23 February 2021 and (b) 14-day mean per capita COVID-19 case rate in the city of Los Angeles (LA) by neighborhood on 30 June 2020.

5.1 COVID-19 Vaccination Rates in New York City

We examine vaccination rates in (modified) zip codes of NYC444Modified zip-code tabulation areas (MODZTCA) are used by the NYC Department of Health & Mental Hygiene for COVID-19 data [27]. In these modified zip codes, some zip codes with small populations are combined [28]. We henceforth refer to modified zip codes as simply “zip codes”.. We demonstrate the effects of the two filtrations that we defined in Section 4. The geographical boundaries of the zip codes are given by a shapefile [27]. New York City zip codes do not satisfy assumption 3 because some of the zip-code boundaries have a component that is homeomorphic to two circles that are glued at a point (i.e., a figure-8). For an example, see Figure (a)a. We construct a simplicial complex in a way that is similar to the construction in Section 3; everything is the same except for some minor modifications to the way that we construct the simplicial complexes for zip codes with figure-8 boundaries. In Figure (b)b, we show how one constructs a simplicial complex  for such a region. Our construction still has property 3.

(a)
(b)
Figure 39: (a) An example of a zip code in NYC whose boundary is homeomorphic to a figure-8. The blue region is the zip code 10469, and the orange region is a subset of the zip code 10461. (b) An illustration of how we construct a simplicial complex  for a region that has the form of zip code 10469.

The data set, which we obtained from the NYC Department of Health & Mental Hygiene website [12], consists of the number of fully vaccinated people in each zip code on 23 February 2021555The NYC Department of Health & Mental Hygiene defines “fully vaccinated” people to be individuals who have either received both doses of the Pfizer or Moderna vaccine or one dose of the Johnson & Johnson vaccine. (This differs from common parlance, in which people are sometime labeled as “fully vaccinated” only after two weeks have passed after their final dose of a vaccine.)

. For each zip code, we divide this number by its population estimate in

[12] to obtain a per capita vaccination rate. For zip code , we define to be the per capita full vaccination rate in on 23 February 2021.

We do not possess the daily vaccination-rate data that is necessary to compute a vineyard, so instead we calculate the PH of with the sublevel and superlevel filtrations from Sections 4.1 and 4.2. We show the resulting PDs for the 1D PH in Figure 42. As we described in Section 4.1, the points in the sublevel-filtration PD correspond to zip codes in which vaccination rates are higher than in the surrounding zip codes. The death filtration level of a PH class is the vaccination rate in that zip code, and the birth filtration level of a PH class reflects the extent of spatial isolation of that zip code from other local maxima; an earlier birth filtration implies more spatial isolation. Similarly, the points in the superlevel filtration PD correspond to zip codes in which the vaccination rate is lower than in surrounding areas. As we discussed in Section 4.1, we obtain the zip code of a PH class from its death simplex . We color all points in the PDs by the borough of the corresponding zip code.

An issue arises from the fact that several of the NYC zip codes are isolated islands. These islands are trivial extrema because they are not adjacent to any other zip codes. One may want to exclude these trivial extrema from the PD. In Appendix B.2, we propose alternative methods for handling disconnected geographical spaces such as NYC.

(a) Sublevel filtration
(b) Superlevel filtration
Figure 42: PDs for the 1D PH of the NYC simplicial complex with filtrations that are induced by the per capita full vaccination rate by zip code on 23 February 2021. We show only the finite PH classes. Each point in a PD corresponds to a zip code, which we label according to its borough [26], that has (a) a higher vaccination rate than its neighboring zip codes or (b) a lower vaccination rate than its neighboring zip codes.

One can use the PDs in Figure 42 to study inequities in vaccine access. For example, it seems potentially desirable to discern patterns in demographic data that correspond to the most persistent points in the PDs.

5.2 COVID-19 Infections in the City of Los Angeles

We now examine COVID-19 case rates in neighborhoods of the city of Los Angeles (LA)666We exclude Angeles National Forest because it only has 20 inhabitants.. The geographical boundaries of the neighborhoods are given by a shapefile [23]. From this, we construct a simplicial complex in the manner that we described in Section 3. We also know the number of cases in each neighborhood from 25 April 2020 through 25 April 2021. For each neighborhood, we divide the case count by the neighborhood population to obtain per capita case rates, and we calculate a running 14-day mean777On day , we take the mean of the case rates on days , , …, . Some outlets (e.g., [36]) report running 14-day means of COVID-19 case counts, and other outlets (e.g., [41]) report 14-day trends. on each day to smooth the data. For neighborhood and time , we define to be the 14-day mean per capita case rate in on day after April 2020. We compute the vineyard for a simplicial complex using the time-dependent sublevel filtration that is induced by . We show our vineyard in Figure 45, and we show subsets of this vineyard in Figures 49 and 54.

(a)
(b)
Figure 45: (a) The vineyard for the LA simplicial complex that we construct using the sublevel filtration from the 14-day mean per capita case rate during the period 25 April 2020–25 April 2021. Each vine represents a COVID-19 anomaly. We color each vine according to the geographical locations of its associated anomaly. (See Figure 46 for the legend.) Because the geographical location of an anomaly can change over time, a single vine can have multiple colors. (b) A different view of the same vineyard.
Figure 46: The legend for Figure 45. Each of the depicted regions is a local maximum of the COVID-19 case-rate function for some subset of the time period 25 April 2020–25 April 2021.
(a)
(b)
Figure 49: (a) The five most-persistent vines of the vineyard in Figure 45. Each vine represents a COVID-19 anomaly. We color each vine according to the geographical locations of its associated anomaly. Because the geographical location of an anomaly can change over time, a single vine can have multiple colors. (See Figure 50 for the legend.) (b) A different view of the same five vines.
Figure 50: The legend for Figure 49. Each of the depicted regions is a local maximum of the COVID-19 case-rate function for some subset of the time period 25 April 2020–25 April 2021.

The vines in the vineyard correspond to COVID-19 anomalies, which we define to be neighborhoods that have a higher running 14-day mean COVID-19 case rate than the surrounding neighborhoods for at least one day. Anomalies that are more spatially isolated yield vines with early birth-filtration levels, and anomalies with high case rates yield vines with late death-filtration levels. See Section 4.1 for a more detailed discussion. We color each vine according to the geographical location(s) of its anomaly. As we discussed in Section 4.3, we obtain the anomaly location(s) from the time-dependent death simplex of a vine. The function is a piecewise-constant function; as it changes, so does the location of the associated anomaly. Therefore, the color of a vine can change over time. For example, consider Figure 49, where we show the five most-persistent vines. The global maximum of the data set is initially in Little Armenia, but it moves to Vermont Square at about . In the vineyard, we see this from the vine that is initially blue (for Little Armenia) from time until about and then orange (for Vermont Square) starting from about time through time . There are also other vines whose locations change over time. Such geographical location changes do not need to be adjacent, but they often are near each other. In Figure 51, we highlight these anomalies on a map.

Figure 51: A map of the most persistent anomalies of the COVID case-rate function in LA during the time period 25 April 2020–25 April 2021. Each of the highlighted regions is a local maximum of the COVID-19 case-rate function for some subset of the time period.

A vineyard encodes the temporal persistence of anomalies. The length of time that a vine is not on the diagonal plane of a vineyard, which we henceforth call the “length” of a vine, is the amount of time that an anomaly exists. At the beginning of the COVID-19 pandemic, all neighborhoods had low per capita case rates. We expect emerging anomalies to have a low case rate for a long time and then for the case rate to grow rapidly starting at some later time. An emerging anomaly in the “low case rate” phase yields a vine that is close to the diagonal for a long time. By examining the lengths of vines, we hypothesize that one can distinguish between very concerning emerging anomalies (i.e., those that may become major COVID-19 anomalies in the future) and anomalies of lesser concern, even when the anomalies have similar case rates.

In Figure 54, we show case rates early in the time period that we track by computing the vineyard for the period 25 April 2020–25 May 2020. The COVID-19 pandemic was declared a national emergency on 13 March 2020 [43], and the city of LA closed its public schools and ordered the closure of restaurants, bars, and gyms on on 16 March 2020 [21]. In our vineyard, we exclude the twenty most-persistent vines to more easily visualize the vines that are close to the diagonal plane. Many of these latter vines are short, so their associated anomalies are short-lived. The longer vines are anomalies that are longer-lived and thus of greater concern in the long run, even though they are close to the diagonal during the period 25 April 2020–25 May 2020. For example, there is an anomaly at Wilmington that we show with the light-blue vine. This vine is close to the diagonal plane, but it has high temporal persistence during the period 25 April 2020–25 May 2020. In Figure 49, we see that Wilmington eventually becomes one of the worst hotspots of COVID-19 case rates in LA.

(a)
(b)
Figure 54: Vineyard for the LA simplicial complex with a sublevel filtration from 14-day mean per capita case rate during the period 25 April 2020–25 May 2020. We exclude the 20 most-persistent vines to more easily see the vines near the diagonal plane. Each vine represents a COVID-19 anomaly; we color each vine according to the geographical location(s) of its anomaly. See Figure 55 for the legend.
Figure 55: The legend for Figure 54. Each of the depicted regions is a local maximum of the COVID-19 case-rate function for some subset of the time period 25 April 2020–25 May 2020.

6 Discussion

In our approach, we needed to make a variety of choices. There are other ways to construct a simplicial complex to represent a geographical space. There are also other choices in topological tools for analyzing time-varying data. We briefly discuss some of these possibilities in the next several paragraphs.

Rasterization is an alternative method for constructing a simplicial complex from shapefile data. When one rasterizes a shapefile, one can transform the resulting image into a simplicial complex by imposing the pixels of the image onto a triangulation of the plane. However, our approach has several key advantages over rasterization. First, the number of simplices in the simplicial complex that one obtains by rasterizing a shapefile is orders-of-magnitude larger than the number of simplices in our construction. Computing the PH of a simplicial complex with fewer simplices allows significantly faster computations. Second, the simplicial complex that one obtains by rasterization has no guarantee of “topological correctness”, as property 3 may not hold. The extent to which the resulting simplicial complex is topologically correct depends on the resolution of the rasterization, and using a higher resolution requires more simplices. Our construction of simplicial complexes also yields a natural way to map a 2D simplex to the geographical region that contains it. We use this preservation of geographical information to find the locations of the local extrema. Lastly, our construction allows us to detect anomalies on the boundary of a geographical space.

Our construction uses direct geographical adjacencies, but one may instead wish to employ “effective” distances between regions. One can calculate effective distances using mobility and transportation data. Two regions that are closely connected via transportation are effectively closer than they are based on direct geographical considerations; this affects phenomena such as the dynamics of infectious diseases [4, 31].

We used only 1D PH to study extrema, but one can alternatively use 0D PH if one is not interested in the geographic locations of the extrema; we discuss this in Appendix B.1. In Appendix B.2, we discuss alternative filtrations that one can apply to geographical spaces (such as NYC) that are disconnected. We used a time-dependent function on a geographical space to compute vineyards, but an alternative is to use an approach that is based on multiparameter zigzag PH. We discuss this in Appendix B.4. When the time-dependent function is monotonic for all regions , one can also use an approach that is based on multiparameter PH (i.e., without needing to invoke zigzag PH); we discuss this in Appendix B.3. However, both multiparameter PH and multiparameter zigzag PH are difficult to visualize, and they both suffer from a lack of easily interpretable invariants. Consequently, we only computed vineyards in our applications.

7 Conclusions

We developed methods to directly incorporate spatial structure into applications of topological data analysis (and, specifically, of persistent homology) to geospatiotemporal and geospatial data. We defined a way to construct a simplicial complex that efficiently and accurately represents a geographical space. Given a function on a geographical space, we defined filtration functions on a simplicial complex such that the PH classes are in one-to-one correspondence with either local minima or local maxima. By constructing a vineyard, one can track how the local extrema move and change over time.

We conducted case studies using COVID-19 vaccination and case-rate data. In one case study, we examined the geospatial vaccination structure in New York City on one day. In our other case study, in which we examined geospatiotemporal data, we constructed a vineyard to analyze COVID-19 case-rate anomalies in the city of Los Angeles over the course of one year. From the vineyard, we identified the locations of these anomalies and measured the severity of the disease outbreaks. The vineyard also captured information about the relationships between anomalies, such as the extent to which they are isolated from each other. We calculated the temporal persistence of an anomaly based on the length of its corresponding vine.

There are several ways to build on our research. It is desirable to discover how to use a vineyard to produce systematic forecasts of how a disease (or something else) will spread in space and time. We hypothesized in Section 5.2 that one can identify “emerging anomalies” in the COVID-19 data set as vines that are long but close to the diagonal plane. In other applications, one may want to predict the locations of the local extrema with the largest data values and/or highest temporal persistences. One may also want to forecast how the extrema will move in space. It will be valuable to investigate how to use the output of our approach as an input to forecasting models.

Our approach is useful for a wide variety of applications, and it seems possible to generalize it for many others. For example, given spatiotemporal voting data, one can identify regions that vote differently from the surrounding regions. This would allow one to generalize the work of [17] to track the intensity of voting differences and study spatial relationships between different political islands. Our methodology is not restricted to geographical data. Our methodology is applicable whenever one has a surface that is partitioned into a finite number of regions and a real-valued function (or a sequence of real-valued functions) on those regions. (That is, it is not restricted to geographical data.) For example, it may be possible to apply our approach to grayscale image data by partitioning an image into regions in which pixel values are close to each other. It also seems possible to extend our approach to higher dimensions; this would require constructing a higher-dimensional simplicial complex given boundary intersection data on the higher-dimensional regions. For example, in three dimensions, one could use such an extension of our approach to study atmospheric, oceanic, and video dynamics.

Appendix A Details of the Simplicial-Complex Construction

a.1 Boundary-Sequence Adjustment

Before constructing the simplicial complexes for each region , we adjust the boundary sequences as follows. Let denote the sequence of neighbors around the outer boundary of region , and let , …, (where is the number of inner boundary components of ) denote the sequences of neighbors around the inner boundary components of . First, we adjust the sequences so that for each region and each boundary component , the first element of has a 1D intersection with . To do this, let be the elements of , where is the number of neighbors. If is not 1D, let be the smallest index such that is 1D. We then set to be equal to . Following this, we adjust the sequences such that for all and . If , there are two cases:

  1. (Case 1) If , let be the unique element of . This situation occurs if is an island, and it can also occur if lies inside or if lies inside . We adjust to be the sequence . If is not the exterior region, let be the index of the boundary component of that intersects . Adjust to be the sequence to compensate for the adjustment that we made to .

  2. (Case 2) If , let and be the two elements of , where (without loss of generality) is the exterior region if is adjacent to the exterior along its th boundary component. For example, in Figure (a)a, . We adjust to be the sequence . If is not the exterior region, which occurs if is not adjacent to the exterior, then we also adjust to compensate, where is the index of the boundary component of that intersects . In that case, we adjust by repeating an additional time.

Finally, we adjust the outer boundary sequences so that for all . If for some region (which does not occur in either of our geographical data sets), let be an element of , where (without loss of generality) is the exterior region if if is adjacent to the exterior along its outer boundary component. We adjust so that repeats an additional times. If is not the exterior, let be the index of the boundary component of that intersects . We adjust to compensate by repeating neighbor an additional times.

a.2 Construction of for Region

We assume that we have already adjusted the boundary sequences as in Appendix A.1 whenever necessary.

   1 Construct the annotated simplicial complex  for region

        Input:

  • The sequence of neighbors in clockwise order around the outer boundary component

  • The sequence of neighbors in clockwise order around the th inner boundary component, where and is the number of inner boundary components

  • , where is the dimension of

Output: An annotated simplicial complex .

1:, the annotated simplicial complex 
2:for  to  do
3:     Initialize , the sequence of vertices that are on the th boundary component of
4:     Add vertex to
5:     Add vertex to
6:     Initialize , the annotations of vertex
7:end for
8:
9:
10:while  do
11:     while  and  do
12:         
13:         if  then
14:              Append to
15:         end if
16:     end while
17:     Annotate with
18:     
19:     
20:     if  then
21:         Add vertex to
22:         Append to
23:         , the annotations of vertex
24:         Add edge to ; the edge has the annotation
25:     end if
26:end while
27:Add edge to ; the edge has the annotation
28:for  to  do
29:     Add edge to ; the edge is unannotated 
30:end for
31:for  to  do
32:     outer
33:     for  to  do
34:         Add edge (outer, ) to ; the edge is unannotated 
35:         Add edge (outer, ) to ; the edge is unannotated 
36:     end for
37:     for  to  do
38:         Add edge to ; the edge is unannotated 
39:     end for
40:     for  to  do
41:         Add 2D simplex (outer, , ) to ; the simplex is unannotated 
42:         Add 2D simplex (outer, outer, ) to ; the simplex is unannotated 
43:     end for
44:     for  to  do
45:         Add 2D simplex to ; the simplex is unannotated 
46:     end for
47:end for
48:for  to  do
49:     Add 2D simplex to ; the simplex is unannotated 
50:end for

 

a.3 Construction of from the Collection

We present two lemmas that were used in Section 3 to construct by gluing together the collection of simplicial complexes .

Lemma 1

Let be the boundary components of a region . If a region is connected and has a nonempty intersection with the boundary component , then does not intersect any other boundary component .

Proof

By the Jordan Curve Theorem, every boundary component of divides the plane into an “inside” and an “outside”. Therefore, because is connected, it either lies outside the outer boundary component of or inside one of the inner boundary components of . If lies inside an inner boundary component , then can intersect but cannot intersect any other for . If lies outside the outer boundary component, then can intersect the outer boundary component of , but it cannot intersect any other boundary components.

For example, suppose that is West Vernon in Figure (b)b. Vermont Square intersects its inner boundary component but not its outer boundary component.

Lemma 2

Let be the annotated simplicial complex for region , let be one of its boundary components, and let be a vertex in . Let be the sequence of region adjacencies of . If , then the boundary component has at most one other vertex with the same set of region adjacencies. Additionally, if exists, its sequence of region adjacencies must be , which is the mirror of the orientation of neighbors around .

Proof

Suppose that is a vertex in with the same set of region adjacencies as . Let be the boundary component of that corresponds to the boundary component of . Let and be the points on that correspond, respectively, to and in . Either the interior of is contained in the region that is bounded by , or it is contained in the complement of the region that is bounded by . Without loss of generality, we suppose the former case. Let be the permutation of such that the sequence of region adjacencies around is . Let , with , be a pair of indices. By Lemma 1, is adjacent only to one boundary component of and to one boundary component of . Let be the boundary component of to which is adjacent, and let be the boundary component of to which is adjacent. We have .

Because is homeomorphic to , there exist paths , from to such that . Because the interior of does not intersect , it follows that and are both in the complement of the region that is bounded by . There are two paths from to on . Let be the unique choice of path such that is not contained in the region that is bounded by the closed curve . Either is in the region that is bounded by the closed curve or is in the region that is bounded by the closed curve . Without loss of generality, we suppose that the latter is true.

Analogously to our argument above, there exist paths , from to such that and and are in the complement of the region that is bounded by . Because is homeomorphic to , and are either both contained in the region that is bounded by or both contained in the complement of the region that is bounded by . Because , it must be the former case. Therefore, . It follows that is order-reversing. If there were another vertex in that is adjacent to the same set of regions, then the orientation of those regions around would be the mirror of both the orientation of regions around and the orientation of regions around , which gives a contradiction when .

For example, let be the region Koreatown in Figure (a)a. The two vertices that are shared by Koreatown and Little Bangladesh have the same region adjacencies, but they have mirrored orientations.

Appendix B Alternative Topological Approaches

b.1 0D Persistent Homology

Let be a real-valued function on a set of geographical regions. In Sections 4.1 and 4.2, we described how one can analyze the local maxima (respectively, minima) of by computing the 1D PH of the sublevel (respectively, superlevel) filtration. In this section, we discuss how the 0D PH of the sublevel (respectively, superlevel) filtration yields information about local minima (respectively, maxima) of .

The 0D PH of the sublevel filtration encodes information about the structure of local minima of in a way that is similar to how the 1D PH encodes information about the structure of local maxima. One can imagine taking -sublevel sets of the function in Figure 33 (where we display -superlevel sets) to see why this is true. A region is a local minimum if the value of is less than the value of for all neighboring regions of for which is 1D. If is a local minimum, there is a 0D PH class whose birth simplex is one of the vertices in one of the triangles in the preimage . The class is born at filtration level . For the LA data set of COVID-19 case rates, 0D PH classes correspond to regions that have a lower case rate than surrounding regions. The smaller the value of in comparison to the surrounding regions, the more persistent the PH class is. There is also one infinite 0D PH class for each connected component. One can think of these classes as corresponding to the “local minimum” at the exterior region. However, unlike for 1D PH classes, there is no canonical map from 0D PH classes to regions because the birth simplex of a 0D class is a vertex that belongs to several regions. The 0D PH of the superlevel filtration analogously encodes information about the structure of local maxima of . However, as with the sublevel filtration, there is no canonical map from 0D PH classes to regions. Therefore, one cannot easily use the 0D PH of the sublevel (respectively, superlevel) filtration to identify the geographical locations of the local minima (respectively, maxima), so we did not study 0D PH in our case studies.

Although we did not compute 0D PH in the present paper, using 0D PH to study the structure of local extrema is appropriate when one is not interested in their locations. One can compute the 0D PH of the sublevel (respectively, superlevel) filtration more efficiently than the 1D PH of the superlevel (respectively, sublevel) filtration using the following approach. Given any filtration function , the 0D PH of is isomorphic to the 0D PH of (, where is the 1-skeleton of (i.e., the vertices and edges of ) and is the simplicial complex that we constructed in Section 3. In particular, if is the sublevel or superlevel filtration that is induced by , then one can construct an alternative 1D simplicial complex  with even fewer simplices than such that the 0D PH of is isomorphic to the 0D PH of . Because has fewer simplices, one can do TDA computations more efficiently. To build , we start with the collection . For each , we remove all 2D simplices. We then remove all non-boundary edges, except that for each inner boundary component, we leave one edge that connects that inner boundary component to the outer boundary component. (The latter step is so that is still connected after we remove the 2D simplices.) We then glue together the collection of simplicial complexes according to their edge and vertex annotations. This yields the desired 1D simplicial complex . However, we are not interested in 0D PH in our case studies, so we do not use this construction.

b.2 Alternative Filtrations for Disconnected Geographical Spaces

In Section 4.1 (respectively, Section 4.2), we defined a sublevel (respectively, superlevel) filtration in which we set the filtration values of all exterior-adjacent vertices and edges to the global minimum (respectively, negative global maximum) of . In applications in which the union of all regions is not connected, such as for the NYC zip codes in Section 5.1, an alternative definition is to consider extrema on each connected component separately, rather than on the entire geographical space at once. This solves the problem that an isolated region (a geographical island888These are literal islands, rather than “islands” from a PH computation.) is trivially both a local maximum and a local minimum because it is not adjacent to any other regions. In Definitions 1 and 2, they appear as 1D PH classes that are born at the earliest filtration time, which may falsely emphasize the persistence of these trivial extrema.

Definition 4 (Alternative Sublevel Filtration)

Let be the simplicial complex from Section 3 for a set of regions, and let be the assignment of 2D simplices to regions. Let . If is a vertex or edge on the boundary of , let be the 2D simplex for which is on the boundary of . On , we define the alternative sublevel filtration function to be

where is the connected component that contains the region . On all other simplices, the filtration function is equal to the sublevel filtration function.

Definition 5 (Alternative Superlevel Filtration)

Let for a set of regions. The alternative superlevel filtration function is the the alternative sublevel filtration function that is induced by .

Definitions 4 and 5 are appropriate options if one seeks to treat each connected component independently. In these alternative definitions, each connected component uses only information about other regions in the same component. One then compares region values to global extremum values on their connected components. One consequence of using these definitions is that one ignores isolated regions, which are trivial extrema. In Definitions 4 and 5, these isolated extrema appear as points on the diagonal of a PD. This is often an appropriate way to handle isolated regions. However, when an isolated region is a global extremum of a data set, this may be undesirable. This situation never occurs in our data.

For example, NYC has 14 connected components; several of these are zip codes that correspond to isolated islands. The alternative sublevel and superlevel filtrations effectively treat each connected component of NYC separately. In Figures (a)a and (b)b, we show the PDs that we compute using the alternative sublevel and superlevel filtrations that are induced by the vaccination-rate function that we defined in Section 5.1. In these PDs, we compare a zip code’s per capita vaccination rate to the global minimum or maximum rate on its connected component, rather than the global extremum of the entire data set. More precisely, the birth time of a connected component’s global extremum is either the lowest per capita vaccination rate of that component (for the alternative sublevel filtration) or the additive inverse of the highest per capita vaccination rate of that component (for the alternative superlevel filtration). Consequently, the trivial island extrema are represented by PH classes on the diagonal.

(a) Alternative sublevel filtration
(b) Alternative superlevel filtration
Figure 58: PDs for the 1D PH of the NYC simplicial complex with filtrations that are induced by the per capita full vaccination rate by zip code on 23 February 2021. We show only the finite PH classes. Each point in a PD corresponds to a non-isolated zip code, which we label according to its borough [26], that has (a) a higher vaccination rate than its neighboring zip codes or (b) a lower vaccination rate than its neighboring zip codes.

The alternative sublevel filtration and the alternative superlevel filtration, along with their time-dependent versions, are implemented in our code that is available at https://bitbucket.org/ahickok/vineyard/src/main/.

b.3 Multiparameter Persistent Homology

One can use multiparameter persistent homology to study how the topology of a data set changes as one varies multiple parameters [8]. In applying multiparameter PH to our COVID-19 case-rate data, two feasible parameters are (1) time and (2) the cumulative COVID-19 case rate. To compute multiparameter PH, one starts with a multiparameter filtration , where for all and is the number of parameters. When , this is a filtered simplicial complex (i.e., an ordinary filtration); when , it is a bifiltration. The multiparameter PH in dimension over field is the graded module . For , the action of is the map that is induced by the inclusion . When , this definition reduces to PH. One can use multiparameter PH to study local extrema of functions that are nondecreasing over time.

Definition 6

Let be the simplicial complex from the construction in Section 3 for a set of regions. Let be a function for which for all . Define the function to be the sublevel filtration that is induced by . Let be the image of , where is the number of elements in the image. We define the bifiltration

One can use Definition 6 to study cumulative COVID-19 case rates over time.

b.4 Multiparameter Zigzag Persistent Homology

One can use zigzag persistent homology to study how the topology of a data set changes as one varies a parameter nonmonotonically [7]. In single-parameter zigzag PH, one starts with a sequence of simplicial complexes such that, for all , either or . (By contrast, in ordinary PH, for all .) An inclusion induces a map , and an inclusion induces a map . Analogously to PH, one can decompose the resulting zigzag module into “interval modules” .

One can use multiparameter zigzag PH when there are multiple parameters that vary nonmonotonically. See Section 2.1 of [7] for a short discussion. In applying multiparameter zizag PH to our COVID-19 case-rate data, two feasible parameters are (1) time and (2) the current COVID-19 case rate. Given a diagram of simplicial complexes, such as in Equation 3, one can construct a diagram of homology groups that is induced by the maps between the simplicial complexes. This is a representation of a quiver. However, there are not well-behaved statistical summaries (in contrast to single-parameter zigzag PH).

Definition 7

Let be the simplicial complex from the construction in Section 3 for a set of regions, and suppose that . Define half steps for , and let . Define the function as follows:

We define the function to be the sublevel filtration that is induced by . Let be the image of . We define

This yields the following diagram:

(3)

The inclusion maps induce a corresponding diagram of homology groups.

One can use Definition 7 to study non-cumulative COVID-19 case rates over time.

Acknowledgements

We thank Henry Adams, Heather Zinn Brooks, Michelle Feng, Lara Kassab, and Nina Otter for helpful discussions. Additionally, we are grateful to Michelle Feng for teaching us how to work with geospatial data. We thank the Los Angeles County Department of Public Health for providing the LA city data on COVID-19 and the LA neighborhood population estimates.

References