## 1 Preliminaries

This section briefly describes our formal setting and presents an overview of our approach. An introduction to topological data analysis can be found in [13].

### 1.1 Background

Input data: Without loss of generality, we assume that the input data is a piecewise linear (PL) scalar field defined on a PL -manifold with equals 2 or 3. It has value at the vertices of

and is linearly interpolated on the simplices of higher dimension. Adjacency relations on

can be described in a dimension independent way. The*star*of a vertex is the set of simplices of which contain as a face. The

*link*is the set of faces of the simplices of which do not intersect . The topology of can be described with its

*Betti numbers*(the ranks of its homology groups [13]), which correspond in 3D to the numbers of connected components (), non collapsible cycles () and voids ().

Critical points:
For visualization and data analysis purposes, several low-level geometric
features can be defined given the input data. Given an isovalue ,
the
*sub-level set* of ,
noted , is defined as the pre-image of the open
interval onto through :
. Symmetrically, the
*sur-level set* is defined by . These two objects serve as fundamental
segmentation tools in many analysis tasks [5].
The points of where the Betti numbers of change
are the *critical points* of
(Fig. 1) and their associated values are called
*critical values*. Let be
the *lower link* of the vertex : . The *upper link* is
given by . To classify
without ambiguity into either lower or upper links, the restriction of to
the vertices of is assumed to be injective. This is easily enforced
in practice by a variant of simulation of simplicity [15].
This is achieved by considering an associated injective integer offset
, which initially typically corresponds to the vertex position
offset in memory. Then, when comparing two vertices, if these share the same
value , their order is disambiguated by their offset . A vertex
is regular, if and only if both and
are simply connected. Otherwise, is a *critical
point* of . Let be the dimension of .
Critical points can be classified with their *index* , which
equals 0 for
minima (), 1 for 1-saddles (), for -saddles () and for
maxima (). Vertices for which or
are greater than 2 are called *degenerate saddles*.

Persistence diagrams:
The distribution of critical points of can be
represented visually by a topological abstraction called the *persistence
diagram*
[14, 11] (Fig. 1).
By applying the Elder
Rule [13], critical points can be arranged in a set of pairs,
such that each critical point appears in only one pair with
and .
Intuitively, the Elder Rule [13] states that if two
topological features of (for instance two connected components) meet
at a given saddle of , the *youngest* of the two features (the
one created last) *dies* at the advantage of the oldest.
For example, if two connected components of merge at a saddle ,
the one created by the highest minimum (the youngest one) is considered
to die at , and and will form a critical point pair.
The persistence diagram
embeds each pair in the plane such
that its horizontal coordinate equals ,
and the vertical coordinate of
and are and , corresponding respectively
to the *birth* and *death* of the pair. The height of the pair
is called the *persistence* and denotes the
life-span of the topological feature created in and destroyed in .
Thus, features with a short life span (noise) will appear in
as low persistence pairs near the diagonal
(Figure 1, bottom).
In
low dimension, the persistence of the pairs linking critical points of index
,
and (in 3D) denotes the life-span of
connected components, voids and non-collapsible
cycles of .
In the rest of the paper, when discussing persistence diagrams, we will only
consider critical point pairs of index and . The impact
of this simplification is
discussed in Sec. 4.4.
In practice,
persistence diagrams serve as an important visual representation
of the distribution of critical points in a scalar data-set.
Small oscillations in the data due to noise will typically be represented by
critical point pairs with low persistence, in the vicinity of the diagonal.
In contrast, topological features that are the most
prominent in the data will be associated with large vertical bars
(Figure 1). In many
applications, persistence diagrams help users as a visual guide to
interactively tune simplification thresholds in topology-based, multi-scale data segmentation tasks based on the Reeb graph [9, 34, 45, 19, 42]
or the Morse-Smale complex [21, 22].

Distance: In order to evaluate the quality of compression algorithms, several metrics have been defined to evaluate the distance between the decompressed data, noted , and the input data, . The -norm, noted , is a classical example:

(1) |

Typical values of with practical interests include and . In particular, the latter case, called the *maximum
norm*

, is used to estimate the maximum pointwise error:

(2) |

In the compression literature, a popular metric is the
the *Peak Signal to Noise Ratio* (PSNR),
where is the number of vertices in :

(3) |

In the context of topological data analysis, several metrics
[11] have been introduced too, in order to compare persistence diagrams. In our context, such metrics will be instrumental
to evaluate the preservation of topological features after decompression. The
*bottleneck* distance [11], noted
, is a
popular
example.
Persistence diagrams can be associated with a pointwise distance, noted
inspired by the -norm. Given two critical points
and , is defined as:

(4) |

Assume that the number of critical points of all possible index is the same in both and . Then the bottleneck distance can be defined as follows:

(5) |

where is the set of all possible bijections mapping the critical points of to critical points of the same index in . If the numbers of critical points of index do not match in both diagrams, will be an injection from the smaller set of critical points to the larger one. Additionally, will collapse the remaining, unmatched, critical points in the larger set by mapping each critical point to the other extremity of its persistence pair , in order to still penalize the presence of unmatched, low persistence features.

Intuitively, in the context of scalar data compression, the bottleneck distance between two persistence diagrams can be usually interpreted as the maximal size of the topological features which have not been maintained through compression (Fig. 1).

A simple variant of the bottleneck distance, that is
slightly more
informative in the context of data compression, is the *Wasserstein*
distance, noted (sometimes called
the
*Earth
Mover’s Distance* [30]), between the persistence
diagrams
and :

(6) |

In contrast to the bottleneck distance, the Wasserstein distance will take into
account the persistence of *all* the pairs which have not been maintained
through compression (not only the largest one).

### 1.2 Overview

An overview of our compression approach is presented in Fig. 2. First, the persistence diagram of the input data is computed so as to evaluate noisy topological features to later discard. The diagram consists of all critical point pairs of index and . Next, given a target size for the preservation of topological features, expressed as a persistence threshold , a simplified function is reconstructed [46] from the persistence diagram of , , from which all persistence pairs below have been removed (Fig. 2(a)). Next, the image of , , is segmented along each critical value of . A new function is then obtained from by assigning to each vertex the mid-value of the interval it maps to. This constitutes a topologically adaptive quantization of the range (Fig. 2(c)). This quantization can optionally be further subdivided to enforce a maximal pointwise error (Fig. 2(d)). At this point, the data can be compressed by storing the list of critical values of and storing for each vertex the identifier of the interval it maps to. Optionally, the input data can be compressed independently by state-of-the-art compressors, such as ZFP [31] (Fig. 2(e)).

At decompression, a first function is constructed by re-assigning to each vertex the mid-value of the interval it maps to. Optionally, if the data has been compressed with a third-party compressor, such as ZFP [31], at decompression, each vertex value is cropped to the extent of the interval it should map to. Last, a function is reconstructed from the prescribed critical points of [46], to remove any topological feature resulting from compression artifacts.

## 2 Data compression

This section presents our topologically controlled compression scheme. In addition to topological control (Sec. 2.1), our approach can optionally support pointwise error control (Sec. 2.3) as well as combinations with existing compressors (Sec. 2.4). The format of the files generated by our compressor is described in Sec. 2.2.

### 2.1 Topological control

The input of our algorithm is the input data, , as well as the size of the topological features to preserve through compression. This size is expressed as a persistence threshold .

First, the persistence diagram of the input data, noted is computed. Next, a simplified version of the input data, noted , is constructed such that admits a persistence diagram which corresponds to that of , but from which the critical point pairs with persistence smaller than have been removed. This simplification is achieved by using the algorithm by Tierny and Pascucci [46], which iteratively reconstructs sub-level sets to satisfy topological constraints on the extrema to preserve. In particular, this algorithm is given as constraints the extrema of to preserve, which are in our current setting the critical points involved in pairs with persistence larger than . In such a scenario, this algorithm has been shown to reconstruct a function such that [46]. At this point, carries all the necessary topological information that should be preserved through compression.

In order to compress the data, we adopt a strategy based on range quantization. By assigning only a small number of possible data values on the vertices of , only bits should be required in principle for the storage of each value (instead of 64 for traditional floating point data with double precision). Moreover, encoding the data with a limited number of possible values is known to constitute a highly favorable configuration for post-process lossless compression, which achieves high compression rates for redundant data.

The difficulty in our context is to define a quantization that respects the topology of , as described by its persistence diagram . To do so, we collect all critical values of and segment the image of by , noted , into a set of contiguous intervals , all delimited by the critical values of (Figure 2, second column, top). Next, we create a new function , where all critical points of are maintained at their corresponding critical value and where all regular vertices are assigned to the mid-value of the interval they map to. This constitutes a topologically adaptive quantization of the range: only possible values will be assigned to regular vertices. Note that although we modify data values in the process, the critical vertices of are still critical vertices (with identical indices) in , as the lower and upper links (Sec. 1.1) of each critical point are preserved by construction.

### 2.2 Data encoding

The function is encoded in a two step process. First a topological index is created. This index stores the identifier of each critical vertex of as well as its critical value, and for each of these, the identifier of the interval immediately above it if and only if some vertices of indeed project to through . This strategy enables the save of identifiers for empty intervals.

The second step of the encoding focuses on data values of regular vertices of . Each vertex of is assigned the identifier of the interval it projects to through . For vertices and non-empty intervals between critical points, we store per-vertex interval identifiers ( words of bits), critical point positions in a vertex index ( words of bits), critical types ( words of bits) and critical values ( floats).

Since it uses a finite set of data values, the buffer storing interval assignments for all vertices of is highly redundant.
Thus, we further compress the data (topological index and interval
assignment) with a standard lossless compressor (*Bzip2*
[38]).

### 2.3 Pointwise error control

Our approach has been designed so far to preserve topological features thanks to a topologically adaptive quantization of the range. However, this quantization may be composed of arbitrarily large intervals, which may result in large pointwise error.

Our strategy can be easily extended to optionally support a maximal pointwise error with regard to the input data , still controlled by the parameter . In particular, this can be achieved by subdividing each interval (Sec. 2.1) according to a target maximal width , prior to the actual quantization and data encoding (Sec. 2.2). Since the topologically simplified function is guaranteed to be at most -away from (, Sec. 2.1), further subdividing each interval with a maximum authorized width of will result in a maximum error of when quantizing the data into . For instance, a local maximum of of persistence lower than can be pulled down by at most when simplifying into [46] (Fig. 3, center) and then further pulled down by up to when being assigned the mid-value of its new interval (of width , Fig. 3, right). In practice, for simplicity, we set . Hence, a maximum pointwise error of is guaranteed at compression ().

### 2.4 Combination with state-of-the-art compressors

The compression approach we presented so far relies on a topologically adaptive
quantization of the range (with optional pointwise error control). The
compression is achieved by only allowing a small number of possible scalar
values in the compressed data, which may typically result in noticeable
*staircase* artifacts. To address this, our method can be optionally
combined seamlessly with any state-of-the-art lossy compressor.
For our experiments, we used *ZFP* [31].
Such a combination is
straightforward at compression time.
In particular, in addition
to the topological index and the compressed quanta identifiers
(subsection 2.2),
the input data is additionally and independently compressed
by the third-party compressor (*ZFP*).

## 3 Data decompression

This section describes the decompression procedure of our approach, which is symmetric to the compression pipeline described in the previous section (Sec. 2). This section also further details the guarantees provided by our approach regarding the bottleneck () and Wasserstein () distances between the persistence diagrams of the input data and the decompressed data, noted (Sec. 3.4).

### 3.1 Data decoding

First, the compressed data is decompressed with the lossless decompressor
*Bzip2* [38]. Next, a function
is constructed
based on the topological index and the interval assignment buffer
(Sec. 2.2). In particular, each critical vertex is
assigned its critical value (as stored in the topological index) and regular
vertices are assigned the mid-value of the interval they project to, based
on the interval assignment buffer
(Sec. 2.2).

### 3.2 Combination with state-of-the-art decompressors

If a state-of-the-art compression method has been used in conjunction with our approach (Sec. 2.4), we use its decompressor to generate the function . Next, for each vertex of , if is outside of the interval where is supposed to project, we snap to the closest extremity of . This guarantees that the decompressed data respects the topological constraints of , as well as the optional target pointwise error (Sec. 2.3).

### 3.3 Topological reconstruction

The decompressed function may contain at this point extraneous critical point pairs, which were not present in (Sec. 2.1). For instance, if a state-of-the-art compressor has been used in conjunction with our approach, arbitrary oscillations within a given interval can still occur and result in the apparition of critical point pairs in (with a persistence smaller than the target interval width , Sec. 2.3) which were not present in . The presence of such persistence pairs impacts the distance metrics introduced in Sec. 1.1, and therefore impacts the quality of our topology controlled compression. Thus, such pairs need to be simplified in a post-process.

Note that, even if no third-party compressor has been used, since our approach is based on a topologically adaptive quantization of the range, large flat plateaus will appear in . Depending on the vertex offset (used to disambiguate flat plateaus, Sec. 1.1), arbitrarily small persistence pairs can also occur. Therefore, for such flat plateaus, must be simplified to guarantee its monotonicity everywhere except at the desired critical vertices (i.e. those stored in the topological index, Sec. 2.2).

Thus, whether a state-of-the-art compressor has been used or not, the last step of our approach consists in reconstructing the function from by enforcing the critical point constraints of (stored in the topological index) with the algorithm by Tierny and Pascucci [46]. Note that this algorithm will automatically resolve flat plateaus, by enforcing the monotonicity of everywhere except at the prescribed critical points [46]. Therefore, the overall output of our decompression procedure is the scalar field as well as its corresponding vertex integer offset .

### 3.4 Topological guarantees

The last step of our decompression scheme, topological reconstruction (Sec. 3.3), guarantees that admits no other critical points than those of (specified in the topological index). Moreover, the corresponding critical values have been strictly enforced (Sec. 3.1). This guarantees that , and thus:

(7) |

Since we know that is -away from the original data (Sec. 2.1) and due to persistence diagrams stability [11], we then have:

(8) |

Thus, the bottleneck distance between the persistence diagrams of the input and decompressed data is indeed bounded by , which happens to precisely describe the size of the topological features that the user wants to preserve through compression.

Since and , we have . This further implies that:

(9) |

where denotes the persistence of a critical point pair and
where denotes the
symmetric difference between and
(i.e. the set of pairs present in
but not in ). In other words, the
bottleneck distance between the persistence diagrams of the input and
decompressed data exactly equals the persistence of the most
persistent pair present in but not in
(in red in Fig. 1). This
guarantees the *exact*

preservation of the topological features selected with an

persistence threshold.As for the Wasserstein distance, with the same rationale, we get:

(10) |

In other words, the Wasserstein distance between the persistence diagrams of the input and decompressed data will be exactly equal to sum of the persistence of all pairs present in but not in (small bars near the diagonal in Fig. 1, bottom), which corresponds to all the topological features that the user precisely wanted to discard.

Finally, for completeness, we recall that, if pointwise error control was enabled, our approach guarantees (Sec. 2.3).

## 4 Results

This section presents experimental results obtained on a desktop
computer with two Xeon CPUs (3.0 GHz, 4 cores each), with 64 GB of RAM.
For the computation of the persistence
diagram and the topological simplification of the data,
we used the algorithms by Tierny and Pascucci
[46]
and Gueunet et al. [20], whose implementations are
available in the Topology ToolKit (TTK) [43]. The other components of
our approach (including bottleneck and Wasserstein distance computations) have
been implemented as TTK modules. Note that our approach has been described so
far for triangulations. However, we restrict our experiments to regular grids
in the following as most state-of-the-art compressors (including *ZFP*
[31]) have been specifically designed for regular grids. For this,
we use the triangulation data-structure from
TTK,
which represents implicitly
regular grids with no memory overhead
using a 6-tet subdivision.

Fig. 4 presents an overview of the compression capabilities
of our approach on a noisy 2D example. A noisy data set is provided on the
input. Given a user threshold on the size of the topological features to
preserve, expressed as a persistence threshold , our approach
generates decompressed data-sets that all share the same persistence diagram
(Figure 4(b), bottom),
which is a subset of the diagram
of the input data
(Figure 4(b), top)
from which pairs
with a persistence lower than have been removed, and those above
have been *exactly* preserved. As shown in this
example,
augmenting our approach with pointwise error control or combining it with a
state-of-the-art compressor allows for improved geometrical reconstructions,
but at the expense of much lower compression rates. Note that the
Figure 4(e) shows the
result of the compression with *ZFP* [31], which has been augmented
with our topological control. This shows that our approach can enhance any
existing compression scheme, by providing strong topological guarantees on the
output.

### 4.1 Compression performance

We first evaluate the performance of our compression scheme with topological control only on a variety of 3D data sets all sampled on regular grids. Fig. 5 (left) presents the evolution of the compression rates for increasing target persistence thresholds expressed as percentages of the data range. This plot confirms that when fine scale structures need to be preserved (small values, left), smaller compression rates are achieved, while higher compression rates (right) are reached when this constraint is relaxed. Compression rates vary among data sets as the persistence diagrams vary in complexity. The Ethane Diol dataset (topmost curve) is a very smooth function coming from chemical simulations. High compression factors are achieved for it, almost irrespective of . On the contrary, the random dataset (bottom curve) exhibits a complex persistence diagram, and hence lower compression rates.

In between, all data sets exhibit the same increase in compression rate for increasing values. Their respective position between the two extreme configurations of the spectrum (elevation and random) depend on their input topological complexity (number of pairs in the persistence diagram).

Fig. 5 (right) plots the evolution of the bottleneck distance between the input and decompressed data, , for increasing target persistence threshold , for all data sets. This plot shows that all curves are located below the identity diagonal. This constitutes a practical validation of our guaranteed bound on the bottleneck distance (Eq. 8). Note that, for a given data set, the proximity of its curve to the diagonal is directly dependent on its topological complexity. This result confirms the strong guarantees regarding the preservation of topological features through compression.

Data-set | With compression-time simplification | No simplification | Decompr. | ||||||

P (%) | S (%) | Q (%) | L (%) | Total (s) | Compr. Rate | Total (s) | Compr. Rate | Time (s) | |

Combustion | 8.4 | 89.3 | 0.7 | 1.6 | 593.9 | 121.3 | 64.1 | 111.1 | 213.3 |

Elevation | 14.6 | 84.1 | 1.2 | 0.1 | 157.0 | 174,848.0 | 25.3 | 174,848.0 | 211.5 |

EthaneDiol | 12.6 | 86.7 | 0.5 | 0.2 | 490.0 | 2,158.6 | 63.0 | 2,158.6 | 228.9 |

Enzo | 9.5 | 86.7 | 1.0 | 2.7 | 695.6 | 24.5 | 91.8 | 19.8 | 204.0 |

Foot | 12.6 | 81.9 | 1.6 | 3.8 | 380.6 | 12.1 | 68.4 | 7.75 | 205.7 |

Jet | 22.2 | 75.7 | 0.6 | 1.5 | 451.3 | 315.6 | 111.1 | 287.6 | 220.3 |

Random | 15.9 | 76.0 | 2.8 | 5.3 | 1357.1 | 1.5 | 307.7 | 1.5 | 101.2 |

Table 1 provides detailed timings for our approach and shows that most of the compression time (at least 75%, column) is spent simplifying the original data into (section 2). If desired, this step can be skipped to drastically improve time performance, but at the expense of compression rates (Table 1, right). Indeed, as shown in Figure 6, skipping this simplification step at compression time results in quantized function that still admits a rich topology, and which therefore constitutes a less favorable ground for the post-process lossless compression (higher entropy). Note however that our implementation has not been optimized for execution time. We leave time performance improvement for future work.

### 4.2 Comparisons

Next, we compare our approach with topological control only to the *SQ* compressor
[27], which has been explicitly designed to control
pointwise error.

Thus, it is probably

the compression scheme that is the most related to our approach.*SQ*proposes two main strategies for data compression, one which is a straight range quantization (with a constant step size ,

*SQ-R*) and the other which grows regions in the 3D domain, within a target function interval width (

*SQ-D*). Both variants, which we implemented ourselves, provide an explicit control on the resulting pointwise error (). As such, thanks to the stability result on persistence diagrams [11],

*SQ*also bounds the bottleneck distance between the persistence diagrams of the input and decompressed data. Each pair completely included within one quantization step is indeed flattened out. Only the pairs larger than the quantization step size do survive through compression. However, the latter are snapped to admitted quantization values. In practice, this can artificially and arbitrarily reduce the persistence of certain pairs, and increase the persistence of others. This is particularly problematic as it can reduce the persistence of important features and increase that of noise, which prevents a reliable multi-scale analysis after decompression. This difficulty is one of the main motivations which led us to design our approach.

To evaluate this,
we compare *SQ* to our approach in the light of the
Wasserstein distance between the persistence diagrams of the input and
decompressed data, . As described in Sec. 1.1, this
distance is more informative in the context of compression, since
not only does it track all pairs which have been lost, but also
the changes of the
pairs which have been preserved. Fig. 7 (left) presents the
evolution of the Wasssertein distance, averaged over all our data sets, for
increasing compression rates. This plot shows that our approach (red curve)
achieves a significantly better preservation of the topological features than
*SQ*, for all compression rates, especially for the lowest ones.
As discussed in the previous paragraph, given a quantization step
, *SQ* will preserve all pairs more persistent than
but it will also degrade them, as shown in the above experiment. Another
drawback of *SQ* regarding the preservation of topological features is the
compression rate. Since it uses a constant quantization step size, it may
require many quantization intervals to preserve pairs above a target persistence
, although large portions of the range may be
devoid of important topological
features. To illustrate this, we compare the compression rates achieved by
*SQ* and our approach, for increasing values of the parameter
.
As in Fig. 5, increasing slopes can be observed.
However,
our approach always achieves higher compression rates, especially for larger
persistence targets.

Next, we study the capabilities offered by our approach to augment a third
party compressor with topological control (Secs. 2.4 and
3.2). In particular, we augmented the *ZFP*
compressor [31], by using the original implementation provided by the
author (with 1 bit per vertex value). Fig. 8 (left) indicates the
evolution of the compression rates as the target persistence threshold
increases. In particular, in these experiments, a
persistence target
of indicates that no topological control was enforced. Thus,
these curves indicate,
apart from this point,
the overhead of topological guarantees over the *ZFP* compressor in terms
of data storage.
These curves, as it could be expected with Fig. 5, show
that compression rates will rapidly drop down for topologically rich data sets
(such as the random one).
On the contrary, for smoother data sets,
such as
Ethane Diol or Jet, high compression rates can be maintained. This shows that
augmenting a third party compressor with topological control results in
compression rates that adapt to the topological complexity of the input.
Fig. 8 (right) shows the evolution of the PSNR
(Sec. 1.1) for decreasing persistence targets .
Surprisingly, in this context, the enforcement of topological control improves
the quality of the data decompressed with *ZFP*, with higher PSNR values
for little persistence targets.
This is due to the rather aggressive
compression rate which we used for *ZFP* (1 bit per vertex value) which
tends to add noise to the data. Thanks to our topological control
(Sec. 3.3), such compression artifacts can be cleaned up by our
approach.

Table 2 evaluates the advantage of our topology aware compression over a standard lossy compression, followed at decompression by a topological cleanup (which simplifies all pairs less persistent than [46]). In particular, this table shows that augmenting a third party compressor (such as ZFP) with our topological control (second line) results in more faithful decompressed data (lower Wasserstein distances to the original data) than simply executing the compressor and topologically cleaning the data in a post-process after decompression (first line). This further motivates our approach for augmenting existing compressors with topological control.

Data-set | ||||||
---|---|---|---|---|---|---|

Combustion | Elevation | EthaneDiol | Enzo | Foot | Jet | |

ZFP + Cleanup | 18.08 | 0.00 | 1.53 | 189.66 | 520,371 | 351.97 |

Topology-aware ZFP | 13.73 | 0.00 | 0.40 | 131.11 | 506,714 | 153.45 |

Finally, Table 3 provides a comparison of the running times (for comparable compression rates) between our approach and SQ and ZFP. ZFP has been designed to achieve high throughput and thus delivers the best time performances. The running times of our approach are on par with other approaches enforcing strong guarantees on the decompressed data (SQ-R and SQ-D).

Data-set | Time (s). | |||
---|---|---|---|---|

ZFP | SQ-R | SQ-D | Ours | |

Combustion | 4.6 | 37.6 | 242.9 | 64.1 |

Elevation | 7.2 | 31.4 | 204.2 | 25.3 |

EthaneDiol | 4.7 | 34.4 | 197.1 | 63.0 |

Enzo | 4.7 | 33.0 | 229.5 | 91.8 |

Foot | 2.9 | 18.2 | 198.0 | 68.4 |

Jet | 4.7 | 31.4 | 203.4 | 111.1 |

Random | 4.1 | 31.6 | 182.7 | 307.7 |

### 4.3 Application to post-hoc topological data analysis

A key motivation to our compression scheme is to allow users to faithfully conduct advanced topological data analysis in a post-process, on the decompressed data, with guarantees on the compatibility between the outcome of their analysis and that of the same analysis on the original data. We illustrate this aspect in this sub-section, where all examples have been compressed by our algorithm, without pointwise error control nor combination with ZFP.

We first illustrate this in the context of medical data segmentation with Fig. 9, which shows a foot scan. The persistence diagram of the original data counts more than 345 thousands pairs (top left). The split tree [5, 8] is a topological abstraction which tracks the connected components of sur-level sets of the data. It has been shown to excel at segmenting medical data [9]. In this context, users typically compute multi-scale simplifications of the split tree to extract the most important features. Here, the user simplified the split tree until it counted only 5 leaves, corresponding to 5 features of interest (i.e. the 5 toes). Next, the segmentation induced by the simplified split tree has been extracted by considering regions corresponding to each arc connected to a leaf. This results immediately in the sharp segmentation of toe bones. (Fig. 9, bottom left). We applied the exact same analysis pipeline on the data compressed with our approach. In particular, since it can be known a priori that this data has only 5 features of interest (5 toes), we compressed the data with a target persistence such that only 5 pairs remained in the persistence diagram (top right). Although such an aggressive compression greatly modifies data values, the outcome of the segmentation is identical (Table 4), for an approximate compression rate of 360.

Next, we evaluate our approach on a more challenging pipeline (Fig. 10), where features of interest are not explicitly related to the persistence diagram of the input data. We consider a snapshot of a simulation run of viscous fingering and apply the topological data analysis pipeline described by Favelier et al. [16] for the extraction and tracking of viscous fingers. This pipeline first isolates the largest connected component of sur-level set of salt concentration. Next it considers its height function, on which persistent homology is applied to retrieve the deepest points of the geometry (corresponding to finger tips). Finally, a distance field is grown from the finger tips and the Morse complex of this distance field is computed to isolate fingers. In contrast to the previous example, the original data undergoes many transformations and changes of representation before the extraction of topological features. Despite this, when applied to the data compressed with our scheme, the analysis pipeline still outputs consistent results with the original data (Fig. 10, compression rate: 56). Only slight variations can be perceived in the local geometry of fingers, but their number is unchanged and their overall geometry compatible. Table 4 provides a quantitative estimation of the similarity between segmentations, before and after compression with several algorithms. This table shows that our approach enables the computation of more faithful topological segmentations (higher rand index) compared to SQ and ZFP, which further underlines the superiority of our strategy at preserving topological features.

Experiment | Rand index | |||
---|---|---|---|---|

Ours | SQ-R | SQ-D | ZFP | |

Foot scan | 1.000 | 0.913 | 0.895 | 0.943 |

Viscous fingers | 0.985 | 0.977 | 0.977 | 0.973 |

### 4.4 Limitations

Like all lossy techniques, our approach is subject to an input parameter that controls the loss, namely the persistence threshold above which features should be strictly preserved. While we believe this parameter to be intuitive, prior domain knowledge about the size of the features to preserve may be required. However, conservative values (typically ) can be used by default, as they already achieve high compression rates while preserving most of the features. In some applications, ad-hoc metrics [9] may be preferred over persistence. Our approach can be used in this setting too as the simplification algorithm that we use [46] supports an arbitrary selection of the critical points to preserve. However, it becomes more difficult then to express clear guarantees on the compression quality in terms of the bottleneck and Wasserstein distances between the persistence diagrams of the input and decompressed data. When pointwise error control is enabled, the -norm between the input and decompressed data is guaranteed by our approach to be bounded by . This is due to the topological simplification algorithm that we employ [46], which is a flooding-only algorithm. Alternatives combining flooding and carving [4, 44] could be considered to reach a guaranteed -norm of . Finally, our approach only considers persistence pairs corresponding to critical points of index and (d-1, d). However pairs may have a practical interest in certain 3D applications and it might be interesting to enforce their preservation throughout compression. This would require an efficient data reconstruction algorithm for pairs, which seems challenging [1].

## 5 Conclusion

In this paper, we presented the first compression scheme, to the best of our
knowledge, which provides
strong topological guarantees on the
decompressed data. In particular,
given a target topological feature size
to preserve, expressed as a persistence threshold , our approach
discards all persistence pairs below in order to achieve high
compression rates, while *exactly* preserving persistence pairs above
. Guarantees are given on the bottleneck and
Wasserstein distances between the persistence diagrams of the
input and
decompressed data. Such guarantees are key to ensure the reliability
of any post-hoc, possibly multi-scale, topological data analysis performed on
decompressed data.
Our approach is simple to implement; we provide a lightweight VTK-based C++
reference implementation.

Experiments demonstrated the superiority of our approach in terms of
topological feature preservation in comparison to existing
compressors,
for comparable compression rates. Our approach can be extended to
include pointwise error control. Further, we showed, with the example of the
*ZFP* compressor [31], how to make any third-party compressor
become *topology-aware* by combining it with our approach and making it
benefit from our strong topological guarantees,
without affecting too much isosurface geometry. We also showed that, when
aggressive compression rates were selected, our topological approach could
improve existing compressors in terms of PSNR by cleaning up topological
compression artifacts. We finally showed the utility of our approach by
illustrating, qualitatively and quantitatively, the compatibility
between the output of post-hoc topological data analysis pipelines, executed
on the input and decompressed data, for simulated or acquired data sets.
Our contribution enables users to faithfully conduct
advanced topological data analysis on decompressed data,
with guarantees on the
size of missed features and the *exact* preservation of
most prominent ones.

In the future, we plan to improve practical aspects of our algorithm for in-situ deployment and to handle time-varying datasets. Runtime limitations will be investigated, with the objective to mitigate the effects of using (or not) a sequential topological simplification step, and to determine how many cores are necessary to outperform raw storage. A streaming version of the algorihm, which would not require the whole dataset to be loaded at once would be of great interest in this framework. Finally, as our approach focuses on regular grids, since, in the case of unstructured grids, the actual mesh is often the information that requires the most storage space, we will investigate mesh compression techniques with guarantees on topological feature preservation for scalar fields defined on them.

###### Acknowledgements.

This work is partially supported by the Bpifrance grant “AVIDO” (Programme d’Investissements d’Avenir, reference P112017-2661376/DOS0021427) and by the French National Association for Research and Technology (ANRT), in the framework of the LIP6 - Total SA CIFRE partnership reference 2016/0010. The authors would like to thank the anonymous reviewers for their thoughtful remarks and suggestions.## References

- [1] D. Attali, U. Bauer, O. Devillers, M. Glisse, A. Lieutier. Homological reconstruction and simplification in R3. SoCG, 117–126, 2013.
- [2] C. Bajaj D. Schikore. Topology preserving data simplification with error bounds. Computers and Graphics, 22(1):3–12, 1998.
- [3] C. L. Bajaj, V. Pascucci, G. Zhuang. Single resolution compression of arbitrary triangular meshes with properties. IEEE DC, 247–256, 1999.
- [4] U. Bauer, C. Lange, M. Wardetzky. Optimal topological simplification of discrete functions on surfaces. DCG, 47(2):347–377, 2012.
- [5] P. Bremer, G. Weber, J. Tierny, V. Pascucci, M. Day, J. Bell. Interactive exploration and analysis of large scale simulations using topology-based data segmentation. IEEE TVCG, 17(9):1307–1324, 2011.
- [6] M. Burrows D. Wheeler. A block sorting lossless data compression algorithm. Technical report, Digital Equipment Corporation, 1994.
- [7] M. Burtscher P. Ratanaworabhan. High throughput compression of double-precision floating-point data. 293–302, 2007.
- [8] H. Carr, J. Snoeyink, U. Axen. Computing contour trees in all dimensions. Symp. on Dis. Alg., 918–926, 2000.
- [9] H. Carr, J. Snoeyink, M. van de Panne. Simplifying flexible isosurfaces using local geometric measures. IEEE VIS, 497–504, 2004.
- [10] J. Cleary I. Witten. Data compression using adaptive coding and partial string matching. IEEE Trans. Comm., 32(4):396–402, 1984.
- [11] D. Cohen-Steiner, H. Edelsbrunner, J. Harer. Stability of persistence diagrams. Symp. on Comp. Geom., 263–271, 2005.
- [12] S. Di F. Cappello. Fast error-bounded lossy HPC data compression with SZ. IEEE Symp. on PDP, 730–739, 2016.
- [13] H. Edelsbrunner J. Harer. Computational Topology: An Introduction. American Mathematical Society, 2009.
- [14] H. Edelsbrunner, D. Letscher, A. Zomorodian. Topological persistence and simplification. Disc. Compu. Geom., 28(4):511–533, 2002.
- [15] H. Edelsbrunner E. P. Mucke. Simulation of simplicity: a technique to cope with degenerate cases in geometric algorithms. ToG, 9(1):66–104, 1990.
- [16] G. Favelier, C. Gueunet, J. Tierny. Visualizing ensembles of viscous fingers. IEEE SciVis Contest, 2016.
- [17] S. W. Golomb. Run-length encodings. IEEE Trans. on IT, 12(3):399–401, 1966.
- [18] D. Guenther, R. Alvarez-Boto, J. Contreras-Garcia, J.-P. Piquemal, J. Tierny. Characterizing molecular interactions in chemical systems. IEEE TVCG, 20(12):2476–2485, 2014.
- [19] C. Gueunet, P. Fortin, J. Jomier, J. Tierny. Contour forests: Fast multi-threaded augmented contour trees. IEEE LDAV, 85–92, 2016.
- [20] C. Gueunet, P. Fortin, J. Jomier, J. Tierny. Task-based augmented merge trees with Fibonacci heaps. IEEE LDAV, 2017.
- [21] A. Gyulassy, P. T. Bremer, B. Hamann, V. Pascucci. A practical approach to morse-smale complex computation: Scalability and generality. IEEE TVCG, 14(6):1619–1626, 2008.
- [22] A. Gyulassy, D. Guenther, J. A. Levine, J. Tierny, V. Pascucci. Conforming morse-smale complexes. IEEE TVCG, 20(12):2595–2603, 2014.
- [23] C. Heine, H. Leitte, M. Hlawitschka, F. Iuricich, L. De Floriani, G. Scheuermann, H. Hagen, C. Garth. A survey of topology-based methods in visualization. Comp. Grap. For., 35(3):643–667, 2016.
- [24] P. G. Howard J. S. Vitter. Analysis of arithmetic coding for data compression. IEEE DCC, 3–12, 1991.
- [25] D. Huffman. A method for the construction of minimum-redundancy codes. 40(9):1098–1101, 1952.
- [26] M. Isenburg, P. Lindstrom, J. Snoeyink. Lossless Compression of Predicted Floating-Point Geometry. Computer-Aided Design, 37(8):869–877, 2005.
- [27] J. Iverson, C. Kamath, G. Karypis. Fast and effective lossy compression algorithms for scientific datasets. Euro-Par, 843–856, 2012.
- [28] S. Lakshminarasimhan, N. Shah, S. Ethier, S. Klasky, R. Latham, R. B. Ross, N. F. Samatova. Compressing the incompressible with ISABELA: in-situ reduction of spatio-temporal data. Euro-Par, 366–379, 2011.
- [29] N. K. Laurance D. M. Monro. Embedded DCT Coding with Significance Masking. IEEE ICASPP, 2717–2720, 1997.
- [30] E. Levina P. Bickel. The earthmover’s distance is the mallows distance: some insights from statistics. IEEE ICCV, vol. 2, 251–256, 2001.
- [31] P. Lindstrom. Fixed-rate compressed floating-point arrays. IEEE TVCG, 20(12):2674–2683, 2014.
- [32] P. Lindstrom, P. Chen, E. Lee. Reducing Disk Storage of Full-3D Seismic Waveform Tomography (F3DT) through Lossy Online Compression. Computers & Geosciences, 93:45–54, 2016.
- [33] P. Lindstrom M. Isenburg. Fast and efficient compression of floating-point data. IEEE TVCG, 12(5):1245–1250, 2006.
- [34] V. Pascucci, G. Scorzelli, P. T. Bremer, A. Mascarenhas. Robust on-line computation of Reeb graphs: simplicity and speed. ToG, 26(3):58, 2007.
- [35] V. Pascucci, X. Tricoche, H. Hagen, J. Tierny. Topological Data Analysis and Visualization: Theory, Algorithms and Applications. Springer, 2010.
- [36] P. Ratanaworabhan, J. Ke, M. Burtscher. Fast lossless compression of scientific floating-point data. IEEE Data Compression, 133–142, 2006.
- [37] J. Schneider R. Westermann. Compression Domain Volume Rendering. IEEE VIS, 293–300, 2003.
- [38] J. Seward. Bzip2 data compressor. http://www.bzip.org, 2017.
- [39] S. W. Son, Z. Chen, W. Hendrix, A. Agrawal, W. k. Liao, A. Choudhary. Data compression for the exascale computing era - survey. Supercomputing Frontiers and Innovations, 1(2), 2014.
- [40] T. Sousbie. The persistent cosmic web and its filamentary structure: Theory and implementations. Royal Astronomical Society, 414(1):350–383, 2011.
- [41] G. Taubin J. Rossignac. Geometric compression through topological surgery. ACM Transactions on Graphics, 17(2):84–115, 1998.
- [42] J. Tierny H. Carr. Jacobi fiber surfaces for bivariate Reeb space computation. IEEE TVCG, 23(1):960–969, 2016.
- [43] J. Tierny, G. Favelier, J. A. Levine, C. Gueunet, M. Michaux. The Topology ToolKit. IEEE TVCG, 24(1):832–842, 2017.
- [44] J. Tierny, D. Guenther, V. Pascucci. Optimal general simplification of scalar fields on surfaces. Topological and Statistical Methods for Complex Data, 57–71. Springer, 2014.
- [45] J. Tierny, A. Gyulassy, E. Simon, V. Pascucci. Loop surgery for volumetric meshes: Reeb graphs reduced to contour trees. IEEE TVCG, 15(6):1177–1184, 2009.
- [46] J. Tierny V. Pascucci. Generalized topological simplification of scalar fields on surfaces. IEEE TVCG, 18(12):2005–2013, 2012.
- [47] J. Ziv A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. on IT, 23(3):337–343, 1977.
- [48] J. Ziv A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Trans. on IT, 24(5):530–536, 1978.

Comments

There are no comments yet.