1 Introduction
Data visualization is extensively used to reveal patterns and structures in data. The display of highdimensional datasets concerning point clouds with a high number of attributes continues to be a relevant research field due to the wide range of applications. The choice of an informative visualization technique depends not only on the characteristics of the data but also on the tasks to be performed. For example, a visualization to analyze the evolution of a highdimensional time series requires a different approach than projecting a document corpus. While both aim to represent the data in a limited number of dimensions, the first emphasizes the progressive and continuous changes that occur in time and the second aims to find differences between groups of documents.
Dimensionality reduction techniques allow for embedding highdimensional data into a plot with two or three axes. These solutions provide visual scalability advantage over classical scatterplot matrices and parallel coordinates [33]. However, the lowdimensional transformations rely on assumptions and parameterizations which can compromise part of the original information. The most recent methods such tSNE [29] or UMAP [30] are effective in identifying similar elements and projecting them separated from other groups. These projections focus on preserving the closest neighbors rather than preserving all distances between the points which can cause an incomplete mapping of the dataset in the lower space [42].
On the other hand, Topological Data Analysis (TDA) aims to deduce and recognize geometric structures from underlying data by means of connecting elements in a graph. The combination of a scalar function with the original source allows exploration of data from different perspectives highlighting information which otherwise would be hidden. Unlike dimension reduction techniques the reconstruction of topology may not be faithful to the original data geometry, but they preserve the continuity between data shapes. Although TDA has demonstrated remarkable results in specialized studies [Lum et al. [28], Nielson et al. [34], Lakshmikanth et al. [25]], it relies on data summaries (e.g. clusters) instead of individual data points and therefore limits the resolution of exploration phase. In addition, the cornerstone of TDA is clustering and appropriate lens which precludes hypothesisfree data exploration.
In this paper we present STAD, a parameterfree dimensionality reduction method which transforms the highdimensional data into a graph highlighting the underlying structure. The projection into a graph provides a higher degree of flexibility to represent the interdependencies between points than coordinate mapping techniques (Fig. 1). Furthermore, STAD allows for incorporating additional functions which can intensify the specific signals adding new perspectives to the exploratory analysis. Additionally, STAD networks generate a representation of data without aggregation, i.e., STAD encodes the original data points as vertices in the graph which provides a high resolution of the data.
This paper is organized as follows. In section 2 we give an overview of related work in the detection of structure in data through dimension reduction and exploratory techniques using graph representations. Section 3 describes the proposed methodology, followed by section 4, in which we present two case studies applying STAD. The approach is discussed in section 5, and finally, section 6 covers conclusions and possible directions for future work.
2 Related Work
The exploration of highdimensional data has been presented in different areas of research in information visualization, data mining, and machine learning
[27]. We review the related work in these areas below, more specifically the topics of dimension reduction, visualization techniques and TDA.2.1 Dimension reduction and embeddings
Dimensionality reduction techniques transform a highdimensional space to a low dimensional one. Considering as input the data points of a pairwise distance matrix, the methods to visualize the global structure falls in the multidimensional scaling (MDS) class [31]. Torgerson Scaling [49], a particular case of PCA, finds a linear and orthogonal transformation of data revealing the most informative view without modifying the local and global relationship between elements. More versatile approaches are nonlinear metric MDS (NMDS) [23] and Sammon mapping [40]
, which overcome the linearity limitation introducing an iterative approach to match the distances between the original and projected space by minimizing the error between both matrices. The difference between Sammon mapping and nonlinear MDS is that Sammon normalizes the original distance to emphasize small differences. The iterative MDS approach has been the basis for other models with improved versions of the loss functions
[39].The Isomap algorithm [4] is also based on the iterative MDS model, but using geodesic distances instead of Euclidean distance. The algorithm defines a neighboring region based on a parameter
given by the user. Once the neighbors are defined, the lowdimensional embedding is generated in a similar fashion to iterative MDS. This method eliminates the need of estimating distances between widely separated elements.
The underlying idea of associating the pairwise distances between the original space and a projected space is also employed in the STAD method. The difference with MDS techniques is that the projection takes place into a graph and more precisely in the path described by nodes and edges. The change in spatialization from a fixed coordinate system to a freedimensional space provides a more flexible visual technique to represent the information as compared in Fig. 1. Instead of mapping the position in a lower space, the STAD graph aims to visualize the relationships between the data elements.
Beyond MDS techniques, more recent dimensionality reduction techniques such as tSNE [29], LargeVis [48] and UMAP [30]
improve the projection of lowdimensional space by intensifying the detection of nearest neighbors. These techniques assume an intrinsic probability distribution which smooth the projection in the lowdimensional space improving the detection of local patterns distorting the global one. However, these approaches tend to increase the division among neighborhoods, which are beneficial for identifying local clusters but contrary to the accurate identification of global trends or continuous patterns
[42].2.2 Exploration of data through network structures
A number of earlier research projects used the network metaphor to facilitate the understanding of multidimensional data (e.g., [Demiralp et al. [9], Alcaide and Aerts [2], Stuetzle [47], Jänicke et al. [21]]). All these techniques depend on a parameter which determines the elements connectivity. The exploratory system presented by Jänicke et al. [21] employs the the Minimum Spanning Tree (MST) to define the minimal structure of the data. Additional edges are added to establish a more consistent data structure using a forcedirected graph layout. STAD uses the same concept of adding edges on top of a MST to draw the data shape but the number of edges are automatically selected through a minimization process. The structures presented in STAD generate a more accurate representation of the original highdimensional space by providing not only the structural shape but also an intuition of the density through the interconnection of nodes in a region of the graph.
Topology studies the global structure of a data from a geometrical perspective providing an informative summary. Topological Data analysis is the general term used for a collection of particular methods to analyze highdimensional datasets [32]. Graph representations are commonly used to illustrate the underlying structure of data, but nodes are aggregations of points rather than individual elements. Under this umbrella, data skeletonization is an important shape descriptor from a disconnected point cloud [Zomorodian [58], Kurlin [24], Rieck and Leitte [37]]. The selection of a proper skeleton is defined by the representation which shows the most persistent features. The stability of topological features is visualized in a socalled barcode [10] and analyzed to identify suitable parameterization [Wasserman [55], Ghrist [14]]. Other TDA methods rely on scalar functions to guide the summarization of highdimensional data such as MorseSmale Complexes (MSC) [Gerber et al. [13], Shivashankar et al. [43], Günther et al. [17]], Reeb graphs [6] and the Mapper algorithm introduced by Singh et al. [44]. While these functions are defined to determine the continuous space of a manifold, the functions in STAD modify the projections of distances by controlling the connections between nodes. However, since the same functions can be applied in both structures, they can be similar. In addition the evaluation criteria in STAD relies on the association between the graph representation and the original space which differ from the geometrical persistent implications of TDA.
3 Methodology
Network visualization representations are projections of data expressed independently from the coordinate system; the visual structure connects elements according to their relationship and not their location. We extend the concept to represent the similarity (distance) between two nodes as the path described by the edges in the network. Once a similarity metric is chosen, a distance matrix containing the pairwise distances between all elements can be defined. The distance matrix can be considered a complete weighted graph , where the indices of the matrix represent the vertices of the graph and the edge weights the distance between any two elements.
STAD proposes a new method to generate the structure of data by transforming the distance matrix into a graph. This method converts the complete graph into a noncomplete unitdistance graph (i.e., all edges in the network have the same length of 1) where the distance between datapoints is reflected in the length of the shortest path between them. That is, the distance between two distant points is build from the neighborhoods of other nodes. A STAD network forms a single connected component, and it ensures the path for any combination of vertices exists. The information presented in the STAD networks is an approximation of the original complete weighted graph due to discretizing the distances in unitary segments. The number of edges in the unitdistance graph controls the shape of the data, and a final graph can vary from the minimum spanning tree to the complete graph. The STAD procedure selects the number of edges automatically by maximizing the correlation between the weighted distance matrix and the unitdistance graph.
In section 3.1, we describe the details of the STAD algorithm and illustrate the steps with a simulated example. In section 3.2, we present an extension of STAD to amplify signals in data through the addition of filters.
3.1 STAD base algorithm
The STAD algorithm can be split in eight sequential steps (Fig. 2): create the distance matrix, build an MST from the complete weighted graph, convert the MST to the unitdistancegraph, add edges to the unitdistance graph, evaluate the objective function, visualize the relationship between correlation and the number of edges, identify the optimal number of edges automatically and create a nodelink diagram of the final network.
3.1.1 Create distance matrix in original highdimensional space
Let be a space with elements and dimensions in , and a metric exists which determines all pairwise distances with , . The distance matrix is the squared matrix with size containing all the distances . This distance matrix is the only required element to generate a STAD network. From our perspective, the matrix is understood as an undirected, weighted and fully connected graph with vertices and edges where each edge has a weight of value . Fig. 2a illustrates the distance matrix creation from a point cloud and the representation of the fully connected graph. The similarity between each pair of elements is projected as edges in the graph. Notice that edges in are undirected due to the symmetric property of the matrix and the diagonal elements are omitted in the graph.
3.1.2 Build MST from complete weighted graph
Next, a minimum spanning tree (MST) is computed for . The MST is a subset of edges that connects all vertices without loops, with a minimal sum of edge weights. Fig. 2b shows the MST network for the complete graph . Note that the MST may not be unique and alternative combinations of edges can produce the same result.
3.1.3 Convert MST to unitdistance graph
The MST is the first unitdistance graph considered in STAD (Fig. 2c) and the addition of edges will improve the association between and . By transforming the complete graph into a unitdistance graph , we reduce the graph dimension of the original into a twodimensional graph formalized by Erdős [11].
3.1.4 Add edges into unitdistance graph
The addition of new edges to the unitdistance graph produces a new graph , where is the number of edges included in addition to the MST.
First, edges are sorted by weight to define the order in which they are added. For instance, means that the weight of is smaller than that of , or that the datapoints referred to in are closer together in the original space than those referred to in . Then the edges are added into the network following a cumulative process so that if then where is the unitdistance graph with the sequence of edges , , , …, from . Although edges are sorted and added into the unitdistance graph by their weight, all edges in itself are unweighted. Fig. 2d shows three examples of unitary graphs to give an intuition into how the network evolves by adding new edges in addition to the MST. The number of possible unitary graphs depends on the number of data elements in the space , so that .
3.1.5 Evaluate the objective function: calculate correlation between distances in original space and those in graph space
From the unitdistance graph , the computation of the shortest path for every pair of vertices produces a squared matrix with size comparable to the distance matrix for the original space . The Pearson correlation is used to measure the agreement between and . Contrary to the statistics absolute error, correlation is invariant under different scalings and takes a known range within 1 and 1 [26]. The correlation between the two matrices during the STAD process is limited to the range from 0 to 1 because the distances projected in follow a similar mapping, i.e. long distances in are projected as long distances in . This finite range provides an intuition of the algorithm performance and a comparable benchmark between iterations at all levels of data. Fig. 2e exemplifies the changes in the distance matrix for different values of together with association value with the original distance matrix .
3.1.6 Optional: Visualize the relationship between correlation and the number of edges
The evaluation of the correlation for consecutive values of describes a quasiconvex function. The influence of an edge addition at the beginning of the sequence (i.e., at low values of ) has a large effect on the correlation with distance matrix . The evaluation of correlation at this stage may fluctuate when contains few edges but describes a convex curve when the amount of edges is big enough. The number of edges needed to reach maximum correlation depends on the size and nature of the data. Intuitively, the association between and is related to the concept of persistence of clustering solutions [46] , i.e., if the shape is persistent along the edges, the computed correlation will be consistently similar. Fig. 2f shows the correlation curve for multiple evaluations. The correlation value initially increases by adding edges on top of the MST, reaching its maximum quickly. After the maximum, there is a constant decrease in the correlation when adding more edges demonstrating that these additional edges degrade the projection of data.
3.1.7 Identify the optimal number of edges automatically
STAD uses simulated annealing (SA) to estimate the optimal which maximizes the correlation and provides a representative data projection of . SA approximates the combination of links which maximize the correlation between and
. This heuristic is a stochastic process and estimates the global optimum by exploring the discrete space of edges (Fig.
2g). The SA candidate generator reduces failures on nonconvexity regions produced by the correlation, mainly when the graph is composed by few edges. Note that as the resulting network is ultimately explored visually and structural features are kept along a range of edges, the difference between a global maximum and an approximation thereof does not imply noticeable deviations in STAD. The start and end of these ranges are only identifiable after the evaluation of all links. Fortunately, these do not all have to be computed, as the most important characteristics are also the most easily detectable.The time complexity of calculating is as described in the Breadthfirst search algorithm definition in [45]; where is the number of vertices and is the number of edges in the graph. Although the number of vertices is fixed, the number of edges evaluated at iteration i influences the running time and varies between when the MST is evaluated and when the graph is complete.
3.1.8 Create nodelink diagram of final network
Networks require a graphical convention to be visualized. Generally, they are drawn as nodelinks representations projected in the twodimensional plane. Although the STAD methodology generates an unweighted graph (unitdistance graph), we include the distances from the original metric in the final nodelink diagram as this improves the visual representation. Fig. 2h shows the resulting network including the distances as weights in the edges. Note that the STAD graph is independent of the graph drawing algorithm used, the focus is on the identification of signals described by the connections of elements rather than the coordinates of nodes. However, graph drawings which minimize the number of crossings and place together small edge weight are appropriate to detect data structures, e.g., ForceAtlas2 [20] and KamadaKawai [22].
3.2 Filters in STAD
The transformation of datapoints and weights from a complete graph into a unitdistance graph representation can hide part of the information, and we can expect that not all patterns in data will be revealed in a STAD graph. In particular, prior knowledge might reveal that for certain applications, specific datapoints should be pulled apart even if they are located close together when considering the complete highdimensional space. We propose an extension of the STAD base algorithm to highlight other signals in the data by the inclusion of functions that acts as filters on data projections.
Filters are an ordered set of values associated with an indexed sequence of natural numbers. They provide limits and thus a context to an arbitrary metric space. Filters can be defined from derived dimensions through statistical functions, subsets of dimensions or external data. Real sequences may be discretized by defining equidistant intervals based on scale or density. The addition of filters focuses on the exploration of data allowing, for example, to integrate domain knowledge. Formally, filters are defined as a space with elements and dimensions in where a mapping exists between and , i.e., there is a function . Filter functions theory is also present in topological methods [7] and the same filters used in TDA can be applied in STAD, providing in some cases similar shapes. However, although both share similarities conceptually, the fundamentals are different. While STAD aims to accentuate particular traits in the projection of a noncontinuous set of points through filter functions, in TDA the filters are the basis of the projection itself and are used as an instrument to generate the manifold.
The inclusion of a filter replaces the first two steps of the STAD base algorithm (3.1.1 and 3.1.2) by a new approach composed by three steps: define the filter (Fig. 3a), create a reduced distance matrix based on this filter (Fig. 3b), and build a MST from the semicomplete weighted graph (Fig. 3c).
3.2.1 Define the filter function
Filter definition depends on data characteristics and the purpose of the analysis. As with other topological techniques, density estimates, centrality functions, orthogonal coordinates, a subset of dimensions and intermediate algorithmic results [Goldfarb [15], Carlsson et al. [8]] are also supported in STAD. The inclusion of filters aims to enrich data exploration through explicit prior knowledge [54] rather than hypothesisfree research. In practice, subsets of dimensions or external data tend to be most interpretable.
Filters in STAD are understood as a linear space where data follows a sequential order. However, the nature of the data included in a filter can be diverse, and both the filter definition and interpretation must be adapted to it. For example, cyclical patterns in temporal data are common such as the day of the week or the month in a year. The last element of this cyclical pattern is as far from the previous as it is from the following although the sequential labeling does not reflect this, i.e., Sunday (day 7) is close to Monday (day 1) as is December (month 12) to January (month 1). In these cases, filters must be represented in a polar space where the repetitive pattern is preserved [57].
Filters are mostly defined as onedimensional or twodimensional. Higher dimensionality although possible tends to overrestrict the space. When filters are real, a discretization process is required to transform them into a natural sequence of indices. In this paper, we present the real filter transformation by specifying as the number of intervals to divide each dimension in. Fig. 3a illustrates the transformation of a real filter into a natural sequence. The effect of variations in the value of during the transformation influences the final network. The implications are discussed in section 3.3. The value can take independent values for each dimension when the filter dimension is greater than one, but the intervals must allow forming a single connected component network as STAD output. Empty intervals in a onedimensional filter are omitted and the adjacency of the intervals is considered to the closest nonempty range. In filters of higher dimensionality, empty intervals are evaluated together with their neighbors defining a consistent grid. The algorithm to generate a consistent grid in STAD is provided as supplemental material.
3.2.2 Create reduced distance matrix based on filter
The inclusion of a filter establishes limits in the metric. We can use these boundaries to introduce the effect of the filter in STAD by reducing the distance matrix and in consequence the complete graph . We define three types of possible connections between datapoints (Fig. 3a):

Intraedges are all connections where and belong to the same index.

Interadjacents are all connections where and belong to adjacent indices.

Internonadjacents are all connections where or belong to different, nonadjacent indices.
Based on these definitions, the distance matrix is reduced to by removing all nonadjacent connections (Fig. 3b). The STAD process uses the distance matrix as input in the estimation of the filter. The derived graph from the distances becomes a semicomplete weighted graph where only links within and between adjacent intervals are considered. The reduction of connections draws networks based on the structure of the filter highlighting properties of data such as the temporality of timeseries or abnormality level of a centrality measure. Additionally, the performance of STAD with filters improves due to the smaller size of the distance matrix to be evaluated.
3.2.3 Build MST from semicomplete weighted graph
From the MST can be computed as described in 3.1.2. However, one might want to ensure certain datapoints to be close together based on specific domain knowledge, even if they are further apart in highdimensional space (or vice versa). In case the specific domain knowledge is expressed in a particular dimension, this would mean that the datapoints are far apart when considering all dimensions, but close together in the dimension under consideration.
Although the classical MST provides valid results in STAD with filters, we propose a version of the MST which better preserves the filter structure by prioritizing intraedges in the process. Artificial connections (i.e. connections made as an artifact of splitting the data along the filters) are detected through community detection and reevaluated globally. This process is split into these three steps:

The is created first inside of each index (intraedge connections). Interadjacent connections are added after the computation to define a single connected component (Fig. 3c left).

The intraedge connections from are evaluated through community detection using the original distances as weights. We implemented the random walk methodology Walktrap [36] due to its adaptability to short sequences. This step aims to detect distant points in highdimensional space that were connected inside of each index. A sensitive configuration of community detection is desired to detect the different signals of data, as false negative divisions are automatically corrected in the following steps. All intraedges falling in the same community and index are preserved and fixed. Remaining edges, i.e. discrepant intraedge and interadjacent edges are omitted and reevaluated in the following step (Fig. 3c center).

Edges from the previous steps are preserved and act as a base, and additional edges are added until a single connected component is created (Fig. 3c right).
3.3 STAD network interpretation
STAD networks generate shapes which provide both global and local intuition of a data structure. Locals signals can refer to clusters, i.e., a homogeneous group of data points according to their similarity, but also broader meanings, for instance, a set of points with gradual dissimilarity (which presents itself as a flare). The graph density provides a notion of data distribution; homogeneous elements appear highly interconnected in the graph and dissimilar elements appear in nonadjacent sections of the graph. The visual edge length of STAD graphs indicates the similarity between the two vertices (see 3.1.8). Specific graph layout algorithms such as the forcedirected layout will search for an equilibrium between these edge lengths and their optimization function, e.g., to minimize overlap of nodes and/or edges [19].
The inclusion of filters intensifies specific information contained in their dimensions. Fig. 4 exposes the effect of filters in a comparable setting where the same variable has been split in a different number of natural indices. According to the distance matrix detailed in 3.2.2, when the number of filter indices is two or smaller, no nonadjacent connections exist and therefore we obtain a STAD network identical to the filterfree approach (Fig. 4b). A higher number of indices produces a finegrained representation of the filter definition but penalize the structural representation of points contained in the underlying dataset . If the number of indices in the filter is equivalent to the number of elements in the dataset, the generated network is forced to connect the adjacent indices. In this case, the STAD network exposes the structure of the filter instead of the structure of data. Additional features of the graph (e.g., node color and/or size as used below) can aid in the interpretation of the data.
4 Case Studies
We applied STAD to two realworld datasets. We present results, derived insights extracted from the shapes and discuss choices on the filter selection. The visual analytics approach helps to discover nonevident patterns in data through the connection between the points describing data shapes as flares and loops.
4.1 Barcelona traffic
We collected a dataset from the public repository Open Data BCN [51]
which contains traffic activity in the city of Barcelona. The analysis was performed with all records from October 2017 until November 2018. The dataset describes measurements of traffic density collected every five minutes in 534 locations of the city which is stored as an ordinal variable from one to six, one corresponding to freely moving traffic and six to standstill. We explored the daily changes in the city by averaging the individual sections into a onedimensional time series for each day. Similarity between days was computed using Euclidean distance to identify differences at identical timestamps. In this section, we will discuss two analyses: one without filter and one using a twodimensional filter composed of the week number and the daily mean of the densest point in the city.
4.1.1 Filterfree STAD analysis of Barcelona traffic
Filterfree STAD analysis results are shown in Fig. 5 where we identify three different patterns. In this example, nodes are colored by day of the week (blue shades correspond to workdays from Monday to Friday, orange refers to Saturday, and red to Sunday) and the size of the node indicates the mean of traffic density in that day. The most significant signals correspond to the difference in activity between weekdays (Fig. 5a) and weekends (Fig. 5c). The groups of workdays on the left are highly connected indicating the high similarity between these days. In the center of the graph (Fig. 5b), we find a subset of days between the largest group of workdays and weekends. This subset corresponds to low activity days in the city; more specifically, they are workdays in the first week of January, Easter week and the month of August. These periods of the year traditionally are associated with holiday periods and are distinguishable from the rest of patterns in data. On the right of the graph, we recognize the weekend and official holidays. Inside this subnetwork, we recognize two more groups which mostly correspond to the two days of the weekend. Saturdays are days with higher activity than Sundays as reflected on the node size of the figure. Official holidays behave like a typical Sunday; we highlight Christmas day as the day of the year with the smallest traffic activity, located at the extreme topright. In contrast, there are some Sundays with higher traffic activity which have been related to some featured event. For example, we name the Political Prisoners Demonstration on April 15th, the 40th Zurich Barcelona Marathon on March 11th and the Final World Cup 2018 on July 15. These days are associated with a higher movement of people and the closing of some section of the city.
Although the connectivity of the network does not provide additional structural insights, the color of nodes helps to recognize weaker signals in the graph as well. More concretely, we can see that Fridays are particularly clustered. Digging deeper into the data we can identify a peak of activity between 14:00 and 16:00 on Fridays (see image in supplemental material). The increased traffic is associated with departures leaving the city.
The global structure of the network presents coherent connectivity between the groups according to their traffic density: the group with highest traffic congestion i.e. typical workdays (Fig. 5a) connects to the group with workdays on holiday period (Fig. 5b) and this is linked to the weekends days (Fig. 5c).
4.1.2 STAD analysis of Barcelona traffic using twodimensional filter
We continue the analysis of Barcelona traffic by incorporating filter functions to identify additional signals. We applied STAD using a twodimensional filter composed of the week number and the mean of the densest point in the city for each day. The resulting network in Fig. 6 maintains the three groups from the approach without filters (Fig. 5) but additional features are revealed. Two additional loops are present, one in the group of workdays and the other on weekends. Further investigation indicated that these structures correspond to renovation works [50] starting in May 2018 which resulted in the closing of a transversal avenue in the west of the city (Fig. 6ei).
The visual gaps in the graph, for instance, between groups a and b, are created by public or bank holidays, which end up in the cluster of the weekend days (groups cd and g) and generate this weaker connectivity between the indices of the temporal filter week number. Likewise, the gap between groups e and f is due to August present at the center of the graph. The circular pattern of traffic between years is reflected in the connectivity between groups f and a. In the weekends we can identify the same separation due to the renovation works (cd vs hi).
4.2 Airquality in Castile and León
We applied STAD on the airquality dataset collected from the Castile and León initiative in Spain [35] to illustrate STAD as a visualization tool for the identification of patterns on highdimensional time series. The dataset contains daily measurements of seven chemicals such as carbon monoxide (CO), nitrogen oxide (NO), nitrogen dioxide (NO2), ozone (O3), sulfur dioxide (SO2) and particulate matter 10 and 2.5 (PM10 and PM2.5). The measurements have been collected at different locations from January 1997 to June 2018. The data was aggregated by week due to the presence of missing values. The explored multidimensional time series contains 1139 records with seven dimensions, and we computed the Euclidean distance to evaluate the similarity between elements. The resulting STAD network graph is presented at Fig. 7. The structural shape generates an intrinsic separation of time identifying changes in the airquality over the years. We recognize three dense groups of points which mainly correspond to different periods: 19972002 (Fig. 7a), 20032008 (Fig. 7b) and 20092018 (Fig. 7c). These visual splits identify relevant changes in the airquality. The vertical position of nodes provides an intuition of seasonality, i.e. nodes on top of the network correspond to autumnwinter dates and nodes on the bottom to springsummer. The coloring of nodes by seasonality is provided as supplemental material.
To investigate seasonality signals further, we extend our exploration by incorporating the week number as the filter in STAD (Fig. 8). The network conserves the signals identified in Fig. 7 although they also reveal additional ones. For instance, between 1997 and 2002 (Fig. 8a) two groups are evident, corresponding to different seasons (autumnwinter and springsummer). In contrast, between 2009 and 2018 (Fig. 8c) the nodes are highly connected, forming a cycle. Further analysis on the network shapes indicates the following:

The two different groups identified at (Fig. 8a) are mainly caused by the chemicals nitrogen oxide and dioxide, carbon monoxide, and sulfur dioxide. These measurements are higher during the autumnwinter and are related to the burning of fossil fuels [56]. In recent years, electric systems started to substitute the previous technologies [1]. This fact is visible in the period 2009 to 2018 (Fig. 8c) where the difference between seasons is less evident.

Different coloring of nodes may help reveal additional patterns in data (see images in supplemental material). For example, ozone fluctuates according to the season period. During the springsummer, ozone levels are higher due to variations in sunlight and UV radiation [41]. In addition, concentration of carbon monoxide, nitrogen, ozone, sulfur dioxide, and particulate matter decreases gradually over the years, reaching stability in 2010. This finding is associated with an improvement of airquality [53].
5 Discussion
In this section, we discuss some limitations of the STAD methodology, their possible solutions and open challenges that remain to be addressed.
5.1 Scalability
While STAD analysis of some datasets results in sparse networks with easily interpretable structures, other datasets end up represented in more complicated networks. As the algorithm works at the level of the individual datapoints, the analysis of large datasets comes at a significant computational cost. In addition, drawing of a large resultant network also becomes cumbersome. For example, the construction of the largest network (airquality dataset) in this paper takes 1.2 minutes. All code was implemented in R with a singlethread and run on an Apple laptop MacBook Pro (dualcore, Mid 2014).
5.1.1 Computational scalability
In STAD, the recursive computation of distances in the unitdistance graph comprises the main bottleneck of network estimation. A possible approach to alleviate this issue is to work with a smaller sample of the initial dataset. Preliminary tests have indicated that such smaller dataset retains the same visual structure as the fullsize dataset, while not suffering from the high computation cost. The addition of edges upon the MST is a cumulative process (3.1.4), i.e., if an edge with weight  with larger weight meaning larger dissimilarity  is added into the network all edges with smaller weight are part of the network . When the algorithm determines the optimal network, it finds an edgeweight threshold which establishes the resulting number of connections. This optimum can be calculated on a subsample of the fullsize dataset. Multiple iterations on a downsampled dataset can be performed in a parallelized setting providing a more robust threshold estimation.
5.1.2 Visual scalability
When large networks are considered, the number of links might become an issue for visualization, resulting in the dreaded ”hairball”. Current approaches on graph layouts [20] could manage up to a million nodes if the network is sparse [GómezRomero et al. [16], Hua et al. [19]]. Consequently, additional transformations such as aggregation to reduce the number of nodes and/or edges are still needed. A possible solution can be found in community detection pipelines [12], such as MCLEAN [2], that simplify this visual representation.
5.2 Addition of edges
As mentioned in 5.1.1
, the addition of edges from the MST follows a sequential and cumulative procedure based on their distances. The incremental approach may cause edge redundancy and contribute to increasing clutter in the plot. A nonadditive and refined procedure such as an evolutionary algorithms
[3] might reduce the number of links conserving the association with the original distance matrix . Nevertheless, the use of evolutionary algorithm would also penalize the performance on the network estimation due to the assessment of crossovers in every iteration. Moreover, elimination of unnecessary edges does not change the structure of the resultant network, and such approach does not guarantee a benefit in making new features recognizable.5.3 Stability and reproducibility
The optimal number of links is dictated by the association between and . The SA algorithm estimates an optimum through a stochastic procedure which can generate different results in each iteration. However, the correlation curve describes a soft convexity with a quasiconstant function around the maximum.
Filters can bring additional signals in the underlying structure of data to the foreground. In extreme cases with very small bins, these might however fragmentize the original structure, although these changes have demonstrated to be reasonably robust (Fig. 4).
5.4 Comparison with related techniques
Alternative methodologies can reveal equivalent signals in data to STAD, especially in the filterfree approach. In this section, we present the different results between STAD and related techniques in the Barcelona traffic case. Nonlinear multidimensional scaling (NMDS), tSNE, the Mapper algorithm, and hierarchical clustering were selected due to the possibility of establishing distances matrices as inputs. Fig.
9 presents the resulting projections for each method:
NMDS (Fig. 9a) captures the difference between weekdays and weekends, although low activity days are not clearly separated from the rest of workdays.

tSNE identifies the three most relevant patterns presented with STAD (Fig. 9b). Fridays are also distinguishable in the embedding. However, densities are poorly preserved. For instance, workdays (especially from Monday to Thursday) have low variability in the dataset but the projection generates an excessive separation between them.

The Mapper algorithm (Fig. 9c) requires the definition of lenses. We selected the projection of NMDS to have a comparable view with the evaluated methods. The network presents a summarization of NMDS, but no additional signal can be detected.
6 Conclusion and future work
STAD propose a parameterfree methodology to visualize the structure of highdimensional datasets, allowing for the identification of signals by means of geometrical shapes as flares and loops. Edges in the graph correspond to the similarity between datapoints; therefore, similarities between individual datapoints are used to encode the higherlevel patterns in the resultant graph and the resulting visualization is a compressed projection of the distance matrix into a freescale space. In addition, integrating filters adds an additional perspective to the exploratory analysis.
On R implementation is available at https://github.com/vdalab/stad; Python and Clojure implementations are under development. Results presented in this paper have been generated with this R implementation. The final graphs included in section 4 were enhanced through Gephi [5].
Future work includes improving the efficiency of computational methods by retaining the information from previous iterations and approximating the shortest path [18]. In addition, we also aim to devise novel visual approaches to compare and interpret networks structures in an integrated environment.
Acknowledgments
The authors wish to thank Danai Kafetzaki for proofreading. This work was supported in part by the IWT/SBO 150056 project ”ACquiring CrUcial Medical information Using LAnguage TEchnology” (ACCUMULATE).
References
 [1] Achieving lowcarbon heating and cooling through electrification. Note: hhttps://setis.ec.europa.eu/setisreports/setismagazine/lowcarbonheatingcooling/achievinglowcarbonheatingandcoolingAccessed: 20190220 Cited by: 1st item.
 [2] (2018) MCLEAN: Multilevel Clustering Exploration As Network. PeerJ Computer Science 4, pp. e145. Cited by: §2.2, §5.1.2.
 [3] (2018) Evolutionary computation 1: Basic algorithms and operators. CRC press. Cited by: §5.2.
 [4] (2002) The isomap algorithm and topological stability. Science 295 (5552), pp. 7–7. Cited by: §2.1.
 [5] (2009) Gephi: an open source software for exploring and manipulating networks. In Third international AAAI conference on weblogs and social media, Cited by: §6.
 [6] (2008) Reeb graphs for shape analysis and applications. Theoretical computer science 392 (13), pp. 5–22. Cited by: §2.2.
 [7] (2013) General Topology: Chapters 1–4. Vol. 18, Springer Science & Business Media. Cited by: §3.2.
 [8] (2012) Topological data analysis and machine learning theory. In BIRS Workshop, Alberta, Cited by: §3.2.1.
 [9] (2013) invis: Exploring highdimensional RNA sequences from in vitro selection. In 2013 IEEE Symposium on Biological Data Visualization (BioVis), pp. 1–8. Cited by: §2.2.
 [10] (2008) Persistent homologya survey. Contemporary mathematics 453, pp. 257–282. Cited by: §2.2.
 [11] (1965) On the dimension of a graph. Mathematika 12 (2), pp. 118–122. Cited by: §3.1.3.
 [12] (2016) Community detection in networks: A user guide. Physics reports 659, pp. 1–44. Cited by: §5.1.2.
 [13] (2010) Visual exploration of high dimensional scalar functions. IEEE Transactions on Visualization and Computer Graphics 16 (6), pp. 1271–1280. Cited by: §2.2.
 [14] (2008) Barcodes: the persistent topology of data. Bulletin of the American Mathematical Society 45 (1), pp. 61–75. Cited by: §2.2.

[15]
(2018)
Understanding Deep Neural Networks Using Topological Data Analysis
. CoRR abs/1811.00852. External Links: Link, 1811.00852 Cited by: §3.2.1. 
[16]
(2018)
Visualizing large knowledge graphs: A performance analysis
. Future Generation Computer Systems 89, pp. 224–238. Cited by: §5.1.2.  [17] (2012) Efficient computation of 3D Morse–Smale complexes and persistent homology using discrete Morse theory. The Visual Computer 28 (10), pp. 959–969. Cited by: §2.2.
 [18] (2014) Approximation of distances and shortest paths in the broadcast congest clique. arXiv preprint arXiv:1412.3445. Cited by: §6.
 [19] (2018) Graph Layout Performance Comparisons of ForceDirected Algorithms. International Journal of Performability Engineering. Cited by: §3.3, §5.1.2.
 [20] (2014) ForceAtlas2, a continuous graph layout algorithm for handy network visualization designed for the Gephi software. PloS one 9 (6), pp. e98679. Cited by: §3.1.8, §5.1.2.
 [21] (2008) Brushing of attribute clouds for the visualization of multivariate data. IEEE Transactions on Visualization and Computer Graphics 14 (6), pp. 1459–1466. Cited by: §2.2.
 [22] (1989) An algorithm for drawing general undirected graphs. Information processing letters 31 (1), pp. 7–15. Cited by: §3.1.8.
 [23] (1964) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29 (2), pp. 115–129. Cited by: §2.1.
 [24] (2015) A onedimensional homologically persistent skeleton of an unstructured point cloud in any metric space. In Computer Graphics Forum, Vol. 34, pp. 253–262. Cited by: §2.2.
 [25] (2017) Mass cytometry and topological data analysis reveal immune parameters associated with complications after allogeneic stem cell transplantation. Cell reports 20 (9), pp. 2238–2250. Cited by: §1.
 [26] (1988) Thirteen ways to look at the correlation coefficient. The American Statistician 42 (1), pp. 59–66. Cited by: §3.1.5.
 [27] (2017) Visualizing highdimensional data: Advances in the past decade. IEEE transactions on visualization and computer graphics 23 (3), pp. 1249–1268. Cited by: §2.
 [28] (2013) Extracting insights from the shape of complex data using topology. Scientific reports 3, pp. 1236. Cited by: §1.
 [29] (2008) Visualizing data using tSNE. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §1, §2.1.
 [30] (2018) UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §1, §2.1.
 [31] (2018) Multidimensional Scaling. In Statistical Methods in Social Science Research, pp. 113–122. Cited by: §2.1.
 [32] (2017) A User’s Guide to Topological Data Analysis.. Journal of Learning Analytics 4 (2), pp. 47–61. Cited by: §2.2.
 [33] (2014) Visualization analysis and design. AK Peters/CRC Press. Cited by: §1.
 [34] (2015) Topological data analysis for discovery in preclinical spinal cord injury and traumatic brain injury. Nature communications 6, pp. 8581. Cited by: §1.
 [35] Open data Castile and León. Note: https://datosabiertos.jcyl.es/web/es/datosabiertoscastillaleon.htmlAccessed: 20190220 Cited by: §4.2.
 [36] (2005) Computing communities in large networks using random walks. In International symposium on computer and information sciences, pp. 284–293. Cited by: item 2.
 [37] (2014) Structural analysis of multivariate point clouds using simplicial chains. In Computer Graphics Forum, Vol. 33, pp. 28–37. Cited by: §2.2.
 [38] (2004) Comparative PM10–PM2. 5 source contribution study at rural, urban and industrial sites during PM episodes in Eastern Spain. Science of the Total Environment 328 (13), pp. 95–113. Cited by: 2nd item.
 [39] (2018) A survey on multidimensional scaling. ACM Computing Surveys (CSUR) 51 (3), pp. 47. Cited by: §2.1.
 [40] (1969) A nonlinear mapping for data structure analysis. IEEE Transactions on computers 100 (5), pp. 401–409. Cited by: §2.1.
 [41] (1997) On the spatial distribution and seasonal variation of lowertroposphere ozone over Europe. Journal of Atmospheric Chemistry 28 (13), pp. 11–28. Cited by: 3rd item.

[42]
(2017)
Intrinsic tstochastic neighbor embedding for visualization and outlier detection
. In International Conference on Similarity Search and Applications, pp. 188–203. Cited by: §1, §2.1.  [43] (2012) Parallel computation of 2D MorseSmale complexes. IEEE Transactions on Visualization and Computer Graphics 18 (10), pp. 1757–1770. Cited by: §2.2.
 [44] (2007) Topological methods for the analysis of high dimensional data sets and 3d object recognition.. In SPBG, pp. 91–100. Cited by: §2.2.
 [45] (2012) Weighted Graph Algorithms. In The Algorithm Design Manual, pp. 191–229. Cited by: §3.1.7.
 [46] (2018) On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset. CoRR abs/1811.00102. External Links: Link, 1811.00102 Cited by: §3.1.6.
 [47] (2003) Estimating the cluster tree of a density by analyzing the minimal spanning tree of a sample. Journal of classification 20 (1), pp. 025–047. Cited by: §2.2.
 [48] (2016) Visualizing largescale and highdimensional data. In Proceedings of the 25th international conference on world wide web, pp. 287–297. Cited by: §2.1.
 [49] (1958) Theory and methods of scaling.. Cited by: §2.1.
 [50] Traffic calming in Av. Príncep d’Astúries.. Note: https://ajuntament.barcelona.cat/guardiaurbana/en/noticia/trafficcalminginavprincepdasturies_562824Accessed: 20190220 Cited by: §4.1.2.
 [51] Traffic state information by sections of the city of Barcelona  Open Data Barcelona.. Note: https://opendataajuntament.barcelona.cat/data/en/dataset/tramsAccessed: 20190220 Cited by: §4.1.
 [52] (2010) Diesel passenger car PM emissions: From Euro 1 to Euro 4 with particle filter. Atmospheric Environment 44 (7), pp. 909–916. Cited by: 2nd item.
 [53] (2015) An integrated assessment of two decades of air pollution policy making in Spain: Impacts, costs and improvements. Science of the Total Environment 527, pp. 351–361. Cited by: 3rd item.
 [54] (2009) Defining and applying knowledge conversion processes to a visual analytics system. Computers & Graphics 33 (5), pp. 616–623. Cited by: §3.2.1.
 [55] (2018) Topological data analysis. Annual Review of Statistics and Its Application 5, pp. 501–532. Cited by: §2.2.
 [56] (1976) Production of atmospheric nitrous oxide by combustion. Geophysical Research Letters 3 (12), pp. 751–753. Cited by: 1st item.
 [57] (2005) Spherical coordinates. Cited by: §3.2.1.
 [58] (2010) Fast construction of the VietorisRips complex. Computers & Graphics 34 (3), pp. 263–271. Cited by: §2.2.
Comments
There are no comments yet.