SAX Navigator: Time Series Exploration through Hierarchical Clustering

08/15/2019
by   Nicholas Ruta, et al.
Keio University
Harvard University
0

Comparing many long time series is challenging to do by hand. Clustering time series enables data analysts to discover relevance between and anomalies among multiple time series. However, even after reasonable clustering, analysts have to scrutinize correlations between clusters or similarities within a cluster. We developed SAX Navigator, an interactive visualization tool, that allows users to hierarchically explore global patterns as well as individual observations across large collections of time series data. Our visualization provides a unique way to navigate time series that involves a "vocabulary of patterns" developed by using a dimensionality reduction technique,Symbolic Aggregate approXimation(SAX). With SAX, the time series data clusters efficiently and is quicker to query at scale. We demonstrate the ability of SAX Navigator to analyze patterns in large time series data based on three case studies for an astronomy data set. We verify the usability of our system through a think-aloud study with an astronomy domain scientist.

READ FULL TEXT VIEW PDF

page 1

page 3

10/02/2020

Extreme-SAX: Extreme Points Based Symbolic Representation for Time Series Classification

Time series classification is an important problem in data mining with s...
01/09/2022

OPP-Miner: Order-preserving sequential pattern mining

A time series is a collection of measurements in chronological order. Di...
02/04/2018

Deep Temporal Clustering : Fully Unsupervised Learning of Time-Domain Features

Unsupervised learning of time series data, also known as temporal cluste...
10/06/2017

Discovering Playing Patterns: Time Series Clustering of Free-To-Play Game Data

The classification of time series data is a challenge common to all data...
09/01/2021

STFT-LDA: An Algorithm to Facilitate the Visual Analysis of Building Seismic Responses

Civil engineers use numerical simulations of a building's responses to s...
08/02/2019

Agglomerative Fast Super-Paramagnetic Clustering

We consider the problem of fast time-series data clustering. Building on...

1 Related Work

We first review methods to visually query time series data to extract patterns and then describe visualization methods for clustered data.

Query Definition for Time Series Analysis. Query-by-example and query-by-sketch interfaces are powerful approaches to querying data intuitively. Query-by-example techniques [DBLP:journals/ivs/HochheiserS04, DBLP:journals/tkde/WangDS12] aim to find similar data points (e.g., time slices) to a user-specified example. However, they do not address how to find the initial interesting time slice from a large collection of time series data. Query-by-sketch techniques do not have this restriction, as users can directly draw the shape they are interested in. However, query-by-sketch techniques have to deal with the user-introduced uncertainties of sketches [Correll2016].

TimeSearcher is a visual exploration tool for time series data [DBLP:journals/ivs/HochheiserS04], which is extended to a query-by-example interface, named SearchBox [Buono2008]. SOMFlow [Sacha2018] presents techniques for time series clustering based on query-by-example, grouping selections based on their relative neighborhoods and by filtering and splitting using metadata-based attribute values. QuerySketch [Wattenberg2001] is a tool for database queries where users can directly sketch the shape of a pattern which automatically extracts matching time slices. Correll and Gleicher [Correll2016] defined a vocabulary of invariants for queries by sketch to deal with uncertainties of sketches.

While these query-by-sketch systems allow users to draw and query time variation in an arbitrary shape, SAX Navigator provides users with building blocks that can be pieced together to create query examples based on observed patterns in the data. It provides a comprehensive exploration of time series collections using query-by-example for specific observations of interest and query-by-sketch to collect results based on a general trend.

Visual Cluster Analysis.

We will now survey a selection of tools that inspired the design of SAX Navigator. Seo and Shneiderman [Seo2002], for example, presented the Hierarchical Clustering Explorer (HCE), a dendrogram-based interactive visual exploration tool for hierarchical clustering. It allows users to filter clusters according to similarities and to compare clusters. NodeTrix [Henry2007] solves the complexity of node-link diagrams of large networks by aggregating nodes into clusters and displaying dense clusters as matrices within the overall node-link diagram. CyteGuide [Hollt2018] enables users to explore the hierarchical representation of the data by viewing both the current status of exploration and the unexplored parts based on sunburst diagrams. Clustervision [Kwon2018] is a VA tool to help users find a proper clustering method from various techniques and parameters. Zeckzer et al. [Zeckzer2018] proposed tiled binned clustering and visualize the results in 3D scatterplots. The method conducts clustering after assigning data points to bins, as we do in SAX Navigator.

All these VA approaches efficiently show clusters and allow users to explore the cluster space. However, it is still often difficult to gain a comprehensive overview of the individual data samples contained in a cluster. Similar to NodeTrix, we aim to reduce visual complexity by showing general patterns within clusters rather than individual observations. We, therefore, incorporate a heat map-based cluster aggregation view into SAX Navigator.

2 A Vocabulary of Patterns (SAX)

The basis of our vocabulary of patterns revolves around a time series dimensionality reduction technique called Symbolic Aggregate approXimation (SAX) [Lin2007]

. SAX allows users to control the resolution of their analysis, but also to apply established and well-understood natural language processing techniques, such as text similarity and retrieval through regular expressions or topic analysis.

Figure 1: Transforming a timeline into SAX representation with letters and a word length of . The dimension of the value () and time () are reduced from to .

2.1 Sax

To prepare our data, we first center and scale it (i.e., we subtract out the mean and divide by the standard deviation). However, depending on the data characteristics, different pre-processing techniques might be used.

Figure 1 depicts an example of translating/converting a time series into the SAX representation. Conceptually, SAX quantizes a continuous time series into discrete bins (along both, the time and amplitude axis) and assigns a letter representation to each quantized bin. The first step to convert a time series into the SAX representation is to define the number of letters and maximum word length (subsequent bin size)

to apply to the data set. Both should be chosen to be the smallest possible values while allowing for good clustering and not smoothing away the details. To determine the distribution of the letters, SAX pools the values of all time series together and fits a normal distribution. Then it creates

partitions of equal probability and assigns the lowest to the letter “a”, the second lowest to “b” and so on to create the set of letters in our vocabulary. Some observations may not be

letters long since not all time series have to be of the same length. Binning, a form of smoothing that removes noise from the data, improves the ability of the clustering algorithm to find similar groups of time series. For each bin, we average the values and determine its letter range.

The result is that each observation is a set of letters of maximum length. We cluster the resulting words with the goal to find groups of time series with similar words within our vocabulary. Lastly, dimensionality reduction promotes scalability by decreasing the complexity of the time series from the space of to .

2.2 Clustering

We use agglomerative hierarchical clustering with complete linkage for clustering time series into similar groups, since it heuristically provides better cluster separation than single or average linkage. The used distance metric is a variation of the MINDIST function, described in Lin et al. 

[Lin2007], which achieves exact matching even though SAX words may contain empty values in our data set. It performs better than Euclidean distance in terms of recovering cluster assignments. The distance between two time series observations as SAX representations is defined as follows:

3 Design of SAX Navigator

SAX Navigator supports the following analysis tasks:

T1 – Explore clusters and general data distribution. Users should be able to explore the cluster space to see general trends and relationships among clusters and get a high-level impression on the data distribution and variability within a cluster (see subsection 3.1).

T2 – Analyze individual time series within a cluster. The system needs to support details-on-demand for individual time series to analyze similarities and detect anomalies and errors (see subsection 3.2).

T3 – Interactive queries based on sketching. Users can sketch patterns of interest to find similar data points (see subsection 3.3).

3.1 Global Patterns

The tree diagram (i.e., dendrogram) of SAX Navigator (subsubsection 3.1.1) shows the global structure of the hierarchical clustering result and represents each cluster node as a heat map (subsubsection 3.1.2), which allows users to identify the general pattern of a cluster (T1).

3.1.1 Tree Diagram


(a) Two cluster nodes, their heat maps, and links of the tree diagram.
(b) Superimposing 59 timelines in the lower cluster of (a).

Figure 2: Each node in the tree diagram is represented by a circle showing the cluster size and a heat map. A superimposed graph is shown on the cluster detail view.

To show the result of the hierarchical clustering, we designed a horizontal node-link tree diagram, as shown in SAX Navigator: Time Series Exploration through Hierarchical Clustering (a). By following links connecting the nodes, users can easily understand how clusters divide into smaller, more similar groups and identify global patterns. In the tree diagram, each node is represented by a circle with a number indicating the cluster size and a heat map. The cluster size is double-encoded in the link width to the node (see Figure 2 (a)).

To address perceptual scalability for a large collection of time series, we filter out small clusters from the tree diagram to reduce visual complexity and clutter. By default, SAX Navigator shows only clusters whose size is more than 2% of the total collection. When users want to see more details of a cluster, clicking a node expands the sub-tree of the node. Users are allowed to pan and zoom the tree diagram to explore it at different scales or contexts.

3.1.2 Heat map for Cluster Aggregation

We visually aggregate all time series in a cluster, which are translated into words by SAX, into a heat map display that shows the overall pattern and distribution of the timelines within the cluster without visual clutter (see Figure 2 (a)). In the heat map display, the axis are the bins ordered by time and the axis shows the SAX-assigned letter. The color of each cell encodes the proportion of observations with that particular letter assignment at each time slice. The color is on a linear scale that goes from white (no observations) to navy (all observations). The lighter the color of a heat map is, as seen in the upper heat map of Figure 2 (a), the more uncertainty or divergence there is in the cluster. As shown in the lower heat map in Figure 2 (a), the heat map is a much clearer display of the general shape than superimposing all time series in a single line chart like Figure 2 (b).

3.2 Local Observations

Local observations are important to understand why we see certain patterns at a global scale. SAX Navigator supports detailed cluster exploration and local comparisons of , , and (T2).

3.2.1 Cluster Detail View

To analyze individual time series, SAX Navigator can show details-on-demand for all data within a cluster. Hovering over a cluster node activates the cluster detail view shown in SAX Navigator: Time Series Exploration through Hierarchical Clustering (d). The raw time series are shown superimposed on one another in the upper part of the view as well as juxtaposed as sparklines within a data table in the lower part of it. Each row of the table consists of data for a single observation. The line chart on top and the rows of the table are connected via brushing and linking. Furthermore, by clicking on the rows of the table, SAX Navigator highlights all connecting branches to the observation’s ID in the tree diagram. Users can compare a selected observation in the context of the cluster ( comparison) or directly to a second selected observation ( comparison).

3.2.2 Cluster Comparison

Figure 3: Cluster comparison view. Users can compare two clusters by selecting two of the heat maps within the tree diagram.

Users can select two clusters in the tree diagram to start a visual comparison (see Figure 3

). The new heat map shows the differences of the values in the first selected cluster versus the second one. The pattern of Cluster A (left) is colored green, and the pattern of Cluster B (right) is colored magenta. Comparisons can be made in either raw counts or percentages. Furthermore, we can show the mean and standard deviation of the “vocabulary of patterns” of the time series in both clusters as a line chart with a confidence interval band, as shown in the upper right heat map of

Figure 3. The comparison view is particularly helpful for comparing patterns between clusters that are difficult to compare across the tree diagram ( comparison).

3.3 Scalable Query Interface

SAX Navigator provides an interactive sketch-based query interface that allows users to search for observations of interest (T3).

The query tool consists of two options. The first is a drop-down menu where users can select a specific name or ID from the loaded data set. In this case, the path to the selected observation of interest will be highlighted. The second method supports searching via user-specified patterns. Inspired by query-by-sketching, we create a grid for users to “draw” their pattern of interest (i.e., the SAX letter sequence of interest). SAX Navigator: Time Series Exploration through Hierarchical Clustering (b) shows a user’s selection of an upside down “V” shape that corresponds to the pattern “abcba”. Using regular expressions, we can quickly search the data set and automatically highlight all tree branches in the tree diagram that contain the specified pattern (see SAX Navigator: Time Series Exploration through Hierarchical Clustering (a)).

4 Implementation

SAX Navigator is a web application based on D3.js [d3_framework] and the Flask microframework [flask_framework]. Readers can access a fully interactive prototype at https://sax-navigator.herokuapp.com/.

5 Evaluation

Our evaluation is based on three case studies and feedback by a domain expert. We used 2,000 observations from the Catalina surveys data release 2 consisting of 46,000 brightness observations [catalina], and retrieved commonly used features. The Catalina survey is a well-known and trusted data set about different types of stars. Initial feedback from astronomers indicate that they can find search results of interest faster with SAX Navigator than with traditional methods such as table-based feature comparisons.

5.1 Case Studies


(a) comparison.

(b) comparison.

(c) comparison.

Figure 4: Case studies of , , and comparisons. (a) Two interesting observations within a single cluster can be examined and compared in high detail. (b) The blue sparkline represents an observation that appears to be incorrectly associated with the cluster. (c) The astronomer can quickly observe the differences between two clusters from completely separate sections of the tree diagram.

For astronomical time series clustering, we implemented and used a kernelized cross-correlation distance metric [Wachman2009KernelsFP] as the primary form of morphological comparison. Using SAX Navigator, the analyst can discover new patterns and verify the classifications provided for the survey’s collection. Let us revisit the example of an astronomer with case studies for our three types of local comparisons.

5.1.1 Comparison

Astronomers frequently compare well-known objects to new observations of interest to classify them. In SAX Navigator, the astronomers can perform a

comparison by using the details-on-demand features for local observations in a single cluster. For example, to determine whether the gold sparkline seen in Figure 4 (a) is simply noisy or an actual misclassification, users can investigate the data by looking at side-by-side comparisons of the shape of the observations as well as at the metadata of the two selected time series.

5.1.2 Comparison

Astronomers have to deal with uncertainties due to instrumental errors related to telescope machinery. These errors can lead to misclassifying the types of celestial observations present in a large astronomy survey. For example, suppose an astronomer has identified a cluster in the tree diagram and wants to determine if any of its members have been erroneously assigned due to instrumental error. As seen in Figure 4 (b), the astronomer can hover over the cluster’s heat map on the left to view the cluster detail view seen on the right. By hovering over a row in the cluster detail view, the astronomer isolated a data error present in the cluster and highlighted it as a blue sparkline to make a comparison. The comparison allowed the astronomer to verify the error’s abrupt spikes at the beginning and just after the middle of the timeline when compared to the more gradual increases and decreases seen in the grey sparklines.

5.1.3 Comparison

Oftentimes, astronomers explore subtle differences between periodic observations which lead to correct classifications. For example, the heat map comparisons depicted in Figure 4 (c) show the differences at specific points in time between two clusters of periodic observations from separate sections of the hierarchical tree structure. In this instance, the heat map can provide a starting point to understand why one cluster is made up primarily of RR Lyrae variables, while the other additionally contains Cepheids. While the heat maps of both clusters show a similar periodic shape, the gaps seen as white and grey space throughout the pattern in Cluster A’s heat map suggest that observations were missing at points throughout the timeline. Cluster B shows a fuller pattern which strengthens the astronomer’s confidence in the sampling. The larger difference heat map further highlights the points at which Cluster A lacks samples.

5.2 Domain User Feedback

To assess the application’s usability, we conducted a think-aloud study with an astronomy graduate researcher who has worked with astronomy observation data for 6 years. We gave the participant no suggestions on how to use the system upfront and observed his usage. We answered clarifying questions about the options available and how to pan/zoom on the main visualization. The participant first explored the options panel at the top half of the screen. The main visualization was most appealing to the participant, he quickly focused his attention on navigating the tree. At first, he did not understand how the clusters were formed and suggested that more transparency was needed in the design to explain the distance metric utilized. Once he gained more experience using the tree navigation and had examined specific cluster members in the cluster detail view, his overall response was very positive. He stated that “Wow, this is a great way to quickly see what patterns are in the survey!” and immediately wanted to load his own data set. He noted that using the tree diagram and heat map comparison tool enabled him to separate prominent collections of periodic eclipsing binaries. He was able to find subtle differences across these collections at certain points in time, an important and difficult task, much faster when compared to traditional methods like a table-based visualization.

6 Conclusion & Future Work

We developed an interactive visualization that allows domain experts to explore their time series data in an efficient and meaningful manner. Utilizing the SAX algorithm, we extract a vocabulary of patterns specific to the imported data, which allows for efficient clustering and querying at scale. Our interactive interface gives users the ability to show the overall structure of the hierarchical clustering and individual cluster details for thousands of time series.

To generalize our approach to other data and domains, we want to add interactive sliders to change the values for the SAX and parameters. This will allow users to fine-tune the amount of smoothing and clustering. Furthermore, we want to optimize our implementation in regards to scalability and evaluate how well our visualization scales with up to to millions of observations.

References