1 Related work
Our work is closely related to research areas in visual summarization, query, and comparison of event sequences. Besides, our design is inspired by the unit visualization. We review the state of the art of these aspects in order.
1.1 Visual summarization of event sequence datasets
To obtain a high-level overview of common paths and their volumes in an event sequence dataset, some researchers extracted tree structures and illustrated them using icicle plots or node-link diagrams [Wongsuphasawat2011, Liu2017, Shen2012, Monroe2013, Kruskal1983]. Others attempted to consolidate the whole dataset into a transition graph to demonstrate the flows of events, for example, OutFlow [Wongsuphasawat2012], CareFlow [Perer2013], and DecisionFlow [Gotz2014].
An increasing number of visualization systems incorporates analytical approaches to understand common patterns shared by sequences. FP-Viz is one of the early systems, which visualizes frequent patterns using a Sunburst diagram [Leung2009]. Wang et al. [Wang2016] and Vrotsou et al. [Vrotsou2009]
mined dominating user behaviors from clickstreams data and further presented them using intuitive visualization, such as node-link diagram and circle packing layout. In addition, Wei et al. proposed an overview of clusters in clickstreams data with a Self-Organizing Map[Wei2012]. Similarly, WebCANVAS [Cadez2003] and LogView [Makanju2008] are designed to show the hierarchical structures generated by sequence clustering methods using tree visualization techniques, such as TreeMaps[Shneiderman1992] and node-link diagrams. Recently, Liu et al. proposed a three-stage analysis pipeline for event sequence analysis [Liu2017a]. In their work, patterns are displayed using glyphs and sequences can be aligned with a particular event.
The design of ICE has been inspired by many of the above visualizations. But we focus on the comparative analysis of event sequence data. Although comparing two sets of sequences could be supported by creating two instances of these techniques, the process may be cumbersome as they are not designed specifically for comparison tasks.
1.2 Visual query of event sequences
The increasing volume of event sequence data has created an overarching need for querying meaningful records. Several techniques are proposed to allow users to define queries based on interval, event absence, and other temporal constraints of events [Hibino1997, Monroe2013a, Jin2010, Krause2016]. However, these approaches do not support querying by starting or ending events. To address this, Outflow [Wongsuphasawat2012] and CareFlow [Perer2013] offer the specification of ending events and the exploration of pathways associated with these results. DecisionFlow moves one step forward to support querying with preconditions (starting events) and outcomes (ending events) [Gotz2014]. To augment expressiveness, Eventpad [Cappers2018] and (squ)eries [Zgraggen2015] enable users to visually construct regular expressions for sequences query.
These techniques are effective when analysts know what events to specify in their queries. However, in many scenarios, they lack such knowledge about the data, and visual exploration is needed to identify meaningful sets of sequences to compare. ICE addresses this issue by offering a matrix-based visualization to provide a visual summarization of the entire data from the perspectives of starting and ending events.
1.3 Visual comparison of event sequences
Although the concept of visual comparison is not new (see [Gleicher2011] for visual comparison techniques), applying this to event sequences is generally under-exploited in the literature. There exists some work targeting at comparison of individual sequences. For example, Similan [Wongsuphasawat2009] and EventAction [Du2016] employ certain metrics to identify similar sequences and present the sequences side by side. Another example is TimeSlice [Zhao2012], which provides exploration and comparison of faceted event sequences. Although useful, these techniques are not adequate for comparing two sets of sequences. Because they focus on event details in individual sequences, and cannot be generalized to compare sequence sets. To support such comparison, MatrixWave [Zhao2015] utilizes a series of transition matrices, and CoCo [Malik2015, Malik2016] compares sequence sets based on some predefined metrics, such as the number of records, number of events, and prevalence of an event. Further, TimeStitch defines cohorts by frequent patterns and compares two cohorts side by side [Polack2015].
The above techniques focus on the comparison of either raw sequences or frequent patterns, whereas ICE supports comparing sets of sequences at different granularity. With much flexibility, individual sequences can be organized in patterns and displayed in a range of layouts, and can be then further explored by analysts.
1.4 Unit visualizations
According to Part et al. [Park2017], unit visualization is defined as “visualization that maintains the identity property of its visual marks, i.e., where each visual mark is a unique entity that is associated with a corresponding unique data item.” The unit visualization has many benefits, including semantic constancy, direct interaction, smooth animation, etc [Park2017]. Many systems based on this approach have been proposed in recent years, such as SandDance [Drucker2015], Squares [Ren2017], Gatherplots [Park2016], and Past Visions [Glinka2016]. To enhance the unit visualization, Oelke et al. discussed different approaches for visual boosting [Oelke2011]. To formalize the creation of unit visualizations, Drucker and Fernandez characterized the design space of unit visualization and proposed a unifying framework [Drucker2015]. Similarly, Park et al. proposed a grammar for unit visualizations named ATOM [Park2017]. The expressive power of ATOM enables it to describe unit visualizations with various complexity.
In this paper, we have explored the design space of unit visualizations for comparing two sets of sequences. Our design space borrows the spatial layouts of unit visualizations in ATOM [Park2017], and moves one step further by adapting the layouts to visual comparison tasks based on Gleicher’s schema [Gleicher2011].
2 Task analysis
We aim to design a visual analysis tool to facilitate the comparison of different subsets of sequences discovered based on the visual exploration of the entire event sequence dataset. We follow Brehmer and Munzner’s typology [Brehmer2013] to derive proper analytical tasks by expressing them as a series of basic tasks, and articulating the why and how.
T1: Explore event sequences based on prefixes and/or suffixes.
Why: discover locate + browse + explore identify
Analysts often seek to understand the whole dataset from the perspectives of starting (prefix) and ending (suffix) events. For example in web analytics, a web analyst may ask questions like: “What are the most popular entry points to our website in this marketing campaign,” and “For customers brought by the search engine X, how many of them check out finally?” [Zhao2015] However, in many situations, analysts lack the preliminary knowledge about the data and may not know what to compare. Thus, they need to discover insights with the help of visualization, in which they have to locate, browse, and explore the sequences because the location and target are unknown. The general goal is to identify interesting sets of sequences, which is tied to T2.
How: aggregate + encode + arrange
To support this task, the visualization should aggregate all sequences based on their prefixes and suffixes to enable effective exploration within the whole dataset. Moreover, the visualization should encode and arrange the aggregated data in a meaningful way (e.g., sorted by some criteria).
T2: Identify interesting sets of event sequences for comparison.
Why: discover look up + browse compare
Based on the starting and ending events, analysts could discover interesting sets of sequences in three ways: by the prefix, the suffix, and both [Gotz2014]. This process is multi-scale in nature, in which analysts would like to browse sequence sets with different prefix and/or suffix lengths, and look up the events from the start and/or the end. For example, in clickstreams on E-commerce websites, analysts may be interested in the sequences ending with “Checkout” followed by “Error” to diagnose the problems that users had after checkout. Overall, the goal is to find candidate sets of sequences to compare, which leads to T3 and T4.
How: navigate + filter + select
To enable this action, the visualization should help analysts navigate the whole dataset at multiple levels, both from the starts and ends of the sequences. Further, due to the overwhelming data volume and the large space of event type permutations, the visualization must allow users to filter out parts that are not considered for comparison. This filtering could be supported based on specific events that analysts select during their multi-scale visual exploration.
T3: Compare two sets of event sequences at pattern level.
Why: discover search summarize + compare
After analysts identify two sequence sets of interest, one important task is to discover high-level insights by comparing their frequent patterns mined by algorithms, such as VMSP [Fournier-Viger2014a] and SPAM [Ayres2002]. Example questions that analysts may raise include: “How do patterns differ between two sets,” “Which pattern clusters are dominated by one set or the other,” and “Do these patterns form some clusters?” To answer these questions, analysts need to conduct visual search of the patterns from two different sets of sequences in order to summarize and compare them.
How: arrange + aggregate + encode + change
The visualization should therefore allow analysts to view an ensemble of the diverse patterns, which should be arranged and aggregated to reflect their relationships and similarity. Techniques such as clustering, and sorting can be applied. To reflect the nuances among the patterns and the sequences containing them (that may be from two different sets), some quantities, such as the support of the mined patterns and the proportion of sequences from each set, should be encoded to ease the comparison at the pattern level. Also, the visualization should offer an ability to change the representation, such as arrangement and encoding methods, for different comparison goals.
T4: Compare raw sequences exhibiting particular patterns.
Why: discover browse + locate compare
In addition to comparing two sets of sequences at a higher level, analysts may want to dig into the raw sequences that contain a specific pattern of interest [Liu2017a]. Thus, they would like to discover the connections between the patterns and sequences, where they need to browse events in the sequences and locate individual events with significance, with the ultimate goal to compare the sequences, for example, based on key events.
How: select + arrange + change
To achieve this task, the visualization should allow analysts to select a specific pattern and arrange the associated sequences according to events in the pattern or other key events of interest (e.g., aligning all sequences based on a particular event). To assist the comparison, analysts should be able to change the events of interest to alter the alignment of the sequences.
3 Design space of sequence set comparison
Assuming that two sets of sequences are identified in the data, a major question is how to compare them effectively at different granularity, as mentioned by T3 and T4. Such comparison is the ultimate goal of an analysts using the visualization tool in our scenario. Since a large body of work (e.g., [Wongsuphasawat2009, Du2016, Zhao2015]) has focused on comparison at the sequence level, in this section, we discuss the design space of visualization techniques that address the pattern level comparison of sequences. Following the notations by Liu et al. [Liu2017a], we first describe our data model and then introduce the design space.
3.1 Data model
An event sequence dataset contains a number of sequences . That is, . A sequence is defined as a set of ordered events: . Each event could be multivariate. For example, a event has a “name” to present its identity and a “timestamp” to record when it happened, etc. A set of patterns can be obtained by applying the sequential pattern mining algorithm to the dataset . Each pattern is a series of events contained in one or more sequences. That is, , such that . Each pattern is associated with a support set, i.e., , which is a set of sequences in that contain the pattern: . The support of a pattern is the ratio of and : .
In our implementation, we employ the VMSP algorithm [Fournier-Viger2014a] for mining maximal sequential patterns; however, ICE is designed to work with any sequential pattern mining algorithms for extracting any kinds of patterns, such as frequent patterns [Ayres2002] and closed patterns [Fournier-Viger2014a]. After an analyst identifies two sets of sequences to compare, and , we join the two sets and apply the VMSP algorithm to obtain the patterns . The support set of each pattern contains sequences from either or .
3.2 Design space of pattern level comparison
When comparing two sets of sequences at the pattern level, it is essential to reveal the relationships between the mined patterns and the sequences containing a particular pattern (i.e., the support sets). This facilitates the “Overview first, zoom and filter, and details on demand” approach of conducting comparison tasks. To allow an analyst to obtain an effective overview of the patterns while accessing some information about the support sets, we employ the unit visualization technique [Park2017] where each sequence is mapped to a visual mark (i.e., the basic unit) and a pattern is a group of corresponding visual marks.
3.2.1 Expressing design space with ATOM
ATOM is a visual grammar for systematically describing the spatial arrangement of each visual mark (that represents a data item) in unit visualizations [Park2017]. The grammar defines a container as the composition of a dataset and a canvas. Hence, a root container consists of the entire dataset and all the visual space available. Then, we recursively apply the unit visualization layout operations until all containers areF associated with only one data item. Especially, the operations manipulate containers in both the data domain and the spatial domain. To be specific, in the data domain, the operations “divides a dataset of parent container into a set of datasets for child containers,” and in the spatial domain, the operations “split the parent space into child spaces.” [Park2017] Different layout operations in the spatial domain are demonstrated in Figure 1.
In our case, the root container consists of the entire dataset, i.e., all the patterns detected from the sequential pattern mining, and the whole canvas. We notice that the data domain is hierarchical in nature. Thus, we assign a sub-container for each pattern in the dataset; therefore, each sub-container consists of the corresponding support set and the visual space allocated by the upper level. We can apply any of the layout operations, including Map2D, FillX, FillY, MaxFill, and Pack, to both the root container and sub-containers, which form the design space of visualizing the patterns and their related sequences (Figure 2). As FillX and FillY are similar, we group them in our design space for simplicity.
Particularly, we employ non-uniform layouts at the pattern level (i.e., within the root container), where the area of the visual space of a pattern is proportional to the number of sequences in its support set. This allows for more screen real estate for more frequent patterns in order to display the sequences associated with the pattern. If not, each sequence may be too small due to the limitation of unit visualizations that every unit needs to be displayed and distinguished [Park2017]. At the sequence level (i.e., within a sub-container), we use uniform layouts with shared size of visual marks. This is because the geometrical properties of the visual representations of all sequences are the same, which facilitates the comparison across different patterns. However, our discussion of the design space (see Section 3.2.3) could be generalized to other configurations, such as uniform layouts at both levels.
3.2.2 Cooperating with visual comparison
Based on the above design space described with ATOM, we consider the basic visual comparison approaches summarized by Gleicher et al. [Gleicher2011], including juxtaposition, superposition, and explicit encoding, in order to accommodate the comparison of two sequence sets with unit visualizations.
As shown in Figure 2, we finally employ superposition, so patterns detected from both sets are merged and displayed on the same canvas, and sequences within each pattern (i.e., the support set) are color-coded based on which set they belong to. If a sequence appears in multiple patterns, we simply duplicate that sequence in our visualization.
If juxtaposition is applied, sequences from two different sets are isolated and located on two visualizations side by side, which may hinder an analyst from interpreting and comparing the data because they need to visually search and match the patterns of two sets on two canvases. For explicit encoding, a difference quantity needs to be computed and visually represented, in our case, the difference between two patterns, or two support sets. However, it is difficult to define the difference as a numerical variable to encode visually, because each support set contains multiple sequences that could be dramatically different. Although some simple variables can be calculated such as the size difference of support sets, it is not informative for comparison. However, future empirical studies need to be conducted to verify the above intuitions.
3.2.3 Discussion on design alternatives
Followed by Figure 2, here we discuss how each design alternative in the design space can facilitate the visual comparison of two sequence sets at the pattern level.
For the Map2D layout, patterns or sequences can be positioned based on a similarity metric (e.g., Levenshtein distance [Yujian2007]) using the MDS projection [Kruskal1978], or based on two attribute axes like a scatterplot. This is beneficial to reveal the relationships among all detected patterns (see the 1st column of Figure 2. When applying Map2D to the sequence level, a shared coordinate system is needed across all the pattern sub-containers in order to facilitate the comparison of different patterns (see the 1st row of Figure 2). However, the MDS algorithm is not applicable because each 2D projection of sequences in a support set could be different. The scatterplot method requires two meaningful attributes of the sequences as the axes, and a shared scale is necessary among all pattern sub-containers. However, some event sequence data may not contain such metadata. Further, it worth noting that such sequence level Map2D layout could be more effective when the uniform layout is employed at the pattern level (Figure 1).
In FillX or FillY, patterns or sequences are aligned in one direction which enables ordering and sorting based on certain criteria (e.g., the size of support set at the pattern level). Note that we employ non-uniform layout for patterns (see the 2nd column of Figure 2) and uniform layout with shared size for sequences (see the 2nd row of Figure 2), so some pattern sub-containers may have unfilled space [Park2017]. There exist four different ways by applying FillX or FillY to either the pattern or the sequence level. Figure 2 only demonstrates the configuration of FillY for patterns and FillX for sequences. However, in any configuration, filling along one direction is not scalable for a large number of data items, where the visual representation of one item could be too small. Further, at the sequence level when the size is shared, much visual space would be wasted.
The MaxFill layout tries to utilize all the visual space, and there exist several space-filling methods such as the TreeMap [Wongsuphasawat2009] and grid-based methods. We employ the TreeMap for MaxFill at the pattern level because of its non-uniform space allocation (see the 3rd column of Figure 2). Although more visually compact, the TreeMap is less informative because the inter-pattern relationship or the ordering feature is missing compared to Map2D or FillX/Y. Further, we use the grid-based approach at the sequence level to facilitate the visual comparison, since the sequences could have a common alignment (see the 3rd row of Figure 2). Similar to FillX/Y, due to the shared size configuration, some pattern sub-container may not be fully filled, but MaxFill is more efficient in terms of best utilizing the visual space. It can also reflect a certain order of the sequences determined by the positioning algorithm in MaxFill.
The last layout is Pack that also exhibits a family of methods such as packing compactly in 2D [Wang2006], in 1D , or by a grid system [Ren2017]. As the size of pattern sub-containers is non-uniform, we apply the method by Wang et al. [Wang2006] for Pack at the pattern level (see the 4th column of Figure 2), and use the grid-based packing at the sequence level, again, to facilitate the cross-pattern comparison of sequences (see the 4th row of Figure 2). Similar to MaxFill at the pattern level, Pack is also less informative compared to Map2D or FillX/Y. Moreover, at the sequence level, Pack may be less space efficient than MaxFill because it does not always maximize the size of visual marks.
In summary, each design alternative has its own advantages and disadvantages, which should be carefully considered case by case according to the characteristics of tasks and data. In general, based on the above discussion, the design alternatives shown in Figure 2(i) and (j) may be the most effective in our scenarios. Further, when the number of detected patterns is large, Figure 2(j) could be less effective. If a shared coordinate system can be built for positioning sequences of different patterns (i.e., support sets), Figure 2(a) and (b) could be good alternatives, but may subject to the constraint of non-uniform pattern level layouts. However, empirical studies for comparing user performance with different design alternatives are needed to further confirm this conclusion.
4 ICE interface
In this section, we describe the interface design of ICE in details (Figure 3). We first introduce the Matrix View that offers visual summarization and exploration of an event sequence dataset, and then the Comparison View that is used to compare two sets of sequences at multiple granularities.
4.1 Matrix View: getting the gist
As shown in Figure 3(a)(c), the Matrix View aims to provide a summarization for the entire dataset based on prefixes and/or suffixes (T1), and help an analyst identify two meaningful sets of sequences for comparison (T2).
4.1.1 Visual encoding
By using tree-like representations, some systems support the visual exploration of event sequences according to prefixes [Wongsuphasawat2011, Liu2017, Shen2012, Monroe2013], so an analyst can track sequences by a series of events from the beginning. In contrast, ICE allows for the exploration from both the start and the end of sequences (T1). To do so, we first construct a prefix tree and a suffix tree based on the entire dataset. Figure 4(a) and (b) illustrate a simple example of this process. Next, we construct a matrix by combining the two trees, where the columns correspond to the prefix tree nodes and the rows correspond to the suffix tree nodes. Thus, each cell in the matrix denotes all the sequences with the same starting and ending events determined by the trees (Figure 4(c)(d)(e)). Some statistical quantities of sequences in each cell can be visually encoded, which allows an analyst to browse the distribution of all sequences based on their prefixes and suffixes. In our implementation, we map the number of sequences of a cell to its color density.
When the cardinality of events is large and sequences are lengthy, the size of the matrix grows exponentially. To avoid information overload, ICE supports dynamic expanding and collapsing the trees. For example, Figure 4(d) shows the result after expanding the cell (a, c) in Figure 4(c). The prefixes and suffixes are displayed with indentation to show the hierarchical structure. Further, to help an analyst understand the characteristics of each column and row, some statistical values can be visualized at the periphery space of the matrix. In ICE, we implemented two metrics including the number of sequences and the average sequence length [Malik2015], where an analyst can choose to show one metric at a time as bar charts (Figure 4(e)). Other metrics can be easily added based on real demand.
4.1.2 User interaction
The Matrix View provides various interaction techniques for exploring and identifying meaningful sequence sets at multiple levels based on the sequence prefixes, suffixes, and both (T2).
Expanding and collapsing. Besides clicking one cell to expand the prefix tree and suffix tree simultaneously, an analyst can click the bars outside the matrix to expand just one column or row (Figure 4(e)). To facilitate multi-scale analysis, three quick access buttons are placed at the top of the Matrix View to allows an analyst to unfold and fold all nodes at the next level for both trees (Figure 3(a)). Unfolding or folding all nodes of the two trees are also available.
Filtering and sorting. An analyst could filter sequences according to their length. This is useful for removing some abnormal sequences that are extremely long or short. By default, ICE shows sequences with length greater than two, which can be further adjusted using a text box. The columns and rows of the matrix can be sorted based on a metric, for example, the total number or the average length of all sequences in a row or column. Since the rows and columns correspond to two trees, the sorting is performed locally to keep the hierarchical structure. That is, only sibling tree nodes are sorted among each other.
Linking, zooming & panning, and selecting. When an analyst hovers over a matrix cell, the corresponding row and column are highlighted in red. Additionally, more information about the cell, including the total number and the average length of the sequences, is displayed in a tooltip. Similarly, hovering over a bar in the bar charts highlights the entire column or row, and detailed information about the contained sequences is offered. Moreover, zooming and panning are offered to help an analyst move around the matrix. Three modes of selection are supported in Matrix View: row, column, or cell, by clicking the item while holding a modifier key. Two sets of event sequences (determined by a cell or the intersection of a row and a column) can be selected to feed to the Comparison View for further analysis.
4.1.3 Design alternative
During the design of ICE, we explored a design based on two icicle plots [Kruskal1983], which are placed side by side with each representing a prefix or a suffix tree of the data (Figure 5(a)). The size of each node in icicle plots is mapped to the number of sequences it represents. When an analyst hovers over a node in one icicle plot, the number of related sequences is mapped to the color of each node in the other icicle plot, as shown in Figure 5(a). Although this visualization is more space efficient, multiple hovering operations are needed to obtain a big picture about the distribution of event sequences with their prefixes and suffixes. In addition, presenting multiple statistical information for each icicle node is not a trivial task. Hence, this design does not facilitate an analyst to choose interesting sets of sequences (T2).
As a result, we proposed the above matrix-based visualization that allows for more effective visual summarization and investigation of the entire dataset (Figure 5(b)). By employing various interaction techniques for expanding and collapsing, the matrix shows an overview of a dataset at multiple levels by balancing between space efficiency and information load.
4.2 Comparison View: distinguishing two sets
After two meaningful sets of sequences are selected in the Matrix View, they are combined and fed into the sequential pattern mining algorithm [Fournier-Viger2014a]. The computed patterns, together with the raw sequences, are visualized in the Comparison View (Figure 3(b)(d)) to support the visual comparison of the two sets at different granularities, including the pattern level (T3) and the sequence level (T4).
4.2.1 Pattern level comparison
As the aforementioned discussion, we employ the unit visualization technique for comparing two sets at the pattern level (T3). Each visual mark (i.e., a small rectangle) represents a sequence with its color indicating which set the sequence belongs to (Figure 3(d)). Multiple visual marks wrapped by a gray rectangle are displayed to indicate the pattern that these sequences contain, based on the layouts discussed in the design space. In Section 3, we have identified two design alternatives, including “Map2D_MaxFill” and “FillX/Y_MaxFill”, which are the most effective for the visual comparison at the pattern level in our case (Figure 2(i)(j)). By default, the Comparison View uses the “Map2D_MaxFill” configuration. An analyst can switch among the two options by clicking the buttons on top of the Comparison View (Figure 3(b)). In addition, we provide options for choosing any combination of pattern and sequence layouts in Figure 2 based on an analyst’s experiences and goals.
4.2.2 Sequence level comparison
After identifying patterns of interest, an analyst may want to examine how sequences in the support set manifest the pattern, and compare the two sets at the sequence level (T4). When an analyst clicks a pattern, all sequences in its support set are displayed on the right of the Comparison View (Figure 3(f)). Inspired by Liu et al.’s work [Liu2017a], we visualize the events of each sequence from top to bottom, and align all sequences horizontally. Each event is represented as a circle with the color mapped to event type. At the top of each sequence, a small triangle filled with blue or orange color is presented to indicate which set it belongs to.
On the left of all sequences, key events in the pattern are displayed with text labels and aligned vertically in order. Several interactions are implemented to assist with the understanding of the relationships between the pattern and the corresponding sequences. First, when an analyst hovers over a key event, the first occurrence of such event is rendered with gray border in all sequences, as shown in Figure 6(a). Second, when an analyst clicks a key event, sequences are aligned by this event (Figure 6(b)). Animated transitions are also applied to indicate the change of sequences in the alignment.
5 Case studies
In this section, we present three case studies from different domains to illustrate how analysts use ICE to explore real-world datasets and obtain insights. The first dataset records important action events in football matches [football], the second one tracks the interactions when students perform digital electronics exercises [Vahdat2015], and the third one contains clickstreams of customers on an E-commerce website [ecommerce].
5.1 Case I: analyzing football matches
The football match dataset records events in 7 matches between Manchester City (Man City) and West Bromwich (West Brom), where 626 events are categorized into 15 event types, including Announcement, Attempt, Corner, Foul, Yellow Card, Second Yellow Card, Red Card, Substitution, Free Kick Won, Offside, Hand Ball, and Penalty Conceded, West Brom Start, West Brom End, Man City Start, Man City End. After sorting the events based on the timestamp, a sequence is defined as a list of events that is launched by one team before it is interrupted by the other. For example, during a ball control of West Brom, they may have a sequence like: West Brom Start, Free kick won, Foul，and West Brom End. The whole dataset contains 355 sequences and the average length is .
Assume Tom is a fan of West Brom and wants to know the behavioral difference between his favorite team and its opponent Man City that has a higher rank. To begin with, Tom wants to have a general idea about the event sequences of two teams by analyzing starting and ending events. Upon loading the dataset, Tom sees two dark cells in a matrix where columns are Man City Start and West Brom Start, and rows are titled Man City End and West Brom End. The two dark matrix cells, i.e., (Man City Start, Man City End) and (West Brom Start, West Brom End) include all the sequences of the two teams. To learn what the two teams do at first and at last during their turns, Tom clicks the two cells to expand both columns and rows to the second level (Figure 7(a)). He observes two clusters in the matrix representing Man City and West Brom, respectively. With the thought that Free Kick Won is a kind of starting point with various possible endings, he focuses on the columns titled with Free Kick Won at the second level. He observes that the cell (Free Kick Won-Attempt) of West Brom is darker in color. However, the same cell of Man City is empty as shown in Figure 7(a). This indicates that some attempts of West Brom start with getting free kicks, but no attempt is initialized by free kick for Man City. Thus, Tom guesses that West Brom tends to use the free kick as the beginning of a set play that leads towards an attempt opportunity.
Tom wonders if there is a key difference between the two teams during their ball control. He selects the two darkest cells, i.e., (Man City Start, Man City End) and (West Brom Start, West Brom End), which are rendered with orange and blue borders in the matrix, respectively. Then sequential patterns are calculated and displayed in the Comparison View. He explores various layouts of patterns and sequences by clicking the buttons on top of the Comparison View. There are some interesting insights shown in the “FillY_MaxFill” layout when sorted by pattern length (Figure 7(b)). He finds that the blue rectangles (West Brom) mainly appear on the top and the orange rectangles (Man City) scatter around the entire visualization. It indicates that some of the longest patterns are composed of the sequences of Man City only, and the sequences of West Brom are merely involved in shorter patterns. Tom hypothesizes that Man City has better abilities to control the ball and clearer strategies in offense. Then, he clicks some of the longest patterns (appeared at the bottom of the Comparison View) and browses the key events on the right of the Comparison View (Figure 7(b3)). He observes that these patterns consist of continuous Attempt with some Corner and Free Kick Won, indicating that Man City launches dense offense effectively when the ball is under their control. After clicking the patterns within the blue rectangles (Figure 7(b1, b2)), Tom identifies some key events including Foul, Yellow Card, and Substitution. Tom infers that West Brom faces great challenges in both defense and offense.
5.2 Case II: understanding students’ learning behaviors
Digital Electronics Education and Design Suite (Deeds) is a software for e-learning in digital electronics, which is used for digital circuit simulation, finite state machine simulation, and so on. When students do exercises of digital electronics with Deeds, the procedure usually involves sequences of events. For example, when a student starts to read the content of an exercise, the Study_Es event is recorded. Then, she may use Deeds to do exercises (Deeds_Es). The student sometimes needs to draw diagrams (Diagram) and adjust properties of the simulation (Properties). Finally, she may use a text editor to write a report (TextEditor_Es), and do irrelevant activities to the course, such as open a web browser (Other). Overall, the dataset contains 8005 events with 23 event types. The events are recorded with a timestamp, a student ID, and a session ID.
Suppose that students in the course Digital Electronics are required to finish four sessions of exercises with increasing difficulty, which focus on gates, arithmetic circuits, flip-flops, and counters, respectively. Linda is a teaching assistant of the course and some students complain to her that switching between several software makes them easily distracted. So she wants to know if it is a common problem faced by all students and how students behave differently in different exercise sessions. She defines a sequence as a series of consecutive events performed by one student doing one exercise. To facilitate the analysis based on sessions, each sequence starts and ends with events representing the session identifiers, such as SessionArithmetic_Start,SessionArithmetic_End. Finally, she obtains 977 sequences with average length .
First of all, Linda wants to explore the whole dataset to understand what students do during the exercise sessions. After loading the dataset, she filters the sequences with length smaller than three to remove outlier sequences. Then, she observes amatrix with the diagonal cells indicating the event sequences from the four different sessions. Wondering what the most common starting events are in each session, she unfolds all the columns in the matrix (Figure 8(a)). Linda observes that no matter what the session is, Deeds_ES and TextEditor_Es have relatively higher numbers of sequences based on the column bar chart, indicating that most students start their exercises by using Deeds and editing text. Further, Linda notices that the Other bar under SessionGate_Start and SessionCounter_Start are higher than the other two sessions. With the preliminary knowledge that session Gate is the easiest and session Counter is the hardest, Linda hypothesizes that students may finish exercises in session Gate quickly and then do something irrelevant to the course like surfing the Internet (i.e., the event Other). For exercises in session Counter, which requires students to analyze counters consisting of many kinds of flip-flops, students may need to browse web-pages to search basic information online about these flip-flops.
Now Linda wants to confirm if students switch among different software. She focuses on the session Arithmetic because this session contains the most sequences as shown in the bar chart (Figure 8(a)). To drill down along the sequences and study the most common behaviors, she keeps expanding the columns with the most numbers of sequences progressively at different tree levels, under the session Arithmetic. Linda observes that some sequences end with alternating between Deeds_ES and TextEditor_ES (Figure 8(b)), suggesting that students need to frequently switch between Deeds and a text editor. Linda further investigates other sessions and also finds such alternating behaviors. She concludes that it is common for students to switch among several software when doing exercises. Therefore, she would like to suggest the Deeds development team integrating a light text reader and text editor into the current system.
Next, Linda wonders if students perform differently in different sessions. Both bar charts next to the matrix columns and rows indicate that session Arithmetic and Flipflop contain the fewest and the most sequences among the four sessions, respectively. Thus, she selects these two cells for comparison, including (SessionArithmetic_Start, SessionArithmetic_End) and (SessionFlipflop_Start, SessionFlipflop_End), which represent all sequences from the two sessions (Figure 3(c)).
Then, Linda shifts her focus to the Comparison View with the “Map2D_MaxFill” layout by default (Figure 3(d)). She clicks the patterns containing only orange sequences which belong to the Arithmetic session, and discovers that they mainly consist of Deeds_ES, TextEditor_Es, and Study_ES (Figure 3(e)). Since the Arithmetic session focuses on arithmetic circuit design, it requires students to draw the schematics of circuits in the Deeds. Moreover, she observes that some patterns consisting of only blue sequences (from the Flipflop session) are clustered at the bottom left of the view (Figure 3(d)). By clicking some of them, Linda discovers that these patterns contain Diagram and Property as key events (Figure 3(f)). That is mainly because exercises on the Flipflop session requires students to frequently change input properties during simulations.
5.3 Case III: investigating website clickstreams
This dataset collects customers’ clicks when they shop via a real-world E-commerce website. Customers can view items, then add to cart, and finally transaction. According to the category tree [ecommerce], customers might view items from the same, parent, or child categories, or the two items can be irrelevant from each other. Hence, an event is defined by the relationship between the category of current viewing item and the previous one, such as View_Parent, View_Child, View_Brother, and View_Other. Overall, the dataset contains 4637 events with 7 types. Thus, a sequence is a list of consecutive events made by one customer, which results in 820 sequences with the average length . Assume that Sam is a web analyst of this E-commerce platform and he would like to get some ideas about how customers shop on the this website.
Initially, the Matrix View shows a matrix with most cells colored black. He first expands the column titled with View, and notices that View_Other and View_Brother are dominant at the second level based on the column bar charts, and View_Brother is higher than View_Other (Figure 9(a)). It indicates that most people start their exploration among items in the same category. Sam keeps expanding View_Brother iteratively, and observes that the View_Brother bars are always the highest (Figure 9(a)), reflecting that the main event following View_Brother is still View_Brother. Next, he unfolds the View_Other column from the second level in the same way and finds that the most significant events following View_Other are View_Other. These observations show two distinct browsing behavior, which indicates two kinds of customers. One group tends to compare many similar items and may hope to find the best one; while the other one would like to view items from irrelevant categories.
In addition, Sam observes an interesting phenomenon that some customers stop their visits by Add to Cart, which means they do not checkout in the end. Thus, Sam would like to discover the difference between these customers. From the Matrix View, he selects two darker cells, (View, Add to Cart) and (View, Transaction). and shifts his focus to the Comparison View, where all patterns are positioned by the “Map2D_MaxFill” layout (Figure 9(b)). On the top left corner, Sam finds a cluster of patterns containing only the blue sequences from the cell (View, Transaction). He clicks some of the patterns and finds that View_Other is the dominant event (Figure 9(b1)). Meanwhile, a cluster of patterns on the bottom with many orange sequences from the cell (View, Add to Cart), attracts his eyes. He explores these patterns and notices that View_Brother is the major event (Figure 9(b2)). Sam infers that customers who explore similar items are those cannot make up their mind in buying things; on the contrary, those who have clear targets, usually switching between different categories, are likely to checkout finally.
Although the three case studies from various domains have demonstrated the effectiveness of ICE in visually summarizing event sequence datasets and comparing sets of sequences at different granularities, the current prototype still has limitations.
First, although the Matrix View is scalable to the number of sequences, it may not scale well when the cardinality of the event sequence data is large. This may result in a large matrix visualization as the types of events are too many. Also, the node expanding and collapsing operations on the matrix columns and rows may overwhelm analysts when the depth of trees is larger. However, categorizing events based on their attributes might solve this problem.
Another limitation of the matrix is regarding its capability in selecting sequence sets with more complicated criteria. The current ICE design focuses on the identification of sequences according to their prefixes and suffixes. In future research, we plan to explore visualizations that allow for identifying sequences based on event attributes, timestamps, time intervals, etc. Further, supporting event querying specifications beyond just prefix and suffix, such as using regular expression like in (squ)eries [Zgraggen2015], would greatly empower the flexibility of the matrix visualization. That is, each matrix row or column could be a specification and each cell represents all the sequences satisfying both criteria from the row and the column.
Third, unit visualization may have three main disadvantages, including computational scalability, display scalability, and perceptual scalability [Park2017]. The Comparison View shares all these scalability issues, but they can be addressed by adjusting the computation process from two aspects. On one hand, an analyst can choose to perform random sampling for the entire dataset in the Matrix View to limit the volume of input data. However, random sampling may not be the ideal option because it ignores the characteristics of the datasets, e.g., the distribution of sequences. On the other hand, an analyst can adjust the “support” value for the maximal pattern mining algorithm [Fournier-Viger2014a] in the Comparison View, to obtain fewer patterns in general. However, this prevents an analyst from “seeing” all the patterns that may result in missing opportunities.
There also exists limitations of our study. Although we have applied ICE to three real-world datasets to demonstrate its generalizability to various domain applications, deployment studies are required to further verify our conclusions. Further, controlled user studies need to be conducted to validate our choices and to better understand the pros and cons of each design alternative in Figure 2.
There are a number of promising interesting directions to extend our visualization techniques. First, the Matrix View can be applied to other tasks, such as the origin-destination analysis. The multi-level matrix can demonstrate the traffic volume at different scales. We notice that MapTrix [Yang2017] employs a matrix to show the flow volume between origins and destinations. However, it does not support exploration at multiple levels. Second, some of our discussion of design space can be generalized to guide other work in applying the unit visualization technique to visual comparison, such as the advantages and disadvantages of combination of two layouts in comparison tasks. Since we only discuss one configuration in unit visualization, i.e., pattern layouts are uniform while sequence layouts are non-uniform, we plan to address other configurations in the future.
7 Conclusion and future work
We have introduced an interactive visualization, called ICE, for summarizing an event sequence dataset based on prefixes and suffixes, and further helping analysts identify promising sets of event sequences to compare at both the pattern and sequence levels. To design ICE, we have performed tasks analysis based on Brehmer and Munzner’s typology [Brehmer2013]. To support the comparison task, we have explored the design space of employing the unit visualization technique in our scenario. Moreover, we have described three case studies to illustrate the effectiveness and usefulness of ICE with real-world datasets in three different application domains.
In the future, we plan to enhance the matrix-based visualization by encoding attributes, such as event timestamps and time intervals, on the columns and rows, as well as by supporting more complicated event specifications in addition to just prefix and suffix. Thus, analysts are able to identify and select interesting sets of sequences based on both event attributes and cardinality in a richer manner. Next, to verify the hypotheses in our discussion for the design space (Figure 2), we plan to conduct empirical studies to evaluate user performance with different design alternatives. Finally, we are interested to experiment ICE with other comparison tasks in different application domains.