Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining (Extended Version)

08/11/2016 ∙ by Chitta Ranjan, et al. ∙ Georgia Institute of Technology 0

The ubiquitous presence of sequence data across fields such as the web, healthcare, bioinformatics, and text mining has made sequence mining a vital research area. However, sequence mining is particularly challenging because of difficulty in finding (dis)similarity/distance between sequences. This is because a distance measure between sequences is not obvious due to their unstructuredness---arbitrary strings of arbitrary length. Feature representations, such as n-grams, are often used but they either compromise on extracting both short- and long-term sequence patterns or have a high computation. We propose a new function, Sequence Graph Transform (SGT), that extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. Importantly, SGT has low computation and can extract any amount of short- to long-term patterns without any increase in the computation, also proved theoretically in this paper. Due to this, SGT yields superior result with significantly higher accuracy and lower computation compared to the existing methods. We show it via several experimentation and SGT's real world application for clustering, classification, search and visualization as examples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A sequence can be defined as a contiguous chain of discrete alphabets, where an alphabet can be an event, a value or a symbol, sequentially tied together in a certain order, e.g., . Sequences are one of the most common data types found in diverse fields such as social science, web, healthcare, bioinformatics, marketing and text mining. Some examples of sequences are: web logs, music listening histories, patient movements through hospital wards, and protein sequences in bioinformatics.

The omnipresence of sequence data has made development of new sequence mining methods important. Some examples of its applications include: a) understanding users’ behavior from their web surfing and buying sequences data to serve them better advertisements, product placements, promotions, and so on, b) assessing process flows (sequences) in a hospital to find the expected patient movement based on diagnostic profiles to better optimize the hospital resource and service, and c) analysis of biological sequences to understand human evolution, physiology and diseases.

A fundamental requirement for data mining is the ability to measure the (dis)similarity, often translated as measuring a distance between the objects. In sequence mining, sequences are the objects, but finding the distances between them is challenging because they are unstructured—arbitrary strings of arbitrary length. To mitigate this challenge, feature representations of sequence are often used. For example, an n-gram method extracts sequence features by looking at the substrings up to length n

and embeds them in a feature vector; in an

n

-order Markov model, the sequence features are represented by the transition probability matrix.

However, in addition to other limitations (discussed in §1.1), most of the existing methods either limit themselves by extracting only short-term patterns or suffer from exponentially increasing computation upon extracting the long-term patterns. For instance, in the abovementioned n-gram and -Markov models, n is kept small due to computational limitations, and thus, unable to capture long-term patterns..

Figure 1. SGT’s unique property.

In this paper, we develop a new function, Sequence Graph Transform (SGT), that extracts the short- and long-term sequence features without any increase in the computation. As depicted in Fig. 1

, SGT can extract any amount of short- to long-term sequence patterns just by dialing a hyperparameter

. This unique property of SGT removes the computation limitations; it enables us to tune the amount of short- to long-term patterns that will be optimal for a given sequence mining problem. Additionally, SGT is a finite-dimensional feature space that can be used as a vector for implementing any mainstream data mining, and as a graph for applying graph mining methods and interpretable visualizations.

These properties lead to a significantly higher sequence data modeling accuracy with lower computation. We theoretically prove the SGT properties, and experimentally and practically validate its efficacy. In the following, §1.11.2, we discuss the related work and the problem specification in more detail.

1.1. Related Work

Early research works used edit-distances between sequences after alignment. Methods for global alignment and local alignment, with or without overlapping, were developed by (Needleman and Wunsch, 1970; Smith and Waterman, 1981)

. Based on these methods, heuristic approaches were proposed for larger datasets

(Edgar, 2004, 2010; Fu et al., 2012). These methods mainly focus on bioinformatics sequence problems and lack general applicability due to difficulty in tuning, high computational complexity, and inability to work on sequences with significantly varying lengths. Additionally, these methods do not provide any feature representation of sequences.

More universally applicable and relatively powerful methods broadly work on one of the following assumptions: a) the sequence process has an underlying parametric distribution, b) similar sequences have common substrings, and c) a sequence evolves from hidden strings.

The parametric methods typically make Markovian distribution assumptions (more specifically a first-order Markovian) on the sequence process (Cadez et al., 2003; Ranjan et al., 2015). However, such distributional assumptions are not always valid. General n-

order Markov models were also proposed but not popular in practice due to high computation. Beyond them, Hidden Markov model-based approaches are popular in both bioinformatics and general sequence problems

(Remmert et al., 2012; Helske and Helske, 2016). This assumes a hidden layer of latent states that results in the observed sequence. Due to the multi-layer setting, the first-order property is not transmitted to the observed sequence. However, tuning HMM (finding optimal hidden states) is difficult and computationally intensive, thus effecting HMM’s generality and scalability.

N-gram methods (also known as k-mer methods in the bioinformatics area) are the most popular approaches that work on the second assumption (Tomović et al., 2006). Although the pretext of this assumption seems appropriate, the optimal selection of substring length, i.e. n in n-gram or k in k-mer, is difficult. In sequence mining, selection of a small value for n can lead to inclusion of noise, but increasing it severely increases the computation. Some other variants, such as spaced-words and adaptive n, are more difficult to optimize (Comin and Verzotto, 2012).

Another class of methods hypothesizes that sequences are generated from some evolutionary process in which a sequence is produced by reproducing complex strings from simpler substrings ((Siyari et al., 2016)

and references therein). This method solves an NP-hard optimization problem to identify the underlying evolution hierarchy and corresponding substrings. These substrings can also be used as features for sequence data mining. However, the estimation algorithms for this and similar methods are heuristics that usually do not guarantee optimality. The algorithms can also lead to several solutions that will cause identifiability and ambiguity issues. Moreover, the evolutionary assumption may not be always true.

The above methods either limit the extent of sequence pattern extraction due to restrictive assumptions or search for (hidden) states or strings in an unobservable universe. This causes limited accuracy and/or computational issues.

Besides these methods, Prefixspan (Han et al., 2001) is another sequence pattern mining approach, but it works on a different type of sequence in which the sequence is a list of elements and each element consists of a set of items, e.g. ⟨a(abc)(ac)d(cf)⟩. For the sequence problems addressed here, Prefixspan’s performance will be similar to n-grams.

Sequence mining problems have also been given attention by the deep learning research community. Embedding spaces for sequences have been proposed using Recurrent Neural Networks (RNN) and Long Short Term Memory

(Graves, 2013). However, the dimension of these embeddings is typically large, and finding the optimal dimension and embeddings requires the use of rigorous optimization problems in a deep learning network. Training such models is computationally intensive, sometimes not interpretable and requires a large amount of training data.

1.2. Problem Specification

The related methods discussed above fail to address at least one of the following challenges: a) feature mapping: effective extraction of sequence characteristics into a finite-dimensional feature space (a vector), b) universal applicability: this requires the absence of distributional or domain-specific assumptions and a small number of tuning hyper-parameters, and c) scalability: it relies on the computational complexity, which should be small with respect to sequence length, size of the database and the alphabets set.

We propose a new sequence feature extraction function, Sequence Graph Transform (SGT), that addresses all of the above challenges and is shown to outperform existing state-of-the-art methods in sequence data mining. SGT works by quantifying the pattern in a sequence by scanning the positions of all alphabets relative to each other. We call it a graph transform because of its inherent property of interpretation as a graph, where the alphabets form the nodes and a directed connection between two nodes shows their “association.” These “associations” between all alphabets represent the signature features of a sequence. A Markov model transition matrix can be compared analogously with the SGT’s feature space; however, among other differences (explored further in the paper), the associations (graph edges) do not represent a probability, and SGT is non-parametric. The non-parametric property also makes it robust to any underlying sequence generation distribution.

Regardless, sequence analysis problems can be broadly divided into: a) length-sensitive: the inherent patterns, as well as the sequence lengths, should match to render two sequences as similar, e.g., in protein sequence clustering; and b) length-insensitive: the inherent patterns should be similar, irrespective of the lengths, e.g., weblog comparisons. In contrast with the existing literature, SGT provides a solution for both scenarios. The advantage of this property becomes more pronounced when we have to perform both types of analysis on the same data and implementing different methods for each becomes cumbersome.

In this paper, our major contribution is the development of a new feature extraction function, SGT, for sequences. In the following, we develop SGT, and provide its theoretical support and extensions. We perform an extensive experimental evaluation, and show that SGT bridges the gap between sequence mining and mainstream data mining through implementation of fundamental methods, viz. PCA, k-means, SVM and graph visualization via SGT on real sequence data analysis.

2. Sequence Graph Transform (SGT)

2.1. Overview and Intuition

Figure 2. Showing “effect” of elements on each other.

By definition, a sequence can be either feed-forward or “undirected.” In a feed-forward sequence, events (alphabet instances) occur in succession; e.g., in a clickstream sequence, the click events occur one after another in a “forward direction.” On the other hand, in an “undirected” sequence, the directional or chronological order of alphabet instances is not present or not important. In this paper, we present SGT for feed-forward sequences; SGT for undirected is given in the extended version.

For either of these sequence types, the developed Sequence Graph Transform works on the same fundamental premise—the relative positions of alphabets in a sequence characterize the sequence—to extract the pattern features of the sequence. This premise holds true for most sequence mining problems because similarity in sequences is often measured based on the similarities in their pattern from the alphabet positions.

In the following, we illustrate and develop the SGT’s feature extraction approach for a feed-forward sequence and later extend it to “undirected” sequences.

Fig. 2 shows an illustrative example of a feed-forward sequence. In this example, the presence of alphabet at positions 5 and 8 should be seen in context with or as a result of all other predecessors. To extract the sequence features, we take the relative positions of one alphabet pair at a time. For example, the relative positions for pair (,) are {(2,3),5} and {(2,3,6),8}, where the values in the position set for are the ones preceding . In the SGT procedure defined and developed in the following sections (§2.3-2.4), the sequence features are shown to be extracted from these positions information.

(a) Feature extracted as a vector with a graph interpretation.
(b) Use of sequences’ SGT features for data mining.
Figure 3. SGT overview.

These extracted features are an “association” between and , which can be interpreted as a connection feature representing “ leading to .” We should note that “ leading to ” will be different from “ leading to .” The associations between all alphabets in the alphabet set, denoted as , can be extracted similarly to obtain sequence features in a -dimensional feature space.

This is similar to the Markov probabilistic models, in which the transition probability of going from to is estimated. However, SGT is different because the connection feature 1) is not a probability, and 2) takes into account all orders of relationship without any increase in computation.

Besides, the SGT features also make it easy to visualize the sequence as a directed graph, with sequence alphabets in as graph nodes and the edge weights equal to the directional association between nodes. Hence, we call it a sequence graph transform. Moreover, we show that under certain conditions, the SGT also allows node (alphabet) clustering.

A high level overview of our approach is given in Fig. (a)a-(b)b. In Fig. (a)a, we show that applying SGT on a sequence, , yields a finite-dimensional SGT feature vector, , for the sequence, also interpreted and visualized as a directed graph. For a general sequence data analysis, SGT can be applied on each sequence in a data corpus, as shown in Fig. (b)b, to yield a finite and equal dimensional representation for each sequence. This facilitates a direct distance-based comparison between sequences and thus makes application of mainstream data mining methods for sequence analysis rather straightforward.

2.2. Notations

Suppose we have a dataset of sequences denoted by . Any sequence in the dataset, denoted by (), is made of alphabets in set . A sequence can have instances of one or many alphabets from . For example, sequences from a dataset, , made of alphabets in (suppose) can be , . The length of a sequence, , denoted by, , is equal to the number of events (in this paper, the term “event” is used for an alphabet instance) in it. In the sequence, will denote the alphabet at position , where and .

As mentioned in the previous section, we extract a sequence ’s features in the form of “associations” between the alphabets, represented as , where , are the corresponding alphabets and is a function of a helper function . is a function that takes a “distance,” , as an input and as a tuning hyper-parameter.

2.3. SGT Definition

As also explained in §2.1, SGT extracts the features from the relative positions of events. A quantification for an “effect” from the relative positions of two events in a sequence is given by , where are the positions of the events and is a distance measure. This quantification is an effect of the preceding event on the later event. For example, see Fig. (a)a, where and are at positions and , and the directed arc denotes the effect of on .

For developing SGT, we require the following conditions on : a) strictly greater than 0: ; b) strictly decreasing with : ; and c) strictly decreasing with : .

The first condition is to keep the extracted SGT feature, , easy to analyze and interpret. The second condition strengthens the effect of closer neighbors. The last condition helps in tuning the procedure, allowing us to change the effect of neighbors.

There are several functions that satisfy the above conditions: e.g., Gaussian, Inverse and Exponential. We take as an exponential function because it will yield interpretable results for the SGT properties (§2.4) and .

(1)

(a)

(b)
Figure 4. Visual illustration of the effect of alphabets’ relative positions.

In a general sequence, we will have several instances of an alphabet pair. For example, see Fig. (b)b, where there are five pairs, and an arc for each pair shows an effect of on . Therefore, the first step is to find the number of instances of each alphabet pair. The instances of alphabet pairs are stored in a asymmetric matrix, . Here, will have all instances of alphabet pairs , such that in each pair instance, ’s position is after .

(2)

After computing from each pair instance for the sequence, we define the “association” feature as a normalized aggregation of all instances, as shown below in Eq. 3a-3b. Here, is the size of the set , which is equal to the number of pair instances. Eq. 3a gives the feature expression for a length-sensitive sequence analysis problem because it also contains the sequence length information within it (proved with a closed-form expression under certain conditions in §2.4). In Eq. 3b, the length effect is removed by standardizing with the sequence length for length-insensitive problems.

(3a)
(3b)

and is the SGT feature representation of sequence .

For illustration, the SGT feature for alphabet pair in sequence in Fig. 2 can be computed as (for in length-sensitive SGT): and .

The features, , can be either interpreted as a directed “graph,” with edge weights, , and nodes in , or vectorized to a -vector denoting the sequence in the feature space.

2.4. SGT properties

In this section, we show SGT’s property of capturing both short- and long-term sequence pattern features. This is shown by a closed-form expression for the expectation and variance of SGT feature,

, under some assumption. Note that the assumption defined below is only for showing an interpretable expression and is not required in practice.

Assumption 1.

A sequence of length with an inherent pattern: , occurs closely together with in-between stochastic gap as , and the intermittent stochastic gap between the pairs as , such that, (See Fig. 5). and characterize the short- and long-term patterns, respectively.

Figure 5. Representation of short- and long-term pattern.
Theorem 1.

The expectation and variance of SGT feature, , has a closed-form expression under Assumption 1, which shows that it captures both short- and long-term patterns present in a sequence in both length-sensitive and -insensitive SGT variants.

(4)
(5)

where,

(6)

and, , .

Proof.

See Appendix A. ∎

As we can see in Eq. 4, the expected value of the SGT feature is proportional to the term . The numerator of contains the information about the short-term pattern, and its denominator has the long-term pattern information.

In Eq. 6, we can observe that if either of (the closeness of and in the short-term) and/or (the closeness of and in the long-term) decreases, will increase, and vice versa. This emphasizes two properties: a) the SGT feature, , is affected by changes in both short- and long-term patterns, and b) increases when becomes closer in the short or long range in the sequence, providing an analytical connection between the observed pattern and the extracted feature. Besides, it also proves the graph interpretation of SGT: that denotes the edge weight for nodes and (in the SGT-graph) increases if closeness between increases in the sequence, meaning that the nodes become closer in the graph space (and vice versa).

Furthermore, the length-sensitive SGT feature expectation in Eq. 4 contains the sequence length, . This shows that the SGT feature has the information of the sequence pattern, as well as the sequence length. This enables an effective length-sensitive sequence analysis because sequence comparisons via SGT will require both patterns and sequence lengths to be similar.

In the length-insensitive SGT feature expectation in Eq. 4, it is straightforward to show that it becomes independent of the sequence length as the length increases. Therefore, as sequence length, , increases, the SGT feature approaches a constant, given as .

Besides, it is shown in Appendix A, . Thus, the expected value of the SGT feature becomes independent of the sequence length at a rate of inverse to the length. In our experiments, we observe that the SGT feature approaches a length-invariant constant when .

(7)

Furthermore, if the pattern variances, and , in the above scenario are small, allows regulating the feature extraction: higher reduces the effect from long-term patterns and vice versa.

The properties discussed above play an important role in SGT’s effectiveness. Due to these properties, unlike the methods discussed in §1.1, SGT can capture higher orders of relationship without any increase in computation. Besides, SGT can effectively find sequence features without the need for any hidden string/state(s) search.

2.5. Extensions of SGT

2.5.1. Undirected sequences

SGT can be extended to work on undirected sequences. In such sequences, the directional pattern or directional relationships (as in feed-forward) are not important. In other words, it is immaterial whether occurs before or after ; occurring closely (or farther) is important. From SGT’s operation standpoint, we have to remove the condition, , from Eq. 2, denoted by .

It is easy to show that and

(8)

where and are given in Eq. 2 and Eq. 3a-3b, respectively (see Appendix B for proof).

Moreover, for sequences with a uniform marginal distribution of occurrence of elements, , will be close to symmetric; thus, the undirected sequence graph can be approximated as, . In practice, this approximation is useful in most cases.

2.5.2. Alphabet clustering


(a) are closer than .

(b) Corresponding SGT’s Graph view.
Figure 6. Illustrative sequence example for alphabet clustering.

Node clustering in graphs is a classical problem solved by various techniques, including spectral clustering, graph partitioning and others. SGT’s graph interpretation facilitates grouping of alphabets that occur closely via any of these node clustering methods.

This is because SGT gives larger weights to the edges, , corresponding to alphabet pairs that occur closely. For instance, consider a sequence in Fig. (a)a, in which occurs closer to than , also implying . Therefore, in this sequence’s SGT, the edge weight for should be greater than for , i.e. .

From Assumption. 1 in §2.4, we will have, (see Appendix C). Therefore, and , and due to Condition 2 on given in §2.3, if , then .

Moreover, for an effective clustering, it is important to bring the “closer” alphabets in the sequence more close in the graph space. In the SGT’s graph interpretation, it implies that should go as high as possible to bring closer to in the graph and vice versa for . Thus, effectively, should be increased. It is proved in Appendix C that will increase with , if , where we have .

Thus, an SGT can represent a sequence as a graph with its alphabets connected with weighted edges, which enables clustering of associated alphabets.

3. SGT Algorithm

1:A sequence, , alphabet set, , and .
2:Initialize:
3:      , and length,
4:for  do
5:     for  do
6:          
7:          
8:          where,
9:     end for
10:     
11:end for
12:if  then
13:     
14:end if
15:;
Algorithm 1 Parsing a sequence to obtain its SGT.

We have devised two algorithms for SGT. The first algorithm (see Algorithm 1) is faster for , while the second (see Algorithm 2) is faster for . Their time complexities are and , respectively. The space complexity is . However, in most datasets, not all alphabets in are present in a sequence, resulting in a sparse SGT features representation. In such cases, the complexity reduces by a factor of the sparsity level.

Additionally, as also evident from Fig. (b)b, the SGT operation on any sequence in a dataset is independent of the other. This means we can easily parallelize the SGT operation on sequences in dataset to reduce the computation time.

The resulting SGT for the sequence, , will be a matrix, which can be vectorized (size, ) for use in distance-based data mining methods, or it can be used as is for visualization and interpretation purposes. Note that we output the root as the final SGT features as it makes the SGTs easy to interpret and comparable for any .

The optimal selection of the hyper-parameter

will depend on the problem in hand. If the end objective is building a supervised learning model, methods such as cross-validation can be used. For unsupervised learning, any goodness-of-fit criteria can be used for the selection. In cases of multiple parameter optimization, e.g. the number of clusters (say,

) and together in clustering, we can use a random search procedure. In such a procedure, we randomly initialize , compute the best based on some goodness-of-fit measure, then fix to find the best , and repeat until there is no change. From our experiments on real and synthesized data, the results of SGT-based data mining are not sensitive to minor differences in . In our implementations, we selected from .

1:A sequence, , alphabet set, , and .
2:function GetAlphabetPositions()
3:     
4:     for  do
5:          
6:     end for
7:     return positions
8:end function
9:Initialize:
10:      , and length, positions GetAlphabetPositions()
11:for  do
12:     
13:     for  do
14:          
15:          
16:          
17:          
18:     end for
19:     
20:end for
21:if  then
22:     
23:end if
24:SGT: ;
Algorithm 2 Extract SGT features by scanning alphabet positions of a sequence.

4. Experimental Analysis

Here we perform an experimental analysis to assess the performance of the proposed SGT. The most important motivation behind SGT is the need for an accurate method to find (dis)similarity between sequences. Therefore, to test SGT’s efficacy in finding sequence (dis)similarities, we built a sequence clustering experimental setup. A clustering operation requires accurate computation of (dis)similarity between objects and thus is a good choice for efficacy assessment.

We performed four types of experiments: a) Exp-1: length-insensitive with non-parametric sequence pattern, b) Exp-2: length-insensitive with parametric sequence pattern, c) Exp-3: length-sensitive sequence problem, and d) Exp-4: alphabet clustering. The settings for each are given in Table 1. Alphabet set is, , for all sequences. Except for Exp-2, clustered sequences were generated such that sequences within a cluster share common patterns. Here two sequences having a common pattern primarily means that the sequences have some common subsequences of any length, and these subsequences can be present anywhere in the sequence. The sequences also comprise of other events, which can be either noise or some other pattern. This setting is non-parametric; however, the subsequences can also bring some inherent parametric properties, such as a mixture of Markov distribution of different orders. In Exp-2, clustered sequences were generated from a mixture of parametric distributions. In all the experiments, k-means with Manhattan distance was applied on the sequences’ SGTs.

Experiment Sequence length, Noise #clusters,
Exp-1 116.4, 47.7 35-65% 5
Exp-2 98.2, 108.3 5
Exp-3 424.6, 130.6 45-50% 5
Exp-4 103.9, 33.6 30-50% 3
Table 1. Experimentation settings.

In Exp-1, we compared SGT with commonly used sequence analysis techniques, viz. n-gram, mixture Hidden Markov model (HMM), Markov model (MM) and semi-Markov model (SMM)-based clustering. For n-gram, we take , and their combinations. Note that 1-gram is equivalent to the bag-of-words method. For these methods, we provided the known to the algorithms. We use F1-score as the accuracy metric.

In this experiment, we set different scenarios such that the overlap of the clusters’ “centroid” is increased. A high overlap between clusters implies that the sequences belonging to these clusters have a higher number of common patterns. Thus, separating them for clustering becomes difficult, and clustering accuracy is expected to be lower.

The Exp-1’s result in Fig. (a)a-(b)b shows the accuracy (F1-score) and the runtimes (with the legend) for each method, where SGT is seen to outperform all others in accuracy. MM and SMM have a poorer accuracy because of the first-order Markovian assumption. HMM is found to have a comparable accuracy, but its runtime is more than six times that of SGT, proving SGT’s superiority. N-gram methods’ accuracies lie in between. Low-order n-grams have smaller runtime than SGT but significantly lower accuracy. Interestingly, the 1-gram method is better when overlapping is high, showing the higher order n-grams’ inability to distinguish between sequences when pattern overlap is high.

(a) F1 scores.
(b) Run times.
Figure 7. Exp-1 results.

Furthermore, we did Exp-2 to see the performance of SGT in sequence datasets having an underlying mixture of parametric distributions, viz. mixture of HMM, MM and SMM. The objective of this experiment is to test SGT’s efficacy on parametric datasets against parametric methods. In addition to obtaining datasets from mixed HMM and first-order mixed MM and SMM distributions, we also get second-order Markov (MM2) and third-order Markov (MM3) datasets. Fig. (a)a-(b)b shows the F1-score and the runtime (as bar labels). As expected, the mixture clustering method corresponding to the underlying distribution is performing the best. Note that SMM is slightly better than MM in the MM setting because of its over-representative formulation, i.e. a higher dimensional model to include a variable time distribution. However, the proposed SGT’s accuracy is always close to the best. This shows SGT’s robustness to underlying distribution and its universal applicability. And, again, its runtime is smaller than all others.

(a) F1 scores.
(b) Run times.
Figure 8. Exp-2 results.

In Exp-3, we compared SGT with length-sensitive algorithms, viz. MUSCLE, UCLUST and CD-HIT, which are popular in bioinformatics. These methods are hierarchical in nature, and thus, themselves find the optimal number of clusters. For SGT-clustering, the number of clusters is found using the random search procedure recommended in §3.

Fig. 9 shows the results, where the y-axis is the ratio of the estimated optimal number of clusters, , and the true number of clusters, . On the x-axis, it shows the clustering accuracy, i.e. the proportion of sequences assigned to a same cluster given that they were actually from the same cluster. For a best performing algorithm, both metrics should be close to 1. As shown in the figure, CD-HIT and UCLUST overestimated the number of clusters by about twice and five times, respectively. MUSCLE had a better estimate but had about 95% accuracy. On the other hand, SGT could accurately estimate and has a 100% clustering accuracy.

Figure 9. Exp-3 results.
Figure 10. Exp-4 results.

Finally, we validated the efficacy of the SGT extensions given in §2.5 in Exp-4. Our main aim in this validation is to perform alphabet clustering (in §2.5.2). Additionally, we use the undirected SGT (in §2.5.2). We set up a test experiment such that across different sequence clusters some alphabets occur closer to each other. We create a dataset that has sequences from three clusters and alphabets belonging to two clusters (alphabets A-H in one cluster and I-P in another).

This emulates a biclustering scenario in which sequences in different clusters have distinct patterns; however, the pattern of closely occurring alphabets is common across all sequences. This is a complex scenario in which clustering both sequences and alphabets can be challenging.

Upon clustering the sequences, the F1-score is found to be 1.0. For alphabet clustering, we applied spectral clustering on the aggregated SGT of all sequences, which yielded an accurate result with only one alphabet as mis-clustered. Moreover, a heat map in Fig. 10 clearly shows that alphabets within same underlying clusters have significantly higher associations. Thus, it validates that SGT can accurately cluster alphabets along with clustering the sequences.

5. Applications on Real Data

5.1. Clustering

Sequence clustering is an important application area across various fields. One important problem in this area is clustering user activity on the web (web log sequences) to understand user behavior. This analysis can result into better services and design.

We took users’ navigation data on msnbc.com collected during a 24-hour period. The navigation data are weblogs that correspond to the page views of each user. The alphabets of these sequences are the events corresponding to a user’s page request. These requests are recorded at a higher abstract level of page category, e.g. frontpage, tech

, which are representative of the structure of the website. The dataset has a random sample of 100,000 sequences for our analysis. The sequences’ average length is 6.9 and their standard deviation is 27.3, with the range between 2 and 7440 and a skewed distribution.


Figure 11. Frequency distribution.

Our objective is to cluster the users with similar navigation patterns, irrespective of differences in their session lengths. We, therefore, take the length-insensitive SGT and use the random search procedure for optimal clustering in §3. We performed k-means clustering with Manhattan distance and the goodness-of-fit criterion as db-index, and found the optimality at for .

The frequency distribution (Fig. 11) of the number of members in each cluster has a long-tail—the majority of users belong to a small set of clusters. These clusters tell us the distinct behaviors of both the majority and minority user types.

5.2. Visualization

Effective visualization is a critical need for easy interpretation of data and its underlying properties. For example, in the above msnbc.com navigation data analysis, interpreting behavior of different user clusters is quite important.


(a) Cluster #1

(b) Cluster #3
Figure 12. Graphical visualization of cluster centroids.

SGT enables a visual interpretation of the different user behaviors. In Fig. (a)a-(b)b, we show a graph visualization of some clusters’ centroids (which are in the SGT space), because a centroid represents the behavior of users present in the cluster. We have filtered edges with small weights for better visualization.

Fig. (a)a shows the centroid for the first cluster that contains the highest membership (12%) and thus indicates the “most” general behavior. This cluster’s users’ behavior is centered around frontpage and misc, with users’ tendency to navigate between frontpage, misc, weather, opinion, news, travel and business at different levels.

Fig. (b)b shows another majority cluster with about 7.5% membership. This group of users seems to have a liking for sports. They primarily visit sports-related pages (the box around the sports node indicates a self-visiting edge), and also move back and forth from sports to frontpage, travel and others.

5.3. Classification

At many occasions, we have labeled sequence data where it is required to build a classification model. SGT can be used for this, and is demonstrated on two datasets: a) protein sequences111www.uniprot.org having either of two known functions, which act as the labels, and b) network intrusion data222https://www.ll.mit.edu/ideval/data/1998data.html containing audit logs and any attack as a positive label.

Attribute Protein Network
Sample size 2113 115
Sequence length range (289, 300) (12, 1773)
Class distribution 46.4%+ 11.3%+
Alphabet set size 20 (amino acids) 49 (log events)
Table 2. Dataset attributes.

The dataset details are given in Table 2. For both problems we use the length-sensitive SGT. For proteins, it is due to their nature, while for network logs, the lengths are important because sequences with similar patterns but different lengths can have different labels. Take a simple example of following two sessions: {login, password, login, password, mail,...} and {login, password,...(repeated several times)..., login, password}. While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns.

For the network intrusion data, the sparsity of SGTs was high. Therefore, we performed principal component analysis (PCA) on it and kept the top 10 PCs as sequence features. We call it SGT-PC, for further modeling. For proteins, the SGTs are used directly.

SVM on- {} Protein Network
SGT{} 99.61%, 89.65%,
Bag-of-words{} 88.45% 48.32%
2-gram{} 93.87% 63.12%
3-gram{} 95.12% 49.09%
1+2-gram{} 94.34% 64.39%
1+2+3-gram{} 96.89% 49.74%
Table 3. Classification accuracy (F1-score) based on 10-fold cross-validation results.

After obtaining the SGT (-PC) features, we trained an SVM classifier on them. For comparison, we used popular n-gram sequence classifiers, viz. bag-of-words (1-gram), 2-, 3-, 1+2-, and 1+2+3-gram. The SVM was built with an RBF kernel. The cost parameter,

, is equal to 1, while the value for the kernel parameter, , is shown within braces in Table 3. The table reports the average test accuracy (F1-score) from a 10-fold cross-validation.

As we can see in Table 3

, the F1-scores are high for all methods in the protein data, with SGT-based SVM surpassing all others. On the other hand, the accuracies are small for the network intrusion data. This is primarily due to: a) a small dataset but high dimension (related to the alphabet set size), leading to a weak predictive ability of models, and b) a few positive class examples (unbalanced data) causing a poor recall rate. Still, SGT outperformed other methods by a significant margin. Although the accuracies of the methods can be further increased using other classifiers such as Boosting and Random Forest, it is beyond the scope of this paper. Here our purpose is to make a simplistic comparison to highlight the superiority of SGT features in building a supervised learning model.

5.4. Search

Most sequence databases found in the real world are very large. For example, protein databases have billions of sequences and increasing. Here we show that SGT sequence features can lead to a fast and accurate sequence search.

Protein SGT-PC Identity
S9A4Q5 33.02 46.3%
S8TYW5 34.78 46.3%
A0A029UVD9 39.21 45.1%
A0A030I738 39.34 45.1%
A0A029UWE3 39.41 45.1%
Table 4. Protein search query (Q9ZIM1) result.

We collected a random sample of 1000 protein sequences from the UniProtKB database on www.uniprot.org. We transformed them to feature space using length-sensitive SGT (with ) to incorporate their lengths’ information. Thereafter, to reduce the dimension we applied principal component analysis and preserved the first 40 principal components (explaining 83% of variance), denoted by SGT-PC. We arbitrarily chose a protein sequence, Q9ZIM1333The protein sequence of Q9ZIM1 is, MSYQQQQCKQPCQPPPVCPTPKCPEPCPPPKC
PEPYLPPPCPPEHCPPPPCQDKCPPVQPYPPCQQKYPPKSK
(ID notation from UniProtKB), as the search query.

We compute the Manhattan distance between the SGT-PCs of the query and each sequence in the dataset. The top five closest sequences are shown with their SGT-PC distances in Table 4. For a reference, we also find the identities—an identity between two sequences is the edit-distance between them after alignment. Here we find the identities after a global alignment, with cost of gap-opening as 0 and gap-extension as 1. Note that alignment algorithms are approximate heuristics; thus, the identities should be seen only as a guideline and not ground truth.

We find that the maximum pairwise identity (=46.3%) corresponds to the smallest SGT-PC distance (33.02) for {Q9ZIM1 (query), S9A4Q5} and the identity decreases with increasing SGT-PC differences. This shows a consistency between SGT results and commonly accepted identity measures. However, the computation time for finding SGT-PC distances between the query and the entire dataset is found to be 0.0014 sec on 2.2GHz Intel i7, while identity computations took 194.4 sec. Although the current in-use methods for protein data-bases, such as BLAST, have a faster alignment and identity computation procedure than a pairwise, its runtime will still be more than finding vector differences as in SGT-based search.

6. Discussion and Conclusion

As we showed in §2.4 and validated in §4-5, SGT’s ability to capture the overall pattern—short- and long-range structures—of a sequence into a fixed finite-dimensional space with shorter computation time makes it stand out. The other methods were found to have lower performance than SGT for this reason.

To bring this into perspective, compare SGT with a first-order Markov model. Suppose we are analyzing sequences in which “ occurs closely after .” Due to stochasticity, the observed sequences can be like: a) , and b) , where (b) is same as (a) but a noise , appearing in between and . While the transition probability in the Markov model will be significantly different for (a:1.0) and (b:0.5), SGT is robust to such noises, (a:0.50) and (b:0.45) for . The effect of the intermittent noise can be easily regulated by changing : choose a high to reduce the noise effect, with a caution that sometimes it may be part of the pattern. Furthermore, a Markov model cannot easily distinguish between these two sequences: and , from the (,) transition probabilities (=1 for both). Differently, the SGT feature for (,) changes from 1.72 to 2.94 (), because it looks at the overall pattern. On another note, although deep learning methods can capture such overall patterns, their representations are in an arbitrary and usually very high dimension.

Besides, SGT may have poor performance if the alphabet set, , is small. This is because the feature space, , will be small to sufficiently capture the pattern characteristics. Typically, this is found to happen if . Additionally, for clustering Manhattan distance is found to outperform Euclidean distance. This can be due to the differences in SGT features typically having small values.

In summary, SGT is a novel approach for feature extraction of a sequence to give it a finite-dimensional representation. This bridges an existing gap between sequence problems and powerful mainstream data mining methods. SGT estimation does not require any numerical optimization, which makes it simple and fast. Due to SGT’s faster execution and ease of parallelization, it can be scaled to most big sequence data problems. For future research, SGT-based methods can be developed for sequence problems in speech recognition, text analysis, bioinformatics, etc., or used as an embedding layer in deep learning. In addition, efficacy of higher-order SGTs can also be explored.

Appendix A Mean and Variance of

To easily denote various pairs in Fig. 5, we use a term, neighboring pair, where an neighbor pair for will have other ’s in between. A first neighbor is thus the immediate neighboring pairs, while the -neighbor has one other in between, and so on (see Fig. 5). The immediate neighbor mentioned in the assumption in Sec. 2.4 is the same as the first neighbor defined here.

The following derivation follows Assumption. 1. Based on it, the expected number of first-neighbor pairs is given as . Consequently, it is easy to show that the expected number of neighboring pairs is , i.e., the second neighboring pairs will be , for the third, so on and so forth, till one instance for the neighbor (see Fig. 5). The gap distance for an neighbor is given as .

Besides, the total number of pair instances will be (, by definition). Suppose we define a set that contains distances for each possible pairs as . Also, since , becomes a lognormal distribution. Thus,

(9)
(10)

where,

(11)

Besides, the feature, , in Eq. 3a can be expressed as,

(12)

This yields to the expectation expression in Eq. 4. Besides, the variances will be

Appendix B SGT for undirected sequences

As explained in Appendix A, under Assumption 1, the expected number of or pair instances will be . Therefore, .

Next,

Next, the SGT for the undirected sequence in Eq. 8 can be expressed as,

Appendix C Proof for Alphabet Clustering

We have, . For , we want, , in turn, . This will hold if , that is, slope, is increasing with . For an exponential expression for (Eq. 1), the above condition holds true if . Hence, under these conditions, the separation increases as we increase the tuning parameter, .

References

  • (1)
  • Cadez et al. (2003) Igor Cadez, David Heckerman, Christopher Meek, Padhraic Smyth, and Steven White. 2003. Model-based clustering and visualization of navigation patterns on a web site. Data Mining and Knowledge Discovery 7, 4 (2003), 399–424.
  • Comin and Verzotto (2012) Matteo Comin and Davide Verzotto. 2012. Alignment-free phylogeny of whole genomes using underlying subwords. Algorithms for Molecular Biology 7, 1 (2012), 1.
  • Edgar (2004) Robert C Edgar. 2004. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics 5, 1 (2004), 1.
  • Edgar (2010) Robert C Edgar. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 19 (2010), 2460–2461.
  • Fu et al. (2012) Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu, and Weizhong Li. 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 28, 23 (2012), 3150–3152.
  • Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 (2013).
  • Han et al. (2001) Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, and MC Hsu. 2001. Prefixspan: Mining sequential patterns efficiently by prefix-projected pattern growth. In proceedings of the 17th international conference on data engineering. 215–224.
  • Helske and Helske (2016) Satu Helske and Jouni Helske. 2016. Mixture hidden Markov models for sequence data: the seqHMM package in R. (2016).
  • Needleman and Wunsch (1970) Saul B Needleman and Christian D Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology 48, 3 (1970), 443–453.
  • Ranjan et al. (2015) Chitta Ranjan, Kamran Paynabar, Jonathan E Helm, and Julian Pan. 2015. The Impact of Estimation: A New Method for Clustering and Trajectory Estimation in Patient Flow Modeling. arXiv preprint arXiv:1505.07752 (2015).
  • Remmert et al. (2012) Michael Remmert, Andreas Biegert, Andreas Hauser, and Johannes Söding. 2012. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nature methods 9, 2 (2012), 173–175.
  • Siyari et al. (2016) Payam Siyari, Bistra Dilkina, and Constantine Dovrolis. 2016. Lexis: An Optimization Framework for Discovering the Hierarchical Structure of Sequential Data. In Proceedings of the 22nd ACM SIGKDD (KDD ’16). ACM, New York, NY, USA, 1185–1194. DOI:http://dx.doi.org/10.1145/2939672.2939741 
  • Smith and Waterman (1981) Temple F Smith and Michael S Waterman. 1981. Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195–197.
  • Tomović et al. (2006) Andrija Tomović, Predrag Janičić, and Vlado Kešelj. 2006.

    n-Gram-based classification and unsupervised hierarchical clustering of genome sequences.

    Computer methods and programs in biomedicine 81, 2 (2006), 137–153.