Sequence classification - the task of assigning classes to sequences of atomic symbols - occurs in a multitude of applicative scenarios such as ubiquitous computing, bioinformatics, finance, and security surveillance . A concrete example is the determination of protein types base solely on the amino-acid sequence. Deep neural architectures deliver excellent predictive power, at the expense of human interpretability  and high demand on computational resources. Other sequence classification approaches provide better interpretability (e.g. linear models), but this is often achieved at the expense of predictive power. Not only is such an accuracy-interpretability trade-off hard to achieve, but comparing models interpretability is often left to manual inspection, as there are no agreed-upon, human-independent measures available. At the same time, a large number of textual and graph-based background knowledge bases are available on the web. Nevertheless, there are no existing works that merge predictive models trained on sequences of symbols with prior knowledge from external knowledge bases.
In this work, we focus on the problem of designing an interpretable model for sequence classification. The problem consists of two parts: i) conceiving the model itself, and ii) validating the improvement in interpretability with a proper metric. Unlike existing sequence classification models, the intuition behind our work is that auxiliary, external background information can i) enhance the interpretability of sequence classification models, and ii) help measure such interpretability. We show that linear models for classification - which are known to strike a good balance between predictive power and interpretability - can be enriched with auxiliary background knowledge to obtain a quantifiable improvement in the interpretability of their features, without affecting the predictive power. Our contribution (Figure 1) includes:
: a feature selection and learning algorithm for sequence classification that uses external embeddings to refine the selection of candidate features.
: a metric to quantify the interpretability of features extracted from symbolic sequences. The metric casts the problem into computing distances in a background knowledge embedding space, and does not depend on human-grounded evaluation protocols.
We evaluate our approach on human activity recognition from wearables (HAR) and amino acid sequence classification. We assess both predictive power and interpretability, experimenting with pre-trained word embeddings and knowledge graph embeddings. We find that using auxiliary knowledge to refine the selection of candidate features results in more interpretable models. At the same time, it does not reduce the predictive power of the learned model. Links to data and code will be inserted in the final version of the paper.
2 Related Work
Sequence Classification: Learning classification models for symbolic sequences often uses the presence or the frequency of consecutive groups of symbols, so called -mers (or -grams in text processing) as features 
. Support Vector Machines (SVMs) show promising results: specific string kernels have been proposed, as well as implementation tricks that improve their efficiency[7, 18, 19]15]
model the probability distribution of sequences for each class separately and assign the class with the highest likelihood to unseen sequences at inference time. Current state-of-the-art results are held by Convolutional Neural Networks (CNN) that operate on sequences of characters, which have been successfully applied to sequences[24, 1]. Nevertheless, CNNs and SVMs are black box models and have poor interpretability.
Background Knowledge Injection: Background knowledge is typically used to improve accuracy: auxiliary knowledge can be encoded as rules, for more accurate relation extraction , or to predict missing links in knowledge graphs . There have been attempts to incorporate semantic monotonic constraints derived from background knowledge , but not for sequence classification.
Interpretability Metrics: 
discuss evaluation protocols to assess the interpretability of machine learning models. They take into account human-grounded experiments - with real-world and simplified tasks. Besides, they also acknowledge the need forfunctionally-grounded protocols that replace human intervention with proxy tasks. A simple proxy to compare linear classifiers is measuring the size of the model (number of features with non zero weights) - the assumption being that the smaller the model, the higher the interpretability. Nevertheless, this is an over-simplistic assumption, as size does not capture the semantics of the model features .
3.1 Sequence Classification
Learning a mapping from sequences of symbols to categorical labels is commonly known as sequence classification. Let be a sequence database of instance-label pairs, where is a sequence and the corresponding label. The goal of sequence classification is to learn a mapping from the sequence database so that we can predict the label of a yet unlabeled sequence . Formally such a mapping is a function where is the set of all possible sequences and the set of class labels. A sequence has the following form and each of the individual symbols belongs to a predefined finite alphabet . For example, if a sequence could be . Note that the lengths of sequences is variable.
A -mer is a sequence of consecutive symbols, e.g., . We write and say is present in if an exact match of is found in . Given this definition and an enumeration schema of all -mers present in the training data, we can represent a sequence as a binary vector: , where means that -mer occurs in sequence .
Such a representation allows us to learn a linear model, i.e., a parameter vector of feature weights to predict the class label of a sequence by setting . Although linear models are not powerful enough to capture non-linear relationships, by working in a very complex feature space (e.g., all -mers) it is possible to learn powerful models, similar to the kernel trick applied by kernel Support Vector Machines .
Although the entire -mer space is huge and in practice infeasible to generate explicitly, it can still be used by exploiting the nested structure of the feature-space using SEQL . In this work we adopt SEQL a linear sequence classifier algorithm, as we want to learn a model that is interpretable but still achieves high accuracy. The main idea behind SEQL is to use a greedy coordinate gradient descent with the Gauss-Southwell rule  which allows to avoid the explicit generation of the feature vectors . A key step of this approach is the efficient search for the current best -mer, in the sense of maximum absolute gradient value, followed by an update of the corresponding weight value . These two steps are executed iteratively until a convergence threshold is reached. The search part itself is realized with a branch-and-bound tree search which is made feasible by a bound on the gradient value of -mers based on its own sub--mers. In particular, each iteration starts by computing the gradient values of all -mers whereby the best gradient value found so far is saved in . For each of the -mer the corresponding upper bound is computed. The sub-tree starting at can be pruned whenever otherwise we expand and repeat the procedure. This search procedure allows to find the best -mer in an efficient and timely manner. The resulting model is a weighted list of -mers which is easier to understand by humans. SEQL has support for two classification losses i) logistic loss and ii) squared hinge loss; here, we use i) to learn linear binary sequence classification models.
Moreover, SEQL goes beyond traditional -mers, since it has the ability to use wildcards within the generated -mers by using the -character. Such wildcard allows -mers with gaps, which leads to more general features. Nevertheless this is computationally expensive .
3.3 Word Embeddings
Word embeddings are representation learning techniques widely adopted in natural language processing. They map words in a text corpus to a low-dimensional, continuous vector space. Such vectors act as representations of terms in a n-dimensional metric space. Word embeddings are mostly generated by processing word co-occurrences, or by using neural architectures, the most popular models being word2vec and GloVe . Popular pre-trained word embeddings collections such as ConceptNet Numberbatch  and GloVe111https://nlp.stanford.edu/projects/glove/ are available on the web. The main shortcoming of word embeddings is that single vectors may represent words that carry multiple meanings.
3.4 Knowledge Graph Embedddings
Knowledge graphs are graph-based knowledge bases whose facts are modeled as relationships between entities. Examples are DBpedia, WordNet, and YAGO. Formally, a knowledge graph is a set of triples in the form , each including a subject , a predicate , and an object . and are the sets of all entities and relation types of .
Knowledge graph embedding models are neural architectures that encode concepts from a knowledge graph (i.e. entities and relation types ) into low-dimensional, continuous vectors . Such knowledge graph embeddings have many applications, e.g., in knowledge graph completion, entity resolution, and link-based clustering 
. Knowledge graph embeddings are learned by training a neural architecture over a knowledge graph. Although such architectures vary, the training phase always consists in minimizing a loss function (usually negative log-likelihood or hinge loss)that includes a scoring function , i.e., a model-specific function that assigns a score to a triple . The optimization procedure learns optimal embeddings by minimizing , such that the model assigns high scores to true statements, and low scores to statements unlikely to be true.
In this section we describe Emb-SEQL, our background knowledge-enriched sequence classification model. We also present the Semantic Fidelity, the metric that we use to assess the interpretability of the features learned by Emb-SEQL.
One of the drawbacks of the -mer-based approach of SEQL is that it fully relies on matching exact -mers. The -wildcard relaxes this constraint but it is very general as it allows an arbitrary symbol. As an alternative approach, we introduce the concept of groups. The main intuition behind these groups is that there exist symbols in the alphabet that are exchangeable in certain situations. Conceptually, groups form a new symbol that can be considered as OR combination of multiple base symbols from the original alphabet. We use these new symbols to extend the all--mer representation that SEQL uses and write them, similar to a regular expressions, as . Groups can be formed by hand but we are more interested in forming them automatically by exploiting background knowledge.
Symbols in more complex alphabets (e.g., Activities, NLP) often have relationships between each other in the sense that some symbols are semantically closer than others. We use the embedded representations of symbols to measure such closeness and form groups based on this measurement. To find sensible groups automatically we first map all base symbols into the embedding space followed by clustering. Various (overlapping) clustering techniques can be used for this task. We adopt a simple radius-based approach: for each symbol in the alphabet a group is formed by aggregating all the symbols that fall within a fixed radius around the embedding of the symbol . After collecting the groups around each individual symbol, all exact duplicates are removed to obtain a final list of groups. It is clear that the selection of radius is crucial, as it directly determines the group sizes. If chosen too small, no groups are formed; on the other hand, a radius too big leads to large and general groups and eventually to only one group that contains all symbols (and hence emulates the -wildcard). Currently, we rely on manual selection of an appropriate radius. Our initial tests with a -nearest neighbourhood approach as an alternative to the radius-based approach did not achieve better performance. Further exploration of automatic group selection mechanisms is left for future work.
We extend SEQL by first pre-computing groups followed by the normal SEQL learning procedure. We call this Emb-SEQL for Embedding enriched SEQL. Once the groups are generated, each of them acts as a new base symbol for SEQL and can be part of any -mer. During the tree search of SEQL, groups behave exactly like a normal symbol of the alphabet.
4.2 Semantic Fidelity
Our base assumption is that binary linear classification models (i.e., weighted list of features) are understandable and interpretable as long as their features are. This complies with the decomposability propriety of interpretable models proposed by . Nevertheless, determining how interpretable is a set of features is a task often neglected, and still requires manual intervention. To overcome this problem, we propose a functionally-grounded protocol  based on the Semantic Fidelity, a novel metric to measure the interpretability of the features of a linear model for sequence classification without the need for user intervention. Following the rationale that explanations should match user expectations , we cast the problem of measuring the interpretability of a set of features to computing distances in the embedding space of an auxiliary background knowledge base. The intuition is that features with positive weights should be highly related to the concept of target class and in contrast negative features should relate the not-target class.
We define the Semantic Fidelity as follows:
is the set of features, is a feature, is the number of features, and is defined as:
where is the positive class of the binary classification task (i.e. the target) and the negative class, is the weight associated to feature , and is the distance between a -mer feature and the concept of the target class . The distance is defined as the average distance between the embeddings of each individual k-mer symbol and the embedding of the class:
where is the number of symbols in and the embedding of symbol represents a single symbol, or in case of an Emb-SEQL group of length , the average of all symbols in the group:
We assume the embedding space and weights to be normalized so that () and the maximum distance , and consequently , where a higher value means a more interpretable model.
In this section we assess the interpretability and the predictive power of Emb-SEQL. We experiment in two distinct application scenarios: human activity recognition from wearables, and amino-acid sequences classification. In a second experiment, we show that the background knowledge injection of Emb-SEQL does not affect its predictive power, but improves interpretability.
5.1 Experimental Settings
Datasets. We experiment with a number of symbolic sequence classification datasets and a range of auxiliary background knowledge sources. The symbolic sequence datasets used in the experiments are:
OPPORTUNITY (HAR): Human activity recognition dataset of wearable sensor data collected from subjects performing actions in a room . It includes inertial measurements from 15 subjects, resulting in 113 sensor recordings provided as multivariate time series. Data points are annotated at different levels of abstraction. For this paper, we aggregate the four low-level labels (left hand action, left hand object, right hand action, right hand object) as well as the locomotion
annotations to form a 5-let. We transform the multidimensional symbolic sequences of OPPORTUNITY by encoding 5-lets into unique symbols (we merge adjacent repeated 5-lets). This procedure results in more than 1,400 unique symbols. We concatenate all records for all subjects, and we window with size 1,000 (roughly 30 seconds) and stride 50. We label a window with its majority class. Our task is predicting the five top-level activities (Relaxing, Coffee time, Clean up, Sandwich time, Early morning) from sequences of -lets. We use a one-vs-all approach to address the multiclass setting of the dataset. All results are obtained with 10-fold cross validation. Note that for Emb-SEQL we compute the embedding of a group by averaging the five embeddings of symbols in a 5-let.
Protein: An excerpt of PhosphoELM222http://phospho.elm.eu.org/ used in . It includes sequences of 21 distinct amino acids from the S/T/Y phosphorylation site. Each sequence is labeled with a protein group. We narrow down to two kinase groups (PKA group with 381 sequences and SRC with 157), to compare against the binary classification task results in . We obtained the result by applying 10 fold cross validation as done in .
We used the following pre-trained word embeddings:
GloVe: these pre-trained embeddings have been created with the GloVe unsupervised model from a large corpus of data crawled from the web , and cover 1.9M words. Embeddings have dimensionality .
We used the following knowledge graphs (detailed statistics are reported in Table 1):
WordNet: WordNet is a popular lexical database of English terms. Words are grouped into synsets, sets of cognitive synonyms that express a distinct concept. Synsets are connected with typed relations that represent conceptual, semantic, and lexical relations. We use the RDF version of WordNet 3.1444http://wordnet-rdf.princeton.edu/about.
YAGO-41: YAGO is a large, broad-scope knowledge graph. We used version 3.1555http://bit.ly/YAGO3. Due to the large size of YAGO, we only used the following splits: yagoDBpediaClasses, agoDBpediaInstances, yagoTaxonomy, yagoTypes, and yagoFacts.
ChEBI-ChEMBL: the knowledge graph includes triples from the RDF versions of ChEBI666https://www.ebi.ac.uk/chebi/ and ChEMBL777http://bit.ly/ChEMBL-RDF. ChEBI includes information about small chemical compounds, i.e., molecular entities involved in processes of living organisms. We use the ChEBI-core split (1.8M triples). ChEMBL-RDF 24.1 is a manually curated chemical database of bioactive molecules with drug-like properties. We downloaded the splits describing the target triples, and the mappings to ChEBI entities.
For each embedding, we manually select a radius for the group generation. The main criteria for the selection are the total number of groups as well as their size. The best radii are: GloVe: 0.35, WordNet: 0.185, YAGO-41: 0.23, ConceptNet: 0.23, ChEBI-ChEMBL: 0.65.
Implementation Details. Emb-SEQL
is implemented in C++. The Semantic Fidelity function is written in Python 3.6. The implementation of the knowledge graph embeddings model uses TensorFlow, on Python 3.6.
Knowledge Graph Embeddings Generation. Besides using pre-trained word embeddings (GloVE and ConceptNet Numberbatch), we also experiment with knowledge graph embeddings. This is done to overcome the single-vector multiple-meaning shortcoming of word embeddings. We learn knowledge graph embeddings for each knowledge graph listed in Table 1. We use ComplEx 
, the neural embedding model that strikes the best trade-off between predictive power and training speed. This is crucial given the size of the knowledge graphs used in the experiments. We rely on typical hyperparameter values known to perform well for splits of WordNet and YAGO: we train the embeddings with dimensionality, AdaGrad optimizer, initial learning rate , margin-based pairwise loss function with margin , and negatives per positive ratio , . Figure 2 shows a PCA-reduced scatterplot of the concept embeddings for the human activity recognition.
5.2 Features Interpretability
Human Activity Recognition. We learn features from the OPPORTUNITY dataset with SEQL and Emb-SEQL, experimenting with word and graph embeddings generated from different auxiliary knowledge bases. We compute the Semantic Fidelity (Equation 1) of the learned features to assess if the features of Emb-SEQL obtained with the embedding-driven groups are more interpretable than those obtained with plain SEQL.
Table 2 reports the semantic fidelity obtained by five binary classifiers defined for each of the five top-level activities to predict. Results are stable across the five target classes. Emb-SEQL outperforms SEQL with most of the embeddings: WordNet brings 4.6% increase in Semantic Fidelity, while GloVe obtains a 2.3% increase and ConceptNet a 0.4% increase. YAGO-41, on the other hand does not bring any advantage over plain SEQL. This is probably due to sparse relations and lack of redundancy in the YAGO splits we used to build YAGO-41. Future experiments will use a complete version of YAGO. Figure 3 shows an example of the embedded features of Emb-SEQL and SEQL in a PCA-reduced representation.
Amino Acids. We also experiment with the Protein dataset. As for the HAR scenario, we learn features with both SEQL and Emb-SEQL. For Emb-SEQL we use the ChEBI-ChEMBL auxiliary knowledge base, which we inject in the model as knowledge graph embeddings. Table 2 reports the Semantic Fidelity over a single class, as this is a binary classification task: Emb-SEQL reaches a Semantic Fidelity 1.5% higher than its counterpart, thus making its features more interpretable.
|Embeddings||Model||std||Class 1||Class 2||Class 3||Class 4||Class 5|
5.3 Classification Quality
Human Activity Recognition. Besides interpretability, we are also interested in assessing whether Emb-SEQL achieves a predictive power comparable to SEQL. Therefore, we compare the performance of Emb-SEQL to SEQL without any knowledge injection, as well as to a SVM baseline, and a LSTM-based neural network. We use a SVM with RBF kernel implemented with libsvm . We extracted all -mers up to
and explicitly generated the feature vector representation for each example as input for the SVM. The LSTM has 64 hidden units followed by a single 16-unit hidden layer classifier which maps to the number of output classes. The model is trained with Adam on weighted cross entropy loss, to mitigate class imbalance. It is trained for 100 epochs with early stopping.
Table 3 shows the results of the experiment on the OPPORTUNITY dataset with the above mentioned embeddings. We show the weighed F1 score, as well as the accuracy, excluding the null class (as done in prior work). Results show that all SEQL models, regardless of the injection of auxiliary knowledge, outperform the SVM in both metrics, as well as the LSTM. The performance of Emb-SEQL is comparable to SEQL. We conclude that the injection of knowledge into Emb-SEQL did not hurt the performance of the model with regard to Accuracy, but lead to better model interpretability (according to Semantic Fidelity).
Amino Acids. A similar conclusion can be drawn for the evaluation on the Protein dataset. Table 3 shows the F1 score and accuracy of SEQL and Emb-SEQL with ChEBI-ChEMBL embeddings as well as for the LSTM-based architecture described for HAR and SCIS_MA (sequence classification based on association rules) and its HMM baseline . It is clearly visible that the Accuracy of SEQL and Emb-SEQL lacks somewhat behind SCIS_MA and HMM, but the injection of knowledge doesn’t significantly hurt the performance of Emb-SEQL.
We show that semantic embeddings help generate more interpretable features for sequence classification with linear models. Besides, we also show that distances in embedding spaces can be used to quantify how interpretable such features are. Future work will include exploration of different clustering techniques to form groups in Emb-SEQL. An important axis of work will be validating the Semantic Fidelity against human-grounded and application-grounded evaluation protocols. Furthermore, we will investigate the application of the Semantic Fidelity to other feature-based models.
Alipanahi, B., Delong, A., Weirauch, M.T., Frey, B.J.: Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nature Biotechnology33(8), 831–838 (2015)
-  Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 27:1–27:27 (2011)
-  Doshi-Velez, F., Kim, B.: Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608 (2017)
-  Freitas, A.A.: Comprehensible classification models: a position paper. ACM SIGKDD explorations 15(1), 1–10 (2014)
-  Gsponer, S., Smyth, B., Ifrim, G.: Efficient sequence regression by learning linear models in all-subsequence space. In: Procs of ECML-KDD. pp. 37–52 (2017)
-  Ifrim, G., Wiuf, C.: Bounded coordinate-descent for biological sequence classification in high dimensional predictor space. Procs of SIGKDD (2011)
-  Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for svm protein classification. Procs of the Pacific Symp on Biocomputing 7, 564–575 (2002)
-  Lipton, Z.C.: The mythos of model interpretability. Queue 16(3) (Jun 2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS (2013)
Miller, T.: Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence (2018)
-  Minervini, P., Demeester, T., Rocktäschel, T., Riedel, S.: Adversarial sets for regularising neural link predictors. In: Procs of UAI (2017)
-  Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A review of relational machine learning for knowledge graphs. Procs of the IEEE 104(1), 11–33 (2016)
-  Nutini, J., Schmidt, M., Laradji, I.H., Friedlander, M., Koepke, H.: Coordinate descent converges faster with the gauss-southwell rule than random selection. In: Procs of ICML. vol. 37, pp. 1632–1641 (2015)
-  Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Procs of EMNLP. pp. 1532–1543 (2014)
-  Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Procs of the IEEE 77(2), 257–286 (1989)
-  Rocktäschel, T., Singh, S., Riedel, S.: Injecting logical background knowledge into embeddings for relation extraction. In: HLT-NAACL. pp. 1119–1129 (2015)
-  Roggen, D., Calatroni, A., Rossi, M., Holleczek, T., Förster, K., Tröster, G., Lukowicz, P., Bannach, D., Pirkl, G., Ferscha, A., Doppler, J., Holzmann, C., Kurz, M., Holl, G., Chavarriaga, R., Sagha, H., Bayati, H., Creatura, M., d. R. Millàn, J.: Collecting complex activity datasets in highly rich networked sensor environments. In: INSS. pp. 233–240 (June 2010)
-  Sonnenburg, S., Rätsch, G., Schäfer, C.: Learning interpretable SVMs for biological sequence classification. Research in Computational Molecular Biology (2005)
-  Sonnenburg, S., Rätsch, G., Rieck, K.: Large scale learning with string kernels. In: Bottou, L., Chapelle, O., DeCoste, D., Weston, J. (eds.) Large Scale Kernel Machines, pp. 73–103. MIT Press, Cambridge, MA. (2007)
-  Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In: Procs of AAAI. pp. 4444–4451 (2017)
-  Trouillon, T., Welbl, J., Riedel, S., Gaussier, É., Bouchard, G.: Complex embeddings for simple link prediction. In: procs of ICML. pp. 2071–2080 (2016)
-  Xing, Z., Pei, J., Keogh, E.J.: A brief survey on sequence classification. SIGKDD Explorations 12(1), 40–48 (2010)
-  Zhou, C., Cule, B., Goethals, B.: Pattern based sequence classification. IEEE Transactions on Knowledge and Data Engineering 28(5), 1285–1298 (2016)
-  Zhou, J., Troyanskaya, O.G.: Predicting effects of noncoding variants with deep learning based sequence model. Nature Methods 12(10), 931–934 (Aug 2015)