I Introduction
Machine learning algorithms are being widely deployed in different applications as automated decisionmaking tools due to their generalization capabilities. This includes malware detection which is our main concern in this work. Traditional algorithms for malware detection search for known signatures which requires them to have a copy of all malware samples. These algorithms are not effective nowadays as (i) polymorphism is used within a malware family, (ii) the number of new malware families is increasingly growing, and (iii) they are not capable of zeroday malware detection. Machine learning algorithms are good candidates for automated malware detection. This is because they can extract complex patterns using different attributes of a malware, and also they can help with zeroday malware detection as they can generalize to new samples [1].
Malware detection can be divided into two main categories of dynamic (behavioral) and static (code) malware detection. In dynamic malware detection, samples are executed, and their runtime behavior is monitored to create indicators of malicious activities. In this type of malware detection, malware samples can easily adapt their runtime behavior to evade detection when they are aware of the normal behavior. In static malware detection, binary codes of samples are examined without executing them to create indicators of malicious activities. In this work, we consider malware detection in portable executable (PE) files using static analysis. Different types of features have been used for static malware detection in PE files such as API calls [2]
, bytelevel Ngrams
[3], features from the PE header [4], and a combination of different types of features [5]. We consider API calls of PE files to distinguish between malware and benign samples. The presenceabsence of API calls forms a transactional dataset. Transactional datasets can also appear in other applications such as healthcare where a sample represents a set of symptoms of a patient, marketing where a sample represents a basket of items purchased by a customer, and natural language processing where a sample represents a set of terms in a document.
A major challenge in designing classifiers for malware detection is that malware authors actively try to evade antimalware systems by modifying existing malware samples without changing their functionalities. Therefore, the final aim is to develop a classifier that not only has a high performance considering available samples, but is also robust to functionalitypreserving modifications. The first step towards having a robust classifier is to understand why a sample is classified under one class rather than the other one. This helps to develop a better understanding of how the model works, which in turn can be used to make the model more robust to evasion attacks. Therefore, interpretability is an important property for a classifier to improve its robustness. An interpretable model should be able to provide some good explanations to its users as to why a specific decision is made regarding a given sample. Several properties are considered for a good explanation to a user [6]. One important property is that explanations need to be contrastive. This means that a user is interested to know not only why a decision is made but also why not another decision. Another important property is that explanations need to be succinct. This means that they need to provide a short list of important reasons for a decision rather a complete list of reasons. There are also several other important motivations for having an interpretable machine learning model. One is that interpretable models help us to understand the scenarios where these models fail. This is important where wrong decisions by a model have serious consequences. For instance, classifying a malware sample as benign can bring the whole critical infrastructure down. Another important motivation is to detect biases in models. This can happen when a model is trained on a biased training dataset. Having an interpretable model helps us to extract these biases by looking at the reasons for its decisions.
In this work, we design an MDLbased classifier as a type of intrinsically interpretable models. Intrinsic in the sense that the model itself is interpretable due to its structure. The design involves selecting a model that best describes (compresses) the training dataset for each class considering the MDL criterion. The MDL principle has already been used for both classification and anomaly detection in transactional datasets
[7, 8, 9, 10]. However, our proposed modelselection method is able to handle much larger datasets (more than 10 to 100 times larger).We search among code tables of patterns as the family of models for the compression of transactional datasets. Code table construction can also be viewed as a pattern summarization problem aimed at selecting a small interesting subset of a large list of candidate patterns. The MDL criterion selects a subset of patterns that best describes the dataset under consideration, i.e., shortest possible description. The code table of selected patterns is considered as the selected model by the MDL criterion [7]. In transactional datasets, closed frequent pattern mining (CFPM) [11] is used to created the large list of candidate patterns required for code table construction. Frequent patterns are sets of items occurring together more than a userdecided threshold. Closed frequent patterns (CFPs) are those frequent patterns that do not have a superset with the same number of occurrence. In large datasets, having a large threshold causes obvious and short patterns, and having a small threshold can cause pattern explosion which makes both pattern mining, and code table construction computationally very expensive.
We therefore propose an MDLbased modelselection method for large transactional datasets. In our method, we first employ clustering to divide our dataset into a number of clusters. We then select a subset of clusters based on a criterion proposed to determine the quality of a cluster. We next perform CFPM in only highquality clusters separately. The outputs of CFPM for all highquality clusters are merged as the final output of CFPM. This approach extracts a subset of all CFPs for our dataset. We show that our approach helps to avoid pattern explosion by considering priority for longer CFPs, and without requiring to extract all CFPs. We finally use the MDL criterion to further summarize extracted patterns, and construct a code table of patterns as the selected model.
We utilize our classifier for static malware detection in PE files using a dataset consisting of 19696 benign, and 19696 malware samples each a binary sequence of size 22761 (representing 22761 unique API calls in our dataset). We compare our classifier with deep neural networks providing us with the stateoftheart performance. The comparison shows that our classifier performs very close to neural networks. We also discuss about the interpretability of our classifier, and how it can help to understand why a sample is classified under one class rather than the other class.
Organization: This work is organized as follows. In Section II, we present some preliminaries for transactional datasets. In Section III, we describe the MDL principle and its applications for pattern summarization, classification and anomaly detection. In Section IV, we present our proposed method for MDLbased model selection. In Section V, we show the advantages of our proposed method for model selection, and also compare our classifier with deep neural networks. In Section VI, we discuss about the interpretability of our classifier. In Section VII, we conclude the work. In Appendix, we review two algorithms that we use for CFPM, and clustering.
Ii Transactional Datasets
In this section, we present some preliminaries for transactional datasets, and also outline some issues related to CFPM and clustering for these datasets.
Iia Preliminaries
Assuming that a dataset consists of possible items, represents the set of all items. The whole dataset, denoted by , is a nonempty multiset (bag) of transactions, i.e., , where each transaction is a subset of , i.e., . We say that a transaction supports an itemset (which is also a subset of ) if . The support of an itemset, denoted by , is the number of transactions that support the itemset. Considering as the multiset of transactions that support the itemset , and as the multiset of transactions that support the itemset , we therefore have

,

,

If , then ,

If , then ,
where denotes the cardinality of a multiset.
IiB Closed Frequent Pattern Mining
We here address the problem of CFPM in transactional datasets[11]. An itemset is frequent if its support is greater than or equal to a userdecided threshold, denoted by . A frequent itemset is closed if it has no superset with the same support. We employ the Linear Time Closed Itemset Mining (LCM) algorithm [12] to directly extracts CFPs. This is as opposed to first extracting all frequent patterns (via algorithms such as the Apriori algorithm [13]), and then selecting the subset of CFPs. The LCM algorithm can dramatically reduce the computation time when the number of frequent patterns is exponentially larger than the number of CFPs. Refer to Appendix for an overview of the LCM algorithm
IiC Clustering
We here address the problem of clustering for transactional datasets. As traditional clustering algorithms using a pairwise similarity do not perform well for transactional datasets, several algorithms have been developed for these datasets which consider a global criterion function [14, 15]. Using a global criterion function has also this advantage that the user does not need to know the number of clusters in advance. The global criterion function is defined such that intracluster similarity is maximised, and intercluster similarity is minimised.
In this work, we employ the Clustering with sLOPE (CLOPE) algorithm [15] which is a fast and scalable algorithm for clustering transactional datasets. In this algorithm, we do not need to know the number of clusters in advance. The two parameters of this algorithm are repulsion factor, used to control intracluster similarity, and maximum cluster number, used to provide an upper limit for the number of clusters. Refer to Appendix for an overview of the CLOPE algorithm
Iii MDL Principle and Its Applications
In this section, we present the MDL principle, and its applications for pattern summarization, classification, and anomaly detection.
Iiia MDL Principle
Kolmogorov complexity theory, also known as algorithmic information theory, was developed to measure the information in objects in isolation, i.e., without knowing the distribution underlying the object. As in data mining, we normally do not know the underlying distribution of our data, we use algorithmic information theory to measure the information in our data. The Kolmogorov complexity of an object is the descriptive complexity of the object, which is the length of the shortest computer program that can describe the object. This is formally defined as follows [16].
Definition 1
The Kolmogorov complexity of an object with respect to a universal computer , denoted by , is defined as
which is the minimum length over all programs that print and halt.
However, we cannot compute the Kolmogorov complexity of an object. Therefore, in practice, the MDL principle is utilized. Using the crude MDL criterion, we choose a model from a family of models, , that minimizes the twoterm objective function where is the number of bits required to describe the object given the model, and is the number of bits required to describe the model itself. Hence, using the crude MDL criterion, we have
IiiB MDLbased Pattern summarization
The MDL principle can be used for pattern summarization where we want to select a small subset of an existing large set of candidate patterns denoted by . In this part, we present the algorithm proposed by Vreeken et al. [7] which uses the MDL principle for pattern summarization. This algorithm performs pattern summarization by searching among code tables of patterns as the family of models to describe the data. A code table, denoted by , has two columns: the first column consists of selected patterns, and the second column consists of binary codes used to encode the patterns in the first column. This algorithm, which basically outputs a semiadaptive compression dictionary, selects the best code table as
(1) 
In the algorithm, as the search space for constructing code tables is very large, a heuristic approach is used to select the best code table. This heuristic approach consists of three steps. In the first step, candidate patterns in the set
are ordered descending first by their support, second by their length. In the second step, a standard code table consisting of all singleton items is constructed. In the third step, candidate patterns from the ordered are examined one by one. In this step, if adding a candidate pattern to the current code table results in a smaller objective function, i.e., , it is kept in the code table, otherwise it is dropped. This leads to keeping only a small subset of in the final code table. The patterns in the final code table are considered as the patterns chosen by the MDL principle.We here explain how the two terms and in equation (1) are calculated. The first term in equation (1), , is calculated as
where is the length of the binary code for the pattern , and is the set of patterns used to cover . The patterns covering a transaction satisfy the following properties
and
As there can be several ways (different sets of patterns) to cover a transaction, the patterns in the code table are ordered descending first by their length, next by their support; the patterns are selected according to this order to cover a transaction.
The lengths of binary codes in the second column of the code table, i.e., , are determined by the Shannon code which is a prefix code. The more a pattern used in the cover of transactions, the shorter its code. Therefore, by defining the usage of a pattern as
the code for the pattern is of length
The second term in equation (1), , is calculated as
(2) 
where is the number of times that item appears in the patterns in the first column of the code table. The number of all possible items in first column of the code table considering a separator between each two patterns is . The first two terms on the lefthand side of equation (IIIB) correspond to encoding the first column of the code table. The last term on the lefthand side of equation (IIIB) corresponds to encoding the second column of the code table consisting of prefix binary codes.
IiiB1 Example
We here provide an example for pattern summarization. In this example, we consider the following dataset which consists of five items and 10 transactions.
1  2  3  4  5 

1  1  1  1  0 
1  1  1  1  0 
1  1  0  1  0 
0  1  1  1  1 
0  0  1  1  1 
0  0  0  1  1 
0  1  0  0  0 
0  0  1  0  0 
0  0  0  1  0 
0  0  0  0  1 
Each row represents a transaction. This dataset can be represented as
where is a multiset, and the superscript for an element shows the multiplicity of that element. We use CFPM with to extract all CFPs of this dataset. This is to form the list of candidate patterns required to construct an MDLbased code table for this dataset. Using extracted CFPs, the ordered list of candidate patterns is
7  
5  
5  
4  
4  
4  
3  
3  
3  
2  
2  
1 
The final code table using the described approach is
binary code length  

which shows the effectiveness of the MDL principle for pattern summarization. In the second column of the code table, we have provided the lengths of binary codes than binary codes themselves. This is because the lengths are important than the codes themselves. Note that item 1 does not appear in the cover of any transactions, i.e., its usage is equal to zero. We keep all singleton items in the final code table by giving them a small usage when their usage is zero. This is to be able to cover any unseen transactions.
IiiC MDLbased Classifier
We here explain how to utilize the MDL principle to build a binary classifier. Supervised learning consists of two phases of training and test. In the training phase, we select a model that best describes the training dataset of each class using the MDL criterion
In the test phase, if for a transaction , we have
this implies that
Consequently, we classify the sample under the second class. Otherwise, we classify it under the first class. Note that the term in the crude MDL criterion prevents the model to be overfitted during the training phase.
IiiD MDLbased Anomaly Detector
We here explain how to utilize the MDL principle to build an anomaly detector. In anomaly detection, we assume that we have access to only a dataset of normal samples (possibly with some small numbers of anomalies which have been mislabelled as normal samples). Therefore, we just select a model that best describes the normal dataset, , using the MDL criterion
Hence, if for the two sample and , we have
this implies
This says that the larger , the smaller . Therefore, in this method, we need to define a threshold using which we say that a sample is anomaly if
Iv Proposed Method for Model Selection
In this work, we aim to construct a classifier for large transactional datasets based on the MDL principle. To do so, as mentioned earlier, we need to select a model that best describes the training dataset for each class using the MDL criterion. This can be done using the twostep KRIMP algorithm proposed by Vreeken et al. [7], shown in Fig. 1. In this algorithm, CFPM is first used to extract all CFPs. These CFPs are then considered as candidates patterns to construct a code table of patterns as described in Section IIIB. The code table is considered as the selected model by the MDL criterion. However, for large datasets, this algorithm can be computationally very expensive. In the first step, having a small in CFPM can lead to pattern explosion, and consequently extracting all CFPs is computationally very expensive. In the second step, we need to test each extracted pattern to decide whether to keep the pattern in the final code table or to drop the pattern. This step is also computationally expensive, and very slow. This is because this step needs to be done for all extracted CFPs in a specific serial order (it cannot be parallelized), and also the whole dataset is used for testing each pattern. To address these problems, we propose using clustering in conjunction with CFPM. We show that this approach extracts a subset of all CFPs by giving priority to longer CFPs. Here, we first explain how clustering affects CFPM. We then present our MDLbased modelselection method.
Iva CFPM after Clustering
We consider applying a CFPM algorithm to the clusters of a dataset separately, and then taking the union over the outputs of the CFPM algorithm for clusters. This provides us with a subset of all CFPs for the whole dataset (i.e., the output of directly applying a CFPM algorithm to the whole dataset). This is because if a pattern is a CFP considering one of the clusters, it is also a CFP considering the whole dataset. We here discuss that this method can be considered as a pattern summarization method that gives priority to longer patters. This is important for our application as the compression is mainly achieved through longer patterns.
To cluster the whole dataset, we utilize a clustering algorithm designed for transactional datasets as described in Section IIC
. The clustering algorithm tries to group similar transactions into one cluster, and dissimilar ones into separate clusters. This implies that the longer a pattern supported by several transactions, the higher the probability that clustering groups those transactions into one cluster. After clustering, four types of clusters can exist corresponding to a CFP: TypeI cluster where the support of the pattern is zero; TypeII cluster where the support of the pattern is nonzero but less than
; TypeIII cluster where the support of the pattern is greater than , but there is a superset for the pattern with the same support; and TypeIV cluster where the support of the pattern is greater than , and there is no superset for the pattern with the same support. Therefore, if we do not have any TypeIV clusters corresponding to a CFP, clustering leads to dropping that pattern. We divide cases leading to dropping a CFP into two scenarios. In the first scenario, all clusters are TypeI or Type III. In the second scenario, there is at least one TypeII cluster. In both scenarios, we do not have any TypeIV clusters.We here show these two scenarios via two examples. In these two examples, we use the CLOPE algorithm with repulsion factor equal to four, and maximum cluster number equal to two. We also consider to be two.
In this first example where we face the first scenario, our dataset consists of five items and seven transactions as shown here
1  2  3  4  5 

1  1  1  1  0 
1  1  1  1  0 
0  1  1  1  1 
0  1  1  1  1 
0  0  1  1  1 
0  0  1  1  1 
0  0  0  1  1 
which is represented as
The CFPs of this dataset are
After clustering, the following two clusters are formed
and the union of CFPs for these two clusters is
For any of the missing CFPs, both clusters are TypeIII.
In this second example where we face the second scenario, our dataset consists of nine items and four transactions
1  2  3  4  5  6  7  8  9 

0  0  0  0  0  0  1  1  1 
0  0  0  1  1  1  1  1  1 
1  1  1  1  1  1  0  0  0 
1  1  1  0  0  0  0  0  0 
which is represented as
The CFPs of this dataset are
After clustering, these two clusters are formed
and the union of CFPs for these two clusters is
It can be seen the pattern is dropped as the result of clustering. Both clusters are TypeII for this pattern.
The patterns dropped as the result of facing the first scenario are those which are formed from the intersection of longer CFPs. In both scenarios, the patterns dropped as the result of clustering are mainly from shorter patterns. That is why we consider CFPM after clustering as a pattern summarization method which gives priority to longer patterns.
IvB Proposed ModelSelection Method
In this section, we explain our proposed MDLbased modelselection method shown in Fig. 2.
As discussed in the last section, we use clustering in conjunction with CFPM to extract a subset of CFPs by giving priority to longer patterns. The maximum number of clusters can be decided based on the parameter . The larger the parameter , the smaller the number of clusters. For a large , we do not face pattern explosion, and consequently we do not need clustering. We use this method when we have a small , and as a result we face pattern explosion. This is to directly avoid pattern explosion, i.e., not by first extracting all CFPs, and then selecting a subset of them. As our target is to minimize the probability that a long CFP is dropped, and also maximize pattern summarization, we propose the following strategy. We first cluster the dataset, and rank clusters according to the following criterion
(3) 
where and are the height and the number of transactions of cluster respectively. The height of cluster is defined as
The cluster quality takes a value between zero and one. It is equal to one where all the transactions of a cluster are the same (the highest quality). We next select a subgroup of clusters as highquality (HQ) clusters by setting a quality threshold, and perform CFPM in only HQ clusters. In HQ clusters, transactions share majority of their items, and as a result the number of CFPs in these clusters is not large even by considering a small . Lowquality (LQ) clusters are the main reason for pattern explosion, and the output of CFPM in these clusters consists of mainly short patterns.
As the output of the patternmining stage, we take the union over the outputs of CFPM in HQ clusters. We finally construct a code table of patterns according to Section IIIB as the selected model.
V Performance Evaluation
In this section, we first compare our proposed modelselection method with the twostep KRIMP method using the small mushroom dataset. This is to show the advantages of our proposed modelselection method in pattern summarization, and constructing a code table. We then evaluate our classifier on a large dataset of API calls for static malware detection in PE files. We also compare our classifier with deep neural networks providing us with the stateoftheart performance in static malware detection.
Va Small Dataset
We use the mushroom dataset to compare our modelselection method with the KRIMP method. The mushroom dataset, which is a categorical dataset, consists of 4208 edible samples, and 3916 poisonous samples. After converting the dataset into a transactional dataset using onehot encoding, each sample of the dataset is a binary sequence of size 117.
VA1 KRIMP Method
We here use the KRIMP method, shown in Fig. 1, to construct a code table, , for the edible dataset, , and a code table, , for the poisonous dataset, . We first use the LCM algorithm for CFPM. Considering to be 0.5 percent of the dataset size for each class (i.e., for the edible dataset, and for the poisonous dataset), we have 34781 CFPs in the edible dataset, and 24041 CFPs in the poisonous dataset. Using extracted CFPs as candidate patterns, we then construct two code tables for the edible, and the poisonous datasets. The final code table for the edible dataset consists of 238 patterns, and the one for the poisonous dataset consists of 176 patterns. Using constructed code tables via the KRIMP method, we have
which show the compression achieved for the edible, and the poisonous datasets. Before the compression, the edible dataset consists of bits, and the poisonous dataset consists of bits.
VA2 Proposed Method
We here use our proposed method, shown in Fig. 2, to construct a code table for the edible dataset, and a code table for the poisonous dataset. To cluster the dataset for each class, we use the CLOPE algorithm with repulsion factor equal to four, and maximum cluster number equal to eight. This provides us with eight clusters for the edible dataset, and six clusters for the poisonous dataset. The cluster qualities of the edible dataset are 0.73, 0.71, 0.71, 0.67, 0.28, 0.63, 0.73, and 0.76. We consider all edible clusters as HQ clusters. The cluster qualities of the poisonous dataset are 0.71, 0.71, 0.65, 0.65, 0.35, and 0.58. We also consider all poisonous clusters as HQ clusters.
After clustering, we now perform CFPM in all clusters separately. We again consider to be 0.5 percent of the dataset size for each class, i.e., for edible clusters, and for poisonous clusters. After taking the union over the outputs of CFPM in different clusters, we have 10831 CFPs corresponding to the edible dataset, and 16554 CFPs corresponding to the poisonous dataset. Note that we now have a shorter list of candidate patterns corresponding to each dataset compared to the KRIMP method.
Using extracted CFPs as candidate patterns, we finally construct two code tables for the edible and the poisonous datasets. The final code table for the edible dataset consists of 183 patterns, and the one for the poisonous dataset consists of 186 patterns. Using constructed code tables via our method, we have
This shows that even after making the list of candidate patterns shorter using our proposed method, we can achieve the same order of compression as the KRIMP method. As the CLOPE algorithm in our method is a low complexity, and fast algorithm, our method makes the process of code table construction computationally much less expensive specially for large transactional datasets.
VB Malware Detection Dataset
We use the dataset provided by AlDujaili et al. [2] to test our classifier for static malware detection in PE files. Our dataset is constructed using 14772 benign training samples, 14772 malware training samples, 4924 benign test samples, and 4924 malware test samples. The total number of API calls in the dataset is 22761. Therefore, each sample of the dataset is a binary sequence of size 22761 where the locations of ones determine API calls of that sample.
VB1 Neural Network and Its Performance
We use fully connected feedforward neural networks to find the stateoftheart performance for our malware dataset. We use fivefold cross validation to optimize hyperparameters of our network. Our network consists of five layers: one input layer of size 22761, three hidden layers of size 300, and one output layer of size two. Rectified linear unit (ReLU) is used as the activation function at the hidden layers, and softmax function is used at the output layer. We use drop out rate of 50 percent to avoid overfitting. The size of minibatches is 100 samples, the learning rate of Adam optimizer is 0.0001, and the number of epochs is 50. The accuracy, false positive rate (FPR), and false negative rate (FNR) obtained by this network are 91.91, 8.04, and 8.12 percent respectively.
VB2 Proposed Classifier and Its Performance
We first use the KRIMP method, shown in Fig. 1, for our dataset. Considering to be 60 percent of the training dataset size for each class, i.e., , we have 7769 CFPs in the benign training dataset, and 2058 CFPs in the malware training dataset. Using these CFPs as candidate patterns, we construct two code tables for the benign and the malware training datasets referred to as and respectively. We decide a sample in the test dataset to be malicious if
(4) 
and to be bengin otherwise. The accuracy, FPR, and FNR obtained by this approach are 85.29, 4.18, and 25.22 percent respectively. In order to improve the performance, we try to use a smaller by which we can extract longer patterns as candidates patterns for model selection. However, by decreasing to be 50 percent of the training dataset size for each class, i.e., , we have 218608 CFPs in the benign training dataset, and 85842 CFPs in the malware training dataset. As it can be seen, by decreasing the threshold by only ten percent of the dataset size, we have a dramatic increase in the number of CFPs. This prevents us from using the KRIMP method as we need to set a much smaller , and consequently the complexity of both CFPM, and code table construction dramatically increases.
We therefore use our proposed approach. To cluster the training dataset for each class, we use the CLOPE algorithm with repulsion factor equal to four, and maximum cluster number equal to eight. This provides us with eight clusters for each of the benign, and the malware training datasets. The cluster qualities of the benign training dataset are 0.85, 0.67, 0.25, 0.69, 0.58, 0.84, 0.03, and 0.29. We consider only the cluster with quality 0.03 as a LQ cluster, and consider the rest as HQ clusters. The cluster qualities of the malware training dataset are 0.99, 0.33, 0.89, 0.01, 0.69, 0.36, 0.47, and 0.59. We consider only the cluster with quality 0.01 as a LQ cluster and consider the rest as HQ clusters.
After clustering, and selecting HQ clusters, we now perform CFPM for HQ clusters separately. We consider to be 0.5 percent of the training dataset size for each class, i.e. . After taking the union, we have 24274 CFPs corresponding to the benign training dataset, and 812 CFPs corresponding to the malware training dataset. Note that we now only have a small number of candidate patterns for each training dataset even by considering a very small .
We finally construct two code tables using extracted patterns as candidate patterns for the benign and the malware training datasets. Using these selected models, and the decision criterion in (4), the accuracy, FPR, and FNR of our approach are 89.43, 12.77, and 8.36 percent respectively. It can be seen that we have been able to improve the accuracy to be very close to the one for deep neural networks. In the next section, we discuss that MDLbased classifiers can be considered as interpretable classifiers which motivates us to use them even by paying a small penalty in accuracy.
Vi Interpretability of MDLbased Classifiers
In this section, we illustrate the interpretability of MDLbased classifiers. As mentioned in the introduction, interpretability is about understanding why a decision is made rather than just what is the decision. Methods to interpret machine learning models are classified into two classes of intrinsic and post hoc methods. Intrinsic interpretability is when the machine learning model itself is interpretable due to its structure. Post hoc interpretability is when a method is developed to interpret a machine learning model after its training. Machine learning models that are intrinsically interpretable can also be used as a post hoc method by approximating the main model in order to explain its decisions. We here show that MDLbased classifiers can be considered as intrinsically interpretable models. Considering a twoclass classifier, we can easily understand why a sample is classified under one class rather than the other one in the following cases.
Case 1: The cover of a sample using the code table for one class consists of a few long and several short patterns, and the cover using the code table for the other class consists of many short patterns. This shows that the sample should belong to the class with a few long patterns. This is because longer patterns represent higher similarity with the samples of a dataset.
Case 2: The cover of a sample consists of few patterns considering the code tables for both classes, but the patterns have shorter sumlength under one class, say class 1, than class 2. This says that the cover consists of patterns with higher usage under class 1. Then the sample should be classified under class 1.
Case 3: A sample consists of items that their support is zero in one class, but not in the other class. This shows that the sample should not be classified under the class which does not support some of the items.
Vii Conclusion
We utilized the minimum description length (MDL) principle, and designed a classifier for large transactional datasets. To do so, we proposed an MDLbased modelselection method for these datasets. The model selection involves first constructing a list of closed frequent patterns (CFPs), and then selecting a subset using the MDL criterion. We showed that, using our method, we can dramatically shorten the list by giving priority to longer patterns as the compression is mainly achieved through longer patterns. This is important as extracting all CFPs, and then summarizing them is computational very expensive due to pattern explosion. We applied our classifier to a dataset of API calls for static malware detection in portable executable (PE) files. We also applied deep neural networks to this dataset to obtain the stateoftheart performance. The comparison showed that we can obtain an accuracy very close to deep neural networks for our dataset. We also discussed that we can consider MDLbased classifiers as intrinsically interpretable classifiers. Although we might need to pay a small penalty in terms of accuracy, interpretability motivates us to use MDLbased classifiers. We believe that such interpretability can provide a good stepping stone to developing a robust classifier that can withstand evasion attacks. An understanding of why a certain decision is made can provide useful hints about how to counteract functionalitypreserving modifications to malware samples.
Appendix
In this appendix, we present the two algorithms which have been used in this work for CFPM, and clustering.
Viia LCM Algorithm
We here provide an overview of the LCM algorithm [12] for CFPM in transactional datasets. In the LCM algorithm, the closure of a pattern is defined as
where . Hence, for every pairs of patterns and , we have

If , then ,

A pattern is closed if and only if .
In the LCM algorithm, the key notion of prefix preserving closure extension (ppcextension) is also defined as follows. A pattern is called a ppcextension of if

i) and,

ii) ,
for some and . , and the core index of , , is the minimum index such that .
Based on the notion of ppcextension, the LCM algorithm works as follows. It starts with an empty pattern where the core index of an empty set is considered to be zero, i.e., . All the frequent ppcextensions of the empty pattern are calculated. Then, for every newly generated frequent ppcextension, this procedure is repeated. The algorithm ends when there is no new frequent ppcextension. All the generated frequent ppcextensions are considered as the output of this algorithm. It has been proved that this outputs all the CFPs.
ViiB CLOPE Algorithm
We here provide an overview of the CLOPE algorithm [15] as a fast and scalable algorithm for clustering transactional datasets. This algorithm uses the global criterion function defined as
(5) 
where , , and are the height, the width, and the size of cluster , respectively. The width of a cluster is calculated as . The size of a cluster is calculated as . Using the width and the size of a cluster , its height is defined as .
The global criterion function in (5) can be generalized as
where , called repulsion factor, is used to have control over intracluster similarity. Larger repulsion factor leads to clusters in which transactions share more common items.
The CLOPE algorithm consists of two phases: allocation phase, and refinement phase. In the allocation phase, we start reading transactions one by one. We either allocate a transaction to an exiting cluster or create a new cluster. This is done to maximise the profit. In the refinement phase, we start reading transactions again. In this phase, we check whether moving a transaction to another existing cluster or a new cluster can increase the profit. If a transaction is moved, we update the clusters, and continue to scan the whole dataset. The algorithm in the refinement phase ends when none of the transactions is moved in an iteration.
References
 [1] M. G. Schultz, E. Eskin, E. Zadok, and S. J. Stolfo, “Data mining methods for detection of new malicious executables,” in Proc. IEEE Symp. Security and Privacy (S&P), Oakland, USA, May 2001, pp. 38–49.

[2]
A. AlDujaili, A. Huang, E. Hemberg, and U. M. O’Reilly, “Adversarial deep learning for robust detection of binary encoded malware,” in
Proc. IEEE Security and Privacy Workshops (SPW), San Francisco, USA, May 2018, pp. 76 – 82.  [3] J. Z. Kolter and M. A. Maloof, “Learning to detect and classify malicious executables in the wild,” Journal of Machine Learning Research, vol. 7, no. 1, pp. 2721–2744, Dec. 2006.
 [4] M. Z. Shafiq, S. M. Tabish, F. Mirza, and M. Farooq, “Peminer: Mining structural information to detect malicious executables in realtime,” in Proc. Int. Symp. Recent Advances in Intrusion Detection (RAID), SaintMalo, France, Sep. 2009, pp. 121–141.
 [5] H. S. Anderson and P. Roth. (2018, Apr. 16) Ember: An open dataset for training static pe malware machine learning models. [Online]. Available: https://arxiv.org/abs/1804.04637v2
 [6] C. Molnar, Interpretable Machine Learning, 2019, https://christophm.github.io/interpretablemlbook/.
 [7] J. Vreeken, M. V. Leeuwen, and A. Siebes, “Krimp: mining itemsets that compress,” Data Min. Knowl. Disc., vol. 23, no. 1, pp. 169–214, July 2011.

[8]
K. Smets and J. Vreeken, “The odd one out: Identifying and characterising anomalies,” in
Proc. of the 11th SIAM Int. Conf. on Data Mining (SDM), Mesa, USA, Apr. 2011, p. 804–815.  [9] ——, “Slim: Directly mining descriptive patterns,” in Proc. of the 12th SIAM Int. Conf. on Data Mining (SDM), Anaheim, USA, Apr. 2012, p. 236–247.
 [10] L. Akoglu, H. Tong, J. Vreeken, and C. Faloutsos, “Fast and reliable anomaly detection in categorical data,” in Proc. of the 21st ACM Int. Conf. on Information and knowledge management (CIKM), Maui, USA, Oct./Nov. 2012, pp. 415–424.
 [11] C. C. Aggarwall and J. Han, Frequent Pattern Mining. Springer, 2007.
 [12] T. Uno, T. Asai, Y. Uchida, and H. Arimura, “An efficient algorithm for enumerating closed patterns in transaction databases,” in Proc. 7th international conference on discovery science, Padova, Italy, Oct. 2004, pp. 16–31.
 [13] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proc. 20th international conference on very large databases (VLDB), Santiago, Chile, Sep. 1994, pp. 487–499.
 [14] K. Wang, C. Xu, and B. Liu, “Clustering transactions using large items,” in Proc. Eighth Int. Conf. Information and Knowledge Management (CIKM), Kansas City, USA, Nov. 1999, pp. 483–490.
 [15] Y. Yang, X. Guan, and J. You, “Clope: A fast and effective clustering algorithm for transactional data,” in Proc. Eighth ACM SIGKDD Conf. Knowledge Discovery and Data Mining (KDD), Edmonton, Canada, July 2002, pp. 682–687.
 [16] T. A. Cover and J. A. Thomas, Elements of Information Theory. John Wiley & Sons, 2006.
Comments
There are no comments yet.