Randomized LU decomposition: An Algorithm for Dictionaries Construction

02/17/2015 ∙ by Aviv Rotbart, et al. ∙ Tel Aviv University 0

In recent years, distinctive-dictionary construction has gained importance due to his usefulness in data processing. Usually, one or more dictionaries are constructed from a training data and then they are used to classify signals that did not participate in the training process. A new dictionary construction algorithm is introduced. It is based on a low-rank matrix factorization being achieved by the application of the randomized LU decomposition to a training data. This method is fast, scalable, parallelizable, consumes low memory, outperforms SVD in these categories and works also extremely well on large sparse matrices. In contrast to existing methods, the randomized LU decomposition constructs an under-complete dictionary, which simplifies both the construction and the classification processes of newly arrived signals. The dictionary construction is generic and general that fits different applications. We demonstrate the capabilities of this algorithm for file type identification, which is a fundamental task in digital security arena, performed nowadays for example by sandboxing mechanism, deep packet inspection, firewalls and anti-virus systems. We propose a content-based method that detects file types that neither depend on file extension nor on metadata. Such approach is harder to deceive and we show that only a few file fragments from a whole file are needed for a successful classification. Based on the constructed dictionaries, we show that the proposed method can effectively identify execution code fragments in PDF files. Keywords. Dictionary construction, classification, LU decomposition, randomized LU decomposition, content-based file detection, computer security.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have shown a growing interest in dictionary learning. Dictionaries were found to be useful for applications such as signal reconstruction, denoising, image impainting, compression, sparse representation, classification and more. Given a data matrix , a dictionary learning algorithm produces two matrices and such that is small where is called dictionary and is a coefficients matrix also called representation matrix. Sparsity of , means that each signal from is described with only a few signals (also called atoms) from the dictionary . It is a major property being pursued by many dictionary learning algorithms. The algorithms, which learn dictionaries for sparse representations, optimize a goal function , which considers both the accuracy and the sparsity of the solution, by optimizing alternately these two properties ( is a regularization term). This construction is computationally expensive and does not scale well to big data. It becomes even worse when dictionary learning is used for classification since another distinctive term in addition to the two aforementioned is being introduced in the objective function. This term provides the learned dictionary a discriminative ability. This can be seen for example in the optimization problem where is a classifier and

is a vector of labels.

is the penalty term for achieving a wrong classification. In order to achieve the described properties, dictionaries are usually over-complete, namely, they contain more atoms than the signal dimension. As a consequence, dictionaries are redundant such that there are linear dependencies between atoms. Therefore, a given signal can be represented in more than one way using dictionary atoms. This enables us on one hand to get sparse representations, but on the other hand it complicates the representation process because it is NP-hard to find the sparsest representation for a signal by an over-complete dictionary [13].

In this work, we provide a generic way to construct an under-complete dictionary. Its capabilities will be demonstrated for signal classification task. Since we do not look for sparse signal representation, we remove the alternating optimization process from the construction of over-complete dictionaries. Our dictionary construction is based on matrix factorization. We use the randomized LU matrix factorization algorithm [16] for a dictionary construction. This algorithm, which is applied to a given data matrix of features and data-points, decomposes into two matrices and , where is the dictionary and is the coefficient matrix. The size of

is determined by the decaying spectrum of the singular values of the matrix

, and bounded by . Both and are linearly independent. The proposed dictionary construction has couple of advantages: it is fast, scalable, parallelizable and thus can run on GPU and multicore-based systems, consumes low memory, outperforms SVD in these categories and works extremely well on large sparse matrices. Under this construction, the classification of a newly arrived signal is done by a fast projection method that represents this signal by the columns of the matrix . The computational cost of this method is linear in the input size, while in the under-complete case finding the optimal solution is NP-hard [13]. Approximation algorithms for sparse signal reconstruction, like Orthogonal Matching Pursuit [17] or Basis Pursuit [6], have no guarantees for general dictionaries.

In order to evaluate the performance of the dictionaries, which are constructed by the application of the randomized LU algorithm to a training set, we use them to classify file types. The experiments were conducted on a dataset that contains files of various types. The goal is to classify each file or portion of a file to the class describing its type. To the best of our knowledge, this work is the first to use dictionary learning method for file type classification. This work considers three different scenarios that represent real security tasks: examining the full content of the tested files, classifying a file type using a small number of fragments from the file and detecting malicious code hidden inside innocent looking files. While the first two scenarios were examined by other works, none of the papers described in this work dealt with the latter scenario. It is difficult to compare our results to other algorithms since the used datasets are not publicly available. For similar testing conditions, we improve the state-of-the-art results. The datasets we used will be made publicly available.

The paper has the following structure: Section II reviews related work on dictionary construction and on file content recognition algorithms. Section III presents the dictionary construction algorithm. Section IV shows how to utilize it to develop our classification algorithms for file content detection. Section V addresses the problem of computing the correct dictionaries sizes needed by the classifiers. Experimental results are presented in Section VI and compared with other content classification methods.

Ii Related Work

Dictionary-based classification models have been the focus of much recent research leading to results in face recognition 

[22, 19, 20, 21, 15, 9], digit recognition [21], object categorization [15, 9] and more. Many of these works [9, 22, 15] utilize the K-SVD [1] for their training, or in other words, for their dictionary learning step. Others define different objective functions such as the Fisher Discriminative Dictionary Learning [21]. Majority of these methods use an alternating optimization process in order to construct their dictionary. This optimization procedure seeks a dictionary which is re-constructive, enables sparse representation and sometimes also has a discriminative property. In some works (see for example [22, 9]) the dictionary learning algorithm requires meta parameters to regulate these properties of the learned dictionary. Finding the optimal values for these parameters is a challenging task that adds complexity to the proposed solutions. A dictionary construction, which uses a multivariate optimization process, is computationally expensive task (as described in [15], for example). The proposed approach in this paper suggests to avoid these complexities by using the randomized LU Algorithm [16]. The dictionary it creates is under-complete where the number of atoms is smaller than the signal dimension. The outcome is that the dictionary construction is fast that does not compromise its abilities to achieve high classification accuracy. We improve upon the state-of-the-art results in file type classification [4] as demonstrated by the experimental results.

The testing phase in many dictionary learning schemes is simple. Usually, linear classifier is used to assign test signals to one of the learned classes [22, 9]. However, classifier learning combined with dictionary learning adds additional overhead to the process [22, 9]. The proposed method in this paper does not require to allocate special attention to a classifier learning. We utilize the output from the randomized LU algorithm to create a projection matrix. This matrix is used to measure the distance between a test signal and the dictionary. The signal is then classified as belonging to the class that approximates it best. The classification process is fast and simple. The results described in Section VI show high accuracy in the content-based file type classification task.

We used this classification task to test the randomized LU dictionary construction and to measure its discriminative power. This task is useful in computer security applications like anti-virus systems and firewalls that need to detect files transmitted through network and response quickly to threats. Previous works in this field use mainly deep packet inspection (DPI) and byte distribution frequency features (1-gram statistics) in order to analyze a file [5, 2, 4, 3, 18, 10, 7, 11, 12]. In some works, other features were tested like consecutive byte differences [4, 5] and statistical properties of the content [5]. The randomized LU decomposition [16] construction is capable of dealing with a large number of features. This enables us to test our method on high dimensional feature sets like double-byte frequency distributions (2-grams statistics) where each measurement has 65536 Markov-walk based features. We refer the reader to [4] and references within for an exhaustive comparison of the existing methods for content-based file type classification.

Throughout this work, when is a matrix, the norm indicates the spectral norm (the largest singular value of ) and when is a vector it indicates the standard norm (Euclidean norm).

Iii Randomized LU

In this section, we present the randomized LU decomposition algorithm for computing the rank LU approximation of a full matrix (Algorithm 1). The main building blocks of the algorithm are random projections and Rank Revealing LU (RRLU) [14] to obtain a stable low-rank approximation for an input matrix .

The RRLU algorithm, used in the randomized LU algorithm, reveals the connection between LU decomposition of a matrix and its singular values. This property is very important since it connects between the size of the decomposition to the actual numerical rank of the data. Similar algorithms exist for rank revealing QR decompositions (see, for example


Theorem III.1 ([14]).

Let be an matrix (). Given an integer , then the following factorization


holds where is a lower triangular with ones on the diagonal, is an upper triangular, and are orthogonal permutation matrices. Let be the singular values of , then:


Based on Theorem III.1, we have the following definition:

Definition III.1 (RRLU Rank Approximation denoted RRLU).

Given a RRLU decomposition (Theorem III.1) of a matrix with an integer (as in Eq. III.1) such that , then the RRLU rank approximation is defined by taking columns from and rows from such that

where and are defined in Theorem III.1.

Lemma III.2[16] RRLU Approximation Error).

The error of the RRLU approximation of is

Algorithm 1 describes the flow of the RLU decomposition algorithm.

Input: matrix of size to decompose; rank of ; number of columns to use (for example, ).
Output: Matrices such that where and are orthogonal permutation matrices, and are the lower and upper triangular matrices, respectively, and is the th singular value of .
1:  Create a matrix of size whose entries are

i.i.d. Gaussian random variables with zero mean

2:  .
3:  Apply RRLU decomposition (See [14]) to
such that .
4:  Truncate and by choosing the first columns
and rows, respectively: and .
5:  . ( is the pseudo inverse of ).
6:  Apply LU decomposition to with column pivoting .
7:  .
8:  .
Algorithm 1 Randomized LU Decomposition
Remark III.3.

In most cases, it is sufficient to compute the regular LU decomposition in Step 3 instead of computing the RRLU decomposition.

The running time complexity of Algorithm 1 is (see Section 4.1 and [16] for a detailed analysis). It is shown in Section 4.2 in [16] that the error bound of Algorithm 1 is given by the following theorem:

Theorem III.4 ([16]).

Given a matrix of size . Then, its randomized LU decomposition produced by Algorithm 1 with integers and () satisfies

with probability not less than

for all and .

Iv Randomized LU Based Classification Algorithm

This section describes the application of the randomized LU Algorithm 1 to a classification task. The training phase includes dictionary construction for each learned class from a given dataset. The classification phase assigns a newly arrived signal to one of the classes based on its similarity to the learned dictionaries. Let be the matrix whose columns are the training signals (samples). Each column is defined by features. Based on Section III, we apply the randomized LU decomposition (Algorithm 1) to , yielding . The outputs and are orthogonal permutation matrices. Theorem IV.1 shows that forms (up to a certain accuracy) a basis to . This is the key property of the classification algorithm.

Theorem IV.1.

Given a matrix . Its randomized LU decomposition is . Then, the error of representing by satisfies:

with the same probability as in Theorem III.4.


By combining Theorem III.4 with the fact that we get

Then, by using the fact that is square and invertible we get

By using the fact that the spectral norm is invariant to orthogonal projections, we get

with the same probability as in Theorm III.4. ∎

Assume that our dataset is composed of the sets . We denote by the dictionary learned from the set by Algorithm 1. is the corresponding coefficient matrix. It is used to reconstruct signals from as a linear combination of atoms from . The training phase of the algorithm is done by the application of Algorithm 1 to different training datasets that correspond to different classes. For each class, a different dictionary is learned. The size of , namely its number of atoms, is determined by the parameter that is related to the decaying spectrum of the matrix . The dictionaries do not have to be of equal sizes. A discussion about the dictionary sizes appears later in this section and in Section V. The third parameter, which Algorithm 1 needs, is the number of projections

on the random matrix columns.

is related to the error bound in Theorem III.4 and it is used to ensure high success probability for Algorithm 1. Taking to be a little bigger than is sufficient. The training process of our algorithm is described in Algorithm 2.

Input: training datasets for sets; dictionary size of each set.
Output: set of dictionaries.
1:  for   do
       , ; (Algorithm 1)
Algorithm 2 Dictionaries Training using Randomized LU

For the test phase of the algorithm, we need a similarity measure that provides a distance between a given test signal and a dictionary.

Definition IV.1.

Let be a signal and be a dictionary. The distance between and the dictionary is defined by

where is the pseudo-inverse of the matrix .

The geometric meaning of

is related to the projection of onto the column space of , where is the dictionary learned for class of the problem. denotes the distance between and which is the vector built with the dictionary . If then Theorem IV.1 guarantees that . For , is large. Thus, is used for classification as described in Algorithm 3.

Input: input test signal; set of dictionaries.
Output: the classified class label for .
1:  for  do
Algorithm 3 Dictionary based Classification

The core of Algorithm 3 is the function from Definition IV.1. This is done by examining portion of the signal that is spanned by the dictionary atoms. If the signal can be expressed with high accuracy as a linear combination of the dictionary atoms then their will be small. The best accuracy is achieved when the examined signal belongs to the span of . In this case, is small and bounded by Theorem III.4. On the other hand, if the dictionary atoms cannot express well a signal then their will be large. The largest distance is achieved when a signal is orthogonal to the dictionary atoms. In this case, will be equal to the norm of the signal. Signal classification is accomplished by finding a dictionary with a minimal distance to it. This is where the dictionary size comes into play. The more atoms a dictionary has, the larger is the space of signals that have low to it and vice versa. By adding or removing atoms from a dictionary, the distances between this dictionary and the test signals are changed. This affects the classification results of Algorithm 3. The practical meaning of this observation is that dictionary sizes need to be chosen carefully. Ideally, we wish that each dictionary will be of zero to test signals of its type, and of large values for signals of other types. However, in reality, some test signals are represented more accurately by a dictionary of the wrong type than by a dictionary of their class type. For example, we encountered several cases where GIF files were represented more accurately by a PDF dictionary than by a GIF dictionary. An incorrect selection of the dictionary size, , will result in either a dictionary that cannot represent well signals of its own class (causes misdetections), or in a dictionary that represents too accurately signals from other classes (causes false alarms). The first problem occurs when the dictionary is too small whereas the second occurs when the dictionary is too large. In Section V, we discuss the problem of finding the optimal dictionaries sizes and how they relate it to the spectrum of the training data matrices.

V Determining the Dictionaries Sizes

One possible way to find the dictionaries sizes is to observe the spectrum decay of the training data matrix. In this approach, the number of atoms in each dictionary is selected as the number of singular values that capture most of the energy of the training matrix. This method is based on estimating the numerical rank of the matrix, namely on the dimension of its column space. Such a dictionary approximates well the column space of the data and represents accurately signals of its own class. Nevertheless, it is possible in this construction that dictionary of a certain class will have high rate of false alarms. In other words, this dictionary might approximate signals from other classes with a low error rate.

Two different actions can be taken to prevent this situation. The first option is to reduce the size of this dictionary so that it approximates mainly signals of its class and not from other classes. This should be done carefully so that this dictionary still identifies well signals of its class better than other dictionaries. The second option is to increase the sizes of other dictionaries in order to overcome their misdetections. This should also be done with caution since we might represent well signals from other classes using these enlarged dictionaries. Therefore, relying only on the spectrum analysis of the training data is insufficient, because this method finds the size of each dictionary independently from the other dictionaries. It ignores the interrelations between dictionaries, while the classification algorithm is based on those relations. Finding the optimal values can be described by the following optimization problem:


where is the number of signals from class in the dataset classified as belonging to class for the respective dictionary sizes and . The term, which we wish to minimize in Eq. V.1, is therefore the total number of wrong classifications in our dataset when using a set of dictionaries with sizes , respectively.

We propose an algorithm for finding the dictionary sizes by examining each specific pair of dictionaries separately, and thus identifying the optimized dictionary sizes for this pair. Then, the global values for all dictionaries will be determined by finding an agreement between all the local results. This process is described in Algorithm 4.

Input: training datasets for the classes; set of possible values of to search in.
Output: } dictionaries sizes.
1:  for  do
       for  do
Algorithm 4 Dictionary Sizes Detection

Algorithm 4 examines each pair of classes and for different values and produces the matrix , such that the element is the number of classification errors for those two classes, when the dictionary size of class is and the dictionary size of class is . This number is the sum of signals from each class that were classified as belonging to the other class. The matrix reveals the ranges of values for which the number of classification errors is minimal. These are the ranges that fit when dealing with a problem that contains only two classes of signals. However, many classification problems need to deal with a large number of classes. For this case, we create the matrix for all possible pairs, find the ranges for each pair and then find the optimal agreement between all pairs. The step find_optimal_agreement describes this idea in Algorithm 4. Finding this agreement can be done by making a list of constraints for each pair and then finding values that satisfy all the constraint and bring the minimal solution to the problem described in Eq. V.1. The constraints can bound from below or above the size of a specific dictionary, or the relation between sizes of two dictionaries (for example, the dictionary of the first class should have 10 more elements than the dictionary of the second class). The step find_optimal_agreement is not described here formally but demonstrated in details as part of Algorithm 4 in Section VI-B.

Vi Experimental Results

In order to evaluate the performance of the dictionary construction and classification algorithms in Section IV, Algorithm 3 was applied to a dataset that contains six different file types. The goal is to classify each file or portion of a file to the class that describes its type. This dataset consists of 1200 files that were collected in the wild using automated Web crawlers. The files were equally divided into six types: PDF, EXE, JPG, GIF, HTM and DOC. 100 files of each type were chosen randomly as training datasets and the other 100 files served for algorithms testing. In order to get results that reflect the true nature of the problem, no restrictions were imposed on the file collection process. Thus, some files contain only a few kilobytes while others are of several megabytes in size. In addition, some of the PDF files contain pictures, which make it hard for a content-based algorithm to classify the correct file type. Similarly, DOC files may contain pictures and the executables may contain text and pictures. Clearly, these phenomena have negative effect on the accuracy of the results in this section. However, we chose to leave the dataset in its original form.

Throughout this work, we came across several similar works [5, 2, 4, 3, 18, 10, 7, 11, 12] that classify unknown files to their type based on their content. None of these works made their datasets publicly available for analysis and comparison with other methods. We decided to publicize the dataset of files that we collected to enable future comparisons. The details about downloading and using the dataset can be obtained by contacting one of the authors.

Three different scenarios were tested with the common goal of classifying files or portions of files to their class type, namely, assigning them to one of the six file types described above. In each scenario, six dictionaries were learned that correspond to the six file types. Then, the classification algorithm (Algorithm 2) was applied to classify the type of a test fragment or a file. The learning phase, which is common to all scenarios, was done by applying Algorithm 4 to find the dictionary sizes and Algorithm 2 to construct the dictionaries. The testing phase varies according to the specific goal of each scenario. Sections VI-A, VI-B and VI-C provide a detailed description for each scenario and its classification results.

Vi-a Scenario A: Entire File is Analyzed

In this scenario, we process a whole file and the extracted features are taken from its entire content. The features are byte frequency distribution (BFD) that contains 256 features followed by consecutive differences distribution (CDD) that adds another 256 features. Total of 512 features are measured for each training and testing files. CDD is used in addition to BFD because the latter fails to capture any information about bytes ordering in the file. CDD turned out to be very discriminative and improved the classification results. The features extracted from each file were normalized by its size since there are files of various sizes in the dataset. Example for BFD construction is described in Fig.

VI.1 and example for CDD construction is given in Fig. VI.2.

AABCCCDR Byte Probability (BFD) A 0.25 B 0.125 C 0.375 D 0.125 R 0.125 0

Fig. VI.1: Byte Frequency Distribution (BFD) features extracted from the file fragment “AABCCCDR”.

AABCCCDFG Difference Probability (CDD) 0 0.375 1 0.5 2 0.125 0

Fig. VI.2: Consecutive Differences Distribution (CDD) features extracted from the file fragment “AABCCCDFG”. There are three consecutive-pairs of bytes with difference 0, four with difference 1 and one with difference 2. These distributions are normalized to produce the shown probabilities. The normalization factor is the length of the string minus one (8 in this example).

This scenario can be useful when the entire tested file is available for inspection. The training was done by the application of Algorithms 4 and 2 to the training data. The parameter to Algorithm 4 was determined by the numerical rank of the training matrix. The possible dictionary sizes need to be close to this rank in order to represent well their datasets. The dictionary sizes were 60 atoms per dictionary. The set of dictionaries is the output of Algorithm 2, which is later used for classification of test files. Each test file was analyzed using Algorithm 5

and classified to one of the six classes. The classification results are presented as a confusion matrix in Table 

VI.1. Each column corresponds to an actual file type and the rows correspond to the classified file type by Algorithm 5. A perfect classification produces a table with score 100 on the diagonal and zero elsewhere. Our results are similar to those achieved in [4] (Table ii@) that use different methods. However, we did not have the dataset that [4] used and there is no way to perform a fair comparison.

Input: input file; set of dictionaries.
Output: file type predicted for .
1:  for   do
Algorithm 5 File Content Dictionary Classification
Correct File Type
Classified File Type PDF 98 0 1 1 0 0
DOC 0 97 1 0 0 0
EXE 0 3 98 2 1 0
GIF 0 0 0 97 1 0
JPG 2 0 0 0 98 0
HTM 0 0 0 0 0 100
TABLE VI.1: Confusion matrix for Scenario A. 100 files of each type were classified by Algorithm 5.

Vi-B Scenario B: Fragments of a File

In this scenario we describe a situation in which the entire file is unavailable for the analysis but only some fragments that were taken from random locations are available. The goal is to classify the file type based on this partial information. This serves a real application such as a firewall that examines packets transmitted through a network or a file being downloaded from a network server. This scenario contains three experiments where different features were used in each. The training phase, which is common to all three experiments, includes extracted features from a 10 kilobytes fragments that belong to the training data. These features serve as an input to Algorithm 2, which produces the dictionaries for the classification phase. The second parameter in Algorithm 2 is a set of dictionary sizes, which were determined by Algorithm 4. We use the first set of features in this scenario (described hereafter) to demonstrate more deeply how Algorithm 4 works. The sizes of six dictionaries need to be determined based on the agreement between the pairwise error matrices. Fig. VI.3 shows the matrices and .

Fig. VI.3(a) describes the number of classification-errors for the PDF and JPG types, as a function of the respective dictionary sizes. It can be observed that there is a large number of errors for many size pairs, suggesting that the PDF and JPG dictionaries exhibit a high measure of similarity. This property makes the distinction between these two types a hard task. A closer look on Fig. VI.3(a) enables us to find the optimal sizes for those dictionaries, by making the following observations. Only a few values in the cells above the main diagonal provide good results for this pair. Additionally, JPG dictionary should have 10 atoms more than the PDF dictionary. It cab be also learned that both dictionaries sizes should be greater than 50 atoms.

The PDF and EXE error values in Fig. VI.3(b) indicate that these dictionaries are well separated. There is a large set of dictionary sizes near the diagonal for which the classification error is low. The following intuition helps to understand why a large range of low errors will achieve better classification results. The error matrices are built based on training data and represent the classification error of the algorithm when it applies to this data (See Eq. V.1 when using 2 sets). The best values from the matrix fit the training data, in the sense that a PDF training signal will be represented more accurately by a PDF dictionary of size than by a JPG dictionary of size . However, this is not necessarily the case for a PDF test signal, which may need a larger PDF dictionary or smaller JPG dictionary in order to be classified correctly. This might happen because many PDF-dictionary atoms are irrelevant for reconstructing this signal while too many JPG-dictionary atoms are relevant for it. This means that from this signal’s perspective, the PDF dictionary size is smaller than and the JPG dictionary is larger than . In terms of Fig. VI.3, which shows the classification errors for the two discussed pairs, this means moving away from the diagonal (which has the best dictionary sizes for the training set). In the JPG-PDF case, this shift will increase the classification error because all the off-diagonal entries in Fig. VI.3(a) have higher errors numbers. On the other hand, there is a low probability to get a classification error in Fig. VI.3(b), because there are many off-diagonal options for dictionary sizes that will generate a low error. The pair JPG-PDF is more sensitive to noise than the pair EXE-PDF. This observation is supported by the confusion matrix of the first experiment, as shown in Table VI.2.

(a) Error matrix for the pair PDF-JPG
(b) Error matrix for the pair PDF-EXE
Fig. VI.3: Error matrices produced by Algorithm 4. The matrix is presented in cold to hot colormap to show ranges of low (blue) and high (red) errors.

In the first experiment, the dictionary sizes, which were determined by Algorithm 4, are 150 atoms per PDF, DOC, EXE, GIF, and HTM dictionaries and 160 atoms per JPG dictionary. 10 fragments of 1500 bytes each were sampled randomly from each examined file. BFD and CDD based features were extracted from each fragment and then normalized by the fragment size (similarly to the normalization by file size conducted in Scenario A in Section VI-A). Then, the distance between each fragment and each of the six dictionaries was calculated. The mean value of the distances was computed for each dictionary. Eventually, the examined file was classified to the class that has the minimal mean value. This procedure is described in Algorithm 6. The classification results are presented in Table VI.2.

Input: input fragments; set of dictionaries.
Output: file type predicted to .
1:  for  do
       for  do
2:  for  do
Algorithm 6 File fragment classification using dictionary learning
Correct File Type
Classified File Type PDF 93 0 2 0 14 0
DOC 0 96 2 0 0 0
EXE 0 4 95 0 0 0
GIF 0 0 0 100 2 0
JPG 6 0 0 0 82 0
HTM 1 0 1 0 2 100
TABLE VI.2: Confusion matrix for Scenario B where BFD+CDD based features were chosen. 100 files of each type were classified by Algorithm 6.

The second experiment used a double-byte frequency distribution (DBFD), which contains 65536 features. Figure VI.4 demonstrates the DBFD feature extraction from a small file fragment.

AABCCC Double-Byte Probability (DBFD) AA 0.2 AB 0.2 BC 0.2 CC 0.4 0

Fig. VI.4: Features extracted from the file fragment “AABCCC” using Double Byte Frequency Distribution (DBFD). The normalization factor is the length of the string minus one.

Similarly to the first experiment, 10 fragments were sampled from random locations at each examined file. However, this time we used 2000 bytes per fragment since smaller fragment sizes do not capture sufficient information when DBFD features are used. The feature vectors were normalized by the fragment’s size as before. Algorithm 6 was applied to classify the type of each examined file. The dictionaries sizes in this experiment are 80 atoms per PDF, DOC and JPG and 60 atoms per EXE, GIF and HTM. The classification results of this experiment are presented in Table VI.3. We see that DBFD based features reveal patterns in the data that were not revealed by using BFD and CDD based features. In particular, it captures very well GIF files that BFD and CDD based features fail to capture.

Correct File Type
Classified File Type PDF 92 0 2 0 5 1
DOC 2 97 2 0 5 0
EXE 3 1 88 2 0 0
GIF 1 1 5 98 0 0
JPG 1 1 2 0 90 0
HTM 1 0 1 0 0 99
TABLE VI.3: Confusion matrix for Scenario B that is based on DBFD based features. 100 files of each type were classified by Algorithm 6.

The third experiment defines a Markov-walk (MW) like set of 65536 features extracted from the dataset for each signal. The transition probability between each pair of bytes is calculated. Figure VI.5 demonstrates how to extract MW type features from a file fragment.

AABCCCF Transition Probability (MW) A A 0.5 A B 0.5 B C 1 C C 0.66 C F 0.33 0

Fig. VI.5: Markov Walk (MW) based features extracted from the file fragment “AABCCCF”.

Both MW based features and DBFD based features are calculated using the double byte frequencies, but they capture different information from the data. DBFD based features are focused on finding pairs of bytes that are most prevalent and those who have low chances of appearing in a file. On the other hand, MW based features represent the probability that a specific byte will appear in the file given the appearance of a previous byte. This is well suited to file types such as EXE where similar addresses and opcodes are used repeatedly. Each memory address or opcode is comprised of two or more bytes, therefore, it can be described by the transition probability between these bytes. Text files also constitute a good example for the applicability of MW based features because it is well known that natural language can be described by patterns of transition probabilities between words or letters. Our study shows that MW based features capture also the structure of media files like GIF and HTM files. The relatively unsatisfactory performance on JPG files is because our PDF dictionary was trained on PDF files that contain pictures. Therefore, it detected some of the JPG files. The prediction accuracy is described in Table VI.4. Those results (97% avg. accuracy) outperform the results obtained by the BFD+CDD and DBFD features. It also improve over all the surveyed methods in [4] (Table vi@), including the algorithm proposed in [4], that has 85.5% average accuracy. However, it should be noted that we used 10 fragments for the classification of each file whereas in [4] a single fragment is used. In Scenario B, the dictionary sizes are 500 atoms per PDF, DOC and EXE files, 600 per GIF files, 800 per JPG files and 220 per HTM files. The HTM dictionary is smaller than the other dictionaries due to the fact that the HTM training set contains only 230 samples, and the LU dictionary size is bounded by the dimensions of the training matrix (see Algorithm 1).

Correct File Type
Classified File Type PDF 93 1 0 0 9 0
DOC 0 98 0 0 0 0
EXE 2 0 98 1 0 0
GIF 3 1 1 99 0 0
JPG 1 0 0 0 91 0
HTM 1 0 1 0 0 100
TABLE VI.4: Confusion matrix for Scenario B using MW based features. 100 files of each type were classified by Algorithm 6.

Vi-C Scenario C: Detecting Execution Code in PDF Files

PDF is a common file format that can contain different media elements such as text, fonts, images, vector graphics and more. This format is widely used in the Web due to the fact that it is self contained and platform independent. While PDF format is considered to be safe, it can contain any file format including executables such as EXE files and various script files. Detecting malicious PDF files can be challenging as it requires a deep inspection into every file fragment that can potentially hide executable code segments. The embedded code is not automatically executed when the PDF is being viewed using a PDF reader since it first requires to exploit a vulnerability in the viewer code or in the PDF format. Still, detecting such a potential threat can lead to a preventive action by the inspecting system.

To evaluate how effective our method can be in detecting executable code embedded in PDF files, we generated several PDF files which contain text, images and executable code. We used four datasets of PDF files as our training data:

: 100 PDF files containing mostly text.
: 100 PDF files containing GIF images.
: 100 PDF files containing JPG images.
: 100 PDF files containing EXE files.

All the GIF, JPG and EXE files were taken from previous experiments and were embedded into the PDF files. We generated 4 dictionaries for each dataset using Algorithm 2. The input for the algorithm was

We then created a test dataset which consisted of: 100 regular PDF files and 10 PDF files that contain executable code. Algorithm 6 classified the 110 files. The input fragments were the PDF file fragments. The input set of dictionaries

were the output from Algorithm 2. A file is classified as malicious (contains an executable code) if we find more than fragments of type EXE inside, otherwise it is classified as a safe PDF file. We used as our threshold since it minimized the total number of miss-classifications. The training step was applied to 10 kilobytes fragments and the classification step was applied to five kilobytes fragments. We used the MW based features (65,536 extracted features). By using Algorithm 6, we managed to detect all the 10 malicious PDF files with of false alarm rate (8 PDF files that were classified as malicious PDF files). The results are summarized in Table VI.5.

Correct File Type
PDF Malicious PDF
Classified File Type Safe PDF 92 0
Malicious PDF 8 10
TABLE VI.5: Confusion matrix for malicious PDF detection experiment. 110 files were classified by Algorithm 6.

Other file formats, which contain embedded data (DOC files for example), can be classified in the same way.

Vi-D Time Measurements

Computer security software face frequent situations that were described in sections VI-AVI-C. Therefore, any solution to a file type classification must provide a quick response to queries. We measured the time required for both the training phase and the classification phase of our method that classifies a file or a fragment of a file. Since the training phase operates offline it does not need to be fast. On the other hand, classification query should be fast for real-time considerations and for high-volume applications. Tables VI.6 and VI.7 describe the execution time in Scenarios A (Section VI-A) and B (Section VI-B), respectively. The times are divided into a preprocessing step and into the actual analysis step. The preprocessing includes feature extraction from files (data preparation) and loading this data into Matlab. The feature extraction was done in Python and the output files were loaded to Matlab. Obviously, this is not an optimal configuration as it involves intensive slow disk I/O. We did not optimize these steps. We note that the computation time of the dictionary size is not included in the table, because this is a meta-parameter to Algorithm 2 which can be computed in different ways, based on the application. The analysis time refers to the time needed by Algorithm 2 to build six dictionaries (left column in each table) and to classify a single file to one of the six classes (right column). The classification was performed by Algorithm 5 in Scenario A (Table VI.6), and by Algorithm 6 in Scenario B (Table VI.7). All training and classification times are normalized by the data size, which allows evaluation of the algorithm performance regardless of actual file sizes (which vary largely). Classification time of Scenario B is not normalized because Algorithm 6 is not dependent on the input file size (it samples the same amount of data from each file, ignoring its size). Our classification process is fast. The preprocessing step can be further optimized for real-time applications. All the experiments were conducted on Windows 64-bit, Intel i7, 2.93 GHz CPU machine with 8 GB of RAM.


Features     Training time (sec) Classification time (sec)
    per 1 MB of data per 1 MB of data



Preprocessing 1.8 1.88
    Analysis 0.004 0.0005
    Total 1.804 1.8805


TABLE VI.6: Running times for Scenario A.


Features     Training time (sec) Classification time
    per 1 MB of data (sec)



Preprocessing 1.93 0.1 (per 1 MB)
    Analysis 0.008 0.01 (per file)
    Total 1.938


Preprocessing 13.78 1.6 (per 1 MB)
Analysis 0.54 0.26 (per file)
    Total 14.32


Preprocessing 18.42 2.41 (per 1 MB)
Analysis 0.65 0.27 (per file)
    Total 19.07


TABLE VI.7: Running times for Scenario B.

Vii Conclusion

In this work, we presented a novel algorithm for dictionary construction, which is based on a randomized LU decomposition. By using the constructed dictionary, the algorithm classifies the content of a file and can deduct its type by examining a few file fragments. The algorithm can also detect anomalies in PDF files (or any other rich content formats) which can be malicious. This approach can be applied to detect suspicious files that can potentially contain malicious payload. Anti-virus systems and firewalls can therefore analyze and classify PDF files using the described method and block suspicious files. The usage of dictionary construction and classification in our algorithm is different from other classical methods for file content detection, which use statistical methods and pattern matching in the file header for classification via deep packet inspection. The fast dictionary construction allows to rebuild the dictionary from scratch when it is out-of-date which is important when building evolving systems that classify continuously changing data.


This research was partially supported by the Israel Science Foundation (Grant No. 1041/10), by the Israeli Ministry of Science & Technology (Grants No. 3-9096, 3-10898), by US - Israel Binational Science Foundation (BSF 2012282) and by a Fellowship from Jyväskylä University.


  • [1] M. Aharon, M. Elad, and A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation, IEEE Trans. on Signal Processing, 54 (2006), pp. 4311–4322.
  • [2] I. Ahmed, K.-s. Lhee, H. Shin, and M. Hong, On improving the accuracy and performance of content-based file type identification, in Information Security and Privacy, Springer, 2009, pp. 44–59.
  • [3]  , Fast file-type identification, in Proceedings of the 2010 ACM Symposium on Applied Computing, ACM, 2010, pp. 1601–1602.
  • [4] M. C. Amirani, M. Toorani, and S. Mihandoost, Feature-based type identification of file fragments, Security and Communication Networks, 6 (2013), pp. 115–128.
  • [5] W. C. Calhoun and D. Coles, Predicting the types of file fragments, Digital Investigation, 5 (2008), pp. S14–S20.
  • [6] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM journal on scientific computing, 20 (1998), pp. 33–61.
  • [7] R. F. Erbacher and J. Mulholland, Identification and localization of data types within large-scale file systems, in Systematic Approaches to Digital Forensic Engineering, 2007. SADFE 2007. Second International Workshop on, IEEE, 2007, pp. 55–70.
  • [8] M. Gu and S. C. Eisenstat, Efficient algorithms for computing a strong rank-revealing QR factorization, SIAM Journal on Scientific Computing, 17 (1996), pp. 848–869.
  • [9] Z. Jiang, Z. Lin, and L. S. Davis, Learning a discriminative dictionary for sparse coding via label consistent k-svd

    , in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1697–1704.

  • [10] M. Karresand and N. Shahmehri, File type identification of data fragments by their binary structure, in Information Assurance Workshop, 2006 IEEE, IEEE, 2006, pp. 140–147.
  • [11] W.-J. Li, K. Wang, S. J. Stolfo, and B. Herzog,

    Fileprints: Identifying file types by n-gram analysis

    , in Information Assurance Workshop, 2005. IAW’05. Proceedings from the Sixth Annual IEEE SMC, IEEE, 2005, pp. 64–71.
  • [12] M. McDaniel and M. H. Heydari, Content based file type detection algorithms, in System Sciences, 2003. Proceedings of the 36th Annual Hawaii International Conference on, IEEE, 2003, pp. 10–pp.
  • [13] B. K. Natarajan, Sparse approximate solutions to linear systems, SIAM journal on computing, 24 (1995), pp. 227–234.
  • [14] C.-T. Pan, On the existence and computation of rank-revealing LU factorizations, Linear Algebra and its Applications, 316 (2000), pp. 199–222.
  • [15] D.-S. Pham and S. Venkatesh, Joint learning and dictionary construction for pattern recognition, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, IEEE, 2008, pp. 1–8.
  • [16] G. Shabat, Y. Shmueli, and A. Averbuch, Randomized LU decomposition, arXiv preprint arXiv:1310.7202, (2013).
  • [17] J. Tropp, Greed is good: algorithmic results for sparse approximation, Information Theory, IEEE Transactions on, 50 (2004), pp. 2231–2242.
  • [18] C. J. Veenman, Statistical disk cluster classification for file carving, in Information Assurance and Security, 2007. IAS 2007. Third International Symposium on, IEEE, 2007, pp. 393–398.
  • [19] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, Robust face recognition via sparse representation, Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31 (2009), pp. 210–227.
  • [20] A. Y. Yang, J. Wright, Y. Ma, and S. S. Sastry, Feature selection in face recognition: A sparse representation perspective, submitted to IEEE Transactions Pattern Analysis and Machine Intelligence, (2007).
  • [21] M. Yang, D. Zhang, and X. Feng, Fisher discrimination dictionary learning for sparse representation, in Computer Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 543–550.
  • [22] Q. Zhang and B. Li, Discriminative k-svd for dictionary learning in face recognition, in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2691–2698.