Evolutionary Simplicial Learning as a Generative and Compact Sparse Framework for Classification

05/14/2020 ∙ by Yigit Oktar, et al. ∙ 0

Dictionary learning for sparse representations has been successful in many reconstruction tasks. Simplicial learning is an adaptation of dictionary learning, where subspaces become clipped and acquire arbitrary offsets, taking the form of simplices. Such adaptation is achieved through additional constraints on sparse codes. Furthermore, an evolutionary approach can be chosen to determine the number and the dimensionality of simplices composing the simplicial, in which most generative and compact simplicials are favored. This paper proposes an evolutionary simplicial learning method as a generative and compact sparse framework for classification. The proposed approach is first applied on a one-class classification task and it appears as the most reliable method within the considered benchmark. Most surprising results are observed when evolutionary simplicial learning is considered within a multi-class classification task. Since sparse representations are generative in nature, they bear a fundamental problem of not being capable of distinguishing two classes lying on the same subspace. This claim is validated through synthetic experiments and superiority of simplicial learning even as a generative-only approach is demonstrated. Simplicial learning loses its superiority over discriminative methods in high-dimensional cases but can further be modified with discriminative elements to achieve state-of-the-art performance in classification tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sparse representations have been proven to be very successful at restoration and reconstruction tasks such as compression, denoising, deblurring, inpainting and superresolution Elad et al. (2010). In essence, they aim at modeling the data/signal through concise linear combinations attained from an overcomplete basis or set of elements. This overcomplete set of elements is named as the dictionary and it can either be carefully fixed (experimentally or analytically) or be adapted to the data at hand through learning Tosic and Frossard (2011). Conventional nonconvex optimization of dictionary learning for sparse representations is given in Eqn. (1) as follows,

(1)

where the matrix is the designated overcomplete dictionary and

is the sparse representation vector of the data point

. While minimizing the reconstruction error of over the dictionary , each sparse vector can have a maximum number of nonzero components due to the strict -norm constraint. In literature, there exist approximate iterative solutions (namely, sparse coding and dictionary update) to this highly nonconvex problem and its variants Gribonval et al. (2015).

In addition to reconstructive signal processing tasks, dictionary learning can also be employed in machine learning problems such as classification and clustering Akhtar et al. (2016); Oktar and Turkan (2018, 2019). At this point, it is proper to introduce one-class classification, as the fundamental form of the general classification problem, to bridge the gap between reconstructive signal processing and machine learning. Supervised machine learning in the form of classification inherently suggests the existence of more than one label. The concept of one-class learning, also known as one-class or unitary classification, emerges when there only exists a single label within the dataset, and one needs to discriminate it against all possible unseen labels Moya and Hush (1996). It is actually a special case of binary classification where there is the “in-class” label and also the “out-of-class”, but there is not any or enough number of “out-of-class” samples within the training dataset. Therefore, in the absence or weakness of the opposing class samples, conventional binary classification methods will have difficulties as they target the decision boundary in-between.

One-class learning methods can be categorized by the type of the targeted classifier model. There exist decision-boundary approaches which seek enclosing hyperspheres, hyperplanes or hypersurfaces in general 

Khan and Madden (2014). These methods can adjust the level-of-detail through the usage of parametrized kernels to cope with the over- or under-fitting problem. On the other hand, graph-based methods try to fit a skeleton with-in data in a bottom-up manner. As an example, a minimum spanning tree model can be utilized as a one-class classifier Juszczak et al. (2009), in which the classification procedure relies on the distance to the tree. A generalization of graph-based approaches is attained through the concept of hypergraph, in which a hyperedge can now connect more than two data points or vertices. Hypergraph models not only allow custom but also lead the way to heterogeneous dimensionality. Such models are investigated in Wei et al. (2003); Silva and Willett (2008). As detailed in Sec. 2, simplicial learning through an extension of dictionary learning can be thought as the utmost generalization of the graph-based domain, in which vertices of a hypergraph can now move freely in space, taking the form of a simplicial.

By definition, an inner-skeleton method seeks a low and possibly heterogeneous dimensional piecewise linear model that expresses the data well in a compact manner. Most importantly, the dictionary learning concept can be categorized as an inner-skeleton method. However, the skeleton attained is not bounded in space but rather an infinite one, where each infinite linear bone is connected to all others at the origin. Technically speaking, a bone corresponds to a linear subspace of arbitrary dimensions. This conception will be indeed helpful when dictionary learning is considered within a multi-class classification framework. In its traditional multi-class formulations, the sparse representation based classifier models a separate dictionary for each distinct class through a data fidelity term together with an -norm regularization constraint on sparse codes ( or in general). Later, the test data is encoded sparsely and classified accordingly favoring the most reconstructive or representative dictionary Wei et al. (2013). In the absence of other modifications, this form of sparse representation based classifier is known to be generative-only. The generative type approaches can create natural random instances of a class, in contrast to discriminative-only methods which focus on decision boundaries between classes.

In a simplistic manner, one can draw parallels between inner-skeleton and generative formulations which discard the existence of other classes; on the other hand, also between decision-boundary and discriminative approaches which need the existence of opposing classes. Not surprisingly, a method can be both generative and discriminative at the same time. Discrimination, in this sense, rises from the fact that while learning a dictionary (or a model) for a class, the data points from other classes are also taken into consideration, i.e., distance to those other points are to be maximized. Some examples of discriminative dictionary learning methods can be given as Mairal et al. (2009); Jiang et al. (2013).

There is a subtle but crucial point that goes unnoticed in sparse representation based classifier applications and this forms the backbone of the proposed study in this paper. Corresponding to this upcoming point, XOr problem of neural networks dictates that a single layer perceptron is not capable of separating XOr inputs as only a single linear decision boundary is at hand. This has paved way to multilayer formulations that can solve linearly non-separable cases. A similar problem haunts dictionary learning methods silently. Consider the case as demonstrated in Fig. 

1, in which there are two classes of digit . “Pale class” includes pale images, while “Bright class” contains exactly the same images but they are brightened up. In technical terms, there are two opposing classes lying on the same subspace in the eyes of linear dictionary learning methods. No matter how much discriminative they are, traditional techniques will be incapable of totally distinguishing these two classes. In other words, dictionary learning in its conventional form is insensitive to intensity/magnitude and it will never be able to solve problems requiring intensity/magnitude distinction.

Figure 1: Conventional dictionary learning is incapable of distinguishing intensity/magnitude, or more technically two classes within the same subspace.

This study proposes a new dictionary learning framework for sparse representations through simplicials. While adapting conventional optimization constraints on sparse codes, the developed evolutionary simplicial learning algorithm leads to a strong generative approach. Experimental validation on different classification tasks demonstrates that this generative-only structure can successfully distinguish two different classes lying on the same subspace as an advantage, while there exist some shortcomings when its discriminative power is under consideration. Achieving state-of-the-art performance in most cases is highly possible through further modifications with discriminative elements. The remaining part of this paper is organized as follows. Sec. 2 introduces the basic concepts and mathematical foundations of simplicial learning as an extension to classical dictionary learning for sparse representations. Then, Sec. 3 details the proposed simplicial learning algorithm by adopting an evolutionary approach with the appropriate fitness function to the problem. Sec. 4 later reports experimental simulations over several datasets and illustrates the obtained results in different classification tasks. Finally, Sec. 5 briefly concludes this study together with possible considerations which can be adapted to strengthen both theoretical and application aspects of the proposed framework.

2 Simplicial Learning: An Extension of Dictionary Learning

2.1 Definitions

Dictionary learning optimization in Eqn. (1) basically tries to fit a union of subspaces to the data. Such subspaces are indeed infinite-extent and all crossing the origin without offsets, designated by the dictionary elements usually referred to as atoms. Simplicial learning as an adaptation of dictionary learning aims instead at fitting bounded generic piecewise linear objects to the data. Table 1 considers certain bounded generic piecewise linear objects. There are many not-equivalent formal definitions of the first construct, namely a polytope to be discussed. This study strictly sticks with the definition that “a polytope is an intact object which admits a simplicial decomposition.” Hence, a polytope is made up of one or more simplices, whereas it is still in question that such simplices can be of different dimensions.

There are two possible ways to generalize the concept of polytope. In the first generalization, connectedness can be discarded leading to the fact that there is not a single object but multiple objects being considered at the same time. The second one allows the building-blocks namely simplices to have different dimensions, thus leading to heterogeneously dimensional objects. A formal name for such union of simplices is a simplicial complex, but restricted self-intersections are imposed for a rigorous treatment. By definition, a simplicial complex is a set of simplices satisfying the following two conditions: (i) every face of a simplex from this set is also in this set and (ii) the non-empty intersection of any two simplices is a face of these two simplices. Losing a bit of formalism, utmost flexibility can be reached by allowing such objects to intersect each other and themselves in arbitrary ways, and such final construct is simply named as a simplicial in the remaining part of this paper, to refer to an arbitrary union of simplices in the most general sense. For a more rigorous treatment of these definitions and related concepts, readers might refer to Munkres (2018).

May not
be intact
Piecewise
linear
Heterogeneous
dimensionality
Arbitrary
intersections
Polytope ?
Simplicial complex
Simplicial
Table 1: Distinctions between the terms for generic objects.

2.2 Related work

Simplex and simplicial complex based data applications are becoming popular in literature as data analysis receives more and more topological considerations Luo et al. (2017); Huang et al. (2015); Belton et al. (2018); Tasaki et al. (2016); Patania et al. (2017). Moreover, utilizing simplices for data applications is not a completely new idea from the perspective of sparse representations Wang et al. (2016); Nguyen et al. (2013). Quite similarly, in this study an adaptation of sparse representations framework is chosen that casts a union of subspaces to a union of simplices. A rigorous mathematical formulation is detailed in the following.

(a) Subspace
(b) Flat
(c) Simplex
Figure 2: A simple example of how additional constraints on sparse codes affect the solution of sparse representations. (a) The conventional sparsity constraint together with (b) sum-to-one () and (c) sum-to-one and non-negativity ( and ) constraints.

2.3 Mathematical formulation

There are three necessary modifications to make a successful transition from the traditional dictionary learning formulation to simplicial learning. First of all, an additional sum-to-one constraint is needed on the sparse codes as noted in Eqn. (2) as follows,

(2)

where denotes the column vector of ones, of appropriate size with the sparse vectors . Such modification casts -dimensional subspaces into ()-dimensional flats, a flat being a ()-subspace with an arbitrary offset. A geometric explanation is illustrated in Fig. 2(a-b) for the case when . In this example, a subspace solution (i.e., an infinite-extent plane) of sparse representations is indeed reduced into a flat (i.e., an infinite-extent line) with an additional sum-to-one constraint on sparse codes.

In addition to above constraint, the second necessary modification is an additional non-negativity on sparse codes as noted in Eqn. (3) as follows,

(3)

where denotes the column vector of zeros, of appropriate size with the sparse vectors . Together with sum-to-one constraint, sparse codes are now restricted to range in magnitude and thus represented flat as an infinite-extent line turns into a simplex (i.e., a bounded line, line segment) as apparent in Figure 2(b-c) for . In the most generic sense, a simplex can be regarded as a bounded flat.

Note here that there is no any structural constraint on the sparse code patterns for the optimization problems in Eqns. (1)-(3). In other words, all possible -combinations of dictionary atoms are available for a -sparse vector solution . Since most of these combinations are unnecessary for a given overcomplete dictionary, keeping a set of possible valid combinations (i.e., forcing certain patterns in sparse codes) will provide a more efficient and more compact representation. This finally leads to the concept of structured sparsity, or group sparsity in exact terms Yuan and Lin (2006); Jacob et al. (2009), as a last modification on the road to simplicial learning.

While referring back to Sec. 1, when positional information is removed from a simplicial, the structure left then corresponds to a hypergraph, in which a hyperedge refers to a specific simplex within the simplicial. In relation to group sparsity, a hyperedge exactly corresponds to a group of atoms, hence a valid pattern of sparse codes. As a consequence, a set of groups/hyperedges, or more technically a hypergraph data structure needs to be kept to define the shape of the simplicial. This hypergraph structure will be denoted as where designates the hyperedge referring to simplex within the simplicial. In accordance with this definition, simplicial learning with a structure imposed by can be formulated in Eqn. (4) as follows,

(4)

where is the hyperedge indexing the closest simplex for the data point , denotes the dimension of that simplex, and the constraint ensures the group sparsity such that only the optimal group (i.e., hyperedge referring to the closest simplex) in is to be filled and other entries which are represented as shall all be zero. Note here that groups can be not only overlapping but also of different sizes, hence leading to heterogeneous dimensionality. In this final form, needs to be learned together with but a further careful consideration is needed over the compactness of the simplicial in return.

In summary, as is, the optimization in Eqn. (4) is highly ill-posed since there is no restriction on the number of simplices to be used or the dimensions of those simplices. One could even choose a very high-dimensional simplicial construct and zero-out the approximation error easily. Therefore, additional penalty terms need to be investigated based on the number and the dimensionality of simplices for a compact solution. Such a challenge appears to be highly combinatorial in nature and an evolutionary approach can be adopted after a careful consideration of an appropriate fitness function, as described and detailed in Sec. 3.

3 Evolutionary Approach

To obtain an optimal or a suitable simplicial in a heuristic manner, certain number of simplicials are to compete against each other on instances of the same dataset. Basically, an evolutionary approach includes a suitable fitness function to guide this search process, and sub-procedures such as

mutations and breeding to perform the actual search.

3.1 The fitness function

There are certain critical points to be carefully considered before designating the fitness function for the defined problem in this study. First of all, a straightforward optimization procedure for the number and the dimensionality of simplices will not be enough to attain a compact model desired. For example, consider that the data is distributed in the shape of a triangle with certain area. In this case, a triangle with the most compact area should be preferred as a targeted model. However, one could fit a triangle to this data with correct angles but excessive area. In such a case the dimensionality or the number of simplices indeed do not change. In conclusion, one needs also to take the volume, or more technically the content of the simplicials, besides considering the number and the dimensionality of simplices.

The content (or volume) of an arbitrary simplex can be calculated using Cayley-Menger determinant Li et al. (2015). Let be a -dimensional simplex in , and denote distance matrix of vertices such that . Then the content of is given in a relation in Eqn. (5) as follows,

(5)

where is matrix obtained from by bordering it with a top row of and a left column of .

Related with the content calculation here, another issue arises because of the allowed heterogeneous dimensionality in the optimization formula. The content of a line-segment (as an object) and a triangle (as an object) are incomparable in a general continuous setting since a triangle contains infinitely-many line-segments itself. To resolve this problem, an exponential term is introduced through an approximated cumulative discrete content calculation of a simplicial as given in Eqn. (6) as follows,

(6)

where denotes the number of hyperedges or equivalently the number of simplices, is the content of the simplex and is the dimension of that simplex. As a content would complicate the exponentiation used, is needed in the discrete approximation.

Having pinned down the above term which will be a component in the fitness function driving the evolutionary process, a fitness function candidate (in a minimization form) is given in Eqn. (7) as follows,

(7)

where sum of squared error (SSE) used as the data fidelity term and approximated cumulative discrete content as to regulate the compactness of the representation. denotes the regularization parameter controlling the contribution of the compactness prior on the solution.

While initially experimenting above fitness function, it is observed that the parameter has a very broad optimality range, which changes drastically from dataset to dataset. This is due to the fact that there is a high dynamic range imbalance between two cumulative terms. Therefore, a variant of the defined fitness function is considered by transforming Eqn. (7) into the logarithmic scale in order to compress the dynamic range, leading to a more natural maximization setting formulated in Eqn. (8) as follows,

(8)

where denotes the number of data points and the parameter regulates over- or under-fitting. When , the fitness function simply reduces to the data fidelity term favoring only for the reconstruction quality. Instead, a high value forces the simplicial to be compact. Empirical investigations suggest that a value around could be a global setting as it provides excellent results over all datasets considered in this study. The parameter is fixed to .

3.2 Mutations and breeding

First of all, it is important to note here that the hypergraph is kept in the form of an incidence matrix of zeros and ones, where the row count corresponds to the number of simplices and the column count matches to the number of vertices or rather the number of atoms (columns) in the dictionary . Mutations can be easily applied on this binary matrix. In detail, there are four main processes that provide the background for evolution: (i) increasing/decreasing the dimension of a simplex, (ii) adding/removing a simplex, (iii) subdividing a simplex and (iv) adding/removing a vertex. All of these mutation operations are performed randomly without any optimality consideration.

As an additional tool to assist the searching process, breeding of two simplicials is also undertaken in which both dictionary elements and hypergraph structures of those two simplicials are split and then merged appropriately in order to create a new simplicial representative of two parents up to certain extent. Details of the breeding procedure are depicted in Alg. 1. At first, hypergraph structures and the corresponding dictionary elements are extracted for these two simplicials and . Then random submatrices and from each hypergraph are attained together with the corresponding columns of these dictionaries, contained in matrices and . While vertices (atoms) are directly concatenated in (line ), hypergraphs are concatenated in a disjoint manner in (line ). In short, two subsimplicials are extracted and then grouped together in a disjoint manner to form a new simplicial . Such tool can be suitably employed to exploit the underlying dimensionality of the dataset since these splitting and merging processes may lead child simplicials to acquire a properly representative data-dimensionality in a very fast manner, much faster than mutation processes to perform alone. Therefore, as a general observation, breeding determines the core dimensionality of the simplicial and mutations fine-tune the simplicial to the data.

1:  
2:  
3:  
4:  
5:  
6:  
7:  
7:  
8:  
8:  
9:  
Algorithm 1 Breeding Algorithm

3.3 Implementation details

The algorithm to learn an evolutionary simplicial model on a set of data points stored in the columns of a data matrix is given in Alg. 2. At first, the initial simplicial is to be generated from the given data points (line ). It is observed that choosing a single point (i.e., centroid of the dataset) as an initial simplicial is sufficient for low-dimensional problems. Through mutations and breeding processes, the initial simplicial takes an appropriate form in a fast manner since the search space is relatively small. However, a procedure involving the -means algorithm Jain (2010) as a subroutine is employed to designate the initial simplicial for high-dimensional problems. In such cases, starting from a single point greatly slows down the process of evolution since the search space is quite large. Hence, an initialization based on -means ensures that the starting simplicial is already a relatively fit one. A last point worth mentioning related to initialization here is that the initial simplicial should satisfy the condition that the numerator of Eqn. (8) is positive, i.e., to lead a meaningful evolution.

1:  
2:  while  do
3:     
4:     
5:     for all  in  do
6:        
7:        
8:        
9:     end for
10:     
11:  end while
12:  
Algorithm 2 Evolutionary Simplicial Learning (ESL) Algorithm

On line , the algorithm performs the projection of data points in onto each simplex of the simplicial  Duchi et al. (2008); Golubitsky et al. (2012) which basically corresponds to the sparse coding optimization. The closest simplex for the data point is determined through the minimum approximation error acquired after projecting onto each simplex. The positive barycentric coordinates of the projection points corresponding to the sparse codes are acquired, and then the necessary spots of the sparse representation matrix is filled accordingly.

On line , dictionary matrix is updated using a direct least-squares solution. To optimize by forcing its derivative to zero, the analytic solution is obtained with where represents Moore-Penrose pseudo-inverse of X. Note that there is no evolutionary process for learning , namely the vertices of the simplicial . Instead, vertices are updated once exactly on this line at each iteration of the algorithm.

Finally, the surviving simplicials are determined based on the fitness scores they attain (line ). Experimental trials suggest that keeping the population size at is an efficient strategy, while an iteration count of is sufficient instead of a full convergence. Notice here that the parent simplicials are to be kept in the population pool when their fitness scores are higher than their children’s.

4 Experimental Results

The proposed method is tested in two phases of experiments to evaluate its classification capabilities. In the first experimental setup, the performance is evaluated in a one-class classification task for outlier detection. Datasets contain certain degree of outliers in such outlier detection problems, and methods learn models –agnostic of data labels– in an unsupervised manner. In the second classification task, the performance of the proposed method is evaluated in a multi-class setting. At this stage, seven synthetic multi-class datasets are generated in addition to two handwritten digit recognition datasets. The synthetic datasets are special in that they contain cases which require intensity/magnitude distinction, especially very challenging for conventional dictionary learning methods.

Dataset #Samples #Dimensions Outlier Ratio (%)
arrhythmia
cardio
glass
ionosphere
letter
lympho
mnist
musk
optdigits
pendigits
pima
satellite
satimage-2
shuttle
vertebral
vowels
wbc
Table 2: Information regarding the datasets used in outlier detection experiments.

4.1 Outlier detection

In total

benchmark datasets are taken from ODDS Library 

Rayana (2016) for the one-class learning task. Information regarding these datasets in terms of number of samples, sample dimensionality and outlier percentages is summarized in Table LABEL:info_out and interested readers might refer to Rayana (2016) for details about each individual dataset. Using these benchmark datasets, a random to train-test set split is repeated for independent simulations and the mean Area Under The Curve (AUC) Receiver Operating Characteristics (ROC) results are reported in Table LABEL:out_res.

The proposed Evolutionary Simplicial Learning (ESL) method is evaluated against an extensive outlier detection benchmark named as PyOD Zhao et al. (2019). The competing methods include Angle-based Outlier Detector (ABOD) Kriegel et al. (2008), Clustering-based Local Outlier Factor (CBLOF) He et al. (2003), Feature Bagging (FB) Lazarevic and Kumar (2005), Histogram-based Outlier Score (HBOS) Goldstein and Dengel (2012), Isolation Forest (IForest) Liu et al. (2008)

, K Nearest Neighbors (KNN

Ramaswamy et al. (2000), Local Outlier Factor (LOF) Breunig et al. (2000), Minimum Covariance Determinant (MCD) Hardin and Rocke (2004)

, One-class Support Vector Machine (OCSVM) 

Scholkopf et al. (2001)

and Principal Component Analysis (PCA) 

Shyu et al. (2003) and one of the most recent results obtained in Weng et al. (2018) on the same benchmark (with an average of runs for each dataset).

Dataset ABOD CBLOF FB HBOS IForest KNN LOF MCD OCSVM PCA  Weng et al. (2018) ESL
arrhythmia
cardio
glass -
ionosphere
letter -
lympho
mnist
musk
optdigits -
pendigits
pima -
satellite
satimage-2
shuttle
vertebral
vowels -
wbc -
MEAN n/a
STDEV n/a
Table 3: Mean AUC ROC results from independent simulations for outlier detection on various datasets.

Last two rows of Table LABEL:out_res

illustrate the mean AUC ROC results over all datasets and their standard deviations. ESL not only presents the best average AUC ROC performance among all methods in the benchmark but also has the least standard deviation. One can conclude that it is the most reliable method among considered techniques for this performance measure. Moreover, ESL shows top AUC ROC performance in three datasets. However, additional tests show that it does not have a noticeable advantage in Precision at n (P@n) performance.

4.2 Multi-class classification

For the multi-class classification task, six challenging synthetic datasets are generated by following the procedures in 1 and these datasets are depicted in Fig. 3. Four of these datasets (namely, Cluster-in-Cluster; Two-Spirals; Half-Kernel and Crescent&Full-moon) contain binary classification tasks while the remaining two of them (Corners and Outliers) consist of four-class classification problems. In addition, a synthetically altered dataset (named as MNIST) is included in the experimental setup, in which all samples of the digit from the original MNIST LeCun et al. (2010) are designated as the “Bright class” while a new “Pale class” is generated from all these original samples by dimming with a scale of according to the previous discussion related to Fig. 1.

Dataset SRC LCKSVD1 LCKSVD2 DLSI FDDL DLCOPAR LRSDL ESL
Cluster-in-Cluster
Two-Spirals
Half-Kernel
Crescent&Full-moon
Corners
Outliers
MNIST8
Table 4: Classification success rates of different dictionary learning methods for six synthetic datasets, and for the proposed binary MNIST problem (last row).
(a) Cluster-in-Cluster
(b) Two-Spirals
(c) Half-Kernel
(d) Crescent&Full-moon
(e) Corners
(f) Outliers
Figure 3: Examples of learned simplicial models on six synthetic datasets. Best visualized in color.

The proposed ESL algorithm in this setup is compared against Sparse Representation-based Classification (SRC) Wright et al. (2008), Label Consistent K-SVD (LCKSVD1 and LCKSVD2) Jiang et al. (2013), Dictionary Learning with Structured Incoherence (DLSI) Ramirez et al. (2010), Fisher Discrimination Dictionary Learning (FDDL) Yang et al. (2011), Dictionary Learning for Commonality and Particularity (DLCOPAR) Kong and Wang (2012) and Low-rank Shared Dictionary Learning (LRSDL) Vu and Monga (2016, 2017). Experimental results in terms of classification success rates are presented in Table LABEL:class_res_syn. It is apparent that ESL easily outperforms all considered dictionary learning methods over all cases. This should not be a surprising result since all utilized synthetic datasets require intensity/magnitude distinction to various extents. On the other hand, some discriminative methods such as LCKSVD2, FDDL and LRSDL undergo meaningful learning (i.e., better than random) over some datasets. This observation leads to an important conclusion that discriminative modifications may alleviate insensitivity to intensity to a certain degree.

Fig. 3 depicts examples of learned simplicial models on six synthetic datasets. As it can be observed clearly, simplicials are bounded and they are composed of simplices (i.e., points and line-segments in these cases) with arbitrary offsets, providing an advantage over unbounded and without-offset dictionary learning models in all these classification tasks.

Generative-only                                             Discriminative
Dataset SDL-G TDDL-G LLC LDL ESL KNN SVM-Gauss SDL-D FDDL TDDL-D
USPS
MNIST - - -
Table 5: Classification error rates of various methods on handwritten digit datasets, USPS and MNIST. ESL appears as a superior generative method, nearly performing at the capacity of discriminative Gaussian SVM on both datasets.

Digit Classification

: In most of the practical pattern recognition applications, the pattern or rather the direction of the feature vector utilized plays an important role on the success rate. For instance, a “star pattern” is a “star pattern” no matter how much bright or pale it is. Therefore, the advantage of simplicial learning over dictionary learning is expected to diminish in some real-world applications. This is observable in digit classification experiments featuring USPS 

Hull (1994) and MNIST datasets as reported in Table 5. In this set of experiments, ESL is compared to classification methods including Supervised Dictionary Learning Mairal et al. (2009) with generative training (SDL-G) and with discriminative learning (SDL-D), Task-driven Dictionary Learning Mairal et al. (2011): unsupervised (TDDL-G) and supervised (TDDL-D), FDDL, KNN, Gaussian SVM, Locality-constrained Linear Coding (LLC) Wang et al. (2010) and Locality-sensitive Dictionary Learning (LDL) Wei et al. (2013). LLC and LDL methods have the sum-to-one constraint on sparse codes, therefore they learn spaces with arbitrary offsets but learned models are still not bounded (without the non-negativity constraint).

As apparent from Table 5, ESL appears to be a successful generative-only method which performs nearly at the capacity of Gaussian SVM (i.e., a well-known and widely used discriminative classifier). However, it cannot outperform discriminative dictionary learning methods such as FDDL and TDDL-D in these datasets. A final note is that ESL can also be modified through discriminative elements. Discriminative methods SDL-D and TDDL-D have a

advantage over their generative counterparts SDL-G and TDDL-G. Hence, a successful discriminative version of ESL can then be projected to reach state-of-the-art, an estimation open to discussion or further investigation.

5 Discussion and Conclusion

Dictionary learning through simplicials is more flexible than classical dictionary learning models since simplices are bounded and freely positioned in space. The proposed sparsity based evolutionary structure, called ESL is highly applicable if the characteristics of the problem at hand requires such successful localized models. In this study, a global fitness function is employed and there is no restriction on the local fitness of each individual simplex within the simplicial. If the local fitness of each simplex is considered and optimized individually, the resulting simplicial model might be in a more compact form. For example, the unnecessary simplex of green simplicial in Fig. 3

(c) would most probably be eliminated as it does not have any local fitness, thus lead to an increased accuracy of classification. Another point worth mentioning here is that the employed fitness function in Eqn. (

8

) is reminiscent of Poisson distribution, in a multidimensional form 

Belyaev and Lumen’skii (1988). Hence, other probabilistic considerations and also discriminative elements can be adapted to strengthen both theoretical and application aspects of the proposed framework.

As exemplified in this paper, simplicial learning can successfully address some weak points of conventional dictionary learning for the considered machine learning problems; it is a promising approach inherently capable of performing signal processing tasks and can become a general machine learning tool with many application domains.

References

  • [1] 6 functions for generating artificial datasets - File Exchange - MATLAB Central. (en). Note: accessed 2019-10-09 External Links: Link Cited by: §4.2.
  • N. Akhtar, F. Shafait, and A. Mian (2016) Discriminative bayesian dictionary learning for classification. IEEE Trans. Patt. Anal. Mach. Intell. 38 (12), pp. 2374–2388. Cited by: §1.
  • R. L. Belton, B. T. Fasy, R. Mertz, S. Micka, D. L. Millman, D. Salinas, A. Schenfisch, J. Schupbach, and L. Williams (2018) Learning simplicial complexes from persistence diagrams. In Conf. Comput. Geometry, pp. 18. Cited by: §2.2.
  • Y. K. Belyaev and Y. P. Lumen’skii (1988) Multidimensional poisson walks. J. Soviet Math. 40 (2), pp. 162–165. Cited by: §5.
  • M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander (2000) LOF: identifying density-based local outliers. In ACM SIGMOD Record, Vol. 29, pp. 93–104. Cited by: §4.1.
  • J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra (2008) Efficient projections onto the l 1-ball for learning in high dimensions. In Int. Conf. Mach. Learn., pp. 272–279. Cited by: §3.3.
  • M. Elad, M. A. T. Figueiredo, and Y. Ma (2010) On the role of sparse and redundant representations in image processing. Proc. IEEE 98 (6), pp. 972–982. Cited by: §1.
  • M. Goldstein and A. Dengel (2012)

    Histogram-based outlier score (hbos): a fast unsupervised anomaly detection algorithm

    .
    KI-2012: Poster and Demo Track, pp. 59–63. Cited by: §4.1.
  • O. Golubitsky, V. Mazalov, and S. M. Watt (2012) An algorithm to compute the distance from a point to a simplex. Commun. Comput. Algebra 46, pp. 57–57. Cited by: §3.3.
  • R. Gribonval, R. Jenatton, F. Bach, M. Kleinsteuber, and M. Seibert (2015) Sample complexity of dictionary learning and other matrix factorizations. IEEE Trans. Inf. Theory 61 (6), pp. 3469–3486. Cited by: §1.
  • J. Hardin and D. M. Rocke (2004) Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator. Computational Stat. Data Anal. 44 (4), pp. 625–638. Cited by: §4.1.
  • Z. He, X. Xu, and S. Deng (2003) Discovering cluster-based local outliers. Pattern Recog. Lett. 24 (9-10), pp. 1641–1650. Cited by: §4.1.
  • J. Huang, F. Nie, and H. Huang (2015) A new simplex sparse learning model to measure data similarity for clustering. In Int. Joint Conf. Artif. Intell., pp. 3569–3575. Cited by: §2.2.
  • J. J. Hull (1994) A database for handwritten text recognition research. IEEE Trans. Patt. Anal. Mach. Intell. 16 (5), pp. 550–554. Cited by: §4.2.
  • L. Jacob, G. Obozinski, and J.-P. Vert (2009) Group lasso with overlap and graph lasso. In Int. Conf. Mach. Learn., pp. 433–440. Cited by: §2.3.
  • A. K. Jain (2010)

    Data clustering: 50 years beyond K-means

    .
    Pattern Recog. Lett. 31 (8), pp. 651–666. Cited by: §3.3.
  • Z. Jiang, Z. Lin, and L. S. Davis (2013) Label consistent K-SVD: learning a discriminative dictionary for recognition. IEEE Trans. on Patt. Anal. Mach. Intell. 35 (11), pp. 2651–2664. Cited by: §1, §4.2.
  • P. Juszczak, D. M. J. Tax, E. Pe-kalska, and R. P. W. Duin (2009) Minimum spanning tree based one-class classifier. Neurocomput. 72 (7-9), pp. 1859–1869. Cited by: §1.
  • S. S. Khan and M. G. Madden (2014) One-class classification: taxonomy of study and review of techniques. The Know. Eng. Review 29 (3), pp. 345–374. Cited by: §1.
  • S. Kong and D. Wang (2012) A dictionary learning approach for classification: separating the particularity and the commonality. In European Conf. Comp. Vis., pp. 186–199. Cited by: §4.2.
  • H.-P. Kriegel, M. Schubert, and A. Zimek (2008)

    Angle-based outlier detection in high-dimensional data

    .
    In Int. Conf. Knowledge Discovery Data Mining, pp. 444–452. Cited by: §4.1.
  • A. Lazarevic and V. Kumar (2005) Feature bagging for outlier detection. In Int. Conf. Knowledge Discovery Data Mining, pp. 157–166. Cited by: §4.1.
  • Y. LeCun, C. Cortes, and C. J. C. Burges (2010) MNIST Handwritten Digit Database. Cited by: §4.2.
  • H.-C. Li, M. Song, and C. Chang (2015) Simplex volume analysis for finding endmembers in hyperspectral imagery. In Satellite Data Comp. Commun. Process. XI, Vol. 9501, pp. 950107. Cited by: §3.1.
  • F. T. Liu, K. M. Ting, and Z.-H. Zhou (2008) Isolation forest. In IEEE Int. Conf. Data Mining, pp. 413–422. Cited by: §4.1.
  • C. Luo, C. Ma, C. Wang, and Y. Wang (2017) Learning discriminative activated simplices for action recognition. In AAAI Conf. Artif. Intell., pp. 4211–4217. Cited by: §2.2.
  • J. Mairal, F. Bach, and J. Ponce (2011) Task-driven dictionary learning. IEEE Trans. Patt. Anal. Mach. Intell. 34 (4), pp. 791–804. Cited by: §4.2.
  • J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach (2009) Supervised dictionary learning. In Adv. Neural Inf. Process. Syst., pp. 1033–1040. Cited by: §1, §4.2.
  • M. M. Moya and D. R. Hush (1996) Network constraints and multi-objective optimization for one-class classification. Neural Networks 9 (3), pp. 463–474. Cited by: §1.
  • J. R. Munkres (2018) Analysis on manifolds. CRC Press. Cited by: §2.1.
  • D. K. Nguyen, K. Than, and T. B. Ho (2013) Simplicial nonnegative matrix factorization. In Int. Conf. Comput. Commun. Tech.-Res. Innov. Vis. Fut., pp. 47–52. Cited by: §2.2.
  • Y. Oktar and M. Turkan (2018) A review of sparsity-based clustering methods. Signal Process. 148, pp. 20–30. Cited by: §1.
  • Y. Oktar and M. Turkan (2019) K-polytopes: a superproblem of k-means. Signal, Image, Video Process. 13 (6), pp. 1207–1214. Cited by: §1.
  • A. Patania, F. Vaccarino, and G. Petri (2017) Topological analysis of data. EPJ Data Sci. 6 (1), pp. 7. Cited by: §2.2.
  • S. Ramaswamy, R. Rastogi, and K. Shim (2000) Efficient algorithms for mining outliers from large data sets. In ACM SIGMOD Record, Vol. 29, pp. 427–438. Cited by: §4.1.
  • I. Ramirez, P. Sprechmann, and G. Sapiro (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. In IEEE Conf. Comp. Vis. Patt. Recog., pp. 3501–3508. Cited by: §4.2.
  • S. Rayana (2016) ODDS library. Stony Brook Univ., Dept. of Computer Sci.. External Links: Link Cited by: §4.1.
  • B. Scholkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural Computation 13 (7), pp. 1443–1471. Cited by: §4.1.
  • M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang (2003) A novel anomaly detection scheme based on principal component classifier. In Int. Conf. Data Mining, Cited by: §4.1.
  • J. Silva and R. Willett (2008) Hypergraph-based anomaly detection of high-dimensional co-occurrences. IEEE Trans. Patt. Anal. Mach. Intell. (3), pp. 563–569. Cited by: §1.
  • H. Tasaki, R. Lenz, and J. Chao (2016) Simplex-based dimension estimation of topological manifolds. In Int. Conf. Patt. Recog., pp. 3609–3614. Cited by: §2.2.
  • I. Tosic and P. Frossard (2011) Dictionary learning: what is the right representation for my signal?. IEEE Signal Process. Mag. 28, pp. 27–38. Cited by: §1.
  • T. H. Vu and V. Monga (2016) Learning a low-rank shared dictionary for object classification. In IEEE Int. Conf. Image Process., pp. 4428–4432. Cited by: §4.2.
  • T. H. Vu and V. Monga (2017) Fast low-rank shared dictionary learning for image classification. IEEE Trans. Image Process. 26 (11), pp. 5160–5175. Cited by: §4.2.
  • C. Wang, J. Flynn, Y. Wang, and A. Yuille (2016) Recognizing actions in 3D using action-snippets and activated simplices. In AAAI Conf. Artif. Intell., pp. 3604–3610. Cited by: §2.2.
  • J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong (2010) Locality-constrained linear coding for image classification. In IEEE Conf. Comp. Vis. Patt. Recog., pp. 3360–3367. Cited by: §4.2.
  • C.-P. Wei, Y.-W. Chao, Y.-R. Yeh, and Y.-C. F. Wang (2013) Locality-sensitive dictionary learning for sparse representation based classification. Pattern Recog. 46 (5), pp. 1277–1287. Cited by: §1, §4.2.
  • L. Wei, W. Qian, A. Zhou, W. Jin, and X. Y. Jeffrey (2003) Hot: hypergraph-based outlier test for categorical data. In Pacific-Asia Conf. Know. Discov. Data Mining, pp. 399–410. Cited by: §1.
  • Y. Weng, N. Zhang, and C. Xia (2018) Multi-agent-based unsupervised detection of energy consumption anomalies on smart campus. IEEE Access 7, pp. 2169–2178. Cited by: §4.1, Table 3.
  • J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma (2008)

    Robust face recognition via sparse representation

    .
    IEEE Trans. Patt. Anal. Mach. Intell. 31 (2), pp. 210–227. Cited by: §4.2.
  • M. Yang, L. Zhang, X. Feng, and D. Zhang (2011) Fisher discrimination dictionary learning for sparse representation. In Int. Conf. Comp. Vis., pp. 543–550. Cited by: §4.2.
  • M. Yuan and Y. Lin (2006) Model selection and estimation in regression with grouped variables. J. Royal Stat. Soc. B 68 (1), pp. 49–67. Cited by: §2.3.
  • Y. Zhao, Z. Nasrullah, and Z. Li (2019) PyOD: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20, pp. 1–7. Cited by: §4.1.