Designing labeled graph classifiers by exploiting the Rényi entropy of the dissimilarity representation

08/22/2014 ∙ by Lorenzo Livi, et al. ∙ 0

Representing patterns as labeled graphs is becoming increasingly common in the broad field of computational intelligence. Accordingly, a wide repertoire of pattern recognition tools, such as classifiers and knowledge discovery procedures, are nowadays available and tested for various datasets of labeled graphs. However, the design of effective learning procedures operating in the space of labeled graphs is still a challenging problem, especially from the computational complexity viewpoint. In this paper, we present a major improvement of a general-purpose classifier for graphs, which is conceived on an interplay between dissimilarity representation, clustering, information-theoretic techniques, and evolutionary optimization algorithms. The improvement focuses on a specific key subroutine devised to compress the input data. We prove different theorems which are fundamental to the setting of the parameters controlling such a compression operation. We demonstrate the effectiveness of the resulting classifier by benchmarking the developed variants on well-known datasets of labeled graphs, considering as distinct performance indicators the classification accuracy, computing time, and parsimony in terms of structural complexity of the synthesized classification models. The results show state-of-the-art standards in terms of test set accuracy and a considerable speed-up for what concerns the computing time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A labeled graph [33] offers a powerful model for representing patterns characterized by interacting elements, in both static or dynamic scenarios. A labeled graph (also called attributed graph) is a tuple , where is the finite set of vertices, is the set of edges, is the vertex labeling function, with denoting the set of vertex labels, and finally is the edge labeling function, with denoting the set of edge labels. The topology of a graph enables the characterization of a pattern in terms of “interacting” elements. Moreover, the generality of both and allows to cover a broad range of real-world patterns. Applications of labeled graphs as data representation models can be cited in many scientific fields, such as electrical circuits [18, 45], networks of dynamical systems [49, 67], biochemical interaction networks [24, 39, 17], time-varying labeled graphs [5], and segmented images [61, 47]. Owing to the rapid diffusion of (cheap) multicore computing hardware, and motivated by the increasing availability of interesting datasets describing complex interaction-oriented patterns, recent researches on graph-based pattern recognition systems have produced numerous methods [14, 31, 68, 52, 37, 6, 44, 22, 47, 4, 13, 40, 1, 38, 20, 10, 62, 41, 2, 12].

Focusing on the high-level design of classification systems for graphs, it is possible to identify two main different approaches: those that operate directly in the domain of labeled graphs and those that deal with the classification problem in a suitable embedding space. Of notable interest are those systems that are based on the so-called explicit graph embedding

algorithms, which transform the input graphs into numeric vectors by means of a mapping or feature extraction technique

[33]. Graph embedding algorithms [23, 56, 6, 59, 19, 42, 9, 51], operate by explicitly developing an embedding space, . The distance between two graphs is hence computed processing their vector representations in , usually by either a geometric or information-theoretic interpretation (e.g., based on divergences [20]). We distinguish two main categories of graph embedding algorithms: those that are defined in terms of a core inexact graph matching (IGM) procedure working directly in the graph domain, , and those that exploit a matrix representation of the graph to extract characterizing information. The former (e.g., see [6, 56, 9]) can process virtually any type of labeled graph, according to the capability of the adopted core matching algorithm. The latter, although usually defined on a more sophisticated mathematical setting (e.g., see [35, 51, 59, 19]), are constrained to process a restricted variety of labeled graphs, in which all the relevant information can be effectively condensed into a matrix representation of a graph, such as the (weighted) adjacency, transition, and Laplacian matrix. Such a transformation (i.e., the embedding) simplifies the problem of learning and testing a classification model by enabling the possibility of adopting well-established pattern recognition tools. However, it may also cause information loss due to the intrinsic complexity of the labeled graph data type, which synthesizes both topological and semantic information (i.e., the vertex/edge attributes) of data in the same structure. The interested reader is referred to [33, 12, 26] and references therein for reviews of recent graph embedding techniques.

The dissimilarity representation offers a valuable framework for this purpose, since it permits to describe arbitrarily complex objects by means of their pairwise dissimilarity values (DV) [48]. In the dissimilarity representation [48, 60], the elements of an input dataset are characterized by considering vectors made of their pairwise DVs. The key component is hence the definition of a nonnegative (bounded) dissimilarity measure . A set of prototypes, , called representation set (RS), is used to develop the dissimilarity matrix (DM), , whose elements are given as , for every and . By means of , it is possible to embed the data in by developing the so-called dissimilarity space representation: each input sample is finally represented by the corresponding row-vector in . The dissimilarity representation has found many applications so far, such as in image analysis, query processing, and signature verification [28, 3, 46, 63, 43], as well as for designing labeled graph clustering and classification systems [56, 37, 9].

Recently, the Optimized Dissimilarity Space Embedding (ODSE) system has been proposed as a general labeled graph classifier, achieving state-of-the-art results in terms of classification accuracy on well-known benchmarking datasets [37]. The synthesis of the ODSE classification model is performed by a novel information-theoretic interpretation of the DM in terms of conveyed information

. In practice, the system estimates the informativeness of the input data dissimilarity representation by calculating the quadratic Rényi entropy (QRE)

[50]. Such an entropic characterization of the underlying distribution of the DVs proved to be effective, and it has been used in the compression–expansion scheme as well as an important factor of the ODSE objective function. However, deriving the ODSE classification model is computationally demanding. As a consequence, we have developed two improved versions of the ODSE graph classification system [34], which are founded on a fast clustering-based compression (CBC) scheme. The setting of such a clustering algorithm is determined analytically and is supported by a formal proof. This fact caused a considerable computational speed-up of the model synthesis stage, maintaining state-of-the-art test set classification performance.

In this paper, we elaborate further over the same CBC scheme estimating the differential -order Rényi entropy of the DVs by means of a faster technique that relies on an entropic Minimum Spanning Tree (MST) [7]. Also in this case, we give a formal proof pertaining the setting of the clustering algorithm governing the compression. We experimentally demonstrate that the performance of ODSE operating with the MST-based estimator is comparable to the one using the kernel-based estimator. Additionally, we observe that, in general, with the former the overall computing time is lower.

The remainder of the paper is organized as follows. In Table 1 we report all acronyms used in this paper. Section 2 provides the necessary theoretical background related to the entropy estimators used in this work. In Section 3 we give an overview of the original ODSE graph classification system design [37]. In Section 4 we present the improved ODSE system, which is primarily discussed considering the QRE estimator. In Section 4.3, we discuss a relevant topic related to the (worst-case) efficiency of the developed CBC procedure. Section 5 introduces the principal theoretical contribution of this paper. We prove a theorem related to the CBC scheme when considering the MST-based estimator. Experiments and comparisons with other graph classifiers on well-known benchmarking datasets are presented in Section 6. Results are discussed by considering many different ODSE variants. In particular, we consider two classification systems operating in the dissimilarity space (DS): a k-NN rule based classifier and a neurofuzzy min-max network (MMN) [57]. Conclusions and future directions follow in Section 7.

Acronym Full name
BSAS Basic sequential algorithmic scheme
CBC Clustering-based compression
DM Dissimilarity matrix
DS Dissimilarity space
DV Dissimilarity value
IGM Inexact graph matching
MinSOD Minimum sum of distances
MMN Min-max network
MS Mode Seek
MST Minimum spanning tree
MST-RE Minimum spanning tree - Rényi entropy
ODSE Optimized dissimilarity space embedding
QRE Quadratic Rényi entropy
RS Representation set
SOA State-of-the-art
SVM Support vector machines
TWEC Triple-weight edit scheme
Table 1: Acronyms sorted in alphabetic order.

2 Differential Rényi entropy estimators

Designing pattern recognition (sub)systems by using concepts derived from information theory is nowadays well-established [64, 32, 15, 20, 25, 8]. A key issue in this context is the estimation of information-theoretic quantities from a given dataset, such as entropy and mutual information. The entropy of a distribution describes and quantifies the uncertainty of the modeled system/process in terms of randomness. Such a quantification is widely used in data analysis for characterizing the observed system/process at a mesoscopic level. From the groundbreaking work of Shannon, different generalized formulations have been proposed. Here we are interested in the generalization proposed by Rényi, which is called

-order Rényi entropy. Given a continuous random variable

, distributed according to a probability density function

, the -order Rényi entropy is defined as:

(1)

In the following two subsections, we provide the details of the non-parametric -order Rényi entropy estimation techniques that we use as components of the ODSE graph classifier. In Sec. 2.1, we introduce the kernel-based entropy estimation technique proposed by Píncipe [50], while in Sec. 2.2 we show another entropy estimation technique that is based on an entropic MST [7].

2.1 The QRE estimator

Recently, Príncipe [50] provided a formulation of Eq. 1 in terms of the so-called information potential of order , ,

(2)

When holds in Eq. 2

, the entropy measure simplifies to the so-called quadratic Rényi entropy. Non-parametric kernel-based estimators provide a plug-in solution to the problem of estimating information-theoretic data descriptors, such as the entropy. A well-known density estimation technique is the Parzen–Rosenblatt windowing, also called kernel density estimation. Usually, a zero-mean Gaussian kernel function

is adopted, giving rise to a probability density estimator of the form:

(3)

The Gaussian kernel

enables a controllable bias–variance trade-off of the estimator dependent on the

kernel size (and on the data sample size ). According to Príncipe [50]

, the QRE of the joint distribution of a

d-dimensional random vector can be estimated by relying on different unidimensional kernel estimators combined as follows:

(4)

where is the quadratic information potential and is a convoluted Gaussian kernel with doubled variance, evaluated at the difference between the realizations. When the input domain is bounded, the entropy is maximized when the distribution is uniform,

(5)

where is a (finite) measure of the input data extent [50]. Notably, Eq. 5 can be used to normalize the (estimated) QRE within the range.

kernel evaluations are needed to compute (4), which may become onerous due to the cost of computing the exponential function.

2.2 The MST-based estimator

Let be the data sample of measurements (points), with , and , and let be the complete (entropic) graph constructed over these measurements. An edge of such a graph connects and in by means of a straight line described by the length , which is computed taking the Euclidean distance:

(6)

The -order Rényi entropy (1) can be estimated according to a geometric interpretation of a MST of in (shortened as MST-RE). To this end, let be the weighted length of a MST connecting the points, which is defined as

(7)

where is a user-defined parameter, and is the set of all possible (entropic) spanning trees of . The Rényi entropy of order , elaborated using the MST length (7), is defined as follows [7, 27]:

(8)

where the order is determined by calculating:

(9)

The term is a constant (given the data dimensionality) that can be approximated, for large enough dimensions, , as:

(10)

Modifying we obtain different -order Rényi entropies. This fact suggests that the selection of the parameter could be performed according to a suitable performance criterion defined for the task at hand. By definition of , MST-RE (8) is not sensible to the dimensionality of the input measurements.

Assuming to perform the estimation on a set of measurements in , the computational complexity involved in computing Eq. 8 is given by:

(11)

The first term in (11) accounts for the generation of , computing the respective Euclidean distances for the edge weights. The second term quantifies the cost involved in the MST computation using the well-known Kruskal’s algorithm; this cost can be reduced by adopting faster approximations for the MST computation [7], which however will not be taken into account in this paper. The last term in (11) concerns the computation of the MST length.

3 The original ODSE graph classifier

The ODSE graph classification system [37] is founded on an explicit graph embedding mechanism that represents the input set of graphs , using a suitable RS , by initially computing the corresponding DM, . The configuration of the embedding vectors representing the input data in is derived directly using the rows of

. The adopted IGM dissimilarity measure is the symmetric version of the procedure called best matching first that uses a three-weight edit scheme (TWEC). Although TWEC provides a heuristic solution to the graph edit distance problem, it has shown a good compromise between computational complexity (quadratic in the graph order) and the number of characterizing parameters

[33, 37, 6]. TWEC performs a greedy assignment of the vertices among the two input graphs on the base of the corresponding labels dissimilarity; edge operations are induced accordingly.

ODSE synthesizes the classification model optimizing the DS representation by means of two dedicated operations, called compression and expansion. Both operations make use of the QRE estimator (Sec. 2.1) to quantify the information conveyed by the DM. Since the DVs fall into a continuous interval, the underlying distribution is assumed to be continuous as well. When the distribution is uniform, the entropy reaches its unique maximum value (see Eq. 5), which is used in the ODSE system to normalize the entropy evaluations in . To make this computation straightforward, TWEC has been re-normalized within the range, yielding a DVs extent equal to .

Another important component of the ODSE graph classification system is the feature-based classifier, which operates directly in ; its own classification model is trained during the ODSE synthesis. Such a classifier can be any well-known classification system, such as an MMN [57], or a kernelized support vector machine (SVM). Test labeled graphs are classified by ODSE feeding the corresponding dissimilarity representation to the learned feature-based classifier, which assigns proper class labels to the test patterns.

Fig. 1(a) and 1(b) give, respectively, the schematics of the ODSE embedding procedure and of the overall synthesis (i.e., optimization) of the ODSE classifier. The ODSE classification model is defined by the RS, , the TWEC parameters, , and the learned feature-based classifier, – see Fig. 1(b). During the synthesis stage additional parameters are optimized, which are fundamental to the determination of the best-performing ODSE classification model. Those parameters, which are synthetically denoted as in Fig. 1(a), are the kernel size used by the entropy estimator and the two entropy thresholds, , which play a fundamental role in the compression and expansion operations, respectively. The ODSE model is synthesized by cross-validating the learned models on the training set over a suitable validation set

. The global optimization is governed by a genetic algorithm, since the recognition rate,

, guides among other factors the optimization, and its analytical definition with respect to (w.r.t.) the model parameters is not available in closed form. The genetic algorithm, although it does not assure convergence towards a global optimum, it is easily and effectively parallelizable, allowing to make use of multicore hardware/software implementations.

(a) ODSE embedding space synthesis step.
(b) Synthesis of the ODSE classification model.
Figure 1: Schematic descriptions of the ODSE embedding space and classification model synthesis. Taken from [37].

3.1 The ODSE objective function

All parameters characterizing the ODES model are arranged into codes, . These include the two entropy thresholds , the kernel size of the entropy estimator, , the weights of TWEC and any parameter of the vertex/edge label dissimilarity measures, all ranging in . Since each induces a specific RS, , the optimization problem that characterizes the ODSE synthesis consists in deriving the best-performing RS:

(12)

The objective function (12) is defined as a linear convex combination of two objectives,

(13)

where and shorten the dissimilarity representation of an entire dataset using the compressed-and-expanded RS instance, . The function evaluates the recognition rate achieved on a validation set , while accounts for the quality of the synthesized classification model. Specifically,

(14)

where , and denotes the cost related to the number of prototypes. Accordingly,

(15)

where is the number of classes characterizing the classification problem at hand. The second term, namely , captures the informativeness of the DM:

(16)

We consider the entropy factor (16) in the ODSE objective function (13) to increase the spread–dispersion of the DVs, which in turn is assumed to magnify the separability of the classes.

3.2 The ODSE compression operation

The compression operation searches for subsets of the initial RS, , which convey similar information w.r.t. ; the initial RS is equal to the whole in the original ODSE. In order to describe the mechanism behind the ODSE compression operation, we need to define when a given subset of prototypes is compressible. Let be the DM corresponding to and , with and . Basically, individuates a subset of columns of . Let be the filtered DM, i.e., the submatrix considering the prototypes in only. We say that is compressible if

(17)

where is the compression threshold, and estimates the QRE of the underlying joint distribution of . In practice, the values of are interpreted as measurements of a -dimensional random vector; is the corresponding notation that we use throughout the paper to denote a sample of random measurements elaborated from the DM. If the measurements are concentrated around a single -dimensional support point, the estimated joint entropy is close to zero. This fact allows us to use Eq. 17 as a systematic compression rule, retaining only a single representative prototype graph of .

The selection of the subsets , for the compressibility evaluation is the first important algorithmic issue to be addressed. In the original ODSE [37], the subset selection has been performed by means of a randomized algorithm. The computational complexity of this approach is , which does not scale adequately as the input size grows.

3.3 The ODSE expansion operation

The expansion focuses on each single , by analyzing the corresponding columns of the compressed DM, . By denoting with the sample containing the DVs corresponding to the j-th column of , we say that is expandable if

(18)

where is the expansion threshold. Practically, the information provided by the prototype is low if the

unidimensional measurements are concentrated around a single real-valued number. In such a case, the estimated entropy would be low, approaching zero as the underlying distribution becomes degenerate. Examples of such prototypes are outliers and prototype graphs that are equal in the same measure to all other graphs. Once an expandable

is individuated through (18), is substituted by extracting new graphs elaborated from . Notably, those new graphs are derived by searching for recurrent subgraphs in a suitable subset of the training graphs.

Although the idea of trying to extract new features by searching for (recurrent) subgraphs is interesting, it is also very expensive in terms of computational complexity.

4 The improved ODSE graph classifier

The improved ODSE system [34] follows the same learning scheme, but with the primary goal of a significant computational speed-up. The first variant, which is presented in Sec. 4.1, considers a simple yet fast RS initialization strategy and a more advanced compression mechanism. The compression is grounded on the formal result discussed in Sec. 4.1.2, which elaborates over the mathematical properties of the QRE estimator. The second variant of the ODSE classifier is presented in Sec. 4.2. This version includes a more elaborated initialization of the RS, while it is characterized by the same CBC operation of Sec. 4.1.2. The expansion operation, in both cases, has been greatly simplified. Finally, in Sec. 4.3 we discuss an important fact related to the efficiency of the implemented CBC.

4.1 ODSE with clustering-based compression operation

4.1.1 Randomized representation set initialization

The initial RS , that is, the RS used during the synthesis, is defined by sampling the according to a selection probability,

. The size of the initial RS is thus characterized by a binomial distribution, containing in average

graphs, with variance . Although such a selection criteria is linear in the training set size, it operates blindly and may cause an unbalanced selection of the prototypes considering the prior class distributions. However, such a simple sampling scheme is mostly used when the available hardware cannot process the entire dataset at hand.

4.1.2 Compression by a clustering-based subset selection

The entropy measured by the QRE estimator (4) is used to determine the compressibility of a subset of prototypes, . Since the entropy estimation is directly related to the DVs between the graphs of , we design a subset selection strategy that aggregates the initial prototypes according to their mutual distance in the DS. Such subsets are assured to be compressible by definition, avoiding thus the computational burden involved in the entropy estimation.

We make use of the well-known Basic Sequential Algorithmic Scheme (BSAS) clustering algorithm (see the pseudo-code of Algorithm 1) with the aim of grouping the -dimensional dissimilarity column-vectors , with (hyper)spheres, using the Euclidean metric . The main reason behind the use of such a simple cluster generation rule is that it is much faster than other more sophisticated approaches [21], and it gives full control on the generated cluster geometry through a single real-valued parameter, . Since constrains each cluster to have a maximum intra-cluster DV (i.e., a diameter) lower or equal to , we can deduce analytically the value of considering the particular instance of the kernel size and the entropy threshold used in Eq. 17. Accordingly, the following theorem (see [34] for the proof) allows us to determine a partition that contains clusters that are compressible by construction.

Theorem 1.

The compressible partition obtained on a training set of graphs, is derived setting:

(19)
0:  The ordered input elements, a dissimilarity measure , the cluster radius , and the maximum number of allowed clusters
0:  The partition
1:  for  do
2:      if  then
3:          Create a new cluster in and define as the set representative
4:      end if
5:      Get the distance value from the closest representative modeling a cluster of the current partition
6:      
7:      if  AND  then
8:          Add a new cluster in and define as the set representative
9:      else
10:          Add in the j-th cluster and update the set representative element
11:      end if
12:  end for
Algorithm 1 BSAS cluster generation rule.

The optimization of and , together with the proof of Theorem 1, allows us to search for the best level of training set compression for the problem at hand. Algorithm 2 shows the pseudo-code of the herein described compression operation. Since the ultimate aim of the compression is to aggregate prototypes that convey similar information w.r.t. , we represent a cluster using the minimum sum of distances (MinSOD) technique [16]. In fact, the MinSOD allows to select a single representative element according to the following expression:

(20)

Eventually, the prototype graphs, , corresponding to the computed MinSOD elements in the DS, populate the compressed RS, .

0:  The initial set of prototype graphs , the DM , the compression threshold , and the kernel size
0:  The compressed set of prototype graphs
1:  Configure BSAS setting and according to Eq. 19
2:  Let be the (ordered) set of dissimilarity vectors elaborated from the columns of
3:  Execute the BSAS on . Let be the obtained compressible partition
4:  Compute the MinSOD element of each cluster , according to Eq. 20. Retrieve from the prototype graph corresponding to each dissimilarity vector
5:  Define
6:  return  
Algorithm 2 Clustering-based compression algorithm.

The search interval for the kernel size can be effectively reduced as follows:

(21)

A proof for (21) can be found in [34]. This bound is important, since it allows to narrow the search interval for the kernel size , which is theoretically defined in the entire extended real line.

4.1.3 Expansion based on replacement with maximum dissimilar graphs

The genetic algorithm evolves a population of models over the iterations . Let be defined as shown in Sec. 4.1.1, and let be the set of unselected training graphs at iteration . Finally, let be the compressed RS at iteration . The herein described expansion operation makes use of the elements of replacing in those prototypes that do not discriminate the classes. The check for the expansion of a single prototype graph is still performed as described in Sec. 3.3. Notably, if the estimated entropy from the -th column vector is lower than the expansion threshold, , then new training graphs are selected from for each class, where is user-defined. Those new graphs are selected such that they result maximally dissimilar w.r.t. the -th prototype under analysis. The new expansion procedure is outlined in [34, Algorithm 2].

Since compression and expansion are evaluated considering two different interpretations of the DM, we accordingly use two different kernel sizes: and .

4.1.4 Analysis of computational complexity

The computational complexity is dictated by the execution of the genetic algorithm, . is the cost of the RS initialization, is the number of (maximum) evolutions, is the population size, and finally is the cost related to a single fitness function evaluation. In this system variant, the initialization is linear in the training set size, ; in average we select prototypes. The detailed cost related to the fitness function, , is articulated as the sum of the following costs:

(22)

The first cost, , is related to the computation of the initial DM corresponding to with RS obtained through the initialization of Sec. 4.1.1; is the computational cost associated with the adopted IGM procedure. is due to the compression operation which consists in a single BSAS execution, where is the cache size of the MinSOD [16], , and is the cost of a single Euclidean distance computation. is the cost characterizing the expansion operation; is the cardinality of the set . This operation is repeated at most times, with a quadratic entropy estimation cost in the training set size. is the cost related to the embedding of the DM, and is due to the classification of the validation set using a k-NN rule based classifier – this cost is updated according to the specific classifier. is the cost for the QRE over the compressed-and-expanded DM.

As it is possible to deduce from Eq. 22, the model synthesis is now characterized by a quadratic cost in the training set size, , as well as in the RS size, , while in the original ODSE it was (pseudo) cubic in both and .

4.2 ODSE with mode seeking initialization

This ODSE version is different from the one described in Sec. 4, since it does not implement any expansion operation; however, it adopts a more elaborated initialization of the RS. The RS initialization is now part of the synthesis, since it depends on some of the parameters tuned during the optimization. Compression is still implemented as described in Sec. 4.1.2.

The initialization makes use of the Mode Seek (MS) algorithm [48], which is a well-known procedure that is able to individuate the modes of a distribution. For each class , and considering a user-defined neighborhood size , the algorithm proceeds as illustrated in [34, Algorithm 3]. The elements of found in this way are the estimated modes of the class distribution; hence it is a supervised algorithm. The cardinality of depends on the choice of : the larger is , the smaller . This approach is very appropriate when elements of the same class are distributed in different and heterogeneous clusters: the cluster representatives are the modes individuated by the MS algorithm. Moreover, the MS algorithm can be useful to filter out outliers, since they are characterized by a low neighborhood density. The procedure depends on , which directly influences the outcome of the initialization. Additionally, since the neighborhood is defined in the graph domain, MS is also dependent on the weights characterizing TWEC (in our case). For this very reason, the initialization is now performed during the ODSE synthesis.

To limit the complexity of such an initialization, in the experiments we systematically assign small values to , constraining the search in small neighborhoods. A possible side effect of this choice is that we can individuate an excessive number of prototypes/modes. This effect is however attenuated by the cascade execution of the compression algorithm (2).

4.2.1 Analysis of computational complexity

The overall computational cost of the synthesis is now bounded by ; see (23). The two main steps of the fitness function involve the execution of the MS algorithm followed by the compression algorithm. The cost refers to the MS algorithm. is the number of training data belonging to the i-th class. refers to the computation of the initial DM, constructed using and the prototypes derived with MS. is the cost of the compression operation, with . , , and are equivalent to the ones described in Sec. 4.2.1. The overall cost is dominated by the initialization stage (the cost), which is (pseudo) quadratic in the class size , and quadratic in the neighborhood size, .

(23)

4.3 The efficiency of the ODSE clustering-based compression

BSAS (see Algorithm 1) is characterized by a linear computational complexity. However, due to the sequential processing nature, the outcome is sensitive to the data presentation order. In the following, we study the effect caused by the ordering of the input over the effectiveness of the CBC, by calculating what we called ODSE compression efficiency factor.

Let be the sequence of dissimilarity vectors describing the prototypes in the DS, which are presented in input to Algorithm 1. Let be the set of all permutations of the sequence . We define the optimal compression ratio for the sequence as:

(24)

where is the compressed RS obtained by analyzing the prototypes arranged according to , and is the uncompressed RS, i.e., the initial RS. Let be the effective compression ratio, achieved by ODSE considering a generic ordering of . The ratio

(25)

describes the asymptotic efficiency of the ODSE compression as the initial RS size grows.

Theorem 2.

The asymptotic worst-case ODSE compression efficiency factor is .

The proof can be found in A. An interpretation of the result of Theorem 2 is that, in the general case, the asymptotic efficiency of the implemented CBC varies within the range of the optimum compression.

5 ODSE with the MST-based Rényi entropy estimator

In the following, we contextualize the MST-RE estimation technique introduced in Sec. 2.2 as a component of the improved ODSE system presented in Sec. 4. Notably, we provide a theorem for determining the parameter of BSAS used in the compression operation (Algorithm 2). In this case, we generate clusters according to the particular instance of and of the parameter, since the kernel size parameter, , is not present in the MST-based estimator. The parameter is optimized during the ODSE synthesis. While is defined in , where is the dimensionality of the samples, we restrict the search interval to , with in the experiments. This technical choice is motivated by the fact that is used in Eq. 7 as exponent, and an excessively large value would easily cause overflow problems of the MST length variable floating-point representation.

Theorem 3.

Considering the instances of and , the compressible partition is derived executing the BSAS algorithm on training graphs by setting:

(26)

The proof of this theorem can be found in B. Defining according to Eq. 26 constrains the BSAS to generate clusters that are compressible by construction. Since and are optimized during the synthesis of the classifier, the result of Theorem 3, likewise the one of Theorem 1, allows us to evaluate different levels of training set compression according to the overall system performance. It goes without saying that computational complexity discussed in the previous sections is readily updated by considering the cost of the MST-based estimator (see Eq. 11).

6 Experiments

The experimental evaluation of the improved ODSE classifier is organized in three subsections. First, in Sec. 6.1 we introduce the IAM benchmarking datasets. In Sec. 6.2 we discuss the main experimental setting adopted in this paper. Finally, in Sec. 6.3 we show and discuss the results.

6.1 Datasets

The experimental evaluation is performed on the well-known IAM graph benchmarking databases [53]. The IAM repository contains many different datasets representing real-world data collected from various fields: from images to biochemical compounds. In particular, we use the Letter LOW (L-L), Letter MED (L-M), Letter HIGH (L-H), AIDS (AIDS), Proteins (P), GREC (G), Mutagenicity (M), and finally the Coil-Del (C-D) datasets. The first three are datasets of digitized characters modeled as labeled graphs, which are characterized by three different levels of noise. The AIDS, P, and M datasets represent biochemical networks, while G and C-D are images of various type. For the sake of brevity, we report only essential details in Tab. 2, referring the reader to Ref. [53] (and references therein) for a more in-depth discussion about the data. Moreover, since each dataset contains graphs characterized by different vertex and edge labels, we adopted the same vertex and edge dissimilarity measures described in [37, 6].

DS # (tr, vs, ts) Classes Vertex Label Edge Label Avg. Avg.
L-L (750, 750, 750) 15 2D coords none 4.7 3.1
L-M (750, 750, 750) 15 2D coords none 4.7 3.2
L-H (750, 750, 750) 15 2D coords none 4.7 4.5
AIDS (250, 250, 1500) 2 Chem. Symbol Valence 15.7 16.2
P (200, 200, 200) 6 Complex Struc. Complex Struc. 32.6 62.1
G (286, 286, 528) 22 2D coords Line Type 11.5 12.2
M (1500, 500, 2337) 2 Chem. Symbols Valence 30.3 30.8
C-D (2400, 500, 1000) 100 2D coords none 21.5 54.2
Table 2: IAM datasets. See [53] for details.

6.2 Experimental setting

The ODSE system version described in Sec. 4.1 is denoted as ODSE2v1, while the version described in Sec. 4.2 as ODSE2v2. Those two versions make use of the QRE estimator; the setting of the clustering algorithm parameter used during the compression is therefore performed according to the result of Theorem 1. By following the same algorithmic scheme, we consider two additional ODSE variants that differ only in the use of the MST-RE estimator. We denote those two variants as ODSE2v1-MST and ODSE2v2-MST. The setting of is hence performed according to the proof of Theorem 3

. However, the MST-based estimator is conceived for high-dimensional data. As a consequence, in the ODSE2v1-MST system version we still use the QRE estimator in the expansion operation, since during the expansion we analyze unidimensional distributions of DVs. We adopted two core classifiers operating in the DS. The first one is a

k-nearest neighbors (k-NN) rule based classifier equipped with the Euclidean distance, testing three values of : 1, 3, and 5. This first choice is dictated by the fact that, primarily, we need to compare the results herein presented directly with our previous works [37, 34]. As a second classifier we used a fast MMN, which is trained with the ARC algorithm [57]. The four aforementioned ODSE variants (i.e., ODSE2v1, ODSEv2, ODSEv1-MST, and ODSE2v2-MST) are therefore replicated into additional four variants that are straightforwardly denoted as ODSE2v1-MMN, ODSEv2-MMN, ODSEv1-MST-MMN, and ODSE2v2-MST-MMN, meaning that we just use the neuro-fuzzy MMN on the embedding space, instead of the k-NN. Tab. 3 summarizes all ODSE configurations evaluated in this paper.

Tests are executed setting the genetic algorithm with a (fixed) population size of 30 individuals, and performing a maximum of 40 evolutions for the synthesis; a check on the fitness value is however performed terminating the optimization if the fitness does not change for 15 evolutions. This setup has been chosen to allow a fair comparison with the previously obtained results [37, 34]. The genetic algorithm performs roulette wheel selection, two-point crossover, and random mutation on the aforementioned codes , encoding the real-valued model parameters; in addition, the genetic algorithm implements an elitism strategy which automatically imports the fittest individual into the next population. In all configurations, we executed the system setting and in Eq. 13 and 14, respectively. Moreover, the neighborhood size parameter

characterizing the MS algorithm, which affects both ODSE2v2 and ODSE2v2-MST versions, has been set as follows: 10 for the L-L, L-M, and L-H, 20 for AIDS, 2 for P, 8 for G, and finally 100 for either M and C-D. Note that these values has been defined according to the training dataset sizes and considering some preliminary tests. Each dataset has been processed five times using different random seeds, reporting hence the average test set classification accuracy together with its standard deviation. We report also the required average serial CPU time and the average RS size obtained after the synthesis. Tests have been conducted on a regular desktop machine with an Intel Core2 Quad CPU Q6600 at 2.40GHz and 4Gb of RAM; software is implemented in C++ on a Linux operating system (Ubuntu 12.04) by making use of the SPARE library

[36]. Finally, the computing time is measured using the clock() routine of the standard ctime library.

Acronym Init Compression / Est. Expansion / Est. Obj. Func. (16) FB Class.
ODSE2v1 Sec. 4.1.1 Sec. 4.1.2 / QRE Sec. 4.1.3 / QRE QRE k-NN
ODSE2v2 Sec. 4.2 Sec. 4.1.2 / QRE QRE k-NN
ODSE2v1-MST Sec. 4.1.1 Sec. 4.1.2 / MST-RE Sec. 4.1.3 / QRE MST-RE k-NN
ODSE2v2-MST Sec. 4.2 Sec. 4.1.2 / MST-RE MST-RE k-NN
ODSE2v1-MMN Sec. 4.1.1 Sec. 4.1.2 / QRE Sec. 4.1.3 / QRE QRE MMN
ODSE2v2-MMN Sec. 4.2 Sec. 4.1.2 / QRE QRE MMN
ODSE2v1-MST-MMN Sec. 4.1.1 Sec. 4.1.2 / MST-RE Sec. 4.1.3 / QRE MST-RE MMN
ODSE2v2-MST-MMN Sec. 4.2 Sec. 4.1.2 / MST-RE MST-RE MMN
Table 3: Summary of the ODSE configurations evaluated in the experiments. The “Init” column refers to the RS initialization scheme, “Compression / Est.” refers to the compression algorithm and adopted entropy estimator, “Expansion / Est.” the same but for the expansion algorithm, and “Obj. Func. (16)” refers to the entropy estimator adopted in Eq. 16. Finally, “FB Class.” specifies the feature-based classifier operating in the DS.

6.3 Results and discussion

All test set classification accuracy results have been collected in Tab. 4. These include the results of three baseline reference systems and several state-of-the-art (SOA) classification systems based on graph embedding techniques. The table is divided in appropriate macro blocks to simplify the comparison of the results. The three reference systems are denoted as RPS+TWEC+k-NN, k-NN+TWEC, and RPS+TWEC+MMN. The first one performs a (class-independent) randomized selection of the training graphs to develop the dissimilarity representation of the input data. This system adopts the same TWEC used in ODSE and performs the classification in the DS by means of a k-NN classifier equipped with the Euclidean distance. The second one differs from the first system by using instead the MMN. Finally, the third reference system operates directly in by means of a k-NN rule based classifier equipped with TWEC. In all cases, to obtain a fair comparison with ODSE, the configuration of the dissimilarity measures for the vertex/edge labels is consistent with the one adopted for ODSE. Additionally, , and is used in the k-NN rule, performing the TWEC parameters optimization (i.e., the weighting parameters in ]) by means of the same aforementioned genetic algorithm implementation. Therefore, also in this case the test set results must be intended as the average of five different runs (however we omit standard deviations for the sake of brevity).

Tab. 4 presents the obtained test set classification accuracy results, while Tab. 5 gives the corresponding standard deviations. We provide two types of statistical evaluation of such results. First, we perform pairwise comparisons by means of t-test; we adopt the usual 5% as significance threshold. Notably, we check if any of the improved ODSE variants significantly outperforms, for each dataset, both the reference systems and original ODSE. Best results satisfying such a condition are reported in bold in Tab. 4. In addition to the pairwise comparisons, we calculate also a global ranking of all classifiers by means of the Friedman test. Missing values are replaced by the dataset-specific averages.

First of all, we note that results obtained with the baseline reference systems are always worse than those obtained with ODSE. Test set classification accuracy percentages obtained by ODSE2v1-MST and ODSE2v2-MST are comparable with those of ODSE2v1 and ODSE2v2, although we note a slightly general improvement for the first two variants. Results are also more stable varying the neighborhood size parameter, , of the k-NN rule. It is worth noting that, for difficult datasets as P and C-D, increasing the neighborhood size in the k-NN rule affects significantly the test set performance (i.e., results degrade considerably). Test set classification accuracy results obtained by means of the MMN operating in the DS are in general (slightly) inferior w.r.t. the ones obtained with the k-NN rule – setting . This result is not too unusual since the k-NN rule is a valuable classifier, especially in absence of noisy data. Since ODSE operates by searching for the best-performing DS for the data at hand, we may deduce that the embedding vectors are sufficiently well-organized w.r.t. the classes. Test set results on the first four datasets (i.e., L-L, L-M, L-H, and AIDS) denote an important improvement over a large part of the SOA systems. On the other hand, results over the P, G, and M datasets are comparable w.r.t. those of the SOA systems. For all ODSE configurations, we observe non convincing results on the C-D dataset; in this case results are comparable only with those of the reference systems (first block of Tab. 4). However, a rational reason explaining this fact is not emerged from the tests yet, requiring thus more future investigations. The global picture provided by the column denoted as “Rank” shows that the ODSE classifiers rank in general very well w.r.t. the SOA systems. Standard deviations (Tab. 5) are reasonably small, denoting a reliable classifier regardless the particular ODSE variant.

We demonstrated that the asymptotic computational complexity of ODSE2 is quadratic, while the original ODSE was characterized by a cubic computational complexity. Here, in order to complement this result with experimental evidence, we discuss also the effective computing time. The calculated serial CPU time, for each dataset, is shown in Tab. 6, which includes both ODSE synthesis and test set evaluation. The ODSE variants based on the MST entropy estimator are faster, with the only exception for the P and C-D datasets. This fact is magnified on the first four datasets, in which the speed-up factor w.r.t. the original ODSE increases considerably. The speed-up factors obtained for the first three datasets are one order of magnitude higher than the ones obtained in the other datasets. In order to provide an explanation for such differences, we need to take a closer look at the dataset details shown in Tab. 2, computational complexity in Eqs. 22 and 23, and the computational complexity of the original ODSE [37]. It is possible to notice that the first three datasets contain smaller (in average) labeled graphs. Therefore, this points us to look for the related terms in the computational complexity formulae. The term (the cost of the graph matching algorithm) is directly affected by the size of the graphs and appears in Eq. 22 and in Eq. 23. The same term appears also in of Eq. 24 in [37]. In the original ODSE version [37], the dissimilarity matrix is constructed using an initial set of prototypes equal to the training set (then it is compressed and expanded). In the new version presented here, we instead use a reduced set with elements. In the first variant that we presented, graphs are selected randomly from the training set based on a selection probability. In the second variant, instead, we use the MS algorithm, which finds a much lower number of representatives (although, as said in the experimental setting section, we use a conservative setting for MS). This fact provides a first rational justification for explaining the aforementioned differences. In fact, graph matching algorithms are expensive from the computational viewpoint (the adopted algorithm is quadratic in the number of vertices). In addition, compression and expansion operations are now much faster (from cubic to quadratic in time). As shown in Tab. 8, the new ODSE versions compute a smaller RS; a direct consequence of the improved compression operation. This is another important factor contributing to the overall speed-up, since smaller RSs imply less graph matching computations during the validation and test stages (we remind that ODSE is trained by cross-validation). Clearly, there are also other factors, such as the convergence of the optimization algorithm, which might be affected by the specific dataset at hand.

As expected, the speed-up factors obtained by using the MMN as classifier are in general higher than those obtained with kNN. In fact, the MMN synthesizes a classification model over the training data embedded into a DS. This significantly reduces the computing time necessary for the evaluation of the test set (and also of the validation stage performed during the synthesis of the model). This is demonstrated by the results in Tab.

7, where we report the CPU time for the test set evaluation only. This fact might assume more importance in particular applications, especially in those where the synthesis of the classifier can be effectively performed only once in off-line mode and the classification model is employed to process high-rate data streams in real-time [58].

Let us focus now on the structural complexity of the synthesized classification models. The cardinality of the best-performing RSs are shown in Tab. 6. It is possible to note that the cardinality are slightly bigger for those variants operating with MST-RE (especially in the first three datasets, i.e., L-L, L-M, and L-H). From this fact we deduce that, when configuring the CBC procedure with the MST-RE estimator, the ODSE classifier, in order to obtain good results in terms of test set accuracy, requires a more complex model w.r.t. the variants involving the QRE estimator. This behavior is however magnified by the setting of the objective function parameter adopted in our tests, which biases the ODSE system towards the recognition rate performance. Notably, variants operating with the MMN develop considerable less costly classification models (see Tab. 8 and 9 for the details). This particular aspect becomes very important in resource-constrained scenarios and/or when the input datasets are very big. The considerable reductions of the RS size here achieved strengthen the fact that the entropy estimation operates adequately in the dissimilarity representation context.

Classifier Dataset Rank
L-L L-M L-H AIDS P G M C-D
Reference systems
RPS+TWEC+k-NN, 98.4 96.0 95.0 98.5 45.5 95.0 69.0 81.0 15
k-NN+TWEC, 96.8 66.3 36.3 73.9 52.1 95.0 57.7 61.2 38
RPS+TWEC+k-NN, 98.6 97.2 94.7 98.2 40.5 92.0 68.7 63.2 23
k-NN+TWEC, 97.5 57.4 39.1 71.4 48.5 91.8 56.1 33.7 39
RPS+TWEC+k-NN, 98.3 97.1 95.0 97.6 35.4 84.8 68.5 59.7 32
k-NN+TWEC, 97.6 60.4 42.2 76.7 43.0 88.5 56.9 27.8 40
RPS+TWEC+MMN 98.0 96.0 93.6 97.4 49.5 95.0 66.0 68.4 28
SOA systems
GMM+soft all+SVM [23] 99.7 93.0 87.8 - - 99.0 - 98.1 12
Fuzzy k-means+soft all+SVM [23] 99.8 98.8 85.0 - - 98.1 - 97.3 9
sk+SVM [54] 99.7 85.9 79.1 97.4 - 94.4 55.4 - 30
le+SVM [54] 99.3 95.9 92.5 98.3 - 96.8 74.3 - 7
PCA+SVM [55] 92.7 81.1 73.3 98.2 - 92.9 75.9 93.6 26
MDA+SVM [55] 89.8 68.5 60.5 95.4 - 91.8 62.4 88.2 37
svm+SVM [11] 99.2 94.7 92.8 98.1 71.5 92.2 68.3 - 17
svm+kPCA [11] 99.2 94.7 90.3 98.1 67.5 91.6 71.2 - 14
lgq [30] 81.5 - - - - 86.2 - - 35
bayes [29] 80.4 - - - - 80.3 - - 36
bayes [29] 81.3 - - - - 89.9 - - 34
FMGE+k-NN [42] 97.1 75.7 66.5 - - 97.5 69.1 - 31
FMGE+SVM [42] 98.2 83.1 70.0 - - 99.4 76.5 - 21
d-sps-SVM [9] 99.5 95.4 93.4 98.2 73.0 92.5 71.5 - 8
GRALGv1 [6] 98.2 75.6 69.6 99.7 - 97.7 73.0 94.0 10
GRALGv2 [6] 97.6 89.6 82.6 99.7 64.6 97.6 73.0 97.8 6
Original ODSE
ODSE, [37] 98.6 96.8 96.2 99.6 61.0 96.2 73.4 - 1
Improved ODSE with QRE
ODSE2v1, [34] 99.0 97.0 96.1 99.1 61.2 98.1 68.2 78.1 4
ODSE2v2, [34] 98.7 97.1 95.4 99.5 51.9 95.4 68.1 77.2 5
ODSE2v1, [34] 99.0 97.2 96.1 99.3 41.4 90.2 68.7 64.3 13
ODSE2v2, [34] 98.8 97.4 95.1 99.4 31.4 38.0 69.4 59.0 24
ODSE2v1, [34] 99.1 96.8 95.2 99.0 38.9 85.4 69.0 58.6 27
ODSE2v2, [34] 98.7 97.0 95.6 99.4 31.3 82.5 70.0 54.0 25
ODSE2v1-MMN 98.3 95.2 94.0 99.3 53.1 94.5 67.9 62.8 22
ODSE2v2-MMN 97.8 95.6 93.6 99.6 48.7 94.8 68.2 59.2 29
Improved ODSE with MST-RE
ODSE2v1-MST, 98.6 96.8 98.9 99.3 61.3 95.6 70.0 81.0 3
ODSE2v2-MST, 98.4 97.1 96.0 99.7 51.0 94.1 71.6 82.0 2
ODSE2v1-MST, 98.7 97.0 96.8 99.5 43.0 92.3 68.6 64.8 11
ODSE2v2-MST, 98.8 96.9 96.0 99.7 35.0 91.0 69.4 60.0 16
ODSE2v1-MST, 99.0 96.8 95.6 99.6 41.4 85.0 68.6 60.0 18
ODSE2v2-MST, 98.8 97.0 95.5 99.7 32.9 83.3 70.0 54.0 19
ODSE2v1-MST-MMN 97.9 95.4 93.6 99.3 49.9 95.0 68.3 62.6 20
ODSE2v2-MST-MMN 97.9 95.1 91.8 99.2 48.5 94.8 67.1 59.0 33
Table 4: Test set classification accuracy results – grayed lines denote novel results introduced in this paper. The “-” sign means that the result is not available to our knowledge.
Classifier Dataset
L-L L-M L-H AIDS P G M C-D
ODSE [37] 0.0256 1.2346 0.2423 0.0000 0.7356 0.4136 0.6586 -
ODSE2v1, [34] 0.0769 0.2309 0.1539 0.0000 2.6242 1.3350 0.5187 4.3863
ODSE2v2, [34] 0.0769 0.0769 0.4000 0.0000 0.2915 0.8021 0.5622 2.2654
ODSE2v1, [34] 0.0769 0.2309 0.2666 0.0000 1.0513 1.2236 0.0856 0.0577
ODSE2v2, [34] 0.0769 0.4618 5.0800 0.1924 1.1666 3.1540 0.0356 1.2361
ODSE2v1, [34] 0.5047 0.0769 0.9365 0.1924 0.5050 2.5585 0.3803 1.3279
ODSE2v2, [34] 0.1333 0.2309 0.0769 0.0000 2.7815 4.5220 1.2666 0.0026
ODSE2v1-MMN 0.1520 0.3320 0.3932 0.1861 1.7740 0.7315 1.1300 1.0001
ODSE2v2-MMN 0.2022 0.2022 0.7682 0.0000 2.7290 1.3584 1.4080 0.3896
ODSE2v1-MST, 0.0730 0.0730 0.1115 0.2772 1.5500 0.1055 1.0786 0.4163
ODSE2v2-MST, 0.0596 0.2231 0.0730 0.0000 1.1660 0.2943 0.9534 0.2146
ODSE2v1-MST, 0.1192 0.1520 0.0942 0.6982 1.0940 0.0000 0.5926 1.7088
ODSE2v2-MST, 0.1460 0.2022 0.0730 0.0000 0.0000 0.1112 0.2365 0.5655
ODSE2v1-MST, 0.1115 0.0942 0.2190 0.0596 0.4748 0.0000 0.0547 1.2356
ODSE2v2-MST, 0.0730 0.0596 0.9933 0.0000 0.0000 0.1112 1.0023 0.9563
ODSE2v1-MST-MMN 0.1115 0.4216 0.7624 0.3217 2.5735 0.3067 0.7926 0.9899
ODSE2v2-MST-MMN 0.0596 0.7636 0.7477 0.0000 2.7290 0.5828 0.8911 1.2020
Table 5: Standard deviations of ODSE results shown in Tab. 4.
Classifier Dataset
L-L L-M L-H AIDS P G M C-D
ODSE [37] 63274 52285 28938 394 8460 601 43060 -
ODSE2v1 [34] 284 (222) 329 (158) 328 (88) 38 (10) 3187 (3) 210 (3) 3494 (12) 2724
ODSE2v2 [34] 126 (502) 268 (195) 183 (158) 110 (3) 1683 (5) 96 (6) 10326 (4) 8444
ODSE2v1-MMN 129 (490) 284 (184) 263 (110) 17 (23) 3638 (2) 170 (4) 8837 (5) 5320
ODSE2v2-MMN 195 (324) 422 (124) 183 (158) 86 (5) 1444 (6) 77 (8) 28511 (2) 20301
ODSE2v1-MST 213 (297) 231 (226) 225 (129) 18 (22) 3860 (2) 168 (4) 2563 (17) 3261
ODSE2v2-MST 145 (463) 160 (327) 107 (270) 93 (4) 2075 (4) 74 (8) 7675 (6) 10092
ODSE2v1-MST-MMN 201 (315) 249 (210) 205 (141) 15 (26) 3450 (2) 155 (4) 5496 (8) 7135
ODSE2v2-MST-MMN 117 (541) 176 (292) 118 (245) 83 (5) 1380 (6) 75 (8) 28007 (2) 16599
Table 6: Average serial CPU time in minutes (and speed-up factor w.r.t. the original ODSE system) considering ODSE model synthesis and test set evaluation. In the k-NN case, we report the results with only.
Class. Sys. Datasets
L-L L-M L-H AIDS P G M C-D
ODSE2v1-MST, 0.740 0.740 0.740 0.130 0.020 0.060 9.020 9.700
ODSE2v1-MST-MMN 0.105 0.105 0.105 0.005 0.014 0.045 6.600 5.250
Table 7: Average serial CPU time in seconds for test set evaluation only. For simplicity, we report the results of only one system variant operating in the DS with the k-NN classifier and only one with the MMN.
Classifier Dataset
L-L L-M L-H AIDS P G M C-D
ODSE [37] 435 750 750 250 200 283 1500 -
ODSE2v1 [34] 146 449 449 8 197 283 760 615
ODSE2v2 [34] 183 431 338 7 82 126 801 770
ODSE2v1-MM 136 192 144 6 190 163 563 555
ODSE2v2-MM 197 546 80 2 93 115 815 740
ODSE2v1-MST 597 595 597 6 198 283 687 618
ODSE2v2-MST 551 574 447 61 122 129 813 775
ODSE2v1-MST-MMN 600 606 500 5 190 184 424 549
ODSE2v2-MST-MMN 550 580 411 61 93 115 456 733
Table 8: Average cardinality of the best-performing RS. In the k-NN case, we report the results with only since results with and are similar.
Classifier Dataset
L-L L-M L-H AIDS P G M C-D
ODSE2v1-MMN 15 39 34 5 43 27 164 357
ODSE2v2-MMN 15 28 41 4 48 28 159 368
ODSE2v1-MST-MMN 15 27 38 3 48 28 168 348
ODSE2v2-MST-MMN 15 27 34 4 43 27 175 365
Table 9: Average number of hyperboxes generated by the MMN. The number of hyperboxes can be used also as a complexity indicator of the model synthesized by the MMN on the DS. Such values should be taken into account considering also the dataset characteristics of Tab. 2 and the computed average representation set sizes in Tab. 8.

7 Conclusions and future directions

In this paper, we have presented different variants of the improved ODSE graph classification system. All the discussed variants are based on the characterization of the informativeness of the DM through the estimation of the -order Rényi entropy. The first adopted estimator computes the QRE by means of a kernel-based density estimator, while the second one uses the length of an entropic MST. The improved ODSE system has been designed by providing different strategies for the initialization, compression, as well as for the expansion operation of the RS. In particular, we conceived a fast CBC scheme, which allowed us to directly control the compression level of the data through the explicit setting of the cluster radius parameter. We provided formal proofs for the two estimation techniques. These proofs enabled us to determine the value of the cluster radius analytically, according to the ODSE model optimization procedure. We have studied also the asymptotic worst-case efficiency of the CBC scheme implemented by means of a sequential cluster generation rule (BSAS).

Experimental evaluations and comparisons with several state-of-the-art systems have been performed on well-known benchmarking datasets of labeled graphs (IAM database). We used two different feature-based classifiers operating in the DS: the k-NN classifier equipped with the Euclidean distance and a neurofuzzy MMN trained with the ARC algorithm. Overall, the variants adopting the MST-based estimator resulted to be faster but less parsimonious for what concerns the synthesized ODSE model (i.e., the cardinality of the best-performing RS was larger). The use of the k-NN rule (with ) yielded slightly better test set accuracy results w.r.t. the MMN, while however in the latter case we have observed important differences in term of (serial) CPU computing time, especially on the test set processing stage. The test set classification accuracy results confirmed the effectiveness of the ODSE classifier w.r.t. state-of-the-art standards. Moreover, the significative CPU time improvements w.r.t. the original ODSE version, and the highly parallelizable global optimization scheme based on a genetic algorithm, bring the ODSE graph classifier one step closer towards the applicability to bigger labeled graphs and larger datasets.

The vector representation of the input graphs have been obtained directly using the rows of the dissimilarity matrix. Such a choice, while it is known to be effective, has been mainly dictated by the computing time requirements of the system. It is worth analyzing the performance of ODSE also when the embedding space is obtained by a (non)linear embedding of the (corrected) pairwise dissimilarity values [65]. Future experiments include testing other core IGM procedures, different -order Rényi entropy estimators, and additional feature-based classifiers.

Appendix A Proof of Theorem 2

Proof.

We focus on the worst-case scenario for , giving thus a lower bound for the efficiency (25). Let denote the i-th element of the sequence , i.e., the i-th dissimilarity vector corresponding to the prototype graph . Let be the best ordering for , i.e.,

(27)

Let us assume the case in which the Euclidean distance among any pair of vectors in is given by

(28)

where is the adopted cluster radius during the ODSE compression. It is easy to understand that this is the worst-case scenario for the compression purpose in the sequential clustering setting. In fact, each vector in the sequence has a distance with its predecessor/successor equal to the maximum cluster radius . As a consequence, there is still a possibility to compress the vectors, but it is however strictly dependent on the specific ordering of .

First of all, it is important to note that, due to the distances assumed in (28), only three elements of can be contained into a single cluster. In fact, any three consecutive elements of the sequence would form a cluster with a diameter equal to . Therefore, considering the sequential rule shown in Algorithm 1, and setting , the best possible ordering is the one that preserves a distance equal to for any two adjacent elements of , achieving a compression ratio of:

(29)

The worst possible ordering, instead, yields , which can be achieved (for instance assuming odd) when considering the following ordering w.r.t. the optimal :

(30)

In this case, Algorithm 1 would generate exactly

(31)

clusters, corresponding to the first elements of the sequence , since every pair of consecutive elements in is at a distance of exactly . Therefore, is the maximum number of clusters that can be generated by considering the distances assumed in (28). Combining Eq. 29 and 31, we obtain for a given ,

(32)

which allows us to claim that the worst-case efficiency of the ODSE compression varies according to the following ratio:

(33)

Taking the limit for in Eq. 33 gives us the claim. ∎

Appendix B Proof of Theorem 3

Proof.

Let us focus the analysis on a single cluster , containing prototypes within a training set of graphs. Let us remind that the cluster radius and diameter are, respectively, and in the spherical cluster case. Therefore, we can obtain an upper bound for the MST length factor (7), considering that (all) the corresponding MST, , of the complete graph generated from the measurements has edges with weights equal to . Specifically,

(34)

In the following, we evaluate exactly as defined in Eq. 10, considering dimensions – note that is shortened as . Eq. 34 allows us to derive the following upper bound for the MST-based entropy estimator (8):

(35)
(36)
(37)
(38)

However, the entropy estimator shown in Eq. 8 does not yield normalized values (e.g., in ). We can normalize the estimations by considering the following factor:

(39)

The quantity is the maximum distance in an Euclidean -hypercube of -dimensions; is the input data extent, which is 2 in our case. Eq. 39 is a maximizer of Eq. 8 since the logarithm is a monotonically increasing function and the other relevant factors in the expression remain constant changing the input distribution. Instead, the MST length achieves its maximum value only in the specific case when all points are at a distance equal to . Therefore, by normalizing Eq. 38 using Eq. 39, we obtain:

(40)

Rewriting the expression in terms of the ODSE compression rule (17), we have:

(41)

Solving for , the right-hand side of (41) can be manipulated as follows: