Probabilistic prototype models for attributed graphs

09/22/2011 ∙ by S. Deepak Srinivasan, et al. ∙ Berlin Institute of Technology (Technische Universität Berlin) 0

This contribution proposes a new approach towards developing a class of probabilistic methods for classifying attributed graphs. The key concept is random attributed graph, which is defined as an attributed graph whose nodes and edges are annotated by random variables. Every node/edge has two random processes associated with it- occurence probability and the probability distribution over the attribute values. These are estimated within the maximum likelihood framework. The likelihood of a random attributed graph to generate an outcome graph is used as a feature for classification. The proposed approach is fast and robust to noise.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Attributed graphs are used to represent data as diverse as images, shapes, molecules and Protein structures. The statistical analysis of a dataset of patterns represented by graphical structures is a challenging problem and is closely related to tasks such as density estimation, mixture modelling, classification and clustering. There have been some efforts to develop probabilistic models for attributed graphs in the context of pattern recognition. Wong et al.,

[1, 2] propose the concept of a random graph which takes into account structural and contextual probabilities. An instantiation (outcome) of a random graph is an attributed graph, which enables the characterization of an ensemble of outcome graphs with a probability distribution. Sole-Ribalta et al., [3] generalize the idea of random graphs to structure described random graphs (SDRG), with node and edge value distributions. Algorithms have been proposed where random attributed graph models are used for classification in the maximum likelihood framework. This framework has also been adopted by Seong et al., [4] to develop an incremental clustering algorithm for attributed graphs, and by Sengupta et al., [5] to efficiently organize large structural modelbases for quick retrieval. There are two features of such a definition that are quite noteworthy- (i) both the structural and contextual probabilities are considered, which are estimated with suitable independence assumptions and (ii) the ability to deal with a wide variety of attribute values.

The present contribution aims to develop a theory for probabilistic modeling of attributed graphs similar to generative models for feature vectors and demonstrate its utility to classify graph patterns. We propose random graph models as prototypes for a set of graphs with continuous node and edge attribute vectors and estimate its parameters. Instead of using the random graph models to classify the patterns in terms of maximum likelihood, we use the likelihood values as features for classification by subsequent discriminative classifiers such as support vector machines.

2 Random attributed graph models

2.1 Definitions

A random attributed graph (simply referred to as random graphs) is a graph whose nodes and edges are finite probability distributions. Each outcome of a random graph is a labeled graph along with a morphism of the labeled graph into the random graph. The morphism specifies for each vertex (or edge) of the outcome graph which vertex (or edge) of the random graph generated it. The probability space of random graphs should be such that, the outcomes are attributed graphs with specified morphism relation and is complete. The definitions in this section follow [1] closely.

Technically, the random graph 111Elements of the random attributed graph are represented by script. is defined to be such that:

1. Each vertex and edge is a finite probability distribution

2. ,

3. The space of joint distribution of all random nodes and random edges is complete

Condition 2 ensures that an edge can occur in an outcome only if both its ends (terminal nodes, given by ) occur. Completedness means that the space is indeed a (standard) probability space. Consider the probability space of the joint distribution. This space is the probability space of attributed graphs and every outcome is an attributed graph.

Let be an outcome graph. A morphism and specifies the structural mapping between the random graph and its outcome. Thus, an outcome of a random graph is specified by the tuple , where . It is to be noted that the mappings and are into and the inverse mappings and are such that some elements could be mapped to , i.e. if no morphism exists. The probability of an outcome graph is then the probability of its joint outcome described by the following

(1)

where is the node attribute function that assigns attribute for every node and denotes the particular node attribute value. are the corresponding edge attribute function and values respectively.

Figure 1: A random attributed graph (centre) with two outcomes and their respective likelihoods

We make the following assumptions to make the definition computationally feasible- node occurences are independent, and edge occurences depend only on the nodes that the edge is incident to.

Then, we can simplify Eq.(1) to

(2)

where denotes a probability that the node occurs and is the probability that does not occur. Similar notation has been adopted for the edges as well. We note that formula in Eq. 2 decomposes the probability of an attributed graph instance as the product of probability of nodes/edges of generating random graphs that occur in the outcome, ”not occurence” probability of nodes/edges that are absent in the outcome, and the probability of the occuring nodes/edges to assume their respective attribute values. Figure 1 illustrates the above definitions with an example.

2.2 Model based clustering of attributed graphs

The estimation of structural parameters of a random graph given a dataset follows from maximizing the likelihood: the node and edge occurence probabilities of random graphs are set to those values which maximize the likelihood of the dataset being generated from the random graph. The cost function is

(3)

where is the likelihood that random graph generates the graph

. We now consider the case where the node and edge attributes are given by feature vectors. In order to simplify the analytical treatment, we assume the attribute vectors to be generated by Gaussian distributions whose means and covariances are to be determined.

Initially, we maximize the cost function with respect to the node and edge occurence probabilities and

. As the node occurences are modelled by independent Bernoulli distributions, the maximum likelihood estimate is the fraction of its occurences in the dataset

(4)

where is the number of occurences of node in the sample set. Similar estimates hold good for the edges except that edge occurence probabilities are normalized by their respective node probabilities (accounting for the fact that the edges cannot occur if any of their end nodes do not occur).

2.3 Density estimation

We now consider the problem of estimating means and covariances of node and edge attribute distributions. It is possible to derive gradient descent update rule for the mean and covariance matrices. The vanilla gradient descent where the means and covariances are updated in the direction of the gradient (as it is assumed to the steepest direction) is not ideal as it ignores the geometry of the underlying probability space. Therefore, we use natural gradient descent to estimate the mean and covariance online [6, 7].

Natural gradient descent is a modification of the gradient descent procedure which takes into account the geometry of the manifold by incorporating a corrective term given by the Riemannian metric tensor. The equations for updating the means and covariances in the direction of the natural gradient are given by

(5)
(6)

where, are the Riemannian metric tensors in the space of means and covariances respectively. The Riemannian metric tensor in the space of mean vectors is given by

(7)

Hence, the online update equation for the Gaussian means is given as below

(8)

The metric tensor in the space of covariance matrices is defined as

(9)

which after some simplification turns out to be

(10)

The online covariance estimation is given by a first order update rule as

(11)

3 Random graphs generating classification features

Prototype based classification schemes are widespread in the domain of attributed graphs [8]. The key idea is to embed the graphs into a vector space in the following manner. Given a set of graphs , we synthesize a set of prototype graphs such that every graph is embedded in as

(12)

where is a dissimilarity measure between the graphs and prototypes. The choice of prototypes influences the distance measure and hence the dissimilarity space. To illustrate, when the prototype graphs are chosen to be set median or means or cluster centres, it is clear as to how the distance is calculated. However, what is a suitable distance measure when we choose random graphs as prototypes ?

The key lies in defining the Kullback-Leibler divergence between the probability density of random prototype graph

and the true (hidden) probability distribution [9, 10]

(13)

The unknown probability distribution is represented by , where is the Dirac delta function at every data sample . Seperating the ln term into and noting

(14)

which is the log-likelihood that the random graph generates the outcome . Hence, likelihood (or more precisely its logarithm) could be used as a feature for classification naturally in the dissimilarity/distance representation framework. We also note here that a feature space embedding of graphs defined by likelihood values corresponds to the framework of Jaakkola et. al., [11] who propose to use kernels derived from generative models.

We thus summarize the scheme as below. Given a dataset of graphs representing patterns belonging to different classes, sythesize first random attributed graphs acting as a model/prototype for each class. The largest graph (i.e. the graph with maximum number of nodes) is initialized as prototype classwise. We then present every graph in the training set, align them with the corresponding prototype and update the node and edge occurence (structural) probabilities. The means and covariances are also updated according to the formulae in Eq. (8), (11). Once the parameters of the random prototype graphs are determined, we embed the dataset into a feature space by calculating the log-likelihood between every graph in the dataset and every element in the prototype set. We point out the following notable features of this scheme: (1) More than one prototypes could be used for every class especially for datasets with diverse graphs in the same class. However, in our analysis and experiments, we consider just one random prototype per class in view of computational complexity of graph matching; (2) The size of the prototypes are bound by the size of the largest graph in the dataset (3) The number of graph matching operations during the parameter estimation stage is , the size of the training set; once the prototype random graphs are sythesized, the training set (with samples) and the test set (with samples) have to be embedded in the likelihood space. This needs another graph matching operations.

4 Experiments

4.1 Algorithmic details

Matching attributed graphs- The problem of aligning random graphs with each of the sample graph and the likelihood calculation involve attributed graph matching. We adopt again the graduated assignment algorithm [12] with a suitable compatibility function for this purpose. This algorithm minimizes a cost function as a function of match matrix over all possible matchings by an iterative procedure which estimates match matrix at every step and normalizing it. The matching quality is influenced by node compatibilites which measure how similar the nodes are structurally and attribute-value wise. In determining the morphism between random attributed graphs and outcome graphs, the compatibility function is set to the likelihood of the node being structurally present in the outcome, thus in effect finding the morphism which is most probable.

Classification procedure- Once the random graphs have been synthesized classwise, the dataset was embedded in to a feature space by calculating the log-likelihood of graphs beng generated by the prototype random graphs. In the feature space, various classifiers were learned on the training set and validated (by performance on the validation set or by cross-validation on the training set). The classifier exhibiting best validation performance was used to classify the test data. Extensive experimentations indicated that support vector machines with polynomial/Gaussian kernels yielded the best performance. All classification experiments were done using PyML software [13].

4.2 Synthetic datasets

We first analyzed the performance of this algorithm on sythetic datasets. We consider a dataset consisting of 200 graphs in training and test set belonging to two classes. The dataset is generated by considering distortions of two base graphs classwise at different levels viz.

. Node and edge attributes are generated according to a normal distribution. The noise according to the specified distortion level is added which modifies node and edge occurences and also their respective attributes. The nodes are then randomly permuted. The dataset is then divided uniformly into training and test sets. The classification scheme described in this chapter is referred to as

(Random Attributed Graph model + Likelihood as a Feature). The standard Nearest Neighbour algorithm () in the graph domain is chosen as the benchmark classifier.

Figure 2: Classifier ROC plots for different distortion levels

The classifiers are evaluated on the basis of the Area under the ROC curve () [14], Blue for and Red for (Figure 2). The classification rates of compared with the proposed algorithm is shown in Table 1. As is seen, for low values of distortion, family of classifiers give near ideal performance. For higher noise levels, the algorithm does achieve higher robustness to noise compared to .

Distortion
RAG + LF 97 97 81 74
kNN 95 84 72 56
Table 1: Classification rates () on the synthetic datasets

4.3 Experiments on IAM graph database repository

A set of experiments were conducted on two standard datasets from the IAM graph dataset repository[15]. A brief description of the dataset is reproduced below.

dataset train,val,test Classes max max
Letter (HIGH) 750, 750, 750 15 8 3.1
Fingerprint 500, 300, 2000 4 26 4.42
Table 2: Summary of main characteristics of the data sets

In order to examine the performance of the proposed approach on a two class problem consisting of patterns from morphologically distinct classes, a reduced dataset called Fingerprint (AW) was created consisting of patterns belonging to only classes arch and whorl.

4.4 Results and discussion

The state-of-the-art techniques chosen are k-NN (chosen as Reference system)[15], embedding based on Similarity Kernels (SK + SVM), embedding based on Lipschitz Embedding (LE+SVM)[16], and Structurally-described random graphs (SDRG)[3]. The approach proposed here is referred to as Random Attributed Graph model + Likelihood as Feature (RAG + LF). RAG + ML denotes the method where a graph pattern is assigned to the class of random prototype graph, which has the maximum likelihood of having generated it. SK+SVM and LE+SVM refer to a family of related classifiers out of which the best performing model is chosen. hence, the comparision is biased towards the same.

Method Letter (HIGH) Fingerprint (AW) Fingerprint
kNN 82 91.8 76.6

SK + SVM
79.1 - 41
LE + SVM 92.5 - 82.8

SDRG
64.3 - -

RAG + ML
67.2 87.5 61.1
RAG + LF 75.7 95.9 78.2

Table 3: Classification rates () on IAM Graph Dataset. , indicates statistically significant improvement of RAG + LF over RAG + ML and SDRG at significance level 0.05 respectively

The following observations are made- the results compare well for the Fingerprint dataset overall, and for the Letter (HIGH) dataset compares well with SK + SVM and is superior to SDRG; although k-NN yields good results overall, it faces the computationally challenging task of choosing k. For SK + SVM and LE + SVM, the task of choosing effective prototype set and calculating the graph-edit distance between the dataset and prototype set is expensive as well and offers no analytical insight. The approach presented here is fast as it involves estimating the parameters of random graph model analytically and needs far less graph matching operations corresponding to generating only one class prototype model. The prototypes also give a good summary of node and edge occurence probabilities in the dataset and probability distributions of their attributes. Embedding the prototypes in the space spanned by likelihood values offers statistically significant improvement with almost no significant loss of speed as there fast packages for SVM’s and other classification algorithms.

5 Conclusions

This work builds upon the notion of random graph models with applications in structural pattern recognition with the following contributions- with independence assumptions a random attributed graph is represented as a joint random variable in its node and edge occurences and of their respective attribute values, an analytical method to estimate the different probability distributions of a random graph model as a prototype given an ensemble of attributed graphs is presented using a maximum likelihood procedure, the utility of the random graph as a prototype is shown by using the likelihood of an outcome graph as a feature for classification. The proposed approach is suited to contexts involving large number of graph data samples, as determination of random prototype graph is a density estimation problem. It is robust to noise and faster on account of lesser number of graph matching operations that need to be performed in contrast to other approaches.

There are several possible extensions to this approach- first, a method to derive a class of probabilistic clustering and classification algorithms is being currently investigated. This means that the random prototype graph is learned from the dataset in a procedure akin to a standard quantization type scheme. Second, is there a way to tie the classifiers in the feature space directly with the learning of prototypes? To elaborate, it is important to investigate the link between type/family of classifiers on the feature space (due to likelihood) with how the random prototypes are estimated/learned. This would help to integrate probabilistic learning in the domain of graphs with discriminative methods for classification in the subsequent likelihood space. Lastly, the foundations of the random graph definitions needs to be explored- although node and edge independence is useful in that it allows an easy analytical estimation of model parameters, it is too strong an assumption. Is there a way to model dependencies of nodes and edges and their attributes (node/edge co-occurences)? Such a model would help enormously in probabilistic sub-structure analysis methods and also give possibly superior classification and clustering algorithms.

References

  • [1] Wong, A.K.C., You, M.: Entropy and distance of random graphs with application to structural pattern recognition. IEEE Trans. PAMI.,Vol-7, No. 5, (1985) 599-609
  • [2] Wong, A.K.C., Ghahraman, D.E.: Random graphs: Structural-contextual dichotomy. IEEE Trans. PAMI.,Vol-2, No. 4, (1980) 341-348
  • [3] Sole-Ribalta, A., Serratosa, F.: A structural and semantic probabilistic model for matching and representing a set of graphs. GbRPR 2009, LNCS 5534, pp. (2009) 164-173
  • [4] Seong, D.S., Kim, H.S., Park, K.H.: Incremental clustering of attributed graphs. IEEE Trans. Sys., Man, Cyb., Vol-23, No.5 (1993) 1399-1410
  • [5] Sengupta, K., Boyer, K.L.: Organizing large structural modelbases. IEEE Trans. PAMI., Vol-17, No. 4, (1995), 321-332
  • [6] Amari, S.-I.: Natural gradient works efficiently in learning. Neural Comput., Vol.10, (1998), 251-276
  • [7] Honkela, A., Tornio, M., Raiko, T., Karhunen, J.: Natural Conjugate Gradient in Variational Inference. Neural Information Processing, (2008), Springer, 305-314
  • [8] Fischer, A., Riesen, K., Bunke, H.: An experimental study of graph classification using prototype selection. ICPR, IEEE, (2008), 1-4
  • [9]

    Hollmen, J., Tresp, V., Simula, O.: A self-organizing map algorithm for clustering probabilistic models. ICANN’99, IEE, Vol. 2, (1999), 946-951

  • [10] Duda, R.O., Hart, P.E., Stork, D.G.: Pattern classification, Wiley-Interscience, (2000)
  • [11] Jaakkola, T., Haussler, D.: Exploiting generative Models in discriminative classifiers. Advances in Neural Information Processing Systems, (1998), 487–493
  • [12] Gold, S., Rangarajan, A.: A graduated assignment algorithm for graph matching. IEEE Trans. PAMI, Vol. 18, No. 4, (1996), 377-388
  • [13]

    Ben-Hur, A.: PyML- A Python Machine Learning package, (2008)

  • [14] Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett., Vol. 27, No.8, (2008), 861-874
  • [15] Riesen, K., Bunke, H. IAM Graph database repository for graph based pattern recognition and machine learning. SSPR+SPR ’08, Springer, (2008), 287-297
  • [16] Riesen, K., Bunke, H. Graph classification by means of Lipschitz embedding. IEEE Trans. Sys. Man Cyber. Part B, Vol. 39, No. 6, (2009), 1472-1483