1 Introduction
Nowadays, with the increasing volume of data generated, for instance by internet and social networks, there is a need for efficient ways to predict useful information from those data. Numerous data mining, machine learning and pattern recognition algorithms were developed for predicting information from a labeled database. These data can take several different forms and, in that case, it would be useful to use these alternative views in the prediction model. In this paper, we focus our attention on supervised classification using both regular tabular data and structural information coming from graphs or networks.
Many different approaches have been developed for information fusion in machine learning, pattern recognition and applied statistics, such as simple weighted averages (see, e.g., Cooke1991 , Jacobs1995 ), Bayesian fusion (see, e.g., Cooke1991 , Jacobs1995 ), majority vote (see, e.g., Chen2001 , Kittler2003 , Lad1996 ), models coming from uncertainty reasoning: fuzzy logic, possibility theory Klir1988 (see, e.g., Dubois1999 ), standard multivariate statistical analysis techniques such as correspondence analysis Merz1999 , maximum entropy modeling (see, e.g., Levy1994 , Myung1996 , FoussMaxEnt2004 ).
As wellknown, the goal of classification is to automatically label data to predefined classes. This is also called supervised learning since it uses known labels (the desired prediction of an instance) to fit the classification model. One alternative is to use semisupervised learning instead
Zhu2009 ; Chapelle2006 ; Hill2006 ; Macskassy07 .Indeed, traditional pattern recognition, machine learning or data mining classification methods require large amounts of labeled training instances – which is often difficult to obtain – to fit accurate models. Semisupervised learning methods can reduce the effort by including unlabeled samples. This name comes from the fact that the used dataset is a mixture of supervised and unsupervised data (it contains training samples that are unlabeled). Then, the classifier takes advantage from both the supervised and unsupervised data. The advantage here is that unlabeled data are often much less costly than labeled data. This technique allows to reduce the amount of labeled instances needed to achieve the same level of classification accuracy
Zhu2009 ; Chapelle2006 ; Hill2006 ; Macskassy07 . In other words, exploiting the distribution of unlabeled data during the model fitting process can prove helpful.Semisupervised classification comes in two different settings: inductive and transductive Zhu2009
. The goal of the former setting is to predict the labels of future test data, unknown when fitting the model, while the second is to classify (only) the unlabeled instances of the training sample. Some oftenused semisupervised algorithms include: expectationmaximization with generative mixture models, selftraining, cotraining, transductive support vector machines, and graphbased methods
Daudin08 ; Joachims99 ; Zhu2008 .The structure of the data can also be of different types. This paper focuses on a particular data structure: we assume that our dataset takes the form of a network with features associated to the nodes. Nodes are the samples of our dataset and links between these nodes represent a given type of relation between these samples. For each node, a number of features or attributes characterizing it is also available (see Figure 1 for an example). Other data structures exist but are not studied in this paper; for instance:

Different types of nodes can be present, with different types of features sets describing them.

Different types of relations can link the different nodes.
This problem has numerous applications such as classification of individuals in social networks, categorization of linked documents (e.g. patents or scientific papers), or protein function prediction, to name a few. In this kind of application (as in many others), unlabeled data are usually available in large quantities and are easy to collect: friendship links can be recorded on Facebook, text documents can be crawled from the internet and DNA sequences of proteins are readily available from gene databases.
In this work, we investigate experimentally various models combining information on the nodes of a graph and the graph structure. Indeed, it has been shown that network information improves significantly prediction accuracy in a number of contexts Hill2006 ; Macskassy07 . Indeed, 14 classification algorithms using various combinations of data sources, mainly described in Fouss2016 , are compared. The different considered algorithms are described in Section 4.1, 4.2 and 4.3
. A standard support vector machine (SVM) classifier is used as a baseline algorithm, but we also investigated the ridge logistic regression classifier. The results and conclusions obtained with this second model were similar to the SVM and are therefore not reported in this paper.
In short, the main questions investigated in this work are:

Does the combination of features on nodes and network structure works better than using the features only?

Does the combination of features on nodes and network structure works better than using the graph structure only?

Which classifier performs best on network structure alone, without considering features on nodes?

Which classifier performs best when combining information, that is, using network structure with features on nodes?
Finally, this comparison leads to some general conclusions and advices when tackling classification problems on network data with node features.
In summary, this work has four main contributions:

The paper reviews different algorithms used for learning from both a graph structure and node features. Some algorithms are inductive while some others are transductive.

An empirical comparison of those algorithms is performed on ten real world datasets.

It investigates the effect of extracting features from the graph structure (and some wellknown indicators in spatial statistics) in a classification context.

Finally, this comparison is used to draw general conclusions and advices to tackle graphbased classification tasks.
The remaining of this paper is organized as follows. Section 2 provides some background and notation. Section 3 investigates related work. Section 4 introduces the investigated classification methods. Then, Section 5 presents the experimental methodology and the results. Finally, Section 6 concludes the paper.
2 Background and notation
This section aims to introduce the necessary theoretical background and notation used in the paper. Consider a weighted, undirected, strongly connected, graph or network (with no selfloop) containing a set of vertices (or nodes) and a set of edges (or arcs, links). The adjacency matrix of the graph, containing nonnegative affinities between nodes, is denoted as , with elements .
Moreover, to each edge between node and is associated a nonnegative number . This number represents the immediate cost of transition from node to . If there is no link between and , the cost is assumed to take a large value, denoted by . The cost matrix is an matrix containing the as elements. Costs are set independently of the adjacency matrix – they are quantifying the cost of a transition according to the problem at hand. Now, if there is no reason to introduce a cost, we simply set (paths are penalized by their length) or (in this case, is viewed as a conductance and as a resistance) – this last setting will be used in the experimental section.
We also introduce the Laplacian matrix of the graph, defined in the usual manner:
(1) 
where is the diagonal (out)degree matrix of the graph containing the on its diagonal and is a column vector full of ones. One of the properties of
is that its eigenvalues provide useful information about the connectivity of the graph
Chung1997 . The smallest eigenvalue of is always equals to , and the second smallest one is equals to only if the graph is composed of at least two connected components. This last value is called the algebraic connectivity.Moreover, a natural random walk on is defined in the standard way. In node
, the random walker chooses the next edge to follow according to transition probabilities
(2) 
representing the probability of jumping from node to node , the set of successor nodes of . The corresponding transition probabilities matrix will be denoted as and is stochastic. Thus, the random walker chooses to follow an edge with a likelihood proportional to the affinity (apart from the sumtoone normalization), therefore favoring edges with a large affinity.
Moreover, we will consider that each of the nodes of has the same set of features, or attributes, with no missing values. The column vector contains the values of the features of node and states for the value of feature taken by node . Moreover, will refer to the data matrix containing the on its rows.
Finally we define as the column vector containing the class labels of the nodes. Moreover, is a binary vector indicating whether or not a node belongs to class number .
Recall that the purpose of the classification tasks will to predict the class of the unlabeled data (in a transductive setting), or to predict new test data (in an inductive setting), while knowing the values of the features for all the nodes of and the class labels on the labeled nodes only for each . Our baseline classifier based on features only will be a linear support vector machines (SVM).
3 Some related work
The 14 investigated models are presented in the next Section 4. In addition to those models, other approaches exist.
For example Belkin2006 ; He2010
use a standard ridge regression model complemented by a Laplacian regularization term, and has been called the Laplacian regularized least squares. This option was investigated but provided poor results compared to reported models (an is therefore not reported). Note that using a logistic ridge regression as the base classifier was also investigated in this work but results are not reported here for conciseness as it provided performances similar to SVMs.
Laplacian support vector machines (LapSVMs) extend the SVM classifier in order to take the structure of the network into account. It exploits both the information on the nodes and the graph structure in order to categorize the nodes through its Laplacian matrix (see Section 2). To this end, Belkin2006 proposed to add a graph Laplacian regularization term to the traditional SVM cost function in order to obtain a semisupervised version of this model. A matlab toolbox for this model is available but provided poor results in terms of performance and tractability.
Chakrabarti et al. developed, in the context of patents classification Chakrabarti1998
, a naive Bayes model in the presence of structural autocorrelation. The main idea is to use a naive Bayes classifier (see for example
Bishop95 ; Hastie2009 ; Theodoridis03 ) combining both feature information on the nodes and structural information by making some independence assumptions. More precisely, it is assumed that the label of a node is influenced by two sources: features of the node and labels of the neighboring nodes (and does not depend on other information). It first considers that labels of all neighboring nodes are known and then relax this constrain by using a kind of relaxation labeling (see Chakrabarti1998 for details). However, we found out that this procedure is very time consuming, even for smallsize networks, and decided to not include it in the present work.Other semisupervised classifiers based on network data only (features on nodes are not available) were also developed Macskassy07 ; Zhu2008 . The interested reader is invited to read, e.g., MOI ; Abney2008 ; Chapelle2006 ; Fouss2016 ; Hofmann2008 ; Silva2016 ; Subramanya2014 ; Zhu2008 ; Zhu2009b (and included references), focused on this topic for a comprehensive description. Finally, an interesting survey and a comparative experiment of related methods can be found in Macskassy07 .
4 Survey of relevant classification methods
The different classification methods compared in this work are briefly presented in this section, which is largely inspired by Fouss2016 . For a more thorough presentation, see the provided references to the original work or Fouss2016 . The classification models are sorted into different families: graph embeddingbased classifiers, extensions of featurebased classifiers, and graphbased classifiers.
4.1 Graph embeddingbased classifiers
A first interesting way to combine information from the features on the nodes and from the graph structure is to perform a graph embedding projecting the nodes of the graph into a lowdimensional space (an embedding space) preserving as much as possible its structural information, and then use the coordinates of the projected nodes as additional features in a standard classification model, such as a logistic regression or a support vector machine.
This procedure has been proposed in the field of spatial statistics for ecological modeling Borcard2002 ; Dray2006 ; Meot1993 , but also more recently in data mining Tang2009 ; Tang2009b ; Tang2010 ; Zhang2008 ; Zhang2008b . While many graph embedding techniques could be used, Dray2006 suggests to exploit Moran’s or Geary’s index of spatial autocorrelation in order to compute the embedding.
Let us briefly develop their approach (see Fouss2016 ). Moran’s and Geary’s (see, e.g., Haining2003 ; Pfeiffer2008 ; Waldhor2006 ; Waller2004 ) are two coefficients commonly used in spatial statistics in order to test the hypothesis of spatial autocorrelation of a continuous measure defined on the nodes. Four possibilities will be investigated to extract features from the graph structure: maximizing Moran’s , minimizing Geary’s
, local principal component analysis and maximizing the BagofPath (BoP) modularity.
4.1.1 Maximizing Moran’s
Moran’s is given by
(3) 
where and are the values observed on nodes and respectively, for a considered quantity measured on the nodes. The column vector is the vector containing the values and is the average value of . Then, is simply the sum of all entries of – the volume of the graph.
can be interpreted as a correlation coefficient similar to the Pearson correlation coefficient Haining2003 ; Pfeiffer2008 ; Waldhor2006 ; Waller2004 . The numerator is a measure of covariance among the neighboring in
, while the denominator is a measure of variance.
is in the interval . A value close to zero indicates no evidence of autocorrelation, a positive value indicates positive autocorrelation and a negative value indicates negative autocorrelation. Autocorrelation means that close nodes tend to take similar values.In matrix form, Equation (3) can be rewritten as
(4) 
where is the centering matrix Mardia1978 and is a matrix full of ones. Note that the centering matrix is idempotent, .
The objective is now to find the score that achieves the largest autocorrelation, as defined by Moran’s index. This corresponds to the values that most explains the structure of . It can be obtained by setting the gradient equal to zero; we then obtain the following generalized eigensystem:
(5) 
The idea is thus to extract the first eigenvector
of the centered adjacency matrix (5) corresponding to the largest eigenvalue and then to compute the secondlargest eigenvector, , orthogonal to , etc. The eigenvalues are proportional to the corresponding Moran’s .The largest centered eigenvectors of (5) are thus extracted and then used as additional features for a supervised classification model (here an SVM). In other words, is a new data matrix, capturing the structural information of , that can be concatenated to the featurebased data matrix , forming the extended data matrix .
4.1.2 Minimizing Geary’s
On the other hand, Geary’s
is another weighted estimate of partial autocorrelation given by
(6) 
and is related to Moran’s . However, while Moran’s considers a covariance between neighboring nodes, Geary’s considers distances between pairs of neighboring nodes. It ranges from 0 to 2 with 0 indicating perfect positive autocorrelation and 2 indicating perfect negative autocorrelation Meot1993 ; Pfeiffer2008 ; Waller2004 .
In matrix form, Geary’s can be written as
(7) 
This time, the objective is to find the score vector minimizing Geary’s . By proceeding as for Moran’s , we find that minimizing aims to compute the lowest nontrivial eigenvectors of the Laplacian matrix:
(8) 
and then use these eigenvectors as additional
features in a classification model. We therefore end up with the problem of computing the lowest eigenvectors of the Laplacian matrix, which also appears in spectral clustering (ratio cut, see, e.g.,
Luxburg2007 ; Fouss2016 ; Newman2010 ).Geary’s has a computational advantage over Moran’s : the Laplacian matrix is usually sparse, which is not the case for Moran’s . Moreover, note that since the Laplacian matrix L is centered, any solution of is also a solution of Equation (8).
4.1.3 Local principal component analysis
In Benali1990 ; Lebart2000 , the authors propose to use a measure of local, structural, association between nodes. The contiguity ratio is defined as
(9) 
and is the average value observed on the neighbors of , . As for Geary’s index, the value is close to zero when there is a strong structural association. However, there are no clear bounds indicating no structural association or negative correlation Lebart2000 .
The numerator of Equation (9) is the mean squared difference between the value on a node and the average of its neighboring values; it is called the local variance in Lebart2000 . The denominator is the standard sample variance. In matrix form,
(10) 
Proceeding as for Geary and Moran’s indexes, minimizing aims to solve
(11) 
Here again, eigenvectors corresponding to the smallest nontrivial eigenvalues of the eigensystem (11) are extracted. This procedure is also referred to as local principal component analysis in Lebart2000 .
4.1.4 Bagofpath modularity
For this algorithm, we also compute a number of structural features, but now derived from the modularity measure computed in the bagofpath (BoP) framework Robin2014 , and concatenate them to the node features . Again, a SVM is then used to classify all unlabeled nodes. Indeed, it has been shown that using the dominant eigenvectors of the BoP modularity matrix provides better performances than using the eigenvectors of the standard modularity matrix . The result for the standard modularity matrix are therefore not reported here.
It can be shown (see Robin2014 for details) that the BoP modularity matrix is equal to
(12) 
where is the fundamental bagofpath matrix and is a length column vector full of ones. Then as for Moran’s and Geary’s , an eigensystem
(13) 
must be solved and the largest eigenvectors are used as new, additional, structural, features.
4.2 Extensions of standard featurebased classifiers
These techniques rely on extensions of standard featurebased classifiers (for instance a logistic regression model or a support vector machine). The extension is defined in order to take the network structure into account.
4.2.1 The AutoSVM: taking autocovariates into account
This model is also known as the autologistic or autologit model Bezag1972 ; Augustin1996 ; Augustin1998 ; Lu2003 , and is frequently used in the spatial statistics and biostatistics fields.
Note that, as a SVM is used as base classifier in this work (see Section 1), we adapted this model (instead of the logistic regression in Augustin1998 ) in order to take the graph structure into account. The method is based on the quantity , where is the predicted membership of node , called the autocovariate in Augustin1998 (other forms are possible, see Augustin1996 ; Augustin1998 ). It corresponds to the weighted averaged membership to class within the neighborhood of : it indicates to which extent neighbors of belong to class . The assumption is that node has a higher chance to belong to class if its neighbors also belong to that class. will be the matrix containing for all and .
However, since the predicted value depends on the occurrence of the predicted value on other nodes, building the model is not straightforward. For the autologistic model, it goes through the maximization of the (pseudo)likelihood (see for example Pawitan2001 ; Bezag1972 ), but we will consider another alternative Augustin1998
which uses a kind of expectationmaximizationlike heuristics (EM, see, e.g.
Dempster77 ; Mclachlan2008 ), and is easy to adapt to our SVM classifier.Here is a summary (see Fouss2016 ) of the estimation procedure proposed in Augustin1998 :

At , initialize the predicted class memberships of the unlabeled nodes by a standard SVM depending on the feature vectors only, from which we disregard the structural information (the information about neighbors’ labels). For the labeled nodes, the membership values are not modified and are thus set to the true, observed, memberships.

Compute the current values of the autocovariates, , for all nodes.

Train a socalled autoSVM model based on these current autocovariate values as well as the features on nodes. This provides parameter estimates .

Compute the predicted class memberships of the set of unlabeled nodes from the fitted autoSVM model. This is done by sequentially selecting each unlabeled node in turn, and applying the fitted autoSVM model, obtained in step 3 based on the autocovariates of step 2. After having considered all the nodes, we have the new predicted values .

Steps 2 to 4 are be iterated until convergence of the predicted membership values .
4.2.2 Double kernel SVM
Here, we describe another simple way of combining the information coming from features on nodes and graph structure. The basic idea (Roth2001 ; Fouss2016 ) is to

Compute a kernel matrix based on node features Scholkopf2002 ; ShaweTaylor2004 , for instance a linear kernel or a gaussian kernel.

Compute a kernel matrix on the graph FoussKernelNN2012 ; Fouss2016 ; Gaertner2008 ; ShaweTaylor2004 , for instance the regularized commutetime kernel (see Subsection 5.2.2).

Fit a SVM based on these two combined kernels.
Then, by using the kernel trick, everything happens as if the new data matrix is
(14) 
where is a kernel on a graph and is the kernel matrix associated to the features on the nodes (see Fouss2016 for details). Then, we can fit a SVM classifier based on this new data matrix and the class labels of labeled nodes.
4.2.3 A spatial autoregressive model
This model is a spatial extension of a standard regression model LeSage2009 and is well known in spatial econometrics. This extended model assumes that the vector of class memberships is generated in each class according to
(15) 
where is the usual parameter vector, is a scalar parameter introduced to account for the structural dependency through and is an error term. Obviously if
is equal to zero, there is no structural dependency and the model reduce to a standard linear regression model. Lesage’s Econometrics Matlab toolbox was used for the implementation of this model
LeSage2009 ; see this reference for more information.4.3 Graphbased classifiers
We also investigate some semisupervised methods based on the graph structure only (no node feature exists or features are simply not taken into account). We selected the techniques performing best in a series of experimental comparisons FoussKernelNN2012 ; MOI ; Mantrach2011 . They rely on some strong assumptions about the distribution of labels: that neighboring nodes (or “close nodes”) are likely to share the same class label Chapelle2006 .
4.3.1 The bagofpaths group betweenness
This model MOI considers a bag containing all the possible paths between pairs of nodes in . Then, a Boltzmann distribution, depending on a temperature parameter , is defined on the set of paths such that long (highcost) paths have a low probability of being picked from the bag, while short (lowcost) paths have a high probability of being picked. The BagofPaths (BoP) probabilities, , providing the probability of drawing a path starting in and ending in , can be computed in closed form and a betweenness measure quantifying to which extend a node is inbetween two nodes is defined. A node receives a high betweenness if it has a large probability of appearing on paths connecting two arbitrary nodes of the network. A group betweenness between classes is defined as the sum of the contribution of all paths starting and ending in a particular class, and passing through the considered node. Each unlabeled nodes is then classified according to the class showing the highest group betweenness. More information can be found in MOI .
4.3.2 A sumofsimilarities based on the regularized commute time kernel
We finally investigate a classification procedure based on a simple alignment with the regularized commute time kernel , a sumofsimilarities defined by , with Zhou2003 ; FoussKernelNN2012 ; Fouss2016 . This expression quantifies to which extend each node is close (in terms of the similarity provided by the regularized commute time kernel) to class . This similarity is computed for each class in turn. Then, each node is assigned to the class showing the largest sum of similarities. Element of this kernel can be interpreted as the discounted cumulated probability of visiting node when starting from node . The (scalar) parameter corresponds to a killed random walk where the random walker has a ) probability of disappearing at each step. This method provided good results in a comparative study on graphbased semisupervised classification FoussKernelNN2012 ; Mantrach2011 ; KEVIN .
5 Experiments
In this section, the different classification methods will be compared on semisupervised classification tasks and several datasets. The goal is to classify unlabeled nodes in partially labeled graphs and to compare the results obtained by the different methods in terms of classification accuracy.
This section is organized as follows. First, the datasets used for semisupervised classification are described in Subsection 5.1. Then, the compared methods are recalled in Subsection 5.2. The experimental methodology is explained in Subsection 5.3. Finally, results are presented and discussed in Subsection 5.4.
5.1 Datasets
Cornell  Texas  Washington  Wisconsin  
Class  (DB1)  (DB2)  (DB3)  (DB4) 
Course  42  33  59  70 
Faculty  32  30  25  32 
Student  83  101  103  118 
Project + staff  38  19  28  31 
Total  195  183  230  251 
Majority class (%)  42.6  55.2  44.8  47.0 
Number of features  1704  1704  1704  1704 
FB 107  FB 1684  FB 1912  
Class  (DB5)  (DB6)  (DB7) 
Main group  737  568  524 
Other groups  308  225  232 
Total  1045  793  756 
Majority class (%)  70.5  71.2  69.3 
Number of features  576  319  480 
Citeseer  Cora  Wikipedia  
Class  (DB8)  (DB9)  (DB10) 
Class 1  269  285  248 
Class 2  455  406  509 
Class 3  300  726  194 
Class 4  75  379  99 
Class 5  78  214  152 
Class 6  188  131  409 
Class 7  344  181  
Class 8  128  
Class 9  364  
Class 10  351  
Class 11  194  
Class 12  81  
Class 13  233  
Class 14  111  
Total  1392  2708  3271 
Majority class (%)  32.7  26.8  15.6 
Number of features  3703  1434  4973 
All datasets are described by (i) the adjacency matrix of the underlying graph, (ii) a class vector (to predict) and (iii) a number of features on nodes gathered in the data matrix . Using a chisquare test, we kept only the 100 most significant variables for each dataset. The datasets are available at http://www.isys.ucl.ac.be/staff/lebichot/research.htm.
For each of these dataset, if more than one graph connected component is present, we only use the biggest connected component, deleting all the others nodes, features and target classes. Also, we choose to work with undirected graphs for all datasets: if a graph is directed, we used to introduce reciprocal edges.

WebKB cocite (DB1DB4) senaimag08 . These four datasets consist of web pages gathered from computer science departments from four universities (there is four datasets, one for each university), with each page manually labeled into one of four categories: course, faculty, student and project Macskassy07 . The pages are linked by citations (if links to then it means that is cited by ). Each web page in the dataset is also characterized by a binary word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words (words appearing less than 10 times were ignored). Originally, a fifth category, Staff, was present but since it contained only very few instances, it was merged with the Project class. Details on these datasets are shown in Table 1.

The three Ego Facebook datasets (DB5DB7) McAuley2012 consist of circles (or friends communities) from Facebook. Facebook data were collected from survey participants using a Facebook application. The original dataset includes node features (profiles), circles, and ego networks for 10 networks. Those data are anonymized. We keep the three first networks and we define the classification task as follow: we picked the majority circle (the target circle) and aggregated all the others (nontarget circles). Details on these datasets are shown in Table 2. Each dataset has two classes.

The CiteSeer dataset (DB8) senaimag08 consists of 3312 scientific publications classified into six classes. The pages are linked by citation (if links to then it means that is cited by ). Each publication in the dataset is described by a binary word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words (words appearing less than 10 times were ignored). Target variable is the domain of the publications (six topics, not reported here). Details on this dataset are shown in Table 3.

The Cora dataset (DB9) senaimag08 consists of 2708 scientific publications classified into one of seven classes denoting topics as for previous dataset. Pages are linked by citations (if links to then it means that is cited by ). Each publication is also described by a binary word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1434 unique words or features (words appearing less than 10 times were ignored). Target variable is the topic of the publications. Details on this dataset are shown in Table 3.

The Wikipedia dataset (DB10) senaimag08 consists of 3271 Wikipedia articles that appeared in the featured list in the period Oct. 721, 2009. Each document belongs to one of 14 distinct categories (such as Science, Sport, Art, …), which were obtained by using the category under which each article is listed. After stemming and stopword removal, the content of each document is represented by a tf/idfweighted feature vector, for a total of 4973 words. Pages are linked by citation (if links to then it means that is cited by ). Target variable is the articles field (14 different, not reported here). Details on this dataset are shown in Table 3.
Moreover, in order to study the impact of the relative information provided by the graph structure and the features on nodes, we created new derived datasets by weakening gradually the information provided by the node features. More precisely, for each dataset, the features available on the nodes have been ranked by decreasing association (using a chisquare statistics) with the target classes to be predicted. Then, datasets with subsets of the features containing respectively the 5 (5F), 10 (10F), 25 (25F), 50 (50F) and 100 (100F) most informative features were created (sets of features). These datasets are weakened versions of the original datasets, allowing to investigate the respective role of features on nodes and graph structure. We also investigate sets with more features (200 and 400). Conclusions were the same so that they are not reported here for conciseness.
5.2 Compared classification models
In this work, a transductive scheme is used, as we need to know the whole graph structure to label unlabeled nodes. 14 different algorithms will be compared and can be sorted in three categories, according to the information they use. Some algorithms use only features to build the model (denoted as X – data matrix with features only), others use only the graph structure (denoted as A – adjacency matrix of the graph only), and the third category uses both the structure of the graph and the features of the nodes (denoted as AX).
5.2.1 Using features on nodes only

This reduces to a standard classification problem and we use a linear Support Vector Machine (SVM) based on the features of the nodes to label these nodes (SVMX). Here, we consider SVMs in the binary classification setting (i.e. ). For multiclass problems, we used a onevsone strategy Hsu2002 . This classifier will be used as a baseline. In practical terms, we use the wellknown Liblinear library REF08a . Notice that SVM follows an inductive scheme, unlike all other methods. Transductive SVMs Vapnik1998 were also considered, but their implementation was too timeconsuming to be included in the present analysis.
5.2.2 Using graph structure only
Three different families of methods using graph structure only are investigated.

The Bag of Path classifier based on the bagofpaths group betweenness (BoPA). This betweenness is computed for each class in turn. Then, each unlabeled node is assigned to the class showing the largest value (see Section 4.3.1 for more details).

The sumofsimilarities method based on the Regularized Commute Time Kernel (CTKA). The classification procedure is the same as BoPA: the class similarity is computed for each class in turn and each unlabeled node is assigned to the class showing the largest similarity (see Section 4.3.2 for more details).

The four graph embedding techniques discussed in Section 4.1 are used together with an SVM, without considering any node feature. The SVM is trained using a given number of extracted eigenvectors derived from each measure (this number is a parameter to tune). The SVM model is then used to classify the unlabeled nodes.

SVM using Moran’s derived dominant eigenvectors; see Section 4.1.1 (SVMMA).

SVM using Geary’s derived dominant eigenvectors; see Section 4.1.2 (SVMGA).

SVM using the dominant eigenvectors extracted from local principal component analysis; see Section 4.1.3 (SVMLA).

SVM using the dominant eigenvectors extracted from the bagofpaths modularity; see Section 4.1.4 (SVMBoPMA).

5.2.3 Using both information (features on nodes and graph structure)
Here, we investigate:

A double kernel SVM (DK SVM). In this case, two kernels are computed, one defined on the graph and the second from the node features (see Section 4.2.2). A SVM is then used to classify the unlabeled nodes.

Support vector machine using Autocovariates (ASVMAX). In this algorithm, autocovariates are added to the node features (see Section 4.2.1).

Spatial AutoRegressive model (
SARAX). This model is a spatial extension of the standard regression model (see Section 4.2.3), used to classify the unlabeled nodes. 
The dominant eigenvectors (this number is a parameter to tune) provided by the four graph embedding techniques (Section 4.1) are combined with the node features and then injected into a linear SVM classifier:

SVM using Moran’s derived features (see 4.1.1) in addition to node features (SVMMAX): .

SVM using Geary’s derived features (see 4.1.2) in addition to node features (SVMGAX): .

SVM using graph local principal component analysis (see 4.1.3) in addition to node features (SVMLAX): .

SVM using bagofpath modularity (see 4.1.4) in addition to node features (SVMBoPMAX): .

The considered classifiers, together with their parameters to be tuned, are listed in Table 4.
DB1  DB2  DB3  DB4  DB5  DB6  DB7  DB8  DB9  DB10  

SAR  AX  100F  53.1812.66  62.905.20  64.1612.41  70.557.87  50.6825.83  72.7123.86  56.3624.09  61.189.57  77.501.34  32.0710.75 
50F  66.949.75  66.545.19  65.868.12  71.987.00  66.1528.63  78.8917.94  61.4822.45  64.523.16  64.285.23  35.647.59  
25F  65.4310.55  66.905.20  69.074.48  68.086.99  80.2920.51  86.2810.30  69.6812.28  56.047.31  53.609.30  35.445.74  
10F  64.375.13  64.845.62  63.564.99  67.706.16  76.149.22  78.329.27  66.0213.80  44.1810.32  42.939.14  29.032.48  
5F  62.034.93  55.708.39  60.476.69  61.648.26  74.707.70  78.129.39  66.1713.90  42.7411.03  37.076.25  21.201.84  
SVMG  AX  100F  83.993.19  80.982.36  80.773.84  83.172.00  87.821.76  91.631.72  74.942.05  70.461.06  71.371.54  54.620.91 
50F  79.513.81  76.642.64  77.213.27  78.473.81  87.591.35  89.651.10  77.451.78  62.921.21  65.981.27  45.942.42  
25F  69.774.36  70.604.48  73.573.18  74.323.69  90.361.24  91.651.61  79.311.05  59.921.46  67.512.18  43.401.70  
10F  63.032.62  58.895.68  64.004.36  64.764.32  93.121.61  93.471.27  78.703.10  57.112.15  72.061.77  39.231.22  
5F  57.314.58  56.836.35  64.285.44  59.563.16  93.212.14  93.341.53  79.862.68  57.142.57  73.192.70  35.901.29  
SVMM  AX  100F  83.673.50  80.623.32  80.703.97  83.252.17  87.682.06  93.173.56  74.912.62  70.471.19  71.281.24  54.410.87 
50F  79.463.72  76.453.08  77.312.97  78.514.85  89.931.19  94.972.92  76.031.78  64.101.57  72.072.60  46.672.00  
25F  68.105.00  71.014.75  73.053.34  74.423.99  93.732.60  96.562.01  79.942.65  65.411.08  73.831.63  45.501.21  
10F  62.214.30  57.296.67  63.673.94  64.234.05  94.911.73  97.591.43  80.662.55  66.332.15  76.361.19  41.911.74  
5F  58.114.65  56.375.53  65.352.69  59.383.04  94.142.01  97.551.57  79.782.72  67.513.51  76.061.01  38.741.77  
BoP  A  100F  54.453.65  41.844.25  48.420.85  45.700.21  96.7610.34  98.6311.08  82.486.06  69.9119.19  78.1119.89  35.344.16 
50F  54.453.65  41.844.25  48.510.90  45.720.21  96.7412.06  98.6211.82  82.506.02  69.9119.19  78.1119.89  35.344.16  
25F  54.453.65  41.844.25  48.330.88  45.640.15  96.8011.26  98.6210.15  82.506.04  69.9119.19  78.1119.89  35.344.16  
10F  54.453.65  41.844.25  48.330.90  45.700.32  96.7612.06  98.6311.08  82.496.00  69.9119.19  78.1119.89  35.344.16  
5F  54.453.65  41.844.25  48.300.86  45.700.25  96.7710.33  98.6311.82  82.506.07  69.9119.19  78.1119.89  35.344.16  
CTK  A  100F  54.212.66  42.434.06  46.703.13  42.947.17  96.300.33  98.440.40  82.871.21  70.461.27  81.690.78  36.381.00 
50F  54.212.66  42.434.06  47.252.80  42.887.14  96.310.34  98.440.38  82.921.17  70.461.27  81.690.78  36.381.00  
25F  54.212.66  42.414.06  46.862.95  42.927.17  96.270.31  98.440.40  82.911.18  70.461.27  81.690.78  36.381.00  
10F  54.212.66  42.493.95  46.873.02  42.987.17  96.320.34  98.410.41  82.921.19  70.461.27  81.690.78  36.381.00  
5F  53.863.13  42.463.95  47.142.85  42.947.17  96.300.29  98.420.38  82.901.17  70.461.27  81.690.78  36.381.00  
SVM  X  100F  83.723.70  80.762.46  80.933.88  83.112.54  88.451.10  91.411.15  74.662.65  70.461.12  71.410.89  54.591.06 
50F  80.602.34  76.912.78  78.333.24  81.124.07  89.101.35  91.191.11  77.982.24  68.811.00  68.580.85  40.211.51  
25F  74.563.29  75.494.25  77.522.89  79.423.19  89.451.68  91.761.12  78.895.87  66.681.07  64.180.63  34.230.96  
10F  70.853.36  74.353.78  75.644.13  71.713.27  89.521.54  91.970.67  80.520.93  59.451.33  56.271.86  30.820.51  
5F  65.553.13  66.177.07  71.022.88  75.022.44  87.180.92  92.190.49  78.485.79  53.901.60  42.263.22  25.230.46  
SVMM  A  100F  46.266.78  33.135.08  47.037.53  40.034.52  95.431.10  95.071.40  81.601.59  55.931.63  74.511.62  30.821.10 
50F  46.266.78  33.135.08  47.037.36  40.094.52  95.431.10  95.061.53  81.611.59  55.931.63  74.511.62  30.821.10  
25F  46.266.78  32.965.08  47.107.51  40.094.52  95.431.09  95.151.53  81.611.59  55.931.63  74.511.62  30.821.10  
10F  46.296.82  33.135.08  47.177.46  40.034.50  95.431.09  95.081.53  81.601.59  55.931.63  74.511.62  30.821.10  
5F  46.396.78  33.045.08  47.147.51  40.034.50  95.431.09  95.071.53  81.611.59  55.931.63  74.511.62  30.821.10  
SVMG  A  100F  43.458.24  33.326.54  44.104.30  39.974.93  89.8212.23  93.791.53  78.531.25  68.131.93  75.581.16  35.031.49 
50F  43.458.54  33.136.55  44.214.03  39.794.93  89.8212.23  93.671.55  78.531.24  68.131.93  75.581.16  35.031.49  
25F  43.457.89  33.296.54  44.214.10  39.814.48  89.9812.20  93.801.54  78.521.24  68.141.93  75.581.16  35.031.49  
10F  43.457.89  33.156.60  44.214.23  39.974.93  89.9512.23  93.541.44  78.521.25  68.131.93  75.581.16  35.031.49  
5F  43.458.54  33.156.57  44.174.03  39.814.93  89.8312.23  93.661.53  78.521.24  68.131.93  75.581.16  35.031.49  
ASVM  AX  100F  79.383.71  75.083.78  80.103.29  81.123.69  92.071.74  92.652.27  79.801.17  66.141.97  70.583.58  44.554.77 
50F  79.214.09  74.402.84  77.663.51  79.414.43  92.322.16  94.761.90  80.502.41  66.081.63  76.282.43  37.137.96  
25F  74.644.89  73.273.50  75.743.76  79.803.48  95.341.80  95.981.52  81.592.75  65.962.59  76.952.46  32.668.29  
10F  71.413.45  72.296.41  75.747.78  76.154.31  95.481.76  96.651.34  81.162.99  62.933.41  74.932.85  28.246.85  
5F  65.827.35  70.926.94  69.396.70  75.467.53  95.581.05  96.601.93  80.855.92  61.893.59  71.804.94  25.106.08  
SVMDK  AX  100F  84.892.59  79.893.40  81.313.31  84.342.50  88.752.04  92.140.91  76.091.95  70.380.76  71.321.20  54.230.88 
50F  80.884.78  76.792.72  77.745.01  80.054.56  88.883.96  91.543.52  79.623.46  68.502.06  69.433.88  41.902.39  
25F  73.594.12  74.243.09  76.894.23  78.653.88  89.082.64  91.955.71  80.323.95  70.252.10  74.0410.60  36.743.48  
10F  69.683.54  72.365.40  74.917.21  72.234.44  89.407.05  89.709.53  80.224.34  72.711.65  76.6217.59  31.833.17  
5F  65.175.68  68.118.81  71.127.52  74.429.03  87.258.24  88.5417.01  80.484.20  71.972.90  78.0119.64  25.822.82  
SVMBoPM  AX  100F  84.074.00  80.433.02  81.192.98  83.392.48  88.151.54  92.543.27  74.102.15  70.600.86  71.702.17  56.242.61 
50F  79.054.53  76.582.37  78.316.22  81.644.71  88.942.53  92.922.71  77.753.44  68.673.13  71.942.99  44.972.41  
25F  73.895.29  73.386.51  78.196.81  79.195.48  91.083.46  94.672.02  79.133.19  66.053.69  76.213.24  41.603.04  
10F  69.846.26  72.098.59  74.867.14  70.155.75  92.922.05  95.601.81  79.333.21  63.642.17  77.261.85  42.202.55  
5F  60.387.24  65.709.05  68.459.61  73.709.29  90.132.62  95.002.61  77.127.46  63.053.33  80.351.67  41.992.23  
SVML  AX  100F  83.673.72  80.413.00  80.794.02  83.152.50  87.901.80  91.822.44  75.262.11  70.481.11  71.181.11  54.870.87 
50F  79.463.62  76.703.02  77.262.97  78.753.75  88.783.47  93.092.43  78.401.96  62.814.09  73.043.04  48.561.62  
25F  68.576.27  70.994.87  73.454.07  74.115.22  90.772.29  95.203.24  79.951.57  62.332.68  73.923.37  46.222.66  
10F  62.715.48  58.599.74  64.618.36  65.697.10  93.681.95  96.651.56  81.181.90  61.353.82  76.792.00  42.311.98  
5F  57.955.99  57.277.30  64.128.45  58.838.49  94.192.30  96.472.80  80.941.80  61.914.95  76.331.99  39.622.44  
SVML  A  100F  40.854.08  33.075.31  40.434.70  42.013.00  90.931.92  95.851.75  79.072.33  62.401.67  76.580.93  34.421.00 
50F  40.854.10  33.594.88  39.665.11  41.993.00  90.931.92  95.831.75  79.072.33  62.401.67  76.580.93  34.421.00  
25F  40.744.08  33.965.57  39.565.13  42.013.00  90.961.87  95.801.78  79.072.33  62.401.67  76.580.93  34.421.00  
10F  40.794.10  33.674.90  39.734.94  42.013.00  90.931.92  95.721.80  79.072.33  62.401.67  76.580.93  34.421.00  
5F  40.794.13  33.965.57  39.685.11  42.013.00  90.961.87  95.711.80  79.072.33  62.401.67  76.580.93  34.421.00  
SVMBoPM  A  100F  43.540.78  32.323.38  39.352.34  42.613.32  91.750.63  94.411.12  79.130.87  67.501.16  80.280.31  14.870.34 
50F  43.540.78  32.323.38  39.352.34  42.613.32  91.750.63  94.411.12  79.130.87  67.501.16  80.280.31  14.870.34  
25F  43.540.78  32.323.38  39.352.34  42.613.32  91.750.63  94.411.12  79.130.87  67.501.16  80.280.31  14.870.34  
10F  43.540.78  32.323.38  39.352.34  42.613.32  91.750.63  94.411.12  79.130.87  67.501.16  80.280.31  14.870.34  
5F  43.540.78  32.323.38  39.352.34  42.613.32  91.750.63  94.411.12  79.130.87  67.501.16  80.280.31  14.870.34 
5.3 Experimental methodology
The classification accuracy will be reported for a 20% labeling rate i.e. proportion of nodes for which labels are known. Labels of remaining nodes are deleted during model fitting phase and are used as test data during the assessment phase, where the various classification models predict the most suitable category of each unlabeled node in the test set.
For each considered feature sets and for each dataset, samples are randomly assigned into 25 folds: 5 external folds are defined and, for each of them, 5 nested folds are also defined. This procedure is performed 5 time to get 5 runs for each feature sets and dataset. The performances for one specific run are then computed by taking the average over the 5 external folds and, for each of them, a 5fold nested crossvalidation is performed to tune the parameters of the models on the inner folds (see Table 4).
5.4 Results and discussion
First of all, most frequently selected parameters values are indicated on Table 4. We observe that the most selected value for (the number of eigenvectors extracted for representing the graph structure; see Section 4.1) is actually low. This is a good news since efficient eigensystem solvers can be used to compute the first eigenvectors corresponding to the largest (or smallest) eigenvalues.
The classification accuracy and standard deviation, averaged on the 5 runs, are reported on Table 5, for the 10 different datasets and the 5 sets of features. Bold values indicate the best performance on those 50 combinations. Recall that the BoPA, CTKA, SVMMA, SVMGA, and SVMLA methods do not depend on the node features as they are based on the graph structure only. It explains why results do not depend on feature set for those five methods.
Moreover, the different classifiers have been compared across datasets through a Friedman test and a Nemenyi posthoc test Demsar2006
. The Friedman test is a nonparametric equivalent of the repeatedmeasures ANOVA. It ranks the methods for each dataset separately, the best algorithm getting the rank 1, the second best rank 2, etc. Once the null hypothesis (the mean ranking of all methods is equal, meaning all classifiers are equivalent) is rejected with
value , the (non parametric) posthoc Nemenyi test is then computed. Notice that all Friedman tests were found to be positive. The Nemenyi test determines whether or not each method is significantly better (value less than 0.05 based on the 5 runs) to another. This is reported, for each feature set in turn (5F, 10F, …, 100F), and thus increasing information available on the nodes, in Figures 2 to 6, and an overall test based on all the features sets and datasets is reported in Figure 7.5.4.1 Overall performances on all datasets and all node feature sets
From Table 5 and Figure 7, overall best performances on all dataset and all node feature sets are often obtained either by a SVM based on node features combined with new features derived from the graph structure (Subsection 4.1), or, unexpectedly, by the CTKA sumofsimilarities method (using graph structure only; see Subsection 4.3.2), which performs quite well on datasets five to nine. The BoPA node betweenness (using graph structure only, see Subsection 4.3.1) is also competitive and achieves results similar to the sumofsimilarities CTKA method (as already observed in MOI ).
On the contrary, the best method among the graph structure plus node features SVM is not straightforward do determine (see Figure 7). From Figures 2 to 6, the main trend is that the performance decreases when the number of features decreases, which seems normal.
However, this trend is not always observed; for example, with the SVMMAX method (SVM with features extracted from Moran’s index and features on nodes, see Subsection
4.1.1) and database DB5, the performances rise when the number of features decreases. This can be explained if we realize that each dataset can be better described in terms of its graph structure (graphdriven dataset, databases DB5 to DB9), or by its node features (featuresdriven dataset, databases DB1 to DB4 and DB10).To highlight this behavior, the network autocorrelation for each class (i.e., for each ) was computed and the average is reported for each dataset. This measure quantifies to which extent the target variable is correlated through on neighboring nodes. The values are reported on Table 6 for Moran’s , Geary’s and LPCA contiguity ratio (see Subsection 4.1.1). For Moran’s , high values (large autocorrelation) indicates that the graph structure is highly informative. This is the opposite for Geary and LPCA, as small values correspond to large autocorrelation.
Nevertheless, from Table 5 and Figure 7, the best overall performing methods combining node features and graph structure are (excluding the methods based on the graph alone, BoPA and CTKA), SVMBoPMAX (SVM with bagofpaths modularity, see Subsection 4.1.4) and ASVMAX (SVM based on autocovariates, see Subsection 4.2.1). They are not statistically different from SVMMAX, SVMLAX and SVMDKAX, but their mean rank is slightly higher.
Notice also that, from Figure 7, if we look at the performances obtained by a baseline linear SVM based on node features only (SVMX), we clearly observe that integrating the information extracted from the graph structure significantly improves the results. Therefore, it seems to be be a good idea to consider collecting link information which could improve the predictions.
driven  +  0    DB 1  DB 2  DB 3  DB 4  DB 10 

Moran’s  1  0  1  1.27  1.09  0.66  0.53  0.79 
Geary’s  0  1  2  0.09  0.09  0.33  0.19  0.12 
LPCA c. ratio  0^{1}^{1}1LPCA continuity ration is positive and lowerbounded by (which tends to be close to zero) where is the largest eigenvalue of , the upper bound is unknown Lebart2000 .  ^{1}^{1}1LPCA continuity ration is positive and lowerbounded by (which tends to be close to zero) where is the largest eigenvalue of , the upper bound is unknown Lebart2000 .  0.20  0.13  0.58  0.67  0.26  
driven  +  0    DB 5  DB 6  DB 7  DB 8  DB 9 
Moran’s  1  0  1  0.22  0.12  0.15  0.06  0.15 
Geary’s  0  1  2  0.78  0.59  0.63  0.57  0.43 
LPCA c. ratio  0^{1}^{1}1LPCA continuity ration is positive and lowerbounded by (which tends to be close to zero) where is the largest eigenvalue of , the upper bound is unknown Lebart2000 .  ^{1}^{1}1LPCA continuity ration is positive and lowerbounded by (which tends to be close to zero) where is the largest eigenvalue of , the upper bound is unknown Lebart2000 .  2.54  2.10  1.86  1.90  0.82 
5.4.2 Exploiting either the structure of the graph or the node features alone
Obviously, as already mentioned, databases DB5 to DB9 are graphdriven, which explains the good performances of the sumofsimilarities CTKA and BoPA on these data. For these datasets, the features on the nodes do not help much for predicting the class label, as observed when looking to Figure 9 where results are displayed only on these datasets. It also explains the behavior of method SVMMAX on database DB5, among others.
In this case, the best performing methods are the sumofsimilarities CTKA and the bagofpaths betweenness BoPA (see Subsection 4.3). This is clearly confirmed by displaying the results of the methods based on the graph structure only in Figure 10 and the results obtained on the graphdriven datasets in Figure 9. Interestingly, in this setting, these two methods ignoring the node features (CTKA and BoPA) are outperforming the SVMbased methods.
Conversely, on the node featuresdriven datasets (DB1 to DB4 and DB10; results displayed in Figure 8 and Table 5), all SVM methods based on node features and graph structure perform well while the methods based on the graph structure only obtain much worse results, as expected. In this setting, the two best performing methods are the same as for the overall results, that is, SVMBoPMAX (SVM with bagofpaths modularity, see Subsection 4.1.4) and SVMDKAX (SVM based on a double kernel, see Subsection 4.2.2). However, these methods are not significantly better than a simple linear SVM based on features only (SVMX), as shown in Figure 8.
From another point of view, Figure 11 takes into account all datasets and compares only the methods combining node features and graph structure. In this setting, the two best methods are also SVMBoPMAX, ASVMAX and SVMDKAX which are now significantly better than the baseline SVMX (less methods and more datasets are compared).
5.4.3 Comparison of graph embedding methods
Concerning the embedding methods described in Subsection 4.1, we can conclude that Geary’s index (SVMGA and SVMGAX) should be avoided by preferring the bagofpaths modularity (SVMBoPAX), Moran’s index (SVMMAX) or graph Local Principal Component Analysis (SVMLAX). This is clearly observable when displaying only the results of the methods combining node features and graph structure in Figure 12. This result is confirmed when comparing only the methods based on graph embedding in Figure 10.
5.4.4 Main findings
To summarize, the experiments lead to the following conclusions:

The best performing methods are highly dependent on the dataset. We observed (see Table 6) that, quite naturally, some datasets are more graphdriven in the sense that the network structure conveys important information for predicting the class labels, while other datasets are node featuresdriven and, in this case, the graph structure does not help much. However, it is always a good idea to consider information about the graph structure as this additional information can improve significantly the results (see Figures 11 and 12).

If we consider the graph structure alone, the two best investigated methods are the sumofsimilarities CTKA and the bagofpaths betweenness BoPA (see Subsection 4.3). They clearly outperform the graph embedding methods, but also the SVMs on some datasets.

When informative features on nodes are available, it is always a good idea to combine the information, and we found that the best performing methods are SVMBoPMAX (SVM with bagofpaths modularity, see Subsection 4.1.4), ASVMAX (SVM based on autocovariates, see Subsection 4.2.1) and SVMDKAX (SVM based on a double kernel, see Subsection 4.2.2) (see Figure 11). Taking the graph structure into account clearly improves the results over a baseline SVM based on node features only.
6 Conclusion
This work considered a data structure made of a graph and plain features on nodes of the graph. 14 semisupervised classification methods were investigated to compare the featurebased approach, the graph structurebased approach, and the dual approach combining both information sources. It appears that the best performance are often obtained either by a SVM method (the considered baseline classifier) based on plain node features combined to a given number of new features derived from the graph structure (namely from the BoP modularity or autocovariates), or by the sumofsimilarities and the bagofpaths modularity method, based on the graph structure only, which perform well on some datasets for which the graph structure carries important class information.
Indeed, we observed empirically that some datasets can be better explained by their graph structure (graphdriven datasets), or by their node features (featuresdriven datasets). Consequently, neither the graphderived features alone or the plain features alone is sufficient to obtain optimal performances. In other words, standard featurebased classification results can be improved significantly by integrating information from the graph structure. In particular, the most effective methods were based on bagofpaths modularity (SVMBoPMAX), autocovariates (ASVMAX) or a double kernel (SVMDKAX).
The takeaway message can be summarize as follow: if the dataset is graphdriven, a simple sumofsimilarities or a bagofpaths betweenness are sufficient but it is not the case if the features on the nodes are more informative. In both cases SVMBoPMAX, ASVMAX, SVMDKAX still ensure good overall performances, as shown on the investigated datasets.
A key point is therefore to determine apriori if a given dataset is graphdriven or featuresdriven. In this paper we proposed spatial autocorrelation indexes to tackle this issue. Further investigations will be carried in that direction. In particular, how can we automatically infer properties of a new dataset (graphdriven or featuresdriven) if all class labels are not known (during crossvalidation for example)?
Finally, the present work does not take into account the scalability and the complexity point of view. This analysis is left for further work.
Acknowledgement
Acknowledgement: This work was partially supported by the ElisIT project funded by the “Région wallonne” and the Brufence project supported by INNOVIRIS (“Région bruxelloise”). We thank this institution for giving us the opportunity to conduct both fundamental and applied research.
References
 (1) R. M. Cooke, Experts in uncertainty, Oxford University Press, 1991.
 (2) R. A. Jacobs, Methods for combining experts’ probability assessments, Neural Computation 7 (1995) 867–888.

(3)
D. Chen, X. Cheng, An asymptotic analysis of some expert fusion methods, Pattern Recognition Letters 22 (2001) 901–904.
 (4) J. Kittler, F. M. Alkoot, Sum versus vote fusion in multiple classifier systems, IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (1) (2003) 110–115.
 (5) F. Lad, Operational subjective statistical methods, John Wiley & Sons, 1996.
 (6) G. J. Klir, T. A. Folger, Fuzzy sets, uncertainty, and information, PrenticeHall, 1988.

(7)
D. Dubois, M. Grabisch, H. Prade, P. Smets, Assessing the value of a candidate: Comparing belief function and possibility theories, Proceedings of the Fifteenth international conference on Uncertainty in Artificial Intelligence (1999) 170–177.
 (8) C. Merz, Using correspondence analysis to combine classifiers, Machine Learning 36 (1999) 226–239.
 (9) W. B. Levy, H. Delic, Maximum entropy aggregation of individual opinions, IEEE Transactions on Systems, Man and Cybernetics 24 (4) (1994) 606–613.
 (10) I. J. Myung, S. Ramamoorti, J. Andrew D. Bailey, Maximum entropy aggregation of expert predictions, Management Science 42 (10) (1996) 1420–1436.
 (11) F. Fouss, M. Saerens, Yet another method for combining classifiers outputs: A maximum entropy approach, Proceedings of the 5th International Workshop on Multiple Classifier Systems (MCS 2004), Lecture Notes in Computer Science, Vol. 3077, SpringerVerlag (2004) 82–91.
 (12) X. Zhu, A. Goldberg, Introduction to semisupervised learning, Morgan & Claypool Publishers, 2009.
 (13) O. Chapelle, B. Scholkopf, A. Zien (editors), Semisupervised learning, MIT Press, 2006.
 (14) S. Hill, F. Provost, C. Volinsky, Networkbased marketing: Identifying likely adopters via consumer networks, Statistical Science 21 (2) (2006) 256–276.
 (15) S. A. Macskassy, F. Provost, Classification in networked data: a toolkit and a univariate case study, Journal of Machine Learning Research 8 (2007) 935–983.
 (16) J.J. Daudin, F. Picard, S. Robin, A mixture model for random graphs., Statistics and Computing 18 (2) (2008) 173–183.
 (17) T. Joachims, Transductive inference for text classification using support vector machines, International Conference on Machine Learning (ICML) (1999) 200–209.
 (18) X. Zhu, Semisupervised learning literature survey, unpublished manuscript (available at http://pages.cs.wisc.edu/ jerryzhu/research/ssl/semireview.html) (2008).
 (19) F. Fouss, M. Saerens, M. Shimbo, Algorithms and models for network data and link analysis, Cambridge University Press, 2016.
 (20) F. R. Chung, Spectral graph theory, American Mathematical Society, 1997.
 (21) M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from examples, Journal of Machine Learning Research 7 (2006) 2399–2434.

(22)
X. He, Laplacian regularized doptimal design for active learning and its application to image retrieval, IEEE Transactions on Image Processing 19 (1) (2010) 254–263.
 (23) S. Chakrabarti, B. Dom, P. Indyk, Enhanced hypertext categorization using hyperlinks, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD 1998), 1998, pp. 307–318.

(24)
C. Bishop, Neural networks for pattern recognition, Oxford University Press, 1995.
 (25) T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learning, 2th ed., Springer, 2009.
 (26) S. Theodoridis, K. Koutroumbas, Pattern recognition, 2nd ed., Academic Press, 2003.
 (27) B. Lebichot, I. Kivimaki, K. Françoisse, M. Saerens, Semisupervised classification through the bagofpaths group betweenness, IEEE Transactions on Neural Networks and Learning Systems 25 (2014) 1173–1186.
 (28) S. Abney, Semisupervised learning for computational linguistics, Chapman and Hall/CRC, 2008.
 (29) T. Hofmann, B. Schölkopf, A. J. Smola, Kernel methods in machine learning, The Annals of Statistics 36 (3) (2088) 1171–1220.
 (30) T. Silva, L. Zhao, Machine learning in complex networks, Springer, 2016.
 (31) A. Subramanya, P. Pratim Talukdar, Graphbased semisupervised learning, Morgan & Claypool Publishers, 2014.
 (32) X. Zhu, A. Goldberg, Introduction to semisupervised learning, Morgan & Claypool Publishers, 2009.
 (33) D. Borcard, P. Legendre, Allscale spatial analysis of ecological data by means of principal coordinates of neighbour matrices, Ecological Modelling 153 (12) (2002) 51 – 68.
 (34) S. Dray, P. Legendre, P. PeresNeto, Spatial modelling: a comprehensive framework for principal coordinate analysis of neighbour matrices, Ecological Modelling 196 (34) (2006) 483 – 493.
 (35) A. Meot, D. Chessel, R. Sabatier, Operateurs de voisinage et analyse des donnees spatiotemporelles (in french), in: D. Lebreton, B. Asselain (Eds.), Biometrie et environnement, Masson, 1993, pp. 45–72.
 (36) L. Tang, H. Liu, Relational learning via latent social dimensions, in: Proceedings of the ACM conference on Knowledge Discovery and Data Mining (KDD 2009), 2009, pp. 817–826.
 (37) L. Tang, H. Liu, Scalable learning of collective behavior based on sparse social dimensions, in: Proceedings of the ACM conference on Information and Knowledge Management (CIKM 2009), 2009, pp. 1107–1116.
 (38) L. Tang, H. Liu, Toward predicting collective behavior via social dimension extraction, IEEE Intelligent Systems 25 (4) (2010) 19–25.
 (39) D. Zhang, R. Mao, A new kernel for classification of networked entities, in: Proceedings of 6th International Workshop on Mining and Learning with Graphs, Helsinki, Finland, 2008.
 (40) D. Zhang, R. Mao, Classifying networked entities with modularity kernels, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), ACM, 2008, pp. 113–122.
 (41) R. Haining, Spatial data analysis, Cambridge University Press, 2003.
 (42) D. Pfeiffer, T. Robinson, M. Stevenson, K. Stevens, D. Rogers, A. Clements, Spatial analysis in epidemiology, Oxford University Press, 2008.
 (43) T. Waldhor, Moran’s spatial autocorrelation coefficient, in: Encyclopedia of Statistical Sciences, 2nd ed. (S. Kotz, N. Balakrishnana, C. Read, B. Vidakovic and N. Johnson, editors), Vol. 12, Wiley, 2006, pp. 7875–7878.
 (44) L. Waller, C. Gotway, Applied spatial statistics for public health data, Wiley, 2004.
 (45) K. V. Mardia, Some properties of classical multidimensional scaling, Communications in Statistics  Theory and Methods 7 (13) (1978) 1233–1241.
 (46) U. von Luxburg, A tutorial on spectral clustering, Statistics and Computing 17 (4) (2007) 395–416.
 (47) M. Newman, Networks: an introduction, Oxford University Press, 2010.
 (48) H. Benali, B. Escofier, Analyse factorielle lissee et analyse des differences locales, Revue de Statistique Appliquee 38 (2) (1990) 55–76.
 (49) L. Lebart, Contiguity analysis and classification, in: W. Gaul, O. Opitz, M. Schader (Eds.), Data Analysis, Studies in classification, data analysis, and knowledge organization, Springer, 2000, pp. 233–243.
 (50) R. Devooght, A. Mantrach, I. Kivimaki, H. Bersini, A. Jaimes, M. Saerens, Random walks based modularity: Application to semisupervised learning, Proceedings of the 23rd International Conference on World Wide Web (2014) 213–224.
 (51) J. E. Besag, Nearestneighbour systems and the autologistic model for binary data, Journal of the Royal Statistical Society. Series B (Methodological) 34 (1) (1972) 75–83.
 (52) N. H. Augustin, M. A. Mugglestone, S. T. Buckland, An autologistic model for the spatial distribution of wildlife, Journal of Applied Ecology 33 (2) (1996) pp. 339–347.
 (53) N. H. Augustin, M. A. Mugglestone, S. T. Buckland, The role of simulation in modelling spatially correlated data, Environmetrics 9 (2) (1998) 175–196.
 (54) Q. Lu, L. Getoor, Linkbased classification, in: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), 2001, pp. 496–503.
 (55) Y. Pawitan, In all likelihood: statistical modelling and inference using likelihood, Oxford University Press, 2001.
 (56) A. Dempster, N. Laird, D. Rubin, Maximum likelihood from incomplete data via the em algorithm (with discussion), Journal of the Royal Statistical Society B 39 (1) (1977) 1–38.
 (57) G. McLachlan, T. Krishnan, The EM algorithm and extensions, 2nd ed, Wiley, 2008.
 (58) V. Roth, Probabilistic discriminative kernel classifiers for multiclass problems, in: B. Radig, S. Florczyk (Eds.), Pattern Recognition: Proceedings of the 23rd DAGM Symposium, Vol. 2191 of Lecture Notes in Computer Science, Springer, 2001, pp. 246–253.
 (59) B. Scholkopf, A. Smola, Learning with kernels, The MIT Press, 2002.
 (60) J. ShaweTaylor, N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, 2004.
 (61) F. Fouss, K. Francoisse, L. Yen, A. Pirotte, M. Saerens, An experimental investigation of kernels on a graph on collaborative recommendation and semisupervised classification, Neural Networks 31 (2012) 53–72.
 (62) T. Gartner, Kernels for structured data, World Scientific Publishing, 2008.
 (63) J. LeSage, R. K. Pace, Introduction to spatial econometrics, Chapman & Hall, 2009.
 (64) A. Mantrach, N. van Zeebroeck, P. Francq, M. Shimbo, H. Bersini, M. Saerens, Semisupervised classification and betweenness computation on large, sparse, directed graphs, Pattern Recognition 44 (6) (2011) 1212 – 1224.
 (65) D. Zhou, O. Bousquet, T. Lal, J. Weston, B. Scholkopf, Learning with local and global consistency, in: Proceedings of the Neural Information Processing Systems Conference (NIPS 2003), 2003, pp. 237–244.
 (66) K. Francoisse, I. Kivimaki, A. Mantrach, F. Rossi, M. Saerens, A bagofpaths framework for network data analysis, To appear in: Neural Networks.
 (67) S. Prithviraj, G. Galileo, M. Bilgic, L. Getoor, B. Gallagher, T. EliassiRad, Collective classification in network data, AI Magazine 29 (3) (2008) 93–106.
 (68) J. McAuley, J. Leskovec, Learning to discover social circles in ego networks, Advances in Neural Information Processing Systems (NIPS).
 (69) C.W. Hsu, C.J. Lin, A comparison of methods for multiclass support vector machines, Transaction on Neural Network 13 (2) (2002) 415–425.
 (70) R. Fan, K. Chang, C. Hsieh, X. Wang, C. Lin, LIBLINEAR: A library for large linear classification, Journal of Machine Learning Research 9 (2008) 1871–1874.

(71)
V. Vapnik, Statistical Learning Theory, Wiley, 1998.
 (72) J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30.
Comments
There are no comments yet.