Relationships between data can often be well described with a graph structure. Although many datasets, including social and traffic networks, come with a pre-existing graph that helps in interpreting them, there is still a large number of datasets (e.g., brain activity information) where a graph is not readily available. Many graph learning techniques have been proposed in the past years  to help in analysing such datasets. Currently, a lot of interest in the field of graph learning is focused on designing graph learning methods that can take into account prior information we might have on the graph structure , or different relationships between data and the underlying graph 
. However, most of these works only consider simple data, where all datapoints follow the same model defined with only one graph. While there are still many topics of interest in those settings, we argue that natural data often comes in more complicated forms. In fact, such data opens an entire field of unsupervised learning methods. A natural example of such a dataset can be found in brain fMRI data, where signals usually measure the brain going through different processes. Each of these processes can be explained with a different brain functional network, with regions of interest as shared network nodes. However, it is not clear which network is activated at what time, causing the need to separate signals corresponding to different networks.
In this paper, we precisely consider data that naturally forms clusters, where signals from each of the clusters live on a different graph. This allows analysis of more complex datasets where simple graph learning methods would suffer from intertwined data and thus lose the ability to capture a meaningful graph structure. In particular, we study the problem of multiple-graph inference from a general group of signals that are an unknown combination of data of different structures. Namely, we propose a generative model for data represented as a mixture of signals naturally living on a collection of different graphs. As is often the case with data, separation of these signals into clusters is assumed to be unknown. We thus propose an algorithm that will jointly cluster the signals and infer multiple graph structures, one for each of the clusters. Numerical simulations suggest the effectiveness of our method in real-world datasets in both clustering signals and recovering the structure of corresponding graphs.
As we will deal with clustering in large dimensionalities in these settings, it is worth noting that inherently high dimensional clustering problems often suffer from the curse of dimensionality and poor interpretability. While imposing that data lives on a graph implicitly reduces the dimensionality of the problem, graphs also offer a natural representation for connections between data. Therefore, they provide interpretability both in terms of direct inspection of graph structure, as well as the ability to further deploy various data analysis algorithms.
In the literature, the graph learning problem has been first considered as sparse precision matrix inference. Data is modelled as a multivariate Gaussian distribution, whose inverse covariance matrix reveals direct pairwise connections between nodes 
. Based on these methods, several models have been proposed to infer Gaussian mixture models with sparse precision matrices  . All these works actually focus on inferring GMRFs (Gaussian Markov Random Fields), and do not constrain values in the precision matrix in any special way. Therefore, the resulting graphs can have both positive and negative weights, as well as self-loops, which can be difficult to interpret in practice. On the other hand, graph representation constrained to a valid Laplacian matrix circumvents this problem, while opening the door to numerous data analysis methods . For these reasons, an increasing amount of work has recently been focusing on inferring (generalised) graph Laplacians. Among the first researchers to focus on graph Laplacian inference, Dong et al.  adopt a graph signal processing perspective to enforce data smoothness on the inferred graph. The proposed model results in assumptions similar to those in GMRFs, but with added constraints that ensure that the graph is given by a valid Laplacian matrix. Kalofolias 
uses a similar framework and proposes a computationally more efficient solution by inferring a weight matrix, which can eventually be easily transformed into a Laplacian. Other recent works in this vein include inference of graph shift operators, with the assumption that data is the result of a graph diffusion process. A popular approach to this problem consists in exploiting the fact that the eigenvectors of the graph will be shared with those of any graph filter. Therefore, they can be estimated from the data sample covariance matrix, and the optimisation can be done only over the eigenvalues . Dictionary based methods try to model signal heterogeneity by taking into account sparse combinations or dictionary atoms. In , signals are represented as linear combinations of atoms from a heat kernel dictionary. As those are still bound to be smooth,  model signals as sparse linear combinations of atoms in a predefined polynomial graph dictionary. While this allows for a larger frequency support, the method assumes the kernels are known, often prohibitive in practice. All aforementioned methods assume the whole set of signals can be well explained with one single graph model. Naturally, this is unlikely to be the case in many applications, where signals might be generated in different ways or exist as a combination of distinct causes.
Finally, there have been some works focused on signals in one dataset that naturally live on different graphs. Kalofolias et al.  infer the structure of a time varying graph, effectively capturing multiple graphs in different periods of time. Segarra et al.  infer multiple networks under the assumption of signal stationarity. However, both of these methods assume that signal clusters are given a priori, i.e. it is clear which signals correspond to which graph. The joint network inference then becomes a problem of imposing similarities on different graph structures, rather than decoupling the signals into groups and learning a graph to explain each of the groups, which is the focus of the present work.
2 Graph Signal Processing Basics
Let be an undirected, weighted graphs with a set of vertices , edges and a weighted adjacency matrix . The value is equal to if there is no edge between and , and denotes the weight of that edge otherwise. The non-normalised (or combinatorial) graph Laplacian is defined as
where is a diagonal matrix of node degrees. A graph Laplacian satisfies:
When these conditions are satisfied, we write , where is a set of valid Laplacian matrices .
We also define a signal on a graph as a function , where denotes the value of a signal on a vertex . A graph signal is considered smooth if most of its energy is concentrated in the low frequencies of the underlying graph, which can be measured with a quadratic form of the graph Laplacian:
Indeed, it is clear from Equation (4) that the signal difference will get more penalised for two vertices linked by a strong edge. It might be less apparent that there is also a strong spectral interpretation. Namely, as a real symmetric matrix, can be decomposed into a set of
orthogonal eigenvectors and associated eigenvalues. These play the role of Fourier transform in graph signal processing, with the eigenvalues taking up a notion associated to frequency. Using this decomposition withbeing the eigenvectors and diagonal matrix of eigenvalues, we can see that the above relation
penalises signals according to their frequency support. In addition, assuming the signal can be modelled through a latent variable , such that , yields the minimiser of 4 as a maximum likelihood estimator for the graph . Note here that represents the pseudo-inverse of the eigenvalue matrix, implying that signal support is inversely proportional to frequency. This gives us a direct relationship between signals and the graph Laplacian matrix :
As , we can now see as a special degenerate GMRF. Namely, the precision matrix has at least one zero eigenvalue, as well as a special structure ensuring .
3 Graph Laplacian Mixture Model
We propose a probabilistic model for a set of graph signals with distinguishable subsets, where the behaviour of signals in each of the subsets (groups) is well explained with a different graph. One might imagine such a dataset arising from measures (signals) of human behaviour, where our work environments, family and different friend groups naturally describe many distinctive social networks between people as nodes. Consistently, the social network having preference at the moment a measurement is taken (e.g., professional network during working hours) will have stronger influence on the measurement. A toy example of our data model is given in Figure (1).
Our goal is to distinguish these groups of signals, and eventually infer a graph that will model the structure of signals in each cluster. As the clusters of signals are unknown and the graph structures largely influence behaviours inside them, we argue that identifying the groups and learning associated graphs are intertwined problems. Therefore, we propose a generative model that jointly explains both signals and clusters, under an assumption of smoothness of signals on the learned graph, for each cluster.
3.1 Multigraph signal representation
The graphical representation for our model is given in Figure (2). Let us assume that there are undirected, weighted graphs with a set of shared vertices . Each graph has a specific set of edges and a weighted adjacency matrix . From each of these weight matrices , we can define a graph Laplacian matrix , as in Equation (1).
We further assume there are clusters, and each of the observed signals on the nodes , belongs to exactly one of the clusters. Cluster participation is modelled through a binary latent variable , with if signal belongs to cluster . Mixing coefficients model the size of each cluster, and define a prior distribution on variables , with .
Finally, we model data in each cluster with a mean and a graph Laplacian , assuming associated signals will be close to and smooth on graph :
Marginalising over latent variables , we have:
where Equation (10) comes from the assumption of smoothness and Equation (6). This fully describes the generative model for our observed signals, with the constraint in (11) ensuring all s are valid Laplacian matrices, while Equations (12) and (13) ensure that
defines a valid probability measure.
3.2 Problem formulation
Given a set of -dimensional graph signals with some intrinsic grouping into
clusters associated to it, we look at the maximum a posteriori estimate for our parametersand . Namely, assuming the data has been sampled independently from the distribution in Equation (10), and allowing for a prior on the graph structure, we want to maximise over the a posteriori distribution of our model:
which does not have a closed form solution. We will thus estimate the parameters using an expectation maximisation (EM), as explained below.
We propose an expectation maximisation algorithm that alternates between optimising for expected cluster participations in the expectation step, and signal means , class proportions and graph topologies in the maximisation step. Therefore, the joint clustering and multi-graph learning problem iterates over two steps: the first trying to estimate the correct clustering, and the second trying to describe said clusters by inferring cluster means and proportions, as well as the graphs describing them. Precisely, we first initialise and randomly, noting that we ensure are set to a random valid Laplacian matrix, guaranteeing it truly describes a graph structure. The alternating steps follow:
4.1 Expectation (E step)
Let us first define
as a matrix of posterior probabilities, withmodelling the probability that signal belongs to the group . Note that, at the same time, this is the expected value of the indicator variable , or .
However, this naive formulation has some limitations. Namely, it is well known that a graph Laplacian has at least one eigenvalue zero, corresponding to the eigenvector . This makes the distribution in Equation (21) degenerate. At the same time, from the signal processing point of view, the corresponding eigenvector is completely smooth on all possible graphs, and is therefore non-informative. The disintegration theorem  guarantees we can restrict our problem to the dimensional subspace spanned by all remaining eigenvectors. We thus proceed by projecting signals to this subspace, where we can then compute and retrieve the probabilities we wanted to model.
Furthermore, if a graph has disconnected components, it will have as many zero eigenvalues as the number of components in the graph. Even if the final graph is connected, there is no guarantee that the algorithm does not return a disconnected graph in one of the steps of the optimisation problem. It is easy to see how this can pose large numerical problems, both in each graph separately as their eigenvalues approach zero, but also in terms of comparing the probabilities these graphs define when trying to infer . To avoid this problem, we add a small regularising constant to every eigenvalue corresponding to the non-trivial eigenvectors of the graph Laplacian. The computation of responsibilities sums up as follows:
4.2 Maximisation (M step)
Having estimated responsibilities in the E-step, we can now maximise over the expected posterior distribution given all observed signals:
Equation (34) is concave over and , and offers closed form solutions for both:
In order to maximise over , we substitute with variables . Now we can formulate a problem of multiple graph inference:
It is clear that these are independent optimisation problems, and we can maximise each separately:
Similarly to the graph inference problem explored in , each of these problems can be solved efficiently with:
with the graph prior encoded in , as well as in the constraint.
To efficiently update through equation 39, it is crucial to choose a good prior on the graph Laplacians. Namely, we want to maximise signal smoothness on the graph, while controlling graph sparsity. Due to it’s computational efficiency, we will use the algorithm from Kalofolias  with the small addition of cluster probabilities . The graph update step thus becomes:
in which diag(
) is vector with diagonal values (node degrees) from, and is the Frobenius norm of off-diagonal values in . Notice that = , where is the weight matrix of the same graph. Compared to the formulation from Equation 39, the function consists of two parts, such that increasing strengthens graph connectivity, while decreasing promotes sparsity. As before, the constraint that ensures is a valid Laplacian matrix.
In this section, we evaluate our algorithm on both synthetic and real data. To the best of our knowledge, there are no other methods tackling the problem of jointly clustering graph signals and learning multiple graph Laplacians. Thus, we compare our algorithm (GLMM) to the estimation of a Gaussian mixture model (GMM) and a signal clustering by simple K-means, followed by graph learning in each inferred cluster separately (K-means + GL). As the Gaussian mixture model does not naturally provide sparse inverse Gaussians, we can obtain graphs by choosing only the largest absolute values in the inferred precision matrices.
5.1 Synthetic data
We consider randomly generated connected graphs of size following an Erdos-Renyi model with . Each graph is given a randomly generated 15-dimensional mean signal that lives on its vertices. We fix the total number of signals to 150 and consider a case with 2 classes, given by the probability vectors and . For each signal we randomly generate through probabilities . Then is randomly generated through a distribution , if , which gives us the full synthetic dataset for one experiment.
We examine the performance of all three algorithms in recovering the original clusters of signals (given by ) and corresponding graph topologies. For each of the methods under comparison, the experiment is repeated 5 times on the same synthetic dataset with different random initialisations, and only the result with the best signal clustering performance is kept. Note that, to avoid numerical issues and render the comparison more fair, we train the GMM in - 1 dimensions, ignoring the direction of the eigenvector which is known to be zero (see Section 2). Further, the whole experiment has been repeated 100 times, each time with different graphs, means and randomly generated data. We present the error in terms of class identification NMSE (, presented in ) in Table 1. The performance of graph inference for each of the methods is presented in Table 2 in terms of mean edge recovery F measure.
|GLMM||GMM||K-Means + GL|
As expected, we can see that all methods are affected by unbalanced clusters, and give significantly poorer results in terms of clustering performance, as well as edge recovery for the graph in less represented cluster. As the graph in the more represented cluster (with ) will always be inferred from the bigger set of relevant signals, all three methods outperform their own F measure score compared to the case of balanced clusters. Finally, our method shows promising results in comparison with other inspected methods in all tasks on synthetic data.
5.2 Real data
We further evaluate our algorithm on real data. In applications where the real network is not known, we evaluate the performance by inspecting the clustering results. We further compare inferred graphs to a constructed ground-truth graph, and use visual inspection to asses graphs quality.
5.2.1 Molene weather dataset
The Molene weather dataset  provides joint measurements of temperature and wind strength in 28 stations across Bretagne, France. There are in total 744 measurements of each of the 2 types, in each of the 28 stations. Note that we refer to signals as measurements, while measure represents a whole class of temperature or wind strength signals. We preprocess the data by subtracting the mean of each signal and normalising both measures, to ensure the data is in the same range. We compare the results of all three algorithms described above in terms of clustering accuracy, graph inference, as well as model generalisability.
We first look at clustering accuracy and graph inference performance. We randomly select 300 preprocessed measurements describing temperature and 300 describing wind speed to create a dataset for evaluation. We ran all algorithms 5 times on this dataset with different random initialisations, and select the best performing run in terms of clustering. The whole experiment has been repeated 100 times, each time with different randomly selected dataset of measurements. Clustering performance is presented in Table 3 in terms of NMSE ().
|GLMM||GMM||K-means + GL|
We can see that GLMM outperforms other considered methods in terms of clustering accuracy. This is especially interesting as the examined dataset does not have an inherent ground truth graph structure supporting it. We argue that this suggests that our data can be well modelled with graphs. Therefore, additional constraints posed by Laplacian inference reduce the scope of the problem when compared to GMM, while they still allow for a lot more adaptivity when compared to simple K-means.
We also investigate graph inference performance on this data. As there are no ground-truth graphs for this data, we construct our own using the knowledge of original clusters. For each cluster separately, we use all available data, preprocess them by subtracting the mean, and use a graph learning algorithm  to infer the graph. We then treat those graphs as ground-truth for comparison and present the results in Table (3).
Next, we evaluate how well trained models generalise to new data. Namely, we separate both temperature and wind data into 600 training and 100 testing signals. We then randomly choose subsets of training data of different sizes to fit generative models. Each algorithm has been ran 5 times with different random initialisations, and the best run in terms of training data clustering performance has been selected. The unseen (test) data is then clustered using trained models. The whole experiment has been repeated 100 times, each time with different randomly selected measurements. Figure 3 shows the test data clustering NMSE () for all three methods, given different training set sizes.
We can see a significantly better performance in our method compared to the others in terms of generalisability to new data. We reiterate that this is especially significant as the results are demonstrated on data that does not inherently live on a graph. We argue that this shows the flexibility of our model, as it is able to well generalise to unseen data on signals that do not necessarily live on a mixture of graphs (but might be well representable with them). Furthermore, we note that while increasing the size of our training set improves the performance of more adaptive algorithms, like GMM and GLMM, K-means does not show the possibility to adapt due to its very simple nature.
As there are no ground-truth graphs, it is difficult to compare inference performance in a fair way. We therefore further investigate graph quality by visual inspection. Figure 4 shows inferred temperature and wind graph topologies for GLMM. Figure 5 presents the same for GMM. We note that these are geographical graphs, plotted with node position corresponding to true measuring station coordinates. We can see that graphs inferred by GLMM offer much more structure, mostly connecting neighbouring nodes.
5.2.2 MNIST digits
Finally, we comment on the advantage our method brings over standard GMMs in terms of interpretability in high dimensional settings. We tackle a well known classification problem in which there is no reason to believe that signals live on a graph. Each signal is one MNIST digit representing ”0” or ”1”. Since each digit is given as a 20x20 pixel grayscale image, the dimension of our signal is 400. We randomly choose 1000 digits from class ’0’ and 1000 digits from class ’1’. In order to give complete results, we show clustering performance of all three algorithms in Table (4). However, we note that MNIST digit clustering is inherently a much lower dimensional problem, and should be treated as such. The experiments confirm this, giving better results for simpler methods: K-means performs best as the simplest model, while GLMM still performs better than GMM due to its more focused nature and lower sensitivity to noise.
While all inspected algorithms achieve acceptable results in terms of clustering performance, we discuss here the interpretability advantage offered by GLMM in these very high dimensional settings. Figure 6 shows the inferred graph topology corresponding to the cluster of digit zero. To visualise the graph, we only considered edges adjacent to at least one vertex with a significant mean value larger than a threshold.
As expected, we can see that neighbouring pixels that form the number zero are strongly correlated. We can also see that pixels are rarely connected if they are not close in the image, even though no pixel position information was given to the graph inference algorithm.
We finish with noting that the cost of using this model when data does not live on a graph comes, of course, in terms of restrictions imposed by a valid graph Laplacian, as well as the sparsity constraint in the graph learning method. These restrictions reduce model flexibility and could therefore lead to less accurate results. However, in large dimensional settings where this model is meant to be used, these restrictions are not too constraining. Even more, as they implicitly reduce the dimensionality of the problem, they seem to even ameliorate the performance on some datasets (as seen in Table 3).
We propose a generative model for mixtures of signals living on a collection of different graphs. We assume that both the mapping of signals to graphs, as well as the topologies of the graphs are unknown. We therefore jointly decouple the signals into clusters and infer multiple graph structures, one for each cluster. To do so, we formulate a Maximum a posteriori problem and solve it through Expectation minimisation. Our experimental results indicate that our method outperforms other baseline methods both on synthetic and real data. We further argue that our model can be used for high dimensional clustering even when data is not assumed to have inherent graph structures. This further brings additional interpretability to clustering results, which is increasingly important in data science applications.
-  Xiaowen Dong, Dorina Thanou, Michael Rabbat, and Pascal Frossard, “Learning graphs from data: A signal representation perspective,” arXiv preprint arXiv:1806.00848, 2018.
-  Hilmi E Egilmez, Eduardo Pavez, and Antonio Ortega, “Graph learning from data under laplacian and structural constraints,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 6, pp. 825–841, 2017.
-  Santiago Segarra, Antonio G Marques, Gonzalo Mateos, and Alejandro Ribeiro, “Network topology inference from spectral templates,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 467–483, 2017.
-  Richard E Bellman, Adaptive control processes: a guided tour, vol. 2045, Princeton university press, 2015.
-  Arthur P Dempster, “Covariance selection,” Biometrics, pp. 157–175, 1972.
-  Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.
-  Patrick Danaher, Pei Wang, and Daniela M Witten, “The joint graphical lasso for inverse covariance estimation across multiple classes,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 2, pp. 373–397, 2014.
-  Anani Lotsi and Ernst Wit, “High dimensional sparse gaussian graphical mixture model,” arXiv preprint arXiv:1308.3381, 2013.
-  Botao Hao, Will Wei Sun, Yufeng Liu, and Guang Cheng, “Simultaneous clustering and estimation of heterogeneous graphical models,” arXiv preprint arXiv:1611.09391, 2016.
David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre
“The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains,”IEEE Signal Processing Magazine, vol. 30, no. 3, pp. 83–98, 2013.
-  X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst, “Learning laplacian matrix in smooth graph signal representations,” IEEE Transactions on Signal Processing, vol. 64, no. 23, pp. 6160–6173, 2016.
-  Vassilis Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, 2016, pp. 920–929.
-  Bastien Pasdeloup, Vincent Gripon, Grégoire Mercier, Dominique Pastor, and Michael G Rabbat, “Characterization and inference of graph diffusion processes from observations of stationary signals,” IEEE Transactions on Signal and Information Processing over Networks, 2017.
-  Dorina Thanou, Xiaowen Dong, Daniel Kressner, and Pascal Frossard, “Learning heat diffusion graphs,” IEEE Transactions on Signal and Information Processing over Networks, vol. 3, no. 3, pp. 484–499, 2017.
-  Hermina Petric Maretic, Dorina Thanou, and Pascal Frossard, “Graph learning under sparsity priors,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. Ieee, 2017, pp. 6523–6527.
-  Vassilis Kalofolias, Andreas Loukas, Dorina Thanou, and Pascal Frossard, “Learning time varying graphs,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 2826–2830.
-  Santiago Segarra, Yuhao Wangt, Caroline Uhler, and Antonio G Marques, “Joint inference of networks from stationary graph signals,” in Signals, Systems, and Computers, 2017 51st Asilomar Conference on. IEEE, 2017, pp. 975–979.
-  Olav Kallenberg, Foundations of modern probability, Springer Science & Business Media, 2006.
-  Benjamin Girault, Signal processing on graphs-contributions to an emerging field, Ph.D. thesis, Ecole normale supérieure de lyon-ENS LYON, 2015.