 # A path following algorithm for the graph matching problem

We propose a convex-concave programming approach for the labeled weighted graph matching problem. The convex-concave programming formulation is obtained by rewriting the weighted graph matching problem as a least-square problem on the set of permutation matrices and relaxing it to two different optimization problems: a quadratic convex and a quadratic concave optimization problem on the set of doubly stochastic matrices. The concave relaxation has the same global minimum as the initial graph matching problem, but the search for its global minimum is also a hard combinatorial problem. We therefore construct an approximation of the concave problem solution by following a solution path of a convex-concave problem obtained by linear interpolation of the convex and concave formulations, starting from the convex relaxation. This method allows to easily integrate the information on graph label similarities into the optimization problem, and therefore to perform labeled weighted graph matching. The algorithm is compared with some of the best performing graph matching methods on four datasets: simulated graphs, QAPLib, retina vessel images and handwritten chinese characters. In all cases, the results are competitive with the state-of-the-art.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The graph matching problem is among the most important challenges of graph processing, and plays a central role in various fields of pattern recognition. Roughly speaking, the problem consists in finding a correspondence between vertices of two given graphs which is optimal in some sense. Usually, the optimality refers to the alignment of graph structures and, when available, of vertices labels, although other criteria are possible as well. A non-exhaustive list of graph matching applications includes document processing tasks like optical character recognition

[1, 2], image analysis (2D and 3D) [3, 4, 5, 6], or bioinformatics [7, 8, 9].

During the last decades, many different algorithms for graph matching have been proposed. Because of the combinatorial nature of this problem, it is very hard to solve it exactly for large graphs, however some methods based on incomplete enumeration may be applied to search for an exact optimal solution in the case of small or sparse graphs. Some examples of such algorithms may be found in [10, 11, 12].

Another group of methods includes approximate algorithms which are supposed to be more scalable. The price to pay for the scalability is that the solution found is usually only an approximation of the optimal matching. Approximate methods may be divided into two groups of algorithms. The first group is composed of methods which use spectral representations of adjacency matrices, or equivalently embed the vertices into a Euclidean space where linear or nonlinear matching algorithms can be deployed. This approach was pioneered by Umeyama , while further different methods based on spectral representations were proposed in [14, 5, 4, 3, 15]. The second group of approximate algorithms is composed of algorithms which work directly with graph adjacency matrices, and typically involve a relaxation of the discrete optimization problem. The most effective algorithms were proposed in [16, 17, 18, 6].

An interesting instance of the graph matching problem is the matching of labeled graphs. In that case, graph vertices have associated labels, which may be numbers, categorical variables, etc… The important point is that there is also a similarity measure between these labels. Therefore, when we search for the optimal correspondence between vertices, we search a correspondence which matches not only the structures of the graphs but also vertices with similar labels. Some widely used approaches for this application only use the information about similarities between graph labels. In vision, one such algorithm is the shape context algorithm proposed in

, which involves an efficient algorithm of node label construction. Another example is the BLAST-based algorithms in bioinformatics such as the Inparanoid algorithm , where correspondence between different protein networks is established on the basis of BLAST scores between pairs of proteins. The main advantages of all these methods are their speed and simplicity. However, the main drawback of these methods is that they do not take into account information about the graph structure. Some graph matching methods try to combine information on graph structures and vertex similarities, examples of such method are presented in [18, 7].

In this article we propose an approximate method for labeled weighted graph matching, based on a convex-concave programming approach which can be applied for matching of graphs of large sizes. Our method is based on a formulation of the labeled weighted graph matching problem as a quadratic assignment problem (QAP) over the set of permutation matrices, where the quadratic term encodes the structural compatibility and the linear term encodes local compatibilities. We propose two relaxations of this problem, resulting in one quadratic convex and one quadratic concave minimization problem on the set of doubly stochastic matrices. While the concave relaxation has the same global minimum as the initial QAP, solving it is also a hard combinatorial problem. We find a local minimum of this problem by following a solution path of a family of convex-concave minimization problems, obtained by linearly interpolating between the convex and concave relaxations. Starting from the convex formulation with a unique local (and global) minimum, the solution path leads to a local optimum of the concave relaxation. We refer to this procedure as the PATH algorithm111The PATH algorithm as well as other referenced approximate methods are implemented in the software GraphM available at http://cbio.ensmp.fr/graphm. We perform an extensive comparison of this PATH algorithm with several state-of-the-art matching methods on small simulated graphs and various QAP benchmarks, and show that it consistently provides state-of-the-art performances while scaling to graphs of up to a few thousands vertices on a modern desktop computer. We further illustrate the use of the algorithm on two applications in image processing, namely the matching of retina images based on vessel organization, and the matching of handwritten chinese characters.

The rest of the paper is organized as follows: Section 2 presents the mathematical formulation of the graph matching problem. In Section 3, we present our new approach. Then, in Section 4, we present the comparison of our method with Umeyama’s algorithm 

and the linear programming approach

 on the example of artificially simulated graphs. In Section 5, we test our algorithm on the QAP benchmark library, and we compare obtained results with the results published in  for the QBP algorithm and graduated assignment algorithms. Finally, in Section 6 we present two examples of applications to real-world image processing tasks.

## 2 Problem description

A graph of size is defined by a finite set of vertices and a set of edges . We consider only undirected graphs with no self-loop, i.e., such that and for any vertices . Each such graph can be equivalently represented by a symmetric adjacency matrix of size , where is equal to one if there is an edge between vertex and vertex , and zero otherwise. An interesting generalization is a weighted graph which is defined by association of nonnegative real values (weights) to all edges of graph . Such graphs are represented by real valued adjacency matrices with . This generalization is important because in many applications the graphs of interest have associated weights for all their edges, and taking into account these weights may be crucial in construction of efficient methods. In the following, when we talk about “adjacency matrix” we mean a real-valued “weighted” adjacency matrix.

Given two graphs and with the same number of vertices , the problem of matching and consists in finding a correspondence between vertices of and vertices of which aligns and in some optimal way. We will consider in Section 3.8 an extension of the problem to graphs of different sizes. For graphs with the same size , the correspondence between vertices is a permutation of vertices, which can be defined by a permutation matrix , i.e., a -valued matrix with exactly one entry in each column and each row. The matrix entirely defines the mapping between vertices of and vertices of , being equal to if the -th vertex of is matched to the -th vertex of , and otherwise. After applying the permutation defined by to the vertices of we obtain a new graph isomorphic to which we denote by . The adjacency matrix of the permuted graph, , is simply obtained from by the equality .

In order to assess whether a permutation defines a good matching between the vertices of and those of , a quality criterion must be defined. Although other choices are possible, we focus in this paper on measuring the discrepancy between the graphs after matching, by the number of edges (in the case of weighted graphs, it will be the total weight of edges) which are present in one graph and not in the other. In terms of adjacency matrices, this number can be computed as:

 F0(P)=||AG−AP(H)||2F=||AG−PAHPT||2F, (1)

where is the Frobenius matrix norm defined by . A popular alternative to the Frobenius norm formulation (1) is the -norm formulation obtained by replacing the Frobenius norm by the -norm , which is equal to the square of the Frobenius norm when comparing -valued matrices, but may differ in the case of general matrices.

Therefore, the problem of graph matching can be reformulated as the problem of minimizing over the set of permutation matrices. This problem has a combinatorial nature and there is no known polynomial algorithm to solve it . It is therefore very hard to solve it in the case of large graphs, and numerous approximate methods have been developed.

An interesting generalization of the graph matching problem is the problem of labeled graph matching. Here, each graph has associated labels to all its vertices and the objective is to find an alignment that fits well graph labels and graph structures at the same time. If we let denote the cost of fitness between the -th vertex of and the -th vertex of , then the matching problem based only on label comparison can be formulated as follows:

 minP∈P tr(CTP)=N∑i=1N∑j=1CijPij, (2)

where denotes the set of permutation matrices. A natural way of unifying (2) and (1) to match both the graph structure and the vertices’ labels is then to minimize a convex combination :

 minP∈P (1−α)F0(P)+αtr(CTP), (3)

that makes explicit, through the parameter , the trade-off between cost of individual matchings and faithfulness to the graph structure. A small value emphasizes the matching of structures, while a large value gives more importance to the matching of labels.

### 2.1 Permutation matrices

Before describing how we propose to solve (1) and (3), we first introduce some notations and bring to notice some important characteristics of these optimization problems. They are defined on the set of permutation matrices, which we denoted by . The set is a set of extreme points of the set of doubly stochastic matrices, that is, matrices with nonnegative entries and with row sums and column sums equal to one: , where denotes the

-dimensional vector of all ones

. This shows that when a linear function is minimized over the set of doubly stochastic matrices , a solution can always be found in the set of permutation matrices. Consequently, minimizing a linear function over is in fact equivalent to a linear program and can thus be solved in polynomial time by, e.g., interior point methods . In fact, one of the most efficient methods to solve this problem is the Hungarian algorithm, which uses a specific primal-dual strategy to solve this problem in . Note that the Hungarian algorithm allows to avoid the generic complexity associated with a linear program with variables.

At the same time may be considered as a subset of orthonormal matrices of and in fact . An (idealized) illustration of these sets is presented in Figure 1: the discrete set of permutation matrices is the intersection of the convex set of doubly stochastic matrices and the manifold of orthogonal matrices. Figure 1: Relation between three matrix sets. O—set of orthogonal matrices, D — set of doubly stochastic matrices, P=D∩O—set of permutation matrices.

### 2.2 Approximate methods: existing works

A good review of graph matching algorithms may be found in 

. Here, we only present a brief description of some approximate methods which illustrate well ideas behind two subgroups of these algorithms. As mentioned in the introduction, one popular approach to find approximate solutions to the graph matching problem is based on the spectral decomposition of the adjacency matrices of the graphs to be matched. In this approach, the singular value decompositions of the graph adjacency matrices are used:

 AG=UGΛGUTG, AH=UHΛHUTH,

where the columns of the orthogonal matrices and

consist of eigenvectors of

and respectively, and and

are diagonal matrices of eigenvalues.

If we consider the rows of eigenvector matrices and

as graph node coordinates in eigenspaces, then we can match the vertices with similar coordinates through a variety of methods

[13, 5, 15]. However, these methods suffer from the fact that the spectral embedding of graph vertices is not uniquely defined. First, the unit norm eigenvectors are at most defined up to a sign flip and we have to choose their signs synchronously. Although it is possible to use some normalization convention, such as choosing the sign of each eigenvector in such a way that the biggest component is always positive, this usually does not guarantee a perfect sign synchronization, in particular in the presence of noise. Second, if the adjacency matrix has multiple eigenvalues, then the choice of eigenvectors becomes arbitrary within the corresponding eigen-subspace, as they are defined only up to rotations .

One of the first spectral approximate algorithms was presented by Umeyama . To avoid the ambiguity of eigenvector selection, Umeyama proposed to consider the absolute values of eigenvectors. According to this approach, the correspondence between graph nodes is established by matching the rows of and (which are defined as matrices of absolute values). The criterion of optimal matching is the total distance between matched rows, leading to the optimization problem:

 minP∈P ∥ |UG|−P|UH| ∥F,

or equivalently:

 maxP∈P tr(|UH||UG|TP). (4)

The optimization problem (4) is a linear program on the set of permutation matrices which can be solved by the Hungarian algorithm in [27, 28].

The second group of approximate methods consists of algorithms which work directly with the objective function in (1), and typically involve various relaxations to optimizations problems that can be efficiently solved. An example of such an approach is the linear programming method proposed by Almohamad and Duffuaa in . They considered the -norm as the matching criterion for a permutation matrix :

 F′0(P)=||AG−PAHPT||1=||AGP−PAH||1.

The linear program relaxation is obtained by optimizing on the set of doubly stochastic matrices instead of :

 minP∈D F′0(P), (5)

where the 1-norm of a matrix is defined as the sum of the absolute values of all the elements of a matrix. A priori the solution of (5

) is an arbitrary doubly stochastic matrix

, so the final step is a projection of on the set of permutation matrices (we let denote the projection of onto ) :

 P∗=ΠPX=argminP∈P ||P−X||2F,

or equivalently:

 P∗=argmaxP∈PXTP. (6)

The projection (6) can be performed with the Hungarian algorithm, with a complexity cubic in the dimension of the problem. The main disadvantage of this method is that the dimensionality (i.e., number of variables and number of constraints) of the linear program (6) is , and therefore it is quite hard to process graphs of size more than one hundred nodes.

Other convex relaxations of (1) can be found in  and . In the next section we describe our new algorithm which is based on the technique of convex-concave relaxations of the initial problems (1) and (3).

## 3 Convex-concave relaxation

Let us start the description of our algorithm for unlabeled weighted graphs. The generalization to labeled weighted graphs is presented in Section 3.7. The graph matching criterion we consider for unlabeled graphs is the square of the Frobenius norm of the difference between adjacency matrices (1). Since permutation matrices are also orthogonal matrices (i.e., and ), we can rewrite on as follows:

 F0(P)=∥AG−PAHPT∥2F=∥(AG−PAHPT)P∥2F=∥AGP−PAH∥2F.

The graph matching problem is then the problem of minimizing over , which we call GM:

 {GM}: minP∈P F0(P). (7)

### 3.1 Convex relaxation

A first relaxation of GM is obtained by expanding the convex quadratic function on the set of doubly stochastic matrices :

 {QCV}: minP∈D F0(P). (8)

The QCV problem is a convex quadratic program that can be solved in polynomial time, e.g., by the Frank-Wolfe algorithm  (see Section 3.5 for more details). However, the optimal value is usually not an extreme points of , and therefore not a permutation matrix. If we want to use only QCV for the graph matching problem, we therefore have to project its solution on the set of permutation matrices, and to make, e.g., the following approximation:

 argminP F0(P)≈ΠPargminD F0(P). (9)

Although the projection can be made efficiently in by the Hungarian algorithm, the difficulty with this approach is that if is far from then the quality of the approximation (9) may be poor: in other words, the work performed to optimize is partly lost by the projection step which is independent of the cost function. The PATH algorithm that we present later can be thought of as a improved projection step that takes into account the cost function in the projection.

### 3.2 Concave relaxation

We now present a second relaxation of GM, which results in a concave minimization problem. For that purpose, let us introduce the diagonal degree matrix of an adjacency matrix , which is the diagonal matrix with entries for , as well as the Laplacian matrix . having only nonnegative entries, it is well-known that the Laplacian matrix is positive semidefinite . We can now rewrite as follows:

 F0(P)=||AGP−PAH||2F=||(DGP−PDH)−(LGP−PLH)||2F=||DGP−PDH||2F−2tr[(DGP−PDH)T(LGP−PLH)]+||LGP−PLH||2F. (10)

Let us now consider more precisely the second term in this last expression:

 tr[(DGP−PDH)T(LGP−PLH)]=trPPTDTGLG∑d2G(i)+trLHDTHPTP∑d2H(i)−trPTDTGPLH∑dG(i)dP(H)(i)−trDTHPTLGP∑dP(H)(i)dG(i)=∑(dG(i)−dP(H)(i))2=∥DG−DP(H)∥2F=∥DGP−PDH∥2F. (11)

Plugging (11) into (10) we obtain

 F0(P)=∥DGP−PDH∥2F−2∥DGP−PDH∥2F +∥LGP−PLH∥2F=−∥DGP−PDH∥2F+∥LGP−PLH∥2F=−∑i,jPij(DG(j)−DH(i))2+tr(PPTILTGLG)+tr(LTHPTPILH)−2tr(PTLTGPLH)vec(P)T(LTH⊗LTG)vec(P)=−tr(ΔP)+tr(L2G)+tr(L2H)−2vec(P)T(LTH⊗LTG)vec(P), (12)

where we introduced the matrix and we used to denote the Kronecker product of two matrices (definition of the Kronecker product and some important properties may be found in the appendix B).

Let us denote the part of (12) which depends on :

 F1(P)=−tr(ΔP)−2vec(P)T(LTH⊗LTG)vec(P).

From (12) we see that the permutation matrix which minimizes over is the solution of the graph matching problem. Now, minimizing over gives us a relaxation of (7) on the set of doubly stochastic matrices. Since graph Laplacian matrices are positive semi-definite, the matrix is also positive semidefinite as a Kronecker product of two symmetric positive semi-definite matrices . Therefore the function is concave on , and we obtain a concave relaxation of the graph matching problem:

 {QCC}: minP∈D F1(P). (13)

Interestingly, the global minimum of a concave function is necessarily located at a boundary of the convex set where it is minimized, so the minimium of on is in fact in .

At this point, we have obtained two relaxations of GM as nonlinear optimization problems on : the first one is the convex minimization problem QCV (8), which can be solved efficiently but leads to a solution in that must then be projected onto , and the other is the concave minimization problem QCC (13) which does not have an efficient (polynomial) optimization algorithm but has the same solution as the initial problem GM. We note that these convex and concave relaxation of the graph matching problem can be more generally derived for any quadratic assignment problems .

### 3.3 PATH algorithm

We propose to approximately solve QCC by tracking a path of local minima over of a series of functions that linearly interpolate between and , namely:

 Fλ(P)=(1−λ)F0(P)+λF1(P),

for . For all , is a quadratic function (which is in general neither convex nor concave for away from zero or one). We recover the convex function for , and the concave function for . Our method searches sequentially local minima of , where moves from to . More precisely, we start at , and find the unique local minimum of (which is in this case its unique global minimum) by any classical QP solver. Then, iteratively, we find a local minimum of given a local minimum of by performing a local optimization of starting from the local minimum of , using for example the Frank-Wolfe algorithm. Repeating this iterative process for small enough we obtain a path of solutions , where and is a local minimum of for all . Noting that any local minimum of the concave function on is in , we finally output as an approximate solution of GM.

Our approach is similar to graduated non-convexity : this approach is often used to approximate the global minimum of a non-convex objective function. This function consists of two part, the convex component, and non-convex component, and the graduated non-convexity framework proposes to track the linear combination of the convex and non-convex parts (from the convex relaxation to the true objective function) to approximate the minimum of the non-convex function. The PATH algorithm may indeed be considered as an example of such an approach. However, the main difference is the construction of the objective function. Unlike , we construct two relaxations of the initial optimization problem, which lead to the same value on the set of interest (), the goal being to choose convex/concave relaxations which approximate in the best way the objective function on the set of permutation matrices.

The pseudo-code for the PATH algorithm is presented in Figure 2. The rationale behind it is that among the local minima of on , we expect the one connected to the global minimum of through a path of local minima to be a good approximation of the global minima. Such a situation is for example shown in Figure 3, where in dimension the global minimum of a concave quadratic function on an interval (among two candidate points) can be found by following the path of local minima connected to the unique global minimum of a convex function. Figure 3: Illustration for path optimization approach. F0 (λ=0) — initial convex function, F1 (λ=1) — initial concave function, bold black line — path of function minima P∗(λ) (λ=0…0.1…0.2…0.3…0.75…1)

More precisely, and although we do not have any formal result about the optimality of the PATH optimization method (beyond the lack of global optimality, see Appendix A), we can mention a few interesting properties of this method:

• We know from (12) that for , , where is a constant independent of . As a result, it holds for all that, for :

 Fλ(P)=F0(P)−λκ.

This shows that if for some the global minimum of over lies in , then this minimum is also the global minimum of over and therefore the optimal solution of the initial problem. Hence, if for example the global minimum of is found on by the PATH algorithm (for instance, if is still convex), then the PATH algorithm leads to the global optimum of . This situation can be seen in the toy example in Figure 3 where, for , has its unique minimum at the boundary of the domain.

• The sub-optimality of the PATH algorithm comes from the fact that, when increases, the number of local minima of may increase and the sequence of local minima tracked by PATH may not be global minima. However we can expect the local minima followed by the PATH algorithm to be interesting approximations for the following reason. First observe that if and are two local minima of for some , then the restriction of to being a quadratic function it has to be concave and and must be on the boundary of . Now, let be the smallest such that has several local minima on . If denotes the local minima followed by the PATH algorithm, and denotes the “new” local minimum of , then necessarily the restriction of to must be concave and have a vanishing derivative in (otherwise, by continuity of in , there would be a local minimum of near for slightly smaller than ). Consequently we necessarily have . This situation is illustrated in Figure 3 where, when the second local minimum appears for , it is worse than the one tracked by the PATH algorithm. More generally, when “new” local minima appear, they are strictly worse than the one tracked by the PATH algorithm. Of course, they may become better than the PATH solution when continues to increase.

Of course, in spite of these justifications, the PATH algorithm only gives an approximation of the global minimum in the general case. In Appendix A, we provide two simple examples when the PATH algorithm respectively succeeds and fails to find the global minimum of the graph matching problem.

### 3.4 Numerical continuation method interpretation

Our path following algorithm may be considered as a particular case of numerical continuation methods (sometimes called path following methods) 

. These allow to estimate curves given in the following implicit form:

 T(u)=0 where T is a mapping T:RK+1→RK. (14)

In fact, our PATH algorithm corresponds to a particular implementation of the so-called Generic Predictor Corrector Approach  widely used in numerical continuation methods.

In our case, we have a set of problems parametrized by . In other words for each we have to solve the following system of Karush-Kuhn-Tucker (KKT) equations:

 (1−λ)∇PF0(P)+λ∇PF1(P)+BTν+μS = 0, BP−12N = 0, PS = 0,

where is a set of active constraints, i.e., of pairs of indices that satisfy , codes the conditions and , and are dual variables. We have to solve this system for all possible sets of active constraints on the open set of matrices that satisfy for , in order to define the set of stationary points of the functions . Now if we let denote the left-hand part of the KKT equation system then we have exactly (14) with . From the implicit function theorem , we know that for each set of constraints ,

 WS={(P,ν,μS,λ):T(P,ν,μS,λ)=0 and T′(P,ν,μS,λ) has the maximal % possible rank}

is a smooth 1-dimensional curve or the empty set and can be parametrized by . In term of the objective function , the condition on may be interpreted as a prohibition for the projection of on any feasible direction to be a constant. Therefore the whole set of stationary points of when is varying from 0 to 1 may be represented as a union where each is homotopic to a 1-dimensional segment. The set may have quite complicated form. Some of may intersect each other, in this case we observe a bifurcation point, some of may connect each other, in this case we have a transformation point of one path into another, some of may appear only for and/or disappear before reaches 1. At the beginning the PATH algorithm starts from , then it follows until the border of (or a bifurcation point). If such an event occurs before then PATH moves to another segment of solutions corresponding to different constraints , and keeps moving along segments and sometimes jumping between segments until . As we said in the previous section one of the interesting properties of PATH algorithm is the fact that if appears only when and is a local minimum then the value of the objective function in is greater than in the point traced by the PATH algorithm.

### 3.5 Some implementation details

In this section we provide a few details relevant for the efficient implementation of the PATH algorithms.

#### Frank-Wolfe

Among the different optimization techniques for the optimization of starting from the current local minimum tracked by the PATH algorithm, we use in our experiments the Frank-Wolfe algorithm which is particularly suited to optimization over doubly stochastic matrices . The idea of the this algorithm is to sequentially minimize linear approximations of . Each step includes three operations:

1. estimation of the gradient ,

2. resolution of the linear program ,

3. line search: finding the minimum of on the segment .

An important property of this method is that the second operation can be done efficiently by the Hungarian algorithm, in .

Another essential point is that we do not need to store matrices of size for the computation of

, because the tensor product in

can be expressed in terms of matrix multiplication:

 ∇F1(P)=−vec(ΔT)−2(LTH⊗LTG)vec(P)=−vec(ΔT)−2vec(LTGPLH).

The same thing may be done for the gradient of the convex component

 ∇F0(P)=∇[vec(P)TQvec(P)] where Q=(I⊗AG−ATH⊗I)T(I⊗AG−ATH⊗I)∇F0(P)=2Qvec(P)=2vec(A2GP−ATGPATH−AGPAH+PA2H)

#### Initialization

The proposed algorithm can be accelerated by the application of Newton algorithm as the first step of QCV (minimization of ). First, let us rewrite the QCV problem as follows:

 minP∈D ∥AGP−PAH∥2F⇔ minP∈D vec(P)TQvec(P)⇔⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩minPvec(P)TQvec(P) such that Bvec(P)=12Nvec(P)≥0N2 (15)

where is the matrix which codes the conditions and . The Lagrangian has the following form

 L(P,ν,λ)=vec% (P)TQvec(P)+νT(Bvec(P)−12N)+μTvec(P),

where and are Lagrange multipliers. Now we would like to use Newton method for constrained optimization  to solve (15). Let denote the set of variables associated to the set of active constraints at the current points, then the Newton step consist in solving the following system of equations:

 ⎡⎢ ⎢⎣2QBTIaB00Ia00⎤⎥ ⎥⎦⎡⎢ ⎢⎣vec(P)νμa⎤⎥ ⎥⎦=⎡⎢ ⎢⎣010⎤⎥ ⎥⎦N2 elements,2N elements,\footnotesize\# of act. ineq. cons. (16)

More precisely we have to solve (16) for . The problem is that in general situations this problem is computationally demanding because it involves the inversion of matrices of size . In some particular cases, however, the Newton step becomes feasible. Typically, if none of the constraints are active, then (16) takes the following form222It is true if we start our algorithm, for example, from the constant matrix .:

 [2QBTB0][vec(P)ν]=  N2 elements,2N elements. (17)

The solution is then obtained as follows:

 vec(P)KKT=12Q−1BT(BQ−1BT)−112N. (18)

Because of the particular form of matrices and , the expression (18) may be computed very simply with the help of Kronecker product properties in instead of . More precisely, the first step is the calculation of where . The matrix may be represented as follows:

 Q−1=(UH⊗UG)(I⊗ΛG−ΛH⊗I)−2(UH⊗UG)T. (19)

Therefore the -th element of is the following product:

 BiQ−1BTj=vec(UTH˜BiTUG)T)(ΛG−ΛH)−2×vec(UTG˜BjTUH), (20)

where is the -th row of and is reshaped into a matrix. The second step is an inversion of the matrix , and a sum over columns . The last step is a multiplication of by , which can be done with the same tricks as the first step. The result is the value of matrix . We then have two possible scenarios:

1. If , then we have found the solution of (15).

2. Otherwise we take the point of intersection of the line and the border as the next point and we continue with Frank-Wolfe algorithm. Unfortunately we can do the Newton step only once, then some of constraints become active and efficient calculations are not feasible anymore. But even in this case the Newton step is generally very useful because it decreases a lot the value of the objective function.

In practice, we found it useful to have the parameter in the algorithm of Figure 2 vary between iterations. Intuitively, should depend on the form of the objective function as follows: if is smooth and if increasing the parameter does not change a lot the form of the function, then we can afford large steps, in contrast, we should do a lot of small steps in the situation where the objective function is very sensitive to changes in the parameter . The adaptive scheme we propose is the following. First, we fix a constant , which represents the lower limit for . When the PATH algorithm starts, is set to . If we see after an update that then we double and keep multiplying by as long as . On the contrary, if is too large in the sense that , then we divide by until the criterion is met, or . Once the update on is done, we run the optimization (Frank-Wolfe) for the new value . The idea behind this simple adaptation schema is to choose which keeps just below .

#### Stopping criterion

The choice of the update criterion is not unique. Here we check whether the function value has been changed a lot at the given point. But in fact it may be more interesting to trace the minimum of the objective function. To compare the new minimum with the current one, we need to check the distance between these minima and the difference between function values. It means that we use the following condition as the stopping criterion

 |Fλnew(P∗(λnew))−Fλ(P∗(λ))|<ϵFλ  and   ||P∗(λnew)−P∗(λ)||<ϵPλ

Although this approach takes a little bit more computations (we need to run Frank-Wolfe on each update of ), it is quite efficient if we use the adaptation schema for .

To fix the values and we use a parameter which defines a ratio between these parameters and the parameters of the stopping criterion used in the Frank-Wolfe algorithm: (limit value of function decrement) and (limit value of argument changing): and . The parameter represents an authorized level of stopping criterion relaxation when we increment . In practice, it means that when we start to increment we may move away from the local minima and the extent of this move is defined by the parameter . The larger the value of , the further we can move away and the larger may be used. In other words, the parameter controls the width of the tube around the path of optimal solutions.

### 3.6 Algorithm complexity

Here we present the complexity of the algorithms discussed in the paper.

• Umeyama’s algorithm has three components: matrix multiplication, calculation of eigenvectors and application of the Hungarian algorithm for (4). Complexity of each component is equal to . Thus Umeyama’s algorithm has complexity .

• LP approach (5) has complexity (worst case) because it may be rewritten as an linear optimization problem with variables .

In the PATH algorithm, there are three principal parameters which have a big impact on the algorithm complexity. These parameters are , , and . The first parameter defines the precision of the Frank-Wolfe algorithm, in some cases its speed may be sublinear , however it should work much better when the optimization polytope has a “smooth” border .

The influence of the ratio parameter is more complicated. In practice, in order to ensure that the objective function takes values between and , we usually use the normalized version of the objective function:

 Fnorm(P)=||AGP−PAH||2F||AG||2F+||AH||2F

In this case if we use the simple stopping criterion based on the value of the objective function then the number of iteration over (number of Frank-Wolfe algorithm runs) is at least equal to where .

The most important thing is how the algorithm complexity depends on the graph size . In general the number of iterations of the Frank-Wolfe algorithm scales as where is the conditional number of the Hessian matrix describing the objective function near a local minima . It means that in terms of numbers of iterations, the parameter is not crucial. defines the dimensionality of the minimization problem, while may be close to zero or one depending on the graph structures, not explicitly on their size. On the other hand, has a big influence on the cost of one iteration. Indeed, in each iteration step we need to calculate the gradient and to minimize a linear function over the polytope of doubly stochastic matrices. The gradient estimation and the minimization may be done in . In Section 4.2 we present the empirical results on how algorithm complexity and optimization precision depend on (Figure 7b) and (Figure 8).

### 3.7 Vertex pairwise similarities

If we match two labeled graphs, then we may increase the performance of our method by using information on pairwise similarities between their nodes. In fact one method of image matching uses only this type of information, namely shape context matching . To integrate the information on vertex similarities we use the approach proposed in (3), but in our case we use instead of

 minP Fαλ(P)=minP (1−α)Fλ(P)+αtr(CTP),. (21)

The advantage of the last formulation is that is just with an additional linear term. Therefore we can use the same algorithm for the minimization of as the one we presented for the minimization of .

### 3.8 Matching graphs of different sizes

Often in practice we have to match graphs of different sizes and (suppose, for example, that ). In this case we have to match all vertices of graph to a subset of vertices of graph . In the usual case when , the error (1) corresponds to the number of mismatched edges (edges which exist in one graph and do not exist in the other one). When we match graphs of different sizes the situation is a bit more complicated. Let denote the set of vertices of graph that are selected for matching to vertices of graph , let denote all the rest. Therefore all edges of the graph are divided into four parts , where are edges between vertices from , are edges between vertices from , and are edges from to and from to respectively. For undirected graphs the sets and are the same (but, for directed graphs we do not consider, they would be different). The edges from , and are always mismatched and a question is whether we have to take them into account in the objective function or not. According to the answer we have three types of matching error (four for directed graphs) with interesting interpretations.

1. We count only the number of mismatched edges between and the chosen subgraph . It corresponds to the case when the matrix from (1) is a matrix of size and rows of contain only zeros.

2. We count the number of mismatched edges between and the chosen subgraph . And we also count all edges from , and . In this case from (1) is a matrix of size . And we transform matrix into a matrix of size of size by adding zero rows and zero columns. It means that we add dummy isolated vertices to the smallest graph, and then we match graphs of the same size.

3. We count the number of mismatched edges between and chosen subgraph . And we also count all edges from (or ). It means that we count matching error between and and we count also the number of edges which connect and . In other words we are looking for subgraph which is similar to and which is maximally isolated in the graph .

Each type of error may be useful according to context and interpretation, but a priori, it seems that the best choice is the second one where we add dummy nodes to the smallest graph. The main reason is the following. Suppose that graph is quite sparse, and suppose that graph has two candidate subgraphs (also quite sparse) and (dense). The upper bound for the matching error between and is , the lower bound for the matching error between and is . So if then we will always choose the graph with the first strategy, even if it is not similar at all to the graph . The main explanation of this effect lies in the fact that the algorithm tries to minimize the number of mismatched edges, and not to maximize the number of well matched edges. In contrast, when we use dummy nodes, we do not have this problem because if we take a very sparse subgraph it increases the number of edges in