Establishing correspondences (or matches) between two sets of semantic features is an important problem in computer vision, and it has been widely involved in many applications such as panoramic stitching and object tracking . In the past decades, graph matching has become one of the most popular and powerful branches for feature correspondence, which regards each feature set as a graph and models the feature correspondence task as a quadratic assignment problem (QAP) where the unary similarities and pairwise affinities are incorporated [10, 19, 58]. Recently, many deep graph matching methods [40, 48, 17, 60, 42] that employ graphic neural networks (GNNs) to aggregate neighbor information and form structured representations of nodes and/or edges have demonstrated exciting matching performances.
Despite of the progress made by the above mentioned graph matching approaches, they depend heavily on the input graph patterns, which are usually built upon the raw feature sets using some heuristic manners such as Delaunay triangulation, -nearest neighbor, complete graphs, etc. Such handcrafting strategies may fail to faithfully explore the semantic relationships between nodes and introduce unreliable structural biases into the built graphs. For example, Fig. 1 (a) illustrates the graph structures of two car images from the Pascal VOC dataset , which are built using the annotated keypoints via Delaunay triangulation. It is observed that the connections between keypoints R_F_RoofTop,R_F_WheelCenter and R_SideviewMirror are inconsistent in the two graphs due to the significant viewpoint variation. Such differences in graph structures introduce incorrect relational biases to guide the graph matching, and finally lead to the incorrect matches between two keypoint sets.
Aiming to address the issues mentioned above, in this paper, instead of taking handcrafted graph structures as input, we work directly on the feature sets and learn simultaneously the discriminative feature representations and the semantic relationships to benefit the graph matching. Inspired by the success of the transformer framework  that employs pure attention computation models for representation learning, we design a joint graph learning and matching (GLAM) network by integrating relationship learning, representation learning and matching decision into a unified attention computation framework. Specifically, our attention module consists of two types of attention layers: (1) the self-attention layer (SAL) aims to learn reliable semantic relationships between keypoints for each feature set, and it forms structured feature representations according to the learnt graph structures; and (2) the cross-attention layer (CAL) computes cross-dependencies among the two feature sets to be matched for reconstructing the feature representations, and it also outputs a soft matching solution based on the computed cross-dependencies.
For evaluating the proposed GLAM network, we report its performance on three popular visual matching benchmarks including Willow Objects , Pascal VOC Keypoints  and SPair-71k , in comparison with state-of-the-art deep graph matching algorithms. The extensive experimental results reveal that our network outperforms all the baseline algorithms by significant margins on all three datasets, which validates its effectiveness and adaptability on various scenarios. Specifically, our algorithm achieves matching accuracies of 99.6%, 86.2% and 82.5%, outperforming the previously strongest competitors by 1.5%, 6.1% and 3.6%, on Willow Objects, Pascal VOC Keyponits and SPair-71k, respectively.
We further validate the learnt graph structures in Sec. 6 to illustrate their effectiveness. We firstly visualize the learnt graph patterns of all categories in the Pascal VOC Keypoints dataset, which exhibit rich and discriminative semantic relationships between keypoints like the manually labeled ones. Moreover, we combine the learnt graph patterns with several state-of-the-art deep graph matching algorithms by replacing their heuristic strategies of graph construction while keeping their main bodies unchanged. The comparison results show that all these algorithms benefit from the learnt graph patterns in different degrees, which demonstrates the effectiveness of the proposed attention module for relationship learning. Fig. 1 (b) shows a representative sample where the learnt graph structures are consistent across two images. Employing the learnt graph patterns to boost graph matching, we achieve perfect matching results between the two keypoint sets. We will release our code publicly available once the paper is accepted.
In summary, with the proposed learnable full attention network for feature corresponding, this paper makes contribution in three-fold:
we analyze shortcomings of graph construction strategies in previous algorithms and propose to learn reliable graph structures to boost graph matching, which is, to the best of our knowledge, among the first attempts to learn proper graph patterns in the deep graph matching framework;
unlike previous deep graph matching algorithms that aggregate node/edge embeddings using various GNN models, we design a fully attention-based network that integrates the learning of graph structures, node representations and matching decisions into a unified framework; and
we validate the excellent performance of the proposed GLAM network on three public benchmarks, and provide insightful analysis and discussion about the effectiveness of learnt graph patterns.
In the rest of this paper, Sec. 2 summarizes related work; then Sec. 3 briefly reviews feature corresponding and graph matching formulation; after that, Sec. 4 describes the proposed framework for deep graph matching based on attention mechanism. We finally present experimental validations for graph matching in Sec. 5 and graph learning in Sec. 6, respectively.
2 Related works
2.1 Feature correspondence
To solve the feature correspondence problem, several early methods are devoted to searching the optimal corresponding solutions under optimization scheme. For example, Lian and Zha  model the feature corresponding problem into minimizing a linear assignment function with quadratic regularization items, and relax it to a concave quadratic program problem that is effectively solved by a large scale concave optimizer. Subsequently, Wei et al. 
extend it to the point matching problem in presence of outliers. For efficiency, the authors deduce the objective function to a function with few nonlinear terms by relaxing the one-to-one matching constraints, and adopt the branch-and-bound algorithm for optimization. Furthermore, CODE
incorporates a coherence-based decision boundary under the assumption that pixels in a local region share the same motion, and embeds a non-linear regression technique into a correspondence likelihood model to solve the feature correspondence problem.
The similar assumption has also been utilized in other studies [5, 45, 4, 1, 63] that model the feature corresponding problem as a correspondence selection task. Specifically, based on the observation that nearby features on the same object typically share similar homographies, Co-Seg , HPM  and HVIV  try to identify the correct correspondences by computing the densities of homographies via Hough voting from correspondence candidates. In addition, considering that corresponding points on two similar shapes generally have similar shape contexts, Belongie et al.  solve the keypoint corresponding by employing a descriptor, named shape context, to each point for improving the matching accuracy. Although these methods achieve significant improvement in the feature matching between rigid objects with different viewpoints, they usually fail to obtain satisfied solutions for semantic matching problems in the case of non-rigid objects where keypoints with close semantic relationships do not always share the same motion. Aiming to address the issues mentioned above, instead of directly utilizing the spatial local information for feature correspondence selection, NM-Net 
employs a compatibility-specific neighbor mining layer as a group model to explore reliable neighbors for each correspondence, and embeds it into a hierarchical deep learning network to classify the correspondences into positive or negative ones.
In the above-mentioned methods, the matching solutions are mainly determined under the unary keypoint similarities with local consistent constraints, while the structural agreements between correspondences are not fully explored in their procedure. By fully exploring structural cues, graph-based methods [64, 9, 50, 28] have become one of the most popular and effective branches for semantic feature correspondence, in which features in the object are organized as a graph and the problem of feature correspondence is hence reformulated as the graph matching problem.
2.2 Graph matching
Due to the combinatorial nature of the graph matching problem, many efforts [51, 31, 65, 62, 34, 29, 30, 23, 11, 8, 14] are devoted to developing approximate algorithms to find acceptable solutions by relax some constraints. Readers can refer the comprehensive surveys [10, 19, 58] for detailed introduction of traditional learning-free graph matching methods.
Recently, with the growing interest in utilizing deep neural network for structured data, learning graph matching with graph neural network (GNN) has gained significant success. A classic way is to relax the graph matching problem to the linear assignment problem (LAP) and focus on learning representative node/edge embeddings or robust affinity metrics via message passing mechanism. For example, Nowak et al.  introduce a Siamese GNN encoder to generate normalized node embeddings for each graph to be matched, and then predict the matching solution by minimizing the cosine distance between the generated embeddings. Wang et al. [48, 49] employ the graph convolutional network (GCN) framework  to produce node embeddings by aggregating inner-graph and cross-graph structure information, and adopt the Sinkhorn network  to solve the relaxed linear assignment problem. Fey et al.  propose a two-stage graph matching framework that starts from an initial ranking of soft correspondences computed via graph neural networks, and iteratively re-ranks the solution by synchronous message passing networks to reach neighborhood consensus between matched node pairs.
Different from the above mentioned methods that focus mainly on the learning of node and/or edge embeddings and relax the quadratic matching constraints, several recently proposed methods work directly on pairwise affinities and embed strong differentiable solvers for quadratic optimization. For instance, Zanfir et al.  formulate graph matching as a quadratic assignment problem under learnable unary and pairwise affinities, and adopt a spectral matching algorithm  as the combinatorial solver for optimization. Rolínek et al.  relax the graph matching problem based on Lagrangian decomposition, which is solved by embedding blackbox implementations of a heavily optimized solver  based on dual block coordinate ascent. Reformulating graph matching as Koopmans-Beckmann’s QAP  to minimize the adjacency discrepancy of graphs to be matched, Gao et al.  adopt the Frank-Wolf algorithm  as the solver to obtain approximate solutions. Besides, Wang et al. 
integrates learning of affinities and solving for combinatorial optimization into a unified learnable framework, where the graph matching problem is transformed to a binary vertex classification problem of a constructed assignment graph. Later, a similar classification-based framework is proposed in, which models the unary similarities and pairwise agreements into the adjacent matrix of a so-called association graph, and reformulates the graph matching problem as a node classification task under soft one-to-one constraints.
The GNN-based methods mentioned above perform message-passing along edges to form discriminative node representations, and their performances depend largely on the graph structures. However, all of them take the graph structures generated by some heuristic strategies as input, hence their performances are still limited in challenging scenarios where the generated graph patterns may not accurately reflect the inherent relationships between nodes (e.g., Fig 1(a)).
2.3 Attention mechanism
Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various natural language processing (NLP) tasks. A notable work is the transformer model  that relies entirely on self-attention to compute representations of its input and output, which has achieved great success in NLP. Inspired by that, many efforts have been devoted to designing attention-based models in the computer vision community [12, 25]. For example, Xu et al. 
propose an attention-based generative adversarial network for text-to-image generation, which consists of multi-stage refinement modules that are designed for generating high-quality sub-regions according to the text-image attention computed by transformer encoder. For image segmentation, Yeet al.  introduce a cross-modal self-attention network to effectively learn long-range dependencies from constructed multimodal features that represent both visual and linguistic information, enabling the model to adaptively focus on important image regions and informative keywords. To improve the quality of scene segmentation, Fu et al.  introduce a dual attention network consisting of position attention module and channel attention module, both of which are designed on top of the transformer coder. For solving image recognition problem, Alexey et al.  split image into a sequence of flattened patches whose linear projection features are directly taken as the input of transformer encoder to learn the embedding of class tokens. More recently, automatically searching transformer architecture  shows further improvements in visual recognition and downstream tasks.
Recently, several studies have been devoted to introducing attention mechanism into graph neural networks for graph representation learning. Example models include GAT  that leverages masked self-attentional layers in a convolution-style neural network, and U2GNN  that induces a powerful aggregation function leveraging a self-attention mechanism followed by a recurrent transition. As for the graph matching task, Yu et al.  propose a node and edge embedding strategy that enables the information in each channel to be merged independently, thus improving the robustness of the learned affinities. SuperGlue  borrows the attention encoder from the transformer model and integrates it into graph neural networks to generate discriminative node representations. Specifically, it jointly takes spatial relationships and visual information into considerations for node embedding using intra-graph and inter-graph attentions. As reported in these studies, embedding attention mechanism into graph neural networks improves matching accuracy, but the message passing procedure for updating node/edge embeddings still relies on the graph structures constructed in advance.
Our work falls into the group of deep graph matching algorithms and is designated on the top of attention mechanism. Compared with previous arts, our learning framework focuses on the learning of not only the structured representations but also their relational inductive biases, which have not been well-explored before. The excellent performance of the proposed method is illustrated in Sec. 5, and the effectiveness of learnt graph patterns is further validated in Sec. 6.
3 Problem Formulation
3.1 Feature Correspondence via Graph Matching
We now describe the problem of semantic feature correspondence and review its graph matching formulation. Given two images and , two sets of semantic keypoints and are annotated on them, respectively. For simplicity, we assume the two sets to be matched have the same size , i.e., , while the following formulation/algorithms can be easily extended to handle varied sizes, e.g., by adding dummy keypoints. The goal of feature correspondence is to find the optimal global correspondences from the possible correspondence space , under one-to-one matching constraints, i.e., and , where
denotes a vector ofones.
As a popular way for feature correspondence, graph matching methods generally build graphs on feature sets to impose structural agreements during finding optimal matching solutions. Specifically, through graph construction strategies such as Delaunay triangulation and -nearest neighbors, the graphs built upon the keypoint sets can be denoted as and respectively, where and are the edge sets that are usually represented by adjacent matrices, and
are the attributes of nodes and edges that may be generated via hand-crafted features or learnable deep features. Given the two graphs, graph matching methods formulate the feature correspondence problem as maximizing the global consistence
where denotes the unary node similarity between node in and node in and measures the pairwise affinity between edge in and edge in .
3.2 Learning of Graph Matching
In the task of learning feature correspondence, we aim to learn a function that maps two feature sets to feature correspondences
where are the input sets of -dimensional features and the output assignment matrix.
Under the graph matching pipeline, this function can be in general decomposed into three functions: (1) a graph generation function that maps a feature set to graph structures
where is the input feature set and the output weighted adjacency matrix; (2) a feature updating function that forms structured features according to the graph structures
where and are the updated feature sets; and (3) a matching decision function that maps two graphs with updated features to the final matching solution
In previous deep graph matching algorithms (e.g., [40, 48, 17, 60, 42, 53, 54, 41, 22]), the graph generation functions is designed in a heuristic way, for instance Delaunay triangulation graphs, -nearest neighbor graphs or complete graphs; the feature updating functions are simulated by various multi-layer GNN models; and the matching decision function are usually implemented as the differentiable versions of some combinatorial solvers, e.g., the Sinkhorn network .
In order to explore reliable graph structures to boost graph matching, we proposed to learn a proper graph generation function , rather than using a heuristic one. Furthermore, these three functions are not designed separately, but are intertwined into a unified attention-based framework, which is described in detail in Sec. 4.2.
4 Full Attention-based Graph Matching Network
Given two images with the semantic keypoints to be matched, our GLAM network first extracts the visual attributes and positional embeddings, and takes the element-wise sum of them as the input of the following learnable attention module. The attention module adopts both self-attention and cross-attention mechanism for learning graph structures, updating keypoint features and finally computing the matching solutions. Note that, our network produces only soft matching solutions for the necessary of training, and we adopt the Hungarian method [27, 38] as a post-procedure to provide desired discrete solutions in the inferring step.
Fig. 2 illustrates the pipeline of the proposed GLAM framework, and detailed descriptions of each key component are provided in the following subsections.
4.1 Feature extraction
to prepare the feature maps of the whole image, and employ interpolation strategy to extract the visual features for keypoints.
Considering that locations of keypoints provide geometric cues, we augment the visual features with location information by designing a learnable position encoder that transforms the keypoint locations into high-dimension positional embeddings. Denoting the visual features of keypoints in image as , and the location coordinates as , the feature augment procedure can be expressed as
where the position encoder is parameterized as a multi-layer perceptron
multi-layer perceptron(MLP) that maps -dimension coordinates into -dimension embeddings. The same CNN and position encoder are also applied on image to produce the keypoint attributes .
Note that, to make full use of the location information, in each iteration we add the positional embeddings with the updated features output from self-attention module and cross-attention module.
4.2 The Attention Module
As the core component of the proposed framework, the attention module consists of self-attention layer (SAL) and cross-attention layer (CAL), which are devoted to updating features based on the intra- and inter-dependencies between keypoints and finally outputing the matching result.
The key idea of the attention layer is akin to the transformer encoder  that maps the queries, keys and values to output vectors and uses the outputs for feature updating. In SAL, our GLAM network learns the intra-dependencies, i.e., the weighted graph, between keypoints in the same image, and then updates the representations according to the learnt relationship between keypoints. By contrast, CAL is designed for incorporating the information from the other keypoint set to enhance the discrimination of features, and it finally outputs a soft matching result between two keypoint sets.
The SAL and CAL are stacked iteratively in the attention module, of which the pipeline is summarized in Algorithm 1.
4.2.1 Self-attention layer
Given the input keypoints with attributes extracted from image , we design a multi-head SAL to learn the graph structures and then update keypoint features based on learnt structures.
Specifically, we first initialize the queries , keys and values as , and then project them into different latent spaces as
where are the projection parameters of the -th attention head in SAL.
Subsequently, the attentions between the queries and keys are computed by
where the operator performs softmax normalization along each row of the input matrix. Based on the attention matrix , we can reconstruct the keypoint attributes by
The concatenation of updated attributes of all attention heads is transformed into a new feature space
where is the number of attention heads, is a learnable parameter matrix and concatenates its input along the channel wise.
Finally, the updated features are passed through a residual learning  architecture
where maps its input in .
As each training sample contains a pair of input feature sets , we apply the same self-attention computations with same parameters, and , on to learn its graph structures and update the keypoint attributes.
4.2.2 Cross-attention Layer
Different from SAL, CAL aims to enhance the keypoints attributes in image using the information from the counterpart image . Thus, we need to initial the queries , keys and values as
Similar to SAL, CAL employs a multi-head architecture to project the input attributes into different latent spaces as
where are the projection parameters of the -th attention head in CAL.
Afterward, the cross-attentions between the queries and keys are computed as
is a sigmoid function that maps its input intofor meeting the nonnegative requirement of the input of Sinkhorn network . Unlike most attention mechanisms that use softmax function for row-wise normalization, here we adopt Sinkhorn network to impose both row- and column-wise normalization. Subsequently, the reconstruction of the query attributes is conducted by
Similar to SAL, we transform the concatenation of the outputs of attention heads through a learnable parameter matrix as
where is the number of attention heads in CAM. Finally we update the keypoint attributes using a residual learning scheme
Note that the mirror-like computation is also applied in CAL to update the feature set and compute the cross-attention matrix , and we average these cross-attention matrices at the last layer as the output matching solution
4.3 Loss function
Given the predicted soft matching solutions and the ground-truth matching solutions , we consider labeling positive matches as a binary classification task, and use the differences between them as the supervision signal for end-to-end training. Thus the loss for the predicted solutions is a weighted cross entropy function
where x and are the vectorization of and , respectively, and is a hyper-parameter that balances the loss to avoid the dominance of the negative candidate matches during training.
5 Experiments on Graph Matching
To validate the effectiveness of our GLAM model, we evaluate it on three public visual semantic matching benchmarks, named Willow Object dataset , Pascal VOC Keypoints  and SPair-71k , in comparison with several state-of-the-art graph matching methods including GMN , SuperGlue , PCA , IPCA , LCS , qc-DGM , DGMC , CIE , BBGM , NGM  and NGM-v2 . Among these baseline methods, IPCA  extends the PCA  by adding an iterative update strategy for cross-graph embeddings. NGM-v2  is the upgraded version of NGM  by augmenting the raw features with SplineCNN . Furthermore, SuperGlue  and CIE  are the methods designed by integrating the attention mechanism into graph neural networks for feature aggregation.
In experiments, we generate the visual feature following PCA  that produces the raw keypoint features by interpolating the image feature maps extracted through the standard VGG16 backbone . The dimensions of visual features and positional embeddings are both 1024 in practice.
For the attention module, we set the number of attention layers , the numbers of attention heads , and fix the dimensions of the features as . For CAL, considering that too much iterations of normalization in Sinkhorn network  will weaken the gradient used for training our model, we set the number of row- and column-normalization iterations for cross-relationships to 5. In the weighted cross entropy loss, we set the weight to balance the significance of positive labels and negative labels in training.
5.2 Comparison on Benchmarks
5.2.1 Willow Object dataset
The Willow Object dataset provided in  contains five categories, each of which is represented by 40 or more images with different instances. In this dataset, each target object is annotated with 10 distinctive landmarks, and each pair of keypoint sets from the same category can form a matching sample. For splitting the dataset, we following [53, 48, 49] choose 20 images from each category for training and keep the rest for testing. In practice, any pair of images from the same category can be taken as a sample, thus the training set contains totally 2,000 training samples. For testing, we randomly select 1,000 pairs of images from the testing set of each category to evaluate the compared approaches.
The experimental results on Willow Object dataset  are summarized in Table I. Due to lack of scale, pose and illumination changes, this dataset is considered relatively easy. The heuristic graph construction strategies of the compared baseline algorithms are able to explore consistent graph patterns across each category, and thus most of them achieve satisfying matching performance on this dataset, especially in the Face category. While our model works directly on keypoint sets to learn proper graph patterns, and it obtain the best matching accuracies in all categories and rises the average accuracy to 99.6%.
5.2.2 Pascal VOC Keypoint
Pascal VOC  with Berkeley annotations of keypoints  contains 20 categories of instances labeled with semantic keypoint locations. Compared with Willow Object dataset , this dataset is considered more challenging because the instances may vary its scale, pose, illumination and the number of inlier keypoints. Following previous arts [53, 48, 49] we split the dataset into two groups, i.e., 7,020 images for training and 1,682 for testing. In training, we take each pair of images from the same category as the training sample. For testing, we randomly select 1,000 samples from each category of the testing set to evaluate the performance of compared methods.
Table II reports the matching accuracy of the compared methods on Pascal VOC . Due to the challenges derived from scaling, viewpoint change and outliers, the compared baselines are unable to explore stable relationships between keypionts using heuristic strategies, and most of them fails to gain satisfied matching results on this dataset. Through learning relationships between keypoints by the attention module, our GLAM model is able to discover more reliable structural patterns among the instances to be matched, and utilizes them for the following representation learning and the final matching decision. As a result, our GLAM model achieves the best matching results on all categories expect boat and tv, and rises the average matching accuracy to 86.2%, significantly surpassing the strongest competitors (80.1% for BBGM  and NGM-v2 ) by 6.1%,
Several representative examples of graph matching selected from bottle, chair and horse are shown in Figs. 3, 4 and 5 respectively, where our method discovers more consistent and semantically meaningful graph structures and achieves obviously better matching results than the compared baseline algorithms. Specifically, in Fig. 3 the two bottles have severe changes in pose and viewpoint, and there are many inconsistent edges between the two graphs constructed heuristically, which contributes greatly to the failure of the baseline methods in finding correct correspondences. For our GLAM model, the edges in both of the learnt graphs cover the contour of bottles in general, which drives our model to focus more on the affinities reflecting the semantic relationships on shapes of bottle. Similarly, Fig 4 shows a representative example of chair images where great changes exist in viewpoint, illumination and even the material of the chair. The graphs generated by the compared baseline algorithms fail to discover inherent relationships between keypoints, and they introduce many noise relationships, e.g., the adjacency between Leg_R_F (annotated with purple nodes) and other keypoints related with Seat. Thus, all of these graph matching methods explore less correct matches between the two chair instances. On the contrary, the learnt graph of our model generally covers the contour of the chair instances and excludes most unreliable relationships. As a result, our model successes to find all correct correspondences between the two keypoint sets. In the horse images in Fig. 5, we observe that all heuristic graphs establish many unreliable relationships between keypoints with less semantic correlations, e.g., the Ear is adjacent with TailBase, and the Nose is associated with Paw. While, our learnt graph structures in both images exclude such unreliable relationships, and most relationships are built between the keypoints that have close semantic meanings. For example, the keypoints related to Paw are more connected with Elbow that has close correlations in biology, and the keypoints in the head are more likely to be adjacent with each other. Under such graph structures, our model achieves the best performance on the horse category.
With the collection of total 70,958 pairs of images from Pascal 3D+  and Pascal VOC 2012 , the SPair-71k dataset  is very well organized with rich annotations for learning. Compared with Pascal VOC Keypoint , the pair annotations in this dataset have more challenging and diverse variations in viewpoint, scale, truncation and occlusion, thus reflecting the more generalized visual correspondence problem in real-world. Furthermore, it removes two ambiguous and poorly annotated categories, i.e., dining table and sofa, from the Pascal VOC dataset.
In Table III we report the matching results of the proposed GLAM method on the SPair-71k dataset in comparison with state-of-the-art graph matching methods. As reported, our GLAM model achieves the best matching results on most of the 18 categories expect bottle, motorbike, plant and train, and rises the average matching accuracy to 82.5%, surpassing the strongest competitor (78.9% for BBGM) by 3.6%. It is demonstrated that our model have better generalization ability in more difficult visual matching problems in real-world.
5.3 Ablation Studies
5.3.1 The Number of Attention Layers
The number of attention layers determines the iterations of feature updating, and has direct influence on the performance of our model. To explore the impact of the attention layer iterations on the matching accuracy and time consumption, we vary the number of attention layers from 1 to 6, and test the proposed model on Pascal VOC .
As expected, the computation time grows almost linearly with the number of attention layers increasing, which has been illustrated in Fig. 6. As shown in Fig. 6, the average matching accuracy is only 67.1% when only a single attention layer is used in our model. The matching accuracy rises rapidly to 83.3% once we stack two attention layers in our model, and it reaches saturation when the number of attention layers is large than 3. As a result, we set the number of attention layers to 3 throughout our experiments to balance the matching accuracy and the time consumption.
5.3.2 Attention Module
To explore the contributions of different attention modules in our model, we design two downgraded version of our framework, named GLAM-S̄ and GLAM-C̄, where S̄ and C̄ denote removing the SAL module and the CAL module from GLAM, respectively.
As reported in Table I, both GLAM-S̄ and GLAM-C̄ achieve comparable results with GLAM on the Willow Object dataset, and they outperform all compared baseline algorithms. It indicates that using only one of the proposed attention layer is able to produce discriminative representations for keypoints in these relatively simple scenarios. However, on the more difficult datasets, Pascal VOC (Table II) and SPair-71k (Table III), the performance of our framework deteriorates remarkably once one of the proposed attention modules is removed. Specifically, the average matching accuracy declines by 17.6% and 8.3% on Pascal VOC when dropping the SAL module and the CAL module, respectively. As for SPair-71k, removing of the SAL module and the CAL module results in the decrease of matching accuracy by 10.2% and 6.1%, respectively.
As illustrated, both self-attention and cross-attention modules contribute significantly in our framework, and the combination of them help us to better learn discriminative representations for keypoint matching.
6 Experiments on Graph Learning
The main motivation of this paper lies in learning graph patterns, instead of utilizing handcrafted graph structures as in previous arts, to boost graph matching. In order to further illustrate the learnt graph patterns and explore their benefits to graph matching, in this section we first visualize learnt graph patterns of each category on the Pascal VOC dataset, and then report the comparison of the baseline algorithms by replacing the handcrafted graph structures in these algorithms with our learnt graph patterns.
6.1 Learnt graph patterns
To explore the learnt graph patterns, we take the average self-attention dependencies output from the last updating iteration as the final learnt adjacency matrices, and for each category we average the matrices in this category as its graph pattern.
For obtaining the learnt adjacency matrix, we randomly choose 2,000 pairs of images from each category in the training set and pass them to our trained network. The output matrices (e.g., 8) of the self-attention module in the last attention layer are counted for the learnt adjacency matrix of the graph. Considering there are attention heads in the self-attention module, we take the average of all attention heads as the weighted adjacency matrix:
Note that, on the Pascal VOC dataset the images in the same category may contain different numbers of labeled keypoints, so we expand the computed weighted adjacency matrix to the maximum size by adding dummy rows and columns for absent keypoints. Subsequently, the computed weight adjacency matrices of all chosen images in the same category are averaged as the learnt graph pattern of this category.
We split the categories in Pascal VOC into a difficult set and a simple one according to the number of semantic keypoints. For illustration, we visualize the heat maps of learnt weighted adjacency for them in Fig. 7 and Fig. 8 respectively, where darker elements indicate stronger relationships between keypoints.
It is obviously that the learnt graph patterns by our framework contains rich and discriminative semantic relationships between keypoints in each category. For example in the Person category in Fig. 7, L_Eye has much stronger connections to keypoints L_Ear, Nose and R_Eye than other keypoints, which is consistence with our semantic adjacency rule. There are various aircraft type in aeroplane category that show great differences in structure and appearance, and its heat map is thus more vague than those of other categories. However, the keypoints in local area still have stronger relationships, for instance, Nose_Bottom has higher linking weights with NoseTip and Nose_Top than other keypoints. Furthermore, it is interesting that all categories about animals except bird have vary similar distributions in heat maps. Specifically, for categories cow, dog, cat, horse and sheep, the L_Eye consistently has stronger relationships with L_EarBase, Nose and R_Eye than other keypoints. Besides the categories mentioned above, the heat maps of the left categories including bird, car and sofa also shows the clear semantic relationships between pairs of keypoints. In Fig. 8 that illustrates the distributions of learnt patterns of simple categories, the disparity between stronger relationships and weak relationships are more obvious for most heat maps. Especially, for categories bottle, diningtable and pottedplant that have regular shapes in real world, stronger connections are more likely to exist between keypoints that have left-to-right or up-and-down relationships.
6.2 Manually labeled graph patterns
Benefiting from the fact that keypoints in Pascal VOC  have been annotated with semantic names, we are able to manually label a semantic graph pattern for each category, and compare them with the graph patterns learnt by our method for validation.
Since different instances in the same category vary greatly in scale, pose and viewpoint, we label the graph pattern of each category mainly according to the semantic relationships between the keypoints, instead of their spatial locations. For example, among the keypoints in person, the Nose is more related to the organs on head such as Eyes and Ears, and less related to other organs on body including Hands and Elbows. Thus, for all instance of the person category, keypoint Nose is connected to Eyes and Ears, but not to other keypoints no matter how close they are spatially.
Fig. 9 illustrates several representative examples of the manual semantic adjacency and the learnt weighted adjacency, in which the thickness of the line represents the strength of the relationship, and too weak links are filtered out for better illustration. It is observed that the learnt patterns are very similar to the manually labeled ones where the link weights between pairs of keypoints with close semantic relationship are prone to be greater, and the ones between weakly related keypoint pairs in semantic labels is relative small. For example, in the last row expressing the adjacencies of cat, the left ear has relatively strong relationships with right ear and left eye, while it has weak relationships with any of the legs, which is consistent with the semantic adjacency annotated manually.
6.3 Comparison with handcrafted graph structures
For further validation of the effectiveness of the learnt graph patterns, we compare them, as well as manually labeled patterns, with the handcrafted graph patterns in several state-of-the-art graph matching algorithms, including GMN , PCA , IPCA , qc-DGM , CIE , BBGM , NGM  and NGMv2 . We replace their graph construction strategy with the learnt graph patterns of our model (respectively the manually labeled patterns), and then use the pretrained parameters111The parameters and results of the pretrained models in Table IV are provided at: https://github.com/Thinklab-SJTU/ThinkMatch. of these models to perform the graph matching task. In practice, in order to filter possible noises in the learnt adjacent matrix, we keep the 70% adjacent edges with the highest weights and remove the rest edges.
Table IV reports the comparison of matching performance between different types of graph patterns, where MODEL+MA and MODEL+LA denote using the manual semantic adjacency and learnt adjacency for graph construction, respectively. Obviously, even using the network parameters that are pretrained under the original graph construction strategies without refining, imposing either the manual semantic patterns or the learnt graph patterns is able to significantly improve the matching accuracy of most baseline algorithm. It indicates that building high-quality of relational biases is crucial for the graph matching framework. Furthermore, using learnt graph patterns achieves comparable performance to that using the manual semantic patterns, which also illustrates the effectiveness and representativeness of the learnt graph patterns.
In this paper, we propose a joint graph learning and matching network for the feature correspondence problem, by integrating a graph construction procedure into an end-to-end learnable framework. Specifically, we design two types of attention layers, self-attention layer and cross-attention layer, with the multi-head attention strategy. The self-attention layer is devoted to updating the features using learnt graph structures. By contrast, the cross-attention layer updates one feature set according to the information from the other, and it finally outputs a soft matching solution for graph matching. Extensive experimental results on three public visual matching benchmarks have demonstrated the superiority of the proposed model. In addition, we show that the learnt structures are consistent with manually determined semantic patterns in general, and demonstrate the effectiveness of the learnt graph patterns in improving previous graph matching methods by replacing their graph construction strategies.
This work is supported by the National Nature Science Foundation of China (Nos. 62076021 and 61872032) and the Beijing Municipal Natural Science Foundation (Nos.4202060 and 4212041).
-  (2002) Shape matching and object recognition using shape contexts. IEEE Trans. Pattern Anal. Mach. Intell. 24 (4), pp. 509–522. Cited by: §2.1.
-  (2009) Poselets: body part detectors trained using 3d human pose annotations. In IEEE Int. Conf. Comput. Vis., pp. 1365–1372. Cited by: §5.2.2.
-  (2007) Automatic panoramic image stitching using invariant features. Int. J. Comput. Vis. 74 (1), pp. 59–73. Cited by: §1.
-  (2013) Robust feature matching with alternate hough and inverted hough transforms. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2762–2769. Cited by: §2.1.
-  (2015) Co-segmentation guided hough transform for robust feature matching. IEEE Trans. Pattern Anal. Mach. Intell. 37 (12), pp. 2388–2401. Cited by: §2.1.
-  (2021) AutoFormer: searching transformers for visual recognition. In IEEE Int. Conf. Comput. Vis., Cited by: §2.3.
-  (2013) Learning graphs to match. In IEEE Int. Conf. Comput. Vis., pp. 25–32. Cited by: §1, §5.2.1, §5.2.1, §5.2.2, §5.
-  (2010) Reweighted random walks for graph matching. In Eur. Conf. Comput. Vis., Vol. 6315, pp. 492–505. Cited by: §2.2.
Finding matches in a haystack: A max-pooling strategy for graph matching in the presence of outliers. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2091–2098. Cited by: §2.1.
Thirty years of graph matching in pattern recognition. Int. J. Pattern Recognit. Artif. Intell 18 (3), pp. 265–298. Cited by: §1, §2.2.
-  (2006) Balanced graph matching. In Conf. Neural Inform. Process. Syst., pp. 313–320. Cited by: §2.2.
-  (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Int. Conf. Learn. Represent., Cited by: §2.3.
-  (2021) An image is worth 16x16 words: transformers for image recognition at scale. In Int. Conf. Learn. Represent., Cited by: §2.3.
-  (2013) A probabilistic approach to spectral graph matching. IEEE Trans. Pattern Anal. Mach. Intell. 35 (1), pp. 18–27. Cited by: §2.2.
-  (2015) The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vis. 111 (1), pp. 98–136. Cited by: §5.2.3.
-  (2010) The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88 (2), pp. 303–338. Cited by: §1, §1, §5.2.2, §5.2.2, §5.2.3, §5.3.1, §5, §6.2.
-  (2020) Deep graph matching consensus. In Int. Conf. Learn. Represent., Cited by: §1, §2.2, §3.2, TABLE I, TABLE II, TABLE III, §5.
-  (2018) SplineCNN: fast geometric deep learning with continuous b-spline kernels. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 869–877. Cited by: §5.
-  (2014) Graph matching and learning in pattern recognition in the last 10 years. Int. J. Pattern Recognit. Artif. Intell 28 (1), pp. 1–40. Cited by: §1, §2.2.
-  (1956) An algorithm for quadratic programming. Nav. Res. Logist. Q. 3, pp. 95–100. Cited by: §2.2.
-  (2019) Dual attention network for scene segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 3146–3154. Cited by: §2.3.
-  (2021) Deep graph matching under quadratic constraint. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 5069–5078. Cited by: §2.2, §3.2, TABLE I, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (1996) A graduated assignment algorithm for graph matching. IEEE Trans. Pattern Anal. Mach. Intell. 18 (4), pp. 377–388. Cited by: §2.2.
-  (2016) Deep residual learning for image recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 770–778. Cited by: §4.2.1.
-  (2021) Transformers in vision: A survey. CoRR abs/2101.01169. External Links: Cited by: §2.3.
-  (2017) Semi-supervised classification with graph convolutional networks. In Int. Conf. Learn. Represent., Cited by: §2.2.
-  (2010) The hungarian method for the assignment problem. In 50 Years of Integer Programming 2010, pp. 29–47. Cited by: §4.
-  (2020) Progressive feature matching: incremental graph construction and optimization. IEEE Trans. Image Process. 29, pp. 6992–7005. Cited by: §2.1.
-  (2009) An integer projected fixed point method for graph matching and MAP inference. In Conf. Neural Inform. Process. Syst., pp. 1114–1122. Cited by: §2.2.
-  (2005) A spectral technique for correspondence problems using pairwise constraints. In IEEE Int. Conf. Comput. Vis., pp. 1482–1489. Cited by: §2.2, §2.2.
-  (2005) A spectral technique for correspondence problems using pairwise constraints. In IEEE Int. Conf. Comput. Vis., pp. 1482–1489. Cited by: §2.2.
-  (2012) Robust point matching revisited: A concave optimization approach. In Eur. Conf. Comput. Vis., Vol. 7573, pp. 259–272. Cited by: §2.1.
-  (2018) CODE: coherence based decision boundaries for feature correspondence. IEEE Trans. Pattern Anal. Mach. Intell. 40 (1), pp. 34–47. Cited by: §2.1.
-  (2014) GNCCP - graduated nonconvexityand concavity procedure. IEEE Trans. Pattern Anal. Mach. Intell. 36 (6), pp. 1258–1267. Cited by: §2.2.
-  (2007) A survey for the quadratic assignment problem. Eur. J. Oper. Res. 176 (2), pp. 657–690. Cited by: §2.2.
-  (2017) Sinkhorn networks: using optimal transport techniques to learn permutations. In NIPS Workshop Optim. Transp. Mach. Learn., pp. 1–10. Cited by: §2.2, §3.2, §4.2.2, §5.1.
-  (2019) SPair-71k: A large-scale benchmark for semantic correspondence. CoRR abs/1908.10543. External Links: Cited by: §1, §5.2.3, §5.
-  (1957) Algorithms for the assignment and transportation problems. J. Soc. Industr. Appl. Math. 5 (1), pp. 32–38. Cited by: §4.
-  (2019) Universal graph transformer self-attention networks. CoRR abs/1909.11855. External Links: Cited by: §2.3.
-  (2018) Revised note on learning quadratic assignment with graph neural networks. In IEEE Data Sci. Workshop, pp. 229–233. Cited by: §1, §2.2, §3.2.
-  (2020) Deep graph matching via blackbox differentiation of combinatorial solvers. In Eur. Conf. Comput. Vis., Vol. 12373, pp. 407–424. Cited by: §2.2, §3.2, §5.2.2, TABLE I, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (2020) SuperGlue: learning feature matching with graph neural networks. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 4937–4946. Cited by: §1, §2.3, §3.2, TABLE I, TABLE II, TABLE III, §5.
-  (2015) Very deep convolutional networks for large-scale image recognition. In Int. Conf. Learn. Represent., Y. Bengio and Y. LeCun (Eds.), Cited by: §4.1, §5.1.
-  (2017) A study of lagrangean decompositions and dual ascent solvers for graph matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7062–7071. Cited by: §2.2.
-  (2011) Speeded-up, relaxed spatial matching. In IEEE Int. Conf. Comput. Vis., pp. 1653–1660. Cited by: §2.1.
-  (2017) Attention is all you need. In Conf. Neural Inform. Process. Syst., pp. 5998–6008. Cited by: §1, §2.3, §4.2.
-  (2018) Graph attention networks. In Int. Conf. Learn. Represent., Cited by: §2.3.
-  (2019) Learning combinatorial embedding networks for deep graph matching. In IEEE Int. Conf. Comput. Vis., pp. 3056–3065. Cited by: §1, §2.2, §3.2, §5.1, §5.2.1, §5.2.2, TABLE I, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (2020) Combinatorial learning of robust deep graph matching: an embedding based approach. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.2, §5.2.1, §5.2.2, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (2019) Deformable surface tracking by graph matching. In IEEE Int. Conf. Comput. Vis., pp. 901–910. Cited by: §2.1.
-  (2018) Graph matching with adaptive and branching path following. IEEE Trans. Pattern Anal. Mach. Intell. 40 (12), pp. 2853–2867. Cited by: §2.2.
-  (2018) Gracker: A graph-based planar object tracker. IEEE Trans. Pattern Anal. Mach. Intell. 40 (6), pp. 1494–1501. Cited by: §1.
-  (2020) Learning combinatorial solver for graph matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 7565–7574. Cited by: §2.2, §3.2, §5.2.1, §5.2.2, TABLE I, TABLE II, TABLE III, §5.
-  (2021) Neural graph matching network: learning lawler’ s quadratic assignment problem with extension to hypergraph and multiple-graph matching. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §2.2, §3.2, §5.2.2, TABLE I, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (2014) Point matching in the presence of outliers in both point sets: A concave optimization approach. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 352–359. Cited by: §2.1.
-  (2014) Beyond PASCAL: A benchmark for 3d object detection in the wild. In IEEE Winter Conf. Appl. Comput. Vis., pp. 75–82. Cited by: §5.2.3.
-  (2018) AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 1316–1324. Cited by: §2.3.
-  (2016) A short survey of recent advances in graph matching. In Int. Conf. Multimed. Retr., pp. 167–174. Cited by: §1, §2.2.
-  (2019) Cross-modal self-attention network for referring image segmentation. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 10502–10511. Cited by: §2.3.
-  (2020) Learning deep graph matching with channel-independent embedding and hungarian attention. In Int. Conf. Learn. Represent., Cited by: §1, §2.3, §3.2, TABLE I, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (2018) Deep learning of graph matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2684–2693. Cited by: §2.2, TABLE I, TABLE II, TABLE III, §5, §6.3, TABLE IV.
-  (2009) A path following algorithm for the graph matching problem. IEEE Trans. Pattern Anal. Mach. Intell. 31 (12), pp. 2227–2242. Cited by: §2.2.
-  (2019) NM-net: mining reliable neighbors for robust feature correspondences. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 215–224. Cited by: §2.1.
-  (2013) Deformable graph matching. In IEEE Conf. Comput. Vis. Pattern Recog., pp. 2922–2929. Cited by: §2.1.
-  (2016) Factorized graph matching. IEEE Trans. Pattern Anal. Mach. Intell. 38 (9), pp. 1774–1789. Cited by: §2.2.