blackbox-deep-graph-matching
Repository for our paper: Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers
view repo
Building on recent progress at the intersection of combinatorial optimization and deep learning, we propose an end-to-end trainable architecture for deep graph matching that contains unmodified combinatorial solvers. Using the presence of heavily optimized combinatorial solvers together with some improvements in architecture design, we advance state-of-the-art on deep graph matching benchmarks for keypoint correspondence. In addition, we highlight the conceptual advantages of incorporating solvers into deep learning architectures, such as the possibility of post-processing with a strong multi-graph matching solver or the indifference to changes in the training setting. Finally, we propose two new challenging experimental setups.
READ FULL TEXT VIEW PDF
Achieving fusion of deep learning with combinatorial algorithms promises...
read it
We propose a general dual ascent framework for Lagrangean decomposition ...
read it
Graph matching involves combinatorial optimization based on edge-to-edge...
read it
Graph matching refers to finding node correspondence between graphs, suc...
read it
End-to-end training of neural network solvers for combinatorial problems...
read it
We study the quadratic assignment problem, in computer vision also known...
read it
The Traveling-Salesperson-Problem (TSP) is arguably one of the best-know...
read it
Repository for our paper: Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers
Matching discrete structures is a recurring theme in numerous branches of computer science. Aside from extensive analysis of its theoretical and algorithmic aspects [9, 26]
, there is also a wide range of applications. Computer vision, in particular, is abundant of tasks with a matching flavor; optical flow
[4, 49, 50], person re-identification [45, 25], stereo matching [36, 12], pose estimation
[11, 25], object tracking [39, 57], to name just a few. Matching problems are also relevant in a variety of scientific disciplines including biology [28], language processing [40], bioinformatics [19], correspondence problems in computer graphics [43] or social network analysis [35].Particularly, in the domain of computer vision, the matching problem has two parts: extraction of local features from raw images and resolving conflicting evidence e.g. multiple long-term occlusions in a tracking context. Each of these parts can be addressed efficiently in separation, namely by deep networks on the one side and by specialized purely combinatorial algorithms on the other. The latter requires a clean abstract formulation of the combinatorial problem. Complications arise if concessions on either side harm performance. Deep networks on their own have a limited capability of combinatorial generalization [6] and purely combinatorial approaches typically rely on fixed features that are often suboptimal in practice. To address this, many hybrid approaches have been proposed.
In case of deep graph matching some approaches rely on finding suitable differentiable relaxations [62, 60], while others benefit from a tailored architecture design [59, 24, 27, 64]. What all these approaches have in common is that they compromise on the combinatorial side in the sense that the resulting “combinatorial block” would not be competitive in a purely combinatorial setup.
In this work, we present a novel type of end-to-end architecture for semantic keypoint matching that does not make any concessions on the combinatorial side
while maintaining strong feature extraction. We build on recent progress at the intersection of combinatorial optimization and deep learning
[56] that allows to seamlessly embed blackbox implementations of a wide range of combinatorial algorithms into deep networks in a mathematically sound fashion. As a result, we can leverage heavily optimized graph matching solvers [51, 53] based on dual block coordinate ascent for Lagrange decompositions.Since the combinatorial aspect is handled by an expert algorithm, we can focus on the rest of the architecture design: building representative graph matching instances from visual and geometric information. In that regard, we leverage the recent findings [24] that large performance improvement can be obtained by correctly incorporating relative keypoint locations via SplineCNN [22].
Additionally, we observe that correct matching decisions are often simplified by leveraging global information such as viewpoint, rigidity of the object or scale (see also Fig. 1). With this in mind, we propose a natural global feature attention mechanism
that allows to adjust the weighting of different node and edge features based on a global feature vector.
Finally, the proposed architecture allows a stronger post-processing step. In particular, we use a multi-graph matching solver [53] during evaluation to jointly resolve multiple graph matching instances in a consistent fashion.
On the experimental side, we achieve state-of-the-art results on standard keypoint matching datasets Pascal VOC (with Berkeley annotations [20, 8]) and Willow ObjectClass [14]. Motivated by lack of challenging standardised benchmarks, we additionally propose two new experimental setups. The first one is the evaluating on SPair-71k [38] a high-quality dataset that was recently released in the context of dense image matching. As the second one, we suggest to drop the common practice of keypoint pre-filtering and as a result force the future methods to address the presence of keypoints without a match.
The contributions presented in this paper can be summarized as follows.
We present a novel and conceptually simple architecture that seamlessly incorporates a combinatorial graph matching solver. In addition, improvements are attained on the feature extraction side by processing global image information.
We introduce two new experimental setups and suggest them as future benchmarks.
We perform an extensive evaluation on existing benchmarks as well as on the newly proposed ones. Our approach reaches higher matching accuracy than previous methods, particularly in more challenging scenarios.
We exhibit further advantages of incorporating a combinatorial solver:
possible post-processing with a multi-graph matching solver,
an effortless transition to more challenging scenarios with unmatchable keypoints.
The research on this intersection is driven by two main paradigms.
The first one attempts to improve combinatorial optimization algorithms with deep learning methods. Such examples include the use of reinforcement learning for increased performance of branch-and-bound decisions
[30, 5, 25]as well as of heuristic greedy algorithms for NP-Hard graph problems
[32, 17, 7, 29].The other mindset aims at enhancing the expressivity of neural nets by turning combinatorial algorithms into differentiable building blocks. The work on differentiable quadratic programming [3] served as a catalyzer and progress was achieved even in more discrete settings [37, 21, 58]. In a recent culmination of these efforts [56], a “differentiable wrapper” was proposed for blackbox implementations of algorithms minimizing a linear discrete objective, effectively allowing free flow of progress from combinatorial optimization to deep learning.
This problem, also known as the quadratic assignment problem [33] in the combinatorial optimization literature, is famous for being one of the practically most difficult NP-complete problems. There exist instances with less than 100 nodes that can be extremely challenging to solve with existing approaches [10]. Nevertheless, in computer vision efficient algorithmic approaches have been proposed that can routinely solve sparse instances with hundreds of nodes. Among those, solvers based on Lagrangian decomposition [54, 65, 51] have been shown to perform especially well, being able to quickly produce high quality solutions with small gaps to the optimum. Lagrange decomposition solvers split the graph matching problem into many small subproblems linked together via Lagrange multipliers. These multipliers are iteratively updated in order to reach agreement among the individual subproblems, typically with subgradient based techniques [48] or dual block coordinate ascent [52].
Graph matching solvers have a rich history of applications in computer vision. A non-exhaustive list includes uses for finding correspondences of landmarks between various objects in several semantic object classes [54, 66, 55], for estimating sparse correspondences in wide-displacement optical flow [54, 2], for establishing associations in multiple object tracking [13], for object categorization [18], and for matching cell nuclei in biological image analysis [28].
Wider interest in deep graph matching was ignited by [62] where a fully differentiable graph matching solver based on spectral methods was introduced. While differentiable relaxation of quadratic graph matching has reappeared [60], most methods [59, 27, 61] rely on the Sinkhorn iterative normalization [47, 1] for the linear assignment problem or even on a single row normalization [24]
. Another common feature is the use of various graph neural networks
[44, 34, 6] sometimes also in a cross-graph fashion [59]for refining the node embeddings provided by the backbone architecture. There has also been a discussion regarding suitable loss functions
[62, 59, 61]. Recently, nontrivial progress has been achieved by extracting more signal from the available geometric information [24, 64].When incorporating a combinatorial solver into a neural network, differentiability constitutes the principal difficulty. Such solvers take continuous inputs (vertex and edge costs in our case) and return a discrete output (an indicator vector of the optimal matching). This mapping is piecewise constant because a small change of the costs typically does not affect the optimal matching. Therefore, the gradient exists almost everywhere but is equal to zero. This prohibits any gradient-based optimization.
A recent method proposed in [56] offers a mathematically-backed solution to overcome these obstacles. It introduces an efficient “implicit interpolation” of the solver’s mapping while still treating the solver as a blackbox. In end effect, the intact solver is executed on the forward pass and as it turns out, only one other call to the solver is sufficient to provide meaningful gradient information during the backward pass.
Specifically, the method of [56] applies to solvers that solve an optimization problem of the form
(1) |
where is the continuous input and is any discrete set. This general formulation covers large classes of combinatorial algorithms that include the shortest path problem, the traveling salesman problem and many others. As will be shown in the subsequent sections, graph matching is also included in this definition.
If denotes the final loss of the network, the suggested gradient of the piecewise constant mapping takes the form
(2) |
in which is a certain modification of the input depending on the gradient of at . This is in fact the exact gradient of a piecewise linear interpolation of
in which a hyperparameter
controls the interpolation range as Fig. 2 suggests.It is worth pointing out that the framework does not require any explicit description of the set (such as via linear constraints). For further details and mathematical guarantees, see [56].
The aim of graph matching is to find an assignment between vertices of two graphs that minimizes the sum of local and geometric costs.
Let and be two directed graphs. We denote by the indicator vector of matched vertices, that is if a vertex is matched with and otherwise. Analogously, we set as the indicator vector of matched edges. Obviously, the vector is fully determined by the vector . Further, we denote by the set of all pairs that encode a valid matching between and .
Given two cost vectors and , we formulate the graph matching optimization problem as
(3) |
It is immediate that fits the definition of the solver given in (1). If is the loss function, the mapping
(4) |
is the piecewise constant function for which the scheme of [56] suggests
(5) |
where the vectors and stand for
(6) |
The implementation is listed in Alg. 1.
In our experiments, we use the Hamming distance between the proposed matching and the ground truth matching of vertices as a loss. In this case, does not depend on and, consequently, .
A more sophisticated variant of graph matching involves more than two graphs. The aim of multi-graph matching is to find a matching for every pair of graphs such that these matchings are consistent in a global fashion (i.e. satisfy so-called cycle consistency, see Fig. 3) and minimize the global cost. Although the framework of [56] is also applicable to multi-graph matching, we will only use it for post-processing.
One disadvantage of using Hamming distance as a loss function is that it reaches its minimum value zero even if the ground truth matching has only fractionally lower cost than competing matchings. This increases sensitivity to distribution shifts and potentially harms generalization. The issue was already observed in [42], where the method [56] was also applied. We adopt the solution proposed in [42], namely the cost margin. In particular, during training we increase the unary costs that correspond to the ground truth matching by , i.e.
(7) |
where denotes the ground truth matching indicator vector. In all experiments, we use .
We employ a dual block coordinate ascent solver [51] based on a Lagrange decomposition of the original problem. In every iteration, a dual lower bound is monotonically increased and the resulting dual costs are used to round primal solutions using a minimum cost flow solver.
Our end-to-end trainable architecture for keypoint matching consists of three stages. We call it BlackBox differentiation of Graph Matching solvers (BB-GM).
Extraction of visual features A standard CNN architecture extracts a feature vector for each of the keypoints in the image. Additionally, a global feature vector is extracted.
Geometry-aware feature refinement Keypoints are converted to a graph structure with spatial information. Then a graph neural network architecture is applied.
Construction of combinatorial instance Vertex and edge similarities are computed using the graph features and the global features. This determines a graph matching instance that is passed to the solver.
The resulting matching is compared to the ground truth matching and their Hamming distance is the loss function to optimize.
While the first and the second stage (Fig. 4) are rather standard design blocks, the third one (Fig. 5) constitutes the principal novelty. More detailed descriptions follow.
We closely follow previous work [24, 62, 59] and also compute the outputs of the relu4_2 and relu5_1 operations of the VGG16 [46]
network pre-trained on ImageNet
[16]. The spatially corresponding feature vector for each keypoint is recovered via bi-linear interpolation.An image-wide global feature vector is extracted by max-pooling the output of the final VGG16 layer, see Fig.
4. Both the keypoint feature vectors and the global feature vectors are normalized with respect to the norm.The graph is created as a Delaunay triangulation [15] of the keypoint locations. We deploy SplineCNN [22], an architecture that proved successful in point-cloud processing. Its inputs are the VGG vertex features and spatial edge attributes defined as normalized relative coordinates of the associated vertices (called anisotropic in [24, 23]). We use two layers of SplineCNN with max aggregations. The outputs are additively composed with the original VGG node features to produce the refined node features. For subsequent computation, we set the edge features as the differences of the refined node features. For illustration, see Fig. 4.
Both source and target image are passed through the two described procedures. Their global features are concatenated to one global feature vector . A standard way to prepare a matching instance (the unary costs ) is to compute the inner product similarity (or affinity) of the vertex features , where is the feature vector of the vertex in the source graph and is the feature vector of the vertex in the target graph, possibly with a learnable vector or a matrix of coefficient as in [59].
In our case, the vector of “similarity coefficients” is produced as a linear transformation of
followed by a nonlinearity. In particular,(8) |
where is a learnable matrix and is a nonlinearity. This allows for a gating-like behavior; the individual coordinates of the feature vectors may play a different role depending on the global feature vector . It is intended to enable integrating various global semantic aspects such as rigidity of the object or the viewpoint perspective. Higher order cost terms
are calculated in the same vein using edge features instead of vertex features with an analogous learnable affinity layer. For an overview, see Fig.
5.We evaluate our method on the standard datasets for keypoint matching Pascal VOC with Berkeley annotations [20, 8] and Willow ObjectClass [14]. Additionally, we propose a harder setup for Pascal VOC that avoids keypoint filtering as a preprocessing step. Finally, we report our performance on a recently published dataset SPair-71k [38]. Even though this dataset was designed for a slightly different community, its high quality makes it very suitable also in this context. The two new experimental setups aim to address the lack of difficult benchmarks in this line of work. The code used for our experiments is available at https://github.com/martius-lab/blackbox-deep-graph-matching.
In some cases, we report our own evaluation of DGMC [24], the strongest competing method, which we denote by DGMC. We used the publicly available implementation [23].
All experiments were run on a single Tesla-V100 GPU. Due to the efficient C++ implementation of the solver [52], the computational bottleneck of the entire architecture is evaluating the VGG backbone. Around 30 image pairs were processed every second.
In all experiments, we use the exact same set of hyperparameters. Only the number of training steps is dataset-dependent. The optimizer in use is Adam [31] with an initial learning rate of which is halved four times in regular intervals. Learning rate for finetuning the VGG weights is multiplied with . We process batches of image pairs and the hyperparameter from (2) is consistently set to . For remaining implementation details, the full code base will be made available.
The standard benchmark datasets provide images with annotated keypoints but do not define pairings of images or which keypoints should be kept for the matching instance. While it is the designer’s choice how this is handled during training it is imperative that only one pair-sampling and keypoint filtering procedure is used at test time. Otherwise, the change in the distribution of test pairs and the corresponding instances may have unintended effects on the evaluation metric (as we demonstrate below), and therefore hinder fair comparisons.
We briefly describe two previously used methods for creating evaluation data, discuss their impact, and propose a third one.
Only the keypoints present in both source and target image are preserved for the matching task. Clearly, any pair of images can be processed this way, see Fig. 5(a).
Target image keypoints have to include all the source image keypoints. The target keypoints that are not present in the source image are then disregarded. Not every pair of images obeys this property, see Fig. 5(b).
When keypoint inclusion filtering is used on evaluation, some image pairs are discarded, which introduces some biases. In particular, pairs of images seen from different viewpoints become underrepresented, as such pairs often have uncomparable sets of visible keypoints, see Fig. 6. Another effect is a bias towards a higher number of keypoints in a matching instance which makes the matching task more difficult. While the effect on mean accuracy is not strong, Tab. 1 shows large differences in individual classes.
Another unsatisfactory aspect of both methods is that label information is required at evaluation time, rendering the setting quite unrealistic. For this reason, we propose to evaluate without any keypoint removal.
For a given pair of images, the keypoints are used without any filtering. Matching instances may contain a different number of source and target vertices, and vertices without a corresponding match.
The Pascal VOC [20] dataset with Berkeley annotations [8] contains images with bounding boxes surrounding objects of 20 classes. We follow the standard data preparation procedure of [59]. Each object is cropped to its bounding box and scaled to . The resulting images contain up to 23 annotated keypoints, depending on the object category.
The results under the most common experimental conditions () are reported in Tab. 2 and we can see that BB-GM outperforms competing approaches.
We propose, see Sec. 4.0.3, to preserve all keypoints (
). Matching accuracy is no longer a good evaluation metric as it ignores false positives. Instead, we report F1-Score, the harmonic mean of precision and recall.
Since the underlying solver used by our method also works for partial matchings, our architecture is applicable out of the box. Competing architectures rely on either the Sinkhorn normalization or a softmax and as such, they are hard-wired to produce maximal matchings and do not offer a simple adjustment to the unfiltered setup. To simulate the negative impact of maximal matchings we provide an ablation of BB-GM where we modify the solver to output maximal matchings. This is denoted by BB-GM-Max.
In addition, we report the scores obtained by running the multi-graph matching solver [53] as post-processing. Instead of sampling pairs of images, we sample sets of 5 images and recover from the architecture the costs of the matching instances. The multi-graph matching solver then searches for globally optimal set of consistent matchings. The results are provided in Tab. 3.
Note that sampling sets of 5 images instead of image pairs does not interfere with the statistics of the test set. The results are therefore comparable.
The Willow ObjectClass dataset contains a total of 256 images from 5 categories. Each category is represented by at least 40 images, all of them with consistent orientation. Each image is annotated with the same 10 distinctive category-specific keypoints, which means there is no difference between the described keypoint filtering methods. Following standard procedure, we crop the images to the bounding boxes of the objects and rescale to .
Multiple training strategies have been used in prior work. Some authors decide to train only on the relatively small Willow dataset, or pretrain on Pascal VOC and fine-tune on Willow afterward [59]. Another approach is to pretrain on Pascal VOC and evaluate on Willow without fine-tuning, to test the transfer-ability [60]. We report results for all different variants, following the standard procedure of using 20 images per class when training on Willow and excluding the classes car and motorbike from Pascal VOC when pre-training, as these images overlap with the Willow dataset. We also evaluated the strongest competing approach DGMC [24] under all settings.
The results are shown in Tab. 4
. While our method achieves good performance, we are reluctant to claim superiority over prior work. The small dataset size, the multitude of training setups, and high standard deviations all prevent statistically significant comparisons.
We also report performance on SPair-71k [38], a dataset recently published in the context of dense image matching. It contains 70, 958 image pairs prepared from Pascal VOC 2012 and Pascal 3D+. It has several advantages over the Pascal VOC dataset, namely higher image quality, richer keypoint annotations, difficulty annotation of image-pairs, as well as the removal of the ambiguous and poorly annotated sofas and dining tables.
Again, we evaluated DGMC [24] as the strongest competitor of our method. The results are reported in Tab. 5 and Tab. 6. We consistently improve upon the baseline, particularly on pairs of images seen from very different viewpoints. This highlights the ability of our method to resolve instances with conflicting evidence. Some example matchings are presented in Fig. 1 and Fig. 7.
Method | Viewpoint difficulty | All | ||
easy | medium | hard | ||
DGMC* | ||||
BB-GM |
To isolate the impact of single components of our architecture, we conduct various ablation studies. The results on Pascal VOC are summarized in Tab. 7 where large performance differences are highlighted.
Influence of the global feature vector is removed. In (8), instead of , we use a single learnable vector .
The matching instances consist only of unary costs . The quadratic costs are set to zero.
We have demonstrated that deep learning architectures that integrate combinatorial graph matching solvers perform well on deep graph matching benchmarks.
Opportunities for future work now fall into multiple categories. For one, it should be tested whether such architectures can be useful outside the designated playground for deep graph matching methods. If more progress is needed, two major directions lend themselves: (i) improving the neural network architecture even further so that input costs to the matching problem become more discriminative and (ii) employing better solvers that improve in terms of obtained solution quality and ability to handle a more complicated and expressive cost structure (e.g. hypergraph matching solvers).
Finally, the potential of building architectures around solvers for other computer vision related combinatorial problems such as multicut or max-cut can be explored.
German Conference on Pattern Recognition
, pages 285–296. Springer, 2015.International Conference on Machine Learning
, ICML’17, pages 136–145, 2017.Realtime multi-person 2d pose estimation using part affinity fields.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’17, 2017.Intl. Conf. on Integration of Constraint Programming, Artificial Intelligence, and Operations Research
, pages 170–181. Springer, 2018.Exact combinatorial optimization with graph convolutional neural networks.
In Advances in Neural Information Processing Systems, NIPS’19, pages 15554–15566, 2019.Facenet: A unified embedding for face recognition and clustering.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’15, pages 815–823, 2015.
Comments
There are no comments yet.