Establishing dense correspondences across images is one of the fundamental tasks in computer vision[okutomi1993multiple, brox2004high, liu2011sift]. Early works have focused on handling different views of the same scene (stereo matching [okutomi1993multiple, hosni2013fast]) or successive frames (optical flow [brox2004high, brox2009large]) in a video sequence. Semantic correspondence algorithms (e.g., SIFT Flow [liu2011sift]) go one step further, finding a dense flow field between images depicting different instances of the same object or scene category, which has proven useful in various computer vision tasks including object recognition [liu2011sift, duchenne2011graph], semantic segmentation [kim2013deformable], co-segmentation [taniai2016joint], image editing [dale2009image], and scene parsing [kim2013deformable, zhou2015flowweb]. Establishing dense semantic correspondences is very challenging especially in the presence of large changes in appearance or scene layout and background clutter. Classical approaches to semantic correspondence [liu2011sift, kim2013deformable, hur2015generalized, bristow2015dense, yang2014daisy] typically use an objective function involving fidelity and regularization terms. The fidelity term encourages hand-crafted features (e.g., SIFT [lowe2004distinctive], HOG [dalal2005histograms], DAISY [tola2010daisy]) to be matched along a dense flow field between images, and the regularization term makes it smooth while aligning discontinuities to object boundaries. Hand-crafted features, however, do not capture high-level semantics (e.g., appearance and shape variations), and they are not robust to image-specific details (e.g., texture, background clutter, occlusion).
Convolutional neural networks (CNNs) have allowed remarkable advances in semantic correspondence in the past few years. Recent methods using CNNs [han2017scnet, choy2016universal, novotny2017anchornet, kanazawa2016warpnet, kim2017fcss, zhou2016learning, rocco2017convolutional, rocco2018end, Seo2018AttentiveSA, Jeon2018PARN] benefit from rich semantic features invariant to intra-class variations, achieving state-of-the-art results. Semantic flow approaches [han2017scnet, choy2016universal, novotny2017anchornet, kim2017fcss, zhou2016learning] attempt to find correspondences for individual pixels or patches. They are not seriously affected by non-rigid deformations, but are easily distracted by background clutter. They also require a large amount of data with ground-truth correspondences for training. Although pixel-level semantic correspondences impose very strong constraints, manually annotating them is extremely labor-intensive and somewhat subjective, which limits the amount of training data available [ham2016proposal]. An alternative is to learn feature descriptors only [choy2016universal, novotny2017anchornet, kim2017fcss] or to exploit 3D CAD models together with rendering engines [zhou2016learning]. Semantic alignment methods [kanazawa2016warpnet, rocco2017convolutional, rocco2018end, Seo2018AttentiveSA, Jeon2018PARN]
, on the other hand, formulate semantic correspondence as a geometric alignment problem and directly regress parameters of a global transformation model (e.g., affine deformation or thin plate spline) between images. They leverage self-supervised learning where ground-truth parameters are generated synthetically using random transformations with, however, a higher sensitivity to non-rigid deformations. Moreover, background clutter prevents focusing on individual objects and interferes with the estimation of the transformation parameters. To overcome this problem, recent methods reduce the influence of distractors by inlier counting[rocco2018end] or an attention process [Seo2018AttentiveSA].
In this paper, we present a new approach to establishing an object-aware semantic flow and propose to exploit binary foreground masks as a supervisory signal during training (Fig. 1). Our approach builds upon the insight that correspondences of high quality between images allow to segment common objects from background. To implement this idea, we introduce a new CNN architecture, dubbed SFNet, that outputs a semantic flow field at a sub-pixel level. We leverage a new and differentiable version of the argmax function, the kernel soft argmax, together with mask/flow consistency and smoothness terms to train SFNet end to end, establishing object-aware correspondences while filtering out distracting details. Our approach has the following advantages: First, it is a good compromise between current semantic flow and alignment methods, since foreground masks are available for large datasets, and they provide an object-level prior for the semantic correspondence task. Exploiting these masks during training makes it possible to focus on learning correspondences between prominent objects and scene elements (masks are of course not used at test time). Second, our method establishes a dense non-parametric flow field (i.e., semantic flow), which is more robust to non-rigid deformations than parametric regression (i.e., semantic alignment). Finally, using the kernel soft argmax allows us to train the whole network end to end, and hence our approach further benefits from high-level semantics specific to the task of semantic correspondence. The main contributions of this paper can be summarized as follows:
We propose to exploit binary foreground masks, that are widely available and can be annotated more easily than individual point correspondences, to learn semantic flow by incorporating an object-level prior in the learning task.
We introduce a kernel soft argmax function, making our model quite robust to multi-modal distributions while providing a differentiable flow field at a sub-pixel level.
We set a new state of the art on standard benchmarks for semantic correspondence, mask transfer, and pose keypoint propagation, clearly demonstrating the effectiveness of our approach. We also provide an extensive experimental analysis with ablation studies.
A preliminary version of this work appeared in [lee2019sfnet]. This version adds (1) a detailed description of related works exploiting object priors for semantic correspondence; (2) an in-depth presentation of SFNet including the kernel soft argmax and loss terms; (3) more comparisons with the state of the art on different benchmarks including the TSS [taniai2016joint] and recent SPair-71k [min2019hyperpixel] datasets; (4) an evaluation on the task of pose keypoint propagation with the JHMDB dataset [jhuang2013towards]; and (5) an extensive experimental evaluation including a runtime comparison and a performance analysis on SFNet trained using noisy labels (i.e., bounding boxes) or with different datasets. To encourage comparison and future work, our code and model are available online: https://cvlab-yonsei.github.io/projects/SFNet.
2 Related Work
Correspondence problems cover a broad range of topics in computer vision including stereo, motion analysis, object recognition and shape matching. Giving a comprehensive review on these topics is beyond the scope of this paper. We thus focus on representative works related to ours.
2.1 Semantic Flow
Classical approaches focus on finding sparse correspondences, e.g., for instance matching [lowe2004distinctive], or establishing dense matches between nearby views of the same scene/object, e.g., for stereo matching [hosni2013fast, okutomi1993multiple] and optical flow estimation [brox2009large, brox2004high]. Unlike these, semantic correspondence methods estimate dense matches across pictures containing different instances of the same object or scene category. Early works on semantic correspondence focus on matching local features from hand-crafted descriptors, such as SIFT [liu2011sift, kim2013deformable, hur2015generalized, bristow2015dense], DAISY [yang2014daisy] and HOG [ham2016proposal, taniai2016joint, yang2017object], together with spatial regularization using graphical models [liu2011sift, kim2013deformable, taniai2016joint, hur2015generalized] or random sampling [barnes2009patchmatch, yang2014daisy]. However, hand-crafting features capturing high-level semantics is extremely hard, and similarities between them are easily distracted, e.g., by clutter, texture, occlusion and appearance variations. There have been many attempts to estimate correspondences robust to background clutter or scale changes between objects/object parts. These use object proposals as candidate regions for matching [ham2016proposal, yang2017object] or perform matching in scale space [qiu2014scale].
Recently, image features from CNNs have demonstrated a capacity to both representing high-level semantics and being robust to appearance and shape variations [krizhevsky2012imagenet, simonyan2014very, he2016deep]. Long et al. [long2014convnets] apply CNNs to establish semantic correspondences between images. They follow the same procedure as the SIFT Flow [liu2011sift]
method, but exploit off-the-shelf CNN features trained for the ImageNet classification task[deng2009imagenet] due to a lack of training datasets with pixel-level annotations. This problem can be alleviated by synthesizing ground-truth correspondences from 3D models [zhou2016learning]
or augmenting the number of match pairs in a sparse keypoint dataset using interpolation[taniai2016joint]. More recently, new benchmarks for semantic correspondence have been released. PF-PASCAL [ham2017proposal] provides 1300+ image pairs of 20 image categories with ground-truth annotations from the PASCAL 2011 keypoint dataset [BourdevMalikICCV09]. SPair-71k [min2019hyperpixel] consists of over 70k of image pairs from PASCAL 3D+ [xiang2014beyond] and PASCAL VOC 2012 [everingham2010pascal] with rich annotations including keypoints, segmentation masks and bounding boxes. These enable learning local features [han2017scnet, kim2017fcss, rocco2018neighbourhood, min2019hyperpixel] specific to the task of semantic correspondence. FCSS [kim2017fcss] introduces a learnable local self-similarity descriptor robust to intra-class variations. SCNet [han2017scnet] and HPF [min2019hyperpixel] present region descriptors exploiting geometric consistency among object parts. NCN [rocco2018neighbourhood]
analyzes neighborhood consensus patterns in the 4D space of all possible correspondences in order to find spatially consistent matches, disambiguating feature matches on repetitive patterns. Although these approaches using CNN features outperform early methods by large margins, the loss functions they use for training typically do not involve a spatial regularizer mainly due to a lack of differentiability of the flow field. In contrast, our flow field is differentiable, allowing us to train the whole network end to end with a spatial regularizer.
2.2 Semantic Alignment
Several recent methods [kanazawa2016warpnet, rocco2017convolutional, rocco2018end, Seo2018AttentiveSA, Jeon2018PARN]
formulate semantic correspondence as a geometric alignment problem using parametric models. In particular, these methods first compute feature correlations between images. The feature correlations are then fed into a regression layer to estimate parameters of a global transformation model (e.g., affine, homography, and thin plate spline) to align images. This makes it possible to leverage self-supervised learning[kanazawa2016warpnet, rocco2017convolutional, rocco2018end, Seo2018AttentiveSA] using synthetically-generated data, and to train the entire CNNs end to end. These approaches apply the same transformation to all pixels, which has the effect of an implicit spatial regularization, providing smooth matches and often outperforming semantic flow methods [choy2016universal, ham2016proposal, han2017scnet, kim2017fcss, zhou2016learning]. However, they are easily distracted by background clutter and occlusion [kanazawa2016warpnet, rocco2017convolutional]Seo2018AttentiveSA] or suppressing outlier matches [rocco2018end], global transformation models are highly sensitive to non-rigid deformations or local geometric variations. Alternative approaches include estimating local transformation models in a coarse-to-fine scheme [Jeon2018PARN] or applying the geometric transformation recursively [kim2018recurrent], but they are computationally expensive. In contrast, our method avoids the problem efficiently by establishing semantic correspondences directly from feature correlations.
2.3 An Object-level Prior for Semantic Correspondence
Several methods [ham2016proposal, han2017scnet, yang2017object, Jeon2018PARN, kim2017fcss, zhou2015flowweb, zhou2016learning] leverage object priors (e.g., object proposals, bounding boxes or foreground masks) to learn semantic correspondence. Proposal flow [ham2016proposal] and its CNN version [han2017scnet] use object proposals as matching primitives, and consider appearance and geometric consistency constraints to establish region correspondences. OADSC [yang2017object] also exploits object proposals, but leverages hierarchical graphs built on the proposals in a coarse-to-fine manner, allowing pixel-level correspondences. Similar to ours, other methods leverage bounding boxes or foreground masks for semantic correspondence. They, however, do not incorporate the object location prior explicitly into loss functions, and use the prior for pre-processing training samples instead. For example, PARN [Jeon2018PARN] and FCSS [kim2017fcss] use bounding boxes or foreground masks to generate positive/negative matches within object regions at training time. In [zhou2015flowweb, zhou2016learning], bounding boxes are used to limit the candidate regions for matching at both training and test time. Contrary to these methods, we incorporate this prior (e.g., bounding boxes or foreground masks) directly into the loss functions to train the network, and outperform the state of the art by a significant margin.
In this section, we describe our approach to establishing object-aware semantic correspondences including the network architecture (Section 3.1) and loss functions (Section 3.2). An overview of our method is shown in Fig. 2.
3.1 Network Architecture
Our model consists of three main parts (Fig. 2): We first extract features from source and target images, and , respectively, using a fully convolutional siamese network, where each of the two branches has the same structure with shared parameters. We then compute matching scores between all pairs of local features in the two images, and assign the best match for each feature using the kernel soft argmax function defined in Sec 3.1.3. All components are differentiable, allowing us to train the whole network end to end. In the following, we describe the network architecture for source to target matching in detail. A target to source match is computed in the same manner.
3.1.1 Feature Extraction
For feature extraction, we exploit a ResNet-101[he2016deep] pretrained for the ImageNet classification task [deng2009imagenet]. Although such CNN features give rich semantics, they typically fire on highly discriminative parts for classification [zhou2016cam]. This may be less adequate for feature matching that requires capturing a spatial deformation for fine-grained localization. We thus use additional adaptation layers to extract features specific to the task of semantic correspondence, making them highly discriminative w.r.t both appearance and spatial context. This gives a feature map of size for each image that corresponds to grids of -dimensional local features. We then apply L2 normalization to the individual -dimensional features. As will be seen in our experiments, the adaptation layers boost the matching performance drastically.
3.1.2 Feature Matching
Matching scores are computed as the dot product between local features, resulting in a 4-dimensional correlation map of size as follows:
where we denote by and -dimensional features at positions and in the source and target images, respectively.
3.1.3 Kernel Soft Argmax Layer
We could assign the best matches by applying the argmax function over a 2-dimensional correlation map , w.r.t all features at each spatial location . However, argmax is not differentiable. The soft argmax function [honari2018improving, kendall2017end]
computes an output by a weighted average of all spatial positions with corresponding matching probabilities (i.e., an expected value of all spatial coordinates weighted by corresponding probabilities). Although it is differentiable and enables fine-grained localization at a sub-pixel level, its output is influenced by all spatial positions, which is problematic especially in the case of multi-modal distributions (Fig.3). In other words, the soft argmax best approximates the discrete argmax when the matching probability is uni-modal having one clear peak.
We introduce a hybrid version, the kernel soft argmax, that takes advantage of both the soft and discrete argmax. Concretely, it computes correspondences for individual locations as an average of all coordinate pairs weighted by matching probabilities as follows.
The matching probability is computed by applying a spatial softmax function to a L2-normalized version of the correlation map :
We perform L2 normalization on the 2-dimensional correlation map , adjusting the matching scores to a common scale before applying the softmax function. is a “temperature” parameter adjusting the distribution of the softmax output. As the temperature parameter increases, the softmax function approaches the discrete one with one clear peak, but this may cause an unstable gradient flow at training time. is a 2-dimensional Gaussian kernel centered on the position obtained by applying the discrete argmax to the correlation map, i.e., . The Gaussian kernel allows us to retain the scores near the output of the discrete argmax while suppressing others. That is, the kernel has the effect of restricting the range of averaging in (2), and makes the kernel soft argmax less susceptible to multi-modal distributions (e.g., from ambiguous matches in clutter and repetitive patterns) while maintaining differentiability. Note that the center position of the Gaussian kernel is changed at every iteration during training, and the matching probability is differentiable, since we do not train the Gaussian kernel itself and no gradients are propagated through the discrete argmax. Note also that the normalization of the correlation map is particularly important for semantic alignment methods [rocco2017convolutional, rocco2018end, Seo2018AttentiveSA, Jeon2018PARN] (see, for example, Table 2 in [rocco2017convolutional]) but its purpose is different from ours. They use the normalization to penalize features having multiple highly-correlated matches, boosting the scores of discriminative matches.
We visualize soft and kernel soft argmax operators in Fig. 3, which shows that the soft argmax yields an incorrect correspondence in the presence of multiple highly correlated features, since a weighted average of matching probabilities having multi-modal distributions accumulates positional errors. The kernel soft argmax instead suppresses matching probabilities for the highest value, making them have an (approximately) uni-modal distributions and favoring correct correspondences.
We exploit binary foreground masks as a supervisory signal to train the network, which gives a strong object prior. To this end, we define three losses that guide the network to learn object-aware correspondences without pixel-level ground truth as
which consists of mask consistency , flow consistency and smoothness terms, balanced by the weight parameters (, , ). In the following, we describe each term in detail.
3.2.1 Mask Consistency Term
We define a flow field from source to target images as
Similarly, a flow field from target to source images is defined as . We denote by and the binary masks of the source and target images, respectively. Values of 0 and 1 in the masks respectively indicate background and foreground regions. We assume that the binary mask in the source images can be reconstructed by warping the mask in the target image and vice versa, if we have discriminative features and correct dense correspondences. To implement this idea, we transfer the target mask by warping [jaderberg2015spatial] using the flow field and obtain an estimate of the source mask as follows.
where denotes a warping operator using the flow field, e.g.,
We then compute the difference between the source mask and its estimate . Similarly, we reconstruct the target mask from using the flow field and compute the difference between and . Accordingly, we define the mask consistency loss (Fig. 4) as
where is the number of pixels in the mask . Although the mask consistency loss does not constrain the background itself, it prevents matches from foreground to background regions and vice versa by penalizing them. This encourages correspondences to be established between features within foreground and background masks, guiding our model to learn object-aware correspondences. Note that the consistency loss using binary masks does not prevent a many-to-one matching (Fig. 5(a)). That is, it does not penalize a case when many foreground features in an image are matched to a single one in other image. For example, the foreground mask in the source image even can be reconstructed, when all points in the foreground region are matched to a single foreground point in the target image.
3.2.2 Flow Consistency Term
To address the many-to-one matching problem, we propose to use a flow consistency loss. It measures consistency between flow fields and within foreground masks as
where is the number of foreground pixels in the mask , and
which aligns the flow field with respect to by warping. is computed similar to (10). We denote by and the L2 norm and element-wise multiplication, respectively. The multiplication is applied separately for each and component.
The flow consistency term penalizes inconsistent correspondences (Fig. 5(b)), and favors one-to-one matching (Fig. 5(c)), alleviating the many-to-one matching problem in the mask consistency loss. For example, when the flow fields are consistent with each other, and in (9) have the same magnitude with opposite directions. Note that having multiple matches for individual points (i.e., a one-to-many matching) is impossible within our framework. Similar ideas have been explored in stereo fusion [zbontar2015computing, godard2017unsupervised] and optical flow [meister2018unflow, zou2018df], but without appearance and shape variations. It is hard to incorporate this term in current semantic flow methods based on CNNs [choy2016universal, han2017scnet, kim2017fcss, rocco2018neighbourhood, min2019hyperpixel], mainly due to a lack of differentiability of the flow field. Recently, Zhou et al. [zhou2016learning] exploit cycle consistency between flow fields, but they regress correspondences directly from concatenated features from source and target images and do not consider background clutter. In contrast, our method establishes a differentiable flow field by computing feature similarities explicitly while considering background clutter.
Although the flow consistency term relieves the many-to-one matching problem, computing this term for a source or a target image only may cause a flow shrinkage problem (Fig. 6(a)). To address this problem, we compute this term w.r.t both source and target images in (9). This penalizes inconsistent matches, e.g., between the entire foreground region in the source image and small parts of the target image (Fig. 6(b)). Note that spreading the flow fields over the entire regions is particularly important to handle scale changes between objects (Fig. 6(c)).
3.2.3 Smoothness Term
The differentiable flow field also allows us to exploit a smoothness term, as widely used in classical energy-based approaches [liu2011sift, kim2013deformable, hur2015generalized]. We define this term using the first-order derivative of the flow fields and as
where and are the L1 norm and the gradient operator, respectively. This regularizes (or smooths) flow fields within foreground regions without being affected by (incorrect) correspondences at background.
In this section, we give experimental details (Secs. 4.1), and present a detailed analysis and evaluation of our approach on the tasks of semantic correspondence (Section 4.2), mask transfer (Section 4.3) and pose keypoint propagation (Section 4.4). We then present ablation studies for different losses and network architectures, as well as a performance analysis for different training datasets (Section 4.5).
4.1 Experimental Details
Following [kim2018recurrent, rocco2018neighbourhood, rocco2018end, Seo2018AttentiveSA, min2019hyperpixel], we use CNN features from ResNet-101 [he2016deep] trained for ImageNet classification [deng2009imagenet]. Specifically, we use the networks cropped at conv4-23 and conv5-3 layers, respectively. This results in two feature maps of size and , respectively, for a pair of input images of size , and gives a good compromise between localization accuracy and high-level semantics. Adaptation layers are trained with random initialization, separately for each feature map in a residual fashion [he2016deep]
. To compute residuals, we add two blocks of convolutional, batch normalization[Ioffe2015BatchNA]
layers, with padding on top of each feature map, where the sizes of convolutional kernels forconv4-23 and conv5-3 features are and , respectively. Each block outputs a residual, which is then added to the corresponding input features. Adaptation layers aggregate the features nonlinearly from large receptive fields (e.g., the receptive field of size on a feature map of size ), transforming them, guided by semantic correspondences and the corresponding loss terms in (4), to be highly discriminative w.r.t both appearance and spatial context. With the resulting two feature maps of size and 111We upsample the features adapted from conv5-3 using bilinear interpolation., we compute pairwise matching scores and then combine them by element-wise multiplication, resulting in a correlation map of size . Following [rocco2017convolutional, Seo2018AttentiveSA], we fix the feature extractor, and train adaptation layers only. At test time, we upsample a flow field of size using bilinear interpolation.
We empirically set the temperature parameter
to 50 and standard deviationof Gaussian kernel to 5. We determine those values using a grid search over (, ) pairs, where the maximum search ranges for and are 100 and 10 with intervals of 10 and 1, respectively. We select the parameters that give the best performance on the validation split of the PF-PASCAL dataset [ham2017proposal, rocco2018end]. Other parameters (, , ) are chosen similarly using the validation split of the PF-PASCAL dataset. We fix these parameters in all experiments.
Training our network requires pairs of foreground masks for source and target images depicting different instances of the same object category. Although the TSS [taniai2016joint] and Caltech-101 [fei2006one] datasets provide such pairs, the number of masks in TSS [taniai2016joint] is not sufficient to train our network, and images in Caltech-101 [fei2006one] lack background clutter. Our model trained with these datasets suffers from overfitting and may not generalize well for other images containing clutter. Motivated by [kanazawa2016warpnet, rocco2017convolutional, Seo2018AttentiveSA, novotny2018self], we generate pairs of source and target images synthetically from single images by applying random affine transformations and use the synthetically warped pairs as training samples. The corresponding foreground masks are transformed with the same parameters. Contrary to [kanazawa2016warpnet, rocco2017convolutional, Seo2018AttentiveSA], our model does not perform parametric regression, and thus does not require ground-truth transformation parameters for training. We use the PASCAL VOC 2012 segmentation dataset [everingham2010pascal] that consists of 1,464, 1,449, and 1,456 images for training, validation and test, respectively. We exclude 122 images from train/validation sets that overlap with the test split in PF-PASCAL [ham2017proposal, rocco2018end]
, and train our model with the corresponding 2,791 images. We augment the training dataset by horizontal flipping and color jittering. Note that we do not use the segmentation masks provided by the PASCAL VOC 2012 dataset, that specify the class of the object at each pixel. We instead generate binary foreground masks using all labeled objects, regardless of image categories and the number of objects, at training time. We train our model with a batch size of 16 for about 7k iterations, giving roughly 40 epochs over the training data. We use the Adam optimizer[kingma2014adam] with and . A learning rate initially set to 3e-5 is divided by 5 after 30 epochs. All networks are trained end to end using PyTorch[paszke2017automatic].
4.1.3 Evaluation Metric
We use the probability of correct keypoint (PCK) [yang2013articulated] to measure the precision of overall assignment, particularly at sparse keypoints of semantic relevance. We compute the Euclidean distance between warped keypoints using the estimated dense flow and ground truth, and count the number of keypoints whose distances lie within pixels, where , typically set to , is a tolerance value, and and are the height and width, respectively, of an image or an object bounding box. Following [rocco2017convolutional, rocco2018end], we divide keypoint coordinates by the height and width of the image size in case of , such that they are normalized in a range of and .
|A||(T) A2Net [Seo2018AttentiveSA]||68.8||70.8|
|A||(T) CNNGeo [rocco2017convolutional]||69.2||71.9|
|A||(T+P) WS-SA [rocco2018end]||70.2||75.8|
|A||(P) RTN [kim2018recurrent]||71.9||75.9|
|F||(B+P) PF-LOM [kim2019fcss]||58.4||68.9|
|F||(P) NCN [rocco2018neighbourhood]||67.0||78.9|
|F||(C+P) HPF [min2019hyperpixel]||74.4||80.4|
4.2 Semantic Correspondence
We compare our model to the state of the art on semantic correspondence including hand-crafted and CNN-based methods with the following four benchmark datasets: PF-WILLOW [ham2016proposal], PF-PASCAL [ham2017proposal], SPair-71k [min2019hyperpixel], and TSS [taniai2016joint]. Following the experimental protocol in [rocco2017convolutional, rocco2018end, min2019hyperpixel], we use for PF-PASCAL and TSS, and for PF-WILLOW and SPair-71k, respectively. Results for all comparisons have been obtained from the source code or models provided by the authors, unless otherwise specified.
|A||(T) A2Net [Seo2018AttentiveSA]||83.2||82.8||83.8||44.4||57.8||81.3||89.4||86.1||40.1||91.7||21.4||73.2||33.8||76.3||74.3||63.3||100.0||45.5||45.3||60.0||70.8|
|A||(T) CNNGeo [rocco2017convolutional]||82.4||80.9||85.9||47.2||57.8||83.1||92.8||86.9||43.8||91.7||28.1||76.4||70.2||76.6||68.9||65.7||80.0||50.1||46.3||60.6||71.9|
|A||(T+P) WS-SA [rocco2018end]||83.7||88.0||83.4||58.3||68.8||90.3||92.3||83.7||47.4||91.7||28.1||76.3||77.0||76.0||71.4||76.2||80.0||59.5||62.3||63.9||75.8|
|F||(P) NCN [rocco2018neighbourhood]||86.8||86.7||86.7||55.6||82.8||88.6||93.8||87.1||54.3||87.5||43.2||82.0||64.1||79.2||71.1||71.0||60.0||54.2||75.0||82.8||78.9|
|F||(C+P) HPF [min2019hyperpixel]||86.5||88.9||81.6||75.0||81.3||89.7||93.7||87.6||62.2||87.5||52.6||87.5||74.2||83.5||73.5||66.2||60.0||66.2||68.5||66.7||80.4|
|Transferred models||(T) CNNGeo [rocco2017convolutional]||21.3||15.1||34.6||12.8||31.2||26.3||24.0||30.6||11.6||24.3||20.4||12.2||19.7||15.6||14.3||9.6||28.5||28.8||18.1|
|(T) A2Net [Seo2018AttentiveSA]||20.8||17.1||37.4||13.9||33.6||29.4||26.5||34.9||12.0||26.5||22.5||13.3||21.3||20.0||16.9||11.5||28.9||31.6||20.1|
|(T+P) WS-SA [rocco2018end]||23.4||17.0||41.6||14.6||37.6||28.1||26.6||32.6||12.6||27.9||23.0||13.6||21.3||22.2||17.9||10.9||31.5||34.8||21.1|
|(P) NCN [rocco2018neighbourhood]||24.0||16.0||45.0||13.7||35.7||25.9||19.0||50.4||14.3||32.6||27.4||19.2||21.7||20.3||20.4||13.6||33.6||40.4||26.4|
|SPair-71k trained models||(T) CNNGeo [rocco2017convolutional]||23.4||16.7||40.2||14.3||36.4||27.7||26.0||32.7||12.7||27.4||22.8||13.7||20.9||21.0||17.5||10.2||30.8||34.1||20.6|
|(T) A2Net [Seo2018AttentiveSA]||22.6||18.5||42.0||16.4||37.9||30.8||26.5||35.6||13.3||29.6||24.3||16.0||21.6||22.8||20.5||13.5||31.4||36.5||22.3|
|(T+P) WS-SA [rocco2018end]||22.2||17.6||41.9||15.1||38.1||27.4||27.2||31.8||12.8||26.8||22.6||14.2||20.0||22.2||17.9||10.4||32.2||35.1||20.9|
|(P) NCN [rocco2018neighbourhood]||17.9||12.2||32.1||11.7||29.0||19.9||16.1||39.2||9.9||23.9||18.8||15.7||17.4||15.9||14.8||9.6||24.2||31.1||20.1|
|(C+P) HPF [min2019hyperpixel]||25.2||18.9||52.1||15.7||38.0||22.8||19.1||52.9||17.9||33.0||32.8||20.6||24.4||27.9||21.1||15.9||31.5||35.6||28.2|
4.2.1 PF-WILLOW and PF-PASCAL
The PF-WILLOW [ham2016proposal] and PF-PASCAL [ham2017proposal] datasets provide 900 and 1,351 image pairs of 4 and 20 image categories, respectively, with corresponding ground-truth object bounding boxes and keypoint annotations. The PF-PASCAL dataset is more challenging than other datasets [ham2016proposal, taniai2016joint] for semantic correspondence evaluation, featuring different instances of the same object class in the presence of large changes in appearance and scene layout, clutter and scale changes between objects. To evaluate our model, we use PF-WILLOW and the test split of PF-PASCAL provided by [ham2017proposal, rocco2018end] corresponding roughly 900 and 300 image pairs, respectively.
We show in Table I the average PCK scores for the PF-WILLOW and PF-PASCAL datasets, and compare our method with the state of the art. From this table, we observe four things: (1) Our model outperforms the state of the art by a significant margin in terms of PCK, especially for the PF-PASCAL dataset. In particular, it shows better performance than other object-aware methods [ham2016proposal, kim2019fcss] that focus on establishing region correspondences between prominent objects. A plausible explanation is that establishing correspondences between object proposals is susceptible to shape deformations. (2) We can clearly see that our model gives better results than semantic alignment methods on both datasets, but performance gain for the PF-PASCAL dataset, which typically contains pictures depicting a non-rigid deformation and clutter (e.g., in cow and sofa classes), is more significant. For example, the PCK gain over RTN [kim2018recurrent] for the PF-PASCAL (81.9 vs. 75.9) is about four times more than that for the PF-WILLOW (73.5 vs. 71.9), indicating that our semantic flow method is more robust to non-rigid deformations and background clutter than semantic alignment approaches. (3) By comparing our model with CNN-based semantic flow methods, we can see that involving a spatial regularizer is significant. These techniques focus on designing fidelity terms (e.g., using a contrastive loss [choy2016universal, kim2019fcss]) to learn a feature space preserving semantic similarities. This is because of a lack of differentiability of the flow field. In contrast, our model gives a differentiable flow field, allowing to exploit a spatial regularizer while further leveraging high-level semantics from CNN features more specific to semantic correspondence. (4) We confirm once more a finding in [long2014convnets] that CNN features trained for ImageNet classification [deng2009imagenet] clearly show a better ability to handle intra-class variations than hand-crafted ones (HOG [dalal2005histograms] in PF-LOM [ham2016proposal]).
Table II shows per-class PCK scores on the PF-PASCAL dataset [ham2017proposal]. Our model achieves state-of-the-art results for 11 object categories, and outperforms all methods on average by a large margin. The performance gain is significant especially in the presence of non-rigid deformations (e.g., in cow and sheep classes) or distractions such as clutter (e.g., in table and sofa classes). This demonstrates once again that our method is able to establish reliable semantic correspondences of keypoints, even for images with large shape variations and clutter by which semantic alignment methods are easily distracted.
The SPair-71k dataset [min2019hyperpixel], a large-scale benchmark for semantic correspondence, provides 70,958 image pairs of 18 object categories with ground-truth annotations for object bounding boxes, segmentation masks, and keypoints. The image pairs in SPair-71k feature various changes in viewpoint, scale, truncation, and occlusion. Following the experimental protocol of [min2019hyperpixel], we evaluate our model on the test split of 12,234 image pairs, and compute PCK scores with We show in Table III the per-class and average PCK scores, and compare our model with the state of the art [rocco2017convolutional, Seo2018AttentiveSA, rocco2018end, rocco2018neighbourhood, min2019hyperpixel]. The first five rows show the PCK scores for models provided by authors, without retraining or finetuning on the SPair-71k dataset. We can see that our model achieves the second best performance, demonstrating that it generalizes well to unseen images. It is slightly outperformed by NCN [rocco2018neighbourhood] (0.4% in terms of the average PCK) but runs about 11 times faster at test time (Table V). The last six rows show the PCK scores for the models trained with SPair-71k. For fair comparison, we train our model with the training set of SPair-71k (986 images). The results show that it performs best in the presence of non-rigid deformations (i.e., in cat and cow classes). For other object categories, our model outperforms other CNN-based methods except for HPF [min2019hyperpixel]. Note that HPF exploits ground-truth correspondences at training time, which gives strong constraints but is extremely labor-intensive. In contrast, our model uses binary foreground masks only, that are widely available and much cheaper to obtain.
. Keypoints in the source and target images are shown as diamonds and crosses, respectively, with a vector representing the matching error. All methods use ResNet-101 features. Compared to the state of the art, our method is more robust to local non-rigid deformations, scale changes between objects, and clutter. See text for details. (Best viewed in color.)
|CNN-based||A||(T) A2Net [Seo2018AttentiveSA]||87.0||67.0||55.0|
|A||(T) CNNGeo [rocco2017convolutional]||90.1||76.4||56.3|
|A||(T+P) WS-SA [rocco2018end]||90.3||76.4||56.5|
|A||(P) RTN [kim2018recurrent]||90.1||78.2||63.3|
|F||(B+P) PF-LOM [kim2017fcss]||83.9||63.5||58.2|
|F||(B+P) DCTM [kim2017dctm]||89.1||72.1||61.0|
The TSS dataset [taniai2016joint] consists of three subsets (FG3DCar, JODS and PASCAL) that contain 400 image pairs of 7 object categories. It provides dense flow fields obtained by interpolating sparse keypoint matches with additional co-segmentation masks. Following the experimental protocol of [rocco2018end], we compute the PCK scores () densely over the foreground object. Table IV compares the average PCK on each subset in the TSS dataset. Our method shows better performance than the state of the art for FG3DCar and JODS. We do not do as well on the PASCAL part of TSS, which contains many image pairs with different poses (e.g., cars captured with left- and right-side viewpoints). Current methods, except for OADSC [yang2017object] that is specially designed for handling changes in viewpoint, have a limited capability of finding matches between images with different poses. Ours is no exception.
|A||(T) CNNGeo [rocco2017convolutional]||34.2|
|A||(T+P) WS-SA [rocco2018end]||34.4|
|A||(T) A2Net [Seo2018AttentiveSA]||61.2|
|F||(P) NCN [rocco2018neighbourhood]||284.2|
|F||(C+P) HPF [min2019hyperpixel]||48.9|
4.2.4 Runtime Analysis
Table V shows runtime comparisons of state-of-the-art methods. For comparison, we run the original source codes implemented using PyTorch[paszke2017automatic]. The average runtime is measured on the same machine with a NVIDIA Titan RTX GPU. The table shows that our model is fastest among the state of the art. Semantic alignment methods [rocco2017convolutional, rocco2018end, Seo2018AttentiveSA] estimate parameters of affine and thin place spline sequentially, which degrades runtime performance. Semantic flow methods involve 4-D convolutions [rocco2018neighbourhood] or a Hough voting process [min2019hyperpixel] on top of the correlation volume, requiring additional computations to establish pixel-level correspondences. Our model, on the other hand, simply assigns the best matches from the correlation volume in a single stage. Most computation time is spent extracting features (23.7 milliseconds). Computing matching scores and establishing correspondences using the kernel soft argmax just take 1.2 milliseconds.
4.2.5 Qualitative Results
Figure 7 shows a visual comparison of alignment results between source and target images with the state of the art on the test split of PF-PASCAL [ham2017proposal, rocco2018end]. To this end, the source images are warped to the target images using the dense flow fields computed by each method. We can see that our method is robust to local non-rigid deformation (e.g., bird beaks and horse legs in the first two rows), scale changes between objects (e.g., front wheels in the third row), and clutter (e.g., wheels in the last row), while semantic alignment methods [rocco2017convolutional, rocco2018end] are not. In particular, the fourth example clearly shows that our method gives more discriminative correspondences, cutting off matches for non-common objects. For example, it does not establish correspondences between a person and background regions in the source and target images, respectively, while CNNGeo [rocco2017convolutional] and WS-SA [rocco2018end] fail to cut off matches on these regions. We can also see that all methods do not establish correspondences for occluded regions (e.g., a bicycle saddle in the last row). We also show in Fig. 8 the top 60 matches chosen according to matching probabilities on the PF-WILLOW [ham2016proposal], PF-PASCAL [ham2017proposal], and TSS [taniai2016joint] datasets. We can see that most strong matches are established between prominent objects, and matches between foreground and background regions have low matching probabilities.
|CNN-based||A||(T) A2Net [Seo2018AttentiveSA]||0.80||0.57|
|A||(T) CNNGeo [rocco2017convolutional]||0.83||0.61|
|A||(T+P) WS-SA [rocco2018end]||0.85||0.63|
|F||(B+P) PF-LOM [kim2017fcss]||0.83||0.52|
|F||(C+P) SCNet-AG [han2017scnet]||0.79||0.51|
|F||(P) NCN [rocco2018neighbourhood]||0.85||0.60|
|F||(C+P) HPF [min2019hyperpixel]||0.87||0.63|
4.3 Mask Transfer
We apply our model to the task of mask transfer on the Caltech-101 [fei2006one] dataset. This dataset, originally introduced for image classification, provides pictures of 101 image categories with ground-truth object masks. Unlike the PF [ham2016proposal, ham2017proposal] and TSS [taniai2016joint] datasets, it does not provide ground-truth keypoint annotations. For fair comparison, we use 15 image pairs, provided by [han2017scnet, rocco2018end], for each object category, and use the corresponding 1,515 image pairs for evaluation. Following the experimental protocol in [kim2013deformable], we compute matching accuracy with two metrics using the ground-truth masks: Label transfer accuracy (LT-ACC) and the intersection-over-union (IoU) metric. Both metrics count the number of correctly labeled pixels between ground-truth and transformed masks using dense correspondences, where the LT-ACC evaluates the overall matching quality while the IoU metric focuses more on foreground objects. Following [Seo2018AttentiveSA, rocco2018end], we exclude the LOC-ERR metric, since it measures the localization error of correspondences using object bounding boxes due to a lack of keypoint annotations, which does not cover rotations, affine or deformable transformations.
The LT-ACC and IoU comparisons on the Caltech-101 dataset are shown in Table VI. Although this dataset provides ground-truth object masks, we do not retrain or fine-tune our model to evaluate its generalization ability on other datasets. From this table, we can see that (1) our model generalizes better than other CNN-based methods for other images outside the training dataset; and (2) it outperforms the state of the art in terms of the LT-ACC and IoU, verifying once more that our model focuses on regions containing objects while filtering out background clutter, even without using object proposals [ham2017proposal, han2017scnet, yang2017object, kim2017fcss] or inlier counting [rocco2018end]. In Fig. 9, we show alignment and label transfer examples on the Caltech-101 [fei2006one] dataset. We can see that our method is robust against local non-rigid deformations (e.g., bird’s neck, body, and legs).
|(V+C) Optical Flow [ilg2017flownet]||45.2||62.9|
|(V) Video Colorizaiton [vondrick2018tracking]||45.2||69.6|
|(V) TimeCycle [wang2019learning]||57.3||78.1|
|(V) TimeCycle [wang2019learning]||57.7||78.5|
|(V) TimeCycle [wang2019learning]||58.4||78.4|
|(C+P) HPF [min2019hyperpixel]||58.7||76.8|
4.4 Pose Keypoint Propagation
We apply our model to the task of keypoint propagation on the JHMDB [jhuang2013towards] dataset. We propagate ground-truth pose keypoints in the first frame to subsequent ones by estimating semantic correspondences between them. The JHMDB [jhuang2013towards] dataset contains 928 clips of 21 action categories with pose keypoints, segmentation masks of humans in action, obtained by a 2D articulated human puppet model [zuffi2012pictorial], and provides three splits, where each split consists of training and test sets. Following the experimental protocol of [wang2019learning, vondrick2018tracking], we test our model on the test set in the split 1 corresponding to 268 clips of action categories, without retraining or fine-tuning on the dataset. We normalize keypoint coordinates in the range of [0,1] by dividing them with the height and width of the human bounding box size, respectively, and use the PCK score with two threshold values () for evaluation.
We show in Table VII the average PCK scores for the keypoint propagation task, and compare our method with the state of the art including self-supervised methods [wang2019learning, vondrick2018tracking]. From this table, we can see that our model based on ResNet-50 [he2016deep] outperforms the state of the art, even without using video datasets for training. For example, TimeCycle [wang2019learning] is trained with the VLOG [fouhey2018lifestyle] dataset that contains 114k videos with the total length of 344 hours. Training networks with such video datasets requires lots of computational resources and training time. We can also see that our model outperforms HPF [min2019hyperpixel], demonstrating once again its generalization ability to unseen images during training. Figure 10 shows a visual comparison of keypoint propagation results with the TimeCycle [wang2019learning] method on the JHMDB [jhuang2013towards] dataset. The qualitative results for the comparison have been obtained from the original model [wang2019learning] provided by the authors. We predict the keypoints in the rest of the videos, by propagating ground truth in the first frame. We can see that our method is more robust to background clutter (e.g., body parts in the first row) and large displacements (e.g., elbows and wrists in the second row). Moreover, it is not seriously affected by occlusion (e.g., ankles and wrists in the last two rows) as the smoothness term regularizes flow fields within prominent objects.
4.5 Ablation Study and Effect of Training Data
We show an ablation analysis on different components and losses in our model. We measure a PCK score (), which is a more strict metric compared to , and report the results of semantic correspondence on the test split of PF-PASCAL [ham2017proposal, rocco2018end]. We also study the effect of using different training datasets on performance.
4.5.1 Training Loss
We show the average PCK for three variants of our model in Table VIII. The mask consistency term encourages establishing correspondences between prominent objects. Our model trained with this term only, however, may not yield spatially distinctive correspondences, resulting in the worst performance. The flow consistency term, which spreads flow fields over foreground regions, overcomes this problem, but it does not differentiate correspondences between background and objects. Accordingly, these two terms are complementary each other and exploiting both significantly boosts the performance of our model from 67.5/71.8 to 78.2. An additional smoothness term further boosts performance to 78.7.
4.5.2 Network Architecture
Table IX compares the performance of networks with different components in terms of average PCK. The baseline models in the first three rows compute matching scores using multi-level features from conv4-23 and conv5-3 layers, and estimate correspondences with different argmax operators. They do not involve any training similar to [long2014convnets] that uses off-the-shelf CNN features for semantic correspondence. We can see that applying the soft argmax directly to the baseline model degrades performance severely, since it is highly susceptible to multi-modal distributions. The results in the next three rows are obtained with a single adaptation layer on top of conv4-23. This demonstrates that the adaptation layer extracts features more adequate for pixel-wise semantic correspondences, boosting performance of all baseline models significantly. In particular, we can see that the kernel soft argmax outperforms others by a large margin, since it enables training our model end to end including adaptation layers at a sub-pixel level and is less susceptible to multi-modal distributions. The last three rows suggest that exploiting deeper features is important, and using all components with the kernel soft argmax performs best in terms of the average PCK. We show in Fig. 11 alignment examples for the variants of our model in Table IX. This confirms once more the results in Table IX that the adaptation layers and exploiting multi-level features boost the matching performance drastically, regardless of types of argmax operators, and the soft argmax is highly susceptible to multi-modal distributions, e.g., caused by ambiguous matches between a bottle and a glass in the source and target images, respectively.
4.5.3 Training with Bounding Boxes
We train our model using object bounding boxes themselves as binary masks. The generated masks are noisy, but are less expensive to annotate than ground-truth foreground masks. We use the same 2,791 images from the PASCAL VOC 2012 segmentation dataset [everingham2010pascal] for training, and obtain an average PCK () of on the PF-PASCAL dataset [ham2017proposal], which is comparable with the score of using ground-truth masks. This suggests that using bounding boxes might be a less accurate but cheaper alternative.
4.5.4 Training on PF-PASCAL
Semantic correspondence methods based on CNNs use different training sets. For example, the methods of [rocco2017convolutional, rocco2018end] use the PASCAL VOC 2011 (11,540 images) and Tokyo Time Machine datasets (20,000 images). In [han2017scnet, rocco2018end, Jeon2018PARN, kim2018recurrent, rocco2018neighbourhood], the training split of PF-PASCAL [ham2017proposal] (about 700 image pairs for 1,001 images) is used. Following these approaches, we train a network on the training split in the PF-PASCAL dataset. We exclude 302 images in this split that overlap with either target or source images in the test dataset. Note that current methods ignore this bias. We use object bounding boxes due to the lack of ground-truth foreground masks in the training split. We obtain the average PCK () of 77.8 on PF-PASCAL, which is comparable with the score of 78.7 for the model trained using 1,464 images on the PASCAL VOC 2012 segmentation dataset. This indicates that our model is robust to the size of training data.
4.5.5 Training on Larger Datasets
We use the training split of MS COCO 2014 [lin2014microsoft] to train our model on a larger dataset. Among 80 object categories, we select 16,624 images of 20 object classes of PASCAL VOC 2012 [everingham2010pascal] using segmentation masks, which is about 6 times the number used in Section 4.2 (2,791 images). We test our model on the PF-PASCAL dataset [ham2017proposal], since MS COCO does not provide benchmarks for semantic correspondence. Despite using a larger number of training samples, the average PCK () decreases slightly from to , mainly due to domain differences between MS COCO and PASCAL VOC datasets. This, however, demonstrates once more the generalization ability of our approach to samples outside the training domain.
We have presented a CNN model dubbed SFNet for learning an object-aware semantic flow end to end, with a novel kernel soft argmax layer that outputs differentiable matches at a sub-pixel level. We have proposed to use binary foreground masks that are widely available and can be obtained easily compared to pixel-level annotations to train a model for learning pixel-to-pixel correspondences. The ablation studies clearly demonstrate the effectiveness of each component and loss in our model. Finally, we have shown that the proposed method is robust to distracting details and focuses on establishing dense correspondences between prominent objects, outperforming the state of the art on standard benchmarks for the tasks of semantic correspondence, mask transfer, and pose keypoint propagation in terms of accuracy and speed.
This research was supported in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2019R1A2C2084816), the Louis Vuitton/ENS chain on artificial intelligence and the Inria/NYU collaboration agreement.