Establishing correspondences and transferring attributes across semantically
similar images can facilitate a variety of computer vision applications[35, 34, 25]. In these tasks, the images resemble each other in contents but differ in visual attributes, such as color, texture, and style, e.g., the images with different faces as exemplified in Fig. 1. Numerous techniques have been proposed for the semantic correspondence [15, 24, 42, 19, 43, 23] and attribute transfer [11, 6, 28, 21, 38, 16, 20, 16, 34, 12], but these two tasks have been studied independently although they can be mutually complementary.
To establish reliable semantic correspondences, state-of-the-art methods have leveraged deep convolutional neural networks (CNNs) in extracting descriptors[7, 53, 24] and regularizing correspondence fields [15, 42, 19, 43, 23]. Compared to conventional handcrafted methods [35, 22, 5, 54, 48], they have achieved a highly reliable performance. To overcome the problem of limited ground-truth supervisions, some methods [42, 19, 43, 23]
have tried to learn deep networks using only weak supervision in the form of image pairs based on the intuition that the matching cost between the source and target features over a set of transformations should be minimized at the correct transformation. These methods presume that the attribute variations between source and target images are negligible in the deep feature space. However, in practice the deep features often show limited performance in handling different attributes that exist in the source and target images, often degrading the matching accuracy dramatically.
To transfer the attributes between source and target images, following the seminal work of Gatys et al. , numerous methods have been proposed to separate and recombine the contents and attributes using deep CNNs [11, 6, 28, 21, 38, 16, 20, 16, 34, 12]. Unlike the parametric methods [11, 21, 38, 16] that match the global statistics of deep features while ignoring the spatial layout of contents, the non-parametric methods [6, 28, 34, 12] directly find neural patches in the target image similar to the source patch and synthesize them to reconstruct the stylized image. These non-parametric methods generally estimate nearest neighbor patches between source and target images with weak implicit regularization methods [6, 28, 34, 12] using a simple local aggregation followed by a winner-takes-all (WTA). However, photorealistic attribute transfer needs highly regularized and semantically meaningful correspondences, and thus existing methods [6, 28, 12] frequently fail when the images have background clutters and different attributes while representing similar global feature statistics. A method called deep image analogy  has tried to estimate more semantically meaningful dense corrrespondences for photorealistic attribute transfer, but it still has limited localization ability with PatchMatch .
In this paper, we present semantic attribute matching networks (SAM-Net) for overcoming the aforementioned limitations of current semantic matching and attribute transfer techniques. The key idea is to weave the advantages of semantic matching and attribute transfer networks in a boosting manner. Our networks accomplish this through an iterative process of establishing more reliable semantic correspondences by reducing the attribute discrepancy between semantically similar images and synthesizing an attribute transferred image with the learned semantic correspondences. Moreover, our networks are learned from weak supervision in the form of image pairs using the proposed semantic attribute matching loss. Experimental results show that SAM-Net outperforms the latest methods for semantic matching and attribute transfer on several benchmarks, including TSS dataset , PF-PASCAL dataset , and CUB-200-2011 dataset .
2 Related Work
Most conventional methods for semantic correspondence that use handcrafted features and regularization methods [35, 22, 5, 54, 48] have provided limited performance due to a low discriminative power. Recent approaches have used deep CNNs for extracting their features [7, 53, 24, 39] and regularizing correspondence fields [15, 41, 42]. Rocco et al. [41, 42] proposed deep architecture for estimating a geometric matching model, but these methods estimate only globally-varying geometric fields. To deal with locally-varying geometric deformations, some methods such as UCN  and CAT-FCSS  were proposed based on STNs . Recently, PARN , NC-Net , and RTNs  were proposed to estimate locally-varying transformation fields using a coarse-to-fine scheme , neighbourhood consensus , and an iteration technique . These methods [19, 43, 23] presume that the attribute variations between source and target images are negligible in the deep feature space. However, in practice the deep features often show limited performance in handling different attributes. Aberman et al.  presented a method to deal with the attribute variations between the images using a variant of instance normalization . However, the method does not have an explicit learnable module to reduce the attribute discrepancy, thus yielding limited performance.
There have been a lot of works on the transfer of visual attributes, e.g., color, texture, and style, from one image to another, and most approaches are tailored to their specific objectives [40, 47, 8, 2, 52, 9]. Since our method represents and synthesizes deep features to transfer the attribute between semantically similar images, the neural style transfer [11, 6, 21, 20]
is highly related to ours. In general, these approaches can be classified into parametric and non-parametric methods.
In parametric methods, inspired by the seminal work of Gatys et al. , numerous methods have been presented, such as the work of Johnson et al. , AdaIN , and WCT . Since these methods are globally formulated, they have shown limited performance for photorealistic stylization tasks [32, 38]. To alleviate these limitations, Luan et al. proposed a deep photo style transfer  that computes and uses the semantic labels. Li et al. proposed Photo-WCT  to eliminate the artifacts using additional smoothing step. However, these methods still have been formulated without considering semantically meaningful correspondence fields.
Among non-parametric methods, the seminal work of Li et al.  first searches local neural patches, which are similar to the patch of content image, in the target style image to preserve the local structure prior of content image, and then uses them to synthesize the stylized image. Chen et al.  sped up this process using the feed-forward networks to decode the synthesize features. Inspired by this, various approaches have been proposed to synthesize locally blended features efficiently [29, 49, 37, 30, 50]. However, the aforementioned methods are tailored to the artistic style transfer, and thus they focused on finding the patches to reconstruct more plausible images, rather than finding semantically meaningful dense correspondences. They generally estimate the nearest neighbor patches using weak implicit regularization methods such as WTA. Recently, Gu et al. 
introduced a deep feature reshuffle technique to connect both parametric and non-parametric methods, but they search the nearest neighbor using an expectation-maximization (EM) that also produces limited localization accuracy.
More related to our work is a method called deep image analogy  that searches semantic correspondences using deep PatchMatch  in a coarse-to-fine manner. However, PatchMatch inherently has a limited regularization power as shown in [27, 36, 33]. In addition, the method still needs the greedy optimization for feature deconvolution that induces computational bottlenecks, and only considers the translational fields, thus having the limitation to handle more complicated deformations.
3 Problem Statement
Let us denote semantically similar source and target images as and , respectively. The objective of our method is to jointly establish a correspondence field between the two images that is defined for each pixel and synthesize an attribute transferred image by transferring an attribute of target image to a content of source image .
CNN-based methods for semantic correspondence [41, 25, 42, 19, 43, 23] involve first extracting deep features [45, 25], denoted by and , from and within local receptive fields, and then estimating correspondence field of the source image using deep regularization models [41, 42, 23], as shown in Fig. 2(a). To learn the networks using only image pairs, some methods [42, 23]
formulate the loss function based on the intuition that the matching cost between the source featureand the target feature over a set of transformations should be minimized. For instance, they formulate the matching loss defined as
where denotes Frobenius norm. To deal with more complex deformations such as affine transformation [27, 23], instead of , or can be used with a matrix . Although semantically similar images can share similar contents but have different attributes, these methods [41, 42, 19, 43, 23] simply assume that the attribute variations between source and target images are negligible in the deep feature space. It thus cannot guarantee measuring a fully accurate matching cost without an explicit module to reduce the attribute gaps.
To minimize the attribute discrepancy between source and target images, attribute or style transfer methods [11, 6, 21, 20] separate and recombine the content and attribute. Unlike the parametric methods [11, 38], the non-parametric methods [6, 28, 34, 12] directly find neural patches in the target image similar to the source patch and synthesize them to reconstruct the stylized feature and image , as shown in Fig. 2(b). Formally, they formulate two loss functions including the content loss defined as
and the non-parametric attribute transfer loss defined as
where is the center point of the patch in that is most similar to a patch centered at in . Generally, is determined using the matching scores of normalized cross-correlation [6, 28] aggregated on over all local patches followed by the labeling optimization such that
where the operator denotes inner product.
However, the hand-designed discrete labeling techniques such as WTA [6, 28], PatchMatch , and EM  used to optimize (4) rely on weak implicit smoothness constraints, often producing poor matching results. In addition, they only consider the translational fields, i.e.,
, thus limiting handling more complicated deformations caused by scale, rotation and skew that may exist among object instances.
We present the networks to recurrently estimate semantic correspondences and synthesize the stylized images in a boosting manner, as shown in Fig. 2(c). In the networks, correspondences are robustly established by matching the stylized source and target images, in contrast to existing methods [42, 23] that directly match source and target images that have the attribute discrepancy. At the same time, blended neural patches using the correspondences are used to reconstruct the attribute transferred image in a semantic-aware and geometrically aligned manner.
Our networks are split into three parts as shown in Fig. 3: feature extraction networks to extract source and target features and , semantic matching networks to establish correspondence fields , and attribute transfer networks to synthesize the attribute transferred image . Since our networks are formulated in a recurrent manner, they output and at each -th iteration, as exemplified in Fig. 4.
4.2 Network Architecture
Feature extraction networks.
Our model accomplishes the semantic matching and attribute transfer using deep features [45, 25]. To extract the features for source and target , the source and target images ( and ) are first passed through shared feature extraction networks with parameters such that , respectively. In the recurrent formulation, an attribute transferred feature from target to source images and a warped target feature , i.e., warped using the transformation fields , are reconstructed at each -th iteration.
Semantic matching networks.
Our semantic matching networks consist of the matching cost computation and inference modules motivated by conventional RANSAC-like methods . We first compute the correlation volume with respect to translational motion only [41, 42, 43, 23] and then pass it to subsequent convolutional layers to determine dense affine transformation fields .
Unlike existing methods [41, 42, 23], our method computes the matching similarity between not only source and target features but also synthesized source and target features to minimize errors from the attribute discrepancy between source and target features such that:
where for local search window centered at . controls the trade-off between content and attribute when computing the similarity, which is similar to . Note that when , we only consider the source feature without considering the stylized feature . These similarities undergo normalization to reduce errors .
Based on this, the matching inference networks with parameters iteratively estimate the residual between the previous and current transformation fields  as
The current transformation fields are then estimated in a recurrent manner  as follows:
where . Unlike [41, 42] that estimate a global affine or thin-plate spline transformation field, our networks are formulated as the encoder-decoder networks as in  to estimate locally-varying transformation fields.
Attribute transfer networks.
To transfer the attribute of target feature into the content of source feature at -th iteration, our attribute transfer networks first blend the source and target features as using estimated transformation field and then reconstruct the stylized source image using the decoder networks with parameters such that .
Specifically, our neural patch blending between and with the current transformation field is formulated as shown in Fig. 5 such that
where . is a confidence of each pixel that has computed similar to  such that
Our neural patch blending module differs from the existing methods [34, 28, 12] in the use of learned transformation fields and consideration of more complex deformations such as affine transformations. In addition, unlike exisiting style transfer methods [28, 12], our networks employ the confidence to transfer the attribute of matchable points only tailored to our objective, as exemplified in Fig. 6.
In addition, our decoder networks are formulated as a symmetric structure to feature extraction networks. Since the single-level decoder networks as in  cannot capture both complicated structures at high-level features and low-level information at low-level features, the multi-level decoder networks have been proposed as in [31, 32], but they are not very economic . Instead, we use the skip connection from the source features to capture both low- and high-level attribute characteristics [31, 32, 12]. However, using the skip connection through simple concatenation  makes the decoder networks reconstruct an image using only low-level features. To alleviate this, inspired by a dropout layer , we present a droplink layer such that the skipped features and upsampled features are stochastically linked to avoid the overfitting to certain level features:
where and are the intermediate and skipped features at -th level for . is the parameters until -th level.
is a binary random variable. Note that if, this becomes the no-skip connected layer.
4.3 Loss Functions
Semantic attribute matching loss.
Our networks are learned using weak supervision in the form of image pairs. Concretely, we present a semantic attribute matching loss in a manner that the transformation field and the stylized image can be simultaneously learned and inferred to minimize a single loss function. After the convergence of iterations at -th iteration, an attribute transferred feature and a warped target feature are used to define the loss function. This intuition can be realized by minimizing the following objective:
In comparison to existing the matching loss and the attribute transfer loss , this objective enables us to solve the photometric and geometric variations across semantically similar images simultaneously.
Although using only this objective provides satisfactory performance, we extend this objective to consider both positive and negative samples to enhance network training and precise localization ability based on the intuition that the matching score should be minimized at the correct transformation while keeping the scores of other neighbor transformation candidates high. Finally, we formulate our semantic attribute matching loss as a cross-entropy loss as
is the softmax probability defined as
It makes the center point within the neighbor become a positive sample and the other points become negative samples. In addition, the truncated max operator is used to focus on the sailent parts such as objects during training with the parameter .
5.1 Training and Implementation Details
To learn our SAM-Net, large-scale semantically similar image pairs are needed, but such public datasets are limited quantitatively. To overcome this, we adopt a two-step training technique, similar to . In the first step, we train our networks using a synthetic training dataset provided in , where synthetic transformations are randomly applied to a single image to generate the image pairs, and thus the images do not have appearance variations. This enables the attribute transfer networks to be learned in an auto-encoder manner [31, 16, 32], but the matching networks still have limited ability to deal with the attribute variations. To overcome this, in the second step, we finetune this pretrained network on public datasets for semantically similar image pairs from the training set of PF-PASCAL  following the split used in .
|Taniai et al. ||0.830||0.595||0.483||0.636|
|GMat. w/Inl. ||0.892||0.758||0.562||0.737|
|GMat. w/Inl. ||0.490||0.748||0.840|
For feature extraction, we used the ImageNet-pretrained VGG-19 networks, where the activations are extracted from ‘relu4-1’ layer (i.e., ). We gradually increase until 1 such that . During training, we set the maximum number of iteration to 5 to avoid the gradient vanishing and exploding problem. During testing, the iteration count is increased to 10. Following , the window sizes of , , and are set to , , and , respectively. The probability of is defined as 0.9 and in testing is set to 0.5.
5.2 Experimental Settings
In the following, we comprehensively evaluated SAM-Net through comparisons to state-of-the-art methods for semantic matching, including Taniai et al. , PF , SCNet , DCTM , DIA , GMat. , GMat. w/Inl. , NC-Net , RTNs , and for attribute transfer, including Gatys et al. , CNN-MRF , Photo-WCT , Gu et al. , and DIA . Performance was measured on TSS dataset , PF-PASCAL dataset , and CUB-200-2011 dataset .
5.3 Ablation Study
To validate the components within SAM-Net, we evaluated the matching accuracy for different numbers of iterations, with various sizes of , and with and without attribute transfer module. For quantitative assessment, we examined the accuracy on the TSS benchmark . As shown in Fig. 7, Fig. 8, and Table 1, SAM-Net converges in 23 iterations. In addition, the results of ‘SAM-Net wo/Att.’, i.e., SAM-Net without attribute transfer, show the effectiveness of attribute transfer module in the recurrent formulation. The results of ‘SAM-Net wo/(11).’, i.e., SAM-Net with the loss of (11), show the importance to consider the negative samples when training. By enlarging the size of , the accuracy improves until 99, but larger window sizes reduce matching accuracy due to greater matching ambiguity. Note that following to .
5.4 Semantic Matching Results
We evaluated SAM-Net on the TSS benchmark , consisting of 400 image pairs. As in [24, 27], flow accuracy was measured in Table 1. Fig. 9 shows qualitative results. Unlike existing methods [7, 48, 13, 15, 24, 41, 42, 23] that do not consider the attribute variations between semantically similar images, our SAM-Net has shown highly improved preformance qualitatively and quantitatively. DIA  has shown limited matching accuracy compared to other deep methods [42, 23], due to their limited regularization powers. Unlike this, the results of our SAM-Net shows that our method is more successfully transferring the attribute between source and target images to improve the semantic matching accuracy.
. For the evaluation metric, we used the PCK between flow-warped keypoints and the ground truth as done in the experiments of. Table 2 summarizes the PCK values, and Fig. 10 shows qualitative results. Similar to the experiments on the TSS benchmark , CNN-based methods [15, 41, 42, 42, 23] including our SAM-Net yield better performance, with SAM-Net providing the highest matching accuracy.
Photorealistic attribute transfer.
We evaluated SAM-Net for photorealistic attribute transfer on the TSS  and PF-PASCAL benchmarks . For evaluatation, we sampled the image pairs from these datasets and transferred the attribute of target image to the source image as shown in Fig. 11. Note that SAM-Net is designed to work on images contain that semantically similar contents and not effective for generic artistic style transfer applications as in [10, 21, 16]. As expected, existing methods tailored to artistic stylization such as a method of Gatys et al.  and CNN-MRF  produce limited quality images. Moreover, recent photorealistic stylization methods such as Photo-WCT  and Gu et al.  have limited performance for the images that have background clutters. DIA  provided degraded results due to its weak regularization technique. Unlike these methods, our SAM-Net has shown highly accurate and plausible results thanks to their learned transformation fields to synthesize the images. Note that some methods such as Photo-WCT  and DIA  have used to refine their results using additional smoothing modules, but SAM-Net does not use any post-processing.
Foreground mask transfer.
We evaluated SAM-Net for mask transfer on the CUB-200-2011 dataset , which contains images of 200 bird categories, with annotated foreground masks. For semantically similar images that have very challenging photometric and geometric variations, our SAM-Net successfully transfers the semantic labels, as shown in Fig. 12.
We presented SAM-Net that recurrently estimates dense correspondences and transfers the attributes across semantically similar images in a joint and boosting manner. The key idea of this approach is to formulate the semantic matching and attribute transfer networks to complement each other through an iterative process. For weakly-supervised training of SAM-Net, the semantic attribute matching loss is presented, which enables us to alleviate the photometric and geometric variations across the images simultaneously.
-  K. Aberman, J. Liao, M. Shi, D. Lischinski, B. Chen, and D. Cohen-or. Neural best-buddies: Sparse cross-domain correspondence. In: SIGGRAPH, 2018.
-  M. Ashikhmin. Fast texture transfer. IEEE Comput. Graph. and Appl., (4):38–43, 2003.
-  C. Barnes, E. Shechtman, A. Finkelstein, and D. B Goldman. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Trans. ToG, 28(3):24, 2009.
-  L. Bourdev and J. Malik. Poselets: Body part detectors trained using 3d human pose annotations,. In: ICCV, 2009.
-  H. Bristow, J. Valmadre, and S. Lucey. Dense semantic correspondence where every pixel is a classifier. In: ICCV, 2015.
-  T. Q. Chen and M. Schmidt. Fast patch-based style transfer of arbitrary style. arXiv:1612.04337, 2016.
-  C. B. Choy, Y. Gwak, and S. Savarese. Universal correspondence network. In: NIPS, 2016.
-  A. A. Efros and W. T. Freeman. Image quilting for texture synthesis and transfer. In: SIGGRAPH, 2001.
-  O. Frigo, N. Sabater, J. Delon, and P. Hellier. Split and match: Example-based adaptive patch sampling for unsupervised style transfer. In: CVPR, 2016.
-  L. A Gatys, A. S Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv:1508.06576, 2015.
-  L. A Gatys, A. S Ecker, and M. Bethge. Image style transfer using convolutional neural networks. In: CVPR, 2016.
-  S. Gu, C. Chen, J. Liao, and L. Yuan. Arbitrary style transfer with deep feature reshuffle. 2018.
-  B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal flow. In: CVPR, 2016.
-  B. Ham, M. Cho, C. Schmid, and J. Ponce. Proposal flow: Semantic correspondences from object proposals. IEEE Trans. PAMI, 2017.
-  K. Han, R. S. Rezende, B. Ham, K. Y. K. Wong, M. Cho, C. Schmid, and J. Ponce. Scnet: Learning semantic correspondence. In: ICCV, 2017.
-  X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, 2017.
-  Philbin. J., O. Chum, M. Isard, J. Sivic, and A. Zisserman. Object retrieval with large vocabularies and fast spatial matching. In: CVPR, 2007.
-  M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In: NIPS, 2015.
-  S. Jeon, S. Kim, D. Min, and K. Sohn. Parn: Pyramidal affine regression networks for dense semantic correspondence estimation. In: ECCV, 2018.
-  Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song. Neural style transfer: A review. arXiv:1705.04058, 2017.
J. Johnson, A. Alahi, and L. Fei-Fei.
Perceptual losses for real-time style transfer and super-resolution.In: ECCV, 2016.
-  J. Kim, C. Liu, F. Sha, and K. Grauman. Deformable spatial pyramid matching for fast dense correspondences. In: CVPR, 2013.
-  S. Kim, S. Lin, S. Jeon, D. Min, and K. Sohn. Recurrent transformer networks for semantic correspondence. In: NIPS, 2018.
-  S. Kim, D. Min, B. Ham, S. Jeon, S. Lin, and K. Sohn. Fcss: Fully convolutional self-similarity for dense semantic correspondence. In: CVPR, 2017.
-  S. Kim, D. Min, B. Ham, S. Lin, and K. Sohn. Fcss: Fully convolutional self-similarity for dense semantic correspondence. IEEE Trans. PAMI, 2018.
-  S. Kim, D. Min, S. Kim, and K. Sohn. Unified confidence estimation networks for robust stereo matching. IEEE Trans. IP, 26(3):1299–1313, 2018.
-  S. Kim, D. Min, S. Lin, and K. Sohn. Dctm: Discrete-continuous transformation matching for semantic flow. In: ICCV, 2017.
-  C. Li and M. Wand. Combining markov random fields and convolutional neural networks for image synthesis. In: CVPR, 2016.
-  C. Li and M. Wand. Precomputed real-time texture synthesis with markovian generative adversarial networks. In: ECCV, 2016.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang. Diversified texture synthesis with feed-forward networks. In: CVPR, 2017.
-  Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang. Universal style transfer via feature transforms. In: NIPS, 2017.
-  Y. Li, M. Liu, X. Li, M. Yang, and J. Kautz. A closed-form solution to photorealistic image stylization. In: ECCV, 2018.
-  Y. Li, D. Min, M. S. Brown, M. N. Do, and J. Lu. Spm-bp: Sped-up patchmatch belief propagation for continuous mrfs. In: ICCV, 2015.
-  J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang. Visual attribute transfer through deep image analogy. In: SIGGRAPH, 2017.
-  C. Liu, J. Yuen, and A Torralba. Sift flow: Dense correspondence across scenes and its applications. IEEE Trans. PAMI, 33(5):815–830, 2011.
-  J. Lu, H. Yang, D. Min, and M. N. Do. Patchmatch filter: Efficient edge-aware filtering meets randomized search for fast correspondence field estimation. In: CVPR, 2013.
-  M. Lu, H. Zhao, A. Yao, F. Xu, Y. Chen, and L. Zhang. Decoder network over lightweight reconstructed feature for fast semantic style transfer. In: ICCV, 2017.
-  F. Luan, S. Paris, E. Shechtman, and K. Bala. Deep photo style transfer. CoRR, abs/1703.07511, 2, 2017.
-  D. Novotny, D. Larlus, and A. Vedaldi. Anchornet: A weakly supervised network to learn geometry-sensitive features for semantic matching. In: CVPR, 2017.
-  E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley. Color transfer between images. IEEE Comput. Graph. and Appl., 21(5):34–41, 2001.
-  I. Rocco, R. Arandjelovic, and J. Sivic. Convolutional neural network architecture for geometric matching. In: CVPR, 2017.
-  I. Rocco, R. Arandjelovic, and J. Sivic. End-to-end weakly-supervised semantic alignment. In: CVPR, 2018.
-  I. Rocco, M. Cimpoi, R. Arandjelovic, A. Torii, T. Pajdla, and J. Sivic. Neighbourhood consensus networks. In: NIPS, 2018.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In: ICLR, 2015.
-  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. IEEE Trans. Multimedia, 15:1929–1958, 2014.
-  Y. Tai, J. Jia, and C. Tang. Local color transfer via probabilistic segmentation by expectation-maximization. In: CVPR, 2005.
-  T. Taniai, S. N. Sinha, and Y. Sato. Joint recovery of dense correspondence and cosegmentation in two images. In: CVPR, 2016.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. arXiv:1603.03417, 2016.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky. Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In: CVPR, 2017.
-  C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical report, 2011.
-  W. Zhang, C. Cao, S. Chen, J. Liu, and X. Tang. Style transfer via image component analysis. IEEE Trans. Multimedia, 15(7):1594–1601, 2013.
-  T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros. Learning dense correspondence via 3d-guided cycle consistency. In: CVPR, 2016.
-  T. Zhou, Y. J. Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. In: CVPR, 2015.