GetNet: Get Target Area for Image Pairing

10/08/2019 ∙ by Henry H. Yu, et al. ∙ Tsinghua University NetEase, Inc 0

Image pairing is an important research task in the field of computer vision. And finding image pairs containing objects of the same category is the basis of many tasks such as tracking and person re-identification, etc., and it is also the focus of our research. Existing traditional methods and deep learning-based methods have some degree of defects in speed or accuracy. In this paper, we made improvements on the Siamese network and proposed GetNet. The proposed method GetNet combines STN and Siamese network to get the target area first and then perform subsequent processing. Experiments show that our method achieves competitive results in speed and accuracy.



There are no comments yet.


page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Find image pairs with a certain connection is a basic technology in the field of computer vision. The essence of many research fields is the process of finding image pairs, such as image matching, image retrieval, etc. In some other research areas, image pairing also plays a key role, such as tracking, object recognition, multi-view 3D reconstruction, structure-from-motion (SfM) and so on. The problem we are concerned here is whether the image contains objects of the same category, which is very common in tasks such as image matching, image retrieval and tracking. Although the rapid development of deep learning method in recent years has greatly promoted the advancement of computer vision and related fields, finding image pairs that meet certain criteria across large, unstructured image datasets can be very time-consuming and prone to errors, especially when the target object is small in the image or it is in a cluttered background.

Recently, Siamese architecture[1]

has been utilised in various image pairing problem, such as face verification, local image patches pairing as well as whole-image matching, but not yet in generic object-centred image pairing and retrieval. In this paper, we establish a theoretical connection between Spatial Transformer Networks (STNs) and Siamese networks, which can find matching or non-matching image pairs (i.e. image pairs that contains the same object or not) as well as output the common parts of matching pairs (i.e. target objects) since STNs can apply the affine transformation to the images which can help extract the region of interest (ROI). An architectural network is designed that is trained by inputting example images pairs and supervised by simply labelled ”1”(means pairing) or ”0”(means non-pairing). A new way of network training was also proposed to get better performance. Our experimental results show that the proposed method improves pairing performance compared to the original Siamese networks. With regions of interest output by STNs, our method also provides a convenient way for locating valuable part in image and an effective way of dataset ground truth auto-labelling.

To sum up, the main contributions of our work are:


Propose a method for object-oriented image pairing that can extract specific subparts that are relevant for image retrieval and matching in clutter with just 0/1 supervision.


Provide a convenient way for locating valuable part in image and an effective way of dataset ground truth auto-labelling.


Propose a new way to train the network to get better performance.

The paper is organized as follows. We first introduce the related work in Section II and then give a general review of Spatial Transformer Networks and Siamese and describe our proposed network structures in Section III. The details of evaluation datasets and experimental results are given in Section IV and the future work is in Section V. Finally, our conclusion is presented in Section VI.

Ii Related Work

Image matching

is an important problem in computer vision and it can be seen as a sub-question of the image pairing problem. Although recent rapid advances in convolutional neural networks (CNNs) techniques have achieved state-of-the-art performance in tasks such as image recognition, object segmentation and so on, finding similar images (e.g. images that have the same objects) across a large, unstructured image dataset can still be very time-consuming, and the performance will be worse if the object in the image is relatively small or the image has a cluttered background, which remains a common problem in computer vision. The main obstacles toward image matching include viewpoint variation, scale variation, illumination variation, occlusion and background clutter. Over the years, different methods have been proposed to solve the image matching problem and increase the accuracy and performance. Generally, these methods can be split into two categories. The first category is based on hand-crafted image feature extraction. Jyoti

. used SIFT feature to match stereo image pair that be applied 3D reconstruction[3]. Also, SURF feature[4], ORB feature[5], color histogram[6]and HOG feature can be used to do the image matching in order to reduce the time of computation but still need lots of time. The researchers also proposed many other methods for extracting features or key points to do the matching job like CSIFT[7], BRISK[8], ORB[9], FREAK[10], stereo keypoint matching[11], LBP[12] and so on. The second category is based on convolutional neural networks (CNNs). CNNs have been widely applied into many areas in computer vision including image matching and made remarkable performances better than traditional methods. Iaroslav .[1]

presented a method to measure the whole-image similarity based on deep neural network and predict the similarity of a query image pair, showing very promising results. However, this method needs to use a pre-trained CNN classifier which causes inconvenience. And since it adopts whole-image similarity to measure the similarity of the image pair, it cannot perform very well in situations where the similar parts are relatively small in the image pairs or the backgrounds are cluttered. Although many improved networks based on Siamese such as SConE

[13], Patch Match Networks[14], SimNet[15] and some other deep convolutional neural networks based methods like [16], [17] are proposed, it is still far from resolved.

Image retrieval has become an important research area in computer vision these years. Its task is to find images that have some connection with the query image, whose essence is actually image pairing. Image retrieval is classified mainly in several types such as text based, content based, sketch based and so on. Here our focus is on content based image retrieval (CBIR) since it retrieval image based on the content which is similar to our research. In CBIR, different researchers focus on different aspects and have achieved good results. Chang [18]

proposed the image retrieval using the color distribution, mean and the standard deviation and Sun

[19] suggested a color distribution entropy method. There are also some researchers see shape as an important feature[20][21] and some tend to texture[22][23]. In addition, the kernel-based approach proposed by Karmakar [24] is also very instructive. And just like the image matching, the research of algorithms in image retrieval can be divided into traditional methods and deep learning based methods. In the former aspect, Krishna [25]

proposed an indexing of the image using the k-means algorithm and Sonali


proposed the SVM algorithm to act as a classifier. Also, the success of deep neural networks on feature representation has led it be widely used in image retrieval tasks. Models pre-trained on popular datasets such as ImageNet

[27], Landmarks[28], COCO[29], etc. can be used to extract features of images and are found to have good generalization performance. Especially, convolutional layers have been proved to be most beneficial at retrieving images [28][30][31][32]

. And then, nearest neighbor search is used on the feature vectors to find the most similar images to a query. Although such progress has been made, speed and accuracy are still problems that need to be solved during large-scale retrieval.

Iii Proposed Method

Iii-a Spatial Transformer Networks

In our method, Spatial Transformer Networks (STNs)[2]as Fig 1 is applied to process the two images of every pair separately. The spatial transformer consists of three parts including localisation net, grid generator and sampler. First, the localisation net takes the original image where is the height, is the width and is the number of channels as input and output the parameters which are related to the transformation


Second, grid generator generates the parameterized sampling gird and then, by applying the grid to the original input image, deformed output image with height , width and channels is produced.

To apply the sampling gird into the input image, all output pixels that are defined on a regular gird with coordinate are computed to form the output image. And since localisation net of 2D affine transformation can output six parameters and it means that STN can apply an affine transformation to the original image like below.


Where and are the coordinates in the output images and and are coordinates in the input feature maps. represents affine transformation. Thus, the precise image which contains the target object from the original image can be extracted using STN.

In our experiment, in order to facilitate the training process, the network is modified and the localisation net only output three parameters including , and which can achieve the local translation in the original images as below:


Where represents the distance of translation in the x axis. represents the distance of translation in the axis and represents the cropping ratio. And when the and which define the spatial location in the input feature maps are obtained, the output feature maps can be calculated as below:



is the image interpolation kernel function (e.g. bilinear, nearest neighbor and so on ) and

and are the parameters of . is the value at location in the input feature map in channel and is the value at location in the output feature map in channel

. And in order to allow backpropagation, the sampling kernel can be used only when gradients can be defined about

and . Take bilinear sampling kernel given as below as an example,


the gradients with respect to and of it can be defined and the partial derivatives are as below:


and the same for . So the STNs can achieve end-to-end train and exact ROI to carry out subsequent processing.

Fig. 1: Spatial Transformer Networks.

Iii-B Siamese CNN architecture

A Siamese CNN architecture is used to match the image pair as Fig 2. Siamese CNN architecture is a classical algorithm which first extracts features from input pair and then compares the features to calculate the similarity of the input pair. The detail of the neural networks is as below and the two one-dimensional feature vectors exacted are connected into a one-dimensional feature vector and then input into the fc layer.

The contrastive loss [33]

(which is defined as below) of the output features is applied to measure the similarity of the images in every pair and carry out a simple supervised learning by giving label 1 or 0 to indicate whether the pairing is successful.


Where d is the Euclidean distance [34] of the features of image pair as below and y is the label 1 or 0. Margin is a given threshold.


Fig. 2: Siamese CNN architecture.

Iii-C Proposed network

The original image pair is the input of the STN networks and the Siamese networks which share parameters to extract features take the output image pair of STN networks as the input pair. Then the Siamese networks output the predict results using contrastive loss. The proposed network is named GetNet and is shown as Fig. 3.

Fig. 3: Structure of proposed network.

Iii-D Network training

A new approach is proposed here to train the GetNet network. CNNs use the back-propagation propagation algorithm to update the gradient for training. But traditional end-to-end training method has less impact on the front end of the network, especially a relatively weak way of supervision only with label 1 or 0 is used here, which makes it even more difficult to update the STN. So a strategy of training the STN and the overall network alternately is proposed. When freezing Siamese network part of the parameters and training the STN alone, the parameters of STN can be adjusted adequately and sample the target object more accurately from the input image. And when training the overall networks, the Siamese network can extract more suitable features to test the similarity according to the label.

Here these two kinds of ways are used in turn and achieved good results which demonstrates the effectiveness of this training approach.

Iv Experimental Results

Iv-a Dataset

Iv-A1 Mnist

MNIST [35] is a dataset of handwritten digits and the size of all the MNIST original images is pixels pixels. To test our networks, a “distorted MINIST” is made by putting the images from MINIST dataset into a pixels pixels background and add random noises like Fig. 4.

Fig. 4: A pixels pixels background with random noises.

Apparently that distorted MNIST dataset is more difficult to pair than MNIST and it is suitable to test the performance of our network.

Iv-A2 “Shelf Tote” Benchmark Dataset

The Shelf Tote Benchmark Dataset[36] was created by team MIT and Princeton Vision Group for the worldwide Amazon Picking Challenge 2016 which contains scenes with unique object poses seen from multiple viewpoints.

It was used to do self-supervised deep Learning for 6D pose estimation in the Amazon Picking Challenge and here we found it qualified for evaluating the performance of our network. The dataset images are like Fig. 


Fig. 5: The Shelf & Tote Benchmark Dataset.

kinds of objects were picked from the dataset and all the images were reshaped to pixels pixels to do the pairing performance test of the network.

Iv-A3 Caltech Leaves Dataset

Caltech leaves dataset is a dataset in Caltech computational vision and it contains images of species of leaves against different backgrounds like Fig. 6.

Fig. 6: The Caltech leaves dataset.

The images in this dataset were also reshaped to pixels pixels and used to do the experiment.

Iv-B Performance

The three dataset mentioned before was used and for each image in each dataset, another image from the same category is taken to create one pair and is given label 1 and also take an image from a different category to create one pair and is given label 0. Thus three datasets only with label or are obtained and the number of positive samples and the number of negative samples in them is nearly equal. And the traditional Siamese networks are used as a comparison to evaluate our network performance.

Siamese (%) Our network (%)
MNIST 98.2 99.3
Tote dataset 80.4 87.1
Leaves dataset 84.3 88.6
TABLE I: the result of the experiments

Fig. 7: The results of our network performance.

Table I is the result of the experiments and some of the results are as Fig. 7. The left pairs are input image pairs and the right pairs are STN output image pairs. It can be seen that the output images contain the target object more precisely and it is obvious that Siamese network can perform better using the right image pairs. And also it can be seen from above that a very simple label is used to supervise the training procedure, and STN network can still locate the target object of the input image commendably which completes the label of target object in the original image, so it can also eliminate the manual process of data labeling in some special tasks.

Fig. 8: Some examples that traditional Siamese network failed but our network succeeded.

Here are some examples that traditional Siamese network failed but our network succeeded (Fig. 8). And two of examples that our method failed are also presented as Fig. 9.

Fig. 9: Two of examples that our method failed.

After analyzing the failed result, we can see that when the background of the image is complex which causes the STNs cannot detect the right object, it is more likely to perform badly. And the result can also be affected by the physical noises such as light and so on. But traditional method without STNs also cannot preform well either and our future work will work on it.

Iv-C Contribution

In this paper, a generic object-centred image pairing method (i.e. determine whether an image pair contains the same object) is proposed that have promising results even the target object is small in the image or in a cluttered background. It achieves this by a novel network structure named GetNet that combines Spatial Transformer Networks (STNs) and Siamese architecture.

Our idea is intuitionistic and reasonable. Humans have this amazing ability to home in on the parts of an image that contains the objects they are interested in even if the objects are inconspicuous. If retrieval systems that focus on the potential objects and ignore these ¡°distractions¡± just like what humans do can be build, and then apply similarity measurement to the potential objects in the image pairs instead of using whole-image similarity measurement, the accuracy and performance of image matching can be improved significantly. In our approach, STNs are used to determine which parts of image in the query image pair to use for matching and outputs these subparts. Then Siamese architecture is applied to measure the similarity of the two subparts and determines whether they are pairing. Example pairs that is simply labelled ”1”(means pairing) or ”0”(means non-pairing) are given to train this network.

Our approach not only improves the accuracy of image pairing problems, but also presents a new and effective method to train networks to get better performance. The alternately training method can be used to fully update the weights of the target part of network which is crucial for CNN to complete the task so that it can achieved better performance and it can be applied into other similar networks as well. At the same time, a new and efficient way for image ground truth auto-labeling is also provided since the target object can be located just given the label or .

And also, the GetNet we proposed here provides a promising solution for finding lesion locations in medical image research. In recent years, the application of AI in the medical field has attracted the attention of more and more researchersa and medical image research is one of the most important aspects. The difficulty in medical image research is that images cannot be well understood as natural images, and doctors often have to determine the location of the lesion based on results such as cancer recurrence or lymph node metastasis, which can be very challenging. And our proposed method can help determine the region of interest and predict the outcome of the treatment, which we believe is instructive for future research.

V Future Work

At present, the STN in our method only output two parameters, which can only achieve the translation. In the next step, we will further study the STN and make it output more parameters to complete the rotation and other affine transformations to get better performance. At the same time the training of current network is difficult and the network structure will be optimized in order to simplify training process in the future. And the noises effect will also be overcome later on.

Vi Conclusion

To solve the problem of pairing images contained the same objects, we propose a new network structure named GetNet with Spatial Transformer Network and Siamese network. Through the experiment, we confirm that our network can improve the accuracy and also label the target object effectively. In this paper, a new and efficient way is also proposed to train the CNNs and it can help improve the performance of the network in some tasks.


  • [1] I. Melekhov, J. Kannala, and E. Rahtu, “Siamese network features for image matching,” in Pattern Recognition (ICPR), 2016 23rd International Conference on.   IEEE, 2016, pp. 378–383.
  • [2] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
  • [3] J. Joglekar and S. S. Gedam, “Image matching with sift features¡ªa probabilistic approach,” Proceedings of IAPRS, vol. 38, pp. 7–12, 2010.
  • [4] Y. Pang, W. Li, Y. Yuan, and J. Pan, “Fully affine invariant surf for image matching,” Neurocomputing, vol. 85, pp. 6–10, 2012.
  • [5] L. Li, L. Wu, and Y. Gao, “Improved image matching method based on orb,” in

    Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), 2016 17th IEEE/ACIS International Conference on

    .   IEEE, 2016, pp. 465–468.
  • [6] X. Zhang, A. Zang, G. Agam, and X. Chen, “Learning from synthetic models for roof style classification in point clouds,” in Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems.   ACM, 2014, pp. 263–270.
  • [7] G. J. Burghouts and J.-M. Geusebroek, “Performance evaluation of local colour invariants,” Computer Vision and Image Understanding, vol. 113, no. 1, pp. 48–62, 2009.
  • [8] S. Leutenegger, M. Chli, and R. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 IEEE international conference on computer vision (ICCV).   Ieee, 2011, pp. 2548–2555.
  • [9] E. Rublee, V. Rabaud, K. Konolige, and G. R. Bradski, “Orb: An efficient alternative to sift or surf.” in ICCV, vol. 11, no. 1.   Citeseer, 2011, p. 2.
  • [10] A. Alahi, R. Ortiz, and P. Vandergheynst, “Freak: Fast retina keypoint,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition.   Ieee, 2012, pp. 510–517.
  • [11] M. Seabright, L. Streeter, M. Cree, M. Duke, and R. Tighe, “Simple stereo matching algorithm for localising keypoints in a restricted search space,” in 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ).   IEEE, 2018, pp. 1–6.
  • [12]

    T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,”

    IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 12, pp. 2037–2041, 2006.
  • [13] T. Trzcinski, J. Komorowski, L. Dabala, K. Czarnota, G. Kurzejamski, and S. Lynen, “Scone: Siamese constellation embedding descriptor for image matching,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
  • [14] M. S. Hanif, “Patch match networks: Improved two-channel and siamese networks for image patch matching,” Pattern Recognition Letters, vol. 120, pp. 54–61, 2019.
  • [15] S. Appalaraju and V. Chaoji, “Image similarity using deep cnn and curriculum learning,” arXiv preprint arXiv:1709.08761, 2017.
  • [16] A. Kumar, S. Srivastava, A. Mukhopadhyay, and S. M. Bhandarkar, “Deep spectral correspondence for matching disparate image pairs,” arXiv preprint arXiv:1809.04642, 2018.
  • [17] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “Deepflow: Large displacement optical flow with deep matching,” in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1385–1392.
  • [18] C.-C. Chang and T.-C. Lu, “A color-based image retrieval method using color distribution and common bitmap,” in Asia Information Retrieval Symposium.   Springer, 2005, pp. 56–71.
  • [19] J. Sun, X. Zhang, J. Cui, and L. Zhou, “Image retrieval based on color distribution entropy,” Pattern Recognition Letters, vol. 27, no. 10, pp. 1122–1126, 2006.
  • [20] Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma, “A survey of content-based image retrieval with high-level semantics,” Pattern recognition, vol. 40, no. 1, pp. 262–282, 2007.
  • [21] R. O. Stehling, M. A. Nascimento, and A. X. Falcao, “On “shapes” of colors for content-based image retrieval,” in Proceedings of the 2000 ACM workshops on Multimedia.   ACM, 2000, pp. 171–174.
  • [22] H. Kekre, S. D. Thepade, T. K. Sarode, and S. P. Sanas, “Image retrieval using texture features extracted using lbg, kpe, kfcg, kmcg, kevr with assorted color spaces,” International Journal of Advances in Engineering & Technology, vol. 2, no. 1, p. 520, 2012.
  • [23] A. Sandhu and A. Kochhar, “Content based image retrieval using texture, color and shape for image analysis,” International Journal of Computers & Technology, vol. 3, no. 1c, pp. 149–152, 2012.
  • [24] P. Karmakar, S. W. Teng, G. Lu, and D. Zhang, “A kernel-based approach for content-based image retrieval,” in 2018 International Conference on Image and Vision Computing New Zealand (IVCNZ).   IEEE, 2018, pp. 1–6.
  • [25] N. Raja and K. S. Bhanu, “Content bases image search and retrieval using indexing by kmeans clustering technique,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, no. 5, pp. 2181–2189, 2013.
  • [26]

    S. Jain and S. Shrivastava, “A novel approach for image classification in content based image retrieval using support vector machine,”

    International Journal of Computer Science & Engineering Technology (IJCSET), vol. 4, no. 3, pp. 223–227, 2013.
  • [27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  • [28] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codes for image retrieval,” in European conference on computer vision.   Springer, 2014, pp. 584–599.
  • [29] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [30] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval: Learning global representations for image search,” in European conference on computer vision.   Springer, 2016, pp. 241–257.
  • [31] F. Radenović, G. Tolias, and O. Chum, “Cnn image retrieval learns from bow: Unsupervised fine-tuning with hard examples,” in European conference on computer vision.   Springer, 2016, pp. 3–20.
  • [32] A. S. Razavian, J. Sullivan, S. Carlsson, and A. Maki, “Visual instance retrieval with deep convolutional networks,” ITE Transactions on Media Technology and Applications, vol. 4, no. 3, pp. 251–258, 2016.
  • [33] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1.   IEEE, 2005, pp. 539–546.
  • [34] A. Ghafoor, N. I. Rao, and S. Khan, “Image matching using distance transform,” Lecture Notes in Computer Science, vol. 2749, pp. 654–660, 2003.
  • [35] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [36]

    A. Zeng, K.-T. Yu, S. Song, D. Suo, E. Walker Jr, A. Rodriguez, and J. Xiao, “Multi-view self-supervised deep learning for 6d pose estimation in the amazon picking challenge,” in

    ICRA, 2017.