3rd Place Solution to Google Landmark Recognition Competition 2021

by   Cheng Xu, et al.
ByteDance Inc.

In this paper, we show our solution to the Google Landmark Recognition 2021 Competition. Firstly, embeddings of images are extracted via various architectures (i.e. CNN-, Transformer- and hybrid-based), which are optimized by ArcFace loss. Then we apply an efficient pipeline to re-rank predictions by adjusting the retrieval score with classification logits and non-landmark distractors. Finally, the ensembled model scores 0.489 on the private leaderboard, achieving the 3rd place in the 2021 edition of the Google Landmark Recognition Competition.


page 1

page 2

page 3

page 4


2nd Place and 2nd Place Solution to Kaggle Landmark Recognition andRetrieval Competition 2019

We present a retrieval based system for landmark retrieval and recogniti...

2nd Place Solution to Google Landmark Recognition Competition 2021

As Transformer-based architectures have recently shown encouraging progr...

Google Landmark Recognition 2020 Competition Third Place Solution

We present our third place solution to the Google Landmark Recognition 2...

2nd Place Solution to Google Landmark Retrieval 2021

This paper presents the 2nd place solution to the Google Landmark Retrie...

Google Landmark Retrieval 2021 Competition Third Place Solution

We present our solutions to the Google Landmark Challenges 2021, for bot...

Efficient large-scale image retrieval with deep feature orthogonality and Hybrid-Swin-Transformers

We present an efficient end-to-end pipeline for largescale landmark reco...

Improving Landmark Recognition using Saliency detection and Feature classification

Image Landmark Recognition has been one of the most sought-after classif...

1 Introduction

Google Landmark Recognition 2021 Competition [1] is the fourth landmark recognition competition on Kaggle, and this year it is organized together with ICCV 2021 Instance-Level Recognition workshop. Participants need to build models to recognize the landmarks (if any) correctly in a private test set, and the code submission method is adopted as previously. This year, the sponsor collects a new set of test images [8], which is created with a focus on fair worldwide representation. The training data for this competition comes from the Google Landmarks Dataset v2 (GLDv2) [13]. GLDv2 is a large-scale benchmark for instance-level recognition and retrieval tasks, including approximately 5M images with about 200k distinct instance labels, which faces several challenges such as intra-class heterogeneity, class imbalance, and a large fraction of non-landmark test images. The cleaned subset of GLDv2 (GLDv2 CLEAN) consists of approximately 1.5 million images with 81,313 classes. Both GLDv2 and GLDv2 CLEAN can be used for training in this competition. Competition entries are evaluated using Global Average Precision (GAP) [10, 13]. This paper summarizes our solution to the competition.

Figure 1: The overview of the whole solution. The top diagram illustrates the model structure and the final prediction, aggregating retrieval score and top1 classification logit. Specifically, the retrieval score for each index ID is adjusted as: - +

2 Method

2.1 Overview

Our final prediction comes from two parts: retrieval scores and classification logits. And the whole solution can be summarised as the following pipeline: 1) data prepossessing; 2) model training and retrieval (concat & retrieval for model ensemble); 3) classification logits adjustment for class imbalance; 4) distractor score penalization for non-landmark; and 5) top 1 classification aggregation. The whole solution is shown in Figure 1. Next, we will explain each part in detail.

2.2 Data Prepossessing

Follow previous solutions [7], we split the training dataset as follows and we use landmark samples from 2019 test set as the validation set.

  • GLDv2c: the clean version of GLDv2, which consists of 1.5M images and 81313 landmarks.

  • GLDv2x: all images from GLDV2 belong to 81313 landmarks, which consists of 3.2M images

  • GLDv2: all images of GLDV2, which consists of 4.1M images and 203094 landmarks.

  • Non-landmark: the non-landmark images from 2019 test set, which consists of 11k images.

2.3 Model training and Retrieval

To calculate the similarity of different landmark samples, the 512-dimensional embeddings of input images are extracted from various backbone models.

2.3.1 Model design

As with previous solutions, the model architecture consists of backbone, gem pooling, neck for embedding and head for classification. To be specific, backbone outputs are aggregated via a Generalized-Mean (GeM) pooling layer [5]

, and then feed into an embedding neck layer(Linear(512)+1D-BN+PReLU). Finally, the image embeddings are used to classify specific landmarks supervised by ArcFace loss 

[3] with adaptive margin [6]. Note that the gem pooling is removed in transformer and hybrid-based models since each token output feature is the global representation because of the self-attention mechanism.

Considering the diverse of model architecture, we chose three types of backbone, which as follows:

  • CNN-based: EfficientNet [11, 12] B5, B6, B7, V2

  • Transformer-based: Swin-L-384 [9]

  • Hybrid-based: CvT-W24-384 [14]

Our final submission contains the above model backbones. Swin-L achieved the best retrieval performance / on public/private, and other models perform as on retrieval score. It is believed that the greater the difference of model structure, the greater the complementarity of performance in the fusion stage. We found that Swin-L and CvT lead a more significant improvement of the final ensemble performance, compared with those CNN models.

2.3.2 Training schedule

Similar to last year’s solutions, different image resolutions and training splits are adopted to accelerate convergence. Our training schedule could be divided into three stages.

  • Stage 1: GLDv2c is used to train the model to classify

    landmarks. The model is pretrained by imagenet with

    input size.

  • Stage 2: GLDv2x is used to finetuned the model and the weights from stage1 are used to pretrain. The resolution of input is or or for different models.

  • Stage 3: GLDV2 is used to train the model to classify landmarks for the classification logits. The weight from stage2 is utilized to pretrain. It is experimented that GLDV2 could not further improve the discrimination of embeddings. Thus, we freeze the backbone and neck, only optimize the classification head.

As for training details, each stage is trained for 10-20 epochs with a cosine annealing scheduler having one warm-up epoch. We use AdamW optimizer with learning rate of

and weight decay of or . The batch-size varies between 512 and 1536 with Syc-BN on Tesla-T4-16GB gpus. For augmentation, RandAug [2], CutOut [4] and RandomResizedCrop are adopted. With the improvement of image resolution, the setting of data augmentation increases gradually.

At inference time, we extract the features of the input image and retrieve the index set, then we select the () candidate images from the index set according to the retrieval similarity score, and we accumulate these candidates scores w.r.t. their labels.

2.4 Classification Logit Adjustment

We find that it is crucial to use classification logits to support predictions. More importantly, we found the top 1 pure classification accuracy of our B5 is around on public LB, where for Swin-L it is around

. The clasification logit represents the cosine similarity between the image feature and the learned class-center by ArcFace. It is believed that this logit is complementary to retrieval score, especially when there are few images of some landmarks in the index.

At this stage, we get the Top 7 retrieved training images. So for each of the 7 images, we look up its classification logits from all 4 models chosen (B5 512 & 768, B6 512 & 768), and simply add the averaged logit to its corresponding retrieval cosine score as adjustment.

2.5 Distractor Score penalization

Similar to what [7] did last year, we use the 2019 test set’s non-landmark images as index, and for each training image (4M), we find its Top 3 matched scores and simply take its average as the distractor score. Then, we generate a mapping between each 4M training image id to its distractor score, and upload this dict to Kaggle to use in submission.

We then subtract the distractor score for each adjusted cosine score above from Sec 2.4, and the final score for each retrieval is:

Raw retrieval score + classification logit - distractor score

2.6 Top 1 Classification Aggregation

We find that using the top 1 classification logits from our best classification models (i.e., EfficientNet B5 and B6) can have another boost.

In Sec 2.5, we obtain all Top 7 indexed images with their score adjusted, and we simply aggregate them landmark id. However, in this section, we add another pair of (top 1 classification landmark id, top 1 classification logit) into the aggregation step. The classification logit used is just the raw top 1 logit, after we averaged all classification models’ 200k prediction logits.

We can’t penalize the classification pair as it is not from any image like the Top 7 matched, therefore it does not have a distractor score. But this turns out not to be a problem, as we found that top 1 classification score can be naturally used as penalty of non-landmark.

2.7 Ensembling

For blending our various models (see Sec 2.3), we first l2-normalize each of them separately, concatenate them, and apply another l2-normalize to conduct ensembled retrieval. Next, we employ our ranking routine elaborated in Sec 2.42.6

(the ensemble process is simply using the concatenated embedding space and running the whole procedure from start to finish on the larger embedding vectors)

3 Conclusion

In this paper, we presented our solution to the Google Landmark Recognition 2021 competition. We use features and classification logits extracted from several different models (CNN-, Transformer- and hybrid-based), optimized with an ArcFace Loss. And we present an efficient re-ranking pipeline: retrieval, classification logit adjustment, distractor score adjustment and top 1 classification adjustment to generate more accurate landmark recognition results. After aggregating several models with different architectures, we reached a final score of on the public and on the private leaderboard respectively.