Google Landmark Recognition 2021 Competition  is the fourth landmark recognition competition on Kaggle, and this year it is organized together with ICCV 2021 Instance-Level Recognition workshop. Participants need to build models to recognize the landmarks (if any) correctly in a private test set, and the code submission method is adopted as previously. This year, the sponsor collects a new set of test images , which is created with a focus on fair worldwide representation. The training data for this competition comes from the Google Landmarks Dataset v2 (GLDv2) . GLDv2 is a large-scale benchmark for instance-level recognition and retrieval tasks, including approximately 5M images with about 200k distinct instance labels, which faces several challenges such as intra-class heterogeneity, class imbalance, and a large fraction of non-landmark test images. The cleaned subset of GLDv2 (GLDv2 CLEAN) consists of approximately 1.5 million images with 81,313 classes. Both GLDv2 and GLDv2 CLEAN can be used for training in this competition. Competition entries are evaluated using Global Average Precision (GAP) [10, 13]. This paper summarizes our solution to the competition.
Our final prediction comes from two parts: retrieval scores and classification logits. And the whole solution can be summarised as the following pipeline: 1) data prepossessing; 2) model training and retrieval (concat & retrieval for model ensemble); 3) classification logits adjustment for class imbalance; 4) distractor score penalization for non-landmark; and 5) top 1 classification aggregation. The whole solution is shown in Figure 1. Next, we will explain each part in detail.
2.2 Data Prepossessing
Follow previous solutions , we split the training dataset as follows and we use landmark samples from 2019 test set as the validation set.
GLDv2c: the clean version of GLDv2, which consists of 1.5M images and 81313 landmarks.
GLDv2x: all images from GLDV2 belong to 81313 landmarks, which consists of 3.2M images
GLDv2: all images of GLDV2, which consists of 4.1M images and 203094 landmarks.
Non-landmark: the non-landmark images from 2019 test set, which consists of 11k images.
2.3 Model training and Retrieval
To calculate the similarity of different landmark samples, the 512-dimensional embeddings of input images are extracted from various backbone models.
2.3.1 Model design
As with previous solutions, the model architecture consists of backbone, gem pooling, neck for embedding and head for classification. To be specific, backbone outputs are aggregated via a Generalized-Mean (GeM) pooling layer 
, and then feed into an embedding neck layer(Linear(512)+1D-BN+PReLU). Finally, the image embeddings are used to classify specific landmarks supervised by ArcFace loss with adaptive margin . Note that the gem pooling is removed in transformer and hybrid-based models since each token output feature is the global representation because of the self-attention mechanism.
Considering the diverse of model architecture, we chose three types of backbone, which as follows:
Our final submission contains the above model backbones. Swin-L achieved the best retrieval performance / on public/private, and other models perform as on retrieval score. It is believed that the greater the difference of model structure, the greater the complementarity of performance in the fusion stage. We found that Swin-L and CvT lead a more significant improvement of the final ensemble performance, compared with those CNN models.
2.3.2 Training schedule
Similar to last year’s solutions, different image resolutions and training splits are adopted to accelerate convergence. Our training schedule could be divided into three stages.
Stage 1: GLDv2c is used to train the model to classify
landmarks. The model is pretrained by imagenet withinput size.
Stage 2: GLDv2x is used to finetuned the model and the weights from stage1 are used to pretrain. The resolution of input is or or for different models.
Stage 3: GLDV2 is used to train the model to classify landmarks for the classification logits. The weight from stage2 is utilized to pretrain. It is experimented that GLDV2 could not further improve the discrimination of embeddings. Thus, we freeze the backbone and neck, only optimize the classification head.
As for training details, each stage is trained for 10-20 epochs with a cosine annealing scheduler having one warm-up epoch. We use AdamW optimizer with learning rate ofand weight decay of or . The batch-size varies between 512 and 1536 with Syc-BN on Tesla-T4-16GB gpus. For augmentation, RandAug , CutOut  and RandomResizedCrop are adopted. With the improvement of image resolution, the setting of data augmentation increases gradually.
At inference time, we extract the features of the input image and retrieve the index set, then we select the () candidate images from the index set according to the retrieval similarity score, and we accumulate these candidates scores w.r.t. their labels.
2.4 Classification Logit Adjustment
We find that it is crucial to use classification logits to support predictions. More importantly, we found the top 1 pure classification accuracy of our B5 is around on public LB, where for Swin-L it is around
. The clasification logit represents the cosine similarity between the image feature and the learned class-center by ArcFace. It is believed that this logit is complementary to retrieval score, especially when there are few images of some landmarks in the index.
At this stage, we get the Top 7 retrieved training images. So for each of the 7 images, we look up its classification logits from all 4 models chosen (B5 512 & 768, B6 512 & 768), and simply add the averaged logit to its corresponding retrieval cosine score as adjustment.
2.5 Distractor Score penalization
Similar to what  did last year, we use the 2019 test set’s non-landmark images as index, and for each training image (4M), we find its Top 3 matched scores and simply take its average as the distractor score. Then, we generate a mapping between each 4M training image id to its distractor score, and upload this dict to Kaggle to use in submission.
We then subtract the distractor score for each adjusted cosine score above from Sec 2.4, and the final score for each retrieval is:
Raw retrieval score + classification logit - distractor score
2.6 Top 1 Classification Aggregation
We find that using the top 1 classification logits from our best classification models (i.e., EfficientNet B5 and B6) can have another boost.
In Sec 2.5, we obtain all Top 7 indexed images with their score adjusted, and we simply aggregate them landmark id. However, in this section, we add another pair of (top 1 classification landmark id, top 1 classification logit) into the aggregation step. The classification logit used is just the raw top 1 logit, after we averaged all classification models’ 200k prediction logits.
We can’t penalize the classification pair as it is not from any image like the Top 7 matched, therefore it does not have a distractor score. But this turns out not to be a problem, as we found that top 1 classification score can be naturally used as penalty of non-landmark.
For blending our various models (see Sec 2.3), we first l2-normalize each of them separately, concatenate them, and apply another l2-normalize to conduct ensembled retrieval. Next, we employ our ranking routine elaborated in Sec 2.4 - 2.6
(the ensemble process is simply using the concatenated embedding space and running the whole procedure from start to finish on the larger embedding vectors)
In this paper, we presented our solution to the Google Landmark Recognition 2021 competition. We use features and classification logits extracted from several different models (CNN-, Transformer- and hybrid-based), optimized with an ArcFace Loss. And we present an efficient re-ranking pipeline: retrieval, classification logit adjustment, distractor score adjustment and top 1 classification adjustment to generate more accurate landmark recognition results. After aggregating several models with different architectures, we reached a final score of on the public and on the private leaderboard respectively.
-  Google Landmark Recognition 2021 Competition. https://www.kaggle.com/c/landmark-recognition-2021/overview/iccv-2021.
-  E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le. Randaugment: Practical automated data augmentation with a reduced search space. In , pages 702–703, 2020.
J. Deng, J. Guo, N. Xue, and S. Zafeiriou.
Arcface: Additive angular margin loss for deep face recognition.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
-  T. DeVries and G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
-  Y. Gu, C. Li, and J. Xie. Attention-aware generalized mean pooling for image retrieval. arXiv preprint arXiv:1811.00202, 2018.
-  Q. Ha, B. Liu, F. Liu, and P. Liao. Google landmark recognition 2020 competition third place solution. arXiv preprint arXiv:2010.05350, 2020.
-  C. Henkel and P. Singer. Supporting large-scale image recognition with out-of-domain samples. CoRR, abs/2010.01650, 2020.
-  Z. Kim, A. Araujo, B. Cao, C. Askew, J. Sim, M. Green, N. F. Yilla, and T. Weyand. Towards A fairer landmark recognition dataset. CoRR, abs/2108.08874, 2021.
-  Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
F. Perronnin, Y. Liu, and J.-M. Renders.
A family of contextual measures of similarity between distributions with application to image retrieval.In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2358–2365. IEEE, 2009.
M. Tan and Q. Le.
Efficientnet: Rethinking model scaling for convolutional neural networks.In
International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
-  M. Tan and Q. V. Le. Efficientnetv2: Smaller models and faster training. arXiv preprint arXiv:2104.00298, 2021.
-  T. Weyand, A. Araujo, B. Cao, and J. Sim. Google landmarks dataset v2: A large-scale benchmark for instance-level recognition and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2575–2584, 2020.
-  H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808, 2021.