Log In Sign Up

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

Localizing text instances in natural scenes is regarded as a fundamental challenge in computer vision. Nevertheless, owing to the extremely varied aspect ratios and scales of text instances in real scenes, most conventional text detectors suffer from the sub-text problem that only localizes the fragments of text instance (i.e., sub-texts). In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue. CORE first leverages a vanilla relation block to model the relations among all text proposals (sub-texts of multiple text instances) and further enhances relational reasoning via instance-level sub-text discrimination in a contrastive manner. Such way naturally learns instance-aware representations of text proposals and thus facilitates scene text detection. We integrate the CORE module into a two-stage text detector of Mask R-CNN and devise our text detector CORE-Text. Extensive experiments on four benchmarks demonstrate the superiority of CORE-Text. Code is available: <>.


page 1

page 4


MOST: A Multi-Oriented Scene Text Detector with Localization Refinement

Over the past few years, the field of scene text detection has progresse...

Which and Where to Focus: A Simple yet Accurate Framework for Arbitrary-Shaped Nearby Text Detection in Scene Images

Scene text detection has drawn the close attention of researchers. Thoug...

STELA: A Real-Time Scene Text Detector with Learned Anchor

To achieve high coverage of target boxes, a normal strategy of conventio...

CentripetalText: An Efficient Text Instance Representation for Scene Text Detection

Scene text detection remains a grand challenge due to the variation in t...

ARM3D: Attention-based relation module for indoor 3D object detection

Relation context has been proved to be useful for many challenging visio...

Detecting Multi-Oriented Text with Corner-based Region Proposals

Previous approaches for scene text detection usually rely on manually de...

Shape Robust Text Detection with Progressive Scale Expansion Network

The challenges of shape robust text detection lie in two aspects: 1) mos...

1 Introduction

Scene text detection, which is known as localizing text instances in natural scene images, is a profound challenge in both computer vision and deep learning communities. Practical automatic scene text detection systems have a great potential impact for numerous applications, e.g., document analysis, industrial automatic, and autonomous driving. The recent development of deep learning techniques for generic object detection

[1, 2] and segmentation [3, 4] has successfully pushed the limits of scene text detection, leading to a surge of deep text detector [5, 6, 7, 8, 9] that follow the typical region proposal-based detection paradigm. Nevertheless, considering that the aspect ratios and scales of text instances often suffer from more variation than those of generic objects, directly applying generic object detectors will inevitably result in broken detections of text instances [6, 7]. Taking the text detection results in Figure 1 (a) as an example, the generic object detector (Mask R-CNN [3]) fails to accurately localize the whole text instances (i.e., full-text) and only detects them as multiple text fragments (i.e., sub-text), especially when the aspect ratios of text instances are large. These facts motivate the exploration of contextual information among sub-texts to alleviate this sub-text problem.

(a) Results of generic object detector (Mask R-CNN)

(b) Results of our CORE-Text

Figure 1: Scene text detection on three images by (a) directly applying generic object detector (Mask R-CNN) and (b) utilizing CORE-Text in this work. Red box: sub-text detection that merely detects sub-regions of full-text instance; Green box: full-text detection.

In the literature, there have been a series of innovations being proposed to improve scene text detection through exploiting contextual information among sub-texts to associate the sub-texts belonging to the same text instance, e.g., segment linking [6] or link merging over local graphs [8]. Nevertheless, most of them solve the sub-text problem in a two-phase manner, i.e., first localizing sub-texts of multiple text instances in an image and then grouping the sub-texts of the same instance. Such way may break the integration between localizing and associating sub-texts of the same text instance, resulting in a sub-optimal solution. Moreover, though these methods have demonstrated performance gains in detection accuracy by addressing the sub-text problem, it is still unclear to what extent the sub-text problem affects the overall performance of text detector for scene text detection task.

In this work, we engage in solving the sub-text problem in scene text detection. First, we quantitatively analyze the frequency of the sub-text problem for a generic object detector (Mask R-CNN) on the benchmark (e.g., ICDAR 2017 MLT) and provide the performance upper-bound by fully eliminating the negative effect of sub-texts. Surprisingly, we find that the sub-text problem accounts for a large proportion of bad cases in existing benchmark, and a significant performance boost (% in Hmean metric) is attained when the sub-text problem is fully addressed.

Moreover, by consolidating the idea of unifying both localization and association of text proposals (containing sub-texts), we present a novel COntrastive RElation (CORE) module to mitigate the sub-text problem in scene text detection task. Technically, the multi-scale text proposals, i.e., a group of sub-texts and full-texts derived from multiple text instances in an input image, are first produced via Region Proposal Networks (RPN). Next, we leverage a vanilla relation block [10] to perform relational reasoning among all text proposals. The relational reasoning is further guided with instance-wise contrastive objective, that pursues instance-level sub-text discrimination in a contrastive manner. This design pursues the learning of instance-aware representations of text proposals through jointly relational reasoning and text instance identification, and thus facilitates the localization and classification of text proposals. Our CORE module could be regarded as a general text proposal refiner and is readily pluggable to any two-stage text detector. We name the whole architecture of text detector (Mask R-CNN here) with CORE module as CORE-Text, and empirically demonstrate that CORE-Text could better mitigate the sub-text problem and obtain encouraging detection results as illustrated in Figure 1 (b).

2 Related Work

Scene Text Detection. Recent progress on this task has evolved through two paradigms: segmentation-based [11, 12] and proposal-based methods [5, 6, 7, 8, 9]. The primary challenge of the former is to distinguish text instances from pixel perspective. The latter may suffer from various aspect ratios and scales of scene texts, and result in the sub-text problem. Several existing works [5, 6, 7, 8] mitigate the sub-text problem in a two-phase way (i.e. localizing and grouping sub-texts), which may break the interaction between localizing and associating sub-text, and lead to a sub-optimal solution. In contrast, we unify both localization and association of text proposals, without any additional grouping post-process.

Relational Reasoning. There has been strong evidences on the use of relational reasoning to support various tasks, e.g., object detection [10, 13, 14, 15], feature learning [16], vision-language [17, 18]. For example, [15] plugs non-local operation into the conventional CNN to enable the pixel-level relational interaction within feature maps, and [10] presents an object relation module to model the relations of regions via the interaction among appearance features and geometry.

The CORE module in our work is also a type of object-to-object relational reasoning. Unlike [10] that is developed for generic object detection, ours goes beyond the self-supervised exploration of contextual information among proposals and aims to additionally guide relational reasoning with instance-wise contrastive objective to mitigate sub-text problem. Such way naturally unifies both localization and association of text proposals (consisting of sub-texts and full-texts), and thus facilitates scene text detection.

3 Sub-text Problem

The sub-text problem refers to the broken detection results in scene text detection task, where a text instance is detected as multiple text fragments (sub-texts). Though the unsatisfactory results caused by sub-texts have been mentioned in several existing works [6, 7], the problem of how the sub-texts affect the overall performance of text detector is not yet fully understood in the literature. In this section, we look into this problem, and provide a detailed quantitative analysis of sub-text problem, including the concrete definition of sub-text, the frequency of sub-text problem in bad cases, and the performance upper-bound The performances are reported on ICDAR 2017 MLT val set, and the detail experiment setting can be referred in Section 5.3. if the sub-text problem is solved.

(a) Sub-text & full-text definition (b) Performance upper-bound
Figure 2: Quantitative analysis of sub-text problem: (a) the definition of sub-text and full-text; (b) the performance upper-bound when sub-text problem is fully addressed.

Sub-text Definition. To formalize this problem, we first present the concrete definition of sub-text and full-text conditioned on the relative position between the detection proposal and ground truth (Figure 2(a)). Note that we leverage the commonly adopted metric of Intersection over Union (IoU) and the Intersection over Foreground (IoF) () to measure the relative position. Specifically, we define the detection proposal as sub-text if its (i.e., the sub-text only covers the fragments of ground truth) and the (i.e., most parts of the sub-text are covered by the ground truth) simultaneously. The detection proposal is defined as full-text if . Here we set as 0.7, and will elaborate its impact in Section 5.3.

Figure 3: An overview of CORE-Text framework that integrates Mask R-CNN with our CORE module. The input image is first fed into the Feature Pyramid Network (FPN) backbone, and a set of text proposals are produced by Region Proposal Networks (RPN). Then, CORE module performs relational reasoning among all text proposal features. The process of relational reasoning is further guided with instance-wise contrastive objective, that encourages instance-level sub-text discrimination. In this way, CORE module learns instance-aware proposal features, which are leveraged to facilitate the classification and regression of text proposals in box branch of Mask R-CNN. For scene text with arbitrary shape, the mask branch in Mask R-CNN is additionally utilized to achieve the final segmentation results.

Frequency. To measure the frequency of sub-text problem, we collect all the bad cases based on ICDAR 2017 MLT val set. Under a moderate evaluation setting (), the proportion of sub-text problem in bad cases is 24.2%. In addition, we find that more strict evaluation setting will result in more sub-texts, e.g. the ratio of sub-text is increased to 49.1% under .

Performance Upper-bound. Since the sub-text problem accounts for a large proportion of bad cases in existing benchmark, here we investigate the performance upper-bound by fully eliminating sub-texts. Specifically, for each detected sub-text, we first measure its IoUs against all ground truths. Next, the detected sub-text is replaced by the ground truth with the largest IoU for evaluation and the obtained performance is thus treated as the upper-bound. As shown in Figure 2(b), after fully eliminating sub-texts, the Hmean of our base model (Mask R-CNN) is increased by 6% under moderate evaluation setting (). Furthermore, under a higher IoU threshold, the larger performance improvement is attained. The results show that there is still much room for improvement in scene text detection, especially for addressing the sub-text problem under more strict evaluation.

4 Approach

We design a universal module, named COntrastive RElation (CORE), that mitigates the sub-text problem by jointly performing relational reasoning and instance-level sub-text discrimination. The CORE module can be further integrated into any two-stage text detector (e.g., Mask R-CNN here) to improve scene text detection. We name the whole text detector as CORE-Text, and Figure 3 depicts the detailed architecture.

4.1 Vanilla Relation Block

We first provide a brief review of vanilla relation block [10], which is commonly adopted in generic object detection for relational reasoning among region proposals. Formally, given the input proposals (: appearance feature; : geometric feature), vanilla relation block achieves the relation-augmented representation of each proposal by refining appearance feature with learnt relation features as


Here the -th relation feature of the -th proposal is calculated as the weighted sum of appearance features from other proposals:


where the relation weight represents the pairwise relation between two proposals based on their appearance and geometric features, and indicates the transformation matrix. Accordingly, vanilla relation block strengthens proposal representations via relational reasoning that exploits the contextual information among region proposals.

4.2 Contrastive Relation Module

The vanilla relation block constructs fully-connected relations among all region proposals and performs relational reasoning in a self-attention manner. This way apparently leaves the contextual information at instance-level not fully explored, in view that the text proposals (both sub-text and full-text) belonging to the same text instance should share the inherently similar semantics. Therefore, we devise the COntrastive RElation (CORE) module by remolding the vanilla relation block with an additional instance-wise contrastive objective. The spirit behind is to guide the relational reasoning with instance-level sub-text discrimination in a contrastive manner, and thus learn the instance-aware representations of text proposals to mitigate the sub-text problem. Technically, given the text proposals produced by RPN, CORE module first utilizes vanilla relation block to trigger the relational reasoning that learns relation features of text proposals. The learning of relation features is further guided with instance-wise contrastive objective to enrich relation features with instance-level information. After that, the learnt instance-aware relation features are aggregated with the primary input proposal features via a shortcut connection, leading to the instance-aware features of all text proposals. The instance-aware proposal features are finally fed into the classification and regression modules for text instance localization.

Instance-wise Contrastive Objective. Traditional contrastive learning [19, 20, 21] targets for learning feature embedding by attracting positives (semantically similar samples) while repelling negatives (semantically dissimilar samples). The common contrastive objective is InfoNCE [22], which frames contrastive learning as a classification problem:


where is a positive pair, is a negative pair, is the number of negative samples, and is temperature parameter. Taking the inspiration from contrastive learning, we derive a particular form of loss, i.e., instance-wise contrastive objective, to penalize incompatibility of each text proposal pair. That is to maximize the agreement of different proposals of same text instance, while minimize the agreement of proposals derived from different instances. Formally, conditioned on all the input text proposals from RPN, we first take the relation features of ground truth proposals as . For each query , the corresponding positive samples are thus defined as the relation features of both sub-text and full-text proposals belonging to the same text instance of . Instead, all the relation features of sub-text, full-text, and ground truth proposals belonging to different text instances are taken as the negative samples

. Note that we additionally involve a 2-layer MLP plus ReLU (hidden layer size: 1,024) to transform relation features into a 128-dimensional embedding space in contrastive learning. These output vectors are normalized via a L2-norm layer. Therefore, the instance-wise contrastive loss is calculated as


4.3 Overall Objective

Recall that our CORE module is a unified text proposal refiner, it is feasible to plug CORE module into any two-stage text detector for scene text detection. We next present the overall objective of our CORE-Text by integrating CORE module into Mask R-CNN [3], which consists of RPN for producing proposal features, box branch for classification and regression task, and mask branch for binary segmentation.

Following the multi-task learning paradigm in Mask R-CNN, the overall objective of our CORE-Text is calculated as the integration of RPN loss , classification loss , regression loss , binary segmentation loss , and instance-wise contrastive loss :


where the weight is set as 0.01 in out experiments. Note that we adopt two-phase strategy for training CORE-Text. At the first phase, we pretrain CORE-Text with RPN loss and instance-wise contrastive loss. In the second phase, the whole architecture is finetuned with the overall objective .

5 Experiments

We empirically evaluate our CORE-Text on four scene text detection benchmarks: ICDAR 2017 MLT [23]

, ICDAR 2015 

[24], CTW1500 [25], and Total-Text [26].

5.1 Dataset and Experimental Settings

Dataset. ICDAR 2017 MLT is a popular benchmark with multi-oriented, multi-scripting and multi-lingual scene texts, containing 7,200 train images, 1,800 val images, and 9,000 test images with word-level annotations. ICDAR 2015 is another multi-oriented scene text detection benchmark that focuses on English texts, and consists of 1,000 train images and 500 test images with annotations labeled as word-level quadrangles. CTW1500 contains curved texts in natural scenes, and includes 1,000 train images and 500 test images with text-line level annotations. Total-Text consists of 1,255 train images and 300 test images with horizontal, multi-oriented and curved texts. The text instances are annotated at word level.

Method ICDAR 2017 MLT ICDAR 2015 CTW1500 Total-Text
CTPN [5] - - - 61.0 74.0 52.0 - - - - - -
SegLink [6] - - - 75.0 73.1 76.8 - - - - - -
Xue et al. [7] 66.6 73.9 60.6 - - - - - - - - -
DRRG [8] 67.3 75.0 61.0 86.6 88.5 84.7 84.5 85.9 83.0 85.7 86.5 84.9
PSENet [11] 70.8 73.7 68.2 85.7 86.9 84.5 82.2 84.8 79.7 80.9 84.0 78.0
DB [12] 74.7 83.1 67.9 87.3 91.8 83.2 83.4 86.9 80.2 84.7 87.1 82.5
PMTD [9] 78.5 85.2 72.8 89.3 91.3 87.4 - - - - - -
Base 77.2 82.7 72.5 88.2 90.1 86.4 84.9 86.3 83.6 85.3 87.4 83.3
Core-Text 78.7 85.3 73.0 89.3 91.1 87.5 85.7 87.8 83.7 86.3 87.7 85.0
Table 1: Performance comparisons on four standard benchmark test

sets. H, P, and R are short for Hmean, Precision, and Recall, respectively.

Network Setups.

We adopt the ImageNet 

[27] pretrained ResNet-50 [28] with 5-level Feature Pyramid Network [1] as the backbone. For prior anchor setting

, we set anchor scale and aspect ratio to 4.82 and {0.57, 1.10, 1.82, 2.81, 5.54} by running k-means clustering on

train set bounding boxes. Two stacked CORE modules are utilized to generate the 1,024-d final features for proposal classification and bounding box regression in box head. Following the default setting of Relation Networks [10], we set the number of relation features as and the dimension of each relation feature is 64. The mask head contains a four-layer FCN to produce the feature map for instance segmentation.

Training Details.

Our model is trained using SGD with 0.9 momentum and 0.0001 weight decay. The batch size is 16. To avoid overfitting, our data augmentation contains: 1) Random horizontal flipping with a probability of 0.5; 2) Random cropping and then resizing to the fixed size

; 3) Random rotation with an angle range of (, ).

Inference Details. At inference, we achieve 1,000 proposals by RPN for each testing image. Next, we run the two stacked CORE modules and box branch on these proposals, followed by Non-Maximum Suppression (NMS) with . The mask branch is then applied to the detection boxes, targeting for localizing the scene texts with arbitrary orientations and shapes. Finally, we adopt a mask level NMS with to further remove duplicates.

5.2 Comparisons with State-of-the-Art

Table 1 summarizes the quantitative results of our CORE-Text on four benchmarks. We compare CORE-Text with several existing state-of-the-art scene text detection techniques. It is worth noting that we additionally include a degraded version of our CORE-Text (named as Base), which is implemented as the basic Mask R-CNN without CORE module. Overall, the results across different datasets consistently show that our CORE-Text exhibits better performances than other text detectors over the most metrics. This basically highlights the advantage of performing relational reasoning and instance-level sub-text discrimination for scene text detection.

Method Hmean Precision Recall sub-text number
Base 80.0 82.7 77.4 1190
Base + VRM 81.1 85.2 77.4 923
Base + CORE 82.1 87.1 77.7 754
Table 2: Ablation study of CORE module on ICDAR 2017 MLT val set. VRM: Vanilla Relation Module.


. At the first training phase, we pretrain our CORE modules with instance-wise contrastive objective for 40 epoch, with the initial learning rate 0.02 annealed by the cosine decay rule. Following the commonly adopted setting in 

[9], we train CORE-Text on both ICDAR 2017 MLT train and val set for 160 epochs at the second training phase. The learning rate is set to 0.04, which is decreased by one-tenth at 80-th and 128-th epoch respectively. During inference, we adopt the single scale testing strategy and resize the long side of each image to 1,920. Our CORE-Text achieves 78.7% Hmean, which makes the absolute improvement over the Base model by 1.5%.

ICDAR 2015. As in [9], we initialize CORE-Text with the ICDAR 2017 MLT pretrained model and further finetune the model with 40 epochs over ICDAR 2015 train set. The learning rate is set to 0.002 and decays one-tenth at 20-th epoch. At inference, the long side of images is resized to 1,920. Finally, CORE-Text achieves 1.1% higher Hmean than Base model.

CTW1500 & Total-Text. To fully verify the generalizability of our CORE-Text, we further evaluate CORE-Text on two challenging benchmarks with curved and multi-oriented texts (i.e., CTW1500 and Total-Text). As in the training settings on ICDAR 2015, we finetune CORE-Text with 40 epochs on each benchmark. The long side of images is resized to 640 and 1,280 on CTW1500 and Total-Text, respectively. Similar to the observations on ICDAR 2015, our CORE-Text consistently outperforms the Base model by 0.8% and 1.0% in Hmean on CTW1500 and Total-Text.

5.3 Experimental Analysis

Then, we conduct ablation study to verify the effectiveness of our CORE module, and further analyze the impact of several hyper-parameters in CORE-Text. Note that all discussions here are based on ICDAR 2017 MLT train and val set. Specifically, we train CORE-Text on ICDAR 2017 MLT train set with 40 epochs, and the initial learning rate is set as 0.04 (decreased by 10 at 20-th and 32-th epoch respectively). The final results are reported over the 1,800 val images.

Ablation study. To examine the impact of each design in CORE module, we conduct ablation study by comparing different variants of CORE-Text in Table 2. We start from the Base model which is a degraded version of CORE-Text without CORE module. Next, we extend Base model by involving the Vanilla Relation Module (VRM) to trigger relational reasoning among proposals in a self-supervised manner, which achieves better performances and meanwhile reduces the number of sub-text bad cases. The results basically demonstrate the effectiveness of relational reasoning in VRM. In addition, the integration of both relational reasoning and instance-level sub-text discrimination, i.e., our CORE module, obtains the highest performances in terms of all the three metrics and further reduces the sub-text number. The performance gains validate the merit of guiding relational with instance-wise contrastive objective for scene text detection.

Impact of in sub-text definition. Table 3 shows the results by varying in the range of [0.5, 0.9]. The Hmean scores only change between 81.7% and 82.1%, which practically eases the selection for the optimal in sub-text definition.

Impact of temperature . The temperature controls the flatness of softmax function in contrastive loss. As shown in Table 4, the best performance is attained when is set as 0.2. Note that the training loss fails to converge when .

Impact of instance-wise contrastive loss weight . Figure 4 depicts the performance curve of Hmean when varies within [0, 1]. In the extreme case of , no instance-level sub-text discrimination is performed and the CORE module degenerates to vanilla relation module. The best Hmean score is achieved when . This again demonstrates that it is reasonable to exploit both relational reasoning and instance-level sub-text discrimination for boosting scene text detection.

0.5 0.6 0.7 0.8 0.9
Hmean (%) 82.0 81.7 82.1 81.8 82.0
Table 3: Impact of hyper-parameter in sub-text definition on ICDAR 2017 MLT val set.
0.01 0.05 0.1 0.2 0.3 0.4 0.5
Hmean (%) - 80.7 82.0 82.1 81.7 81.4 81.6
Table 4: Impact of temperature in contrastive loss on ICDAR 2017 MLT val set.
Figure 4: Impact of instance-wise contrastive loss weight .

6 Conclusions

In this paper, we investigate the sub-text problem in scene text detection task and present a novel COntrastive RElation (CORE) module to alleviate this issue. Particularly, unlike existing methods that tackle sub-text problem in a two-phase fashion, our CORE module jointly localizes and associate text proposals (sub-texts and full-texts from multiple instances) to boost scene text detection. To materialize our idea, we remold the vanilla relation block by additionally involving an instance-wise contrastive objective to guide the process of relational reasoning among text proposals. Such design naturally enables a joint learning of relational reasoning and instance-level sub-text discrimination in a contrastive manner, leading to instance-aware representations of text proposals. Furthermore, we devise a novel text detector (i.e., CORE-Text) that integrates CORE module into the generic object detector (Mask R-CNN). Extensive experiments conducted on four benchmarks demonstrate the efficacy of CORE-Text.


  • [1] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017.
  • [2] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” IEEE Trans. PAMI, 2017.
  • [3] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick, “Mask R-CNN,” in ICCV, 2017.
  • [4] Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR, 2015.
  • [5] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao, “Detecting text in natural image with connectionist text proposal network,” in ECCV, 2016.
  • [6] Baoguang Shi, Xiang Bai, and Serge J. Belongie, “Detecting oriented text in natural images by linking segments,” in CVPR, 2017.
  • [7] Chuhui Xue, Shijian Lu, and Fangneng Zhan, “Accurate scene text detection through border semantics awareness and bootstrapping,” in ECCV, 2018.
  • [8] Shi-Xue Zhang, Xiaobin Zhu, Jie-Bo Hou, Chang Liu, Chun Yang, et al., “Deep relational reasoning graph network for arbitrary shape text detection,” in CVPR, 2020.
  • [9] Jingchao Liu, Xuebo Liu, Jie Sheng, Ding Liang, Xin Li, and Qingjie Liu, “Pyramid mask text detector,” in CVPR, 2019.
  • [10] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei, “Relation networks for object detection,” in CVPR, 2018.
  • [11] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai Shao, “Shape robust text detection with progressive scale expansion network,” in CVPR, 2019.
  • [12] Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai,

    “Real-time scene text detection with differentiable binarization,”

    in AAAI, 2020.
  • [13] Qi Cai, Yingwei Pan, et al., “Exploring object relation in mean teacher for cross-domain detection,” in CVPR, 2019.
  • [14] Jiajun Deng, Yingwei Pan, Ting Yao, et al., “Relation distillation networks for video object detection,” in ICCV, 2019.
  • [15] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He,

    “Non-local neural networks,”

    in CVPR, 2018.
  • [16] Yingwei Pan, Yehao Li, Ting Yao, Tao Mei, et al., “Learning deep intrinsic video representation by exploring temporal coherence and graph structure,” in IJCAI, 2016.
  • [17] Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei,

    “X-linear attention networks for image captioning,”

    in CVPR, 2020.
  • [18] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei, “Exploring visual relationship for image captioning,” in ECCV, 2018.
  • [19] Raia Hadsell, Sumit Chopra, et al., “Dimensionality reduction by learning an invariant mapping,” in CVPR, 2006.
  • [20] Qi Cai, Yu Wang, Yingwei Pan, et al., “Joint contrastive learning with infinite possibilities,” in NeurIPS, 2020.
  • [21] Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, and Tao Mei, “Seco: Exploring sequence supervision for unsupervised representation learning,” in AAAI, 2021.
  • [22] Aäron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018.
  • [23] Nibal Nayef, Fei Yin, et al., “ICDAR2017 robust reading challenge on multi-lingual scene text detection and script identification - RRC-MLT,” in ICDAR, 2017.
  • [24] Dimosthenis Karatzas, Lluis Gomez-Bigorda, et al., “ICDAR 2015 competition on robust reading,” in ICDAR, 2015.
  • [25] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, and Sheng Zhang, “Detecting curve text in the wild: New dataset and new solution,” arXiv:1712.02170, 2017.
  • [26] Chee Kheng Chng and Chee Seng Chan, “Total-text: A comprehensive dataset for scene text detection and recognition,” in ICDAR, 2017.
  • [27] Olga Russakovsky, Jia Deng, et al.,

    Imagenet large scale visual recognition challenge,”

    Int. J. Comput. Vis., 2015.
  • [28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR 2016.