Textual Visual Semantic Dataset for Text Spotting
Text Spotting in the wild consists of detecting and recognizing text appearing in images (e.g. signboards, traffic signals or brands in clothing or objects). This is a challenging problem due to the complexity of the context where texts appear (uneven backgrounds, shading, occlusions, perspective distortions, etc.). Only a few approaches try to exploit the relation between text and its surrounding environment to better recognize text in the scene. In this paper, we propose a visual context dataset for Text Spotting in the wild, where the publicly available dataset COCO-text [Veit et al. 2016] has been extended with information about the scene (such as objects and places appearing in the image) to enable researchers to include semantic relations between texts and scene in their Text Spotting systems, and to offer a common framework for such approaches. For each text in an image, we extract three kinds of context information: objects in the scene, image location label and a textual image description (caption). We use state-of-the-art out-of-the-box available tools to extract this additional information. Since this information has textual form, it can be used to leverage text similarity or semantic relation methods into Text Spotting systems, either as a post-processing or in an end-to-end training strategy. Our data is publicly available at https://git.io/JeZTb.READ FULL TEXT VIEW PDF
Textual Visual Semantic Dataset for Text Spotting
Recognition of scene text in images in the wild is still an open problem in computer vision. There exist a number of difficulties in recognizing texts in images due to the many possible lighting conditions, variations in textures, complex backgrounds, textual font types and perspective distortions. The ability to automatically detect and recognize text in natural images, a.k.a.text spotting or OCR in the wild is an important challenge for many applications such as visually-impaired assistants  or autonomous vehicles . In recent years, the interest of Computer Vision community in Text Spotting has significantly increased [1, 36, 16, 8, 7, 32]
. However, state-of-the-art scene text recognition methods do not leverage object and scene recognition. Therefore, in this work, we introduce a visual semantic context textual dataset (e.g. object, scene information) for text spotting tasks. Our goal is to fill this gap, bringing closer vision and language, by understanding the scene text and its relationship with the environmental visual context.
The relation between text and its surrounding environment is very important to understand text in the scene. While there are some publicly available datasets for text spotting, none includes information about visual context in the image. Therefore, we propose a visual context semantic knowledge dataset for the text spotting pipeline, as our aim is to combine natural language processing and computer vision. In particular, we exploit the semantic relatedness between the spotted text and its image context. For example, as shown in Figure1 the word “dunkin” has a stronger semantic relation with “coffee”, thus it will be more likely to appear in a visual context than other possible candidates such as “junking” or “unkind”.
|Text Recognition Dataset|
|Label||Description||Dictionary||# bbox||# text|
|IC17-T2||from COCO-text dataset ICDAR 17 Task-2 ||-||46K||-|
|Synth90K||Synthetic dataset with 90K dict ||90K||9M||-|
|SVT||Street View Text ||-||647||-|
|IC13||ICDAR 2013 ||-||1K||-|
|IC17-V||Image+Textual dataset from IC17 Task-3 (ours)||-||10K||25K|
|COCO-Text-V||Image+Textual dataset from COCO-text (ours)||-||16K||60K|
|COCO-Pairs||Only Textual dataset from COCO-text (ours)||-||-||158K|
Each sample have bounding box and its full image and the textual visual context (object, scene and caption).
Departing from [34, 33], in this paper we describe in depth the construction of the visual context dataset. This dataset is based on the COCO-text , which uses Microsoft COCO dataset  and annotates texts appearing in the images. We further extend the dataset using out-of-the-box tools to extract visual context or additional information from images. Our main contribution is this combined visual context dataset, that provides the unrestricted-OCR research community the chance to use semantic relatedness between text and image to improve the results. The computer vision community tackles this problem by dividing the task into two sub-models one for text, and other for object [46, 18, 29]. Our approach uses existing state-of-the-art visual context generation approaches and thus, it can be used as a visual context with text spotting as post-processing (OCR correction) or end-to-end training.
While there are some publicly available datasets for text spotting, none of them includes visual context information such as objects in the scene, location id or textual image descriptions. In this section we describe several publicly available text spotting datasets.
summarizes the number of examples in different datasets. Sizes of real image datasets with annotated texts are in the order of thousands, and have a very limited vocabulary, which makes them insufficient for deep learning methods. Therefore, introduced a synthetic data generator without any human label cost. Words are sampled from a 90K-words dictionary, and are rendered synthetically to generate images with complex background, fonts, distortions, etc. It contains 9 million cropped word text images. All current state-of-the-art text spotting algorithms [36, 16, 8, 7] are trained on this dataset.
Text Spotting shared tasks carried out at ICDAR conferences released several relevant datasets:
ICDAR 2013 (IC13) . The ICDAR 2013 dataset consists of two sections for different text spotting subtasks: (1) text localization and (2) text segmentation. Text localization consists of 328 training images and 233 test images. Given its reduced size, ICDAR 2013 dataset is typically used for evaluation of scene text understanding tasks: localization, segmentation, and recognition.
ICDAR 2017 (IC17) . ICDAR 2017 is based on COCO-text  and aims for end-to-end text spotting (i.e. detection and recognition). The dataset consists of 43,686 full images with 145,859 text instances for training, and 10,000 images and 27,550 instances for validation.
Street View Text (SVT) . This dataset consists of 349 images downloaded from Google Street View. For each image, only one word-level bounding box is provided. This is the first dataset that deals with text image in real scenarios, such as shop signs in a wide range of fonts styles.
COCO-Text . This dataset is based on Microsoft COCO  (Common Objects in Context) and consists of 63,686 images, 173,589 text instance (annotations of the images). The COCO-text dataset differs from the other datasets in three aspects. First, the dataset was not collected with text recognition in mind. Thus, the annotated text instances lie in their natural context. Second, it contains a wide variety of text instances, such as machine-printed and handwritten text. Finally, the COCO-text has a much larger scale than other datasets for text detection and recognition.
We use state-of-the-art tools to extract textual information for each image. In particular, for each image we extract: 1) spotted text candidates (text hypotheses), and 2) surrounding visual context information.
To extract the text associated with each image or bounding box we employ several off the self pre-trained Text Spotting baselines to generate text hypotheses. All the pre-trained models are trained on a synthetic dataset . We build out text hypotheses dataset for each image as the union of the predictions of all baselines. We next describe these models.
The first baseline is a CNN with fixed lexicon based recognition, able to recognize words in a predefined 90K-word dictionary. Each word is corresponds to a word (class) in the 90K dictionary (multi-class classification). The dictionary is composed of various forms of English words (e.g
. nouns, verbs, adjectives, adverbs, etc) In short, the model classifies each input word-image into a pre-defined word in the 90K fixed lexicon. Each word
in the dictionary corresponds to one output neuron. The final outputfor a given image is written as:
|11, il, j, m, …||railroad||train||a train is on a train track with a train on it|
|lossing, docile, dow, dell, …||bookshop||bookstore||a woman sitting at a table with a laptop|
|29th, 2th, 2011, zit, …||parking||shopping||a man is holding a cell phone while standing|
|happy, hooping, happily, nappy, …||childs||bib||a cake with a bunch of different types of scissors|
|coke, gulp, slurp, fluky,…||plate||pizzeria||a table with a pizza and a fork on it|
|will, wii, xviii, wit,….||remote||room||a close up of a remote control on a table|
Convolutional Recurrent Neural Network (CRNN)
Convolutional Recurrent Neural Network (CRNN). The second baseline is a CRNN that learns the words directly from sequence labels, without relying on character annotations. The encoder uses a CNN to extract a set of features from the image. The CNN has no fully connected layers and extracts sequential feature representations of the input image, which are fed into a bidirectional RNN. A Connectionist Temporal Classification  based method is used to convert the per-frame predictions made by the RNN into a label sequence as following:
where is sequence-to-sequence mapping function, sequence label and the sequence.
LSTM-Visual Attention (LSTM-V) .
The third baseline also generates output words as probable character sequences, without relying on lexicon. The network is based on encoder-decoder architecture with visual attention mechanism. In particular, they use the CNN pre-trained model
mentioned above as encoder, but without the final layer, to extract the most important feature vectors from each text image. That feature vector is used to reduce model complexity through the soft attention model, which focus on the most relevant parts of the image at every step. The LSTM  decoder computes the output character probability for each time step and the visual attention , where are the deep model output parameters of each layer:
CNN-Attention . Finally, we employ one of the most recent state-of-the art systems, which also produces the final output words as probable character sequences, without any fixed lexicon. The model is based on a CNN encoder-decoder with attention and a CNN character language model. The final character prediction is a element-wise addition of the attention and language vector as:
where and are softmax functions that convert the attention and language vectors to predicted characters separately.
To extract the visual context from each image, we use out-of-the-box state-of-the-art classifiers. We obtain three kinds of contextual information: objects in the image, location/scenario labels, and a textual description or caption.
The output of the following classifiers is a -dimensional vector with the probabilities of object types. We retain the top-5 most likely objects.
The design of this network is based on an inception module, which uses 1-D convolutions to reduce the number of parameters. Also, a fully connected layer is replaced with a global average pooling at the end of the network. The network consists of 22 layer Deep CNN with reduced parameters. It has a top-5 error rate of 6.67%.
Inspired by the breakthrough ResNet performance, a hybrid-inception module was proposed. Inception-ResNet combines the two architectures (Inception modules and residual connection) to boost performance even further. We use Inception-ResNet-v2, with a top-5 error rate 3.1%.
The object hypotheses are obtained by extracting top-5 error class labels from each classifier and re-ranking them based on the cosine distance.
To extract scene information, we considered just one scene classifier . This is a pre-trained scene classifier able to recognize different scenario classes. The original model is based on Places365-Standard as deep convolutions network that trained on 1.8 million images from 365 scene categories. The same work proposed a better model, which we use, consisting of a fine-tuned model Places365-ResNet222http://places2.csail.mit.edu/ based on ResNet architecture.
Finally, we use a caption generator to extract more visual context information from each image, as a natural language description. Image caption generation approaches can use either top-down or bottom-up approaches. The bottom-up approach consists of detecting objects in the image and then attempting to combine the identified objects into a caption . On the other hand, the top-down approach learns the semantic representation of the image which is then decoded into the caption . Most current state-of-the-art systems adopt the top-down approach using RNN-based architectures. In this work, we use the latter top-down model to extract the visual description of the image.
The caption generator encoder of  uses a ResNet architecture  trained on ILSVRC competition dataset for general image classification task, and the decoder is tuned on COCO-caption , the same dataset for which we extract all visual context information. Table 3 shows that the caption has richer semantic.
As described above, the output of several text spotting systems is included in the dataset as text hypotheses or possible candidates for each image. However, some filtering is applied to remove duplicates and unlikely words:
First, we use a unigram language model (ULM) to filter out rare words (e.g. pretzel), non-words (e.g. tolin), or very short words (e.g. inc) unlikely to be in the image. The ULM  was built from Opensubtitles 333https://www.opensubtitles.org, a large database of movie subtitles containing around 3 million unique word forms, including numbers and other alphanumeric combinations that make it well suited for our task. We combined this corpus with google-ngrams444https://books.google.com/ngrams that contains 5 million tokens from English literature books. The combined corpora contain around 8 million tokens as shown in Table 4.
Secondly, we add the ground-truth if it was removed by the filter or if it was not included in the hypothesis list generated by the baselines. Note that this may occur often, since according to the author of COCO-text  the significant shortcoming of this dataset is a bounding box detection recall. Therefore, in about 40% of the images, the text is not properly detected and thus the classifiers fail to recognize it.
Despite we extract the top-5 objects from each image, we use a semantic similarity measure and threshold to filter out predictions where the object classifier is not confident enough. We use two approaches to filter out duplicated cases and false positive example.
Threshold measure. First, we consider a threshold to extract the most likely classes in the images, and eliminate low confidence predictions.
Semantic alignment. We use the cosine similarity to select the most likely visual context in the image. Concretely, we use a general text word-embedding [25, 27] to compute the similarity score between different visual context elements, and then we select objects or places detected with: 1) a high confidence and that have 2) strong semantic similarity with other image elements. The underlying assumption is that if two objects in the image are related, the classifier prediction we are relying on will be more likely to be correct.
|Unique Count for Textual Dataset|
|Flickr 30K ||30k||-||160k||-||2604,646||509,459||139128||169158|
Finally, we enrich the dataset with text-object co-occurrence frequencies. Since this information is not associated to each image, but is an aggregated of the whole dataset, it is provided in a separate table. This co-occurrence information may be useful when the text hypotheses and the scenes are not close in the semantic space but they are in the real world (e.g. delta and airliner or the sports TV channel kt and racket may not be close according to a general word embedding model, but they co-occur often in the image dataset). A sample of these co-ocurrence frequencies is shown in Figure 3 (b).
The co-occurrence information 
can be used to estimate the conditional probabilityof a word given that object appears in the image:
where is the number of training images, appears as the gold standard (ground truth) annotation for recognized text, and the object classifier detects object label in the image. Similarly, is the number of training images where the object classifier detects object class .
In this section, we outline in more detail our textual visual context dataset, which is an extension to COCO-text. First, we explain the original dataset and then we describe our proposed textual visual context.
As we described in Section 2.2, the COCO-text dataset is much larger than other text detection and recognition. It consists of 63,686 images, 173,589 text instances (annotations of the images).
We propose three different visual textual datasets for COCO-text as shown in Table 1: 1) training dataset (COCO-Text-V), 2) benchmark testing (IC17-V) and 3) object and text co-occurrence database (COCO-Pairs).
COCO-Text-V: It consists of 16K images with associated bounding boxes, and 60K textual data, each line have a caption, object and scene visual information. As shown in Table 2, for each bounding box we extract =10 text hypotheses, and each of them have different or same visual context information depending on the semantic alignment.
ICDAR17-Task3-V (IC17-V) is based on ICDAR17 task 3 end-to-end text recognition dataset. Similar to COCO-Text-V, we only introduce the visual context (textual dataset) for each bounding box. It consists of 10K images with 25k textual data for testing and validation.
To be able to use other type of word embedding, knowledge based embedding, we use external knowledge BabelNet  to extract multiple senses for each word. BabelNet555https://babelnet.org/ is the largest semantic network with a multilingual encyclopedic dictionary, comprising approximately 16 million entries for named entities linked by semantic relations and concepts. Each class label in ResNet has sense or meaning that is extracted from the predefined sense inventory (BabelNet). This allows the model to learn more accurate semantic relations between the spotted text and its visual. That sense ID can be used to extract any word vector from any pre-trained sense embedding [13, 28, 31, 4, 14]. It consists of 1800 images with id senses (e.g. orange as fruit and orange as color) that can be used to compute the similarity vector. Some of the words can be used multiple times because they have only one meaning. For example, an “umbrella” means the same in all contexts; meanwhile, the word “bar” has multiple meanings, such as a steel bar or bar that serves alcoholic beverages.
COCO-Pairs: This textual dataset has no bounding box, only the textual information. The dataset consists of only a pair of object-text extracted from each image. It consists of 158K word-visual context pairs. We combined the output from the visual classifier with the ground truth to create the pairs (e.g. text-scene, text-object).
Table 3 shows unique word count of part-of-speech tagging (nouns, verb, etc.) of our dataset. Our proposed textual datasets have more semantic than the original COCO-text dataset. Also, as seen in Figure 4 real text in the wild is very challenging problem and thus, current state-of-the-art including our dataset struggle to detect the correct coordination of bounding box. Thus, we use the dataset, COCO-text, ground truth annotation to overcome this shortcoming in this inaccurate bounding box coordination.
To evaluate the utility of the proposed dataset, we define a novel task, consisting of using the visual context in the image where the text appears to re-rank a list of candidates for the spotted text generated by some pre-existing model.
More specifically, the task is to use different similarity or relatedness scorers to reorder the -best hypothesis produced by a trained model with a softmax output. This candidate word re-ranking should filter out false positive and eliminate low frequency short words. The softmax score and the probabilities of the most related elements in the visual context are then combined by simple algebraic multiplication. In this work, we experimented extracting and re-ranking -best hypotheses for .
For evaluation, we used a less restrictive protocol than the standard one proposed by  and adopted in most state-of-the-art benchmarks, which does not consider words with less than three characters. This protocol was introduced to overcome the false positives on short words that most current state-of-the-art struggle with, including our Baseline. Instead, we consider all cases in the dataset, and words with less than three characters are also evaluated.
Since our task is re-ranking, we use the Mean Reciprocal Rank (MRR) to evaluate the quality of re-ranker outputs. MRR is computed as , where rank is the position of the first correct answer in the candidate list. MRR is only looking at the rank of the first correct answer; hence it is more suitable in cases such ours, where for each candidate word there is only a single right answer.
Human Evaluation as an Upper Bound. To calibrate the difficulty of the task we picked 33 random pictures from the test dataset and had 16 human subjects try to select the right word among the top candidates produced by the baseline text spotting system. We observed that human subjects more familiar with ads and commercial logos obtain higher scores. Average human performance was 63% (highest 87%, lowest 39%). Figure 5 shows the user interface for human annotation.
To generate the list of candidate words that will be re-ranked, we rely on two baseline pre-trained systems: a CNN  and an LSTM . Each baseline takes a text image bounding box Bb as input and produces candidate words plus a probability for each prediction .
The CNN baseline uses a closed lexicon and can not recognize any word outside its 90K-word dictionary. The LSTM baseline uses a visually soft-attention mechanism which performs unconstrained text recognition without relying on a lexicon.
We performed two experiments, and in each of them we compared the performance of several existing semantic similarity/relatedness systems.
The first experiment consists of re-ranking the text hypotheses produced by the baseline spotting system using only word-to-word similarity metrics. In this experiment each candidate word is compared to objects and places appearing in the image, and re-ranked according to the obtained similarity scores. In the second experiment, we re-rank the candidate words comparing them with an automatically generated caption for the image. For this, we require semantic similarity systems able to produce word-to-sentence or sentence-to-sentence similarity scores.
We used different off-the-shelf semantic similarity systems to compare the candidate words with the visual context in the image (objects and places), and evaluated the performance of each of them. The used systems are:
Glove : Word embedding system that derives the semantic relationships between words from the co-occurrence matrix. The advantage of Glove over Word2Vec  is that it does not rely on local word-context information, but it incorporates global co-occurrence statistics.
Fasttext : Extension of Word2Vec that instead of learning directly the word, it learns a -gram representation. Thus, it can deal with rare words not seen during training, by breaking them down into character -grams.
Relational Word Embeddings  (RWE): Enhanced version of Word2Vec that encodes complementary relational knowledge into the standard word-embedding in the semantic space. This enhanced embedding is still learned from pure co-occurrence statistics and not relying on any external knowledge. The model intends to capture and combine new knowledge complementary to standard similarity-centric embeddings.
TWE : Semantic Relatedness with Word Embeddings. Word embedding trained using Word2Vec, but instead of general corpus, it is trained on the presented dataset, so it can learn associations between candidate words and their visual context that are uncommon in general text. The model is trained on a Skip-gram model  that works well with small amounts of training data and is able to represent low-frequency words.
LSTMEmbed : LSTMEmbed is the most recent model in sense embeddings. It utilizes a BiLSTM architecture to learn the word and sense embeddings from annotated corpora. We use the same approach than in : 200-dimension embeddings trained on the English portion of BabelWiki and English Wikipedia.
Once the similarity between the candidate word and the most closely related element in the visual context is computed, we need to convert that score to a probability in order to combine them in the re-ranking process. Following , we use two different methods to obtain the final probability:
For TWE, we use where, since , then , and thus is our approximation of , which is then divided by to obtain the conditional probability.
For all other word-level similarity methods, we combine the obtained cosine similarity , the probability of the detected context (provided by the object/place classifier), and the probability of the candidate word (estimated from a 8M token corpus ). The final probability is computed following  with confirmation assumption as:
Results of experiment 1 are shown in Table 5-top.
In the second experiment we used sentence-level semantic similarity. For this, we resorted to state-of-the-art sentence embedding models fine-tuned using the caption dataset.
that targets high accuracy at the cost of complexity and resource consumption. We experimented with USE-T fine tuning and feature extraction to compute the semantic relation with cosine distance.
Bert666We use the basic bert-base-uncased model. : Bidirectional Encoder Representations from Transformers has shown groundbreaking results in many semantics-related NLP tasks.
Fine-tuned Bert: According to Bert authors, it is not suited for Semantic Textual Similarity (STS) task, since it does not generate a meaningful vector to compute the cosine distance. Thus, we also evaluated a fine-tuned version of the model with one extra layer to compute the semantic score between caption and candidate word. In particular, we fed the sentence representation into a linear layer and a softmax for sentence pair tasks (Q&A re-ranking task).
Results for the second experiment are shown in Table 5-bottom. Fine-tuned Bert outperforms all other models. BL+TWE ranks second in accuracy.
|BL+ Glove ||22.0||7||44.5||19.1||4||78.8|
|BL+ LSTMmebed ||21.6||7||44.0||19.2||4||79.6|
|BL+ BERT-feature ||21.7||7||45.0||19.3||4||81.2|
|BL+ BERT (fine-tune) ||22.7||8||45.9||20.1||9||79.1|
We have proposed a dataset that extends COCO-text with visual context information, that we believe useful for the text spotting problem. In contrast to the most recent method  that relies on limited classes of context objects and uses a complex architecture to extract visual information, our approach utilizes out-of-the-box state-of-the-art tools. Therefore, the dataset annotation will be improved in the future as better systems become available. This dataset can be used to leverage semantic relation between image context and candidate texts into text spotting systems, either as post-processing or end-to-end training. We also use our dataset to train/tune an evaluate existing semantic similarity systems when applied to the task of re-ranking text hypothesis produced by a text spotting baseline, showing that it can improve the accuracy of the original baseline between 2 and 3 points. Note that there’s a lot of room for improvement up to 7.4 points in a benchmark dataset.
This work is supported by the KASP Scholarship Program and by the Spanish government under projects HuMoUR TIN2017-90086-R and María de Maeztu Seal of Excellence MDM-2016-0656.
Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, pp. . Cited by: Table 3.