Log In Sign Up

Semantic Relatedness Based Re-ranker for Text Spotting

by   Ahmed Sabir, et al.
Universitat Politècnica de Catalunya

Applications such as textual entailment, plagiarism detection or document clustering rely on the notion of semantic similarity, and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. We present a scenario where semantic similarity is not enough, and we devise a neural approach to learn semantic relatedness. The scenario is text spotting in the wild, where a text in an image (e.g. street sign, advertisement or bus destination) must be identified and recognized. Our goal is to improve the performance of vision systems by leveraging semantic information. Our rationale is that the text to be spotted is often related to the image context in which it appears (word pairs such as Delta-airplane, or quarters-parking are not similar, but are clearly related). We show how learning a word-to-word or word-to-sentence relatedness score can improve the performance of text spotting systems up to 2.9 points, outperforming other measures in a benchmark dataset.


A Comprehensive Comparative Study of Word and Sentence Similarity Measures

Sentence similarity is considered the basis of many natural language tas...

Multimodal Word Distributions

Word embeddings provide point representations of words containing useful...

Evaluating semantic models with word-sentence relatedness

Semantic textual similarity (STS) systems are designed to encode and eva...

COS960: A Chinese Word Similarity Dataset of 960 Word Pairs

Word similarity computation is a widely recognized task in the field of ...

Visual Semantic Re-ranker for Text Spotting

Many current state-of-the-art methods for text recognition are based on ...

Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

This paper focuses on enhancing the captions generated by image-caption ...

SMDDH: Singleton Mention detection using Deep Learning in Hindi Text

Mention detection is an important component of coreference resolution sy...

1 Introduction

Deep learning has been successful in tasks related to deciding whether two short pieces of text refer to the same topic, e.g. semantic textual similarity (Daniel:18), textual entailment (Ankur:16) or answer ranking for Q&A (Aliaksei:15).

However, other tasks require a broader perspective to decide whether two text fragments are related more than whether they are similar

. In this work, we describe one of such tasks, and we retrain some of the existing sentence similarity approaches to learn this semantic relatedness. We also present a new Deep Neural Network (DNN) that outperforms existing approaches when applied to this particular scenario.

The task we tackle is Text Spotting, which is the problem of recognizing text that appears in unrestricted images (a.k.a. text in the wild) such as traffic signs, commercial ads, or shop names. Current state-of-the-art results on this task are far from those of OCR systems with simple backgrounds.

Existing approaches to Text Spotting usually divide the problem in two fundamental tasks: 1) text detection, consisting of selecting the image regions likely to contain texts, and 2) text recognition, that converts the images within these bounding boxes into a readable string. In this work, we focus on the recognition stage, aiming to prove that semantic relatedness between the image context and the recognized text can be useful to boost the system performance. We use existing pre-trained architectures for Text Recognition, and add a shallow deep-network that performs a post-processing operation to re-rank the proposed candidate texts. In particular, we re-rank the candidates using their semantic relatedness score with other visual information extracted from the image (e.g. objects, scenario, image caption). Extensive evaluation shows that our approach consistently improves other semantic similarity methods.

Figure 1: Overview of the system pipeline, an end-to-end post-process scores the semantic relatedness between a candidate word and the context in the image (objects, scenarios, natural language descriptions, …)

2 Text Hypothesis Extraction

We use two pre-trained Text Spotting baselines to extract text hypotheses. The first baseline is a CNN (Max:16)

with fixed lexicon based recognition, able to recognize words in a predefined 90K-word dictionary. Second, we use an LSTM architecture with visual attention model


that generates the final output words as probable character sequences, without relying on any lexicon. Both models are trained on a synthetic dataset

(Max:14b). The output of both models is a Softmax score for each of the candidate words.

3 Learning Semantic Relatedness for Text Spotting

To learn the semantic relatedness between the visual context information and the candidate word we introduce a multi-channel convolutional LSTM with an attention mechanism. The network is fed with the candidate word plus several words describing the image visual context (object and places labels, and descriptive captions)111All this visual context information is automatically generated using off-the-shelf existing modules (see section 4)., and is trained to produce a relatedness score between the candidate word and the context.

Our architecture is inspired by (Aliaksei:15), that proposed CNN-based re-rankers for Q&A. Our network consists of two subnetworks, each with 4-channels with kernel sizes , and overlap layer, as shown in Figure 1. We next describe the main components:

Multi-Channel Convolution

: The first subnetwork consists of only convolution kernels, and aims to extract n-gram or keyword features from the caption sequence.

The convolution is applied over a sequence to extract n-gram features from different positions. Let be the sentence matrix, where is the sentence length, and the dimension of the i-th word in the sentence. Let also denote by the kernel for the convolution operation. For each -th position in the sentence, is the concatenation of consecutive words, i.e., . Our architecture uses multiple such kernels to generate feature maps

. The feature map for each window vector

can be written as:


where is element-wise multiplication,

is nonlinear function, in our case we apply Relu function

(Vinod:10), and is a bias. For kernels, the generated feature maps can be arranged as feature representation for each window as: . Each row of is the new generated feature from the -th kernel for the window vector at position . The new generated feature (window representations) are then fed into the joint layer and LSTM as shown in Figure 1.

Multi-Channel Convolution-LSTM: Following C-LSTM (Chunting:15) we forward the output of the CNN layers into an LSTM, which captures the long term dependencies over the features. We further introduce an attention mechanism to capture the most important features from that sequence. The advantage of this attention is that the model learns the sequence without relying on the temporal order. We describe in more detail the attention mechanism below.

Also, following (Chunting:15), we do not use a pooling operation after the convolution feature map. Pooling layer is usually applied after the convolution layer to extract the most important features in the sequence. However, the output of our Convolutional-LSTM model is fed into an LSTM (Sepp:97)

to learn the extracted sequence, and pooling layer would break that sequence via downsampling to a selected feature. In short, LSTM is specialized in learning sequence data, and pooling operation would break such a sequence order. On the other hand, for the Multi-Channel Convolution model we also lean the extracted word sequence n-gram directly and without feature selection, pooling operation.

Attention Mechanism: Attention-based models have shown promising results on various NLP tasks (Dzmitry:14). Such mechanism learns to focus on a specific part of the input (e.g. a relevant word in a sentence). We apply an attention mechanism (Colin:15)

via an LSTM that captures the temporal dependencies in the sequence. This attention uses a Feed Forward Neural Network (FFN) attention function:


where is the attention of the hidden weight matrix and is the output vector. As shown in Fig. 1 the vector is computed as a weighted average of , given by (defined below). The attention mechanism is used to produce a single vector for the complete sequence as follows:

where is the total number of steps and is the computed weight of each time step for each state , is a learnable function that depends only on . Since this attention computes the average over time, it discards the temporal order, which is ideal for learning semantic relations between words. By doing this, the attention gives higher weights to more important words in the sentence without relying on sequence order.

Overlap Layer:

The overlap layer is just a frequency count dictionary to compute overlap information of the inputs. The idea is to give more weight to the most frequent visual element, specially when it is observed by more than one visual classifier. The dictionary output is a fully connected layer.

Finally, we merge all subnetworks into a joint layer that is fed to a loss function which calculates the semantic relatedness between both inputs. We call the combined model Fusion Dual Convolution-LSTM-Attention (FDCLSTM


Since we have only one candidate word at a time, we apply a convolution with masking

in the candidate word side (first channel). In this case, simply zero-padding the sequence has a negative impact on the learning stability of the network. We concatenate the CNN outputs with the additional feature into MLP layers, and finally a sigmoid layer performing binary classification. We trained the model with a binary cross-entropy loss (

) where the target value (in ) is the semantic relatedness between the word and the visual. Instead of restricting ourselves to a simple similarity function, we let the network learn the margin between the two classes –i.e. the degree of similarity. For this, we increase the depth of network after the MLPs merge layer with more fully connected layers. The network is trained using Nesterov-accelerated Adam (Nadam) (Timothy:16)

as it yields better results (specially in cases such as word vectors/neural language modelling) than other optimizers using only classical momentum (ADAM). We apply batch normalization (BN)

(Sergey:15) after each convolution, and between each MLPs layer. We omitted the BN after the convolution for the model without attention (FDCLSTM), as BN deteriorated the performance. Additionally, we consider 70% dropout (Nitish:14) between each MLPs for regularization purposes.

full dict list k MRR full list k MRR
Baseline (BL) full: 19.7 dict: 56.0 full: 17.9
BL+ Glove (Jeffrey:14) 22.0 62.5 75.8 7 44.5 19.1 75.3 4 78.8
BL+C-LSTM (Chunting:15) 21.4 61.0 71.3 8 45.6 18.9 74.7 4 80.7
BL+CNN-RNN (Xingyou:16) 21.7 61.8 73.3 8 44.5 19.5 77.1 4 80.9
BL+MVCNN (Wenpeng:16) 21.3 60.6 71.9 8 44.2 19.2 75.8 4 78.8
BL+Attentive LSTM (Ming:16) 21.9 62.4 74.0 8 45.7 19.1 71.4 5 80.2
BL+fasttext (Armand:17) 21.9 62.2 75.4 7 44.6 19.4 76.1 4 80.3
BL+InferSent (Alexis:17) 22.0 62.5 75.8 7 44.5 19.4 76.7 4 79.7
BL+USE-T (Daniel:18) 22.0 62.5 78.3 6 44.7 19.2 75.8 4 79.5
BL+TWE (Ahmed:18) 22.2 63.0 76.3 7 44.7 19.5 76.7 4 80.2
BL+FDCLSTM (ours) 22.3 63.3 75.1 8 45.0 20.2 67.9 9 79.8
BL+FDCLSTM (ours) 22.4 63.7 75.5 8 45.9 20.1 67.6 9 81.8
BL+FDCLSTM (ours) 22.6 64.3 76.3 8 45.1 19.4 76.4 4 78.8
BL+FDCLSTM (ours) 22.6 64.3 76.3 8 45.1 19.7 77.8 4 80.4
Table 1: Best results after re-ranking using different re-ranker, and different values for -best hypotheses extracted from the baseline output (%). In addition, to evaluate our re-ranker with MRR we fixed CNN LSTM

4 Dataset and Visual Context Extraction

We evaluate the performance of the proposed approach on the COCO-text (Andreas:16). This dataset is based on Microsoft COCO (Tsung-Yi:14) (Common Objects in Context), which consists of 63,686 images, and 173,589 text instances (annotations of the images). This dataset does not include any visual context information, thus we used out-of-the-box object (Kaiming:16) and place (Bolei:14) classifiers and tuned a caption generator (Oriol:15) on the same dataset to extract contextual information from each image, as seen in Figure 2.

5 Related Work and Contribution

Understanding the visual environment around the text is very important for scene understanding. This has been recently explored by a relatively reduced number of works.

Anna:16 shows that the visual context could be beneficial for text detection. This work uses a 14 classes pixel classifier to extract context features from the image, such as tree, river, wall, to then assist scene text detection. Chulmoo:17 employs topic modeling to learn the correlation between visual context and the spotted text in social media. The metadata associated with each image (e.g tags, comments and titles) is then used as context to enhance recognition accuracy. Sezer:17 takes advantage of text and visual context for logo retrieval problem. Most recently, Shitala:18 use object information (limited to 42 predefined object classes) surrounding the spotted text to guide text detection. They propose two sub-networks to learn the relation between text and object class (e.g. relations such as carplate or sign boarddigit).

Unlike these methods, our approach uses direct visual context from the image where the text appears, and does not rely on any extra resource such as human labeled meta-data (Chulmoo:17) nor limits the context object classes (Shitala:18). In addition, our approach is easy to train and can be used as a drop-in complement for any text-spotting algorithm that outputs a ranking of word hypotheses.

6 Experiments and Results

In the following we use different similarity or relatedness scorers to reorder the -best hypothesis produced by an off-the-shelf state-of-the-art text spotting system. We experimented extracting -best hypotheses for .

We use two pre-trained deep models: a CNN (Max:16) and an LSTM (Suman:17) as baselines (BL) to extract the initial list of word hypotheses.

Figure 2: An example illustrate some examples of candidate re-ranking. The top three examples are re-ranked based on the semantic relatedness score. The delta-airliner which frequently co-occur in training data is captured by overlap layers. The 12-football show that the relation between sport and numbers. Also, program-school have a much more distance relation but our model is able to re-rank the most related words. The blue-runway and bus-private show that the overlap layer can be effective when the visual context appears in more than one visual context classifier. Finally, hotdog-a and by-ski have no semantic correlation but are solved by network thanks to the frequency count dictionary.

The CNN baseline uses a closed lexicon; therefore, it cannot recognize any word outside its 90K-word dictionary. Table 1 presents four different accuracy metrics for this case: 1) full columns correspond to the accuracy on the whole dataset. 2) dict columns correspond to the accuracy over the cases where the target word is among the 90K-words of the CNN dictionary (which correspond to 43.3% of the whole dataset. 3) list columns report the accuracy over the cases where the right word was among the -best produced by the baseline. 4) MRR Mean Reciprocal Rank (MRR), which is computed as follows: , where rank is the position of the correct answer in the hypotheses list proposed by the baseline.

Comparing with sentence level model: We compare the results of our encoder with several state-of-the-art sentence encoders, tuned or trained on the same dataset. We use cosine to compute the similarity between the caption and the candidate word. Word-to-sentence representations are computed with: Universal Sentence Encoder with the Transformer USE-T (Daniel:18), and Infersent (Alexis:17) with glove (Jeffrey:14). The rest of the systems in Table 1 are trained in the same conditions that our model with glove initialization with dual-channel overlapping non-static pre-trained embedding on the same dataset. Our model FDCLSTM without attention achieves a better result in the case of the second baseline LSTM that full of false-positives and short words. The advantage of the attention mechanism is the ability to integrate information over time, and it allows the model to refer to specific points in the sequence when computing its output. However, in this case, the attention attends the wrong context, as there are many words have no correlation or do not correspond to actual words. On the other hand, USE-T seems to require a shorter hypothesis list to get top performance when the right word is in the hypothesis list.

Comparing with word level model: We also compare our result with current state-of-the-art word embeddings trained on a large general text using glove and fasttext. The word model used only object and place information, and ignored the caption. Our proposed models achieve better performance than our TWE previous model (Ahmed:18), that trained a word embedding (Tomas:13) from scratch on the same task.

Similarity to probabilities:

After computing the cosine similarity we need to convert that score to probabilities. As we proposed in previous work

(Ahmed:18) we obtain the final probability combining (Sergey:03)

the similarity score, the probability of the detected context (provided by the object/place classifier), and the probability of the candidate word (estimated from a 5M token corpus)


Effect of Unigram Probabilities:  Suman:17 showed the utility of a language model (LM) when the data is too small for a DNN, obtaining significant improvements. Thus, we introduce a basic model of unigram probabilities with Out-of-vocabulary (OOV) words smoothing. The model is applied at the end, to re-rank out false positive short words, and has the main goal of re-ranking out less probable word overranked by the deep model. As seen in Table 1, the introduction of this unigram lexicon produces the best results.

Human performance: To estimate an upper bound for the results, we picked 33 random pictures from the test dataset and had 16 human subjects try to select the right word among the top candidates produced by the baseline text spotting system. Our proposed model performance on the same images was 57%. Average human performance was 63% (highest 87%, lowest 39%).

7 Conclusion

In this work, we propose a simple deep learning architecture to learn semantic relatedness between word-to-word and word-to-sentence pairs, and show how it outperforms other semantic similarity scorers when used to re-rank candidate answers in the Text Spotting problem.

In the future, we plan using the same approach to tackle similar problems, including lexical selection in Machine Translation, or word sense disambiguation (Chiraag:18). We believe our approach could also be useful in multimodal machine translation, where an image caption must be translated using not only the text but also the image content barrault2018findings

. Tasks that lie at the intersection of computer vision and NLP, such as the challenges posed in the new BreakingNews dataset (popularity prediction, automatic text illustration) could also benefit from our results 



We would like to thank José Fonollosa, Marta Ruiz Costa-Jussà and the anonymous reviewers for discussion and feedback. This work was supported by the KASP Scholarship Program and by the MINECO project HuMoUR TIN2017-90086-R.