Deep learning has been successful in tasks related to deciding whether two short pieces of text refer to the same topic, e.g. semantic textual similarity (Daniel:18), textual entailment (Ankur:16) or answer ranking for Q&A (Aliaksei:15).
However, other tasks require a broader perspective to decide whether two text fragments are related more than whether they are similar
. In this work, we describe one of such tasks, and we retrain some of the existing sentence similarity approaches to learn this semantic relatedness. We also present a new Deep Neural Network (DNN) that outperforms existing approaches when applied to this particular scenario.
The task we tackle is Text Spotting, which is the problem of recognizing text that appears in unrestricted images (a.k.a. text in the wild) such as traffic signs, commercial ads, or shop names. Current state-of-the-art results on this task are far from those of OCR systems with simple backgrounds.
Existing approaches to Text Spotting usually divide the problem in two fundamental tasks: 1) text detection, consisting of selecting the image regions likely to contain texts, and 2) text recognition, that converts the images within these bounding boxes into a readable string. In this work, we focus on the recognition stage, aiming to prove that semantic relatedness between the image context and the recognized text can be useful to boost the system performance. We use existing pre-trained architectures for Text Recognition, and add a shallow deep-network that performs a post-processing operation to re-rank the proposed candidate texts. In particular, we re-rank the candidates using their semantic relatedness score with other visual information extracted from the image (e.g. objects, scenario, image caption). Extensive evaluation shows that our approach consistently improves other semantic similarity methods.
2 Text Hypothesis Extraction
We use two pre-trained Text Spotting baselines to extract text hypotheses. The first baseline is a CNN (Max:16)Suman:17)
that generates the final output words as probable character sequences, without relying on any lexicon. Both models are trained on a synthetic dataset(Max:14b). The output of both models is a Softmax score for each of the candidate words.
3 Learning Semantic Relatedness for Text Spotting
To learn the semantic relatedness between the visual context information and the candidate word we introduce a multi-channel convolutional LSTM with an attention mechanism. The network is fed with the candidate word plus several words describing the image visual context (object and places labels, and descriptive captions)111All this visual context information is automatically generated using off-the-shelf existing modules (see section 4)., and is trained to produce a relatedness score between the candidate word and the context.
Our architecture is inspired by (Aliaksei:15), that proposed CNN-based re-rankers for Q&A. Our network consists of two subnetworks, each with 4-channels with kernel sizes , and overlap layer, as shown in Figure 1. We next describe the main components:
: The first subnetwork consists of only convolution kernels, and aims to extract n-gram or keyword features from the caption sequence.
The convolution is applied over a sequence to extract n-gram features from different positions. Let be the sentence matrix, where is the sentence length, and the dimension of the i-th word in the sentence. Let also denote by the kernel for the convolution operation. For each -th position in the sentence, is the concatenation of consecutive words, i.e., . Our architecture uses multiple such kernels to generate feature maps
. The feature map for each window vectorcan be written as:
where is element-wise multiplication,
is nonlinear function, in our case we apply Relu function(Vinod:10), and is a bias. For kernels, the generated feature maps can be arranged as feature representation for each window as: . Each row of is the new generated feature from the -th kernel for the window vector at position . The new generated feature (window representations) are then fed into the joint layer and LSTM as shown in Figure 1.
Multi-Channel Convolution-LSTM: Following C-LSTM (Chunting:15) we forward the output of the CNN layers into an LSTM, which captures the long term dependencies over the features. We further introduce an attention mechanism to capture the most important features from that sequence. The advantage of this attention is that the model learns the sequence without relying on the temporal order. We describe in more detail the attention mechanism below.
Also, following (Chunting:15), we do not use a pooling operation after the convolution feature map. Pooling layer is usually applied after the convolution layer to extract the most important features in the sequence. However, the output of our Convolutional-LSTM model is fed into an LSTM (Sepp:97)
to learn the extracted sequence, and pooling layer would break that sequence via downsampling to a selected feature. In short, LSTM is specialized in learning sequence data, and pooling operation would break such a sequence order. On the other hand, for the Multi-Channel Convolution model we also lean the extracted word sequence n-gram directly and without feature selection, pooling operation.
Attention Mechanism: Attention-based models have shown promising results on various NLP tasks (Dzmitry:14). Such mechanism learns to focus on a specific part of the input (e.g. a relevant word in a sentence). We apply an attention mechanism (Colin:15)
via an LSTM that captures the temporal dependencies in the sequence. This attention uses a Feed Forward Neural Network (FFN) attention function:
where is the attention of the hidden weight matrix and is the output vector. As shown in Fig. 1 the vector is computed as a weighted average of , given by (defined below). The attention mechanism is used to produce a single vector for the complete sequence as follows:
where is the total number of steps and is the computed weight of each time step for each state , is a learnable function that depends only on . Since this attention computes the average over time, it discards the temporal order, which is ideal for learning semantic relations between words. By doing this, the attention gives higher weights to more important words in the sentence without relying on sequence order.
The overlap layer is just a frequency count dictionary to compute overlap information of the inputs. The idea is to give more weight to the most frequent visual element, specially when it is observed by more than one visual classifier. The dictionary output is a fully connected layer.
Finally, we merge all subnetworks into a joint layer that is fed to a loss function which calculates the semantic relatedness between both inputs. We call the combined model Fusion Dual Convolution-LSTM-Attention (FDCLSTM).
Since we have only one candidate word at a time, we apply a convolution with masking
in the candidate word side (first channel). In this case, simply zero-padding the sequence has a negative impact on the learning stability of the network. We concatenate the CNN outputs with the additional feature into MLP layers, and finally a sigmoid layer performing binary classification. We trained the model with a binary cross-entropy loss () where the target value (in ) is the semantic relatedness between the word and the visual. Instead of restricting ourselves to a simple similarity function, we let the network learn the margin between the two classes –i.e. the degree of similarity. For this, we increase the depth of network after the MLPs merge layer with more fully connected layers. The network is trained using Nesterov-accelerated Adam (Nadam) (Timothy:16)
as it yields better results (specially in cases such as word vectors/neural language modelling) than other optimizers using only classical momentum (ADAM). We apply batch normalization (BN)(Sergey:15) after each convolution, and between each MLPs layer. We omitted the BN after the convolution for the model without attention (FDCLSTM), as BN deteriorated the performance. Additionally, we consider 70% dropout (Nitish:14) between each MLPs for regularization purposes.
|Baseline (BL)||full: 19.7 dict: 56.0||full: 17.9|
|BL+ Glove (Jeffrey:14)||22.0||62.5||75.8||7||44.5||19.1||75.3||4||78.8|
|BL+Attentive LSTM (Ming:16)||21.9||62.4||74.0||8||45.7||19.1||71.4||5||80.2|
4 Dataset and Visual Context Extraction
We evaluate the performance of the proposed approach on the COCO-text (Andreas:16). This dataset is based on Microsoft COCO (Tsung-Yi:14) (Common Objects in Context), which consists of 63,686 images, and 173,589 text instances (annotations of the images). This dataset does not include any visual context information, thus we used out-of-the-box object (Kaiming:16) and place (Bolei:14) classifiers and tuned a caption generator (Oriol:15) on the same dataset to extract contextual information from each image, as seen in Figure 2.
5 Related Work and Contribution
Understanding the visual environment around the text is very important for scene understanding. This has been recently explored by a relatively reduced number of works.Anna:16 shows that the visual context could be beneficial for text detection. This work uses a 14 classes pixel classifier to extract context features from the image, such as tree, river, wall, to then assist scene text detection. Chulmoo:17 employs topic modeling to learn the correlation between visual context and the spotted text in social media. The metadata associated with each image (e.g tags, comments and titles) is then used as context to enhance recognition accuracy. Sezer:17 takes advantage of text and visual context for logo retrieval problem. Most recently, Shitala:18 use object information (limited to 42 predefined object classes) surrounding the spotted text to guide text detection. They propose two sub-networks to learn the relation between text and object class (e.g. relations such as car–plate or sign board–digit).
Unlike these methods, our approach uses direct visual context from the image where the text appears, and does not rely on any extra resource such as human labeled meta-data (Chulmoo:17) nor limits the context object classes (Shitala:18). In addition, our approach is easy to train and can be used as a drop-in complement for any text-spotting algorithm that outputs a ranking of word hypotheses.
6 Experiments and Results
In the following we use different similarity or relatedness scorers to reorder the -best hypothesis produced by an off-the-shelf state-of-the-art text spotting system. We experimented extracting -best hypotheses for .
We use two pre-trained deep models: a CNN (Max:16) and an LSTM (Suman:17) as baselines (BL) to extract the initial list of word hypotheses.
The CNN baseline uses a closed lexicon; therefore, it cannot recognize any word outside its 90K-word dictionary. Table 1 presents four different accuracy metrics for this case: 1) full columns correspond to the accuracy on the whole dataset. 2) dict columns correspond to the accuracy over the cases where the target word is among the 90K-words of the CNN dictionary (which correspond to 43.3% of the whole dataset. 3) list columns report the accuracy over the cases where the right word was among the -best produced by the baseline. 4) MRR Mean Reciprocal Rank (MRR), which is computed as follows: , where rank is the position of the correct answer in the hypotheses list proposed by the baseline.
Comparing with sentence level model: We compare the results of our encoder with several state-of-the-art sentence encoders, tuned or trained on the same dataset. We use cosine to compute the similarity between the caption and the candidate word. Word-to-sentence representations are computed with: Universal Sentence Encoder with the Transformer USE-T (Daniel:18), and Infersent (Alexis:17) with glove (Jeffrey:14). The rest of the systems in Table 1 are trained in the same conditions that our model with glove initialization with dual-channel overlapping non-static pre-trained embedding on the same dataset. Our model FDCLSTM without attention achieves a better result in the case of the second baseline LSTM that full of false-positives and short words. The advantage of the attention mechanism is the ability to integrate information over time, and it allows the model to refer to specific points in the sequence when computing its output. However, in this case, the attention attends the wrong context, as there are many words have no correlation or do not correspond to actual words. On the other hand, USE-T seems to require a shorter hypothesis list to get top performance when the right word is in the hypothesis list.
Comparing with word level model: We also compare our result with current state-of-the-art word embeddings trained on a large general text using glove and fasttext. The word model used only object and place information, and ignored the caption. Our proposed models achieve better performance than our TWE previous model (Ahmed:18), that trained a word embedding (Tomas:13) from scratch on the same task.
Similarity to probabilities:
After computing the cosine similarity we need to convert that score to probabilities. As we proposed in previous work(Ahmed:18) we obtain the final probability combining (Sergey:03)
the similarity score, the probability of the detected context (provided by the object/place classifier), and the probability of the candidate word (estimated from a 5M token corpus)(Pierre:16).
Effect of Unigram Probabilities: Suman:17 showed the utility of a language model (LM) when the data is too small for a DNN, obtaining significant improvements. Thus, we introduce a basic model of unigram probabilities with Out-of-vocabulary (OOV) words smoothing. The model is applied at the end, to re-rank out false positive short words, and has the main goal of re-ranking out less probable word overranked by the deep model. As seen in Table 1, the introduction of this unigram lexicon produces the best results.
Human performance: To estimate an upper bound for the results, we picked 33 random pictures from the test dataset and had 16 human subjects try to select the right word among the top candidates produced by the baseline text spotting system. Our proposed model performance on the same images was 57%. Average human performance was 63% (highest 87%, lowest 39%).
In this work, we propose a simple deep learning architecture to learn semantic relatedness between word-to-word and word-to-sentence pairs, and show how it outperforms other semantic similarity scorers when used to re-rank candidate answers in the Text Spotting problem.
In the future, we plan using the same approach to tackle similar problems, including lexical selection in Machine Translation, or word sense disambiguation (Chiraag:18). We believe our approach could also be useful in multimodal machine translation, where an image caption must be translated using not only the text but also the image content barrault2018findings
. Tasks that lie at the intersection of computer vision and NLP, such as the challenges posed in the new BreakingNews dataset (popularity prediction, automatic text illustration) could also benefit from our resultsRamisa_pami2018.
We would like to thank José Fonollosa, Marta Ruiz Costa-Jussà and the anonymous reviewers for discussion and feedback. This work was supported by the KASP Scholarship Program and by the MINECO project HuMoUR TIN2017-90086-R.