Written communication is arguably one of the most important human inventions that allows the transmission of information in an explicit manner. Moreover, given the fact that text is omnipresent in man made scenarios [veit2016coco, karatzas2015icdar], as well as the implicit relation between visual information and scene text instances, the design of holistic computer vision models for scene interpretation is fundamental.
With the purpose of designing a holistic model, in this work we leverage textual information applied to the problem of fine-grained classification and image retrieval. Fine-grained classification tackles the problem of classifying different object instances that are visually similar and difficult to discriminate. The complexity of this task lies in finding discriminative features which often require domain specific knowledge[maji2013fine, xiao2015application].
An early work that demonstrated the importance of text (domain specific knowledge) for fine-grained storefront classification was put forward by Movshovitz [movshovitz2015ontological], in which the trained classifier learned automatically to attend to text found in an image as the sole way of solving the task. Since then, there has been additional research that explicitly combines textual and visual cues, being the work presented by Karaoglu et al. [karaoglu2013text, karaoglu2017words] and Bai et al.[bai2018integrating] the most related ones to our paper. In this work, we propose the usage of a state of the art text retrieval model presented by Gomez [Gomez_2018_ECCV] to detect and obtain the Pyramidal Histogram of Characters (PHOC) of scene text. We use the PHOC descriptors extracted from images and explore different fusion strategies to merge the visual and textual modalities. Additionally, we construct a Fisher Vector (FV) Encoding from the obtained PHOCs to be used as a fixed-length text feature in our pipeline and further improve the classifier results. Our model leverages the visual features combined with the morphology of a word (refer to Figure 1
), that belong to specific fine-grained classes, without the need to understand them semantically. Contrary to previous methods, this approach is especially useful when dealing with text recognition errors and named entities which are often difficult to encode in a purely semantic space. The combination of these two modalities produce an output probability vector that addresses the classification task at hand. As an additional application, we evaluate the proposed model on fine-grained image retrieval in available datasets. Overall, the main contributions of our work are:
We propose a novel architecture that achieves state of the art on fine-grained classification by considering text and visual features of an image.
We show that by using Fisher Vectors obtained from PHOCs of scene text, we obtain a more robust representation in which words with similar structure get encoded on the same Gaussian component, thus creating a more powerful discriminative descriptor than PHOCs alone.
We provide exhaustive experiments in which we compare the performance of different alternative modules in our model and previous state of the art.
2 Related Work
2.1 Scene Text Detection and Recognition
Even though deep learning has made significant progress[lecun2015deep], localizing and recognizing text in images still remains an open problem in the computer vision community due to the ample variety of text occurrences in natural images [zhu2016scene]. Essentially a system capable of reading text requires two steps, detection and recognition. Jaderberg et al. [jaderberg2016reading] tackles this problem by generating text proposals that were refined by a CNN. The bounding boxes obtained were used as input to another CNN that was trained to classified them according to a fixed dictionary. In another work, [gupta2016synthetic] defined a Fully Convolutional Regression Network to detect text by regressing bounding boxes and the same classification network as [jaderberg2016reading] was employed for text recognition. More recent approaches use customized variations of object detectors fine-tuned to detect text instances such as [kim2016pvanet] and [liu2016ssd] resulting in models proposed by [zhou2017east] and [liao2017textboxes, liao2018textboxes++]
. Recently, the community attention has placed an additional effort in the development of end-to-end models. The main existing notion is that features that help to improve detection are also useful at the moment of recognizing text instances. Heet al. [he2018end]
uses a CNN to extract proposals, which are fed into an LSTM (Long-Short Term Memory) to refine the bounding boxes that are later employed as input to yet another LSTM to perform recognition. In parallel, additional work has been conducted into the development of multilingual scene text recognizers, such as the work of[buvsta2018e2e] which consists on two CNNs. The first one is optimized to detect text and a the second one employs a Connectionist Temporal Classification (CTC) [graves2006connectionist] module for recognition, while training both in an end to end manner.
In this work, we leverage the Pyramidal Histogram Of Characters (PHOC) descriptor [almazan2014word, sudholt2017learning] (see Figure 3) commonly used to query a given text instance in handwritten documents and natural scene images. The PHOC of a word encodes the position of a specific character in a particular spatial region of the detected text instance. Such a descriptor has proven to perform as the state of the art in scene text retrieval [Gomez_2018_ECCV], and as our experiments show, encoding it with the Fisher Vector [perronnin2007fisher] provides an improved text descriptor for fine-grained classification.
2.2 Fine-Grained Classification
Recent works on fine-grained classification base their approach on localizing salient parts of an image [deng2013fine, yang2012unsupervised], and use the saliency maps to classify the objects. Later approaches such as the one of Tang [tang2017learning], use a weakly supervised method to find discriminative features and leverage them to perform the classification between similar instances. Other methods use existing prior knowledge from unstructured text to propose a semantic embedding that differentiates similar classes [xu2018fine]. A self-supervision method is introduced in [yang2018learning] that learns to propose significant image regions to find inter-class discriminative features.
More related to our work, [karaoglu2017words] tackles this task by extracting visual features with a pre-trained GoogleNet [szegedy2015going] and a Bag of Words feature to represent the text instances found in an image and further classify them. More recently, Bai [bai2018integrating] use a similar approach and extract visual features using a GoogleNet and a combination of two models: [liao2017textboxes] to detect and [shi2017end] to recognize text. The text found is represented as GloVe features [pennington2014glove], a word embedding that is further used with attention on the visual features to find a semantic relation between the two modalities to classify the image.
2.3 Multimodal Fusion
The combination of different modalities provides a richer content description rather than one modality alone, therefore the contained knowledge should be leveraged to further exploit explicit information according to the task [srivastava2012multimodal]. In this work we explore other fusion methods used in multimodal learning, that shows a performance increase especially in tasks that require exploiting two modalities such as Visual Question Answering (VQA) and Visual Relationship Detection (VRD).
One of the initial works presented by [ben2017mutan], modeled a Tucker decomposition of the bilinear interaction of two distinct modalities. Later, a Multimodal Low-rank Bilinear Attention Network (MLB) was proposed by [kim2016hadamard], in which the result of the fusion of two modalities was based on a low-rank bilinear pooling operation using the Hadamard product along with an attention mechanism. A factorized bilinear pooling (MFB) is proposed by [yu2017multi]
, where each third mode section of the tensor is constrained by a rank. Later methods, such as a Multimodal Factorized High-order pooling (MFH) fusion was presented by[yu2018beyond], which uses a high-order fusion formed by cascaded MFB modules. In the work conducted by [ben2017mutan], a bilinear pooling is performed where the tensor is represented as a Tucker decomposition. The obtained main tensor has the same rank constrain as the MFB technique. Lately, a Multimodal Bilinear Superdiagonal Block (Block) fusion strategy based on the work presented by [ben2019block], has achieved state of the art results in VQA and VRD.
3 Proposed Model
The devised model consists mainly in four processing blocks: visual features extraction, textual features extraction, attention unit and classification. The whole model pipeline is shown in Figure2.
The first block extracts the visual features from a given image and produces a fixed size representation of it. The second block consists of extracting the PHOC representation of each text instance found in an image and use a pre-trained Gaussian Mixture Model (GMM) to obtain the correspondent FV descriptor. The third block consists of an attention unit that multiplies learned weights with the encoded FV depending on the visual features extracted previously. Finally, the last block consists of a concatenation of the two different modalities followed by a fully connected layer to obtain a probability output vector which is used for classification. For the rest of the paper, let be the set of all possible categories in a given dataset; be the set of images; be the labelling function.
3.1 Visual Features
In our model, we use a Convolutional Neural Network (CNN)[he2016deep]
pre-trained on ImageNet[deng2009imagenet] as a visual feature extractor, denoted as . We use the output of the last convolutional block of before the last average pooling layer as the visual features, denoted as . Attention on visual features has proven to yield improved performance on several tasks. As it is presented by [dey2019doodle], we compute a soft-attention mechanism due to its differentiable properties, thus allowing an end-to-end learning. The proposed attention function learns an attention mask which assigns weights to different regions of an image given a feature map . The attention mask is learned by applying convolution layers on the output features from the CNN. Lastly, to obtain the final output of the attention module along with the visual features, the operation is computed by .
3.2 Textual Features
Methods shown in previous works [karaoglu2017words, bai2018integrating] contain mainly three drawbacks. First, the employed text recognizers are bound to a fixed dictionary, which may or may not include the exact words that are present in the image. Second, some words that are contained in the fixed recognition dictionary may not exist in the proposed semantic embedding (GloVe, Word2Vec) such as license plates, brand names, acronyms, etc. Third, any mistake committed by the recognizer will yield a vector embedding that lies far from the semantic embedding of the correct word. Contrary, correct recognition of semantically similar words that might indicate different fine-grained classes will lead to embeddings close to each other, which are not discriminative enough to perform correct classification. This is the case of similar semantic words such as restaurant and steakhouse, cafe and bistro, coke and pepsi among some other sample classes from the datasets used.
In order to exploit the morphology of a word to obtain discerning features, we employ the PHOC representation. The PHOC representation employed in this work is composed by the concatenation of vectors from the levels to plus the most common bi-grams in English language. This yields a 604-dimensional discrete binary vector that represents the characters contained in a word (see Figure 3). A dictionary given by [jaderberg2016reading]
is employed to obtain a PHOC per word, in this way, we populate a matrix of this compact representation. In order to reduce the dimensionality and to find linearly uncorrelated variables of this compact vector, a Principal Component Analysis (PCA) is performed. This procedure yields a more compact but at the same time informative vectorial representation of a given word.
The obtained data points were used to construct a Gaussian Mixture Model (GMM) [gregor1969algorithm] formed by Gaussian components. We denote the parameters of the -component GMM by , where and are respectively the mixture weight, mean vector and covariance matrix of Gausssian . We define:
where denotes Gaussian :
and we require:
Once the GMM model is trained, it will be used to extract a single Fisher Vector representation per image which encodes its contained textual information. The textual features per image are obtained by using the model from [Gomez_2018_ECCV]. Given an input image, the model outputs a list of bounding boxes, each one containing a confidence score and a PHOC prediction.
We get the top- object proposals set . The resulting PHOCs , where is the dimensionality of the PHOC embedding obtained and the recognized words embedded in the PHOC space. It is essential to note that the model from [Gomez_2018_ECCV] is able to generalize and construct PHOCs from previously unseen samples, out of vocabulary words and different languages that employ s similar character set (e.g. Latin), making it suitable for the task at hand. Afterwards, we project each embedded textual instance of the obtained descriptors into a reduced dimensional space by employing PCA. The resulting vectors are used to obtain the Fisher Vector [perronnin2007fisher] from the previously trained GMM. The GMM associates each PCAed vector to a component
in the mixture model with a weight given by the posterior probability:
For each mode , consider the mean and the covariance deviation vectors
where spans the vector dimensions. The FV of a given image is simply the concatenation of the obtained vectors and for each of the components in the Gaussian mixture model.
The FV and the GMM encode inherently similar information. This takes place because they both include statistics of order , and [sanchez2013image, perronnin2007fisher]. However, the FV provides a vectorial representation which is more compact, faster to compute and suitable for processing. The dimension of the FV obtained, noted as , is given by , where is the PHOC dimension after performing the PCA and is the number of Gaussian clusters. The intuition captured by the FV is to compute the gradient of a PHOC sample (bag of textual features) that shows the probability of belonging to each of the Gaussian components, which can be understood as a probabilistic textual vocabulary based on its morphological structure (see Figure 1).
3.3 Attention on features
In the proposed fine-grained classification task we can intuitively state that there will be some recognized text that is more relevant than others at the moment of discriminating similar classes. Therefore, it is important to capture the inner correlation between the textual and visual features. To adhere this idea into our pipeline, we propose a modified attention mechanism inspired from [you2016image]. The attention mechanism learns a tensor of weights that is used between the visual features and the obtained FV. The implemented attention is defined by:
The resulting tensor , contains a normalized attention vector that is multiplied with the textual features to obtain the final attended textual features .
The obtained attended textual features and the visual features are concatenated, such that the final features are formed by
. Finally, the resulting vector serves as input to a final classification layer that outputs the probability of a given class. The proposed network is trained to optimize the cross entropy loss function given by:
4 Experiments and Results
The following section describes the datasets employed, the implementation details along with the analysis of the results obtained from the experiments conducted.
4.1.1 Con-Text Dataset
Originally presented by [karaoglu2013text], is a dataset taken from the ImageNet [deng2009imagenet] ”building” and ”place of business” sub-categories. It consists of categories with images in total. The classes from this dataset are visually similar (Pizzeria, Restaurant, Dinner, Cafe) and requires text to successfully perform a fine-grained classification. The dataset was not built for text recognition purposes, thus not all images contain text in them. A high variability of text size, location and font styles make text recognition on this dataset a challenging task.
4.1.2 Drink Bottle Dataset
Dataset presented by [bai2018integrating] comprises the sub-categories soft drink and alcoholic drink found on ImageNet[deng2009imagenet]. There are 18,488 images divided in 20 categories. The dataset contains several not common, occluded, rotated, low quality and blurred text instances which increases the difficulty of performing successful text recognition.
4.2 Implementation Details
The visual features of the proposed model are taken by attending the features of the output of the last block layer of the Resnet before the last average pooling layer. These features are passed through a fully connected layer to down-sample them to a final dimension of . To construct the textual features, a maximum number of
PHOC proposals are obtained per image. If a lesser number of PHOC proposals are obtained, a zero padding scheme is employed to fix the size of the input features. The resulting PHOCs are reduced in size through PCA, to obtain features of a dimensionality of.
The Fisher Vector is calculated from the PCA-ed PHOCs by employing a pre-trained Gaussian Mixture Model as it is described in Section 3.2. The trained GMM employs Gaussian components thus yielding a FV of dimension. The obtained textual features are down-sampled by passing them through a fully connected layer to finally obtain a resulting size of before the attention mechanism is computed. The attention between both modalities produces an output vector of , that multiplies the learned weights to the textual features. As the last step, a concatenated vector of the visual and textual features () is used to produce the final classification probability vector.
The network is trained for epochs with the combination of RAdam [liu2019radam] and the Lookahead [zhang2019lookahead] optimizers. The batch size employed in all our experiments is , with a learning rate of , momentum of that decays by every epochs.
4.3 Comparison with the State of the Art
When comparing our method to the current state of the art, it is evident that the proposed pipeline consistently outperforms previous approaches. The performance of our method is shown in Table 1 (see the Supplementary Material Section for the results of each of the classes found in the Con-Text and Drink Bottle datasets respectively). As it can be seen, our method surpasses [bai2018integrating] in the Drink Bottle dataset by a significant margin, however this margin is smaller in the Con-Text dataset. Nonetheless, it is important to note that the method presented by [bai2018integrating] employs two additional classifiers to solve this task, thus relying on an ensemble model. Such kind of adopted approaches require longer training times, as well as more computation resources since several deep networks need to be trained. Therefore, when comparing to the single classifier presented by [bai2018integrating], our model offers a significant improvement. In the upcoming sections, we provide explanations and exhaustive experimentation that shows the main strengths and advantages of our model.
4.4 Importance of Textual Features
Several baselines of growing complexity were defined in order to: assess the effectiveness of the proposed model, discern the added performance of employing textual features along visual ones and to verify the improvement obtained from using a fusion mechanism.
Visual Only: This baseline assesses the performance of the CNN encoder based on visual features solely. To this end, the 2048 dimensional output features , serve as the input to a fully connected layer according to the number of classes of the evaluated dataset.
Textual Only: We evaluate the performance of two state of the art text recognizers: Textspotter [he2018end] and E2E_MLT [buvsta2018e2e] along with the most confident PHOCs obtained from the model presented by [Gomez_2018_ECCV].
For illustration purposes, Figure 4 shows heat maps obtained by employing the model from [Gomez_2018_ECCV] according to the confidence scores obtained when a text instance is detected. It is important to note that Textspotter [he2018end] is bound to a dictionary to output the final recognized word, whereas the multilingual model E2E_MLT from [buvsta2018e2e] is not. The recognized text is embedded with pretrained versions of GloVe [pennington2014glove], FastText [bojanowski2017enriching] and Word2Vec [mikolov2013distributed], finally outputting tensors of size , which in our experiments . When working with PHOCs, the output vector has a size . As we can observe in Table 2, in the visual only baseline, the ResNet152 CNN [he2016deep] performed better in this task, due to the major expressiveness of the model and the residual block architecture that it is based on.
|Fisher Vector (PHOC)|
In the text only baseline, by using standard text recognizers we can observe that the E2E_MLT performs better in the Con-Text dataset, whereas the Textspotter model surpasses E2E_MLT in the Drink Bottle dataset. Nonetheless, both of them are outperformed by employing the PHOCs obtained from [Gomez_2018_ECCV] as the word embedding. This effect is due to the inherent morphological nature of the PHOC embedding.
Overall, the best results in the textual only baseline are obtained by the Fisher Vector obtained from the PHOCs. Qualitatively shown in Figure 1
, the Gaussian Mixture gracefully captures the morphology of words obtained from PHOCs. Therefore, words with similar syntax are clustered together in the GMM, thus allowing the Fisher Vector to be a powerful descriptor relevant for this task that yields even more discriminative features than other embeddings. It is important to note as well that in our experiments, FastText performs better than Word2Vec or GloVe because it can produce embeddings of out of vocabulary words while considering word n-grams which strengthens our conjecture on the importance of morphology of text to solve this task.
4.5 Comparison of Models
Extensive experiments were conducted regarding the different combinations of text recognizers, word embeddings and fusion techniques. Table 3 show the results obtained in both the Con-Text and Drink Bottle dataset.
When introducing fusion techniques to the models, traditional text recognizers such as E2E_MLT performs better in Con-Text compared to Textspotter, thus achieving a higher mAP. The opposite effect is found in the Drink Bottle dataset, in which Textspotter behaves better than its E2E_MLT. It is interesting to note that the PHOCs obtained perform consistently in both datasets, yielding comparable results to the traditional recognizers employed. Regarding the embedding mechanism utilized, morphological embeddings (FastText, PHOC) work better than purely semantic embeddings due to the discriminative space learned.
We can observe that the usage of fusion techniques usually improve the mAP performance obtained on each method aside from the cases when the models employ Fisher Vector features. Nonetheless, in our experiments we have not found a specific fusion technique that can be generalized for every tested method. Each fusion technique increases the performance for a specific model, being MFH and Block slightly more consistent than others. It is necessary to indicate that employing Fisher Vector features obtained from PHOCs consistently achieves the best performance in a general and consistent manner across both datasets.
In order to asses the efficacy of using the Fisher Vector along with another embedding that captures out of vocabulary words while at the same time considering the character morphology, we employ the Fisher Vector obtained from FastText. To this end, FastText employs character n-grams to construct a relevant vectorial representation of a word, thus it also uses syntax of a detected word. The results of the conducted experiments using Fisher Vector features from FastText and PHOC are shown in the last two columns of Tables 3. There are two results to highlight obtained from this experiment. Firstly, working with PHOCs along FVs always yield better performance compared to Fasttext. The cause might be the information captured by Fasttext encapsulates morphology in the form of character n-grams, as well as semantics. Whereas the PHOC is a compact representation based solely on word morphology.
GT: Ouzo blueOuzo: 0.53 Vodka: 0.14 BirchB: 0.04
|GT: RootB blueRootB: 0.74 QuinW: 0.07 Vodka: 0.02||GT: Sarsap blueSarsap: 0.97 QW: 3.4e-5 Bitter: 3.0e-5||GT: Vodka blueVodka: 0.58 Ouzo:0.15 Pepsi:0.13||GT: Biiter blueBitter: 0.31 BirchB: 0.17 RootB: 0.0||GT: Guinn blueGuinn: 0.40 Drambui: 0.15 Sauterne: 0.08||GT: Ouzo redVodka: 0.48 Ouzo: 0.18 Sauterne: 0.14||GT: Ouzo redSauterne: 0.90 Rootb: 3.0e-2 Chablis: 2.2e-2|
|GT: Pharma bluePharma: 0.42 Funeral: 0.13 Cafe: 0.08||GT: Theatre blueTheatre: 0.83 Diner: 0.01 Pharma: 0.01||GT: PawnS bluePawnS: 0.38 School: 0.27 MedicalC: 0.16||GT: Theatre blueTheatre: 0.99 BookS: 4.8e-3 Disc.H: 9.5e-4||GT: DryCl blueDryCl: 0.20 Resta: 0.11 TeaH: 0.09||GT: Cafe blueCafe: 0.46 Resta: 0.16 Barber: 0.07||GT: RepairS redBarber: 0.62 RepairS: 0.09 School: 8.3e-3||GT: Pharma redTobacco: 0.37 Pharma: 0.33 Bistro: 0.06|
Secondly, by combining the explored fusion methods along with Fisher Vectors did not provide a significant advantage. A straightforward concatenation operation between the FV and the visual features reinforces the notion that both modalities contain discriminative and orthogonal features well suited for this task. As an additional advantage, by employing concatenation the model convergences faster while at the same time providing a better performance.
4.6 Qualitative Results
Fine-grained classification probabilities obtained from our model output are depicted in Figure 5. The textual features employed are able to generalize to unseen textual instances or named entities such as the case of bottle brands or business places. We can observe that our model has a hard time reading handwritten text or vertical textual occurrences, thus wrongly predicting a class, such as the example shown at the first row, seventh column. Nonetheless, the model seems to be capturing text morphology, as can be seen on the prediction of the class ’pawn shop’. Finally on the last two samples on each row, there are not enough guiding textual features and the model relies only on similar visual features. Nonetheless, classifying these samples correctly are a hard task even for humans.
4.7 Fine-grained Image Retrieval
In the same manner as the work presented in [karaoglu2017words] and [bai2018integrating], we conduct a retrieval experiment by utilizing the computed vector of the last output layer of the proposed model as retrieval features.
We take the approach of query by example, that is, given a sample image that belongs to a specific class, the system must return a ranked list of similar classes as the query. The metric employed to conduct this experiment is the cosine similarity. The proposed method is more robust at the moment of employing a combination of visual and textual features which are discriminative enough to conduct a different task successfully as it is the case in fine-grained image retrieval. The retrieval quantitative performance for both datasets is shown in Table4, for qualitative results please refer to the Supplementary Material.
5 Conclusions and Future Work
In this work, we have presented a deep neural network framework suitable for a fine-grained classification task. Through extensive experiments conducted, we have presented that leveraging textual information is a key approach to extract information from images. Exploiting these textual cues can pave the road towards more holistic computer vision models of scene understanding. We have shown that current text recognizers that are limited by a dictionary are not the best alternative for this task, because it requires a recognizer able to generalize out of vocabulary words from unseen samples. Additionally, we have analyzed the fact that using semantic embeddings in a fine-grained classification task do not produce the best results due to the related semantic space shared across similar classes. By integrating state-of-the-art techniques and constructing a powerful morphological descriptor from text contained in images, we show that a better suited feature for this task can be learned. Such a feature proves to be useful for a fine-grained classification task as well as for query-by-example image retrieval. Leveraging this robust textual feature yields state-of-the-art results in both tasks across the assessed datasets. Classification and retrieval is possible due to the discriminative features learnt by the model. As future work, we plan to develop a morphological descriptor that captures the same discriminative features using a smaller feature dimension. A continuous valued embedding can replace the binary PHOC while preserving the generalization ability of unseen samples. We want to explore the usefulness of this embedding in other computer vision tasks such as visual question answering[biten2019stvqa, singh2019towards] and text-based image retrieval.