With the evolution of Artificial Intelligence (AI) and deep learning, research community is now moving to those problems which can help us in real world gains. From the image classification tasks to the object detection, deep learning has played a vital role. On the other side, Natural language processing (NLP) has also created wide range of applications starting from simple text classification to fully automated chat-bots in native languages.
Caption generation is a rising research field which combines computer vision with NLP. It has a number of applications:transcribing the scenes for blind people, classifying videos & images based on different scenarios, image based search engines for an optimized search, visual question answering , and context understanding.
Although, huge amount of research work has been done in this field for languages like English, French etc but there is no current research or publication that focuses on the Urdu language caption generation. Urdu language is spoken by more than 100 million people of Pakistan and India. It serves as a motivation to our problem of generating Urdu captions to remove the language barriers. This will help the natives in understanding visuals in many real world applications.
Several image-based datasets are available containing multiple captions against each image in English language: Flickr8k [hodosh2013framing], Flickr30k [flickr30k], and COCO [lin2014microsoft]. [hodosh2013framing] contains 8k images with 5 English captions against each image. [flickr30k] has 30k images with 5 English captions against each image. [lin2014microsoft] contains around 200k labelled images. Deep learning is a data-driven method. Owing to being short on time, we chose to work with Flickr8k [hodosh2013framing] to work on. Different practices were followed when translating the captions to Urdu which are detailed in the following subsections.
The major contributions of our paper are:
We prepare a dataset of Urdu language captions.
We evaluate the performance of few previously proposed techniques for the task of caption generation.
We enhance the performance by combining the previous techniques with newly proposed techniques.
We embed Urdu language grammar classifier in training our base model and back-propagated its loss to our attention based model.
We do extensive testing and provide a detailed discussion of the results.
The rest of the paper is organized as: section 2 will brief you about the previously proposed techniques. In section 3, we explain our proposed solution in detail. Section 4 shows the evaluation of our approach by showing the results. In section 5, we will be discussing the future directions in detail that can be exploited further. The paper is concluded in section 6. The flow of our work has been given in the Figure 1.
2 Related Work
In pre-deep learning era, people used template-based caption generation techniques. Templates are filled by detecting different objects, and attributes. Farhadi et al. [farhadi2010every] proposed triplet of scene elements to fill the template. Li et al. [li2011composing] used a method to extract the phrases related to detected objects, attributes and their relationships. Kulkarni et al. [bybaby] proposed a conditional random field (CRF) to infer the objects, attributes, and prepositions before filling in the gaps. Although template based methods generate grammatically correct captions but they lack in providing flexibility to change the length of text. Retrieval based caption generation was introduced in [kuznetsova2012collective]. Visually similar images and their captions are retrieved first from a large database and then captions for queried image are generated. This method produces correct and generalizable captions but it can not generate semantically correct and mage specific captions.
Use of deep neural networks (DNNs) for caption generation was first proposed by Kiros[kiros2014multimodal]
. They proposed to use convolutional neural networks (CNNs) for extracting features from images to generate captions. They use a multi-modal space to represent images and text jointly for multi-modal representation learning and image caption generation. This model has several advantages over previous methods which include:high level features provide more information do not uses fixed length templates , and use of multi-modal neural language models for improving the results on language part. [jozefowicz2016exploring] revealed that neural language models can not handle large amount of data .To mitigate this problem, [kiros2014unifying] proposed LSTM based image caption generation model that can handle the long dependencies in a sentence. The paper named as ”structure-content neural language model (SC-NLM)” also used sentence encoding along with LSTM network to generate the captions .
LSTMS and RNNs removed the barriers in sequence to sequence sentence generation. Karpathy et al. [karpathy2014deep]
proposed a model that learns a joint embedding space for ranking and generation. Their model learns to score sentences and image similarity as a function of R-CNN object detection with outputs of a bidirectional recurrent neural networks (RNNs). This method works at a finer level and embeds fragments of images and sentences. This method breaks down the images into a number of objects and sentences into a dependency tree relations (DTR)[de2006generating]. Afer that, it reasons about their latent and inter-modal alignment. It showed that the method achieves significant improvements in the retrieval task compared to other previous methods. Despite of the great performance, this method faces a few limitations in the generation of dependency trees as the model relations are not always appropriate.
Mao et al. [mao2014deep]
proposed multi-modal RNNs for image caption generation. This method uses 2 types of neural networks: a deep CNN for images and a deep RNN for sentences. These two networks interact with each other in a multi-modal layer to form the whole model. Both image and of sentences are given as input in this method. It calculates the probability distribution to generate the next word of captions. Vinyals et al.[vinyals2015show]
proposed a method called Neural Image Caption Generator. This method uses a CNN as encoder for feature extraction from images and a long short term memory (LSTM) for generating captions. The output of the last hidden layer of the encoder is used as the input to the LSTM. The LSTM is capable of keeping track of the objects that have been described by the text. Model is trained based on maximum likelihood estimation (MLE) methods and use joint embedding to generate image captions.
Fang et al. [fang2015captions] proposed a three-step pipeline for caption generation by incorporating object detection. Their model first learn detectors for several visual concepts based on a multi-instance learning framework. A language model trained on captions was then applied to the detector outputs, followed by re-scoring from a joint image-text embedding space.
Xu et al. [xu2015show] proposed to use the output from the convolutional layers instead of last hidden FC layers of the CNN. They introduced an attention-based image captioning method for the first time. The method can detect the salient objects in an image automatically. Attention-based methods can concentrate on the salient parts of the image and generate the corresponding words at the same time. Two types of attention mechanisms are used in this method, stochastic hard attention and deterministic soft attention. It uses convolutional layers of CNN as encoder for extracting information of the salient objects from the image. This method does not use fully connected layers, so that it can focus only on salient objects from images. Attention mechanism help to train model so that it can detect which word describes which part of an image. LSTM is used as decoder that generate captions. A disadvantage of this model is that there might be a loss of information in images as the focus is only on salient features of images.
Our main focus is to generate captions on images in Urdu language and there is no publication available for caption generation in Urdu language. This makes our work unique. Our work is much related to the previous work done in English language proposed by Xu et al. [xu2015show]
. We train our caption generation model on new CNN architectures to extract more features from images to overcome the disadvantages of this model. We also use different optimization techniques in order to get better results. We also replace LSTM in hard attention models with gated recurrent unit (GRU)[cho2014learning].
3 Proposed Solution
Our focus of work is on “Whole Scene Based” Captioning. As discussed in the introduction section, there is no dataset available which provides the captions in Urdu. So, we divide our project into three parts: the first part focuses on the development of a dataset containing Urdu captions, the second part aims at developing a deep learning model which is sufficiently able to generate understandable Urdu captions for a given image, and the last part is to critically test this model. The testing involves testing the generated captions both for correct prediction and grammar. Furthermore, we also try performing some architectural changes in best model for improved performance. These changes are discussed in the deep learning model subsection.
Our final proposed architecture has been shown in Figure 2. The details of each part of the model are discussed in the following subsections.
3.1 Dataset Generation
We use the Flickr8k dataset to build a dataset. There are two possible approaches: translating using the free APIs provided by service provides as Google 222https://translate.google.com/ and Microsoft 333https://www.microsoft.com/en-us/translator/business/translator-api/ captioning the images by hand from scratch. The first technique doesn’t provide with good translations due to less available data. So, we had to use the second one, i.e., translating from scratch.
Following practices were used in the generation of the dataset:
The captions were divided in the people of the group, i.e., all the captions were not translated by the same person.
All the captions of a certain image were not translated at the same time.
In case of a discrepancy of English translations from the images, the Urdu captions were written from zero on the base of the image.
After successfully translating the captions by a person, they were reviewed by other people of the group.
Google input tool 444https://www.google.com/inputtools/try/ was used in order to write the translations. It enables us to write Urdu using Roman font and then transcribes them into Nastalique.
This data was then preprocessed in order to be used as input to the DL model. This preprocessing involved removing the punctuation and tokenizing the dataset. Tokenizing was performed on the base of spaces present in a sentence. As we know, that Urdu doesn’t have spaces among the words, so this had to be taken into consideration while translating. Spaces were added at the ending of each word.
Tokenizing with no spaces present among the words is a difficult task in Urdu, as there are two types of: non-joiners and joiners. The later change their shape depending on the part of the word they exist in. The former retain their shape despite of which part of the word they are present in. Even with joiners we can not be sure that the word is ending as they can also exist in the center of the word. There are several theorems available for tokenizing but none provides any code that might be useful in our problem. We are currently working on this.
These tokens were then further processed and & were added to each of the sentence.
3.2 Deep Learning Model
As discussed before our technique is hugely inspired by the technique of Xu et al. [xu2015show]. We have improved upon their performance. Our model consists of three major parts: encoder, attention mechanism, and decoder. They work in series, i.e., the input to the encoder is the images with the captions. The output of this is then passed to the attention mechanism and the LSTM that work in parallel to predict captions of the input image.
Encoder consists of the CNN which extracts features from image. The features are extracted from the last convolutional layer instead of the fully connected layers. The output of this CNN is fed to the attention mechanisms. The advantage of taking the output of the last convolutional layer is that we can extract salient features that might otherwise be lost in case of taking output from the fully connected layers. The latest CNN architectures including Ret-101 [he2016deep], DenseNet-161 [huang2017densely], and InceptionV3 [szegedy2016rethinking] were used as CNN. ResNet-101 & InceptionV3 performed the best hence they were chosen fro further testing. Using these architectures we were able to extract comparatively more features. We also tried different optimizers, namely: Momentum [sutskever2013importance], Adam [kingma2014adam]
, and RMSprop[hinton2012neural].
|Our Dataset (Man (Urdu))||Other Datasets|
With the help of an attention mechanism, the model can learn to focus on the relevant part of the image. We used the attention med by Bahdanau at el. [bahdanau2014neural]. This attention mechanism was termed as ”Soft Attention” by Xu et al. [xu2015show]. This attention is deterministic, i.e., it pays equal attention to all parts of the image. Image features and previous hidden state of decoder are passed to attention mechanism. The alignment score is first calculated by the attention mechanism. This score tells the decoder about how much attention decoder has to pay on a particular part of the image. The alignment score is show in equation 1.
This score is then passed through a softmax in order to calculate the weighted attention score as shown in equation 2.
Context vector is then generated by doing an element-wise multiplication of the attention weights with the encoder outputs (extracted features) as shown in equation3. Due to the softmax, if the score of a specific input element is closer to 1, its effect and influence on the decoder output is amplified, whereas if the score is close to 0, its influence is drowned out and nullified.
The context vector is further passed to the decoder where it is concatenated with current input of decoder. The concatenated output is then passed to GRU to generate next word and this process repeats until last word of the caption is generated. These generated words are them concatenated to generate a complete sentence.
The decoder consists of a GRU [cho2014learning] to generate captions. GRU is similar to LSTM but has fewer parameters. GRU also does not have an output gate. These features make GRU faster, computationally inexpensive, and memory efficient. GRU generates one word at each time step. This generated word is conditioned on the previous hidden state of GRU, previously generated word, and context vector. Chung et al. [chung2014empirical] proved the GRU show more accurate results on small datasets than the LSTMs. This makes GRU the best choice for our problem.
At the end of this DL model grammar testing model was added which is detailed in the following section.
3.3 Urdu Grammar Correction
Our aim in the grammar correction part is to propose a model that can detect the grammar errors present in the generated captions and hence give them a score. This score can be treated as loss and back-propagated into the GRU in order to generate grammar-wise correct captions. We are currently working on this and we have created a dataset of grammar-wise wrong and correct sentences. We have trained a deep learning model to just predict the correct and false grammar. This is further discussed in details in the discussion section.
4 Experimental Setup and Results
Due to the lack of resources, the experiments were performed on google colab. The datasets were saved and loaded from google drive.
We have tested our proposed technique on our Flickr8k dataset, our own generated ’man’ dataset, and the ’dog’ dataset generated by previous year’s group. Table 1 shows the BLEU performance of our model. It can be seen that our model is able to produce substantial BLEU score. We outperform the previous year’s group by almost a double. Their BLEU score was 0.4 and we have managed to achieve a better BLEU score of 0.83. This can be further increased by the use of the discussed Urdu grammar correction techniques. The BLEU score on ’man’ dataset is better on hard attention because very specific and similar context based images were used and translations were also done very carefully. Another advantage is that in Urdu language much of words in sentences repeats frequently.
Some predictions of our model on images from the validation dataset are shown in the figure 4. It can be seen that the model is able to perform sufficiently well in predicting the Urdu captions. A few wrong predictions can also be seen. We are still working on the grammar correction part. The results of our model’s attention mechanism are given in Figure 3.
All our codes can be found at this github 555https://github.com/abdullahzia510/Urdu_Caption_Generation link.
We also trained a model for Urdu grammar detection on our self-annotated dataset. The model was able to achieve an accuracy of 63% telling wrong from correct grammar,
There are multiple challenges that we face during processing Urdu language. Some of them are detailed in the coming paragraphs.
The major focus of previous research has been on the western languages. Due to this, there are scarce resources of eastern languages like Urdu. This has been a huge hurdle in our work as we had to create the dataset ourselves. Urdu is comparatively complex as its morphology and syntax structure is a combination of Persian, Sanskrit, English, Turkish and Arabic [adeeba2011experiences].
Defining the word boundaries is a difficult task for Urdu owing to difference in joiner and non-joiners. Even if we are able to properly define the word boundaries, the compound words like ”yahan-wahan”, ”torr-marror”, ”koh-paima” are needed to be treated as a single word. This is a major hurdle we faced as there is no complete dataset than can identify these words and hence help tokenizing the data. There are different techniques proposed like that of minimum edit distance that can be used to detect these but there are limitations due to the presence of outliers. The spaces must be ignored in case of tokenization of Urdu[daud2017urdu], in the following cases:
Compound words like ”wazeer azam” etc
Reduplication like ”subah subah” etc
Affixation like ”sarmaya kaari” etc
Proper nouns like ”inaam ilahi” etc
Abbreviations and acronyms
The model is not able to detect the nouns clearly. It is not able to detect the difference between ”karta hai” and ”karti hai”. This is a major problem faced while predicting using our trained model. Such problems are not faced in English.
Stemming is major part of NLP. The objective of stemming is to standardize words by reducing a word into its origin or root [riaz2007challenges]. We need to stem the data and keep the morphemes. Morpheme is the unit of language that reflects a meaningful form of the word. These morphemes can help us in identifying the object and the action being performed inside the image very clearly. The predicted sentences can then be corrected grammar-wise. This will produce much better results than by using simple tokenization and no other pre-processing.
There are three types of techniques that can be used for Urdu language processing, namely: rule-based, statistical, and hybrid. As there are exceptions to every rule, so hybrid approach is the most suitable approach. Part-of-speech tagging can help us in finding the correct grammar and hence the model can learn to generate grammar-wise better captions. The unavailability of such a model for Urdu is a hurdle in our task.
Bhatt et al. [bhatt2009multi] proposed a multi-representational and multi-layered tree-bank for Urdu. Their model proposes the dependency and phrase structure of the input sentence. Their model might have been a great help to our project but they have not made their code public and also they have performed the task for Roman Urdu while we are working on Nastaliq Urdu.
Kabir et al. [kabir2002two] propose a two-pass parsing algorithm. In the first stage the model tries to predict the POS tagging of the input sentence. If it finds the input sentence to be inconsistent with the rules of Urdu grammar, then it suggests changes to the input sentences and after applying those changes the sentence is again passed to the parser. This suggestion of changes and again parsing is the second stage of the parsing algorithm. Along with grammatical mistakes it also looks for structural mistakes. The problem of unavailability of code was faced here. Their proposed work flow is shown in Figure 5.
Durrani et al. [durrani2010urdu] propose a good workflow of Urdu language word segmentation. Figure 6 shows the sequence that should be followed in order to pre-process the Urdu data to use it for any NLP task.
6 Conclusions and Future Work
We have prepared a dataset for Urdu language caption generation consisting of 700 images. Using this dataset we have proposed a deep learning algorithm that produces admirable results for similar unseen images. We show that our algorithm performs better than the previous year’s group (that worked on the same problem) by almost a double margin. We have also worked on the grammar correction part, but the work is still under progress. We have been able to predict the wrong and correct captions from the predicted captions. We are currently working on devising a technique through which we can back-propagate the loss back from the grammar testing model to the GRU.