Visual understanding and language generation are two tasks that are intuitive for humans, but pose a challenge to computers. Recently, convolutional neural networks (CNNs)(Krizhevsky et al., 2012)
and long short-term memory networks (LSTMs)(Hochreiter and Schmidhuber, 1997)
have achieved state-of-the-art results on image understanding and natural language generation, respectively. For image captioning, one of the most successful approaches has been the encoder-decoder architecture, where the image is first “encoded” by a CNN into a latent semantic hidden vector, and then “decoded” into a natural language sentence using a recurrent, language generating LSTM.
For most people, describing a picture is an intuitive task that requires little effort. However, patients with neurodegenerative disorders have impaired brain function that inhibits some cognitive subtasks; this manifests as difficulties with picture description (Croisile et al., 1996). In fact, because linguistic impairment is one of the earliest signs of AD, picture description is useful as a cognitive test for AD (Forbes-McKay and Venneri, 2005).
In this paper, we implement a variant of the “Show and Tell” neural network for image captioning (Vinyals et al., 2015)
, and simulate the effects of neurodegeneration by adding dropout during inference, which randomly sets a subset of the neuron outputs in a layer to zero. We evaluate the effects of dropout on language generation, and compare the results to picture description by patients with diseases like AD and aphasia.
2 Related Work
2.1 Image Captioning
Mao et al. (2014)
were the first to apply deep recurrent neural networks (RNNs) to image captioning. Their architecture used a multimodal layer that combined image representations preprocessed by a CNN, the input word embedding, and the output of the RNN at each time step to generate output word embeddings.
Vinyals et al. (2015) used an encoder-decoder architecture that only looked at the image once. The encoder was a CNN that encoded images into vector representations and the decoder was an LSTM that decoded the image features into natural language descriptions. By using an LSTM, their model was able to retain long-term dependencies and avoid having to show the image to the RNN multiple times. Xu et al. (2015) added a visual attention mechanism, which learned which part of the image to focus on at each time step.
2.2 Picture Description
Alzheimer’s disease (AD) is a neurodegenerative disorder affecting 47 million people worldwide (Prince et al., 2016). One of the earliest symptoms of AD is cognitive impairment, especially difficulty with language production. One widely-used cognitive test for AD is the Cookie Theft picture description task from the Boston Diagnostic Aphasia Examination (Goodglass and Kaplan, 1983). In this task, the patient is shown a drawing of a chaotic kitchen scene, and is asked to describe it in as much detail as possible. Language in patients with AD is characterized by semantic impairment, particularly difficulty finding words for concepts and ideas (Taler and Phillips, 2008).
Wernicke’s aphasia (WA), also known as fluent aphasia, is caused by damage to Wernicke’s area, which is partially responsible for language production. Patients with WA tend to produce long stretches of words in a seemingly random order111Video of a patient with WA performing a picture description task: https://www.youtube.com/watch?v=xzp-XUBknQI (Buckingham Jr and Kertesz, 1974).
2.3 Dropout in neural networks
Dropout is the technique of randomly selecting, with probability, neurons in a layer and setting their outputs to zero. When used during training, dropout has been shown to be an effective regularization method and has an effect similar to averaging an ensemble of models (Srivastava et al., 2014). Dropout has been applied to RNN language models as well, with a similar regularizing effect (Zaremba et al., 2014). While previous work considered dropout during training as a regularization mechanism, we consider dropout during inference and its effects on language generation. To our knowledge, dropout during inference in RNNs has not been studied.
AD may be caused by a misfolding in the beta-amyloid protein, causing beta-amyloid plaques to form in neurons (Goedert and Spillantini, 2006)
. This causes inhibited electrochemical signal transmission in the synapses. Thus, the effect of AD in brain cells is similar to that of dropout in neural networks. The goal of our work is to simulate the effects of AD to produce pathological linguistic effects similar to picture description by patients with neurodegenerative disorders.
3.1 Image caption network
We implement an encoder-decoder neural architecture. For the CNN, we use the VGG16 convolutional network (Simonyan and Zisserman, 2014) and initialize it with weights from Torchvision Paszke et al. (2017)
, which were trained on ImageNet. During the training of our image captioning network, all of the weights for the CNN are frozen.
During inference, the image is processed by the VGG16 network up until the last hidden layer. This hidden layer is fed into a linear layer to produce a representation for the initial hidden layer of the LSTM. Then, a sentence is generated as follows: On each iteration, the LSTM hidden layer is used to produce a linear softmax classification to generate a probability distribution over the entire vocabulary to pick the next word. The word with the highest probability is picked as the next word in the sentence. We then use a word embedding lookup to convert this word into a vector, and feed it into the next iteration of the LSTM. This process continues until the LSTM generates a special end marker, or exceeds a fixed word limit of 20 words (fewer than 3% of sentences in the training data exceed this length).
To train the network, we compute the perplexity
, which represents the likelihood function of this image-caption pair according to our model, and is computed by feeding the image and caption sentence into the network and taking the sum of the cross entropy errors at each step. Once computed, the perplexity is minimized using backpropagation and stochastic gradient descent.
3.2 Dropout in GRU
The LSTM was the first recurrent neural network model that addressed the vanishing and exploding gradient problems, in which simple RNNs had difficulty learning long-term dependencies due to numerical instabilityBengio et al. (1994)
. More recently, the gated recurrent unit (GRU) was found to achieve similar performance to the LSTM, while using only a hidden state and no cell state(Cho et al., 2014). We use a modified version of the GRU, in which a constant level of dropout is added to the hidden state between iterations. This is represented by the following equations:
where denotes element-wise multiplication between vectors and is the sigmoid. The variable is the hidden state at time , and is the input vector. First, we compute vectors representing the reset gate, and representing the update gate, both in the range . A temporary hidden state is computed from the input and previous hidden state masked by the reset gate . The next hidden state is set to a combination of the previous hidden state and the temporary hidden state as determined by the update gate . Finally, a dropout is applied to . The variables , and (with various subscripts) are unknown weights to be learned by backpropagation.
We define to be the dropout probability during training, and the dropout probability during evaluation. Typically, dropout is used during training for regularization and turned off during evaluation, so and . However, in this work we explore the possibility of and its effects on language generation.
3.3 Evaluation metrics
To test the performance of our model, we generate captions for images in a validation set, and consider the BLEU and METEOR scores; for BLEU, we consider , which correlates most highly with human ratings of performance (Papineni et al., 2002; Denkowski and Lavie, 2014).
Next, we evaluate the effects of dropout on caption accuracy and vocabulary diversity. We use Kullback-Leibler (KL) divergence to measure the distance between the word frequency distribution of the ground-truth validation captions and the sentences generated by our network on the validation set. That is,
where is the word distribution of the generated captions and is the word distribution of the most common 10,000 words among the ground-truth captions.
For each run of the experiment, we also calculate , the number of unique words among all generated captions, and , the proportion of generated captions that exceed the word limit of 20.
4 Results and Discussion
We train our model using the COCO2014 dataset (Lin et al., 2014), which contains 82,783 training images and 40,504 validation images.
The neural network model is implemented using PyTorch(Paszke et al., 2017). We use GLoVe embeddings from SpaCy (Pennington et al., 2014; Honnibal and Johnson, 2015) trained on Common Crawl. The network was trained using Adam (Kingma and Ba, 2014).
The LSTM version of our model achieves a BLEU-4 score of 20.6 and METEOR score of 20.0. The GRU version performed similarly, with a BLEU-4 score of 20.1 and METEOR score of 19.8.
Our model slightly underperforms the scores reported by Vinyals et al. (2015)
. We did not implement beam search to minimize perplexity across a sequence but instead used the greedy approach of picking the highest probability word at each step. Additional hyperparameter optimization may have improved our model accuracy, though that is not our purpose here.
Next, we evaluate the accuracy and vocabulary diversity of the model as dropout is added in inference. We train two versions of the GRU model, once with no training dropout () and once with . For each model, we generate captions for the validation set, using evaluation dropouts . The results are shown in Table 1.
For both versions of the model, the BLEU-4 and METEOR scores are maximized when ; this is expected, since dropout is usually disabled during evaluation for best performance. When evaluation dropout is moderate (), the model trained with dropout performs better than the model trained without. When evaluation dropout is high (), both models perform poorly.
The standard model without dropout only generates a vocabulary of 733, out of a total possible vocabulary of 10,000; when dropout is added during inference, the generated vocabulary is more diverse. In both versions of the model, the KL divergence of word frequency distributions is minimized using a moderate dropout (). Thus, a moderate amount of dropout produces a word frequency distribution closer to the target distribution when dropout is added during inference. However, when evaluation dropout is too high, the generated captions have low BLEU-4 and METEOR scores, high , and high increased probability of exceeding the sentence length limit.
Next, we comment on the qualitative effects of dropout on language generation. A sample of captions generated with various levels of dropout is given in the appendix. Generally, errors follow into two common patterns:
A caption starts out normally, then repeats the same word several times: “a small white kitten with red collar and yellow chihuahua chihuahua chihuahua”
A caption starts out normally, then becomes nonsense: “a man in a baseball bat and wearing a uniform helmet and glove preparing their handles won while too frown”
Both of these patterns are sometimes observed in patients with paraphasic disorders, like fluent aphasia (Berti et al., 2015). In particular, one symptom of Wernicke’s Aphasia is the use of jargon speech: long streams of words with seemingly no meaning but retaining some phonological and grammatical structure. Buckingham Jr and Kertesz (1974) give some examples of jargon speech:
“I know a deprecol over american person churches such as no dish or penthenis”
“I think my foremust acoushner looks ellington”
“I would say that the mick daysas nosis or chpickters”
5 Conclusion and Future Work
In this paper, we have implemented an encoder-decoder image captioning neural network, and applied dropout during inference to simulate the effects of neurodegeneration on language production. The resulting sentences qualitatively resemble speech produced by patients with language production disorders like fluent aphasia. Future research will include quantitative comparisons of our language model with speech by aphasic patients.
Our model applies dropout in the hidden layer of the GRU, but there are other ways to simulate neurodegeneration as well. For example, one may instead add Gaussian noise to the hidden layer, or deterministically dropout specific neurons instead of at random. Furthermore, it is unknown whether the effects we observed can be reproduced using different corpora. The next step is to train a recurrent neural language model on a picture description corpus, such as the DementiaBank Corpus in the TalkBank project (MacWhinney et al., 2011).
This work may be suitable as a data augmentation preprocessing step for systems that automatically detect dementia and aphasia using speech. A neural language model is trained using normative data, then dropout is applied during inference to generate degenerate data. This synthetic data is then combined with real data from patients with neurodegenerative disorders using semi-supervised methods to train a classifier. However, the implementation of this concept is a topic for future research.
- Bengio et al. (1994) Yoshua Bengio, Patrice Simard, and Paolo Frasconi. 1994. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
- Berti et al. (2015) Anna Berti, Francesca Garbarini, and Marco Neppi-Modona. 2015. Disorders of higher cortical function. In Neurobiology of Brain Disorders, pages 525–541. Elsevier.
- Buckingham Jr and Kertesz (1974) Hugh W Buckingham Jr and Andrew Kertesz. 1974. A linguistic analysis of fluent aphasia. Brain and Language, 1(1):43–61.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Croisile et al. (1996) Bernard Croisile, Bernadette Ska, Marie-Josee Brabant, Annick Duchene, Yves Lepage, Gilbert Aimard, and Marc Trillet. 1996. Comparative study of oral and written picture description in patients with alzheimer’s disease. Brain and language, 53(1):1–19.
- Denkowski and Lavie (2014) Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.
- Forbes-McKay and Venneri (2005) Katrina E Forbes-McKay and Annalena Venneri. 2005. Detecting subtle spontaneous language decline in early alzheimer’s disease with a picture description task. Neurological sciences, 26(4):243–254.
- Fraser et al. (2016) Kathleen C Fraser, Jed A Meltzer, and Frank Rudzicz. 2016. Linguistic features identify alzheimer’s disease in narrative speech. Journal of Alzheimer’s Disease, 49(2):407–422.
- Goedert and Spillantini (2006) Michel Goedert and Maria Grazia Spillantini. 2006. A century of alzheimer’s disease. science, 314(5800):777–781.
- Goodglass and Kaplan (1983) Harold Goodglass and Edith Kaplan. 1983. Boston diagnostic aphasia examination booklet. Lea & Febiger.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
Honnibal and Johnson (2015)
Matthew Honnibal and Mark Johnson. 2015.
An improved non-monotonic transition system for dependency parsing.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
Lin et al. (2014)
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014.
Microsoft coco: Common objects in context.
European conference on computer vision, pages 740–755. Springer.
- MacWhinney et al. (2011) Brian MacWhinney, Davida Fromm, Margaret Forbes, and Audrey Holland. 2011. Aphasiabank: Methods for studying discourse. Aphasiology, 25(11):1286–1307.
- Mao et al. (2014) Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
- Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- Prince et al. (2016) Martin Prince, Adelina Comas-Herrera, Martin Knapp, Maëlenn Guerchet, and Maria Karagiannidou. 2016. World alzheimer report 2016: improving healthcare for people living with dementia: coverage, quality and costs now and in the future.
- Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Srivastava et al. (2014)
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research, 15(1):1929–1958.
- Taler and Phillips (2008) Vanessa Taler and Natalie A Phillips. 2008. Language performance in alzheimer’s disease and mild cognitive impairment: a comparative review. Journal of clinical and experimental neuropsychology, 30(5):501–556.
Vinyals et al. (2015)
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015.
Show and tell: A neural image caption generator.
Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 3156–3164. IEEE.
- Xu et al. (2015) Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057.
- Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. 2014. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329.
Appendix A Example sentences
Below are a set of examples of sentences generated with dropout during inference. The neural network is trained with and evaluated with . Ellipsis means the sentence exceeded the length limit of 20.
a bear walking through a field of tall grass
a group of people standing on top of a snow covered slope
a cat laying on a bed with a cat
a baseball player swings his bat at a baseball
a group of people sitting around a table with food
a boat is in the water near a dock
a man in a red white and black hair is lying on a fitting fitting
a herd of sheep grazing in a field of grass
a table with a lot of food including grapes melon seaweed
a bathroom with a toilet ripped out
a bus is parked by packed packed macbook sequential sequential drawer basebal funky western sanctioned confident automobile leaguer crossroad peson …
a herd of sheep are in a williams williams twp poeple khakis accommodate surveying unenthused unenthused homey recreation slider clinton …
professional clothing great overstuffed handlebars tailed prepped photos recovery version volkswagen brings lose broiled papered sprouting valve mets halfway lavishly …
seven shoppers hunched petite westmark gril chives caucasian end yellowed trashcans crumb photographic slipper poeple poeple soaked barbecuing twp twp …
a offspring fitted stemmed pregnant nurse urns surveys consume reservoir snuggled meatballs curry twp terrace trailers peple motocross youngsters specialized …