DeepDiary: Automatic Caption Generation for Lifelogging Image Streams

by   Chenyou Fan, et al.
Indiana University Bloomington

Lifelogging cameras capture everyday life from a first-person perspective, but generate so much data that it is hard for users to browse and organize their image collections effectively. In this paper, we propose to use automatic image captioning algorithms to generate textual representations of these collections. We develop and explore novel techniques based on deep learning to generate captions for both individual images and image streams, using temporal consistency constraints to create summaries that are both more compact and less noisy. We evaluate our techniques with quantitative and qualitative results, and apply captioning to an image retrieval application for finding potentially private images. Our results suggest that our automatic captioning algorithms, while imperfect, may work well enough to help users manage lifelogging photo collections.



There are no comments yet.


page 6

page 7


Controlling Length in Image Captioning

We develop and evaluate captioning models that allow control of caption ...

A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image c...

When was that made?

In this paper, we explore deep learning methods for estimating when obje...

Image Representations and New Domains in Neural Image Captioning

We examine the possibility that recent promising results in automatic ca...

Image Captioning based on Deep Learning Methods: A Survey

Image captioning is a challenging task and attracting more and more atte...

A Semi-supervised Framework for Image Captioning

State-of-the-art approaches for image captioning require supervised trai...

Behavioural pattern discovery from collections of egocentric photo-streams

The automatic discovery of behaviour is of high importance when aiming t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Wearable cameras that capture first-person views of people’s daily lives have recently become affordable, lightweight, and practical, after many years of being explored only in the research community [23, 37, 1]. These new devices come in various types and styles, from the GoPro, which is marketed for recording high-quality video of sports and other adventures, to Google Glass, which is a heads-up display interface for smartphones but includes a camera, to Narrative Clip and Autographer, which capture “lifelogs” by automatically taking photos throughout one’s day (e.g., every 30 seconds). These devices, and others like them, are being used for a variety of applications, from documenting police officers’ interactions with the public [39], to studying people’s activities at a fine grain resolution for psychological studies [26, 12], to keeping visual diaries of people’s lives for promoting health [40] or just for personal use [21, 11]. No matter the purpose, however, all of these devices can record huge amounts of imagery, which makes it difficult for users to organize and browse their image data.

In this paper, we attempt to produce automatic textual narrations or captions of a visual lifelog. We believe that describing lifelogs with sentences is most natural for the average user, and allows for interesting applications like generating automatic textual diaries of the “story” of someone’s day based on their lifelogging photos. We take advantage of recent breakthroughs in image captioning using deep learning that have shown impressive results for consumer-style images from social media [28, 30], and evaluate their performance on the novel domain of first-person images (which are significantly more challenging due to substantial noise, blurring, poor composition, etc.). We also propose a new strategy to try to encourage diversity in the sentences, which we found to be particularly useful in describing lifelogging images from different perspectives.

Of course, lifelogging photo streams are highly redundant since wearable cameras indiscriminately capture thousands of photos per day. Instead of simply captioning individual images, we also consider the novel problem of jointly captioning lifelogging streams, i.e. generating captions for temporally-contiguous groups of photos corresponding to coherent activities or scene types. Not only does this produce a more compact and potentially useful organization of a user’s photo collection, but it also could create an automatically-generated textual “diary” of a user’s day based only on their photos. The sentences themselves are also useful to aid in image retrieval by keyword search, which we illustrate for the specific application of searching for potentially private images (e.g. containing keywords like “bathroom”). Joint caption estimation over multiple images also reduces noise and errors in the captioning results, since evidence from multiple photos is used to infer each caption. We formulate this joint captioning problem in a Markov Random Field model and show how to solve it efficiently.

To our knowledge, we are the first to propose image captioning as an important task for lifelogging photos, as well as the first to apply and evaluate automatic image captioning models in this domain. To summarize our contributions, we learn and apply deep image captioning models to lifelogging photos, including proposing a novel method for generating photo descriptions with diverse structures and perspectives; propose a novel technique for inferring captions for streams of photos taken over time in order to find and summarize coherent activities and other groups of photos; create an online framework for collecting and annotating lifelogging images, and use it to collect a realistic lifelogging dataset consisting of thousands of photos and thousands of reference sentences and evaluate these techniques on our data, both quantitatively and qualitatively, under different simulated use cases.

2 Related Work

While wearable cameras have been studied for over a decade in the research community [23, 37, 1], only recently have they become practical enough for consumers to use on a daily basis. Recent work has explored using them to aid human memory for retrospection [7, 53, 11, 21], to help students learn [4], to assist people with visual impairments [27], and to study people’s activities at a fine grain resolution for psychological studies [26, 12, 2], among many other applications. These wearable camera applications raise a number of challenges. From a privacy perspective, for example, Denning et al[10] and Nguyen et al[41] study how bystanders react to wearable cameras, while Hoyle et al[24]

identify privacy risks to the camera wearers themselves. From a technical standpoint, many applications would require automatic techniques to analyze and organize the vast quantities of images that wearable cameras collect. In the computer vision field, recent work has begun to study this new style of imagery, which is significantly different from photos taken by traditional point-and-shoot cameras. Specific research topics have included recognizing objects 

[32, 16], scenes [17], and activities [15, 44, 6, 45]. Some computer vision work has specifically tried to address privacy concerns, by recognizing photos taken in potentially sensitive places like bathrooms [49], or containing sensitive objects like computer monitors [32]

. However, these techniques typically require that classifiers be explicitly trained for each object, scene type, or activity of interest, which limits their scalability.

Instead of classifying lifelogging images into pre-defined and discrete categories, we propose to annotate them with automatically-generated, free-form image captions, inspired by recent progress in deep learning. Convolutional Neural Networks (CNNs) have recently emerged as powerful models for object recognition in computer vision 

[48, 14, 18, 33]

, while Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) have been developed for learning models of sequential data, like natural language sentences 

[13, 19]. The combination of CNNs for recognizing image content and RNNs for modeling language have recently been shown to generate surprisingly rich image descriptions [38, 52, 28], essentially “translating” from image features to English sentences [29].

Some closely related work has been done to generate textual descriptions from videos. Venugopalan et al[51] use image captioning model to generate video descriptions from a sequence of video frames. Like previous image captioning papers, their method estimates a single sentence for each sequence, while we explicitly generate multiple diverse sentences and evaluate the image-sentence matching quality to improve the captions from noisy, poorly-composed lifelogging images. Zhu et al[54] use neural sentence embedding to model a sentence-sentence similarity function, and use LSTMs to model image-sentence similarity in order to align subtitles of movies with sentences from the original books. Their main purpose is to find corresponding movie clips and book paragraphs based on visual and semantic patterns, whereas ours is to infer novel sentences from new lifelogging image streams.

3 Lifelogging Data Collection

To train and test our techniques, two of the authors wore Narrative Clip lifelogging cameras over a period of about five months (June-Aug 2015 and Jan-Feb 2016), to create a repository of 7,716 lifelogging photos. To facilitate collecting lifelogging photos and annotations, we built a website which allowed users to upload and label photos in a unified framework, using the Narrative Clip API.222

We collected textual annotations for training and testing the system in two different ways. First, the two authors and three of their friends and family members used the online system to submit sentences for randomly-selected images, producing 2,683 sentences for 696 images. Annotators were asked to produce at least two sentences per image: one that described the photo from a first-person perspective (e.g., “I am eating cereal at the kitchen table.”) and one from a third-person perspective (e.g. “A bowl of cereal sits on a kitchen table.”). We requested sentences from each of these perspectives because we have observed that some scenes are more naturally described by one perspective or the other. Annotators were welcome to enter multiple sentences, and each image was viewed by an average of 1.45 labelers.

Second, to generate more diversity in annotators and annotations, we published 293 images333We randomly chose 300, but removed 7 that we were not comfortable sharing with the public (e.g. photos of strangers whose permission we were not able to obtain). on Amazon’s Mechanical Turk (AMT), showing each photo to at least three annotators and, as before, asking each annotator to give at least one first-person and one third-person sentence. This produced a set of 1,813 sentences, or an average of 6.2 sentences per image. A total of 121 distinct Mechanical Turk users contributed sentences.

Finally, we also downloaded COCO [35], a popular publicly-available dataset of 80,000 photos and 400,000 sentences. These images are from Internet and social media sources, and thus are significantly different than the lifelogging context we consider here, but we hypothesized that this may be useful additional training data to augment our smaller lifelogging dataset.

4 Automatic Lifelog Image Captioning

We now present our technique for using deep learning to automatically annotate lifelogging images with captions. We first give a brief review of deep image captioning models, and then show how to take advantage of streams of lifelogging images by estimating captions jointly across time, which not only helps reduce noise in captions by enforcing temporal consistency, but also helps summarize large photo collections with smaller subsets of sentences.

4.1 Background: Deep networks for captioning

Automatic image captioning is a difficult task because it requires not only identifying important objects and actions, but also describing them in natural language. However, recent work in deep learning has demonstrated impressive results in generating image and video descriptions [28, 51, 54]

. The basic high-level idea is to learn a common feature space that is shared by both images and words. Then, given a new image, we generate sentences that are “nearby” in the same feature space. The encoder (mapping from image to feature space) is typically a Convolutional Neural Network (CNN), which abstracts images into a vector of local and global appearance features. The decoder (mapping from feature space to words) produces a word vector using a Recurrent Neural Network (RNN), which abstracts out the semantic and syntactic meaning.

For extracting visual features, Convolutional Neural Networks (CNNs) [33]

have become very popular. A typical CNN is much like a classical feed-forward neural network, except that the connections between early layers are not all fully-connected, but instead have specially-designed structures with shared weights that encode operations like convolutions and spatial pooling across local image regions. Modern CNNs are also typically very deep, often with 20 or more layers. While the final output of CNNs differ based on the task (e.g., a class label for image classification), a common trick is to use the output of one of the penultimate layers as a feature vector to represent the visual appearance of an input image. These “deep features” produced by CNNs have been repeatedly shown 

[43, 18] to outperform traditional hand-made image features such as SIFT [36] and HOG [9].

Figure 1:

Illustration of deep image captioning using a CNN and a two-layer LSTM. An image is fed into a CNN to produce a visual feature representation, and presented to the LSTM as its initial input. Then each word in a training sentence is presented to the LSTM at each step. A softmax layer is attached to the second layer of the LSTM to generate predicted sentences and softmax loss.

For modeling sequences like sentences, Recurrent Neural Networks (RNNs) [13, 19], and specifically Long Short-Term Memory (LSTM) models [22, 19], have become popular. RNNs include hidden units that are self-connected (i.e. some of their inputs are connected to their own outputs), which have the effect of having “memory” that can develop internal representations for the patterns of input sequences [13]. LSTMs are a special form of RNNs that include an array of specially-designed hidden units called memory blocks, each of which contains three gates and a memory cell. In any given iteration of training, a memory block can choose to read or ignore its input, to remember or forget its current cell value, or to output or suppress the new cell value. These two ingredients of CNN and LSTM models are usually combined for image captioning in the following way [52, 28, 51]. The training data consists of a set of images, each with at least one human-generated reference sentence. During training, for any given image , we first generate a corresponding deep visual feature vector

using a CNN. This vector is then presented as the initial input to the LSTM model. Then, the LSTM model is presented with each word in the training sentence in turn, by inputing the word vector corresponding to each word to the LSTM. The output from the hidden unit predicts the next word in the sentence, in particular giving a probability distribution over words in the dictionary. During each step

of training, error is back-propagated from the following step as well as from the softmax (word generation) layer. Word vectors as well as weights of hidden units are trained and updated during back-propagation. An intuitive way to visualize LSTM is to unroll the hidden states at each time, as shown in Figure 1. As shown in the figure, in practice we use a two-layer LSTM to get better predictions, such that the hidden states of the first layer serve as input to the second layer and word predictions are emitted from the output of the second layer.

At test time, to generate a caption for new image , we again use a CNN to produce an image feature vector and present it to the LSTM, which then predicts the first word of the sentence based on the visual features. After that, the best prediction at step for word is used as the input for step . In the prediction stage, a forward pass of LSTM generates a full sentence terminated by a stop word for each input image. Similar image captioning models have been discussed in detail in recent papers [52, 28, 51]. In section 4.2.1, we discuss in detail how to generate diverse captions for a single image.

4.2 Photo Grouping and Activity Summarization

The techniques in the last section automatically estimate captions for individual images. However, lifelogging users do not typically capture individual images in isolation, but instead collect long streams of photos taken at regular intervals over time (e.g., every 30 seconds for Narrative Clip). This is a significant difference from most applications of image captioning that have been studied before, which target isolated images found on the Internet or in social media, and represents both a challenge and an opportunity. The challenge is that generating thousands of captions for a day’s activities, one for each photo, could easily overwhelm a user. This means that evidence from multiple images can be combined together to produce better captions than is possible from observing any single image, in effect “smoothing out” noise in any particular image by examining the photos taken nearby in time. These sentences could provide more concise summarizations, helping people find, remember, and organize photos according to broad events instead of individual moments.

Suppose we wish to estimate captions for a stream of images which are sorted in order of increasing timestamps. We first generate multiple diverse captions for each individual image, using a technique we describe in the next subsection. We combine all of these sentences together across images into a large set of candidates (with , where is the number of diverse sentences generated per image; we use ). We wish to estimate a sequence of sentences such that each sentence describes its corresponding image well, but also such that the sentences are relatively consistent across time. In other words, we want to estimate a sequence of sentences so as to minimize an energy function,


where each , is a unary cost function measuring the quality of a given sentence in describing a single image , is a pairwise cost function that is 0 if and are the same and 1 otherwise, and is a constant. Intuitively, controls the degree of temporal smoothing of the model: when , for example, the model simply chooses sentences for each image independently without considering neighboring images in the stream, whereas when is very large, the model will try to find a single sentence to describe all of the images in the stream.

Equation (1) is a chain-structured Markov Random Field (MRF) model [31], which means that the optimal sequence of sentences can be found efficiently using the Viterbi algorithm. All that remains is to define two key components of the model: (1) a technique for generating multiple, diverse candidate sentences for each image, in order to obtain the candidate sentence set , and (2) the Score function, which requires a technique for measuring how well a given sentence describes a given image. We now describe these two ingredients in turn.

4.2.1 Generating Diverse Captions

Our joint captioning model above requires a large set of candidate sentences. Many possible sentences can correctly describe any given image, and thus it is desirable for the automatic image captioning algorithm to generate multiple sentences that describe the image in multiple ways. This is especially true for lifelogging images that are often noisy, poorly composed, and ambiguous, and can be interpreted in different ways. Vinyals et al[52] use beam search to generate multiple sentences, by having the LSTM model keep candidate sentences at each step of sentence generation (where is called the beam size). However, we found that this existing technique did not work well for lifelogging sentences, because it produced very homogeneous sentences, even with a high beam size.

Figure 2: Sample captions generated by models pre-trained with COCO and fine-tuned with lifelogging dataset. Three different colors show the top three predictions produced in three beam searches by applying the Diverse M-Solutions technique. Within each beam search, sentences tend to have similar structures and describe from similar perspective; between consecutive beam searches, structures and perspectives tend to be different.

To encourage greater diversity, we apply the Diverse M-best solutions technique of Batra et al[5], which was originally proposed to find multiple high-likelihood solutions in graphical model inference problems. We adapt this technique to LSTMs by performing multiple rounds of beam search. In the first round, we obtain a set of predicted words for each position in the sentence. In the second round, we add a bias term that reduces the network activation values of words found in the first beam search by a constant value. Intuitively, this decreases the probability that a word found during the previous beam search being selected again at the same word position in the sentence. Depending on the degree of diversity needed, additional rounds of beam search can be conducted, each time penalizing words that have occurred in any previous round. In our current implementation, we use three rounds of beam search and set the beam size to be five, so we generate a total of 15 candidate sentences for each individual image. The set of all of these sentences across all images in the photo stream produces the candidate sentence set in equation (1). Figure 2 presents sample automatically-generated results by using three rounds of beam search and a beam size of 3 for illustration purposes. We see that the technique successfully injects diversity into the set of estimated captions. Many of the captions are quite accurate, including “A man is sitting at a table” and “I am having dinner with my friends,” while others are not correct (e.g. “A man is looking at a man in a red shirt”), and others are nonsensical (“There is a man sitting across the table with a man”). Nevertheless, the captioning results are overall remarkably accurate for an automatic image captioning system, reflecting the power of deep captioning techniques to successfully model both image content and sentence generation.

4.2.2 Image-sentence quality alignment

The joint captioning model in Equation (1) also requires a function , which is a measure of how well an arbitrary sentence describes a given image . The difficulty here is that the LSTM model described above tells us how to generate sentences for an image, but not how to measure their similarity to a given image. Doing this requires us to explicitly align certain words of the sentence to certain regions of an image – i.e. determining which “part” of an image generated each word. Aligning between words of a sentence with regions of an image is a difficult task. Common captioning datasets such as COCO [35] contain image-level captions, but do not have ground truth data mapping words to specific objects or image regions. Karpathy et al[30] propose matching each region with the word with maximum inner product (interpreted as a similarity measure) across all words in terms of learnable region vectors and word vectors, and to sum all similarity measures over all regions as the total score. They use Regions with CNN (R-CNN) [18] to detect image regions and obtain region feature vectors. Word vectors are encoded by Bidirectional LSTM (BLSTM) [46, 20] (which is a variant of LSTM that captures contextual information from not only previous words but also future ones). They also construct positive training instances (true image-description pairs) and negative training instances (images with randomly sampled sentences from other images) to train this model in an unsupervised fashion which seeks to maximize the margin between scores of positive training instances and negative instances. We implement their method and train this image-sentence alignment model on our lifelogging dataset. To generate the matching score for Equation (1), we extract region vectors from image , retrieve trained word vectors for words in sentence , and sum similarity measures of regions with best-aligned words.

4.2.3 Image grouping

Finally, once captions have been jointly inferred for each image in a photostream, we can group together contiguous substreams of images that share the same sentence. Figure 3 shows examples of activity summarization. In general, the jointly-inferred captions are reasonable descriptions of the images, and much less noisy than those produced from individual images in Figure 2, showing the advantage of incorporating temporal reasoning into the captioning process. For example, the first row of images shows that the model labeled several images as “I am talking with a friend while eating a meal in a restaurant,” even though the friend is only visible in one of the frames, showing how the model has propagated context across time. Of course, there are still mistakes ranging from the minor error that there is no broccoli on the plate in the second row to the more major error that the last row shows a piano and not someone typing on a computer. The grammar of the sentences is generally good considering that the model has no explicit knowledge of English besides what it has learned from training data, although usage errors are common (e.g., “I am shopping kitchen devices in a store”).

Figure 3: Randomly-chosen samples of activity summarization on our dataset.

5 Experimental evaluation

We first use automatic metrics that compare to ground truth reference sentences with quantitative scores. To give a better idea of the actual practical utility of technique, we also evaluate in two other ways: using a panel of human judges to rate the quality of captioning results, and testing the system in a specific application of keyword-based image retrieval using the generated captions.

5.1 Quantitative captioning evaluation

Automated metrics such as BLEU [42], CIDEr [50], Meteor [3] and Rouge-L [34]

have been proposed to score sentence similarity compared to reference sentences provided by humans, and each has different advantages and disadvantages. BLEU was originally intended for evaluating machine translations but is also commonly used to evaluate captioning. BLEU counts occurrences of n-grams in a candidate sentence (clipped by the maximum occurrences in the reference sentences), and normalizes by the total number of n-grams in the candidate. It is typically evaluated for different values of

, and in particular BLEU-

refers to the geometric mean from 1-grams to

-grams; here we report scores of BLEU-1 to 4. CIDEr [50]

computes average cosine similarity (with TF-IDF weighting) between the

-grams (typically up to 4-grams) of the generated sentence and human-generated reference sentences. Meteor [3] compares unigrams between generated and reference sentences using different degrees of similarity (exact match, matching stems, synonymy). Rouge-L [34] produces an F-measure based on length of longest common sequence of candidate and reference sequence. We present results using all of these metrics (using the MS COCO Detection Challenge implementation444, and also summarize the seven scores with their mean.

5.1.1 Implementation

A significant challenge with deep learning-based methods is that they typically require huge amounts of training data, both in terms of images and sentences. Unfortunately, collecting this quantity of lifelogging images and annotations is very difficult. To try to overcome this problem, we augmented our lifelogging training set with COCO data using three different strategies: Lifelog only training used only our lifelogging dataset, consisting of 736 lifelogging photos with 4,300 human-labeled sentences; COCO only training used only COCO dataset; and COCO then Lifelog started with the COCO only model, and then used it as initialization when re-training the model on the lifelogging dataset (i.e., “fine-tuning” [33]).

For extracting image features, we use the VGGNet [47] CNN model. The word vectors are learned from scratch. Our image captioning model stacks two LSTM layers, and each layer structure closely follows the one described in [52]

. To boost training speed, we re-implemented LSTM model in C++ using the Caffe 

[25] deep learning package. It takes about 2.5 hours for COCO pre-training, and about 1 hour for fine-tuning on Lifelog dataset with 10,000 iterations for both.

At test time, the number of beam searches conducted during caption inference controls the degree of diversity in the output; here we use three to match the three styles of captions we expect (COCO, first-person, and third-person perspectives). Samples of predicted sentences are shown in Figure 2. This suggests that different genres of training sentences contribute to tune hidden states of LSTM and thus enable it to produce diverse structures of sentences in testing stage.

Datasets Metric
Training Testing Bleu-1 Bleu-2 Bleu-3 Bleu-4 CIDEr METEOR ROUGE      Mean
Lifelog Lifelog 100 0.669 0.472 0.324 0.218 0.257 0.209 0.462 0.373
COCO 0.561 0.354 0.206 0.118 0.143 0.149 0.374 0.272
COCO+Lifelog 0.666 0.469 0.319 0.210 0.253 0.207 0.459 0.369
Lifelog@Usr1 Lifelog@Usr2 0.588 0.410 0.279 0.189 0.228 0.195 0.431 0.331
Lifelog@2015 Lifelog@2016 0.557 0.379 0.249 0.160 0.325 0.202 0.425 0.328
Table 1: Bleu1-4, CIDEr, Meteor and Rouge Scores for Diverse 3-Best Beams of Captions on Test Set.

5.1.2 Results

Table 1 presents quantitative results of each of these training strategies, all tested on the same set of 100 randomly-selected photos having 1,000 ground truth reference sentences, using each of the seven automatic scoring metrics mentioned above. We find that the Lifelog only strategy achieves much higher overall accuracy than COCO only, with a mean score of 0.373 vs. 0.272. This suggests that even though COCO is a much larger dataset, images from social media are different enough from lifelogging images that the COCO only model does not generalize well to our application. Moreover, this may also reflect an artifact of the automated evaluation, because Lifelog only benefits from seeing sentences with similar vocabulary and in a similar style as in the reference sentences, since the same small group of humans labeled both the training and test datasets. More surprisingly, we find that Lifelog only also slightly outperforms COCO then lifelog (0.373 vs 0.369). The model produced by the latter training dataset has a larger vocabulary and produces richer styles of sentences than Lifelog only, which hurts its quantitative score. Qualitatively, however, it often produces more diverse and descriptive sentences because of its larger vocabulary and ability to generate sentences in first-person, third-person, and COCO styles. Samples of generated diverse captions are shown in Figure 2.

We conducted experiments with two additional strategies in order to simulate more realistic scenarios. The first scenario reflects when a consumer first starts using our automatic captioning system on their images without having supplied any training data of their own. We simulate this by training image captioning model on one user’s photos and testing on another. Training set has 805 photos and 3,716 reference sentences; testing set has 40 photos and 565 reference sentences. The mean quantitative accuracy declines from our earlier experiments when training and testing on images sampled from the same set, as shown in Table 1, although the decline is not very dramatic (from 0.373 to 0.331), and still much better than training on COCO (0.272). This result suggests that the captioning model has learned general properties of lifelogging images, instead of overfitting to one particular user (e.g., simply “memorizing” the appearance of the places and activities they frequently visit and do).

The other situation is when an existing model trained on historical lifelogging data is used to caption new photos. We simulate this by taking all lifelogging photos in 2015 as training data and photos in 2016 as testing data. Training set has 673 photos and 3,610 sentences; testing set has 30 photos and 172 sentences. As shown in Table 1, this scenario very slightly decreased performance compared to training on data from a different user (0.328 vs 0.331), although the difference is likely not statistically significant.

5.2 Evaluation with human judges

The evaluation metrics used in the last section are convenient because they can be automatically computed from ground-truth reference sentences, and are helpful for objectively comparing different methods. However, they give little insight into how accurate or descriptive they are, or whether they would be useful for real lifelogging users.

We conducted a small study using human judges to rate the quality of our automatically-generated captions. In particular, we randomly selected 21 images from the Lifelog 100 test dataset (used in Table 1) and generated captions using our model trained on the COCO then Lifelog scenario. For each image, we generated 15 captions (with 3 rounds of beam search, each with beam size 5), and then kept the top-scoring caption according to our model and four randomly-sampled from the remaining 14, to produce a diverse set of five automatically-generated captions per image. We also randomly sampled five of the human-generated reference sentences for each image.

For each of the ten captions (five automatic plus five human), we showed the image (after reviewing it for potentially private content and obtaining permission of the photo-taker) and caption to a user on Amazon Mechanical Turk, without telling them how the caption had been produced. We asked them to rate, on a five-point Likert scale, how strongly they agreed with two statements: (1) “The sentence or phrase makes sense and is grammatically correct (ignoring minor problems like capitalization and punctuation,” and (2) “The sentence or phrase accurately describes either what the camera wearer was doing or what he or she was looking at when the photo was taken.” The task involved 630 individual HITs from 37 users.

Table 2 summarizes the results, comparing the average ratings over the 5 human reference sentences, the average over all 5 diverse automatically-generated captions (Auto-5 column), and the single highest-likelihood caption as estimated by our complete model (Auto-top). About 92% of the human reference sentences were judged as grammatically correct (i.e., somewhat or strongly agreeing with statement (1)), compared to about 77% for the automatically-generated diverse captions and 81% for the single best sentence selected by our model. Humans also described images more accurately than the diverse captions (88% vs 54%), although the fact that 64% of our single best estimated captions were accurate indicates that our model is often able to identify which one is best among the diverse candidates. Overall, our top automatic caption was judged to be both grammatically correct and accurate 59.5% of the time, compared to 84.8% of the time for human reference sentences.

We view these results to be very promising, as they suggest that automatic captioning can generate reasonable sentences for over half of lifelogging images, at least in some applications. For example, for 19 (90%) of the 21 images in the test set, at least one of five diverse captions was unanimously judged to be both grammatically correct and accurate by all 3 judges. This may be useful in some retrieval applications where recall is important, for example, where having noise in some captions may be tolerable as long as at least one of them is correct. We consider one such application in the next section.

Grammar Accuracy
Rating Human Auto-5 Auto-top Human Auto-5 Auto-top
 1 1.9% 7.6% 11.9% 2.9% 22.4% 21.4%
 2 3.8% 10.0% 7.1% 3.8% 15.2% 7.1%
 3 0.5% 5.7% 0.0% 4.8% 8.1% 7.1%
 4 19.0% 17.6% 4.8% 22.4% 17.6% 19.0%
 5 73.3% 59.0% 76.2% 65.2% 36.7% 45.2%
 Mean 4.60 4.10 4.26 4.45 3.31 3.60
Table 2: Summary of grammatical correctness and accuracy of lifelogging image captions, on a rating scale from 1 (Strongly Disagree) to 5 (Strongly Agree), averaged over 3 judges. Human column is averaged over 5 human-generated reference sentences, Auto-5 is averaged over 5 diverse computer-generated sentences, and Auto-top is single highest-likelihood computer-generated sentence predicted by our model.

5.3 Keyword-based image retrieval

Image captioning allows us to directly implement keyword-based image retrieval by searching on the generated captions. We consider a particular application of this image search feature here that permits a quantitative evaluation. As mentioned above, wearable cameras can collect a large number of images containing private information. Automatic image captioning could allow users to find potentially private images easily, and then take appropriate action (like deleting or encrypting the photos). We consider two specific types of potentially embarrassing content here: photos taken in potentially private locations like bathrooms and locker rooms, and photos containing personal computer or smartphone displays which may contain private information such as credit card numbers or e-mail contents.

We chose these two types of concerns specifically because they have been considered by others in prior work: Korayem et al[32] present a system for detecting monitors in lifelogging images using deep learning with CNNs, while Templeman et al[49] classify images according to the room in which they were taken. Both of these papers present strongly supervised based techniques, which were given thousands of training images manually labeled with ground truth for each particular task. In contrast, identifying private imagery based on keyword search on automatically-generated captions could avoid the need to create a training set and train a separate classifier for each type of sensitive image.

We evaluated captioning-based sensitive image retrieval against standard state-of-the-art strongly-supervised image classification using CNNs [33] (although we cannot compare directly to the results presented in [32] or [49]

because we use different datasets). We trained the strongly-supervised model by first generating a training set consisting of photos having monitors and not having monitors, and photos taken in bathrooms and locker rooms or elsewhere, by using the ground truth categories given in the COCO and Flickr8k datasets. This yielded 34,736 non-sensitive images, 6,135 images taken in sensitive places, and 4,379 images with displays. We used pre-trained AlexNet model (1000-way classifier on ImageNet data) and fine-tuned on our dataset by replacing the final fully connected layer with a 3-way classifier to correspond with our three-class problem.

3-way classification CNN-based Caption-based NotSen Place Display NotSen Place Display NotSen 0.730 0.130 0.140 0.686 0.117 0.197 Place 0.189 0.811 0 0.151 0.792 0.057 Display 0.300 0.043 0.657 0.143 0.008 0.849 2-way classification CNN-based Caption-based NotSen Sen NotSen Sen NotSen 0.730 0.270 0.686 0.314 Sen 0.317 0.683 0.161 0.839
Table 3: Confusion matrices for two approaches on two tasks for detecting sensitive images. Left: Results on 3-way problem of classifying into not sensitive, sensitive place (bathroom), or digital display categories. Right: Results on 2-way problem of classifying into sensitive or not (regardless of sensitivity type). Actual classes are in rows and predicted classes are in columns.

We also ran the technique proposed here, where we first generate automatic image captions, and then search through the top five captions for each image for a set of pre-defined keywords (specifically “toilet,” “bathroom,” “locker,” “lavatory,” and “washroom” for sensitive place detection, and “computer,” “laptop,” “iphone,” “smartphone,” and “screen” for display detection). If any of these keywords is detected in any of the five captions, the image is classified as sensitive, and otherwise it is estimated to be not sensitive.

Figure 4: Precision-recall curves for retrieving sensitive images using CNNs (left) and generated captions (right).

Table 3

presents the confusion matrix for each method, using a set of 600 manually-annotated images from our lifelogging dataset as test data (with 300 non-sensitive images, 53 images in sensitive places, and 252 with digital displays). We see the supervised classifier has better prediction performance on finding sensitive places (0.811) than keyword based classifiers (0.792), while the caption-based technique classifier outperforms on predicting second type of sensitive images (0.849 vs 0.657). In a real application, determining the type of private image is likely less important than simply deciding if it is private. The right table in Table 

3 reflects this scenario, showing a confusion matrix which combines the two sensitive types and focuses on whether photos are sensitive or not.

From another point of view, sensitive photo detection is a retrieval problem. Figure 4 shows precision-recall curves for CNN and caption-based classifiers, respectively. They show the trade-off between selecting accurate sensitive photos (high precision) and obtaining a majority of all sensitive photos (high recall). For example, by using CNN classifier, we can obtain 80% type 1 (sensitive place) photos with accuracy around 58% (Figure 4(left) green curve); by using the caption-based classifier, we can obtain 80% of type 2 (digital display) sensitive photos with precision around 78% (Figure 4(right) blue curve).

Overall, these results suggest that keyword search in automatically-generated captions could yield similar accuracies to strongly-supervised classifiers, but without having to be explicitly re-trained on each type of private image. The two approaches may also be complementary, since they use different forms of evidence in making classification decisions, and users in a real application could choose their own trade-off on how aggressively to filter lifelogging images.

6 Conclusion

In this paper, we have proposed the concept of using automatically-generated captions to help organize and annotate lifelogging image collections. We have proposed a deep learning-based captioning model that jointly labels photo streams in order to take advantage of temporal consistency between photos. Our evaluation suggests that modern automated captioning techniques could work well enough to be used in practical lifelogging photo applications. We hope our research will motivate further efforts of using lifelogging photos and descriptions together to help human memory recall the activities and scenarios.

7 Acknowledgements

This work was supported in part by the National Science Foundation (IIS-1253549 and CNS-1408730) and Google, and used compute facilities provided by NVidia, the Lilly Endowment through support of the IU PTI, and the Indiana METACyt Initiative. We thank Zhenhua Chen, Sally Crandall, and Xuan Dong for helping to label our lifelogging photos.


  • [1] R. Azuma, Y. Baillot, R. Behringer, S. Feiner, S. Julier, and B. MacIntyre. Recent advances in augmented reality. IEEE Computer Graphics and Applications, 21(6):34–47, 2001.
  • [2] S. Bambach, S. Lee, D. Crandall, and C. Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In IEEE Intl. Conf. on Computer Vision, 2015.
  • [3] S. Banerjee and A. Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005.
  • [4] D. Barreau, A. Crystal, J. Greenberg, A. Sharma, M. Conway, J. Oberlin, M. Shoffner, and S. Seiberling. Augmenting memory for student learning: Designing a context-aware capture system for biology education. American Society for Information Science and Technology, 43(1):1–6, 2006.
  • [5] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich. Diverse m-best solutions in markov random fields. In European Conf. on Computer Vision, pages 1–16. Springer, 2012.
  • [6] D. Castro, S. Hickson, V. Bettadapura, E. Thomaz, G. Abowd, H. Christensen, and I. Essa. Predicting daily activities from egocentric images using deep learning. In Intl. Symposium on Wearable Computers, 2015.
  • [7] S. Clinch, P. Metzger, and N. Davies. Lifelogging for observer view memories: an infrastructure approach. In 2014 ACM Intl. Joint Conf. on Pervasive and Ubiquitous Computing: Adjunct Publication, pages 1397–1404, 2014.
  • [8] D. Crandall and C. Fan. Deepdiary: Automatically captioning lifelogging image streams. In European Conference on Computer Vision International Workshop on Egocentric Perception, Interaction, and Computing.
  • [9] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In

    IEEE Conf. on Computer Vision and Pattern Recognition

    , volume 1, pages 886–893, 2005.
  • [10] T. Denning, Z. Dehlawi, and T. Kohno. In situ with bystanders of augmented reality glasses: Perspectives on recording and privacy-mediating technologies. In ACM SIGCHI Conf. on Human Factors in Computing Systems, pages 2377–2386, 2014.
  • [11] A. Doherty, K. Pauly-Takacs, N. Caprani, C. Gurrin, C. Moulin, N. O’Connor, and A. Smeaton. Experiences of aiding autobiographical memory using the sensecam. Human–Computer Interaction, 27(1-2):151–174, 2012.
  • [12] A. R. Doherty, N. Caprani, V. Kalnikaite, C. Gurrin, A. F. Smeaton, E. Noel, et al. Passively recognising human activities through lifelogging. Comput. Hum. Behav., 27(5):1948–1958, 2011.
  • [13] J. L. Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
  • [14] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 2155–2162, 2014.
  • [15] A. Fathi, Y. Li, and J. M. Rehg. Learning to recognize daily actions using gaze. In European Conf. on Computer Vision, pages 314–327. Springer, 2012.
  • [16] A. Fathi, X. Ren, and J. M. Rehg. Learning to recognize objects in egocentric activities. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 3281–3288, 2011.
  • [17] A. Furnari, G. Farinella, and S. Battiano. Recognizing personal contexts from egocentric images. In ICCV Workshops, 2015.
  • [18] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 580–587, 2014.
  • [19] A. Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
  • [20] A. Graves, N. Jaitly, and A.-R. Mohamed. Hybrid speech recognition with deep bidirectional lstm. In IEEE Workshop on Automatic Speech Recognition and Understanding, pages 273–278, 2013.
  • [21] C. Gurrin, A. F. Smeaton, D. Byrne, N. Hare, G. J. Jones, and N. Connor. An examination of a large visual lifelog. In Information Retrieval Technology, pages 537–542. Springer, 2008.
  • [22] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [23] S. Hodges, L. Williams, E. Berry, S. Izadi, J. Srinivasan, A. Butler, G. Smyth, N. Kapur, and K. Wood. Sensecam: A retrospective memory aid. In ACM Conf. on Ubiquitous Computing, pages 177–193, 2006.
  • [24] R. Hoyle, R. Templeman, S. Armes, D. Anthony, D. Crandall, and A. Kapadia. Privacy behaviors of lifeloggers using wearable cameras. In ACM Intl. Joint Conf. on Pervasive and Ubiquitous Computing, pages 571–582, 2014.
  • [25] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
  • [26] V. Kalnikaite, A. Sellen, S. Whittaker, and D. Kirk. Now let me see where I was: Understanding how lifelogs mediate memory. In ACM SIGCHI Conf. on Human Factors in Computing Systems, pages 2045–2054, 2010.
  • [27] S. Karim, A. Andjomshoaa, and A. Tjoa. Exploiting sensecam for helping the blind in business negotiations. In Computers Helping People with Special Needs. Springer, 2006.
  • [28] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. arXiv:1412.2306, 2014.
  • [29] A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks. arXiv:1506.02078, 2015.
  • [30] A. Karpathy, A. Joulin, and L. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889–1897, 2014.
  • [31] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
  • [32] M. Korayem, R. Templeman, D. Chen, D. Crandall, and A. Kapadia. Enhancing lifelogging privacy by detecting screens. In ACM CHI Conf. on Human Factors in Computing Systems, 2016.
  • [33] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
  • [34] C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In

    Workshop On Text Summarization Branches Out

    , 2004.
  • [35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. In European Conf. on Computer Vision, pages 740–755. Springer, 2014.
  • [36] D. Lowe. Distinctive image features from scale-invariant keypoints. Intl. J. of computer vision, 60(2):91–110, 2004.
  • [37] S. Mann, J. Nolan, and B. Wellman. Sousveillance: Inventing and using wearable computing devices for data collection in surveillance environments. Surveillance & Society, 1(3):331–355, 2002.
  • [38] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille. Explain images with multimodal recurrent neural networks. arXiv:1410.1090, 2014.
  • [39] L. Miller and J. Toliver. Implementing a body-worn camera program: Recommendations and lessons learned. Technical report, Office of Community Oriented Policing Services, 2014.
  • [40] M. Moghimi, W. Wu, J. Chen, S. Godbole, S. Marshall, J. Kerr, and S. Belongie. Analyzing sedentary behavior in life-logging images. In Intl. Conf. on Image Processing, 2014.
  • [41] D. H. Nguyen, G. Marcu, G. R. Hayes, K. N. Truong, J. Scott, M. Langheinrich, and C. Roduner. Encountering sensecam: personal recording technologies in everyday life. In ACM Intl. Conf. on Ubiquitous Computing, pages 165–174, 2009.
  • [42] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. BLEU: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • [43] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In IEEE Conf. on Computer Vision and Pattern Recognition Workshops, pages 512–519, 2014.
  • [44] M. Ryoo, T. J. Fuchs, L. Xia, J. K. Aggarwal, and L. Matthies. Robot-centric activity prediction from first-person videos: What will they do to me. In ACM/IEEE Intl. Conf. on Human-Robot Interaction, pages 295–302, 2015.
  • [45] M. Ryoo and L. Matthies. First-person activity recognition: What are they doing to me? In IEEE Conf. on Computer Vision and Pattern Recognition, pages 2730–2737, 2013.
  • [46] M. Schuster and K. Paliwal. Bidirectional recurrent neural networks. IEEE T. Signal Processing, 45(11):2673–2681, 1997.
  • [47] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014.
  • [48] C. Szegedy, A. Toshev, and D. Erhan. Deep neural networks for object detection. In Advances in Neural Information Processing Systems, pages 2553–2561, 2013.
  • [49] R. Templeman, M. Korayem, D. J. Crandall, and A. Kapadia. Placeavoider: Steering first-person cameras away from sensitive spaces. In Network and Distributed Systems Security Symposium, 2014.
  • [50] R. Vedantam, C. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 4566–4575, 2015.
  • [51] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence–video to text. arXiv:1505.00487, 2015.
  • [52] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. arXiv:1411.4555, 2014.
  • [53] C. Yoo, J. Shin, I. Hwang, and J. Song. Facelog: capturing user’s everyday face using mobile devices. In ACM Conf. on Pervasive and Ubiquitous Computing, pages 163–166, 2013.
  • [54] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. arXiv:1506.06724, 2015.