With the prevalence of deep neural network in visual data nowadays, detecting patterns or recognizing objects from an image dataset has become less mysterious than it used to be. Given a manually labeled training data with sufficient images, we can tune the parameters of a convolutional neural network that yields near-perfect performance. However, those concepts are typically learned in a supervised setting. It’s still time consuming and expensive to organize such a large volume of manual work to label the training data, especially that the amount of images generated by smart devices during a single day is significantly beyond the capacity of manual labor. We are wondering if those newly generated, unlabeled images can be classified, recognized, or labeled automatically, so that they can either be utilized for large-scale training purposes, or simply be better documented and organized in local devices.
In this paper, we combine AlexNet and Latent Dirichlet Allocation (LDA) to form a hybrid supervised-unsupervised method to extract topics (visual concepts) from unlabeled datasets, see Fig. 1. The idea is to construct the embedding for each image from a universal pre-trained model, and then apply the topic model by grouping salient semantic features into visual topics. Concretely, we consider two scenarios. First, we take the challenge from a life-logging dataset. Life-logging cameras create huge collections of photos, even for a single person on a single day [Bambach et al.(2015)Bambach, Lee, Crandall, and Yu, Ryoo and Matthies(2013), Korayem et al.()Korayem, Templeman, Chen, and Kapadia], which makes it difficult for users to browse or organize their photos effectively. Unlike text corpora in which words create intermediate representations that carry semantic meaning for higher-level concepts such as topics, images have no such obvious intermediate representation in between. Egocentric photos are particularly challenging because they were taken opportunistically, so they are often blurry and poorly-composed compared to consumer-style images. We use this method to “summarize" a subject’s living genre from an egocentric life-logging dataset.
Second, we use COCO dataset, a labeled dataset as ground truth to evaluate the hybrid method in terms of consistent rate. We apply LDA on top of the bag-of-word representation from a pre-trained AlexNet and get the topic assignment matrix over all images. By defining a concept of consistent rate, we compare the ground truth labels and “concept clusters" from our method, and measure the consistency of those clusters. It shows that the space with “irrelevant" dimensions suffice a dissimilarity measurement of image data by achieving an average consistent rate of . We also apply Harp-LDA [Zhang et al.(2016)Zhang, Peng, and Qiu], a parallel LDA based on sparse matrix decomposition, on the state of the art Intel Knights Landing cluster for parallization, which shows the feasibility for potential applications on scaled up experiment settings.
The method provides a hybrid way to extract image topics with a pre-trained AlexNet and a probabilistic topic model. It automatically labels images before manually double-checking, instead of having people label the entire image dataset; the living genre extracted from egocentric images can be used in potential psychological research; it can also detect duplicated images and organize photo albums in local computers. We describe the related work in section , data in section , methods in section , experiment and results in section , and summarize the work in section .
2 Related work
Many methods have been proposed that adapt techniques from text topic modeling to vision. Li and Perona [Fei-Fei and Perona(2005)] propose a Bayesian hierarchical model to learn characteristic intermediate themes in an unsupervised way, for example, while Sivic et al [Sivic et al.(2008)Sivic, Russell, Zisserman, Freeman, and Efros] and Li et al. [Li et al.(2010)Li, Wang, Lim, Blei, and Fei-Fei] introduce ways to discover hierarchical image structure from unlabeled datasets [Sivic et al.(2008)Sivic, Russell, Zisserman, Freeman, and Efros, Li et al.(2010)Li, Wang, Lim, Blei, and Fei-Fei]. However, most of these techniques were developed prior to the prevalence of deep neural network, and are based on hand-tuned features. Moreover, none have studied egocentric imagery, as we do here.
Research on computer vision for first-person images and video has been popular in recent years, including in the fields of object tracking[Lee et al.(2014)Lee, Bambach, Crandall, Franchak, and Yu], activity recognition [Pirsiavash and Ramanan(2012), Iwashita et al.(2014)Iwashita, Takamine, Kurazume, and Ryoo], and event detection [Lu and Grauman(2013)]. However, there has been relatively little work on unsupervised object discovery and scene summarization in this domain [Ryoo and Matthies(2013)]. Here we take a first step towards understanding the extent to which existing hierarchical Bayesian topic models paired with deeply-learned features could be successfully applied to the unique properties of first-person imagery, such as repetitive scenes, frequent motion blur, and poor image composition.
We are interested in extracting topics from egocentric images as a way to visualize the subject’s living genres. This is assumed to be relatively feasible as it may contain large amount of duplicated images (for example, a series of images on a subject sitting in front of a laptop are with sufficient similarity to be considered duplicated). Then we test the method on COCO dataset with quantitative analysis on accuracy and scalability.
3.1 Lifelogging dataset.
The dataset of first-person images is captured with a Narrative Clip lifelogging camera by one of the authors. We wore that camera, which takes pictures about every seconds, for two weeks during Summer . The camera captured ( days) images of a wide variety daily activities including commuting to work, having meetings, preparing and eating meals, interacting with friends and family, etc. The lifelogging user reviewed all of the images after collection and removed about that they felt too private to share.
3.2 COCO dataset
In order to evaluate the performance quantitatively and test the scalability of the hybrid supervised-unsupervised method, we use Microsoft COCO (Common Object in Context) dataset [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick], as it contains categories in super-categories with over images in total (some images are associated with multiple categories). We randomly select a subset of COCO with categories, compute and compare consistent rate with other methods. The whole dataset is used for scalability experiment.
As with previous works, we assume that features from the layer of CNN can be regarded as visual words which compose images in a similar way as words compose documents in text corpora, and that the ordering factor of visual words can be ignored. Empirical studies show that the occurrence of words in each documents can be modeled as multinomial distribution [Nigam et al.(2000)Nigam, McCallum, Thrun, and Mitchell, McCallum et al.(1998)McCallum, Nigam, et al.].
4.1 Data preprocessing and pre-trained model
To get the order-irrelevant (the order of “words" itself in each image carries no information) representation and learn a joint distribution of images and topics, we process the data in three steps shown in Fig.1. We first use a pre-trained AlexNet (trained on ImageNet) to extract the label response of each image from the output layer. We choose AlexNet, instead of more recent architectures, as the pre-trained model as it is pervasively used in current research and applications. The representation is extracted from the softmax layer, which is an
dimensional vector corresponding tolabels from AlexNet [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton]
. Each probability is with the range ofthat shows how likely the image is related to the corresponding label. The greater probability indicates a larger likelihood.
We set a threshold and keep the indices those labels with probabilities above the threshold. This removes labels that are less semantically salient and gives a “bag-of-words" representation where the data order is irrelevant. After preprocessing, each image is represented with a vector of label IDs that are bounded between and . With this representation, we build a classic LDA model to learn a distribution over topics for each image.
4.2 Generative model and collapsed Gibbs sampler
We model the relation between a feature to an image as a word to a document, and assume that there is a hidden variable of topic in between words and images. In a similar generative process as proposed in Latent Dirichlet Allocation [Blei et al.(2003)Blei, Ng, and Jordan], an image is generated by first assigning topics and then sampling features (visual words) from selected topics which is given in Eq. 1. There are three parameters, i.e. , , to be inferred from the posterior distribution, where and
are hyperpriors,is the distribution over topics for each document (image), is the distribution over words for each topic, and is the topic allocations.
Collapsed Gibbs sampling works as and can be represented in a closed form by , see Eq. 2 [Griffiths and Steyvers(2004)]. Here is the count of words in images being assigned to topic ; means the total number of the word being assigned to the topic ; and are the total number of topics and distinct words, respectively. In this work, we are particularly interested in matrix through which we visualize the topic by displaying representative images (in practice, we pick top images).
The collapsed Gibbs sampler needs to calculate the probability of the word in the image being assigned to a topic , given all other topic assignments to the rest words in all images. Integrate out multinomial parameters and we calculate that probability based on Eq. 3. In Eq. 3, means all words but the word in the image. For example, means the count of all the words in image that have been assigned to topic , except the word; means the count for all words from all images with the topic assignment of , except the word in the image. In practice, the term is dropped out as it is a constant in each image.
5 Experiment and results
We conducted the “bag-of-words" representation part on Dell PowerEdge server with two NVidia Tesla
GPU boards via Caffe[Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell]. For lifelogging dataset, it took approximately minutes to generate the corpus. For COCO dataset, it took hours in practice. We implemented sequential version of LDA for lifelogging data and a subset of COCO data for accuracy evaluation. To test the scalability, we ran Harp-LDA [Zhang et al.(2016)Zhang, Peng, and Qiu] on on Juliet, a cluster for digital science computing as a part of FutureSystems in Indiana University. In practice we conducted Harp-LDA experiment on a cluster of machines with cores each.
5.1 Lifelogging data summarization with duplicated images
The method was applied on the egocentric dataset of a single school day where the subject ride to school and work. We extracted topics and topics respectively from the dataset and select the top images from each topic, as shown in Fig. 2.
In the case of -topics, the T-shaped bicycle is bonded with T-shaped tables and the second topic indicates a working scenario with the laptop. Given topics, the grouping of semantic features is conducted in a finer granularity, as it decomposes the second topic from -topic-case into a topic related to the computer screen and another topic for the desk. The first topic has more purity with all top images being bicycles.
We generalized the method to a larger dataset for weekly images and selected topics with top representative images. The results based on both and layers are shown in Fig. 3. Both representations yield a similar living genrethe with topics on driving, meeting with collegues, working with the laptop, walking in the yard, etc. The output from softmax layer leads to marginally more coherent semantic groupings.
While the hybrid supervised-unsupervised method generates a set of topics each of which conveys consistent semantic meaning, the original labels, derived from ImageNet, are not highly corelated to the content of our life-logging dataset. As shown in Fig. 4 where we displayed corresponding labels of each topic from AlexNet, these labels may not form appropriate captions for the corresponding images in each topic. However, the high dimensional space constructed with labels suffices the embedding of the semantic dissimilarity. In the next section, we calculate consistent rate to evaluate the method with a subset of COCO data with ground truth labels.
5.2 Topic spectrogram and consistent rate
To have a quantitative analysis on the performance of this method, we first randomly selected images among categories (broccoli, frisbee, fridge and cow) from COCO, images for each category, extracted the bag-of words representation from CNN, and calculated the topic distribution by LDA, as is shown in Fig. 5. Overall, we see a clear topic assignment “mode" for every images. Note that the label set of ImageNet does not include “Frisbee" and “Cow". With the related labels, such as “French_bulldog", “Bernese_mountain_dog", “lawn_mower" etc, the method built the synthesized semantic labels that could group images in those two categories. Some images from those two categories are correlated semantically: dogs are likely to be seen with a frisbee in a scenario of outdoor activities. From the spectrogram, we see that images responding to both topics contain both dogs and frisbee. For the and topic where topic assignment is more consistent, if a topic assignment of one image is different from the majority in that category, it’s likely that the image was not labeled correctly (we listed two images that should have contained either brocolli and a refrigerator).
We compared the output from pure CNN and CNN+LDA by extending the number of topic of LDA to , the same amount of labels from the output layer of AlexNet trained on ImageNet. The spectrogram of the output from the softmax layer of AlexNet, and compared that with the output of LDA on the same amount of topics is shown in Fig. 6.
To measure the consistent rate of the output labels, we calculated the index with the largest mean probability among labels for each image subset , and counted the number of images the label with the largest probability equal to the corresponding index , and divided it by the amount of images in the subset, i.e. . The result is shown in Table 1.
The consistent rate of CNN is relatively low with a wide range of responding labels. With LDA to extract topics from labels, the consistent rate is improved by times on average. For the case of CNN+LDA, the consistent rate of “frisbee" and “fridge" is much lower comparing to the other categories because those two categories are correlated: some images under “frisbee" was assigned to “fridge" and vice versa. When for top largest probabilities, the consistent rate for the two categories became close to those of other two categories.
5.3 Parallelization for scaled-up dataset
Running a preprocessed COCO image dataset with tokens from images for topics is beyond the computing capacity of most local computers. Instead, we ran the experiment on a Intel Knights Landing cluster at Indiana University. One node with Xeon Phi F processors ( cores in total) was used. All of them have around GB memory and are connected by Omni-Path Fabric. We used Harp-LDA [Zhang et al.(2016)Zhang, Peng, and Qiu]
based on sparse matrix decomposition, recorded the execution time for each iterations and estimated the overall execution time foriterations, see Fig. 7.
The execution time for each experiment setting will converge after iterations. When using thread in a single node, the overall execution time for is ms (around hours and minutes). With threads, it’s ms; for and threads, they are ms and ms (around minutes) respectively.
The paper showed that the image feature generated by Convolutional Neural Network contains semantic information that can be grouped and extracted by topic modeling in a supervised-unsupervised way. The pre-trained network with a “non-specific" set of labels gave the bag-of-words representation which measures dissimilarity yet with low purity. With topic modeling, we further grouped the semantic features into visual topics of higher level. The method was applied on our life-logging dataset to extract living genres by listing duplicated images, and on a subset of COCO dataset for consistent rate analysis. We also conducted a parallel experiment on KNL machine with all COCO images over categories for scalability test. The topic assignment procedure for images costs minutes. The method can be used to extract living genres from egocentric data with duplicated images; automatically group, pre-label and semantically organize images by topics.
- [Bambach et al.(2015)Bambach, Lee, Crandall, and Yu] Sven Bambach, Stefan Lee, David J Crandall, and Chen Yu. Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In Proceedings of the IEEE International Conference on Computer Vision, pages 1949–1957, 2015.
[Blei et al.(2003)Blei, Ng, and Jordan]
David M Blei, Andrew Y Ng, and Michael I Jordan.
Latent dirichlet allocation.
the Journal of Machine Learning Research, 3:993–1022, 2003.
[Fei-Fei and Perona(2005)]
Li Fei-Fei and Pietro Perona.
A bayesian hierarchical model for learning natural scene categories.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 524–531, 2005.
- [Griffiths and Steyvers(2004)] Thomas L Griffiths and Mark Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1):5228–5235, 2004.
- [Iwashita et al.(2014)Iwashita, Takamine, Kurazume, and Ryoo] Yumi Iwashita, Asamichi Takamine, Ryo Kurazume, and MS Ryoo. First-person animal activity recognition from egocentric videos. In 2014 22nd International Conference on Pattern Recognition (ICPR), pages 4310–4315. IEEE, 2014.
- [Jia et al.(2014)Jia, Shelhamer, Donahue, Karayev, Long, Girshick, Guadarrama, and Darrell] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the ACM International Conference on Multimedia, pages 675–678. ACM, 2014.
- [Korayem et al.()Korayem, Templeman, Chen, and Kapadia] Mohammed Korayem, Robert Templeman, Dennis Chen, and David Crandall Apu Kapadia. Enhancing lifelogging privacy by detecting screens.
- [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- [Lee et al.(2014)Lee, Bambach, Crandall, Franchak, and Yu] Stefan Lee, Sven Bambach, David Crandall, John Franchak, and Chen Yu. This hand is my hand: A probabilistic approach to hand disambiguation in egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 543–550, 2014.
- [Li et al.(2010)Li, Wang, Lim, Blei, and Fei-Fei] Li-Jia Li, Chong Wang, Yongwhan Lim, David M Blei, and Li Fei-Fei. Building and using a semantivisual image hierarchy. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3336–3343, 2010.
- [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
- [Lu and Grauman(2013)] Zheng Lu and Kristen Grauman. Story-driven summarization for egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2714–2721, 2013.
[McCallum et al.(1998)McCallum, Nigam, et al.]
Andrew McCallum, Kamal Nigam, et al.
A comparison of event models for naive bayes text classification.In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Citeseer, 1998.
- [Nigam et al.(2000)Nigam, McCallum, Thrun, and Mitchell] Kamal Nigam, Andrew Kachites McCallum, Sebastian Thrun, and Tom Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2-3):103–134, 2000.
- [Pirsiavash and Ramanan(2012)] Hamed Pirsiavash and Deva Ramanan. Detecting activities of daily living in first-person camera views. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2847–2854. IEEE, 2012.
- [Ryoo and Matthies(2013)] Michael Ryoo and Larry Matthies. First-person activity recognition: What are they doing to me? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2730–2737, 2013.
- [Sivic et al.(2008)Sivic, Russell, Zisserman, Freeman, and Efros] Josef Sivic, Bryan C Russell, Andrew Zisserman, William T Freeman, and Alexei A Efros. Unsupervised discovery of visual object class hierarchies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.
- [Zhang et al.(2016)Zhang, Peng, and Qiu] Bingjing Zhang, Bo Peng, and Judy Qiu. High performance lda through collective model communication optimization. Procedia Computer Science, 80:86–97, 2016.