With the increasing popularity of different online social platforms, such as Facebook, Twitter, Pinterest etc., multi-modal data streams (e.g. text, image, audio, video, etc) are generated as byproducts of people’s everyday online activities in the digital world. The wide availability of these digital breadcrumbs [Estrin:2014:SDN:2580723.2580944] have already cultivated major research efforts in the industry and academia to develop techniques to understand personal preferences. These techniques have led to the success of recommendation systems [das2007google, middleton2004ontological], such as Yelp, Foursquare etc., that help users find things they will enjoy, and enabled accurate targeting of advertisements.
Text-centric data, such as tweets, and status updates, are among the most popular data streams for profiling personal attributes [schwartz2013personality] due to their early adoption and pervasiveness. It has been shown by [schwartz2013personality, correa2010interacts, bamman2014gender] that various personal traits, such as gender, age, extroversion and openness, are manifested in these language features. Until recently, as driven by the emergence of photo sharing social media sites (e.g. Pinterest and Instagram) and the wide availability of embedded cameras on mobile devices, images have become a significant portion of contents that people posted online, and text data is thus limited for not capturing visual preferences. Building on this line of research, some recent work started to explore the value of visual contents in uncovering people’s interests [you2015picture, ottoni2014pins, lovato2013we, lovato2014faved]. However, most current research in this domain [you2015picture, ottoni2014pins, lovato2013we] converts images to one or more labels, and uses the text-based, categorical information to understand users’ preferences. While such image-to-text approaches can benefit from the existing techniques developed for text-based data, they potentially miss the rich context and visual cues that are known to affect and guide people’s perceptions of image contents [gibson1950perception]. This limitation is especially highlighted on image intensive social networks, such as Pinterest. For example, as Fig.1 shows, even under the same category, Travel, there are obvious distinctions between the pins (i.e. the images on Pinterest) curated by different users. These distinctions could play an important role in not only image recommendations itself, but also in domains, such as travel destination recommendations.
In this paper, we take a step deeper into profiling users’ visual preferences for images under the same label. We propose a novel framework based on Deep Convolutional Neural Networks (CNN) to directly learn a image distance metric from a large set of similar and dissimilar image pairs. We then leverage this similarity measure to profile each user’s visual preferences. The experimental results, based on 5,790 Pinterest users’ pins under the Travel category, indicate that the proposed approach is able to reveal each user’s distinct visual preferences, and the derived user profile has strong predictive power to predict the images that the user will pin.
Compared with traditional solutions, our work offers three major contributions:
Our approach enables fine-grained user interest profiling directly from visual contents. For images under the same label, we reveal intra-categorical variances that traditional classification methods were not able to capture.
We propose a novel distance-metric learning method based on the combination of traditional-CNN and Siamese Network[chopra2005learning] models. This framework outperforms the state-of-the-art CNN model in terms of mean Average Precision (mAP).
Our experiment demonstrates beyond classification utilities of visual contents in user interest profiling. We believe that our findings, while preliminary, shed light on the potential of incorporating fine-grained visual content analysis as an important technique for personalization.
Ii Related Work
Ii-a Visual Content Analysis on OSNs
The pioneering work in this domain studied online photos on Flickr [lovato2013we, lovato2014faved, schifanella15image] and demonstrated the feasibility of extracting aesthetic and biometric features from user-generated image collections. It has been shown that people’s preferences over these photographic features are identifiable and could be used for personalization [yeh2010personalized]. Building on these prior efforts, recent literature has begun to explore the possibilities of profiling user’s behavior [ottoni2014pins, zhong2013sharing, bernardini2014pin] and interests [you2015picture] from visual contents posted on social media. Although the work from [you2015picture] has shown the initial findings of intra-categorical image variations among different users, most existing approaches treated image analysis as a classification problem where one or more labels are assigned and processed in a manner similar to text data. The major limitation behind such approaches is that a general classification model is trained and applied to all the users while ignoring individual users’ distinct perception and preferences to an image category. Our preliminary experiments show that individual users do have distinct preferences even under the same category, and this personal preference is consistent over the user’s lifetime.
Ii-B Image Retrieval and Personalization
The algorithms we propose in this paper are related to the similar image retrieval
problem in computer vision[kim2012web, deng2011hierarchical, fu2015tagging]
, where given a text query, semantically relevant images will be returned from a large database. It’s similar to our work because the image similarity metric is an important component of the retrieval function and it has been shown that the algorithmic performance will achieve major improvements when incorporating user interests profile and temporal patterns of social events[kim2012web]. Although most retrieval functions directly use visual features for similarity measurement[kim2012web, deng2011hierarchical], it is still unclear whether images themselves could provide utilities other than categorical labels and the extent of their usefulness in personal interest profiling. In this paper, we conduct experiments using publicly available data from 5,790 Pinterest users. The results demonstrate identifiable signals from visual contents that extend beyond classification and image categories.
Iii Problem Definition
The general question we intend to answer in this paper is whether user-generated visual contents have predictive power for users’ preferences beyond labels. To quantitatively measure the differences of visual contents posted by different users under the same category, we consider the following setup of the problem.
Under an image category, each active user who posted in this category is denoted by , and the images a user posted are denoted by in the chronological order. The problem is to find a function such that can accurately characterize the user ’s distinct visual preferences. More specifically, we consider the following two tasks:
(1) Pairwise Comparison: Given the general characteristics of images posted under this category, we analyze whether the proposed profiling function can distinguish the pairwise users’ preferences so that the differences between each derived profile pair are statistically significant.
(2) Prediction: We divide every user’s image set into training and testing subsets, and evaluate the predictive power of profile by using it to predict which is the user ’s collections (board) among all the testing sets.
Iv Dataset Collection
We choose Pinterest as the targeted platform since it is one of the most popular image-centric social networks. On Pinterest, users posted pins (i.e. typically an image along with a short description) and organized them in self-defined boards, each of which is associated with one of 34 predefined categories. This fully structured way of image collection makes Pinterest a natural candidate for investigating intra-categorical user preferences. In this paper, we scraped different users’ boards within the travel category. These travel boards are further filtered by the following two criteria: (1) The board should contain no less than 100 pins to guarantee that there is enough data for each user; and (2) The board should have at least one pin posted after June 2014 to ensure that the user is still active[danescu2013no]. After filtering, we obtained 5,790 travel boards, each of which belongs to a different user. We use 1,800 of them as background corpus and exclude them from the analysis.
V Proposed Methodology
Fig.2 shows an overview of the proposed framework. The framework consists of three major components: (1) Each image (i.e. pin)
is first embedded in a 410-dimensional feature space via a pre-trained Siamese network and the Places-CNN. The feature vector for each imageis denoted by ; (2) Based on the distance between and the center of each pre-trained visual cluster, an image is soft-assigned to 200 pre-trained clusters such that the final representation () for the image is its affinities to all the clusters. (3) Finally, a user profile is defined as the aggregate of all the feature vectors . i.e. , where . In the following, we discuss important design decisions and the rationales behind each component.
V-a Deep Distance Metric Learning
Distance metric learning using Deep Siamese Network has achieved significant performance improvements in face verification[taigman2014deepface], geo-localization[lin2015learning] and food image embedding[yang2015CIKM]. In addition, it is suggested by [sun2013hybrid] that feature concatenation (hybrid) from CNNs trained under different conditions will further strengthen the discriminative power of the model. In light of these prior efforts, we fine-tuned a Siamese Network based on the Places dataset[zhou2014learning] and concatenated its features with the pre-trained Places-CNN model[zhou2014learning] (Fig.2), both of which utilized the AlexNet[krizhevsky2012imagenet] architecture. We choose to use the Places dataset and include the Places-CNN model because the images we deal with are mostly scene photos from the travel category. In this section, we focus on our design and training choices for the Siamese Network. Interested readers can refer to the original papers for details[krizhevsky2012imagenet].
As illustrated in Fig. 3, our Siamese Network architecture is the same as AlexNet[krizhevsky2012imagenet]
except that we change the output dimension of the last fully connected layer to 205 in order to stay consistent with the output of Places-CNN. We also add a Batch Normalization layer[ioffe2015batch] at the end to normalize the 205 dimensional feature so that each dimension has zero mean and unit variance within a training batch. Our goal is to learn a low dimensional feature embedding where similar scene images are pulled together while dissimilar images are pushed far away. Specifically, we want and to have small distance (close to 0) if and are similar instances; otherwise, they should have distance larger than a margin . In this paper, we choose Contrastive Loss proposed in [hadsell2006dimensionality]
as the loss function when optimizing the Siamese Network.
In eqn.(1), similarity label indicates whether the input pair of scene images are similar or not ( for similar, for dissimilar), is the margin for dissimilar scenes and is the Euclidean Distance between and
in the embedding space. We use the open-source implementation of gradient descent and back-propagation provided by Caffe[jia2014caffe] to train and test Siamese Network.
In the training phase, we treat the Places dataset images with the same labels as similar pairs and those under different categories as dissimilar pairs. We sample 102,500 similar pairs and 1,045,500 dissimilar pairs to train our Siamese Network. We set the learning rate of the last fully connected layer as and the rate for the rest layers as . The model that we use in this paper is trained for 50,000 iterations. Finally, the output of Siamese Network (205 dimension) will be concatenated with the output of the fully connected layer in Places-CNN, which together form a 410 dimensional feature embedding for each image.
V-B Clustering and User Profiling
After the training phase, we use the pretrained Siamese Network and Places-CNN to extract 410 dimensional feature for each image . We randomly sample 1800 users and use their images as the background corpus to discover latent clusters 111They are excluded from the following pair-wise comparison and prediction tasks
. A traditional K-means[macqueen1967some] unsupervised clustering algorithm is used to divide the image set into 200 visual clusters, and their centers are denoted by . Built on the pre-trained cluster centers, each image is then soft assigned to 200 clusters based on eqn.(2) such that each dimension of the final representation reveals the likelihood of the image belonging to a specific visual cluster.
where and ( is the margin of Siamese Network).
Finally, for each user , we derive her profile by aggregating all the image feature representations in her collection of pins via eqn.(3). This profile intuitively represents the distribution of users’ interests over different visual clusters.
V-C User Pairwise Comparison
Given a pair of user and user , we investigate whether the derived profile has the discriminative power to different users’ preferences. Users’ pairwise differences are evaluated over the general distribution of images under travel boards. This general distribution is derived from the background corpus , where . We adopt log odds ratio with informative Dirichlet prior
log odds ratio with informative Dirichlet priorproposed in [monroe2008fightin] to analyze pairwise differences; this approach was originally used for comparing the differences of word frequencies between articles.
We first calculate the log odds ratio with respect to different visual cluster as in eqn.(4), where controls the size of background corpus.
In addition, we consider the estimated uncertainty as suggested in[monroe2008fightin] and calculate the variance value as in eqn.(5).
The final statistic for each visual cluster
is the z-score of the log-odds-ratio, computed as in eqn(6).
The method we adopt in this section takes into account the background corpus as prior, which alleviates the data sparsity problem and makes the differences of very frequent visual clusters detectable. Under such conditions, if , the confidence level that user and are significantly different is greater than . We will show the overall distribution of all pairwise user differences in the following experiments section.
Vi-a Distance Metric Evaluation
|Hybrid CNN||Places CNN||SIFT+BOW||Random Guess|
We evaluate the efficacy of the distance metric derived from our hybrid model by measuring its clustering performance, namely to what extent the distance metric can cluster test images that share the same labels in the Places Dataset[zhou2014learning]. We check the nearest -neighbors of each test image for , where
is the size of the testing dataset, and calculate the Precision and Recall values for each
. We use mean Average Precision (mAP) as the evaluation metric to compare the performance with the competing algorithms as suggested in[yang2015CIKM]. For every method, the Precision/Recall values are averaged over all the images in the testing set. The results are shown in Table.I where an ideal algorithm has mAP value equals to 1.
We compare our hybrid model with two important competing algorithms: (1) Pretrained Places CNN[zhou2014learning]: We extract a 205-dimensional feature from the output of the last fully connected layer in the Places CNN and use it as the representation for each image; (2) SIFT+Bag of Words(BoW)[lowe2004distinctive]: For this state-of-the-art hand crafted representation, we extract features using 410 visual words so that it has the same feature dimension as our hybrid model. As is shown in Table.I, traditional feature representation (SIFT + BOW) does not have enough discriminative power for the task of scene image embedding. The hybrid model that we propose in this paper outperforms both of the approaches mentioned above in terms of mAP values. These evaluation results not only justify the value of the Siamese network method, but also show that the strategy of concatenating different CNN features could improve the performance of the model.
The feature embedding model proposed in this paper has the promise for visualizing and discovering image clusters among travel images. We randomly sample 10,000 pins from background corpus and project all images to a 2-D plane using t-Distributed Stochastic Neighbor Embedding (t-SNE)[van2008visualizing]. As shown in Fig.4, we divide the plane into many small blocks, and for each block we randomly sample a representative scene image that resides in that area. The final embedding clearly groups similar scenes more closely in the new space. The embedding results (Fig.4) indicate that we can capture rather fine-grained image categories that are likely to appear in travel boards. For instance, natural scenes (e.g. beach, mountains), city views (e.g. building, street) and travel necessities (e.g. bags, shoes).
Vi-B Pairwise Comparison
To investigate how much intra-categorical variance exists between Pinterest users, for each pair of users (except those 1,800 users used for background corpus), we estimate the pairwise dissimilarity between them using the z-score described in Section V. More specifically, let denote the z-score that estimates the difference between users in the visual cluster . Then, the overall preference difference between users , denoted by , is estimated by the maximum z-score over all visual clusters as defined in eqn.(7).
We plot the empirical cumulative distribution function (eCDF) of for all the pairwise users in Fig.5. The distribution demonstrates that there are more than half of the user pairs that have statistically significant difference (i.e. ) in their visual preferences even for the same category of images. This result verifies our assumption that there is significant intra-categorical variance among different users and underscores the importance of understanding users’ fine-grained interests and preferences.
Vi-C Prediction of Future Pins Collections
In addition to pair-wise comparisons, the other question we want to answer is whether the user profile derived with our hybrid model has discriminative power to different users’ preferences. In order to quantitatively measure that, we propose the following prediction task: (1) 100 images (denoted as ) are randomly sampled from each image set to guarantee that each user has the same number of pins for training and prediction; (2) Each sampled image set is then divided into training () and testing () subsets based on their chronological order; (3) Each user’s profile is calculated based on two sets separately (i.e. ); (4) For each user and her profile based on her training set, we predict which testing set belongs to her using euclidean distances. More specifically, we sort all the testing sets by the euclidean distances between their profile and the user’s profile in an ascending order, and the ranking of the user’s real testing set is denoted as . Finally, Mean Reciprocal Rank (MRR), as defined in eqn.8, is used to evaluate the overall prediction accuracy across all the users (). MRR is a standard metric for evaluating the accuracy of a prediction algorithm.
In order to show the effects of the size of training set, we fix the testing set to contain the last 50 pins in and vary the training set to include the first 10, 20, 30, 40, 50 pins. In addition, we compare our approach to a text-based user interesting profiling approach. The procedure for this text-based user interests profiling is similar to the one shown in Fig.2, but, instead of using hybrid deep neural network, we adopt the state-of-the-art PV-DM model[le2014distributed] to embed each pin’s text description into a 100-dimensional feature space.
As is shown in Fig.6, the profiles that we calculated based on visual contents have significantly better performance than text and random baselines in terms of Mean Reciprocal Rank. The results further demonstrate the possibilities that, in image-centric social networks (e.g. Pinterest), visual contents play a more significant role in affecting users’ behavior and preferences compared to traditional text-based platforms. Although there is still a large space of algorithmic improvements to be explored, our preliminary results provide promising evidence for using intra-categorical variance information to understand people’s interests and preferences.
Vii Future Work
Moving forward, there are several directions we would like to pursue. (1) Comprehensive intra-categorical image analysis model: in this paper, we only consider the images under the travel category. However, in real world applications, there are a large number of image categories. A general and comprehensive model to analyze users’ intra-categorical preferences for a wide variety of images categories will be of significant importance; (2) Information fusion of inter- and intra- categorical image analysis: one of the opportunities enabled by the fine-grained image analysis is to fuse and propagate inter- and intra- categorical information. A hierarchical model could be built to analyze users’ visual preferences in different levels and their inter-level interactions. Finally (3) cross-platform information sharing: cross-platform behavior analysis is a user-centric idea to explore the sharing and fine-tuning of user profiles across multiple platforms. This will be particularly useful for solving cold-start problems [park2006naive] in many recommender systems. For example, one can use users’ fine-grained interests learned from Pinterest to recommend friends or places in another social network.
To conclude, in this paper, we propose a user preference profiling framework that extracts signals with strong discriminative power to users’ fine-grained preferences. Compared to previous work, the proposed framework is a hybrid one that takes advantages of Siamese Network and traditional CNN to directly extract similarity information from images. Our experimental results based on data from 5,790 Pinterest users show that the proposed method is able to characterize the intra-categorical interests of a user with a resolution that is beyond what a coarse-grained image classification can do. Our findings suggest that there is great potential in finer-grained user visual preference profiling, and we hope this paper will fuel future development of deeper and finer understanding of users’ latent preferences and interests.
We appreciate the anonymous reviewers for insightful comments. This research is partly funded by AOL-Program for Connected Experiences and further supported by the small data lab at Cornell Tech which receives funding from UnitedHealth Group, Google, Pfizer, RWJF, NIH and NSF.