Image feature representation plays an essential role in image recognition and related tasks. The current state-of-the-art feature learning paradigm is supervised learning from labeled data. However, this paradigm requires large-scale category labels, which limits its applicability to domains where labels are hard to obtain. In this paper, we propose a new data-driven feature learning paradigm which does not rely on category labels. Instead, we learn from user behavior data collected on social media. Concretely, we use the image relationship discovered in the latent space from the user behavior data to guide the image feature learning. We collect a large-scale image and user behavior dataset from Behance.net. The dataset consists of 1.9 million images and over 300 million view records from 1.9 million users. We validate our feature learning paradigm on this dataset and find that the learned feature significantly outperforms the state-of-the-art image features in learning better image similarities. We also show that the learned feature performs competitively on various recognition benchmarks.READ FULL TEXT VIEW PDF
In this paper, we demonstrate how the state-of-the-art machine learning ...
The current state-of-the-art in feature learning relies on the supervise...
Over the last years, deep convolutional neural networks (ConvNets) have
Visual aesthetic assessment has been an active research field for decade...
Numerous fake images spread on social media today and can severely jeopa...
Neural net classifiers trained on data with annotated class labels can a...
Font selection is one of the most important steps in a design workflow.
Image recognition is a central problem in Computer Vision which enjoys great progresses in the last decade. Feature learning plays an essential role in image recognition. Traditional recognition methods, such as[10, 13, 4, 1, 18, 12], are based on hand-crafted image features. These feature representations require a significant amount of domain knowledge and do not generalize well to new domains. The current state-of-the-art feature learning paradigm is supervised learning from data . This data-driven supervised feature learning paradigm does not require domain knowledge but requires large datasets with category labels to train properly. However, collecting large labeled datasets is not an easy task even with the help of crowdsourcing. For instance, we often need experts to label images in a special domain which may be hard to find via crowdsourcing. Lack of large labeled datasets limits the applicability of the supervised feature learning paradigm in new problem domains.
and transfer learning (using labeled data from different domains). These methods hold great promise, but regarding how successful they are, there remains an interesting research question: Are category-level labels the only way for data driven feature learning?
There is a surge of social media websites in the last ten years. Most social media websites such as Pinterest have been collecting content data that the users share as well as behavior data of the users. User behavior data are the activities of individual users, such as likes, comments, or view histories and they carry rich information about corresponding content data. For instance, two photos of a similar style on Pinterest tend to be pinned by the same user. If we aggregate the user behavior data across many users, we may recover interesting properties of the content. For instance, the photos liked by a group of users of similar interests tend to have very similar styles.
In this paper, we propose a new paradigm for data driven image feature learning which we call collaborative feature learning. The main idea in collaborative feature learning is to learn image features from user behavior data on social media. In particular, we use the user behavior data collected on social media to recover latent representations of individual images and learn a feature transformation from the images to the recovered latent representations. It is a major departure from the existing paradigms on feature learning such as supervised learning in that we do not rely on category labels at all. There are several challenges in this new feature learning paradigm. In particular, user behavior data can be very sparse and noisy. For instance, most users only see a very small portion of all the images and some user behavior data are erroneous. Fortunately, there exist structures in the behavior data over all the users and the structures can be exploited to deal with sparsity and noise in the data.
We acknowledge that our new data-driven feature learning paradigm only applies where there are social media data available and the effectiveness is determined by the quality of the data. In fact, all the data-driven methods share the same limitation. To test our feature learning paradigm, we collect a large-scale dataset from Behance.net which is a popular social media that focuses on artists and designers. We download about 1.9 million Behance artworks along with the view history of about 1.9 million users. That results in more than 300 million user-artwork view records. In the experiments, we find that the learned latent representations indeed reflect rich visual and semantic information of the images. We further observe that the image features learned from the latent representations not only perform well on standard image recognition benchmarks but outperform the state-of-the-art feature (the ImageNet feature ) on tasks such as finding images of similar styles (see Figure 1).
We propose a new paradigm for feature learning from social media. We completely forgo the use of category labels in existing feature learning paradigms. Instead, we use user behavior data collected on social media. Our paradigm can take advantage of the massive data that are collected on social media which mitigate the dataset scalability issue in feature learning and image recognition in general. We further validate and test our paradigm on large-scale data collected from a real-world social media website, and show promising results. Finally, we want to remark that although the focus of this paper is image and visual data, our feature learning paradigm is by no means limited to learning visual features. For instance, it can be used to learn interesting audio features from social media websites such as Spotify .
Image features play an important role in various image recognition problems. There is a rich body of literature in Computer Vision on image features. It is beyond the scope of this paper to do a comprehensive review. Early methods [10, 13] use low-level features which are more about appearance and recent methods, such as [4, 1, 18, 12], focus on high-level features which are more about semantics. Different from hand-crafted features, features learned directly from data are the current state-of-the-art . Data-driven features are shown to be able to effectively encode both semantics and appearance and outperform previous methods on many recognition benchmarks. But they need a lot of labeled images (on the order of millions) to train properly. Unsupervised feature learning methods, just to name a few [2, 11, 16, 15, 17], hold significant promise in terms of overcoming the labeled dataset limitation.
addresses the cold start problem in music recommendation. It mines latent factors from user music listening logs and uses convolutional neural networks as a nonlinear regressor to predict the latent factors. Different from, this work focuses on how to learn image features from user behavior data. This work is also different from  which uses deep belief nets to fuse and learn unified features for multi-modal social media data. Instead of learning features,  starts from low-level features of multi-modal social media data and focuses on learning a common feature map for various tasks.
The proposed approach is a framework that unifies latent factor analysis and deep convolutional neural network for image feature learning from social media. Although rich social information can be harvested from social websites, such as content items, item tags, user social friendships, user views and comments, we focus on the simple form of user-item view data in this work to keep our feature learning framework general. Given a set of content items and a set of users , the corresponding user-item view data is in the format of a matrix between and , which is denoted as . We set if viewed content item (regardless the number of views), and otherwise (missing entries). Note that we use to denote the missing entries, which should not be confused with negative signals mentioned in following sections. As we will show later, the user-item view matrix encodes a lot of information about the similarity between different content items, which we can use for supervised image feature learning for the social content. The content items could be any media format presumably, such as videos, images, or audios. In this work, we will focus on images and leave feature learning for other media formats as future work.
provides an overview of our approach. Based on the user-item view matrix, we use collaborative filtering to decompose it into the product between content item latent factors and user latent factors. As the latent factors of content items encode rich information about the similarity between the content items, we then generate pseudo classes for the content items by clustering their corresponding latent factors using K-means. Deep convolutional neural network (DCNN) is then trained based on these pseudo classes in the traditional supervised way. Finally, the trained DCNN can be used to extract content features for our social content domain. In the following sections, we will cover details of our approach.
In this part, we explain how to extract content latent factors from the user-item view matrix.
The view matrix we consider here only records the minimum information of whether a user viewed a particular content item or not. This is called implicit feedback in the literature, which is an indirect reflection of a user’s true opinion of the content item. Compared with explicit feedback data, such as user ratings on Netflix about movies and on Amazon about products, where explicit positive or negative feedbacks are given, implicit feedback data is more general and at a much larger scale. For example, Amazon can collect many more clicks and views than getting reviews or comments from the buyer in a much easier way. However, implicit feedback data is typically a lot noisier and weaker indication of the user’s true opinion. Since it does not contain explicit negative signals, there is no easy way to identify negative signals from the missing data, because a missing entry could be interpreted as a sign of dislike or the user just has not discovered the content yet. Given the massive amount of contents in social media, the user-item view matrix is extremely sparse (e.g., well over 99% entries are missing), and it is very likely that most of the missing entries are due to that the content has not been discovered by the user.
To address this issue, inspired by the experience from an industrial competition on building music recommendation systems 
, we sample “negatives” from the large number of missing entries with a probabilistic trick. When drawing the negative samples, the sampling follows a probability distribution that is proportional to the popularity of the content. The rationale behind our sampling is that, a popular content has a higher chance of being discovered by the user, and therefore, a missing entry is more likely to suggest an negative altitude for the user. The popularity of a content item is a measure of how much exposure it receives from users, i.e., how many users have viewed it. Formally, given the view matrix, we define the popularity of item as
Based on the content popularity, the sampling distribution for negative data is defined as:
Apparently, we avoid sampling entries that are positive. In practice, we take the logarithm of , and normalize with respect to each user, so that for each user, the sampling probabilities sum up to 1. Algorithm 1 provides a description of our sampling process to get negative data, where is the set of sampled negative view entries. For every missing entry , we set for further analysis.
Among different methods in collaborative filtering for latent factor analysis, matrix-factorization-based models have gained a lot popularity thanks to their attractive accuracy and scalability. In this work, we will focus on using matrix factorization model on our user-item view matrix. The model associates each user
with a user latent factor vectorand each item with an item latent factor vector , where and is the dimension of the latent space. And the prediction for an entry is done by taking the inner product between the two latent factors, i.e., . With the positive entries and sampled negative entries from our user-item view matrix, we conduct matrix factorization with regularizations as
Here is the weight placed on the regularization term. Note that the summation is over “non-missing” entries only in
, including both positives and the sampled negatives. Due to the large size of our problem (on the magnitude of millions), we adopted stochastic gradient descent (SGD) to solve Equation (3). At each iteration of SGD, a single non-missing entry in is randomly picked, and the partial gradients with regard to the involved and are calculated in order to update them. We further improve the optimization efficiency with asynchronous SGD, which updates parameters in parallel for multiple non-missing entries of . Because the user-item view matrix is extremely sparse, there is little chance of parameter update conflict for our asynchronous SGD. In practice, we find that asynchronous SGD can significantly speed up the optimization without comprising the quality of the solution. We also find that it has a stable convergence behavior in practice.
As we will see in the experiment section, the computed latent factors of our content items encode rich high-level visual and semantic information of the corresponding content items. Therefore, the learned latent factors can serve as a good source of supervision for learning meaningful features for our social media, i.e., image visual features in this work. Inspired by the recent work of deep convolutional neural network (DCNN) on large-scale ImageNet  being able to learn quite generic visual features, we use the latent factors to create pseudo classes for our content items and then apply DCNN to learn high level features in a supervised way.
Even though could be noisy, the latent factors can largely reflect the semantic relationship between our content items. Although this source of information can be used in different forms to learn visual features, such as learning from sampled triplets, we resort to a more traditional supervised way by creating pseudo classes from the content items. Specifically, we first cluster the latent factor space into K clusters with k-means. Then we create the pseudo classes by partitioning the content items based on the cluster index of its latent factor, i.e.,
Finally, we use the same deep neural network structure proposed in , which contains five convolutional layers and two fully connected layers, to learn a K-way DCNN classification model, from which we can extract high-level visual features for our social content.
Since the latent factor space is continuous, the above procedure might suffer from suspected quantization problem. Alternatively, we could learn a DCNN regression function directly from our content item to its latent factor , which defines a continues mapping between the visual images and latent factors. However, in practice, we find that training the network in this way results in inferior features. First, the latent space is a high-dimensional continuous space; learning a regression function directly from image pixels to this continuous space is very slow. Second, the latent factors obtained from the user behavior data is noisy and encode both visual semantic information and some amount of pure social and cultural information. Directly enforcing the mapping between image pixels and noisy latent factors could screw up the learning process, especially when we are using an
norm cost function that is not robust to outliers. By creating discrete pseudo classes, which is essentially a lossy vector quantization coding transform, we could use the more robust softmax loss function that is robust to outliers. On the other hand, as the latent factors are noisy themselves, it may not matter much for the quantization loss. In our experiments, we find that learning from discrete pseudo classes can produce satisfactory visual features.
To validate our new feature learning paradigm, we collect a large-scale image and user behavior dataset from Behance.net. Behance is a popular social media website for professional photographers, artists, and designers to share their work. Content data on Behance are mostly in the form of images and there are a small portion of videos as well. Content is very diverse, ranging from photographs to cartoons, paintings, typographs, graphic design, etc. Content data on Behance are organized as projects, each of which has associated images and videos. The website organizes all its projects into 67 fields. Each project may be associated with multiple fields. (Note that fields are very coarse categorization and have large overlaps between each other. Therefore, they are not suitable as labels for image classification training.) In Figure 3, we show several images from some representative fields.
The content data on Behance are all shared by the users. The project owner, who uploads the project, picks one of the most representative images as the cover image, which will be presented to other users. While browsing over a large number of cover images, a user simply clicks the cover image of interest and will be directed to the project page with the entire content. Behance records the view data for each project which is a list of users who have viewed the project. Choosing Behance as our testbed is motivated by the observation that a user tends to view projects of similar contents or styles, which is the key to our approach.
We first download all the cover images for 1.9 million projects, and for each project we obtain the list of users who have viewed it, which results in 326 million view records from 1.9 million users. Note that we only download the cover images and use them in our experiments. But it is possible to obtain the additional project content. The density of the view matrix (without negative sampling) is about 0.0093%.
We further process the raw data by removing the most popular and the least popular projects, because projects with too many or too few views cannot be modeled properly by latent factors. Similarly, we remove the users that are too active and inactive. The minimum and maximum thresholds we use for both project view counts and user view count are (10, 20000). The thresholds are chosen to retain most of the data. In particular, after thresholding, we have 1.9 million project and 310 million view records from 0.93 million users. The density of the view matrix goes up to 0.0176%. Although the most and least popular projects and active users have been removed, the distribution of views on projects, as well as the activity (the number of views a user gives) of users, are still uneven and long tail.
In this section, we first study the characteristics of our learned latent factors and analyze the information captured. Then we investigate the neural network based visual feature, and evaluate its performance on Behance dataset for image similarity, style classification and image category classification on standard benchmarks.
We split the processed data carefully for fair experiments. First, we split the set of all images as 95% and 5% as training and testing. For clarity, we will denote them as and for the rest of the paper. This first split is for feature learning, thus is used for DCNN training and is for evaluating learned feature representation. While training DCNN, we leave a small portion of for validation purpose. Accordingly, the user behavior matrix after negative sampling is split based on and : we form matrix by slicing with images belong to , and similarly we have . Next, we split for latent factor analysis. Note that, because the goal of latent factor analysis is to recover the missing entries in , the split is on the non-missing entries in . We randomly sample 80% of the non-missing entries for training, and the rest 20% for validation.
To infer latent factors of projects in , we first sample negative entries for following Algorithm 1. For each user , we set the number of negative samples, , to be twice the number of her/his positives. Therefore, has about 900 million non-missing entries, so the matrix density becomes 0.0527%. Note that as gets larger, the computational cost to solve Equation (3) increases linearly, thus for computational efficiency we avoid sampling too many negatives. We then apply regularized matrix factorization on the training split of
, and use the validation split to estimate the optimal parameter value for. The validation criteria are root-mean-square error (RMSE) on predicting the validation split of
with inferred latent factors, and personalized ranking (PR), which measures the rank of positive and negative items for each user. Note that both RMSE and PR are used in an industrial machine learning competition. Then the entire is used to compute final latent factors with the optimal . We also experimented with the dimensionality of and . Generally, very high dimensional latent space (large ) may cause overfitting and increase computational cost, while very low ones may fail to capture the latent structure. Our validation process shows that setting to 0.01 and to 100 is a reasonable balance which achieves good RMSE and PR, while keeping the computation efficient. With this setting, we obtain validation RMSE of 0.2955, and PR of 0.1891. Note that small values for both measurements means a good performance, and PR’s expectation at random guess is 0.5. Please refer to  for more details of RMSE and PR.
Following previous steps, we run k-means clustering on the learned latent factors. We vary the number of clusters within , and train a DCNN for each of the the corresponding cluster assignments. Similar to the data preparation pipeline in Krizhevsky et al. , we resize training images so that the short side has 256 pixels, then take the center crop of the resized images. At training time, we use random
crops to augment the dataset for improved robustness. Dropout is adopted to avoid overfitting. Training approximately converged in 60 epochs, and we extract features from fully connected layers in DCNNs.
As discussed earlier, latent factors from view data should reveal some properties of the content data. Since the latent factors also serve as an implicit supervision in our feature learning, it is natural to learn what information they have captured about the individual images and whether our assumption of the correlation structure is valid. To this end, we present a simple experiment to empirically study those latent factors. For each project, represented by its cover image, we retrieve its nearest neighbors in the training set , using cosine similarity between latent factors. We find strong visual and semantic proximity between query projects and their nearest neighbors (NNs), and the observation is consistent across the entire set. In Figure 4, We show several randomly selected queries of various semantics and their top NNs. For clear illustration, we also list representative tags of the queries. Note that tags are obtained from Behance, and they are provided by project owners at upload time, therefore, tags are typically random and noisy. (We only show the informative ones here.)
From Figure 4, we can first observe a clear categorical correlation at coarse level between a query and its NNs across all examples. For instance, row 1-3 are all portraits of woman, row 4-5 are about automotive design, and the following rows include various subjects and contexts, such as house, footwear, food photo etc. Furthermore, we notice that the latent factors also reveal richer visual context at a finer level than category. For instance, despite all being female portraits, query 1-3 have drastically different image styles and contexts, and these differences are well respected in the retrieved NNs. Similar phenomenon is shown in row 4 and 5, where the car in query 4 is a classic car, thus there are more classic cars in its NNs while cars in row 5 are more of modern design. On the other hand, we also observe a small portion of failed queries, whose NNs are irrelevant. These failure cases are mainly caused by two reasons: (1) the sparsity and noise in user behavior data, (2) social factors that are not visually related, e.g. an user always like to view his/her friends’ projects.
Given the latent factors analyzed previously, a natural question is how well our learned image feature captures the concept embedded in the latent space. To validate this, we conduct a retrieval experiment on the learned image features. We use images in to query against . Similarity between images are calculated using cosine similarity. To validate the effectiveness of our learned feature, we compare it with the state-of-the-art image classification feature learned on ImageNet ILSVRC2012 dataset, with the same network structure and training procedure as , and we refer to it as ImageNet feature.
We first visually inspect the visual relationship between a large number of queries and their NNs. In Figure 1, we show a set of randomly selected queries and the NNs for our feature and the ImageNet feature , and clearly our feature shows significantly better visual consistency.
Furthermore, we quantitatively evaluate the NNs with two measurements, which are designed for Behance data. The measurements go beyond visual similarity, and most importantly, reflect the relationships between images on Behance. The measurements are: given a query image,
(1) the number of common viewers between the query and the retrieved NNs. We measure, for each pair of query and NN, the ratio between their common viewers (set size) and the union (set size), therefore, as shown in Figure 5, the ratio drops as the rank of neighbors falls further behind. We calculate this measure for the top 100 NNs for each query in , and report its mean across at every rank position;
(2) the number of retrieved NNs that have been viewed by the owner of the query image. Similarly, this quantity is measured between a query and all of the top 100 NNs. We report the average of this number across all queries in .
Plots on the two measurements are shown in Figure 5, where our feature consistently outperforms ImageNet feature. Because of marginal differences of DCNN features from Fully Connected Layer (FC) 6 and FC7 layers, we report the results of FC6 features. (For better comparison, we also randomly sample a large number of image pairs and calculate the random expectation of the two measurements.)
We apply our feature as image descriptor to image classification on standard benchmarks for object class classification and visual style classification. We choose Caltech256  for object class classification. We focus more on image style classification, due to the nature of our dataset, which is crawled from an artistic asset sharing site. We use three major image style benchmarks, including Flickr style, which has 80k images on 20 visual styles, Wikipaintings, which contains 85k images belonging to 25 styles, and AVA style, which consists of 14k images with 14 photographic styles. Flickr style and Wikipaintings are both from a recent work , in which Karayev et al. evaluates various established image features on the three benchmarks.
|Our feature||ImageNet feature||Meta-Class|
as competing features for their competitive performance. For deep learning features, we have experimented with features from both FC6 and FC7 layers, and found that for object classification, the performance difference is marginal, however, for image style classification, due to dataset bias, features in FC7 layer obtains lower accuracy. Therefore, we report results on FC6 features. We also experimented with our features trained on different number of pseudo classes, and found that 2000 consistently provides better performance than others. Each of the style benchmarks is split into 80% and 20% for training and testing, and for Caltech256, we use 50 images per category for training and 20 images for testing. We use linear SVM as the classification model.
As shown in Table 1, in image style classification benchmarks, our feature achieves similar or even better accuracy than ImageNet feature, which is the state-of-the-art single feature for style recognition reported in . As for object classification, our feature obtained competitive results on Caltech256. It is worth to point out that ImageNet feature is learned on ILSVRC2012 dataset with 1000 categorical labels, and Met-Class is trained on a subset of 8000 synsets of the entire ImageNet database, whereas our feature is learned on images with noisy user view data.
We propose a novel data-driven feature learning paradigm that exploits user behavior data collected from social media websites. Our feature learning paradigm is different from existing methods in that it does not rely on category labels. We show that the learned feature outperforms the state-of-the-art features on our Behance 2M Dataset in terms of learning better image similarities while performing competitively on various standard recognition benchmarks. We believe using social data for feature learning and image recognition in general is a promising new research direction.
In terms of future work, we plan to pursue two directions. First, we want to make our learning paradigm more robust to data noise and non-visual factors. Second, we would like to go beyond the view records and incorporate additional social data, such as user relationships on social media. For more information, please visit the project webpage111http://www.cs.dartmouth.edu/~chenfang/proj_page/Behance_CVPR15/.
Computer Vision and Pattern Recognition (CVPR), 2012.