Recommender systems help to discover items of personal interest by learning from historical feedback in order to understand the factors that influence users’ decisions. Recently, there has been an interest in developing recommender systems that are ‘visually aware,’ in the sense that the visual features (extracted from product images) are incorporated directly into the recommendation objective. Such systems can substantially improve recommendation accuracy, especially in settings (such as clothing recommendation) where visual factors strongly guide users’ decisions.
However, actually incorporating visual signals can be challenging. Extracting meaningful representations (the complexity of style) from image data alone is not straightforward, and can require costly, high-dimensional representations (e.g. CNN-based methods). Furthermore, high-dimensional ‘black box’ image models offer little by way of interpretability, which can impede usability when building interfaces that interact with these representations.
In this paper we seek to build visually-aware representations on top of interpretable visual features based on fine-grained parsing of product images, for the problem of clothing recommendation. We show that such features can lead to superior performance compared to ‘black box’ image representations, while substantially reducing their dimensionality, and also that such features can be used to develop more usable and interactive systems.
2. Related Work
We build upon latent factor models, and in particular Bayesian Personalized Ranking (BPR) (Rendle et al., 2009), which is trained using implicit feedback
(i.e., purchases vs. non-purchases) in order to estimate rankings of items that are likely to be interacted with. In particular, our work extends ideas from visually aware recommendation as well as models of fashion and clothing style.
Visually-aware Recommender Systems.
Recent works have introduced visually-aware recommender systems where users’ rating dimensions are modeled in terms of visual signals in the system (product images). Systems have been built for link prediction (McAuley et al., 2015) and personalized search, though most closely related are methods that extend traditional recommender systems (such as Bayesian Personalized Ranking) to incorporate visual dimensions to facilitate item recommendation tasks (He and McAuley, 2016b). We build on extensions to such models that incorporate temporal dynamics in addition to visual signals to capture the evolution of fashion style (He and McAuley, 2016a).
Fashion and Clothing Style.
Beyond the methods mentioned above, modeling fashion or style characteristics has emerged as a popular computer vision task in settings other than recommendation, e.g. with a goal to categorize or extract features from images, without necessarily building any model of a ‘user.’ This includes categorizing images as belonging to a certain style(Bossard et al., 2013), as well as models that create rich stylistic annotations, like DeepFashion (Liu et al., 2016).
3.1. Visual Feature Generation
To compute the features and attribute probabilities used in the recommendation experiments, we have implemented a variation of the model proposed in(Dong et al., 2016)2015)
, to obtain a general set of intermediate feature representations, and subsequently fine-tuned for several epochs on a large proprietary dataset of fashion images annotated with the target attributes. Care was put into making sure that each attribute is represented to a sufficient extent in the fine-tunning dataset, to guarantee a consistent degree of generalization to new types of images for all classes.
In terms of performance, our model does comparably or better than DeepFasion (Liu et al., 2016) or MTCT (Dong et al., 2016) according to the numbers reported in their respective papers, although it is difficult to compare exactly as the test datasets are different, and our attribute categories are not necessarily the same as theirs.
|user set, item set|
|predicted ‘score’ user gives to item|
|number of latent factors|
|number of visual factors|
|number of image features|
|global offset (scalar)|
|bias of user , item (scalar)|
|latent factors of user , item ()|
|visual factors of user , item ()|
|visual (image) features of item ()|
|embedding matrix ()|
visual bias vector ()
|user ’s affinity vector ()|
|user ’s affinity vector at time ()|
|’th index of user ’s affinity vector (scalar)|
|one-hot vector ‘on’ at index ()|
|item scaling factor (scalar)|
|feature scaling factor (scalar)|
3.2. Bayesian Personalized Ranking
The core of our prediction model is built on Matrix Factorization (MF), a state-of-the-art method for rating prediction. The basic MF formulation describes each user’s preference towards an item in terms of a set of user and item specific latent factors (, ), such that the inner product encodes the compatibility between and . In our case, the preference predictor extends the basic latent factor model by learning set of visual factors (, ), where encodes a separate visual-specific compatibility. Using the image features directly for is problematic due to the high dimensionality of , so we learn an embedding kernel which maps our image features to a lower dimensional space (). Thus, a user ’s predicted rating of an item is given by the following predictor:
3.2.1. Temporal Dynamics
To incorporate temporal dynamics in addition to visual signals, we employ the technique introduced in (He and McAuley, 2016a) which extends eq. 1 by parameterizing the visual factors bias by a set of learned epochs (a fixed number of flexible time-based dataset partitions). Critically, we can describe an items’ visual factors at time (epoch) in terms of a time-specific weighting vector (users weigh visual dimensions, or ‘styles’, differently over time) and temporal drift (items gain and lose attractiveness in different dimensions over time):
where indicates the Hadamard product.
3.2.2. Fitting the Model
To fit our model, we use Bayesian Personalized Ranking (BPR), a pairwise ranking optimization framework. BPR adopts stochastic gradient descent to efficiently learn the parameters of the model, using the desired preference predictor and implicit feedback from the training data. The same training procedure is used for the temporally-aware model, adjusted to incorporate review timestamps as additional feedback (used to learn the optimal epoch segmentation). For further details regarding the training procedure and model formulation, refer to(He and McAuley, 2016a).
|Clothing (all)||All Items||0.504556||0.443571||0.640032||0.645513||0.748643||0.744377||0.761767||0.749785|
|Women’s clothing||All Items||0.494471||0.391156||0.612812||0.621887||0.741344||0.740134||0.758179||0.745743|
|Men’s clothing||All Items||0.505361||0.356822||0.629649||0.631144||0.714754||0.726443||0.734160||0.730461|
AUC on the test set. Boldface indicates the models using interpretable visual features. All models’ hyperparameters were optimized during training for AUC on the validation set.Women and Men are the two largest subcategories within Clothing.
We use a modified version of the Amazon.com dataset introduced by McAuley et al (He and McAuley, 2016a). The original dataset is a content-rich dataset containing millions of items, metadata, images, and implicit-feedback (e.g., review data), which we use for our ground truth ‘preference’ statistic.
In our testing, we use a subset of the original dataset (Clothing) containing 1.4 million items filtered from categories that encode fashion dynamics (clothing, jewelry, bags, etc). Additionally, we report results on the largest two subcategories within our dataset, Women and Men (643,195 and 278,762 items, respectively). As part of preprocessing, we filter out users who have less than 5 written reviews to increase the density of the dataset.
4.2. Evaluation Methodology
Our data is split into training, validation and test sets by sampling one item for validation () and another for testing () for each user, and the remaining data is used for training (). Similar to (He and McAuley, 2016a), all methods reported are evaluated on with the widely used AUC (Area Under the ROC curve) metric:
where is an indicator function that returns 1 iff is true and the set of evaluation pairs for user u is:
All methods are evaluated under two settings, ‘All Items’ and ‘Cold Start’. ‘All Items’ evaluates the average AUC across the entire test set, whereas ‘Cold Start’ evaluates the average AUC for items with less than five recorded feedback instances in the training set. Cold start performance is particularly important in the clothing recommendation setting, where most datasets will have long-tailed distributions due to the constant flow of new items with no prior feedback.
4.3. Comparison Methods
We compare our method using low-dimensional interpretable visual features (I-VBPR) and additional temporal dynamics (I-TVBPR) with state-of-the-art visually- and temporally-aware methods (VBPR and TVBPR) using ‘black box’ image features extracted from AlexNet (Krizhevsky et al., 2012). We also compare our method to several Matrix Factorization approaches for reference (both Matrix Factorization-based baselines (MM-MF and BPR-MF) were implemented using MyMediaLite111http://www.mymedialite.net). MM-MF is a pairwise MF model optimized for hinge ranking loss, while BPR-MF is the state-of-the-art (non-visual method) for personalized ranking on implicit feedback datasets. RAND (random) assigns preferences at random, while POP (popularity) rank prediction is equivalent to an item’s popularity.
4.4. Performance and Analysis
Table 2 shows the average AUC for each method on the Women’s, Men’s, and overall Clothing test datasets. We used 10 latent factors for all models, and 10 additional visual factors for the visually-aware models. For the visually-aware methods, was set to 5 and the remaining regularization hyperparameters were set to 0. We summarize the results as follows:
Temporally and visually-aware methods. Incorporating visual signals clearly significantly increases performance, and all visually-aware methods improve at least 10% compared to BPR-MF, the next-best baseline. Incorporating temporal information information on top of visual information increased accuracy in all cases, though with smaller margins.
Interpretable vs. ‘black box’ image features. Due to the difference in dimensionality of between the interpretable features and black box features (several hundred vs. several thousand image features), the interpretable models use a small fraction of the total model parameters compared to the black box models (¡5%). Despite the large discrepancy in parameter count, models using the interpretable features achieve comparable prediction accuracy: in the overall dataset and in Women’s Clothing, the black box methods maintained a 1-2% performance boost in the ‘All Items’ setting, while the interpretable methods maintained a similar lead in most of Men’s Clothing and several cold start scenarios.
Ultimately, our results demonstrate that by using a relatively low number of interpretable image features, our model can produce comparable and in several cases superior results to the black box approach while substantially reducing model complexity and allowing for more usable and interactive systems.
5. Interactive recommendation
In this section we demonstrate how our prediction model can be extended to build a personalized, interactive, fashion-aware recommendation system. In addition to improving raw performance, another tangible benefit of incorporating interpretable visual features into our prediction model is that it allows us to model users’ preferences in terms of directly observable item properties. Unlike the typical features extracted from pre-trained CNNs, interpretable visual features allows the recommender system to generate sets of recommendations due to a specific feature, or similarly, to narrow down a set of items based on explicit visual criteria.
5.1. User personalization
We initialize the recommender system for user by constructing an affinity vector , which represents the sensitivity of user towards each visual dimension. This is done by first fitting the model, then recording an average response for each visual dimension using a modified version of the preference predictor function (eq. 5). Given a user and a feature dimension , we calculate an average the response towards using a one-hot vector (1.0 at index , 0.0 at all other indices) across a random sample of items and store the result in the affinity vector:
Once the response towards each dimension has been recorded in the respective , we rescale the affinity vector to match the original feature scaling using a normalization function that divides each element by the sum of the elements in . This allows us to uncover which visual dimensions the user is most responsive to.
5.2. User interaction
Our goal is to not only generate highly-personalized item recommendations based on visual signals, but also to allow the user to tailor their own recommendation results dynamically. Once the affinity vector has been initialized, we can begin to generate item recommendations using a nearest neighbor search within the item set using the visual feature space. A user can dynamically update the model’s generated recommendations by scaling their own affinity vector in the direction of a chosen item (see figure 1):
Additionally, a user can choose to boost a specific visual feature (e.g., color) within their affinity vector:
In each case, the scaling constants and are fixed prior to runtime and determined experimentally.
6. Tracking fashion trends
Previous work has focused on visualizing the temporal shift of the latent visual dimensions, by plotting the weighting vector (eq. 2) at each epoch (He and McAuley, 2016a). However, the meaningfulness of such visualizations is ambiguous, since it requires inferring the visual property or style (among many) the latent dimension may be capturing.
Since our model utilizes interpretable visual features (instead of extracted CNN features), we are able to visualize the temporal-dynamics of our dataset at the feature-level. To track the popularity of a feature within a learned epoch , we can sum the time-weighted influence of a vector on each latent visual dimension:
Thus, given an interpretable feature , using a feature ’s influence we can model how the feature’s popularity has changed over time (see figure 2). This allows us to directly observe the evolution of real interpretable feature dimensions, as opposed to the ambiguous dimensions of the latent embedding.
In this paper we introduced a novel approach to fashion-aware product recommendation that utilizes interpretable visual features. We show that such features can lead to superior performance compared to ‘black box’ image representations, while substantially reducing their dimensionality, and also that such features can be used to develop more usable and interactive systems. Future work will focus on extended applications enabled by the feature generation process, including iterative clustering and querying of recommendation results.
- Bossard et al. (2013) Lukas Bossard, Matthias Dantone, Christian Leistner, Christian Wengert, Till Quack, and Luc Van Gool. 2013. Apparel classification with style. In ACCV.
- Dong et al. (2016) Qi Dong, Shaogang Gong, and Xiatian Zhu. 2016. Multi-Task Curriculum Transfer Deep Learning of Clothing Attributes. Technical Report. arXiv:1610.03670 http://arxiv.org/abs/1610.03670
- He and McAuley (2016a) Ruining He and Julian McAuley. 2016a. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In World Wide Web.
He and McAuley (2016b)
Ruining He and Julian
VBPR: Visual bayesian personalized ranking from
implicit feedback. In
AAAI Conference on Artificial Intelligence.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with Deep Convolutional Neural Networks. In NIPS.
et al. (2016)
Ziwei Liu, Ping Luo,
Shi Qiu, Xiaogang Wang, and
Xiaoou Tang. 2016.
DeepFashion: Powering Robust Clothes Recognition
and Retrieval with Rich Annotations. In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
- McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. 2015. Image-based recommendations on style and substitutes. In SIGIR.
- Rendle et al. (2009) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback. In UAI.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252.