Personalized Fashion Recommendation from Personal Social Media Data: An Item-to-Set Metric Learning Approach

05/25/2020 ∙ by Haitian Zheng, et al. ∙ University of Rochester 0

With the growth of online shopping for fashion products, accurate fashion recommendation has become a critical problem. Meanwhile, social networks provide an open and new data source for personalized fashion analysis. In this work, we study the problem of personalized fashion recommendation from social media data, i.e. recommending new outfits to social media users that fit their fashion preferences. To this end, we present an item-to-set metric learning framework that learns to compute the similarity between a set of historical fashion items of a user to a new fashion item. To extract features from multi-modal street-view fashion items, we propose an embedding module that performs multi-modality feature extraction and cross-modality gated fusion. To validate the effectiveness of our approach, we collect a real-world social media dataset. Extensive experiments on the collected dataset show the superior performance of our proposed approach.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

With the thriving social networks, people start to share everyday moments online. For instance, they share the place they visited, the food they had, and the outfit they wore. There are multiple fashion-oriented online communities where users show off their dressing styles and connect to new people that share similar fashion interests. As an example, Lookbook

111 users can showcase their fashion styles with various street-view selfie posts (Fig. 1), which no doubt reveal their individual fashion preferences. This emerging trend presents a new opportunity for personalized fashion analysis through analyzing the user-created contents, and allows us to uncover the fashion interest of individual users at personal levels.

Figure 1. An example user web page on A user web page contains street-view fashion selfie posts uploaded by the user, which reveal his/her personal fashion interests. Taking such user activities as inputs, our model can recommend personalized outfits based on their fashion preferences. In this specific example, the top-3 recommendations in the figure reflect minimal and monotonic color looks that the user prefers. Note that outfit #1 correctly predicts one of her own outfits in the testing pool.

In this work, we study the problem of personalized fashion recommendation with personal social media data, which seeks to recommend new fashion outfits based on online activities of social network users. Although there have been a number of studies on clothing retrieval and recommendation (Hadi Kiapour et al., 2015; Huang et al., 2015; Kuang et al., 2019; Wang et al., 2017; Jagadeesh et al., 2014; Simo-Serra et al., 2015; Iwata et al., 2011; Hu et al., 2015; Liu et al., 2017; Ma et al., 2019), exploiting personal social media data for fashion recommendation is fundamentally challenging and less explored. In particular, the online activities of social media users are often no more than street-view selfie with additional word descriptions. The granularity of such data is much coarser than other explored data types, such as transaction records (Cardoso et al., 2018), human evaluation (Tangseng et al., 2017; Tangseng and Okatani, 2020), garment item annotations (Tangseng et al., 2017; Li et al., 2017; Hu et al., 2015; Yu et al., 2018) and annotated attributes (Yang et al., 2019). As a result, most established models are not directly applicable to our task due to the lack of supervision.

Without requiring other fine-grained supervision beyond the selfie posts, we propose a self-supervised approach for effective and personalized fashion recommendation We regard the selfie posts of users as either a set that reveals their personal fashion preferences, or outfit items of to-be-recommended items. Upon this basis, we propose to learn an item-to-set metric that measures the similarities between a set and items for personalized recommendation. To this end, we propose a self-supervised task that seeks to minimize the item-to-set distance for the set and items of a user, while maximizing such distances for sets and items of different users. Benefiting from such a training scheme, our framework is able to perform personalized recommendation without requiring any additional supervision such as transaction records (Cardoso et al., 2018) or human evaluation (Tangseng et al., 2017; Tangseng and Okatani, 2020).

Although metric learning has been well-studied in the literature (Hadsell et al., 2006; Schroff et al., 2015; Sohn, 2016; He, 2019)

, learning such an item-to-set metric is previously unexplored and faces new challenges. In reality, a user can have interest in more than one fashion style. Therefore, the item-to-set similarity cannot be captured by an oversimplified average of multiple item-wise similarities. Alternatively, a nearest-neighbor item-to-set metric is difficult to learn as it is susceptible to noise and outliers.

To addresses the above issues, we propose a new and generalized item-to-set metric. Specifically, we propose an importance weight for each item in the set. The importance weight changes according to a different set and query, and it serves to filter out outliers and unrelated items from the set. Different from nearest-neighbor classification, it can update features of all set items and enable effective learning. We consider two principles, namely neighboring importance and intra-set importance to implement the importance function. The neighboring importance serves to filter out set items that are far away from the new item, while the intra-set importance serves to filter the noise and outlier inside the set.

We further propose a user-specific item-to-set metric. The new metric is motivated by the fact that different users focus on different aspects of fashion products. As a result, the similarity metric should depend on the set of selfie posts to facilitate more targeted fashion recommendation. To utilize user-specific information, we propose a space transform operation that transforms item features into user-specific space before the similarity computation. The user-specific design further boosts the fashion recommendation performances.

Extracting fashion preferences information from user selfie posts involves understanding the raw fashion images and the associated text descriptions, as well as fusing information from multiple sources for better feature integration. To this end, we design a multi-modal embedding module. In particular, we design an image embedding module that extracts high-level fashion feature from raw selfie images. We also design hashtag and title embedding modules that utilize attentive averaging to extract semantic features from sets of word descriptions. To alleviate the influence of incorrect parsing, missing modalities or typos, we design a cross-gated fusion module that performs progressive feature fusion for each modality.

To validate the effectiveness of our proposed approach, we collect a real-world social media dataset. Through extensive experiments on the network design, we validate the effectiveness of our approach.

We highlight our contributions as follows:

  • We present a fashion recommendation system built on personal social media data. Our system recommends personalized outfits using a few unconstrained street-view selfie posts of users.

  • We propose a self-supervised scheme to enable the training of the system. Our approach is based on a novel item-to-set metric learning framework that requires only the user selfie posts as the supervision.

  • We design a multi-modal embedding module that better fuses the social media data for extraction of fashion features.

  • We evaluate our approach on our collected social media dataset. Extensive experiments on the real world dataset demonstrates the effectiveness of our approach.

2. Related Works

Fashion analysis has drawn broad interest in the multimedia community. The recent studies on fashion analysis can be categorized into four aspects, namely: 1) fashion annotation, 2) fashion retrieval, 3) fashion composition, and 4) fashion recommendation.

2.1. Fashion Annotation

Fashion annotation aims at generating fashion attributes to facilitate automatic fashion analysis. It includes clothing parsing, recognition, attributes annotation and landmark detection. Clothing parsing (Yamaguchi et al., 2012; Dong et al., 2014; Liu et al., 2016a) predicts garment items at pixel-level. Recent works in this field (Liang et al., 2018; Zhou et al., 2018; Gong et al., 2018) apply techniques from semantic segmentation (Ren et al., 2015) and achieve significant improvements. Attribute annotation (Liu et al., 2012a, b; Chen et al., 2015, 2012) aims to generate fashion attributes from clothing images. Liu et al. (Liu et al., 2016a) propose a large-scale fashion dataset with attribute annotation. Kenan E et al. (Ak et al., 2018)

utilize weakly supervised learning to annotate attributes with localization. Towards fashion landmark detection, Liu et al. 

(Liu et al., 2016b) propose a cascading multiple convolutional network to detect landmarks. Yan et al. (Yan et al., 2017)

propose a recurrent transformer network for unconstrained fashion landmark detection.

2.2. Fashion Retrieval

Clothing retrieval (Yamaguchi et al., 2014, 2013; Wang et al., 2013) attempts to find similar clothing from a person image query. Typical approaches on fashion retrieval (Liu et al., 2012b; Wang et al., 2013) utilize attributes to learn a fashion representation. Recently, Kiapour et al. (Hadi Kiapour et al., 2015) propose a deep metric network to retrieve garment items. Huang et al. (Huang et al., 2015) design an attribute-aware ranking network for retrieval feature learning. Kuang et al. (Kuang et al., 2019)

design a graph reasoning network that learns visual similarity for fashion retrieval. Wang et al. 

(Wang et al., 2017) design a self-learning model that learns to retrieve from image inputs.

2.3. Fashion Composition

Fashion composition focus on measuring whether clothing items are compatible and aims to generate visually complementary combination of fashion items. To this end, Li et al. (Li et al., 2017) propose a learning-based approach on set data for mining outfit compositions. Han et al. (Han et al., 2017) predict compatibility relationships of fashion items with sequence models. (Song et al., 2017) learn compatibility models by Bayesian personalize ranking. Hsiao et al. (Hsiao and Grauman, 2018) study the problem of automatic capsule creation. Ma et al. (Ma et al., 2019) derive fashion knowledge using a social media database.

2.4. Fashion Recommendation

There are several attempts at personalized fashion recommendation. Jagadeesh et al. (Jagadeesh et al., 2014) design a data driven model that performs complementary fashion recommendation from visual input. Simo-Serra et al. (Simo-Serra et al., 2015) propose a random field model that jointly reasons about fashionability factors of users for fashion outfit recommendation. Iwata et al. (Iwata et al., 2011) propose a probabilistic topic model for learning fashion coordinates. Hu et al. (Hu et al., 2015)

propose a tensor factorization approach for collaborative fashion recommendation. Liu et al. 

(Liu et al., 2017) design a visual-based model that learns style feature of items for sensing preferences of users. Hidayati et al. (Ma et al., 2019) study the problem of fashion recommendation for personal body types.

3. Data Construction

We crawl a social media dataset from a popular fashion-focused website, where users can freely post their outfits and selfies. The left part of Fig 1 shows the profile of a user, with recent selfies posted by users. We crawl a total of 2,293 personal profiles from users. For each user, we keep their 100 most recent selfie posts with their corresponding photo titles and hashtags. The 2,293 users do not include any users with more than 7,000 fans because the latter are most likely fashion brand’s commercial accounts which contain diverse photos of different fitting models.

3.1. Data Overview

In Table 1, we show the basic statistics of our collected data, which includes user attributes such as age, number of looks, likes per picture, number of fans, and number of followings. We also visualize the most frequent 20 words from the hashtags and titles as shown in Fig.  2. The frequency plot suggests that hashtags are often words that describe the styles of outfits, such as street, blogger and summer. On the other hand, title words are more specific and usually describe attributes and colors of outfits. In Table 2, we show garment statistics of our processed dataset. Garment statistics shows that basics garments such as upper clothes, pants and shoes have higher proportion, while accessories have lower proportion.

Distribution Min Max Mean Median Std Skewness
Age 19 70 29.20 29 4.48 1.88
#Looks 100 2,414 203.33 158 149.64 5.20
#Likes 0 8,730 113.23 80 119.57 4.23
#Fans 3 6,672 1,171.36 765 1,157.77 1.45
#Followings 0 21,423 269.28 73 982.48 12.87
Table 1. The user statistics of our collected dataset.
Figure 2. The word frequency for the hashtag and title modalities. Words are sorted by the frequency in the modalities.
Garment Count Proportion Garment Count Proportion
 Upper Clothes 178,041 .21 Hat 49,887 .06
 Pants 167,697 .2 Socks 38,627 .05
 Shoes 156,880 .19 Glove 11,785 .01
 Coat 104,710 .12 Scarf 9,775 .01
 Dress 62,627 .07 Jumpsuits 4,619 .01
 Skirt 60,315 .07
Table 2. The garment statistics of our processed dataset.

3.2. Data Prepossessing

3.2.1. Image Data Preprocessing

Since the raw selfie images often contain multiple concatenated pictures, we utilize a detection model (Joseph and Ali, 2018) to crop person bounding-boxes, and then select the ones with the highest scores to obtain the best-fitted person images. We also exclude grayscale images which typically cannot fully reflect the outfit styles.

3.2.2. Word Data Preprocessing

For title and hashtag features, we utilize the Wikipedia pretrained GloVe text embedding to extract features. Specifically, for hashtag, we first apply the Viterbi algorithm to compute word segments. The embedding of each word is taken as input to generate hashtag features. Likewise, we use the embedding of every word from the title to generate the title features. After extracting both word and image features, features are simply concatenated as a row vector that corresponds to a single post.

4. Method

As illustrated in Fig. 1, we consider the problem of personalized fashion recommendation from personal social media, which aims at recommending new outfits to a user based on several selfies of that user. We intend to achieve the following objectives:

  1. Multi-modality feature extraction. The multi-modality social-media activities can reveal fashion preferences of individuals. Our system should extract the fashion preferences of individuals based on user activities, and then match the preference with candidate outfits.

  2. Multi-interest awareness. User selfie posts can reveal multiple fashion interests that a user may have. Our system should effectively represent the multiple interests of a user for better recommendation.

  3. User uniqueness. Different users may pay attention to specific fashion components while being less sensitive to others. Our recommendation should take into account such user-specific fashion interests for a more targeted recommendation.

To meet those objectives, we propose an embedding network in Sec. 4.1 that embeds the unique activities and outfit candidates into a feature space. Next, in Sec. 4.2, we present an item-to-set metric learning framework that learns to match user selfie posts to new outfits. Finally, in Sec. 4.3, we provide the training scheme and objective of our model.

4.1. Fashion Item Embedding Module

Social media users often post fashion selfies with photo titles and hashtags. Such multi-modal user activities often reveal the personal fashion preferences of the users. To extract fashion information from user activities, we propose a fashion item embedding module. As shown in Fig. 3, the embedding module extracts fashion information from image-hashtag-title triplets by first extracting features from the three modalities and then performing multi-modality fusion.

Figure 3. Our embedding module takes a multi-modal user post of as inputs and outputs a joint embedding. We extract visual features for individual garment parts to better capture fashion clues.

4.1.1. Image Feature Extraction

Image inputs can reveal the outfit combination and appearance preference of a user. To extract fashion information from images, we incorporate a body parsing model (Liang et al., 2018) to extract garment regions from the input image. The garment regions that we use include common garments semantics such as {dress, coat, pant, skirt} and exclude non-garment semantics (face, hair, background) or rare regions (socks, sunglasses, etc.). We then utilize a pre-trained image recognition model (Simonyan and Zisserman, 2014) to extract visual information from each garment regions. Specifically, we compute the feature response of (Simonyan and Zisserman, 2014) at layer conv1 and conv2, and average the feature response for each garment region. We then concatenate features of all the garment regions, resulting a -d image feature that represents the visual style of a fashion outfit. An illustrative example of such a process is shown on the top of Fig. 3.

4.1.2. Hashtag Feature Extraction

A hashtag modality input is a set of word embedding vectors that are extracted from the hashtags of a fashion item. In the case that the hashtag modality is missing, a zero vector is used to represent an empty hashtag set. Since semantically different words can sometimes refer to the same fashion style, e.g., corset, leatherjacket and black can refer to the goth style, we employ an additional MLP to transform the general word embedding into a fashion-related embedding .

Similar to image feature extraction, we aim at generating a fixed-length vector to represent the hashtag features. Because the size of embedding features may vary, we present an attentive averaging operation to weighted average the embedding features. Specifically, an MLP is applied to generate unnormalized weights for each feature , followed by a softmax operation and weighted averaging to produce an averaged feature:


The averaged feature serves to represent the feature of the hashtag modality.

4.1.3. Title Feature Extraction

A title modality input is a set of word embedding vectors that are extracted from the title words. In a similar fashion to the hashtag feature extraction, we utilize an MLP to extract fashion-aware features. Next, attentive averaging is used to extract features for title modality.

4.1.4. Cross-modality Gated Fusion

The multi-modality features extracted in the previous steps often suffer from incorrect body parsing, missing modalities or misspelling. To improve the quality of features, we propose a multi-modality cross-gating scheme that sequentially integrates information from alternative modalities for feature fusion.

Specifically, since title and hashtag features are less noisy than image features and they carry complementary semantics information, our scheme first performs a cross-gating operation between hashtag and title modalities. The hashtag and title features are updated simultaneously using the following updating rules:


where and generate a filtering score in each feature dimension, respectively. The operation represents an element-wise product and

represents a sigmoid function.

In a similar fashion, the cross-filtered features from hashtag and title modalities are used to filter the low-level image features. Specifically, the image modality feature is updated by:


where generates the filtering score for every dimension of the image modality.

Finally, a 2-layer MLP is employed on the concatenation of features from all modalities to generate the fused feature from the selfie posts of each user:

Figure 4. The pipeline of item-to-set metric learning. Given a set of activities of a user, item-to-set metric learning aims at generating low distance scores for the positive item and high distance scores for the negative items. Our approach generates both intra-set and neighboring importance scores to dynamically select items from the set for distance computation. Furthermore, we propose a user-specific transformation for learning user-specific metric.

4.2. Item-to-set Metric Learning

We consider a generalized metric learning problem that learns a similarly measurement from a set of user selfie posts to a candidate outfit item that is in the recommendation pool. In the following, we will review the concept of metric learning, then propose our item-to-set metric learning framework.

4.2.1. Item-to-set Similarity Metric

Metric learning typically aims to learn a similarity measurement between two items. Typical metric learning approaches (Hadsell et al., 2006; Schroff et al., 2015; Jing and Tian, 2020) often regard two items and as feature points and in a normed vector space and uses a point-wise distance to measure the di-similarity of two items. In (Hadsell et al., 2006; Jing and Tian, 2020), distance is used as the item-wise dis-similarity measurement.

Built upon the item-wise measurement , we propose an item-to-set similarity metric , which measures how dis-similar an item is to a set of items . The item-to-set metric aims to predict how similar a outfit candidate is to a set of user selfies for personalized fashion recommendation. In the following, we will first discuss the weakness of two item-to-set metric definitions, then propose our improved item-to-set metric.

We first consider an averaged item-to-set distance that computes the averaged distances between all items in a set and an item , specifically:


We note that the averaged distance is equivalent to first averaging all features in the set, then compute the item-wise distance, assuming that distance is used for the item-wise distance . The feature averaging operation is also proposed in (Snell et al., 2017) for few-shot learning.

Alternatively, a nearest-neighbor item-to-set distance computes the nearest distance from all items in the set to query item :


where . It can be seen as a weighted averaged distance, where we assign weight to item and to the resting items.

Both of these item-to-set metrics have drawbacks. First, as the averaged distance performs feature averaging, it cannot properly captures the similarity when contains items in multiple fashion styles. Second, although the nearest-neighbor distance is more adaptive to the multiple fashion style case, the minimum operation is susceptible to outliers and noise. Moreover, it only updates features for the closest items during training. As a result, training with the nearest-neighbor metric could hardly converge.

To design a metric that better captures the multiple interests of a user while facilitating robust training, we propose a generalized item-to-set distance. Specifically, given a set and a query , we propose to first assign an importance weight to each item

before feature averaging and distance computation. The importance weight is computed using an importance estimator

. Such a item-to-set distance is defined by:


Our formulation is a generalized form of Eq. 4 and Eq. 5. To understand that, note that with being a constant value or , our formulation recovers Eq. 4 and Eq. 5. However, our importance weights are generated using a learnable function, which allows our metric to explore a better weight assignment strategy. Next, we will elaborate on the design of the importance estimator.

4.2.2. Importance Estimation

First, we consider neighboring importance weight:


where is a non-negtive and learnable parameter. The neighboring importance mimics a nearest-neighboring operation in the sense that it assigns more weights to that are closer to to capture the multiple interests of a user. However, unlike nearest neighbor that ignores all non minimal-distance items, the above equation can update all item features during learning for more robust training. In addition, is learned from data to balance the trade-off between utilizing all items or only the nearest item.

Due to incorrect parsing, missing modalities or typos, noise and outliers in the set are inevitable. To reduce the influences of noise and outliers when computing the distance, we further consider an intra-set importance weight:


where outputs a scalar from an input vector, and is a vector that captures the statistics of the set along all feature dimensionalities222We employ mean, standard derivation, min and max functions to compute the statistics along all feature dimensions, then concatenate the extracted statistics into a vector.. In this way, we compare each item with the set to eliminate the outliers from the sets.

Our overall importance weights are generated using a linear combination of Eq. 7 and Eq. 8 as follows:


4.2.3. User-specific Metric Space

As different users may focus on different aspects of fashion items, the item-to-set metric itself should be user-specific. For instance, for the minimalist fashion style users, the item-to-set distance should be more sensitive to the amount of colors that are used. However, for users of the artsy style, the item-to-set distance should focus more on unusual prints and the complexity of accessories.

To extend our similarity metric in Eq. 6 to a user-specific metric, we perform a user-specific space transformation before the distance computation. In particular, given the set , we compute a scaling vector which indicates the scaling factor at each feature dimension:


One could also apply the sigmoid function instead of the softmax function. However, we found that the softmax function slightly boosts the recommendation accuracy because it ensures that all weights sum to .

Using the space transformation, we can extend the item-to-set metric of Eq. 6 to a set-specific metric. Specifically, we define a user-specific item-to-set metric:


where represents vector elementwise multiplication. Eq. 11 filters out the feature dimensions that a user focuses less on before the distance computation. This procedure helps the recommendation system to be more user-specific.

4.3. Learning Objectives

Metric learning generally serves to reduce the distances of positive pairs and enlarge the distances of negative pairs. To adapt this principle for item-to-set metric learning, we sample the item set of size from a random user, then generate a positive item from the same user and negative items from the other users. The item-to-set metric learning is cast as an

-way classification problem, which aims to classify the positive samples from all negative samples. In particular, we minimize the negative log-likelihood as follows:


We employ the user-specific item-to-set distance (Eq. 11) as the item-to-set distance function. In testing stage, given a user selfie post set , we recommend items that are close to the set with the learned item-to-set distance.

5. Experiments

To evaluate the effectiveness and performance of our approach, we conduct experiments using the dataset collected in Sec. 3. To this end, we reserve the latest activities of all the 2,293 users as a outfit candidate pool for recommendation. This candidate pool is used to evaluate the recommendation performance. Afterwards, we randomly split the 2,293 users into a training set, which contains 1834 users, and a test set, which contains the remaining users. Such a data split ensures that training set, test set and outfit pool are disjoint for fair evaluation. We perform model training on the training set and evaluate the fashion recommendation results on the test set.

5.1. Implementation

Our metric learning framework is implemented using PyTorch 

(Paszke et al., 2019). Although our item-to-set metric computes the distances between set items and multiple query items, our implementation utilizes standard matrix operations to support efficient batched training. We set the initial learning rate to and decay it by a factor of after every epochs and optimize the weights via SGD with a momentum of . The batchsize is set to in our experiments while the number of negative items to set to for most of our experiments. We also evaluate the influence of with more experiments. For evaluation, we randomly select user selfie posts as input from each user for recommendation. We repeat the sampling process for times and report the averaged recommendation performance. The random seed for sampling is fixed during evaluation. We also test other input sizes with more experiments.

5.2. Quantitative Evaluation

The quantitative performance for fashion recommendation is evaluated with top-k recall at . In Table 3, we compare our method with two item-wise metric learning schemes triplet+avg and triplet+avg, which learn selfie post embedding and respectively computes item-to-set distance with average distance or nearest distance for recommendation. In addition, a recent few-shot-learning method (Li et al., 2019) proposed the image-to-class measure that measures similarity from an image to a set of queries. We incorporate the image-to-class measure (with 3-NN) as item-to-set metric and train a baseline DN4 using triplet loss for recommendation. Our model and the comparative methods are trained with the same sampling strategy and train setting for fair comparisons. From the table, our approach based on item-to-set metric achieves 230 times improvement over random guess in terms of recall@1 and shows substantiate advantages over the comparative methods. We also perform extensive comparative experiments to study effectiveness our design, which are elaborated as follows:

Methods Recall@1 Recall@10 Recall@25
random guess 0.0004 0.0043 0.0108
triplet+NN 0.0141 0.0589 0.1037
triplet+avg 0.0144 0.0650 0.1085
DN4 (Li et al., 2019) 0.0163 0.0741 0.1283
ours 0.1005 0.2420 0.3336
Table 3. The fashion recommendation performance in comparisons to different methods. Performances are measured in recall scores at 1, 10 and 25 respectively.

Multimodalities and fusion. To study the benefit of using multi-modal features, we train our fashion recommendation system with different combinations of modalities and fusion methods, and report their performances. We also evaluate the different implementation of the embedding module, i.e. hashtag and title modalities with or without using attentive feature averaging (denoted by w/ or w/o att). We compare our cross-modal gated fusion with fusion by concatenation (denoted by w/ or w/o cross). From Table 4, we observe that: i) hashtag is the most informative modality as it often contains fashion style descriptions of outfits, ii) attentive feature averaging can improve the representative power of hashtag and title modalities, iii) multi-modality fusion improves the recommendation performance, iv) cross-modal gated fusion can improve the recommendation performance.

Baselines Recall@1 Recall@10 Recall@25
Image (I) 0.0236 0.1003 0.1666
Hashtag (H) w/o att 0.0601 0.1128 0.1485
Hashtag (H) w/ att 0.0738 0.1293 0.1630
Title (T) w/o att 0.0183 0.0693 0.0988
Title (T) w/ att 0.0268 0.0687 0.1012
T+H w/o att, w/o cross 0.0766 0.1572 0.2097
T+H w/ att w/o cross 0.0841 0.1730 0.2178
T+H w/ att w/ cross 0.0901 0.1736 0.2219
I+T+H w/o att, w/o cross 0.0857 0.2185 0.3040
I+T+H w/ att w/o cross 0.0914 0.2289 0.3118
I+T+H w/ att w/ cross 0.1055 0.2420 0.3336
Table 4. The impacts of modalities and fusion schemes on the performance of fashion recommendation. Performances are measured in recall at 1, 10 and 25, respectively.

Metric designs. In Table 5 and Fig. 5 left, we evaluate different variants of our proposed item-to-set metrics, and show their convergence curves: i) NN, denoting the nearest-neighbor item-to-set distance defined by Eq. 5, ii) average, denoting the averaged item-to-set distance defined by Eq. 4, iii) ours w/ v, denoting a metric that uses importance weights but only applies intra-set importance as described in Eq. 8, iv) ours w/ u+v, denoting a metric that applies both intra-set importance and neighboring importance as described in Eq. 9, v) average w/ specified, denoting a metric that applies averaged distance but also applies the user-specific formulation in Eq. 11 v) ours full, denoting our full metric. From Table 5 and Fig. 5 left we observe that: i) NN cannot converge as it is susceptible to embedding noises while the performance of average is the second lowest ii) with our formulation, intra-set importance ours w/ v and neighboring importance ours w/ u+v can improve over average, iii) the user-specific formulation can improve performance upon both average and ours w/ u+v. Notably, our full metric is able to double the recall@1 in comparison with average.

Methods Recall@1 Recall@10 Recall@25
NN 0.0000 0.0044 0.0087
average 0.0587 0.1872 0.2731
ours w/ v 0.0631 0.1961 0.2759
ours w/ u+v 0.0794 0.2253 0.3150
average w/ specified 0.0805 0.2038 0.2885
ours full 0.1005 0.2420 0.3336
Table 5. The fashion recommendation performance for different item-to-set metric schemes. Performances are measured in recall scores at 1, 10 and 25 respectively.
Figure 5. The convergence curves of recommendation models with different item-to-set metric schemes. Recommendation performances are measured using recall scores at 25.

Learning objectives. In Table 6 and Fig. 5 right, we further evaluate other training objectives other than our proposed objective. In particular, we test the contrastive loss (Hadsell et al., 2006) and triplet loss (Schroff et al., 2015) in our item-to-set setting. We also vary the negative sample size from Eq. 12 to analyze the impact on it (different results are denoted by cls-). We observes that for our task, contrastive and triplet objectives are not as effective as the classification objective. In addition, increasing the negative sample size will improve the recommendation performance.

Methods Recall@1 Recall@10 Recall@25
random guess 0.0004 0.0043 0.0108
contrastive 0.0231 0.0738 0.1193
triplet 0.0162 0.0831 0.1460
cls-10 0.0756 0.2061 0.2937
cls-50 0.1005 0.2420 0.3336
cls-200 0.1210 0.2627 0.3545
Table 6. The fashion recommendation performance for different training objectives. Performances are measured in recall scores at 1, 10 and 25 respectively.
Figure 6. The user selfie embedding learned with the averaged item-to-set metric (left) and our metric with importance weights. Our metric represents a user (e.g. colored by yellow or cyan) with multiple clusters, which effectively captures the multiple fashion interests of a user.

Influence of set size. We also study the influence of set size. Specifically, we vary the input activities size of users and observe the recommendation performances. From Tab. 7, lager set size improves the recommendation performance as we expected.

Recall@1 0.0735 0.0897 0.1005 0.1144 0.1230
Recall@10 0.1786 0.2146 0.2420 0.2611 0.2798
Recall@25 0.2492 0.2919 0.3336 0.3541 0.3813
Table 7. The fashion recommendation performance with different numbers of input selfie posts . Performances are measured in recall scores at 1, 10 and 25 respectively.
Figure 7. Based on posts of users (Left Columns), outfits are recommended and top 3 are shown (Right columns). A red box represents when our recommendation system correctly predicts the outfit of users. 18.9% of the test users are given the correct outfits with our top-3 recommendation (87 out of 459). This shows that our model can makes good personalized recommendations based on the fashion taste of users.

5.3. Qualitative Results

5.3.1. Recommendation Results

For qualitative evaluation, we depict the top- recommendation results of our model on the test set. The top-3 recall of our model is 18.9%, meaning that it gives correct outfits recommendation to 18.9% of the test users (87 out of 459). As shown in Fig. 7 (right), our recommendation system is able to recommend style coherent outfits to individual users (a red box indicates a correct prediction on the test set). For instance, user prefers vintage and country styles and the combination of dresses and skirts with heels or boots. From posts #3, #7, #8 and #9, we learn that for the fall season, she prefers neutral colors such as black and brown with boots. She also prefers boater or beret hat as embellishments (Posts #5 and #9). For user , our first recommendation (k=1) correctly predict the outfit of the user, which consists of a vintage style blue dress with a textured straw hat. It also successfully recommends her preferred outfit style for the fall season (k=2), which consists of a brown jacket with black boots and a beret hat. Our third recommendation is a U.S. country style burgundy jumpsuit (k=3), which also matches the dressing tastes of user .

Our recommendations to users and further substantiate the personalized recommendation capacity of our model. For instance, user prefers street style outfits such as jackets and jeans in neutral colors (white, gray and black), white sport shoes or black boots. For user , our model recommends black jackets with white sport shoes (k=1,2) which match her preferred items. It also recommends a simple street styled outfit with gray t-shirt and dart gray maxi skirt (k=3), which matches the dressing style of user . User prefers warm colors and floral patterns for her outfits. Our model recommends a white dress with contrast coloring floral patterns (k=1), which is her own outfit. It also recommends a terse dress with a combination of warm colors (k=2), which also matches the color preference of the user. Readers are referred to the supplementary material for more visualization of our recommendation results.

5.3.2. Effects of Importance Weights

It is useful to understand the role that the importance weights play. In Fig. 6, we visualize the embedding of the average item-to-set baseline, as well as the embedding learned using our importance weights (the baseline ours u+v in Table 5). It can be observed that average tends to represent user posts with loosely cluster points. In contrast, ours u+v represents user posts with multiple tightly clusters points, which effectively represents the multiple fashion interests of a user for more accurate and flexible recommendation.

6. Conclusion

In this work, we study the problem of personalized fashion from personal social media data. We present a item-to-set metric learning framework that learns the similarity between user posts and fashion outfits. To account for the diversity of fashion interests of users, we propose neighboring importance. To reduce the influence of noise and outliers in a set, we propose intra-set importance. The combination of the two terms serves to dynamically assign weights for adaptive item-to-set similarity measurement. We further propose user-specific space transformation that learns user-specific metrics for more personalized recommendation. To extract features from user activities and outfits, we propose a multi-modality feature extraction module for cross-modality fusion. We collect a real-world social media dataset to access the performance of fashion recommendation. The effectiveness of our framework is shown through extensive experiments and analysis.


  • K. E. Ak, A. A. Kassim, J. Hwee Lim, and J. Yew Tham (2018) Learning attribute representations with localization for flexible fashion search. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7708–7717. Cited by: §2.1.
  • Â. Cardoso, F. Daolio, and S. Vargas (2018) Product characterisation towards personalisation: learning attributes from unstructured data to recommend fashion products. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 80–89. Cited by: §1, §1.
  • H. Chen, A. Gallagher, and B. Girod (2012) Describing clothing by semantic attributes. In European conference on computer vision, pp. 609–623. Cited by: §2.1.
  • Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and S. Yan (2015) Deep domain adaptation for describing people based on fine-grained clothing attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5315–5324. Cited by: §2.1.
  • J. Dong, Q. Chen, X. Shen, J. Yang, and S. Yan (2014)

    Towards unified human parsing and pose estimation

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 843–850. Cited by: §2.1.
  • K. Gong, X. Liang, Y. Li, Y. Chen, M. Yang, and L. Lin (2018) Instance-level human parsing via part grouping network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 770–785. Cited by: §2.1.
  • M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg (2015) Where to buy it: matching street clothing photos in online shops. In Proceedings of the IEEE international conference on computer vision, pp. 3343–3351. Cited by: §1, §2.2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1, §4.2.1, §5.2.
  • X. Han, Z. Wu, Y. Jiang, and L. S. Davis (2017) Learning fashion compatibility with bidirectional lstms. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1078–1086. Cited by: §2.3.
  • K. He (2019) Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722. Cited by: §1.
  • W. Hsiao and K. Grauman (2018) Creating capsule wardrobes from fashion images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7161–7170. Cited by: §2.3.
  • Y. Hu, X. Yi, and L. S. Davis (2015) Collaborative fashion recommendation: a functional tensor factorization approach. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 129–138. Cited by: §1, §2.4.
  • J. Huang, R. S. Feris, Q. Chen, and S. Yan (2015) Cross-domain image retrieval with a dual attribute-aware ranking network. In Proceedings of the IEEE international conference on computer vision, pp. 1062–1070. Cited by: §1, §2.2.
  • T. Iwata, S. Watanabe, and H. Sawada (2011) Fashion coordinates recommender system using photographs from fashion magazines. In

    Twenty-Second International Joint Conference on Artificial Intelligence

    Cited by: §1, §2.4.
  • V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan (2014) Large scale visual recommendations from street fashion images. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1925–1934. Cited by: §1, §2.4.
  • L. Jing and Y. Tian (2020)

    Self-supervised visual feature learning with deep neural networks: a survey

    IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §4.2.1.
  • R. Joseph and F. Ali (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §3.2.1.
  • Z. Kuang, Y. Gao, G. Li, P. Luo, Y. Chen, L. Lin, and W. Zhang (2019) Fashion retrieval via graph reasoning networks on a similarity pyramid. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3066–3075. Cited by: §1, §2.2.
  • W. Li, L. Wang, J. Xu, J. Huo, Y. Gao, and J. Luo (2019) Revisiting local descriptor based image-to-class measure for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7260–7268. Cited by: §5.2, Table 3.
  • Y. Li, L. Cao, J. Zhu, and J. Luo (2017)

    Mining fashion outfit composition using an end-to-end deep learning approach on set data

    IEEE Transactions on Multimedia 19 (8), pp. 1946–1955. Cited by: §1, §2.3.
  • X. Liang, K. Gong, X. Shen, and L. Lin (2018) Look into person: joint body parsing & pose estimation network and a new benchmark. IEEE transactions on pattern analysis and machine intelligence 41 (4), pp. 871–885. Cited by: §2.1, §4.1.1.
  • Q. Liu, S. Wu, and L. Wang (2017) DeepStyle: learning user preferences for visual recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–844. Cited by: §1, §2.4.
  • S. Liu, J. Feng, Z. Song, T. Zhang, H. Lu, C. Xu, and S. Yan (2012a) Hi, magic closet, tell me what to wear!. In Proceedings of the 20th ACM international conference on Multimedia, pp. 619–628. Cited by: §2.1.
  • S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, and S. Yan (2012b) Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3330–3337. Cited by: §2.1, §2.2.
  • Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016a) Deepfashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1096–1104. Cited by: §2.1.
  • Z. Liu, S. Yan, P. Luo, X. Wang, and X. Tang (2016b) Fashion landmark detection in the wild. In European Conference on Computer Vision, pp. 229–245. Cited by: §2.1.
  • Y. Ma, X. Yang, L. Liao, Y. Cao, and T. Chua (2019) Who, where, and what to wear? extracting fashion knowledge from social media. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 257–265. Cited by: §1, §2.3, §2.4.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §5.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.1.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §4.2.1, §5.2.
  • E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun (2015) Neuroaesthetics in fashion: modeling the perception of fashionability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 869–877. Cited by: §1, §2.4.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §4.1.1.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §4.2.1.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, Cited by: §1.
  • X. Song, F. Feng, J. Liu, Z. Li, L. Nie, and J. Ma (2017) Neurostylist: neural compatibility modeling for clothing matching. In Proceedings of the 25th ACM international conference on Multimedia, pp. 753–761. Cited by: §2.3.
  • P. Tangseng and T. Okatani (2020) Toward explainable fashion recommendation. In The IEEE Winter Conference on Applications of Computer Vision, pp. 2153–2162. Cited by: §1, §1.
  • P. Tangseng, K. Yamaguchi, and T. Okatani (2017) Recommending outfits from personal closet. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2275–2279. Cited by: §1, §1.
  • X. Wang, T. Zhang, D. R. Tretter, and Q. Lin (2013) Personal clothing retrieval on photo collections by color and attributes. IEEE transactions on multimedia 15 (8), pp. 2035–2045. Cited by: §2.2.
  • Z. Wang, Y. Gu, Y. Zhang, J. Zhou, and X. Gu (2017)

    Clothing retrieval with visual attention model

    In 2017 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. Cited by: §1, §2.2.
  • K. Yamaguchi, M. Hadi Kiapour, and T. L. Berg (2013) Paper doll parsing: retrieving similar styles to parse clothing items. In Proceedings of the IEEE international conference on computer vision, pp. 3519–3526. Cited by: §2.2.
  • K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg (2012) Parsing clothing in fashion photographs. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3570–3577. Cited by: §2.1.
  • K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg (2014) Retrieving similar styles to parse clothing. IEEE transactions on pattern analysis and machine intelligence 37 (5), pp. 1028–1040. Cited by: §2.2.
  • S. Yan, Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2017) Unconstrained fashion landmark detection via hierarchical recurrent transformer networks. In Proceedings of the 25th ACM international conference on Multimedia, pp. 172–180. Cited by: §2.1.
  • X. Yang, X. He, X. Wang, Y. Ma, F. Feng, M. Wang, and T. Chua (2019) Interpretable fashion matching with rich attributes. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 775–784. Cited by: §1.
  • W. Yu, H. Zhang, X. He, X. Chen, L. Xiong, and Z. Qin (2018) Aesthetic-based clothing recommendation. In Proceedings of the 2018 World Wide Web Conference, pp. 649–658. Cited by: §1.
  • Q. Zhou, X. Liang, K. Gong, and L. Lin (2018) Adaptive temporal encoding network for video instance-level human parsing. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1527–1535. Cited by: §2.1.