Predicting which fashion items will go together to form an outfit is a high-value task in e-commerce. While there is an established literature on this topic, it remains hard to build accurate systems. The key is to tell whether two items are compatible, which is complicated because both complex appearance properties and perceptual issues are in play. Current methods rely on similarity measures obtained by learning embeddings. Until recently, such approaches have ignored garment type (e.g. “dress” vs. “hat”), which is odd because one does not usually have many items of the same kind in an outfitVeit2015; He2016LearningCA. Recent work has shown that acknowledging type produces improvements in standard metrics DBLP:journals/corr/abs-1803-09196.
Remarkably, we are not aware of any recent work using discriminative method to predict compatibility. This absence is odd because the problem is naturally discriminative — one wants to know whether a particular pair of items is compatible or not. Embedding methods try to solve a harder problem of learning a good metric of similarity. In this paper, we demonstrate that discriminative methods produce robust systems. Our methods exceed the state-of-the-art on established metrics by 2.5% on both compatibility prediction task and fill-in-the-blank task on the Polyvore Outfits dataset.
The current standard, Polyvore Outfits, was created by online community users who are not necessarily experts in the fashion domain. To learn the compatibility rules defined by professionals, we introduce three new outfit datasets from crawling three e-commerce sites consisting of 360,176 outfits in total. We evaluate our discriminative method across datasets and demonstrate that the method performs consistently well on expert-curated outfit data. We also introduce pairwise AUC as a new metric for measuring compatibility. Unlike compatibility (per outfit) AUC used by prior works, pairwise AUC is not misled by the average size of outfits in the dataset when comparing across datasets.
To compare the compatibility rules defined by multiple sources, we compute the type-pair AUCs for every pair of product types (e.g., outerwear-bottom) and visualize them using heat maps. The analysis reveals that the underlying ways items match in different outfit datasets are different, which suggests that the notion of fashion compatibility is subjective. The type-pair AUCs also indicates that the model predicts some type-pairs more accurately than others. Exploring this property, we show that compatibility can also be queried indirectly, using a third type as the anchor. We demonstrate that indirect query yield improvement in the AUC results for the infrequent pairs of types.
Our contributions are:
Three new datasets: We introduce three new datasets generated by professional stylists from three e-commerce sites, consisting of 360,176 outfits in total.
Improved accuracy: We show that a discriminative approach improves accuracy over state of the art in learning compatibility relationship.
Cross-dataset comparisons: We show that the method performs consistently well on the standard dataset and our three datasets. However, the current evaluation uses compatibility (per outfit) AUC. We show that, for statistical reasons, compatibility AUC is higher on datasets with larger outfits. We describe pairwise AUC as an alternative metric for measuring compatibility performance and demonstrate that it is more stable when comparing across outfit datasets.
Indirect query strategy: The model performs poorly for pairs of types with relatively few compatible examples in the training data. We show that, for these pairs, an indirect strategy for querying compatibility (e.g., searching for “eyewear” that are compatible with a “scarf” through “tops” that are compatible with both) improves over a direct query.
2 Related Work
The established approach for learning representations of complex relationships involves constructing an embedding space by training as a siamese structure 1467314 or using triplet loss 7298682 with samples of positive and negative pairs. Embedding methods are also commonly used to capture hard-to-define relationships in the fashion domain such as style, fashionability, and the matching between clothing items DBLP:journals/corr/McAuleyTSH15; 7780408; Hsiao2017LearningTL; 7298688; Veit2015; He2016LearningCA. DBLP:journals/corr/HanWJD17 trained a bidirectional LSTM model to predict the next compatible clothing item within an outfit, always regarding an outfit as a whole. Noticed the non-transitive nature of the compatibility relationship, DBLP:journals/corr/abs-1803-09196 recently demonstrated that enforcing type-awareness in the embeddings produce better performance over prior works.
Discriminative methods, or in other words classification methods, have been commonly used to solve a variety of standard computer vision tasks, such as image classification, object detection, image segmentationNIPS2012_4824; NIPS2015_5638; Ronneberger2015UNetCN, etc. However, surprisingly, we are unable to find examples of its usage in solving the compatibility problem.
Prior works mainly used curated outfits as supervision signals and the ground truth for evaluation to learn fashion compatibility. Many recent works, including the state-of-the-art, are trained and evaluated using outfits data mined from Polyvore DBLP:journals/corr/HanWJD17; DBLP:journals/corr/abs-1803-09196. Polyvore is a social network where fashion lovers curate outfits using a set of fashion product images. We emphasize that the quality of the outfits curated by Polyvore users are not guaranteed. Also, prior works only evaluate their methods using outfits from a single source DBLP:journals/corr/HanWJD17; DBLP:journals/corr/abs-1803-09196. Assessing one outfit dataset does not ensure a method’s performance on other outfit datasets.
In the fashion domain, other recent works focused on products recognition and retrieval Liu2012StreettoshopCC; Yang2014ClothingCB; Hu2015CollaborativeFR; Kiapour2015WhereTB, fashion recommendation Vittayakorn2016AutomaticAD; Shen2007WhatAI; McAuley2015ImageBasedRO, fashion attributes detections Kiapour2014HipsterWD and discovering fashion trends AlHalah2017FashionFF.
3 E-commerce datasets
To learn compatibility relationships that meet the professional standard, one should use outfits created by experts. As a contribution, we introduce three new datasets consists of outfits made by professional stylists in three fashion e-commerce sites. The E-commerce datasets comprise of 636,407 fashion products from three different e-commerce sites: Farfetch (www.farfetch.com), Net-A-Porter (www.net-a-porter.com) and Moda Operandi (www.modaoperandi.com). For each product, the site provides a studio photo of a model wearing the item together with a set of other products, forming a complete outfit. The websites contain links to other products. We mine the outfit data directly from the e-commerce websites using web-scraping techniques. As the links may disappear when products go out-of-stock, we may not always obtain the complete outfit in the studio image.
We obtained a total of 360,176 unique outfits, each comprises of two or more fashion products, with statistic shown in Table 1. Every fashion item comes with a product image of white background, a name, a list of category keywords, and other metadata. We specially select the three sites because they consistently supply labeled front-view product image with a white background, which match the Polyvore Outfits dataset. To obtain the 11 type labels used by DBLP:journals/corr/abs-1803-09196, we train a classification model to predict the type from product metadata. The model using the text CNN architecture introduced by DBLP:journals/corr/Kim14f. Training labels are obtained by manually create a mapping between the Farfetch product category keywords and the 11 target types. The classification model achieves over 99% accuracy and is used to label the type for all products.
We split the outfit data into three datasets, and each only comprises of outfits from one e-commerce site. The split is necessary because outfits from different sites may not follow the same compatibility rules, which is later verified by the results. Each dataset is split into 60% for training, 20% for validation, and 20% for testing, based on the number of outfits. We do not prevent shared items between different splits, because DBLP:journals/corr/abs-1803-09196 showed that such operation has minimal effect on the results but significantly reduce the number of available outfits. From the test set, we create tests for the Fashion Compatibility Prediction task introduced by DBLP:journals/corr/HanWJD17. The original outfits are regarded as positive examples. We create a negative example to match every positive example by substitute each item in the outfit by a random item of the same type.
|No. of||No. of||Outfits of k items|
|Outfits||Items||k = 2||k = 3||k = 4||k = 5||k = 6||k = 7||k > 7|
|360,176||636,407||104,144||100,049||74, 813||42, 704||20,702||9,926||7,838|
|No. of Outfits||No. of Items||No. of Outfits||No. of Items||No. of Outfits||No. of Items|
4 Discriminative methods for predicting compatibility
We must learn a function that takes a pair of fashion item images as input, and produces a compatibility score in the range
. This score will be tested against a threshold, to be chosen later. This function is computed from a (learned) feature embedding vectorcomputed from each separate image. From the embedding, we compute a joint feature vector , then apply a predictor to obtain
Training data consists of pairs that are compatible or incompatible, and the function is learned using binary cross-entropy loss. For the image encoder , we use the identical architecture used by DBLP:journals/corr/abs-1803-09196, with ResNet18 as the backbone 7780459. The scoring function, constructed to be symmetric in their arguments. Write for the ’th component of vector . We consider the following options:
dot, where ;
diff, where ;
sum, where .
Each appears to contribute strongly to the predictions. We investigate producing feature vectors by stacking the three options (using all turns out to be best). As each of the formulation predicts compatibility, we argue that supplying all three of them to the probability predictor function will yield better result. We write the concatenation of the feature vectors, for exampledotsum, as
and so on.
A natural generalization is to form an outer product, where
While this exposes all possible pairwise quadratic and linear monomials, it does not outperform the other feature vectors (demonstrated by the study) and is inconveniently large.
4.1 Generating the training data
All pairs of items that occur in an outfit are assumed compatible, so we use all pairs in any training outfit as positive training pairs. As is usual for embedding methods, we assume that an arbitrary pair that does not appear in an outfit is incompatible. When randomly sampling negative items, we adopt the category-aware negative sampling method introduced by DBLP:journals/corr/abs-1803-09196. The method requires the negative item to have the same category as the positive items.
We conduct an experiment to compare the performance of our methods against the state-of-the-art compatibility model on the established Polyvore Outfit dataset. The results demonstrate that our model outperforms the best prior work by 2.5% on AUC and fill-in-the-blank test. An ablation study is also conducted to illustrate different variants of our method.
Following prior works, we evaluate our method on the fashion compatibility task and the fill-in-the-blank test (FITB) task. Fill in the blank measures a method’s accuracy in choosing the correct item from a list of four to fill in a blank in an outfit. For fashion compatibility task, we follow the usual practice of averaging the score for all pairs of items in an outfit, then computing AUC for ground truth outfits against random outfits (compatibility AUC). For both tasks, we use the standard test set created by DBLP:journals/corr/abs-1803-09196. In section 5, we show that compatibility AUC is inclined to be higher for datasets where outfits are larger, which could mislead. Therefore, we also report the pairwise AUC, defined as the AUC for all pairs (of any type) under the compatibility score.
To compare these methods, we use the full Polyvore Outfits dataset, as by DBLP:journals/corr/abs-1803-09196. We compare to the type-aware embedding compatibility model of that paper. This model is advantaged over our methods because it is trained using text information (precomputed HGLMM Fisher vectors from Klein2014FisherVD
) as well as image information; our models use only image information. Each method is trained for 20 epochs using Adam optimizers with a learning rate of 10e-5. We choose the epoch with the highest average of pairwise AUC and compatibility AUC on the validation set for testing.
Our discriminative method strongly outperforms the type-aware embedding (Table 2), despite not possessing text information. Table 3 shows the performance between different variants of our method on the Polyvore Outfits Dataset. All conditions are trained using item representation of 64, as Table 2 confirmed that using larger item representation yield better results. Using transformation of dot alone performs 2% better than diff and sum alone. Concatenating either diff or sum with dot increase the performance by 1%. Concatenating the three relationships yield the best performance with an additional improvement of 1%. The outer product yields minimal improvement over dot, and so can be discarded.
|Method||compatibility AUC||pairwise AUC||FITB|
|type-aware embedding (64-D)||.862||.654||55.3%|
|type-aware embedding (512-D)||.875||.695||58.0%|
|discriminative method (64-D)||.895||.715||59.1%|
|discriminative method (512-D)||.903||.720||60.4%|
|Transformation Function||compatibility AUC||pairwise AUC||FITB|
|dot diff sum||.895||.715||59.1%|
5 Pairwise AUC and multiple datasets
We evaluate our best performing model (concatenating dot, diff and sum features, retrained as appropriate) on the three e-commerce datasets to demonstrate the consistent strength of our approach. However, we carefully choose the metric to use, because different datasets have different average outfit size, as shown in Table 5.
5.1 Pairwise AUC vs. Compatibility AUC
Datasets with a large average outfits size will tend to have higher compatibility AUC, quite independent of the accuracy of the prediction of individual compatibility scores. This effect is easily seen with a simple model. Assume the compatibility predictor produces a score that is a normal random variable. Scale and translate as required so that this score is distributed asfor non-compatible edges; in this case, for compatible edges, the score will be distributed as , for . Write for the AUC computed from these distributions. Then the false positive rate at some threshold is (for the cumulative distribution of the unit normal) and the true positive rate is . The AUC is then . This integral either does not appear in standard tables (or we were unable to find it there; the shift seems to be the problem). But clearly if , , so . Similarly, if , , so .
But outfit compatibility is computed by averaging the prediction over all edges in the outfit. Considering a dataset where all outfits have pairs, the compatibility score for a random outfit will still be distributed as . The compatibility for a true outfit will be distributed as . In turn, as grows the compatibility AUC must grow, even if the quality of prediction for each particular pair does not change. This means that compatibility AUC can mislead when comparing methods across datasets. Note that is linear in the size of outfits, meaning growth could be quite fast. Pairwise AUC (the AUC for all pairs of item under the compatibility score) does not suffer from this effect, and so is better for comparing between datasets.
|Training Set||pairwise AUC||compatibility AUC|
|Polyvore testset||Moda testset||Farfetch testset||Nap testset|
This table shows the average number of item per outfit in each of the 4 test set and the standard deviation(SD).
We use the full transformation function (dot diff sum) with 512 dimensions item feature vectors. Other parts of the experiment setup are identical, as described in Section 4.2. We train the model on the four available outfit datasets (3 e-commerce + Polyvore) respectively and test every model on each of the four datasets. For each of the 16 conditions, we compute both the pairwise AUC scores and the compatibility AUC scores.
The results in Table 4 and 5 show that compatibility AUC is consistently biased by the average outfit size of the testing set, which supports the earlier claim. On compatibility AUC scores, we observe the unlikely event that the model trained on Polyvore outperforms the model trained on Net-A-Porter when tested on the Net-A-Porter testing set. This event is an artifact of the outfit size, and the pairwise AUC scores do not reflect such a fact. Notice that pairwise AUC scores of training and testing on the same dataset falls into a smaller range than the compatibility AUC scores. This observation also suggests that pairwise AUC is a more stable metric for cross-dataset comparison of the same method. Further discussion only considers the results from pairwise AUC.
The results show the discriminative method for learning compatibility performs reliably well for different outfit datasets. The scores show that the compatibility link used by professionals are qualitatively different from those used by Polyvore users. Compatibility relationship on each site is also very different: training on one does not get good results when predicting on others. In Figure 2, we query for compatible bottoms to match a plain white t-shirt from the same pool of selectable products. The query results suggest that each dataset has different matching rules for a plain white t-shirt with some degree of overlap (all match based on color white). We also observe that training on Polyvore consistently gives the second best result when testing on every e-commerce dataset. Both the quantitative and qualitative findings suggest that the compatibility links in Polyvore dataset are more conservative and generalizable than the relationships in the e-commerce datasets.
6 Indirect querying
Some type-pairs yield poor results (as shown in Figure 3), likely because there are few positive training examples for those pairs. Variance effects mean that we will see a poor AUC for those pairs using the direct strategy of computing compatibility as a function of items. We show that an indirect computation of compatibility can help.
Our indirect strategy works as follows. Assume we wish to compute the compatibility between an item of type and an item of type . Rather than computing , we obtain a set of items of type , and compute
(i.e., find an item of type that is compatible with both, and attribute the compatibility from that). The performance depends on the selection of . At training time, we search for a strategy as follows: for each of the least populous 11 pairs (i.e., 20% of the pairs) of types, compute the AUC for that pair using the direct strategy and the indirect strategy for all possible third types. From the search result, we select the strategy with the best AUC for each type pair, as a indirect strategy. This fixed strategy is used at test time for that pair of types.
Indirect strategy results: To evaluate whether there is any advantage in using an indirect strategy, we compare two cases (Table 6) using the models of section 5.2 trained for that experiment. In the first, we use only the direct strategy on test data (direct). In the second, we use whichever strategy emerged from the search at training (combined, ). We evaluate by comparing the average of the type pair AUC’s, confined to the 11 that could have changed. Using the combined strategy yield a small but useful improvement on test. The search spaces are also sufficiently small that the improvement generalizes.
|Moda testset||Farfetch testset||Nap testset|
This paper introduced three new, professionally curated fashion outfit dataset for learning fashion compatibility. The work proposed using discriminative method to learn pairwise compatibility relationships and demonstrated that it outperforms the-state-of-the-art on an established dataset, and perform consistently well across different datasets. We proposed pairwise AUC as a new metric for evaluating compatibility tasks across datasets. The work also analyzes the model’s performance on different pair of types and demonstrates that an indirect query strategy improves the results for rarely occurred type pairs. Our cross-datasets comparison also indicates that fashion compatibility is a subjective notion - even experts do not have a unified standard. Therefore, we consider explainability and personalization two important future directions.