1 Introduction
Predicting which fashion items will go together to form an outfit is a highvalue task in ecommerce. While there is an established literature on this topic, it remains hard to build accurate systems. The key is to tell whether two items are compatible, which is complicated because both complex appearance properties and perceptual issues are in play. Current methods rely on similarity measures obtained by learning embeddings. Until recently, such approaches have ignored garment type (e.g. “dress” vs. “hat”), which is odd because one does not usually have many items of the same kind in an outfit
Veit2015; He2016LearningCA. Recent work has shown that acknowledging type produces improvements in standard metrics DBLP:journals/corr/abs180309196.Remarkably, we are not aware of any recent work using discriminative method to predict compatibility. This absence is odd because the problem is naturally discriminative — one wants to know whether a particular pair of items is compatible or not. Embedding methods try to solve a harder problem of learning a good metric of similarity. In this paper, we demonstrate that discriminative methods produce robust systems. Our methods exceed the stateoftheart on established metrics by 2.5% on both compatibility prediction task and fillintheblank task on the Polyvore Outfits dataset.
The current standard, Polyvore Outfits, was created by online community users who are not necessarily experts in the fashion domain. To learn the compatibility rules defined by professionals, we introduce three new outfit datasets from crawling three ecommerce sites consisting of 360,176 outfits in total. We evaluate our discriminative method across datasets and demonstrate that the method performs consistently well on expertcurated outfit data. We also introduce pairwise AUC as a new metric for measuring compatibility. Unlike compatibility (per outfit) AUC used by prior works, pairwise AUC is not misled by the average size of outfits in the dataset when comparing across datasets.
To compare the compatibility rules defined by multiple sources, we compute the typepair AUCs for every pair of product types (e.g., outerwearbottom) and visualize them using heat maps. The analysis reveals that the underlying ways items match in different outfit datasets are different, which suggests that the notion of fashion compatibility is subjective. The typepair AUCs also indicates that the model predicts some typepairs more accurately than others. Exploring this property, we show that compatibility can also be queried indirectly, using a third type as the anchor. We demonstrate that indirect query yield improvement in the AUC results for the infrequent pairs of types.
Our contributions are:

Three new datasets: We introduce three new datasets generated by professional stylists from three ecommerce sites, consisting of 360,176 outfits in total.

Improved accuracy: We show that a discriminative approach improves accuracy over state of the art in learning compatibility relationship.

Crossdataset comparisons: We show that the method performs consistently well on the standard dataset and our three datasets. However, the current evaluation uses compatibility (per outfit) AUC. We show that, for statistical reasons, compatibility AUC is higher on datasets with larger outfits. We describe pairwise AUC as an alternative metric for measuring compatibility performance and demonstrate that it is more stable when comparing across outfit datasets.

Indirect query strategy: The model performs poorly for pairs of types with relatively few compatible examples in the training data. We show that, for these pairs, an indirect strategy for querying compatibility (e.g., searching for “eyewear” that are compatible with a “scarf” through “tops” that are compatible with both) improves over a direct query.
2 Related Work
The established approach for learning representations of complex relationships involves constructing an embedding space by training as a siamese structure 1467314 or using triplet loss 7298682 with samples of positive and negative pairs. Embedding methods are also commonly used to capture hardtodefine relationships in the fashion domain such as style, fashionability, and the matching between clothing items DBLP:journals/corr/McAuleyTSH15; 7780408; Hsiao2017LearningTL; 7298688; Veit2015; He2016LearningCA. DBLP:journals/corr/HanWJD17 trained a bidirectional LSTM model to predict the next compatible clothing item within an outfit, always regarding an outfit as a whole. Noticed the nontransitive nature of the compatibility relationship, DBLP:journals/corr/abs180309196 recently demonstrated that enforcing typeawareness in the embeddings produce better performance over prior works.
Discriminative methods, or in other words classification methods, have been commonly used to solve a variety of standard computer vision tasks, such as image classification, object detection, image segmentation
NIPS2012_4824; NIPS2015_5638; Ronneberger2015UNetCN, etc. However, surprisingly, we are unable to find examples of its usage in solving the compatibility problem.Prior works mainly used curated outfits as supervision signals and the ground truth for evaluation to learn fashion compatibility. Many recent works, including the stateoftheart, are trained and evaluated using outfits data mined from Polyvore DBLP:journals/corr/HanWJD17; DBLP:journals/corr/abs180309196. Polyvore is a social network where fashion lovers curate outfits using a set of fashion product images. We emphasize that the quality of the outfits curated by Polyvore users are not guaranteed. Also, prior works only evaluate their methods using outfits from a single source DBLP:journals/corr/HanWJD17; DBLP:journals/corr/abs180309196. Assessing one outfit dataset does not ensure a method’s performance on other outfit datasets.
In the fashion domain, other recent works focused on products recognition and retrieval Liu2012StreettoshopCC; Yang2014ClothingCB; Hu2015CollaborativeFR; Kiapour2015WhereTB, fashion recommendation Vittayakorn2016AutomaticAD; Shen2007WhatAI; McAuley2015ImageBasedRO, fashion attributes detections Kiapour2014HipsterWD and discovering fashion trends AlHalah2017FashionFF.
3 Ecommerce datasets
To learn compatibility relationships that meet the professional standard, one should use outfits created by experts. As a contribution, we introduce three new datasets consists of outfits made by professional stylists in three fashion ecommerce sites. The Ecommerce datasets comprise of 636,407 fashion products from three different ecommerce sites: Farfetch (www.farfetch.com), NetAPorter (www.netaporter.com) and Moda Operandi (www.modaoperandi.com). For each product, the site provides a studio photo of a model wearing the item together with a set of other products, forming a complete outfit. The websites contain links to other products. We mine the outfit data directly from the ecommerce websites using webscraping techniques. As the links may disappear when products go outofstock, we may not always obtain the complete outfit in the studio image.
We obtained a total of 360,176 unique outfits, each comprises of two or more fashion products, with statistic shown in Table 1. Every fashion item comes with a product image of white background, a name, a list of category keywords, and other metadata. We specially select the three sites because they consistently supply labeled frontview product image with a white background, which match the Polyvore Outfits dataset. To obtain the 11 type labels used by DBLP:journals/corr/abs180309196, we train a classification model to predict the type from product metadata. The model using the text CNN architecture introduced by DBLP:journals/corr/Kim14f. Training labels are obtained by manually create a mapping between the Farfetch product category keywords and the 11 target types. The classification model achieves over 99% accuracy and is used to label the type for all products.
We split the outfit data into three datasets, and each only comprises of outfits from one ecommerce site. The split is necessary because outfits from different sites may not follow the same compatibility rules, which is later verified by the results. Each dataset is split into 60% for training, 20% for validation, and 20% for testing, based on the number of outfits. We do not prevent shared items between different splits, because DBLP:journals/corr/abs180309196 showed that such operation has minimal effect on the results but significantly reduce the number of available outfits. From the test set, we create tests for the Fashion Compatibility Prediction task introduced by DBLP:journals/corr/HanWJD17. The original outfits are regarded as positive examples. We create a negative example to match every positive example by substitute each item in the outfit by a random item of the same type.
No. of  No. of  Outfits of k items  

Outfits  Items  k = 2  k = 3  k = 4  k = 5  k = 6  k = 7  k > 7 
360,176  636,407  104,144  100,049  74, 813  42, 704  20,702  9,926  7,838 
Farfetch  NetAPorter  Modaoperandi  

No. of Outfits  No. of Items  No. of Outfits  No. of Items  No. of Outfits  No. of Items 
234,591  501,610  88,251  111,249  14,336  46,546 
4 Discriminative methods for predicting compatibility
We must learn a function that takes a pair of fashion item images as input, and produces a compatibility score in the range
. This score will be tested against a threshold, to be chosen later. This function is computed from a (learned) feature embedding vector
computed from each separate image. From the embedding, we compute a joint feature vector , then apply a predictor to obtainTraining data consists of pairs that are compatible or incompatible, and the function is learned using binary crossentropy loss. For the image encoder , we use the identical architecture used by DBLP:journals/corr/abs180309196, with ResNet18 as the backbone 7780459. The scoring function
consists of a fully connected neural network with 2 hidden layers with ReLU nonlinearity and an output layer with sigmoid activation. We investigate various choices of
, constructed to be symmetric in their arguments. Write for the ’th component of vector . We consider the following options:
dot, where ;

diff, where ;

sum, where .
Each appears to contribute strongly to the predictions. We investigate producing feature vectors by stacking the three options (using all turns out to be best). As each of the formulation predicts compatibility, we argue that supplying all three of them to the probability predictor function will yield better result. We write the concatenation of the feature vectors, for example
dotsum, asand so on.
A natural generalization is to form an outer product, where
While this exposes all possible pairwise quadratic and linear monomials, it does not outperform the other feature vectors (demonstrated by the study) and is inconveniently large.
4.1 Generating the training data
All pairs of items that occur in an outfit are assumed compatible, so we use all pairs in any training outfit as positive training pairs. As is usual for embedding methods, we assume that an arbitrary pair that does not appear in an outfit is incompatible. When randomly sampling negative items, we adopt the categoryaware negative sampling method introduced by DBLP:journals/corr/abs180309196. The method requires the negative item to have the same category as the positive items.
4.2 Experiments
We conduct an experiment to compare the performance of our methods against the stateoftheart compatibility model on the established Polyvore Outfit dataset. The results demonstrate that our model outperforms the best prior work by 2.5% on AUC and fillintheblank test. An ablation study is also conducted to illustrate different variants of our method.
Metrics
Following prior works, we evaluate our method on the fashion compatibility task and the fillintheblank test (FITB) task. Fill in the blank measures a method’s accuracy in choosing the correct item from a list of four to fill in a blank in an outfit. For fashion compatibility task, we follow the usual practice of averaging the score for all pairs of items in an outfit, then computing AUC for ground truth outfits against random outfits (compatibility AUC). For both tasks, we use the standard test set created by DBLP:journals/corr/abs180309196. In section 5, we show that compatibility AUC is inclined to be higher for datasets where outfits are larger, which could mislead. Therefore, we also report the pairwise AUC, defined as the AUC for all pairs (of any type) under the compatibility score.
Experiment Setting
To compare these methods, we use the full Polyvore Outfits dataset, as by DBLP:journals/corr/abs180309196. We compare to the typeaware embedding compatibility model of that paper. This model is advantaged over our methods because it is trained using text information (precomputed HGLMM Fisher vectors from Klein2014FisherVD
) as well as image information; our models use only image information. Each method is trained for 20 epochs using Adam optimizers with a learning rate of 10e5. We choose the epoch with the highest average of pairwise AUC and compatibility AUC on the validation set for testing.
Results
Our discriminative method strongly outperforms the typeaware embedding (Table 2), despite not possessing text information. Table 3 shows the performance between different variants of our method on the Polyvore Outfits Dataset. All conditions are trained using item representation of 64, as Table 2 confirmed that using larger item representation yield better results. Using transformation of dot alone performs 2% better than diff and sum alone. Concatenating either diff or sum with dot increase the performance by 1%. Concatenating the three relationships yield the best performance with an additional improvement of 1%. The outer product yields minimal improvement over dot, and so can be discarded.
Method  compatibility AUC  pairwise AUC  FITB 

typeaware embedding (64D)  .862  .654  55.3% 
typeaware embedding (512D)  .875  .695  58.0% 
discriminative method (64D)  .895  .715  59.1% 
discriminative method (512D)  .903  .720  60.4% 
Transformation Function  compatibility AUC  pairwise AUC  FITB 

diff  .853  .687  52.5% 
sum  .858  .691  53.5% 
dot  .878  .703  56.1% 
outer product  .879  .704  56.2% 
diff sum  .879  .704  56.8% 
dot diff  .889  .711  57.6% 
dot sum  .892  .712  58.6% 
dot diff sum  .895  .715  59.1% 
5 Pairwise AUC and multiple datasets
We evaluate our best performing model (concatenating dot, diff and sum features, retrained as appropriate) on the three ecommerce datasets to demonstrate the consistent strength of our approach. However, we carefully choose the metric to use, because different datasets have different average outfit size, as shown in Table 5.
5.1 Pairwise AUC vs. Compatibility AUC
Datasets with a large average outfits size will tend to have higher compatibility AUC, quite independent of the accuracy of the prediction of individual compatibility scores. This effect is easily seen with a simple model. Assume the compatibility predictor produces a score that is a normal random variable. Scale and translate as required so that this score is distributed as
for noncompatible edges; in this case, for compatible edges, the score will be distributed as , for . Write for the AUC computed from these distributions. Then the false positive rate at some threshold is (for the cumulative distribution of the unit normal) and the true positive rate is . The AUC is then . This integral either does not appear in standard tables (or we were unable to find it there; the shift seems to be the problem). But clearly if , , so . Similarly, if , , so .But outfit compatibility is computed by averaging the prediction over all edges in the outfit. Considering a dataset where all outfits have pairs, the compatibility score for a random outfit will still be distributed as . The compatibility for a true outfit will be distributed as . In turn, as grows the compatibility AUC must grow, even if the quality of prediction for each particular pair does not change. This means that compatibility AUC can mislead when comparing methods across datasets. Note that is linear in the size of outfits, meaning growth could be quite fast. Pairwise AUC (the AUC for all pairs of item under the compatibility score) does not suffer from this effect, and so is better for comparing between datasets.
Training Set  pairwise AUC  compatibility AUC  

Polt  Modt  Fart  Napt  Polt  Modt  Fart  Napt  
Polyvore  .718  .685  .594  .636  .902  .724  .646  .749 
Modaoperandi  .589  .713  .566  .601  .660  .732  .592  .636 
Farfetch  .538  .575  .707  .620  .560  .585  .734  .661 
NetAPorter  .577  .617  .544  .668  .616  .632  .559  .730 
Variance  .0061  .0040  .0052  .0008  .0227  .0051  .0058  .0029 

Polyvore testset  Moda testset  Farfetch testset  Nap testset  

Mean  SD  Mean  SD  Mean  SD  Mean  SD 
5.35  1.62  2.51  .74  3.36  1.15  4.77  1.65 
This table shows the average number of item per outfit in each of the 4 test set and the standard deviation(SD).
5.2 Experiment
We use the full transformation function (dot diff sum) with 512 dimensions item feature vectors. Other parts of the experiment setup are identical, as described in Section 4.2. We train the model on the four available outfit datasets (3 ecommerce + Polyvore) respectively and test every model on each of the four datasets. For each of the 16 conditions, we compute both the pairwise AUC scores and the compatibility AUC scores.
The results in Table 4 and 5 show that compatibility AUC is consistently biased by the average outfit size of the testing set, which supports the earlier claim. On compatibility AUC scores, we observe the unlikely event that the model trained on Polyvore outperforms the model trained on NetAPorter when tested on the NetAPorter testing set. This event is an artifact of the outfit size, and the pairwise AUC scores do not reflect such a fact. Notice that pairwise AUC scores of training and testing on the same dataset falls into a smaller range than the compatibility AUC scores. This observation also suggests that pairwise AUC is a more stable metric for crossdataset comparison of the same method. Further discussion only considers the results from pairwise AUC.
The results show the discriminative method for learning compatibility performs reliably well for different outfit datasets. The scores show that the compatibility link used by professionals are qualitatively different from those used by Polyvore users. Compatibility relationship on each site is also very different: training on one does not get good results when predicting on others. In Figure 2, we query for compatible bottoms to match a plain white tshirt from the same pool of selectable products. The query results suggest that each dataset has different matching rules for a plain white tshirt with some degree of overlap (all match based on color white). We also observe that training on Polyvore consistently gives the second best result when testing on every ecommerce dataset. Both the quantitative and qualitative findings suggest that the compatibility links in Polyvore dataset are more conservative and generalizable than the relationships in the ecommerce datasets.
6 Indirect querying
Some typepairs yield poor results (as shown in Figure 3), likely because there are few positive training examples for those pairs. Variance effects mean that we will see a poor AUC for those pairs using the direct strategy of computing compatibility as a function of items. We show that an indirect computation of compatibility can help.
Our indirect strategy works as follows. Assume we wish to compute the compatibility between an item of type and an item of type . Rather than computing , we obtain a set of items of type , and compute
(i.e., find an item of type that is compatible with both, and attribute the compatibility from that). The performance depends on the selection of . At training time, we search for a strategy as follows: for each of the least populous 11 pairs (i.e., 20% of the pairs) of types, compute the AUC for that pair using the direct strategy and the indirect strategy for all possible third types. From the search result, we select the strategy with the best AUC for each type pair, as a indirect strategy. This fixed strategy is used at test time for that pair of types.
Indirect strategy results: To evaluate whether there is any advantage in using an indirect strategy, we compare two cases (Table 6) using the models of section 5.2 trained for that experiment. In the first, we use only the direct strategy on test data (direct). In the second, we use whichever strategy emerged from the search at training (combined, ). We evaluate by comparing the average of the type pair AUC’s, confined to the 11 that could have changed. Using the combined strategy yield a small but useful improvement on test. The search spaces are also sufficiently small that the improvement generalizes.
Moda testset  Farfetch testset  Nap testset  

direct  combined  gain  direct  combined  gain  direct  combined  gain 
.615  .676  .061  .584  .615  .029  .573  .584  .011 
7 Conclusion
This paper introduced three new, professionally curated fashion outfit dataset for learning fashion compatibility. The work proposed using discriminative method to learn pairwise compatibility relationships and demonstrated that it outperforms thestateoftheart on an established dataset, and perform consistently well across different datasets. We proposed pairwise AUC as a new metric for evaluating compatibility tasks across datasets. The work also analyzes the model’s performance on different pair of types and demonstrates that an indirect query strategy improves the results for rarely occurred type pairs. Our crossdatasets comparison also indicates that fashion compatibility is a subjective notion  even experts do not have a unified standard. Therefore, we consider explainability and personalization two important future directions.
Comments
There are no comments yet.