Fashion compatibility learning, whose goal is to automatically compose compatible outfits for recommendations, is of importance to a variety of academic and industrial tasks such as outfit composition [feng2018interpretable], wardrobe creation [hsiao2018creating], item recommendation [he2016learning], fashion generation [bettaney2019fashion]. It has recently attracted increasing attention [han2017learning, Shih2017, li2017mining, Tangseng2018, Nakamura2018, Vasileva2018, hsiao2017learning, simo2016fashion, he2016learning, veit2015learning, song2017neurostylist].
In general, we can categorize the fashion compatibility learning methods into two classes: one formulates it as a pair-wise learning task [McATarShiHen15] [Veit2017] [Shih2017] [Vasileva2018], which develops measurement methods (e.g., metric learning ) for pair-wise compatibility, and another one is outfit-wise compatibility learning, which models the process of forming an outfit as sequence learning (i.e., LSTM) [han2017learning] [li2017mining] [Tangseng2018]. Most existing works treat fashion compatibility as visual appearance compatible problem. As a result, although significant progress has been made, we are still not able to answer a question as shown in Figure 1: “is Outfit A compatible in a business occasion?”.
Fashion compatibility is also a theme-matters problem. For example, as shown in Figure 1, Outfit A may be compatible based on visual appearance and can be dressed for dating. But if one wants to have it for business, she may want to adjust it to Outfit B (long shirt instead of miniskirt for business). Therefore, theme-aware fashion compatibility is very important for fashion recommendation.
Most existing fashion datasets such as Polyvore dataset [han2017learning] and DeepFashion2 [DeepFashion2], however, do not carry the capability to estimate theme-aware fashion compatibility. Hence, we built a new real-world fashion dataset called Fashion32, which is the first one with rich annotations including outfit themes and fine-grained fashion categories. Since the annotations were labeled by fashion stylists from brand vendors, they generally are of high quality. Fashion32 contains 32 theme tags for more than 13K around outfits, and 152 fine-grained categories for more than 40K outfit items. To learn theme-aware fashion compatibility models from this dataset, we face two challenges: how to measure pair-wise compatibility of outfit items and how to associate a theme to pairwise compatibility to compute outfit-wise compatibility.
To address the above challenges, we propose a theme-attention model, which is built on the category-specific embedding space. Figure 2 illustrates the overview of our framework. Given an outfit and a specific theme, pair-wise items are projected into the category-specific subspace (Figure 2 (a)). Unlike traditional embedding, which maps all fashion items into a common space, we employ triplet network and embedding masks (Figure 2 (b)) to project items category-specific subspace embedding. This “task-orientated” embedding enables the subspace to be more discriminative for compatibility computing. We further build a theme-attention model to associate the themes with pairwise compatibility (Figure 2 (c)). As a result, a theme-specific attention matrix is learned to link the theme to pairwise compatibility of outfit items, and further to aggregate pairwise ones to estimate the outfit-wise compatibility.
To the best of our knowledge, our work is the first one to explicitly estimate fashion compatibility given a specific theme. [han2017learning] maybe able to answer a question like “what to dress for a biz meeting” thanks to their visual-semantic embedding, but their capability relies on the quality of the image captions. Also, their Bi-LSTM framework is less flexible due to specific item order and number. Yet our theme-aware model does not have such constraints. The category-specific embedding is inspired by [Veit2017][Vasileva2018]. But unlike [Veit2017] and [Vasileva2018], which simply group fashion items into coarse categories (e.g., top, bottom, shoe, etc), we employ fine-grained categories because they usually have strong connections to fashion themes due to their properties. For instance, T-shirts imply causal, shirts are more official, and Polo-shirts are in-between. All the properties essentially imply some fashion themes. The coarse category does not carry this advantage.
To summarize, our work has the following contributions:
We introduce the first theme-aware fashion dataset, which enables to compute the fashion compatibility given a specific theme.
We propose a theme-attention model to associate themes with pairwise compatibility to compute outfit-wise compatibility. To the best of our knowledge, this is the first attempt to study the theme-matters compatibility learning problem.
We leverage fine-grained categories and the category-specific embedding to effectively support our theme-attention model.
We demonstrate our proposed approach can outperform state-of-the-art approaches on Fashion32 dataset and the improved Polyvore dataset [han2017learning].
Fashion Datasets. In general, we can group the existing fashion datasets, which are built for fashion compatibility learning, into two categories: online shopping datasets [yamaguchi2015mix, he2016ups, hsiao2018creating] and social media datasets [han2017learning, Vasileva2018, Tangseng2018]. The former datasets are mined from some online e-commerce platforms by leveraging buyer’s shopping carts to form various outfits and labels. As we know, however, a shopping cart usually contains mixed items, which may not form a compatible outfit. Therefore, the labels in online shopping datasets can be very noisy. The social media datasets, such as Maryland Polyvore [han2017learning] and Polyvore Outfits [Vasileva2018], are collected from social media platforms. Since the outfits are created by fashion enthusiasts, these datasets have less outfit noise. But the quality of outfits is very diverse, because outfits created on social media largely depends on users’ uploading and labels are also random and bias.
Unlike previous datasets, our Fashion32 not only carries fashion themes, fine-grained categories, and recommendation descriptions but also has high-quality outfit model pictures and abundant annotations. More importantly, our outfit composition and its annotations come from fashion designers of brand vendors. Consequently, our dataset is more realistic and convincing.
Another two popular fashion datasets, i.e., DeepFashion [liu2016deepfashion] and DeepFashion2 [DeepFashion2]
, are specifically built for fashion research including attribute prediction, image retrieval, and fashion synthesis, but not for fashion compatibility.
Fashion Compatibility Learning. In general, telling an outfit compatible or not is a subjective task. One needs to check all possible compatible relationships between items, which can involve very subtle difference. One current solution is to leverage metric learning [McATarShiHen15, Vasileva2018, chen2018dress] or embedding techniques [van2012stochastic, veit2015learning, wilber2015learning] to project the fashion items into a specific space, in which the outfit compatibility is explicitly measured by pairwise items’ distances [Veit2017]. This measurement is built on pairwise compatibility rather than outfit-wise (namely, computing outfit compatibility as a whole). han2017learning [han2017learning] employ Bi-LSTM beyond visual-semantic embedding to estimate outfit compatibility in an end-to-end model. Meanwhile, Vasileva2018 [Vasileva2018] proposes a category-aware embedding approach to include garment/item types or coarse-grained categories (e.g., top, bottom, etc.) during learning. Taking the garment type into consideration, the embedding space consists of a set of type-specific sub-spaces, which further improves the fashion compatibility estimation.
|Theme Type||Theme Tag(outfit counts)|
The Fashion32 Dataset
As aforementioned, most current fashion datasets lack the capability of theme-aware fashion compatibility learning because no fashion theme and labels of fine-grained item category are provided. Accordingly, we collected a new fashion dataset called Fashion32. There are about 13K outfits, and each of them has been labeled with multiple themes from a set of 32 themes. Also, each outfit fashion item is tagged with one of the 152 fine-grained categories. Figure 3 shows some outfit examples from our dataset. Every single outfit has rich meta information and various labels, as well as real model pictures. To the best of our knowledge, this is the first real-world dataset carrying both theme and fine-grained category annotations for each outfit and fashion item. The annotations were labeled by fashion stylists from the brand vendors. It can be publicly accessed through the following link: ToBeReleased.
Dataset collection. The Fashion32 dataset is crawled from the fashion channels of the e-commerce platform JD.com, one of the largest e-commerce platforms for fashion shopping. We collected 32 fashion themes as listed in Table 1. These fashion themes were proposed by fashion designers and utilized by the platform to index its products. Each collected outfit in Fashion32 is designed by fashion designers and uploaded to the platform by the brand vendors. The brand name of designers is recorded in each outfit for reference.
We collected 13,914 outfits, as well as additional 40,667 images of fashion items and 51,415 model pictures. One outfit usually contains 2 or 3 fashion items and more than 4 pictures of a model wearing these fashion items, as shown in Figure 3. The model pictures are important in fashion compatibility because they demonstrate how to select fashion items to form an outfit. Each fashion item was assigned with one label for coarse-grained category (i.e., 6 categories including inner top, outer top, bottom, shoe, bag, and accessory) and one label for fine-grained category (i.e., 152 categories). This assignment was done when the vendors uploaded their outfits to the platform.
Besides, all fashion items carry more information including product name, Stock Keeping Unit (SKU) ID, tags of design/style, tags of texture/fabric, tags of color, and a paragraph for product descriptions. To our knowledge, this dataset has the most detailed fashion labels, which can be used not only in fashion compatibility but also in the fashion image analysis.
Theme tags and descriptions. The fashion theme of an outfit can carry rich context information on it. This high-level fashion knowledge can reflect an outfit’s style, occasion, or culture. Hence, in this dataset, we mainly have the following four groups of fashion themes: occasion, style, fit, and gender. There are 32 themes in total as shown in Table 1, which also lists the number of outfits collected for each theme. Each outfit is labeled with at least one theme. Also, for each outfit, we also collected a paragraph description, which explains the reason for fashion compatibility and recommendation. As an example, the Outfit C in Figure 3 is labeled with theme tags: travel, thin, nature, and female, and its description for the recommendation looks like “Denim jackets open or bare shoulders, with straw hat sunglasses, full of holiday atmosphere”.
Fine-grained category. As aforementioned, the Theme-Attention is built on the fine-grained categories, as they usually carry more high-level knowledge that can be used to form a theme. Each fashion items is labeled with one of 152 fine-grained categories such as T-shirt, jacket, boots, wallet, sunglasses, and so on. The fashion designers labeled the fashion attributes for each fashion item to construct the fine-grained categories. For example, the jacket of Outfit C in Figure 3 has the fine-grained category short jacket, and its attributes are lapel, ruffle, simple, cowboy, long sleeve, tops, female, and jacket. Each fashion item has 7 attributes in average.
In this section, we first formulate the theme-aware compatibility learning problem. Then, we propose our framework as shown in Figure 2, which consists of two main parts: (a) Category-Specific Embedding Network, and (b) Theme-Aware Attention Learning.
The proposed fashion compatibility learning framework consists of two major components: category-aware triplet embedding and attentive Theme-Attention, where category serves as a bridge connecting two components. Given a fashion outfit , let be one of the fashion items of , where the superscript denotes this item’s category is ( is the fine-grained category set ). Then, the theme-aware fashion compatibility of outfit , given the fashion theme , can be computed as,
where and are item categories, is the pairwise distance between any pair of items in outfit , and is computed in a category-specific embedding subspace with theme-attention. The employment of theme-attention enables our approach to obtain the outfit-wise compatibility conditional on a theme , rather than simply averaging the pairwise compatibility of an outfit.
Category-Specific Embedding Network
Given one pair of compatible items and one incompatible pair, a Triplet Network can learn a mapping function, which projects compatible items close to each other and incompatible ones separable in the embedding space. In general, the fine details of fashion items are important for the embedding to learn a fashion compatibility “metric”. To better capture the details, we prefer to learn a category-specific embedding, since the compatibility measurement between one pair of categories can be different from that of another pair. In other words, a mapping function is learned to measure the compatibility between items from categories and . For simplicity, in this section, the superscript will be omitted.
Given an outfit and two items , in , where indicates the corresponding item’s category, to compute the distance in terms of compatibility, we adopt multiple layers CNN with deep residual block [he2016deep] (see Section Implementation Details for details) to embed two items into the category-specific space as and , where represents the CNN parameters. As a result, the distance between two outfit items can be computed as,
where is the Euclidean Distance [euclidean1980]. In the experiments (Section Implementation Details), we will introduce the details of different deep networks to construct non-linear projections.
To learn a category-specific (i.e., category-pair) mapping function , we form a training set consisting of a group of training triplets by selecting two items and from outfit and the third item from another outfit . The selected items are from either category or . One assumption in our triplet formulation is that items and from the same outfit are compatible, while from another outfit is incompatible to the other two items. If is the anchor of a specific outfit, the optimization goal is to force the distance between items in the same outfit closer than that of items
from different outfits. Therefore, our goal during the triplet network learning is to minimize the following loss function over the training set:
where is some margin.
The category-specific embedding network attempts to learn an independent mapping function for any pair of categories . It results in ( is the category set) number of CNN networks or embedding spaces. These individual embedding processes are less efficient because the CNNs could be highly redundant since the difference between two category-specific embeddings may be small. Therefore, instead of learning individual spaces, we propose to learn category-specific sub-spaces, which enables to learn a shared mapping function for all category-specific embedding. To this end, we further introduce a category-specific mask into the triplet embedding process. The mask serves as a gate function by selecting relevant bins to project an item to its category-specific subspace, which is depicted as .
Then, the distance between two items in Equation (2) can be represented with category-specific compatibility:
where is a vector, and is also the output size of feature extractor .
Therefore, the modified conditional triplet loss is represented as:
The loss can be minimized by learning the embedding mask to each category pair.
Theme-Aware Attention Learning
Given an outfit , we can compute its outfit-wise compatibility by evaluating the pairwise compatibility of all pairs items in outfit . One straightforward solution is to average all pairwise compatibility as following,
where and are items in outfit , and is total number of item pairs in . This solution treats each pair equally without considering the outfit’s theme tags. As shown in Figure 1, outfit A may be highly compatible without taking into account the fashion theme, while it may not be suitable for a business purpose. Therefore, we shall take into account fashion themes when measuring outfit-wise compatibility. To this end, an attentive Theme-Attention is proposed in this work.
In fact, the theme-aware fashion compatibility is an attention problem. The theme-attention is built to link themes to pairwise compatibility and eventually enables to add theme attention to the estimation of outfit-wise compatibility. As shown in Figure 2 (c), the pairwise compatibility (e.g., node and ) is computed based on category-specific embedding. The yellow edges imply the likelihood of associating a theme (e.g., and ) to pairwise categories. In fact, the association likelihood is the theme-attention (e.g., ) to pairwise compatibility when aggregating all pairwise ones into the outfit-wise compatibility. Consequently, learning theme-attention is to learn the theme-attention values like .
Therefore, given a fashion theme , the theme-aware compatibility for an outfit can be computed by,
where indicates the attention weight for the category pair and . Putting all together, we obtain an attention matrix for a given theme . As we can see, directly measures the compatibility based on the items’ appearance. Basically, it can be treated as instance-level compatibility conditional on their category. While attention matrix carries high-level human knowledge when measuring if an outfit compatible or not.
To learn the attention network for theme , we treat all compatible outfits associated with theme as positive examples. To obtain the negative examples, we select outfits from other themes, as well as creating an incompatible outfit by replacing one or several items in a compatible one. We treat the compatibility prediction as a classification problem, and formulate the loss function as a Cross-Entropy Loss [Goodfellow2016] as,
where is the output, and the is ground-truth of theme-aware compatibility under theme condition.
To evaluate the effectiveness of theme tags, we conduct experiments to compare proposed theme-aware approach and the state-of-the-art methods as the baselines. To show the generality of our method, we perform experiments not only on proposed dataset Fashion32 but also on theme-ignored dataset Polyvore[han2017learning], which has been used by several previous works.
In all experiments, we use ImageNet[russakovsky2015imagenet] pre-trained ResNet-50 model [he2016deep] with the bottleneck network as the backbone model. The input image is resized to
and the output of embedding is a 1000-dimension feature vector. All models are trained on a single Tesla V100 GPU, and each input mini-batch has 32 outfits. It takes about 6 hours for 50 epochs of training. The learning rate isand exponentially decayed by a factor of 0.2 every 10 epochs. The optimizing strategy is SGD with a momentum of 0.9. Only model parameters with the best performance on the validation set will be saved. In Fashion32, We split the outfits into three parts: training (11,040), validation (853), and testing (2,021) sets. The experiment settings for the Polyvore dataset are the same as that for Fashion32.
Negative Sample All labeled outfits in a dataset are naturally positive samples (i.e., compatible outfits). So there are no annotated negative outfits (i.e., incompatible outfits), because in real life people won’t compose an incompatible outfit. To generate a negative sample, we select a positive one and substitute one item in this outfit with an item that is randomly selected from the other outfits, which carry the same category with the one in the original outfit. For training, validation and test set, the ratio of compatible and incompatible outfits are all . During evaluating, each model is evaluated 5 times with different negative samples.
Area Under Curve (AUC) To evaluate the model’s binary prediction of compatibility. We reported the Area Under the Receiver Operating Characteristic (ROC) curve.
Fill-in-the-blank (FITB) Accuracy FITB task aims at filling the most compatible item into the blank of an outfit. Each blank has 4 options in our experiment, the accuracy can be calculated for the option selection process. Our model chooses the answer by predicting scores for 4 possible outfits only substituting the blank item with different options.
Quantitative Experiments on Fashion32 Dataset
To verify the effectiveness of our theme-aware fashion compatibility model, we also implemented a Baseline version of the Theme-Attention method. Both of them minimize the compatibility loss via Equation (8) and Equation (5), respectively. The Theme-Attention is built on the Baseline with additional theme-aware attention.
|Methods||Type||Compat. AUC(%)||FITB Acc(%)|
Table 2 shows the AUC and FITB scores of both methods. In terms of AUC scores, theme-attention method achieved better performance than our Baseline in almost all groups of themes except the gender group. In general, FITB scores are proportional to AUC scores. Especially, theme-Attention achieves 6.89% of AUC increase for the occasion theme group. The results successfully demonstrate that theme-attention method is able to improve the quality of fashion compatibility. In addition, from the results, we can observe that theme-attention can perform better on some distinctive themes. For example, outfits of “sports” usually consist of T-shirt, shorts, or running shoes, which make the We theme-aware weights easy to be learned. As a comparison, the Gender theme group does not help fashion compatibility estimation. We conjecture that “Female” and “Male” are the distinctive themes because it is easy to tell if an outfit designed for male or female based on its visual appearance. Adding this incapable attention to the model actually hurt the performance. That is why its performance is worse than our Baseline.
In terms of the FITB scores, the Theme-Attention method can achieve up to 7.42% improvement compared to the Baseline. Overall, the Theme-Attention method can effectively learn the theme-aware fashion compatibility under a theme with narrow variations of category combinations.
Figure 4 illustrates some visual examples of our theme-aware fashion compatibility learning results. As we can see, our model generates different compatibility scores for an outfit given different themes. The theme with the highest score implies it is the most relevant theme to this outfit. Hence, outfit (a), (b), and (c) are correctly detected as business, sports, and travel, respectively. Example (d) is a negative example without a theme. Also, the result of (b) tells us an outfit can be suitable for multiple themes, i.e., both travel and sports are good for (b). This visualization further verifies the proposed theme-aware fashion compatibility.
|Method||Compat. AUC(%)||FITB Acc(%)||Compat. AUC(%)||FITB Acc(%)|
Subjective Experiment on Online Fashion Shop
We further evaluate our model’s performance by calculating the click rates of customers’ browsing history, which is widely used by many e-commence platforms. An outfit with a higher click rate will have higher compatibility score. Unlike previous experiments, this one is more subjective.
We collected 500 fashion items from an online fashion shop as the searching pool. Given a fashion item and a target theme, two experimental algorithms (i.e., our theme-attention and non-theme method) will recommend 3 to 5 items from the search pool to generate a complete outfit. Figure 5 shows some recommended outfits. As we see, the results are not only visually compatible but also suitable for the given theme. During the evaluation, 5 subjects are assigned to conduct the click rate experiments. The subjects were asked to tell which is more compatible given a fashion theme. To avoid subjects’ bias on one recommendation algorithm, we randomly switched the order of two recommendations. Each subject evaluated 200 outfits on three themes: business, travel, and sports.
The final results show that Theme-Attention method is better than non-theme method with 8.6% improvement in terms of click rates. It further demonstrates that our approach can recommend theme-compatible outfits from a pool of new fashion items which have not to be seen during the training.
Since we are the first to work on the theme-aware fashion compatibility problem, it is very difficult to directly compare our approach with previous work. Since our Baseline is non-theme version of our approach, we can apply our Baseline to Polyvore dataset for comparison. However, our Baseline requires fine-grained categories which Polyvore does not have. To this end, we improved Polyvore dataset to fine-grained Polyvore ( please refer to our supplemental material for this improved version ). In addition, we also managed to run previous approaches on our Fashion32 dataset without leveraging the theme information.
Table 3 illustrates the detailed performance comparison on both fine-grained Polyvore and Fashion32 dataset. On fine-grained Polyvore dataset, as we can see, our Baseline is comparable to previous approaches for both compatibility prediction and FITB tasks. As our Baseline is actually a non-theme version of our approach, this comparison is somewhat meaningful. On Fashion32 dataset, our theme-attention outperforms the state-of-the-arts approaches. In terms of FITB, our approach is 7% better than state-of-the-art approaches. Although this is not a direct comparison, the results still demonstrate the advantages of theme-aware fashion compatibility. Previous approaches do not have a mechanism to leverage theme information for fashion compatibility.
To solve the theme-aware fashion compatibility problem, in this paper, we collected the first theme-matters fashion dataset, which contains 13K outfits in total over 32 themes and 152 fine-grained category classes. We further propose a novel benchmark, which leverages the attentive Theme-Attention built on category-specific embedding, to learn theme-aware fashion compatibility. we evaluate our approaches by both objective and subjective experiments. Compared with the baseline, our method Theme-Attention achieved 94.26% AUC in outfit compatibility prediction and 78.85% accuracy in FITB task, respectively. Comparing to several recent works on the Polyvore dataset, the Baseline version of the Theme-Attention method also achieves competitive on compatibility prediction and FITB tasks.