Fashion plays important and sophisticated roles in various aspects: social, culture, identity and etc. We can roughly understand a fashion is a type of reaction common to a considerable number of people. It is common that fashion styles suggested by fashion experts, fashion icons or more popularly now by Key Opinion Leaders (KOLs) from the social media are imitated by people.
On internet, a set of outfits with description is a common way to demonstrate a fashion style. People imitate the fast changing fashion styles from these demonstrations without interaction with the experts. However, learning fashion styles from demonstrated outfits by machine is challenging, as it requires how low-level fashion elements can be mapped to high-level styles.
Fashion styles are composed of important design elements such as color, pattern, material, silhouette, and trim . Each outfit item may demonstrate some design elements with corresponding images and attribute information at low level. High level information for style may be available too: relationship between these items and sometime related cultural meaning behind the suggested style. All the items in an outfit should be consistent with the described style, and be compatible with each other as well.
Consider the outfit information as input data, learning-based method can be used if we have a good data set with labels of each item and corresponding style. Most existing outfit composition methods follow this supervised learning approach and concentrate on predicting the compatibility between fashion items. However, compatibility is only part of fashion style knowledge. Compatible items cannot guarantee the consistency with the condition style definition. Besides supervised learning methods based on behavioral cloning (BC) suffer from distribution shift: because the agent greedily imitates demonstrated actions, it can drift away from one style to another styles due to subtle differences.
In this paper, we address this fundamental question from two aspects. First, we design a hierarchical multimodal representation to describe complex latent information of the whole outfit structure. Based on this representation, we further propose to use an inverse reinforcement learning method to infer experts’ composition reward function and learn the composition value function simultaneously.
Corresponding to the three parts of the outfit demonstration: image, attributes and style description, our outfit encoder network consist of three parts to learn the rich information conveyed in a outfit. At low level, a shared multimodal variational autoencoder is employed to learn the jointly representation from image and attribute information for each item. At high level, we try to cover the the whole outfit style by applying two steps: matching strategy is applied to extract the relations between pairs of items once the item vectors are generated; and a pre-trained BERT model  is used to encode explanation text as the condition of the whole outfit.
Adversarial inverse reinforcement learning (AIRL) is used in our algorithm to provide the agent the capacity for simultaneous learning of the reward function and value function, which enables us to both make use of the efficient adversarial formulation and recover a generalizable and robust reward function for both item compatibility and style consistency.
Our key contributions can be summarized as follows:
We propose a learning-based framework for effective fashion imitation. To our best knowledge, our method is the first to address the imitation behavior in the fashion style learning.
We design a hierarchical multimodal neural network that can effectively encode the rich latent information of whole outfit: at low level, it can capture complementary information from multimodal observations; and interleaving factors are covered at high level.
We propose an adversarial inverse reinforcement learning method for recovering the style reward function, which can learn more robust reward to avoid the style drift.
For the rest of the paper, we fist discuss the related work in Section 2. We then describe the hierarchical multi-modality fusion model for outfit in Section 3. Next, details of the proposed adversarial training method is introduced. We show our experimental results in Section 5 and give our discussion and future work at the end.
2 Related Work
Computer vision and recommendation techniques have important and rich applications in fashion domain. The majority of research in this domain focus on fashion image attribute recognition and retrieval, fashion item semantic embedded learning, fashion recommendation, visual compatibility learning and outfit composition.
Fashion Attribute Recognition and Retrieval. Clothing attributes provide a useful tool to assess clothing products as mid-level semantic visual concepts. Recently, more deep networks such as DeepFashion and MTCT, were proven to be efficient in attribute recognition on large datasets. Nanoto et al. proposed a multi-label joint learning network to predict cloth attributes from images with minimum human supervision. Several works utilized weakly labeled image-text pairs to discover attributes[, ]. Besides, images retrieval and products recommendation tasks can benefit from attributes results. Chen et al. focused on solving the problem of describing people based on fine-grained clothing attributes. AMNet conducted a fashion search after changing an attribute. Mix and match  combined a deep network with conditional random fields to explore the compatibility of clothing items and attributes. Wei et al.
considered both global and localized attributes as ’words’ to describe cloth. Many works above utilized attributes for image retrieval and recommendation tasks, but the attributes can not address the high level information of the whole outfit. In our work, we combine multi-modal information as a global feature. The experiments reveal our feature could also be used for missing attributes inference.
Fashion Visual-semantic Embedding Learning. Fashion item representation learning is the fundamental step of all downstream inference work. There are a lot of works trying to investigate this important problem with different network structure and learning methods. The most common approaches tend to be trained as siamese network or using triplet loss , This is extended in Han et al.  by feeding the visual representation from each garment within an outfit into an LSTM in order to jointly reason about the outfit as a whole. Simo-Serra et al.
trained a classifier network with ranking as the constraint to extract distinctive feature representation and also used high level feature of the classifier network as embedding of fashion style. Many unsupervised representation learning methods also were proposed to learn latent feature from unlabeled data directly. Most of them utilize Variational Auto-Encoder (VAE) and Predictability Minimization (PM) model to learn fashion item embeddings. The encoded embeddings usually contain mixed and unexplainable features of original images. There are several approaches which implemented multi-modal embedding methods to reveal novel feature structures (e.g. [,,
]). These multimodal methods only to map the image text into the same space without considering the deep correlation between the different modals. In this paper, we try to learn the distributed representation for specific fashion styles. We assume that the complementary representation for fashion style should cover both compatibility between items and common feature shared by whole outfit. To reach this goal, our method learn the joint representation for fashion items from image and attributes information at low level, and capture the relationship between items and condition style.
Fashion Recommendation and Outfit Composition. There are a few approaches for fashion items recommendation. In the context of fashion analysis, visual compatibility measures whether clothing items complement one another across visual categories. For example, “sweat pants” are more compatible with “running shoes” than “high-heeled shoes”. Most existing fashion related research work mainly focus on the compatibility between fashion items, and mainly learning on images data only. Iwata et al. proposed a topic model to recommend ”Tops” for ”Bottoms”. The goal of this work is to compose fashion outfit automatically by building product coordinates from visual features in each fashion item region. Veit et al.
built a Siamese Convolutional Neural Network (CNN) architecture to learn clothing matching pair products from the Amazon co-purchase dataset, focusing on the representative problem of learning compatible clothing style. Simo-Serra1 et al. introduced a Conditional Random Field (CRF) to learn the different outfits formula and types of people. The model is further being used to predict how fashionable a person looks on a particular photograph. Li et al.  used multi-modal embeddings as features and the quality scores as the label to train a grading model. Xuemeng et al.  propose to model the compatibility between fashion items based on Bayesian personalized ranking (BPR). Han etc jointly learn compatibility relationships among fashion items and employ a Bi-LSTM model to learn the compatibility relationships among fashion items by modeling an outfit as a sequence.  used style topic models for compatibility and defined the recommendation as a subset selection problem. Chen et al.  proposed an encoder-decoder model to generate personalized fashion outfits. Most of works above utilized supervised learning methods to predict the compatibility between fashion items and validated model performance on validation dataset generated with negative sampling techniques. Thus, these methods are inclined to learn the general patterns rather than capture the subtle difference between fashion styles. To address this problem, our method implemented adversarial learning methods to learns more robust composition policy.
Imitation Learning. Imitation learning techniques aim to mimic human behavior in a given task. An agent, i.e., a learning machine, by learning a mapping between observations and actions, is trained to perform a task from demonstrations . Inverse reinforcement learning (IRL) is a form of imitation learning that accomplishes this by first inferring the expert’s reward function and then training a policy to maximize it    
. Most of imitation learning works are mainly focus on the imitation process in the fields of robotics, adaptive planning, and data-driven animation. In this work, we formalize the outfit composition task as a Markov Decision Process(MDP) and work under the maximum ambiguity causal IRL framework of, which allow us to cast the reward learning problem as a maximum likelihood problem. Our IRL algorithm is built upon the adversarial IRL architecture proposed in  and . A discriminator is trained to distinguish experts’ selection, while the agent is trained to ”fool” the discriminator into thinking itself is the expert. To our knowledge, this is the first approach that considers the outfit composition as a style imitation learning problem in the fashion domain.
3 Hierarchical Multimodal Representation for Fashion Outfit
3.1 Fusion Representation for Fashion Item
For one fashion item, it is obvious that the corresponding image and attribute tags have the complementary information. For instance, people can easily tell the color and pattern of a garment from image, but need to check the attribute tags to know the garment material and the specific functional usage. Learning from diverse modalities with generative approaches has the potential to yield more generalized joint representations. Inspired by the product-of-experts(PoE) inference network , assuming the conditional independence among the modalities and joint posterior from multiple modalities is a product of individual posteriors, we utilize a multimodal variational autoencoder (MVAE) 
to learn a joint distribution from both image and attribute tags as shown in Figure2.
In the multimodal setting, we assume the modalities, , are conditionally independent given the common latent variable, . Then, we assume that a generative model of the form . If an item is presented by a collection of modalities , then the evidence lower bound(ELBO) becomes:
MVAE can be trained by simply optimizing the evidence lower bound given in Eqn.1. The product and quotient distributions are not in general solvable in closed form. However, when and are Gaussian there is a simple analytical solution: a product of Gaussian experts is itself Gaussian with mean and covariance , where , are the parameters of the i-th Gaussian expert, and is the inverse of the covariance.
3.2 Language-Conditioned Hierarchical Representation for Outfit Style
Comparing with standard compatibility prediction task, the fashion imitation learning is to learn a reward function that generalizes across different types of styles. While standard supervised methods are typically trained and evaluated without considering style information, we want our language-conditioned reward function to produce correct behavior when generating outfit with given style conditions.
Notice that the description text of the outfits usually explains the common salient features such as occasion, season, trending information with which all the items consistent. We encode the explanation language as the condition of the whole outfit structure. As shown in Figure 3, for any pairs of items in an outfit, a pretrained BERT model  is used to encode description text , and the MVAE encoder convert two items’ image and attributes information to an joint representation . The interaction between items is represented as . The outfit embedding is the concatenation of these two parts, and the outfit style rewards will be learned based on it.
4 Adversarial Inverse Reinforcement learning for Style Reward Learning
For the outfit composition task, we formalize the process as follow. Let denotes the set of all fashion items, denote an outfit, and denote the items in the outfit , so that . Each item belong to a limited number of categories . The fashion outfit composition process can be formulated as an iterative item selection process, in which at most one item is selected for each category. For example, a user may want to compose an outfit of ”UK Smart Casual Style”. Then, he/she needs to select one item from four categories: ”Shirt”, ”Jacket”, ”Pant” and ”Shoes” respectively.
This process can be described with the the maximum causal entropy IRL framework , which considers an entropy-regularized Markov decision process (MDP), defined by the tuple . are the state and action spaces, respectively, is the discount factor, the dynamics and transition distribution , the initial state distribution , and the reward function is unknown in the standard reinforcement learning setup and can only be queried through interaction with the MDP. Specifically, can be presented with the selected fashion items and refer to the item selection actions during the process. The reward function indicate the compatibility and style consistency between fashion items and the described style.
Because the reward function is unknown, we assume the experts’ demonstration outfits are composited with an optimal policy . Inverse reinforcement learning instead seeks inferring the reward function given a set of demonstrations . Moreover, the dynamics of composition process is known. Instead of using full trajectories, we could focus on the single state and action case. The entire training procedure is detailed in Algorithm 1. During the training process, our algorithm alternate between training a discriminator to classify the expert selection from outfit generated by current policy, and update the policy to confuse the discriminator   . The discriminator is trained with the form:
where is restricted to a reward approximator and a shaping term as
Suppose we are given an expert policy that we wish to rationalize with IRL. is the true reward function. The is the advantage function need to be recoverd. recovers the optimal value function , which servers as the reward shaping term:
5 Experimental Results
In this section, we introduce the data collection we used for style imitation learning, the evaluation of the composition agent, and some further analysis.
Different from fashion datasets collected from Polyvore   or Lookastic  that are suitable for data mining, the dataset with more complete fashion style information works better for fashion imitation. Namely, we like the dataset has an explanation text to describe the style of each outfit, and an optional list of every demonstrated item for the given style. This is natural for the human being imitation and many fashion e-commerce websites use this way to demonstrate fashion items on their platforms. We specifically found a website named Chuanda on wqs.jd.com that is suitable for our need. On Chuanda website, all the outfits are curated by fashion experts and every item is mapped to an identical product for sale.
We collect a fashion style dataset from Chuanda containing 3,557 outfits covering 67 basic fashion styles. In this dataset, each outfit is composed of up to three items, and a short description about the outfit style. Moreover, for each given outfit, there is averagely around 21.04 other candidate items suggested by fashion experts. Every item in this dataset is mapped to an identical sale item on JD.com website. This is convenient and useful as we can collect the product name, images, and attributes for the corresponding fashion item. In another word, in the dataset all these fashion items are labeled with 1,879 distinct fashion related attributes that belong to 5 types: Gender, Season, Style, Material, and Function. More statistics of this dataset is shown in Table 1.
|Outfits||Items||Attributes||Basic styles||Avg Opts|
We perform two different evaluations for our proposed learning framework. First, we evaluate the effectiveness of the learned multimodal representation by predicting missing attributes of the given fashion item. Second, we measure style consistency by computing the similarity of composited outfits with the recommendation list provided by fashion experts.
Missing Attribute Imputation
. Fashion item representation is critical for the downstream style imitation learning.To verify if our method can learn more complementary information, we conduct the missing attribute imputation task to evaluate the effectiveness of the MVAE. On Chuanda dataset, we simulate incomplete supervision by randomly reserving a fraction of the dataset as multi-modal examples. We examine the effect of supervision on the attribute prediction task, e.g. predict the correct attribute label from an image . For the MVAE, the total number of examples shown to the model is always fixed, only the proportion of complete bi-modal example is varied. Five important types of attributes (Gender, Season, Style, Material, and Function) are masked first in the input and then predicted with MVAE decoder, the evaluation results are provided in Table 2. We also perform a qualitative analysis of the items representation generated from MVAE and visualize the features space in Figure 5 using t-SNE 
. Our representation display robustness to background variance and items share similar style and visual appearance can be clustered together. For example, items of ”casual white tops” and ”business style pants” are very close in the space regardless of the background noise.
Style Consistency. In every Chuanda outfit, each item have a list of optional alternatives, which are also consistent with the given style. For condition style and query top, we adopted the common strategy  that feeds selected top item and conditional style description as query, and randomly selected K bottoms as the candidates. The item in the experts’ demonstration and optional alternatives are positive candidates. Thus, we can evaluate the effectiveness of the imitation learning by measuring the average position of the consistent item in the ranking list with the mean average precision (MAP) metric. We have totally 1000 unique tops and styles in test set. We compared out method (HM-AIRL) with two methods: feature based pointwise mutual information (PMI) ranking algorithm and Han et al’s BiLSTM based method. Pointwise mutual information is s a measure of association and is used for finding collocations and associations between items. In Chuanda dataset, all the items are labeled with 1,879 attributes. We pre-calculate the the PMI scores between any pair of attribute, and rank the candidate items with the sum of attribute PMI scores. For the Bi-LSTM method, we follow the setting in  and retrain the network on Chuanda dataset.
The performance of three methods at different number of top k candidate items is shown in Figure 4. HM-AIRL method get the highest MAP from top 5 to 35 ranking results. We also notice that many items selected by HM-AIRL but not in experts optional list are also compatible with the query top and consistent with the conditional style, which is reasonable in the real application. Bi-LSTM methods get lowest MAP score on this task. By analyzing the ranking list generated by Bi-LSTM model, we think this is mainly caused by three reasons. First, Bi-LSTM failed to consider the style constrain and many selected item are not consistent with condition style. Second, the images in Chuanda dataset is much more noisy. Unlike the clean images on Polyvore website, a lot of images in Chuanda contain irrelevant information such as: price tags, promotion ads etc. Third, Chuanda dataset is a smaller dataset than Polyvore. It is much more challenging to learn complex style concepts on relative small dataset with supervised learning methods.
In Figure 6, we demonstrate the outfits generated with attributes pointwise mutual information ranking, Han et al’s BiLSTM method, our HM-AIRL and fashion experts’ selections for query tops under three distinct condition styles. Compared with experts’ selection, only HM-AIRL guarantee both compatibility between items and consistency with condition style description. In the outfit generated by Bi-LSTM and PMI based algorithms, the selected matching items actually belong to other styles.
In the experiment, we use Adam optimizer with a batch size of 256, learning rate of 0.00005 for the HM-ARIL optimization. For the MVAE model used to learn fashion item fusion representation, image encoder and decoder follow the standard DCGAN architecture 
. Attribute encoder and decoder is a standard 3-layer fully connected network VAE architecture. The two VAEs share the identical latent variables size of 256. The outfit style description encoder uses the base 12-layer BERT model that was pre-trained on Chinese Wikipedia corpus with 21128 unique Chinese characters, and we fix it during the training. The entire framework is implemented with Pytorch.
6 Remarks and Future Work
Fashion experts keep proposing novel styles that are appreciated and imitated by individuals sharing the same taste or preference. In this work, we propose a framework to imitate fashion styles from outfit demonstrations. A hierarchical multimodal network is introduced to represent the whole outfit structure. Comparing with other work, our method captures the latent contextual information behind the fashion style by learning both the joint representation from image and attributes for each item and the compatibility and style consistency between items.
Relying on this hierarchical multimodal representation, we train the agent with an inverse reinforcement learning algorithm based on adversarial learning. Our approach builds upon a vast line of work on IRL. Hence, our approach, just like IRL, does not interact with the expert during training and adapt training samples to improve learning efficiency. Our experiment shows that HM-AIRL can learn the value function for imitating fashion styles and is robust to style shift.
In recommendation, user behaviour data such as browsing history is very important. Content-based analysis such as style suggestion is only one factor. For the future, we like to integrate our framework into a full recommendation system and evaluate its performance. Note that the framework presented in this paper is not limited to fashion. Design artifacts in many domains contain latent concepts that can be expressed with sets of human-interpretable features capturing different levels of granularity  . This model also offers attractive capabilities: it can infer latent abstract concepts, and imitate experts from their demonstrations. In the future, we like to explore how this framework can empower applications in other domains such as interior design, architecture and etc.
-  (2016) Guided cost learning: deep inverse optimal control via policy optimization. pp. . Cited by: §2.
-  (2017) Fashion forward: forecasting visual style in fashion. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp. 388–397. External Links: Cited by: §2.
-  (2008) Visual complexity and aesthetic perception of web pages. In Special Interest Group for Design of Communication (SIGDOC), Cited by: §6.
-  (2015) Learning visual clothing style with heterogeneous dyadic co-occurrences. Cited by: §2.
-  (2010) Automatic attribute discovery and characterization from noisy web data. In ECCV, Cited by: §2.
-  (1924) Fashion imitation. Chapter 13 in Fundamentals of Social Psychology, pp. 151–167. Cited by: §1.
-  (1993) Signature verification using a ”siamese” time delay neural network. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, San Francisco, CA, USA, pp. 737–744. External Links: Cited by: §2.
-  (2009) A survey of robot learning from demonstration. In Robotics and autonomous systems, Cited by: §2.
-  (2018) Generalized product of experts for automatic and principled fusion of gaussian process predictions. Cited by: §3.1.
Deep domain adaptation for describing people based on fine-grained clothing attributes.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) POG: personalized outfit generation for fashion recommendation at alibaba ifashion. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD’19. External Links: Cited by: §2.
Supervised learning of universal sentence representations from natural language inference data.
Conference on Empirical Methods on Natural Language Processing (EMNLP), Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Cited by: §1, §3.2.
-  (2016) Learning from humans. pp. 1995–2014. Cited by: §2.
Multi-task curriculum transfer deep learning of clothing attributes. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). External Links: Cited by: §2.
-  (2018) Creating capsule wardrobes from fashion images. Cited by: §2.
-  (2017) Learning fashion compatibility with bidirectional lstms. Note: ACM Multimedia Cited by: §2, §2, §5.1, §5.2.
Training products of experts by minimizing contrastive divergence. Tranining. Cited by: §3.1.
-  (2017) Multi-label fashion image classification with minimal human supervision. 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2261–2267. Cited by: §2.
-  (2016) Generative adversarial imitation learning. Cited by: §4.
-  (2018) Learning robust rewards with adversarial inverse reinforcement learning. pp. . Cited by: §1, §2, §4.
-  (2013) Auto-encoding variational bayes. Cited by: §2.
-  (2008) Factorization meets the neighborhood: a multifaceted collaborative filtering model. Cited by: §5.2.
-  (2014) CommandSpace: modeling the relationships between tasks, descriptions and features. In ACM Symposium on User Interface Software and Technology, Cited by: §6.
A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. pp. . Cited by: §2, §4.
-  (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) Multimodal generative models for scalable weakly supervised learning. NeurIPS. Cited by: §3.1.
-  (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §5.2.
-  (2004) Apprenticeship learning via inverse reinforcement learning. Cited by: §2.
-  (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. Cited by: §5.2.
-  (2011) Fashion coordinates recommender system using photographs from fashion magazines. Cited by: §2.
-  (2008) Learning factorial codes by predictability minimization. Cited by: §2.
-  (2003) Learning a distance metric from relative comparisons. In Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, Cambridge, MA, USA, pp. 41–48. External Links: Cited by: §2.
-  (2017) Compatibility family learning for item recommendation and generation. ArXiv abs/1712.01262. Cited by: §2.
Fashion style in 128 floats: joint ranking and classification using weak data for feature extraction. Cited by: §2.
-  (2017) NeuroStylist: neural compatibility modeling for clothing matching. In Proceedings of the 25th ACM International Conference on Multimedia, MM ’17. Cited by: §5.1.
-  (2017) NeuroStylist: neural compatibility modeling for clothing matching. In ACM Multimedia, Cited by: §2.
-  (2006) The fundamentals of fashion design. pp. . Cited by: §1.
-  (2015) Neuroaesthetics in fashion: modeling the perception of fashionability. Cited by: §2.
Visualizing data using t-SNE.
Journal of Machine Learning Research9, pp. 2579–2605. External Links: Cited by: Figure 5, §5.2.
-  (2016) Automatic attribute discovery with neural activations. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part IV, pp. 252–268. External Links: Cited by: §2.
-  (2017) Learning the latent “look”: unsupervised discovery of a style-coherent embedding from fashion images. Note: IEEE International Conference on Computer Vision, (ICCV) Cited by: §2, §2.
-  (2017) Miningfashionoutfitcom- position using an end-to-end deep learning approach on set data. Cited by: §2.
-  (2015-09) Mix and match: joint model for clothing and attribute recognition. In Proceedings of the British Machine Vision Conference (BMVC), G. K. L. Tam (Ed.), pp. 51.1–51.12. External Links: Cited by: §2.
-  (2019) Interpretable fashion matching with rich attributes. In Proceedings of the 42Nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19. Cited by: §5.1.
-  (2017-06) Memory-augmented attribute manipulation networks for interactive fashion search. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2008) Maximum entropy inverse reinforcement learning. In Proc. AAAI, pp. 1433–1438. Cited by: §2.
-  (2010) Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §2, §4.