“Costly thy habit as thy purse can buy, But not expressed in fancy; rich, not gaudy, For the apparel oft proclaims the man.”
Fashion style tells a lot about one’s personality. With the fashion industries going online, clothing fashions are becoming a much more popular topic among the general public. There have been a number of research studies on clothes retrieval and recommendation [6, 8, 10, 11, 16, 27], clothing category classification [3, 13, 17, 23, 28], attribute prediction [1, 2, 4, 29] and clothing fashion analysis [16, 19]. However, due to the fact that the fashion concept is often subtle and subjective, the composition of fashion outfit keeps being an open problem to reach consensus for the general public.
present an approach by concatenating hand-crafted features into a vector as an embedding for each clothing item. The extracted specified attribute features of this approach are usually mixed with other attribute features. Very recently, Liet al. 
present a deep neural network based method by adopting mixed multi-model embedding to represent an outfit item as a whole. There are two characteristics for such deep neural network based embedding method: 1) the embedding is unexplainable; 2) the embedding does not bring attribute information. Unfortunately, in many practical applications, it is necessary to understand the importance of different attributes in an outfit composition for designers, businesses and consumers. That is to say, an interpretable and partitioned embedding is vital for a practical fashion outfit composition system.
To address the aforementioned problems, we proposed a partitioned embedding network in this paper. The proposed network architecture consists of three components: an auto-encoder module, a supervised attributes module, and a multi-independent module. The auto-encoder module serves to encode all useful information into the embedding. In the supervised attributes module, multiple attributes labels are adopted to ensure that different parts of the overall embedding correspond to different attributes. In the multi-independent module, to ensure each part of the embedding only relates to the corresponding attribute, the mutually independent constraint is taken into account. Then, considering that matching items may appear multiple times in different outfits, we propose a fashion composition graph to model matching relationships in outfit with the extracted partitioned embeddings. Meanwhile, an attribute matching map which learns the importance of different attributes in the composition is also built. Comparative framework between our model and the multi-modal embeddings based method  is shown in Figure 1.
To summarize, our work has three primary contributions: 1) We present a partitioned embedding network, which can extract interpretable embeddings of fashion outfit items. 2) We put forward a weakly-supervised fashion outfit composition model, which depends solely on a large number of outfits without quality scores of outfits as others. Besides, our model can be extended to a dataset with annotated quality scores. 3) An iterative and customized fashion outfit composition scheme is given. Since fashion trends alter directions quickly and dramatically, our model can keep up with the fashion trends by easily incorporating new fashion outfit dataset.
2 Related Work
Embedding Methods. In recent years, there are several models [19, 23, 26] can be used to get embeddings of clothes. Vittayakorn et al.  extract five basic feature (color, texture, shape, parse and style) of outfit appearance and concatenated them to form a vector for representing outfit. They use the vector to learning the outfit similarity and analyze visual trends in fashion. Matzen et al. 
adopt deep learning method to train several attribute classifiers and used high-level features of the trained classifiers to create a visual embedding of clothing style. Then, using the embedding, millions of Instagram photos of people sampled worldwide are analyze to study spatio-temporal trends in clothing around the globe. Simo-Serraet al.  train a classifier network with ranking as the constraint to extract discriminative feature representation and also used high level feature of the classifier network as embeddings of fashion style. Other embedding methods that are related to our work include Auto-Encoder [bengio2009learning], Variational Auto-Encoder (VAE)  and Predictability Minimization (PM) model 
. Autoencoder and VAE are used to encode embeddings from unlabeled data. The encoded embeddings usually contain mixed and unexplainable features of original images. To get partitioned embeddings, Schmidhuber adopted an adversarial operation to get embeddings, where units are independent. However, the independent units are not meaningful.
Fashion Outfit Composition. As described above, due to the difficulty in modeling outfit composition, there are few works [12, 15, 22] studying fashion outfit composition. Iwata et al.  propose a topic model to recommend ”Tops” for ”Bottoms”. The goal of this work is to compose fashion outfit automatically. challenging in modeling many aspects of the fashion outfits, such as compatibility and aesthetics. Veit et al. 
use a Siamese Convolutional Neural Network (CNN) architecture to learn clothing matching from the Amazon co-purchase dataset, focusing on the representative problem of learning compatible clothing style. Simo-Serra1et al.  introduce a Conditional Random Field (CRF) to learn the different outfits, types of people and settings. The model is used to predict how fashionable a person looks on a particular photograph. Li et al.  use the quality scores as the label and multi-modal embeddings as features to train a grading model. The framework of Li’s approach is shown in Figure 1(a). However, the mixed multi model embedding used by Li’s model is unexplainable, which leads to unexplainable outfit composition.
3 The Proposed Method
To overcome the limitations of existing fashion outfit composition approaches, we propose an interpretable partitioned embedding method to achieve customized fashion outfit composition. In section 3.1, the partitioned embedding network is firstly presented how to partition an embedding into interpretable parts which correspond to different attributes. Then, we introduce how to build composition relationship in our proposed composition graph and attributes maps with the interpretable and partitioned embedding.
3.1 Interpretable Partitioned Embedding
Let denote an item of outfit, denote the encoded embedding of item . To make the extracted embedding interpretable and partitionable, there are two constraints that should be satisfied: on one hand, the fixed parts of whole embedding should correspond to specific attributes; on the other hand, different parts of the whole embedding should be mutual independent. Thus, the embedded attributes embedding of item can be described as below:
where corresponds to different parts of embedding and is the total number of attributes. Condition implies that does not depend on . Figure 2 shows framework of the interpretable partitioned embedding network. The whole embedding network is an auto-encoder network which embeds items of outfits into another feature space. The whole encoder network is composed of attribute encoder networks , which embeds an original item into different parts of whole embedding. The decoder network decodes an whole embedding back to original image. In the process of decoding, attributes networks and labels serve to learn useful features which are related to corresponding attributes. Meanwhile, the mutually independent constraint is taken into account, which can not only ensure different parts of embedding solely related to corresponding attribute but also extract embedding of indefinite attributes, such as, texture and style. As a consequence, the loss of the partitioned embedding network is defined as follows:
where , are balancing parameters, is the decoded image of item , is the auto-encoder loss, is the summed loss of each auxiliary attributes loss(I, Label k) and is mutually independent loss. Inspired by the adversarial approach in , we adopt adversarial operation to meet the mutually independent constraint. The mutually independent loss includes predicting loss and encoding loss . In the prediction stage, each prediction net tries to predict corresponding embedding part as much as possible to maximize predictability. The predicted loss can be defined as:
where is function representation of prediction net . In the encoding stage, all encoder nets try to make all prediction net fail to predict corresponding embedding, which means to minimize predictability. The encoded loss thus could be defined as:
In summary, the auto-encoded loss makes sure that the whole embedding contains all information in the item of an outfit. Attributes sum loss serves to learn useful information which are related to corresponding attributes. Mutually independent loss ensures that different parts of embedding only related to corresponding attributes.
As aforementioned, there is a large number of attributes [26, 17, 19] for describing fashion, such as category, color, shape, texture, style and so on. Considering that some attributes are indefinable (such as texture and style), we classify those attributes into same class as remaining attributes. The information of remaining attributes can be extracted with the mutually independent constraint. To get importance of attributes in the fashion cloth, we administer a questionnaire among 30 professional stylists. According to results of the questionnaire, we split attributes into 4 classes and rank them as follow: category, color, shape, texture and remaining attributes (texture, style and others). Thus, in the experiment, we split the whole dataset into different groups according to the category. So, the number of attribute equals to 3. are encoding networks of the color attribute, shape attribute and remaining attributes( texture, style and so on), respectively. are supervised networks of the color attribute, shape attribute and remaining attributes, respectively. are partitioning network of color attribute, shape attribute and remaining attributes, respectively.
Color embedding extraction. As described above, color is a primary element in fashion outfit composition. To get color label, we adopt latest proposed color theme extraction method  to extract color themes. This method can extract large span and any number of ranked color themes. We modify it to just extract color in the foreground area of the item. Meanwhile, we adopt top-5 extracted color themes of an item as the color label. In the experiment, we adopt Generative adversarial nets(GAN)  to extract color embedding. So, the color attribute supervised network is a GAN architecture, which includes a generative color model and a color discriminative model . Input and output of network are and corresponding color themes, respectively. Input and output of network are color themes and a discriminant value, respectively. The architectures of these two networks are summarized in Table 1.
Shape embedding extraction. To get an explicit mapping relationship between shape and embedded codes, we adopt a Variational Auto-Encoder model  to encode and decode shape information. We conduct a toy experiment to validate that shape network can encode all shapes in shape space. The toy VAE network encode the original hats items into latent embeddings with two parameters. Then, through uniform sampling of two parameters, corresponding encoded shapes are shown in Figure 3, where we can see that almost all kinds of hats’ shape are included. Meanwhile, shapes information are encoded and decoded relatively satisfactory. For each item of an outfit, we use the basic open-close operation and threshold methods to get the mask of item as the label of shape. Input and output of the shape attribute network are and corresponding shape mask, respectively.
Remaining attributes embedding extraction. As described above, indefinable attributes (texture, shape and others) are classify into same class as remaining attributes. To extract corresponding information of remaining attributes, mutually independent constraint is took into account. In this article, adversarial method is adopted to realize mutually independent constraint. For each attribute corresponding embedding , there is a prediction network that ensure is independent of all other . All the prediction networks share the same architecture. Input and output of prediction network are , .
denote a Convolution-BatchNorm-ReLU layer withfilters. denotes a Convolution-BatchNorm-Dropout-ReLU layer with a dropout rate of 30% and filters. denotes a deconvolution-ReLU layer with filters, and denotes a Full Connection with neuron. All convolutions are 4x4 spatial filters with stride 2. Convolutions in both encoder and the discriminator are downsampled by a factor of 2, whereas in the decoder they upsample by a factor of 2. Input image size of is 128*128*3. All the network architectures are summarized in table 1.
3.2 Fashion Outfit Composition
Considering that some matching items may appear many times in different outfits, we propose a fashion composition graph to model matching relationship in outfit. For each category, we first cluster all items into cluster centers according to color embedding. Then, for each cluster , items that belongs to it is clustered into cluster centers according to shape embedding. Lastly, items belonging to shape cluster center are clustered into cluster centers according to the remaining embedding. Then, we can get cluster centers . After getting all clustering centers, fashion composition graph is defined as follow:
where is the vertex set of all cluster center , is the edge set and is the weight set.
At the initial stage, all vertexes have no connection and all weights are equal to zeros. If item and item appear in the same outfit and there is no connection between and , a connection is created and weight is set as one. If corresponding vertices and already have connection, the weight between them will be updated as follow:
where is the weight in the last stage. Figure 4 shows an example of the connection process.
Attribute Matching Map. After obtaining the interpretable and partitioned embeddings, for each category, every attribute class corresponding parts of embeddings are clustered into several clusters. In the process of building fashion outfit graph , an attribute composition score map of different attributes is also built. which is defined as:
where, is the number of categories, are clusters’ number of color feature, shape feature and remaining features respectively, is the -th color attribute cluster of category , is the -th shape attribute cluster of category , is the -th other attribute cluster of category , are score values set and all initial score value in set are equal to zero. When an item of category and an item of category appear together in the same outfit, score value between color cluster ( ) and color cluster ( ) will be updated using the following equation:
where, is last stage score value. and are updated in the same way.
After getting outfit composition graph and attributes map , composition score of an outfit is defined as follow:
where, is the matching score in composition graph, is attributes matching scores . is the number of matching weights in outfit ,
is variance among matching weights. To get an matching outfit for an item , an exhaustion algorithm is adopted to search the most matching outfit. In order to reduce time complexity, only top connected items of each category are taken into account.
4.1 Implementation Details
DataSet. In the experiment, We collect a dataset from Polyvore.com, which is the most popular fashion oriented website in the US with tens of thousands of fashion outfits creating every day. In the dataset, fashion outfits are associated with the number of likes and some fashion items. Each fashion item has corresponding image, category and like. In practical application, an outfit may contain many items, such as tops (blouse, cardigan, sweater, sweatshirt, hoodie, tank, tunic ), bottoms(pant, skirt, short), hats, bags, shoes, glasses, watches, jewelry and so on. In this article, we choose five prerequisite categories( tops, bottoms, hats, bags and shoes). We perform some simple filtering over raw datasets through discarding items that mix human body. Finally, we get a set of 1.56 million outfits and 7.53 million items.
Setting. In our experiment, is 2 and is 0.7, number of cluster center for color attribute is 1000, number of cluster center for shape attribute is , is total number of items in cluster , cluster center for remaining attributes is , is total number of items in cluster .
4.2 Outfit Composition’s Psycho-Visual Tests
To verify the validity of our proposed method, we make a pairwise comparison study. In this experiment, a total of one hundred items are used, along with 20 items for each category. For each item, each method will give an outfit. 30 professional stylists take part in the psychophysical experiment, 13 males and 17 females. In the paired comparison, a stylist is presented with a pair of the recommended outfit. The stylists are asked to choose the most matching composition. All pairs are presented twice to avoid mistakes in individual preferences. The number of pairs is , where is the number of composition methods and is the total number of test outfits. For each pair, the outfit chosen by an observer is given a score of 1, the other outfit is given a score of 0 and both of them got a score of 0.5 when equally matching. After obtaining the raw data, we converted it to frequency matrix , where the entry denotes the score given to composition method as compared with composition method . From matrix , percentage matrix is calculated through , where denotes total number of observations. Table shows the matrix obtained in our experiment.
To get preference scales from the matrix , we apply the case V of Thurstone’s law of comparative judgment  following Morovic’s thesis . Firstly, a logistic matrix is calculated through Eq.(10) as
where is an arbitrary additive constant (0.5 was suggested by Bartleson(1984) and is also been used in our experiment). Logistic matrix are then transformed to -scores matrix by using a simple scaling in the form of where coefficientto be 0.5441 with observations for each pair of algorithms. From the matrix , an accuracy score of composition method is calculated by averaging the scores of the th column of the matrix
. The final matching scores are contained in an interval around 0. The interval can be estimated asat the confidence level. The resulting
-scores and confidence intervals for all outfits are shown in Figure6. It is evident that our model performs superiorly to other methods and auto-encoder is the least suitable. The matching scores of five composition methods are -0.7123, -0.1546, 0.1227, 0.2655 and 0.4788 separately with confidence interval. Figure 5 depicts directly visual comparative results of representative composition outfits. From the figure, we can see that our method get more satisfactory composition results. The method with score label gives more reasonable composition. It is obvious that our method produces better composition outfit than the other methods.
4.3 Validity of Attribute Embedding
In our experiment, we adopt clustering method to validate the discriminative ability of the encoded attributes embedding. After finishing training of the whole network, we use encoding network to extract the embedding of items. Then, we use -means methods to cluster attributes corresponding part of embedding. Figure 7 gives some random picked visual clustering results. Figure 7(a) shows that tops in the same class are similar in color attributes, which proves that the extracted color corresponding parts of embeddings are distinguishable. In Figure 7(b), items that have exactly similar shapes are clustered into the same class. To verify the effectiveness of mutually independent constraint, we use remaining attributes corresponding parts of embeddings to cluster bag items. From Figure 7(c), we can observe that bags in the same class usually have some attributes in common. Bags in class 1 and class 2 have similar texture and bags in class 3 have similar styles, which demonstrates that remain feature includes other useful attribute feature. Meanwhile, bags in the same class usually have multifarious shape and color, which proves that the mutually independent constraint works.
4.4 Personalized Attributes Composition
In the stage of building composition graph, we cluster the attributes corresponding parts of embeddings into different level cluster centers, which are set as vertexes in the graph. In the testing stage, users can specify their preferred attributes, such as “color=red” and “shape=circle”, our model can give a series of specified attribute items with interpretable matching scores for users to choose. Figure 5 shows some ordered recommended matching items with five composition score. For example, sum composition score of the recommended hat in purple dashed box is 51.87. Weight composition score in blue box is 16.45. Attributes matching scores corresponding to color, shape and remaining attributes are 14.36, 11.74 and 9.32, respectively. From attributes matching scores, we can see that color attribute and shape attribute give more contribution to the final composition than remaining attributes, which is consistent with our questionnaire’s result. For the specified color attribute, recommended items are almost indiscriminate in color. For the specified shape, all recommended items have similar shapes that are in line with the original specified shape.
4.5 Outfit Composition’s Extension
As described above, our fashion composition graph is weakly supervised, which only dependents on outfit dataset without matching score labels. Our model can also be extended to a dataset with matching score labels. After normalizing the favorite user scores, the fashion composition graph with the score is built using Eq.(6) by setting a normalized score of the corresponding outfit. Outfit composition result with score label are given in Figure 5, Figure 6 and table 2.
The fashion trends is known by its quick and dogmatical variation . The proposed technique enable us to keep up with the fashion trend along time. When new trend outfit dataset joints, old of composition graph is divided by , where is the max weight of the graph. Then, the fashion composition graph is updated following Eq.(6) with . In the experiments, we classify our whole outfit dataset into four parts according to years: dataset1(2007-2009), dataset2(2010-2012), dataset3(2013-2015), dataset4(2016-2017). We use our proposed iterated model with dataset1(2007-2009) as the basic dataset to get initial outfit composition graph. Then, by constantly adding new dataset into the basic dataset, we can get new outfit composition graph, by which fashion trend is kept up with. Figure 9 gives the visual most popular outfit along years. We can see that our iterated model can keep up with fashion trends tightly.
Fashion style tells a lot about one’s personality. Interpretable embeddings of items are fatal in design, market and clothing retrieval. In this paper, we propose a partitioned embedding network, which can extract interpretable embedding from items. With attributes label as the constraint, different parts of embedding are restrained to related to corresponding attributes. With multi-independent constraint, different parts of embedding are restrained to only related to corresponding attributes. Then, using the extracted partitioned embeddings, a composition graph with attributes matching map are built. When users specify their preference attributes, such as color and shape, our model can recommend desirable outfit with interpretable attributes matching scores. Meanwhile, extensive experiments demonstrate that interpretable and partitioned embedding is helpful for designer, businesses and consumers to better understand composition in outfits. In applications, people’s skin color and stature have great influence on outfit composition. Thus, personalization would be took into consideration in our future work. In further, straightforward composition relationship among items is another direction of our future work.
-  K. Abe, T. Suzuki, S. Ueta, A. Nakamura, Y. Satoh, and H. Kataoka. Changing fashion cultures. arXiv preprint arXiv:1703.07920, 2017.
L. Bossard, M. Dantone, C. Leistner, C. Wengert, T. Quack, and L. Van Gool.
Apparel classification with style.
Asian conference on computer vision, pages 321–335. Springer, 2012.
-  H. Chen, A. Gallagher, and B. Girod. Describing clothing by semantic attributes. Computer Vision–ECCV 2012, pages 609–623, 2012.
Q. Chen, J. Huang, R. Feris, L. M. Brown, J. Dong, and S. Yan.
Deep domain adaptation for describing people based on fine-grained
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5315–5324, 2015.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
-  W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan. Style finder: Fine-grained clothing style detection and retrieval. In Proceedings of the IEEE Conference on computer vision and pattern recognition workshops, pages 8–13, 2013.
-  Z. Feng, W. Yuan, C. Fu, J. Lei, and M. Song. Finding intrinsic color themes in images with human visual perception. Neurocomputing, 2017.
-  J. Fu, J. Wang, Z. Li, M. Xu, and H. Lu. Efficient clothing retrieval with semantic-preserving visual phrases. In Asian Conference on Computer Vision, pages 420–431. Springer, 2012.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, and T. L. Berg. Where to buy it: Matching street clothing photos in online shops. In Proceedings of the IEEE International Conference on Computer Vision, pages 3343–3351, 2015.
J. Huang, R. S. Feris, Q. Chen, and S. Yan.
Cross-domain image retrieval with a dual attribute-aware ranking network.In Proceedings of the IEEE International Conference on Computer Vision, pages 1062–1070, 2015.
-  T. Iwata, S. Wanatabe, and H. Sawada. Fashion coordinates recommender system using photographs from fashion magazines. In IJCAI, volume 22, page 2262, 2011.
-  M. H. Kiapour, K. Yamaguchi, A. C. Berg, and T. L. Berg. Hipster wars: Discovering elements of fashion styles. In European conference on computer vision, pages 472–488. Springer, 2014.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Y. Li, L. Cao, J. Zhu, and J. Luo. Mining fashion outfit composition using an end-to-end deep learning approach on set data. IEEE Transactions on Multimedia, 2017.
-  Q. Liu, S. Wu, and L. Wang. Deepstyle: Learning user preferences for visual recommendation. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 841–844. ACM, 2017.
-  Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1096–1104, 2016.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
-  K. Matzen, K. Bala, and N. Snavely. Streetstyle: Exploring world-wide clothing styles from millions of photos. arXiv preprint arXiv:1706.01869, 2017.
-  J. Morovic. To develop a universal gamut mapping algorithm. 1998.
-  J. Schmidhuber. Learning factorial codes by predictability minimization. Learning, 4(6), 2008.
-  E. Simo-Serra, S. Fidler, F. Moreno-Noguer, and R. Urtasun. Neuroaesthetics in fashion: Modeling the perception of fashionability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 869–877, 2015.
E. Simo-Serra and H. Ishikawa.
Fashion style in 128 floats: Joint ranking and classification using weak data for feature extraction.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 298–307, 2016.
-  L. L. Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.
-  A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Belongie. Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Conference on Computer Vision, pages 4642–4650, 2015.
-  S. Vittayakorn, K. Yamaguchi, A. C. Berg, and T. L. Berg. Runway to realway: Visual analysis of fashion. In Applications of Computer Vision (WACV), 2015 IEEE Winter Conference on, pages 951–958. IEEE, 2015.
-  X. Wang and T. Zhang. Clothes search in consumer photos via color matching and attribute learning. In Proceedings of the 19th ACM international conference on Multimedia, pages 1353–1356. ACM, 2011.
-  K. Yamaguchi, M. H. Kiapour, L. E. Ortiz, and T. L. Berg. Parsing clothing in fashion photographs. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3570–3577. IEEE, 2012.
-  K. Yamaguchi, T. Okatani, K. Sudo, K. Murasaki, and Y. Taniguchi. Mix and match: Joint model for clothing and attribute recognition. In BMVC, pages 51–1, 2015.