Algorithmic clothing frameworks have been routinely operationalized using recommender systems that utilise deep convolutional networks for feature representation and prediction (Sengupta et al., 2017b)
. Such a content-based image retrieval amounts to recommending items to users who might have similar styles (contemporary vs. retro, etc.), like similar patterns (stripes vs. polka, etc.) or colours. Often time, frameworks rely on recommending not a single item but a pair (or triplets, etc.) of products. For the fashion vertical, this would mean recommending what trousers to wear with a given shirt, for example. Yet, service providers often do not have access to a consumer’s behaviour – be it click-through rate, prior purchases, etc. This means the recommender system generally has a cold-start problem, resulting in incorrect recommendations. This is a problem intrinsic to many collaborative recommendation algorithms which are built with thea priori assumption of the availability of a user preference database.
In this paper, we describe a scalable method to recommend fashion inventory (tops, trousers, etc.) wherein user preferences are learnt from experts (oracles), that we accumulate using images from fashion blogs. We call them ‘street-style images’. Utilising deep neural networks enable us to parse such images into a high dimensional feature representation that allow us to recommend a pair from a high-dimensional user preference tensor that can be constrained to exist in a retailer’s database. Whilst the database presented in this paper has a 2-D form, utilising a deep neural network enables us to learn an n-D tensor. This helps in recommending multiple items that have a structure – for example, to predict not only whether a trouser fits well with a shirt but also what allied accessories like belts, cuffs, etc. can be worn.
In section 2, we detail the technical infrastructure that allows us to produce recommendations starting from a single image. We start by (a) segmenting and localising the object of interest and (b) designing a knowledge base that constrains the recommendation from the knowledge derived from street-style images to only those commodities that are available from a retailer’s database.
Recommending an item that is suitable to wear with yet another item involves the following steps:
Localisation/segmentation of garments from street-style images (oracle)
Generation of association between a pair of garment, i.e., determine which items in each image are being worn by the same person
Construction of a joint distribution (co-occurrence matrix) based on either visual features from street-style images or items from a vendor’s inventory
Produce recommendation by using: (a) colour, (b) pattern or (c) a street-style oracle under a content-based retrieval framework
In the next sections, we will describe the steps comprising a recommendation engine (Figure 1) starting with a description of deep convolutional neural networks (dCNN). A pre-trained VGG-16 model (Simonyan and Zisserman, 2014)
on ImageNet dataset is used as a base model to fine tune our fashion segmentation, localisation and pattern classification models.
Segmentation using a dCNN
Images are segmented using a dCNN framework so that the images could be partitioned into groups of unique objects. Specifically, in order to remove background effect for the dominant colour generation, each clothing item in street-style images is segmented by using an FCN (Fully Convolutional Network) followed by a CRF (Conditional Random Fields), as shown in Figure 2.
). Both methods first convert the pre-trained dCNN classifier by replacing the last fully connected layers by fully convolutional layers; this produces coarse output maps. For pixel-wise prediction, upsampling and concatenating the scores from intermediate feature maps are applied to connect these coarse outputs back to the pixels. DeepLab speeds up segmentation of dCNN feature maps by using the ‘atrous’ (with holes) algorithm(Mallat, 2008)
. Instead of deconvolutional layers, the atrous algorithm is applied in the layer that follows the last two max-pooling layers. This is done by introducing zeros to convolutional filters to increase their length. This can control the Field-of-View (FOV) of the models by adjusting the input stride, without increasing the number of parameters or the amount of computation. Additionally, atrous spatial pyramid pooling (ASPP) is employed in DeepLab to encode objects as well as image context at multiple scales. After coupling with dCNN based pixel-wise prediction (blob-like coarse segmentation), a fully-connected pairwise Conditional Random Field (CRF)(Krahenbuhl and Koltun, 2011) is applied to model the fine edge details. This is done by coupling neighbouring nodes to assign the same label to spatially proximal pixels.
Localisation using a dCNN
Due to complicated backgrounds in street-style images, object detection is applied to localize the clothing items from the cluttered background. The detected garments are used as a query to find similar items from same class inventory images; for details on the deep learning architecture refer to (Sengupta et al., 2017b), for feature encoding refer to (Sengupta et al., 2017a) and for similarity measurements please see (Qian et al., 2017). Additionally, detected items are classified according to different texture patterns. With the associated bounding boxes, the co-occurrence matrix of patterns is generated based on street-style images.
Three state-of-the-art neural networks for object detection are evaluated by training on 45K street-style images (Section 3). Faster-R-CNN (Faster Region-based Convolutional Neural Network) (Ren et al., 2015) is the first quasi-real-time network. Its architecture is based on a Region Proposal Network (RPN) for quick region proposals and a classifier network for assessing the proposals. A non-maximum suppression (NMS) algorithm suppresses the boxes that are redundant. These steps are provided by the same base network, saving computational time. SSD (Single Shot Multi-Box Detector) (Liu et al., 2016a) is the best network for optimising speed at the cost of a small drop in accuracy. The structure is equivalent to a number of class-specific RPN working at different scales. The results are then combined by NMS. R-FCN (Region-based Fully Convolutional Networks) (Dai et al., 2016) is yet another improvement of Faster-R-CNN with a structure based on an RPN and a classifier. It has a reduced overhead due to the reduction in the size of the fully connected classifier that facilitates the classification of the different regions of the proposal independently.
In the next section, we describe three different methods for recommending items to users that are based on visual content such as the dominant colour, texture pattern, etc.
Recommendation by colour
A common attribute that encompasses consumer behaviour is to recommend garments based on colour. We operationalize such a scheme by (a) segmenting clothing items from street-style images, (b) extracting the dominant colour from segmented items using density estimation, (c) finding the associations of segmented items in the street-style dataset and (d) building a joint distribution of associations based on co-occurrence of dominant colours, from street-style images.
First of all, a colour map is generated by using k-means based on CLElab(ICE, 1977) value of segmented pixels in individual garment category. Each category has its own colour map; these maps are then used as an index of the co-occurrence matrix. When a query is submitted, the dominant colour is extracted from the segmented garment and a search is initiated on the corresponding co-occurrence matrix to find the colour that best matches the query garment. Finally, we recommend inventory images from the corresponding category that has the matching colour to go with the query image. Figure 3 shows the framework for dominant colour extraction and co-occurrence matrix generation from street-style image dataset. Additionally, when the dominant colour is extracted from the query image, we use a knowledge-based recommendation engine wherein a colour wheel is also utilised to recommend items with specific attributes – for example, complementary colour, triadic colour, etc. (AttireClub, 2014).
Recommendation by pattern
A similar framework for pattern recommendation, as shown in Figure 4 is used to make a recommendation based on the pattern that is intrinsic to an object. Again, we recommend items that have a similar pattern by (a) detecting garments from street-style images, (b) classifying cropped garments to one of the texture patterns and (c) searching corresponding co-occurrence of texture pattern from street-style images.
Recommendation via content-based retrieval
In Figure 5, a content-based recommendation system is operationalised as follows: (a) locate and crop garments in street-style images, (b) associate top and bottom garments worn by the same person, (c) generate a table from each associated top-bottom pair of the inventory dataset. Specifically, we initiate a query on the top cropped garment from street-style image against inventory images of the same category (e.g., query a street-style blouse against inventory blouses), similarly, run a query on the bottom garment, for details on the specific architecture used for retrieval please see (Sengupta et al., 2017b). A joint table can be constructed by adding the score of all possible combinations of top 5 retrieval results for the top garment (blouse) with top 5 retrieval results for bottom garment (i.e., trouser). Such a table tells us how fashionable such a garment combination is. Given an image of a blouse, for example, a skirt may then be recommended by using retrieval on query image and then looking up in the table which of the skirts gather higher frequencies when combined with the blouse.
3. Datasets and Results
In order to recommend inventory images based on fashion trends (street-to-shop), we generate two fashion datasets i.e., a street-style image dataset (no. 1) and an inventory image dataset (no. 2; Figure 6). Dataset no.1 has 280K street-style images that were downloaded from latest fashion blogs. Out of these, 70K street-style images were used to build co-occurrence matrices of fashion inventories. Data set no. 2 has 100K inventory images that come from various fashion retailers. The images are categorised according to 5 classes, i.e., Coat/Jackets, Dresses, Skirts, Top/Blouses and Trousers. Inventory images are recommended to users based on colour, pattern and visual similarity by querying the generated co-occurrence matrices from street-style images. As seen in Figure 6, most street-style images contain cluttered backgrounds: a person with a larger pose variation, multiple persons who overlap – stand side by side, etc. Due to such backgrounds, object detection and segmentation is applied for street-style images to localise the requisite fashion items. Most inventory images show a single item with a plain background; localisation is also required here to mask out the model’s leg and her head.
Dataset for segmentation
10K street-style images are generated and manually segmented by using the GrabCut algorithm (Rother et al., 2004). The dataset is split by using 7K images for training, 1.5K for validation and 1.5K for testing. Table 1 shows the split results for training, testing and validation dataset with a number of masks for each fashion item.
|Train(7K images)||2507 masks(22%)||2080(19%)||2018(18%)||2637(23%)||2078(18%)|
Dataset for localisation
45K street-style images are generated and annotated manually by drawing a bounding box around each fashion item. The data is split into 36K images for training, 4.5K for validation and 4.5K for testing. The split results for training, testing and validation dataset with bounding boxes (BBs) for each item is shown in Table 2.
Dataset for texture classification
Combining some categories in DTD (Cimpoi et al., 2014), texture tags in Deep Fashion dataset (Liu et al., 2016b) and some fashion blogs, 10 most popular pattern categories on fashion inventory are selected. 14K single fashion item images (11K for training and 3K for testing) from a client is sourced for training a neural network model to classify the pattern behind each texture; some examples for the 10 categories are shown in Figure 7.
Results for segmentation
Two methods for segmenting objects (Section 2) are evaluated on street-style images. Both models are initialized with a pretrained VGG-16 model on ImageNet, trained on 8.5K (train+validation) segmentation dataset and evaluated on a 1.5K testing dataset. Table 4 shows Intersection over Union (IoU = True positive/(True positive+False positive+False Negative)) and Pixel Accuracy (PA = True positive/(True positive+False Negative)) on each fashion class and the mean value of each class for both the models. DeepLab with multiple scales and LargeFOV along with a CRF achieves the best performance of 59.66% mean IoU and 73.99% mean PA. It is also evident that combining CRF with FCN increases the Mean IoU by 4% and PA by 2%. We have also trained the model by initializing with VGG-16 on our fashion classification dataset (Sengupta et al., 2017b); here, the mean IoU drops by 4%. Figure 8 shows segmentation results on street-style images by using the DeepLab-MultipleScales-LargeFOV algorithmic combination with a CRF. Average test time to segment an image: FCN = 115ms and CRF=638ms on a nVIDIA Titan X GPU with 12GB memory.
|Models||CoatsJackets||Dresses||Skirts||TopsBlouses||Trousers||Background||Mean IoU||Mean PA|
|FCN-8||IoU = 52.74%||74.23%||64.72%||57.91%||71.85%||96.36%||69.63%|
|PA = 44.80%||44.73%||42.88%||36.11%||54.38%||94.07%||52.83%|
|DeepLab-MSc-LargeFOV + CRF||70.70%||70.94%||68.69%||56.14%||80.01%||97.45%||73.99%|
Results for localisation
In order to evaluate the performance of three networks (Section 2) on localisation, three models are trained on 40.5K (train+validation) street-style localisation dataset and tested on a 4.5K dataset. We used the default parameters chosen from the original papers. Table 5 shows the Average Precision (AP) calculated on each object class and the mean Average Precisions (mAPs) for the three models. The bounding boxes are considered only if the IoU is larger than 50%. Average testing time for an image is evaluated on a NVIDIA Quadro M6000 GPU with 24GB memory. Table 5 shows that R-FCN has an edge over the other models that were evaluated. SSD is particularly suitable when speed is the main concern. Figure 9 shows R-FCN detection results on street-style images.
|SSD 500x500||84%||74%||76%||76%||86%||79.2 %||70|
Results for texture classification
For the pattern recommendation system, the cropped garments are classified in 10 texture patterns. For this, a pattern classifier is trained by fine tuning a pre-trained VGG-16. We use 11K images for training and 3K images for testing (Section 3). Results are listed in Table 6.
Results for garment association
After detecting garments, a person detector (Dollár et al., 2010) is applied to constrain the cropped garment being worn by the same person. In total, 6 associations between a pair of garments in 70K street-style images are generated and listed in Table 7. The numbers indicate how many people wear corresponding garments in the 70K street-style images.
Recommendation using colour
After garment association, 6 co-occurrence matrices of dominant colour are generated from 70K street-style images. In our system, a colour map with 130 bins for each category is created by using k-means on all segmented pixels of the corresponding garment in street-style images. When a query image is submitted, the dominant colour is extracted from the segmented item and a search from corresponding co-occurrence matrix is initiated to find the best colour that matches the query item. For example in Figure 10(1), the first row shows the query image and best matching colour obtained from the tops/blouses-skirts colour co-concurrence matrix. The second row shows the recommend skirt according to the recommended colour from an inventory database; some reference examples with same match colour from the street-style dataset are displayed in the third row. In Figure 10(2-4) we show some examples for the colour recommendation based on different aspects of the colour wheel. Figure 10(2) shows complimentary colour trousers with a query top. Figure 10(3) shows one of the triadic coloured skirts with a yellow top. Figure 10(4) shows triadic coloured skirts and tops/blouses with a yellow coat.
Recommendation using pattern
For pattern recommendation, when a query image is submitted, the garments are cropped from the images and classified in to one of the ten texture patterns. We then search a corresponding x pattern co-occurrence matrix to find the best match pattern with respect to the query item. This then allows us to recommend items with a matching pattern from the inventory dataset. Two examples are shown in Figure 11: (1) shows the top/blouse that form the query; a plain colour trouser is recommended to take into account the attributes of the query pattern. Figure 11(2) shows the query i.e., top/blouse with a dotted pattern; the FRS recommends that a plain coloured skirt is worn with such a top. The third row in each figure shows some reference images from the street-style dataset with the same match pattern.
Recommendation using content-based retrieval
Given a query image, we run a query on the image against inventory images of the same “top” category (Qian et al., 2017), pick some of higher ranking “top” garments, use the look-up table to find and recommend the most frequent “bottom” garments for each “top” garment to the user. Two examples are shown in Figure 12: (1) top/blouses with trousers and (2) coat/jackets with skirts.
This paper has detailed an end-to-end commercially deployable system – starting from image segmentation, localisation to recommending a dyad of clothing accessories. The knowledge representation is learnt by crawling through fashion blogs (street-style oracles) for images that are prescriptive of a variety of style, preferred by consumers. Deep neural networks complement this knowledge by learning a latent feature representation which then enables dyadic recommendations. We propose two other simpler recommendations by utilising the colour wheel to prescribe dyads of colours or use deep feature vectors to recommend clothing accessories based on the texture of the fabric. The framework is scalable and has been deployed on cloud-service providers.
Our work adds on to the burgeoning vertical of algorithmic clothing (Liu et al., 2016b) that use discriminative and probabilistic models to recommend consumers on how to finesse their dressing style. For example, (Murillo et al., 2012) has learnt ‘urban tribes’ by learning which group of people are more likely to socialise with one another, therefore may have similar dressing style. Classifying styles of clothing has been the focus of (Bossard et al., 2013)
’s work where the authors use a random forest classifier to distinguish a variety of dressing styles. Similarly,(Oramas and Tuytelaars, 2016) use neural networks to measure the visual compatibility of different clothing item by analysing co-occurrences of base-level elements between images of compatible objects. The combination of street-style oracles with a deep learning based feature representation framework is similar to the work by (Veit et al., 2015) wherein they use a Siamese convolutional neural network to learn compatibility of a variety of clothing items. There is a stark dissimilarity though – (Veit et al., 2015) used Amazon.com’s co-purchase dataset, which is instantiated on the assumption that two items purchased together are worn together. This may not be always true – therefore, the present work bases recommendation on the current trends in fashion (well represented by fashion blogs). The framework is flexible such that ‘clothing trends’ can be updated to keep up with seasonality trends (Al-Halah et al., 2017) or dissected into hierarchial models suited to the demographics or age-range of the clientele. With the availability of GPUs, such a framework becomes highly scalable.
There are a few challenges that can imperil the formation of a joint occurrence matrix. The first is the sparsity of the matrix involved – this is caused due to an inadequate number of street-style images with a specific combination of two clothing items. An easy way to alleviate such an issue is to use generative models (Murphy, 2012). A neural network can also be utilised as a function approximator (see below) such that the learnt features can be encoded (Sengupta et al., 2017a) to reveal dependencies between inventory items.
Our framework would recommend similar items to those previously suggested to the user. Whilst such a problem is severe for collaborative filtering approaches, our hybrid recommender system alleviates just a part of it, especially if we relax the assumption that the co-occurrence matrix has quite a stable probability distribution. Thus, a vital strand of our current research lies in personalization – how can we alter the recommendations such that it takes into account not only our shopping behaviour but also the granularity of our ‘personal’ taste. One way forward to formulate this feature-based exploration/exploitation problem is to frame it as a contextual bandit problem(Langford and Zhang, 2008). Put simply, such an algorithm sequentially selects the dyad recommendation based on the interaction of the consumer with the recommendation system.
The present work focuses on a recommendation dyad i.e., a trouser to go with a shirt; nevertheless, the present framework is equipped to make recommendations over a much larger combination of co-occurrences. As earlier, the next step forward would be to replace the joint-occurrence matrix with a neural network so that a non-linear function over multiple items could be learnt. This would be necessary for next generation algorithms that can recommend us an entire wardrobe rather than dyads or triads of clothing items.
This work was supported by two Technology Strategy Board (TSB) UK grants (Ref: 720499 and Ref: 720695).
- Al-Halah et al. (2017) Ziad Al-Halah, Rainer Stiefelhagen, and Kristen Grauman. 2017. Fashion Forward: Forecasting Visual Style in Fashion. arXiv:1705.06394 (2017).
- AttireClub (2014) AttireClub. 2014. A Guide to Coordinating the Colors of Your Clothes. (2014). http://attireclub.org/2014/05/05/coordinating-the-colors-of-your-clothes/
- Bossard et al. (2013) Lukas Bossard, Matthias Dantone, Christian Leistner, Christian Wengert, Till Quack, and Luc Van Gool. 2013. Apparel Classification with Style. In Proceedings of the 11th Asian Conference on Computer Vision - Volume Part IV (ACCV’12). Springer-Verlag, 321–335.
- Chen et al. (2015) L. Chen, E. Shelhamer, and T. Darrell. 2015. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs. ICLR (2015).
- Cimpoi et al. (2014) M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and A. Vedaldi. 2014. Describing Textures in the Wild. In CVPR.
- Dai et al. (2016) J. Dai, Y. Li, K. He, and J. Sun. 2016. R-FCN: object detection via region-based fully convolutional networks. NIPS (2016).
- Dollár et al. (2010) P. Dollár, S. Belongie, and P. Perona. 2010. The Fastest Pedestrian Detector in the West. BMVC (2010).
- ICE (1977) ICE. 1977. CIE Recommendations on Uniform Color Spaces, Color-Difference Equations, and Metric Color Terms. Color Research & Application 2, 1 (1977), 5–6.
- Krahenbuhl and Koltun (2011) P. Krahenbuhl and V. Koltun. 2011. Efficient inference in fully connected CRFS with gaussian edge potentials. NIPS (2011).
John Langford and Tong
The Epoch-Greedy Algorithm for Multi-armed Bandits with Side Information.In Advances in Neural Information Processing Systems 20, J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis (Eds.). Curran Associates, Inc., 817–824.
- Liu et al. (2016a) W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. 2016a. SSD: single shot multibox detector. ECCV (2016).
- Liu et al. (2016b) Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. 2016b. DeepFashion: Powering Robust Clothes Recognition and Retrieval With Rich Annotations. In CVPR.
- Long et al. (2015) J. Long, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. 2015. Fully Convolutional Networks for Semantic Segmentation. CVPR (2015).
- Mallat (2008) S. Mallat. 2008. A Wavelet Tour of Signal Processing, Third Edition: The Sparse Way (3rd ed.). Academic Press.
Murillo et al. (2012)
A. C. Murillo, I. S.
Kwak, L. Bourdev, D. Kriegman, and
S. Belongie. 2012.
Urban tribes: Analyzing group photos from a social
2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 28–35.
- Murphy (2012) Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
- Oramas and Tuytelaars (2016) José Oramas and Tinne Tuytelaars. 2016. Modeling Visual Compatibility through Hierarchical Mid-level Elements. CoRR (2016).
- Qian et al. (2017) Y. Qian, E. Vazquez, and B. Sengupta. 2017. Deep Geometric Retrieval. CoRR abs/1702.06383 (2017).
- Ren et al. (2015) S. Ren, K. He, R. B. Girshick, and J. Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. NIPS (2015).
- Rother et al. (2004) C. Rother, V. Kolmogorov, and A. Blake. 2004. GrabCut: interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. (2004).
- Sengupta et al. (2017a) B. Sengupta, E. Vasquez, and Y. Qian. 2017a. Deep Tensor Encoding. In Proceedings of KDD Workshop on Machine Learning meets Fashion, Vol. abs/1703.06324.
- Sengupta et al. (2017b) B. Sengupta, E. Vazquez, M. Sasdelli, Y. Qian, M. Peniak, L. Netherton, and G. Delfino. 2017b. Large-scale image analysis using docker sandboxing. CoRR abs/1703.02898 (2017).
- Simonyan and Zisserman (2014) K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. CoRR (2014).
- Veit et al. (2015) Andreas Veit, Balazs Kovacs, Sean Bell, Julian McAuley, Kavita Bala, and Serge Belongie. 2015. Learning Visual Clothing Style with Heterogeneous Dyadic Co-occurrences. In ICCV.