A Deep Learning System for Predicting Size and Fit in Fashion E-Commerce

07/23/2019 ∙ by Abdul-Saboor Sheikh, et al. ∙ Shopify Zalando 15

Personalized size and fit recommendations bear crucial significance for any fashion e-commerce platform. Predicting the correct fit drives customer satisfaction and benefits the business by reducing costs incurred due to size-related returns. Traditional collaborative filtering algorithms seek to model customer preferences based on their previous orders. A typical challenge for such methods stems from extreme sparsity of customer-article orders. To alleviate this problem, we propose a deep learning based content-collaborative methodology for personalized size and fit recommendation. Our proposed method can ingest arbitrary customer and article data and can model multiple individuals or intents behind a single account. The method optimizes a global set of parameters to learn population-level abstractions of size and fit relevant information from observed customer-article interactions. It further employs customer and article specific embedding variables to learn their properties. Together with learned entity embeddings, the method maps additional customer and article attributes into a latent space to derive personalized recommendations. Application of our method to two publicly available datasets demonstrate an improvement over the state-of-the-art published results. On two proprietary datasets, one containing fit feedback from fashion experts and the other involving customer purchases, we further outperform comparable methodologies, including a recent Bayesian approach for size recommendation.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Deep Learning based solution for fit clothing size recommendation

view repo


PyTorch Implementation of A Deep Learning System for Predicting Size and Fit in Fashion E-Commerce (RecSys'19)

view repo


A personalized fit recommendation system using SFNET to learn input and latent representations of customers and articles for size and fit prediction

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Fashion is a way to express identity, moods, and opinions. Recent studies show size and fit are amongs the most influential factors, driving e-commerce customer satisfaction (Pisut2017). A crucial difference when engaging in online compared to traditional brick and mortar retail is the lack of immediate sensory feedback about fit and feel of a product. For many, this is a major deterrent against fashion e-commerce.

To make matters worse, the notion of size is inherently ambiguous: for instance, size systems may be coarsely defined (e.g ‘Small’ , ‘Medium’, ‘Large’ ) or they may vary between regions (e.g., EU vs. US shoe sizes). There is furthermore vanity sizing, where brands modify standardized size specifications to target a particular clientele. As a result, there exists myriad of overlapping size systems in the fashion industry, with no agreed standard for conversion between them. Even within brands there is not necessarily one consistent conversion logic employed to convert sizes from one country or region to another.

One way to assist customers in finding the correct size is to provide size conversion charts which convert body measurements to article sizes. However, this requires customers to know their body measurements. Interestingly, even if the customer gets accurate measurements with the aid of tailor-like tutorials and expert explanations, the size charts themselves almost always suffer from high variance, even within a single brand. This is especially true for fast fashion brands that represent the largest part of sales volume. In a fast moving fashion environment, designers strive to beat competition by continuously serving consumers with the latest trends at competitive prices. To meet time, cost and design constraints, same articles with varying attributes (e.g., color, material, etc.) are often sourced from different production channels, causing inconsistencies in size and fit characteristics.

There are numerous other factors that make it essential for fashion e-commerce platforms to develop data-driven systems for providing informed size and fit advice to their customers (e.g., abdulla2017; guigoures2018hierarchical; sembium2017; Sembium2018; misra2018decomposing).

In this work, we propose a deep learning based content-collaborative methodology for personalized size and fit prediction. Standard approaches to collaborative filtering solely rely on interaction data to model customer behavior (KorenBell2015), but for a vast majority of customers, such data is sparse. This results in an extremely sparse customer-article interaction matrix, which makes it difficult to model preferences of every individual customer on a personalized level. Additional information in the form of customer and article attributes can however help to deal with the sparsity and cold-start recommendations (see e.g., ShiEtAl2014; sembium2017)

. In the same spirit, our proposed method uses both interaction data as well as arbitrary customer and article features for personalized size/fit prediction. Our method employs a split-input neural network architecture with global and entity-specific parameters. The global set of parameters allows the model to capture information relevant for predicting size and fit across customers, whereas the entity-level embedding variables equip the model with the capacity to discover implicit properties of individual customers and articles for personalized recommendations. The method is a priori independent of underlying semantics behind its targets and can model multiple individuals or intents behind an account.

2. Related Work

The topic of understanding article size issues as well as predicting size and fit on a personalized level has gained momentum in the research community. In the following we outline some recent developments on the subject and draw parallels between our work and closely-related methodologies in collaborative filtering:

The authors of (PengSayegh2014) put forth the idea of mapping customer images to existing 3D body scans, which are aligned with articles to generate fit ratings.

The method introduced in (abdulla2017) proposes to use a skip-gram based word2vec model (word2vec)

on the purchase history data to learn latent representations of articles. The approach then forms a customer representation by aggregating over the learned representations of said customers’ purchased articles. A gradient boosted classifier is then trained on customer and article latent representations to predict the fit.

In  (guigoures2018hierarchical)

, the authors propose a hierarchical Bayesian approach for personalized size recommendation. Conditioned on customer and article pairs, the method models the joint conditional probability of sizes ordered by customers together with their outcomes (i.e. kept vs. size related return) as observed in training data. For making personalized size recommendations, the method uses the conditional probability of size given a customer and an article with the outcome set to keep. The method uses approximate probabilistic inference for parameter optimization and testing.

The authors of (sembium2017)

propose to deduce ‘true’ sizes of customers and articles from purchase and return data using a latent factor model. The deduced size features are fed into a standard classification regime to perform ordinal fit prediction (i.e. ‘Small’, ‘Fit’, ‘Large’). The method in addition performs hierarchical clustering on individual customer data to handle multiple customers behind an account. A follow-up work proposes a Bayesian version of the ordinal regression model


. The method relies on approximate probabilistic inference (mean-field variational approximation with Polya-Gamma augmentation) for posterior distribution estimation over customer and article sizes.

An approach conceptually similar to our work is proposed in (misra2018decomposing), which models the size recommendation problem as a fit prediction problem. In a two-step procedure, the method first learns to embed customers and articles in a latent space with the same dimensionality. Once the embeddings are obtained using an ordinal regression procedure, they are used in the next step to learn representations for each class by applying prototyping and metric learning techniques. The authors of (misra2018decomposing) also provide the public datasets that we use to benchmark our approach.

Most of the works mentioned above do not take an end-to-end approach to the task at hand, while some are limited w.r.t. scalablility (e.g., due to their probabilistic nature) or capacity (e.g., due to predefined interactions, linearity assumptions, ability to handle cold-starts or model multiple users/intents behind one identity). Our work in contrast presents a scalable, end-to-end deep learning approach to size and fit recommendation. The two pathway neural network architecture employed in this work (Figure 1) flexibly consumes both categorical and continuous customer and article features and it learns (potentially non-linear) customer-article interactions from data.

Our model architecture is rather generic in the context of collaborative filtering. It is for instance closely related to the Deep Structured Semantic Model (DSSM) (DSSM2013) and Neural Collaborative Filtering (NCF) (ncf2017)

. Developed for web search, DSSM uses independent neural network layers to embed customers and articles into a latent space. It then uses a predefined interaction between the latent embeddings to predict its target. NCF employs a Neural Tensor Networks 


inspired architecture to learn input embeddings or features for (one-hot encoded) customers and articles. The architecture comprises a shallow (GMF) as well as a deep (MLP) feedforward pathway to respectively model both linear and non-linear interactions between customer and article pairs. A notable difference between our architecture and DSSM or NCF is that our architecture uses skip connections 

(resnet2016) between layers.

Our proposed approach can be seen as a generalization of logistic matrix factorization (johnson2014logistic), which is a linear model of customer-item interactions. Aside from interaction data, the method does not take any additional customer or item information into account for making personalized recommendations.

Figure 1. Schematic of SFnet architecture for size and fit prediction. The symbol indicates concatenation, while each trapezoid represent a cascade of fully-connected feedforward layers with skip connections.

3. Problem Formulation

We build our recommendation system via likelihood maximization. To that end, we ought to formulate and optimize the parameters of an instance of a probabilistic model that maximizes the probability of outcomes of observed customer-article interactions in the training data. Our training data is a set of tuples , where denotes a customer, an article and

is a categorical variable such as fit feedback or size of the article. Given the data we can define a conditional probability distribution

, such that it allows us to define a statistical model for associating customer-article interactions with respective outcomes. Given and a set of customer-article interactions, we can define the following likelihood function:


where represents the set of parameters of the conditional distribution. We seek values for so that (1

), or equivalently its logarithm, is maximized. Once optimized, we can evaluate the conditional distribution with customer-article pairs to estimate the odds of modeled outcomes, i.e. size or fit. For brevity, we will omit

in our later references to the conditional distribution in (1).

3.1. Modeling Assumptions

In (1) we make a simplifying assumption that each of the data points in the training dataset is independently and identically distributed given a customer and article pair. This allows us to model the outcome as a categorical variable. One can however consider modeling

as a multivariate categorical vector

e.g., to capture interactions among multiple sizes in selection-orders – orders where a customer orders more than one sizes. Such a modelling scheme would allow to capture co-dependencies among the elements of , but at the cost of increased model complexity.

Furthermore both this work and other models compared here do not take the temporal nature of the data into account. A more elaborate model could further condition every order on all previous orders.

As we shall see, the simplifying assumptions discussed above yield a computationally amenable objective (1) that can be optimized at scale in an end-to-end fashion for predicting customer size or fit on a personalized level for a given query article.

3.2. Modeling Personalized Size/Fit Preferences

In general, the conditional distribution in (1) takes the form of a categorical distribution over one of possible outcomes of the output variable . For instance, in case of a binary outcome (e.g., ‘Fit’, ‘No fit’),

can be modeled as a Bernoulli distribution. In the simplest form, we can marginalize over all the articles in a customer’s history to have

only conditioned on the customer. Such a customer-only-level personalization approach (with some population-level smoothing) aggregates over articles, and hence to a certain degree alleviates the data sparsity problem. Marginalization of articles may also be a reasonable assumption so long as customers size and fit preferences are not influenced by article attributes. However, article attributes, including brand, style, material etc. can indeed influence a customer’s size preferences, which makes it desirable to model dependencies of such kind even when individual customer order histories may only sparsely reflect such fine-grained information. We therefore define a global model of such that its parameters are (partially) shared across all customers and articles:


Here we define the parameters of to be the output of a neural network (i.e. is the output of a feedforward neural network). The elements of the vector signify the odds of possible outcomes such as sizes of an article or one of the possible fit feedback values. Our neural network is parameterized by a set of matrices and consumes feature sets and corresponding to both customer and article. The features can be comprised of both explicit attributes as well as variables that can be uniquely identified with individual customers and articles and allow us to encode implicit information such as customer style preferences or intrinsic article sizes. As we will see in Section 3.3, such encodings in neural network based models can be learned in an end-to-end fashion by means of input feature embeddings.

By plugging (2) into (1), we can globally optimize for

by minimizing a loss function such as categorical cross-entropy via (stochastic) gradient descent (SGD). Note that

includes neural network weight matrices as well as the embedded input features of customers and articles.

3.3. Size and Fit Network (SFnet) Architecture

For the neural network in (2), we choose an architecture that is loosely inspired by Siamese networks (BromleyEtAl1993); however, there is a crucial difference that input pathways of the model are not weight sharing replica of each other (Elkahky2015). As illustrated in Figure 1, the size and fit network (SFnet) architecture ingests customer and article information through non-identical feedforward input pathways. As shown in the figure, the input layers of both customer and article pathways embed categorical features (e.g., customer id, article id, brand, etc.) such that their unique values get mapped to trainable vector variables. Note that by embedding unique customer or article identifiers, we indeed equip the model with the capacity to learn personalized latent features of individual customers and articles in an end-to-end fashion. Both customer and article input pathways concatenate their set of embedded and non-embedded (i.e. continuous) features to pass them through a cascade of non-linear layers with skip connections (resnet2016) to obtain latent embeddings of customers and articles. This allows the model to capture latent information about both entities that is only contained in (higher-order) implicit patterns in data. Through such an embedding scheme, we can theoretically learn to disentangle information and identify multiple personas with diverging size or fit preferences behind a single account or discover properties that are intrinsic to certain articles or brands.

After obtaining the so called latent embeddings of both customer and article, we simply concatenate the embeddings to send the combined information through another set of non-linearities (with skip connections) to yield the parameter vector which paramterizes the conditional disrtibution (2).

In the neural network architecture described above, the continuous features as well as the learned input embeddings of categorical features jointly allow the model to represent customers and articles on a personalized level. On the other hand, through the weight matrices which paramterize the network layers, the model learns to represent higher-order patterns in the data that are globally relevant for predicting size and fit. Such a model can be efficiently trained at scale, given (individually) sparse customer-article interaction histories.

4. Empirical Evaluation

We demonstrate the generality of our method by applying it to different datasets and tackle a variety of size and fit related classification tasks. Two of the datasets we use are publicly available benchmarks for size recommendation (misra2018decomposing), while another two are our internal datasets. One of the internal datasets contains feedback from fashion experts on length and width deviation of a large number of shoes with respect to their given sizes. The other internal dataset is comprised of a large number of customer orders and purchases, on which in a backtesting setup we learn to predict sizes of ordered and kept articles for individual customer accounts. We compare our approach with a number of methodologies and report micro-averaged area under the ROC curve (AUC), accuracy and average log-likelihood as performance metrics.

4.1. Experimental Setup

We use the Keras functional API with Tensorflow backend in Python for our implementation. For parameter optimization we use the Adam optimizer 

(kingma2014) to perform SGD. We use performance on validation data (taken to be a

split of the data at hand) for hyperparameter tuning and to avoid overfitting. Table

1 describes the hyperparameter settings we used in our experiments. 111The settings listed in Table 1 were not found exhaustively and in our experience the performance is fairly robust to minor deviations in the listed settings. Apart from regularization as listed in Table 1, we did not observe significant performance gains from applying other regularization measures such as dropout (srivastava2014)

or batch normalization 


Due to the input embedding of categorical features, the parametric capacity and with it the memory requirement of our method increase linearly with respect to both the cardinality of embedded customer and article features, as well as customer and article numbers. Otherwise the number of parameters as defined by customer and article input pathways and top layers in Table 1 remains constant throughout.

SFnet Hyperparameters
Customer/Article Pathway (emb. + cont.) feats. 25 15 10
Top Layers 50 100 200 500 output
L2 Reg.
L2 Reg. Cust. Emb. 0.1
L2 Reg. Article Emb. 0.01
Embedding Dimensions 10
Hidden Unit Activation tanh
Loss cross-entropy
SGD Batch Size 2048
Epochs 15–50
Table 1. Hyperparameter settings used in our experiments.

4.2. Experiments on Public Datasets

The two publicly available datasets we use were introduced by (misra2018decomposing). One of the datasets ‘ModCloth’ comes from an online vintage clothing retailer. The data contains three categories of clothing: dresses, bottoms and tops. The other dataset ‘RentTheRunWay’ comes from an online clothing rental platform for women. The dataset is comprised of several clothing categories (including shoes). Both datasets contain customer-article interactions with categorical feedback on fit: ‘Small’, ‘Fit’ or ‘Large’. Table 2 contains general statistics of the datasets as provided by (misra2018decomposing). The datasets are sparse in customer-article interaction. Following the protocol used by (misra2018decomposing), we randomly split the data into training, validation and testing; however, since we do not know the exact split used in (misra2018decomposing)

, we report the average results with standard deviation computed from

independent trials.

Statistic/Dataset ModCloth RentTheRunWay
# Transactions 82,790 192,544
# Customers 47,958 105,571
# Articles 5,012 30,815
% Small 15.7 13.4
% Large 15.8 12.8
Single Transaction Customers 31,858 71,824
Single Transaction Articles 2,034 8,023
Table 2. General statistics of public datasets.

Table 3 lists customer and article features available in both datasets that we use to train our neural network. We indicate further categorical features we embed via the input embedding technique described in Section 3.2. To handle cold-start cases during test (and validation), we define a ‘default’ input embedding for each embedded feature. The default embeddings were then trained by randomly and independently assigning each of them, of the data points every SGD epoch.

Features/Dataset ModCloth RentTheRunWay
Article category, quality, item id, size category, rating, rented for, item id, size
Customer shoe width, shoe size, waist, bust, cup size, bra size, age, body type, bust size, height, weight, user id
hips, height, user id
Table 3. Benchmark customer and article features. Features marked with were categorical and were embedded using input embedding. Moreover, features markded with were split into alphabetical (for embedding) and numerical parts.

MLP Baseline: As a deep learning baseline, we train another neural network to parameterize (2). The architecture of the model is a feedforward neural network that we obtain by simply concatenating the customer and article input pathways of SFnet. It therefore corresponds to the MLP pathway of NCF (ncf2017), however with additional customer and article input features and skip connections between layers. For both benchmarks, the network takes as input a concatenated set of customer and article features listed in Table 3. All categorical features marked in the table are embedded via input embeddings. We follow hyperparamter settings from Table 1 to endow the model with a capacity comparable to SFnet. Following the same protocol as for SFnet, we perform independent runs of the model to report mean and standard deviation of the performance metrics.

Results: We compare the performance of SFnet on benchmark datasets in Table 4. The first four rows in the table are results from (misra2018decomposing)

, where the authors compare latent variable (LV) vs. latent factor (LF) based embeddings of customers and articles with logistic regression (LR) or metric learning (ML) on top for classification. The approach is conceptually analogous to ours, but we learn both customer and article embeddings as well as their interaction end-to-end with a neural network. To our knowledge, the results of

(misra2018decomposing) represent the previous state-of-the-art on both benchmarks; SFnet however clearly outperforms (misra2018decomposing) as well as the MLP baseline, is analogous to the MLP pathway in NCF. As illustrated in Figure 2, in one of our runs we could achieve more than improvement on the average AUC over the previously best performing LF-ML. While (misra2018decomposing) do not publish results on accuracy and average log-likelihood, compared to the MLP baseline, SFnet achieves better results on both datasets.

Figure 2. The ROC curves for one of the best runs of SFnet on benchmark datasets.
Micro-avg. AUC Accuracy Average log-likelihood
Method/Dataset ModCloth RentTheRunWay ModCloth RentTheRunWay ModCloth RentTheRunWay
LV-LR 0.617 0.676
LF-LR 0.626 0.672
LV-ML 0.621 0.681
LF-ML 0.657 0.719
MLP Baseline 0.624 0.007 0.692 0.010 0.681 0.004 0.733 0.006 -0.819 0.004 -0.708 0.01
SFnet 0.689 0.005 0.749 0.004 0.690 0.004 0.760 0.004 -0.758 0.006 -0.610 0.008
Table 4. Comparison on publicly available Benchmark datasets.

4.2.1. Customer and Article Embeddings and Data Sparsity:

As discussed in Section 3.3, the method we propose can learn implicit features of customers and articles through entity-specifc input embeddings; the model however requires enough interactions of an entity (i.e. a customer or an article) to learn its meaningful representation through input embedding. This is evident in Table 5, where we compare the performance of SFnet on ModCloth and RentTheRunWay benchmarks w.r.t. inclusion vs. exclusion of user and item identifiers from customer and article features. As indicated by the first two rows of the table, we observe including or excluding user ID from the list of customer features in Table 3 does not have a significant effect on performance for both the datasets. This should not come as a surprise as the general statistics of data in Table 2 indicate that most customers in both datasets have only one transaction, hence we cannot expect the model to capture anything meaningful by embedding the customer identifier. Table 2 on the other hand indicates that the datasets are relatively sparse on the article side. Indeed removing item ID from article features in Table 3 affects the performance of our model, which is reflected by the third and fourth rows of Table 5.

Given these results for the benchmarks, we surmise that SFnet makes use of both explicit and implicit features of articles, while for customers it mainly relies on explicit features to handle the task. In the next sections, our method will completely rely on input embeddings learned against unique identifiers to represent customers for personalized size and fit predictions.

Entity embedding Micro-avg. AUC Accuracy Average log-likelihood
user id item id ModCloth RentTheRunWay ModCloth RentTheRunWay ModCloth RentTheRunWay
0.689 0.005 0.749 0.004 0.690 0.004 0.760 0.004 -0.758 0.006 -0.610 0.008
0.693 0.009 0.751 0.004 0.691 0.004 0.760 0.001 -0.757 0.009 -0.607 0.004
0.637 0.004 0.667 0.007 0.686 0.004 0.733 0.007 -0.803 0.006 -0.716 0.023
0.638 0.007 0.674 0.003 0.683 0.005 0.739 0.002 -0.806 0.009 -0.698 0.006
Table 5. Effect of including or excluding customer and article embeddings on the performance of SFnet.

4.3. Experiments on Expert Feedback Data

In order to gain insights on size and fit characteristics of new articles before their online activation, we ask different fashion experts to physically try on articles and provide qualitative feedback on their fit. Each fitting session involves one fashion expert and the sessions are run independently so that the experts do not influence each other. We run three fitting sessions for each article. For every session we draw an expert from a pool of experts.

The motivation for this experiment is that if using SFnet we can learn to reliably predict fit feedback of individual experts given the attributes of an article, we can select new articles for try-ons based on the predicted feedback: for instance when there is a degree of disagreement in the predicted feedback of different experts or if there is a consensus on deviation from true to size fit.

The data for the experiment is comprised of around

distinct pairs of shoes. We collect feedback on both length and width of the shoes. The feedback is defined as an ordinal variable and it takes one of the

values: ‘Too small’, ‘Small’, ‘True to size’, ‘Big’ or ‘Too big’. The dataset is highly imbalanced with and true to size responses for length and width.

We train individual instances of SFnet and the methods we compare with to independently predict the feedback on length and width. We treat each fashion expert as a customer who is represented by a unique identifier. For shoes we use attributes such as brand, fitted size, color, main material and other categorical attributes, which define non-overlapping subcategories of shoes. All features we consider are categorical and are embedded through input embedding. We perform independent runs to report the mean and standard deviation of the performance metrics. For each run we randomly split data to consume for training, while

each is kept for validation and testing. We benchmark our method against two other well-suited approaches for the problem: a Naive Bayes classifier and boosted trees.

Naive Bayes: When dealing with classification tasks with categorical input features, Naive Bayes is a straightforward choice. However in our case some of the features have very high cardinality (over distinct brands for example) and some of the feature values are sparsely or never observed in the training data. Hence we apply Laplace smoothing (manning2013introduction) to avoid computational issues with the conditional probability estimation.

Boosted trees: Another well-suited methodology to compare against is gradient boosted trees. High feature cardinality also poses a problem for tree based approaches as it requires the training algorithm to evaluate the best of all the possible partitions of feature values into classes, which is equal to the Stirling number of second kind (graham1989concrete). We therefore encode fashion experts and shoe attributes using smoothed target encoding (micci2001preprocessing) to reduce the complexity of the task.

Results: Table 6 shows the results obtained on test data. All three approaches are comparable in terms of accuracy; however, the numbers hover around the a priori probability ( for length and for width) of the dominant ‘true to size’ class. We take the results as an indication of expert feedback being unbiased and therefore independent of the considered article attributes. In terms of other metrics, while SFnet takes a clear lead w.r.t. the average AUC, the relatively low likelihood values of SFnet despite being more accurate in comparison to Naive Bayes suggests that the output distributions of SFnet may tend to be more peaky in nature. This leads to a relatively high loss in likelihood when the method predicts the wrong outcome with a high probability.

Micro-avg. AUC Accuracy Average log-likelihood
Method/Feedback Length Width Length Width Length Width
Naive Bayes
Boosted Trees
Table 6. Comparison on expert feedback prediction task.

4.4. Experiments on Purchase Data

In this section, we present results on modeling customer size preferences given their purchase history. Our goal here is to predict the size of articles which customers order and keep. Note that a ”customer” in this context refers to a customer account which is potentially used by multiple individuals. This is a realistic scenario for most e-commerce retail platforms and for personalized recommendation, it demonstrates the need for modeling multiple personas behind one identity. We will analyze SFnet’s performance on multi-user accounts in Section 4.4.2.

For these experiments, we use our proprietary dataset of customer purchases spanning a period of roughly years. The purchased articles in the data belong to the sub-categories of shoes, textile and sportswear. We only consider customer accounts with at least purchases in the history. The dataset contains roughly million purchases involving around customers and articles. There are more than distinct sizes in the data. Multidimensional sizes such as jeans size and are taken to be independent of each other. Due to overlapping size systems, a distinct size can be used in multiple clothing sub-categories.

Apart from an anonymous customer identifier, our data does not contain any other customer information. We therefore do not consider cold-start customers in this experiment222In the absence of additional features as in Table 3, if (akin to Section 4.2) we learn a default customer embedding for cold-start customers, we can only expect to approximate population-level marginal distributions over kept sizes in article sub-categories, which will be rather non-informative for personalization.. For articles we use unique identifiers together with categorical attributes such as brand, main material, country of origin, season and taxonomical attributes which including gender (female, male or unisex), define a non-overlapping hierarchy of clothing items.

Backtesting: To simulate a realistic scenario, we perform our experiments in a backtesting setup. To that end, we split the data chronologically into train, validation and test sets. This implies that our training instances come from the past, while validation and test splits contain more recent purchases with test split containing the latest ones. In backtesting, aside from encountering cold-start customers, we may also encounter new articles in the test for which have not learned any dedicated input embeddings during training; nonetheless the default article embedding (as described in Section 4.2) together with shared attributes such as brand, material, etc. allow us to evaluate new articles in the test (and validation) data split.

With train, validation333Since the methods we compare with in this section do not require extensive hyperparameter tuning, we merge the validation split into the training data for those methods. and test, we keep data split ratios the same as before. During test, we truncate and renormalize the output distributions of SFnet and compared methods to the available sizes of test articles. Moreover, since we allow customers to order more than one sizes, we further report top-2 and top-3 accuracies with the other performance metrics.

Bayesian Model: We benchmark our approach against a recently introduced Bayesian method for size recommendation (guigoures2018hierarchical). The approach is based on a hierarchical Bayesian model exploiting the customer purchase history to learn the usual size of multiple users of a single account. Originally, the method was proposed to model both returns and keeps in a customer history, but in our setting where we are only interested in modeling size distribution of kept articles in customer accounts. In this case, the model proposed by (guigoures2018hierarchical)

reduces to an infinite Gaussian mixture model with an associated truncated Dirichlet process of level four (we refer to

(guigoures2018hierarchical) for more details).

We train an independent instance of the Bayesian model for articles of all genders (i.e. female, male and unisex) within each of the main clothing categories in data – including shoes and upper and lower body garments. Moreover, since the approach is meant to be for continuous size systems, we employ expert knowledge to convert alpha-numeric sizes (e.g., Small, Medium, Large) into a continuous size range. To disambiguate overlapping numerical size systems, we further use a semi-supervised Gaussian Expectation Maximization algorithm

(basu2002) to cluster articles based on the characteristics of their size systems (e.g., minimum, maximum and median sizes, step between sizes, etc.). Once clustered, the size that represents a cluster is defined by a domain expert.

Baseline: We also estimate a population-level marginal distribution of kept sizes, which we obtain by training the Bayesian model for each clothing category and gender across all customers.

Results: As shown in Table 7, SFnet outperforms both Bayesian and baseline approaches on all the metrics. We further observe a narrowing gap between SFnet and the Bayesian approach w.r.t. top- accuracies. This is due to the fact that for a given article, there are usually a handful of sizes to choose from, hence increasing significantly boosts the chances of hitting the right size for both the algorithms.

Micro-avg. Accuracy Average
Method AUC top-1 top-2 top-3 log-likelihood
Table 7. Comparison on test data containing articles in various clothing categories and overlappting size systems.

4.4.1. Dealing with Category Cold-Starts:

An appealing use-case for size recommendation in e-commerce fashion retail is that of category cold-start where an existing customer with purchase history in other categories orders an article from a new category. Note that for category cold-starts, the Bayesian approach defaults to the baseline approach, which is a category and gender-conditioned marginal distribution of purchased sizes.

Results: While the baseline approach recommends among available sizes, the most purchased size of a category cold-start article, we expect SFnet to be better than that. Indeed tn table 8, we find SFnet’s performance on cold-start recommendation in three different categories significantly better than the baseline default mode of the Bayesian approach.

Micro-avg. Accuracy Average
AUC top-1 top-2 top-3 log-likelihood
Men’s Shirts
Table 8. Category cold-start performance in three different categories.

4.4.2. Modeling Multiple Users Behind One Identity:

In our last experiment we asses SFnet’s capacity to deal with multiple users behind one account. We use gender profiles (i.e. female, male or unisex) of purchased articles to assume customer accounts to be single or multi-user. Based on the gender profiles, we first filter the data to contain only those accounts with both female and male articles in the test split. We then perform ablations by partitioning the filtered accounts w.r.t. their gender distribution in the training data, yielding the three rows of Table 9. The first row represents user accounts that either contain female and unisex, or male and unisex articles in their training histories. During test, as indicated by male and female columns of the table, those accounts are tested on the articles of gender that was lacking in their training histories. We term such cases as ‘gender cold-starts’. The second row of the table represents the opposite of the first row, where accounts with female and unisex (respectively male and unisex) articles in training data are tested on female (respectively male) articles. The last row represents the accounts which contain all the three genders in their training histories and we test their performance on female vs. male articles.

Baseline Bayesian SFnet
gender male female male female male female
Table 9. Top-1 accuracy on multi-user accounts in test. Rows represent different types of customer histories encountered during training.

Results: As the Bayesian approach defaults to the baseline for the gender cold-starts, we see identical numbers for both methods in the first row of Table 9; to our surprise however, SFnet’s performance for the gender cold-starts is significantly better than the baseline marginals. We hypothesize that SFnet makes use of higher-order correlations discovered from multi-user accounts to achieve the results. In the second row of the table, we see SFnet is most accurate with user accounts that are consistently one gender (plus unisex) during training and test. For multi-user accounts in the third row, we observe a reduction in SFnet’s performance, yet the accuracy is significantly higher than the Bayesian (and baseline) approach. The results are indicative of SFnet’s capacity for modeling multiple users, although further analysis is warranted to assess SFnet’s ability to disambiguate multiple intents.

5. Conclusion

In this work we proposed SFnet, a deep learning based methodology which combines collaborative and content-based modeling techniques to learn input and latent representations of customers and articles for size and fit prediction. The method is highly scalable and works end-to-end without requiring a priori knowledge about its prediction targets underlying ordinal structure. As demonstrated by competitive empirical performance in a variety of experiments on multiple datasets, our SFnet architecture offers both the flexibility and the capacity for capturing higher-order abstractions of size and fit relevant information from arbitrary customer and article features. Future extensions of this work can include multi-view objectives (Elkahky2015) (such as predicting both categorical and ordinal targets) or time-dependent modeling of customer behavior (DonkersEtAl2017) with respect to size and fit.

We acknowledge and appreciate constructive feedback from our reviewers and area chair. We thank Alan Akbik and Calvin Seward for their valuable feedback in the preparation of this manuscript. We would also like to thank Julia Lasserre for helpful discussions on the design of experiments on customer purchase data.