Assessing Fashion Recommendations: A Multifaceted Offline Evaluation Approach

by   Jake Sherman, et al.
True Fit Corporation

Fashion is a unique domain for developing recommender systems (RS). Personalization is critical to fashion users. As a result, highly accurate recommendations are not sufficient unless they are also specific to users. Moreover, fashion data is characterized by a large majority of new users, so a recommendation strategy that performs well only for users with prior interaction history is a poor fit to the fashion problem. Critical to addressing these issues in fashion recommendation is an evaluation strategy that: 1) includes multiple metrics that are relevant to fashion, and 2) is performed within segments of users with different interaction histories. Here, we present our multifaceted offline strategy for evaluating fashion RS. Using our proposed evaluation methodology, we compare the performance of three different algorithms, a most popular (MP) items strategy, a collaborative filtering (CF) strategy, and a content-based (CB) strategy. We demonstrate that only by considering the performance of these algorithms across multiple metrics and user segments can we determine the extent to which each algorithm is likely to fulfill fashion users' needs.



There are no comments yet.


page 1

page 2

page 3

page 4


Addressing Marketing Bias in Product Recommendations

Modern collaborative filtering algorithms seek to provide personalized p...

Reducing offline evaluation bias of collaborative filtering algorithms

Recommendation systems have been integrated into the majority of large o...

Session-based Complementary Fashion Recommendations

In modern fashion e-commerce platforms, where customers can browse thous...

On Sampling Collaborative Filtering Datasets

We study the practical consequences of dataset sampling strategies on th...

Improving offline evaluation of contextual bandit algorithms via bootstrapping techniques

In many recommendation applications such as news recommendation, the ite...

Estimating Error and Bias in Offline Evaluation Results

Offline evaluations of recommender systems attempt to estimate users' sa...

User Validation of Recommendation Serendipity Metrics

Though it has been recognized that recommending serendipitous (i.e., sur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Few industries touch the lives and identities of consumers as intimately as fashion. Everyone makes decisions about what to wear, and these decisions reflect not only the prevailing cultural norms (souiden2011cross), but also the individual identity of the wearer (casidy2009predicting; mulyanegara2009big). Clothing can affect how the wearer feels and behaves as well as how others feel and behave in response to the wearer (Johnson2014). The motivations that drive fashion consumers include fashionability, individualization, self assurance, flaw minimization, and comfort, yet the importance of these motivations to fashion purchasing depends on the characteristics of the consumer (Tiggemann2009). That is, consumers' motivations for buying clothing are personal (Vaccaro2016). Consequently, any recommender system (RS) developed for the fashion domain should be evaluated on criteria that are specifically relevant to the needs of fashion consumers.

As with other domains, fashion users should receive “accurate” recommendations (i.e., recommendations that are relevant), but accuracy alone is not sufficient. Although some researchers have been urging RS developers to think beyond accuracy for more than a decade (e.g., (Ziegler2005)), accuracy is still the predominant focus of most RS evaluations. A survey of recent papers at the ACM RecSys Conference noted that while roughly 85% of papers used some form of offline accuracy metric, a mere 20% included a measure of diversity, novelty, or another alternative metric (Jannach2016). A dogged focus on maximizing accuracy can unintentionally degrade the end user experience. McNee, Riedl, and Konstan (McNee2006) provide an illustrative example: a travel RS that recommends solely locations that users have previously visited would perform better on most accuracy metrics than a system that recommends novel travel destinations that are more interesting to the user. Thus, non-accuracy based evaluation measures are necessary for properly evaluating RS in general, and fashion RS specifically.

Personalization is critical for good fashion recommendations. In marketing, “personalization” is operationalized as the customization of goods and/or services to meet the needs of specific consumers (Goldsmith1999). Different fashion consumers have different motivations for purchasing fashion items (Tiggemann2009). As a result, one way to evaluate the personalization of fashion recommendations is by measuring how much recommendation lists vary from user to user (e.g., the approach in (zhou2010solving)). In addition to list diversity, understanding the popularity bias in recommendations is important for evaluating personalization because recommendations dominated by popular items are necessarily depersonalized.

We also want to understand how well RS perform within multiple user segments. Providing accurate and engaging recommendations is easier for users with rich interaction histories. However, within our own fashion datasets, few users have prior sales or even item views. That is, most users are new. Because personalization is so critical to fashion, approaches such as collaborative filtering (CF) that cannot provide recommendations for new users are unlikely to provide a good experience to most fashion shoppers. Consequently, we must assess how fashion RS perform with new as well as established users.

The goal of our current work is to develop a methodology for more comprehensively evaluating the extent to which fashion RS produce quality recommendations. To provide an understanding of how different types of RS perform, we evaluate three algorithms: 1) a most popular (MP) items strategy, 2) a CF-based strategy, and 3) a content-based (CB) strategy. Our evaluation method consists of multiple measures of recommendation quality performed on multiple user segments.

2. Related Works

Several approaches have been developed to address the unique demands of providing fashion recommendations. Despite the prominence of the cold-start problem inherent in fashion data, some fashion recommendation approaches have nevertheless relied on CF. Hwangbo and colleagues (Hwangbo2018) developed a novel user-based CF approach for recommending complementary and substitute fashion items that offers interesting algorithmic ideas, but is limited in that 1) only existing products are accommodated, and 2) recommendations are not personalized to users. Rue La La, a flash sale fashion retailer, developed a latent factors CF approach for providing fashion recommendations that overcomes the cold-start problem for items by recommending product groups to users instead of individual products (Harrison2017). Although this approach does address the cold start problem for items, it does not allow them to make personalized recommendations for new users.

Other methodologies for providing fashion recommendations eschew CF in favor of models that circumvent the cold start problem by leveraging user and/or product attributes to make recommendations. De Melo, Nogueira, and Guliato (Melo2015) developed a content based (CB) fashion RS that constructs detailed clothing item attributes to build content profiles for each user and then uses k-nearest neighbors to make recommendations for new items. Although this approach overcomes the item cold-start problem, it cannot provide personalized recommendations for new users. A RS developed by Zalando (Freno2017) leverages both user and product attributes within a learning to rank (L2R) framework, allowing them to make recommendations for both new items and users. However, Zalando only measured the accuracy of recommendations and did not evaluate how the system performed within different user segments.

3. Approach

Users Products Sales (%) Views (%) Unobserved (%)
Retailer 1 39,307 376 7,461(0.05%) 103,829(0.7%) 14,668,142(99.2%)
Retailer 2 42,490 865 8,276(0.02%) 143,781(0.4%) 36,601,793(99.6%)
Retailer 3 60,333 386 21,904(0.1%) 141,320(0.6%) 23,125,314(99.3%)
Table 1. Descriptive Statistics for Training Data
New Users (%) View Users (%) Sale Users (%) Products Sales (%) Views (%) Unobserved (%)
Retailer 1 2,477(73.8%) 667(19.9%) 213(6.3%) 319 1,727(0.2%) 8,850(0.8%) 1,060,306(99.0%)
Retailer 2 6,048(69.0%) 1,997(22.8%) 720(8.2%) 676 5,171(0.1%) 50,443(0.9%) 5,869,526(99.1%)
Retailer 3 5,164(71.2%) 1,513(20.9%) 578(7.9%) 314 2,753(0.1%) 19,578(0.9%) 2,255,739(99.0%)
Table 2. Descriptive Statistics for Test Data

3.1. Evaluation

3.1.1. Data Selection

We performed separate evaluations on three different retailers. Within retailers, we trained models for women's dresses. We constructed each dataset by taking all user-product interactions that occurred in a one year period and split our data into training and test sets by allocating the first eight months to the training data and the remaining four months to the test data. Descriptive statistics for the training and test data are in Table 1 and Table 2, respectively. In both the training and test data for all retailers, the overwhelming majority of observations in the user-item matrix are unobserved (i.e., the users did not view or buy the item). Our three retailers are also similar in terms of the distribution of users across our three user segments (see Section 3.1.2).

3.1.2. User Segmentation

One of our goals was to understand how RS would perform in user segments with different product interaction histories. Many of our users have no prior interaction history. Therefore, we define “new users” as users who have no sales or views in the training data. Some users have viewed items in the training data, but never made a purchase. We consider these users “view users” (i.e., users with views in the training data, but no sales. Lastly, a minority of users have a prior purchase falling within the training data. We consider these users “sale users.” We note that these user distinctions are based on the training data only. We perform our evaluations within each of these user segments as well as across all user segments to gain insight into the recommendation experience for different types of users.

3.1.3. Accuracy

Because we are primarily interested in how well RS perform at ranking items, we focus our evaluation on top-n performance (Valcarce2018a). To assess model accuracy, we use a modified version of normalized discounted cumulative gain (NDCG) at k. is a normalized version of the discounted cumulative gain () metric, which is computed for a particular user as:


where is the relevance label for the item recommended to a user. normalizes the by dividing it by the ideal , or the that would be achieved by a perfect ranking. One of the limitations of typical implementations is that if predictions are tied, the value can be non-deterministic since the gain for tied items will be based on arbitrary ordering. To mitigate this problem, we implement the tie-aware approach proposed in (mcsherry2008computing). We set k to 10 as users typically see about 10 recommendations, and micro-average each user’s value together to report an aggregated value. In addition to reporting raw values, we also report the percentage change between our values and the value that would result from a random ranking (%) of the items since will vary based on the number of items in data.

3.1.4. Personalization

To date, there is little consensus as to how best to measure recommendation diversity and personalization directly. We leverage two indirect measures that speak to diversity and personalization: an inter-user average distinct recommendations at k metric and also a relative popularity metric.

In order to measure inter-user diversity, we use the (average distinct at ) metric, which is defined as:


where is the total number of users, and , or the distinctness between a single pair of users, is defined as:


where is the set of top-k recommended items for user , and is is the set of top-k recommended items for user . measures the cardinality of the symmetric difference between two different users' top-k recommendations. In the case where two users' top-k recommendations are exactly the same, the value of will be zero. When they have no items in common, the value will be . In order to avoid the complexity associated with computing across the entire population of user pairs, we randomly sample the proportion of user pairs from the population of user pairs in order to create a randomly sampled set of user pairs. Then, we redefine as:


where is an indicator variable that takes a value of 1 when the pair of users is in the randomly sampled set of user pairs, and a value of 0 otherwise.

In order to measure relative popularity, we use the (relative popularity at k) metric to quantify the popularity of users’ top-k recommendations relative to recommending the most popular items. is defined as:


where , or the relative popularity at k for a single user, is defined as:


where is the quantity sold of the top- recommendation for user , and is the quantity sold of the most popular product across all users. In the scenario where the top-k most popular products are being recommended to all users, becomes , resulting in a value of one, its upper bound.

3.2. Recommendation Algorithms

3.2.1. MP Recommendations

By definition, items are popular if they have broad appeal across a wide swath of users. Therefore, we might expect that by recommending popular items, we can achieve high levels of accuracy (Cremonesi2010a). We determine which items are most popular based on sales. Specifically, we sum the total number of units sold for each item within our sales data on a retailer by retailer basis, and then recommend the top-k items with the highest quantity of units sold. This MP recommendation strategy serves as a baseline algorithm that gives depersonalized but broadly palatable recommendations.

3.2.2. CF Recommendations

Because some fashion RS use CF, we also include a CF-based recommendation algorithm. For our fashion items, we do not have explicit ratings of user preferences for products (e.g., a star rating of 1 to 5). Instead we must rely on implicit proxies for user preferences, in our case, product sales and views. In contrast to traditional item-based (e.g., (sarwar2001)) and user-based CF (e.g., (Resnick1994)), alternating least squares (ALS) is a matrix factorization CF strategy developed specifically for implicit feedback datasets (Hu2008). ALS allows user preference to be separated from confidence in user preference, which is useful for implicit datasets since indirect measures of user preferences are inherently noisy. Because sales can be considered a stronger signal of user preference than views, we weight sales more heavily in our model. We treat views as a binary for whether or not an item was viewed since multiple views may or may not indicate increased user preference. With our trained ALS model, we make predictions for all user-item combinations, and recommend the top-k items with the highest predicted values.

3.2.3. CB Recommendations

Many fashion RS use product and/or user attributes to give recommendations, so we also include a CB approach that leverages information about users and products. This CB recommendation strategy consists of two major phases. In the first phase, we fit an ALS CF to user sales and views and make predictions for user preferences. We then use these predictions to augment our original user-item interaction data. Specifically, if the user-item interaction was observed (i.e., was either viewed or sold), we retain the value of the original user-item interaction. If the user-item interaction was not observed, we substitute the value predicted by ALS. We then train a random forest model using the augmented outcomes as labels and information about users and products as features. For product features, we represent fashion details such as style attributes (e.g., dress shape, sleeve length) and price. For user features, we use fashion-relevant information about users such as body mass index (BMI), user age, and brand preferences. We train models separately for different retailers since the relationships between user and product features can be assumed to vary by retailer. Using the trained RF model, we recommend the top-

k items with the highest predicted values.

4. Results

Results across retailers are presented in Tables 3-5

. To help illustrate the evaluation metrics and user segmentation, we provide depictions of results for Retailer 1 in Figures


4.1. Accuracy

Figure 1. for Retailer 1. The yellow dotted line corresponds to the value for Retailer 1 that would result from a random ranking of the items.

NDCG Figure

Retailer 1(%) Retailer 2(%) Retailer 3(%)
Sale Users
MP 0.077(340.2%) 0.025(253.8%) 0.122(703.4%)
CF 0.032(85.4%) 0.023(222.7%) 0.094(517.0%)
CB 0.046(164.4%) 0.021(194.1%) 0.108(608.5%)
View Users
MP 0.065(269.6%) 0.031(332.6%) 0.110(623.1%)
CF 0.030(69.9%) 0.024(236.7%) 0.078(415.5%)
CB 0.033(90.2%) 0.025(258.6%) 0.115(657.2%)
New Users
MP 0.136(681.6%) 0.025(259.6%) 0.094(519.1%)
CF - - -
CB 0.025(46.0%) 0.012(68.6%) 0.089(485.8%)
MP 0.126(620.3%) 0.026(259.8%) 0.102(569.2%)
CF - - -
CB 0.029(65.8%) 0.014(95.5%) 0.094(521.1%)
Note: CF cannot make predictions for new users.
% is % improvement over random.
Table 3. Evaluation

For , the MP recommendation strategy outperformed the CF and CB strategies. In general, CB recommendations outperformed CF recommendations on , with sale users in Retailer 2 being the only exception. Algorithm performance between user segments was dependent on retailer. For example, across all retailers, for CB recommendations, was lowest for new users, but in Retailer 1, for MP recommendations was higher for new than both view and sale users. Overall, although MP was generally more accurate than CB, and CB was generally more accurate than CF, a finer grained analysis by user type and retailer revealed a more complex pattern of results.

4.2. Personalization

Figure 2.

for Retailer 1. Each error bar represents the 95% confidence interval of the distribution of 1,000 bootstrap samples of

values. The MP recommendation strategy produces the same recommendations for all users, resulting in values of 0.

Average distinct Figure

Retailer 1(SD) Retailer 2(SD) Retailer 3(SD)
Sale Users
MP 0(0) 0(0) 0(0)
CF 18.5(3.7) 18.3(3.9) 16.7(4.1)
CB 15.9(2.8) 18.0(2.3) 12.9(3.4)
View Users
MP 0(0) 0(0) 0(0)
CF 18.0(4.3) 18.2(4.1) 16.5(4.2)
CB 15.7(2.9) 18.0(2.1) 12.6(3.5)
New Users
MP 0(0) 0(0) 0(0)
CF - - -
CB 15.7(2.8) 18.1(2.1) 12.5(3.4)
MP 0(0) 0(0) 0(0)
CF - - -
CB 15.8(2.8) 18.1(2.1) 12.4(3.4)

SD is the standard deviation between user pairs.

Table 4. Evaluation

For , CF provided more distinctive recommendations than CB across all three retailers for the view and sale user segments. Meanwhile, MP provided the same popular recommendations to all users, resulting in values of 0. Within each retailer/model combination,

exhibited very little variance across user segments. Figure 

2 shows the increased distinctiveness of CF over CB and the low within-retailer/model variance in across user segments for Retailer 1.

Figure 3. for Retailer 1. Each error bar represents the 95% confidence interval of the distribution of 1,000 bootstrap samples of values. By only recommending the most popular items, the MP recommendation strategy always produces values of 1.

Popularity Figure

Retailer 1(SD) Retailer 2(SD) Retailer 3(SD)
Sale Users
MP 1(0) 1(0) 1(0)
CF 0.38(0.06) 0.31(0.07) 0.43(0.13)
CB 0.42(0.08) 0.24(0.06) 0.55(0.13)
View Users
MP 1(0) 1(0) 1(0)
CF 0.37(0.06) 0.31(0.07) 0.43(0.12)
CB 0.42(0.08) 0.24(0.06) 0.56(0.12)
New Users
MP 1(0) 1(0) 1(0)
CF - - -
CB 0.43(0.08) 0.24(0.06) 0.57(0.11)
MP 1(0) 1(0) 1(0)
CF - - -
CB 0.43(0.08) 0.24(0.06) 0.57(0.11)
Note: CF cannot make predictions for new users.
SD is the standard deviation across users.
Table 5. Evaluation

The MP recommendation strategy provided the most popularity-biased recommendations because by recommending the same, most-popular items to all users, MP always results in values of 1. CB had more popularity-biased recommendations than CF for Retailers 1 and 3, while the opposite was true for Retailer 2. Overall, recommendations for Retailer 3 were the most popularity-biased, followed by Retailer 1, and then Retailer 2. Within each retailer/model combination, exhibited very little variance across the four user segments.

4.3. Summary of model results

CB modeling was generally more accurate but less personalized than CF. Although CF generally provided more personalized recommendations for view and sale users compared with CB modeling, CF had much lower user-space coverage than CB modeling because CF cannot make recommendations for new users. While the MP recommendation strategy was able to provide very accurate recommendations, those recommendations were completely depersonalized, with the lowest possible and highest possible . Only by performing a holistic evaluation that includes measures of personalization and evaluations across user segments are we able to expose the shortcomings of the MP and CF recommendation strategies. Despite having lower accuracy compared with MP and lower personalization when compared with CF, CB models were able to successfully balance accuracy with personalization while making recommendations to new users.

4.4. Summary of retailer results

Figure 4. Sales distributions for our three retailers. Items are ordered by popularity, with the most popular items at the bottom. The set of popular items that make up a third of sales is known as the short-head, while the set of remaining items make up the long-tail (Cremonesi2010a). The yellow dashed line provides the demarcation between the items in the short-head and long-tail.

Average distinct Figure

Given the results by model type, we suspect that the patterns of results by retailer may be driven by differences in retailer sales distributions. In general, recommendations for Retailer 3 had the highest accuracy as measured by , but the lowest diversity and the highest popularity bias, which may be explained by Retailer 3 having the sales distribution most dominated by popular items. As shown in Figure 4, one third of sales for Retailer 3 involve only the 3.7% of most popular items, compared with Retailers 1 and 2, where one third of sales involve the 8.7% and 12.1% of most popular items, respectively. Additionally, in most cases CB modeling was more accurate but less personalized than CF, with the exception of higher for CF than CB at Retailer 2, and lower for CB than CF for sale users at Retailer 2. This exception may be explained by Retailer 2 having the sales distribution least dominated by popular items, where the accuracy and popularity bias of CB might be directly affected by the sales distribution of the underlying retailer data.

5. Discussion

Our goal was to propose an offline methodology for evaluating fashion RS. Because personalization is a critical feature of fashion, our evaluation framework includes accuracy as well as recommendation diversity and popularity bias. Moreover, because most users in our fashion datasets are new, we performed our analyses separately for users based on prior interaction history. By considering multiple metrics within multiple user segments, we gain a better understanding of how algorithm decisions are likely to influence the experience of the end users.

Although our results varied to some extent by user segment and retailer, we can still make several important conclusions. First, across all of our retailers, our data is very sparse. For comparison, the Netflix dataset and the MovieLens dataset, both of which have been used extensively for RS research (Bennett2007; F.Maxwell2015), demonstrate denser data than any of our three retailers. The overwhelming majority of users represented in the test dataset had no views or sales in the training dataset. As a result, our CF algorithm was unable to provide recommendations for over 70% of users, making CF a poor algorithm choice for fashion. In contrast, our MP algorithm was able to provide accurate recommendations for all user segments; however, because item popularity was calculated across all users, the MP algorithm provides no personalization. Our CB approach represents the best algorithm choice of the three because it provides: 1) relatively accurate recommendations, 2) an acceptable level of personalization, and 3) complete user-space coverage.

Our fashion RS evaluation approach has many advantages over more simplistic approaches; however, there are several ways in which our approach is limited. Here, we define users based on interaction history, but user groups could be defined along many axes (e.g., demographics, frequent versus infrequent shoppers). Also, our approach focused on segmenting users, not products. Content providers may also be interested in how well RS perform within specific subsets of their products (e.g., new versus classic products). Furthermore, we limited our RS comparisons to three relatively basic algorithms. Comparing different variants of these algorithms (e.g., neighborhood-based CF versus model-based CF) could provide additional nuance to our results. Future research could apply our evaluation approach to more variants of common algorithms as well as to novel algorithms specifically tailored to fashion recommendation (e.g, an algorithm focused on flaw minimization or comfort).

Here, we have proposed a more comprehensive offline evaluation. However, prior research has indicated that offline and online metrics are not always correlated (beel2013comparative), calling into question the utility of offline evaluation. One of reasons why offline and online metrics disagree could be that most offline evaluation methods are singularly focused on accuracy (beel2013comparative) and as a result, fail to capture the full range of human factors that influence users' experiences. A multifaceted evaluation approach applied to multiple user segments is more likely to promote algorithms that perform well on online metrics (e.g., click through rates, increased sales, etc.,). Nevertheless, an important next step will be performing an online evaluation to validate our offline results.

In sum, our current work demonstrates the importance of evaluating recommendations from multiple angles. By performing a multifaceted offline evaluation, we can develop a better insight into how our RS are likely to perform when encountered by real-world fashion users.