Off-line vs. On-line Evaluation of Recommender Systems in Small E-commerce

by   Ladislav Peska, et al.
Charles University in Prague

In this paper, we present our work towards comparing on-line and off-line evaluation metrics in the context of small e-commerce recommender systems. Recommending on small e-commerce enterprises are rather challenging due to the lower volume of interactions and low user loyalty, rarely extending beyond a single session. On the other hand, we usually have to deal with lower volumes of objects, which are easier to discover by users through various browsing/searching GUIs. The main goal of this paper is to determine applicability of off-line evaluation metrics in learning true usability of recommender systems (evaluated on-line in A/B testing). In total 800 variants of recommending algorithms were evaluated off-line w.r.t. 18 metrics covering rating-based, ranking-based, novelty and diversity evaluation. The off-line results were afterwards compared with on-line evaluation of 12 selected recommender variants. Off-line results shown a great variance in performance w.r.t. different metrics with the Pareto front covering 68% of the approaches. On-line metrics correlates positively with ranking-based metrics (AUC, MRR, nDCG), while too high values of diversity and novelty had a negative impact on the on-line results. We further train two regressors to predict on-line results based on the off-line metrics and estimate performance of recommenders not evaluated in A/B testing directly.



page 5

page 6


One Size Does Not Fit All: Modeling Users' Personal Curiosity in Recommender Systems

Today's recommender systems are criticized for recommending items that a...

It's Time to Consider "Time" when Evaluating Recommender-System Algorithms [Proposal]

In this position paper, we question the current practice of calculating ...

Exploring Customer Price Preference and Product Profit Role in Recommender Systems

Most of the research in the recommender systems domain is focused on the...

Link Stream Graph for Temporal Recommendations

Several researches on recommender systems are based on explicit rating d...

Exploring Data Splitting Strategies for the Evaluation of Recommendation Models

Effective methodologies for evaluating recommender systems are critical,...

Estimating Error and Bias in Offline Evaluation Results

Offline evaluations of recommender systems attempt to estimate users' sa...

Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?

Offline evaluation is a popular approach to determine the best algorithm...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recommender systems (RS) belong to the class of automated content-processing tools, aiming to provide users with unknown, surprising, yet relevant objects without the necessity of explicitly query for them. The core of recommender systems are machine learning algorithms applied on the matrix of user to object preferences. As such, recommender systems are highly studied research topic as well as extensively used in real-world applications.

However, throughout the decades of recommender systems research, there was a discrepancy between industry and academia in evaluation of proposed recommending models. While academic researchers often focused on off-line evaluation scenarios based on recorded past data, industry practitioners value more the results of on-line experiments on live systems, e.g., via A/B testing. While off-line evaluation is easier to conduct, repeatable, fast and can incorporate arbitrary many recommending models, it is often argued that it does not reflect well the true utility of recommender systems as seen in on-line experiments (Gilotte et al., 2018). On-line evaluation is able to naturally incorporate current context, tasks or search needs of the user, appropriateness of recommendation presentation as well as causality of user behavior. On the other hand, A/B testing on live systems is time consuming, the necessary time scales linearly with the volume of evaluated approaches and it can even harm retailer’s reputation if bad recommendations are shown to users.

A wide range of approaches aimed to bridge the gap between industry and academia.

Jannach and Adomavicius (Jannach and Adomavicius, 2016) argue for recommendations with a purpose, i.e., after a certain level of RS’s maturity, in particular established numerical estimators of user’s preference, authors suggest to step back and revisit some of the foundational aspects of RS. Authors aimed to reconsider the variety of purposes, for which recommender systems are already used today in a more systematic manner and proposed a framework which should cover both consumer’s/provider’s viewpoint and strategic/operational perspective.

One way to approach this goal are user studies via questionnaires (e.g. (Pu et al., 2011)) or more involved frameworks, e.g. (Knijnenburg et al., 2012). Still, the main problem remains: we may lack the participants, whose motivation, information needs and behavior would be similar to real-world users.

A recent contribution to academia-industry discussion was the 2017 Recommender Systems Challenge (Abel et al., 2017), focused to the problem of job recommendations111

. In the first phase, participants evolved their models on off-line data. Afterwards, invited participants were tasked to provide and evaluate recommendations on-line. Most of the teams managed to preserve their off-line performance also during the on-line phase. Quite surprising was the fact that traditional methods and metrics to estimate the users’ preferences for unknown items (of course, tuned to specifics of the task) worked best. The winning team combined content and neighbor-based models with feature extraction, balanced sampling and minimizing a tricky classification objective

(Volkovs et al., 2017).

Another approach to treat the off-line/on-line phenomenon comes from considerations about relevance of statistical learning in understanding causation, confounding, missing (not at random - MNAR) data (see (Little and Rubin, 2002)).

Starting point of this approach is the observation that implicit feedback (despite many advantages) has inherent biases and these are a key obstacle to its effective usage. For example, position bias in search rankings strongly influences how many clicks a result receives, so that directly using click-through data as a training signal in Learning-to-Rank (LTR) methods yields sub-optimal results (Joachims and Radlinski, 2007). To overcome the bias problem, Joachims et al. (Joachims et al., 2017) presented a counterfactual inference framework that provides the theoretical basis for unbiased LTR via Empirical Risk Minimization despite the biased data.

Also Gilotte et al. (Gilotte et al., 2018) considered off-line methods to estimate the potential uplift of the on-line performance of a considered novel approach. Authors proposed a new counterfactual estimator to fulfill the goal and utilized a proprietary dataset of 39 past A/B tests, containing several hundreds of billions of recommendations in total.

Previously mentioned approaches are user centric. However, in the RecSys Challenge 2017 (Abel et al., 2017)

, we could observe the success of item-based methods. The main cause was probably the cold start problem, which is prevalent also in small e-commerce enterprises.

Kaminskas et al. (Kaminskas et al., 2017) observed that the small amount of returning customers makes traditional user-centric personalization techniques inapplicable and designed an item-centric product recommendation strategy. Authors deployed the proposed solution on two retailers’ websites and evaluated it in both on-line and off-line settings.

Jannach et al. (Jannach et al., 2015) considered the problem of recommending to users with short-term shopping goals. Authors observed the necessity of item-based approaches but also importance of algorithms usually used for long-term preferences.

In our previous work (Peska and Vojtas, 2017) we considered the usage of implicit preferences relations on the problem of recommending for small e-commerce enterprises with short-term user’s goals.

In general, providing recommendation service on small e-commerce enterprises brings several specific challenges and opportunities, which changes some recommending paradigms applied, e.g., in large-scale multimedia enterprises. Let us briefly list the key challenges:

  • High competition has a negative impact on user loyalty. Typical sessions are very short, users quickly leave to other vendors, if their early experience is not satisfactory enough. Only a fraction of users ever returns.

  • For those single-time visitors, it is not sensible to provide any unnecessary information (e.g., ratings, reviews, registration details).

  • Consumption rate is low, users often visit only a handful (0-5) of objects and rarely ever buys anything.

  • Small e-commerce enterprises generally offer lower volume of objects (ranging usually from hundreds to tens of thousands instead of millions as in, e.g., Amazon).

  • Objects often contain extensive textual description as well as a range of categorical attributes. Browsing and attribute search GUIs are present and widely used.

The first three mentioned factors contribute to the data sparsity problem and limited applicability of user-based collaborative filtering (CF). Although the total number of users may be relatively large (hundreds or thousands per day), the volume of visited objects per user is limited and the timespan between the first and last feedback is short.

The last two factors contribute towards objects’ discoverability. This may seemingly decrease the necessity of recommender systems222Although objects are more discoverable and users do not depend on recommendations only, they are often not willing to spend too much time in the discovery process and recommendations may considerably shorten it., but also decreases the effect of missing not at random data (see e.g. (Marlin and Zemel, 2009)) and therefore contribute to the consistency of off-line and on-line evaluation.

Despite mentioned obstacles, the potential benefit of recommender systems in small e-commerce enterprises is still considerable, e.g., ”more-of-the-kind” and ”related-to-purchased” recommendations are not easy to mimic with standard search/browsing GUI.

1.1. Main Contribution

Within the scope of small e-commerce enterprises, the main goal of this paper is to determine usability of various off-line evaluation methods and their combination in learning the relevance of recommendations w.r.t. on-line production settings. In total 800 variants of recommender systems were evaluated off-line w.r.t. 18 metrics covering rating-based, ranking-based, novelty and diversity evaluation. The off-line results were afterwards compared with on-line evaluation of 12 selected algorithm’s variants.

Specifically, the main contributions of this paper are as follows:

  • By comparing on-line and off-line results, we identified off-line metrics, which correlate with the actual on-line results (visits after recommendation, VRR). Overall, the ranking-based metrics provided the most consistent positive predictions, while too high diversity and novelty scores had negative impact on the on-line results.

  • We further trained two regression methods aiming to predict on-line results from off-line metrics.

  • Based on both previous points, we may recommend word2vec and cosine CB methods to be used on small e-commerce enterprises. Furthermore, one of the regressors predicted that diversity and temporal novelty enhancements may improve the on-line results of cosine CB method.

  • Datasets acquired during both off-line and on-line evaluation are available for future work.

2. Materials and Methods

2.1. Dataset and Evaluation Domain

As the choice of suitable recommending algorithms is data-dependent, let us first briefly describe the dataset and the domain, we used for evaluation.

Experiments described in this paper were conducted on a medium-sized Czech travel agency333 The agency sells tours of various types to several dozens of countries. Each object (tour) is available in selected dates. Some tours (such as trips to major sport events) are one-time only events, others, e.g., seaside holidays or sightseeing tours are offered on a similar schedule with only minimal changes for several years. All tours contain a textual description accompanied with a range of content-based (CB) attributes, e.g., tour type, meal plan, type of accommodation, length of stay, prices, destination country/ies, points of interest etc.

The agency’s website contains simple attribute and keyword search GUI as well as extensive browsing and sorting options. Recommendations are displayed on a main page, browsed categories, search results and opened tours. However, due to the importance of other GUI elements, recommendations are usually placed below the initially visible content.

2.2. Recommending Algorithms

In accordance with Kaminskas et al. (Kaminskas et al., 2017), we considered user-based recommending algorithms, e.g., matrix factorization models impractical for small e-commerce due to a high user fluctuation and short timespan between first and last visits. Instead, we opted for item-to-item recommending models and define users through the history of their visits.

2.2.1. Item-to-item Recommending Models

We considered three recommending approaches corresponding with the three principal sources of data: object’s CB attributes, their textual description and the history of users’ visits (collaborative filtering). The information sources are mostly orthogonal, each focused on a different recommending paradigm. The expected output of recommendations based on CB attributes is to reveal similar objects to the ones in question. By utilizing the stream of user’s visits, it is possible to uncover objects that are related, yet not necessarily similar. The expected outcome of textual-based approaches is also to provide similar objects, however the similarity may be hidden within the text, e.g., seaside tours with the same type of beach, both suitable for families, located in a small peaceful village, but in a different country.

For each type of source information, we proposed one state-of-the-art algorithm as follows:

– Skip-gram word2vec model (Mikolov et al., 2013) utilizes the stream of user’s visits. Similarly as in (Barkan and Koenigstein, 2016), the sequence of visited objects is used instead of a sentence of words, however, we kept the original window size parameter in order to better model the stream of visits. The output of the algorithm is an embedding of a given size for each object, while similar embeddings denotes objects appearing in a similar context. In evaluation, embedding’s size was select from {32, 64, 128} and context window size was selected from {1, 3, 5}.

Doc2vec model (Le and Mikolov, 2014) utilizes the textual description of objects. Doc2vec extends word2vec model by an additional attribute defining the source document (object) for each word in question. The model, in addition to the word embeddings calculates also embeddings of the document itself, therefore the output of the algorithm are embeddings of a given size for each object (document). Textual description of objects was preprocessed by a Czech stemmer 444 and stop-words removal. In evaluation, embedding’s size was select from {32, 64, 128} and window size from {1, 3, 5}.

– Finally, we used cosine similarity

on CB attributes. Nominal attributes were binarized, while numeric attributes were standardized before the similarity calculation. We evaluated two variants of the approach differing in whether to allow evaluating similarity on self

555Otherwise, the similarity of an object to itself is zero by definition.. In this way, we may promote/restrict recommendations of already visited objects, which belongs to some of the commonly used strategies.

Given a query of a single object, the base recommended list would be a list of top-k objects most similar to the query object (or its embeddings vector).

2.2.2. Using History of User’s Visits

While the above described algorithms focus on modeling item-item relations, we may posses a longer record of visited objects for some users. Although many approaches focused on a last visited object only, e.g., (Kaminskas et al., 2017), some approaches using the whole user session emerged recently (Hidasi et al., 2015). Therefore, we proposed several methods to process users’ history and aggregate recommendations for individual objects. The variants are as follows:

  • Using mean of recommendations for all visited objects.

  • For each candidate object, use max of its similarity w.r.t. some of the visited object.

  • Using last visited object only.

  • Using weighted average of recommendations with linearly decreasing weights. In this case, only the last-k visited objects are considered, while its weight linearly decreases for older visits. We evaluated results considering last 3, 5 and 10 objects.

  • Using weighted average of recommendations with temporal weights. This variant is the same as the previous one, except that the weights of objects are calculated based on the timespan between the current date and the date of visit: . We evaluated results considering last 3, 5 and 10 objects as well as a full user profile.

While the first two approaches considered uniform importance of the visited objects, others rely on variants of ”the newer the better”heuristics. Using history of the user instead of the last item only is one of the extensions of our work compared to (Kaminskas et al., 2017).

2.2.3. Novelty and Diversity Enhancements

The performance of recommenders may also depend on a lot of subjective, user-perceived criteria, as introduced in (Ricci et al., 2011), such as novelty or diversity of recommended items. In this paper, we evaluated two types of novelty: temporal novelty considering the timespan from the last object’s update and user-perceived novelty describing the fraction of recommended objects, which were previously visited by the user666Our definition of user novelty differs from, e.g., popularity-based novelty (Vargas and Castells, 2011), which is more suitable for domains such as movie recommendation, where objects are often widely known.. As for diversity, we evaluated intra-list diversity (ILD) (Di Noia et al., 2014) expressed on the cosine distance of CB attributes.

As certain types of algorithms may provide recommendations that lacks sufficient novelty or diversity, we utilized strategies enhancing temporal novelty as well as diversity. For enhancing diversity of the recommended list, we adopted the Maximal Margin Relevance approach (Carbonell and Goldstein, 1998) with parameter held constant at . For enhancing temporal novelty, we re-ranked the list of object based on a weighted average of their original relevance and temporal novelty : . The parameter was held constant at .

As the choice of recommending algorithm, user’s history aggregation, novelty and diversity enhancements are orthogonal, we run the off-line evaluation for all possible combinations. In total, 800 variants of recommender systems were evaluated.

3. Evaluation Scenario

In this section, we would like to describe the evaluation scenario and metrics. We separate the evaluation into two distinct parts: off-line evaluation on historic data and on-line A/B testing on a production server.

3.1. Off-line Evaluation

For the off-line experiment, we recorded users’ visits for the period of January 2016 - July 2018. The dataset contained over 560K records from 370K users. However, after applying restrictions on the volume of visits777Only the users with at least 2 and no more than 150 visited objects were kept., the resulting dataset contained 260K records of 72K users. We split the dataset into a train set and a test set based on a fixed time-point. All feedback recorded before June 1, 2018 was used for training, while feedback dating between June 1, 2018 and July 19, 2018 was used as a test set. The test set was further restricted to only incorporate users, who have at least one record in the train set as well, resulting into 3400 records of 970 users.

In evaluation, we focused on four types of metrics, commonly used in recommender system’s evaluation: rating prediction, ranking prediction, novelty and diversity. We evaluated several metrics for each class.

For rating prediction, we suppose that visited objects have the rating and all others . Mean absolute error (MAE) and coefficient of determination () were evaluated.

For ranking-based metrics, we supposed that the relevance of all visited objects is equal, and other objects are irrelevant,

. Following metrics were evaluated: area under ROC curve (AUC), mean average precision (MAP), mean reciprocal rank (MRR), precision and recall at top-5 and top-10 recommendations (p5, p10, r5, r10) and normalized discounted cumulative gain at top-10, top-100 and a full list of recommendations (nDCG10, nDCG100, nDCG). The choice of ranking metrics reflects the importance of the head of the recommended list (p5, p10, r5, r10, nDCG10, MRR, MAP) as only a short list of recommendations can be displayed to the user. However, as the list of recommendable objects may be restricted due to the current context of the user (e.g., currently browsed category), we also included metrics evaluating longer portions of the recommended lists (AUC, nDCG100, nDCG).

As discussed in section 2.2.3, we distinguish two types of novelty in recommendations: recommending recently created or updated objects (temporal novelty) and recommending objects not seen by the user in the past (user novelty). For temporal novelty, we utilized logarithmic penalty on the timespan between current date and the date of the object’s last update: . Mean of for top-5 and top-10 recommendations was evaluated. For user novelty, a fraction of already known vs. all recommended objects was used: and evaluated for top-5 and top-10 objects. Finally, the intra-list diversity (ILD) (Di Noia et al., 2014) evaluated at top-5 and top-10 recommendations was utilized as diversity metric.

All off-line metrics were evaluated for each pair of user and recommender. Mean values for each recommender are reported.

3.2. On-line Evaluation

The on-line A/B testing was conducted on the travel agency’s production server between July 19, 2018 and August 17, 2018888The evaluation is ongoing, we aim to provide additional results in the future.. Out of 800 algorithms evaluated off-line, we selected in total 12 recommenders with (close to) best and (close to) worst results w.r.t. each evaluated metric. Details of the selection procedure are in section 4.1. One recommender was assigned to each user based on his/her ID. During the on-line evaluation, we monitored which objects were recommended to the user, whether (s)he clicked on some of them and which objects (s)he visited.

Based on the collected data, we evaluated two metrics: click through rate (CTR) and visit after recommend rate (VRR). CTR is a fraction between the volume of clicked and recommended objects and indicates that a recommendation was both relevant for the user and successful in catching his/her attention. VRR is a weaker criterion checking that after the object was recommended, user also visited it (i.e., (s)he might not pay attention to recommendations or the presentation was not so persuasive, however the recommended object itself was probably relevant). Although VRR is weaker than CTR, we selected it as a main evaluation metric due to the higher volume of recorded feedback and also because recommended objects were often placed outside of the initially visible area and therefore the CTR results may underestimate the true utility of recommendations.

4. Results and Discussion

4.1. Off-line Evaluation

Our aim in off-line evaluation was threefold. First, determine whether all evaluated metrics are necessary and provide valuable additional information. Second, identify, whether there are some general trends on the sub-classes of evaluated approaches or consistently dominating recommenders and finally, select suitable candidates for on-line evaluation.

First, we constructed a matrix of Pearson’s correlation for all off-line metrics (see Figure 1). The figure reveals several interesting patterns in the results. All novelty, diversity and rating prediction metrics are anti-correlated with ranking prediction metrics. The relation is especially strong for diversity. While this may be an artifact of selected algorithms, it still seems like an interesting phenomenon worth to be further studied. Metrics from rating prediction, temporal novelty, user novelty and diversity classes were highly correlated () and therefore only one metric for each category was selected (MAE, , , ILD10). As for ranking-based metrics, results were slightly more diversified. We identified three main clusters: {AUC}, {MAP, MRR, p5, p10, r5, r10, nDCG10} and {nDCG100, nDCG}. AUC, MRR and nDCG100 metrics were selected as representatives of each cluster.999We also evaluated metrics w.r.t. Spearman’s correlation. The results were highly similar, only the differences between ranking-based metrics were smaller in general.

Figure 1. A matrix of Pearson’s correlation for off-line evaluation metrics.

We further evaluated the recommenders’ results according to this restricted set of metrics. First thing to note is that results were extremely diverse; 547 out of 800 recommenders were on the Pareto front. Therefore, we focused on providing some insight on recommender’s subclasses. Table 1 contains mean results as well as results of the best member for each type of recommending algorithm. We may observe that while doc2vec models were superior in MAE and ILD, word2vec and cosine similarity performed considerably better w.r.t. ranking-based metrics. Furthermore, ILD score of doc2vec and word2vec were more than double than cosine similarity ones in average.

As for the history aggregation methods, we observed that better results w.r.t. ranking-based metrics achieved recommenders utilizing major portion of user’s history (mean, temporal). Furthermore, recommenders with temporal-based user profiling also exhibited higher values of . Both diversity and novelty enhancements considerably increased ILD and respectively with a negligible impact on other metrics (detailed results are available in supplementary materials). In general, the type of the algorithm (cosine, word2vec, doc2vec) seems to have the determining impact on the results, surpassing effects of history aggregation, novelty enhancements or diversity enhancements.

Algorithm MAE AUC MRR nDCG100 ILD10
doc2vec 0.372 / 0.213 0.586 / 0.715 0.025 / 0.056 0.051 / 0.101 0.233 / 0.289 0.999 / 1.000 0.804 / 0.888
cosine 0.395 / 0.360 0.780 / 0.797 0.140 / 0.186 0.208 / 0.242 0.228 / 0.259 0.996 / 0.999 0.262 / 0.443
word2vec 0.382 / 0.236 0.795 / 0.825 0.088 / 0.134 0.178 / 0.229 0.225 / 0.278 0.989 / 0.999 0.609 / 0.855
Table 1. Off-line results for recommending algorithm types. Average score and best member’s results are depicted.

While selecting candidates for on-line A/B testing, our main task was to determine predictability of on-line results from off-line metrics. However, due to the limited time and available traffic, the volume of recommenders evaluated in on-line A/B testing cannot be too high.

Therefore, we adopted a following strategy: for each off-line metric, we selected the best and the worst performing recommender by default. However, if another recommender achieved close-to-best / close-to-worst performance101010Ranked within top-5% of results and with absolute difference to the best result. and was already present in the set of candidates, we selected this one to save space. Furthermore, if a different type of algorithm achieved close-to-best performance, we considered its inclusion as well for the sake of diversity. Table 2 contains the final list of candidates for on-line evaluation.

4.2. On-line Evaluation

A total of 4287 users participated in the on-line evaluation, to whom, a total of 130261 objects were recommended111111We excluded global-only recommendations provided to users without any past visited objects and results of users with too many visited objects (probably agency’s employees).. The total volume of click-through events was 928 and the total volume of visits after recommendation was 10961.

The results were not conclusive in general, but we found a strong relation with the volume of previously visited objects by the user. Therefore we only report on the results of novel users with 1-5 visited objects (who represent the main part of the website’s traffic with 65% of the received feedback). We will aim to provide a more comprehensive analysis of the results w.r.t. all classes of users in the future work.

Table 2 contains results of on-line A/B testing (VRR and CTR) as well as off-line results (MAE, AUC, MRR, nDCG100, , and ild10) for twelve recommender variants selected for on-line evaluation.As expected, CTR values were approximately an order of magnitude lower than VRR. We suppose that this may be mainly attributed to the problem of recommendations’ visibility. As a future work, we would like to confirm this hypothesis by applying more detailed implicit feedback analysis as in (Peska and Vojtas, 2017) to estimate probability that recommendations were perceived by the user. Another possibility is that recommended objects were potentially relevant, but not in the current context (user eventually process them after some time). Also this factor may be revealed by a more detailed feedback analysis in the future.

Figure 2 depicts Pearson’s and Spearman’s correlation between on-line and off-line evaluation metrics for users with 1-5 visited objects. The results show a substantial positive correlation between VRR and CTR results as well as a correlation between rank-based metrics and both CTR and VRR. The highest correlation scores were observed for AUC metric, which is further depicted on Figure 3. Results seem to corroborate the findings of RecSys Challenge 2017 (Abel et al., 2017) that, if the off-line metric is reasonably defined, the on-line results can be predicted from it up to some extend. On the other hand, both novelty, diversity and rating-based metrics correlates negatively with the on-line results in most cases.

Results of rating prediction metrics (MAE) had low or negative values in general (also for users with more visited objects) and we may conclude that rating prediction metrics seem irrelevant in assessing on-line performance of recommender systems.

Furthermore, the results in average preferred word2vec and cosine over doc2vec model w.r.t. CTR and word2vec over both other models w.r.t. VRR.

Figure 2. A matrix of correlations between off-line and on-line evaluation metrics.
Figure 3. Ranking-based comparison of AUC and on-line evaluation metrics. Labels denote IDs of recommenders and types of Item-to-Item recommending algorithms.
Algorithm Parameters History Nov. Div. MAE AUC MRR nDCG100 ild10 CTR VRR
  1: doc2vec e: 128, w: 1 last yes no 0.292 0.617 0.031 0.057 0.234 1.000 0.800 0.0070 0.050
  2: doc2vec e: 128, w: 1 temp. no yes 0.362 0.679 0.031 0.075 0.221 0.999 0.838 0.0084 0.075
  3: doc2vec e: 32, w: 5 mean no no 0.455 0.555 0.028 0.050 0.211 0.997 0.786 0.0089 0.054
  4: doc2vec e: 32, w: 5 mean no yes 0.455 0.555 0.025 0.046 0.214 0.998 0.859 0.0062 0.060
  5: doc2vec e: 128, w: 5 max yes no 0.214 0.526 0.012 0.031 0.229 0.995 0.741 0.0077 0.052
  6: cosine s:False temp. yes no 0.406 0.797 0.146 0.215 0.255 0.994 0.270 0.0057 0.020
  7: cosine s:True mean yes no 0.400 0.795 0.149 0.214 0.229 0.994 0.223 0.0119 0.088
  8: cosine s:True last-10 no no 0.390 0.783 0.127 0.205 0.218 0.996 0.208 0.0075 0.055
  9: word2vec e: 64, w: 5 mean no yes 0.414 0.809 0.103 0.182 0.215 0.973 0.683 0.0090 0.062
10: word2vec e: 128, w: 3 last no no 0.438 0.816 0.102 0.195 0.244 0.977 0.495 0.0095 0.065
11: word2vec e: 128, w: 3 last no no 0.290 0.734 0.097 0.168 0.212 0.997 0.534 0.0077 0.056
12: word2vec e: 32, w: 3 last-10 no no 0.432 0.814 0.134 0.229 0.214 0.988 0.443 0.0080 0.089
Table 2. On-line and off-line results of recommenders selected for A/B testing. Div. and Nov: stands for diverstity and novelty enhancements; parameter stands for embeddings size, denotes context window size and denotes whether calculating similarity on self is allowed. Best results w.r.t. each metric are in bold.

4.3. Post-processing Off-line Results

After the completion of on-line experiments, we also aimed to revisit previous off-line results with the knowledge from on-line / off-line comparison. In order to do so, we trained two regression methods aiming to predict VRR from off-line evaluation metrics. We utilized linear regression (LM) and MARS model and predict

score for all recommenders.

Both predictors were generally consistent (Pearson’s ), however predictions slightly differed in some key questions. Both and favored word2vec and cosine CB model over doc2vec, however the comparison of word2vec and cosine CB models was contradictory ( slightly favored word2vec over cosine, while the opposite was predicted by ).

Both predictors agreed that novelty enhancements would decrease the predicted and diversity enhancements would have only minimal effect in average case. However, according to MARS method, both novelty and diversity enhancements could increase results of several top-scoring algorithms. Upon a closer look, MARS method mostly predicts increased VRR for diversity/novelty enhanced cosine CB model, which exhibited the lowest ILD in general, so at least diversity enhancements seems appropriate in this case. This is in line with some other studies (Noia et al., 2017), claiming that an appropriate level of diversity should be maintained and this may vary across CB attributes and users as well.

The usage of longer user’s history (mean, temporal, temporal5, temporal10, last5, last10) led to a slightly worse performance in average, but improved over approaches with shorter history (last, last3, temporal3) w.r.t. best per-class results, indicating that the usage of longer user’s history may be beneficial, but do not work for all algorithms.

Finally, the overall best performing recommenders w.r.t. were cosine CB approaches with disallowed similarity on self, longer user history profile (temporal, mean), and enhanced novelty and diversity in most cases. The overall best w.r.t. were word2vec models with embeddings size , no diversity or novelty enhancements and using longer history profiles (max, mean, temporal).

5. Conclusions and Future Work

In this paper, we conducted an extensive comparison of off-line and on-line evaluation metrics in the context of small e-commerce enterprises. Experiments were held on a Czech travel agency and shown a moderate correlation between ranking-based off-line metrics (AUC, MRR, nDCG100) and both visits after recommend rate (VRR) and click-through rate (CTR). Similarly, results indicated a negative correlation between on-line metrics (CTR, VRR) and novelty, diversity and rating-based metrics. Although this relation may be caused by the choice of recommending algorithms, it would be interesting to verify, whether this is a global feature of the domain, or, e.g., there are recommenders, users or attributes, for which, higher diversity or novelty is appropriate.

Although there was a significant correlation between CTR and VRR results, the high absolute difference illustrate relatively high independence of users on recommendations as well as good discoverability of objects. Both factors reduce the effect of missing not at random problem in small e-commerce enterprises.

In addition to the direct on-line - off-line comparison, we trained two models aiming to predict on-line results () from off-line metrics and provided comparison of all variants of recommenders. Both temporal novelty and diversity enhancements seem to have a potential to improve results of some algorithms (cosine CB) and it may be worthwhile to evaluate them in the future.

The utility of CF vs. CB information (expressed by of word2vec and cosine CB models) was highly similar. Word2vec model seems to be slightly better, but the small margin in results prevent us from voting for a single most suitable information source. Therefore, the future work should include more detailed analysis of algorithms’ performance w.r.t. different segments of users and relevant context as well as evaluation of hybrid approaches utilizing both sources of information. Furthermore, the estimated results should be verified by additional on-line A/B testing.

The future work should also incorporate utilization of more complex implicit user feedback in order to assess importance of visited objects as well as decrease the visibility noise in on-line evaluation, especially CTR.

Acknowledgements. This paper has been supported by Charles University project Progres Q48. Source codes, evaluation data and complete results are available from