1. Introduction
It has been well established that the effect of beyondaccuracy properties on user satisfaction is a critical success factor in deploying a recommender system (McNee et al., 2006; Herlocker et al., 2004). Among these properties, surprise has recently been the subject of several studies owing to its links to serendipity (Adamopoulos and Tuzhilin, 2011; Kaminskas and Bridge, 2014; Silveira et al., 2017) and the problem of overspecialisation in contentbased recommender systems (de Gemmis et al., 2015), as well as its importance in some application domains (Mourão et al., 2017).
In the literature, the notion of surprise generally reflects the capacity to make recommendations that are dissimilar from the items known to a given user: the more a recommended item is dissimilar, the more it is surprising (Kaminskas and Bridge, 2014; Adamopoulos and Tuzhilin, 2011; Zhang et al., 2012). Current surprise metrics and evaluation methods allow us to estimate the average surprise in recommendations produced by an algorithm. Given a set of algorithms and a fixed experimental setting, statistical tools can be used to estimate if there is a significant difference in the average degree of surprise between recommendations produced by any two algorithms. Although this kind of approach can be successfully applied when selecting which algorithm might be more promising for an application domain, it is unable to provide a common measurement scale that remains consistent in different experimental settings; nor does it reveal how much room there is for improvement with respect to surprise.
In this study, we express the view that surprise can be regarded as a system resource: at any given time, a recommender system has a limited “stock of surprise” that is available to each user. This theoretical “stock of surprise” is referred to here as “maximum potential surprise”. The recommendation algorithm, which is designed to optimise a set of objectives, controls how much surprise is embedded in each recommendation it produces. However, there is a limit to how much of the available surprise any recommendation algorithm can embed in a recommendation. By pursuing this line of thought, we were able to devise a surprise metric, called “normalised surprise”, which provides a measurement scale in which the meaning remains consistent across different settings. For example, it can provide the information that, on average, a recommender system has embedded 20% of the available surprise in each recommendation it produces. As a result, this means that there is still 80% of the available surprise that can be appropriated by the system.
The remainder of this paper is structured as follows: Section 2 examines related work on surprise evaluation; in Section 3 a theoretical model is designed for potential surprise and the proposed surprise metric; Section 4 describes the experiments conducted to validate the proposed metric through both an ancillary synthetic dataset and the popular MovieLens dataset, and the results are discussed in Section 5. Finally, we conclude by summarising the work and making suggestions for possible future work in Section 6.
2. Related Work
Before providing a review of relevant work related to this research, in this section there is an examination of the properties of a recommender system that are related to surprise (Section 2.1). This is followed by a discussion of several surpriserelated metrics that have been proposed in the literature (Section 2.2) and the one plus random offline evaluation method for surprise (Section 2.3).
2.1. The Surprise Property
The challenge of discovering new items that might be useful to a user has been focus of a large number of works in literature on recommender systems. In general, the approach involves finding new items that bear some similarity to items which have been given good ratings by some users. An even greater challenge is to find new items that do not resemble items known to a user, yet would still be useful to them. This would be a serendipitous recommendation.
Herlocker et al. (Herlocker et al., 2002) offers a definition of serendipity that is usually cited by work in this area: “A serendipitous recommendation helps the user find a surprisingly interesting item he might not have otherwise discovered.” In a sense, this definition supports a perspective whereby serendipity, as a system property, results from the interaction of two other and more fundamental properties: surprise and relevance. Being surprising and relevant (or useful) to a user are the basic requirements of a serendipitous recommendation.
It has been recently pointed out by Kaminskas and Bridge (Kaminskas and Bridge, 2016) that there is a conceptual overlap between the properties of novelty or unexpectedness and the notion of surprise. In this study, we subscribe to the categorisation suggested by these authors, in which a) novelty is related to the notion of an item being popular, and thus is not directly related to serendipity, b) unexpectedness usually conveys the same notion as surprise, and, as mentioned earlier, c) surprise can be regarded as a component of serendipity.
In view of this, our focus is on the metrics employed for estimating surprise, serendipity and unexpectedness. This review sets out by pointing out that several authors have approached the problem of measuring these properties by adopting strategies that, although clearly distinct from each other, have some key features in common. Each strategy is analysed on the basis of three factors:
Intrinsic vs extrinsic evaluation^{1}^{1}1Here, intrinsic vs extrinsic evaluation is an analogy to the same dichotomy employed in clustering quality evaluation methods (Han et al., 2011).: some studies have defined metrics that only use data that are within the system under evaluation (Akiyama et al., 2010; Zhang et al., 2012; Kaminskas and Bridge, 2014), while others have defined metrics that use data made available by an external system (often referred to as PPM  Primitive Prediction Model) in addition to the internal data (Murakami et al., 2008; Ge et al., 2010; Adamopoulos and Tuzhilin, 2011)
Subjective vs objective view: some metrics assume that surprise is subjective in nature, since it depends on the set of items known to each user (Murakami et al., 2008; Adamopoulos and Tuzhilin, 2011; Zhang et al., 2012; Kaminskas and Bridge, 2014), while others view surprise as a property of the item itself (Ge et al., 2010; Akiyama et al., 2010) and, thus, is independent of the users.
Reductionist vs nonreductionist approach: some authors have employed a reductionist approach, in so far as they seek to isolate the surprise and relevance components of serendipity and examine them as separate metrics (Murakami et al., 2008; Akiyama et al., 2010; Adamopoulos and Tuzhilin, 2011; Zhang et al., 2012; Kaminskas and Bridge, 2014), while others have proposed metrics that treat surprise and relevance in a more integrated way (Ge et al., 2010).
2.2. Surprise Metrics
In this section, there is a review of six surpriserelated metrics in the literature. Figure 1 provides a summary of the review, and illustrates how they are positioned with regard to the factors described in Section 2.1. As can be seen in the diagram, the metrics include the year of publication, and the ellipses show trends or changes in the factors. This review is not meant to be exhaustive but rather aims to capture the approaches that have evolved over a period of time.
A metric for unexpectedness
Murakami et al. (Murakami et al., 2008) have proposed a metric to evaluate serendipity that explores the idea that a serendipitous recommendation must be “nonobvious”, whereas recommendations made by a PPM are expected to be obvious. As shown in Equation 1, the metric is calculated from a recommendation list produced for the user by the system under evaluation. In this definition, the predicate accounts for the predicted relevance of an item to the user , while accounts for surprise, and reflects the degree to which an item is similar to items rated highly by the user (i.e. based on a subjective view). Since there are separate metrics for surprise and relevance, it can be assumed that a reductionist approach is being adopted. Note that the predicate , in Equation 2, includes the relevance predicted by both the system under evaluation () and the external system (), and can thus be regarded as an extrinsic evaluation.
(1) 
(2) 
A metric for serendipity
Ge et al. (Ge et al., 2010) devised a metric for evaluating serendipity that follows the same line of thought pursued by Murakami et al. (Murakami et al., 2008), although the external system is employed in a different way. As shown in Equation 3, is applied to a recommendation list , and estimates the usefulness of each item , that accounts for relevance. In Equation 4, is defined as a list that consists of the elements recommended to the user by the system under evaluation () and that do not appear in the list drawn up for user by an external system (). This means that comprises nonobvious, unexpected items, and hence only accounts for surprise.
(3) 
(4) 
In addition, as there is no specific metric for surprise in Equations 3 and 4, it can be assumed that adopts a nonreductionist approach. Note that operates in an objective way, since estimating surprise () does not involve evaluating the degree to which new items are similar to items already known to the user.
A metric for general unexpectedness
Akiyama et al. (Akiyama et al., 2010) set out a metric called “general unexpectedness” that explores a combinatorial intuition: an item that shows a rare combination of attributes must be treated as unexpected. It assumes that each item has some content combined with it, and that such a content can be described by a set of attributes. This usually is the case with contentbased recommenders (de Gemmis et al., 2015). As shown in Equation 5, the unexp metric is estimated for , the recommendation list produced to user by the system under evaluation, and this aggregates the uscore obtained for each item . The uscore, defined in Equation 6, is the reciprocal of the
average joint probability
estimated for each pair of attributes of . In this equation, represents the set of attributes that describe , denotes the number of items in the repository that have attribute and is the number of items that have both attributes and . Thus an objective view is adopted since surprise can be seen as a property of the content of an item. Unlike the metrics previously described, this metric does not employ an external system (i.e. an intrinsic evaluation). In addition, it should be noted that this metric only accounts for surprise, and it thus adopts a reductionist approach.(5) 
(6) 
A metric for unexpectedness
Adamopoulos and Tuzhilin (Adamopoulos and Tuzhilin, 2011) propose a metric for unexpectedness that examines an intuition about user expectation: an item is expected for user if it is known to them or bears some similarity to items known to them. As shown in Equation 7, the unexp metric is calculated from L, the recommendation list produced for user by the system under evaluation, and , a list of obvious, expected items that is defined in Equation 8. In that equation, is a recommendation list produced for user by an external system, represents the set of items that have been rated by user , and the predicate neighbours represents the set of items in the repository () that are similar to the items in up to some degree specified by threshold parameters in . This approach adopts an external system (extrinsic evaluation), the metric only accounts for surprise (through a reductionist approach), and adheres to a subjective view, since it takes account of past experience of the user.
(7) 
(8) 
The unserendipity metric
Zhang et al. (Zhang et al., 2012) explore the idea that a serendipitous recommendation must be dissimilar to items known to the user, in a semantic sense. It resembles the metric proposed by Akiyama et al. (Akiyama et al., 2010)
, since it assumes that each item is combined with some content, but in this case content attributes are represented as vectors in
instead of sets. As is shown in Equation 9, the metric is computed from the recommendation list drawn up for user by the system under evaluation (), and results in a score that is the average cosine similarity obtained from the items in
and the set of items known to the user (). a) This approach does not employ an external system (intrinsic evaluation); b) the metric only accounts for surprise (reductionist approach) and c) it adheres to a subjective view of surprise. It should be noticed that, unlike the metrics shown earlier, is scaleinverted, since the lower the score, the more surprising the is.(9) 
A metric for surprise
In a similar way to Zhang et al. (Zhang et al., 2012), Kaminskas and Bridge (Kaminskas and Bridge, 2014) argue that a surprising recommendation must be dissimilar to items known to the user, but does not require that this dissimilarity should be semantic in nature. They also explore the interplay between the notions of distance and similarity^{2}^{2}2Given a metric for distance, a similarity metric can be derived, and viceversa (Deza and Deza, 2009).. Equation 10 shows that the metric is calculated from the recommendation list produced for user by the system under evaluation (), and produces the average surprise computed for each item in . The surprise of an item in is estimated as either a) the minimum distance between and each item known to the user (), as described in Equation 11, or b) the maximum degree of similarity between the same items, as shown in Equation 12. The predicate is defined as the Jaccard distance between the set of attributes recovered from contents linked to items and , while the predicate computes the normalised pointwise mutual information score (NPMI) (Bouma, 2009) for the same items. This approach does not employ an external system (intrinsic evaluation); the metric only accounts for surprise (reductionist approach) and supports a subjective view of surprise, since it takes account of the past experience of the users.
(10) 
(11) 
(12) 
In summary, we argue that all metrics described involve (in an abstract sense) a notion of distance in their surprise component, when it is applied to a) known and unknown items (subjective view) (Murakami et al., 2008; Adamopoulos and Tuzhilin, 2011; Zhang et al., 2012; Kaminskas and Bridge, 2014), b) expected and unknown items (extrinsic evaluation) (Murakami et al., 2008; Ge et al., 2010; Adamopoulos and Tuzhilin, 2011), or c) the content linked to different items (Akiyama et al., 2010; Adamopoulos and Tuzhilin, 2011; Zhang et al., 2012; Kaminskas and Bridge, 2014):

Murakami et al. (Murakami et al., 2008): the predicate isrel measures how similar an item is to the items known to a user.

Akiyama et al. (Akiyama et al., 2010): the predicate uscore is the reciprocal of a measure of similarity given by jointprobability;

Adamopoulos and Tuzhilin (Adamopoulos and Tuzhilin, 2011): the predicate is the difference between sets;

Zhang et al. (Zhang et al., 2012): the predicate is defined as the reciprocal of geometric similarity, and is thus a distance;

Kaminskas and Bridge (Kaminskas and Bridge, 2014): the predicate directly specifies the distance and similarity functions.
Finally, if one accepts the idea that surprise is a form of distance, and there is a tendency to separate surprise from relevance when seeking serendipity, it can be argued that it might be fruitful to tackle the problem of evaluating surprise from an informational perspective, by trying to answer the following questions:

Are there limits to how much surprise a recommender can offer to a user?

Are there limits to how surprising a recommendation list can be?

If these limits exist, is it possible to use them to create a scale up in which the performance of a system can be measured?

If these limits exist, do decisions on how to represent data or which distance function to employ influence them?
The current metrics for surprise do not address these questions, and this study seeks to fill in the gaps.
2.3. Evaluation of Surprise
All the metrics described in Section 2.2 evaluate the surprise^{3}^{3}3In this subsection, the term surprise will also encompass both serendipity and unexpectedness predicates described in section 2.2. of a recommendation list that is produced by the system that is being evaluated. An evaluation method is required to obtain an estimate of how the system performs with regard to surprise. Most studies follow a statistical procedure to compute this kind of estimate: a sample of users is selected, recommendation lists are produced, surprise evaluations are made, and the average is calculated.
On the other hand, in response to a general consensus about the limited ability of the accuracy metrics to evaluate the performance of recommenders in topN recommendation tasks, a new offline evaluation method called “one plus random” was designed to estimate the recall of a recommender system (Cremonesi et al., 2008; Bellogin et al., 2011). This method follows the intuition that, in a sufficiently large set that consists of items unknown to user , most of these items are irrelevant to . Supposing that an item that is highly rated by , namely , is added to , if an algorithm attributes to a score such that ranks among the topN items , then the algorithm has succeeded in the task. In a recent study, Kaminskas and Bridge (Kaminskas and Bridge, 2014) adapted this method to estimate the degree of surprise of a recommender system. The intuition behind the one plus random method is retained and, in addition to computing an estimate for recall, it also computes the average surprise obtained from the recommendation lists produced for a sample of users.
3. A theoretical model for surprise
Before addressing the questions posed in Section 2.2, it should be made clear what properties a metric for surprise should have.
Surprise must be subjective: Barto et al. (Barto et al., 2013) carried out a review of the concepts of surprise and novelty in cognitive science, as well as their quantitative models^{4}^{4}4The notion of surprise in Kaminskas and Bridge (Kaminskas and Bridge, 2016) is closer to the notion of novelty in cognitive science than that of surprise, as described in Barto et al. (Barto et al., 2013).. Reisenzein et al. (Reisenzein et al., 2017) examined to which extent the experimental evidence supports different quantitative models of surprise. Both these studies portray these phenomena as subjective in nature, and show that their quantitative models usually involve some form of subjective probability, which may represent expectations or beliefs held by an individual. From a more intuitive standpoint, and when applied to recommender systems, this idea can be illustrated through the Scenario 1:

Suppose items and are similar to each other;

Suppose user has not been exposed to items and , and all the items known to are very dissimilar from and ;

Suppose user has been exposed to item ;

If the system recommended item for user , it would be a surprising recommendation;

If it recommended the same item for user , it would not be as surprising, because user knows , a similar item.
Surprise must be dynamic: assuming that surprise depends on beliefs or expectations, it seems reasonable to presume that it changes over time, as the user is constantly being exposed to new experiences. This idea can be illustrated through the Scenario 2:

Suppose items and are similar to each other;

Suppose user has not been exposed to items and , and all the items known to are very dissimilar from and ;

Suppose the system recommends item to user ;

After some time has passed item is recommended to ;

Unlike what happened in Scenario 1, recommending item to is not as surprising, since knows , a similar item.
Surprise is related to the notion of distance: all the metrics reviewed in Section 2.2 involve the notion of distance; most of them reflect the extent to which a new item resembles the items known to a user. This is in accordance with the subjective and dynamic views of surprise, since both involve assessing similarity between objects.
In adopting these three ideas as premises for this work, we support the definition of surprise made by Kaminskas and Bridge (Kaminskas and Bridge, 2014), and described in Equation 11. This definition assumes that the surprise of an item, , is inversely proportional to the degree to which item is similar to the items known to the user; it adopts a subjective view, since it considers surprise to be a function of the items known to the user. It also accounts for changes in the surprise of an unobserved item, as the growth of the set of known items.
The remainder of this section has two objectives. First, to devise a theoretical model that can be used to estimate the total amount of surprise a system can offer to an arbitrary user (Sections 3.1 to 3.5). Second, to employ the theoretical model to estimate the maximum amount of surprise a system can embed in a recommendation list of arbitrary length (Section 3.6). The theoretical model described next has the following settings:

Initial condition: each user has rated at least one item;

Interaction: the system produces a recommendation list to that only contains one item, which is promptly consumed;

The repository of the system has a finite number of items;

The repository of the system remains stable (no new items are introduced), and this after a finite number of interactions, all the users will have been exposed to all of the items.
3.1. Surprise is a finite resource
At any given time, a recommender system has a finite number of items in its repository. On the basis of this premise we argue that surprise is a finite resource in this kind of system. Let represent the set of items in the repository of the system. Suppose user has been exposed to all but one item in the repository, namely item . Let represent the set of items unknown to , and the set of items to which user has been exposed. Thus, the total amount of surprise the system can offer to is given by .
This scenario can be modified to allow for : suppose that user has been exposed to all but two items, namely and . Then, and . Suppose that the system recommends items and in this order. Then the total amount of surprise the system can offer to is . The last term accounts for the fact that item was known to user when item was recommended. It should be noted that the order in which items are recommended may produce a different amount of surprise.
3.2. The surprise of a sequence
Building on the previous scenario, suppose that user has been exposed to all but items, namely . Suppose that the system recommends these items in a specific order, represented by the sequence . Then the surprise of such a sequence of recommendations to the user can be generalised by the following predicate (surprise of a sequence):
(13) 
where represents the head of the sequence , namely , and its remaining items, (). Since is a recursive predicate, let when .
3.3. The potential surprise
As stated earlier, the amount of surprise a system can potentially offer to a user , is finite and depends on the sequence in which items are ordered. Thus, the maximum potential surprise a system is able to offer to the user must correspond to the surprise obtained by a specific ordering of the elements of :
(14) 
where is the set of permutations of items in . The maximum potential surprise, , claims that there are some permutations of the items in that maximise the surprise for the user . This amount of surprise can be interpreted as the “stock of surprise” a system can offer to user . Following the same principle, the minimum potential surprise, , is the permutation of items that minimises the potential surprise for that user:
(15) 
3.4. The normalised surprise of a sequence
Once the maximum and the minimum amount of potential surprise a system can offer to a user have been defined, these limits can be used to create a scale that allows the surprise of any sequence comprising all items in to be measured:
(16) 
The normalised surprise of a sequence, , results in a score within the interval . If , then is a permutation of items in that maximises the surprise for the user . On the other hand, if , then it offers the minimum amount of surprise to the user.
3.5. Computational costs and approximations
Real recommender systems have a huge number of items that are unknown to any given user. Since calculating the requires evaluating surprise in permutations, its exact computation is not feasible. However, by applying the optimisation theory to combinatorial problems (Johnson and McGeoch, 2015; Mehdi, 2011), an approximation to (Equation 14) can be computed by means of a greedy estimation strategy:
(17) 
(18) 
In Equation 18, is the most surprising item in with respect to . The same technique can be used to obtain an approximation for (Equation 15):
(19) 
(20) 
In Equation 20, is the least surprising item in with regard to . We can now define an approximation to (Equation 16) using the approximations for the maximum and minimum potential surprises (Equations 17 and 19, respectively):
(21) 
3.6. Surprise of a recommendation list
Up to this point, we have focused on estimating the total amount of surprise a recommender system can offer to an arbitrary user. The approach required finding a sequence consisting of all the items unknown to a user () that maximise the potential surprise for them. We now turn to the problem of estimating the maximum amount of surprise the system can embed in a sequence that does not contain all the items unknown to a user. This sequence is referred to as a truncated sequence, and represented as . Now suppose that , consisting of items, obtains the maximum amount of surprise for user that can be embedded in a sequence with items. Then it should be the case that . Solving for using Equation 21, give us:
where is the set of arrangements of items in . An approximation for can be obtained by a greedy strategy.
We recognise that the assumption that can represent a recommendation list may be subject to criticism, since there are important discrepancies between the settings assumed by the theoretical model and the real user interactions with the the system:

A user may fail to notice a recommendation list , so:
; 
A user may not promptly consume all the items in , so:
; 
A user may not consume an item in the order that it is ranked in the list, so: .
However, in our view, even in such cases, the theoretical model can still be used to provide an upper bound estimation of the surprise experienced by the user.
3.7. Adapted evaluation method
Since estimates the normalised surprise of a recommendation list, we need a method to assess the performance of a recommender system with regard to this metric. We adopt the approach employed by Kaminskas and Bridge (Kaminskas and Bridge, 2014) and adapt the one plus random offline evaluation method for this assessment, as described in Algorithm 1. It has five parameters, namely a sample of users (U), the set of items in the repository (I), user ratings (Ratings), the length of a recommendation list (topN), and a meta model that, given the set of items known to a user, induces a model that computes a score for an arbitrary, unknown item (metamodel). As will be described in Section 4.2, this meta model plays a key role in evaluating predictions supported by the theoretical model.
In line 3, is assigned to the set of items known to the user , and in line 4, is assigned to the set of items unknown to them. In line 6, is a list with 1,000 unknown items, and each of these items is mapped to a tuple and the resulting list is assigned to (line 7). The score is produced by the model induced in line 5. In lines 8 and 9, tuples in are sorted in descending order and the items that rank in the first topN positions are assigned to . Finally, in line 10, the approximate normalised surprise of is computed and accumulated. The algorithm returns the average amount of normalised surprise obtained from the user sample U.
4. Experiments and Results
Two experiments were conducted: the first aims to assess the quality of the greedy approximations for maximum and minimum potential surprise; the purpose of the second is to validate predictions of the theoretical model through different choices of recommender algorithm, data representation, and distance function.
4.1. Evaluating the approximation strategy
Method: a synthetic dataset was employed to assess the differences between the exact computations of maximum and minimum potential surprises and their greedy approximations.
Dataset: the dataset contains a single user and eleven items, labelled .
The user was exposed to one item ().
The items are represented as vectors in , and arbitrarily distributed.
Procedure: the degree of surprise was measured for all the permutations of the set of unknown items (, ), according to in Equation 13.
The maximum and minimum surprise measurements obtained were recorded.
The greedy approximations for and , defined in Equations 17 and 19 respectively, were computed and results recorded.
Variations: since the surprise of a sequence, , uses a distance function, the procedure was repeated using four different functions: nonnormalised Euclidean distance and cosine distance (geometric intuition), Jaccard distance (combinatorial intuition), and JensenShannon divergence (informational intuition).
The Jaccard distance and JensenShannon divergence to vectors in were applied as described in (Jurafsky and
Martin, 2018).
Results: Table 1 shows the values that were calculated using both exact and approximate predicates.
Except for the case of where nonnormalised Euclidean distance was used, no substantial difference between the exact and approximate computations was obtained.
That deviation occurred because underestimated .
Since underestimating or overestimating could lead to achieve a score outside the interval , in practice it seems reasonable to clip its value if it falls outside this interval, and this solution was adopted in Experiment 2.
Although these results do not support the general claim that a greedy algorithm will always achieve approximations as good as those obtained, it suggests that the local approximation approach is feasible in real settings.
Distance  

Euclidean  37.684  36.948  23.935  23.935 
cosine  1.367  1.367  0.277  0.277 
Jaccard  3.784  3.784  2.552  2.552 
JensenShannon  2.089  2.089  0.494  0.494 
4.2. Evaluating the theoretical model
The theoretical model supports the following predictions:

If a recommender system embeds the maximum amount of surprise in each recommendation it produces, then its evaluation should achieve the maximum score in the potential surprise scale (mean ).

If a system embeds the minimum amount of surprise in each recommendation, then its evaluation should obtain the minimum value of the scale (mean ).

Any recommender whose objective is neither to maximise nor minimise surprise, will achieve an intermediate value within the scale ().
These predictions should be confirmed regardless of the choices of data representation and distance function.
Method: a controlled environment was created to enable this experiment to be carried out.
This environment submits data from the MovieLens1M dataset to a process that produces a time series consisting of measurements for surprise, according to (Equation 21).
Dataset: the MovieLens1M dataset was employed (Harper and
Konstan, 2015). It contains 3,883 items, 6,040 users and just over 1 million ratings. It was enhanced by short movie descriptions collected from the online MovieLens system in September 2017.
Items whose short description was not available or was not written in English were rejected.
Process: the dataset is submitted to a process comprising three stages: preprocessing, segmentation and measurement.
In the preprocessing stage, a distance matrix is computed for each pair of items in the dataset, by following the parameters specified in each variation outlined in the next paragraph.
During the segmentation stage, the ratings are ordered by timestamp and aggregated into timeframes that include 1,500 ratings each.
Consecutive timeframes that contain ratings from at least 30 common users are marked as eligible for measurement if there is at least one 5star rating for each user.
These criteria enable us to: 1) control the number of measurement samples; 2) control variation among the samples, since the users that are randomly allocated to a timeframe , will be preferably allocated to ; 3) satisfy the conditions required by the original one plus random method (Cremonesi et al., 2008; Bellogin
et al., 2011), by allowing us to obtain recall evaluations for each sample, for future work.
Finally, during the measurement stage, Algorithm 1 is applied to each eligible interval.
An interval is a sequence of consecutive timeframes, starting from the first timeframe and stretching to an eligible timeframe.
The measurements are sequenced and recorded as a time series.
Variations: The process was repeated using different choices of recommender algorithm, data representation, and distance function.
Recommender algorithms
: three algorithms were used: a) the traditional itemkNN (with
), which scores items according to the weighted average rating attributed to the most similar items known to a given user; b) an algorithm that scores items according to their degree of surprise (MSI  Most Surprising Item), which promotes the generation of surprising recommendation lists; and c) an algorithm that scores items according to its familiarity (LSI  Least Surprising Items), which promotes nonsurprising recommendation lists. Since Algorithm 1 is used, a set with 1,000 randomlysampled unknown items is submitted to each algorithm that, in its turn, attributes a score to each item. Then the topranking items () are selected as a recommendation list.Data representation: four models were used: Models C and P are semantic models, Model U is a useritem, and Model N is NPMI.
Model C is a traditional vector space model of semantics (or countbased VSM) (Baroni et al., 2014; Turney and Pantel, 2010). It uses the short description linked to an item to produce its respective semantic vector. Items for which the description was too short^{5}^{5}5Description is too short if it has less than 13 terms after stop words removal; default NLTK stop words for English were employed (Loper and Bird, 2002). were rejected. Before computing tfidf scores (Manning et al., 2008), the terms were stemmed by means of the Snowball algorithm (Porter, 2001; Loper and Bird, 2002).
Model P is a distributed vector space model of semantics (or predictionbased VSM) (Baroni et al., 2014). It uses the short description linked to each item to produce a semantic vector. Semantic vectors were extracted through an implementation of the Paragraph Vector (Mikolov et al., 2013; Řehůřek and Sojka, 2010), which does not require stop words removal or stemming.
Model U is a useritem model (Ning et al., 2015). Each item is represented as a vector of a length equal to the number of users in the system repository. Each component represents a) the rating the user has attributed to item , or b) zero if the item was not rated by . The items without any rating were rejected.
Model N is a NPMI score model (Kaminskas and Bridge, 2014)
. The model consists of two probability distributions,
and : the first is estimated as the proportion of users who have been exposed to each item , and the latter as the proportion of users who have been exposed to both items . The distance between is calculated by means of the NPMI score (Bouma, 2009) (which measures similarity), and then inverted and rescaled so that the image is mapped to .Variations  MSI  kNN  LSI  

Model  Distance  Median  Mean  St.Dev.  Median  Mean  St.Dev.  Median  Mean  St.Dev. 
C  Euclidean  0.912  0.910  0.031  0.448  0.443  0.087  0.023  0.024  0.010 
C  cosine  0.985  0.980  0.019  0.742  0.740  0.077  0.211  0.219  0.093 
C  Jaccard  0.969  0.964  0.025  0.704  0.697  0.084  0.172  0.193  0.102 
C  JensenShannon  0.984  0.975  0.031  0.626  0.615  0.088  0.081  0.097  0.074 
C  Aitchison  0.979  0.978  0.015  0.512  0.510  0.088  0.037  0.040  0.018 
P  Euclidean  0.985  0.984  0.011  0.615  0.605  0.082  0.096  0.099  0.042 
P  cosine  0.971  0.951  0.061  0.571  0.566  0.083  0.093  0.096  0.050 
U  Euclidean  0.921  0.918  0.032  0.835  0.813  0.102  0.004  0.007  0.018 
U  cosine  0.983  0.970  0.036  0.618  0.633  0.179  0.037  0.042  0.027 
U  Jaccard  0.999  0.939  0.097  0.585  0.609  0.203  0.053  0.059  0.038 
U  JensenShannon  0.953  0.948  0.029  0.603  0.602  0.162  0.081  0.085  0.036 
U  Aitchison  0.946  0.943  0.025  0.758  0.745  0.098  0.008  0.011  0.015 
N  NPMI  0.687  0.678  0.091  0.545  0.535  0.111  0.098  0.111  0.072 
Median, mean and standard deviation of
over MSI, kNN, and LSI algorithms.Distance functions: six distance functions were used for exploring different intuitions: Euclidean and cosine distances (geometric), Jaccard distance (combinatorial), JensenShannon divergence and NPMI (informational), and Aitchison distance (statistical) (Egozcue et al., 2011).
The JensenShannon divergence and Aitchison distance are not defined when one vector has a zero component, so they require smoothed vectors.
When needed, smoothing was applied using Bayesian Multiplicative Treatment (BMT) with Perks prior (Egozcue et al., 2011).
In addition, since there are premises behind each distance, some of them cannot be applied to vectors from all the representation models.
The Jaccard distance, JensenShannon divergence, and Aitchison distance require compositional data (Egozcue et al., 2011), which means that they can only be applied to vectors from Models C and U; and the NPMI score requires an NPMI score model.
Results: Table 2 shows the median, mean, and standard deviation of values obtained from the time series that were produced by executing the process under different variations.
The mean under kNN was within the predicted range, but none of the variations under MSI or LSI achieved their predicted results.
The average values for under MSI were consistently higher than those under kNN and nearest to 1, whereas the values for under LSI were consistently lower than those under kNN and nearest to 0.
There are two reasons for this discrepancy.
First, Algorithm 1 draws 1,000 items from the set of unknown items (), while and selects topN items from without sampling.
In this situation, the probability that a 1,000 sample will miss one of the topN items selected by (or ) is over 66%.
Since the mean in Table 2 is calculated over a time series comprising 30 intervals (minimum), each containing 30 users (minimum), it means that the probability of all 900 draws will contain all the topN items is practically nil.
Second, Algorithm 1 does not apply a greedy search when selecting items from the sample of 1,000 items, as and do.
A supplementary experiment was conducted, in which Algorithm 1 was altered so that it could use the without sampling and with a greedy search, and then be applied to the variations with the largest discrepancies, namely Model N with NPMI score under MSI and Model C with cosine distance under LSI.
The results confirmed the predicted results for both MSI and LSI.
5. Discussion
The aim of this study was to determine a) if there are limits to how much surprise a recommender system can offer to its users, b) how much surprise it can embed in a recommendation list, and c) how these limits can be used to design a metric that reflects how much room there is for improving surprise in recommendations. While previous studies focused on designing metrics that explore different intuitions about what makes a surprising recommendation (Murakami et al., 2008; Ge et al., 2010; Akiyama et al., 2010; Adamopoulos and Tuzhilin, 2011; Zhang et al., 2012; Kaminskas and Bridge, 2014) or how to combine different metrics (Silveira et al., 2017), we explored a novel perspective: surprise as a finite resource in any recommender system, whatever intuition about surprise is adopted.
As the results suggest, there are limits to surprise, and the proposed metric obtained values consistent with these theoretical bounds. They were also consistent for a number experimental settings where the choices of data representation and distance function varied. In fact, one contribution made by this work is that it provides further evidence that such choices have a nonnegligible effect. For example, the coefficient of variation of over Models C, P and U under kNN, is 13.6% with cosine distance, and 29.9% with Euclidean distance. In addition, the coefficient of variation under kNN for Model U is 13.8%, and for Model C is 20.7%.
As briefly discussed in Section 3.6, we recognise that there are important discrepancies between the theoretical model employed in this study and a realworld setting. As argued, the model can still be used to estimate an upper bound to the real surprise experienced by a user. However, there is another limitation that still needs to be addressed. The theoretical model assumes that the repository remains stable. This requirement was necessary to allow for a fixed upper bound of potential surprise. As a means of evaluating the impact of undermining this premise, in Experiment 2, the dataset is segmented in a chronological order so that each interval only includes items with an estimated release date within that interval, thus simulating an evolving repository. The behaviour of the metrics under this condition, as the results suggest, remained within the limits predicted by the theoretical model.
It should also be noted that a surprise model, like any model, is a simplification of the world. For example, the definition of surprise adopted for this study (Equation 11) models the user experience as a set of items. As a result, all the things known to a user are represented as points in , and all they know is the subject of movies. The definition also assumes that a user has no bias when recovering from memory the one item that is most similar to that being recommended, according to some notion of similarity. Both premises are obviously unrealistic, but in our view, these oversimplified models are necessary and still useful, especially in the absence of realistic computational models of surprise that can feasibly be employed to describe an arbitrary user interacting with a recommender system.
6. Conclusion
This study adopts a new approach to evaluate surprise in recommender systems. A theoretical model of surprise was designed on the assumption that surprise is a limited resource in any recommender system. This model predicts the bounds to how much surprise a recommender system can offer its users, and these bounds were employed to design a surprise metric that can be used to determine how competent a system is at embedding surprise in its recommendations and how much room there is for improvement.
Further work should be carried out explore other datasets and recommendation algorithms, as well as other the choices of local optimisation algorithm. The decision to approximate the potential surprise by using a greedy algorithm took account of the fact that it is easy to manually check its results. However, this algorithm lacks theoretical limits to its precision, as some alternatives have.
Finally, the theoretical model opens up a line of thought that it may be fruitful to pursue. On the assumption that surprise arises from a lack of information, the concept of maximum potential surprise can be framed as the total amount of information a system can offer to a given user. Since the order in which the items are recommended to a user can lead to a lower amount of experienced surprise, does this mean that information is lost? It might be the case that this imbalance between maximum potential surprise and actual user surprise, can establish a close relationship with relevance or other properties of recommender systems. In a loose analogy with mechanical systems, potential energy is never lost; rather, it can only be transformed into something else.
References
 (1)
 Adamopoulos and Tuzhilin (2011) Panagiotis Adamopoulos and Alexander Tuzhilin. 2011. On Unexpectedness in Recommender Systems: Or How to Expect the Unexpected. In Proceedings of the Workshop on Novelty and Diversity in Recommender Systems at the Fifth ACM International Conference on Recommender Systems (DiveRS @ RecSys 2011). ACM, New York, NY, USA, 11–18. https://doi.org/10.1145/2043932.2044019
 Akiyama et al. (2010) Takayuki Akiyama, Kiyohiro Obara, and Masaaki Tanizaki. 2010. Proposal and Evaluation of Serendipitous Recommendation Method Using General Unexpectedness. In Proceedings of the Workshop on the Practical Use of Recommender Systems, Algorithms and Technologies at the Fouth ACM International Conference on Recommender Systems (PRSAT @ RecSys 2010). ACM, New York, NY, USA, 3–10. https://doi.org/10.1145/1864708.1864795
 Baroni et al. (2014) Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! A systematic comparison of contextcounting vs. contextpredicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Stroudsburg, PA, USA, 238–247. https://doi.org/10.3115/v1/P141023
 Barto et al. (2013) Andrew Barto, Marco Mirolli, and Gianluca Baldassarre. 2013. Novelty or Surprise? Frontiers in Psychology 4 (2013), 907. https://doi.org/10.3389/fpsyg.2013.00907
 Bellogin et al. (2011) Alejandro Bellogin, Pablo Castells, and Ivan Cantador. 2011. Precisionoriented Evaluation of Recommender Systems: An Algorithmic Comparison. In Proceedings of the Fifth ACM Conference on Recommender Systems (RecSys ’11). ACM, New York, NY, USA, 333–336. https://doi.org/10.1145/2043932.2043996
 Bouma (2009) Gerlof Bouma. 2009. Normalized (Pointwise) Mutual Information in Collocation Extraction. In Proceedings of the Conference of the German Society for Computational Linguistics and Language Technology (GSCL 2009). GSCL e.V., Manheim, Germany, 31–40. https://svn.spraakdata.gu.se/repos/gerlof/pub/www/Docs/npmipfd.pdf
 Cremonesi et al. (2008) Paolo Cremonesi, Roberto Turrin, Eugenio Lentini, and Matteo Matteucci. 2008. An Evaluation Methodology for Collaborative Recommender Systems. In International Conference on Automated Solutions for Cross Media Content and MultiChannel Distribution (AXMEDIS 2008). IEEE, Washington, DC, USA, 224–231. https://doi.org/10.1109/AXMEDIS.2008.13
 de Gemmis et al. (2015) Marco de Gemmis, Pasquale Lops, Cataldo Musto, Fedelucio Narducci, and Giovanni Semeraro. 2015. SemanticsAware ContentBased Recommender Systems. In Recommender Systems Handbook (2nd. ed.), Francesco Ricci, Lior Rokach, and Bracha Shapira (Eds.). Springer, New York, NY, Chapter 26, 119–159. https://doi.org/10.1007/9781489976376
 Deza and Deza (2009) Michel Marie Deza and Elena Deza. 2009. Encyclopedia of Distances. Springer, Berlin, Heidelberg. 1–583 pages. https://doi.org/10.1007/9783642002342
 Egozcue et al. (2011) Juan José Egozcue, Carles BarcelóVidal, Josep Antoni MartínFernández, Eusebi JarautaBragulat, José Luis DíazBarrero, and Glòria MateuFigueras. 2011. Compositional Data Analysis. WileyBlackwell, Hoboken, NJ, USA, Chapter 11, 139–157. https://doi.org/10.1002/9781119976462.ch11
 Ge et al. (2010) Mouzhi Ge, Carla DelgadoBattenfeld, and Dietmar Jannach. 2010. Beyond Accuracy: Evaluating Recommender Systems by Coverage and Serendipity. In Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys ’10). ACM, New York, NY, USA, 257–260. https://doi.org/10.1145/1864708.1864761
 Han et al. (2011) Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data Mining: Concepts and Techniques (3rd. ed.). Morgan Kaufmann, San Francisco, CA, USA.
 Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. 2015. The Movielens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4 (2015), 19. https://doi.org/10.1145/2827872
 Herlocker et al. (2002) Jon Herlocker, Joseph A Konstan, and John Riedl. 2002. An Empirical Analysis of Design Choices in NeighborhoodBased Collaborative Filtering Algorithms. Information retrieval 5, 4 (2002), 287–310.
 Herlocker et al. (2004) Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 5–53.
 Johnson and McGeoch (2015) David S Johnson and Lyle A McGeoch. 2015. The traveling salesman problem: A case study in local optimization (preliminary version). (2015). http://www.csc.kth.se/utbildning/kth/kurser/DD2440/avalg14/TSPJohMcg97.pdf
 Jurafsky and Martin (2018) Dan Jurafsky and James H Martin. 2018. Speech and Language Processing (draft manuscript of the 3rd ed.). (2018). https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
 Kaminskas and Bridge (2014) Marius Kaminskas and Derek Bridge. 2014. Measuring Surprise in Recommender Systems. In Proceedings of the Workshop on Recommender Systems Evaluation: Dimensions and Design, at the 8th ACM Conference on Recommender Systems (REDD @ RecSys ’14). ACM, New York, NY, USA, 393–394. https://doi.org/10.1145/2645710.2645780
 Kaminskas and Bridge (2016) Marius Kaminskas and Derek Bridge. 2016. Diversity, Serendipity, Novelty, and Coverage: A Survey and Empirical Analysis of BeyondAccuracy Objectives in Recommender Systems. ACM Trans. Interact. Intell. Syst. 7, 1, Article 2 (Dec. 2016), 42 pages. https://doi.org/10.1145/2926720

Loper and Bird (2002)
Edward Loper and Steven
Bird. 2002.
NLTK: The Natural Language Toolkit. In
Proceedings of the ACL02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics  Volume 1
(ETMTNLP ’02). Association for Computational Linguistics, Stroudsburg, PA, USA, 63–70. https://doi.org/10.3115/1118108.1118117  Manning et al. (2008) Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.
 McNee et al. (2006) Sean M. McNee, John Riedl, and Joseph A. Konstan. 2006. Being Accurate is Not Enough: How Accuracy Metrics Have Hurt Recommender Systems. In Extended Abstracts on Human Factors in Computing Systems (CHI EA ’06). ACM, New York, NY, USA, 1097–1101. https://doi.org/10.1145/1125451.1125659
 Mehdi (2011) Malika Mehdi. 2011. Parallel Hybrid Optimization Methods for Permutation Based Problems. Ph.D. Dissertation. Université des Sciences et Technologie de Lille  Lille I, Lille, France.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.). Curran Associates, Inc., Red Hook, NY, USA, 3111–3119.
 Mourão et al. (2017) Fernando Mourão, Leonardo Rocha, Camila Araújo, Wagner Meira Jr, and Joseph Konstan. 2017. What surprises does your past have for you? Information Systems 71 (2017), 137–151. https://doi.org/10.1016/j.is.2017.08.001

Murakami
et al. (2008)
Tomoko Murakami, Koichiro
Mori, and Ryohei Orihara.
2008.
Metrics for Evaluating the Serendipity of
Recommendation Lists. In
New Frontiers in Artificial Intelligence
, Ken Satoh, Akihiro Inokuchi, Katashi Nagao, and Takahiro Kawamura (Eds.). Springer, Berlin, Heidelberg, 40–46. https://doi.org/10.1007/9783540781974_5  Ning et al. (2015) Xia Ning, Christian Desrosiers, and George Karypis. 2015. A comprehensive survey of neighborhoodbased recommendation methods. In Recommender Systems Handbook (2nd. ed.), Francesco Ricci, Lior Rokach, and Bracha Shapira (Eds.). Springer, New York, NY, Chapter 2, 37–76. https://doi.org/10.1007/9781489976376
 Porter (2001) Martin F Porter. 2001. Snowball: A language for stemming algorithms. (2001). http://snowball.tartarus.org/texts/introduction.html
 Řehůřek and Sojka (2010) Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.
 Reisenzein et al. (2017) Rainer Reisenzein, Gernot Horstmann, and Achim Schützwohl. 2017. The CognitiveEvolutionary Model of Surprise: A Review of the Evidence. Topics in Cognitive Science (Online Early View) (2017). https://doi.org/10.1111/tops.12292
 Silveira et al. (2017) Thiago Silveira, Leonardo Rocha, Fernando Mourão, and Marcos Gonçalves. 2017. A Framework for Unexpectedness Evaluation in Recommendation. In Proceedings of the Symposium on Applied Computing (SAC ’17). ACM, New York, NY, USA, 1662–1667. https://doi.org/10.1145/3019612.3019760
 Turney and Pantel (2010) Peter D. Turney and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 1 (jan 2010), 141–188. http://dl.acm.org/citation.cfm?id=1861751.1861756
 Zhang et al. (2012) Yuan Cao Zhang, Diarmuid Ó Séaghdha, Daniele Quercia, and Tamas Jambor. 2012. Auralist: Introducing Serendipity into Music Recommendation. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (WSDM ’12). ACM, New York, NY, USA, 13–22. https://doi.org/10.1145/2124295.2124300