Collaborative Filtering under Model Uncertainty

08/23/2020 ∙ by Robin M. Schmidt, et al. ∙ Universität Tübingen 26

In their work, Dean, Rich, and Recht create a model to research recourse and availability of items in a recommender system. We used the definition of predictive multiplicity by Marx, Pin Calmon, and Ustun to examine different variations of this model, using different values for two model parameters. Pairwise comparison of their models show, that most of these models produce very similar results in terms of discrepancy and ambiguity for the availability and only in some cases the availability sets differ significantly.



There are no comments yet.


page 3

page 4

page 5

page 6

page 7

Code Repositories


Analyzation of recourse and availability for recommender systems under the definition of predictive multiplicity, discrepancy, and ambiguity

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In toady’s society, recommendation systems have a huge influence on how individuals explore and experience information DeanRR20. Generally, they are applied in a broad variety of domains including media (e.g. videos or music), product recommendations or travel and real estate. This raises some interesting questions like: “How easy can a user be pigeonholed by their viewing history?” or “How does a recommender system encode bias that limits the availability of content?”. These are some of the main question which inspired the work by DeanRR20 which form the basis of our contributions by providing the respective models and definition frameworks.

Ii Model by DeanRR20

Ii-a Problem Setting

(a) MovieLens dataset
(b) LastFM dataset
Fig. 1: Test RMSE (-axis) of the matrix factorization models with varying latent space dimensions (-axis) on the MovieLens dataset (left) and the LastFM dataset (right): DeanRR20

For their model DeanRR20 define a recommender system as a collection of users and items . A rating for a user-item combination is then denoted as . The observed ratings

for a user are stored inside a rating history sparse vector

with defined values at and at every other position. This allows the recommender system to make decisions following a policy based on this sparse vector denoted as which yields a subset of items.

Based on these constraints, DeanRR20 define an item to be reachable from user if there is a modification to the rating history that results in item being recommended to user . With that, they define the whole reachability problem as


utilizing a modification set which describes the possibilities of modifications to their respective rating history and the difficulty of making these changes in . Intuitively, the cost function can correlate to the number of needed changes or how far these changes are away from the current preferences of the user DeanRR20.

Moreover, DeanRR20 focus on linear preference models which use a user vector and an item vector in combination with item bias , user bias and over all bias to yield a predicted user rating as


When incorporating matrix factorization with user and item representations as factors lying in a latent space of size we can represent the factors as and . Together with the regularizer , this yields


for fitting the model where DeanRR20 use the regularization on user and item factors.

When defining the cost function, instead of modeling it as the penalty on change from existing ratings, DeanRR20 penalize the change from predicted ratings. For edits on observed items (history edits) it is defined as


while edits on the recommended items (reactions) are defined as


Ii-B Recourse and Availability

Further, DeanRR20 define recourse and availability as they are respectively important for understanding how a user’s preferences limited the reachable content and how available a certain item is inside the recommender system. Hence, they are defined as:

Definition 1 (recourse)

The amount of recourse available to a user is defined as the percentage of unseen items that are reachable, i.e. for which Equation 1 is feasible. The difficulty of recourse is defined by the average value of the recourse problem over all reachable items .

Definition 2 (availability)

The availability of items in a recommender system is defined as the percentage of items that are reachable by some user.

Ii-C User Cold-Start

When a new user enters the recommender system, he has no prior rating history from which to predict preferences from. This process is known as the User Cold-Start problem and common strategies for recommender systems include presenting items which are most likely to be rated highly or be most informative about the respective users’ preferences DeanRR20. With the definition of recourse, DeanRR20 define the onboarding set not only for its contribution to the model accuracy but also its provided amount of recourse.

Ii-D Sufficient Conditions for Top-

(a) Top- Recommender System
(b) Top- Recommender System
(c) Top- Recommender System
(d) Top- Recommender System
(e) Top- Recommender System
(f) Top- Recommender System
Fig. 2: Discrepancy of availability on the MovieLens dataset comparing the available items of any baseline model (-axis) with the available items of any model in the -level set (-axis) for varying latent space size . The content of each cell is the amount of elements in the difference set where and otherwise. Final discrepancy (row-wise maximum) for the baseline model and the size of available items of the smaller set are highlighted in red and green.

The recommender system described by DeanRR20, a Top- recommender system, only ever recommends one item at a time. Since most real world applications involve serving several items at once, to model reality more closely, DeanRR20 expand this system to recommend items at the same time instead, creating a Top- recommender system with .

DeanRR20 define an item-region for the Top- case, when as follows:


with all but at most items .

Although, according to DeanRR20, this region is contained in the latent space, generally of relatively small dimensions, the description depends on the number of items which in general will be quite large. While linear for , for , the description for each region requires linear inequalities, becoming expensive very quick, even for small values of .

Ii-D1 Sufficient Condition for Availability

To bypass those computational concerns, DeanRR20 show that the full description of the Region is not necessary, but instead finding any point in the latent space that satisfies is sufficient. They propose a sampling approach to determine the availability of an item with a complexity of only .

DeanRR20 call items that are inside an item-region defined via sampling aligned-reachable, which is a lower bound on the availability of items, yielding an underestimate of the availability of items in a system.

Using the aligned-reachable condition as a generic model audit, DeanRR20 propose an item-based audit algorithm with and an increased value for . The model audit counts the number of aligned-unreachable items, returning a lower bound on the overall availability of items. The model audit can also be used to propose constraints or penalties on the model during training.

Ii-D2 Sufficient Condition for Recourse

As user recourse inherits the same computational problems as described for availability, DeanRR20 continue with the sampling perspective to test feasibility. This yields a lower bound on the amount of recourse available to a user, based on their specific rating history and the allowable actions. They show, that items who are aligned-reachable are also reachable by users, implying that item availability implies recourse for any user with control over at least ratings whose corresponding item factors are linearly independent.

DeanRR20 conclude, that user recourse follows from the ability to modify ratings for a set of diverse items, and immutable ratings ensure the reachability of some items, potentially at the expense of others.

Ii-E Experimental Demonstration

(a) Top- Recommender System
(b) Top- Recommender System
(c) Top- Recommender System
(d) Top- Recommender System
(e) Top- Recommender System
(f) Top- Recommender System
Fig. 3: Discrepancy of availability on the LastFM dataset comparing the available items of any baseline model (-axis) with the available items of any model in the -level set (-axis) for varying latent space size . The content of each cell is the amount of elements in the difference set where and otherwise. Final discrepancy (row-wise maximum) for the baseline model and the size of available items of the smaller set are highlighted in red and green.

DeanRR20 demonstrate how the proposed analyses can be used for auditing and interpreting characteristics of a matrix factorization model. They use the MovieLens 10M dataset Harper2016, a common benchmark for evaluating rating predictions and the method described by RendleZK19 in their recent work on baselines for recommender systems. The methods did match those presented by RendleZK19 and reproduced their reported accuracies. Models of a variety of latent dimension ranging from to were examined. Additionally, they conducted a similar set of experiments on the LastFM dataset Bertin-Mahieux2011, which yielded similar results. Those results were included in Appendix B of the original paper.

After performing the item-based audit previously described, DeanRR20 found, that for larger values of as well as , the number of items that are aligned-reachable is significantly higher. Baseline reachability is especially low for small values of . At the same time, they found that unavailable items do have systematically lower popularity, while popularity alone does not determine reachability.

Ii-E1 System Recourse for Users

For this, DeanRR20 used continuous ratings instead of rounded to the nearest step, since it was easier for the model to work with, also only the most rated items were chosen to significantly reduce the computation time needed. This does produce a small overestimation because of popularity bias but was still considered a good approximation.

When allowing history edits, for a growing number of items in the history, DeanRR20 recognised two distinct shapes in the recourse curve. For all values of , there first was a sharp increase in the available recourse, which did level after a while for each value of . This can be explained by two factors: The sharp increase is determined by the limiting effect of the projection, which rises continuously. The leveling of effect on the other hand is determined by the baseline item-reachability. DeanRR20 note at this point that while higher complexity and therefore a larger number of latent dimensions provides a larger amount of recourse, lower complexity lets the model reach the maximum faster.

When considering fixed ratings and allowing no history edits, DeanRR20 were able to come up with the following conclusions:

  1. The amount of recourse is actually bigger for a lower number of latent dimensions

  2. For a small history length, more recourse is available

  3. When rating random items, the available recourse is bigger, compared to rating only the recommended items

This does not contradict the previous results, as using a fixed history does eliminate the availability of additional recourse and only concerns the anchor points. It is worth noting, that the advantages of additional recourse seem to outweigh the disadvantages of the anchor points for large histories and more latent dimensions.

Ii-E2 Recourse difficulty

At last, DeanRR20 did examine the cost of recourse over all users for a single item. A Top- recommender system was used to once again reduce the computational burden of computing the exact set . The cost was posed as the size of the difference between the user input and the predicted ratings. In two trial runs, the cost of recourse was determined for a set of random items as well as for the set of the highest rated items. Here, DeanRR20 was able to arrive at the following findings:

  1. The cost does not increase, but the amount of recourse is lower for a larger number of latent dimensions.

  2. The cost was actually lower for random, then for Top-

The experimental demostration ended with the conclusion, that future work should more carefully examine methods for constructing recommended sets that trade-off predicted ratings with measures like diversity under the lens of user recourse.

Iii Drawbacks of the Dean et al. Model

(a) MovieLens dataset
(b) LastFM dataset
Fig. 4: Percentage of availability discrepancy on the MovieLens (left) and LastFM dataset (right) comparing the available items of any baseline model (-axis) with the available items of any model in the -level set (-axis) for varying Top- recommender systems and latent space size denoted as . The content of each cell is the amount of elements in the difference set where and otherwise. Comparisons where the set size difference is less than are marked with borders.

The method proposed by DeanRR20 has a number of drawbacks, which are described briefly and left for further research. These drawbacks include the user cold start problem, popularity biases, filter bubbles and human-model interactions.

Iii-1 User Cold Start

While proposing onboarding sets to provide additional recourse, rather than focusing on model accuracy, DeanRR20 do not provide any demonstration of how the onboarding set plays a potential role in availability and recourse.

Iii-2 Popularity bias

DeanRR20 find that, the differences in availability of items does, to some extend, relate to their general popularity or unpopularity in the training data. According to Steck11, this seems to be a phenomenon in recommender systems in general which amongst other things reproduces undesirable demographic biases. DeanRR20 provide no further explanation on how to combat this problem.

Iii-3 Filter Bubbles

The model used by DeanRR20 solely provides a reachability criteria which is based on the possibility of a user reaching a specific item. However, it does not provide any predictions if a specific user in a real-world scenario will actually reach the specific item or not. The possibility of recourse, for example, does not fix the problem of filter bubbles, as it merely provides the means to do so, but the user also has to use these means. Therefore, further research would be warranted to examine if the cost function proposed models actual user behaviour or if it needs to be fundamentally changed to not provide a false appearance of fairness.

Iii-4 Human-Model Interactions

Lastly, there are untapped possibilities of future work to examine the interactions between users and models as the models evolve over time and the user behaviour might be influenced by the model at the same time. DeanRR20 note, that this path likely would lead towards understanding phenomena like filter bubbles.

Iv Model by MarxCU19

(a) MovieLens dataset
(b) LastFM dataset
Fig. 5: Total percentage of ambigue items (-axis) when comparing the models with different latent space dimension in the -level set on their set of available items. We split between different Top- recommender systems (-axis).

Shifting our focus, the key concept in the work by MarxCU19 is the concept of multiplicity. If there are at least two competing models within an error tolerance the respective problem exhibits multiplicity MarxCU19. MarxCU19 refer to the set of competing models as the -level set and extend that term to predictive multiplicity where two models within the -level set assign different predictions to a instance in the training data.

Further, MarxCU19 propose formal measures for the possibility of multiple competing models (predictive multiplicity) in the form of discrepancy and ambiguity and define them as:

Definition 3 (discrepancy)

Maximum number of conflicting predictions between a baseline model and any good model. If the discrepancy is small, near-optimal models (in the -level set) output similar predictions and vice versa.

Definition 4 (ambiguity)

Number of individuals that can be assigned a different prediction by at least one model in the -level set.

While discrepancy is an upper bound for the number of predictions that can change, ambiguity determines that value for a particular model choice between a set of good models. Further, they provide integer programming tools to compute these measures for linear classification problems taking into account all possible models within a certain performance margin MarxCU19.

For their experiments, they construct binary classification problems based on the ProPublica COMPAS dataset machinebias, the Felony Defendants in Large Urban Counties dataset pretrial and the Recidivism of Prisoners Released in 1994 dataset recidivism. The results by MarxCU19 show that for example on the COMPAS dataset a competing model with only less accuracy can disagree on over of the predictions (discrepancy) and of predictions are vulnerable to model selection (ambiguity). Further, they try to raise awareness that discrepancy and ambiguity should not be overlooked when deploying classification systems in highly influential real-world scenarios and should be reported similarly to model statistics such as the test error MarxCU19.

V Experimental Setup

For our experiments, we want to apply the definition of predictive multiplicity and evaluate the proposed measures (discrepancy and ambiguity) from MarxCU19 on the recommendation system of DeanRR20. Therefore, we first need to determine a relevant -level set. For our experiments, we used the trained models from DeanRR20, using the same settings as used in the paper for both, the MovieLens 10M dataset as well as the LastFM 1K dataset. Hence, we use the same values for the number of latent dimensions while setting the sizes of the recommender set to . As shown in Figure 1, when deploying an on the Root-Mean-Squared-Error (RMSE), all our trained models with different latent dimensions lie within the -level set. According to MarxCU19, an -value of represents a conservative default for accuracy based measures and hence a RMSE with is reasonable when applied in our application scenario. We also verified this by looking at the absolute changes in prediction on the testset in Figure 6. These are also marginally small which verifies our choice of -level set.

Based on this set of competing models, we can evaluate the discrepancy

in availability on the testset by choosing a baseline classifier from the

-level set and comparing it to all other models in this set. This way, we can construct comparison matrices for the discrepancy of availability with varying of size which for us are matrices of size for each dataset. Additionally, by also considering as a model parameter, we can construct comparison matrices of size for the discrepancy of availability with varying and which for us is matrix for each dataset of size . By taking the row-wise maximum value, we evaluate the final discrepancy for each baseline model choice. With this structure of our experiments, the constructed comparison matrices are symmetric and have diagonal entries with zero values since the set of conflicting elements between the same model is always an empty set.

Note, DeanRR20 also compare their definitions of recourse and availability for the different models in our -level set (cf. Figures 3-7 and 10-13 in DeanRR20), however, they do not include an analysis of the conflicting elements on this set. Hence, we base our analysis of the availability on the conflicting elements when comparing any baseline model and its available set with any model of the -level set and its available set where we define the set of conflicting elements to be the negated intersection as . This set includes all items of that are not included in and vice versa. We want to stress that we consider the set difference as especially relevant metric since these highlight available items in a smaller set which are not available in the broader set which has already progressed further. Therefore, we ensure that the baseline model for has a smaller set size than the model of the -level set for formalized as . This yields available items which are available in the smaller available set but not in the larger one.

Further, we analyse the ambiguity on the respective prediction items by comparing how often each item falls into a conflicting prediction set for each model of the -level set. This allows us to compute an average percentage value of ambigue items (items that at least fall into one conflicting prediction set) for each element of the recommender set . This yields percentage values of average ambiguity values for the respective recommender system which, in our case, are values for each dataset.

Vi Results & Discussion

(a) MovieLens dataset
(b) LastFM dataset
Fig. 6: Average absolute rating difference for the testset predictions when comparing any baseline model (-axis) to all models in the -level set (-axis) on the MovieLens dataset (left) and the LastFM dataset (right). The model with the highest absolute difference to the baseline model (row-wise maximum) is highlighted in red.

We observe some high-level trends where for the same latent space size for any two Top- recommender systems with the available items of the Top- recommender system are a proper subset of the Top- recommender system. This observation partly motivated our experiments for Figures 3 and 2 where we compare the availability for different latent space sizes for consistent Top- recommender systems. The results illustrate that, generally speaking, we observe contradicting availability sets when comparing two latent space sizes that are rather small (e.g.  , or ). However, for the majority of the Top- recommender systems (mostly all but ), contradicting availability values exclusively appear in the upper left quadrant for both the MovieLens and the LastFM dataset. Even if contradicting available items appear, they tend to be a comparably small amount with regard to the overall size of available items. This leaves us with the conclusion that when keeping the Top- recommendation system constant, different models have very little discrepancy in availability.

Further, when looking at at the ambiguity of the conflicting available items illustrated by Figure 5, we observe a similar pattern. Overall, the number of ambigue items for the models in our -level set with varying latent dimension size tends to be very low. For the MovieLens dataset this value ranges from roughly while for the LastFM dataset the range is from . Since this value is so low, it is highly unlikely that it will have an overall meaningful impact and therefore can be neglected.

When we expand our -level set from only considering the latent space dimension as a model parameter to considering the Top- recommendation sets in addition to the latent space dimension, we can construct a similar heatmap as seen previously. Now, this new heatmap has shape and is illustrated in Figure 4. Here, we observe larger values with a maximum for the MovieLens dataset of and for the LastFM dataset of . The maximum values respectively occur when comparing (baseline) to and (baseline) to . For the MovieLens dataset, of pairings produce a set difference above , while produce a set difference above with a maximum discrepancy of . For the LastFM dataset, of pairings produce a set difference above , while produce a set difference above with a maximum discrepancy of . There seems to be no direct correlation between the discrepancy and the difference in set size.

Vii Outlook

As one can see, there are a lot of possible aspects one can analyse when looking at a recommendation system. Obviously, it is impossible to cover all possible choices of such degrees of freedom and hence our analysis is mostly biased towards the availability (and therefore inherently the recourse) in recommendation systems, since we identify this as a key challenge and very critical point across different models. Further, additional analysis can be done in the future on comparing the

discrepancy and ambiguity for recommendations for the user cold start, cost of recourse or recommendations based on a fixed history on models with different latent space size .