1 Introduction
Rankings are ubiquitous across a large variety of online services, from search engines, online shops and recommender systems to social media and online dating. Consequently, it has become much easier to find information, products, jobs, opinions or even potential romantic partners—rankings have undoubtedly increased the utility users obtain from online services. However, ranking have also been blamed to play a major role in an increasing number of missteps, particularly in the context of social and information systems, from fueling the spread of misinformation vosoughi2018spread , increasing polarization nyt2016polarization and degrading social discourse guardian17 to undermining democracy nyt17democracy . As the decisions taken by ranking models become more consequential to individuals and society, one must ask: what went wrong in these cases?
Current ranking models are typically designed to optimize immediate measures of utility, which often reward instant gratification. For example, one of the guiding technical principles behind the optimization of ranking models in the information retrieval literature, the Probability Ranking Principle (PRP)
robertson1977probability , states that the optimal ranking should order items in terms of probability of relevance to the user. However, such measures of immediate utility do not account for longterm consequences. As a result, ranking models often have an unexpected cost to the longterm welfare. In this work, our goal is to design consequential ranking models which understand the longterm consequences of their proposed rankings.More specifically, we focus on a problem setting that fits a variety of realworld applications, including those mentioned previously: at every time step, a ranking model receives a set of items and ranks these items on the basis of a measure of immediate utility^{1}^{1}1Our methodology does not need to observe the immediate utility the ranking model based their rankings on. and a set of features. Items may appear over time and be present at several time steps. Moreover, their corresponding features may change over time and these changes may be due to the influence of previous rankings. For example, the number of likes, votes, or comments—the features—that a post—the item—published by a user receives in social media depends largely on its ranking position gomez2014quantifying ; hodas2012visibility ; kang2015vip ; lerman2014leveraging . Moreover, for every sequence of rankings, there is an associated longterm (cost to the) welfare, whose specific definition is application dependent. For example, in information integrity, the welfare may be defined as the average number of posts including misinformation at the top of the rankings over time. Then, our goal is to construct consequential ranking models that optimally trade off fidelity to the original ranking model maximizing immediate utility and longterm welfare.
Our contributions. In this paper, we first introduce a joint representation of ranking models and user dynamics using Markov decisions processes (MDPs), which is particularly wellfitted to faithfully characterize the above problem setting^{2}^{2}2In this work, for ease of exposition, we assume all users get exposed to the same rankings, as in, , Reddit. However, our methodology can be readily extended to the scenario in which each user get exposed to a different ranking, as in, , Twitter.. Then, we show that this representation greatly simplifies the construction of consequential ranking models that trade off fidelity to the rankings provided by a model maximizing immediate utility and the longterm welfare. More specifically, we apply Bellman’s principle of optimality and show that it is possible to derive an analytical expression for the optimal consequential ranking model in terms of the original ranking model and the cost to the welfare. This means that we can obtain optimal consequential rankings just by applying weighted sampling on the rankings provided by the original ranking model using the (exponentiated) cost to welfare. However, in practice, such a naive sampling will be inefficient, specially in the presence of highdimensional features. Therefore, we design a practical and efficient gradientbased algorithm to learn parameterized consequential ranking models that effectively approximate optimal ones^{3}^{3}3We will release an opensource implementation of our algorithm with the final version of the paper..
Finally, we showcase our methodology using synthetic and real data gathered from Reddit. Our results show that consequential ranking models derived using our methodology provide ranks that may mitigate the spread of misinformation and improve the civility of online discussions without significant deviations from the rankings provided by models maximizing immediate utility measures.
Further related work. The work most closely related to ours is devoted to construct either fair rankings asudehy2017designing ; biega2018equity ; celis2017ranking ; singh2017equality ; singh2018fairness ; singh2019policy ; yang2017measuring ; zehlike2017fa or diverse rankings carbonell1998use ; clarke2008novelty ; radlinski2009redundancy ; radlinski2008learning
. However, this line of research defines fairness and diversity in terms of exposure allocation on an individual ranking. In contrast, we consider sequences of rankings, we characterize the consequences of these rankings on the user dynamics, and focus on improving the welfare in the longterm. Other related work includes recent approaches to address the learningtorank problem from the perspective of reinforcement learning
singh2019policy ; feng2018greedy ; wei2017reinforcement . However, these approaches consider the construction of a single optimal ranking as an MDP in which the state defined at the level of an item/position. In contrast, our MDP considers a sequence of rankings.Finally, there is a paucity of work on the delayed impact of machine learning algorithms
hucheng2018 ; liu18delayed ; mouzannar2019fair and recommender systems sinha2016deconvolving ; schnabel2018short . However, the former has focused on classification tasks and considered simple onestep feedback models and the latter on the tradeoff between exploration and exploitation.2 Rankings and User Dynamics
In this section, we first introduce our joint representation of rankings and user dynamics, starting from the problem setting it is designed for. Then, we formally define consequential rankings as the solution to a particular reinforcement learning problem.
Problem setting. Let be a particular ranking model^{4}^{4}4Unless stated otherwise, the notation does not imply that is a parameter within a class of ranking models, but it just serves as a placeholder to identify a specific ranking model. (or, equivalently, ranking algorithm). At each time step , the ranking model receives a set of items and these items are characterized by a feature matrix , where the th row contains the feature values for item and is the number of features per item. Here, we assume that items may appear over time and be present at several time steps. Moreover, their corresponding feature values may change over time. For example, think of the number of likes, votes or comments that a post receives in social media—they are often used as features to decide the ranking of the post and they change over time.
Then, the ranking model provides a ranking of the items on the basis of their set of features and a (hidden) measure of immediate utility. A ranking is defined as a permutation of the rank indices, , the model ranks item in position , where highest rank is position . In addition, we also define the ordering of a ranking as a permutation of the item indices, , the model ranks item in position . The ranking and orderings are related by and . Here, we assume that the provided ranking at time step may influence the feature matrix at time step . This is in agreement with recent empirical studies gomez2014quantifying ; hodas2012visibility ; kang2015vip ; lerman2014leveraging , which have shown that the posts (the items) that are ranked highly receive a higher number of likes, comments or shares (the features).
Finally, given a trajectory of feature matrices and rankings there is an additive cost to the welfare, , where is an arbitrary immediate cost whose specific definition is application dependent. For example, in information integrity, the welfare may be defined as the average number of posts including misinformation at the top of the rankings over time. In the remainder, we will say that a trajectory is induced by a ranking model .
Joint representation of rankings and user dynamics. The above problem setting naturally fits the following joint representation of rankings and user dynamics using Markov decision processes (MDPs) sutton2018reinforcement , which also has an intuitive causal interpretation:
(1) 
where the first term represents the particular choice of ranking model^{5}^{5}5In our work, we consider probabilistic ranking models, which assign a probability to each ranking. It would be interesting to extend our methodology to deterministic ranking models., the second term represents the distribution for the user dynamics, which determines the feature matrix at any given time step, and the initial feature matrix and ranking are given. Moreover, the above representation makes two major assumptions, which are also illustrated in Figure 3 in Appendix A:

[noitemsep,nolistsep,leftmargin=0.8cm]

To provide a ranking for a set of items at time step , the ranking model only uses the feature matrix corresponding to that set of items. More formally, given the feature matrix , the ranking provided by the ranking model is conditionally independent of previous feature matrices , . In most practical scenarios, ranking models optimizing for immediate utility satisfy this assumption.

The dynamics of the feature matrices, which characterize the user dynamics, are Markovian. That means, given the feature matrix and ranking , the feature matrix is conditionally independent of previous feature matrices and rankings , . This is a natural assumption taken in the stateoftheart models (, Hawkes processes with exponential kernels de2016learning ; du2016recurrent ; behzad2019pnas ).
— Ranking model: Our approach is agnostic to the particular choice of ranking model—it provides a methodology to derive consequential rankings that are optimal under a ranking model. In our experiments, we showcase our methodology for one wellknown ranking model, the PlackettLuce (PL) ranking model luce1977choice ; plackett1975analysis , which is best described in terms of the orderings of the rankings. More specifically, under the PL model, at each time step , the ranking , with ordering , is sampled from a distribution
(2) 
where is a given parameter. In the above, we can think of as a quality score associated to the item , which controls the probability that this item is ranked at the top—the higher the quality score, the higher the probability that the item is ranked first. In practice, the quality score of the above PL ranking model may be computed using a complex nonlinear function tran2016choice
, , a neural network.
— User dynamics: Our approach only requires to be able to sample from any arbitrary model for the transition probability
, which may be estimated using historical ranking and user data. Here, in contrast with the ranking model, the user dynamics are not something that one can decide upon—they are given.
Consequential rankings. Let be an existing ranking model that optimizes some hidden immediate utility and a cost to the welfare. Then, we construct a consequential ranking model , which optimally trades off the fidelity to the original ranking model and the cost to the longterm welfare, by solving the following optimization problem:
(3) 
with
(4) 
where the expectation is taken over all the trajectories of feature matrices and rankings of length under the ranking model and is a given parameter which controls the trade off between the fidelity to the original ranking model and the longterm cost to the welfare. In Eq. 4, the first term penalizes trajectories that achieve a large cost to the welfare and the second term penalizes ranking models whose induced trajectories differ more from those that the original model would induce, where the terms associated to the user dynamics cancel. Moreover, the choice of trajectory length will depend on the definition of longterm—accounting for longerterm consequences to the welfare will require larger trajectory lengths .
Finally, note that our measure of fidelity has a natural interpretation in terms of the KullbackLeibler (KL) divergence kullback1951information , which has been extensively used as a distance
measure between probability distributions, leading to the formulation of reinforcement learning as probabilistic inference
levine2018reinforcement ; kappen2012optimal ; ziebartICML . More specifically, we can write the expectation of the second term as the KL divergence between the original and the consequential ranking model, , .3 Building Consequential Rankings
In this section, we tackle the optimization problem defined by Eq. 3 from the perspective of reinforcement learning and show that the optimal consequential ranking model can be expressed in terms of the original ranking model.
We can first break the above problem into small recursive subproblems using Bellman’s principle of optimality bertsekas . This readily follows from the fact that, under the representation introduced in Section 2, the ranking model and the user dynamics are a Markov decision process (MDP). More specifically, Bellman’s principle tells us that the optimal ranking model should satisfy the following recursive equation, which is called the Bellman optimality equation:
(5) 
with . The function is called the value function and the function is called immediate loss. Moreover, in our problem, it can be readily shown that the immediate loss adopts the following form:
Within the loss function, the first term penalizes the immediate cost to the welfare and the second term penalizes consequential ranking models whose induced transition probability differs from that induced by the original ranking model.
In general, Bellman optimality equations are difficult to solve. However, the structure of our problem will help us find an analytical solution. Inspired by Todorov todorov2009efficient , we proceed as follows. Let . Then, we can rewrite the minimization in the RHS of Eq. 5 as
where we have dropped and because they do not depend on . Then, we can use Eq. 1 to factorize both transition probabilities in the numerator and the denominator within the logarithm and, as a result, the terms cancel and we obtain:
The above equation resembles a KL divergence, however, note that the fraction within the logarithm does not depend on and the denominator is not normalized to one. If we multiply and divide the fraction by the following normalization term:
(6) 
we obtain:
In the above equation, note that the first term does not depend on and the second term achieves its global minimum of zero if the numerator and the denominator are equal. Thus, the optimal consequential ranking model is just given by:
(7) 
Finally, if we substitute back into the Bellman equation, given by Eq. 5, and write it in terms of , we can also find the function using the following recursive expression:
with . The above result has an important implication. It means that we can use sampling methods to obtain (unbiased) samples from the optimal consequential ranking, , stratified sampling douc2005comparison , as shown in Appendix B
. However, in practice, these sampling methods may be inefficient and have high variance if the original ranking model
produces rankings that have very low probability under the optimal consequential ranking model. This will be specially problematic in the presence of highdimensional feature vectors due to the curse of dimensionality. In the next section, we will present a practical method for approximating
, which iteratively adapts a parameterized consequential ranking model using a stochastic gradientbased algorithm.4 A Stochastic GradientBased Algorithm
In this section, our goal is to find a consequential ranking model within a class of parameterized ranking models that approximates well the optimal consequential ranking model , given by Eq. 7, that minimizes the objective function in Eq. 3, , .
To this aim, we introduce a general gradientbased algorithm, which only requires the class of parameterized ranking models
to be differentiable. In particular, we resort to stochastic gradient descent (SGD)
(kiefer1952stochastic, ), , , where is the learning rate at step . Here, it may seem challenging to compute a finite sample estimate of the gradient of the objective function since the derivative is taken with respect to the parameters of the ranking model , which we are trying to learn. However, we can overcome this challenge using the logderivative trick as in (williams1992simple, ), which allows us to write the gradient as:(8) 
where is often referred as the score function (Hyvarinen05:ScoreMatching, ). The overall procedure is summarized in Algorithm 1, where Minibatch samples a minibatch of size from and InitializeRankingModel initializes the parameters of the ranking model.
Remarks. Note that, to compute an empirical estimate of the gradient in Eq. 8, we only need to be able to sample from the user dynamics , since the explicit dependence cancels out within , as pointed out in Section 2. Moreover, depending on the choice of parameterized family of ranking models, one may be able to compute the score functions analytically. In our experiments, the class of PlackettLuce (PL) ranking models allows for that, ,
where the second term within the logarithm in the last equation is the derivative of the logsumexp function, whose analytical expression can be found elsewhere. Finally, if we think of the parameterized ranking model as a policy, our algorithm resembles policy gradient algorithms used in the reinforcement learning literature sutton2018reinforcement . This connection opens up the possibility of using variance reduction techniques used in policy gradient to improve the empirical estimation of the gradient zhao2011analysis .
5 Experiments on synthetic data
In this section, our goal is compare the performance of ranking models that maximize an immediate measure of utility against consequential rankings in a problem setting with known user dynamics satisfying the Markov property.
Experimental setup. Each trajectory has length and, at each time step , the ranking model receives a set of posts and ranks them. Given a set of items and a ranking , we assume that the set of items is just a copy of the set of items where the posts at the bottom of the ranking are replaced by new posts. Each post has two features , where is the (static) probability that the post is misinformation and is the (dynamic) rate of shares at time , initialized with . There are high risk posts () and low risk posts () and a post is either high risk or low risk uniformly at random. Thus, whether the actual post is misinformation or not is a latent variable , which is unobserved by the ranking model. The instantaneous rate of shares for each item is given by , where is the time when the post was first ranked by the ranking model, is the virality, and a post is either viral () or non viral () uniformly at random. Here, note that rate of shares of an item increases if the item is ranked at the top, as observed in previous empirical studies.
The original ranking model aims to promote viral posts on the top positions of the ranking at each time , , , where is the ordering of the ranking . To this aim, it uses a PlackettLuce (PL) model, given by Eq. 2, with . The consequential ranking models aim to trade off fidelity to the original model and the longterm presence of misinformation on the top positions of the rankings, , . Here, we consider two consequential rankings models: (i) an optimal consequential ranking model , which provides rankings by applying weighted sampling on the rankings provided by the original ranking model ; and, (ii) a PlackettLuce (PL) consequential ranking model , which is learned using Algorithm 1 with iterations and as batch size. Moreover, we experiment with different values of the parameter , which controls the trade off between fidelity to the original model and cost to the welfare and, for each experiment, we perform repetitions.
Quality of the rankings. We first compare the original ranking model and the optimal consequential ranking model in terms of three quality metrics: (i) the immediate utility ; (ii) the cost to welfare ; and, (iii) the true cost to welfare , defined as . Figure 1(ab) summarizes the results, where note that the original ranking model is just the optimal consequential ranking model with . The results show that: (i) the consequential ranking model achieves lower (true) cost to welfare than the original ranking model; and, (ii) as increases, the consequential ranking model is able to reduce the (true) cost to welfare without significantly decreasing the immediate utility. Next, we investigate whether the optimal consequential ranking model treats viral and nonviral posts differently. Intuitively, the ranking model should be more willing to change the rank of high risk viral posts than that of high risk non viral posts. To confirm this intuition, we compute the fraction of estimated and true misinformation, and , in the top positions of the rankings over time for both viral () and non viral ( posts. Figure 1(c) summarizes the results, which show that, as we increase , the fraction of misinformation for viral posts on the top positions is lower than the fraction of misinformation for non viral posts. Appendix C provides an indepth comparison between optimal and PL consequential ranking models.
6 Experiments on real data
In this section, we compare the performance of ranking models that maximize an immediate measure of utility and parameterized consequential rankings models using data from Reddit, a popular social news aggregation platform^{6}^{6}6Due to the size of the dataset, we were unable to run the weighted sampling procedure needed to implement optimal consequential rankings models.. Before we proceed further, we would like to acknowledge that:

[noitemsep,nolistsep,leftmargin=0.8cm]

Since we do not have access to the ranking algorithm used by Reddit (or any other social media platform), our experiments are a proof of concept, which demonstrate the practical potential of our methodology on real data using a simple PL ranking model. Evaluating the efficacy of our methodology across a wide range of deployed ranking algorithms is left as future work.

Our experiments use observational data, , they are open loop. As a result, the rankings only influence the immediate utility and the cost of welfare but not the user dynamics. However, our evaluation is likely to be conservative—consequential rankings may achieve a greater reduction of the cost to welfare in an interventional experiment. For example, in the context of uncivil behavior, there is empirical evidence that users are more likely to post uncivil comments if they are exposed to uncivil comments before cheng2017anyone ; muddiman2017news . Therefore, penalizing the rank of uncivil comments over time may prevent other users from engaging into uncivil behavior.
Dataset description and experimental setup. We used a publicly available Reddit dataset^{7}^{7}7https://archive.org/details/2015_reddit_comments_corpus., which contains (nearly) all publicly available comments to link submissions posted by Reddit users from October 2007 to May 2015. In our experiments, we focused on the links submissions to the subreddit Politics and selected the set of submissions with more than and less than comments. After these preprocessing steps, our dataset comprised submissions and comments.
In a first set of experiments, we focus on the civility of the comments in each submission, as measured by an uncivility score . In a second set of experiments, we focus on the misinformation spread by the comments of each submission, as measured by an unreliability score . Appendix D contains more details on the definition and estimation of both scores. In both sets of experiments, we use comments from submissions as training set for learning the parameterized consequential ranking models and comments from submissions as test set for evaluation.
Each submission corresponds to one trajectory whose length is just the number of comments in the submission, , each time step corresponds to the time at which a new comment was created. Then, at each time step , the ranking model ranks the latest set of comments . Moreover, each comment has three features , where is the time (in seconds) elapsed since the first comment was posted, is the uncivility score and is the unreliability score At each time , the original ranking model aims to promote the most recent comment to the top of the ranking, , its immediate utility is defined as , where is the item at the top of the rank . To this aim, it uses a PlackettLuce (PL) model, given by Eq. 2, with . For the first set of experiments, the consequential ranking models aim to trade off fidelity to the original ranking model and the civility of the comments on the top positions of the rankings, , . In a second set of experiments, the consequential ranking models aim to trade off fidelity to the original ranking model and the misinformation in the comments on the top positions of the rankings, , . In both cases, the consequential ranking models are PlackettLuce (PL) models, which we learned using Algorithm 1 with iterations and as batch size, and experiment with .
. Lines are averages and shaded areas are 95% confidence intervals over all submissions in the test set.
Quality of the rankings. We first compare the original ranking model and the consequential PL ranking models using the average immediate utility and the cost to welfare. Here, note that, in the first set of experiments, the cost to welfare measures the degree of uncivility of the top ranking positions while, in the second set of experiments, it measures the amount of misinformation. Figure 2 summarizes the results, where note that the original ranking model is just the optimal consequential ranking model with . The results show that the consequential PL ranking models are able to reduce the cost to the welfare up to % at a minimum cost in terms of immediate utility—they are able to reduce the degree of uncivility and the amount of misinformation at the top ranking positions without significant changes to the original reverse chronological ranking.
7 Conclusions
We have initiated the design of (parameterized) consequential ranking models that optimally trade off between the fidelity to ranking models optimizing for immediate utility and the longterm welfare. Our work opens up many interesting avenues for future work. For example, we have considered probabilistic ranking models and a fidelity measure based on KL divergence. A natural next step is to augment our methodology to allow for deterministic ranking models and consider other fidelity measures between rankings. It would be very interesting to apply our framework to more sophisticated ranking models. Moreover, we have assumed that the models that optimize for immediate utility are optimal. However, they may be suboptimal in terms of the sum of immediate utility over time since it is unclear that current ranking models are designed to account for the consequences that their proposed rankings have on the feature matrices. It would be very interesting to account for this in future work. It would also be interesting to experiment with settings in which the cost to welfare cannot be factorized into items, , information diversity. Finally, we have evaluated our algorithm using observational real data, however, it would be very revealing to perform an evaluation based on interventional experiments.
References
 [1] A. Asudehy, H. Jagadishy, J. Stoyanovichz, and G. Das. Designing fair ranking schemes. arXiv preprint arXiv:1712.09752, 2017.
 [2] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2nd edition, 2000.
 [3] A. J. Biega, K. P. Gummadi, and G. Weikum. Equity of attention: Amortizing individual fairness in rankings. arXiv preprint arXiv:1805.01788, 2018.
 [4] J. Carbonell and J. Goldstein. The use of mmr, diversitybased reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336. ACM, 1998.
 [5] L. E. Celis, D. Straszak, and N. K. Vishnoi. Ranking with fairness constraints. arXiv preprint arXiv:1704.06840, 2017.
 [6] J. Cheng, M. Bernstein, C. DanescuNiculescuMizil, and J. Leskovec. Anyone can become a troll: Causes of trolling behavior in online discussions. In Proceedings of the Conference on ComputerSupported Cooperative Work (CSCW), 2017.
 [7] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Büttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 659–666. ACM, 2008.
 [8] A. De, I. Valera, N. Ganguly, S. Bhattacharya, and M. G. Rodriguez. Learning and forecasting opinion dynamics in social networks. In Advances in Neural Information Processing Systems, pages 397–405, 2016.
 [9] R. Douc and O. Cappé. Comparison of resampling schemes for particle filtering. In ISPA 2005. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005., pages 64–69. IEEE, 2005.
 [10] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. GomezRodriguez, and L. Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1555–1564, 2016.
 [11] Y. Feng, J. Xu, Y. Lan, J. Guo, W. Zeng, and X. Cheng. From greedy selection to exploratory decisionmaking: Diverse ranking with policyvalue networks. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 125–134, 2018.
 [12] M. GomezRodriguez, K. P. Gummadi, and B. Schoelkopf. Quantifying information overload in social media and its impact on social contagions. In ICWSM, 2014.
 [13] Herrman, John. Inside facebook’s politicalmedia machine. New York Times, 2016.
 [14] N. Hodas and K. Lerman. How visibility and divided attention constrain social contagion. In SocialCom, 2012.
 [15] L. Hu and Y. Chen. A shortterm intervention for longterm fairness in the labor market. In Proceedings of the Web Conference, 2018.
 [16] A. Hyvärinen. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research, 6:695–709, 2005.
 [17] J. Kang and K. Lerman. Vip: Incorporating human cognitive biases in a probabilistic model of retweeting. In ICSC, 2015.
 [18] H. J. Kappen, V. Gómez, and M. Opper. Optimal control as a graphical model inference problem. Machine learning, 87(2):159–182, 2012.
 [19] J. Kiefer, J. Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
 [20] S. Kullback and R. A. Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
 [21] K. Lerman and T. Hogg. Leveraging position bias to improve peer recommendation. PloS one, 9(6):e98914, 2014.
 [22] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
 [23] L. T. Liu, S. Dean, E. Rolf, M. Simchowitz, and M. Hardt. Delayed impact of fair machine learning. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 [24] R. D. Luce. The choice axiom after twenty years. Journal of mathematical psychology, 15(3):215–233, 1977.
 [25] H. Mouzannar, M. I. Ohannessian, and N. Srebro. From fair decision making to social equality. In FAT, 2019.
 [26] A. Muddiman and N. J. Stroud. News values, cognitive biases, and partisan incivility in comment sections. Journal of communication, 67(4):586–609, 2017.
 [27] R. L. Plackett. The analysis of permutations. Applied Statistics, pages 193–202, 1975.
 [28] F. Radlinski, P. N. Bennett, B. Carterette, and T. Joachims. Redundancy, diversity and interdependent document relevance. In ACM SIGIR Forum, volume 43, pages 46–52. ACM, 2009.
 [29] F. Radlinski, R. Kleinberg, and T. Joachims. Learning diverse rankings with multiarmed bandits. In Proceedings of the 25th international conference on Machine learning, pages 784–791. ACM, 2008.
 [30] S. E. Robertson. The probability ranking principle in ir. Journal of documentation, 33(4):294–304, 1977.
 [31] T. Schnabel, P. N. Bennett, S. T. Dumais, and T. Joachims. Shortterm satisfaction and longterm coverage: Understanding how users tolerate algorithmic exploration. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 513–521, 2018.
 [32] A. Singh and T. Joachims. Equality of opportunity in rankings. In Workshop on Prioritizing Online Content, Neural Information Processing Systems, 2017.
 [33] A. Singh and T. Joachims. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2018.
 [34] A. Singh and T. Joachims. Policy learning for fairness in ranking. arXiv preprint arXiv:1902.04056, 2019.
 [35] A. Sinha, D. F. Gleich, and K. Ramani. Deconvolving feedback loops in recommender systems. In Advances in Neural Information Processing Systems, pages 3243–3251, 2016.
 [36] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [37] B. Tabibian, U. Upadhyay, A. De, A. Zarezade, B. Schoelkopf, and M. GomezRodriguez. Optimizing human learning via spaced repetition optimization. Proceedings of the National Academy of Sciences, 2019.
 [38] E. Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences, 106(28):11478–11483, 2009.
 [39] T. Tran, D. Phung, and S. Venkatesh. Choice by elimination via deep neural networks. arXiv preprint arXiv:1602.05285, 2016.
 [40] S. Vaidhyanathan. Facebook wins, democracy loses. New York Times, 2017.
 [41] S. Vosoughi, D. Roy, and S. Aral. The spread of true and false news online. Science, 2018.
 [42] Z. Wei, J. Xu, Y. Lan, J. Guo, and X. Cheng. Reinforcement learning to rank with markov decision process. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 945–948, 2017.
 [43] R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine learning, 8(34):229–256, 1992.
 [44] J. C. Wong. Former facebook executive: social media is ripping society apart. The Guardian, 2017.
 [45] K. Yang and J. Stoyanovich. Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 2017.
 [46] M. Zehlike, F. Bonchi, C. Castillo, S. Hajian, M. Megahed, and R. BaezaYates. Fa* ir: A fair topk ranking algorithm. In CIKM, 2017.
 [47] T. Zhao, H. Hachiya, G. Niu, and M. Sugiyama. Analysis and improvement of policy gradient estimation. In Advances in Neural Information Processing Systems, pages 262–270, 2011.
 [48] B. D. Ziebart, J. A. Bagnell, and A. K. Dey. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning, 2010.
Appendix A Graphical representation of rankings and user dynamics
Appendix B Algorithm for sampling from optimal consequential ranking
Algorithm 2 shows the stratified sampling douc2005comparison method for obtaining samples from optimal consequential ranking. Within the algorithm, Sample samples trajectories from and StratifiedSampler generates samples weighted by using stratified sampling.
Appendix C Additional experiments on synthetic data
First, we compare the performance of the optimal consequential ranking model computed via weighted sampling and the PL ranking model learned using Algorithm 1 using the same quality metrics as in the previous section. Figure 4(ab) summarizes the results. We observe that both the optimal and PL consequential ranking models achieve similar values of immediate utility over time. However, the optimal model has a competitive advantage in terms of cost to the welfare, which becomes more pointed as grows. These findings suggest that, the larger the value of , the more difficult is to learn a PL model that approximates effectively the optimal model.
Next, we compare the scalability of both models in terms of the number of samples needed per ranking. Figure 4(c) summarizes the results, which shows that, as grows, it becomes computationally prohibitive to generate optimal consequential rankings using weighted sampling due to the growing difference between and . This questions the practicality of weighted sampling to generate optimal consequential rankings.
Appendix D Uncivility and unreliability scores
To estimate the uncivility score
for each comment, we apply sentiment analysis on the text of the comments using the software package
Pattern^{8}^{8}8https://www.clips.uantwerpen.be/pages/patternen and, for each comment, obtain two quantities: mood and polarity. The mood of a comment can take one of the following four values: indicative, imperative, conditional and subjunctive. The polarity of a comment is a real number in , where lower (higher) values indicate more negative (positive) words in the text. Then, we define the uncivility score of a comment as the absolute value of the polarity of the comment if the polarity is negative and the mood of the comment is indicative or imperative and zero otherwiseWe estimate the unreliability score for each comment by estimating the average unreliability score of the domains that appeared in each of them, as estimated by aggregating publicly available data from Politifact and Snopes^{9}^{9}9https://www.kaggle.com/arminehn/rumorcitation/version/3. More specifically, our combined dataset contains fact checking information for unique urls from unique domains. For each url, it assigns a label that indicates the reliability of its content. We used these labels to assign a numerical unreliability score for each url. More specifically, if the url is labeled as “false”, “pantsfire”, “mfalse” or “legend”, we set the unreliability score to . If the url is labeled as “true”, “mtrue” or “mostlytrue”, then we set the unreliability score to . And, if the url is labeled using some other label value, we set the unreliability score to . We computed an unreliability score for each domain, which measures its level of (un)trustworthiness, by taking the average of the unreliability scores of individual urls from the domain. Then, we define the unreliability score of a comment as the average unreliability score of the domain(s) of the link(s) used in the comment if the average is negative and zero otherwise. Here, also note that, if a comment does not contain any links or the domain(s) of the link(s) does not appear in our dataset, we set the unreliability score for that comment to .
Tables 1 and 2 provide a few examples of comments with a high uncivility score and domains with a high unreliability score.
Comment  Uncivility () 

If you once tell a lie, the truth is ever after  0.0 
your enemy.  
I dream of a world where your bigoted stupid  0.1 
ideas don’t have the popular shield of faith.  
Shut the f**k up and die already you POS  0.4 
warmongering profiteer.  
Crap? Or pap. Take your pick.  0.8 
i blame the evil KOCH BROTHERS!  1.0 
Url  Misinformation () 

aids.gov  0.0 
pbs.org  0.26 
breitbart.com  0.56 
lifeisajoke.com  1.0 
Appendix E Additional experiments on real data
In the experiments on real data in the main paper, we have assumed that the original ranking model maximizes immediate utility. Moreover, we have taken for granted that there is a negative correlation between the immediate utility achieved by the consequential ranking model and the KullbackLeibler (KL) divergence between its induced probability and the probability induced by the original ranking model. However, in practice, it may happen that the original ranking model is not optimal in terms of immediate utility maximization. This may be specially the case if the original ranking model greedily maximizes immediate utility and does not take into account how their proposed rankings change the user dynamics. Figure 5 demonstrates that, in our first set of experiments, there is indeed a negative correlation between the immediate utility and the KL divergence, however, in our second set of experiments, there is a positive correlation. This suggests that the original ranking model in the second set of experiments is suboptimal.