Playtime Measurement with Survival Analysis

01/04/2017 ∙ by Markus Viljanen, et al. ∙ Turun yliopisto 0

Maximizing product use is a central goal of many businesses, which makes retention and monetization two central analytics metrics in games. Player retention may refer to various duration variables quantifying product use: total playtime or session playtime are popular research targets, and active playtime is well-suited for subscription games. Such research often has the goal of increasing player retention or conversely decreasing player churn. Survival analysis is a framework of powerful tools well suited for retention type data. This paper contributes new methods to game analytics on how playtime can be analyzed using survival analysis without covariates. Survival and hazard estimates provide both a visual and an analytic interpretation of the playtime phenomena as a funnel type nonparametric estimate. Metrics based on the survival curve can be used to aggregate this playtime information into a single statistic. Comparison of survival curves between cohorts provides a scientific AB-test. All these methods work on censored data and enable computation of confidence intervals. This is especially important in time and sample limited data which occurs during game development. Throughout this paper, we illustrate the application of these methods to real world game development problems on the Hipster Sheep mobile game.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Playtime in games

1.1 Why Playtime is Important

Game analytics is becoming increasingly important in understanding player behavior [1]. Widespread adoption of games, internet connectivity and new business models have resulted in data gathering in an unprecedented scale. With increasing availability of data, researches and industry alike are motivated to gain insight into the data through game analytics.

Focal point of analytics is player retention and churn [2]. Retention has been used in connection with many related measures and methods aiming to increase the length of product use [3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. Better retention simply means players are engaged with the game for longer. Player churn, meaning players quitting the game either momentarily or definitely, decreases product use and is therefore a counterpart of retention. It has also been extensively researched [13, 14, 15, 16, 17, 18, 19, 20]. Retention metrics are popular because they are thought to reflect player enjoyment, and increased product use provides increased possibilities for monetization in free-to-play and subscription based games. Game success may be attributed to the process of acquiring new users and retaining these users with effective monetization [2].

Of actual measures that quantify retention in analytics, total playtime is a highly useful overall retention metric [21, 22] and session playtime [23, 24, 25, 26, 27, 28, 29, 30] can be utilized to measure in-game retention. Discrete metrics such as session count [3, 4], progression [9] and active periods [13, 14, 15, 16, 17, 18, 19, 20], have also been used in connection with retention.

In this paper we describe survival analysis methods to measure retention, with focus on total playtime: this enables game developers, managers and publishers to better benchmark the game [1]. Survival analysis is well-suited for retention analysis because it is developed specifically for duration data. Other fields with very long histories where these statistical methods have become standard include demography [31], reliability engineering and biomedical sciences [32].

The primary reason for its de facto status in many fields is that survival analysis excels with non-normal and censored data, and does not necessitate parametric approaches [33]

. Playtime exhibits non-normal characteristics: duration is positive, heavily skewed towards zero an often has a long tail. Since in the industry it is often unfeasible to wait until all users have churned to obtain their total playtime, censoring is also present. Furthermore, user retention may not always completely follow popular parametric models

[12], making model-free approaches attractive. The widespread recognition of survival analysis in fields with similar data and the demand for scientific analytics suggests that game analytics will benefit from this approach.

1.2 Related Research

User behavior in terms of measuring duration data has been researched in game analytics [21, 22] and game networking [23, 24, 25, 26, 27, 28, 29, 30] as a topic of itself, networking often analyzing playtimes along with idle times. Game analytics literature attempts to understand user retention and how the game itself contributes to it. Game networking investigates how network quality and related factors add to user retention and how user activity on the other hand imposes a load on servers which operator might try to mitigate. Total playtime [21, 22, 34], Session playtime [23, 24, 25, 26, 27, 28, 29, 30] and Session interarrival time [23, 24, 26] or idle time, have been popular measurements. Session is commonly defined as a duration of continuous play [23, 24, 25, 26, 27, 28, 29, 30] but has also been used to refer to a completed match [3, 4, 5, 6, 7, 8]. With long-term games, popular retention measurements are subscription time [13, 14, 15] or active periods over calendar time [16, 17, 18, 19, 20], possibly combined [14]. Session counts [3, 4] and progression [9] are also instances of user retention which are on an abstract level equivalent to a discrete duration variable.

Playtime has been analyzed empirically in many studies with some fitting a parametric playtime distribution. A notable example is the study of total playtimes [22] in over 3000 Steam games totaling 6 million players which utilized the Weibull distribution for archetype analysis. Research with parametric models has often investigated exponential [23, 24, 35], Weibull [11, 12, 25, 26, 30, 21, 22], Gamma [21], log-logistic [12], log-normal [30, 21, 34] and Pareto-type [12, 23, 24, 29] distributions.

Survival analysis is in the early stages of being applied to game analytics. Some studies have used tools that are central in survival analysis, such as the survival curve of playtimes or the churn rate [25, 26, 27, 28], where notable is the use of Kaplan-Meier estimate [27, 28] to deal with censored session durations due to short collection times. Studies focusing on survival analysis have researched modeling the user process [11], measuring difficulty with automated playtesting [35] and retention regression over time-varying game features [34].

1.3 Survival Analysis of Playtimes

In this paper we describe how survival analysis can contribute to playtime analysis. We introduce the following fundamental analyses a game analyst can carry out on data consisting of observed playtimes, using standard survival analysis software such as R [36], SAS and Stata [33], among others

  1. Survival and hazard curves: these two foundational concepts of survival analysis allow studying both visually and analytically the rates at which players churn from the game at different time points. They enable the analyst to better understand the overall quality, including strong and weak points of a game.

  2. Mean and median provide singular metrics for characterizing the expected and the typical playtime. They allow the analyst to aggregate the data to a single informative number together with confidence intervals.

  3. The log-rank test provides a scientific AB-test by comparing the survival curves of different groups (e.g. players of different versions of a game). This allows the analyst to deduce whether the groups are different to a given degree of confidence.

We limit our investigation to survival analysis without covariates, which covers the first chapters in textbooks and is often the starting point for any survival analysis investigation [33]. All the presented methods are accompanied with examples analyzing total playtime in Tribeflame Ltd.’s mobile game Hipster Sheep. These methods can be used more generally to analyze any duration data, for retention important targets could also be session and subscription durations.

2 Playtime data

2.1 Retention

The target of survival analysis is a positive duration variable, which is often called ‘time to event’. This may be the person’s lifetime in epidemiology, time to machine failure in reliability engineering and time to disease recurrence in medicine [32]. Duration variable may also be discrete, such as lifespan in years or repetitions to failure. Formally, a given population has a set of duration variables with .

In gaming, retention refers to player activity until the event of user churn. Based on the literature surveyed there are several candidate duration variables for retention:

  1. Total playtime

  2. Session playtime

  3. Total progression

  4. Total active or subscription time (MMORPGs etc.)

Total playtime is the total time spent playing the game, in seconds for example. Session playtime corresponds to the duration of continuous play. Total progression relies on a game-developer’s intuition of how game consumption is transformed into a positive non-decreasing value, a natural example would be levels completed. In games with open-ended goals and long-term gameplay, one may analyze total time active as the calendar time player was engaged in the game world, or total subscription periods such as months until cancellation.

2.2 Churn and Censoring

Censoring is very common in survival analysis type data. For example, in medical studies patients may drop out of the study before experiencing the event of interest or the study may have a limited follow-up time which terminates the study before every patient has had the event [33]. Such subject is called censored and the data is in this case is subject to right censoring. The study subjects contribute information, since we know time to failure must be greater than the time of censoring. To deal with this possibility, the set of duration observations is extended with corresponding censoring indicators , to give a data set . If , we indicate that the time to event is greater or equal to , and with that the event occurred at .

In successful games with very long playtimes, censoring is often unavoidable because game developers wish to perform analytics before waiting every user to churn. Even in mobile games which display short playtimes, scattering of sessions over calendar time implies a very long actual follow-up may be required. There is a second important challenge which is not found in survival analysis. We can always observe if person contracted a disease or a machine failed, but it is not possible in principle know if player has churned definitely. Player may always return to the game; it is only with the passing of time we increase the confidence this will not happen.

The challenge of detecting churn related to total playtime has been dealt with in the literature using various rules to impute churned and non-churned players: assuming players have churned

[21, 22] or defining a window of inactivity which implies churn [16, 17]. More sophisticated way of churn detection would use one of the churn prediction algorithms in the literature to predict the censoring label [18], here churned equals ’event observed’ and non-churned ’event censored’. Complex user process models are also able to infer churn [11].

This problem manifests with quits without notification because it is sometimes difficult to say if the user comes back. The problem does not exist if the churn event is observable, these include session end, level failure and subscription cancellation. Nevertheless, none of the current solutions seem perfectly satisfactory as they may add bias depending on the method. Player churn is an extended topic and we further assume that a simple method to impute churn is available. This enables us to focus on the standard methods which are universally applicable and have a strong theoretical foundation [32]. We aim to discuss extended methods incorporating churn uncertainty in future research.

Figure 1: Hipster Sheep promotional material also displayed in Google Play, used with permission of Tribeflame Ltd.

2.3 Playtime Data Set

In this study we chose to use total playtimes of an in-development mobile game to illustrate every method with a real world survival analysis application. It is important to keep in mind that methods can be applied equivalently to any game or any duration data motivated by retention measurement.

We use data from Hipster Sheep, in Figure 1, a commercial grade puzzle game being developed by Tribeflame Ltd. The game is targeted at young adult females and has an artistic theme of making light fun of hipsters through self-irony. The goal is to guide an anthropomorphic sheep through labyrinths on her quest for the next big thing. The game is free-to-play, level-based and uses in-game purchase monetization. Energy mechanics are used to limit the possibility for unlimited free play. Levels combine skill with a hefty dose of luck, as is common in modern free-to-play games.

Figure 2: User acquisition to test versions 1.11, 1.15 and 1.18. Daily New Users (DNU) are highlighted in dark color and Daily Active Users (DAU) in transparent color. Each acquisition spikes over few days, with resulting user activity diminishing over time.

During the development, there were three significant user acquisition campaigns for versions 1.11, 1.15 and 1.18, whose purpose was to test the game’s appeal in-between successive development cycles. There are many other version iterations of the game, yet we focus on these versions because they make up the majority of the user base. Users were purchased randomly through advertising in social networks. There was some organic growth where users invited friends or found the game on Google Play, but by and large the data set consists of acquired users.

In Figure 2 we display daily new users (DNU) acquired and resulting daily active users (DAU). In version 1.11 there were a total of 970 users acquired in early June 2015, in version 1.15 a total of 1246 users were acquired early September 2015 and in version 1.18 a total of 1537 new players arrive, mostly mid-October. The three versions hold 3753 players in total. This excludes those players with only one extremely brief session, since this was deduced to be part of ’acquisition phenomena’ rather than gameplay; the game takes a dozen seconds to load.

Figure 3: A histogram of playtimes for all players in Hipster Sheep, which is one of the most widely utilized tools to measure user behavior.

Well-known statistical tools such as histograms, empirical density functions and cumulative density functions are used for duration data across game analytics and networking [23, 24, 25, 26, 27, 28, 29, 30, 21, 22, 35, 34]. Figure 3 plots a histogram for comparison to the new methods.

2.4 Playtime Example Data Set

The player activity was logged to a database using an in-house logging framework which operates by sending accumulated event packages at brief intervals. The data was processed to compute the total playtime. For almost every player we may confidently say that they churned, but close to the observation limit we decided, following [16, 17], that players playing 14 days within collection time are not churned. This applied to 1% of players and led to their playtimes being ’censored’.

To illustrate computations, we randomly sampled a subset of 10 players from version 1.18 for Android. Results are displayed in Table 1 with playtime duration and censoring indicator.

Figure 4 visualizes the playtime data of the sample. We see that there is significant early churn, with 40% churning before 1 hour of gameplay, and a heavy tail with one 12 hour gameplay observation. Other players seem to have more typical 1-6 hour playtimes. One player happened to be censored, but otherwise the sample seems to be quite representative of the population.

3 Survival Analysis

Player Playtime Censored Playtime (hours)
gp0 00:22:51 False 0.38
gp1 05:55:32 False 5.93
gp2 00:10:48 False 0.18
gp3 00:00:13 False 0.00
gp4 01:50:59 False 1.85
gp5 02:21:48 False 2.36
gp6 00:47:27 True 0.79
gp7 04:45:25 False 4.76
gp8 11:55:22 False 11.92
gp9 00:01:53 False 0.03
Table 1: Hipster Sheep: 10 Players
Figure 4: A random sample of 10 players from Hipster Sheep version 1.18 for Android. Player with identifier ’gp_6’ had been playing very recently and was determined to be active, whereas others had churned with high likelihood.

We now begin our investigation on how survival analysis helps to analyze playtime data. In this section, we explain two foundational concepts in survival analysis: the survival curve and the hazard. These concepts enable the analyst to analyze the playtime data both visually and computationally. The survival curve is a natural way to visualize the proportion surviving of a given population, which is why it is typically used in the context of duration data [33]. The hazard function is often motivated as the cause for a given survival curve, enabling simpler analysis.

3.1 The Survival Function

Suppose players have playtimes

. Statistically speaking, these playtimes are a sample of a random variable

of the population playtimes and based on the sample we seek to analyze the distribution of . In the discrete case, where

is for example playtime rounded to hours, we use a probability mass function (PMF), or the probability of failure at

: . If

is continuous, we define a probability density function (PDF) instead:

Regardless whether we used PMF or PDF, the cumulative density function (CDF) is used to describe the accumulated probability of playtime less or equal to . In survival analysis, we often analyze the survival function (SF) [32] instead, which gives the probability of having playtime greater than :

As a complement to the cumulative density function, the survival function is a monotonously decreasing function and has the property that and as . Sometimes in practical applications the survival function is not constrained to approach zero, in which case the distribution is improper.

3.2 The Hazard Function

Function Session 1 Session 2 Session 3 Session 4
Retention rate 50% 80% 80% 80%
Churn rate 50% 20% 20% 20%
# Failing 500 100 80 64
# Surviving 1000 500 400 320
Table 2: 1000 Players Churn Example

Survival analysis distributions are often easiest to understand in terms of geometric decay of players. Since churned players are no longer at risk of churn, it is useful to contemplate constant churn rate acting on the remaining players. We might also have churn rates with high initial churn, and thereon simple constant churn. For example, in Table 2, 50% of players play more than one session, but after second session 80% survive to play session 3, of those 80% survive to play session 4, etc. Churn in this case refers to the number of players who after session do not play the next session and survival quantifies the number of players playing more than the th session.

This is formalized in the concept of hazard. For the discrete case, the hazard function [32] quantifies the proportion of the remaining players who churn:

For the continuous case, the hazard is the instantaneous failure rate in the remaining player base, motivated as the limit:

An important point to note is that continuous hazard is not the probability of failure at time , it can be greater than 1. An approximation in a small interval for the probability of failure is . The relationship of proportional failure probability and rate is analogous to that of PMF and PDF. These two settings can actually be treated together with the Riemann-Stieltjes integral, for more information we refer the reader to [32, 37].

3.3 Connection of Hazard and Survival

Knowing the hazard or the survival, we can derive the other. In practical applications the hazard is often analyzed for simpler interpretations and the survival curve derived as a function of the hazard. For the discrete case it is easy to see that a product formulation is possible: at point , the survival is the product of the fractions remaining after churn events at . For example, in Table 2 the survival after session three is: , i.e. 320 players.

In the continuous case, we take a product integral which works like the Riemann integral in partitioning the domain. Instead of a sum, the result is the limit of a product of terms over the partition, consisting of surviving fractions . In fact, the less well-known product integral may be written in terms of the Riemann integral by taking the logarithm [37]:

The integral of the hazard is the cumulative hazard function [32]. Its utility is explained in how a proportional change in corresponds to a linear change in . Taking the logarithm gives the cumulative hazard in terms of the survival function:

In the simplest case, the hazard

is homogeneous implying the churn rate is constant over time. This hazard can be used to derive two well-known distributions: geometric distribution for the discrete case and exponential distribution for the continuous case. Of the common survival distributions listed in Table 

3, Weibull is one of the most popular [38], and it also has wide applicability in games [22, 12, 11].

Function
Exponential
Weibull
Log-Logistic
Log-Normala
Table 3: Simple Survival Type Distributions

4 Playtime Survival

In this chapter, we utilize the survival and the hazard function to measure game goodness. We first introduce the theory using the 10 player sample and then apply the methods to entire game data. In distribution fitting one has the problem of choosing a parametric model. However, in survival analysis it is not actually necessary to guess distributions; the data over the follow-up time can be used to make model free estimates. These are called nonparametric methods [37].

4.1 Fitting a Survival Model

Suppose that one has a reason to believe that data follows a parametric model and the hazard or the survival is specified. What next? Fitting a distribution is often done utilizing the maximum likelihood (ML) [38]. Specifically, for data set with durations and censoring indicators , a PDF/PMF parametrized by is fitted by assuming the observations i.i.d. and finding the parameters which maximize the likelihood of observed data.

The logarithm of the likelihood is taken in practice to avoid numerical errors associated with extremely small quantities. The ML estimate may be found iteratively using optimization algorithms such as Newton-Raphson [37].

Figure 5: Exponential fit (green) to the sample of 10 players, with confidence intervals (green, dashed). Contrasted to the KM baseline (black) to be presented later, the early observations may not fit the simple model.

For example, to fit the exponential distribution in Figure 5. one minimizes the likelihood in terms of objective

where we have defined the number of observed churns and the total time at risk of churning . The log-likelihood is maximized when the derivative is zero. In this case we can directly find the ML estimate:

Given the survival times in Table 1, there are 10 players with churning. The total time at risk is the sum of accumulated playtimes: (hours). Therefore, we obtain a failure rate or hazard, of churns/h.

95% confidence intervals (C.I.) for the parameter may be obtained using the normal approximation

where 1.96 is the value of standard normal distribution such that

. To estimate confidence intervals for an asymptotically normally distributed quantity one therefore needs an estimate for its variance.

The variance, or in general the covariance, estimate can be obtained as the inverse of the observed information, which is the negative second derivative in this case, or in general is the negative Hessian [36]. Since , the variance estimate can be obtained with a substitution of the maximum likelihood parameter

The failure rate estimate with confidence intervals therefore is churns/h.

4.2 Estimating Playtime Survival

Since a chosen parametric model may not always fit the data, it is often desirable to use an empirical estimate as a benchmark. If there are no censored observations, it is straightforward to compute the SF empirically as the fraction of playtimes greater than : . However, if there are censored observations, we need to use a Kaplan-Meier estimate.

Figure 6: Kaplan-Meier estimator (line) with confidence intervals (shaded) for the sample of 10 players. Censored event time at 0.79 is denoted by a small vertical tick. The estimate reaches 0 at the last failure at 11.92 hours.

Given event times and at time the number of surviving non-censored, or ’at risk’, players and churning players , the fraction churning is the Kaplan-Meier (KM) product-limit estimator is defined [33]:

The estimator is simplest to describe with an example. Table 4. and Figure 6 show the estimate calcuated for data in Table 1. At every event time , we compute the remaining fraction and multiply it with the KM-estimate of survival at previous failure time to obtain surviving population . Note how the one censored event time at is not in the table of event times but reduces the risk set at .

Time At Churn Haz. Cum. Surv. CI.l. CI.u.
(h) risk Haz. (KM) 95% 95%
(NA)
0.00 10 1 0.10 0.10 0.90 0.47 0.99
0.03 9 1 0.11 0.21 0.80 0.41 0.95
0.18 8 1 0.13 0.34 0.70 0.33 0.89
0.38 7 1 0.14 0.48 0.60 0.25 0.83
1.85 5 1 0.20 0.68 0.48 0.16 0.75
2.36 4 1 0.25 0.93 0.36 0.09 0.65
4.76 3 1 0.33 1.26 0.24 0.04 0.54
5.93 2 1 0.50 1.76 0.12 0.01 0.41
11.92 1 1 1.00 2.76 0.00
Table 4: Sample: KM Computation

4.3 Estimating Playtime Cumulative Hazard

An alternative is the Nelson-Aalen (NA) [33] estimate of the cumulative hazard given by the sum of fractions churning:

A nonparametric hazard estimate [37] requires smoothing the cumulative hazard step function estimate with kernels, for example. Various kernels exists; popular choices are the uniform kernel, Epanechnikov kernel and Gaussian kernel. Kernel is a mass of density concentrated around zero with a total area of one, spread determined by bandwith , which gives a hazard estimate:

Figure 7: Nelson-Aalen estimate (black) with confidence intervals (shaded) and Epanechnikov kernel smoothing (green) for the cumulative hazard in the 10 player sample. The survival corresponding to this estimate never reaches zero.

Of course, using the KM-estimate we can derive a cumulative hazard extimate by . Equivalently for the NA-estimate we have . Both estimators are utilized extensively in practice [37].

It is possible to compute confidence intervals for the KM. The variance may be approximated with the delta method [37]:

These variance estimates may extend above and below zero, which violates survival curve assumptions. A common fix [37] which provides confidence intervals for NA estimate as well is to estimate the variance of the log-log transformed estimate:

where . Transforming back with gives the KM C.I. in Table 4.

4.4 Playtime Survival for Game Data

The Exponential, Weibull, Log-Normal and Log-logistic distributions listed in Table 3 are common parametric models for survival data [32]. In Figure 8. we have fitted these four models using ML to Hipster Sheep playtimes. We observe three models with significant model deviations. The exponential distribution overestimates early survival and underestimates late survival. The Log-Normal and Log-Logistic distributions fit short playtimes but have significantly longer tails than observed in practice. The Weibull distribution appears to have least model deviation, corroborating the finding that it provides good approximations to multiple games [22].

Figure 8: Kaplan-Meier estimator and distribution fits for the entire population in Hipster Sheep. Confidence in the KM estimate is high due to sample size. The Weibull distribution fits data best with others having significant flaws.

Figure 8 demonstrates why the nonparametric Kaplan-Meier and Nelson-Aalen estimates are popular. Parametric models are more powerful wherever they describe the data, but when they do not the results are incorrect. Nonparametric models are robust to model deviations, in other words they are often the safe choice, and even in limited data sets they may be sufficient to describe quantities of interest [33]. The provided confidence intervals are informative in data constrained industry applications. Since acquiring users costs money [2], a manager might request a statistically significant user test with fewest possible users which makes confidence intervals useful.

5 Playtime Metrics

Our goal in this section is to explain three important metrics to benchmark the quality of a game. The metrics are motivated by the survival curve: the hazard, the mean playtime and the median playtime. These metrics are simple, easy to measure and it is possible to assess their reliability with confidence intervals.

5.1 Hazard as a Metric

In reliability engineering, the failure rate is a key measure of product reliability [32]: it provides a profile of how reliability evolves over time. Products may experience early failures due to defective units, the rate may then stabilize to a constant for the period of ’useful life’ and go up in the ’worn-out’ period. The churn rate provides a similar funnel type visual for games and can be investigated in terms of the early, middle and late-game hazards. In general, the hazard is an informative time-dependent metric for the risk of the event in occurring.

In free-to-play games, it is often observed that the failure rate is very high during initial sessions, and stabilizes or steadily continues to decline as most dedicated players remain [12]. In pay-to-pay games with campaigns, one may observe playtimes that are more clustered [34]. This analysis could prove useful for game-design as well [35]. In terms of game progression, good level design should have an approximately uniform churn: unexpected increases in the churn rate signify flaws and level specific decreases suggest underutilized improvements.

Figure 9: Smoothing the 10 player sample with 1 h piecewise exponential rates contrasted to Epanechnikov kernel with . These bins are too small for this sample, and a higher degree of smoothing appears more informative.

A major problem with small sample hazard estimation is that the interpretation could depend on the method chosen. This is illustrated with the 10 player sample in Figure 9. The piecewise exponential method has a constant hazard, or exponential distribution, within given pieces (bins) of the domain. With 1 hour bins, there are churns within the first hour with total time at risk h, making the first 1 hour rate = 4/6.38 = 0.63 churns/h. In bins with no churns the rate is 0.00, and in the last bin there is 1 churn with a single player at risk h, implying h = 1.09 churns/h.

5.2 Mean Playtime as a Metric

The hazard function is not a single measure, but a function: a set of measures, one for every point in time. Often a single measure is required to benchmark the game. If we assume monetization is proportional to retention it is desirable to use the expected playtime, or the mean, as a singular metric to predict profit which is expected user value minus acquisition cost. For a playtime distribution the mean playtime is defined:

The mean playtime has a surprising connection to the playtime survival: it is the area under the curve (AUC) [32]:

Therefore, to compare two survival curves using a single metric one may compare the area underneath each. This is quite remarkable, since we then have a singular statistic quantifying the goodness of a game. The comparison is well-defined even in cases where the survival curves cross and the ranking is time-dependent. The metric quantifies how much better in additional mean playtime one survival curve is. Visually the shape of the survival curve describes where the additional playtime has been accumulated from: one may have achieved it by decreasing initial churn or increasing long-term retention.

Time At risk Length Survival Add Area Tail Area
0 NA NA NA NA 3.28
0 10 0.00 1.00 0 3.27
0.03 9 0.03 0.90 0.03 3.25
0.18 8 0.15 0.80 0.12 3.13
0.38 7 0.20 0.70 0.14 2.99
1.85 5 1.47 0.60 0.88 2.11
2.36 4 0.51 0.48 0.25 1.86
4.76 3 2.39 0.36 0.86 1.00
5.93 2 1.17 0.24 0.28 0.72
11.92 1 6.00 0.12 0.72 NA
Table 5: Sample: KM-Based Area Computation

The method of deriving estimate for the mean through area under the survival curve is beneficial because it works with censored observations. Simply ignoring censored observations would lead to downward bias in the estimate. Furthermore, confidence intervals for the mean can be derived utilizing the area. In Table 5. we have computed intervals between churn events and how much each event adds to total area . The tail area , denotes the area remaining in the tail after all areas have been accounted up to . The total area , which is the expected playtime, can be computed to be 3.28 hours.

95% confidence intervals using the normal approximation are obtained with , in this case (h). The variance estimate for can be derived [31]:

Figure 10: Reading E[T] and Median[T] from the KM survival curve. E[T] is the area under the curve (AUC) whereas the curve drops below 0.5 at Median[T] with confidence intervals necessitated by the KM confidence intervals in blue.

5.3 Median Playtime as a Metric

The mean playtime is quite informative in many cases, but it may not quantify typical player experience due to many early failures or the presence of a long tail. The median on the other hand, attempts to quantify the ’typical’ playtime. It is defined as the point to which half of the players survived:

In general, one can define arbitrary quantiles for the survival curve. Specifically, the quantiles are defined:

To compare two survival curves, one may compare the points at which half of players are lost. The game with greater time to lose half of the players is then said to be better, relative to the median. This benchmark may be extended by creating quantile measures, such as a sequence of times at which of players are lost for each game. These measures give an unequivocal benchmark of short term and long term retention.

The median playtime can be read from the survival curve by finding the earliest t at which the survival drops to equal to or below 0.5. Furthermore, confidence intervals for the median can also be directly read from the pointwise KM-estimate confidence intervals: one draws a vertical line from the median and reads the left (lower) and right (upper) value at -axis where the line meets the KM C.I. Specifically, we seek lowest and highest value of t such that the following inequality with log-log transform is satisfied [36]:

where was estimated previously to obtain log-log transformed KM confidence intervals. The normal approximation based value is for 95% C.I. The p’th quantile confidence interval may be estimated by substituting in the inequality. In this case, we obtain a highly uncertain estimate ( h ).

5.4 Playtime Metrics for Game Data

The churn rate provides a very useful time-dependent metric of game quality. As explained previously, it can be used as a funnel to quantify strong and weak points of the game. In games with long-term consumption patterns a simple hazard enables reliable player lifetime forecasts.

In Figure 11. we used Epanechnikov-kernels to obtain a hazard estimate for the three versions in the data set. Since the playtime is terminated by a churn event, this is a smooth estimate of the churn rate. We see that version 1.18 churn is quite high initially at about 0.6 churns/h, and halves to 0.3 churns/h during first 4 hours, designated as early gameplay. A steady decline continues as most dedicated players remain in the game and appears to reach near constant 0.2 churns/h 10 hours onwards.

Figure 11: The version hazard estimates contrasted with Epanechnikov-kernels using a high degree of smoothing and left continuity correction. First version seems uniformly worse, whereas other two appear indistinguishable.

For the singular metrics that summarize the survival curve, Table 6. computes the mean and the median with confidence intervals for the three game versions in the data set. One should note that the mean confidence interval of 1.11 contrasted to either 1.15 or 1.18 do not overlap, which implies the difference is significant. However, between 1.15 and 1.18 the confidence interval is wide enough to contain the other mean estimate. The quantile metrics, which includes the median, may be read with confidence intervals from KM estimates in Figure 12.

Version Mean CI.l. CI.u. Median CI.l. CI.u.
95% 95% 95% 95%
1.11 1.55 1.37 1.73 0.60 0.55 0.68
1.15 2.21 2.00 2.42 0.77 0.66 0.87
1.18 2.41 2.18 2.64 0.77 0.66 0.86
Table 6: Hipster Sheep: Playtime Metrics

The difference between the mean and median as metrics is clearly visible; whereas random players in 1.18 typically quit after 0.77 hours, the expected playtime extracted out of random players is three times larger at 2.41 hours. The presence of both fickle and dedicated players produces effects which make both metrics informative from different managerial perspectives.

6 Playtime Comparison

6.1 Comparing Cohort Survival

Given two survival curves, a test of their difference is often asked for. Statistical tests can be used for this purpose. These tests assume that the samples are identically distributed under null hypothesis, and we obtain evidence which may reject this conclusion with a given degree of confidence. For example, we can have two game versions and survival curves produced by two cohorts consisting of players for each game version. We then assume the changes had no effect, that the survival is equal, and the evidence given by players provides a test which may reject this assumption, leading us to conclude that indeed the changes affected the game.

While several statistical tests exist, it is useful to be able to compare two survival curves in their entirety. Instead of comparing the means or the possibility that one is strictly better , we present the log-rank test [33] which tests the assumption that under censored observations and allows one to use the survival curve to ascertain the difference. For simplicity we call one of the cohorts ‘control cohort’ and the other the ‘test cohort’.

Figure 12: Kaplan-Meier estimator for the entire population in Hipster Sheep separated into cohorts by three versions. There is some censoring in the most recent version. Based on the plot, one might think the early version 1.11 is worst, but it is hard to say if the improvement from 1.15 to 1.18 is significant.

Suppose that at time there are players with churning in the control cohort and players with churning in the test cohort. Denote total players and total churning. The log-rank test is based on the observation that if the null hypothesis was true, the groups were equal, then given , , and , the number is a sample of a hypergeometric random variable :

The mean and variance of this distribution are [31]:

Using these facts, it is possible to construct a linear test statistic based on a score statistic obtained by summing the differences between observed and expected event counts

[31]:

A chi-square test statistic allows one to obtain a p-value [31]:

Test -value
1246 970 21.4 3.74e-06
1246 1537 0.4 0.534
Table 7: Statistical Test of Survival Equivalence

To apply this test in a real example, in Table 7 we have taken the version 1.15 as a control group and compared it to a cohort with version 1.11 and then to 1.18, which are plotted in Figure 12. The first test then answers the question whether the version 1.15 was an improvement over 1.11 and the second on whether 1.18 further improved the game. The difference between 1.11/1.15 is highly significant, however unlike what might be visually deduced from Figure 12, the 1.15/1.18 difference in the survival tail is completely nonsignificant.

There are modifications to the log-rank test which emphasize different aspects of the survival curve: one may want to weight early or late failures more heavily. For example, the weighted test statistics [36] and are commonly used with the weight . Setting one obtains the Prentice or Peto-Peto modification of the Gehan-Wilcoxon test which places more emphasis on earlier survival differences. In our case, this resulted in -values 0.004 and 0.76 which have the same interpretations.

6.2 Stratification

The cohorts may not always be directly comparable. For example, user acquisitions may be conducted with different marketing campaigns or in different countries. Therefore, differences between two versions might really reflect a different underlying composition of players and not changes in behavior due to versions, for example.

To correct for such effects, one needs to adjust for the covariate which is suspected to be an alternate cause for the effects. This is equivalent to testing the null hypothesis across groups . The test is based on computing score statistic and variance for each group separately and using the test [36]:

In our case adjusting for country of origin, we obtain p-values 7.88e-06 and 0.608 which does not change the interpretations.

Problem R function [36] R library
Fit nonparametric survival model? survfit survival
Fit parametric survival model? survreg survival
Fit nonparametric hazard? muhaz/pehaz muhaz
Compute mean/median? print(print.rmean=T/F) base
Compute log-rank test? survdiff survival
Table 8: Applied research cheatsheet

7 Conclusion

In this study, we demonstrated that survival analysis can be used to measure retention in games. Positive, skewed and censored duration data make it a very natural and powerful tool for this purpose. Duration variables quantifying retention such as playtime, session time and subscription time, even game progression, may be analyzed with the methods of survival analysis. In this study we used a real world game development example with focus on total playtime.

We presented the basic foundation of survival analysis, which argued that the phenomena may be analyzed in a simple way through the churn rate or its complement, the retention rate. The study focused on three key motivations for survival analysis based measurement: computing survival curves, deriving survival metrics and comparing survival data. These methods contribute towards scientific data analysis by presenting methods new to game analytics, which are also able to deal with censoring and utilize statistical significance tests.

For computing survival curves and cumulative hazards, we presented the Kaplan-Meier and the Nelson-Aalen estimate. Kernel methods may be used to compute the churn rate and produce smooth nonparametric survival curves.

For metrics, we discussed how the hazard is an improvement over using the survival curve as a funnel type estimate. Utilized widely in reliability engineering, adopting it for game analytics is especially useful in retention and progression analysis to detect deviations from the natural pattern of constant rates. Furthermore, the mean and the median playtime metrics were derived from the survival curve with confidence intervals.

For survival comparison, we used the log-rank statistical test to perform a test of the null hypothesis that the survival curves are equal. The test may be extended to stratify over covariates and compare multiple cohorts. This method enables scientific AB-testing of game version quality, for example

The reader may take advantage of Table 8 to use the methods for applications. It lists the methods we have presented and the R software functions implementing them.

In summary, survival analysis motivated functions, metrics and comparisons provide multiple tools to utilize for retention and progression measurement in game development. We think that the field has a large potential to contribute to scientific game analytics and anticipate further research on this topic

References

  • [1] M. S. El-Nasr, A. Drachen, and A. Canossa, eds., Game Analytics, Maximizing the Value of Player Data. Springer, 2013.
  • [2] E. B. Seufert, “Freemium metrics,” in Freemium Economics, pp. 83–113, Boston: Morgan Kaufmann, 2014.
  • [3] B. G. Weber, M. Mateas, and A. Jhala, “Using data mining to model player experience,” in Proc. FDG Evaluating Player Experience in Games Workshop, ACM, 2011.
  • [4] B. G. Weber, M. John, M. Mateas, and A. Jhala, “Modeling player retention in madden NFL 11,” in

    Proc. Innovative Applications of Artificial Intelligence Conf.

    (D. G. Shapiro and M. P. J. Fromherz, eds.), AAAI Press, 2011.
  • [5] B. E. Harrison and D. L. Roberts, “When players quit (playing Scrabble),” in Proc. AAAI Artificial Intelligence and Interactive Digital Entertainment Conf. (M. Riedl and G. Sukthankar, eds.), The AAAI Press, 2012.
  • [6] B. Harrison and D. L. Roberts, “Analytics-driven dynamic game adaption for player retention in Scrabble,” in Proc. IEEE Computational Intelligence in Games Conf., pp. 1–8, 2013.
  • [7] B. E. Harrison and D. L. Roberts, “Analytics-driven dynamic game adaption for player retention in a 2-dimensional adventure game,” in Proc. AAAI Artificial Intelligence and Interactive Digital Entertainment Conf., 2014.
  • [8] B. Harrison and D. L. Roberts, “An analytic and psychometric evaluation of dynamic game adaption for increasing session-level retention in casual games,” IEEE Trans. Comput. Intell. AI in Games, vol. 7, no. 3, pp. 207–219, 2015.
  • [9] T. Debeauvais, C. V. Lopes, N. Yee, and N. Ducheneaut, “Retention and progression: Seven months in World of Warcraft,” in Proc. Int. Foundations of Digital Games Conf. (M. Mateas, T. Barnes, and I. Bogost, eds.), Society for the Advancement of the Science of Digital Games, 2014.
  • [10] T. Debeauvais and C. V. Lopes, “Gate me if you can: The impact of gating mechanics on retention and revenues in Jelly Splash,” in Proc Int. Foundations of Digital Games Conf. (J. P. Zagal, E. MacCallum-Stewart, and J. Togelius, eds.), Society for the Advancement of the Science of Digital Games, 2015.
  • [11] M. Viljanen, A. Airola, T. Pahikkala, and J. Heikkonen, “Modelling user retention in mobile games,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 62–69, IEEE, 2016.
  • [12] M. Viljanen, A. Airola, T. Pahikkala, and J. Heikkonen, “User activity decay in mobile games determined by simple differential equations?,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 126–133, IEEE, 2016.
  • [13] J. Kawale, A. Pal, and J. Srivastava, “Churn prediction in MMORPGs: A social influence based approach,” in Proc. Int. Computational Science and Engineering Conf., vol. 4, pp. 423–428, 2009.
  • [14] Z. Borbora, J. Srivastava, K. W. Hsu, and D. Williams, “Churn prediction in MMORPGs using player motivation theories and an ensemble approach,” in IEEE Int. Privacy, Security, Risk and Trust Conf. and IEEE Int. Social Computing Conf., pp. 157–164, 2011.
  • [15] Z. H. Borbora and J. Srivastava, “User behavior modelling approach for churn prediction in online games,” in IEEE Int. Privacy, Security, Risk and Trust Conf. and IEEE Int. Social Computing Conf., pp. 51–60, 2012.
  • [16] J. Runge, P. Gao, F. Garcin, and B. Faltings, “Churn prediction for high-value players in casual social games,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 1–8, 2014.
  • [17]

    P. Rothenbuehler, J. Runge, F. Garcin, and B. Faltings, “Hidden Markov models for churn prediction,” in

    Proc. SAI Intelligent Systems Conf., pp. 723–730, 2015.
  • [18] F. Hadiji, R. Sifa, A. Drachen, C. Thurau, K. Kersting, and C. Bauckhage, “Predicting player churn in the wild,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 1–8, 2014.
  • [19] M. Tamassia, W. Raffe, R. Sifa, A. Drachen, F. Zambetta, and M. Hitchens, “Predicting player churn in Destiny: A hidden Markov models approach,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 325–332, IEEE, 2016.
  • [20]

    R. Sifa, S. Srikanth, A. Drachen, C. Ojeda, and C. Bauckhage, “Predicting retention in sandbox games with tensor factorization-based representation learning,” in

    Proc. IEEE Computational Intelligence and Games Conf., pp. 142–149, IEEE, 2016.
  • [21] C. Bauckhage, K. Kersting, R. Sifa, C. Thurau, A. Drachen, and A. Canossa, “How players lose interest in playing a game: An empirical study based on distributions of total playing times,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 139–146, IEEE.
  • [22] R. Sifa, C. Bauckhage, and A. Drachen, “The playtime principle: Large-scale cross-games interest modeling,” in Proc. IEEE Computational Intelligence and Games Conf., pp. 1–8, IEEE, 2014.
  • [23] T. Henderson and S. Bhatti, “Modelling user behaviour in networked games,” in Proc. ACM International Multimedia Conf., (New York, NY, USA), pp. 212–220, ACM, 2001.
  • [24] M. Kwok and G. Yeung, “Characterization of user behavior in a multi-player online game,” in Proc. ACM SIGCHI Int. Advances in Computer Entertainment Technology Conf., (New York, NY, USA), pp. 69–74, ACM, 2005.
  • [25] C. Chambers, W.-c. Feng, S. Sahu, and D. Saha, “Measurement-based characterization of a collection of on-line games,” in Proc. ACM SIGCOMM Internet Measurement Conf., pp. 1–1, USENIX Association, 2005.
  • [26] W.-c. Feng, D. Brandt, and D. Saha, “A long-term study of a popular MMORPG,” in Proc. ACM SIGCOMM Network and System Support for Games Workshop, (New York, NY, USA), pp. 19–24, ACM, 2007.
  • [27] P.-Y. Tarng, K.-T. Chen, and P. Huang, “An analysis of WoW players’ game hours,” in Proc. ACM SIGCOMM Network and System Support for Games Workshop, (New York, NY, USA), pp. 47–52, ACM, 2008.
  • [28] K.-T. Chen, P. Huang, and C.-L. Lei, “Effect of network quality on player departure behavior in online games,” IEEE Trans. Parallel Distrib. Syst., vol. 20, no. 5, pp. 593–606, 2009.
  • [29] D. Pittman and C. GauthierDickey, “A measurement study of virtual populations in massively multiplayer online games,” in Proc. ACM SIGCOMM Network and System Support for Games Workshop, (New York, NY, USA), pp. 25–30, ACM, 2007.
  • [30] D. Pittman and C. GauthierDickey, “Characterizing virtual populations in massively multiplayer online role-playing games,” in Proc. Int. Advances in Multimedia Modeling Conf., (Berlin, Heidelberg), pp. 87–97, Springer-Verlag, 2010.
  • [31] S. Selvin, Survival Analysis for Epidemiologic and Medical Research (Practical Guides to Biostatistics and Epidemiology). Cambridge University Press, 1 ed., 2008.
  • [32] J. F. Lawless, Statistical models and methods for lifetime data. Wiley series in probability and mathematical statistics. Applied probability and statistics, New-York, Chichester, Brisbane: J. Wiley, 1982.
  • [33] D. G. Kleinbaum and M. Klein, Survival Analysis: A Self-Learning Text (Statistics for Biology and Health). Springer, 2nd ed., 2005.
  • [34] T. Allart, G. Levieux, S. Natkin, and A. Guilloux, “Design influence on player retention : A method based on time varying survival analysis,” in Proc. IEEE Computational Intelligence and Games Conf., IEEE, 2016.
  • [35] A. Isaksen, D. Gopstein, and A. Nealen, “Exploring game space using survival analysis,” in Proc. Int. Foundations of Digital Games Conf. (J. P. Zagal, E. MacCallum-Stewart, and J. Togelius, eds.), Society for the Advancement of the Science of Digital Games, 2015.
  • [36] D. F. Moore, Applied survival analysis using R. Use r!, Cham: Springer, 2016.
  • [37] R. J. Cook and J. F. Lawless, The statistical analysis of recurrent events. Statistics for biology and health, New York, London: Springer, 2007.
  • [38] F. E. Harrell, Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. New York: Springer, second edition ed., 2015.