Estimating Latent Demand of Shared Mobility through Censored Gaussian Processes

01/21/2020
by   Daniele Gammelli, et al.
DTU
12

Transport demand is highly dependent on supply, especially for shared transport services where availability is often limited. As observed demand cannot be higher than available supply, historical transport data typically represents a biased, or censored, version of the true underlying demand pattern. Without explicitly accounting for this inherent distinction, predictive models of demand would necessarily represent a biased version of true demand, thus less effectively predicting the needs of service users. To counter this problem, we propose a general method for censorship-aware demand modeling, for which we devise a censored likelihood function. We apply this method to the task of shared mobility demand prediction by incorporating the censored likelihood within a Gaussian Process model, which can flexibly approximate arbitrary functional forms. Experiments on artificial and real-world datasets show how taking into account the limiting effect of supply on demand is essential in the process of obtaining an unbiased predictive model of user demand behavior.

READ FULL TEXT VIEW PDF

Authors

page 8

page 9

page 10

page 11

page 13

04/02/2021

Modeling Censored Mobility Demand through Quantile Regression Neural Networks

Shared mobility services require accurate demand models for effective se...
03/23/2018

Testing demand responsive shared transport services via agent-based simulations

Demand Responsive Shared Transport DRST services take advantage of Infor...
02/26/2019

Online Framework for Demand-Responsive Stochastic Route Optimization

This study develops an online predictive optimization framework for oper...
08/22/2020

Forecasting of the Montreal Subway Smart Card Entry Logs with Event Data

One of the major goals of transport operators is to adapt the transport ...
02/02/2020

Fair Allocation Based Soft Load Shedding

Renewable sources are taking center stage in electricity generation. Due...
11/29/2017

Gaussian Processes for Demand Unconstraining

One of the key challenges in revenue management is unconstraining demand...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Being able to understand and predict demand is essential in the planning and decision-making processes of any given transport service, allowing service providers to make decisions coherently with user behavior and needs. Having reliable models of demand is especially relevant in shared transport modes, such as car-sharing and bike-sharing, where the high volatility of demand and the flexibility of supply modalities (e.g., infinitely many possible collocations of a car-sharing fleet) require that decisions be made in strong accordance with user needs. If, for instance, we consider the bike sharing scenario, service providers face a great variety of complex decisions on how to satisfy user demand. To name a few, concrete choices must be made for what concerns capacity positioning (i.e., where to deploy the service), capacity planning (i.e., dimensioning the fleet size), rebalancing (i.e., where and when to reallocate idle supply), and expansion planning (i.e., if and how to expand the reach of the service).

Demand modeling uses statistical methods to capture user demand behavior based on recorded historical data. However, historical transport service data is usually highly dependent on historical supply offered by the provider itself. In particular, supply represents an upper limit on our ability to observe realizations of the true demand. For example, if we have data about a bike-sharing service with bikes available, we might observe a usage (i.e., demand) of bikes even though the actual demand might have potentially been higher. This leads to a situation in which historical data is in fact representing a biased, or censored, version of the underlying demand pattern in which we are truly interested. More importantly, using censored data to build demand models will, as a natural consequence, produce a biased estimate of demand and an inaccurate understanding of user needs, which will ultimately result in non-optimal operational decisions for the service provider.

To address these problems, we propose a general approach for building models that are aware of the supply-censoring issue, and which are ultimately more reliable in reflecting user behavior. To this end, we formulate a censored likelihood function and apply it within a Gaussian Process model of user demand. Using both synthetic and real-world datasets of shared mobility services, we pit this model against non-censored models and analyze the conditions under which it is better capable of recovering true demand.

2 Review of Related Work

2.1 Demand Forecasting for Shared Mobility

There exists a long line of literature about demand forecasting, with research mainly focusing on obtaining predictions of demand based on historical data in combination with some exogenous input, such as weather and temporal information. The bike-sharing context alone already provides many such examples, as follows. [Yang, DU201939]

use Random Forest models for rental predictions, whereas

[rudloff] approach the demand forecasting problem through the use of Generalized Linear Models (GLMs) of counts, namely, Poisson and Negative-Binomial regression models. [rixey] apply multivariate regression using data gathered from multiple bike-sharing systems.

[Li]

define a hierarchical prediction model, based on Gradient Boosting Regression Trees, for both rentals and returns at a city-aggregation-level. They model rent proportions for clusters of bike-stations through a multi-similarity based inference and predict returns consistently with rentals by learning transition probabilities (i.e., where bikes end after being rented). A different approach is found in

[Singhvi2015PredictingBU], where the proposed prediction model is based on log-log regression and refers to the Origin-Destination (O-D) trip demand, rather than rentals and returns in specific locations or areas. [zhang2016deep, XU201847, Lin2018258]

instead propose deep learning-based approaches to model bike in-flow and out-flow for the bike-sharing system areas and station-level demand respectively.

2.2 Censored Modeling

Data censorship involves two corresponding sets of values: true and observed, so that each observed value may be clipped above or below the respective true value. Observations that have not been clipped are called non-censored, and correspond exactly with true values. All other observations are called censored as they have been clipped at their observed values, therefore giving only a limited account of the corresponding true values (e.g., observed demand not corresponding with true demand). We next review methods for modeling censored data, namely, censored modeling.

An early form of censored modeling is the Probit model [aldrich1984linear] for binary (/) observations, which assumes that the probability of observing depends linearly on the given explanatory features. James Tobin extended this to a model for censored regression [tobin]

, now called Tobit, which assumes that the latent variable depends on the explanatory features linearly with Gaussian noise, and that observations are censored according to a fixed and known threshold. Because the censored observations divert Least Squares estimators away from the true, latent values, Tobit linear coefficients are instead commonly estimated via Maximum Likelihood techniques, such as Expectation Maximization or Heckman’s Two-Step estimator

[amemiya1984tobit].

The Tobit model has since become a cornerstone of censored modeling in multiple research fields, including econometrics and survival analysis [Bewick2004]. It has been extended through multiple variations [Greene2011], such as: multiple latent variables, e.g., one variable determines which observations are non-censored while another determines their values [amemiya1985advanced]; Tobit-like models of censored count data [terza1985tobit]

; Tobit Quantile Regression

[powell1986censored]; dynamic, autoregressive Tobit [wei1999bayesian]

; and combination with Kalman Filter

[allik2015tobit]. Other methods of censored regression have also been suggested, predominantly for survival analysis, such as: Proportional Hazard models [kay1977proportional], Accelerated Failure Time models [wei1992accelerated]

; Regularized Linear Regression

[li2016regularized]

; and Deep Neural Networks

[biganzoli2002general, wu2018deep].

As reflected in all the above works, research into censored modeling commonly aims to reconstruct the latent process that yields the true values. In this work too, we aim to predict true values, not the occurrence of clipping nor the actually observed values. Similarly to Tobit, we too assume that all observations are independent and that each observation is known to be either censored or non-censored. However, contrary to all the above works, our method employs Gaussian Processes, which are non-parametric and do not assume any specific functional relationship between the explanatory features and the latent variable. A theoretical treatment of non-parametric censored regression appears in [lewbel2002nonparametric], where the authors derive estimators for the latent variable with fixed censoring, analyze them mathematically, and apply them to a single, simulated dataset. In contrast, this work proposes a likelihood function and applies it to multiple, real-world datasets with dynamic censoring.

2.3 Common Approaches to Handling Demand Censorship

Within the above stream of research, the censoring problem discussed in Section 1 is widely accepted. However, to the best of our knowledge, there has been no extensive study on how observed demand can differ from the true, underlying demand and how these differences could impact predictive models. To assess this issue, common approaches regard various data cleaning techniques. For example, [rudloff, omahony2015citibike, ALBINSKI201859, goh] attempt to avoid the bias induced by censored observations by filtering out the time periods where censoring might have occurred, before modeling. As a further example, [ALBINSKI201859] substitute the censored observations with the mean of the historical (non-censored) observations regarding the same period. A different but related approach is proposed by [jian, Freund2019]

who focus on obtaining an unbiased estimate of arrival rate by omitting historical instances where bikes were unavailable.

These common approaches manage, to some degree, to correct the bias characterizing observed demand. However, they also give relevance to the fact that this problem represents an important gap which requires further study in order to obtain reliable predictions for shared transport and other fields. We believe there are two main reasons to why there is the need for a more structured view on the censoring problem: (i) the approaches described above might not be applicable in many real-world scenarios in which the number of censored observations is very high, leading either to an excessive elimination of useful data or to an inadequate imputation of the censored data; (ii) rather than resorting to cleaning and imputation procedures as the ones above, it would be desirable to equip the forecasting model with some sort of awareness towards the censoring problem, so that the model can utilize the entire information captured in the observations to coherently adjust its predictions.

3 Methodology

In this Section, we incrementally describe the building blocks of our proposed censored models. First, we introduce several general concepts: Tobit likelihood, Gaussian Processes, and the role of kernels. Then, we combine these concepts by defining Censored Gaussian Processes along with a corresponding inference procedure.

3.1 Tobit Likelihood

As a reference point for developing our censored likelihood function, let us now elaborate on the likelihood function of the popular Tobit censored regression model, described in Section 2.2. For each observation , let be the corresponding true value. For instance, in a shared transport demand setting, is the true, latent demand for shared mobility, while is the observed demand; if is non-censored then , otherwise is censored so that . We are also given binary censorship labels , so that if is censored and otherwise (e.g., labels could be recovered by comparing observed demand to available supply).

Tobit parameterizes the dependency of on explanatory features through a linear relationship with parameters and noise term , where all

are independently and normally distributed with mean zero and variance

, namely:

(1)

There are multiple variations of the Tobit model depending on where and when censoring arises. In this work, without loss of generality, we deal with upper censorship, also known as Type I, where is upper-bounded by a given threshold , so that:

(2)

The likelihood function in this case can be derived from Eqs. 1 and 2, as follows.

  1. If , then is non-censored and so its likelihood is:

    (3)

    where

    is the standard Gaussian probability density function.

  2. Otherwise, i.e., if , then is censored and so its likelihood is:

    (4)

    where is the standard Gaussian cumulative density function.

Because all observations are assumed to be independent, their joint likelihood is:

(5)

which is a function of and .

3.2 Gaussian Processes

Gaussian Processes (GPs) [rasmussen2005gaussian]

are an extremely powerful and flexible tool belonging to the field of probabilistic machine learning

[Ghahramani2015]. GPs have been applied successfully to both classification and regression tasks regarding various transport related scenarios such as travel times [rodr_boris, Ide2009], congestion models [Liu:2013:ACR:2487575.2487598], crowdsourced data [rodr_pereira, rodr_Henri], traffic volumes [Xie], etc. For example, given a finite set of points for regression, there are typically infinitely many functions which fit the data, and GPs offer an elegant approach to this problem by assigning a probability to every possible function. Moreover, GPs implicitly adopt a full probabilistic approach, thus enabling the structured quantification of the confidence – or equivalently, the uncertainty – in the predictions of a GP model. This ease in uncertainty quantification is one of the principal reasons why we chose to use GPs for demand prediction in the shared mobility domain. Indeed, transport service providers are not only interested in more accurate demand models, but also, and maybe most importantly, wish to make operational decisions based on the measure with which the model is confident of its predictions.

Given a dataset with

input vectors

and scalar outputs

, the goal of GPs is to learn the underlying data distribution by defining a distribution over functions. GPs model this distribution by placing a multivariate Gaussian distribution defined by a

mean function and a covariance function (between all possible pairwise combinations of input points in the dataset), or kernel, on a latent variable

. More concretely, a GP can be seen as a collection of random variables which have a joint Gaussian distribution. GPs are therefore a Bayesian approach which assumes a priori that function values behave according to:

(6)

where is the vector of latent random variables, is a mean vector, and is a covariance matrix with entries defined by a covariance function , so that . As is customary in GP modeling, we shall assume without loss of generality that the joint Gaussian distribution is centered on

. Also, because the response variable is continuous, we model each

as generated from a Gaussian distribution centered on with noise variance .

3.3 Kernels

A fundamental step in Gaussian Process modeling is the choice of the covariance matrix . This not only describes the shape of the underlying multivariate Gaussian distribution, but also determines the characteristics of the function we want to model. Intuitively, because entry defines the correlation between the -th and the -th random variables, the kernel describes a measure of similarity between points, ultimately controlling the shape that the GP function distribution adopts.

Multiple kernels can be combined, e.g., by addition and multipication, to generate more complex covariance funcitons, thus allowing for a great flexibility and better encoding of domain knowledge in the regression model. For the datasets in our experiments (Section 4), we use different combinations of the following three kernels, where denotes the Euclidean distance between vectors :

  1. Squared Exponential Kernel (SE)

    (7)

    The SE kernel (also called RBF: Radial Basic Function) intuitively encodes the idea that nearby points should be correlated, therefore generating relatively smooth functions. The kernel is parameterized by variance and length scale : for larger , farther points are considered more similar; acts as an additional scaling factor, so that larger corresponds to higher spread of functions around the mean.

  2. Periodic Kernel

    (8)

    The Periodic kernel allows modeling functions with repeated patterns, and can thus extrapolate seasonality in time-series data. Parameters and have the same effect as for the SE kernel, while parameter corresponds to period length of patterns.

  3. Matérn Kernel

    (9)

    where is the Gamma function, is the Bessel function [mabramowitz64:handbook], and as before, is lengthscale and is variance. The Matérn kernel can be considered as a generalization of the SE kernel, parameterized by positive parameters and , where lower values of yield less smooth functions.

As is common, we select kernel parameters in our experiments through Type-II Maximum Likelihood, also known as Empirical Bayes, whereby latent variables are integrated out.

3.4 Gaussian Process with Censored Likelihood

We have so far presented two separate modeling approaches: Tobit models for censored data, and non-parametric Gaussian Processes for flexible regression of arbitrarily complex patterns. For non-parametric censored regression, we now combine the two approaches within one model by defining a censorship-aware likelihood function for GPs:

(10)

Eq. 10 is obtained from the Eq. 5 by replacing , the Tobit prediction, with , the GP prediction. We next propose an inference procedure for obtaining the posterior distribution of Censored GP with this likelihood function.

3.4.1 Inference

In a Bayesian setting, we are interested in computing the posterior over the latent , given the response and features . Bayes’ rule defines this posterior exactly:

(11)

However, exact calculation of the denominator in Eq. 11 is intractable because of the Censored GP likelihood (Eq. 10), as also explained in [rasmussen2005gaussian]. The posterior distribution thus needs to be estimated approximately, which we do in this study via Expectation Propagation (EP) [minka2001expectation]. EP alleviates the denominator intractability by approximating the likelihood of each observation through an un-normalized Gaussian in the latent , as:

(12)

where is the -th factor, or site, with site parameters . The likelihood is then approximated as a product of the independent local likelihood approximations:

(13)

where , and is a diagonal covariance matrix with . Consequently, the posterior is approximated by

(14)

where , , and is the EP approximation to the marginal likelihood.

The key idea in EP is to update the single site approximation sequentially by iterating the following four steps. Firstly, given a current approximation of the posterior, the current is left out from this approximation, defining a cavity distribution:

(15)

where and .

The second step is the computation of the target non-Gaussian marginal combining the cavity distribution with the exact likelihood :

(16)

Thirdly, a Gaussian approximation to the non-Gaussian marginal is chosen such that it best approximates the product of the cavity distribution and the exact likelihood:

(17)

To achieve this, EP uses the property that if

is Gaussian, then the distribution which minimizes the Kullback-Leibler divergence

, is the distribution whose first and second moments match the ones of

. In addition, given that is un-normalized, EP also imposes the condition that the zero-th moment should match. Formally then, EP minimises the following objective:

(18)

Lastly, EP chooses the for which the posterior has the marginal from the third step.

3.4.2 Censored Likelihood Moments

The implementation of EP in our experiments is based on GPy 111https://sheffieldml.github.io/GPy/

, an open source Gaussian Processes package in Python. We have implemented the censored likelihood and defined the following moments for the EP inference procedure:

(19)

The moments of our censored likelihood (Eq. 10) can be defined analytically, depending on the censoring upper limit , as follows (for a detailed derivation, see [groot2012gaussian]).

  • If < , then:

    (20)
  • Otherwise, = , and:

    (21)

    where is the Gaussian Cumulative Density Function and .

4 Experiments

In this Section, we execute experiments for several datasets as follows. First, to convey a high-level intuition on how censored models can have an advantage compared to non-censored models, we start with two relatively simple datasets: synthetically generated data in Section 4.2, and real-world motorcycle accident data in Section 4.3. Equipped with this intuition, we then proceed to more elaborate real-world datasets. In Section 4.4, we then build models to predict bike pickups for a major bike-sharing service provider in Copenhagen, Denmark.

4.1 Models

As introduced in previous sections, the focus of our experiments222Source code available at: https://github.com/DanieleGammelli/CensoredGP is the comparison between Censored and Non-Censored models in the estimation of true demand patterns. We thus compare three GP models:

  1. [label=()]

  2. Non-Censored Gaussian Process (NCGP): represents the Gaussian Process model most commonly used in literature, i.e., with Gaussian observation likelihood. NCGP is trained on the entire dataset, consisting of both censored and non-censored observations, without discerning between them.

  3. Non-Censored Gaussian Process, Aware of censorship (NCGP-A): functionally equivalent to NCGP, but uses information on censoring as a pre-processing step. That is, NCGP-A is trained only on non-censored points, thus avoiding exposure to a biased version of the true demand (because of censoring). This, however, comes at the cost of ignoring relevant information embedded in the censored data

  4. Censored Gaussian Process (CGP): this model considers all observations – censored and non-censored – through the likelihood function defined in Eq. 10 (Section 3.4).

4.2 Synthetic Dataset

To demonstrate the capabilities of our approach, we begin with a simple, synthetically generated dataset, where the censoring pattern has a relevant and meaningful structure. We define a latent function for :

(22)

which we wish to recover from censored, noisy observations. We generate these observations for by repeatedly evaluating and adding a noise term, which is independently sampled from with . As seen in Fig. 0(a), this censoring process generates a cloud of points far from the true dynamics of the latent in the peaks of the sinusoidal oscillation.

We fit NCGP, NCGP-A and CGP to the observations and measure how well each of them recovers the true . The resulting predictions by the three models in Fig. 0(b), 0(c), 0(d) highlight interesting aspects worth underlying:

  • Fig. 0(b): NCGP predicts an approximately linear trend for observed data, which, if we did not know the true function behind our observations, would seem like a reasonable conclusion. Nevertheless, the predictions are definitely far from the true data generating process.

  • Fig. 0(c): By discarding the censored observations, NCGP-A is able to correctly assess the behavior of the underlying function in regions close to observable data. However, NCGP-A is not able to generalize to regions affected by censoring.

  • Fig. 0(d): CGP, on the other hand is able to exploit its knowledge of censoring to make better sense of observable data. Through the use of a censored likelihood (Eq. 10), CGP assigns higher probability to plausible functional patterns, which are coherent not only with the observable data but also with the censored function.

(a) Synthetic Data
(b) NCGP
(c) NCGP-A
(d) CGP
Figure 1: Synthetically generated data (a) and model fits for NCGP (b), NCGP-A (c) and CGP (d). The dashed line is the latent true function ; crosses and circles are, respectively, censored and non-censored observations; the continuous line and the shaded area are, respectively, the posterior mean over the GP latent variable and the corresponding posterior confidence interval.

4.3 Real-World Dataset 1: Motorcycle Accident

Having demonstrated our approach on an artificial dataset, we now make use of the openly accessible Motorcycle Accident dataset [motorcycle] used in literature for a wide variety of statistical learning tasks. For this experiment we aim at exploring how variations in the severity of the censoring process can impact the models’ performance. In particular, we will be focusing on two axes of variation: the percentage of censored points, which controls how often the true function is observable, and the intensity of censoring, which controls how far censored observations are from true values. Concretely, we apply the following censoring process in the experiments with Motorcycle data , given any and .

  1. Initialize for all .

  2. Define the set of censored observations by randomly selecting elements from .

  3. For each , independently sample a censorship intensity as , and censor the ’th observation as .

We explore moderate to extreme censoring through increasing percentage of censored points, as , and increasing censorship intensity, as . Figs. 2, 3 and 4 show how each of NCGP, NCGP-A and CGP, respectively, reacts to these variations in censorship intensity: the percentage of censored observations increases from top to bottom, and censorship intensity increases from left to right. We see that:

  • Fig. 2

    : Despite being able to successfully recover the latent function for small censorship intensities (first row), NCGP is then exposed to an excessive amount of bias because of the increased censoring. This results in completely distorted predictions, as the remaining non-censored points are essentially considered as outliers by the model.

  • Fig. 3: On the other hand, NCGP-A manages to avoid the bias coming from censored points by ignoring them, as long as the non-censored observations characterize well the latent function. When this condition no longer holds, NCGP-A is unable to generalize to unobserved domains, so that it yields biased predictions with high uncertainty.

  • Fig. 4: Lastly, CGP not only manages to avoid the bias introduced by censored observations in moderate to high censoring scenarios, but is also able to couple information coming from both censored and non censored observations to achieve better predictions in regions of extreme censoring.

Figure 2: NCGP fit for intensifying censorship of the motorcycle accident data.
Figure 3: NCGP-A fit for intensifying censorship of the motorcycle accident data.
Figure 4: CGP fit for intensifying censorship of the motorcycle accident data.

4.4 Real-World Dataset 2: Bike-Sharing Demand and Supply

In this Section, we deal with the problem of building a demand prediction model for a bike-sharing system. We use real-world data provided by Donkey Republic, a major bike-sharing provider in the city of Copenhagen, Denmark. Donkey Republic can be considered a hub-based service, meaning that the user of the service is not free to pick up or drop off a bike in any location, but is restricted to a certain number of virtual hubs around Copenhagen. Our objective is to model daily rental demand in the hub network.

4.4.1 Data Aggregation

Figure 5: Map with: 1) marked locations of Donkey Republic hubs, 2) the three super-hubs in our experiments, as big circles around constituent hubs.

The given data consists of individual records of users renting and returning bikes in hubs during days: from 1 March 2018 until 14 March 2019. Hence before modeling daily rentals, we aggregate the data both spatially and temporally. Spatially, hubs were aggregated in three super-hubs by selecting three main service areas (such as the central station and main tourist attractions) and considering a radius around these points of interest (Fig. 5). The choice of radius is justified by Donkey Republic business insights, whereby users typically agree to walk at most to rent a bike and typically give up if the bike is farther. Temporally, the data at our disposal allowed us to retrieve the time-series of daily rental pickups regarding the three super-hubs, which will represent the target of our prediction model.

The spatial aggregation of individual hubs into super-hubs is an important modeling step in building the prediction model, for two main reasons:

  1. Time-series for individual hubs reveal excessive fluctuations in rental behavior. Hence separate treatment of individual hubs could likely hide most of the valuable regularities in the data and ultimately expose the predictive models to an excessive amount of noise.

  2. As seen in Fig. 5, the individual hubs are very close one to the other, especially in central areas of the city. Demand patterns between neighboring hubs can thus be well correlated, and a bike-sharing provider would actually benefit more from understanding the demand in an entire area rather than in a single hub. Moreover, bike-sharing demand conceptually covers an entire area, as bike-sharing users would likely walk a few dozen meters more to rent a bike. Hence from here on, we assume that if no bikes are available at some hub, then users who wish to rent a bike are willing to walk to any other hub in the same super-hub.

4.4.2 Censorship of Bike-Sharing Data

Ideally and before modeling, we would like to have access to the true bike-sharing demand, free of any real-world censorship. However, this ideal setting is impossible, as historical data records are necessarily censored intrinsically to some extent. Consequently and for the sake of experimentation, we assume that the given historical data represents true demand (which is what ideally we would like to predict). This further allows us to censor the data manually and examine the effects of such censorship.

We apply manual censorship to the time series of each super-hub in two stages. In the first stage, for each day in , we let indicate whether at any moment during there were no bikes available in the entire super-hub, and define accordingly the set of censored and non-censored observations:

(23)
(24)

We then fix binary censorship labels as follows: for and for . The reason for doing so is that for every day in , there was a moment with zero bike availability, and so there may have been additional demand, which the service could not satisfy and which was thus not recorded.

Having fixed the censorship labels, the second stage of censorship can be executed multiple times for different censorship intensities. That is, given a censorship intensity , we censor each observation for which to of its original value. For example, Fig. 6 shows true demand (red) as well as the result of censoring it to of its original value (grey) for each day in (blue markers).

Figure 6: Super-hub 3 data with Censorship Intensity.

4.4.3 Modeling and Evaluation

As for previous experiments, we train models on the manually censored demand and then evaluate them on the observations without censorship – this is the measure of actual interest, as it represents true demand. Evaluation is done by comparing the posterior mean to the predicted mean through two commonly used measures: coefficient of determination () and Rooted Mean Squared Error (RMSE), as detailed in Appendix A.1.

As defined in Section 4.1, the three models are NCGP, NCGP-A and CGP. We equip each model with a kernel that consists of a combination of SE and Periodic kernels on the temporal index feature and a Matérn kernel on weather data. Namely, the covariance function between the ’th and the ’th sample is:

(25)

where is the temporal index, and consists of weather features as detailed in Table 1. Concretely, we estimate kernel parameters through Type-II Maximum Likelihood using Adam (CGP) and BFGS (NCGP, NCGP-A) to perform gradient ascent. This choice of optimization routines was the most performing in practice for each respective model.

Solar Irradiance Wind Air Rain
Global short wave Direction min. Temperature Fall
Short wave horizontal Direction max. Relative humidity Duration
Short wave direct normal Direction avg. Pressure Intensity
Downwelling horizontal long wave Speed avg.
Speed max.
Table 1: Weather features for bike-sharing data, collected in Copenhagen by [dtuclimate].

The main goal of our experiments is to analyze how well these models can recover the true demand pattern after being trained on a censored version of the same demand. In particular, we are interested in investigating to what degree censored models (CGP) are able to recover the underlying demand pattern compared to standard regression models (NCGP, NCGP-A), and how this comparison evolves under different censorship intensities. Our experiments thus employ incremental censorship intensity , so that ranges between two extremes: from absence of censoring (full observability of the true demand) to complete censoring (no demand observed in historical data).

4.4.4 Results for Bike-Sharing Experiments

Figure 7: Censoring performance analysis on the entire dataset (left) and only on non-censored observations (right) regarding (top) and RMSE (bottom) for super-hub 1.

This section presents results for the predictive models implemented on each of the three time-series with cross-validation. Tables 2, 3, 4 in Appendix A.2 detail the evaluation for each of the three super-hubs. We now concentrate on the results for super-hub 1, as presented in Fig. 7, because they are representative also of the results for the other two super-hubs. The plots in Fig. 7 are a visual representation of Table 2 and compare the performances of NCGP, NCGP-A and CGP for different censorship intensities. We discern between evaluating model performance on the entire dataset (consisting of both censored and non-censored observations) vs. only on non-censored observations.

First, we compare the models that do not discard of any observations, namely, CGP and NCGP. Considering that a predictive model is better the more its RMSE is to close and the more its is close to , the plots show that the two models are comparable under low degree of censoring. However, as the censorship intensifies, NCGP becomes strongly biased towards the censored observations, whereas CGP recovers the underlying demand much more consistently.

Next, we compare between NCGP-A and the CGP and see that NCGP-A achieves reasonable predictive accuracy, which is still mostly worse than the predictive accuracy of CGP. As outlined previously in Sections 4.2 and 4.3, NCGP-A accuracy depends highly on the extent to which observable data characterizes the full behavior of the latent function (in this case, true demand). Here, the percent of points affected by censoring falls between and for all the three super-hubs, so that NCGP-A has acceptable observability over the true demand. Even so, CGP outperforms NCGP-A also on just non-censored data; this suggests that using a censored likelihood not only allows models to avoid predictive bias on censored data, but also allows consistent understanding of the data generating process, ultimately leading to increased performance also on observable data.

In conclusion, and as outlined in previous Sections, the non-parametric nature of Censored GP allows it to effectively exploit the concept of censoring, thus preventing censored observations from biasing the entire demand model. In other words, Censored GP is capable of activating censoring-awareness depending on data only.

4.4.5 Behavior under Zero Censorship

It is worth emphasizing how Non-Censored GP is a special case of Censored model GP in those cases where all observations are assumed to be non-censored. This becomes evident when juxtaposing the two likelihood functions:

(26)
(27)

which shows that in absence of censoring (i.e., when ), .

For censorship, we still see in Fig. 7 some difference in performance. To realize why this happens, recall that by our experimental design (Section 4.4.2), censorship labels are pre-determined regardless of censorship intensity, hence some observations are labeled as censored even when censorship intensity is . Furthermore, the true demand in our experiments originates from historical data, which is itself intrinsically censored. It follows that the models are never exposed to utter absence of censored observations. Conversely, in a real-world scenario, perfect censoring corresponds to complete absence of censored observations, so that censored and non-censored GP are equivalent.

5 Summary and Future Work

Building a model for demand prediction naturally relies on extrapolating knowledge from historical data. This is usually done by implementing different types of regression models, to both explain past demand behavior and compute reliable predictions for the future – a fundamental building block for a great number decision making processes. However, we have shown how a reliable predictive model must take into consideration censoring, especially in those cases in which demand is implicitly limited by supply. More importantly, we stressed the fact that, in the context of shared transport demand modeling, there is a need for models which can deal with censoring in a meaningful way, rather than resorting to different data cleaning techniques.

To deal with the censoring problem, we have constructed models that incorporate a censored likelihood function within a flexible, non-parametric Gaussian Process (GP). We compare this model to commonly used GP models, which incorporate a standard Gaussian likelihood, through a series of experiments on synthetic and real-world datasets. These experiments highlight how standard regression models are prone to return a biased model of demand under data censorship, whereas the proposed Censored GP model yields consistent predictions even under severe censorship.

The experimental results thus confirm the importance of censoring in demand modeling, especially in the transport scenario where demand and supply are naturally interdependent. More generally, our results support the idea of building more knowledgeable models instead of using case-dependent data cleaning techniques. This can be done by feeding the demand models insights on how the demand patterns actually behave so that the models can adjust automatically to the available data.

For future work, we plan to study settings where censorship labels are partly or even completely unknown. We also note that shared mobility demand prediction can benefit from utilizing spatio-temporal correlations, as commutes between areas take place regularly and as nearby areas are likely to exhibit similar demand patterns. Hence also for future work, we plan to implement models which predict censored demand jointly for multiple areas.

Acknowledgement

The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Individual Fellowship H2020-MSCA-IF-2016, ID number 745673.

References

Appendix A Appendix

a.1 Evaluation Measures

Models in our experiments are evaluated with the following measures:

(28)
RMSE (29)

where are the true demand, are the corresponding evaluations of the posterior mean, and

(30)

is the mean true demand.

a.2 Experimental Results for Section 4.4

Tables 2, 3, 4 hereby provide the results for the experiments of Section 4.4, with best values highlighted in bold.

Censorship Intensity: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Entire Dataset
RMSE NCGP 8.31 8.38 8.6 8.99 9.52 10.07 10.74 11.48 12.29 13.14 14.03
NCGP-A 9.03 9.03 9.03 9.03 9.03 9.03 9.03 9.03 9.03 9.03 9.03
CGP 8.73 8.47 8.38 8.26 8.34 8.44 8.51 8.67 8.85 9.04 9.28
NCGP 0.69 0.68 0.66 0.63 0.59 0.54 0.48 0.40 0.32 0.22 0.11
NCGP-A 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63 0.63
CGP 0.65 0.68 0.68 0.69 0.69 0.68 0.67 0.66 0.65 0.63 0.61
Non-Censored Data
RMSE NCGP 7.96 7.99 8.11 8.36 8.74 9.02 9.43 9.91 10.44 11.01 11.61
NCGP-A 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73 8.73
CGP 8.19 8.06 8.00 7.91 7.98 8.02 8.08 8.20 8.25 8.40 8.54
NCGP 0.69 0.68 0.67 0.65 0.62 0.60 0.56 0.51 0.46 0.40 0.33
NCGP-A 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62
CGP 0.67 0.68 0.68 0.69 0.68 0.68 0.68 0.67 0.66 0.65 0.64
Table 2: Model Performance for Super-hub 1
Censorship Intensity: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Entire Dataset
RMSE NCGP 6.17 6.22 6.37 6.63 7.00 7.42 7.91 8.46 9.05 9.67 10.33
NCGP-A 7.28 7.28 7.28 7.28 7.28 7.28 7.28 7.28 7.28 7.28 7.28
CGP 6.74 6.57 6.44 6.43 6.52 6.59 6.72 6.90 7.05 7.22 7.40
NCGP 0.65 0.65 0.63 0.60 0.56 0.50 0.43 0.35 0.26 0.15 0.03
NCGP-A 0.52 0.52 0.52 0.52 0.52 0.52 0.52 0.52 0.52 0.52 0.52
CGP 0.59 0.61 0.62 0.62 0.61 0.61 0.59 0.57 0.55 0.53 0.50
Non-Censored Data
RMSE NCGP 5.85 5.88 5.99 6.16 6.39 6.65 6.96 7.31 7.70 8.12 8.56
NCGP-A 6.72 6.72 6.72 6.72 6.72 6.72 6.72 6.72 6.72 6.72 6.72
CGP 6.21 6.11 6.01 6.01 6.05 6.09 6.13 6.28 6.42 6.51 6.68
NCGP 0.71 0.71 0.70 0.68 0.66 0.63 0.59 0.55 0.50 0.45 0.38
NCGP-A 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62
CGP 0.68 0.69 0.70 0.70 0.69 0.69 0.68 0.67 0.65 0.64 0.62
Table 3: Model Performance for Super-hub 2
Censorship Intensity: 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Entire Dataset
RMSE NCGP 7.27 7.32 7.45 7.62 7.87 8.19 8.56 8.99 9.46 9.96 10.5
NCGP-A 7.35 7.35 7.35 7.35 7.35 7.35 7.35 7.35 7.35 7.35 7.35
CGP 7.42 7.26 7.20 7.17 7.19 7.25 7.29 7.37 7.42 7.50 7.65
NCGP 0.54 0.53 0.51 0.49 0.46 0.41 0.36 0.29 0.22 0.13 0.03
NCGP-A 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53 0.53
CGP 0.52 0.54 0.55 0.55 0.55 0.54 0.53 0.52 0.52 0.51 0.49
Non-Censored Data
RMSE NCGP 6.99 7.02 7.10 7.20 7.36 7.58 7.85 8.16 8.50 8.88 9.28
NCGP-A 7.02 7.02 7.02 7.02 7.02 7.02 7.02 7.02 7.02 7.02 7.02
CGP 7.12 7.01 6.97 6.92 6.96 6.98 6.99 7.05 7.04 7.09 7.22
NCGP 0.55 0.55 0.54 0.53 0.50 0.47 0.44 0.39 0.34 0.28 0.21
NCGP-A 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55 0.55
CGP 0.54 0.55 0.56 0.56 0.56 0.55 0.55 0.54 0.55 0.54 0.52
Table 4: Model Performance for Super-hub 3