The bicycle sharing system of Barcelona (called Bicing) has more than stations and about bicycles that can be rented by users. It is a very common option for local people that prefer to protect the environment, look for an alternative to the traditional public transportation, want to exploit the benefits of bicycles or simply cannot afford to drive.
As any other system, nonetheless, it has some problems which users must deal with. One of the problems that most of the people notice in the first days of use is that the number of bicycles and the number of free slots are limited. That is, in some occasions it is not possible to find bikes near their location, and it may happen that they do not find an available slot at the stations which are close to their destination. As one may notice, the problem of not finding a bike can be solved by taking another public transport, such as a bus, the subway or a train. However, if a person is currently using a bike and cannot leave it close to their destination, it may become an inconvenience and a reason for avoiding the use of the system in the future.
Under these circumstances, a prediction about how many bikes they will find in the next day at a certain station may improve the system’s reliability and increase its usage. For example, if a person is planning to go to a place in the next morning, they must change their time of departure if there will be no bikes at the stations which are close to their origin, or if there will be no space to return their bike in a station near to their destination.
From the point of view of the service operator, they must build a schedule for their workers to collect bikes from a station which has many bikes to a station where only few bikes can be found at a certain time. This often happens, for example, at the stations placed in more elevated places, because the users take the bikes in the morning and drive downhill to their destinations, but they do not take them back, due to the difficulties ascending the streets. Therefore, having a prediction about at which time and which stations will be completely full of bikes or completely empty, makes it possible to improve the quality of the service for the end users, besides better utilizing the resources and reducing their costs.
In this work, we focus on predict how many bicycles will be available at the stations at different times of the day. To achieve that, we apply the Random Forest algorithm and classify the observations according to the context of the city, i.e., considering that external factors (such as holidays, extremely low temperatures and rainfalls) may influence the behavior of the users and change the number of bikes rented on a specific day. We utilize public data about the status of the stations, weather and festivities (available online) and predict the statuses of the stations up to hours earlier.
Our contribution is a way to make predictions about the status of the bike stations using several data types. We illustrate the performance of the predictions and compare them to other approaches that do not provide as accurate responses as those achieved by the Random Forest models. Moreover, we show the potential of exploring public data to improve large-scale systems, considering that this kind of predictions may be also applied to other cities that use similar systems, or to different public services, such as the public parking system.
In Section II, we explain the related work and give an overview on prediction methods. Later, in Section III, we explain the data sets used and how to access them, before showing how we utilized the data in Section IV. In Section V, we show and explain the performance of the predictions done during the tests. Finally, we draw some conclusions about the methods used and possible applications for them in Section VI.
In this Section, we give an overview about how predictions are made and list the most relevant works that use different prediction methods based on public data.
Ii-a Time Series
There are many methods developed to predict values from sampled data. Naïve predictions are based on aggregate functions, e.g., the average, the sum, the maximum or the median value among a set of numbers. Although they have a low computational cost, their accuracy is usually constrained.
Time series methods are more powerful. They consider the evolution of a system, given historical observations indexed by time. Examples of this kind of method may vary from the simplest Autoregressive (AR) and Moving Average (MA) to the most complex extensions of ARIMA (SARIMA, VARIMA, ARIMA-GARCH). Such methods are more often employed in economics in order to predict stock prices and their variations, but their applications can be extended to almost any area of study. In comparison with the naïve methods, they consume more computational power, but the expectations about their accuracy are also much higher. In , the authors did an analysis of the Bicing system and made predictions about the number of bikes in the near future using Naïve and AR models. In their experiments, they showed that predicting the number of bikes in the next hour using AR models was much more accurate than assuming that they would be the same as in the last observations. Time series methods are widely used to predict one-dimensional data. However, in complex environments–such as a city– there exist a high number of parameters which may have an impact on the system that is going to be studied.
Ii-B Multivariate methods
In traditional prediction methods that use only time series as input, the data is considered univariate. That is, apart from the time (used to index the observations), only the own variable (represented by past observations) is used as input to predict the future values. Considering that the input variables are used to make predictions, we will call them predictors. In many cases, the data can be multivariate and a set of predictors will be mapped to the output value. For example, besides having the time to index each observation, it may be possible to use as well characteristics of the environment that are relevant and may cause impact in the future output values. As real world situations are usually composed by several aspects which may impact the others, such methods may reproduce better the evidences of the environment. Corroborating such idea, recent studies have concluded that considering multiple sources of data and different points of view can produce more accurate predictions . Therefore, there is a need for multivariate prediction methods, which are able to incorporate several sources of information in order to produce more accurate results.
, the authors studied the bike system operation with predictions about the number of bikes at the stations in the near future using Naïve models and Bayesian Networks. Before making the predictions, it was necessary to classify every station according to the average level of use per day, because each prediction is made for a certain group. Additionally, every time a new station is installed, its usage must be compared with the usage of the other stations in order to verify which of them have similar levels. The main restriction of their approach is that it ignores the impact of external factors, such as a large event in a public space, which may attract more users to the closest bike stations. That is, in order to change the current models, a new data analysis would be required to include these parameters, before generating a new model in a procedure with a space and time complexity which may increase exponentially. In our approach, new predictors can be added without requiring extra computation apart from the creation of the models.
In , the authors observed the similarity of the subway stations from a different perspective than ours. They described a method to build clusters of stations based on the data collected from users behavior and proposed a mechanism to make predictions based on such knowledge. Finally, they proposed three prediction methods that are based on weighted average between similar trips. As before, the main limitation of this method is that the computational costs for including new parameters are extremely high–if possible. That is, during the system development, it is necessary to determine what the most relevant events are that may affect the use of every single station and select those that show a high correlation, based on historical observations.
As an example of a scalable multivariate method, decision tree learning
is successfully applied in machine learning problems to make predictions. A decision tree is a n-ary tree that has a height equal to the number of predictors plus one, where the last level contains the leaves and represents the output values. The general idea of using a decision tree as a predictive model, is to create a way to systematically map the characteristics of an observation (composed by several variables) to its output value. If the output value is a label instead of a number, the decision tree may be called a classification tree.
Ii-C Random Forest
Random Forest  is an algorithm that uses decision trees to create unbiased classification trees. To achieve that, the procedure randomly selects a subset of predictors from the original data and builds a classification tree based on it. After creating several trees, a final decision tree is built based on the average relevance of the predictors, and can be used to make predictions over other sets of observations. Therefore, besides building the decision tree, the Random Forest algorithm calculates the relevance of each predictor in the real environment.
One method to represent the relevance of the predictors is the calculation of the mean decrease in the Gini coefficient of the predictors. The Gini coefficient is a measure of inequality of a distribution, where 0 represents perfect equality and 1 perfect inequality. In the Random Forest algorithm, after the creation of the decision trees, the Gini coefficient is calculated for each level of the tree and compared to their respective parents. Changes in the Gini coefficients are summed for each predictor in all trees and normalized in order to calculate their mean decrease. In summary, the mean decrease in Gini coefficient is a measure of how each predictor contributes to the homogeneity of the nodes and leaves in the resulting decision tree. A higher decrease in Gini coefficient of one predictor means that such predictor is more relevant when classifying an observation.
In this work, we adopted the Random Forest as the prediction method. From the best of our knowledge, it is the first study that uses Random Forest to make predictions over publicly available data. Furthermore, it can also produce better results for longer time intervals and larger data sets, as well as providing the scalability lacked in the other works.
Iii Data sets
In this Section we are going to explain how the data used in the experiments is handled. This description makes this research reproducible, besides explaining the source and the reliability of the data which we used in the tests.
The data generated by the Bicing system is open for public access via an API (Application Programming Interface) that can be accessed online333http://wservice.viabicing.cat/v1/getstations.php?v=1
. The information available in this data set is about the current status of the stations: their location; a list of the closest stations; whether they are operating normally or not; the number of bikes available; and the number of free slots (i.e., the number of bikes that can be parked at that station at that moment). This information is updated almost once a minute. For more than one year, we stored this information nearly once a minute, i.e., despite some occasional failures, such as connectivity problems between our computers and our database server, we have all the information generated by the system in this time interval available for the tests.
Besides the information about the bike system, we accessed two other sources of data that contain information about factors that may affect the operation of Bicing: the holidays and the weather (including the forecast) in the city. The data about the holidays in Barcelona444http://www.bcn.cat/calendarifestius/en/ have been chosen because they involve the national and local festivities, such as bank and religious holidays, and we expect that on these days the normal activities in shops, banks, schools and universities, are suspended or reduced. Moreover, the weather information555 http://www.wunderground.com/history/airport/LEBL/2015/, such as temperature, relative humidity, dew point and wind speed, may be correlated to changes in the users’ behavior and influence the use of the bikes. Finally, the weather forecast provides detailed information about the current local weather and how it is expected to be in the next 3 days. That is, it incorporates the predictions about the weather that usually contain a combination of several factors and are computed by supercomputers which are not accessible by every person.
In conclusion, the external information is mainly chosen according to its descriptive power about the situation of the city in the past, besides providing a prediction about whether such scenarios will occur again in the future.
Iv Study of the Data
In this Section we explain how we transformed the raw data, which is available online for open access, into analyzable information. For example, the procedures to clean and analyze the data before performing tests and applying the results.
|Full||No slots available|
|Almost full||Between 1 and 5 slots available|
|Bikes and slots available||More than 5 bikes available and more than 5 slots available|
|Almost empty||Between 1 and 5 bikes available|
|Empty||No bikes available|
First of all, we phrase our problem from the point of view of a normal user. Since they cannot rent more than one bike at a time, it is not required to have a prediction about the exact number of bikes at a certain time in the future. It is more relevant, on the other hand, to know whether there will be bikes or not at such a moment. Therefore, we classify the status of a station as full, almost full, slots and bikes available, almost empty or empty, as shown in Table I.
The data set with the information from the bike system is large (around 10 Gigabytes). Thus, in order to make predictions using an ARIMA model, we selected the last days (approximately observations), as shown in Figure 3. The number of available bikes at the stations was considered a time series and a seasonal ARIMA model was chosen using the Akaike information criterion . For the seasonal factor, we assumed that every day has a similar usage pattern and the statuses tend to repeat through the days. In the end, we classified the predictions according to the levels explained before (full, almost full, etc.).
The second type of predictions was done using the Random Forest algorithm. A subset of the data was created by randomly selecting entries from the period that starts weeks before the model creation and finishes weeks before it, i.e., a total of weeks. Additionally, we randomly select entries from a period that starts months before the model creation. Figure 4 shows which data were selected in order to generate the predictions about the next days.
We select this dataset for two reasons, the first one is due to the memory required to construct a model, which is nearly Gigabytes when we use
entries. Given that we cannot select all available data, we try to choose observations that are (probably) the most similar to the next days, which are going to be predicted. Having built this data set, we are able to construct two different sets of predictors for the models, one that uses only the data extracted from the bike system (RF) and another one that merges them with the data from the weather forecast and the information about holidays (Extended RF, explained in SectionIII).
Given that there is no space to show the status of every station, we observed the data from the last week of January and selected stations which were either completely full or completely empty during more than 30% of the time: stations number , , and . In Figure 1, it is possible to see a map of Barcelona and their exact position. Each station is at least kilometers from the others and has distinct characteristics. For example, station is very close to the sea, while station is at meters above the sea level; stations and are very close to subway stations and the others are not.
Based on the mean decrease of Gini coefficient, we can observe in Figure 2 how external factors correlate with the statuses of the stations during the whole year. For example, it is possible to notice that the most relevant factor for all of them is the time of the day, i.e., their daily use is comparable. For example, if we observed that, at station , there were more bikes on Monday morning than on Monday afternoon, we might expect that on Tuesday morning there would be more bikes than on Tuesday afternoon. This illustrates that our assumption about the inter-day seasonality (when using ARIMA models) is valid.
Although this is relevant and explored in other studies that observed daily trends , there are interesting points which have not been noticed before. For example, at station , the relative humidity has a greater effect than at the others, which can be explained by its location (close to the sea, where people tend to avoid going when it is raining). At stations and , the day of the week is more relevant than at the others, which can also be explained by their location: since they are close to subway stations, they become an option for workers and students that use the public bikes in combination with the subway to arrive at their destination. In conclusion, we can see that each station has its own ”profile”, i.e., some external factors influence more the use of some stations than of the others.
During February 5–12, 2015, we ran our tests based on the selected stations. From 00:00 to 01:00 we set up the prediction models and made predictions for every minutes. We calculated the accuracy as the share of time when the stations were either empty and full and had been successfully predicted. Moreover, in order to observe the performance of the predictions, we calculated the sensitivity and the specificity of each method.
The sensitivity is the relation between the total number of “true positives” and the “positives”. In our scenario, a “positive” happens when a station is predicted to be either full or empty. Thus, a “true positive” happens when the station was actually observed being full or empty after having made such a prediction. Therefore, if a prediction model has higher sensitivity, it means that the predictions about the ”positives” are more reliable than with a model with lower rates.
Analogously, the specificity is the relation between the total number of “true negatives” and the number of “negatives”. In our predictions, an outcome is considered as a “negative” when it predicts that the station will be neither full nor empty and a “true negative” happens if the station was neither full nor empty at that moment. Again, if a prediction model has higher specificity, it means that the predictions about the ”negatives” are more reliable than using a model with lower rates.
V-a ARIMA vs. Random Forest
Predictions using ARIMA models had been shown to be very efficient in the short term (less than 15 minutes) by other authors. However, our tests showed that they are very inaccurate when predicting a complete day. In fact, they had never predicted that the stations would be completely full nor completely empty, because they tend to be around an average from the last couple of days. After observing such a low accuracy, we decided to include a more flexible evaluation of the results. The flexible criteria (marked with an asterisk in the plots) considers predictions of almost full and almost empty levels as “positive” outcomes when calculating the sensitivity and specificity. From a practical perspective, users and administrators can consider these levels as ”warnings” about the possibility of achieving a critical level (full or empty).
In Figure 5 we show the obtained results for the predictions using ARIMA and those using the Random Forest models. Even considering a more flexible model of evaluation, the ARIMA models are significantly outperformed by the predictions obtained using the Random Forest algorithm. The biggest difference occurred at station , where the ARIMA models correctly predicted of the critical statuses, while the accuracy of the Random Forest algorithm was . The sensitivity of the predictions using the Random Forest algorithm is larger for three stations and the difference for the fourth one (station ) was . Finally, the specificity of the predictions using the Random Forest algorithm are similar for three stations, and the biggest difference is of about for station .
V-B Enhanced Random Forest models
In Figure 6 we compare two different Random Forest models, the “RF” and the “Extended RF”. The difference is that the simplest Random Forest models (“RF”) have fewer predictors and, because of this, they require less data and less effort to be built. However, the other models (“Extended RF”) incorporate the predictions about the weather conditions and the information about possible holidays, which may increase the amount of information that they consider when making a prediction.
Due to the low sensitivity observed using the flexible criteria, we included only the results obtained based on the standard criteria of evaluation for their performance, which considers strictly the outcome of the predictions as their exact meaning, i.e., a prediction outcome is considered as a “positive” only if it is either full or empty, and as a “negative” otherwise. We notice that the sensitivity and the specificity of those predictions are always better than those obtained by the ARIMA models showed before.
Although there is no big difference for stations and , for station the difference between the sensitivity of the predictions that use external information and the predictions that use only the data from the Bicing system was around and for station , it was . On average, they were the same.
When comparing the different stations, with our results we can affirm that some of them are ”less predictable” than the others. For example, for stations and the accuracy was lower than , while for station it was almost . Considering all the observed stations together, both methods had sensitivity of approximately and specificity of more than . As we can also observe, the differences were not relevant for most of the stations, and we conclude that there were no gains on including external information, considering the predictions made for the next hours.
V-C Durability of the models
Considering that the high total number of stations may require a large computation time in order to create the prediction models and to make the predictions, we observed whether the quality of the predictions had decreased over several days. That is, we made predictions for the next days ( hours) and compared their accuracy, sensitivity and specificity according to the age of the predictions about the same days (from February to ).
The results (illustrated in Figure 7) show that the accuracy of the predictions using only the data from the Bicing system decreased from to , which may not be considered a big difference, but the most complex models performed always better and kept their accuracy over even when doing predictions for days later. Moreover, the predictions that used external information not only had better sensitivity when done some days before, but also performed better than the simplest models. For example, the sensitivity of the extended models is around for the predictions about the next hours and when about the next hours. The specificity also increased , which we do not consider a significant change, but it shows that the quality of the predictions can be maintained (if not improved) by the use of external information.
We conclude that the weather forecast complemented our models with information about the next days, which was useful enough to improve the quality of the predictions when we had no other information about the city’s environment in the future.
Vi Conclusion and Future Work
In this work we presented an analysis of the data about the public bicycle system in Barcelona and the predictions about the statuses of the stations during days (February 5–12, 2015). In order to run tests and draw our conclusions, we chose stations which often have a critical status, i.e., on average, they spend more than hours per day completely empty or completely full of bikes.
Using public data, we first observed the characteristics of such stations, for example, the impact of high temperatures and the influence of the relative humidity in the number of bicycles. Later, we compared the accuracy of the predictions using ARIMA with those using the Random Forest algorithm. It was shown that their predictions using ARIMA models are inaccurate when the goal is to inform about the critical states (i.e., full and empty stations), and cannot provide reliable information for the users. Moreover, although ARIMA models have been shown to be a good option in the short-term (less than one hour), they do not perform well for longer periods, besides having higher complexity and consuming more computational resources to be created.
The predictions using the Random Forest algorithm performed better and were able to correctly predict nearly half of the times when the stations were either completely full or completely empty, up to days before they actually happened. Also, the sensitivity of the predictions that use the Random Forest algorithm is about , which means that most of the times the ”positive” outcomes are correct. Furthermore, their specificity is around , which means that every times that the models predict that there will bikes and free slots at a station, of them are correct. We remark, nonetheless, that they may require some improvement before being adopted by applications aimed at users of the bicycle system. The use of other relevant predictors may be an option to build more powerful models and can be done by using more observations and data from other sources. For example, non-public information about the times when the bikes will be collected or taken to a station should have a positive impact in their performance.
From the results shown in this paper, it is possible to observe that the use of the bike stations is partially predictable and that, based on the predictions done using only the data accessible by the public, it would be possible to improve the support schedule. In case of taking actions to improve the service, the system is expected to evolve and new trends may be observed. However, given that the models may be regenerated every days, they are able to incorporate such novelties, as well as variant user behaviors across different times of the year.
From the point of view of the system administration, the predictions may trigger different actions, such as collecting bikes from a station that is almost full before the users face the problem of not finding places to leave their bikes. Moreover, it is possible to extend these predictions with other datasets available online, like the neighborhood wealth, the proximity of a place to other public transportation, schools and companies. The extended version of the predictions may be used to decide whether it is a good option to install a new station at a certain place or not, based on how many users would use it during the year.
From the user’s perspective, this set of predictions may be used as a framework to improve their current API and show not only the current status of the stations, but also the future statuses. Our future plans include organizing the predictions for all stations in a scalable way and provide to the users an interface to access this information. The interface can be a mobile application that provides users the option to make plans based on the predictions about the number of bikes at the selected stations.
This work has been partially supported by the Spanish Government through the project TEC2012-32354 (Plan Nacional I+D) and by the Catalan Government through the project SGR-2014-1173.
-  G. E. P. Box and G. Jenkins, Time Series Analysis, Forecasting and Control, 1990.
-  A. Kaltenbrunner, R. Meza, J. Grivolla, J. Codina, and R. Banchs, “Urban cycles and mobility patterns: Exploring and predicting trends in a bicycle-based public transport system,” Pervasive and Mobile Computing, vol. 6, no. 4, pp. 455–466, Aug. 2010. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S1574119210000568
-  B. Mellers, E. Stone, P. Atanasov, N. Rohrbaugh, S. E. Metz, L. Ungar, M. M. Bishop, M. Horowitz, E. Merkle, and P. Tetlock, “The Psychology of Intelligence Analysis: Drivers of Prediction Accuracy in World Politics,” Journal of experimental psychology: Applied, vol. 21, no. 1, pp. 1–14, 2015.
-  J. Froehlich, J. Neumann, and N. Oliver, “Sensing and Predicting the Pulse of the City through Shared Bicycling.” IJCAI, no. 2, 2009. [Online]. Available: http://www.nuriaoliver.com/bicing/IJCAI09_Bicing.pdf
-  N. Lathia, J. Froehlich, and L. Capra, “Mining Public Transport Usage for Personalised Intelligent Transport Systems,” 2010 IEEE International Conference on Data Mining, no. October 2009, pp. 887–892, Dec. 2010. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5694056
-  J. Quinlan, “Induction of decision trees,” Machine learning, pp. 81–106, 1986. [Online]. Available: http://link.springer.com/article/10.1023/A:1022643204877
-  L. Breiman, “Random forests,” Machine learning, pp. 1–33, 2001. [Online]. Available: http://link.springer.com/article/10.1023/A:1010933404324
-  H. Akaike, “A new look at the statistical model identification,” Automatic Control, IEEE Transactions on, 1974. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1100705