Understanding and forecasting traffic is an important task for urban policymakers. Road networks are by far the most heavily used part of transport infrastructure (for example, of all trips in the UK were made by car in 2016 ); yet compared to other transportation modes (such as rail and air) basic data about traffic flow on roads is largely lacking. In the last decade, a variety of novel data sources have started to offer the possibility of filling this gap, such as data from GPS transponders on mobile phones (see ref.  for a review) or data from social media , which are generating considerable academic interest. Here, we contribute to this growing literature on the use of new data sources to understand traffic by using volunteered geographic information from OpenStreetMap (OSM) to understand what types of land use are associated with traffic jams, as well as increased traffic volume.
, though land use categories are often classified at a highly aggregate level (e.g., defining areas asresidential, commercial, or industrial) and data have typically been expensive to put together . OSM is very promising in this regard in that its data is highly granular, offering a classification of different types of commercial activity, public amenities and other forms of land use, but also in the fact that all this data is freely and openly available. The completeness and accuracy of OSM coverage has been assessed in previous studies [8, 9, 10, 11, 12, 13, 14, 15, 16], yielding positive but cautious results, particularly about road networks. It has also been used to successfully identify the types of trips which human mobility models struggle to predict accurately .
We test the extent to which OSM data can offer a good estimation of the volume of overall traffic and the number of traffic disruptions, defined as any deviation from normal smooth traffic on a road network, by making use of a series of linear regression models. For the models of the traffic disruptions volume, observations are the geographic (latitude and longitude) points where traffic disruptions were observed in the network and the response variable is the number of traffic disruptions observed during the month of March 2017.
The data analysis pipelines for the two sets of linear models in this study are described in Figure 1. As shown in the top panels (a)
, we first produce kernel density estimates (KDE) of every OSM category and meta-category. We then estimate the number of traffic disruptions at a given latitude and longitude using the KDEs of either the OSM meta-categories or of the OSM categories at each point. To produce the KDEs, we made use of a Gaussian kernel searched over a range of bandwidth parameters before adopting a bandwidth of, which captures the range of spatial variation of all OSM points of interest. The specific value of the bandwidth parameter did not qualitatively affect our results. These KDEs allow us to estimate the density of any type of OSM feature at all of the points where traffic disruptions were reported.
As shown in the bottom panels (b), we also perform a second set of linear regressions where we aggregate the OSM data points into a total count for every one of the 112 electoral wards in the county of Oxfordshire, UK. We then estimate the volume of traffic going into every ward using either counts of the OSM meta-categories or all OSM categories for each ward.
. Only a small subset of the 40 predictor variables are shown for (b). Respectively, *, ** and *** indicate, and .
Estimating traffic disruptions
The first linear model to estimate traffic disruptions only makes use of the meta-categories of OSM features (see Table 3a). These meta-categories represent traditional classifications of land use types. The model only weakly fits the traffic disruptions data, resulting in an adjusted of . Individual coefficients show that commercial areas are the ones most associated with high traffic, whilst industrial areas are the least so. We also tested different versions of the model only estimating distributions on weekdays and weekends, as the nature of traffic disruptions on these days could be different, but the overall fit to the log-transformed data was similar.
The second model has more granular land-use data by making use of all OSM categories that were observed at least a hundred times in Oxfordshire, resulting in KDEs for 40 different types of point (from pubs, schools and restaurants to graveyards, postboxes and gardens). This model fits the log-transformed data considerably better than the meta categorization model as captured by the adjusted , which is a goodness-of-fit metric that takes into account the different number of independent variables and is a common metric for model comparison in computational social science [18, 19, 20]. This granular model results in an adjusted . The model coefficients of largest absolute value are represented in Table 3b, and their corresponding p-values are indicated as well.
The second, granular model gives estimates of how things we might expect to explain local traffic jams vary with actual traffic disruptions. For example, one would expect places of worship and schools to both have a relatively high number of traffic disruptions, but the coefficients in this model indicate a large difference between the coefficient corresponding to the relationship between the number of points of interest tagged as schools and the log-transformed number of traffic disruptions and the corresponding coefficient for places of worship. The analysis, however, is only correlational: OSM points of interest tagged as farmland, parking and graveyards all have high positive coefficients. The high number of traffic disruptions around such points might be due to traffic network features such as narrow roads rather than the effects of these OSM features directly.
Estimating traffic volume
We also test the effectiveness of OSM data in estimating the traffic volume in Oxfordshire. For this variable, rather than using KDEs to estimate the density of each OSM feature at a given road, we aggregate the number of points of interest tagged with each meta-category and category, producing two sets of independent variables for every ward: one corresponding to the total number of points tagged with each one of the 6 OSM meta-categories, and one corresponding to the points in every ward in the 40 categories. We then produce two corresponding linear regression models using the log-transformed total traffic flowing into a ward as the dependent variable.
The linear regression models built with the traffic volume data show the same qualitative trend as the ones built with traffic disruption data. The first model, with the 6 meta-categories, results in an adjusted of . Its coefficients indicate that OSM points tagged as commercial are associated with heavier incoming traffic, while points tagged as recreational are negatively associated with it. Coefficients are presented in Table S1.
The finer-grained model, featuring 40 OSM categories, naturally shows a more nuanced scenario. Not only does it provide a better fit to the data, with an adjusted of , but it also provides more detail into the meta-categories used in the simpler linear models. Categories such as telephone and university show strong associations with higher levels of incoming traffic, whereas categories such as forest, meadow and allotments show weaker associations.
Not surprisingly, some OSM categories are also highly correlated, in the sense that they often appear in the same wards. Figure 2 shows these correlations in detail. It shows a heatmap displaying the Pearson correlation between the distribution of OSM categories over wards, giving higher values to pairs of OSM categories that often appear in the same wards (e.g, forest and meadow), and lower values to pairs of wards that rarely co-occur (e.g., farmyard and fast food
). The figure also shows the result of performing hierarchical clustering on the OSM categories according to their correlation. There is a cluster formed byfarm, farmland, farmyard, forest, meadow, graveyard and reservoir, which separates these rural categories from more urban categories as university or retail.
For both the incoming traffic volume per ward and the number of traffic disruptions, the jump from 6 meta-categories to 40 OSM categories implied a change from a linear model with a poor fit to a model with a better fit, indicated by the changes in their adjusted . It is natural to then ask if all 40 OSM categories are necessary for the new model to work, or if an equally good fit could be obtained by selecting a different number of meta-categories, or a subset of those 40 OSM categories, excluding correlated categories. This is discussed in the next subsection.
We address the explanatory power of each variable in these linear models using feature ranking with recursive feature elimination, aided by cross-validated selection of the best number of features, as implemented in the scikit-learn Python library . For both dependent variables, i.e., the incoming traffic volume per ward and the volume of traffic disruptions on a point in the road network, we perform 1000 rounds of k-fold cross-validation with , scoring models for their . For every cross-validation round, the 6 or 40 independent variables are then ranked according to their importance, which in this case is the magnitude of their corresponding coefficients in the linear models. Selected features are assigned rank , with the next-best variable being assigned rank , and so on until the last variable.
As multiple cross-validation rounds might result in different rankings of their predictor variables, we combine all rankings by calculating the stability of every variable, as well as its mean rank. Stability selection 
is a method which provides a useful balance between feature selection and data interpretation, by evaluating how often a given feature is included among the most important (i.e., rank) for a model. Strong or important features should achieve scores close to , indicating that most of the 1000 cross-validation rounds ranked them as one of the best features for prediction. Any weaker but still relevant features should still have non-zero scores, as they ought to be selected as best features at least occasionally. Finally, irrelevant features should return near-zero scores, indicating that they are very unlikely to feature among the selected variables.
For the volume of traffic disruptions, both the mean rank and the stability analysis reveal the same pattern, as shown in Tables 6 and 9. The meta-category residential features at the top, with both mean rank and stability equal to , indicating a variable that featured as important in all of the 1000 cross-validation rounds. It is then followed by the meta-category of recreational, which still features as important, with all other meta-categories featuring with a lower rank, and a stability less than . The corresponding granular OSM categories show the categories farmland, residential, parking, forest, and farmyard at the top, with mean rank and stability of , indicating that they were considered important variables in all cross-validation rounds. These categories are followed by farm, meadow, and industrial, with stability of and respective mean ranks of , and .
Tables 6 and 9 also show the mean rank and stability results for the total incoming traffic volume. Reported results are for trips on weekday mornings, but qualitatively similar results are obtained when using the full collection of trips in the dataset as shown in Table S2. The meta-category commercial features at the top, with both mean rank and stability equal to , indicating a variable that featured as important in all of the 1000 cross-validation rounds. It is then followed by the meta-category of recreational, which still features as important, with all other meta-categories featuring with a lower rank, and a stability less than or equal to . The corresponding granular OSM categories show fast-food at the top, with a mean rank and stability of . The categories post box and cafe feature next. OSM categories such as farm and farmyard feature with lower mean ranks, and stability under . One must bear in mind that the OSM categories residential and commercial are not equivalent to the meta-categories residential and commercial. This point is discussed in more detail in the next section.
The analysis presented in this paper shows how fine-grained land use categories can be used to estimate traffic volume and traffic disruption patterns. In particular, we have shown that the fine-grained features available on OpenStreetMap can greatly increase the explanatory power of linear models. We have also shown the importance of different land use categories by using recursive feature elimination, and have used cross-validation to examine the predictive power of different models.
One useful application of these data and methods is to offer estimated answers to questions such as “what impact will placing another cafe at a given point have on traffic jams at that location?”. For example, according to our fine-grained traffic models, the impact of a new school on the number of traffic disruptions in its area should be comparable to the impact of a new retail store or fast food restaurant. The linear model coefficients associated with the presence of these amenities are all approximately , meaning that that an increase by in these variables (number of schools, retail stores, and restaurants) implies an increase of in the log-transformed number of traffic disruptions, i.e., an increase in in the monthly number of traffic disruptions at the location. These same categories—school, retail, and fast food—also have a positive correlation with the monthly volume of traffic going into a ward, even if with different coefficients. Respectively, the three categories have coefficients of , , and , implying respective increases in , , and in the total (non-log transformed) traffic flowing into areas.
It is important to remember the limitations of OpenStreetMap land use categories. For example, the OSM categories residential and commercial are not equivalent to the meta-categories residential and commercial, and the OSM dataset includes tags such as farmland and farmyard along with farm, which was deprecated and substituted by the two other farm categories in 2017 . Categories and meta-categories might differ in the quality of the annotation, and in how informative they are to the traffic predictions. The cross-validation and recursive feature elimination performed here are first steps in tackling this issue. The rank and stability analysis provide additional evidence that higher numbers of traffic disruptions are observed in residential and rural areas, indicated by meta-categories such as residential and OSM categories such as farmland, forest and farmyard. This result matches the distribution of OSM categories over all wards, as indicated in Figure 2, which shows that OSM tags such as house, farmland, residential, and farmyard are often seen in the same wards, while rarely co-occurring with OSM categories such as commercial or cafe. The latter two OSM categories do not feature as important predictors for the number of traffic disruptions, but they do feature as important predictors for traffic volume, where they show the highest rank and stability, which is also observed for the meta-category commercial.
Our study also suggests promising avenues for future research. One of these would be to take advantage of the constantly evolving nature of OpenStreetMap to track the emergence of new physical features, and relate these to changes in traffic conditions, thus extending the correlations we have highlighted in this paper into a causal setting. Another would be to combine these with other sources of observational data, such as licensing applications, planning permission, and building regulations, to see if these can build on the baseline model we have constructed. Finally, it would be worthwhile extending our study to other countries and contexts, to see if the value of OSM’s granular point of interest data is generalizable. As our ability to understand and explain traffic patterns improves so will the ability of policymakers to effectively design urban transport systems that serve the needs of their citizens.
Materials and Methods
Our geographical focus is the English county of Oxfordshire, a geographical area of just over km and which contains around inhabitants. For our OpenStreetMap (OSM) data, we downloaded points of interest from the OSM database which provide indications of the way land is used. Points of interest were downloaded in November 2017. One of the authors then assigned each point of interest to six meta-categories of land use: residential, industrial, commercial, recreational, institutional and green space. These categories are standard across the transport and land-use literature (see, for example, the typologies present in [4, 24, 7]). We also preserved the more granular categorization given to the points by OSM itself. For example, our meta-category of commercial contains categories such as restaurant, pub and cafe. We chose to ignore OSM categories and meta-categories with less than a hundred points of interest in Oxfordshire, as well as categories indicating the location of the transport network itself, as these are obviously coterminous with our traffic disruption data.
Traffic volume and traffic disruptions data
We obtained the traffic disruption data from traffic disruption reports shared with us by the Oxfordshire County Council, which are sourced from a major traffic analytics company. These reports correspond to over million traffic incidents from just over 6,500 points on the Oxfordshire traffic network (each point being approximately a 10m10m square). The number of traffic disruptions counts at each point ranged from 1 to 64,313, and with an average of 219 traffic disruption counts per point. It is important to note that many traffic disruptions such as the ones studied in this paper do not result in casualties or police reports, meaning that data on car accidents only reflects a fraction of the incident estimates presented here.
For the traffic volume data, we used anonymised and aggregated GPS mobile phone data provided by a major smartphone operating system. Similar data sets have been validated and successfully used in urban mobility studies in San Francisco  and Amsterdam . The data set contains estimated trip volumes for origin-destination pairs of wards in Oxfordshire between January and February 2017 in hourly increments. We took a subset of the data, only using trips inferred by the company to be made by vehicle (and not walking or cycling), and trips on weekdays made between 7am and 12pm (noon), which we aggregated into a total traffic going into every Oxfordshire ward over the two-month period. Using the whole day and/or including weekend trips yielded qualitatively similar results. Finally, we obtained shapefiles for the border of all Oxfordshire wards from the Digimap mapping data service . Datasets were manipulated using dataframes from the Python Pandas library .
-  Department for Transport, Transport Statistics Great Britain 2016, https://bit.ly/2tsCsvq.
-  E. I. Vlahogianni, M. G. Karlaftis, J. C. Golias, Short-term traffic forecasting: Where we are and where we’re going. Transportation Research Part C: Emerging Technologies 43, 3–19 (2014).
G. McNeill, J. Bright, S. A. Hale, Estimating local commuting patterns from
geolocated twitter data.
EPJ Data Science6, 24 (2017).
-  M. Wegener, F. Fürst, Land-use transport interaction: State of the art, http://dx.doi.org/10.2139/ssrn.1434678 (2004).
-  M. Lenormand, M. Picornell, O. G. Cantú-Ros, T. Louail, R. Herranz, M. Barthelemy, E. Frías-Martínez, M. S. Miguel, J. J. Ramasco, Comparing and modelling land use organization in cities. Royal Society Open Science 2, 150449 (2015).
-  T. Louail, M. Lenormand, M. Picornell, O. G. Cantú, R. Herranz, E. Frias-Martinez, J. J. Ramasco, M. Barthelemy, Uncovering the spatial structure of mobility networks. Nature Communications 6 (2015).
-  Y. Liu, F. Wang, Y. Xiao, S. Gao, Urban land uses and traffic ‘source-sink areas’: Evidence from gps-enabled taxi data in shanghai. Landscape and Urban Planning 106, 73–87 (2012).
-  M. Haklay, How good is volunteered geographical information? a comparative study of openstreetmap and ordnance survey datasets. Environment and planning B: Planning and design 37, 682–703 (2010).
-  J.-F. Girres, G. Touya, Quality assessment of the french openstreetmap dataset. Transactions in GIS 14, 435–459 (2010).
-  D. Zielstra, A. Zipf, 13th AGILE international conference on geographic information science (2010), vol. 2010.
-  M. Helbich, C. Amelunxen, P. Neis, A. Zipf, Comparative spatial analysis of positional accuracy of openstreetmap and proprietary geodata. Proceedings of GI_Forum pp. 24–33 (2012).
-  A. Mashhadi, G. Quattrone, L. Capra, OpenStreetMap in GIScience (Springer, 2015), pp. 125–141.
-  J. J. Arsanjani, P. Mooney, A. Zipf, A. Schauss, OpenStreetMap in GIScience (Springer, 2015), pp. 37–58.
-  H. Senaratne, A. Mobasheri, A. L. Ali, C. Capineri, M. Haklay, A review of volunteered geographic information quality assessment methods. International Journal of Geographical Information Science 31, 139–167 (2017).
-  J. Bright, S. De Sabbata, S. Lee, Geodemographic biases in crowdsourced knowledge websites: Do neighbours fill in the blanks? GeoJournal 83, 427–440 (2018).
-  J. Bright, S. De Sabbata, S. Lee, B. Ganesh, D. K. Humphreys, Openstreetmap data for alcohol research: Reliability assessment and quality indicators. Health & place 50, 130–136 (2018).
-  C. Q. Camargo, J. Bright, S. A. Hale, Diagnosing the performance of human mobility models at small spatial scales using volunteered geographic information. arXiv preprint arXiv:1905.07964 (2019).
-  H. Choi, H. Varian, Predicting the present with google trends. Economic Record 88, 2–9 (2012).
-  L. Wu, E. Brynjolfsson, Economic analysis of the digital economy (University of Chicago Press, 2015), pp. 89–118.
-  A. Y. Lin, J. Cranshaw, S. Counts, Proceedings of the 2019 World Wide Web Conference (WWW’19), May (2019), pp. 13–17.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python.Journal of Machine Learning Research 12, 2825–2830 (2011).
-  N. Meinshausen, P. Bühlmann, Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72, 417–473 (2010).
-  OpenStreetMap contributors, Openstreetmap mapnik and cartocss update, https://github.com/gravitystorm/openstreetmap-carto/blob/master/CHANGELOG.md (2017).
-  S. Srinivasan, R. Provost, R. Steiner, Modeling the land-use correlates of vehicle-trip lengths for assessing the transportation impacts of land developments. Journal of Transport and Land Use (2013).
-  B. Sana, J. Castiglione, D. Cooper, D. Tischler, Using Google’s Aggregated and Anonymized Trip Data to Support Freeway Corridor Management Planning in San Francisco, California. Transportation Research Record: Journal of the Transportation Research Board 2643, 65–73 (2017).
-  V. L. Knoop, P. B. C. van Erp, L. Leclercq, S. P. Hoogendoorn, 2018 21st International Conference on Intelligent Transportation Systems (ITSC) (2018), pp. 3832–3839.
-  EDINA Digimap Ordnance Survey Service, OS MasterMap Topography Layer [Shape geospatial data], Scale 1, Tile: Oxfordshire, Ordnance Survey, Using: EDINA Digimap Ordnance Survey Service, https://digimap.edina.ac.uk/ (Downloaded in June 2018).
-  W. McKinney, et al., Proceedings of the 9th Python in Science Conference (Austin, TX, 2010), vol. 445, pp. 51–56.