Road Accidents in the UK (Analysis and Visualization)

07/29/2019 ∙ by Anjul Tyagi, et al. ∙ Stony Brook University 0

Analysis of road accidents is crucial to understand the factors involved and their impact. Accidents usually involve multiple variables like time, weather conditions, age of driver, etc. and hence it is challenging to analyze the data. To solve this problem, we use Multiple Correspondence Analysis (MCA) to first, filter out the most number of variables which can be visualized effectively in two dimensions and then study the correlations among these variables in a two dimensional scatter plot. Other variables, for which MCA cannot capture ample variance in the projected dimensions, we use hypothesis testing and time series analysis for the study.



There are no comments yet.


page 1

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Analysis of road accidents data can reveal various hidden facts. Accident datasets are high dimensional in nature and techniques like MDS and PCA can be used to project the data in lower dimensions for visualization. However, these techniques don’t preserve the correlation among variables. Instead, Multiple Correspondence Analysis[2]

(MCA) can be used to visualize and correlate between variables from the high dimensional data in two dimensions. It also gives the discrimination measure of how correctly each variable from the dataset is represented in lower dimensions. We use this measure to effectively visualize some variables from the dataset. For other variables which can’t be correctly visualized by MCA, we use hypothesis testing and time series analysis to get some further insights.

2 Dataset

The dataset is taken from Kaggle[1] and it contains the details of every recorded accident in the UK from 2005 till 2015. The full dataset is divided into three major categories i.e. accident information, casualty information, and vehicle information.

3 Related Works

Ljubic et al. [3] used time series analysis to study the accidents data in the UK. Sikdar et al. [4]

used hypothesis testing to study accidents data in India. However, our work uses data visualization to filter out a smaller set of features which can’t be effectively visualized in lower dimensions.

4 Approach

4.1 Discrimination measure over variables

We use discrimination plot generated with MCA to see which variables can be represented accurately in two-dimensional visualizations of the dataset. As shown in Figure 1, more the value of a variable along any dimension, easier it is to represent that variable along that dimension. As the circles in Figure 1 show, the main variables which can be visualized using MCA are Age of the driver, the location of an accident, day of the week. Other variables like vehicle type, date, weather conditions and sex of the driver which cannot be accurately represented using MCA, we use hypothesis testing and time series analysis to analyze them.

Figure 1: Discrimination Measure plot over variables in accidents data.

4.2 Multiple Correspondence Analysis variables plot

We choose three variables which had high variance along Dimensions 1 and 2 from Discrimination Measure Analysis, namely Location of the accident (Postcode), Day of the week and the Age group of the driver and project them using MCA with all the features in a single plot to study the correlation. FigureRoad Accidents in the UK (Analysis and Visualization) shows how each of these categories is related to others in the form of a scatter plot. Several insights obtained from this analysis are discussed in the results section.

4.3 Analysis over other variables and important events in the UK

Not all the variables can be efficiently represented in lower dimensions using MCA, hence further techniques to analyze data are required. Hypothesis testing can be used to further understand how the variables are related to each other. We used Welch’s t-test statistic to study several hypothesis on this dataset. Furthermore, because this dataset is time-bound, we can make some predictions on the data using time series analysis. We applied autoregression on our dataset to analyze and predict the trend in accidents over the years. Results are discussed in the next section.

Figure 2: Prediction of monthly accidents using autoregression from 2005 till 2014.

5 Results

5.1 MCA plot with postcode, day of the week and age group of driver (Figure Road Accidents in the UK (Analysis and Visualization))

  • The number of accidents on Sundays and Wednesdays is fewer than those on other days in any postcode.

  • Age groups 11-15 years, 26-35 years and 36-45 years have the similar number of accident records and the major day of accidents for these age groups is Saturday.

  • Warrington(WA) and Guildford(GU) have more accidents on Tuesdays and the most common age group of people causing accidents is 46 to 55 years.

  • Age group 6-10 years is responsible for a lesser number of accidents compared to other age groups.

5.2 Hypothesis testing

We found out that the number of accidents before, and during the London Summer Olympics remained same. Similarly, other interesting hypothesis were tested and are discussed in Table 1.

5.3 Time Series Analysis

Figure 2 shows the prediction of the number of monthly accidents over the years. We see that the number of accidents has decreased over the years. The prediction accuracy can be measured by the root mean square error value, which was 699.84.

6 Conclusion

In this paper, we combined visualization and data analysis techniques for the effective study of a dataset. We visualized the correlation between the location of the accident, day of the week and age of the drivers using MCA. Further, we studied other important features using hypothesis testing and predicted the trend in accidents using time series analysis. Future work will include more detailed analysis of the data using Machine Learning and other advanced visualization techniques.

7 Acknowledgments

This research was partially supported by NSF grant IIS 1527200 & MSIT, Korea under the ICTCC Program (IITP-2017-R0346-16-1007).

Null Hypothesis Result
Number of daily accidents in summer and winter are equal 15 to 30 more daily accidents in summer
Number of daily accidents by young drivers (Age 18-25 years) and old drivers (Age 65-85 years) are equal 85 to 89 more accidents by young people
The number of daily accidents before and during the London Summer Olympics (2012) were same Accept Null Hypothesis. P-value 0.197.
Number of daily accidents in areas close to subway stations is same as other areas 9 to 29 more accidents daily in areas close to subway stations.
Males cause an equal number of daily accidents as females 428 to 439 more accidents by males.
Table 1: Results from hypothesis testing.


  • [1]
  • [2] H. Abdi and D. Valentin. Multiple correspondence analysis. Encyclopedia of measurement and statistics, pp. 651–657, 2007.
  • [3] P. Ljubič, L. Todorovski, N. Lavrač, and J. C. Bullas. Time-series analysis of uk traffic accident data. In Proceedings of the Fifth International Multi-conference Information Society, pp. 131–134, 2002.
  • [4] P. Sikdar, A. Rabbani, N. Dhapekar, and D. G. Bhatt. Hypothesis testing of road traffic accident data in india. International Journal of Civil Engineering and Technology (IJCIET), 8(6):430–435, 2017.