1 Introduction
Analysis of road accidents data can reveal various hidden facts. Accident datasets are high dimensional in nature and techniques like MDS and PCA can be used to project the data in lower dimensions for visualization. However, these techniques don’t preserve the correlation among variables. Instead, Multiple Correspondence Analysis[2]
(MCA) can be used to visualize and correlate between variables from the high dimensional data in two dimensions. It also gives the discrimination measure of how correctly each variable from the dataset is represented in lower dimensions. We use this measure to effectively visualize some variables from the dataset. For other variables which can’t be correctly visualized by MCA, we use hypothesis testing and time series analysis to get some further insights.
2 Dataset
The dataset is taken from Kaggle[1] and it contains the details of every recorded accident in the UK from 2005 till 2015. The full dataset is divided into three major categories i.e. accident information, casualty information, and vehicle information.
3 Related Works
Ljubic et al. [3] used time series analysis to study the accidents data in the UK. Sikdar et al. [4]
used hypothesis testing to study accidents data in India. However, our work uses data visualization to filter out a smaller set of features which can’t be effectively visualized in lower dimensions.
4 Approach
4.1 Discrimination measure over variables
We use discrimination plot generated with MCA to see which variables can be represented accurately in twodimensional visualizations of the dataset. As shown in Figure 1, more the value of a variable along any dimension, easier it is to represent that variable along that dimension. As the circles in Figure 1 show, the main variables which can be visualized using MCA are Age of the driver, the location of an accident, day of the week. Other variables like vehicle type, date, weather conditions and sex of the driver which cannot be accurately represented using MCA, we use hypothesis testing and time series analysis to analyze them.
4.2 Multiple Correspondence Analysis variables plot
We choose three variables which had high variance along Dimensions 1 and 2 from Discrimination Measure Analysis, namely Location of the accident (Postcode), Day of the week and the Age group of the driver and project them using MCA with all the features in a single plot to study the correlation. FigureRoad Accidents in the UK (Analysis and Visualization) shows how each of these categories is related to others in the form of a scatter plot. Several insights obtained from this analysis are discussed in the results section.
4.3 Analysis over other variables and important events in the UK
Not all the variables can be efficiently represented in lower dimensions using MCA, hence further techniques to analyze data are required. Hypothesis testing can be used to further understand how the variables are related to each other. We used Welch’s ttest statistic to study several hypothesis on this dataset. Furthermore, because this dataset is timebound, we can make some predictions on the data using time series analysis. We applied autoregression on our dataset to analyze and predict the trend in accidents over the years. Results are discussed in the next section.
5 Results
5.1 MCA plot with postcode, day of the week and age group of driver (Figure Road Accidents in the UK (Analysis and Visualization))

The number of accidents on Sundays and Wednesdays is fewer than those on other days in any postcode.

Age groups 1115 years, 2635 years and 3645 years have the similar number of accident records and the major day of accidents for these age groups is Saturday.

Warrington(WA) and Guildford(GU) have more accidents on Tuesdays and the most common age group of people causing accidents is 46 to 55 years.

Age group 610 years is responsible for a lesser number of accidents compared to other age groups.
5.2 Hypothesis testing
We found out that the number of accidents before, and during the London Summer Olympics remained same. Similarly, other interesting hypothesis were tested and are discussed in Table 1.
5.3 Time Series Analysis
Figure 2 shows the prediction of the number of monthly accidents over the years. We see that the number of accidents has decreased over the years. The prediction accuracy can be measured by the root mean square error value, which was 699.84.
6 Conclusion
In this paper, we combined visualization and data analysis techniques for the effective study of a dataset. We visualized the correlation between the location of the accident, day of the week and age of the drivers using MCA. Further, we studied other important features using hypothesis testing and predicted the trend in accidents using time series analysis. Future work will include more detailed analysis of the data using Machine Learning and other advanced visualization techniques.
7 Acknowledgments
This research was partially supported by NSF grant IIS 1527200 & MSIT, Korea under the ICTCC Program (IITP2017R0346161007).
Null Hypothesis  Result 

Number of daily accidents in summer and winter are equal  15 to 30 more daily accidents in summer 
Number of daily accidents by young drivers (Age 1825 years) and old drivers (Age 6585 years) are equal  85 to 89 more accidents by young people 
The number of daily accidents before and during the London Summer Olympics (2012) were same  Accept Null Hypothesis. Pvalue 0.197. 
Number of daily accidents in areas close to subway stations is same as other areas  9 to 29 more accidents daily in areas close to subway stations. 
Males cause an equal number of daily accidents as females  428 to 439 more accidents by males. 
References
 [1] https://www.kaggle.com/silicon99/dftaccidentdata/data.
 [2] H. Abdi and D. Valentin. Multiple correspondence analysis. Encyclopedia of measurement and statistics, pp. 651–657, 2007.
 [3] P. Ljubič, L. Todorovski, N. Lavrač, and J. C. Bullas. Timeseries analysis of uk traffic accident data. In Proceedings of the Fifth International Multiconference Information Society, pp. 131–134, 2002.
 [4] P. Sikdar, A. Rabbani, N. Dhapekar, and D. G. Bhatt. Hypothesis testing of road traffic accident data in india. International Journal of Civil Engineering and Technology (IJCIET), 8(6):430–435, 2017.
Comments
There are no comments yet.