I Introduction
Crime is a problem that we face every day in our society. Even though there are various reasons behind it, most of the reasons of crimes can be attributed to socialeconomical reasons. It is also shown that urban areas and cities show higher density of crime[1]. Crime also depends on different factors such as education, culture, economy level of neighbours and unemployment. There is a huge push towards using machine learning models to get statistics regarding crime predictions, to attest why they occur, when they would occur, and to whom it would occur[2][3][4][5][6][7]. One of the reasons we wanted to work with crime was because of individual incidents that we have seen on Drexel University campus, a rape incident dating around September 2019 that caused widespread backlash around Drexel University community and Philadelphia community regarding why Public Safety didn’t take enough precautions. Philadelphia, being at the top 6 cities in the United States for population, and being our home appeals to us as a city that we can study, because we wanted to see if we could find any underlying reasons regarding crime by building predictive models, and see if we can systematically find those reasons with robust workflows. Some of the workflows that we adhere by in this study is to feature extraction, model selection, parameter tuning for those models, and feature selection.
Ii Related Work
There is a huge push towards building predictive models and fight against crime. Studies show that one of the techniques used widely in this crime field is to look at how dense the crime points are on a map. It has been shown that the existence of crime dense areas can be used as an indicator of the future crime areas since crime changes depend on several different reasons on a multidimensional layer, this has been widely accepted as an indicator of future crime. In this study, we wanted to differentiate ourselves by following approaches.

Work with very large number of classes (30 labels)

Create features that doesn’t depend on the city

Find optimal number of clusters in a data set

Cluster centers and use the distance as feature in our predictive models.

Work with different supervised learning models that incorporate the aforementioned aspects hoping that it would increase our model accuracies.
Researchers have focused on studying crime both from a time and location perspective[8]. The time perspective is the predictive aspect of crime as one might imagine. More specifically, one can create a grid on a city, and count the crime points on a grid and pose this problem as a regression over time series[9]. Other perspective is to use the location. Location might sound similar to the first time perspective but this is different and the difference lie on the fact that crime locations barely change over short amounts of time. So, if one were to study the crime dense neighbors of Philadelphia over a decade, and then guess the crime dense neighbors for the next year, month etc, one potential solution would be to flag the already existing crime dense areas and predict those neighbors as the future potential crime dense areas. We have to realize that the literature uses a special word for this, that is crime hot spot. There are mathematical models that labels an area as crime hot spot or not based on a Euclidean distance, that is a linear kernel functions.
Even though current literature is built on top of these approaches, we wanted to remove the assumption that current literature has, even this meant deviating from the current literature approaches. For this reason, there is a narrow common ground between our findings and the common ground where we can compare our findings. This meant creating models that would not depend on the city. For example, as we have seen with the time perspective, a predictive model that poses this crime problem as a regression is a model that would need crime counts over time. To get the crime counts, most researches had to create grids on a city and count the crime counts for each grid and sum them over different periods of time[10][11]. This way, one can do regression single grid, and let’s say, predict the crime count that one would expect in that crime grid cell for future time. One difficulty with that approach is that the data scientists or researches have to find a way to divide the city into different grid cells. This could be a problem since not every major city is a square, or has a compatible shape to be treated a rectangle/square. To mitigate this, we propose to create clusters in our data and use the clusters as a way of counting the crime instead of using a grid. More specifically, we create clusters by using our crime points for each year. We, then stack the clusters on top of each to get rid of the time element. Because we remove one of the constraints, now we can still get information from the time dimension that our data has without explicitly using it in our models. This approach of creating clusters removes the dependency of putting a grid into the city, and therefore it removes most of the preprocessing that a data scientist has to do to work with the crime data.
Iii Problem Statement
Problems that we are tackling are as follows:

[label=()]

Can we predict crime type given location and time?

Can we predict accurately if class number is very large?

Does incorporating features from unsupervised learning techniques improve our supervised models to predict crime type?

Can we develop a systematic workflow to combine both learning (supervised/unsupervised) techniques for the crime data set that we work with?
To answer these questions, we investigate creating clusters and looking at different supervised machine learning models.
Iv Basic Approach
Our approach includes the following phases:

Data preprocessing, feature extraction

Finding optimal number of clusters in our data set

Creating Clusters for each year and stacking the cluster centers.

Calculating the Euclidean distance from each crime point to cluster centers

Adding the distance features to previous, train different models including KNearest Neighbors, Logistic Regression, Decision Tree, Random Forest, Multi Layer Perceptron
We have two features in our data set. These features are mainly time and location. By using location and time, we can generate the following features via some data processing.

Hour

Month

Year

DayOfWeek

Is_Weekend

X

Y

Is_Intersection

Is_Block

Police District

Street_Type, (St, Blv, Ave etc)
X  Y  Date  Description 

75.174324  39.986978  4/3/2009 8:46  Other Assaults 
75.238710  39.953566  2/2/2008 7:56  Robbery Firearm 
75.069437  40.034939  4/8/2007 2:54  Driving Under Influence 
75.113286  39.996494  5/19/2006 11:37  Thefts 
75.065362  40.046056  7/26/2006 13:35  Other Assaults 
V Data
The data set used during this study has about 1.3 million samples. It has been collected by Open Data Philly City Council Organization[12]. For our supervised learning models, we used the 80/20 training set, we got about 838860 samples for training data and 262030 for testing data. This equates to using the first 9 years beginning from 2015 as training data and the remaining a year as the testing data set. The data set years ranged from 2006 to 2015. Some rows were missing some missing values. Missing values required us to do data preprocessing. In order to perform data processing, it is essential to improve the data quality. There are a few techniques in practice, which are employed for the purpose of data preprocessing. The techniques are data cleaning, feature selection, outlier detection, and component reduction and transformation. Before applying a classification algorithm usually some preprocessing is performed on the data set. Features are location and time for a crime points. Time for a crime point is dispatch time that the operator at 911 call center recorded. Therefore, the time is expressed with preciseness up until minute. The location for the crime point is the X, and Y coordinates of the crime point. Latitude measures angular distance from the equator to a point north or south of the equator. Longitude is an angular measure of east/west from the Prime Meridian, which has an angular measure of 0 since it is the beginning for that measure. Latitude values increase or decrease along the vertical axis, the Y axis. Longitude changes value along the horizontal access, the X axis. Philadelphia’s latitude’s range is slightly greater than its longitude range, which might make the Y feature more important. This can be easily seen with feature selection analysis that we did. More importantly, other city councils that gather the same type of crime data can easily see that the crime analysis is very tightly connected to the city structure and neighbors distribution. For this reason, data scientists and researches usually have to have prior knowledge regarding the physicality of the city. In our study, we tried to remove the assumptions regarding the city land space and focused on the features that can be generalized well such as dispatch time of crime, and angular measure of the crime point on Earth such as longitude and latitude.
We also look at individual classes to see the underlying patterns. Some patterns that we is that there are less crime during cold seasons than hot seasons over a year. Interestingly enough, in 2009, there is a dip in the number of classes that occurred per year. Even though one might expect the otherwise situation since during recession when people panicked, one would have expected that there would be more crimes since people are more desperate. Around 6am is the safest hour in a day since most criminals are sleeping. Additionally, there is a peak in the crime count around lunch break. Since the data set is in Philadelphia, which is a highly populated urban city, there are more crimes in the lunch time compared to morning and slightly after lunch time. Overall, crime count peaks in the evening between 8pm and 10pm and stays vrey high until 1am.
Now, we look at some specific crime incidents and aggregate them over hours, months, and years. We see that some crime types such as prostitution and sex offenses occur very frequently during night time, and other crime types such as thefts and vandalism occur equally all day and remain stable in a day. We see that driving under influence occur at a very high rate between 10pm and 2 am, which is a natural time window for drivers who leave their parties after getting enough alcohol.
When we change the time scope from hours to months, we see that there is less crime incidents witnessed during cold months and that the hot months such as spring and summer see an increase in certain number of crime types such as thefts and prostitution.
If we look at the aggregation of crime types over years, we see that the trends get significantly harder to see. There are some general trends that we can mention. First, some crime types occur less over recent years such as vandalism. There are also some crime types that increase such as thefts. With thefts, we don’t see a decline in the number of theft incidents.
Vi Experiment Results, Analysis and Performance Evaluation
Quick Summary for results are as follows:

Random Forest is most sensitive to the minute and the hour.

Random Forest is the best performing model, which aligns with the current literature.

Support Vector Machines over 30 labels fails to run to completion in Google Cloud Compute Engine Service.

The optimal number of clusters for all years is7 but when we take each year as a separate data set, we see that the optimal number of clusters varies between 7 and 10.

Bayesian Inference works significantly well with 30 class labels, achieving around 27% mean accuracy compared to logistic regression with 5%and K Nearest Neighbors with 19% accuracies.
Having said all these results, now we expand upon them with details here.
Via Unsupervised Learning
Unsupervised learning techniques are methods where one employs systematic methods to a data set without any labels to understand the underlying features regarding the data. The techniques that we use in this study is KMeans clustering algorithm. One really important aspect about this method is to find the optimal number of clusters in one’s data set. For this objective, we employed two different methods, namely elbow method and gap statistics. Elbow method can be employed like this:
Gap statistics method gives 7 as the optimal cluster number when applied to all the years in our data set. Elbow method gives 3 as the optimal cluster number when applied to all years in our data set. Because we have about 1 million data points, we want to maximize the variance on distances that we calculate to the cluster centers, and go with the k=7 optimal cluster count. We apply the KMeans clustering algorithm and see the results in Figure
27We plot only the cluster centers for each year and show the results in Figure 28. Cluster centers align on the direction from west to North East indicating that the crime points are gathered around the West Philadelphia and North East Philadelphia areas. To see the effect of cluster centers on the map, we apply a Gaussian Density Function to draw contours where height of the surface indicates the density of the crime happening in the future. This aspect really resembles the approach that we mentioned in our Related Work section when we introduced the concepts that literature took in this field.One being the time aspect, and one being the location aspect. Northeast and West Philadelphia achieved the tallest surface heights.
ViB Supervised Learning
Supervised learning methods are techniques that employ training data, a cost function and a testing data where training data is used to fit a data and testing data is used to report how well the the fit was behaved. Out of 1.3 millions samples in the training data set. We got about 838860 samples for training data and 262030 for testing data. This equates to using the first 9 years beginning from 2015 as training data and the remaining years 2015, 2016, 2017 and 2018 years as the testing data set. Some rows had missing missing values. Missing values required us to do data preprocessing and drop them.The class labels can be seen in Table II. The models that we employed in order to predict the crime types are as follows:

K Nearest Neighbors

Naive Bayesian Inference

Decision Tree

Random Forest

Logistic Regression

Support Vector Machine

Multi Layer Perceptron
We choose KNN with 5 neighbors. KNN is a classifier that makes the classification output based on the majority of votes of the k nearest neighbors.Naive Bayes methods naively employs inference by assuming that the feature pairs are independent. We also used Decision Tree with a confidence factor 0.3. Decision Trees are supervised learning models that achieves the value of the target variable by learning simple splitting rules/decision rules on the data set. Random Forest that we used had 10 trees. A random forest is a model that combines several decision trees on several subsamples of the data set and use the averaging to improve the predictive accuracy. Since it uses several trees, it is also expected to generalize well and avoid over fitting. Logistic Regression is a supervised learning method which is well suited to be a binary classifier and can also be used for multi class classification problems. It uses a log function in order to produce probability values over classes which then can be used to predict classes. Support Vector Machines (SVM) are supervised learning machines. They implement a good generalization on a limited number of learning patterns inferred based on the features that we used. It uses a linear kernel and tries to separate the crime points in a very high dimensional space which is likely to have a linear hyper plane. Multi Layer Perceptrons are layered supervised learning models that tries to find a hyper plane in order to separable the data. We employed one hidden layer of 150 neurons.
During our study, we employed a free Google Cloud Compute Engine Service with free 12 hours of GPU access in order to take advantage of fast cloud computing. We encourage the reader to see the specifications here[12]. We had several time outs whilst training the Support Vector Machines and the Multi Layer Perceptron models, and therefore, we don’t report their results in this study.
Class Label Index  Class Labels Used in the Supervised Models 

0  Aggravated Assault Firearm 
1  Aggravated Assault No Firearm 
2  All Other Offenses 
3  Arson 
4  Burglary NonResidential 
5  Burglary Residential 
6  Driving Under Influence 
7  Disorderly Conduct 
8  Embezzlement 
9  Forgery and Counterfeiting 
10  Fraud 
11  Gambling Violations 
12  Homicide  Criminal 
13  Homicide  Gross Negligence 
14  Homicide  Justifiable 
15  Liquor Law Violations 
16  Motor Vehicle Theft 
17  Narcotic / Drug Law Violations 
18  Offenses Against Family and Children 
19  Other Assaults 
20  Other Sex Offenses (Not Commercialized) 
21  Prostitution and Commercialized Vice 
22  Public Drunkenness 
23  Rape 
24  Receiving Stolen Property 
25  Recovered Stolen Motor Vehicle 
26  Robbery Firearm 
27  Robbery No Firearm 
28  Theft from Vehicle 
29  Thefts 
30  Vagrancy/Loitering 
31  Vandalism/Criminal Mischief 
32  Weapon Violations 
All the models were done by using Python’s scikit library and the preprocessing was done by first reading from excel file, splitting it into two: first for training and second for testing with 80% and 20% ratios respectively.
We also take a look at the number components that we can keep the high variance. This can help us eliminate components that don’t give extra information, or important information.
When we apply Principal component analysis, we see that applying PCA to our model will decrease the performance, This can be attributed to the fact that we are working very small number features and because essential information is lost in the PCA process, we lose information immediately after we start applying PCA.
For each row, a uniform probability prediction (no machine learning required), where each label has a 1/34 probability would give a log loss score of:
So if we calculate the log loss score per label, we can see that for what labels, we are performing worse than the base line probability.
As it can be seen in Figure 32, we apply a smoothing parameter in order to improve the accuracy of our models.We add a small value to all the probability predictions. This is to achieve that we don’t have any 0 value probability. Note that while the for each row of the prediction matrix, this is not an issue. function used by Python rescales the matrix back to .
Figure 32, gives insight to adding a smoothing parameter to the probability predictions over different classes. We get the lowest score of: 2.281062, with smoothing parameter: 0.000170. The improvement is 0.169367%
Label  No of Mispredictions  Mean Log Loss 

0  4  1.654153 
1  18  1.918882 
2  159416  2.156223 
4  344  1.337823 
5  1663  2.427808 
6  1510  2.124929 
7  1409  1.636606 
10  6561  2.401358 
16  446  1.805052 
17  7  2.574474 
18  24767  2.382460 
20  92597  2.577907 
21  1  5.540110 
22  952  1.709928 
23  15  1.773899 
26  3573  2.454954 
27  9  2.280118 
28  34  2.523617 
29  34812  2.343397 
30  49930  2.108511 
31  386  1.826013 
32  37763  2.546528 
33  79  1.984121 
Rank  Feature  Weight 

0  Hour Zone  0.091309 
1  Hour  0.089656 
2  Y  0.062149 
3  Rot60X  0.060207 
4  Radius  0.058731 
5  Rot45X  0.057662 
6  Angle  0.057590 
7  X  22 0.05119 
8  Rot30Y  0.056596 
9  Rot30X  0.056554 
10  Rot60Y  0.055187 
11  Rot45Y  0.054602 
12  Street1  0.0038214 
13  Minute  0.032564 
14  WeekOfYear  0.031246 
15  Year  0.028269 
16  Day  0.02737 
17  DayOfWeekNum  0.019447 
18  PdDistrictNum  0.017935 
19  Month  0.017296 
20  Street2  0.010806 
21  Season  0.008322 
22  IsWeekend  0.007317 
23  IsIntersection  0.003908 
24  StreetType  0.000005 
25  IsBlock  0.00000 
Some of the feature rankings that we have done can be seen in Table IV
Model  Log Loss  Accuracy 

Random Forest  2.312060  0.218282 
Naive Bayes  4.846123  0.274343 
Decision Tree  8.787213  0.322790 
K Neighbors  19.703055  0.195351 
Logistic Regression  9.2131214  0.052230 
SVM  NaN  NaN 
MLP  NaN  NaN 
Vii Conclusion
In this paper we have proposed a novel approach to predict multiclass crime type by incorporating unsupervised learning techniques and also relaxed some of the assumptions that we have seen in the current literature. We have kept working with all class labels and even though we got lower accuracy values, we were able to see that the best performing models were the same. When we combine supervised and unsupervised learning techniques, our workflow also produced results that could be easily generalized to other cities, since we are not putting a grid on a city like other studies have done so far. Due to lack of features and large number of class labels, we systematically crafted features in order to achieve better fit models. Specifically, we have described a methodology to run clustering algorithms on the data set, then use the distance to cluster centers as a feature in our supervised learning models. We achieved 2.2323 log loss on our Random Forest machine learning model, which was the best among various models that we have used. We hope this workflow of combining unsupervised and supervised learning models would give inspiration to create robust crime prediction workflows in fighting against crime.
Viii Acknowledgements
We would like to acknowledge Dr. Andrew Cohen from Department of Electrical and Computer Engineering for teaching this course, Dr. Robert Kane from Department of Criminology and Justice Studies and Dr. Matthew Burlick from Department of Computer Science and Informatics for advising us.
Ix Future Work
We have used Euclidean distance to calculate the distance from crime centers to crime points. Since the crimes are urban crimes, we would like to see the effect of choosing a different distance such as cityblock distance in the future work.
References
 [1] JohnsonHart, L., and Kane, R. (2016). Deserts of Disadvantage: The Diffuse Effects of Structural Disadvantage on Violence in Urban Communities. Crime & Delinquency. DOI: 10.1177/0011128716682228.
 [2] Chainey, S., Tompson, L., Uhlig, S.: The utility of hotspot mapping for predicting spatial patterns of crime. Security Journal 21, 428 (2008)
 [3] Kim S., Joshi P., Kalsi P.S. and Taheri P.: Crime Analysis Through Machine Learning, doi: 10.1109/IEMCON.2018.8614828

[4]
Hochreiter, Sepp, and Jrgen Schmidhuber, Long shortterm memory. Neural computation 9.8, 1997, pp. 17351780
 [5] Stalidis P., Semertzidis T., Daras P.: Examining Deep Learning Architectures for Crime Classification and Prediction, arXiv:1812.00602 (2018)
 [6] Stec A., Klabjan D.: Forecasting Crime with Deep Learning, arXiv: 1806.01486v1 (2018)
 [7] Zhuang Y., Almeida M., Morabito M., Ding W.: Crime Hot Spot Forecasting: A Recurrent Model with Spatial and Temporal Information, IEEE International Conference on Big Knowledge (2017)
 [8] Weisburd, David, and Cody W. Telep, Hot Spots Policing, what we know and what we need to know, Journal of Contemporary Criminal Justice, Vol 30, 2014, pp. 200220
 [9] Yu, C. H., Ding, W., Chen, P., and Morabito, M, Crime forecasting using spatiotemporal pattern with ensemble learning. PacificAsia Conference on Knowledge Discovery and Data Mining. Springer International Publishing, 2014.
 [10] Wang, D., Ding, W., Stepinski, T., Salazar, J., Lo, H., and Morabito, White, M., and Kane, R. (2013). Pathways to CareerEnding Police Misconduct: An Examination of Patterns, Timing and Organizational Responses to Officer Malfeasance in the NYPD. Criminal Justice & Behavior. M, Optimization of criminal hotspots based on underlying crime controlling factors using geospatial discriminative pattern. International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems. Springer Berlin Heidelberg, 2012

[11]
Cesario, Eugenio, Charlie Catlett, and Domenico Talia, Forecasting Crimes Using Autoregressive Models. Dependable, Autonomic and Secure Computing, 14th Intl Conf on Pervasive Intelligence and Computing, 2nd Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech) IEEE 14th Intl C. IEEE, 2016
 [12] Open Philly Publicly available Crime Data Set, URL: https://www.opendataphilly.org/
 [13] Google Cloud Compute Engine Free Service, Specifications URL: $https://colab.research.google.com/drive/151805XTDgdgHb3AXJCpnWaqRhop_2$
X Appendix
Here we report more of our crime type aggregations over hours, months, and years. We encourage the reader to explore these figures to see the underlying patterns for specific crime types.
Comments
There are no comments yet.