CASTNet: Community-Attentive Spatio-Temporal Networks for Opioid Overdose Forecasting

05/12/2019 ∙ by Ali Mert Ertugrul, et al. ∙ University of Pittsburgh Middle East Technical University 0

Opioid overdose is a growing public health crisis in the United States. This crisis, recognized as "opioid epidemic," has widespread societal consequences including the degradation of health, and the increase in crime rates and family problems. To improve the overdose surveillance and to identify the areas in need of prevention effort, in this work, we focus on forecasting opioid overdose using real-time crime dynamics. Previous work identified various types of links between opioid use and criminal activities, such as financial motives and common causes. Motivated by these observations, we propose a novel spatio-temporal predictive model for opioid overdose forecasting by leveraging the spatio-temporal patterns of crime incidents. Our proposed model incorporates multi-head attentional networks to learn different representation subspaces of features. Such deep learning architecture, called "community-attentive" networks, allows the prediction of a given location to be optimized by a mixture of groups (i.e., communities) of regions. In addition, our proposed model allows for interpreting what features, from what communities, have more contributions to predicting local incidents as well as how these communities are captured through forecasting. Our results on two real-world overdose datasets indicate that our model achieves superior forecasting performance and provides meaningful interpretations in terms of spatio-temporal relationships between the dynamics of crime and that of opioid overdose.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Opioid use disorders and overdose rates in the United States have increased at an alarming rate since the past decade [warner2011drug]. Overdose deaths have risen since the 1990s, and the number of heroin overdose deaths has risen sharply since 2010 [rudd2016increases]. Mortalities caused by drug overdose are the main cause of injury-related death in the U.S. [CDC]

. The growth rate of opioid overdose together with the number of impacted individuals in the U.S., has led many to classify this as an “opioid epidemic”

[kolodny2015prescription]. Enhanced understanding of the dynamics of the overdose epidemic may help policy-makers to develop more effective epidemic prevention mechanisms and control strategies [jalal2018changing].

The opioid epidemic is a complex social phenomenon involving and interacting with various social, spatial and temporal factors [burke2016forecasting]. Highlighting the links between opioid use and various factors has drawn significant attention. Among those, studies have identified relationships between opioid use and crime incidences, including cause (that opioid use leads to criminal activities [bennett2008statistical]), effect (that involvement in criminal behavior leads to drug use [hammersley1989relationship]), and common causes (that crime and drug tend to co-occur [seddon2005drugs]). Crime occurrences also have non-trivial spatio-temporal characteristics – for example, routine activity theory suggested that crimes may exhibit spatio-temporal lags as the likely offenders of one place may reach suitable targets in other places. Therefore, how to unveil the complicated relationship between opioid use and crime incidences is challenging.

In seeking data for investigation, detailed assessments of opioid use disorders and overdose growth require systematically collected well-resolved spatio-temporal data [gruenewald2013b]. However, the amount of systematically monitored data either at a regional or local level in the U.S. is very limited. In addition, there is no common reporting mechanism for incidents. For instance, the incident categories and the organization of categories vary significantly across the databases. On the other hand, crime data is meticulously collected, organized and stored, at a finer-grained level. Given the plausible relationship between the crime dynamics and opioid use as well as the availability of real-time crime data for various locations, in this study, we seek to explore the capability of forecasting opioid overdose using real-time crime data.

Recent works in predictive modeling has shown significant improvement in spatio-temporal event forecasting and time series prediction [qin2017dual, zhao2018distant]. However, these studies suffer from two main concerns. First, most of them overlook the complex interactions between local (observed from within a region) and global (observed from all regions) activities across time and space. Only a few have paid attention to this problem, yet they model the global activities as a single universal representation [ertugrul2018forecasting, ertugrul2019], which is either irrespective of the event location or is reweighed based on a pre-defined fixed proximity matrix [liang2018geoman]. In other words, none of the existing works learns to differentiate the pairwise activity relationships between a particular event location and other locations. Second, most of the spatio-temporal forecasting studies mainly focus on prediction performance and lack interpretability to uncover the underlying spatio-temporal characteristics of the activities, such as (1) what local and global activity features are more predictive for the subsequent events? (2) what are the locations that have more salient contribution to predicting opioid overdose with respect to the target location? Inspired by the idea of multi-head attentional networks [vaswani2017], in this work, we propose a novel deep learning architecture, called “CASTNet,” for opioid overdose forecasting using spatio-temporal characteristics of crime incidents, which seeks to address the aforementioned problems. Assuming that different locations could share similar dynamics, our approach aims to learn different representation subspaces of cross-regional dynamics, where each subspace involves a set of locations called “community” that share similar behaviors. The proposed architecture is called “community-attentive” as it allows the prediction for a given location to be individually optimized by the features contributed by a mixture of communities. Specifically, combining the features of the given target location and features from the communities (referred to as local and global dynamics), the model learns to forecast the number of opioid overdoses in the target location. Meanwhile, by leveraging a Lasso regularization [scardapane2017group] and hierarchical attention mechanism, our method allows for interpreting what local and global features are more predictive, what communities contribute more to predicting incidences at a location, and what locations contribute more to each community.

Overall, our contributions include: (1) A community-attentive spatio-temporal network: We propose a multi-head attention based architecture that learns different representation subspaces of global dynamics (communities) to effectively forecast the opioid overdoses for different target locations. (2) Interpretability in hierarchical attention and features: First, CASTNet incorporates a hierarchical attention mechanism which allows for interpreting community memberships (which locations form the communities), community contributions for forecasting local incidents and informative time steps in both local and global for the prediction. Second, CASTNet incorporates Group Lasso (GL) [scardapane2017group] to select informative features which succinctly capture what activity types at both local- and global-level are more associated with the future opioid overdoses. (3) Extensive experiments: We performed extensive experiments using real-world datasets from City of Cincinnati and City of Chicago. The results indicate a significant improvement in forecasting performance with greater interpretability compared to several baselines and state-of-the-art methods.

2 Related Work

The existing works have investigated the links between opioid use and various social phenomena as well as contextual factors including crime and economic stressors. Among them, Hammersley et al. [hammersley1989relationship] stated that opportunities for drug use increase with involvement in criminal behavior. The people dependent on opiates are disproportionately involved in criminal activities [bennett2008statistical] especially for the crimes committed for financial gain [pierce2015quantifying]. Seddon et al. [seddon2005drugs] revealed that crime and drug use share common set of causes and they co-occur together. Most of the existing works studying the relationship between opioid use and social phenomena have employed basic statistical analysis, and focused on current situation and trends rather than predicting/forecasting overdose. Moreover, these studies overlooked the interactions among spatio-temporal dynamics of the locations. Among the studies predicting/forecasting opioid overdose, regression-based approaches have been applied in individual-level [glanz2018prediction] and state-level [kennedy2016opioid]

. Also, a neural network-based approach has been proposed

[ertugrul2018forecasting] to forecast heroin overdose from crime data, which identifies the predictive hot-spots. However, the effect of these hot-spots is universal and irrespective of event locations on the prediction.

Furthermore, there have been studies that utilized spatial and temporal dependencies for event forecasting and time series prediction. With the success of neural network-based models, several studies employed neural models to forecast/detect events related to anomalies [chalapathy2018group], crime [huang2018deepcrime] and social movements [ertugrul2019]. Additionally, several studies utilized deep neural models for times series prediction. Among them, Ghaderi et al. [ghaderi2017deep]

proposed a recurrent neural network (RNN) based model to forecast wind speeds. Qin et al.

[qin2017dual] presented a dual-stage attention-based RNN model to make time series prediction. Similarly, Liang et al. [liang2018geoman] proposed multi-level attention networks for geo-sensory time series prediction. A few of the studies considered the complex relationships between local and global activities, yet they modeled the global activities as a universal representation, which either does not change from event location to location or is adjusted by a pre-defined fixed proximity matrix. Most of these works simply employed a single temporal model to model various local and global spatio-temporal activities, which is insufficient to capture the complex spatio-temporal patterns at both local and global levels. Moreover, existing methods primarily focus on forecasting performance, yet they provide no or limited interpretability capability to unveil the underlying spatio-temporal characteristics of the local and global activities.

3 Method

3.1 Problem Definition

Suppose there are locations-of-interest (e.g. neighborhoods, districts) and each location can be represented as a collection of its static and dynamic features. While the static features (e.g. demographics, economical indicators) remain same or change slowly over a longer period of time, the dynamic features are the updates for each time interval (e.g. day, week). Let be the static features of location , and the set of dynamic features for location at time . We are also given a continuous variable that indicates the number of opioid overdose incidents (e.g. emergency medical services (EMS) calls, deaths) at location at future time . The collection of dynamic features from all locations-of-interest within an observing time window with size up to time can be represented as , where .

Our goal is to predict the number of future opioid overdose incidents at specific location at a future time , where is called the lead time for forecasting. The forecasting is based on the static and dynamic features of the target location itself, as well as the dynamic features in the environment (from all locations-of-interest). Therefore, the forecasting problem can be formulated as learning a function that maps the static and dynamic features to the number of opioid overdose incidents at the future time at a target location .

To facilitate spatio-temporal interpretation of the forecasting, we seek to develop a model that can differentiate contribution of the features, the locality (local features vs. global features) and the importance of latent communities when contributing to the prediction of other locations. Therefore, we further organize the dynamic features into two sets: the local features, represent dynamic features for the target location , and the global features, for , contain the sequences of dynamic features for all locations of interest.

Figure 1: Overview of our proposed CASTNet architecture. The local component (a) models local dynamics of the locations, and the static component (b) models the static features. The global component (c) summarizes different representation subspaces (i.e. communities) of global dynamics, learned by community blocks (d), by querying these multi-subspace representations through the embedding of the target location (). Spatial Att. Block (e) reweights the global dynamics of locations. Checkered rectangles on top of the inputs represent GL regularization. Red arrows indicate the queries for the corresponding attentions. “FC”: fully-connected layer; “embed”: embedding layer.

3.2 Architecture

In this work, we propose an interpretable, community-attentive, spatio-temporal predictive model, named CASTNet. As shown in Fig. 1, our proposed architecture consists of three primary components, local component (Fig. 1a), static component (Fig. 1b) and global component (Fig. 1c). The global component is designed to model the global contribution of dynamic features for all-locations-of-interest by learning different representation subspaces of global dynamics, and to output target location-specific global contribution. On the other hand, the local component is designed to model the contribution of local dynamic features for the target location. Finally, the static component models location-specific static information about the target location.

3.2.1 Global Component.

This component produces the target location-specific global contribution (from all locations) to forecast the number of incidents at the target location at future time . It consists of number of community blocks, where each community block learns a different representation subspace of the global dynamic features, which is inspired by the idea of multi-head attention [vaswani2017]. A community block (Fig. 1d) models the global dynamic features through a hierarchical attention network which consists of a spatial attention block (Fig. 1e), a recurrent unit and a temporal attention. For the sake of clarity, we explain the internal mechanism of global component in a bottom-up manner by following the order (Fig. 1e 1d 1c):

Spatial Attention Block is used to reweight the contribution of dynamic features of each location at time . More specifically, the attention weight, , represents the contribution of the location at time to the community . Since higher spatial attention weight for a location indicates the involvement of its dynamic features in this community, we call this community membership.

is the context vector, which summarizes the aggregated contribution of all locations as follows:


where , and are the parameters to be learned, and is the dynamic feature size of any location. After the context vector is computed, it is fed to the recurrent unit.

Recurrent unit is used to capture the temporal relationships among the reweighted global dynamic features for the community as follows:


where is LSTM [hochreiter1997] for community , and is the -th hidden state of

-th community. We use LSTM in our model (in each community block) since it addresses the vanishing and exploding gradient problems of basic RNNs.

Temporal Attention is applied on top of LSTMs to differentiate the contribution of latent representations of global dynamic features at each time point and for each community. To make the output specific to target location, we incorporate a query scheme based on a time-dependent community membership (i.e., contribution of each location to the community) where the membership is further weighted based on the location’s spatial proximity to target location (with nearby locations getting larger weights than further ones). Specifically, let denotes the attention weight over the hidden state of community at time . The context vector , which is aggregate contribution from community , can be learned through the proximity-based weighting scheme as:


where is a vector encoding the proximity of the target location to all locations. Here, the proximity of two locations is calculated based on the inverse of geographic distance (haversine):


Community Attention aims to produce a global contribution with respect to the target location by combining different representation subspaces for each of the communities . A soft-attention approach is then employed to combine the contributions from all communities. Here, to make the prediction specific to the target location, we incorporate a query scheme, which takes each community vector as a key and the embedding of the target location as a query, as follows:


where , and are the parameters to be learned, where is the number of hidden units in LSTMs in the community blocks, and is the output of the global component.

3.2.2 Local Component.

This component is designed to model the contribution of the local dynamic features for any target location (Fig. 1a). It basically includes a recurrent unit and a temporal attention that focuses on the most informative time instants. The dynamic features of target location are fed to the recurrent unit to model local dynamics.


where is LSTM, as in the global component, and is the -th hidden state of LSTM. Then, we also employ a temporal attention on top of the LSTM in this component, which can select the most informative hidden states (time instants) with respect to the dynamic features of target location . We only provide the calculation of output vector of the local component to be succinct as: where is the attention weight for the hidden state at time , and is the output of the local component.

3.2.3 Static Component.

This component models the static information specific to the target location (Fig. 1b). The input incorporates the static features,

, and a one-hot encoding vector

that represents the target location. We apply a fully connected layer (FC) to separately learn a latent representation for each of the two types of information. In particular, the one-hot location vector will be converted into an embedding and will be utilized in the aforementioned query component (see Eq. (7)). is the output of this component, which is concatenation of learned embeddings and latent representation of static features.

3.2.4 Objective Function.

The objective function consists of three terms: prediction loss, orthogonality loss and Group Lasso (GL) regularization as follows:


where and are the tuning parameters for the penalty terms, and , is the mean squared error (MSE), and are the predicted and actual number of opioid overdose incidents for sample , respectively. A penalty term, is added to avoid learning redundant memberships across communities, i.e., multiple communities may consist of a similar group of locations. To encourage community memberships to be distinguishable as much as possible, we incorporate this orthogonality loss term into the objective function. Let be the community membership vector denoting how each location contributes to the community , averaging over time, and is a matrix consisting of such membership vectors for all communities, the orthogonality loss is given by:



is the identity matrix. This loss term encourages different communities to have non-identical locations as members as much as possible, which helps reduce the redundancy across communities. Lastly, we incorporate GL regularization into objective function, which imposes sparsity on a group level

[scardapane2017group], and which has been found effective in several domains ([zhu2016co, ochiai2017automatic]) to select informative features. Our main motivation to employ GL is to select community-level and local-level informative features. It enables us to interpret and differentiate which features are important for opioid overdose incidents. It is defined as:


where denotes input weight matrix in the community block in the global component. and represent input weight matrices in the local and the static components, respectively.

is vector of outgoing connections (weights) from an input neuron,

denotes a set of input neurons, and indicates the dimension of .

3.3 Features

We employ two types of features namely static features and dynamic features.

Static features

include economical status, housing status, educational level of neighborhoods and demographics such as population, gender diversity index and race diversity index, which are obtained from census data. The diversity index is calculated using normalized entropy. Furthermore, we employ median household income, per capita income and poverty (%) as the economical indicators. We utilize percentage of the vacant houses (housing occupancy) and percentage of owner occupied houses (housing tenure) as the housing-related static features. We also consider percentage of high school graduation and below as the educational attainment indicator as an another static feature. As a result, we obtain a total of nine static features. Note that we apply z-score normalization for median household income and per capita income, and log-transformation for population while preparing the feature vectors.

Dynamic features are to capture the crime dynamics of the locations that may be predictive for opioid overdose. We extract the dynamic features from public safety data portals of the cities. Each crime incident is identified by a unique crime incident number and has a certain type which shows a hierarchical structure. The crime data gathered from different cities may have very different categories. For example, the dataset from the City of Chicago includes much more categories than that from the City of Cincinnati. Here, we only consider the highest level, “primary crime types” and eliminate rare categories. The full list of crime categories used in this work can be found in Fig. 4. In addition to these features, we also utilize the number of total crimes and the number of total opioid overdose incidents as additional dynamic features. For each neighborhood and each time unit, the feature vector contains the total number of crimes, the total number of incidents for each primary crime type and the number of opioid overdose. We apply z-score normalization to all dynamic features.

4 Experiments

4.1 Datasets

We apply our method to forecast opioid overdose on two cities, namely City of Chicago and City of Cincinnati. The neighborhood boundaries officially recognized by the City of Cincinnati and the City of Chicago are called “Statistical Neighborhood Approximations (SNAs)” and “community areas”, respectively. Hereafter, we use “neighborhoods” to refer to both. There are 77 neighborhoods in Chicago whereas Cincinnati consists of 50 neighborhoods. While we select 47 neighborhoods from Chicago (where of opioid overdose deaths occur), we use all neighborhoods of Cincinnati in our experiments. Table 1 shows descriptive information about both datasets. For each city, we collect three types of data related to crime, opioid overdose and census as follows:

Crime data: We collect crime incident information (geo-location, time and primary type of the crimes) from the open data portals of the cities. We use Public Safety Crime dataset*** and Police Data Initiative (PDI) Crime Incidents dataset to extract such information for City of Chicago and City of Cincinnati, respectively. We extract 14 crime-related dynamic features for Chicago, and 9 dynamic features for Cincinnati.

Opioid overdose data: We collect different types of opioid overdose data for each city since there is no systematic monitoring of drug abuse at either a regional or state level in the U.S. For Chicago, we collect opioid overdose death records (geo-location and time) from Opioid Mapping Initiative Open Datasets On the other hand, we utilize the EMS response data§§§ for heroin overdoses in Cincinnati.

Census data: We use the 2010 Census data to extract the features related to demographics, economical status, housing status and educational status of the neighborhoods.

#Opioid ODs
Chicago 47 15 9 573207 1468 deaths 1 week
Cincinnati 50 10 9 75779 5401 EMS calls 1 week
Table 1: Descriptive information about our datasets.

4.2 Baselines

We compare our model with a number of baselines as follows:

  • HA: Historical average.

  • ARIMA: is a well-known method for predicting future values for time series.

  • VAR: captures the linear inter-dependencies among multiple time series and forecasts future values.

  • SVR: is Support Vector Regression. We use its two variants, SVR (trained separate models for each location) and SVR (trained a single model for all locations).

  • LSTM: We train an LSTM network in which dynamic features are fed to LSTM, then the latent representations are concatenated with static features for prediction.

  • DA-RNN [qin2017dual]: is a dual-staged attention-based RNN model for spatio-temporal time series prediction.

  • GeoMAN [liang2018geoman]: is a multi-level attention-based RNN model for spatio-temporal prediction, which shows state-of-the-art performance in the air quality prediction task.

  • ActAttn [ertugrul2019]: is a hierarchical spatio-temporal predictive framework for social movements. We replace the final classification layer with regression layer to configure it to regression task to use it as another baseline.

Furthermore, to evaluate the effectiveness of individual components of our model, we also include its several variants for the comparison as follows:

  • CASTNet-noGL

    : GL regularization is not incorporated into the loss function.

  • CASTNet-noOrtho: Orthogonality penalty is not applied so that differentiation of the communities is not encouraged.

  • CASTNet-noSA: The spatial attentions are removed from the community blocks. Instead, the feature vectors of all locations are concatenated.

  • CASTNet-noTA: The temporal attentions in both local and global components are removed from the architecture.

  • CASTNet-noCA: The community attention is removed from the architecture. Instead, the context vectors of the communities are concatenated.

  • CASTNet-noSC: The static features are excluded from the architecture, yet the location-ID is still embedded.

4.2.1 Settings:

We used ‘week’ as time unit and ‘neighborhood’ as location unit. We divided datasets into training, validation and test sets with ratio of 75%, 10% and 15%, respectively. We set to make short-term predictions. For RNN-based methods, hidden unit size of LSTMs was selected from . The networks were trained using Adam optimizer with a learning rate of 0.001. For each LSTM layer, dropout of 0.1 was applied to prevent overfitting. In our models, the regularization factors and were optimized from the small sets and , respectively using grid search. For ARIMA and VAR, the orders of the autoregressive and moving average components were optimized for the time lags between 1 and 11. For RNN-based methods, we performed experiments with different window sizes , and shared the results for (the best setting for all models). Our code and data are available at

5 Results

Chicago Cincinnati
HA 0.2329 0.3385 0.5728 0.8727
ARIMA 0.2272 0.3396 0.5717 0.8952
VAR 0.2242 0.3386 0.5606 0.8712
SVR 0.2112 0.3321 0.5153 0.8609
SVR 0.1984 0.3063 0.4886 0.8602
LSTM 0.2024 0.3134 0.5235 0.8267
DA-RNN [qin2017dual] 0.1726 0.3051 0.4817 0.8225
GeoMAN [liang2018geoman] 0.1679 0.2829 0.5034 0.8453
ActAttn [ertugrul2019] 0.1693 0.2937 0.4827 0.8326
CASTNet-noGL 0.1662 0.3129 0.4703 0.8311
CASTNet-noOrtho 0.1649 0.2948 0.4716 0.8109
CASTNet-noSA 0.1608 0.2893 0.4579 0.8152
CASTNet-noTA 0.1641 0.2876 0.4700 0.8141
CASTNet-noCA 0.1631 0.3069 0.4730 0.8225
CASTNet-noSC 0.1693 0.2980 0.4692 0.8291
CASTNet 0.1391 0.2679 0.4516 0.8032
Table 2: Performance Results.

5.1 Performance Comparison

Table 2 shows that CASTNet achieves the best performance in terms of both mean absolute error (MAE) and root mean squared error (RMSE) on both datasets. Our model shows 17.2% and 5.3% improvement in terms of MAE and RMSE, respectively, on Chicago dataset compared to state-of-the-art approach GeoMAN. Similarly, CASTNet enhances the performance 6.3% and 2.4% on Cincinnati dataset in terms of MAE and RMSE, respectively, compared to DA-RNN which shows best performance among the other baselines. Furthermore, we observe that mostly spatio-temporal RNN-based models outperform other baselines, which indicates they better learn the complex spatio-temporal relationships between crime and opioid overdose dynamics.

We further evaluate the effectiveness of each individual component of CASTNet with an ablation study. As described in Section 4.2, each variant is different from the proposed CASTNet by removing one tested component (with others kept identical as much as possible). Table 2 shows that the removal of GL in the model results in a significantly lower performance compared to the others. In addition, CASTNet-noGL can no longer be able to select informative features. Similarly, excluding orthogonality term (CASTNet-noOrtho) loses the ability to learn distinguishable communities or representation subspaces and reduces the performances as well. Moreover, comparing CASTNet with CASTNet-noCA shows that employing community attention has a great impact on the performance, which indicates that learning pairwise activity relationships between a particular event location and communities is crucial. Location-specific static features are also informative since their exclusion (CASTNet-noSC) degrades the performance in both cases. The individual component that provides the least performance gain is spatial attention for both cases. However, its removal (CASTNet-noSA) results in loss of interpretability capability of community memberships. These results reflect that each individual component has important contribution to forecasting performance.

(a) Chicago
(b) Cincinnati
Figure 2: MAE and RMSE results w.r.t change in the number of communities.

Moreover, we evaluate the performance of the CASTNet with respect to the change in number of communities . We conduct experiments with different values of selected from {0, 1, …, 6} and the results are given in Fig. 2. Note that the model does not consider global contribution when . Also, when , the model yields a single universal representation of global activities which is irrespective of the event locations. The best performances are obtained when for Chicago and for Cincinnati datasets. We observe that while increases until the optimum value, the performance increases, and some communities are decomposed to form new communities. However, as long as continues to increase after its optimum value, the performance starts to decrease slightly or remains stable, and the semantic subspaces of some communities become similar. With this experiment, we indicate that learning different representations of global activities significantly improves the forecasting performance.

5.2 Analysis of Community Memberships and Community Contributions

We analyze the community memberships of the neighborhoods and community contributions on forecasting future opioid overdose by answering the following questions.

5.2.1 How do locations contribute to communities?

CASTNet learns different representation subspaces (communities) of global dynamics unlike the previous work [liang2018geoman, ertugrul2019], and each community consists of a group of different members due to orthogonality penalty. We represent the learned communities and their memberships (i.e., the spatial attention weights in Eq. (2), averaged over time for ease of interpretation) on the left side of Fig. 2(a) and 2(b) for Chicago and Cincinnati, respectively, where the line thickness represents the degree at which a location contributes to the corresponding communities. Note that neighborhoods on the left side of Fig. 2(a) and Fig. 2(b) are ordered by the number of crimes. As shown in Fig. 3, most locations have dedicated to one community. For Chicago model (Fig. 2(a)), Austin (25), which has the highest number of crime incidents and opioid overdose deaths, formed a separate community by itself. While North Lawndale (29) and Humboldt Park (23) together formed the community , West Garfield Park (26), East Garfield Park (27) and North Lawndale (29) formed an another community . Note that neighborhoods of and have the highest opioid overdose death rate after Austin (25). On the other hand, the community is formed by the neighborhoods having low crime and overdose death rates including Fuller Park (37), McKinley Park (59) and West Elsdon (62). Furthermore, for Cincinnati model (Fig. 2(b)), Westwood (49), where the highest number of crimes were committed, formed a separate community by itself. It shows a similar behavior to the Chicago case. East Price Hill (13), West Price Hill (48), Avondale (1) and Over-The-Rhine (34) formed the community where these neighborhoods have the highest crime rate after Westwood (49) and the highest opioid overdose rate. On the other hand, the community is formed by rest of the neighborhoods (with low and moderate crime rates) and their memberships of that community are almost equal.

(a) Chicago
(b) Cincinnati
Figure 3: Community memberships and community contributions on forecasting. For each community, left side represents community memberships (how each location contributes to the community), and right side represents the average community contribution (how the community contribute to predicting a target location). Edge thickness indicates the weight of community membership (left side) and community contribution (right side). Node size denotes overall community membership of a location (left side) and overall community contribution to forecasting overdose (right side) in the target neighborhood. Edge color shows the input and output of a specific community. Node color of a neighborhood indicates the community for which the corresponding neighborhood has the highest membership (left side). Node color of a neighborhood denotes the community from which the neighborhood takes the largest contribution (right side). Edges whose weights are above a certain threshold are shown.

5.2.2 How do the communities contribute to forecasting?

CASTNet is capable of modeling the pairwise activity relationships between a particular event location and the communities. It allows the target location to attend the communities to select location-specific global contributions. We analyze how these communities contribute to forecasting by visualizing the community attention weights (i.e., in Eq. (8) averaged over test samples for each neighborhood) in Fig. 2(a) and Fig. 2(b) for Chicago and Cincinnati, respectively. While the left side of the figures represents the community memberships, the right side indicates the average community contributions for each neighborhood. Note that neighborhoods on the right side of Fig. 2(a) and Fig. 2(b) are ordered by the number of opioid overdoses. For Chicago case, and have more contributions than the other communities on forecasting overdose. While contributes more to neighborhoods with low or moderate opioid overdose death rate, contributes more to the neighborhoods where the death rate is higher. also contributes more to the neighborhoods with the highest death rate (e.g. Austin (25), Humboldt Park (23)). This means that any particular neighborhood attends more to the community, which is formed by the similar neighborhoods. On the other hand, does not significantly contribute to any neighborhood although it is formed by a crime hot-spot (Austin (25)). For Cincinnati case, is a very dominant community, which makes the largest global contribution to most of the neighborhoods. The neighborhoods that formed and (e.g. East Price Hill (13), West Price Hill (48), Westwood (49)) are very predictive, and the change in their dynamics have greater impact on forecasting future overdoses in the target neighborhoods. On the other hand, has larger contribution to neighborhoods where the overdose rate is the highest. This indicates that the crimes committed in the members of are also informative for forecasting future overdoses in opioid hot-spots.

5.3 Feature Analysis

Figure 4: Importance of dynamic features. Mean absolute values of input weights of local and global components.

We investigate the importance of dynamic features by analyzing the mean absolute input weights of local and global components as shown in Fig. 4. For Chicago case, GL selects Narcotics and Assault as the most important features for future opioid overdose deaths in the same location. Moreover, Theft, Deceptive Practice, Narcotics, Burglary and Motor V. Theft are the predictive features from while Weapons Violation, Deceptive Practice (e.g. Fraud) and Criminal Trespass are significant from . Recall that, and are the most contributing communities to forecasting (see Fig. 2(a)). This shows that property crimes (e.g. Theft, Burglary, Deceptive Practice) are more significant predictors than the violent crimes for Chicago. Such crimes previously committed in the members of and may be a significant indicator of future opioid overdose deaths in Chicago. On the other hand, Battery, Narcotics, Burglary, and Motor V. Theft are predictive features from while Battery, Total Crimes and Other Offenses (e.g. offenses against family) are significant from . However, has larger contribution than other communities for only Austin (25). does not provide a significant contribution to any neighborhood as much as the other communities. For Cincinnati case, Opioid Overdose Occ. is the most predictive feature for forecasting future opioid overdose in the same location, which means the local component behaves as an autoregressive module unlike the Chicago case. Furthermore, both violent crimes including Agg. Assaults, Rape, Homicide, Part 2 Minor (e.g. Menacing) and property crimes including Burglary/Breaking Ent., Theft, Part 2 Minor (e.g. Fraud) are significant features from . On the other hand, Theft and Part 2 Minor from , and Theft and Burglary from are predictive features for future opioid overdose in the target locations. Recall that and have more salient contribution on most of the neighborhoods, which implies that commitment of previous property crimes (especially Theft) in the members of those communities may be one of the potential indicators of future opioid overdose in the other neighborhoods. Note that our findings are also consistent with the literature that highlighted the connection between crime and drug use, and suggested the property crimes such as theft, burglary might be committed to raise funds to purchase drugs [bennett2008statistical].

Figure 5: Importance of static features. Mean absolute values of input weights of FC layer in static component.

We explore the importance of static features by analyzing mean absolute input weights of FC in static component (see Fig. 5). For Chicago case, demographic features (Population, Gender Div. and Race Div.) are significant. We observe that Owner Occupied H. units, Poverty and Educational Att. are also informative. For Cincinnati case, Gender Div. and Population are important features for forecasting as well as the Educational Att. and Per Capita Income. Based on these results, the neighborhoods with higher population, and lower or moderate gender diversity may require additional resources to prevent opioid overdose in both cities. Also, economic status is important for neighborhoods of both cities, which is consistent with the previous work that suggested communities with a higher concentration of economic stressors (e.g. low income, poverty) may be vulnerable to abuse of opioids as a way to manage chronic stress and mood disorders [king2014determinants]. Although there exist three economic status indicators, GL selects only one, Poverty for Chicago and Per Capita Income for Cincinnati.

6 Discussion and Future Work

In this work, we presented a community-attentive spatio-temporal predictive model to forecast opioid overdose from crime dynamics. We developed a novel deep learning architecture based on multi-head attentional networks that learns different representation subspaces of features (communities) and allows the target locations to select location-specific community contributions for forecasting local incidents. At the same time, our proposed model allows for interpreting predictive features in both local-level and community-level, as well as the community memberships and the community contributions to forecasting local incidents. We demonstrated the strength of our method through the extensive experiments. Our method achieved superior forecasting performance on two real-world opioid overdose datasets compared to the baseline methods.

The experiment results suggest different spatio-temporal crime-overdose potential links. The overdose deaths at a target neighborhood in Chicago appear to be better predicted by crime incidents at neighborhoods that share the same community with the target neighborhood. Also, change in crime incidences in those neighborhoods with low crime rates is an important indicator of future overdose deaths in most of the other neighborhoods. On the other hand, in Cincinnati, the crime incidents occurred in communities comprising those crime hot-spots seem to well predict the overdose events in most of the neighborhoods. Furthermore, the predictive local activities are different in two cases. While the local crime incidents, in particularly Narcotics and Assault, are predictive for local overdose deaths in Chicago, previous overdose occurrences are informative for future overdose incidents in Cincinnati. On the other hand, the global contributions to forecasting local overdose incidents show similar patterns in both cities. Change in property crimes, in particular Theft, Deceptive Practice, Burglary and Weapons Violation (crime against to society) in Chicago, Theft and Burglary in Cincinnati, can be significant indicators for future local overdose incidents as well as certain type of violent crimes (Battery for Chicago and Agg. Assault for Cincinnati). Last but not the least, demographic characteristics, economic status and educational attainment of the neighborhoods in both cities may help forecasting future local incidents. Our findings support the hypothesis that criminal activities and opioid overdose incidents may reveal spatio-temporal lag effects, and are consistent with the literature. As future work, we plan to investigate the link between opioid use (or overdose) and other social phenomena using our method. We also plan to extend our model to consider multi-resolution spatio-temporal dynamics for prediction.

Acknowledgement. This work is part of the research associated with NSF #1637067 and #1739413. Any opinions, findings, and conclusions or recommendations expressed in this material do not necessarily reflect the views of the funding sources.