    # Bayesian Sparse Covariance Structure Analysis for Correlated Count Data

In this paper, we propose a Bayesian Graphical LASSO for correlated countable data and apply it to spatial crime data. In the proposed model, we assume a Gaussian Graphical Model for the latent variables which dominate the potential risks of crimes. To evaluate the proposed model, we determine optimal hyperparameters which represent samples better. We apply the proposed model for estimation of the sparse inverse covariance of the latent variable and evaluate the partial correlation coefficients. Finally, we illustrate the results on crime spots data and consider the estimated latent variables and the partial correlation coefficients of the sparse inverse covariance.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Revealing correlations of data is the simplest way to analyze relationships of given samples. However, the simple way has a lot of unignorable problems such as noise robustness, interpretability, and treatments of discrete data. Specifically, using simple correlations only, large input dimension causes difficulty of finding significant relationships because the reasonable threshold selection is not trivial. In order to find the essential relationships between variables, the sparse modeling, such as LASSO (Least Absolute Shrinkage and Selection Operator), is focused in these decades. Graphical Lasso  is a representative method which achives sparse covariance structure analysis. Since an inverse covariance matrix corresponds to a partial correlation matrix with appropriate scaling, we can discover robust and essential relationships of data by sparse covariance structure analysis. However, because of the assumption of a Gaussian Graphical Model (GGM) for observed data, Graphical Lasso can not treat count data.

In order to overcome this limitation of Graphical Lasso, we propose a hierarchical Bayesian model for Poisson distributed observations with sparse covariance structure. In our model, the latent variables, which indicate “

potential risks” of events, follow a Bayesian Graphical LASSO(BGL) , and the occurrences of events follow the homogeneous Poisson processes. We apply the proposed model to spatial crime data analysis and investigate the effectiveness with numerical experiments.

### 1.1 Graphical Lasso

Graphical Lasso  is a well-known and powerful model for sparse covariance structure analysis of GGM. Let a zero-meaned data matrix

follows a Gaussian distribution independently with given a precision matrix

denoting each component as in .

 p(Y|Ω)=T∏t=1N(yt|0,Ω−1). (1)

The Graphical Lasso optimizes the following objective which consists of the term from maximizing likelihood estimator of under the penalty

 Ω=arg maxΩ′∈M+log(det(Ω′))−tr(Y⊤YΩ′)−λ∥Ω′∥1, (2)

where indicates the set of all positive semi-definite matrices and is a regularization parameter. Here, the norm is defined by . Note that maximizing the objective of Graphical Lasso (2) is equivalent to obtain the maximum a posteriori (MAP) estimator of the following model, which is the combination of the Laplace distributions of the off-diagonal components of denoting as of the form

and the exponential distribution of the diagonal components

of the form :

 p(Y∣Ω) =T∏t=1N(yt∣0,Ω−1), (3) p(Ω∣λ) (4)

where is a normalizing term and is an indicator function defined by

 1Ω∈M+={1   (Ω∈M+)0   (otherwise). (5)

### 1.2 Bayesian Graphical Lasso

Bayesian Graphical Lasso (BGL)  realizes a fully Bayesian treatment of Graphical Lasso. Using the fact that the double exponential distributions can be represented as a mixture of Gaussian and exponential distributions, (3) is modified to

 p(Ω|τ,λ) =C−1τ∏i

where

denote the vector of the upper off-diagonal and diagonal entries of

and is the latent scale parameters. From the decomposition of the double exponential distributions, we derive a data-augmented block Gibbs sampling algorithm for the BGL model(6).

### 1.3 The Poisson Process and Crime Data

A Poisson distribution gives probability masses of the numbers of occurred events per unit time. More precisely, if the time intervals of events’ occurrences follow identical exponential distributions, the numbers of events per unit time follow identical Poisson distributions. This model is called a homogeneous Poisson process. The Poisson process is a simple but reasonable model to represent occurrences for rare events. Common examples of Poisson processes are customers calling a help

, radioactive decay in atoms , and crime occurrence , and so on.

If data follow a multivariate point process, it is significant to understand the inter-variate relationships of data. When the data consists of continuous values, we can evaluate the structure of variables by calculating correlations or applying Graphical Lasso. However, for count data, the correlation matrix will lead to biased evaluation for the structure. Hence, we introduce a novel Bayesian framework of sparse covariance structure analysis for count data.

The remainder of this paper is organized as follows. Section 2 gives the concrete formulation of the proposed model. In Section 3, we show the effectiveness of our model through experiments with spatial crime data and Section 4 shows the discussion of our model. Finally, Section 5 is devoted to a summary of this study.

## 2 The Proposed Method

### 2.1 Our Model

Fig. 1 shows the graphical model of the proposed model. At First, we define the set of non-negative integers . When we obtain -dimensional sequential data for timesteps, we assume that the elements of count data matrix follows conditionally independent Poisson distributions

 P(Y|μ,Z) =A∏i=1T∏t=1Poisson(yti|exp(η(μi,zti))), (8) η(μi,zti) =μi+zti, (9)

and we also give priors as

 p(μ) =N(μ|0,σ2μI), (10) p(zt|Ω) =N(zt|0,Ω−1)  for t=1,…,T, (11) p(ω|λ) ∝∏i

Here, we assume the linear predictor as the potential risk of occurrence of events. Thus, indicates averaged potential risk of -th dimension and represents dispersities from . Since follows the BGL prior, we can extract the sparse and essential co-occurrence structures of count data. In addition, we discussion about the effects of choices of in Section 3. Therefore, the joint posterior distribution can be expressed as

 p(Z,μ,Ω,λ|Y)∝P(Y|μ,Z)p(μ)[T∏t=1p(zt|Ω)]p(ω|λ)p(λ). (14)

### 2.2 Sampling Scheme

We evaluate the posterior (14

) with a Markov chain Monte Carlo method which consists of two different sampling schemes. The first scheme is block Gibbs sampling for the parameters in the BGL,

i.e. and . The second one is about the parameters of the potential risks, that is, and . We adopt the Metropolis-Hastings scheme for correlated count data .
For example, we describe the Metropolis-Hastings sampling scheme from the fully conditional distribution of here. When tries to transition to new state , we define the acceptance probability for simplicity

 r=min{1,p(μ′|Y,Z,Ω,λ)p(μ|μ′)p(μ|Y,Z,Ω,λ)p(μ′|μ)} (15)

The sampling procedure is as follows:

Step 1

Find optimal about the fully conditional distribution

 ^μ=arg maxμlogp(μ|Y,Z,Ω,λ) (16)

by use of the Newton-Raphson method.

Step 2

Sample a candidate state from proposal distribution defined as the multivariate t-distribution:

 p(μ′|^μ)=Multi-t(μ′|^μ,H−1^μ,ν), (17)

where is the Hessian matrix of (16) and

is a user-defined hyperparameter which indicates the degree of freedom.

Step 3

Accept the candidate state with probability .
Update the state if accepted, and keep the state to be if rejected.

Sampling from is also executed by the similar way to . In summary, we repeat sampling from the fully conditional distributions of and respectively with Metropolis-Hastings within Gibbs sampler.

## 3 Synthetic Data Analyses

### 3.1 Synthetic Data

To assess the performance of our proposed model, we generate four synthetic datasets, whose size are . We fix and generated as with , and zero otherwise for entire simulations, where and represent constant values. Given these true parameters, we sample and the observed data matrix from (8) and (11) respectively.

### 3.2 Analyzing Effects of Hyperparameter Selection

In our Metropolis-Hastings sampling scheme, we empirically find that an appropriate selection for the degree of freedom in proposal t-distribution gives a significant effect for the sampling efficiency. In our simulations, slightly small , such as , is better than bigger . Because a t-distribution with

is equivalent to Cauchy distribution and

is corresponding to a Gaussian distribution, our choice means intermediate form of them.

Next, we determine the parameters in priors. For , which is the hyperparameter of the regularization parameter , we adopt for and . Moreover, a value seems to be reasonable for the prior . Since and determine the parameters of the Poisson distribution within the exponential function in the proposed model, the absolute value of should be small.

### 3.3 Simulation Results

Here we show the estimation results of , , in the case that the size of data is . We adopt the hyperparameters and updated all elements at once both and in the Metropolis-Hastings algorithms. Figure 2: The estimation result of μwith the MAP estimator with 95% credible intervals.

Fig. 2 shows the result of . We show the solid line estimates of with the MAP estimator with 95% credible intervals. The dotted line represents the true values. It is assumed that small is difficult to estimate because the linear predictor is exponentially transformed to be the parameter of Poisson distribution. However, we find the 95% the true values are held within credible intervals. Figure 3: (a) shows the MAP estimation results of Z, (b) is true value of Z, and (c) shows the difference between (a) and (b) in absolute value.

Fig. 3 shows the MAP estimation results of and their true values. The subfigure (c) shows the differences between MAP estimated and true in absolute value. Figure 4: The estimation result of (zt1)Tt=1. The markers show the true values of and the solid line shows MAP estimated values with the 95% credible intervals.

Fig. 4 shows the estimation result of the value of . Note that is estimated to be as shown in Fig. 2. The solid line is the MAP estimation results of with the 95% credible intervals, and the markers are their true values. For the positive , the true values are within the credible intervals of the estimation result. On the other hand, the true values seem to be out of the 95% credible intervals for the negative . However, considering that the parameter of Poisson distribution becomes a small value, the estimation results are reasonable.

### 3.4 Evaluating Partial Correlations Figure 5: (a) shows the estimated partial correlation coefficients of Ω and (b) shows the true partial correlation coefficients of Ω.

We can calculate partial correlation matrix as

 pij=−ωij√ωii√ωjj (18)

from the estimated precision matrix . In Fig. 5, we show the MAP estimated for . We compare the estimated partial correlation coefficients and true one, which consists of small number of non-zero components. We confirm the partial correlation coefficients that correspond to the non-zero components are relatively larger than the other components that correspond to the zero components in the true partial correlation coefficients. We also find that non-diagonal and non-zero estimates have shrinkage from their true values. This is affected by the Lasso-like priors of BGL.

## 4 Analysis Of Crime Spots Data

### 4.1 Spatial Crime Data

As an application example of the proposed model, we employed the proposed model to spatial crime data which is obtained from . The spatial crime data published by the National Institute of Justice (NIJ) contains criminal occurrences and their locations in the Portland City, Oregon, USA.

In the crime data, the latent variable represents the average potential risk of criminal occurrences of the -th area. Also, represents dispersities of the potential risks at each time point. Therefore, the interaction structure is captured by .

For simplicity, we used the crimes that occurred in 2016 and extracted 60 areas. Furthermore, we aggregate the number of crimes per week into one time point. Hence, the size of data matrix becomes is .

### 4.2 The Partial Correlation Of Ω On Crime Data

The estimated is the inverse of the sparse covariance matrix that contributes to the influence between variables and temporal variation. Therefore, the partial correlation calculated from represents a sparse correlation of crime occurrence risk between areas.

A strong positive correlation between two areas means that the risk of the crime in one area intends to increase when the other area risk increases.

### 4.3 Visualization Of Partial Correlations Figure 6: The visualization of sparse partial correlation coefficients between areas on Portland city’s map. The thickness of the curved black lines shows the magnitudes of the coefficients.

The partial correlation coefficients matrix is calculated from (18) with the estimated result of . Fig. 6 shows a visualized sparse partial correlation coefficients on the Portland city map. The coefficients shown on the map are the top 2% of the whole in absolute value. The correlation of crime risks could lead to a visual understanding by expressing a strong correlation on the map.

## 5 Conclusion

The proposed model can estimate reasonable values of and latent variables and . On the other hand, when the parameter is less than 1 in the Poisson process, the number of events tends to be , so that it is difficult to estimate the negative true value of the latent variable of the simulation data in the proposed model. One of the possible solutions tackle this problem is to find an alternative transformation function which maps to for the linear predictor of Poisson distribution. For example, is one of the candidates.

It is possible to find useful features by analyzing the estimation results of latent variables by the proposed model. We have been able to obtain sparse correlations between crime risk areas, by applying the proposed model to crime data. As a future issue, because there are many samples related to time in the Poisson process in general, we would like to make our model possible to catch time-series dependence.

## References

•  S. Chib and I. Jeliazkov (2001) Marginal likelihood from the metropolis–hastings output. Journal of the American Statistical Association 96 (453), pp. 270–281. Cited by: §2.2.
•  J. Friedman, T. Hastie, and R. Tibshirani (2008) Sparse inverse covariance estimation with the graphical lasso. In Biostatistics, Vol. 9, pp. 432–441. Cited by: §1.1, §1.
•  N. Homepage Real-time crime forecasting challenge. Note: https://nij.ojp.gov/funding/real-time-crime-forecasting-challenge Last accessed 18 May 2020 Cited by: §4.1.
•  Y. Igarashi, K. Nagata, T. Kuwatani, T. Omori, Y. Nakanishi-Ohno, and M. Okada (2016-03) Three levels of data-driven science. Journal of Physics: Conference Series 699, pp. 012001. External Links: Cited by: §1.
•  A. Ihler, J. Hutchins, and P. Smyth (2006) Adaptive event detection with time-varying poisson processes. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 207–216. Cited by: §1.3.
•  D. W. Osgood (2000)

Poisson-based regression analysis of aggregate crime rates

.
Journal of quantitative criminology 16 (1), pp. 21–43. Cited by: §1.3.
•  A. Sitek and A. M. Celler (2015) Limitations of poisson statistics in describing radioactive decay. Physica Medica 31 (8), pp. 1105–1107. Cited by: §1.3.
•  R. Tibshirani (1996) Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58 (1), pp. 267–288. External Links: ISSN 00359246, Link Cited by: §1.
•  H. Wang et al. (2012) Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis 7 (4), pp. 867–886. Cited by: §1.2, §1.
•  J. Weinberg, L. D. Brown, and J. R. Stroud (2007) Bayesian forecasting of an inhomogeneous poisson process with applications to call center data. Journal of the American Statistical Association 102 (480), pp. 1185–1198. Cited by: §1.3.