# High-Dimensional Bernoulli Autoregressive Process with Long-Range Dependence

We consider the problem of estimating the parameters of a multivariate Bernoulli process with auto-regressive feedback in the high-dimensional setting where the number of samples available is much less than the number of parameters. This problem arises in learning interconnections of networks of dynamical systems with spiking or binary-valued data. We allow the process to depend on its past up to a lag p, for a general p > 1, allowing for more realistic modeling in many applications. We propose and analyze an ℓ_1-regularized maximum likelihood estimator (MLE) under the assumption that the parameter tensor is approximately sparse. Rigorous analysis of such estimators is made challenging by the dependent and non-Gaussian nature of the process as well as the presence of the nonlinearities and multi-level feedback. We derive precise upper bounds on the mean-squared estimation error in terms of the number of samples, dimensions of the process, the lag p and other key statistical properties of the model. The ideas presented can be used in the high-dimensional analysis of regularized M-estimators for other sparse nonlinear and non-Gaussian processes with long-range dependence.

## Authors

• 7 publications
• 9 publications
• 19 publications
• 25 publications
• 13 publications
• ### The Impact of Regularization on High-dimensional Logistic Regression

Logistic regression is commonly used for modeling dichotomous outcomes. ...
06/10/2019 ∙ by Fariborz Salehi, et al. ∙ 4

• ### Improved Estimation of High-dimensional Ising Models

We consider the problem of jointly estimating the parameters as well as ...
11/07/2008 ∙ by M. Kolar, et al. ∙ 0

• ### Learning Partially Observed Linear Dynamical Systems from Logarithmic Number of Samples

In this work, we study the problem of learning partially observed linear...
10/08/2020 ∙ by Salar Fattahi, et al. ∙ 0

• ### Regularized Estimation of High-Dimensional Vector AutoRegressions with Weakly Dependent Innovations

There has been considerable advance in understanding the properties of s...
12/19/2019 ∙ by Ricardo P. Masini, et al. ∙ 0

• ### Impulse Response Analysis for Sparse High-Dimensional Time Series

We consider structural impulse response analysis for sparse high-dimensi...
07/30/2020 ∙ by Jonas Krampe, et al. ∙ 0

• ### A Statistical Investigation of Long Memory in Language and Music

Representation and learning of long-range dependencies is a central chal...
04/08/2019 ∙ by Alexander Greaves-Tunnell, et al. ∙ 0

• ### Data-Driven Sparse System Identification

In this paper, we study the system identification porblem for sparse lin...
03/21/2018 ∙ by Salar Fattahi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many signal processing applications, the underlying time series may be modeled as a multivariate Bernoulli process (MBP). For example, the spike trains from an ensemble of neurons can be modeled as a collection of Bernoulli variables where for each neuron, at each time instant, the probability of spiking could depend on the history of the spikes from the ensemble. Similarly, several other signals such as activity in social networks, trends of stock prices in financial markets

[Sta08, TY16], crime occurrences in a metropolitan area [MRW18], medical emergency call forecasting [MMW11], climate dynamics [GMM14] and certain physiological [KKMM17] and biological processes [KBS18] can all be encoded as multivariate autoregressive Bernoulli processes where the history of the process affects the present outcome.

We are interested in developing generalized linear models (GLM) to capture the behavior of temporally-dependent MBPs. Such models allow one to not only make predictions about the future of the process, but also infer the relations among the coordinates (i.e., individual time series), based on proper estimates of the parameters of the model. For example, in the neural spike train application, one can reveal a latent network among the neurons (i.e., who influences whose firing) just from the observations of patterns of neural activity, a task which is of significant interest in neuroscience [OWB05, SB03, BKM04]. Similarly, in the context of social networks, one might be interested in who is influencing whom [RWH12].

In a GLM, one models an invertible link function of the conditional mean of the observed variables as a linear function of the covariates. In time series analysis, the role of the covariates is played by the history of the evolving time series. For Gaussian random variables with the identity link function, this leads to the classical Gaussian autoregressive (AR) process. Indeed, much of the focus in time series analysis has been on VAR (vector AR) processes

[BM15, CRZ16, MP15, MM17, ABC16], which provide significant richness as a rudimentary model. However, these models fail to capture higher order correlations among the variables and become harder to interpret for variables from discrete spaces.

Another fundamental class of processes is the MBP, for which very few results are known as compared to the Gaussian VAR processes. For MBPs, we are interested in understanding the dependence of each variable on the history of the process. We consider a time series of Bernoulli variables where each variable depends on at most lags of the process, resulting in possible interaction parameters. These interactions can be arranged in an tensor , where captures the effect of variable from lags ago on variable .

Collectively, each slice given by for indicates the coupling between all the pairs of variables that are lag- apart. On the other hand, each fiber of length along the third dimension , for , can be thought of as a filter that modulates the behavior of variable on the past responses of variable . Delineating the parameters in this way helps with the interpretation of the dynamics of the process. For example, in neural signal processing, these filters encode properties of the spiking behavior of the neurons, as demonstrated by [WP17] where they provide a characterization of the filter coefficients that incite activities such as bursting, tonic spiking, phasic spiking and several others. Similarly, a slice can be thought of as an adjacency matrix for the lag- influence network. If neurons and are not connected, then for all . However, when two neurons are connected, it is possible to have different patterns of influence at different time lags. Hence, the influence networks could potentially be different for each .

In order to reveal the structure of these influence networks, one needs to estimate the tensor parameter , given a sample of observations from the process. In many scenarios, the parameter tensor is known to be sparse. For example, in the neural setting, each neuron, or the functional unit of the cortex, has limited direct connections to other units. To exploit the sparsity assumption in forming an estimate of , we propose and provide a rigorous analysis of an penalized maximum likelihood estimator (MLE).

For Gaussian random variables, the

penalized MLE takes the form of a regularized least-squares (LS) tensor regression. However, for the Bernoulli GLM, maximizing the likelihood differs from ordinary least-squares. Although one can still use a regularized LS estimator, incorporating the Bernoulli likelihood allows one to capture more information, i.e., achieve smaller variance. The resulting

-estimator in the Bernoulli case resembles a regularized logistic regression problem which is in the class of convex problems that can be solved efficiently.

While traditional statistical methods work under the assumption that the number of available samples significantly exceeds the number of parameters, i.e., the assumption is often not true in several applications. Going back to the neuroscience application, in most spiking neural data from in vivo measurements, a limited sample of spike trains are available for an experiment, owing to constraints such as subject fatigue, changes in sensor characteristics, and sensitivity to exogenous laboratory conditions. In addition, synaptic connectivity is time-varying and hence there is a limited period over which the model parameters can be assumed constant.

In such a scenario, statistically sound methods are desired that provide guarantees on the fidelity of the model trained over a limited sample of observations, especially with . However, in spite of the model identification problem being ill-posed in this case, it is often possible to perform reliable estimation even with , by constraining the parameter to lie in a low-dimensional subspace, with degrees of freedom, where . Several such low-dimensional subspaces have been investigated in the literature on compressive sensing and high-dimensional statistics. Examples include the (elementwise) sparsity where approximately of the entries are assumed non-zero, or the low-rank assumption on the parameter tensor where it is assumed to be the sum of a few rank-1 tensors. These assumptions are often practically valid, enhance the interpretability of the model, and make the problem well-posed when . Moreover, the estimation is computationally tractable due to convex optimization based estimators for these, even for the tensor models [RYC15].

Our focus in this work is on the “approximately sparse” model, as motivated by the network-of-fibers structure in the neural spike train application. The parameter, , is assumed to be well-approximated by an -sparse tensor. Such a constraint is incorporated by regularizing the (convex) maximum likelihood problem using the elementwise penalty. Our main result in Theorem 3.1 establishes the consistency of this regularized MLE in the high-dimensional regime of under some regularity conditions. Despite the fact that the techniques for establishing such high-dimensional guarantees on regularized -estimators are by now fairly mature [BVDG11], significant challenges remain in analyzing dependent non-Gaussian processes.

### 1.1 Key contributions

A major theoretical contribution of our work is to establish the so-called restricted strong convexity (RSC) [NRWY12]

for the log-likelihood of a dependent non-Gaussian process. This requires a restricted eigenvalue condition for the MBP, which is nontrivial due to the non-Gaussian and highly correlated entries of the resulting design matrix. What makes the problem more challenging is the existence of feedback from more than just the immediate past (the case

).

We establish the RSC for general using the novel approach of viewing the

-block version of the process as a Markov chain. The problem becomes significantly more challenging when going from

to even . The difficulty with this higher-order Markov chain is that its contraction coefficient is trivially 1. We develop techniques to get around this issue which could be of independent interest (cf. Section 6). Our techniques hold for all .

Much of the previous work towards proving the RSC condition of the log-likelihood function has either focused on the independent sub-Gaussian case [RWY11, ZH08] or the dependent Gaussian case [BM15, RYC15] for which powerful Gaussian concentration results such as the Hanson–Wright inequality [RV13] are still available. Our approach is to use concentration results for Lipschitz functions of discrete Markov chains, and strengthen them to uniform results using metric entropy arguments. In doing so, we circumvent the use of empirical processes which require additional assumptions for MBP estimation [RST15]. Moreover, our approach allows us to identify key assumptions on the decay rate of the fibers of the parameter, for sample-efficient estimation.

Although MBP time series are often modeled using the link function, our analysis allows for any Lipschitz continuous, log-convex link function. The analysis brings out crucial properties of the link function, and the role it plays in determining the estimation error and sample complexity.

### 1.2 Previous work

Estimating the parameters of a multivariate time series has been of interest in recent years. Much of the work, however, focuses on Gaussian VAR() processes with linear feedback [BM15, CRZ16, MP15, MM17, ABC16]. For these models, a restricted eigenvalue condition can be established fairly easily, by reducing the problem, even in the time-correlated setting, to the concentration of quadratic functionals of Gaussian vectors for which powerful inequalities exist [RV13]. These techniques do not extend to non-Gaussian setups.

In the non-Gaussian setting, Hall et al. [HRW16, ZR18] recently considered a multivariate time series evolving as a GLM driven by the history of the process. The autoregressive MBP with lags is a special case of this model. They provide statistical guarantees on the error rate for the regularized estimator, although their assumptions on the parameter space are restrictive when applied to the MBP. More importantly, their results are restricted to the case which does not allow the explicit encoding of long-term dependencies as observed in crucial neuronal phenomena such as periodic spiking.

More recently, Mark et al. [MRW18, MRW17] considered a similar model for multivariate Poisson processes with lags , albeit either through predetermined basis functions or by restricting to lags or . A key contribution of us is to bring out the explicit dependence on in multilag MBP models, allowing for a general . We show how the scaling of the sample complexity and the error rate with can be controlled by the properties of the link function and a certain norm of the parameter tensor.

Our results improve upon those in [HRW16, MRW18] when applied to the MBP. Due to the key observation that a -lag MBP can be viewed as a discrete Markov chain, our analysis relaxes several assumptions made by [HRW16, MRW18]. In doing so, we achieve better sample complexities with explicit dependence on . Our analysis borrows from martingale-based concentration inequalities for Lipschitz functions of Markov chains [KR08].

The univariate Bernoulli process for was considered by Kazemipour et. al. [Kaz18, KWB17] where they analyzed a multilag Bernoulli process for a single neuron. Their analysis does not extend to case. Even for , their analysis is restricted to the biased process with for all . Mixing times of the MBP have been considered in [KBS18]. However, their discussion is again limited to .

The rest of the paper is organized as follows. In Section 2, we present the MBP model and the proposed regularized MLE (R-MLE). Section 3 presents our main result, Theorem 3.1, on the consistency of the R-MLE and discusses its assumptions and implications. In Section 4, we provide some simulation results corroborating the theory. Section 5 provides the outline of the proof of Theorem 3.1. In Section 6, we present techniques we developed for deriving concentration inequalities for dependent multivariate processes. Other key technical results needed in the proof are given in Section 7. We conclude with a discussion in Section 8.

## 2 Problem Formulation

Consider an -dimensional time series , where denotes time and each . A general framework for analyzing is to use a GLM for modeling the conditional mean of the present given the past, i.e., is such that

 E[xti∣Xt−1]=f(⟨Θi∗∗,Xt−1⟩)

independently across coordinates . Here, is the “-lag” history at time , and is the slice of the parameter tensor along the first dimension. is the inverse link function. The notation denotes the usual Euclidean inner product between two matrices, i.e., equals

 ∑j,ℓΘijℓXt−1jℓ=∑j,ℓΘijℓxt−ℓj=∑ℓ\ipΘi∗ℓ,xt−ℓ

where we note the useful identity . (The inner product in the last equality is the usual vector inner product.) The entry captures how much the dimension of the process at time lag affects the distribution of the dimension, at each time instant. The model relates the distribution of to the history of the process over time lags .

In other words, for the MBP model, we have

 xti∣Xt−1∼Ber(zti),zti:=zti(Θ):=f(⟨Θi∗∗,Xt−1⟩), (1)

where for some , is the inverse link function. Note that represents the conditional probability of spiking for neuron at time , given the lag history of the ensemble.

A general (time-invariant) time series over the discrete space with finite dependence on the past, i.e.,

 P(xt=a∣xt−1,xt−2,…)=P(xt=a∣xt−1,xt−2…,xt−p),a∈{0,1}N

can be modeled as a homogeneous Markov chain over , with possibly parameters in the worst case. Hence the MBP model is a dimensional representation of a subset of these models. The representation power of the MBP in (1) is beyond the scope of this paper.

We are interested in estimating the “true” parameter tensor from samples of a process. (We assume that the first samples corresponding to to are given for “free” since during the estimation the data points are regressed onto their history

via the likelihood function of the Bernoulli distribution.)

Although the parameter space has ambient dimension , in real-world applications, often resides in or is well-approximated by a low dimensional subspace of , allowing reliable estimation even if as is desired in several applications. In the literature on high-dimensional statistics, this is often the assumption and several such low dimensional subspaces are now well-studied and ubiquitous in analyses, some examples include tensors being low-rank, exactly sparse, approximately sparse and sparse with a specific structure. Here we focus on parameter estimation for the approximately sparse parameter model where the true parameter has a “good enough” sparse approximation.

Let the process be generated by a true parameter . We assume to be approximately -sparse. More precisely, we assume that the following quantity

 σs(Θ∗):=min|S|≤s∥Θ∗Sc∥1,1,1, (2)

decays fast as a function of , where is the tensor with support restricted to , the complement of . This quantity captures the approximation error when is approximated by an -sparse tensor. For an exactly -sparse tensor , we have . In general, we do not impose any constraint on and state a general result involving this parameter. We denote by the optimum set that solves (2). For future reference, we also define

 τ2s(Θ∗):=σ2s(Θ∗)s. (3)
##### Notation.

Here and in what follows denotes the norm on tensors obtained by collapsing the dimensions from right to left by applying , and norms in that sequence. Thus is the elementwise norm of the tensor, i.e., the sum of the absolute values of all its entries, and is the absolute value of the entry of the tensor with largest magnitude, which is the dual norm for norm. denotes the number of non-zero entries in .

## 3 Main Result

We study a regularized maximum likelihood estimator (R-MLE) for . The (normalized) negative log-likelihood of the Bernoulli process (1) is given by,

 L(Θ) =−1nn∑t=1N∑i=1ℓit(⟨Θi∗∗,Xt−1⟩) (4)

where . In order to incorporate the sparse approximability of during estimation, we penalize the likelihood via an element-wise regularization, leading to the R-MLE

 ˆΘ=argminΘ∈RN×N×pL(Θ)+λn\normΘ1,1,1, (5)

where is the element-wise norm of the 3-tensor encouraging a sparse solution to the optimization problem. Consider samples of the multivariate Bernoulli process generated by equation (1) with parameter . We denote these samples by . We further assume that the process satisfies the the following regularity conditions:

1. [label=(A0)]

2. The samples are drawn from a stable and wide-sense stationary process which has a power spectral density matrix,

 X(ω):=∞∑ℓ=−∞Cov(xt,xt+ℓ)e−jωℓ∈CN×N, (6)

for with minimum eigenvalues uniformly bounded below as

 (7)
3. The nonlinearity or inverse link function for some is twice differentiable and has a Lipschitz constant , i.e., . We also assume that both and are strongly convex with curvatures bounded below by .

The following quantities will be key in stating the error bounds:

 gf(Θ∗) :=(3L2f2\epsp∑ℓ=1N∑i=1(N∑j=1p∑k=ℓ|Θ∗ijk|)2)1/2, (8) Gf(Θ∗) :=8c2f[1+p2(1gf(Θ∗)−1)2]. (9)

We note that is a valid norm on the space of 3-tensors. This quantity captures how fast the process is mixing, and controls how fast Lipschitz functions of the process concentrate. It indirectly controls the hardness of the estimation problem. See Section 6 for the details. We are now ready to state our main result:

###### Theorem 3.1.

Let be a process generated by (1) with parameter and satisfying 1 and 2. Suppose . Then there exist positive universal constants and such that for

 n≥c1Gf(Θ∗)c2fc6ℓs3log(N2p),

any solution to (5) with , satisfies

 (10)

Theorem 3.1 is proved in Section 5. We first discuss important implications of this result, and the conditions imposed on the scaling of various parameters.

### 3.1 Remarks on Theorem 3.1

The two terms in the error bound (10) correspond to the estimation and approximation errors, respectively. The estimation error, in general, scales at the so-called fast rate in our setting, while the approximation error scales with the slower rate . For the exact sparsity model, where the approximation error vanishes (since ) and we achieve the fast rate.

The overall sample complexity for consistent estimation (ignoring constants) is thus

 n≳Gf(Θ∗)max{s3,˜τ4s(Θ∗)}log(N2p). (11)
##### Scaling with s.

According to (11), the scaling of in the sparsity parameter “” is at best , corresponding to the case of hard sparsity where . While an dependence is not ideal, it is not clear if it can be improved significantly without imposing restrictive assumptions. It is clear that one cannot do better than , the optimal scaling in the linear independent settings. In our proof, the additional factor comes from concentration inequality (21) in Proposition 5.6. There, if one were able to show sub-Gaussian concentration for deviations of the order of instead of then the additional can be removed. It remains open whether such concentration is possible and under what additional assumptions. Figure 0(c) in Section 4 suggests a superlinear dependence on , hinting that the situation may not be as simple as the i.i.d. case.

In comparison, for , a sample complexity of was reported in [HRW16, Cor. 1], whereas [MRW18, Thm 4.4] requires samples where and are parameters defined in their respective models, both of which can potentially grow as unless assumed otherwise.

##### Scaling with p.

For , our result is the first to provide a sample complexity logarithmic in which holds for all . In contrast, [KWB17, Thm. 1] requires samples and relies heavily on .

Our bound scales with through the quantity . The scaling depends on the behavior of the tail of , that is, how fast the “influence from the past” dies down. For different regimes of the influence decay, the scaling of is summarized in Table 1. In the worst case, without any assumptions on and with , could scale as . Although this is not ideal, it is analogous to constant in the sample complexity of [MRW18], which is derived under more assumptions on . (The dependence of on in that result is also somewhat complicated.)

On the other hand, under mild assumptions of polynomial or exponential tail decay, the dependence on is much better: If decays polynomially in lag , i.e., , uniformly in , for any , or exponentially for , then the sample complexity reduces by a factor of . Furthermore, if the Lipschitz constant is allowed to drop with , more reduction in and hence the sample complexity is possible, as illustrated in Table 1.

Appendix C provides derivations for Table 1. Better bounds on and hence the sample complexity can be obtained by imposing suitable structural assumptions on

##### Scaling with N.

Our results have a logarithmic dependence on , the number of neurons in the context of neuronal ensembles, which is a notable feature of our work. We overcome the barrier for the MBP model, while also allowing .

##### Assumptions.

We use Assumption 1 to guarantee that restricted strong convexity (RSC) holds at the population level. The RSC is key in guaranteeing that any parameter tensor that maximizes the regularized likelihood does not deviate far from the true parameter. For the case, it implies that the process does not have zeros on the unit circle in the spectral domain. Assumption 1 is by now standard in estimating time-series [RYC15, BM15]. It relates to the “flatness” of the power spectral density (PSD) [BM15]. Controlling the scaling of in terms of is a non-trivial research question, even for Gaussian AR(

) processes, since the relation between the PSD and the parameter is via the Z-transform. While there could be pathological

for which , the set of parameters for which , is largely believed to be non-trivial. A line of work [BM15, HRW16, MRW18] obtains weak bounds on which could decrease with , hence the authors need to further assume .

Assumption 2 is not too restrictive. For example, for the logit link,

, the sigmoid function, we have

 cf =minuα2fα(u)(1−fα(u))=α2\eps(1−\eps)

assuming that , that is, . The Lipschitz constant in this case satisfies

## 4 Simulations

We evaluate the performance of the estimator in (5) using simulated data. We use two different metrics of performance: (1) the estimation error in Frobenius norm, and (2) support recovery, i.e., assuming that the true parameter tensor is exactly -sparse, how does the support estimated from compare to the support of . To do so, we need to estimate the support from . If we know the sparsity, we can estimate the support by taking the indices corresponding to the largest entries of in magnitude. If we do not know the sparsity in advance, we can estimate the support based on a threshold chosen by cross-validation. Given a threshold , the estimated support would be

 ˆ\supp(Θ):={(j,k,ℓ):|ˆΘMLjkℓ|≥γ}.

Note that our theoretical results do not give any guarantees for support recovery. In order to guarantee support recovery, a stronger result bounding the error uniformly for each entry of is required, i.e., we need to control with high probability. Therefore, more work is needed to obtain theoretical guarantees for support recovery. Nevertheless, our simulations show that the estimator is able to recover the support very well.

We first simulate a network with units and lags, giving a parameter space of dimension . We generate a uniformly random sparsity pattern over this space, and then generate the data using the parameter tensor, according to model (1). We use the sigmoid non-linearity . Next we use this data to obtain and compute the estimation error in Frobenius norm . This process is repeated for independent runs. The regularization parameter was set to .

The results are shown in Figure 1. Figure 0(a)

shows how the error changes with the sample size. The shaded area represents one standard deviation of the estimation error over 20 runs. For comparison, a rational function fit of the form

is also plotted on top of the error. In Figure 0(b), we plot the error vs. sparsity for a fixed sample size. The error grows almost linearly with sparsity. A linear fit of the error is also shown for comparison. Figure 0(c) shows the average error, over the runs, for different sparsity levels and different sample sizes. As expected, the error goes down as becomes sparser, or sample size increases.

Finally, the support recovery performance is shown in Figure 0(d). In this experiment, the network has units but only lag for . For recovering the support, we assumed that the sparsity is known, and took the indices corresponding to the largest entries of as the recovered support. The fraction of the correctly recovered indices is plotted against the sample size. Figure 0(d) shows that if the sample size is below some threshold, no entries of the support are recovered, while above the threshold, the recovered fraction gradually increases to .

## 5 Proof of the Main Result

We now outline the proof of Theorem 3.1. The main challenge in the regime , is that the empirical Hessian is rank-deficient and hence the likelihood cannot be strongly convex (i.e., have a positive curvature) in all directions around the true parameter . This means that even though a candidate solution is a stationary point of (5), it could be far away from if the error vector lies in the null-space of . However, for sufficiently large values of , one can guarantee that this almost never happens under certain assumptions.

Specifically, if the regularization parameter is large enough, then lies in a small “cone-like” subset of . Now, it suffices to require that is strongly convex only over this subset (i.e., is uniformly quadratically bounded from below on this set). This observation is by now standard in the high-dimensional analysis of M-estimators. It was shown in [NRWY12, Lem. 1]

that if for any loss function

, problem (5) is solved with regularization parameter satisfying

 λn≥2\norm∇L(Θ∗)∞,∞,∞, (12)

then for any , the error vector belongs to the cone-like set , defined as

 C(S;Θ∗)={Δ∈Ω:\normΔSc1,1,1≤3\normΔS1,1,1+4\normΘ∗Sc1,1,1}. (13)

In other words if satisfies (12).

###### Definition 1 (Restricted Strong Convexity).

For a loss function , let

 RL(Δ;Θ∗):=L(Θ∗+Δ)−L(Θ∗)−⟨∇L(Θ∗),Δ⟩ (14)

be the remainder of the first-order Taylor expansion of the loss function around . A loss function satisfies restricted strong convexity (RSC) relative to with curvature and tolerance if

 RL(Δ;Θ∗)≥κ\fnormΔ2−τ2 (15)

for any , where is any subset of .

The following result gives the desired error bound in Theorem 3.1.

###### Proposition 5.1.

(Adapted from [NRWY12, Thm. 1]) If (5) is solved with regularization parameter satisfying (12), and if for any the RSC condition (15) holds for all , defined in (13), then we have

 \fnormˆΘ−Θ∗2≤9λ2n|S|κ2+λnκ(2τ2+4\normΘ∗Sc1). (16)

To apply this result, we first show in Lemma 5.2 that taking is enough for (12) to hold with high probability. Equation (16) then gives a family of bounds, one for each choice of . Decreasing potentially increases and hence presents a trade-off. We choose an that balances all the terms in the bound. Specifically, we choose that solves (2), with and . For this choice , we show in Proposition 5.3 that (15) holds over with high probability for , and . Putting these together proves Theorem 3.1.

### 5.1 Choice of the regularization parameter

To set such that (12) holds, we need to find an upper bound on . Since affects the error bound directly, we would like to choose the smallest that satisfies (12). In general, one would like to have a vanishing (as ) to guarantee consistency. Our next result provides the necessary bound on the gradient of the loss, leading to a suitable choice for the regularization parameter :

###### Lemma 5.2.

For any constant ,

 \norm∇L(Θ∗)∞,∞,∞≤Lf\eps√c1log(N2p)n (17)

with probability at least , where .

###### Proof.

Fix and . From (4),

 ∂L(Θ)∂Θijℓ=−1nn∑t=1ℓ′it(\ipΘi∗∗,Xt−1)Xt−1jℓ

where It follows that

 ∂L(Θ∗)∂Θijℓ=1nn∑t=1DtijℓwhereDtijℓ:=(1−xti1−zti−xtizti)f′(\ipΘ∗i∗∗,Xt−1)Xt−1jℓ.

Let be the -field generated by the past observations of the process. We have , hence

 \ex[Dtijℓ∣Ft−1]=(1−\ex[xti∣Ft−1]1−zti−\ex[xti∣Ft−1]zti)f′(\ipΘ∗i∗∗,Xt−1)Xt−1jℓ=0.

That is, is a martingale difference sequence. Recall that , while and both by Assumption 2. If follows that is also bounded, i.e., . By the Azuma–Hoeffding inequality for martingale differences [vdG02],

 P(∣∣∂L(Θ∗)∂Θijℓ∣∣>t)=P(∣∣1nn∑i=1Dtijℓ∣∣>t)≤2exp(−n\eps2t22L2f),t>0.

Writing , by the union bound we have,

 \pr(\infnorm∇L(Θ∗)>t)≤2N2p⋅exp(−n\eps2t22L2f)

Taking establishes the result.

### 5.2 Restricted Strong Convexity of L

Proving RSC property (15) for a particular choice of is a major contribution of our work. This is a nontrivial result since it involves uniformly controlling a dependent non-Gaussian empirical process. Even for i.i.d. samples, the task is challenging since the quantity to be controlled, , is a random function and one needs a uniform bound on it from below. Controlling the behavior of this function becomes significantly harder without the independence assumption. We establish the RSC property for the negative log-likelihood loss (4) under the MBP model (1) in the following:

###### Proposition 5.3.

Let . There exists a numerical constant such that if

 n≥c1Gf(Θ∗)c2fc6ℓs3log(N2p) (18)

then, the RSC property (15) holds with

 κ=min{14cfc2ℓ,1},andτ2=σ2ss

for all tensors , with probability at least . The constant in the exponent only depends on .

#### 5.2.1 Proof sketch of Proposition 5.3

To prove this result we proceed by a establishing a series of intermediate lemmas. First, we show that is lower bounded by a quadratic function of :

###### Lemma 5.4 (Quadratic lower bound).

The remainder of the first-order Taylor expansion of the negative log-likelihood function, around , satisfies

 RL(Δ;Θ∗)≥E(Δ;X):=cfnn∑t=1N∑k=1\ipΔk∗∗,Xt−12, (19)

for all and .

Notice that is a random function due to the randomness in . Next, we show that the population mean of is strongly convex:

###### Lemma 5.5 (Strong convexity at the population level).

Under Assumption 1,

 EE(Δ;X)≥cfc2ℓ\normΔ2F,for all Δ∈RN×N×p. (20)

Lemmas 5.4 and 5.5 are proved in Appendix A. Next, we show that for a fixed , the quantity concentrates around its mean. Section 6 proves and provides more comments on the following concentration inequality:

###### Proposition 5.6 (Concentration inequality).

For any , if is generated as (1) with process parameter satisfying , then

 P(∣∣E(Δ;X)−EE(Δ;X)∣∣>t\normΔ22,1,1)≤2exp(−nt2Gf(Θ∗)). (21)

Finally, we use the structural properties of set along with Lemma 5.5 and Proposition 5.6 to give a uniform quadratic lower bound on , which holds with high probability:

###### Lemma 5.7.

The result of Proposition 5.3 holds with replaced with .

The proof of Lemma 5.7 (see Section 7.1) makes use of a discretization argument. Proving uniform laws are challenging when the parameter space is not finite. The discretization of the set uses estimates of the entropy numbers for absolute convex hulls of collections of points (Lemma 7.1). These estimates are well-known in approximation theory and have been previously adapted to the analysis of regression problems in [RWY11]. Proposition 5.3 follows by combining from Lemmas 5.4 and 5.7.

## 6 Concentration of E(Δ;X)

Proposition 5.6 is a concentration inequality for , a quadratic empirical process based on dependent non-Gaussian variables. For independent sub-Gaussian variables, such a concentration result is often called the Hanson–Wright inequality [RV13, Thm. 1]. Providing such inequalities for dependent random variables is significantly more challenging. For dependent Gaussian variables, the machinery of the Hanson–Wright inequality can still be adapted to derive the desired result; see [BM15, Prop. 2.4]. However, these arguments do not extend easily to non-Gaussian dependent variables and hence other techniques are needed to provide such concentration inequalities.

A key observation, discussed in Section 6.1, is that the MBP can be represented as a discrete-space Markov chain. This allows us to use concentration results for dependent processes in countable metric spaces. There are several results for such processes; see [KR08, M96, S00] and [Kon12] for a review. Here, we apply the result by [KR08]. These concentration inequalities are stated in terms of various mixing and contraction coefficients of the underlying process. The challenge is to control the contraction coefficients in terms of the process parameter , which in our case is done using quantities and .

We start by stating the result by Kontorovich et. al. [KR08] for a process consisting of (possibly dependent) random variables taking values in a countable space . For any , define the mixing coefficient

 ηkℓΔ=supw,w′,y ∥∥P(Xnℓ=⋅∣Xk=w,Xk−11=y)−P(Xnℓ=⋅∣Xk=w′,Xk−11=y)∥∥TV. (22)

where the supremum is over and . Let be the matrix with entries for and zero otherwise (i.e., an upper triangular matrix), and let be the operator norm of .

###### Proposition 6.1.

[KR08, Theorem 1.1] Let be an -Lipschitz function of with respect to the Hamming norm, then concentrates around its mean as,

 P(|ϕ(X)−Eϕ(X)|>\eps)≤2exp(−\eps2nL2\mnormH2∞),∀\eps>0. (23)

We apply the above result to by finding an upper bound for the Lipschitz constant of map , and bounding the mixing coefficients in terms of the process parameters , and the link function . Before doing so, let us discuss how to bound using more tractable contraction coefficients.

### 6.1 Bounding \normH∞ using Markov contraction

We refer to a time-invariant process over a countable space as -Markov if for some finite ,

 P(xt=z∣xt−1,xt−2,…)=P(xt=z∣xt−1,xt−2,…,xt−p),∀z∈S,  ∀ t∈Z. (24)

We start by recalling a well-known contraction quantity, the Dobrushin ergodicity coefficient, and relating it to the mixing coefficients of -Markov processes.

#### 6.1.1 Dobrushin ergodicity coefficient

For a Markov chain (or -Markov process) over a discrete space , let

be its (transition) kernel. The kernel is a nonnegative stochastic matrix, i.e., each row is a probability distribution. Thus,

and where is the all-ones vector. Let

 H1:={u∈R|S|∣1⊤u=0}. (25)

This subspace is invariant to every Markov kernels , i.e., for all , we have . The Dobrushin ergodicity coefficient of is defined as

 τ1(P):=supu∈H1 ∥u⊤P∥1∥u∥1