    # Censored autoregressive regression models with Student-t innovations

This paper proposes an algorithm to estimate the parameters of a censored linear regression model when the regression errors are autocorrelated, and the innovations follow a Student-t distribution. The Student-t distribution is widely used in statistical modeling of datasets involving errors with outliers and a more substantial possibility of extreme values. The maximum likelihood (ML) estimates are obtained throughout the SAEM algorithm . This algorithm is a stochastic approximation of the EM algorithm, and it is a tool for models in which the E-step does not have an analytic form. There are also provided expressions to compute the observed Fisher information matrix . The proposed model is illustrated by the analysis of a real dataset that has left-censored and missing observations. We also conducted two simulations studies to examine the asymptotic properties of the estimates and the robustness of the model.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A linear regression model is a commonly used statistical tool to analyze the relationship between the response (dependent) and some explanatory (independent) variables. In these models, the error term is usually considered a zero-mean and constant-variance random variable independent of the other error terms. However, observations collected over time are often autocorrelated rather than independent. Examples of this data abound in economics, medicine, environmental monitoring, and social sciences. The study of the dependence among observations is of considerable practical interest

(see tsay2005analysis; box2015time).

A stochastic model that has been successfully used in many real-world applications to deal with autocorrelated structures in the error term is the autoregressive (AR) model. In the AR model, the current value of the process is expressed as a finite linear combination of previous values of the process and a stochastic shock of disturbance. The shock of disturbance is also known as innovation in the time series literature. In general, it is assumed that the disturbance is normally distributed. For example,

beach1978maximum suggested a maximum likelihood (ML) estimation method to estimate the parameters of AR(1) error term regression models. ullah1983estimation

used the two-stage Prais-Winsten estimator for a linear regression model with first-order autocorrelated disturbances. The authors considered a Gaussian distribution for the innovations to derive the large-sample asymptotic approximation for the variance-covariance matrix.

alpuim2008efficiency

proposed a linear regression model with the sequence of error terms following a Gaussian autoregressive stationary process, in which the parameters of the model are estimated using ML and ordinary least squares (OLS) estimation. Nevertheless, in statistical modeling of datasets involving errors with outliers and a stronger possibility of extreme values, the normality assumption is not satisfied. In these cases, heavy-tailed distributions may be considered. In this context,

tuacc2018robust proposed considering the

distribution with known degrees of freedom as an alternative to the Normal distribution and applying the conditional maximum likelihood (CML) method to find the estimators of the model parameters.

Moreover, modeling data with AR errors can be challenging when some observations are censored or missing. In the first case, values can occur out of the range of a measuring instrument, and in the second, the value is unknown. Censored time series data are frequently encountered in environmental monitoring, medicine, and economics. In the literature, there are some works related to censored autoregressive linear regression models with Normal innovations. For example, zeger1986regression proposed a Gaussian censored autocorrelated model with parameters estimated by maximizing the full likelihood using the quasi-Newton and the EM-algorithms. The authors pointed out that the quasi-Newton approach is preferable because it estimates the variance-covariance matrix for the parameters. park2007censored

developed a censored autoregressive moving average (CENARMA) model with Gaussian white noise, in which the algorithm works iteratively imputing the censored or missing observations by sampling from the truncated Normal distribution. Then the parameters are estimated by maximizing an approximate full likelihood function. Besides,

wang2018quasi suggested a quasi-likelihood method using the complete-incomplete data framework. Recently, schumacher2017censored considered an autoregressive censored linear model with the parameters estimated via the SAEM algorithm in which the conditional mean is used to impute the censored observations. Lastly, liu2019parameter proposed a coupled MCMC-SAEM algorithm to fit an AR regression model with Student- innovations and only missing data, i.e., no censoring was considered.

During the bibliographic review, no studies were found that consider the Student-

distribution for the innovations in censored autoregressive models on a likelihood-based perspective. Hence in this work, we propose an Expectation Maximization (EM)-type algorithm to estimate the parameters of a censored regression model with autoregressive errors and innovations following a Student-

distribution. The SAEM algorithm was employed to obtain the ML estimates because of the difficulties in computing the E-step in the EM algorithm (dempster1977maximum). As it was demonstrated by jank2006implementing, the computational effort of SAEM is much smaller and achieves convergence in just a fraction of the simulation size when compared to the Monte Carlo EM algorithm (wei1990monte), thanks to a recycling of the simulated variables from one iteration to the next.

The paper is organized as follows. Section 2 shows a brief description of autoregressive regression models of order and the SAEM algorithm. Section 3 is devoted to formulate the censored autoregressive model with Student- innovations. It also provides the expressions used to estimate the parameters of the proposed model, the method used to estimate the observed information matrix (louis1982finding), and the expression to make predictions. Section 4 displays the results of two simulation studies carried out to examine the asymptotic properties of the estimates, and to demonstrate the robustness of the model. Section 5 shows the estimates for the parameters of the total phosphorus concentration dataset, available in the R package ARCensReg (schumacher2016package)

, together with the analysis of quantile residuals. Section

6 concludes with a discussion.

## 2 Preliminaries

We begin by defining some notations and introducing the basic definitions and algorithms used throughout this work. The Normal distribution with mean and variance is denoted by , the notation for a -variate Normal distribution with mean and variance-covariance matrix is , and a truncated multivariate Normal distribution with truncation region is denoted by

. The notation for a random variable with Gamma distribution is given by

, where is the shape parameter and is the rate parameter. The Student- distribution with location parameter , scale parameter and degrees of freedom is denoted by . A random variable

with uniform distribution on the interval

is denoted by . The expression “” means independent and identically distributed,

represents the transpose of a matrix or a vector,

is the gamma function evaluated at point , and is an increasing sequence of -field such that , with being a sequence of random variables. We use the index to represent a specific vector or matrix, i.e., and where is the th row of matrix X.

### 2.1 The autoregressive regression model of order p

Consider a classical linear regression model with the difference that the errors are autocorrelated as a discrete-time autoregressive process; thus, the model for an observation at time is given by

 Yt = x⊤tβ+ξt, (1) ξt = ϕ1ξt−1+…+ϕpξt−p+ηt,ηtiid∼F(⋅),t=1,…,n, (2)

where represents the dependent variable observed at time , is a vector of known covariables, is a vector of unknown regression parameters to be estimated, and is the regression error. The regression error follows an autoregressive model as given in Equation 2, where is a vector of autoregressive coefficients and is a shock of disturbance with distribution . The is also known as the innovation in the time series literature (see, for instance, box2015time, schumacher2017censored, and wang2018quasi).

Now suppose that in Equation 2 follows a Student- distribution with location parameter 0, scale parameter and

degrees of freedom, whose probability density function (PDF) will be given by

 fηt(η)=Γ(ν+12)Γ(ν2)(πνσ2)12(1+η2νσ2)−ν+12,η∈R,t=1,…,n. (3)

Then the model defined by Equations 1-3 will be called the autoregressive linear regression model of order (AR) hereinafter. Moreover, the distribution of might be written using a scale mixture of Normal (SMN) distributions as showed in kotz2004multivariate, where can be written as , where , , with being independent of . Consequently, the conditional distribution of given is Normal with zero mean and variance , i.e.,

### 2.2 The SAEM algorithm

dempster1977maximum introduced a general approach to the iterative computation of maximum likelihood (ML) estimates in problems with incomplete data. This algorithm is called the Expectation-Maximization (EM) algorithm. Let be the complete data, where is the observed data and is the incomplete data. Let be the complete log-likelihood function. The EM algorithm proceeds as follows:

• [leftmargin=0.5cm]

• E-step: Let be the estimate of at the th iteration. Compute the conditional expectation of the complete log-likelihood function .

• M-step: Maximize with respect to to obtain .

There are situations where the E-step has no analytic form. To deal with these cases, wei1990monte proposed to replace the expectation in the computation of with a Monte Carlo (MC) approximation based on a large number of independent simulations of the missing data. This algorithm was called the Monte Carlo EM (MCEM) algorithm. As an alternative to the computationally intensive MCEM algorithm, delyon1999convergence proposed a scheme that splits the E-step into a simulation step and an integration step, while the maximization step remains unchanged. The SAEM algorithm is performed as follows:

• [leftmargin=0.5cm]

• E-step:

• Simulation: Draw samples of the missing data from the conditional distribution .

• Stochastic approximation: Update by

 ˆQk(θ)=ˆQk−1(θ)+δk(1MM∑l=1ℓc(θ;y(k,l)m,yo)−ˆQk−1(θ)),

where is a decreasing sequence of positive numbers such that and , is also known as the smoothness parameter (kuhn2005maximum).

• M-step: Update as .

This process is iterated until some distance between two successive evaluations of the actual log-likelihood function becomes small enough.

## 3 Proposed model and parameter estimation

This section is devoted to formulating the censored autoregressive linear model. Here we specify the log-likelihood function, the algorithm used to overcome the parameter estimation problem, expressions to approximate the standard errors, and expressions to predict at unobserved times when the values of some covariables (if needed) at these times are available.

### 3.1 The censored ART(p) model

Assume that the response variable

, in Equation 1, is not fully observed for all times . We observe instead , where represents either an observed value or the limit of detection (LOD) of a censored variable, and is the censoring indicator defined as

 Ct={1if Vt1≤Yt≤Vt2,(censored)0if Vt=Yt.(observed) (4)

Thereby the model defined by Equations 1-4 will be called censored autoregressive regression model of order (CAR). In this context, is left-censored if and , it is right-censored if and , it is interval censored if and , and represents a missing value if and .

To compute the log-likelihood function of the CAR model, we ignore the marginal distribution of the first observations and we denote by the vector of observed data and by the vector of censored or missing observations, with . Note also that the distribution of conditioning on all the preceding data only depends on the previous observations. Then, given and has a Student- distribution with location parameter , scale parameter and degrees of freedom, where is a matrix with the covariates for and is a vector with the covariates related to , i.e.,

 f(yt|θ,Ft−1)=f(yt|θ,y(t,p))=f(yt|θ,yt−1,yt−2,…,yt−p),t=p+1,…,n.

Thus the observed log-likelihood function can be computed by

 ℓ(θ;yo)=log(∫Rmn∏t=p+1f(yt|θ,yt−1,yt−2,…,yt−p)dym). (5)

### 3.2 Parameter estimation via the SAEM algorithm

The observed log-likelihood function in Equation 5 involves complex expressions, making work directly with very difficult. Hence to obtain the ML estimates of , we apply an EM-type algorithm considering the hierarchical representation in the SMN distributions mentioned in Subsection 2.1. Treating and as hypothetical missing data, and augmenting those to the observed data vector , where and , the resulting complete data is given by the vector , and consequently, the complete log-likelihood function can be written as

 ℓ(θ;yc) = g(ν|u)−n−p2logσ2−12σ2n∑i=p+1ui(~yi−w⊤iϕ)2+cte,

with , , , and being a constant.

Let be the current estimate of . Then the conditional expectation of the complete log-likelihood given the observed data, without the constant is

 Qk(θ) = E[ℓ(θ;yc)|% V,C,ˆθ(k)] = ˆg(ν|u)(k)−n−p2logˆσ2(k)−12ˆσ2(k)(ˆuy2∗(k)−2ˆϕ(k)⊤ˆuyy(k)∗+ˆϕ(k)⊤ˆuy2(k)∗ˆϕ(k)),

where

 ˆg(ν|u)(k) = n−p2[ˆν(k)log(ˆν(k)2)−2logΓ(ˆν(k)2)]+ˆν(k)2(ˆlog(u)(k)−ˆu(k)), ˆuy2∗(k) = ˆuy2(k)−n∑i=p+1(2ˆuyi(k)x⊤iˆβ(k)−ˆu(k)iˆβ(k)⊤xix⊤iˆβ(k)), ˆuyy∗(k) = ˆuyy(k)−n∑i=p+1(ˆuyi(k)X(i,p)ˆβ(k)+ˆu%yi(k)x⊤iˆβ(k)−ˆu(k)iX(i,p)ˆβ(k)ˆβ(k)⊤xi), ˆuy2∗(k) = ˆuy2(k)−n∑i=p+1(ˆuyi(k)ˆβ(k)⊤X⊤(i,p)+X(i,p)ˆβ(k)ˆuy% i(k)⊤−ˆu(k)iX(i,p)ˆβ(k)ˆβ(k)⊤X⊤(i,p)),

such that , , , , , , and , for all .

It is easy to observe that the E-step reduces to the computation of , , , , , , and , for all . Because of the difficulties in the E-step, we work with a variation of the EM algorithm called the SAEM algorithm. Assuming that is the current estimate of , the SAEM algorithm proceeds as follows:

Step E-1 (Sampling). In the simulation step, we generate samples from the conditional distribution of the latent variables via the Gibbs sampling algorithm. Thus, we consider the observed and censored/missing part of y separately, and rearrange the location vector and scale matrix parameters in the conditional distribution of according to the following scheme:

• [leftmargin=0.5cm]

• Step 1: Sample from the , with , , and , where the elements , and are computed using Equations 16 and 17 available in Appendix A.1. Then, we construct a full vector of observations as , using the observed and the sample generated values.

• Step 2: Sample from , whose elements are independent and Gamma distributed as with , , and , .

Step E-2 (Stochastic Approximation). Since the sequence for is available, we replace the conditional expectations in by the following stochastic approximations:

 ˆu(k)i=ˆu(k−1)i+δk(1MM∑l=1u(k,l)i−ˆu(k−1)i),i=p+1,…,n. ˆlog(u)(k)=ˆlog(u)(k−1)+δk(1MM∑l=1n∑i=p+1logu(k,l)i−ˆlog(u)(k−1)). ˆuy2(k)=ˆuy2(k−1)+δk(1MM∑l=1n∑i=p+1u(k,l)iy2(k,l)i−ˆuy2(k−1)). ˆuyi(k)=ˆuyi(k−1)+δk(1MM∑l=1u(k,l)iy(k,l)i−ˆuyi(k−1)),i=p+1,…,n. ˆuyy(k)=ˆuyy(k−1)+δk(1MM∑l=1n∑i=p+1u(k,l)iy(k,l)iy(k,l)(i,p)−ˆuyy(k−1)). ˆuyi(k)=ˆuyi(k−1)+δk(1MM∑l=1u(k,l)iy(k,l)(i,p)−ˆuyi(k−1)),i=p+1,…,n. ˆuy2(k)=ˆuy2(k−1)+δk(1MM∑l=1n∑i=p+1u(k,l)i%y(k,l)(i,p)y(k,l)⊤(i,p)−ˆuy2(k−1)).

The variable will be considered as suggested by galarza2017quantile, i.e., if and if , where is the maximum number of iterations and is a cutoff point which determines the percentage of initial iterations with no-memory. If , the algorithm will have a memory for all iterations and hence will converge slowly to the ML estimates, and needs to be large. If

, the algorithm will be memory-free, it will converge quickly to a solution neighborhood, and the algorithm will initiate a Markov chain leading to a reasonably well-estimated mean after applying the necessary

burn-in and thinning steps. A number between 0 and 1 will assure an initial convergence in distribution to a solution neighborhood for the first iterations and an almost sure convergence for the rest of the iterations.

The selection of and could affect the speed of convergence of the SAEM algorithm. A graphical approach can monitor the convergence of the estimates for all parameters and determine the values for these constants, as lavielle2014mixed suggested. An advantage of the SAEM algorithm is that, even though it performs an MCMC E-step, it requires a small and fixed sample size , making it much faster than the MCEM algorithm.

Step CM (Conditional Maximization). The conditional maximization step is carried out, and is updated by maximizing over to obtain a new estimate , which leads to the expressions:

 ˆϕ(k+1)=(ˆuy2∗(k))−1ˆuyy∗(k), (6) ˆσ2(k+1)=1n−p(ˆuy2∗(k)−2ˆϕ(k+1)⊤ˆuyy∗(k)+ˆϕ(k+1)⊤ˆuy2∗(k)ˆϕ(k+1)), (7) ˆβ(k+1)=(n∑i=p+1ˆu(k)iαiα⊤i)−1n∑i=p+1(ˆuyi(k)−ˆϕ(k+1)⊤ˆuyi(k))αi, (8) ˆν(k+1)=argmaxν>2ˆg(ν|u)(k), (9)

with for all .

### 3.3 Standard error approximation

The Fisher information matrix is a good measure of the amount of information a sample dataset provides about the parameters, and it can be used to compute the asymptotic variance of the estimators. louis1982finding developed a procedure for extracting the observed information matrix when the EM algorithm is used to find the ML estimates in incomplete data problems.

Let and be the first and negative second derivatives of the complete-data log-likelihood function with respect to the parameter vector of the complete data, respectively. Let and be the corresponding first and negative second derivatives of the log-likelihood function of the observed data. Then the observed information matrix is given by

 Io(θ)=Bo(yo;θ)=E[Bc(yc;θ)∣∣yo]−E[Sc(yc;θ)S⊤c(yc;θ)∣∣yo]+So(yo;θ)S⊤o(yo;θ). (10)

delyon1999convergence adapted the method proposed by louis1982finding to compute the observed information matrix when the SAEM algorithm is used to estimate the parameters. For this, it is necessary to compute the auxiliary terms

 Hk = −Gk+ΔkΔ⊤k, (11) Gk = Gk−1+δk⎛⎜⎝1MM∑l=1⎛⎜⎝∂2ℓ(θ;y(k,l))∂θ∂θ⊤+∂ℓ(θ;y(k,l))∂θ∂ℓ(θ;y(k,l))∂θ⊤⎞⎟⎠−Gk−1⎞⎟⎠, Δk = Δk−1+δk⎛⎝1MM∑l=1∂ℓ(θ;y(k,l))∂θ−Δk−1⎞⎠,

in which is the vector of observations sampled at iteration for all and , where is the maximum number of iteration of the SAEM algorithm and is the number of MC samples for stochastic approximation. The inverse of the limiting value of can be used to assess the dispersion of the estimators (delyon1999convergence).The analytical expressions for the first and second derivatives of the complete data log-likelihood function are given in Appendix A.2.

### 3.4 Prediction

For generating predicted values from the CAR model, we denote by to the -vector of random variables (RVs) with the observed values and by another vector of RVs of length whose realized values we want to predict from the observed ones.

Let be equal to , where is the vector of uncensored observations and is the vector of censored or missing observations. To deal with the censored values existing in , we use an imputation procedure that consists in replacing the censored values with the values obtained in the last iteration of the SAEM algorithm, i.e., , since elements of can also be updated during the Step E-2 of the SAEM algorithm as

 ˆy(k)m=ˆy(k−1)m+δk(1MM∑l=1y(k,l)m−ˆy(k−1)m),k=1,…,W, (12)

with the same , and settings. The new vector of observed values will be denoted by .

Now, supposing all values in are completely observed, and the explanatory variables for are available, the forecasting procedure will be performed recursively (box2015time) as follows

 ˆyn+k=⎧⎪ ⎪⎨⎪ ⎪⎩x⊤n+kβ+∑k−1i=1ϕi(ˆyn+k−i−x⊤n+k−iβ)+∑pj=kϕj(yn+k−j−x⊤n+k−jβ),1≤k≤px⊤n+kβ+∑pj=1ϕj(ˆyn+k−j−x⊤n+k−jβ),p

where is substituted by and is the estimate obtained via the SAEM algorithm. So .

## 4 Simulation Study

In this section, we examine the asymptotic properties of the SAEM estimates through a simulation study considering different sample sizes and levels of censoring. A second simulation study is performed to demonstrate the robustness of the estimates obtained from the proposed model when the data is perturbed.

### 4.1 Asymptotic properties (Simulation study 1)

This simulation aims to provide empirical evidence about the consistence of the ML estimates under different scenarios. Hence three datasets with 300 MC samples were generated. The sample size for each dataset was , and . The data was generated from the model specified by Equations 1 and 2, considering a Student- distribution for the innovations (’s) with and . The rest of the parameters were set as , , and the covariables at time were , with and .

The idea is to have four scenarios with an average level of censoring/missing of 0%, 5%, 20%, and 35%, respectively. The first case is related to the data without censoring (original data), for the other cases were considered 1.60, 3.45, and 4.30 as LODs, i.e., values lower than the LOD are substituted by the LOD, and additionally 20% of the desired censored rate corresponds to observations randomly selected considered as missing.