 # A Robust Time Series Model with Outliers and Missing Entries

This paper studies the problem of robustly learning the correlation function for a univariate time series with the presence of noise, outliers and missing entries. The outliers or anomalies considered here are sparse and rare events that deviate from normality which is depicted by a correlation function and an uncertainty condition. This general formulation is applied to univariate time series of event counts (or non-negative time series) where the correlation is a log-linear function with the uncertainty condition following the Poisson distribution. Approximations to the sparsity constraint, such as ℓ^r, 0< r< 1, are used to obtain robustness in the presence of outliers. The ℓ^r constraint is also applied to the correlation function to reduce the number of active coefficients. This task also helps bypassing the model selection procedure. Simulated results are presented to validate the model.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Forecasting (anticipating) future events or activities is an important problem in data science. A common task for a forecaster is to predict normal future events using past and current observations and to alert when the observed number of events significantly deviates from the predicted value. If events and activities are random then there is no hope in making any meaningful future prediction. However, if events are correlated in the sense that events at time

depend on events prior to and the correlation function that describes this dependency persists throughout the observed data, then this correlation function can be used to predict future events based on past and current observations. In many real datasets, the observed data is incomplete and is often contaminated with outliers and random noise. An important task is then to robustly learn the correlation function that describes the underlying normal activities and patterns from the observed data. We start with the following setup.

Let be a uniform discretization of some time interval of interest which we assume to be . Let be the number of observed events and be the expected number of events that occur in the time interval

. Observed events are determined by the conditional probability

 P(yi|ui,θi) (1)

where

is some auxiliary variable, representing for instance the variance, that

may depend on. In a univariate case, the goal is to learn how depends on prior and , for . In other words, one is interested in finding the function such that

 ui=f({uj}j

In this paper, we consider the case where only a partial series for some is observed and it may contain outliers and anomaly. To tackle this problem, the following minimization problem is proposed

 mina0,a,b,yJ(a0,a,b,y), where J(a0,a,b,y)=N∑i=1[ui−yilog(ui)+log(Γ(yi+1))]+λ∑i∈D|yi−~yi|r+μ(p∑k=1|ak|s+q∑k=1|bk|s), (3)

for some , and , to jointly learn and the complete series

via imputation.

• Here, we consider parametrically the correlation function defined as

 ui=f({yj}j≤i−1,{uj}j≤i−1)=max(exp(log(ui+1))−1,0),

where is given by

 log(ui+1)=a0+p∑k=1aklog(yi−k+1), (4)

for some .

• measures the sparsity of the sequence representing anomaly. Similarly, and impose sparsity on and and overcome the model selection issue (e.g. AIC, BIC, etc.). Both of these constraints are important for the recovery of .

With the presence of missing entries and outliers, Figure 1 shows the ability of the model (3) to recover having

 a0 = 1, a=(a1,a2,a3,a4,a5) = (0.25,−0.5,0,0,−0.5,0.5). Figure 1: Plots of estimated parameters a0 and a for 100 simulations when 50% of the data is observed and 2.5% of the observed entries are contaminated.

In the followings, we provide motivations for the proposed model. Many well-known time series models can be described by (1) and (2). For instance, in an model [5, 18], each is defined as

 yi=ϕ0+p∑k=1ϕkyi−k+q∑k=1ψkei−k+ei, (5)

where follows

, a normal Gaussian distribution with mean

and variance . Let

 ui:=ϕ0+p∑k=1ϕkyi−k+q∑k=1ψkei−k, (6)

then equation (5) implies . This implies

 P(yi|ui,θi=σ)=P(yi−ui=ei)=1σ√2πe−|yi−ui|22σ2. (7)

Substituting for in (6), one obtains

 ui=f({uj}j

for some real-valued and and some new positive integers and . Here we assume zero boundary conditions, that is and are identically zero whenever . Clearly, other types of boundary conditions such as reflection or Neumann can be used. Thus (5) can be transformed into (7) and (8), which are (1) and (2) respectively.

Another example is the Poisson linear autoregressive model (see

[20, 25], among others). Assuming is a non-negative integer, then (1) is given by

 P(yi|ui)=uyiie−uiyi!, (9)

and (2) is given by

 ui=f({uj}j

where and are non-negative. This shows that the model can only detect zero or positive correlations. To overcome this drawback, the log-linear model is often used [40, 20, 25, 15] where is defined such that

 log(ui+1)=a0+p∑k=1aklog(yi−k+1)+q∑k=1bklog(ui−k+1).

Here , and are real-valued and therefore can represent negative correlations.

In the case of complete observations where we are given the series and some prior knowledge on the conditional probability condition (1), the task is then to learn the optimal that maximizes the likelihood function,

 (f∗,θ∗)=argmaxf,θ{L(f,θ)=P(y,f,θ)}. (10)

It can be shown (see for instance ) that the joint probability in (10) is given by

 P(y,f,θ)=N∏i=1P(yi|ui,θi)P(f)P(θ), (11)

assuming and are independent. The maximization problem in (10) is equivalent to

 (f∗,θ∗)=argminf,θ{−N∑i=1log(P(yi|ui,θi))−log(P(f)P(θ))}. (12)

With some knowledge about , the conditional probability provides the distribution of . For a single value prediction , we can solve

 ^yN+1:=argmaxyN+1P(yN+1|uN+1,θN+1). (13)

In case where the conditional probability follows a Poisson distribution, then is completely determined by , and it doesn’t depend on the auxiliary variable .

In this paper, we consider the problem of learning parametrically the underlying correlation function and given a partially observed series for some which may be contaminated by outliers and anomalies. Given some prior knowledge about the uncertainty condition (1), the problem consists of: 1) extending to the whole series (including ) via imputation such that for all and 2) using the complete series to learn . These two steps are done iteratively as they are inter-dependent. The interpretation of is as follows:

1. Suppose is normal, i.e. it can be described by and the uncertainty condition (1), then we would like to enforce .

2. On the other hand if is anomalous, then we allow and let the model decides a normal value for .

Moreover, outliers and anomalies are interpreted as rare events that are supported sparsely on . Based on this interpretation, we would like the difference series to be sparse. Thus this can be seen as solving the optimization problem:

 miny,f,θ{−N∑i=1log(P(yi|ui,θi))−log(P(f))−log(P(θ))}, (14)

with the constraint that the partial series is sparse. Here, and are priors on and respectively. Sparsity is an essential ingredient in the theory of compressed sensing [7, 8, 13]. Approximating the sparsity constraint has been a subject of great importance as the exact sparsity problem is NP-hard. Here we consider sparsity approximations as proposed in [7, 8, 13, 10] by using for . Incorporating these sparsity approximations into the minimizing energy (14), we propose the following unconstraint variational problem

 minf,y,θ{J(y,f,θ)=−N∑i=1log(P(yi|ui,θi))−log(P(f))−log(P(θ))+λ∑i∈D|yi−~yi|r}. (15)
###### Remark 1.

Predicting with the mean is optimal. However, for and , imputing with the mean is not always optimal. Indeed, suppose only depends on previous ’s, i.e. for some . Suppose also that both and are known with no outliers, that is for all and is the only non-observed entry. The task is to compute the optimal by using (15). This amounts to solving

 y∗k=argminyk[−N∑i=1log(P(yi|ui,θi))].

If , then it is clear that only the last term in the above sum contains . This implies

 y∗N=argminyN[−log(P(yN|uN,θN))]=argmaxyNP(yN|uN,θN).

On the other hand, if , then

 y∗k = argminyk⎡⎣−min(k+p,N)∑i=klog(P(yi|ui,θi))⎤⎦ = argmaxyk⎡⎣min(k+p,N)∏i=kP(yi|ui,θi)⎤⎦.

The latter case shows that is not always an optimal value for .

The paper is organized as follows. Section 2 recalls some related prior work that are most relevant. Section 3 describes the proposed Poisson log-linear model having the uncertainty condition (1) following the Poisson distribution and the correlation function (2) following a log-linear function. One subproblem for solving (15) is to compute

 inft∈R{Er(t)=μ|t|r+12|t−t′|2},

for some fixed , and . In , Nie-etal proposed a method for solving this problem via solving a zero of a strictly convex function. For completeness, we go over in section 4 a similar method for computing the proximal operator for . Section 5 goes over an algorithm to compute a minimizer for (15). Section 6 shows numerical results on simulated data to validate the proposed model. In Appendix A we show that the above minimization problem (15) is related to

 maxy,f,θP(~yD,y,f,θ),

for some fixed and .

## 2 Prior Work

In , Huber considered the classical least square problem of learning parameters from observations obeying the relation

 yi=p∑j=1xijaj+ei. (16)

Here are known coefficients and are iid random Gaussian noise. For an autoregressive model, . Estimating the parameter amounts to minimizing the sum of squares

 mina∈Rp⎧⎪⎨⎪⎩N∑i=1∣∣ ∣∣yi−p∑j=1xijaj∣∣ ∣∣2=N∑i=1|ei|2⎫⎪⎬⎪⎭. (17)

Let , and , then the relative condition number measuring the sensitivity of with respect to perturbation of is 

 κa→u=∥X∥∥a∥∥Xa∥≤∥X∥∥X+∥

where is an arbitrary matrix norm and is the pseudo inverse of if exists. Let and

be the largest and smallest singular values of

respectively, then by using , one has . If is large, a small deviation in can create large deviation in the solution and hence it is important to obtain an accurate estimate for . As noted in , outliers affect the accuracy of the estimates. Following Lecture 18 in , the condition number for measuring the sensitivity of with respect to perturbation of is , where and . The more noise and outliers in the system, the closer is to . This leads to a large value of . Hence it is important to have good estimations of both and in the presence of missing data and outliers in the observation.

In [22, 23], Huber proposed a robust alternative to (17) by considering

 mina{N∑i=1ρ(yi−p∑j=1xijaj)=N∑i=1ρ(ei)}, (18)

where is chosen so that it is less sensitive to large . In particular, the proposed has the form

 ρ(x)={12|x|2 if |x|

where is some chosen constant which is data dependent. From the definition of we see that if , least square is performed in (18) and hence the model respects the additive normal Gaussian noise assumption in (16). When , is no longer considered normal Gaussian but assumed to follow a Laplacian distribution of the form

 P(ei)=fc(ei)=[C(c)e12c2]e−c|ei|,

where is a constant such that . Note that the Laplacian distribution has a wider tail than a Gaussian distribution and hence allows for the existence of large better than the Gaussian distribution. Another popular robust choice for is the least absolute deviation  which amounts to having , which also follows a Laplacian distribution.

The standard LASSO 

amounts to learning a sparse parameter vector

via minimizing

 mina{12σ2∥y−Xa∥22+μ∥a∥1}.

This model is still sensitive to outliers. A modification to this model introduces an extra variable representing outliers (see  and references there in):

 mina,z{12σ2∥Xa−y−z∥22+μ∥a∥1+λ∥z∥1}.

Suppose . In , a robust nonparametric model is proposed:

 minf,z{N∑i=1|yi−ui−z|2+μ∥f∥H+λ∥z∥1},

where is a reproducing kernel Hilbert space (RKHS) which includes Sobolev spaces. A slightly different approach is proposed in , where each is given by , and both and are learned via solving

 infa,f{N∑i=1|yi−ui|2+λ∥a∥1}.

In connection with the proposed model (15), take and let with and

 P(yi|ui,θi)=P(yi|ui,σ)=1σ√2πe−|yi−ui|22σ2

for some fixed . Note that is completely determined by and estimating the parameter vector with the LASSO prior amounts to minimizing

 mina,y{12σ2∥y−Xa∥22+μ∥a∥1+λ∑i∈D|yi−~yi|}.

Here, the sparse vector representing outliers is . Note that the proposed method (15) is much more general to accommodate other types of noise in the data that is not additive (multiplicative Gaussian, Poisson, negative binomial, etc.)

There are quite a few existing methods on estimating the parameters in the presence of missing data, and from a high level perspective, they align with the following two approaches.

The first approach iteratively imputes missing and unobserved data in some manner and then use the imputed and observed data to estimate the parameters. These methods include mean imputation, expectation maximization (EM)

, multiple imputation , among others. See [30, 33, 21] for a survey of some of these methods. Matrix completion [6, 9] is a form of imputation where missing entries in the data matrix are imputed under the assumption that the data matrix has low rank. The proposed imputation performed in Algorithm 4 for solving (15) follows along the line the iterative approach of the EM method ; but instead of maximizing the expectation we maximize the likelihood.

The second approach either doesn’t impute or only imputes the necessary missing entries that the observed entries depend on. For instance in full information maximization likelihood (FIML) method , only complete data points are used as inputs to estimate the parameters. Suppose we are only interested in estimating the constant with , then all observed data points are complete. However, for and suppose one in every consecutive points are missing then the set of complete data points is empty and hence the FIML method is not applicable. The non-negative definite covariance method [26, 11] only considers observed data points as inputs as opposed to complete data points. Here the observed data points may depend on the missing data, however this method sets this dependency to zero, that is imposing zero boundary conditions on the unobserved entries for which some of the observed entries may depend on (see section 3.2.1 in .) Further modification was introduced to acquire non-negative definite condition for the covariance matrix. This second approach can also be applied to our problem and it is more appropriate when the missing entries are systematic as oppose to random. To motivate the problem, consider the case where the correlation function is a constant, that is for all . Assuming, that the observed series has no outliers, then can be approximated as the mean and there is no need to impute for all . Similarly, suppose and , then it is only necessary to impute as opposed to for all . One can view as an unknown boundary condition. Thus in this second approach, the minimization problem becomes

 minf,y¯D,uD′,θ{J(y¯D,uD′,f,θ)=−∑i∈Dlog(P(yi|ui,θi))−log(P(f)+λ∑i∈D|yi−~yi|r}, (20)

where consists of and all of the indices such that depends on for some , and consists of all the indices such that depends on for some . Note that the difference between (15) and (20) is that in (20) the sum is only over observed indices as opposed to all indices (observed and unobserved.) It would be interesting to compare this approach with the proposed method (15) and we leave this for a future work.

## 3 Poisson Log-Linear Model

In this section, we consider the Poisson distribution for the conditional uncertainty condition and using a log-linear correlation function to model in (15). In particular, we suppose that is only conditioned on and that it follows the Poisson distribution,

 P(yi|ui)=uyiie−uiΓ(yi+1), (21)

where whenever is a nonnegative integer. Note that (21) is defined for all . We consider to be a log-linear correlation function satisfying

 log(ui+1)=a0+p∑k=1aklog(yi−k+1)+q∑k=1bklog(ui−k+1), (22)

for some with the constraint . In other words,

 ui=f({yj}j≤i−1,{uj}j≤i−1)=max(exp(log(ui+1))−1,0),

where is defined as in (22). Since is completely determined by , and , we see that the series is completely determined by and the series .

The prior on is now given by

 P(f)=P(a0)[p∏k=1P(ak)][q∏k=1P(bk)].

Here we assume all the parameters are independent from each other, and follow a family of exponential probability distributions

 fμ,s(x)=C(μ,s)e−μ|x|s, for some 0

where is chosen such that . corresponds to the LASSO constraint  and corresponds to the Bayesian bridge constraint proposed in [16, 34].

Combining (21)-(23) into (15), the proposed variational problem becomes

 mina0,a,b,yJ(a0,a,b,y), where J(a0,a,b,y)=N∑i=1[ui−yilog(ui)+log(Γ(yi+1))]+λ∑i∈D|yi−~yi|r+μ(p∑k=1|ak|s+q∑k=1|bk|s), (24)

for some , and .

###### Remark 2.

It is possible to minimize the energy in (24) over the set of nonnegative integer-valued series . However, this set is not convex. To overcome this non-convexity we extend (21) to all nonnegative real-valued series and therefore use as oppose to . We remark that this extension is not the continuous version of the Poisson distribution proposed in [28, 24]. Given , the cumulative probability distribution for the continuous Poisson is defined as if and

 Fλ(t)=Γ(t,λ)Γ(t), for t>0.

Here

 Γ(t,λ):=∫∞λe−ττt−1 dτ

is the incomplete Gamma function. Let the probability distribution function be defined such that

 Fλ(t)=∫t0fλ(τ) dτ.

It can be shown that for all ,

 ∫t+1tfλ(τ) dτ=e−λλtΓ(t+1).

Thus instead of (21), one can use or

 P(yi|ui)=e−uiu⌊yi⌋iΓ(⌊yi⌋+1),

where is the largest integer that is less than or equal to .

## 4 Proximal Mapping for ℓr, 0<r<1

One of the ingredients for computing (24) is to solve a subproblem

 shrinkr(t′,μ)=J∂Er(t′):=argmint∈R{Er(t)=μ|t|r+12|t−t′|2}, (25)

for some , and . For , it was shown in  that is given by

 J∂E1(t′)=shrink1(t′,μ):=sign(t′)(|t′|−μ)+, (26)

where . For , a common approach for solving (25) is to consider a regularized version via

 mint∈R{Hr(t)=μ|t+ϵ|r+12|t−t′|2},

for some . The minimizer for is approximated by where

 tn+1=tn−dtdHrdt(tn),

with some initial guess . Since is nonconvex, may not be a global minimizer. In remark 3 below we show that for a range of , with the initial guess , the above iteration will converge to that is not a global minimizer.

There are explicit formula for when or , but for general , no closed-form expression for exists. In , Nie-Etal proposed a method for solving via computing a zero of a strictly convex function using Newton method. For completeness, we present here a method similar to the one proposed in .

Recall (25) and for simplicity assume and denote by the global minimizer for . First, note that . Now, for we get

 ddtEr(t)=μrtr−1+(t−t′).

Define

 gr(t)=μr−t′t1−r+t2−r, for t≥0.

Note also that for , which shows that if and only if . Suppose . This implies that must satisfy

 gr(t∗)=μr−t′(t∗)1−r+(t∗)2−r=0. (27)

On , we get

 g′r(t)=−(1−r)t′t−r+(2−r)t1−r,

and

 g′′r(t)=(1−r)rt′t−r−1+(2−r)(1−r)t−r,

which is strictly greater than . This implies is strictly convex on and achieving its minimal value at . Moreover, we have

 gr(0)=gr(t′)=μr>0, and
 gr(t0)=μr+(t0−t′)t1−r0<μr.

This implies the followings:

1. If , that is

 μ>1r(t′−t0)t1−r0, (28)

then has no zeros on . This shows that a global minimizer does not exist.

2. If , that is

 μ<1r(t′−t0)t1−r0, (29)

then has exactly two zeros and . Since and , the function is strictly increasing near zero. This implies that the zero is not a local minimizer for , and hence .

3. If , that is

 μ=1r(t′−t0)t1−r0, (30)

then has exactly one zero at . Since is the first zero of and is strictly increasing near , we get that a global minimizer does not exist.

###### Remark 3.

Cases (28) and (30) correspond to having strictly increasing on , and hence is the global minimizer. Note in the case (30), is a saddle point of . As for the case (29), has one positive local minimizer at and therefore the global minimizer . It is possible that for some . In this case there is no uniqueness to the global minimizer of .

Figure 2 shows the plots of and the corresponding with and for various choices of . The green lines correspond to the value in and in . Let . In the cases where , the global minimizer for is . However, for , the second zero of the corresponding is the global minimizer for .

From the above remark, the shrinkage operator in (25) for some is given by

 (31)

Computing for the second zero of (assuming ) is fast and straightforward since is strictly convex. For instance, one can use the Newton method as follow:

###### Algorithm 1.

Newton method for computing the second zero of (assuming and ):

1. Set , and small.

2. while .

• .

3. end while.

In our numerical simulation, the above algorithm converges in Newton iterations with -.

For general , we have

 shrinkr(t′,μ)=sign(t′)shrinkr(|