1 Introduction
Forecasting (anticipating) future events or activities is an important problem in data science. A common task for a forecaster is to predict normal future events using past and current observations and to alert when the observed number of events significantly deviates from the predicted value. If events and activities are random then there is no hope in making any meaningful future prediction. However, if events are correlated in the sense that events at time
depend on events prior to and the correlation function that describes this dependency persists throughout the observed data, then this correlation function can be used to predict future events based on past and current observations. In many real datasets, the observed data is incomplete and is often contaminated with outliers and random noise. An important task is then to robustly learn the correlation function that describes the underlying normal activities and patterns from the observed data. We start with the following setup.Let be a uniform discretization of some time interval of interest which we assume to be . Let be the number of observed events and be the expected number of events that occur in the time interval
. Observed events are determined by the conditional probability
(1) 
where
is some auxiliary variable, representing for instance the variance, that
may depend on. In a univariate case, the goal is to learn how depends on prior and , for . In other words, one is interested in finding the function such that(2) 
In this paper, we consider the case where only a partial series for some is observed and it may contain outliers and anomaly. To tackle this problem, the following minimization problem is proposed
(3) 
for some , and , to jointly learn and the complete series
via imputation.

Here, we consider parametrically the correlation function defined as
where is given by
(4) for some .

measures the sparsity of the sequence representing anomaly. Similarly, and impose sparsity on and and overcome the model selection issue (e.g. AIC, BIC, etc.). Both of these constraints are important for the recovery of .
With the presence of missing entries and outliers, Figure 1 shows the ability of the model (3) to recover having
Plots of estimated parameters
and for 100 simulations when of the data is observed and of the observed entries are contaminated.In the followings, we provide motivations for the proposed model. Many wellknown time series models can be described by (1) and (2). For instance, in an model [5, 18], each is defined as
(5) 
where follows
, a normal Gaussian distribution with mean
and variance . Let(6) 
then equation (5) implies . This implies
(7) 
Substituting for in (6), one obtains
(8) 
for some realvalued and and some new positive integers and . Here we assume zero boundary conditions, that is and are identically zero whenever . Clearly, other types of boundary conditions such as reflection or Neumann can be used. Thus (5) can be transformed into (7) and (8), which are (1) and (2) respectively.
Another example is the Poisson linear autoregressive model (see
[20, 25], among others). Assuming is a nonnegative integer, then (1) is given by(9) 
and (2) is given by
where and are nonnegative. This shows that the model can only detect zero or positive correlations. To overcome this drawback, the loglinear model is often used [40, 20, 25, 15] where is defined such that
Here , and are realvalued and therefore can represent negative correlations.
In the case of complete observations where we are given the series and some prior knowledge on the conditional probability condition (1), the task is then to learn the optimal that maximizes the likelihood function,
(10) 
It can be shown (see for instance [27]) that the joint probability in (10) is given by
(11) 
assuming and are independent. The maximization problem in (10) is equivalent to
(12) 
With some knowledge about , the conditional probability provides the distribution of . For a single value prediction , we can solve
(13) 
In case where the conditional probability follows a Poisson distribution, then is completely determined by , and it doesn’t depend on the auxiliary variable .
In this paper, we consider the problem of learning parametrically the underlying correlation function and given a partially observed series for some which may be contaminated by outliers and anomalies. Given some prior knowledge about the uncertainty condition (1), the problem consists of: 1) extending to the whole series (including ) via imputation such that for all and 2) using the complete series to learn . These two steps are done iteratively as they are interdependent. The interpretation of is as follows:

Suppose is normal, i.e. it can be described by and the uncertainty condition (1), then we would like to enforce .

On the other hand if is anomalous, then we allow and let the model decides a normal value for .
Moreover, outliers and anomalies are interpreted as rare events that are supported sparsely on . Based on this interpretation, we would like the difference series to be sparse. Thus this can be seen as solving the optimization problem:
(14) 
with the constraint that the partial series is sparse. Here, and are priors on and respectively. Sparsity is an essential ingredient in the theory of compressed sensing [7, 8, 13]. Approximating the sparsity constraint has been a subject of great importance as the exact sparsity problem is NPhard. Here we consider sparsity approximations as proposed in [7, 8, 13, 10] by using for . Incorporating these sparsity approximations into the minimizing energy (14), we propose the following unconstraint variational problem
(15) 
Remark 1.
Predicting with the mean is optimal. However, for and , imputing with the mean is not always optimal. Indeed, suppose only depends on previous ’s, i.e. for some . Suppose also that both and are known with no outliers, that is for all and is the only nonobserved entry. The task is to compute the optimal by using (15). This amounts to solving
If , then it is clear that only the last term in the above sum contains . This implies
On the other hand, if , then
The latter case shows that is not always an optimal value for .
The paper is organized as follows. Section 2 recalls some related prior work that are most relevant. Section 3 describes the proposed Poisson loglinear model having the uncertainty condition (1) following the Poisson distribution and the correlation function (2) following a loglinear function. One subproblem for solving (15) is to compute
for some fixed , and . In [32], Nieetal proposed a method for solving this problem via solving a zero of a strictly convex function. For completeness, we go over in section 4 a similar method for computing the proximal operator for . Section 5 goes over an algorithm to compute a minimizer for (15). Section 6 shows numerical results on simulated data to validate the proposed model. In Appendix A we show that the above minimization problem (15) is related to
for some fixed and .
2 Prior Work
In [23], Huber considered the classical least square problem of learning parameters from observations obeying the relation
(16) 
Here are known coefficients and are iid random Gaussian noise. For an autoregressive model, . Estimating the parameter amounts to minimizing the sum of squares
(17) 
Let , and , then the relative condition number measuring the sensitivity of with respect to perturbation of is [38]
where is an arbitrary matrix norm and is the pseudo inverse of if exists. Let and
be the largest and smallest singular values of
respectively, then by using , one has . If is large, a small deviation in can create large deviation in the solution and hence it is important to obtain an accurate estimate for . As noted in [23], outliers affect the accuracy of the estimates. Following Lecture 18 in [38], the condition number for measuring the sensitivity of with respect to perturbation of is , where and . The more noise and outliers in the system, the closer is to . This leads to a large value of . Hence it is important to have good estimations of both and in the presence of missing data and outliers in the observation.In [22, 23], Huber proposed a robust alternative to (17) by considering
(18) 
where is chosen so that it is less sensitive to large . In particular, the proposed has the form
(19) 
where is some chosen constant which is data dependent. From the definition of we see that if , least square is performed in (18) and hence the model respects the additive normal Gaussian noise assumption in (16). When , is no longer considered normal Gaussian but assumed to follow a Laplacian distribution of the form
where is a constant such that . Note that the Laplacian distribution has a wider tail than a Gaussian distribution and hence allows for the existence of large better than the Gaussian distribution. Another popular robust choice for is the least absolute deviation [2] which amounts to having , which also follows a Laplacian distribution.
The standard LASSO [37]
amounts to learning a sparse parameter vector
via minimizingThis model is still sensitive to outliers. A modification to this model introduces an extra variable representing outliers (see [31] and references there in):
Suppose . In [29], a robust nonparametric model is proposed:
where is a reproducing kernel Hilbert space (RKHS) which includes Sobolev spaces. A slightly different approach is proposed in [17], where each is given by , and both and are learned via solving
In connection with the proposed model (15), take and let with and
for some fixed . Note that is completely determined by and estimating the parameter vector with the LASSO prior amounts to minimizing
Here, the sparse vector representing outliers is . Note that the proposed method (15) is much more general to accommodate other types of noise in the data that is not additive (multiplicative Gaussian, Poisson, negative binomial, etc.)
There are quite a few existing methods on estimating the parameters in the presence of missing data, and from a high level perspective, they align with the following two approaches.
The first approach iteratively imputes missing and unobserved data in some manner and then use the imputed and observed data to estimate the parameters. These methods include mean imputation, expectation maximization (EM)
[12], multiple imputation [35], among others. See [30, 33, 21] for a survey of some of these methods. Matrix completion [6, 9] is a form of imputation where missing entries in the data matrix are imputed under the assumption that the data matrix has low rank. The proposed imputation performed in Algorithm 4 for solving (15) follows along the line the iterative approach of the EM method [12]; but instead of maximizing the expectation we maximize the likelihood.The second approach either doesn’t impute or only imputes the necessary missing entries that the observed entries depend on. For instance in full information maximization likelihood (FIML) method [14], only complete data points are used as inputs to estimate the parameters. Suppose we are only interested in estimating the constant with , then all observed data points are complete. However, for and suppose one in every consecutive points are missing then the set of complete data points is empty and hence the FIML method is not applicable. The nonnegative definite covariance method [26, 11] only considers observed data points as inputs as opposed to complete data points. Here the observed data points may depend on the missing data, however this method sets this dependency to zero, that is imposing zero boundary conditions on the unobserved entries for which some of the observed entries may depend on (see section 3.2.1 in [11].) Further modification was introduced to acquire nonnegative definite condition for the covariance matrix. This second approach can also be applied to our problem and it is more appropriate when the missing entries are systematic as oppose to random. To motivate the problem, consider the case where the correlation function is a constant, that is for all . Assuming, that the observed series has no outliers, then can be approximated as the mean and there is no need to impute for all . Similarly, suppose and , then it is only necessary to impute as opposed to for all . One can view as an unknown boundary condition. Thus in this second approach, the minimization problem becomes
(20) 
where consists of and all of the indices such that depends on for some , and consists of all the indices such that depends on for some . Note that the difference between (15) and (20) is that in (20) the sum is only over observed indices as opposed to all indices (observed and unobserved.) It would be interesting to compare this approach with the proposed method (15) and we leave this for a future work.
3 Poisson LogLinear Model
In this section, we consider the Poisson distribution for the conditional uncertainty condition and using a loglinear correlation function to model in (15). In particular, we suppose that is only conditioned on and that it follows the Poisson distribution,
(21) 
where whenever is a nonnegative integer. Note that (21) is defined for all . We consider to be a loglinear correlation function satisfying
(22) 
for some with the constraint . In other words,
where is defined as in (22). Since is completely determined by , and , we see that the series is completely determined by and the series .
The prior on is now given by
Here we assume all the parameters are independent from each other, and follow a family of exponential probability distributions
(23) 
where is chosen such that . corresponds to the LASSO constraint [37] and corresponds to the Bayesian bridge constraint proposed in [16, 34].
(24) 
for some , and .
Remark 2.
It is possible to minimize the energy in (24) over the set of nonnegative integervalued series . However, this set is not convex. To overcome this nonconvexity we extend (21) to all nonnegative realvalued series and therefore use as oppose to . We remark that this extension is not the continuous version of the Poisson distribution proposed in [28, 24]. Given , the cumulative probability distribution for the continuous Poisson is defined as if and
Here
is the incomplete Gamma function. Let the probability distribution function be defined such that
It can be shown that for all ,
Thus instead of (21), one can use or
where is the largest integer that is less than or equal to .
4 Proximal Mapping for ,
One of the ingredients for computing (24) is to solve a subproblem
(25) 
for some , and . For , it was shown in [37] that is given by
(26) 
where . For , a common approach for solving (25) is to consider a regularized version via
for some . The minimizer for is approximated by where
with some initial guess . Since is nonconvex, may not be a global minimizer. In remark 3 below we show that for a range of , with the initial guess , the above iteration will converge to that is not a global minimizer.
There are explicit formula for when or [41], but for general , no closedform expression for exists. In [32], NieEtal proposed a method for solving via computing a zero of a strictly convex function using Newton method. For completeness, we present here a method similar to the one proposed in [32].
Recall (25) and for simplicity assume and denote by the global minimizer for . First, note that . Now, for we get
Define
Note also that for , which shows that if and only if . Suppose . This implies that must satisfy
(27) 
On , we get
and
which is strictly greater than . This implies is strictly convex on and achieving its minimal value at . Moreover, we have
This implies the followings:

If , that is
(28) then has no zeros on . This shows that a global minimizer does not exist.

If , that is
(29) then has exactly two zeros and . Since and , the function is strictly increasing near zero. This implies that the zero is not a local minimizer for , and hence .

If , that is
(30) then has exactly one zero at . Since is the first zero of and is strictly increasing near , we get that a global minimizer does not exist.
Remark 3.
Cases (28) and (30) correspond to having strictly increasing on , and hence is the global minimizer. Note in the case (30), is a saddle point of . As for the case (29), has one positive local minimizer at and therefore the global minimizer . It is possible that for some . In this case there is no uniqueness to the global minimizer of .
Figure 2 shows the plots of and the corresponding with and for various choices of . The green lines correspond to the value in and in . Let . In the cases where , the global minimizer for is . However, for , the second zero of the corresponding is the global minimizer for .
From the above remark, the shrinkage operator in (25) for some is given by
(31) 
Computing for the second zero of (assuming ) is fast and straightforward since is strictly convex. For instance, one can use the Newton method as follow:
Algorithm 1.
Newton method for computing the second zero of (assuming and ):

Set , and small.

while .

.


end while.
In our numerical simulation, the above algorithm converges in Newton iterations with .
For general , we have
Comments
There are no comments yet.