Forecasting (anticipating) future events or activities is an important problem in data science. A common task for a forecaster is to predict normal future events using past and current observations and to alert when the observed number of events significantly deviates from the predicted value. If events and activities are random then there is no hope in making any meaningful future prediction. However, if events are correlated in the sense that events at timedepend on events prior to and the correlation function that describes this dependency persists throughout the observed data, then this correlation function can be used to predict future events based on past and current observations. In many real datasets, the observed data is incomplete and is often contaminated with outliers and random noise. An important task is then to robustly learn the correlation function that describes the underlying normal activities and patterns from the observed data. We start with the following setup.
Let be a uniform discretization of some time interval of interest which we assume to be . Let be the number of observed events and be the expected number of events that occur in the time interval
. Observed events are determined by the conditional probability
is some auxiliary variable, representing for instance the variance, thatmay depend on. In a univariate case, the goal is to learn how depends on prior and , for . In other words, one is interested in finding the function such that
In this paper, we consider the case where only a partial series for some is observed and it may contain outliers and anomaly. To tackle this problem, the following minimization problem is proposed
for some , and , to jointly learn and the complete series
Here, we consider parametrically the correlation function defined as
where is given by
for some .
measures the sparsity of the sequence representing anomaly. Similarly, and impose sparsity on and and overcome the model selection issue (e.g. AIC, BIC, etc.). Both of these constraints are important for the recovery of .
Plots of estimated parametersand for 100 simulations when of the data is observed and of the observed entries are contaminated.
, a normal Gaussian distribution with meanand variance . Let
then equation (5) implies . This implies
Substituting for in (6), one obtains
for some real-valued and and some new positive integers and . Here we assume zero boundary conditions, that is and are identically zero whenever . Clearly, other types of boundary conditions such as reflection or Neumann can be used. Thus (5) can be transformed into (7) and (8), which are (1) and (2) respectively.
Another example is the Poisson linear autoregressive model (see[20, 25], among others). Assuming is a non-negative integer, then (1) is given by
and (2) is given by
where and are non-negative. This shows that the model can only detect zero or positive correlations. To overcome this drawback, the log-linear model is often used [40, 20, 25, 15] where is defined such that
Here , and are real-valued and therefore can represent negative correlations.
In the case of complete observations where we are given the series and some prior knowledge on the conditional probability condition (1), the task is then to learn the optimal that maximizes the likelihood function,
assuming and are independent. The maximization problem in (10) is equivalent to
With some knowledge about , the conditional probability provides the distribution of . For a single value prediction , we can solve
In case where the conditional probability follows a Poisson distribution, then is completely determined by , and it doesn’t depend on the auxiliary variable .
In this paper, we consider the problem of learning parametrically the underlying correlation function and given a partially observed series for some which may be contaminated by outliers and anomalies. Given some prior knowledge about the uncertainty condition (1), the problem consists of: 1) extending to the whole series (including ) via imputation such that for all and 2) using the complete series to learn . These two steps are done iteratively as they are inter-dependent. The interpretation of is as follows:
Suppose is normal, i.e. it can be described by and the uncertainty condition (1), then we would like to enforce .
On the other hand if is anomalous, then we allow and let the model decides a normal value for .
Moreover, outliers and anomalies are interpreted as rare events that are supported sparsely on . Based on this interpretation, we would like the difference series to be sparse. Thus this can be seen as solving the optimization problem:
with the constraint that the partial series is sparse. Here, and are priors on and respectively. Sparsity is an essential ingredient in the theory of compressed sensing [7, 8, 13]. Approximating the sparsity constraint has been a subject of great importance as the exact sparsity problem is NP-hard. Here we consider sparsity approximations as proposed in [7, 8, 13, 10] by using for . Incorporating these sparsity approximations into the minimizing energy (14), we propose the following unconstraint variational problem
Predicting with the mean is optimal. However, for and , imputing with the mean is not always optimal. Indeed, suppose only depends on previous ’s, i.e. for some . Suppose also that both and are known with no outliers, that is for all and is the only non-observed entry. The task is to compute the optimal by using (15). This amounts to solving
If , then it is clear that only the last term in the above sum contains . This implies
On the other hand, if , then
The latter case shows that is not always an optimal value for .
The paper is organized as follows. Section 2 recalls some related prior work that are most relevant. Section 3 describes the proposed Poisson log-linear model having the uncertainty condition (1) following the Poisson distribution and the correlation function (2) following a log-linear function. One subproblem for solving (15) is to compute
for some fixed , and . In , Nie-etal proposed a method for solving this problem via solving a zero of a strictly convex function. For completeness, we go over in section 4 a similar method for computing the proximal operator for . Section 5 goes over an algorithm to compute a minimizer for (15). Section 6 shows numerical results on simulated data to validate the proposed model. In Appendix A we show that the above minimization problem (15) is related to
for some fixed and .
2 Prior Work
In , Huber considered the classical least square problem of learning parameters from observations obeying the relation
Here are known coefficients and are iid random Gaussian noise. For an autoregressive model, . Estimating the parameter amounts to minimizing the sum of squares
Let , and , then the relative condition number measuring the sensitivity of with respect to perturbation of is 
where is an arbitrary matrix norm and is the pseudo inverse of if exists. Let and
be the largest and smallest singular values ofrespectively, then by using , one has . If is large, a small deviation in can create large deviation in the solution and hence it is important to obtain an accurate estimate for . As noted in , outliers affect the accuracy of the estimates. Following Lecture 18 in , the condition number for measuring the sensitivity of with respect to perturbation of is , where and . The more noise and outliers in the system, the closer is to . This leads to a large value of . Hence it is important to have good estimations of both and in the presence of missing data and outliers in the observation.
where is chosen so that it is less sensitive to large . In particular, the proposed has the form
where is some chosen constant which is data dependent. From the definition of we see that if , least square is performed in (18) and hence the model respects the additive normal Gaussian noise assumption in (16). When , is no longer considered normal Gaussian but assumed to follow a Laplacian distribution of the form
where is a constant such that . Note that the Laplacian distribution has a wider tail than a Gaussian distribution and hence allows for the existence of large better than the Gaussian distribution. Another popular robust choice for is the least absolute deviation  which amounts to having , which also follows a Laplacian distribution.
The standard LASSO 
amounts to learning a sparse parameter vectorvia minimizing
This model is still sensitive to outliers. A modification to this model introduces an extra variable representing outliers (see  and references there in):
Suppose . In , a robust nonparametric model is proposed:
where is a reproducing kernel Hilbert space (RKHS) which includes Sobolev spaces. A slightly different approach is proposed in , where each is given by , and both and are learned via solving
In connection with the proposed model (15), take and let with and
for some fixed . Note that is completely determined by and estimating the parameter vector with the LASSO prior amounts to minimizing
Here, the sparse vector representing outliers is . Note that the proposed method (15) is much more general to accommodate other types of noise in the data that is not additive (multiplicative Gaussian, Poisson, negative binomial, etc.)
There are quite a few existing methods on estimating the parameters in the presence of missing data, and from a high level perspective, they align with the following two approaches.
The first approach iteratively imputes missing and unobserved data in some manner and then use the imputed and observed data to estimate the parameters. These methods include mean imputation, expectation maximization (EM), multiple imputation , among others. See [30, 33, 21] for a survey of some of these methods. Matrix completion [6, 9] is a form of imputation where missing entries in the data matrix are imputed under the assumption that the data matrix has low rank. The proposed imputation performed in Algorithm 4 for solving (15) follows along the line the iterative approach of the EM method ; but instead of maximizing the expectation we maximize the likelihood.
The second approach either doesn’t impute or only imputes the necessary missing entries that the observed entries depend on. For instance in full information maximization likelihood (FIML) method , only complete data points are used as inputs to estimate the parameters. Suppose we are only interested in estimating the constant with , then all observed data points are complete. However, for and suppose one in every consecutive points are missing then the set of complete data points is empty and hence the FIML method is not applicable. The non-negative definite covariance method [26, 11] only considers observed data points as inputs as opposed to complete data points. Here the observed data points may depend on the missing data, however this method sets this dependency to zero, that is imposing zero boundary conditions on the unobserved entries for which some of the observed entries may depend on (see section 3.2.1 in .) Further modification was introduced to acquire non-negative definite condition for the covariance matrix. This second approach can also be applied to our problem and it is more appropriate when the missing entries are systematic as oppose to random. To motivate the problem, consider the case where the correlation function is a constant, that is for all . Assuming, that the observed series has no outliers, then can be approximated as the mean and there is no need to impute for all . Similarly, suppose and , then it is only necessary to impute as opposed to for all . One can view as an unknown boundary condition. Thus in this second approach, the minimization problem becomes
where consists of and all of the indices such that depends on for some , and consists of all the indices such that depends on for some . Note that the difference between (15) and (20) is that in (20) the sum is only over observed indices as opposed to all indices (observed and unobserved.) It would be interesting to compare this approach with the proposed method (15) and we leave this for a future work.
3 Poisson Log-Linear Model
In this section, we consider the Poisson distribution for the conditional uncertainty condition and using a log-linear correlation function to model in (15). In particular, we suppose that is only conditioned on and that it follows the Poisson distribution,
where whenever is a nonnegative integer. Note that (21) is defined for all . We consider to be a log-linear correlation function satisfying
for some with the constraint . In other words,
where is defined as in (22). Since is completely determined by , and , we see that the series is completely determined by and the series .
The prior on is now given by
Here we assume all the parameters are independent from each other, and follow a family of exponential probability distributions
for some , and .
It is possible to minimize the energy in (24) over the set of nonnegative integer-valued series . However, this set is not convex. To overcome this non-convexity we extend (21) to all nonnegative real-valued series and therefore use as oppose to . We remark that this extension is not the continuous version of the Poisson distribution proposed in [28, 24]. Given , the cumulative probability distribution for the continuous Poisson is defined as if and
is the incomplete Gamma function. Let the probability distribution function be defined such that
It can be shown that for all ,
Thus instead of (21), one can use or
where is the largest integer that is less than or equal to .
4 Proximal Mapping for ,
One of the ingredients for computing (24) is to solve a subproblem
for some , and . For , it was shown in  that is given by
where . For , a common approach for solving (25) is to consider a regularized version via
for some . The minimizer for is approximated by where
with some initial guess . Since is nonconvex, may not be a global minimizer. In remark 3 below we show that for a range of , with the initial guess , the above iteration will converge to that is not a global minimizer.
There are explicit formula for when or , but for general , no closed-form expression for exists. In , Nie-Etal proposed a method for solving via computing a zero of a strictly convex function using Newton method. For completeness, we present here a method similar to the one proposed in .
Recall (25) and for simplicity assume and denote by the global minimizer for . First, note that . Now, for we get
Note also that for , which shows that if and only if . Suppose . This implies that must satisfy
On , we get
which is strictly greater than . This implies is strictly convex on and achieving its minimal value at . Moreover, we have
This implies the followings:
If , that is
then has no zeros on . This shows that a global minimizer does not exist.
If , that is
then has exactly two zeros and . Since and , the function is strictly increasing near zero. This implies that the zero is not a local minimizer for , and hence .
If , that is
then has exactly one zero at . Since is the first zero of and is strictly increasing near , we get that a global minimizer does not exist.
Cases (28) and (30) correspond to having strictly increasing on , and hence is the global minimizer. Note in the case (30), is a saddle point of . As for the case (29), has one positive local minimizer at and therefore the global minimizer . It is possible that for some . In this case there is no uniqueness to the global minimizer of .
Figure 2 shows the plots of and the corresponding with and for various choices of . The green lines correspond to the value in and in . Let . In the cases where , the global minimizer for is . However, for , the second zero of the corresponding is the global minimizer for .
From the above remark, the shrinkage operator in (25) for some is given by
Computing for the second zero of (assuming ) is fast and straightforward since is strictly convex. For instance, one can use the Newton method as follow:
Newton method for computing the second zero of (assuming and ):
Set , and small.
In our numerical simulation, the above algorithm converges in Newton iterations with -.
For general , we have