 # Joint Probability Distribution of Prediction Errors of ARIMA

Producing probabilistic guarantee for several steps of a predicted signal follow a temporal logic defined behavior has its rising importance in monitoring. In this paper, we derive a method to compute the joint probability distribution of prediction errors of multiple steps based on Autoregressive Integrated Moving Average(ARIMA) model. We cover scenarios in stationary process and intrinsically stationary process for univariate and multivariate.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Signal Temporal Logic(STL) specify the simultaneous behavior of a signal across different time point, while time series prediction based on Autoregressive Integrated Moving Average (ARIMA) model only give guarantee on each single time point. In this paper, we give the proof of calculating joint probability distribution of prediction errors of multiple steps.

## 2 Error of Prediction in Univariate Process

The goal is to get the joint distribution of

 {Err(Xn+1),Err(Xn+2),…,Err(Xn+h)}

### 2.1 Error of Stationary Process

Previous work   concluded that for stationary process

 E[Err(Xn+h)]=0E[Err(Xn+h)]2=γX(0)−(ahn)⊤γn(h)

Assuming that

is a normal distribution, we have

.

###### Definition 1

The best linear prediction of is:

 P(Xn+h∣Xn,Xn−1,…,X1,1)=ah0+n∑i=1ahiXn+1−i (1)

The optimized choice of are

 (2)

and

 a0=μX(1−n∑i=1ai) (3)

where

###### Definition 2
 Err(Xn+h)=PnXn+h−Xn+h (4)
###### Definition 3

For best linear predictor 

 Pn(α1U+α2V+β)=α1Pn(U)+α2Pn(V)+β
###### Corollary 1

The value of best linear predictor at already observed datapoint follows

 Pn(n∑i=1αiXi+β)=n∑i=1αiXi+β
###### Definition 4

If the distribution of

is a multivariate normal distribution, then the random variable vector

, where is a matrix, is also a multivariate normal distribution.And if , then

###### Definition 5

If

are independent white noise, the joint distribution of

 ⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zn−q+1Zn−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠

is a multivariate normal distribution with

 μZ=⎛⎜ ⎜ ⎜ ⎜⎝00⋮0⎞⎟ ⎟ ⎟ ⎟⎠ΣZ=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝σ20…00σ2…0⋮⋮⋱000…σ2⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (5)
###### Lemma 1

Given , let denote , then

 Yh=C1⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zn−q+1Zn−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (6)

where is a transform matrix from to .

 C1=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝θqθq−1…θ1100…00θqθq−1…θ110…000θqθq−1…θ11…0⋮⋮⋮⋱⋱⋱⋱⋱⋮000…θqθq−1…θ11⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (7)
###### Proof

From the definition of ARMA(p,q) process,

 Yt=θ(B)Ztt≥p+1=Zt+θ1Zt−1+θ2Zt−2+⋯+θqZt−q (8)

where .

###### Lemma 2
 PnYn+h=C3C2⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zp−q+1Zp−q+2⋮Zn⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (9)

where is a matrix, is a matrix.

 C2=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝θqθq−1…θ1100…00θqθq−1…θ110…000θqθq−1…θ11…0⋮⋮⋮⋱⋱⋱⋱⋱⋮000…θqθq−1…θ11⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (10)
 C3=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝a1n−pa1n−p−1…a11a2n−pa2n−p−1…a21⋮⋮⋱⋮ahn−pahn−p−1…ah1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (11)
###### Proof

is a stationary process starting from with zero mean, the prediction of follows

 PnYn+h=ah0+n−p∑i=1ahiYn+1−i (12)

where . C3 is the coefficient matrix of when calculating , the best linear predictor of , a stationary MA(q) process with zero mean.

###### Theorem 2.1

The covariance matrix of prediction error of a moving average process is

 ΣErrY=CZtoErrYΣZC⊤ZtoErrY (13)
###### Proof

Let denote , denote , denote , denote , denote

Error is defined as

 Err(Yn+h)=PnYn+h−Yn+h (14)

From lemma 1 and 2

 Err(Yh)=C3C2⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zp−q+1Zp−q+2⋮Zn⎞⎟ ⎟ ⎟ ⎟ ⎟⎠−C1⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zn−q+1Zn−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (15)

After augment the coefficient matrix and to and by

 C∗1=(Oh×(n−p)C1)C∗2=(C2O(n−p)×h) (16)

equation 15 is transformed to

 Err(Yh)=C3C∗2⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zp−q+1Zp−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠−C∗1⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zp−q+1Zp−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠=(C3C∗2−C∗1)⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zp−q+1Zp−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (17)

Let denote . Follow Lemma 4, , where

 ΣErrY=CZtoErrYΣZC⊤ZtoErrY (18)

where the size of here is .

###### Theorem 2.2

is a multivariate normal distribution in ARMA(p,q) process , using the best linear predictor . The covariance matrix of prediction error is

 ΣErrX=CErrYtoErrXCZtoErrYΣZC⊤ZtoErrYC⊤ErrYtoErrX (19)

where

 CErrYtoErrX=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝100…0c2110…0⋮⋮⋱⋱⋮ch1ch2ch3…1⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (20)

in which is the coefficient of when calculating .

can be recursively given by: For any (),

 ci,1=min(p,h−1)∑k=1ϕkci−k,1 (21)

For any (),

 ci,j=ci−1,j−1 (22)
###### Proof

From ARIMA we know

 Yt=ϕ(B)Xtt>max(p,q) (23)

Then from definition 23

 Xn+h=Yn+h+p∑i=1ϕiXn+h−i (24)

From definition 3

 PnXn+h=Pn(Yn+h+p∑i=1ϕiXn+h−i)=PnYn+h+p∑i=1ϕiPnXn+h−i (25)

From corollary 1

 PnXn+h=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩PnYn+h+h−1∑i=1ϕiPnXn+h−i+p∑i=hϕiXn+h−ih≤pPnYn+h+p∑i=1ϕiPnXn+h−ih>p (26)

Then we have

 Err(Xn+h)=PnXn+h−Xn+h=PnYn+h+h−1∑i=1ϕiPnXn+h−i+p∑i=hϕiXn+h−i−Yn+h−p∑i=1ϕiXn+h−i=Err(Yn+h)+min(p,(h−1))∑i=1ϕiErr(Xn+h−i) (27)

which can be represent in matrix form

 Err(Xh)=CErrYtoErrXErr(Yh) (28)

Let denote

 ⎛⎜ ⎜ ⎜ ⎜⎝100…0c2110…0⋮⋮⋮…0ch1ch2ch3…1⎞⎟ ⎟ ⎟ ⎟⎠ (29)

Using Lemma 4, we have where

 ΣErrX=CErrYtoErrXΣErrYC⊤ErrYtoErrX=CErrYtoErrXCZtoErrYΣZC⊤ZtoErrYC⊤ErrYtoErrX (30)

Q.E.D.

From definition 2,

 Xn+h∈[c1,c2]⇔Err(Xn+h)∈[PnXn+h−c2,PnXn+h−c1] (31)

which lead to the following conclusion:

###### Theorem 2.3
 P(Xn+h∈[c1,c2])=P(Err(Xn+h)∈[PnXn+h−c2,PnXn+h−c1])

Now we have a guarantee for any prediction interval of a time step , denote the event as event , then the joint probability of can be calculated by using theorem 2.2 when is a stationary process.

### 2.2 Error of Intrinsically Stationary Process

###### Definition 6

When the d-ordered differencing of a time series is a stationary process while its -ordered process is still a non-stationary process, we call this process a d-ordered intrinsically stationary process.

A d-ordered differencing is defined as:

 ∇dXt=d∑k=0(dk)(−1)kXt−k (32)
###### Lemma 3

To a d-ordered intrinsically stationary process, the best linear prediction of based on observation is:

 PnXn+h=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩Pn∇dXn+h−d∑k=h(dk)(−1)kXn+h−k−h−1∑k=1(dk)(−1)kPnXn+h−kh≤d+1Pn∇dXn+h−d∑k=1(dk)(−1)kPnXn+h−kh>d+1 (33)
###### Proof

From the definition of differencing function, equation 32,

 Xn+h=∇dXn+h−d∑k=1(dk)(−1)kXn+h−k (34)

which means

 PnXn+h=Pn(∇dXn+h−d∑k=1(dk)(−1)kXn+h−k) (35)

using Property 3,

 PnXn+h=Pn∇dXn+h−d∑k=1(dk)(−1)kPnXn+h−k (36)

Introducing Property 1, when , is observed, so:

 (37)

When ,

 PnXn+h=Pn∇dXn+h−d∑k=1(dk)(−1)kPnXn+h−k (38)

Q.E.D.

###### Lemma 4

The error of the -th time step can be represent recursively by .

 Err(Xn+h)=Err(∇dXn+h)−min(h−1,d)∑k=1(dk)(−1)kErr(Xn+h−k) (39)
###### Proof

When, , using the definition of differencing (equation 32):

 Xn+h=∇dXn+h−d∑k=h(dk)(−1)kXn+h−k−h−1∑k=1(dk)(−1)kXn+h−k (40)

and Lemma 3, the error is:

 (41)

Then for , similarly, using the definition (equation 32):

 Xn+h=∇dXn+h−d∑k=1(dk)(−1)kXn+h−k (42)

and Theorem 3, the error is:

 (43)

Q.E.D.

###### Theorem 2.4

The joint distribution among errors of different time steps is a multivariate normal distribution

 Err(Xh)∼Nh(0,ΣErrX) (44)

where

 ΣErrX=CErr∇XtoErrXΣErr∇dXC⊤Err∇XtoErrX (45)

in which is the covariance matrix of given by Theorem 2.2 and

 CErr∇XtoErrX=⎛⎜ ⎜ ⎜ ⎜⎝100…0T2110…0⋮⋮⋮…0Th1Th2Th3…1⎞⎟ ⎟ ⎟ ⎟⎠ (46)

in which is the coefficient of when calculating .

can be recursively given by: For any (),

 Ti,1=min(h−1,d)∑k=1(dk)(−1)k+1Ti−k,1 (47)

For any (),

 Ti,j=Ti−1,j−1 (48)
###### Proof

From Lemma 4

 ⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Err(Xn+1)Err(Xn+2)⋮Err(Xn+h)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠=⎛⎜ ⎜ ⎜ ⎜⎝100…0T2110…0⋮⋮⋮…0Th1Th2Th3…1⎞⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Err(∇dXn+1)Err(∇dXn+2)⋮Err(∇dXn+h)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (49)

in which is the coefficient of when calculating .

can be recursively given by: For any (),

 Ti,1=min(h−1,d)∑k=1(dk)(−1)k+1Ti−k,1 (50)

For any (),

 Ti,j=Ti−1,j−1 (51)

Denoting

 Err(∇dXh):=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Err(∇dXn+1)Err(∇dXn+2)⋮Err(∇dXn+h)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠
 CErr∇XtoErrX:=⎛⎜ ⎜ ⎜ ⎜⎝100…0T2110…0⋮⋮⋮…0Th1Th2Th3…1⎞⎟ ⎟ ⎟ ⎟⎠
 Err(Xh):=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Err(Xn+1)Err(Xn+2)⋮Err(Xn+h)⎞⎟ ⎟ ⎟ ⎟ ⎟⎠

Following Definition 4, where

 ΣErrX=CErr∇XtoErrXΣErr∇XC⊤Err∇XtoErrX (52)

Q.E.D.

Now we have the joint guarantee of any prediction interval of a series of time steps .We can use the conclusion of Theorem 2.4 when dealing with an intrinsically stationary process.

## 3 Error of Prediction of Multivariate Process

### 3.1 Error of Stationary Multivariate Process

All of the time series we have discussed are univariate time series, now we want to generalize our conclusion to multivariate cases. Similarly, we will deal with the stationary case firstly.

###### Definition 7

The best linear prediction of is:

 P(Xn+h∣Xn,Xn−1,…,X1,⎛⎜ ⎜ ⎜ ⎜⎝11⋮1⎞⎟ ⎟ ⎟ ⎟⎠)=Ah0+n∑i=1AhiXn+1−i (53)

in which is a time series with m variate, is a vector and are matrices.

The optimized choice of are

 n∑j=1AhjΓ(i−j)=Γ(i+h−1)i=1,2,…,n (54)
###### Definition 8

Denoting , where is the white noise of time step i.

 Zt∼Nmt(0,ΣZ) (55)

where

 ΣZ=σ2Emt×mt (56)
###### Lemma 5

Defining a multivariate MA process

 Yt=Φ(B)Xtt>max(p,q) (57)

where is a matrix.

Denoting , then

 Yh=C1⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zn−q+1Zn−q+2⋮Zn+h⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (58)

where is a transform matrix from to .

 C1=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ΘqΘq−1…Θ1Em×mOm×mOm×m…Om×mOm×mΘqΘq−1…Θ1Em×mOm×m…Om×mOm×mOm×mΘqΘq−1…Θ1E…Om×m⋮⋮⋮⋱⋱⋱⋱⋱⋮Om×mOm×mOm×m…ΘqΘq−1…Θ1Em×m⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (59)
###### Proof

From the definition of multivariate ARMA(p,q) process,

 Yt=Θ(B)Ztt≥p+1=Yt=Zt+Θ1Zt−1+Θ2Zt−2+⋯+ΘqZt−q (60)

Q.E.D.

###### Lemma 6

The best linear predictor of is a linear combination of .

 PnYh=C3C2⎛⎜ ⎜ ⎜ ⎜ ⎜⎝Zp−q+1Zp−q+2⋮Zn⎞⎟ ⎟ ⎟ ⎟ ⎟⎠ (61)

where

 C2=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝ΘqΘq−1…Θ1Em×mOm×mOm×m…Om×mOm×mΘqΘq−1…Θ1Em×mOm×m…Om×mOm×mOm×mΘqΘq−1…Θ1Em×m…Om×m⋮⋮⋮⋱⋱⋱⋱⋱⋮Om×mOm×mOm×m…ΘqΘq−1…Θ1Em×m⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (62)

is a transform matrix from to .

And

 C3=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝A1n−pA1n−p−1…A11A2n−pA2n−p−1…A21⋮⋮⋱⋮Ahn−pAhn−p−1…Ah1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠ (63)

is the coefficient matrix of when calculating , the best linear predictor of , where a multivariate stationary MA(q) process with zero mean.

will be given by equation 54.

###### Theorem 3.1

The joint distribution among errors of different time steps of is a multivariate distribution

 Err(