 # Online Learning for Distribution-Free Prediction

We develop an online learning method for prediction, which is important in problems with large and/or streaming data sets. We formulate the learning approach using a covariance-fitting methodology, and show that the resulting predictor has desirable computational and distribution-free properties: It is implemented online with a runtime that scales linearly in the number of samples; has a constant memory requirement; avoids local minima problems; and prunes away redundant feature dimensions without relying on restrictive assumptions on the data distribution. In conjunction with the split conformal approach, it also produces distribution-free prediction confidence intervals in a computationally efficient manner. The method is demonstrated on both real and synthetic datasets.

## Code Repositories

### online-learning

Online Learning for Distribution-Free Prediction

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Prediction is a classical problem in statistics, signal processing, system identification and machine learning

[1, 2, 3, 4]. The prediction of stochastic processes was pioneered in the temporal and spatial domains, cf. [5, 6, 7, 8], but the fundamental ideas were generalized to arbitrary domains. In general, the problem can be formulated as predicting the output of a process, , for a given input test point after observing a dataset of input-output pairs

 D={(x1,y1),…,(xn,yn)}.

In this paper, we are interested in learning predictor functions in scenarios where is very large or is increasing. We consider two common classes of predictors:

1. The first class consists of predictors in linear regression (Lr) form. That is, given a regressor function , the predictor

is expressed as a linear combination of its elements. This class includes ridge regression,

Lasso, and elastic net [9, 10, 11]. The regressor can be understood as a set of input features and a standard choice for it is simply .

2. The second class comprises predictors in linear combiner (Lc) form. That is, the predictor is expressed as a linear combination of all observed samples , using a model of the process. This class includes kernel smoothing, Gaussian process regression, and local polynomial regression [12, 13, 14, 15, 16]. A standard model choice is a squared-exponential kernel or covariance function for .

The weights in the Lr and Lc classes are denoted and , respectively. For a given prediction method in either class, the weights are functions of the data

. The set of possible weights is large and must be constrained in a balanced manner to avoid a high variance and bias of

, which correspond to overfitting and underfitting, respectively.

In most prediction methods, the constraints on the set of weights are controlled using a hyperparameter, which we denote

and which is learned from data. There exist several methods for doing so, including cross-validation, maximum likelihood, and weighted least-squares [1, 17, 18]. These learning methods are, however, nonconvex and not readily scalable to large , which is the target case of this paper. Moreover, in the process of quantifying the prediction uncertainty, using e.g. the bootstrap  or conformal approaches [20, 21], the learning methods compound the complexity and render such uncertainty quantification intractable for large .

In this paper, we consider a class of predictors that can equivalently be expressed in either linear regression or combiner form. In this class, constrains the weights for each dimension of individually. Thus irrelevant features can be suppressed and overfitting mitigated [22, 23, 4]. To learn the hyperparameters , we employ a covariance-fitting methodology [18, 24, 25] by generalizing a fitting criterion used in [26, 27, 28]. We extend this learning approach to a predictive setting, which results in a predictor with the following attributes:

• computable online in linear runtime,

• implementable with constant memory,

• does not suffer from local minima problems,

• enables tractable distribution-free confidence intervals,

• prunes away irrelevant feature dimensions.

These facts render the predictor particularly suitable for scenarios with large and/or growing number of data points. It can be viewed as an online, distribution-free alternative to the automatic relevance determination approach [22, 23, 4]. Our contributions include generalizing the covariance-fitting methodology to non-zero mean structures, providing connections between linear regression and combiner-type predictors, and an analysis of prediction performance when the distributional form of the data is unknown.

The remainder of the paper is organized as follows. In Section 2, we introduce the problem of learning hyperparameters. In Section 3, we highlight desirable constraints on the weights in an Lr setting, whereas Section 4 highlights these constraints in an Lc setting. The hyperparameters constrain the weights in different ways in each setting. In Section 5, the covariance-fitting based learning method is introduced and applied to an Lc predictor. The resulting computational and distribution-free properties are derived. Finally, in Section 6, the proposed online learning approach is compared with the offline cross-validation approach on a series of real and synthetic datasets.

###### Remark 1.

In the interest of reproducible research, we have made the code for the proposed method available at https://github.com/dzachariah/online-learning.

Notation: is the elementwise Hadamard product. The operation stacks all

into a single column vector, while

denotes the th column of . The sample mean is written as . The number of nonzero elements in a vector is denoted as . The Kronecker delta is denoted .

Abbreviations: Independent and identically distributed (i.i.d.).

## 2 Learning problem

Let denote a predictor with hyperparameters . For linear regression and combiner-type predictors, the choice of constrains the set of weights. For any input-output pair , the risk of the predictor is taken to be the mean-squared error 

 R≜E[|y−ˆy(x;θ)|2]. (1)

The optimal choice of

is therefore the hyperparameter that minimizes the unknown risk. A common distribution-free learning approach is cross-validation, which estimates the risk for a fixed choice of

. The risk estimate is formed by first dividing the training data into subsets and then predicting the output in one subset using data from the remaining subsets[1, ch. 7]:

 ˆR(θ)=K∑k=1nknˆE[|yi−ˆy¬k(xi;θ)|2], (2)

where is the number of samples in subset and denotes the predictor using training data from all subsets except . In this approach, the hyperparameter is learned by finding that minimizes . For one-dimensional hyperparameters, a numerical search for the minimum of (2) is feasible when is small; otherwise it may well be impractical.

An alternative learning approach is to assume that the distribution of belongs to a family that is parameterized by

. Then the hyperparameter can be learned by fitting a covariance model of the data to the empirical moments of the training data, cf.

. If additional assumptions are made about the distributional form, it is possible to formulate a complete probabilistic model of the data and learn using the asymptotically efficient maximum likelihood approach, cf. [30, 17].

The aforementioned statistical learning methods are, however, in general nonconvex and may therefore give rise to multiple minima problems [31, 32]. This becomes problematic when is multidimensional as the methods may require a careful choice of initialization and numerical search techniques [33, ch. 5]. In addition, the distributional assumptions employed in the maximum likelihood approach may lack robustness to model misspecifications .

More importantly for the data scenarios considered in this paper, these learning methods cannot readily be implemented online. That is, for a given dataset, the computational complexity of learning does not scale well with and the process must be repeated each time increases. Consequently, for each new , the predictor function must be computed afresh.

Our main goal is to develop a learning approach that is implementable online, obviates local minima problems, and has desirable distribution-free properties. Before we introduce the proposed learning approach, we introduce the hyperparameters in the context of Lr and Lc predictors, respectively.

## 3 Linear regression predictor

Lr predictors are written in the following form

 ˆy(x)=ϕ⊤(x)w, (3)

where is a -dimensional regression function and are weights. The empirical risk of a predictor can then be written as

 R(w)≜ˆE[|yi−ϕ⊤(xi)w|2]. (4)

When the empirical risk is minimized under a constraint on , irrelevant dimensions of can be suppressed which mitigates overfitting [1, 35]. For notational convenience we write the regressor matrix as

where denotes the th column.

When is large, we would ideally like to find a subset of at most relevant dimensions [1, 36, 37, 38]. We may then define the optimal weights as

 w⋆≜argminw:∥w∥0≤kR(w). (5)

Let denote corresponding optimal Lr predictor. This sparse predictor does not rely on any assumptions on the data, and is therefore robust to model misspecifications. We collect the prediction errors of the training data in the vector

 ε=col{ε1,…,εn},

where . In the subsequent analysis, we also make use of

 ε⋆≜max1≤j≤p∣∣ε⊤˜ϕj∣∣. (6)

Solving (5) is a computationally intractable problem, even for moderately sized . Thus will be taken as a reference against which any alternative can be compared by defining the prediction divergence :

 (7)

A tractable convex relaxation of the problem in (5) is the -regularized Lasso method , as defined by

 ˆw=argminwR(w)+θ∥w∥1, (8)

where is a hyperparameter that shrinks all weights toward zero in a sparsifying fashion. This Lr predictor has desirable risk-minimizing properties and can be computed online in linear runtime for any fixed [40, 41]. The corresponding divergence can be bounded per the following result.

###### Result 1.

If the hyperparameter satisfies

 θ≥2ε⋆n, (9)

then the divergence of the Lasso-based predictor from the optimal sparse predictor is bounded by

 Δ⋆≤2θ∥w⋆∥1. (10)
###### Proof.

See Appendix A.1

If the hyperparameter in (8) satisfies (9), then (10) ensures that the maximum divergence of Lasso-based predictor from the optimal Lr predictor remains bounded. Then redundant dimensions of are pruned away, which is a desirable property. The hyperparameter in (8) is typically learned offline using (2), which is a limitation for the scenarios of large and/or increasing considered herein. Moreover, the individual weights in (3) are constrained uniformly in the Lasso approach. To provide a more flexible predictor that constrains the weights individually, we consider an Lc approach next.

###### Remark 2.

For the special case in which the prediction errors of are i.i.d. zero mean Gaussian with variance , and the regressors are normalized such that , the inequality (9

) is satisfied with high probability when

for some positive constant , cf. . However, this is an infeasible choice since is unknown and must be estimated.

## 4 Linear combiner predictor

Lc predictors are written in the following form

 ˆy(x)=λ⊤(x)y, (11)

where contains the observed samples and are weights. Given any , we will find an unbiased predictor of that minimizes the conditional risk. That is, we seek the solution to

 minλR(λ|x),subject toE[y−λ⊤y|X,x]=0, (12)

where

 R(λ|x)=E[∣∣y−λ⊤y∣∣2|X,x],

and denotes all inputs in . The weights are then constrained by assuming a model of given .

Here we specify a simple model that does not rely on the distributional form of but merely constrains its moments :

• The conditional mean of is parameterized as

 E[y|x]=u⊤(x)w0, (13)

where is a given function and are unknown coefficents. In the simplest case we have an unknown constant mean, i.e., , by setting . Alternatively, one may consider an unknown affine function.

• The conditional covariance function can be expressed as

 Cov[y,y′|x,x′]=∞∑k=1θkψk(x)ψk(x′)+θ0δ(x,x′) (14)

using Mercer’s theorem , where is given and are nonnegative parameters. For a stationary process with an isotropic covariance function, we may for instance use the Fourier or Laplace operator basis [42, 43].

To enable an computationally efficient online implementation, we consider a model with a truncated sum of terms in (14). We write

 ψ(x)=col{ψ1(x),…,ψq(x)}

and the hyperparameter as

 θ≜col{θ0,θ1,…,θq},

for notational simplicity.

###### Remark 3.

The model, specified by (13) and (14), is invariant with respect to the distributional properties of the data. It therefore includes commonly assumed distributions, such as Gaussian and Student-t [33, 44].

###### Remark 4.

When is modelled as a stationary process, the Fourier or Laplace operator basis can be used in (14) to parameterize its power spectral density via . This can be interpreted as a way to parameterize the smoothness of the process .

###### Result 2.

Under model assumptions (13) and (14), the covariance properties of the training data are given by

 Σ≜Cov[y|X]=ΨΘΨ⊤+θ0Inr(x)=Cov[y,y|X,x]=ΨΘψ(x), (15)

where

###### Proof.

The result follows from an elementwise application of (14) to the training data:

###### Result 3.

The weights of the optimal Lc predictor in (11) are given by

 λ(x)=Σ−1U(U⊤Σ−1U)†u(x)+Σ−1Π⊥r(x), (16)

where

and is a projector onto .

###### Proof.

See Appendix B. ∎

Consequently, the hyperparameters in constrain the optimal weights via the model (13) and (14). If the model were correctly specified, we could in principle search for a risk-minimizing . This would, however, be intractable in an online setting.

Next, we show that it is possible to interchange Lc and Lr representations, (11) and (3), of the predictor .

###### Result 4.

The optimal Lc predictor in (11) and (16) can be written in Lr form

 ˆy(x;θ)=ϕ⊤(x)ˆw, (17)

where

 ϕ(x)=col{u(x),ψ(x)} (18)

and the weights are given by

 ˆw=argminwR(w)+θ0n∥w∥2D, (19)

where .

###### Proof.

See Appendix C. ∎

###### Remark 5.

Note that, if we impose , (19) is equivalent to the ridge regression method .

###### Remark 6.

The Lc predictor has an Lr intepretation: when (14) is formed using the Fourier or Laplace operator basis, the corresponding regressor (18) will approximate any isotropic covariance function by varying . This includes covariance functions belonging to the important Matérn class, of which the popular squared-exponential function is a special case [30, ch. 6.5]. Cf. [42, 43, 45] for regressor functions with good approximation properties. Similarly, by partitioning an arbitrary regressor into the form (18) yields an Lc interpretation of an Lr predictor. The regressor matrix is then . For instance, a standard regressor is the affine function , which can be interpreted as a constant mean with a covariance function that is quadratic in .

The elements of constrain each of the weights (19) individually, unlike the uniform approach in (8). Thus the hyperparameter in the optimal Lc predictor determines the relevance of individual features in , similar to the automatic relevance determination framework in [23, 46, 47]. While this enables a more flexible predictor, the -regularization term in (19) does not yield the sparsity property of the Lasso-based predictor. More importantly, the learning approach (2) is intractable when is multidimensional. We show how to get around this issue in the next section.

## 5 Online learning via covariance fitting

We consider a covariance-fitting approach for learning in the flexible Lc predictor. We show that using the learned hyperparameter results in a predictor with desirable, computational and distribution-free, properties.

The normalized sample covariance matrix of the training data can be written as

 ˜Z=(y−Uw0)(y−Uw0)⊤∥y−Uw0∥2,

using the mean structure (13). The assumed covariance structure is parameterized by in (15). We seek to fit to the sample covariance in the following sense:

 θ⋆=argminθ∥∥˜Z−Σ∥∥2Σ−1, (20)

using a weighted norm that penalizes correlated residuals, cf. [18, 24, 25]. To match the magnitude of the normalized sample covariance we subject the parameters in (20) to the normalization constraint

 tr{˜Z−Σ}=0. (21)
###### Result 5.

The learning problem defined by (20) is convex in and will therefore not suffer from local minima issues.

###### Proof.

See Appendix D. ∎

###### Remark 7.

Eq. (20) is a generalization of the covariance-fitting criterion in [26, 27] to the case of nonzero mean structures of the data. See also  for further connections.

### 5.1 Properties of the resulting predictor

Using (20) in the Lc predictor from (11) and (16), we obtain the following result:

###### Result 6.

The Lc predictor with a learned parameter has the following Lr form:

 ˆy(x;θ⋆)=ϕ⊤(x)ˆw (22)

where

 ˆw=argminw√R(w)+1√n∥φ⊙w∥1 (23)

and the elements of are set to

 φj={1√n∥˜ϕj∥2,j>u0,otherwise. (24)
###### Proof.

See Appendix E. ∎

Thus the covariance-based learning approach endows the predictor with a sparsity property, via the weighted -regularization term in (23), that prunes away redundant feature dimensions. Similar to (10), we can bound the associated prediction divergence.

###### Remark 8.

The methodology in (20) generalizes the approach in [26, 27], and following the cited references we call (22) the Spice (sparse iterative covariance-based estimation) predictor. Eq. (23) can be viewed as a nonuniformly weighted extension of the square-root Lasso method in .

###### Result 7.

If all elements (24) are positive and satisfy

 φj≥ε⋆√nR(w⋆), (25)

then the divergence of the Spice predictor from the optimal sparse predictor is bounded by

 Δ⋆≤2n∥φ⊙w⋆∥21+4√R(w⋆)n∥φ⊙w⋆∥1. (26)
###### Proof.

See Appendix A.2. ∎

The above results are valid even when the model (13) and (14) is misspecified. We have therefore shown a distribution-free property of the Spice predictor, which will prune away redundant feature dimensions automatically. It parallels (10) but is not dependent on any hyperparameters. Thus we can view the covariance-fitting approach as a vehicle for constructing a general Lr predictor with weights on the form (23).

In addition, the Spice predictor has computational properties that are appealing in cases with large and/or increasing , where an online implemention is desirable.

###### Result 8.

The Spice predictor (22) can be updated at each new data point in an online manner. The computations are based on

 (27)

which have fixed dimensions and are readily updated sequentially. The pseudocode for the online learning algorithm is summarized in Algorithm 1. It is initialized by setting the weights to .

###### Proof.

See Appendix F. ∎

###### Result 9.

The total runtime of the algorithm is , where is the number of cycles per sample, and its memory requirement is .

###### Proof.

The complexity of the loop in Algorithm 1 is proportional to for each sample. Regarding the memory, the number of stored variables is constant and the largest variable is the matrix . ∎

###### Remark 9.

In the given algorithm, the individual weights are updated in a fixed cyclic order. Other possible updating orders are described in a general context, in . As , however, the algorithm converges to the global minimizer (23) at each sample , irrespective of the order in which are updated [51, 52].

###### Remark 10.

Let denote the convex cost function in (23) at sample . The termination point of the cyclic algorithm for provides the starting point for minimizing the subsequent cost function . For finite , the termination point will deviate from the global minimizer (23). In practice, however, we found that can be chosen as a small integer and that even yields good prediction results when . Furthermore, we observed that the results are robust with respect to the orders in which are updated.

###### Remark 11.

In the special case when the prediction errors of are i.i.d. zero-mean Gaussian and the regressors are normalized as , the bound (26) can be ensured with high probability by multiplying the coefficients (24) by a factor , where is a positive constant. See Appendix A.2.

### 5.2 Distribution-free prediction and inference

In summary, the covariance-fitting methodology above has the following main attributes:

• avoids local minima problems in the learning process,

• results in a predictor that can be implemented online

• maximum divergence from the optimal sparse predictor can be evaluated.

Its computational and distribution-free properties also make it possible to combine this approach with the split conformal method in , which provides computationally efficient uncertainty measures for a predictor under minimal assumptions.

Suppose the input-output data consist of i.i.d. realizations from an unknown distribution

 (xi,yi)∼p(x,y).

For a generic point , we would like to construct a confidence interval for the predictor (22) with a targeted coverage. That is, find a finite interval

 C(x)≜[ˆy(x)−¯¯¯r,ˆy(x)+¯¯¯r], (28)

that covers the predicted output with a probability that reaches a prespecified level .

For simplicity, assume that is an even number and randomly split into two equally-sized datasets and . For a given targeted coverage level , the split conformal interval is constructed using the following three steps :

1. Train the Spice predictor using .

2. Predict the outputs in and compute the residuals

3. Sort the residuals and let denote the th smallest , where .

###### Result 10.

Setting as above in (28), yields an interval that covers the predicted output with a probability

 Pr{y∈C(x)}≥κ.

Thus the targeted level can be ensured. In addition, when the residuals have a continuous distribution, the probability is also bounded from above by .

###### Proof.

See [21, sec. 2.2]. ∎

###### Remark 12.

Predictive probabilistic models, such as those considered in [23, 53]

, provide credibility intervals but, in contrast to the proposed approach, lack coverage guarantees. In fact, even when the said models are correctly specified, their uncertainty is systematically underestimated after learning the hyperparameters

.

The Spice-based split conformal prediction interval in (28) provides a computationally efficient, distribution-free prediction and inference methodology.

## 6 Numerical experiments

In this section, we compare the online Spice approach developed above with the well-established offline -fold cross-validation approach for learning predictors that use a given regressor function . To balance the bias and variance of (2), as well as the associated computational burden, we follow the recommended choice of folds [1, ch. 7]. Finding the global minimizer of (2) in high dimensions is intractable, and in the following examples we restrict the discussion to the learning of predictors based on the scalable Ridge and Lasso regression methods, since these require only one hyperparameter.

Offline cross-validation was performed by evaluating (2) on a grid of 10 hyperparameter values and selecting the best value. Each evaluation requires retraining the predictor with a new hyperparameter value and the entire process is computationally intensive.

For the prediction problems below, we apply regression functions of the form (18):

 ϕ(x)=col{1,ψ(x)},

using a constant mean . For we use either a linear function, , or the Laplace operator basis due to its attractive approximation properties . For the latter choice, suppose is -dimensional and belongs to . Then the elements of are defined by

 ψk1,…,kd(x)=d∏j=11√Ljsin(πkj(xj+Lj)2Lj), (29)

where are the indices for dimension . We have that has dimension . The rectangular domain can easily be translated to any arbitrary point. When is large, we may apply the basis to each dimension separately. Then the resulting has dimension .

### 6.1 Sparse linear regression

In this experiment we study the learning of predictors under two challenging conditions: heavy-tailed noise and colinear regressors. The input is of dimension . The dataset was generated using an conditional Student-t distribution with mean

 E[y|x]=1+5x1+5x10+5x20+5x30+5x40

and variance . The input were generated using an i.i.d. degenerate zero-mean Gaussian variable with covariance matrix , where the numerical rank of and the variances are normalized by setting .

For the predictors, we let be linear and thus . We ran Monte Carlo simulations to evaluate the performances of the predictors. In the first set of experiments we estimate the risk . To clarify the comparison between the predictors, the risk is normalized by the noise variance and presented in decibel scale (dB) in Table 1. As expected for this data generating process, the sparse predictors outperform Ridge. For Spice and Lasso, the difference is notable when but is less significant as more samples are obtained.

Next, we evaluate the inferential properties by repeating the above experiments with samples. The dataset is randomly partitioned into two sets and , each of size , to produce confidence intervals as in (28). We target the coverage level , and report the average confidence interval length as well as average coverage of the interval in Table 2. Note that the probability that is nearly exactly equal to the targeted level without relying on any distributional assumptions. Thus the reported confidence intervals are accurate. Furthermore, the average interval lengths are significantly smaller for the sparse predictors compared to Ridge. The reported intervals for Spice and Lasso are similar in length, with the former being slightly smaller. The interval lengths can be compared to the dynamic range of , which has a length of approximately 60.

The runtime for the offline learning approach using Ridge or Lasso, is which is similar to the runtime for the online Spice method , where in this example. The average runtimes are reported in Table 3. While all three methods scale linearly in , using a cross-validated Lasso predictor is slower in this implementation.

### 6.2 Global ozone data

The ozone density determines the transmission of ultraviolet radiation through the atmosphere which has an important impact on biochemical processes and health. For this reason, measuring the total column ozone has been of interest to scientists for decades. In 1978, the Nimbus-7 polar orbiting satellite was launched, equipped with a total ozone mapping spectrometer. The satellite was sun synchronous, and due to the rotation of the Earth, its scan covered the entire globe in a 24 hour period with a certain spatial resolution . For illustrative purposes, we consider a set of spatial samples of ozone density from the satellite, measured in Dobson units (DU) and recorded on October 1st, 1988, cf. . The spatial coordinates were transformed from longitude and latitude using the area-preserving Mollweide map projection .

For , we use the Laplace operator basis with so that . The boundaries were set slightly larger than those given by the Mollweide projection: and , where is the radius of the Earth.

In the first experiment, the training is performed using samples and the ozone density is predicted on a fine spatial scale. Note that this dataset is nearly three orders of magnitude larger than that used in the previous example, which makes it too time consuming to implement the offline cross-validation method due to its computational requirement. Therefore we only evaluate the Spice predictor here. Fig. 1 illustrates both the training samples and the predicted ozone density

. It can be seen that the satellite data is not uniformly sampled and, moreover, it contains significant gaps. The predictions in these gapped areas appear to interpolate certain nontrivial patterns. Figure 1: Predicted ozone density ^y(x) in [DU] and training samples (dots) for two different areas. Note both the irregular and gapped sampling pattern.

In the second experiment, we evaluate the inferential properties of the Spice predictor by learning from a small random subset of the data. We use samples, or approximately of the data. The resulting predictions exhibit discernible patterns as in Fig. 2. Figure 2: Predicted ozone density ˆy(x) in [DU] and training samples (dots) for two different areas. The predictor was trained using a randomly selected dataset consisting of 5% of the original data.

For a comparison Fig. 3 zooms in on a region with gapped data, also highlighted in Fig. 1. Note that the predictions are consistent with the full data case. Figure 3: Predicted ozone density ^y(x) in [DU] and training samples (dots). Here n≈0.05n0 randomly selected training samples are used in contrast with the predictions shown in Fig. 1, which uses the full dataset n=n0.

The remaining samples are used for validating the predictor. Its risk is estimated by the out-of-sample prediction errors,

 ˆR=ˆE[|yi−^y(xi)|2],

which we translate to Dobson units by taking the square-root. The result was a root-risk of DU. In addition, the confidence interval with had a length of DU and an empirical coverage of 0.90. Both performance metrics compare well with the dynamic range of the data, which spans [179.40, 542.00] or DU. Fig. 4 also shows that the empirical distribution of the prediction errors is symmetric. These results illustrate the ability of the proposed method to learn, predict and infer in real, large-scale datasets. Figure 4: Histogram of the prediction errors for ¯n=164735 samples in [DU]. The predicted samples belong to the interval [179.40, 542.00].

## 7 Conclusions

In this paper we considered the problem of online learning for prediction problems with large and/or streaming data sets. Starting from a flexible Lc predictor, we formulated a convex covariance-fitting methodology for learning its hyperparameters. The hyperparameters constrain individual weights of the predictor, similar to the automatic relevance determination framework.

It was shown that using the learned hyperparameters results in a predictor with desirable computational and distribution-free properties. We denote it as the Spice predictor. It was implemented online with a runtime that scales linearly in the number of samples, its memory requirement is constant, it avoids local minima issues, and prunes away redundant feature dimensions without relying on assumed properties of the data distributions. In conjunction with the split conformal approach, it also produces distribution-free prediction confidence intervals. Finally, the Spice predictor performance was demonstrated on both real and synthetic datasets, and compared with the offline cross-validation approach.

In future work, we will investigate input-dependent confidence intervals, using the locally-weighted split conformal approach.

## Appendix A Derivation of bounds

### a.1 Lasso bound (10)

We expand the empirical risk of by

 R=ˆE[|(yi−ˆy⋆(xi))−(ˆy(xi)−ˆy⋆(xi))|2]=R⋆+Δ⋆−2ˆE[εi(ˆy(xi)−ˆy⋆(xi))]=R⋆+Δ⋆−2nε⊤Φ(ˆw−w⋆).

Using Hölder’s inequality along with the triangle inequality yields

 ε⊤Φ(ˆw−w⋆)≤|ε⊤Φ(ˆw−w⋆)|≤∥Φ⊤ε∥∞∥ˆw−w⋆∥1≤ε⋆(∥ˆw∥1+∥w⋆∥1).

Thus, the prediction divergence is bounded by

 Δ⋆≤R−R⋆+2ε⋆n(∥ˆw∥1+∥w⋆∥1). (30)

Next, by inserting and into the cost function of (8), we obtain

 R−R⋆≤θ(∥w⋆∥1−∥ˆw∥1) (31)

Applying this inequality along with (9) to (30), gives

 Δ⋆≤θ(∥w⋆∥1−∥ˆw∥1)+θ(∥ˆw∥1+∥w⋆∥1)=2θ∥w⋆∥1. (32)

Cf.  for the case when the unknown data generating mechanism belongs to a sparse linear model class.

### a.2 Spice bound (26)

For notational simplicity, we write

 g(w)=∥φ⊙w∥1=p∑j=1φj|wj|,

since . By inserting and into the cost function of (23), we have that

 R1/2−R1/2⋆≤1√n(g(w⋆)−g(ˆw))≤1√ng(w⋆). (33)

Multipling the inequality by and rearranging, yields

 R−R⋆≤1√n(R1/2+R1/2⋆)(g(w⋆)−g(ˆw))≤1√n(2R1/2⋆+1√ng(w⋆))≜f(w⋆)(g(w⋆)−g(ˆw)),=∑j1√nf(w⋆)φj(|w⋆,j|−|ˆwj|) (34)

where the second inequality follows from using in (33).

Inserting (34) into (30), yields

 Δ⋆≤∑j1√nf(w⋆)φj(|w⋆,j|−|ˆwj|)+2ε⋆n(|w⋆,j|+|ˆwj|). (35)

Given (25), we have that

 φj≥ε⋆√nR⋆≥2ε⋆√n(2R1/2⋆+1√ng(w⋆))=2ε⋆√nf(w⋆)

By re-arranging and dividing by , we obtain the following inequality

 1√nf(w⋆)φj≥2ε⋆n,∀j>u.

Therefore (35) can be bounded by

 Δ⋆≤∑j2√nf(w⋆)φj|w⋆,j|=2√n(2R1/2⋆+1√ng(w⋆))g(w⋆)=4√R⋆ng(w⋆)+2ng2(w⋆)

Note that in the special case when the prediction errors of are i.i.d. , and the regressors are normalized as , we have that

 √σ2(2lnp+δ)≥ε⋆√n, (36)

with probability greater than where is a positive constant, cf. [39, ch. 6.2]. In this setting is a consistent estimate of . Therefore (25) is satisfied with high probability in this case if the elements of are multiplied by a factor , so that

 φj=1√n∥˜ϕj∥2×c√2lnp+δ,

for some .

## Appendix B Optimal weights (16)

The conditional risk can be decomposed as

 (37)

using (15). Here is a constant and the bias vanishes due to the unbiasedness constraint on . This constraint, in turn, can be expressed as

 (u⊤(x)−λ⊤U)w0=0,∀w0,

or equivalently

 U⊤λ=u(x).

Note that a weak assumption in our case is , especially when and . Hence there exist that satisfy the equality above.

Thus the problem (12) can be reformulated using the convex Lagrangian

 L(λ,κ)=λ⊤Σλ−2λ⊤r+2(U⊤λ−u(x))⊤κ, (38)

where is the vector of Lagrange multipliers.

A stationary point of the Lagrangian in (38) with respect to and satisfies

 {Σλ−r+Uκ=0,U⊤λ−u(x)=0.

The first equality can be expressed as . We can insert the first equality into the second and solve for the Lagrange multipliers

 κ=−(U⊤Σ−1U)−(u(x)−U⊤Σ−1r).

Then we have

 λ=Σ−1r+Σ−1U(U⊤Σ−1U)−u(x)−Σ−1U(U⊤Σ−1U)−U⊤Σ−1r.

Noting that the orthogonal projector is given by concludes the proof.

## Appendix C Linear regression form (19)

Let be partitioned as in (18). Then

 ˆy(x)=λ⊤(x)y=ϕ⊤(x)ˆw

in combination with (16) allows us to identify the weights:

 ˆw=[ˆw0ˆw1]=[(U⊤Σ−1U)†U⊤Σ−1yΘΨ⊤Σ−1(y−Uˆw0)]. (39)

Next, define the problem

 minw0,w1,θ−10∥y−Uw0−Ψw1∥22+∥w1∥2Θ−1 (40)

for which the minimizing , namely,

 ˆw1=θ−10(θ−10ΨΨ⊤+Θ−1)−1Ψ⊤(y−Uw0)=θ−10(Θ−ΘΨ⊤Σ−1ΨΘ)Ψ⊤(y−Uw0)=θ−10ΘΨ⊤Σ−1(Σ−ΨΘΨ⊤)(y−Uw0)=ΘΨ⊤Σ−1(y−Uw0),

has the same form as in (39) with still to be determined. In the above calculation we made use of the matrix inversion lemma. Inserting back into (40) yields the concentrated cost function

 θ−10∥y−Uw0−Ψˆw1∥22+∥ˆw1∥2Θ−1=θ−10∥(IN−ΨΘΨ⊤Σ−1)(y−Uw0)∥22+∥ΘΨ⊤Σ−1(y−Uw0)∥2Θ−1=θ0(y−Uw0)⊤Σ−1Σ−1(y−Uw0)+(y−Uw0)⊤Σ−1ΨΘΨ⊤Σ−1(y−Uw0)=(y−Uw0)⊤Σ−1(y−Uw0).

Minimizing with respect to yields in (39). Finally, multiplying the cost function (40) by yield the desired form.

## Appendix D Convexity of learning problem (20)

By expanding the criterion (20) we obtain

 ˜y⊤Σ−1˜y+tr{Σ} (41)

where .

We note that in (15) is a linear function of . Thus the normalization constraint (21) can be written as a linear equality constraint

 nθ0+q∑j=1∥˜ϕj∥2θj=ρ (42)

Next, we define an auxiliary variable that satisfies

 α≥˜y⊤Σ−1˜y,

or equivalently

 [α˜y⊤˜yΣ]⪰0. (43)

Using the auxiliary variable and the definition of , we can therefore reformulate the learning problem

 minα,θα+nθ0+q∑j=1∥˜ϕj∥2θj,

which has a linear cost function and is subject to the constraints (42) and (43). This problem is recognized as a semidefinite program, i.e. it is convex. See  and .

## Appendix E Spice predictor (22)

The derivation follows in three steps. First we show that dropping the constraint (21) yields the simpler unconstrained problem in (20). Next, we prove that the fitted hyperparameters in the simplified problem yields the same predictor. Finally, we show how the corresponding Lr form arises as a consequence.

By dropping the constraint (21) we may consider the problem in the following form:

 ˆθ=argminθtr{ZΣ−1}+tr{Σ}, (44)

where is the unnormalized sample covariance matrix.

We now prove that, . Begin by defining a constant , such that at the minimum of (44). We show that is the only possible value and so both terms in (44) equal each other at the minimum.

Let , and observe that the cost (