I Introduction
Ia Motivation
Modern machine learning is driven in large part by training nonlinear statistical models [1] on huge data sets held in cloud storage via stochastic optimization methods. This recipe has yielded advances across fields such as vision, acoustics, speech, and countless others. Of course, universal approximation theorems [2], the underpinning for the predictive capability of methods such as deep neural networks and nonparametric methods, have been known for some time. However, when accumulating enough data is difficult or if the observations arise from a nonstationary process, as in robotics [3], energy [4], and communications [5], online training is of interest. Specifically, for problems with dynamics, one would like the predictive power of universality while stably and quickly adapting to new data.
Bayesian and nonparametric methods [6] meet these specifications in the sense that they possess the universal approximation capability and may stably adapt to new data. In particular, tools under this umbrella such as kernel regression [7], Gaussian Processes, and particle filters/Monte Carlo methods [8] can stably (in a Lyapunov sense) incorporate new information as it arrives via functional variants of convex optimization or probabilistic updates via Bayes’ Rule. These facts motivate applying Bayesian and nonparametric methods to streaming problems.
Alas, Bayesian and nonparametric methods are limited by the curse of dimensionality. Typically, these tools require storing a probability density estimate that retains all past training examples together with a vector of weights that denotes the importance of each sample. Moreover, as time becomes large, their complexity approaches infinity. This bottleneck has led to a variety of approximation schemes, both offline and online, to ensure memory is under control. Typically one fixes some memory budget, and projects all additional training examples/particles onto the likelihood of the density estimate spanned by current points
[9]. Alternatives include memoryefficient approximations of the associated functional [10] or probabilistic model [11].IB Significance
Until recently, a significant gap in the literature of memory reduction design for Bayesian and nonparametric methods existed: how to provide a tunable tradeoff between statistical consistency^{1}^{1}1Consistency is the statistical version of optimality, and means that an empirical estimator converges to its population analogue, i.e., is asymptotically unbiased. and memory? Recently, the perception that nonparametric methods do not scale to streaming settings has been upended [12, 13]. The key insight of these methods is that one may fix an errorneighborhood around the current density estimate, and project it onto a nearby subspace with lower memory, thus sparsifying the sample path online and allowing the complexity of the density representation to be flexible and problemdependent [14]. The performance guarantees of this approach echo ShannonNyquist Sampling for nonparametric statistics. That is, the radius of the error neighborhood, which we henceforth call the compression budget, determines the asymptotic radius of convergence, i.e., bias, as well as the complexity of the resulting density function.
IC Impact
Proposed approaches allow stable and memoryefficient online training of universal statistical models, a foundation upon which many theoretical and practical advances may be built. For instance, in autonomous control of robotic platforms, it is wellknown that optimal control based on physical models [15] is limited to the domain for which the model remains valid. When used beyond this domain, a gap emerges between an a priori hypothesized physical model and the ground truth. Alternatively, such problems can be cast within the rubric of sequential learning where one seeks to regress on the model mismatch. Initial efforts toward this end have appeared recently [16], but substantial enhancements are possible, for instance, by increasing the descriptive capacity of the class of learned models and the adaptivity of autonomous behaviors to changing environments.
A similar problem arises in communications systems, where nonlinear adaptive filters can be used to mitigate signal saturation in amplifiers and detectors. The current techniques are limited to linearity for, e.g., equalization in receivers, noise cancellation, beamforming in radar, and sonar systems. Such a limitation is because current communications channel state estimators [5], modulators, and localizers need to adapt with low latency, and currently only linear methods meet the requirement [17]. However, nonparametric and Bayesian methods would enable identification and tracking of modulation type from symbol estimates that exhibit more complex interaction effects, channel state estimation in the presence of nonlinearities arising from the environment, and localization problems where the entire distribution over possible source locations is of interest, due to, e.g., desire for confidence guarantees.
On the theory side, we note that efforts to develop optimization tools for multiagent online learning have been mostly restricted to cases where agents learn linear statistical models [18], which exclude the state of the art machine learning methods. However, since kernel learning may be formulated as a stochastic convex problem over a function space, standard strategies, i.e., distributed gradient [19] and primaldual methods [20]
may be derived. These schemes allow individual agents to learn in a decentralized online manner a memoryefficient nonlinear interpolator as good as a centralized clairvoyant agent with all information in advance. Many enhancements are possible via recent advances in decentralized optimization
[21], such as those which relax conditions on the network [22] and amount of communications [23].Apart from multiagent systems, the general approach of sparsifying as much as possible while ensuring a descent property holds, may be applied to other nonparametric statistical tools such as Gaussian processes and Monte Carlo Methods [24, 14] by varying the ambient space and compression metric. Due to the need for brevity, we defer such a discussion to future work.
The rest of the paper is organized as follows. In the subsequent section we detail supervised learning in the online nonparametric setting. In Section
II, we formulate the problem of supervised online training with kernels, and define memoryaffordable kernel regression algorithms for solving them in IIA and extend this framework to online riskaware learning Section IIB. In Section III, we spotlight online nonparametric learning methodologies in decentralized settings. In Section IV, we conclude with a discussion of implications and open problems.Ii Online Supervised Learning with Kernels
We begin by formalizing the problem of supervised learning as expected risk minimization (ERM), which is the foundation of filtering, prediction, and classification. Then, we detail how the problem specializes when the estimator admits a kernel parameterization. In ERM, we seek to learn a regressor so as to minimize a loss quantifying the merit of a statistical model averaged over data. Each point in the data set constitutes an inputoutput pair which is an i.i.d. realization from the stationary distribution of random pair with . We consider the problems where samples arrive sequentially in perpetuity, which is applicable to signal processing, communication, and visual perception.
Hereafter, we quantify the quality of an estimator function
by a convex loss function
, where is a hypothesized function class. In particular, if we evaluate at feature vector , its merit is encapsulated by . Then, we would like to select to have optimal model fitness on average over data, i.e., to minimize the statistical loss . We focus on minimizing the regularized loss as(1) 
The preceding expression specializes to linear regression when
for some , but rather here we focus on the case that function class is a reproducing kernel Hilbert space (RKHS) – for background, see [7]. An RKHS consists of functions that have a basis expansion in terms of elements of through a kernel defined over inner products of data:(2) 
where is the Hilbert inner product for . The kernel function is henceforth assumed positive semidefinite: for all . Example kernels include the polynomial and Gaussian , where .
The term reproducing comes from replacing by in (2)(i) which yields . We note that this reproducing property permits writing the inner product of the feature maps of two distinct vectors and as the kernel evaluations as . Here, denotes the feature map of vector . The preceding expression, the kernel trick, allows us to define arbitrary nonlinear relationships between data without ever computing [7]. Moreover, (2) (ii) states that any function is a linear combination of kernel evaluations at the vectors , which for empirical versions of (1), implies that the Representer Theorem holds [7]
(3) 
Here a collection of weights and is the sample size. We note that the number of terms in the sum (3), hereafter referred to as the model order, coincides with the sample size, and hence grows unbounded as empirical approximations of (1) approach their population counterpart. This complexity bottleneck is a manifestation of the curse of dimensionality in nonparametric learning. Decades of research have attempted to overcome this issue through the design of memoryreduction techniques, most not guaranteed to yield the minimizers of (1). In the following subsection, we detail a stochastic approximation algorithm which explicitly trades off model fitness with memory requirements, and provide applications.
Iia Trading Off Consistency and Complexity
In this section, we derive an online algorithm to solve the problem in (1
) through a functional variant of stochastic gradient descent (FSGD)
[25]. Then, we detail how memory reduction may be attained through subspace projections [12]. We culminate by illuminating tradeoffs between memory and consistency both in theory and practice. We begin by noting that FSGD applied to the statistical loss (1) when the feasible set is an RKHS (2) takes the form [25](4) 
where
is the constant step size. The equality uses the chain rule, the reproducing property of the kernel (
2)(i), and the definition as in [25]. With initialization , the Representer Theorem (3) in (4) allows one to rewrite (4) with dictionary and weight updates as(5) 
We define the matrix of training examples as the kernel dictionary, the kernel matrix as , whose entries are kernel evaluations , and the empirical kernel map as the vector of kernel evaluations.
At each time , the new sample enters into current dictionary to obtain , and hence the model order , i.e., the number of points in dictionary , tends to as when data is streaming. Existing strategies for online memoryreduction include dropping past points when weights fall below a threshold [25], projecting functions onto fixedsize subspaces via spectral criteria [27] or the Hilbert norm [10], probabilistic kernel approximations [11], and others. A commonality to these methods is a capitulation on convergence in pursuit of memory reduction. In contrast, one may balance these criteria by projecting the FSGD sequence onto subspaces adaptively constructed from past data .
Model Order Control via Subspace Projections To control the complexity growth, we propose approximating the function sequence [cf. (4)] by projecting them onto subspaces of dimension spanned by the elements in the dictionary , i.e., , where we denote , and represents the kernel matrix obtained for dictionary . To ensure model parsimony, we enforce the number of elements in to satisfy .
To introduce the projection, note that (4) is such that . Instead, we use dictionary, whose columns are chosen from To be specific, we augment (4) by projection:
(6) 
where we denote the subspace associated with dictionary , and the righthand side of (6) to define the projection operator. Parsimonious Online Learning with Kernels (POLK) (Algorithm 1) projects FSGD onto subspaces stated in (6). Initially the function is . Then, at each step, given sample and stepsize , we take an unconstrained FSGD iterate which admits the parametric representation and . These parameters are then fed into KOMP with approximation budget , such that .
Diminishing  Constant  

Stepsize/Learning rate  
Sparse Approximation Budget  
Regularization Condition  
Convergence Result  a.s.  a.s.  
Model Order Guarantee  None  Finite 
Parameterizing the Projection The projection may be computed in terms of data and weights. To select dictionary for each , we use greedy compression, specifically, a destructive variant of kernel orthogonal matching pursuit (KOMP) [28] with budget . The input function to KOMP is with model order parameterized by its kernel dictionary and coefficient vector . The algorithm outputs with a lower model order. We use to denote the function updated by an unprojected FSGD step, whose coefficients and dictionary are denoted as and . At each stage, the dictionary element is removed which contributes the least to Hilbertnorm error of the original function , when dictionary is used. Doing so yields an approximation of inside an neighborhood . Then, the “energy” of removed points is reweighted onto those remaining .
Algorithm 1 converges both under diminishing and constant stepsize regimes [12]. When the learning rate satisfies , where is the regularization parameter, and is attenuating such that and with approximation budget satisfying . Practically speaking, this means that asymptotically the iterates generated by Algorithm 1 must have unbounded complexity to converge exactly. More interestingly, however, when a constant algorithm stepsize is chosen to satisfy , then under constant budget which satisfies , the function sequence which converges to near the optimal [cf. (1)] and has finite complexity. This tradeoff is summarized in Table I.
Experiments We discuss experiments of Algorithm 1 for multiclass classification on Gaussian Mixtures dataset (Fig. 3(a)) as in [29] for the case of kernel SVM, and compare with several alternatives: budgeted stochastic gradient descent (BSGD) [9], a fixed subspace projection method, which requires a maximum model order a priori; Dual Space Gradient Descent (Dual) [30], a hybrid of FSGD with a random features; nonparametric budgeted SGD (NPBSGD) [31], which combines a fixed subspace projection with random dropping, Naive Online Regularized Min. Alg. (NORMA) [25], which truncates the memory to finitehorizon objectives, and Budgeted PassiveAggressive (BPA) which merges incoming points via nearest neighbor [32].
In Figure 2 we plot the empirical results of this experiment. POLK outperforms many of its competitors by an order of magnitude in terms of objective evaluation (Fig. 2(a)) and testset error rate (Fig 2(b)). The notable exception is NPBSGD which comes close in terms of objective evaluation but less so in terms of test error. Moreover, because the marginal feature density contains modes, the optimal model order is , which is approximately learned by POLK for (i.e., ) (Fig. 2(c)). Several alternatives initialized with this parameter, on the other hand, do not converge. Moreover, POLK favorably trades off accuracy and sample complexity – reaching below error after only samples. The final decision surface of this trial of POLK is shown in Fig. 3(b), where it can be seen that the selected kernel dictionary elements concentrate at modes of the classconditional density. Next, we discuss modifications of Algorithm 1 that avoid overfitting via notions of risk from operations research.
IiB Compositional and RiskAware Learning with Kernels
In this section, we explain how augmentations of (1
) may incorporate riskawareness into learning, motivated by biasvariance tradeoffs. In particular, given a particular sample path of data, one may learn a model overly sensitive to the peculiarities of the available observations, a phenomenon known as overfitting. To avoid overfitting
offline, bootstrapping (data augmentation), crossvalidation, or sparsitypromoting penalties are effective [33]. However, these approaches do not apply to streaming settings. For online problems, one must augment the objective itself to incorporate uncertainty, a topic extensively studied in econometrics [34]. Specifically, one may use coherent risk as a surrogate for error variance [35], which permits the derivations of online algorithms that do not overfit, and are attuned to distributions with illconditioning or heavytails, as in interference channels or visual inference with lessthanlaboratory levels of cleanliness.To clarify the motivation for riskaware augmentations of (1), we first briefly review the biasvariance (estimationapproximation) tradeoff. Suppose we run some algorithm and obtain estimator . Then one would like to make the performance of approach the Bayes optimal where denotes the space of all functions that map data to target variables . The performance gap between and decomposes as [33]
(7) 
by adding and subtracting (1) (ignoring regularization). Thus, the discrepancy decomposes into two terms: the estimation error, or bias, and approximation error, or variance^{2}^{2}2Approximation error is more general than variance, but for, e.g., the quadratic loss, the former reduces into the later plus the noise of the data distribution. We conflate these quantities for ease of explanation, but they are different.
. The bias is minimized as the number of data points goes to infinity. On the other hand, universality implies the variance is null, but in practice due to inherent unknown properties of data and hyperparameter choices, it is positive. To avoid overfitting in the online setting, we propose accounting for error variance via a dispersion measure
, which yields a variant of supervised learning that accounts for approximation error [36](8) 
For example, is commonly used – the semivariance. Alternatives are
th order semideviation or the conditional valueatrisk (CVaR), which quantifies the loss function at tailend quantiles of its distribution.
Choice of scales the emphasis on bias or variance error in (8), and its solutions, as compared with (1
), are attuned to outliers and higherorder moments of the data distribution. Thus, for classification,
may be equipped for classification with significant class overlap, or regression (nonlinear filtering) with dips in signal to noise ratio. To derive solutions to (
8), we begin by noting that this is a special case of compositional stochastic programming [37], given as(9) 
Due to nested expectations, SGD no longer applies, and hence alternate tools are required, namely, stochastic quasigradient methods (SQG). Recently the behavior of SQ has been characterized in detail [37] – see references therein. Here we spotlight the use of such tools for nonparametric learning by generalizing SQG to RKHS, and applying matching pursuitbased projections [28]. Such an approach is the focus of [13], which provides a tunable tradeoff between convergence accuracy and required memory, again echoing ShannonNyquist sampling in nonparametric learning, but has the additional virtue that it admits an error variance which is controllable by parameter in (8). We begin by noting applying stochastic gradient to (9) requires access to the stochastic gradient
(10) 
However, the preceding expression at training examples is not available due to the expectation involved in the argument of . A second realization of is required to estimate the innerexpectation. Instead, we use SQG, which defines a scalar sequence to track the instantaneous functions evaluated at sample pairs :
(11) 
with the intent of estimating the expectation . In (11), is a scalar learning rate chosen from the unit interval which may be either diminishing or constant. Then, we define a function sequence initialized as null , that we sequentially update using SQG:
(12) 
where is a stepsize parameter chosen as diminishing or constant, and the equality makes use of the chain rule and the reproducing kernel property (2)(i). Through the Representer Theorem (3), we then have parametric updates on the coefficient vector and kernel dictionary
(13) 
In (13), kernel dictionary parameterizing function is a matrix which stacks past realizations of , and the coefficients as the associated scalars in the kernel expansion (3) which are updated according to (13). The function update of (12) implies that the complexity of computing is , due to the fact that the number of columns in , or model order , is , and thus is unsuitable for streaming settings. This computational cost is an inherent challenge of extending [37] to nonparametric kernel functions To address this, one may project (12) onto lowdimensional subspaces in a manner similar to Algorithm 1 – see [12]. The endresult, (12) operating in tandem with the projections defined in the previous section, is what we call Compositional Online Learning with Kernels, and is summarized as Algorithm 2. Its behavior trades off convergence accuracy and memory akin to Table I, and is studied in [13].
Experiments Now we discuss a specialization to online regression, where model fitness is determined by the square loss function where and . Due to the biasvariance tradeoff (7), we seek to minimize both the bias and variance of the loss. To account for the variance, we quantify risk by the th order central moments:
(14) 
For the experiment purposes, we select . We remark that the dispersion measure in (14
) is nonconvex which corresponds to the variance, skewness, and kurtosis of the loss distribution. We can always convexify the dispersion measure via positive projections (semideviations); however, for simplicity, we omit the positive projections in experiments.
We evaluate COLK on data whose distributions has higherorder effects, and compare its test accuracy against existing benchmarks that minimize only bias. We inquire as to which methods overfit: COLK (Algorithm 2), as compared with BSGD [9], NPBSG [31], POLK [12]. We consider is different training sets from the same distribution. To generate the synthetic dataset regression outliers, we used the function as the original function (a reasonable template for phase retrieval) and target ’s are perturbed by additive zero mean Gaussian noise. First we generate samples of the data, and then select as the test data set. From the remaining samples, we select at random to generate different training sets. We run COLK over these training set with parameter selections: a Gaussian kernel with bandwidth , stepsize , , with parsimony constant , variance coefficient , and minibatch size of . Similarity, for POLK we use and with parsimony constant
. We fix the kernel type and bandwidth, and the parameters that define comparator algorithms are handtuned to optimize performance with the restriction that their complexity is comparable. We run these algorithms over different training realizations and evaluate their test accuracy as well as standard deviation.
The advantage of minimizing the bias as well as variance may be observed in Fig. 4(a) which plots the learned function for POLK and COLK for two training data sets. POLK learning varies from one training set to other while COLK is robust to this change. In Fig. 4(b) we plot the model order of the function sequence defined by COLK, and observe it stabilizes over time regardless of the presence of outliers. These results substantiate the convergence behavior spotlighted in [13], which also contains additional experimental validation on real data.
Iii Decentralized Learning Methods
In domains such as autonomous networks of robots or smart devices, data is generated at the network edge. In order to gain the benefits of laws of large numbers in learning, aggregation of information is required. However, transmission of raw data over the network may neither be viable nor secure, motivating the need for decentralized processing. Here, the goal is for each agent, based on local observations, to learn online an estimator as good as a centralized one with access to all information in advance. To date, optimization tools for multiagent online learning are predominately focused to cases where agents learn linear statistical models
[18]. However, since kernel learning may be formulated as a stochastic convex problem over a function space, standard strategies, i.e., distributed gradient [19] and primaldual methods [20] may be derived. Doing so is the focus of this section, leveraging the proposed projection of previous sections.To formulate decentralized learning, we define some key quantities first. Consider an undirected graph with nodes and edges. Each represents an agent in the network, who observes a distinct observation sequence and quantifies merit according to their local loss . Based on their local data streams, they would like to learn as well as a clairvoyant agent which has access to global information for all time:
(15) 
Decentralized learning with consensus constraints: Under the hypothesis that all agents seek to learn a common global function, i.e., agents’ observations are uniformly relevant to others, one would like to solve (15) in a decentralized manner. To do so, we define local copies of the global function , and reformulate (15) as
(16) 
Imposing functional constraints of the form in (16) is challenging due to the fact it involves computations independent of data, and hence may operate outside the realm of the Representer Theorem (3). To alleviate this issue, we approximate consensus in the form which is imposed for in expectation over . Thus agents are incentivized to agree regarding their decisions, but not entire functions. This modification of consensus remarkably yields a penalty functional amenable to efficient computations, culminating in updates for each of the form:
(17) 
where is a penalty coefficient that ensures constraint violation, and exact consensus as . Moreover, stacking the functions along by , [19] establishes tradeoffs between convergence and memory akin to Table I hold for decentralized learning (16) when local functions are fed into local projection steps. Experiments then establish the practical usefulness of (17) for attaining state of the art decentralized learning.
Decentralized learning with proximity constraints: When the hypothesis that all agents seek to learn a common global function is invalid, due to heterogeneity of agents’ observations or processing capabilities, imposing consensus (16) degrades decentralized learning [38].
Thus, we define a relaxation of consensus (16) called proximity constraints that incentivizes coordination without requiring agents’ decisions to coincide:
(18) 
where is small when and are close, and defines a tolerance. This allows local solutions of (18) to be different at each node, and for instance, to incorporate correlation priors into algorithm design. To solve (18), we propose a method based on Lagrangian relaxation, specifically, a functional stochastic variant of the ArrowHurwicz primaldual (saddle point) method [20]. Its specific form is given as follows:
(19)  
(20) 
The KOMPbased projection is applied to each local primal update (19), which permits us to trade off convergence accuracy and model complexity, recovering tradeoffs akin to Table I.
The guarantees for primaldual methods on in stochastic settings are given in terms of constant stepsize mean convergence, due to technical challenges of obtaining a strict Lyapunov function for (19)(20). Specifically, define as the regularized penalty. Then for a horizon , stepsize selection and budget results in respective suboptimality and constraint violation attenuating with as
(21) 
Note the quantities on the right of (21) aggregate terms obtained over iterations, but are still bounded by sublinear functions of . In other words, the average optimality gap and constrain violation are respectively bounded by and , and approach zero for large . In [20], the experimental merit of (19)(20) is demonstrated for decentralized online problems where nonlinearity is inherent to the observation model.
Iv Discussion and Open Problems
Algorithm 1 yields nearly optimal online solutions to nonparametric learning (Sec. IIA), while ensuring the memory never becomes unwieldy. Several open problems may be identified as a result, such as, e.g., the selection of kernel hyperparameters to further optimize performance, of which a special case has recently been studied [39]. Moreover, timevarying problems where the decision variable is a function, as in trajectory optimization, remains unaddressed. On the practical side, algorithms developed in this section may be used for, e.g., online occupancy mappingbased localization amongst obstacles, dynamic phase retrieval, and beamforming in mobile autonomous networks.
The use of risk measures to overcome online overfitting may be used to attain online algorithms that are robust to unpredictable environmental effects (Sec. IIB), an ongoing challenge in indoor and urban localization [40], as well as model mismatch in autonomous control [16]
. Their use more widely in machine learning may reduce the “brittleness” of deep learning as well.
For decentralized learning, numerous enhancements of the methods in Sec. III are possible, such as those which relax conditions on the network, the smoothness required for stability, incorporation of agents’ ability to customize hyperparameters to local observations, and reductions of communications burden, to name a few. Online multiagent learning with nonlinear models may pave the pathway for nextgeneration distributed intelligence.
The general principle of sparsifying a nonparametric learning algorithm as much as possible while ensuring a descentlike property also holds when one changes the metric, ambient space, and choice of learning update rule, as has been recently demonstrated for Gaussian Processes [24]. Similar approaches are possible for Monte Carlo methods [14], and it is an open question which other statistical methods limited by the curse of dimensionality may be gracefully brought into the memoryefficient online setting through this perspective.
Overall, the methods discussed in this work echo ShannonNyquist sampling theorems for nonparametric learning. In particular, to estimate a (classconditional or regression) probability density with some fixed bias, one only needs finitely many points, after which all additional training examples are redundant. Such a phenomenon may be used to employ nonparametric methods in streaming problems for future learning systems.
References
 [1] S. Haykin, Neural Netw. Prentice hall New York, 1994, vol. 2.
 [2] V. Tikhomirov, “On the representation of continuous functions of several variables as superpositions of continuous functions of one variable and addition,” in Selected Works of AN Kolmogorov. Springer, 1991, pp. 383–387.
 [3] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics. MIT press, 2005.
 [4] I. Atzeni, L. G. Ordóñez, G. Scutari, D. P. Palomar, and J. R. Fonollosa, “Demandside management via distributed energy generation and storage optimization,” IEEE Trans. Smart Grid, vol. 4, no. 2, pp. 866–876, 2013.
 [5] A. Ribeiro, “Ergodic stochastic optimization algorithms for wireless communication and networking,” IEEE Trans. Signal Process., vol. 58, no. 12, pp. 6369–6386, 2010.

[6]
S. Ghosal and A. Van der Vaart,
Fundamentals of nonparametric Bayesian inference
. Cambridge University Press, 2017, vol. 44.  [7] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” Ann. Stat., pp. 1171–1220, 2008.
 [8] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, and J. Miguez, “Particle filtering,” IEEE Signal Process. Mag., vol. 20, no. 5, pp. 19–38, 2003.
 [9] Z. Wang, K. Crammer, and S. Vucetic, “Breaking the curse of kernelization: Budgeted stochastic gradient descent for largescale svm training,” The J. Mach. Learn Res., vol. 13, no. 1, pp. 3103–3131, 2012.
 [10] C. K. Williams and M. Seeger, “Using the nyström method to speed up kernel machines,” in Proc. of NeurlIPS, 2001, pp. 682–688.
 [11] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Proc. of NeurlIPS, 2009, pp. 1313–1320.
 [12] A. Koppel, G. Warnell, E. Stump, and A. Ribeiro, “Parsimonious online learning with kernels via sparse projections in function space,” J. Mach. Learn Res., vol. 20, no. 3, pp. 1–44, 2019.
 [13] A. S. Bedi, A. Koppel, and K. Rajawat, “Nonparametric compositional stochastic optimization,” arXiv preprint arXiv:1902.06011, 2019.
 [14] V. Elvira, J. Míguez, and P. M. Djurić, “Adapting the number of particles in sequential monte carlo methods through an online scheme for convergence assessment,” IEEE Trans. Signal Process., vol. 65, no. 7, pp. 1781–1794.
 [15] C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: theory and practice a survey,” Automatica, vol. 25, no. 3, pp. 335–348, 1989.
 [16] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learningbased model predictive control for safe exploration,” in Proc. of IEEE CDC, 2018, pp. 6059–6066.
 [17] G. K. Kaleh and R. Vallet, “Joint parameter estimation and symbol detection for linear or nonlinear unknown channels,” IEEE Trans. Commun., vol. 42, no. 7, pp. 2406–2413, 1994.
 [18] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multiagent optimization,” IEEE Trans. Autom. Control, vol. 1, no. 54, pp. 48–61, 2009.
 [19] A. Koppel, S. Paternain, C. Richard, and A. Ribeiro, “Decentralized online learning with kernels,” IEEE Trans. Signal Process., vol. 66, no. 12, pp. 3240–3255, 2018.
 [20] H. Pradhan, A. S. Bedi, A. Koppel, and K. Rajawat, “Exact nonparametric decentralized online optimization,” in Proc of IEEE GlobalSIP. IEEE, 2018, pp. 643–647.
 [21] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact firstorder algorithm for decentralized consensus optimization,” SIAM J. Opt., vol. 25, no. 2, pp. 944–966, 2015.
 [22] A. Nedić and A. Olshevsky, “Distributed optimization over timevarying directed graphs,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 601–615, 2015.
 [23] P. Wan and M. D. Lemmon, “Eventtriggered distributed optimization in sensor networks,” in Proc. of IEEE IPSN, 2009, pp. 49–60.
 [24] A. Koppel, “Consistent online gaussian process regression without the sample complexity bottleneck,” Proc. of IEEE ACC (to appear), 2019.
 [25] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online Learning with Kernels,” IEEE Trans. Signal Process., vol. 52, pp. 2165–2176, August 2004.
 [26] D. Needell, J. Tropp, and R. Vershynin, “Greedy signal recovery review,” in Proc. of IEEE Asilomar Conf. Signals, Systems and Computers, 2008, pp. 1048–1050.
 [27] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive leastsquares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug 2004.
 [28] P. Vincent and Y. Bengio, “Kernel matching pursuit,” Machine Learning, vol. 48, no. 1, pp. 165–187, 2002.

[29]
J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import Vector Machine,”
Journal of Computational and Graphical Statistics, vol. 14, no. 1, pp. 185–205, 2005.  [30] T. Le, T. Nguyen, V. Nguyen, and D. Phung, “Dual space gradient descent for online learning,” in Advances in Neural Information Processing Systems, 2016, pp. 4583–4591.
 [31] T. Le, V. Nguyen, T. D. Nguyen, and D. Phung, “Nonparametric budgeted stochastic gradient descent,” in Proc. of AISTAT, 2016, pp. 654–662.
 [32] Z. Wang and S. Vucetic, “Online passiveaggressive algorithms on a budget.” in Proc. of AISTAT, 2010a.
 [33] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning. Springer series in statistics New York, 2001, vol. 1.
 [34] H. Levy and H. M. Markowitz, “Approximating expected utility by a function of mean and variance,” The American Economic Review, vol. 69, no. 3, pp. 308–317, 1979.
 [35] P. Artzner, F. Delbaen, J.M. Eber, and D. Heath, “Coherent measures of risk,” Mathematical finance, vol. 9, no. 3, pp. 203–228, 1999.
 [36] S. Ahmed, “Convexity and decomposition of meanrisk stochastic programs,” Math. Prog., vol. 106, no. 3, pp. 433–446, 2006.
 [37] M. Wang, E. X. Fang, and H. Liu, “Stochastic compositional gradient descent: Algorithms for minimizing compositions of expectedvalue functions,” Math. Prog., vol. 161, no. 12, pp. 419–449, 2017.
 [38] A. Koppel, B. M. Sadler, and A. Ribeiro, “Proximity without consensus in online multiagent optimization,” IEEE Transactions on Signal Processing, vol. 65, no. 12, pp. 3062–3077, 2017.
 [39] M. Peifer, L. F. Chamon, S. Paternain, and A. Ribeiro, “Sparse learning of parsimonious reproducing kernel hilbert space models,” in ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3292–3296.
 [40] V. Elvira and I. Santamaria, “Multiple importance sampling for efficient symbol error rate estimation,” IEEE Signal Process. Lett., 2019.