Optimally Compressed Nonparametric Online Learning

09/25/2019 ∙ by Alec Koppel, et al. ∙ 0

Batch training of machine learning models based on neural networks is now well established, whereas to date streaming methods are largely based on linear models. To go beyond linear in the online setting, nonparametric methods are of interest due to their universality and ability to stably incorporate new information via convexity or Bayes' Rule. Unfortunately, when used online, nonparametric methods suffer a classic "curse of dimensionality" which precludes their use: their complexity scales at least with the time index. We survey online compression tools which bring their memory under control while attaining approximate convergence. The asymptotic bias depends on a compression parameter which trades off memory and accuracy, echoing Shannon-Nyquist sampling for nonparametric statistics. Applications to autonomous robotics, communications, economics, and power are scoped, as well as extensions to multi-agent systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

I-a Motivation

Modern machine learning is driven in large part by training nonlinear statistical models [1] on huge data sets held in cloud storage via stochastic optimization methods. This recipe has yielded advances across fields such as vision, acoustics, speech, and countless others. Of course, universal approximation theorems [2], the underpinning for the predictive capability of methods such as deep neural networks and nonparametric methods, have been known for some time. However, when accumulating enough data is difficult or if the observations arise from a non-stationary process, as in robotics [3], energy [4], and communications [5], online training is of interest. Specifically, for problems with dynamics, one would like the predictive power of universality while stably and quickly adapting to new data.

Bayesian and nonparametric methods [6] meet these specifications in the sense that they possess the universal approximation capability and may stably adapt to new data. In particular, tools under this umbrella such as kernel regression [7], Gaussian Processes, and particle filters/Monte Carlo methods [8] can stably (in a Lyapunov sense) incorporate new information as it arrives via functional variants of convex optimization or probabilistic updates via Bayes’ Rule. These facts motivate applying Bayesian and nonparametric methods to streaming problems.

Alas, Bayesian and nonparametric methods are limited by the curse of dimensionality. Typically, these tools require storing a probability density estimate that retains all past training examples together with a vector of weights that denotes the importance of each sample. Moreover, as time becomes large, their complexity approaches infinity. This bottleneck has led to a variety of approximation schemes, both offline and online, to ensure memory is under control. Typically one fixes some memory budget, and projects all additional training examples/particles onto the likelihood of the density estimate spanned by current points

[9]. Alternatives include memory-efficient approximations of the associated functional [10] or probabilistic model [11].

I-B Significance

Until recently, a significant gap in the literature of memory reduction design for Bayesian and nonparametric methods existed: how to provide a tunable tradeoff between statistical consistency111Consistency is the statistical version of optimality, and means that an empirical estimator converges to its population analogue, i.e., is asymptotically unbiased. and memory? Recently, the perception that nonparametric methods do not scale to streaming settings has been upended [12, 13]. The key insight of these methods is that one may fix an error-neighborhood around the current density estimate, and project it onto a nearby subspace with lower memory, thus sparsifying the sample path online and allowing the complexity of the density representation to be flexible and problem-dependent [14]. The performance guarantees of this approach echo Shannon-Nyquist Sampling for nonparametric statistics. That is, the radius of the error neighborhood, which we henceforth call the compression budget, determines the asymptotic radius of convergence, i.e., bias, as well as the complexity of the resulting density function.

Fig. 1:

Generalized Projection scheme for Bayesian and nonparametric methods in the online setting. A new sample arrives and is incorporated as either functional stochastic gradient method, maximum a posteriori estimation, or Monte Carlo particle generation. Rather than allow the complexity to grow ad infinitum, the current statistical model is projected a lower dimensional subspace via greedy compression which is at most

-far away according to some metric. The compression parameter trades off statistical consistency and memory requirements.

I-C Impact

Proposed approaches allow stable and memory-efficient online training of universal statistical models, a foundation upon which many theoretical and practical advances may be built. For instance, in autonomous control of robotic platforms, it is well-known that optimal control based on physical models [15] is limited to the domain for which the model remains valid. When used beyond this domain, a gap emerges between an a priori hypothesized physical model and the ground truth. Alternatively, such problems can be cast within the rubric of sequential learning where one seeks to regress on the model mismatch. Initial efforts toward this end have appeared recently [16], but substantial enhancements are possible, for instance, by increasing the descriptive capacity of the class of learned models and the adaptivity of autonomous behaviors to changing environments.

A similar problem arises in communications systems, where nonlinear adaptive filters can be used to mitigate signal saturation in amplifiers and detectors. The current techniques are limited to linearity for, e.g., equalization in receivers, noise cancellation, beamforming in radar, and sonar systems. Such a limitation is because current communications channel state estimators [5], modulators, and localizers need to adapt with low latency, and currently only linear methods meet the requirement [17]. However, nonparametric and Bayesian methods would enable identification and tracking of modulation type from symbol estimates that exhibit more complex interaction effects, channel state estimation in the presence of nonlinearities arising from the environment, and localization problems where the entire distribution over possible source locations is of interest, due to, e.g., desire for confidence guarantees.

On the theory side, we note that efforts to develop optimization tools for multi-agent online learning have been mostly restricted to cases where agents learn linear statistical models [18], which exclude the state of the art machine learning methods. However, since kernel learning may be formulated as a stochastic convex problem over a function space, standard strategies, i.e., distributed gradient [19] and primal-dual methods [20]

may be derived. These schemes allow individual agents to learn in a decentralized online manner a memory-efficient nonlinear interpolator as good as a centralized clairvoyant agent with all information in advance. Many enhancements are possible via recent advances in decentralized optimization

[21], such as those which relax conditions on the network [22] and amount of communications [23].

Apart from multi-agent systems, the general approach of sparsifying as much as possible while ensuring a descent property holds, may be applied to other nonparametric statistical tools such as Gaussian processes and Monte Carlo Methods [24, 14] by varying the ambient space and compression metric. Due to the need for brevity, we defer such a discussion to future work.

The rest of the paper is organized as follows. In the subsequent section we detail supervised learning in the online nonparametric setting. In Section

II, we formulate the problem of supervised online training with kernels, and define memory-affordable kernel regression algorithms for solving them in II-A and extend this framework to online risk-aware learning Section II-B. In Section III, we spotlight online nonparametric learning methodologies in decentralized settings. In Section IV, we conclude with a discussion of implications and open problems.

Ii Online Supervised Learning with Kernels

We begin by formalizing the problem of supervised learning as expected risk minimization (ERM), which is the foundation of filtering, prediction, and classification. Then, we detail how the problem specializes when the estimator admits a kernel parameterization. In ERM, we seek to learn a regressor so as to minimize a loss quantifying the merit of a statistical model averaged over data. Each point in the data set constitutes an input-output pair which is an i.i.d. realization from the stationary distribution of random pair with . We consider the problems where samples arrive sequentially in perpetuity, which is applicable to signal processing, communication, and visual perception.

Hereafter, we quantify the quality of an estimator function

by a convex loss function

, where is a hypothesized function class. In particular, if we evaluate at feature vector , its merit is encapsulated by . Then, we would like to select to have optimal model fitness on average over data, i.e., to minimize the statistical loss . We focus on minimizing the regularized loss as

(1)

The preceding expression specializes to linear regression when

for some , but rather here we focus on the case that function class is a reproducing kernel Hilbert space (RKHS) – for background, see [7]. An RKHS consists of functions that have a basis expansion in terms of elements of through a kernel defined over inner products of data:

(2)

where is the Hilbert inner product for . The kernel function is henceforth assumed positive semidefinite: for all . Example kernels include the polynomial and Gaussian , where .

The term reproducing comes from replacing by in (2)(i) which yields . We note that this reproducing property permits writing the inner product of the feature maps of two distinct vectors and as the kernel evaluations as . Here, denotes the feature map of vector . The preceding expression, the kernel trick, allows us to define arbitrary nonlinear relationships between data without ever computing [7]. Moreover, (2) (ii) states that any function is a linear combination of kernel evaluations at the vectors , which for empirical versions of (1), implies that the Representer Theorem holds [7]

(3)

Here a collection of weights and is the sample size. We note that the number of terms in the sum (3), hereafter referred to as the model order, coincides with the sample size, and hence grows unbounded as empirical approximations of (1) approach their population counterpart. This complexity bottleneck is a manifestation of the curse of dimensionality in nonparametric learning. Decades of research have attempted to overcome this issue through the design of memory-reduction techniques, most not guaranteed to yield the minimizers of (1). In the following subsection, we detail a stochastic approximation algorithm which explicitly trades off model fitness with memory requirements, and provide applications.

Ii-a Trading Off Consistency and Complexity

In this section, we derive an online algorithm to solve the problem in (1

) through a functional variant of stochastic gradient descent (FSGD)

[25]. Then, we detail how memory reduction may be attained through subspace projections [12]. We culminate by illuminating tradeoffs between memory and consistency both in theory and practice. We begin by noting that FSGD applied to the statistical loss (1) when the feasible set is an RKHS (2) takes the form [25]

(4)

where

is the constant step size. The equality uses the chain rule, the reproducing property of the kernel (

2)(i), and the definition as in [25]. With initialization , the Representer Theorem (3) in (4) allows one to rewrite (4) with dictionary and weight updates as

(5)

We define the matrix of training examples as the kernel dictionary, the kernel matrix as , whose entries are kernel evaluations , and the empirical kernel map as the vector of kernel evaluations.

initialize , i.e. initial dictionary, coefficient vectors are empty
for  do
     Obtain independent training pair realization
     Compute unconstrained functional stochastic gradient step
     Revise dictionary and weights
     Compute sparse function approximation via KOMP [26]
end for
Algorithm 1 Parsimonious Online Learning with Kernels (POLK)

At each time , the new sample enters into current dictionary to obtain , and hence the model order , i.e., the number of points in dictionary , tends to as when data is streaming. Existing strategies for online memory-reduction include dropping past points when weights fall below a threshold [25], projecting functions onto fixed-size subspaces via spectral criteria [27] or the Hilbert norm [10], probabilistic kernel approximations [11], and others. A commonality to these methods is a capitulation on convergence in pursuit of memory reduction. In contrast, one may balance these criteria by projecting the FSGD sequence onto subspaces adaptively constructed from past data .

Model Order Control via Subspace Projections To control the complexity growth, we propose approximating the function sequence [cf. (4)] by projecting them onto subspaces of dimension spanned by the elements in the dictionary , i.e., , where we denote , and represents the kernel matrix obtained for dictionary . To ensure model parsimony, we enforce the number of elements in to satisfy .

To introduce the projection, note that (4) is such that . Instead, we use dictionary, whose columns are chosen from To be specific, we augment (4) by projection:

(6)

where we denote the subspace associated with dictionary , and the right-hand side of (6) to define the projection operator. Parsimonious Online Learning with Kernels (POLK) (Algorithm 1) projects FSGD onto subspaces stated in (6). Initially the function is . Then, at each step, given sample and step-size , we take an unconstrained FSGD iterate which admits the parametric representation and . These parameters are then fed into KOMP with approximation budget , such that .

Diminishing Constant
Step-size/Learning rate
Sparse Approximation Budget
Regularization Condition
Convergence Result a.s. a.s.
Model Order Guarantee None Finite
TABLE I: Summary of convergence results for different parameter selections.

Parameterizing the Projection The projection may be computed in terms of data and weights. To select dictionary for each , we use greedy compression, specifically, a destructive variant of kernel orthogonal matching pursuit (KOMP) [28] with budget . The input function to KOMP is with model order parameterized by its kernel dictionary and coefficient vector . The algorithm outputs with a lower model order. We use to denote the function updated by an un-projected FSGD step, whose coefficients and dictionary are denoted as and . At each stage, the dictionary element is removed which contributes the least to Hilbert-norm error of the original function , when dictionary is used. Doing so yields an approximation of inside an -neighborhood . Then, the “energy” of removed points is re-weighted onto those remaining .

Algorithm 1 converges both under diminishing and constant step-size regimes [12]. When the learning rate satisfies , where is the regularization parameter, and is attenuating such that and with approximation budget satisfying . Practically speaking, this means that asymptotically the iterates generated by Algorithm 1 must have unbounded complexity to converge exactly. More interestingly, however, when a constant algorithm step-size is chosen to satisfy , then under constant budget which satisfies , the function sequence which converges to near the optimal [cf. (1)] and has finite complexity. This tradeoff is summarized in Table I.

(a) Empirical risk
(b) Error rate
(c) Model order
Fig. 2: Comparison of POLK and its competitors on Gaussian Mixtures (left) for multi-class kernel SVM when the model order parameter is fixed for the competitors at and the parsimony constant of POLK is set to . Observe that POLK achieves lower risk and higher accuracy than its competitors on this problem instance.

Experiments We discuss experiments of Algorithm 1 for multi-class classification on Gaussian Mixtures dataset (Fig. 3(a)) as in [29] for the case of kernel SVM, and compare with several alternatives: budgeted stochastic gradient descent (BSGD) [9], a fixed subspace projection method, which requires a maximum model order a priori; Dual Space Gradient Descent (Dual) [30], a hybrid of FSGD with a random features; nonparametric budgeted SGD (NPBSGD) [31], which combines a fixed subspace projection with random dropping, Naive Online Regularized Min. Alg. (NORMA) [25], which truncates the memory to finite-horizon objectives, and Budgeted Passive-Aggressive (BPA) which merges incoming points via nearest neighbor [32].

In Figure 2 we plot the empirical results of this experiment. POLK outperforms many of its competitors by an order of magnitude in terms of objective evaluation (Fig. 2(a)) and test-set error rate (Fig 2(b)). The notable exception is NPBSGD which comes close in terms of objective evaluation but less so in terms of test error. Moreover, because the marginal feature density contains modes, the optimal model order is , which is approximately learned by POLK for (i.e., ) (Fig. 2(c)). Several alternatives initialized with this parameter, on the other hand, do not converge. Moreover, POLK favorably trades off accuracy and sample complexity – reaching below error after only samples. The final decision surface of this trial of POLK is shown in Fig. 3(b), where it can be seen that the selected kernel dictionary elements concentrate at modes of the class-conditional density. Next, we discuss modifications of Algorithm 1 that avoid overfitting via notions of risk from operations research.

(a) Training Data
(b) (hinge loss)
Fig. 3: Visualization of the decision surfaces yielded by POLK for the multi-class kernel SVM and logisitic regression on Gaussian Mixtures. Training examples from distinct classes are assigned a unique color. Grid colors represent the classification decision by . Bold black dots are kernel dictionary elements, which concentrate at the modes of the data distribution. Solid lines denote class label boundaries and grid colors denote classification decisions.

Ii-B Compositional and Risk-Aware Learning with Kernels

In this section, we explain how augmentations of (1

) may incorporate risk-awareness into learning, motivated by bias-variance tradeoffs. In particular, given a particular sample path of data, one may learn a model overly sensitive to the peculiarities of the available observations, a phenomenon known as overfitting. To avoid overfitting

offline, bootstrapping (data augmentation), cross-validation, or sparsity-promoting penalties are effective [33]. However, these approaches do not apply to streaming settings. For online problems, one must augment the objective itself to incorporate uncertainty, a topic extensively studied in econometrics [34]. Specifically, one may use coherent risk as a surrogate for error variance [35], which permits the derivations of online algorithms that do not overfit, and are attuned to distributions with ill-conditioning or heavy-tails, as in interference channels or visual inference with less-than-laboratory levels of cleanliness.

To clarify the motivation for risk-aware augmentations of (1), we first briefly review the bias-variance (estimation-approximation) tradeoff. Suppose we run some algorithm and obtain estimator . Then one would like to make the performance of approach the Bayes optimal where denotes the space of all functions that map data to target variables . The performance gap between and decomposes as [33]

(7)

by adding and subtracting (1) (ignoring regularization). Thus, the discrepancy decomposes into two terms: the estimation error, or bias, and approximation error, or variance222Approximation error is more general than variance, but for, e.g., the quadratic loss, the former reduces into the later plus the noise of the data distribution. We conflate these quantities for ease of explanation, but they are different.

. The bias is minimized as the number of data points goes to infinity. On the other hand, universality implies the variance is null, but in practice due to inherent unknown properties of data and hyperparameter choices, it is positive. To avoid overfitting in the online setting, we propose accounting for error variance via a dispersion measure

, which yields a variant of supervised learning that accounts for approximation error [36]

(8)

For example, is commonly used – the semivariance. Alternatives are

-th order semideviation or the conditional value-at-risk (CVaR), which quantifies the loss function at tail-end quantiles of its distribution.

Choice of scales the emphasis on bias or variance error in (8), and its solutions, as compared with (1

), are attuned to outliers and higher-order moments of the data distribution. Thus, for classification,

may be equipped for classification with significant class overlap, or regression (nonlinear filtering) with dips in signal to noise ratio. To derive solutions to (

8), we begin by noting that this is a special case of compositional stochastic programming [37], given as

(9)

Due to nested expectations, SGD no longer applies, and hence alternate tools are required, namely, stochastic quasi-gradient methods (SQG). Recently the behavior of SQ has been characterized in detail [37] – see references therein. Here we spotlight the use of such tools for nonparametric learning by generalizing SQG to RKHS, and applying matching pursuit-based projections [28]. Such an approach is the focus of [13], which provides a tunable tradeoff between convergence accuracy and required memory, again echoing Shannon-Nyquist sampling in nonparametric learning, but has the additional virtue that it admits an error variance which is controllable by parameter in (8). We begin by noting applying stochastic gradient to (9) requires access to the stochastic gradient

(10)

However, the preceding expression at training examples is not available due to the expectation involved in the argument of . A second realization of is required to estimate the inner-expectation. Instead, we use SQG, which defines a scalar sequence to track the instantaneous functions evaluated at sample pairs :

(11)

with the intent of estimating the expectation . In (11), is a scalar learning rate chosen from the unit interval which may be either diminishing or constant. Then, we define a function sequence initialized as null , that we sequentially update using SQG:

(12)

where is a step-size parameter chosen as diminishing or constant, and the equality makes use of the chain rule and the reproducing kernel property (2)(i). Through the Representer Theorem (3), we then have parametric updates on the coefficient vector and kernel dictionary

(13)

In (13), kernel dictionary parameterizing function is a matrix which stacks past realizations of , and the coefficients as the associated scalars in the kernel expansion (3) which are updated according to (13). The function update of (12) implies that the complexity of computing is , due to the fact that the number of columns in , or model order , is , and thus is unsuitable for streaming settings. This computational cost is an inherent challenge of extending [37] to nonparametric kernel functions To address this, one may project (12) onto low-dimensional subspaces in a manner similar to Algorithm 1 – see [12]. The end-result, (12) operating in tandem with the projections defined in the previous section, is what we call Compositional Online Learning with Kernels, and is summarized as Algorithm 2. Its behavior trades off convergence accuracy and memory akin to Table I, and is studied in [13].

initialize , i.e. initial dictionary, coefficient vectors are empty
for  do
     Update auxiliary variable according to (11)
     Compute functional stochastic quasi-gradient step (12)
     Revise function parameters: dictionary weights (13)
     Compress parameterization via KOMP
end for
Algorithm 2 Compositional Online Learning with Kernels

Experiments Now we discuss a specialization to online regression, where model fitness is determined by the square loss function where and . Due to the bias-variance tradeoff (7), we seek to minimize both the bias and variance of the loss. To account for the variance, we quantify risk by the -th order central moments:

(14)

For the experiment purposes, we select . We remark that the dispersion measure in (14

) is non-convex which corresponds to the variance, skewness, and kurtosis of the loss distribution. We can always convexify the dispersion measure via positive projections (semi-deviations); however, for simplicity, we omit the positive projections in experiments.

We evaluate COLK on data whose distributions has higher-order effects, and compare its test accuracy against existing benchmarks that minimize only bias. We inquire as to which methods overfit: COLK (Algorithm 2), as compared with BSGD [9], NPBSG [31], POLK [12]. We consider is different training sets from the same distribution. To generate the synthetic dataset regression outliers, we used the function as the original function (a reasonable template for phase retrieval) and target ’s are perturbed by additive zero mean Gaussian noise. First we generate samples of the data, and then select as the test data set. From the remaining samples, we select at random to generate different training sets. We run COLK over these training set with parameter selections: a Gaussian kernel with bandwidth , step-size , , with parsimony constant , variance coefficient , and mini-batch size of . Similarity, for POLK we use and with parsimony constant

. We fix the kernel type and bandwidth, and the parameters that define comparator algorithms are hand-tuned to optimize performance with the restriction that their complexity is comparable. We run these algorithms over different training realizations and evaluate their test accuracy as well as standard deviation.

(a) Visualization of regression.
(b) Model order
(c) Test Error and Std. dev.
Fig. 4: COLK, with , , , , , bandwidth as compared to alterantives for online learning with kernels that only minimize bias for regression outliers data. COLK quantifies risk as variance, skewness, and kurtosis. We report test MSE averages over training runs, and its standard dev. as error bars. Outlier presence does not break learning stability, and test accuracy remains consistent, at the cost of increased complexity. COLK attains minimal error and deviation.

The advantage of minimizing the bias as well as variance may be observed in Fig.  4(a) which plots the learned function for POLK and COLK for two training data sets. POLK learning varies from one training set to other while COLK is robust to this change. In Fig. 4(b) we plot the model order of the function sequence defined by COLK, and observe it stabilizes over time regardless of the presence of outliers. These results substantiate the convergence behavior spotlighted in [13], which also contains additional experimental validation on real data.

Iii Decentralized Learning Methods

In domains such as autonomous networks of robots or smart devices, data is generated at the network edge. In order to gain the benefits of laws of large numbers in learning, aggregation of information is required. However, transmission of raw data over the network may neither be viable nor secure, motivating the need for decentralized processing. Here, the goal is for each agent, based on local observations, to learn online an estimator as good as a centralized one with access to all information in advance. To date, optimization tools for multi-agent online learning are predominately focused to cases where agents learn linear statistical models

[18]. However, since kernel learning may be formulated as a stochastic convex problem over a function space, standard strategies, i.e., distributed gradient [19] and primal-dual methods [20] may be derived. Doing so is the focus of this section, leveraging the proposed projection of previous sections.

To formulate decentralized learning, we define some key quantities first. Consider an undirected graph with nodes and edges. Each represents an agent in the network, who observes a distinct observation sequence and quantifies merit according to their local loss . Based on their local data streams, they would like to learn as well as a clairvoyant agent which has access to global information for all time:

(15)

Decentralized learning with consensus constraints: Under the hypothesis that all agents seek to learn a common global function, i.e., agents’ observations are uniformly relevant to others, one would like to solve (15) in a decentralized manner. To do so, we define local copies of the global function , and reformulate (15) as

(16)

Imposing functional constraints of the form in (16) is challenging due to the fact it involves computations independent of data, and hence may operate outside the realm of the Representer Theorem (3). To alleviate this issue, we approximate consensus in the form which is imposed for in expectation over . Thus agents are incentivized to agree regarding their decisions, but not entire functions. This modification of consensus remarkably yields a penalty functional amenable to efficient computations, culminating in updates for each of the form:

(17)

where is a penalty coefficient that ensures constraint violation, and exact consensus as . Moreover, stacking the functions along by , [19] establishes tradeoffs between convergence and memory akin to Table I hold for decentralized learning (16) when local functions are fed into local projection steps. Experiments then establish the practical usefulness of (17) for attaining state of the art decentralized learning.

Decentralized learning with proximity constraints: When the hypothesis that all agents seek to learn a common global function is invalid, due to heterogeneity of agents’ observations or processing capabilities, imposing consensus (16) degrades decentralized learning [38].

Thus, we define a relaxation of consensus (16) called proximity constraints that incentivizes coordination without requiring agents’ decisions to coincide:

(18)

where is small when and are close, and defines a tolerance. This allows local solutions of (18) to be different at each node, and for instance, to incorporate correlation priors into algorithm design. To solve (18), we propose a method based on Lagrangian relaxation, specifically, a functional stochastic variant of the Arrow-Hurwicz primal-dual (saddle point) method [20]. Its specific form is given as follows:

(19)
(20)

The KOMP-based projection is applied to each local primal update (19), which permits us to trade off convergence accuracy and model complexity, recovering tradeoffs akin to Table I.

The guarantees for primal-dual methods on in stochastic settings are given in terms of constant step-size mean convergence, due to technical challenges of obtaining a strict Lyapunov function for (19)-(20). Specifically, define as the regularized penalty. Then for a horizon , step-size selection and budget results in respective sub-optimality and constraint violation attenuating with as

(21)

Note the quantities on the right of (21) aggregate terms obtained over iterations, but are still bounded by sublinear functions of . In other words, the average optimality gap and constrain violation are respectively bounded by and , and approach zero for large . In [20], the experimental merit of (19)-(20) is demonstrated for decentralized online problems where nonlinearity is inherent to the observation model.

Iv Discussion and Open Problems

Algorithm 1 yields nearly optimal online solutions to nonparametric learning (Sec. II-A), while ensuring the memory never becomes unwieldy. Several open problems may be identified as a result, such as, e.g., the selection of kernel hyper-parameters to further optimize performance, of which a special case has recently been studied [39]. Moreover, time-varying problems where the decision variable is a function, as in trajectory optimization, remains unaddressed. On the practical side, algorithms developed in this section may be used for, e.g., online occupancy mapping-based localization amongst obstacles, dynamic phase retrieval, and beamforming in mobile autonomous networks.

The use of risk measures to overcome online overfitting may be used to attain online algorithms that are robust to unpredictable environmental effects (Sec. II-B), an ongoing challenge in indoor and urban localization [40], as well as model mismatch in autonomous control [16]

. Their use more widely in machine learning may reduce the “brittleness” of deep learning as well.

For decentralized learning, numerous enhancements of the methods in Sec. III are possible, such as those which relax conditions on the network, the smoothness required for stability, incorporation of agents’ ability to customize hyper-parameters to local observations, and reductions of communications burden, to name a few. Online multi-agent learning with nonlinear models may pave the pathway for next-generation distributed intelligence.

The general principle of sparsifying a nonparametric learning algorithm as much as possible while ensuring a descent-like property also holds when one changes the metric, ambient space, and choice of learning update rule, as has been recently demonstrated for Gaussian Processes [24]. Similar approaches are possible for Monte Carlo methods [14], and it is an open question which other statistical methods limited by the curse of dimensionality may be gracefully brought into the memory-efficient online setting through this perspective.

Overall, the methods discussed in this work echo Shannon-Nyquist sampling theorems for nonparametric learning. In particular, to estimate a (class-conditional or regression) probability density with some fixed bias, one only needs finitely many points, after which all additional training examples are redundant. Such a phenomenon may be used to employ nonparametric methods in streaming problems for future learning systems.

References

  • [1] S. Haykin, Neural Netw.   Prentice hall New York, 1994, vol. 2.
  • [2] V. Tikhomirov, “On the representation of continuous functions of several variables as superpositions of continuous functions of one variable and addition,” in Selected Works of AN Kolmogorov.   Springer, 1991, pp. 383–387.
  • [3] S. Thrun, W. Burgard, and D. Fox, Probabilistic robotics.   MIT press, 2005.
  • [4] I. Atzeni, L. G. Ordóñez, G. Scutari, D. P. Palomar, and J. R. Fonollosa, “Demand-side management via distributed energy generation and storage optimization,” IEEE Trans. Smart Grid, vol. 4, no. 2, pp. 866–876, 2013.
  • [5] A. Ribeiro, “Ergodic stochastic optimization algorithms for wireless communication and networking,” IEEE Trans. Signal Process., vol. 58, no. 12, pp. 6369–6386, 2010.
  • [6] S. Ghosal and A. Van der Vaart,

    Fundamentals of nonparametric Bayesian inference

    .   Cambridge University Press, 2017, vol. 44.
  • [7] T. Hofmann, B. Schölkopf, and A. J. Smola, “Kernel methods in machine learning,” Ann. Stat., pp. 1171–1220, 2008.
  • [8] P. M. Djuric, J. H. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. F. Bugallo, and J. Miguez, “Particle filtering,” IEEE Signal Process. Mag., vol. 20, no. 5, pp. 19–38, 2003.
  • [9] Z. Wang, K. Crammer, and S. Vucetic, “Breaking the curse of kernelization: Budgeted stochastic gradient descent for large-scale svm training,” The J. Mach. Learn Res., vol. 13, no. 1, pp. 3103–3131, 2012.
  • [10] C. K. Williams and M. Seeger, “Using the nyström method to speed up kernel machines,” in Proc. of NeurlIPS, 2001, pp. 682–688.
  • [11] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning,” in Proc. of NeurlIPS, 2009, pp. 1313–1320.
  • [12] A. Koppel, G. Warnell, E. Stump, and A. Ribeiro, “Parsimonious online learning with kernels via sparse projections in function space,” J. Mach. Learn Res., vol. 20, no. 3, pp. 1–44, 2019.
  • [13] A. S. Bedi, A. Koppel, and K. Rajawat, “Nonparametric compositional stochastic optimization,” arXiv preprint arXiv:1902.06011, 2019.
  • [14] V. Elvira, J. Míguez, and P. M. Djurić, “Adapting the number of particles in sequential monte carlo methods through an online scheme for convergence assessment,” IEEE Trans. Signal Process., vol. 65, no. 7, pp. 1781–1794.
  • [15] C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: theory and practice- a survey,” Automatica, vol. 25, no. 3, pp. 335–348, 1989.
  • [16] T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause, “Learning-based model predictive control for safe exploration,” in Proc. of IEEE CDC, 2018, pp. 6059–6066.
  • [17] G. K. Kaleh and R. Vallet, “Joint parameter estimation and symbol detection for linear or nonlinear unknown channels,” IEEE Trans. Commun., vol. 42, no. 7, pp. 2406–2413, 1994.
  • [18] A. Nedic and A. Ozdaglar, “Distributed subgradient methods for multi-agent optimization,” IEEE Trans. Autom. Control, vol. 1, no. 54, pp. 48–61, 2009.
  • [19] A. Koppel, S. Paternain, C. Richard, and A. Ribeiro, “Decentralized online learning with kernels,” IEEE Trans. Signal Process., vol. 66, no. 12, pp. 3240–3255, 2018.
  • [20] H. Pradhan, A. S. Bedi, A. Koppel, and K. Rajawat, “Exact nonparametric decentralized online optimization,” in Proc of IEEE GlobalSIP.   IEEE, 2018, pp. 643–647.
  • [21] W. Shi, Q. Ling, G. Wu, and W. Yin, “Extra: An exact first-order algorithm for decentralized consensus optimization,” SIAM J. Opt., vol. 25, no. 2, pp. 944–966, 2015.
  • [22] A. Nedić and A. Olshevsky, “Distributed optimization over time-varying directed graphs,” IEEE Trans. Autom. Control, vol. 60, no. 3, pp. 601–615, 2015.
  • [23] P. Wan and M. D. Lemmon, “Event-triggered distributed optimization in sensor networks,” in Proc. of IEEE IPSN, 2009, pp. 49–60.
  • [24] A. Koppel, “Consistent online gaussian process regression without the sample complexity bottleneck,” Proc. of IEEE ACC (to appear), 2019.
  • [25] J. Kivinen, A. J. Smola, and R. C. Williamson, “Online Learning with Kernels,” IEEE Trans. Signal Process., vol. 52, pp. 2165–2176, August 2004.
  • [26] D. Needell, J. Tropp, and R. Vershynin, “Greedy signal recovery review,” in Proc. of IEEE Asilomar Conf. Signals, Systems and Computers, 2008, pp. 1048–1050.
  • [27] Y. Engel, S. Mannor, and R. Meir, “The kernel recursive least-squares algorithm,” IEEE Trans. Signal Process., vol. 52, no. 8, pp. 2275–2285, Aug 2004.
  • [28] P. Vincent and Y. Bengio, “Kernel matching pursuit,” Machine Learning, vol. 48, no. 1, pp. 165–187, 2002.
  • [29]

    J. Zhu and T. Hastie, “Kernel Logistic Regression and the Import Vector Machine,”

    Journal of Computational and Graphical Statistics, vol. 14, no. 1, pp. 185–205, 2005.
  • [30] T. Le, T. Nguyen, V. Nguyen, and D. Phung, “Dual space gradient descent for online learning,” in Advances in Neural Information Processing Systems, 2016, pp. 4583–4591.
  • [31] T. Le, V. Nguyen, T. D. Nguyen, and D. Phung, “Nonparametric budgeted stochastic gradient descent,” in Proc. of AISTAT, 2016, pp. 654–662.
  • [32] Z. Wang and S. Vucetic, “Online passive-aggressive algorithms on a budget.” in Proc. of AISTAT, 2010a.
  • [33] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning.   Springer series in statistics New York, 2001, vol. 1.
  • [34] H. Levy and H. M. Markowitz, “Approximating expected utility by a function of mean and variance,” The American Economic Review, vol. 69, no. 3, pp. 308–317, 1979.
  • [35] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, “Coherent measures of risk,” Mathematical finance, vol. 9, no. 3, pp. 203–228, 1999.
  • [36] S. Ahmed, “Convexity and decomposition of mean-risk stochastic programs,” Math. Prog., vol. 106, no. 3, pp. 433–446, 2006.
  • [37] M. Wang, E. X. Fang, and H. Liu, “Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions,” Math. Prog., vol. 161, no. 1-2, pp. 419–449, 2017.
  • [38] A. Koppel, B. M. Sadler, and A. Ribeiro, “Proximity without consensus in online multiagent optimization,” IEEE Transactions on Signal Processing, vol. 65, no. 12, pp. 3062–3077, 2017.
  • [39] M. Peifer, L. F. Chamon, S. Paternain, and A. Ribeiro, “Sparse learning of parsimonious reproducing kernel hilbert space models,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 3292–3296.
  • [40] V. Elvira and I. Santamaria, “Multiple importance sampling for efficient symbol error rate estimation,” IEEE Signal Process. Lett., 2019.