1. Introduction
In the past few decades, there has been a rapid growth of interest in automated learning from data across various scientific fields including statistics [49], engineering [27], computer science [20, 36]
, mathematics, and many more. An overview of machine learning problems in a wide range of contexts (statistical learning theory, pattern recognition, system identification, deep learning, and so on) can be found in
[15, 5, 1, 17]. One of the main paradigms is to learn an unknown target function from a given collection of inputoutput pairs (supervised learning), which can be rephrased as the problem of finding an approximation of a multidimensional function. For example, in
[32, 31], the authors demonstrated a connection between approximation theory and regularization with feedforward multilayer networks. In general, learning a smooth function from data is illposed unless a priori information about either the data structure or the generating function is provided [47, 30, 12].One of the wellknown methods to make the learning problem wellposed is to exploit additional properties of the target function [16]
. For example, if the target function depends only on a few active directions associated with a suitable random matrix, the function can be recovered from a small number of samples
[12]. On the other hand, many wellknown learning methods consider the target function in a particular function class (such as radial basis functions, projection pursuit, feedforward neural networks, and tensor product methods) and add a penalty (such as Tikhonov regularization or sparse constraints) to the associated parameter estimation problem.
Recently, sparse models combined with datadriven methods have been investigated intensively for learning nonlinear partial differential equations, nonlinear dynamical systems, and graphbased networks. The model selection problem for dynamical systems from time series dates back to
[8] where the authors investigate the concepts from dynamical system theory to recover the underlying structure from data. In [52], the authors construct a sampling matrix from the data matrix and its power to recover the ordinary differential equations and find an optimal Kronecker product representation for the governing equations. Furthermore, based on the observation that many governing equations have a sparse representation with respect to highdimensional polynomial spaces, the authors in
[4] developed the SINDy algorithm which uses that sampling matrix and a sequential leastsquare thresholding algorithm to recover the governing equations of some unknown dynamical systems. The convergence of the SINDy algorithm is provided in [53]. A groupsparse model was proposed in [42] to learn governing equations from a family of dynamical systems with bifurcation parameters. By exploiting the cyclic structure of many nonlinear differential equations, the authors in [43] proposed an approach to identify the active basis terms using fewer random samples (in some cases on the order of a few snapshots). For the noisy case, in [40] the authors use the integral formulation of the differential equation to reduce the effect of noise and identify the model from a smoother basis set. To learn a nonlinear partial differential equation from spatiotemporal dataset, the authors in [39] proposed a LASSObased approach using a dictionary of partial derivatives. In [37], the authors developed an adaptive ridgeregression version of
[4] for learning nonlinear PDE, while in [33] a hidden physics model based on Gaussian processes was presented. On the other hand, the data are often contaminated by noise, contain outliers, have missing values, or have a limited amount of samples. When the given data are limited, there are several works addressing learning problems ranging from sampling strategies in highdimensional dynamics using random initial conditions [41], to a weighted minimization on the lower set [34, 6], model predictive control using SINDy [21], and sample complexity reduction to linear timeinvariant systems [11]. In [44], the authors proposed a method to approximate an unknown function from noise measurements via sequential approximation. Geometric methods, such as [25], can be used to approximate functions in highdimensions when the data concentrate on lowerdimensional sets.Regarding supervised learning analysis, the input data are assumed to be independent and identically distributed (i.i.d.). However, this assumption does not hold in many applications such as speech recognition, medical diagnosis, signal processing, computational biology, and financial prediction. Alternatively, for noni.i.d. processes satisfying certain mixing conditions, various reconstruction results have been addressed in different contexts. The convergence rates of several machine learning algorithms have been studied for noni.i.d. data. Examples include weighted average algorithm [9]
, least squares support vector machines (LSSVMs)
[18], and onevsall multiclass plugin classifiers
[10]. In [50], the authors discussed several mixing conditions for weakly dependent observations which guarantee the consistency and asymptotic normality for the nonlinear least squares estimator. Minimum complexity regression estimators for dependent observations and strongly mixing observations were proposed in [29] using certain Bernsteintype inequalities for dependent observations. In [38], a conditionally i.i.d. model for pattern recognition was proposed, where the inputs are conditionally independent given the output labels. In [46], the authors proved that if the datagenerating process satisfies a certain law of large number, the support vector machines are consistent. In
[19], a Bersteintype inequality for geometricallymixing processes is established and applied to deduce an oracle inequality for generic regularized empirical risk minimization algorithms. Using a strong central limit theorem for chaotic data and compressed sensing results, the authors in
[48] proved a reconstruction guarantee for sparse reconstruction of governing equations for threedimensional chaotic systems with outliers. The common technique in the mentioned works is the application of either a central limit theorem or a suitable concentration inequality for the given data.In this work, we study the problem of learning nonlinear functions from identically distributed (but not necessarily independent) data that are corrupted by outliers and/or contaminated by noise. By expressing the target function in the multivariate polynomial space, the learning problem is recast as a sparse robust linear regression problem where we incorporate both the unknown coefficients and the corruptions in a basis pursuit framework. The main contribution of our paper is to provide a reconstruction guarantee for the associated
optimization problem where the (augmented) sampling matrix is formed from the data matrix, its powers, and the identity matrix. Although the data may not be i.i.d., we prove that the sampling matrix satisfies the null space property, provided that the data are compact and satisfies a suitable concentration inequality. Consequently, the basis pursuit problem will be guaranteed to have a unique solution and be stable with respect to noise.
The paper is organized as follows. In Section 2, we explain the problem setting. In Section 3, we first recall the theory from compressive sensing, then present the theoretical reconstruction guarantees. In Section 4, we state the recovery results for various types of data including i.i.d. data, exponentially strongly mixing data, geometrically mixing data, and uniformly ergodic Markov chain. The numerical implementations and results are described in Sections 5. We discuss the conclusion and future works in Section 6.
2. Problem Statement
We would like to learn a function from data points , where is corrupted data, is the uncorrupted part, represents the corruption, and denotes noise. We say that is an outlier if the corruption is nonzero. Assume that the function of interest is a multivariate polynomial of degree at most :
Let , , be the matrix where the rows are , and be the data matrix,
Then we form the dictionary matrix from data,
(2.1) 
where is the maximal number of multivariate monomials of degree at most .
Denote the coefficient vector and , we can reformulate our problem as follows:
Find such that .
Without corruptions and with arbitrary noise vector , the problem is classically solvable by least squares regression once . With corruptions, whose locations can be arbitrary but are unknown beforehand, if and at least of the measurements are uncorrupted, then one could in theory do a regression on each of the subsets of measurements and retain the set with the smallest error; however, this is an infeasible combinatorial algorithm. Thus, the convex relaxation of this combinatorial algorithm is a natural choice for reconstruction algorithm:
(2.2) 
On the other hand, if the polynomial coefficients are sparse or the polynomial function can be approximated by a sparse polynomial, the learning problem can be recast as follows:
(2.3) 
or, more generally, as the corrupted sensing problem [24, 14, 22],
(2.4) 
For the remainder of the paper, we denote the sparsity level of by , and the rowsparsity level of by . In the noiseless case (), we have:
3. Reconstruction Guarantee Analysis
Before presenting the properties of the matrix and theoretical guarantees for the corresponding optimization problems, we first recall some results from compressive sensing including the null space property and the stable null space property (see [13] for a comprehensive overview).
3.1. Theory from Compressive Sensing
Definition 3.1.
A matrix is said to satisfy

the null space property of order if
for any set with .

the stable null space property of order with constant if
for any set with .
Proposition 3.2 (Recovery guarantee given null space property).
Given a matrix , every sparse vector with is the unique solution of
(3.1) 
if and only if satisfies the null space property of order s.
Proposition 3.3 (Recovery guarantee given stable null space property).
Suppose a matrix satisfies the stable null space property of order with constant . Then, for any with , a solution of the optimization problem (3.1) approximates the vector with error
The null space property for the matrix , along with the existence of an sparse solution to the underdetermined system of equations, is a sufficient and necessary condition for sparse solutions of the NP hard minimization problem,
to be exactly recovered via the minimization (3.1). On the other hand, the stable null space property of the matrix guarantees that any solution, sparse or not, can be recovered up to the error governed by its distance to sparse vectors.
3.2. Theoretical Guarantees
We will show that if the uncorrupted data satisfy an appropriate concentration inequality and their common distribution is nondegenerate (that is, if implies
contains infinitely many elements), then the polynomial coefficients of the unknown function as well as the location of the outliers can be exactly recovered with high probability from the unique solution of the
minimization problem (2.3), provided that the output values are exact. When the output values contain dense noise, we show that every solution of the associated optimization problem can be approximated by a sparse solution.To begin with, we will show that the matrix , where is constructed from all monomials up to degree , satisfies the null space property.
Theorem 3.4.
Fix . Consider where the uncorrupted data are bounded by
and identically distributed according to a nondegenerate probability distribution
, and the corruption is bounded by and row sparse. Let , and let be a function such that(3.2) 
when is large enough and is some chosen constant. Assume that satisfies the following concentration inequality:
(3.3) 
for any and any bounded Borel function . Here .
Then, there is a constant depending only on , and , so that when satisifies:
(3.4)  
the matrix , where is the dictionary matrix (2.1), satisfies the null space property of order with probability at least .
Proof.
For each , define as follows:
We first evaluate the lower bound for the summation . For any nonzero , we have . Indeed, if , then almost surely. Since is nondegenerate, there are infinitely many such that . This implies which is a contradiction. Therefore, for any .
On the other hand, since the set is compact and nonempty, we can apply the extreme value theorem for the continuous function to get the following bound:
for some constant . Note that depends on , and .
According to a wellknown result on the covering number (for example, see Appendix C.2, [13]), there exists a finite set of points in of cardinality
such that
Applying the union bound on and using the assumption , we derive:
provided that
(3.5) 
Hence,
Therefore, for any , we have:
(3.6) 
with probability at least .
For each there exists so that . Applying the Hölder’s inequality for with , we obtain:
Combining with the inequality (3.6), we obtain
with probability at least , provided that
(3.7) 
By linearity, we have in the same event,
(3.8) 
Next, we will estimate the lower bound for , where . Denote , where is defined as follows
Applying the Hölder’s inequality, we have
Similarly, we have
Therefore,
Since , we deduce and
(3.9) 
Thus, in the event that (3.8) holds, we have combined with (3.9) that
(3.10)  
provided moreover that
Now, we are ready to verify the null space property condition for in the event that (3.8) holds. Let be an arbitrary set of size and . Denote be the last entries of , and
Since , and . Using the inequality (3.2), we have
On the other hand, using the inequality (3.10), we obtain
Then when satisfies (3.5), (3.7), and
(3.11) 
we have , for any . That completes our proof. ∎
Remark 3.5.

Since for any with probability , we conclude that the matrix is of full column rank.

From the proof, we also derive that if , the matrix satisifes the partial null space property of order (see [3], Definition 3.1).
Combining with the reconstruction results from compressed sensing (see Proposition 3.2 and Proposition 3.3), we immediately obtain the following reconstruction guarantees.
Theorem 3.6.
Fix . Suppose we observe corrupted measurements
where and satisfy the assumptions in Theorem 3.4, and is a sparse multivariate polynomial with at most monomial terms of degree at most . Denote , , be the dictionary matrix (2.1), and be the unknown polynomial coefficients of . The problem can be recast as
for some .

When , then . Suppose , then there is a constant depending only on , and , so that when satisfies (3.4), the polynomial coefficients of as well as the vector can be exactly recovered with probability from the unique solution to the minimization problem:
Remark 3.7.

The same result in Theorem 3.6 can be extended immediately to learn a system of highdimensional polynomial functions with the same coefficient matrix, where each is a multivariate polynomial of degree at most :

By considering a slight modification of the matrix , , we can verify that also satisfies the null space property, provided that is sufficiently large. Indeed, every can be written as . Then with the lower bound on , we can immediately show , provided that
(3.13) Hence, the corrupted compressed sensing problem
will have a unique solution.
4. Recovery Results for Various Types of Data
In this section, we apply our results to several popular types of dependent data. Indeed, we only need to verify that these types of data satisfy the required concentration inequality in Theorem 3.4. For the sake of simplicity, we state the recovery results for the noiseless case of (i.e., when ).
4.1. Independent and Identically Distributed (i.i.d.) Data
In [45]
, the authors provide the following Bernstein inequality for i.i.d. random variables:
Lemma 4.1.
If are i.i.d. random variables with , then the following probability inequality holds
(4.1) 
where
and is any bounded Borel function.
In this case, the function in the concentration inequality (3.3) is
and satisfies the condition (3.2) for any constant , when is large enough. Indeed, the condition on can be rewritten as
(4.2) 
If the maximal polynomial degree is fixed, the smaller is, the smaller is needed to satisfy the inequality (4.2).
As a result, we have the following recovery result for i.i.d data.
Theorem 4.2.
Fix . Suppose we observe corrupted measurements
where the uncorrupted data are i.i.d. according to a nondegenerate distribution and bounded by ; the corruption is bounded by and row sparse; and is a sparse multivariate polynomial with at most monomials of degree at most . Then, when satisfies (3.4) and (4.2), the polynomial coefficients of the function can be exactly recovered and the outliers can be successfully detected from the unique solution of (2.3) with high probability.
4.2. Exponentially Strongly mixing Data
We first recall the definition of mixing coefficients and a concentration inequality for mixing. For a stationary stochastic process , define (see [35, 29])
The stochastic process is said to be exponentially strongly mixing if
for some and , where the constants and are assumed to be known. Note that strong mixing implies asymptotic independence over sufficiently large time.
In [29], the authors proved the following concentration inequality for exponentially strongly mixing:
Lemma 4.3.
If are stationary exponentially strongly mixing with
, then the following probability inequality holds for sufficiently large:
(4.3) 
where
and is any bounded Borel function.
Hence the concentration inequality (3.3) is satisfied with
Since
(4.4) 
for any when is lare enough, we have the recovery result for exponentially strongly mixing data.
Theorem 4.4.
Fix . Suppose we observe corrupted measurements
where the uncorrupted data are stationary exponentially strongly mixing and bounded by ; the corruption is bounded by and row sparse; and is a sparse multivariate polynomial with at most monomials of degree at most . If the stationary distribution of is nondegenerate, then when is sufficiently large and satisfies Equation (3.4) and Equation (4.4), the polynomial coefficients of the function can be exactly recovered and the outliers can be successfully detected from the unique solution of (2.3) with high probability.
4.3. Geometrically (timereversed) mixing Data
The mixing processes were introduced in [28] to exhibit many common dynamical systems that are not necessary mixing such as LasotaYorke maps, unimodal maps, piecewise expanding maps in higher dimension. Moreover, the geometrically mixing processes are strongly related to some wellknown results on the decay of correlations for dynamical systems (see [19]).
Let be an valued stationary process on . For a seminorm on a vector space of bounded measurable functions that satisfies , we define the norm by .
Let and be the algebras generated by and respectively. Then, the mixing coefficient is
and the timereversed mixing coefficient is
A sequence of random variables is called geometrically (timereversed) mixing if
for some constants , and . The following concentration inequality for stationary geometrically (timereversed) mixing process is a direct consequence of the Bernstein inequality presented in [19].
Lemma 4.5.
Let be a stationary geometrically (time reversed) mixing process. Consider a function such that , , and . Then, for sufficient large we have
(4.5) 
In this case, the concentration inequality (3.3) holds for
and satisfies the condition (3.2) for any when is large enough. Hence, we have the recovery result for geometrically (timereversed) mixing data.
Theorem 4.6.
Fix . Suppose we observe corrupted measurements
Comments
There are no comments yet.