Recovery guarantees for polynomial approximation from dependent data with outliers

Learning non-linear systems from noisy, limited, and/or dependent data is an important task across various scientific fields including statistics, engineering, computer science, mathematics, and many more. In general, this learning task is ill-posed; however, additional information about the data's structure or on the behavior of the unknown function can make the task well-posed. In this work, we study the problem of learning nonlinear functions from corrupted and dependent data. The learning problem is recast as a sparse robust linear regression problem where we incorporate both the unknown coefficients and the corruptions in a basis pursuit framework. The main contribution of our paper is to provide a reconstruction guarantee for the associated ℓ_1-optimization problem where the sampling matrix is formed from dependent data. Specifically, we prove that the sampling matrix satisfies the null space property and the stable null space property, provided that the data is compact and satisfies a suitable concentration inequality. We show that our recovery results are applicable to various types of dependent data such as exponentially strongly α-mixing data, geometrically C-mixing data, and uniformly ergodic Markov chain. Our theoretical results are verified via several numerical simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/29/2020

Efficient Noise-Blind ℓ_1-Regression of Nonnegative Compressible Signals

In compressed sensing the goal is to recover a signal from as few as pos...
05/27/2019

Finite-Time Analysis of Q-Learning with Linear Function Approximation

In this paper, we consider the model-free reinforcement learning problem...
06/28/2018

Truncated Sparse Approximation Property and Truncated q-Norm Minimization

This paper considers approximately sparse signal and low-rank matrix's r...
01/04/2016

Robust Non-linear Regression: A Greedy Approach Employing Kernels with Application to Image Denoising

We consider the task of robust non-linear regression in the presence of ...
11/10/2017

A Theoretical Analysis of Sparse Recovery Stability of Dantzig Selector and LASSO

Dantzig selector (DS) and LASSO problems have attracted plenty of attent...
03/18/2020

Neutron reflectometry analysis: using model-dependent methods

Neutron reflectometry analysis is an inherently ill-posed, which is to s...
06/24/2021

Three rates of convergence or separation via U-statistics in a dependent framework

Despite the ubiquity of U-statistics in modern Probability and Statistic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the past few decades, there has been a rapid growth of interest in automated learning from data across various scientific fields including statistics [49], engineering [27], computer science [20, 36]

, mathematics, and many more. An overview of machine learning problems in a wide range of contexts (statistical learning theory, pattern recognition, system identification, deep learning, and so on) can be found in

[15, 5, 1, 17]

. One of the main paradigms is to learn an unknown target function from a given collection of input-output pairs (supervised learning), which can be rephrased as the problem of finding an approximation of a multi-dimensional function. For example, in

[32, 31], the authors demonstrated a connection between approximation theory and regularization with feedforward multilayer networks. In general, learning a smooth function from data is ill-posed unless a priori information about either the data structure or the generating function is provided [47, 30, 12].

One of the well-known methods to make the learning problem well-posed is to exploit additional properties of the target function [16]

. For example, if the target function depends only on a few active directions associated with a suitable random matrix, the function can be recovered from a small number of samples

[12]

. On the other hand, many well-known learning methods consider the target function in a particular function class (such as radial basis functions, projection pursuit, feed-forward neural networks, and tensor product methods) and add a penalty (such as Tikhonov regularization or sparse constraints) to the associated parameter estimation problem.

Recently, sparse models combined with data-driven methods have been investigated intensively for learning nonlinear partial differential equations, nonlinear dynamical systems, and graph-based networks. The model selection problem for dynamical systems from time series dates back to

[8] where the authors investigate the concepts from dynamical system theory to recover the underlying structure from data. In [52]

, the authors construct a sampling matrix from the data matrix and its power to recover the ordinary differential equations and find an optimal Kronecker product representation for the governing equations. Furthermore, based on the observation that many governing equations have a sparse representation with respect to high-dimensional polynomial spaces, the authors in

[4] developed the SINDy algorithm which uses that sampling matrix and a sequential least-square thresholding algorithm to recover the governing equations of some unknown dynamical systems. The convergence of the SINDy algorithm is provided in [53]. A group-sparse model was proposed in [42] to learn governing equations from a family of dynamical systems with bifurcation parameters. By exploiting the cyclic structure of many nonlinear differential equations, the authors in [43] proposed an approach to identify the active basis terms using fewer random samples (in some cases on the order of a few snapshots). For the noisy case, in [40] the authors use the integral formulation of the differential equation to reduce the effect of noise and identify the model from a smoother basis set. To learn a nonlinear partial differential equation from spatio-temporal dataset, the authors in [39] proposed a LASSO-based approach using a dictionary of partial derivatives. In [37]

, the authors developed an adaptive ridge-regression version of

[4] for learning nonlinear PDE, while in [33] a hidden physics model based on Gaussian processes was presented. On the other hand, the data are often contaminated by noise, contain outliers, have missing values, or have a limited amount of samples. When the given data are limited, there are several works addressing learning problems ranging from sampling strategies in high-dimensional dynamics using random initial conditions [41], to a weighted -minimization on the lower set [34, 6], model predictive control using SINDy [21], and sample complexity reduction to linear time-invariant systems [11]. In [44], the authors proposed a method to approximate an unknown function from noise measurements via sequential approximation. Geometric methods, such as [25], can be used to approximate functions in high-dimensions when the data concentrate on lower-dimensional sets.

Regarding supervised learning analysis, the input data are assumed to be independent and identically distributed (i.i.d.). However, this assumption does not hold in many applications such as speech recognition, medical diagnosis, signal processing, computational biology, and financial prediction. Alternatively, for non-i.i.d. processes satisfying certain mixing conditions, various reconstruction results have been addressed in different contexts. The convergence rates of several machine learning algorithms have been studied for non-i.i.d. data. Examples include weighted average algorithm [9]

, least squares support vector machines (LS-SVMs)

[18]

, and one-vs-all multiclass plug-in classifiers

[10]. In [50], the authors discussed several mixing conditions for weakly dependent observations which guarantee the consistency and asymptotic normality for the nonlinear least squares estimator. Minimum complexity regression estimators for -dependent observations and strongly mixing observations were proposed in [29] using certain Bernstein-type inequalities for dependent observations. In [38], a conditionally i.i.d. model for pattern recognition was proposed, where the inputs are conditionally independent given the output labels. In [46]

, the authors proved that if the data-generating process satisfies a certain law of large number, the support vector machines are consistent. In

[19], a Berstein-type inequality for geometrically

-mixing processes is established and applied to deduce an oracle inequality for generic regularized empirical risk minimization algorithms. Using a strong central limit theorem for chaotic data and compressed sensing results, the authors in

[48] proved a reconstruction guarantee for sparse reconstruction of governing equations for three-dimensional chaotic systems with outliers. The common technique in the mentioned works is the application of either a central limit theorem or a suitable concentration inequality for the given data.

In this work, we study the problem of learning nonlinear functions from identically distributed (but not necessarily independent) data that are corrupted by outliers and/or contaminated by noise. By expressing the target function in the multivariate polynomial space, the learning problem is recast as a sparse robust linear regression problem where we incorporate both the unknown coefficients and the corruptions in a basis pursuit framework. The main contribution of our paper is to provide a reconstruction guarantee for the associated

-optimization problem where the (augmented) sampling matrix is formed from the data matrix, its powers, and the identity matrix. Although the data may not be i.i.d., we prove that the sampling matrix satisfies the null space property, provided that the data are compact and satisfies a suitable concentration inequality. Consequently, the basis pursuit problem will be guaranteed to have a unique solution and be stable with respect to noise.

The paper is organized as follows. In Section 2, we explain the problem setting. In Section 3, we first recall the theory from compressive sensing, then present the theoretical reconstruction guarantees. In Section 4, we state the recovery results for various types of data including i.i.d. data, exponentially strongly -mixing data, geometrically -mixing data, and uniformly ergodic Markov chain. The numerical implementations and results are described in Sections 5. We discuss the conclusion and future works in Section 6.

2. Problem Statement

We would like to learn a function from data points , where is corrupted data, is the uncorrupted part, represents the corruption, and denotes noise. We say that is an outlier if the corruption is non-zero. Assume that the function of interest is a multivariate polynomial of degree at most :

Let , , be the matrix where the rows are , and be the data matrix,

Then we form the dictionary matrix from data,

(2.1)

where is the maximal number of -multivariate monomials of degree at most .

Denote the coefficient vector and , we can reformulate our problem as follows:

Find such that .


Without corruptions and with arbitrary noise vector , the problem is classically solvable by least squares regression once . With corruptions, whose locations can be arbitrary but are unknown beforehand, if and at least of the measurements are uncorrupted, then one could in theory do a regression on each of the subsets of measurements and retain the set with the smallest error; however, this is an infeasible combinatorial algorithm. Thus, the convex relaxation of this combinatorial algorithm is a natural choice for reconstruction algorithm:

(2.2)

On the other hand, if the polynomial coefficients are sparse or the polynomial function can be approximated by a sparse polynomial, the learning problem can be recast as follows:

(2.3)

or, more generally, as the corrupted sensing problem [24, 14, 22],

(2.4)

For the remainder of the paper, we denote the sparsity level of by , and the row-sparsity level of by . In the noiseless case (), we have:

3. Reconstruction Guarantee Analysis

Before presenting the properties of the matrix and theoretical guarantees for the corresponding -optimization problems, we first recall some results from compressive sensing including the null space property and the stable null space property (see [13] for a comprehensive overview).

3.1. Theory from Compressive Sensing

Definition 3.1.

A matrix is said to satisfy

  • the null space property of order if

    for any set with .

  • the stable null space property of order with constant if

    for any set with .

Proposition 3.2 (Recovery guarantee given null space property).

Given a matrix , every -sparse vector with is the unique solution of

(3.1)

if and only if satisfies the null space property of order s.

Proposition 3.3 (Recovery guarantee given stable null space property).

Suppose a matrix satisfies the stable null space property of order with constant . Then, for any with , a solution of the optimization problem (3.1) approximates the vector with -error

The null space property for the matrix , along with the existence of an -sparse solution to the underdetermined system of equations, is a sufficient and necessary condition for sparse solutions of the NP hard minimization problem,

to be exactly recovered via the -minimization (3.1). On the other hand, the stable null space property of the matrix guarantees that any solution, sparse or not, can be recovered up to the error governed by its distance to -sparse vectors.

3.2. Theoretical Guarantees

We will show that if the uncorrupted data satisfy an appropriate concentration inequality and their common distribution is non-degenerate (that is, if implies

contains infinitely many elements), then the polynomial coefficients of the unknown function as well as the location of the outliers can be exactly recovered with high probability from the unique solution of the

-minimization problem (2.3), provided that the output values are exact. When the output values contain dense noise, we show that every solution of the associated optimization problem can be approximated by a sparse solution.

To begin with, we will show that the matrix , where is constructed from all monomials up to degree , satisfies the null space property.

Theorem 3.4.

Fix . Consider where the uncorrupted data are -bounded by

and identically distributed according to a non-degenerate probability distribution

, and the corruption is -bounded by and -row sparse. Let , and let be a function such that

(3.2)

when is large enough and is some chosen constant. Assume that satisfies the following concentration inequality:

(3.3)

for any and any bounded Borel function . Here .

Then, there is a constant depending only on , and , so that when satisifies:

(3.4)

the matrix , where is the dictionary matrix (2.1), satisfies the null space property of order with probability at least .

Proof.

For each , define as follows:

We first evaluate the lower bound for the summation . For any non-zero , we have . Indeed, if , then -almost surely. Since is non-degenerate, there are infinitely many such that . This implies which is a contradiction. Therefore, for any .

On the other hand, since the set is compact and nonempty, we can apply the extreme value theorem for the continuous function to get the following bound:

for some constant . Note that depends on , and .

According to a well-known result on the covering number (for example, see Appendix C.2, [13]), there exists a finite set of points in of cardinality

such that

Applying the union bound on and using the assumption , we derive:

provided that

(3.5)

Hence,

Therefore, for any , we have:

(3.6)

with probability at least .

For each there exists so that . Applying the Hölder’s inequality for with , we obtain:

Combining with the inequality (3.6), we obtain

with probability at least , provided that

(3.7)

By linearity, we have in the same event,

(3.8)

Next, we will estimate the lower bound for , where . Denote , where is defined as follows

Applying the Hölder’s inequality, we have

Similarly, we have

Therefore,

Since , we deduce and

(3.9)

Thus, in the event that (3.8) holds, we have combined with (3.9) that

(3.10)

provided moreover that

Now, we are ready to verify the null space property condition for in the event that (3.8) holds. Let be an arbitrary set of size and . Denote be the last entries of , and

Since , and . Using the inequality (3.2), we have

On the other hand, using the inequality (3.10), we obtain

Then when satisfies (3.5), (3.7), and

(3.11)

we have , for any . That completes our proof. ∎

Remark 3.5.
  • Since for any with probability , we conclude that the matrix is of full column rank.

  • From the proof, we also derive that if , the matrix satisifes the partial null space property of order (see [3], Definition 3.1).

  • If we keep the conditions (3.5) and (3.7), and change the condition (3.11) to

    (3.12)

    then , for any and any set with . It means satisfies the stable null space property of order .

Combining with the reconstruction results from compressed sensing (see Proposition 3.2 and Proposition 3.3), we immediately obtain the following reconstruction guarantees.

Theorem 3.6.

Fix . Suppose we observe corrupted measurements

where and satisfy the assumptions in Theorem 3.4, and is a sparse multivariate polynomial with at most monomial terms of degree at most . Denote , , be the dictionary matrix (2.1), and be the unknown polynomial coefficients of . The problem can be recast as

for some .

  • When , then . Suppose , then there is a constant depending only on , and , so that when satisfies (3.4), the polynomial coefficients of as well as the vector can be exactly recovered with probability from the unique solution to the -minimization problem:

  • When and is not necessarily sparse, if satisfies (3.12), (3.5), and (3.7), a solution to the -minimization (2.3) approximates the true solution with -error:

    where is the best -term approximation (vector of largest-magnitude entries) of and is the stable null space constant of the matrix .

Remark 3.7.
  • The partial -minimization problem in [48]

    is a special case of problem (2.3) with . In other words, given corrupted input-output data where the corruption measurements are -sparse, we can recover the polynomial function that fit the given data and detect the outliers correctly.

  • The same result in Theorem 3.6 can be extended immediately to learn a system of high-dimensional polynomial functions with the same coefficient matrix, where each is a multivariate polynomial of degree at most :

  • By considering a slight modification of the matrix , , we can verify that also satisfies the null space property, provided that is sufficiently large. Indeed, every can be written as . Then with the lower bound on , we can immediately show , provided that

    (3.13)

    Hence, the corrupted compressed sensing problem

    will have a unique solution.

4. Recovery Results for Various Types of Data

In this section, we apply our results to several popular types of dependent data. Indeed, we only need to verify that these types of data satisfy the required concentration inequality in Theorem 3.4. For the sake of simplicity, we state the recovery results for the noiseless case of (i.e., when ).

4.1. Independent and Identically Distributed (i.i.d.) Data

In [45]

, the authors provide the following Bernstein inequality for i.i.d. random variables:

Lemma 4.1.

If are i.i.d. random variables with , then the following probability inequality holds

(4.1)

where

and is any bounded Borel function.

In this case, the function in the concentration inequality (3.3) is

and satisfies the condition (3.2) for any constant , when is large enough. Indeed, the condition on can be re-written as

(4.2)

If the maximal polynomial degree is fixed, the smaller is, the smaller is needed to satisfy the inequality (4.2).

As a result, we have the following recovery result for i.i.d data.

Theorem 4.2.

Fix . Suppose we observe corrupted measurements

where the uncorrupted data are i.i.d. according to a non-degenerate distribution and -bounded by ; the corruption is -bounded by and -row sparse; and is a sparse multivariate polynomial with at most monomials of degree at most . Then, when satisfies (3.4) and (4.2), the polynomial coefficients of the function can be exactly recovered and the outliers can be successfully detected from the unique solution of (2.3) with high probability.

4.2. Exponentially Strongly -mixing Data

We first recall the definition of -mixing coefficients and a concentration inequality for -mixing. For a stationary stochastic process , define (see [35, 29])

The stochastic process is said to be exponentially strongly -mixing if

for some and , where the constants and are assumed to be known. Note that strong mixing implies asymptotic independence over sufficiently large time.

In [29], the authors proved the following concentration inequality for exponentially strongly -mixing:

Lemma 4.3.

If are stationary exponentially strongly -mixing with
, then the following probability inequality holds for sufficiently large:

(4.3)

where

and is any bounded Borel function.

Hence the concentration inequality (3.3) is satisfied with

Since

(4.4)

for any when is lare enough, we have the recovery result for exponentially strongly -mixing data.

Theorem 4.4.

Fix . Suppose we observe corrupted measurements

where the uncorrupted data are stationary exponentially strongly -mixing and -bounded by ; the corruption is -bounded by and -row sparse; and is a sparse multivariate polynomial with at most monomials of degree at most . If the stationary distribution of is non-degenerate, then when is sufficiently large and satisfies Equation (3.4) and Equation (4.4), the polynomial coefficients of the function can be exactly recovered and the outliers can be successfully detected from the unique solution of (2.3) with high probability.

4.3. Geometrically (time-reversed) -mixing Data

The -mixing processes were introduced in [28] to exhibit many common dynamical systems that are not necessary -mixing such as Lasota-Yorke maps, uni-modal maps, piecewise expanding maps in higher dimension. Moreover, the geometrically -mixing processes are strongly related to some well-known results on the decay of correlations for dynamical systems (see [19]).

Let be an -valued stationary process on . For a semi-norm on a vector space of bounded measurable functions that satisfies , we define the -norm by .

Let and be the -algebras generated by and respectively. Then, the -mixing coefficient is

and the time-reversed -mixing coefficient is

A sequence of random variables is called geometrically (time-reversed) -mixing if

for some constants , and . The following concentration inequality for stationary geometrically (time-reversed) -mixing process is a direct consequence of the Bernstein inequality presented in [19].

Lemma 4.5.

Let be a stationary geometrically (time reversed) -mixing process. Consider a function such that , , and . Then, for sufficient large we have

(4.5)

In this case, the concentration inequality (3.3) holds for

and satisfies the condition (3.2) for any when is large enough. Hence, we have the recovery result for geometrically (time-reversed) -mixing data.

Theorem 4.6.

Fix . Suppose we observe corrupted measurements