    # Nonlinear generalization of the single index model

Single index model is a powerful yet simple model, widely used in statistics, machine learning, and other scientific fields. It models the regression function as g(<a,x>), where a is an unknown index vector and x are the features. This paper deals with a nonlinear generalization of this framework to allow for a regressor that uses multiple index vectors, adapting to local changes in the responses. To do so we exploit the conditional distribution over function-driven partitions, and use linear regression to locally estimate index vectors. We then regress by applying a kNN type estimator that uses a localized proxy of the geodesic metric. We present theoretical guarantees for estimation of local index vectors and out-of-sample prediction, and demonstrate the performance of our method with experiments on synthetic and real-world data sets, comparing it with state-of-the-art methods.

## Code Repositories

### local_sim_experiments

Experiment suite for local sim paper

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many problems in data analysis can be formulated as learning a function from a given data set in a high-dimensional space. Due to the curse of dimensionality, accurate regression on high-dimensional functions typically requires a number of samples that scales exponentially with the ambient dimension

[Sto82]. A common approach to mitigating these effects is to impose structural assumptions on the data. Indeed, a number of recent advances in data analysis and numerical simulation are based on the observation that high-dimensional, real-world data is inherently structured, and that the relationship between the features and the responses is of a lower dimensional nature [AC09].

The most direct such model, which has become an important prior for many statistical and machine learning problems, assumes a -dimensional relationship, written as

 f(X)=g(⟨a,X⟩), (1)

where the features and responses are related through an unknown index vector , and an unknown monotonic function . The model (1), called the single index model (SIM), first appeared in economic and statistical communities in the early 90s [Ich93, HH96]. Moreover, it provides a basis for more complex models such as multi-index models [HTFF05, LZC05]

[LBH15]. In particular, even though the monotonicity assumption on

might seem restrictive, note that most neural networks use monotonic activation functions, such as ReLU or the sigmoid.

An assumption commonly shared by SIM and its generalizations is that there is a single lower dimensional linear space that accounts for the complexity in relating and . While simple, this assumption is only a first level approximation and is rarely observed in real-world regression problems. The goal of this paper is to relax the global linearity assumption present in model (1) in order to locally adapt to changes in the relationship between and . Specifically, we propose the nonlinear single index model (NSIM), defined by

 f(X)=g(πγ(X)), (2)

where represents the orthogonal projection a simple -curve , defined by

 (3)

and is a bi-Lipschitz function. Since bi-Lipschitz functions are also monotonic, SIM is a special example of (2), where .

Before formally describing our assumptions and detailing the approach, let us begin with a couple of comments. Recall that smooth curves can be locally approximated by affine approximations, i.e. where is a local tangent vector of . Then (2) can be seen as a family of problems where iterates over pieces of that are approximately affine. Now notice that the model and the monotonicity of imply an equivalence between and . Therefore, instead of iterating over approximately affine pieces of , we can equivalently iterate over a partition of consisting of disjoint intervals , and assert that (2) can be split up into localized SIM problems

 E[Y|X,f(X)∈Rj]≈gj(⟨aj,X⟩),j=1,…,J, (4)

where the tangent vectors now play the role of index vectors in (1).

In Figure 1 we study the effects of such an approach on several UCI data sets. Namely, for each data set we partition the data into sets, as detailed above, learn a SIM estimator on each of the sets, and then plot the generalization error of the resulting estimator (i.e

the one described in this paper) as a function of the hyperparameter

. Given sufficient amount of data, we can observe that replacing (1) with its localized counterpart (4) often results in better estimation. For example, on the Yacht data set the generalization error improves by more than 30 percent for

. Notice though that increasing the number of localized pieces does not always improve the performance. This can mostly be attributed to the fact that splitting the original data set into disjoint subgroups reduces the number of samples within each group, which has a detrimental effect on the variance of the estimator. In other words, we face a typical bias-variance trade-off, implying that the hyperparameter

needs to be carefully selected.

We will now provide an overview of relevant literature and describe the details of our approach.

#### Related work.

To the best of our knowledge, relaxations of SIM have not yet been considered in the given form. However, three research areas are highly relevant: linear single- and multi-index models; nonlinear sufficient dimension reduction; manifold regression. Below we provide a short summary and overview of the most significant and relevant achievements in each of these fields.

Single- and multi index models have been extensively researched, and we therefore, restrict ourselves to conceptually related work. Most studies focus only on the estimation of index vector(s). Along those lines, the most similar work is [HJS01] where the authors use iterative local linear regression to estimate the index vector . The locality is enforced by kernel weights, which are initially set to be spherical around the estimation point, and then iteratively reshaped so that the isolines resemble level set boundaries of a strictly monotonous link function . The approach has been extended to the multi-index case [DJS08].

Another relevant line of work are methods based on inverse regression that began with the introduction of sliced inverse regression (SIR) [Li91], and was followed up by [DC00] (SAVE), [Li92] (PHD), [XTLZ02] (MAVE), [LZC05] (Contour regression), [LW07]

(Directional regression), and others. These methods use estimates of inverse moments

, to estimate the index vector (or the sufficient dimension reduction space).

There are also several methods that simultaneously learn the link function and the index vector. We mention Isotron [KS09] and Slisotron [KKSK11], that iteratively update the link function using isotonic regression, and the vector using the function estimate; [GRB17] that uses a similar approach but additionally assumes sparsity of the index vector; [KP16] that uses an iterative procedure and spline estimates; and [Rad15] that proposes higher dimensional splines.

The work and mathematical understanding of nonlinear sufficient dimension reduction is still in its early stages and there are many open questions. Most of the existing studies consider kernelized versions of linear estimators (such as SIR or SAVE) to globally linearize the problem in feature space, and then apply well-known methods regression methods, see [LAL11, LLC13, Wu08, YHL09].

Finally, model (2) can be considered from the viewpoint of manifold regression, where the goal is to estimate a function defined on the data. Manifold regression methods, such as [BL07, Kpo11, MWZ10], generally assume that the marginal distribution of is either supported on or in its close vicinity. As a consequence, Euclidean distances can be used to locally approximate the geodesic metric. This is a strong assumption which is implicitly or explicitly leveraged by all manifold regression techniques, and presents a breaking point for their effective use. In this work, we instead consider distributions that are spread in all directions of the ambient space around the curve . Therefore, geodesic proximity cannot be inferred from Euclidean distances and we need to use a localized proxy for the geodesic distance.

#### Main idea and estimation procedure for the NSIM model.

The model (2) increases the flexibility of the ordinary SIM (1) by allowing for varying index vectors, corresponding to different regimes of the response . Consequently, a natural approach would be to partition the data into several groups, based on , and assume a SIM in each subgroup. Since SIM estimators have been extensively studied we can rely on well-known methods to locally estimate the index vector. In particular, our approach follows three steps.

In the first step we partition the data set into sets and through a disjoint union of the responses, for intervals , and set

 Yj:=Y∩Rj,Xj:={Xi∈X:Yi∈Yj}. (5)

We refer to sets as level sets, since they can be defined as in the noise-free case. The optimal method for partitioning as depends on the marginal distribution of , and is best chosen after inspecting the empirical density. For example, we recommend using dyadic cells of if the density of is roughly uniform, and stochastically equivalent blocks222These ensure equal number of samples (up to ) in each level set.

if the probability mass is unevenly distributed. Also, it can be useful to first transform

’s, e.g. using a transform, to obtain more balanced level sets.

In the second step we compute estimates of local index vectors by using linear regression on and . Namely, let be the local empirical covariance matrix. Then, set , where is defined by

 (6)

or equivalently,

 ^bj:=^Σ†j^E(Xj,Yj)((Y−^EYjY)(X−^EXjX)). (7)

We note that the slope vector of linear regression has been used in [HJS01] to estimate the index vector for SIM. Our analysis of is different since the nonlinearity of the geometry induces an additional approximation error. To reduce the approximation error the number of level sets needs to grow with and achieving the optimal error with requires a carefully balanced relationship , see Section 3.

In the final step we use a kNN-type estimator to predict for out-of-sample . Since the make-or-break point of kNN-estimators regards how distances between and training samples are measured, the critical point of this step is about the selection of an appropriate distance function. The issue is that the optimal choice (the geodesic metric on ) is not available since is not known, and the naive choice (the Euclidean metric) generally leads to estimation rates that depend on the ambient dimension, and thus the curse of dimensionality.

To develop a proxy metric, consider now the ordinary SIM (1). In this case the geodesic metric is equivalent to the Euclidean distance of projected samples, and training the kNN estimator on projected samples achieves optimal univariate regression rates if the index vector approximates the true index vector with a sufficiently high rate. In other words is a good proxy for the geodesic metric. The NSIM case is more challenging because we have different index vectors to choose from, and cannot be a priori assigned to any level set since we do not know . Still, we can show that the quantity (referred to as proxy metric)

 Δ(x∗,Xi):=∣∣⟨^aj(Xi),x∗−Xi⟩∣∣,where j(Xi) is such that Xi∈Xj(Xi), (8)

approximates the geodesic metric reasonably well, provided that the piece of connecting and is linear enough, see (19). This motivates the following estimator: let contain the closest training samples of with respect to within a Euclidean ball around of radius . Then we set

 ^fk(x∗):=^ENΔ(x∗,k,η)Y=1k∑Xi∈NΔ(x∗,k,η)Yi. (9)

As we will discuss in Section 3.4, the radius plays a dual role. It needs to be large enough so that there are sufficiently many samples to choose neighbors from, but also small enough that so (8) is still a good proxy for the geodesic metric. The entire estimation approach is summarized in Algorithm 1.

#### Computational complexity.

The first two steps, i.e. partitioning and computing tangents, are dominated by , which is mostly due to forming covariance matrices and computing the generalized inverse. The out-of-sample prediction has a complexity of operations per evaluation.

#### Contributions.

In this work, we provide a thorough introduction and theoretical analysis of NSIM model (2) by considering the estimation problem from data samples , where is the unknown underlying distribution of . Our main result concerns the estimation of local index vectors, respectively the tangent field of . Moreover, we provide a simple, and computationally efficient algorithm to estimate that achieves nearly-optimal theoretical guarantees in some scenarios. Finally, we validate our theoretical results using synthetic data experiments, and we show that the additional flexibility of the NSIM allows superior performance over the SIM on some real data sets.

The theoretical results of this work can be summarized as follows:

• Theorem 3 ensures accurate approximations of localized index vectors, namely we show333 is the distribution induced by for any arc-length parametrization of . in the regime where is negligible compared to , and the corresponding piece of can be well approximated by an affine space.

• Corollary 10 ensures a consistent estimation of with nearly-optimal univariate regression rate for piecewise constant estimators in the noise-free setting ().

• Corollary 11 considers the case of bounded noise and ensures in general. In case of an ordinary SIM however, we recover the optimal univariate regression rate for piecewise constant estimators if the noise is sufficiently small.

#### Organization of the paper.

We begin in Section 2 by introducing the model, the non-parametric assumptions on the distribution, data generating function, and function noise. In Section 3 we present the theoretical analysis of our estimators, starting with error rates for learning the geometry (i.e. the localized index vectors) and concluding with the guarantees for prediction accuracy. Most of the proofs needed for this section are postponed to the Appendix. In Section 4 we present numerical experiments on synthetic data (to validate the theoretical results) and real-world data sets (to compare with state-of-the-art estimators). Section 5 concludes with a discussion about our results and future directions.

#### General notation.

denotes the standard Euclidean norm for vectors, and the spectral norm for matrices. denotes the geodesic metric on . Provided that is an arc-length parametrization, this means . For a discrete set of points we use to denote the number of elements: . On the other hand, if is a connected subsegment of , then denotes the length of the segment ; and if is an interval, is the length of the interval. Here an interval always refers to a closed and connected subset of the real line. The Moore-Penrose inverse of a matrix is denoted by . denotes a ball of radius around a point , with respect to a metric .

We denote by and

the expectation and covariance matrix of a random variable

, and by and the sample mean and sample covariance over all samples. denotes the sample mean over all samples that belong to a subset , that is, We use

to denote the joint probability distribution of

, and to denote the marginal, i.e., the probability distribution of , and is its support. The abbreviation a.s. is used as a shorthand for almost sure events (with respect to implicit random vectors).

## 2 Theoretical framework for the NSIM model

Whereas the description of the NSIM model in Section 1 may convey its basic motives, it is certainly not adequate for the purposes of a rigorous investigation. Due to NSIM’s scope, it is relatively easy to construct examples that fit the model but for which estimation from finite samples is not possible. Thus, our goal now is to define a framework that allows a theoretical analysis, yet is broad enough to encompass both the single index model and its nonlinear relaxation. In the following we describe the assumptions on the function class, the underlying nonlinearity, and on the distribution of the data set.

#### Regularity assumptions for f and Im(γ).

We consider simple, simply-connected and smooth curves, i.e. that admit an arc-length parametrization with . We assume that its curvature and torsion are bounded and denote and . We consider with a bounded Hessian, , that satisfy for some -bi-Lipschitz function , that is

 L−1fdγ(v,v′)≤∣∣g(v)−g(v′)∣∣≤Lfdγ(v,v′), for all v,v′∈Im(γ). (10)

Through rescaling we can always assume . We can, without loss of generality, align with , i.e. choose an orientation such that . Another important quantity is the reach of - the largest such that any point at distance less than from has a unique nearest point on [Fed59]. This ensures that is well defined for all within the reach, i.e. such that .

#### Distributional assumptions.

The following assumptions 1 - 5 characterize the family of distributions that are admissible to the analysis in the next section. They are, apart from 1, standard and mostly related to commonly used assumptions. At the end of the section we provide a family of distributions that satisfies all assumptions related to the marginal .

Our first assumption asserts that is centered in the middle of the distribution and reads

1. [label=(A0)]

2. a.s.

By rewriting it as , we see that it is inspired by the linear conditional mean assumption444This asserts that for any , and some , where is the index space. from multi-index model literature [Li91, DC00], which holds, for example, for any elliptical distribution. Moreover, it also implies the identifiability of by the distribution of in the following sense.

###### Lemma 1.

Let , with orthogonal projections defined according to (3), and let be a random vector such that and are a.s. unique. Let and be measurable and injective. If , and a.s., then a.s.

###### Proof.

The assumption implies that we have a.s.

 0=E[X−πD(X)|f(X)]−E[X−πD′(X)|f(X)]=E[πD′(X)|f(X)]−E[πD(X)|f(X)]. (11)

Since conditioning on an injective function of a random variable is equivalent with conditioning on the random variable itself555This can be proven by applying twice for , and , , where is the -algebra spanned by a random variable ., we get

 E[πD′(X)|f(X)]=E[πD′(X)|g(πD′(X))]=E[πD′(X)|πD′(X)]=πD′(X),

and similarly . Plugging into (11) the claim follows. ∎

Let us now split the random vector into , its component on , and , the component orthogonal to , see Figure 2. Random vectors and are the content of next two assumptions.

1. [label=(A0)]

2. holds almost surely.

3. There are such that holds for any . Denote .

These assumptions are standard for estimation problems involving nonlinear geometries, such as manifold estimation or manifold regression. In particular, some version of 2 is required because the projection , and consequently the function , is in general not uniquely defined for points outside the reach of .

For an interval we will in the rest of the paper denote conditional moments such as and by and .

Our fourth assumption is the one that is most specific to the NSIM and reveals a limitation of our approach. Let be the induced random variable, taking values in , and for an interval , define the mean . The orthogonal projection onto the normal space at is denoted as , see Figure 2. We assume there exist universal constants such that for all intervals with we have

1. [label=(A4.0), leftmargin = [   A4.1]]

2. For all , with , we have ,

3. .

Assumption 1

defines a lower bound for non-zero eigenvalues of the normal component of the localized covariance matrix. This will be critical for ensuring that linear regression selects the local gradient (which is a tangent of

) instead of the local curvature vector of , if is small enough. Specifically, the requirement reads (see Theorem 3), which implies that a larger provides better differentiation between local gradients and local curvature vectors. We also note that for fixed, depends on the dimensionality because

 B2≥E[∥W∥2|R]=Tr(Cov(W|R))≈Tr(Cov(PR,⊥X|R))>rank(Cov(PR,⊥X|R))C⊥.

For a full rank this implies .

On the other hand, assumption 2 is a nonlinear relaxation of the commonly used constant conditional covariance assumption for multi-index model estimation, cf. [DC00, LZC05, LW07]. It is perhaps best understood as a Lipschitz-continuity condition of the sliced conditional covariance , seen as a function of , with respect to and . The parameter acts as an amplifier of curvature effects, but does not play a crucial role in the analysis.

Our last assumption concerns the function noise .

1. [label=(A0)]

2. Let , where , and almost surely. Furthermore, whenever we have .

The joint indepedence implies , and with 1 we get . By the law of total covariance (see (42)) it follows that for any interval . The independence assumption also implies , but when conditioning on the random variables and become increasingly dependent as tends to . Since holds without any assumptions, 5 is trivially satisfied when replacing by . However, this encompasses very general scenarios, e.g. neglecting any potential symmetries in the distribution of , which is our main motivation to quantify the effect more precisely.

Finally, we provide a family of marginal distributions that satisfies 1 - 2. The proof is in Appendix A.3.

###### Lemma 2.

Let be a -Lipschitz map such that the columns of form an orthonormal basis of the normal space of at . Let , for some random vector , with , , and . Then satisfies 1, 2, 1, 2 with , , and .

## 3 Main results

In this section we conduct a theoretical analysis of the proposed estimator (9). We begin in Section 3.2 by providing guarantees for the estimation of local index vectors in terms of the number of samples and , the number of level sets. In Section 3.3 we use these results to show that the estimator (9) is accurate on almost linear pieces of the curve. To extend the estimator to the entire curve in Section 3.4 we establish results for the proxy metric in a restricted search space around an out-of-sample point .

### 3.1 Preliminary remarks

The main results are in Theorem 3, and in Corollaries 10 and 11. Generally speaking, our analysis naturally follows the steps that motivated the construction of the estimator:

1. [label=Step 0, leftmargin=[Step 3], topsep = 0pt, itemsep = 0ex]

2. Partition ’s according to a dyadic partitioning of the range666Technically, we ought to use , and to account for noise at the boundaries, but for the sake of simplicity we assume is thresholded to , such that for all . , i.e.

 (12)
3. Estimate local index vectors by using (7) in each partition.

4. Regression on unseen data using , where is a set denoting the -nearest neighbors of according to the proxy geodesic metric (8).

Denote the tangent at by , and let . Before providing details of the analysis we begin with a couple of comments regarding 1 - 3.

First, to balance bias and variance of the estimator the number of level sets in 1 has to be carefully chosen. Recall that local index vector are defined as , where is a solution of local linear regression (7). The variance of the estimator is mainly due to the estimation error , and decreases as and decrease. The bias of the estimator is on one hand due to the approximation of by piecewise linear spaces, and on the other due to (first term in (15)). Both components decrease only with .

Our analysis shows that achieves the optimal balance, implying that there are two regimes. In the first regime, where , increasing will simultaneously reduce both the bias and the variance of the estimator. In the second regime the function noise precludes further decreasing (since we cannot decrease the length of the corresponding segment of that contains .) In other words, the noise level imposes a lower bound on , and the bias does not completely vanish in the nonlinear case (cf. Lemma 6).

Second, the analysis of 2 is valid in a regime where satisfies certain demands:

• ensuring that the covariance in the direction of the tangent does not vanish: we require , which is guaranteed if (Lemma 18). Otherwise, it would be impossible to approximate using linear regression;

• solving nonlinear problems with linear methods: we need to isolate sufficiently short portions of so that the corresponding local geometry can be well approximated by a linear space and that the influence of the Hessian normal to is weak compared to the magnitude of the gradient (which we implicitly use to estimate local index vectors).

To quantify the first requirement we introduce a constant which is, up to absolute constants (see Lemma 28 for full details), defined by

The first factor is a technical requirement that arises from the nonlinearity of the problem. The second factor on the other hand reflects the fact that if the conditional marginal measure (induced by ) could be supported on a set of Lebesgue measure zero.

In what follows we focus our attention on how our estimates depend on fundamental parameters in the model: the sample size , length of (resp. number of level sets ), curvature and torsion, and . We use and to absorb positive multiplicative factors for terms that decay with the total sample size or with the local sample size , and will mostly omit other constants such as the Lipschitz constant or . Details regarding other constants that affect the model, and higher order terms in and in , are deferred to the Appendix.

For we set , . For an interval we define , and , where is the infimum of all connected pieces of such that .

### 3.2 Local direction approximation via linear regression

Define where is the a.s. unique level set with and is given by (7). The error in estimating the local index vectors can be decomposed as

 ∥^a(Xi)−a(Xi)∥≤∥∥^aj(Xi)−aj(Xi)∥∥+∥∥aj(Xi)−a(Xi)∥∥≤∥∥^aj(Xi)−aj(Xi)∥∥+κRj∣∣S(Rj)∣∣. (13)

Note that assuming the distribution of

is equivalent to a uniform distribution (

i.e. 3 holds), the second term, which we refer to as the quantization error, cannot be improved. We will now bound the first term, which improves as decreases and as the number of corresponding data samples increases.

###### Theorem 3.

Let , and be a closed interval with777The precise requirement on is restated in Theorem 29 in the Appendix, and is defined in (72).

 2σε<|R|≲max{12KRLf,1CT2(LH+Cε),1KR√C⊥CT,Rmax}, (14)

Assume we have i.i.d. copies of , where . Denote and . For , with , we have with probability of at least

 ∥^a−a(¯γR)∥≲CT2C⊥(KR|R|2+u2|R|√NR). (15)
###### Remark 4 (Special cases of Theorem 3).

1. [topsep=-10pt,itemsep=5pt,parsep=0pt,partopsep=-10pt, leftmargin =    ]

2. If corresponds to a flat piece of the curve the first term in (15) vanishes. Therefore, provided that is small,

is an unbiased estimator of

, with convergence rate . This result covers the usual single index model, and our estimation rate matches that of known estimators (e.g. [HJS01]).

3. In the noise-free case we can remove the lower bound restriction on in (14). Theorem 3 then implies , provided . Note that a similar rate is achieved in the noisy case, in the regime .

We now use a union bound argument and Theorem 3 to bound . The first step is to ensure that there are sufficiently many samples (simultaneously) in all sets .

###### Lemma 5.

Assume we have i.i.d. copies of . Let and be a partitioning according to (12). Then for any we have with probability at least

 minj=1,…,J∣∣Xj∣∣≳(J−1−2σϵ)N|I|u2. (16)
###### Proof.

We first show that for each set there exists a segment of a minimal length such that implies almost surely. Then we use Lemma 15 to conclude the result. So let arbitrary and denote , . Then , and thus there exists a segment with such that . Now we use Lemma 15 to get

 P(minj∈J∣∣Xj∣∣≥J−1−2σεLfc−VN|I|u2−2)≥P({Vi}Ni=1 is a% |I|u2c−VN-net)≥1−exp(−u2).

We will now combine Lemma 5 and Theorem 3 to bound the estimation error for local index vectors in all level sets.

###### Lemma 6.

Assume we have i.i.d. copies of . Let and be a partitioning according to (12) such that satisfies (14). Moreover, assume satisfies

 √NLB:= ⎷(J−1−2σϵ)N|I|log2(J)≳max{CT2,1C⊥}log(D)u3. (17)

Then for we have with probability at least

 maxi=1,…,N∥^a(Xi)−a(Xi)∥≲CT2C⊥u3√NLBJ+KJ. (18)

Let us make a couple of comments about the implications of this lemma before writing down its proof. First, the terms in the bound on the right hand side of (18) can also be written in a local form, i.e., global curvature and torsion bounds can be replaced with curvature and torsion bounds on segments of the curve that correspond to a specific sample . In other words, the learning of index vectors is consistent on locally linear pieces.

Second, (17) and (18) suggest that the choice of that achieves the optimal error bound is , where is large enough so that (17) is satisfied. Looking at (18), this implies that in order to decrease the error we ought to increase as long as (i.e. subdivide the data set into a larger number of subsets), while keeping the number of samples within each subset constant (up to log-factors and lower bounded by ). The rationale behind this is that further subdividing the data set not only reduces the approximation error (which is caused by the curvature), but it also reduces the variance in the linear regression part of the problem, i.e. when estimating by . Note also that in (18), compared to (15), we lose an order in , which is due to the last term in (13), i.e. using quantization to approximate the entire tangent field over the respective level set.

###### Proof of Lemma 6.

Lemma 5 and (17) imply that with probability at least we have

 minj=1,…,J√∣∣Xj∣∣≳√NLBlog(J