# Approximation by finite mixtures of continuous density functions that vanish at infinity

Given sufficiently many components, it is often cited that finite mixture models can approximate any other probability density function (pdf) to an arbitrary degree of accuracy. Unfortunately, the nature of this approximation result is often left unclear. We prove that finite mixture models constructed from pdfs in C_0 can be used to conduct approximation of various classes of approximands in a number of different modes. That is, we prove approximands in C_0 can be uniformly approximated, approximands in C_b can be uniformly approximated on compact sets, and approximands in L_p can be approximated with respect to the L_p, for p∈[1,∞). Furthermore, we also prove that measurable functions can be approximated, almost everywhere.

## Authors

• 1 publication
• 23 publications
• 33 publications
• 26 publications
04/14/2020

### Universal Approximation on the Hypersphere

It is well known that any continuous probability density function on R^m...
08/22/2020

### Approximation of probability density functions via location-scale finite mixtures in Lebesgue spaces

The class of location-scale finite mixtures is of enduring interest both...
02/11/2016

### A Universal Approximation Theorem for Mixture of Experts Models

The mixture of experts (MoE) model is a popular neural network architect...
11/13/2012

### Gradient density estimation in arbitrary finite dimensions using the method of stationary phase

We prove that the density function of the gradient of a sufficiently smo...
05/17/2021

### Universal Regular Conditional Distributions

We introduce a general framework for approximating regular conditional d...
08/08/2011

### An application of the stationary phase method for estimating probability densities of function derivatives

We prove a novel result wherein the density function of the gradients---...
03/31/2022

### Flat-topped Probability Density Functions for Mixture Models

This paper investigates probability density functions (PDFs) that are co...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let be an element in the Euclidean space, defined by and the norm , for some . Let be a function, such that , everywhere, and , where is the Lesbegue measure. We say that is a probability density function (pdf) on the domain (an expression that we will drop, from hereon in). Let be another pdf, and for each , define the functional class:

 Mgm={h:h(x)=m∑i=1ci1σnig(x−μiσi), μi∈Rn, σi∈R+, c∈Sm−1, i∈[m]},

where , ,

 Sm−1={c∈Rm:m∑i=1ci=1 and ci≥0,∀i∈[m]},

, and is the matrix transposition operator. We say that any is a location-scale finite mixture of the pdf .

The study of pdfs in the class is an evergreen area of applied and technical research, in statistics. We point the interested reader to the many comprehensive books on the topic, such as Everitt and Hand (1981), Titterington, Smith and Makov (1985), McLachlan and Basford (1988), Lindsay (1995), McLachlan and Peel (2000), Fruwirth-Schnatter (2006), Schlattmann (2009), Mengersen, Robert and Titterington (2011), and
Fruwirth-Schnatter, Celeux and Robert (2019).

Much of the popularity of finite mixture models stem from the folk theorem, which states that for any density , there exists an , for some sufficiently large number of components , such that approximates

arbitrarily closely, in some sense. Examples of this folk theorem come in statements such as: “provided the number of component densities is not bounded above, certain forms of mixture can be used to provide arbitrarily close approximation to a given probability distribution

Titterington, Smith and Makov (1985, p. 50), “the [mixture] model forms can fit any distribution and significantly increase model fit” Walker and Ben-Akiva (2011, p. 173), and “a mixture model can approximate almost any distribution” Yona (2011, p. 500). Other statements conveying the same sentiment are reported in Nguyen and McLachlan (2019). There is a sense of vagary in the reported statements, and little is ever made clear regarding the technical nature of the folk theorem.

In order to proceed, we require the following definitions. We say that is compactly supported on , if is compact and if , where is the indicator function that takes value 1 when and , elsewhere, and is the set complement operator (i.e., ). Here, is a generic subset of . Furthermore, we say that for any , if

 ∥f∥Lp(X)=(∫|1Xf|pdλ)1/p<∞,

and for , if

 ∥f∥L∞(X)=inf{a≥0:λ({x∈X:|f(x)|>a})=0}<∞,

where we call the on . When , we shall write

. In addition, we define the so-called Kullback-Leibler divergence, see

Kullback and Leibler (1951), between any two pdfs and on as

 KLX(f,g)=∫1Xflog(fg)dλ.

In Nguyen and McLachlan (2019), the approximation of pdfs by the class was explored in a restrictive setting. Let be a sequence of functions that draw elements from the nested sequence of sets (i.e., ). The following result of Zeevi and Meir (1997) was presented in Nguyen and McLachlan (2019), along with a collection of its implications, such as the results of from Li and Barron (1999) and Rakhlin, Panchenko and Mukherjee (2005).

###### Theorem 1 (Zeevi and Meir, 1997).

If and are pdfs and is compact, then there exists a sequence such that

 limm→∞∥∥f−hgm∥∥L2(K)=0 and limm→∞KLK(f,hgm)=0.

Although powerful, this result is restrictive in the sense that it only permits approximation in the norm on compact sets , and that the result only allows for approximation of functions that are strictly positive on . In general, other modes of approximation are desirable, in particular approximation in for or are of interest, where the latter case is generally referred to as uniform approximation. Furthermore, the strict-positivity assumption, and the restriction on compact sets limits the scope of applicability of Theorem 1. An example of an interesting application of extensions beyond Theorem 1 is within the approximation framework of Devroye and Lugosi (2000).

Let again be a pdf. Then, for each , we define

 Ngm={h:h(x)=m∑i=1ci1σnig(x−μiσi), μi∈Rn, σi∈R+, ci∈R, i∈[m]},

which we call the set of location-scale linear combinations of the pdf . In the past, results regarding approximations of pdfs via functions have been more forthcoming. For example, in the case of , where

 ϕ(x)=(2π)−n/2exp(−∥x∥22/2), (1)

is the standard normal pdf. Denoting the class of continuous functions with support on by . We have the result that for every pdf , compact set , and , there exists an and , such that (Sandberg, 2001, Lem. 1). Furthermore, upon defining the set of continuous functions that vanish at infinity by

 C0={f∈C:∀ϵ>0,∃ a % compact K⊂Rn, such that∥f∥L∞(K∁)<ϵ},

we also have the result: for every pdf and , there exists an and , such that (Sandberg, 2001, Thm. 2). Both of the results from Sandberg (2001) are simple implications of the famous Stone-Weierstrass theorem (cf. Stone, 1948 and De Branges, 1959).

To the best of our knowledge, the strongest available claim that is made regarding the folk theorem, within a probabilistic or statistical context, is that of DasGupta (2008, Thm. 33.2). Let be a sequence of functions that draw elements from the nested sequence of sets , in the same manner as . We paraphrase the claim without loss of fidelity, as follows.

###### Claim 1.

If are pdfs and is compact, then there exists a sequence , such that

 limm→∞∥∥f−ηgm∥∥L∞(K)=0.

Unfortunately, the proof of Claim 1 is not provided within DasGupta (2008). The only reference of the result is to an undisclosed location in Cheney and Light (2000), which, upon investigation, can be inferred to be Theorem 5 of Cheney and Light (2000, Ch. 20). It is further notable that there is no proof provided for the theorem. Instead, it is stated that the proof is similar to that of Theorem 1 in Cheney and Light (2000, Ch. 24), which is a reproduction of the proof for Xu, Light and Cheney (1993, Lem. 3.1).

There is a major problem in applying the proof technique of Xu, Light and Cheney (1993, Lem. 3.1) in order to prove Claim 1. The proof of Xu, Light and Cheney (1993, Lem. 3.1) critically depends upon the statement that “there is no loss of generality in assuming that for ”. Here, for , . The assumption is necessary in order to write any convolution with and an arbitrary continuous function as an integral over a compact domain, and then to use a Riemann sum to approximate such an integral. Subsequently, such a proof technique does not work outside the class of continuous functions that are compactly supported on . Thus, one cannot verify Claim 1 from the materials of Xu, Light and Cheney (1993), Cheney and Light (2000), and DasGupta (2008), alone.

Some recent results in the spirit of Claim 1 have been obtained by Nestoridis and Stefanopoulos (2007) and Nestoridis, Schmutzhard and Stefanopoulos (2011), using methods from the study of universal series (see, e.g., Nestoridis and Papadimitropoulos, 2005).

Let

 W=⎧⎨⎩f∈C0:∑y∈Znsupx∈[0,1]n |f(x+y)|<∞⎫⎬⎭

denote the so-called Wiener’s algebra (see, e.g., Feichtinger, 1977) and let

 V={f∈C0:∀x∈Rn, |f(x)|≤β(1+∥x∥2)−n−θ, β,θ∈R+}

be a class of functions with tails decaying at a faster rate than .

In Nestoridis, Schmutzhard and Stefanopoulos (2011), it is noted that . Further, let

 Cc={f∈C:∃ a compact set K, such that 1K∁f=0},

denote the set of compactly supported continuous functions. The following theorem was proved in Nestoridis and Stefanopoulos (2007).

###### Theorem 2 (Nestoridis and Stefanopoulos, 2007, Thm. 3.2).

If , then the following statements hold.

• For any , there exists a sequence , such that

• For any , there exists a sequence , such that

 limm→∞∥∥f−ηgm∥∥L∞=0.
• For any and , there exists a sequence , such that

 limm→∞∥∥f−ηgm∥∥Lp=0.
• For any measurable , there exists a sequence , such that

 limm→∞ηgm=f, almost everywhere.
• If is a Borel measure on , then for any , there exists a sequence , such that

 limm→∞ηgm=f, almost everywhere, with respect% to ν.

The result was then improved upon, in Nestoridis, Schmutzhard and Stefanopoulos (2011), whereupon the more general space was taken as a replacement for , in Theorem 2. Denote the class of bounded continuous functions by . The following theorem was proved in Nestoridis, Schmutzhard and Stefanopoulos (2011).

###### Theorem 3 (Nestoridis et al., 2011, Thm. 3.2).

If , then the following statements are true.

• The conclusion of Theorem 2(a) holds, with replaced by .

• The conclusions of Theorem 2(b)–(e) hold.

• For any and compact , there exists a sequence , such that

 limm→∞∥∥f−ηgm∥∥L∞(K)=0.

Utilizing the techniques from Nestoridis and Stefanopoulos (2007), Bacharoglou (2010) proved a similar set of results to Theorem 2, under the restriction that is a non-negative function with support , using (i.e. has form (1), where ) and taking as the approximating sequence, instead of . That is, the following result is obtained.

###### Theorem 4 (Bacharoglou, 2010, Cor. 2.5).

If , then the following statements are true.

• For any pdf , there exists a sequence , such that

• For any , such that , there exists a sequence , such that

 limm→∞∥∥f−hϕm∥∥L∞=0.
• For any and , such that , there exists a sequence , such that

 limm→∞∥∥f−hϕm∥∥Lp=0.
• For any measurable , there exists a sequence , such that

 limm→∞hϕm=f, almost everywhere.
• For any pdf , there exists a sequence , such that

 limm→∞∥∥f−hϕm∥∥L1=0.

To the best of our knowledge, Theorem 4

is the most complete characterization of the approximating capabilities of the mixture of normal distributions. However, it is restrictive in two ways. First, it does not permit characterization of approximation via the class

for any except the normal pdf . Although is traditionally the most common choice for in practice, the modern mixture model literature has seen the use of many more exotic component pdfs, such as the student-t

pdf and its skew and modified variants (see, e.g.,

Peel and McLachlan, 2000, Forbes and Wraith, 2013, and Lee and McLachlan, 2016). Thus, its use is somewhat limited in the modern context. Furthermore, modern applications tend to call for , further restricting the impact of the result as a theoretical bulwark for finite mixture modeling in practice. A remark in Bacharoglou (2010) states that the result can generalized to the case where instead of . However, no suggestions were proposed, regarding the generalization of Theorem 4 to the case of .

In this article, we prove a novel set of results that largely generalize Theorem 4. Using techniques inspired by Donahue et al. (1997) and Cheney and Light (2000), we are able to obtain a set of results regarding the approximation capability of the class of mixture models , when or , and for any . By definition of , the majority of our results extend beyond the proposed possible generalizations of Theorem 4. The remainder of the article is devoted to proving the following theorem.

###### Theorem 5 (Main result).

If we assume that and are pdfs and that , then the following statements are true.

• For any , there exists a sequence , such that

 limm→∞∥∥f−hgm∥∥L∞=0.
• For any and compact , there exists a sequence , such that

• For any and , there exists a sequence , such that

 limm→∞∥∥f−hgm∥∥Lp=0.
• For any measurable , there exists a sequence , such that

 limm→∞hgm=f, almost everywhere.
• If is a Borel measure on , then for any , there exists a sequence , such that

 limm→∞hgm=f, almost everywhere, with respect to% ν.

If we assume instead that , then the following statement is also true.

• For any , there exists a sequence , such that

 limm→∞∥∥f−hgm∥∥L1=0.

The article proceeds as follows. The separate parts of Theorem 5 are proved in the subsections of Section 2. Comments and discussion are provided in Section 3. Necessary technical lemmas and results are also included, for reference, in Appendix A.

## 2 Main result

### 2.1 Technical preliminaries

Before we begin to present the main theorem, we establish some technical results regarding our class of component densities . Let and denote the convolution of and by . Further, we denote the sequence of dilates of by The following result is an alternative to Lemma 5 and Corollary 1. Here, we replace a boundedness assumption on the approximand, in the aforementioned theorem by a vanishing at infinity assumption, instead.

###### Lemma 1.

Let be a pdf and , such that . Then,

 limk→∞∥gk⋆f−f∥L∞=0.
###### Proof.

It suffices to show that for any , there exists a , such that , for all . By Lemma 6, , and thus . By making the substitution , we obtain

 ∫gk(x)dλ=∫kng(kx)dλ=∫g(z)dλ=1,

for each . By Corollary 1, we obtain and thus we can choose a , such that

 ∫1{x:∥x∥2>δ}gkdλ<ϵ4∥f∥L∞.

Since is a pdf, we have

 |(gk⋆f)(x)−f(x)| =∣∣∣∫gk(y)[f(x−y)−f(x)]dλ(y)∣∣∣ ≤∫gk(y)|f(x−y)−f(x)|dλ(y).

By uniform continuity, for any , there exists a such that , for any , such that (Lemma 6). Thus, on the one hand, for any , we can pick a such that

 ∫1{y:∥y∥2>δ(ϵ)}gk(y)|f(x−y)−f(x)|dλ(y) ≤2∥f∥L∞∫1{y:∥y∥2>δ(ϵ)}gkdλ (2) ≤2∥f∥L∞×ϵ4∥f∥L∞=ϵ2,

and on the other hand

 ∫1{y:∥y∥2≤δ(ϵ)}gk(y)|f(x−y)−f(x)|dλ(y) ≤ϵ2∫1{y:∥y∥2≤δ(ϵ)}gkdλ (3) ≤ϵ2×1=ϵ2.

The proof is completed by summing (2) and (3). ∎

###### Lemma 2.

If is such that , and , then there exists a , such that , and

 ∥f−h∥L∞<ϵ
###### Proof.

Since , there exists a compact such that . By Lemma 7, there exists some , such that and . Let , which implies that and . Furthermore, notice that and , by construction. The proof is completed by observing that

For any , uniformly continuous function , let

 w(f,δ)=sup{x,y∈Rn:∥x−y∥2≤δ}|f(x)−f(y)|

denote the modulus of continuity of . Furthermore, define the diameter of a set by and denote an open ball, centered at with radius by .

Notice that the class can be parameterized as

 Mgm={h:h(x)=m∑i=1ciknig(kix−zi), zi∈Rn, ki∈R+, c∈Sm−1, i∈[m]},

where and . The following result is the primary mechanism that permits us to construct finite mixture approximations for convolutions of form . The argument motivated by the approaches taken in Theorem 1 in Cheney and Light (2000, Ch. 24), Nestoridis and Stefanopoulos (2007, Lem. 3.1), and Nestoridis, Schmutzhard and Stefanopoulos (2011, Thm. 3.1).

###### Lemma 3.

Let and be pdfs. Furthermore, let be compact and , where and . Then for any , there exists a sequence , such that

 limm→∞∥∥gk⋆h−hgm∥∥L∞=0.
###### Proof.

It suffices to show that for any and , there exists a sufficiently large enough so that for all such that

 ∥∥gk⋆h−hgm∥∥L∞<ϵ. (4)

For any , we can write

 (gk⋆h)(x) =∫gk(x−y)h(y)dλ(y)=∫1{y:y∈K}gk(x−y)h(y)dλ(y) =∫1{y:y∈K}kng(kx−ky)h(y)dλ(y)=∫1{z:z∈kK}g(kx−z)h(zk)dλ(z).

Here, is continuous image of a compact set, and hence is compact (cf. Rudin, 1976, Thm. 4.14). By Lemma 8, for any , there exists (, ), such that . Further, if , then we have . We can obtain a disjoint covering of by taking and () and noting that , by construction (cf. Cheney and Light, Ch. 24). Furthermore, each is a Borel set and .

For convenience, let denote the disjoint covering, or partition, of . We seek to show that there exists an and , such that

 ∥∥ ∥∥gk⋆h−m∑i=1ciknig(kix−zi)∥∥ ∥∥L∞<ϵ,

where ,

 ci=k−n∫1{z:z∈Aδi}h(z/k)dλ(z),

and , for .

Further, and , with chosen as follows. By Lemma 6, for some positive . Then, . We may choose so that , so that

 ∥cmknmg(kmx−zm)∥L∞≤ϵ2.

Since , the sum of () satisfies the inequality

 m−1∑i=1ci =k−nm−1∑i=1∫1{z:z∈Aδi}h(zk)dλ=k−n∫1{z:z∈kK}h(zk)dλ =∫1{x:x∈K}hdλ≤∫1{x:x∈K}fdλ≤∫fdλ=1.

Thus, , and our construction implies that where

 hgm(x)=m∑i=1ciknig(kix−zi)∀x∈Rn.

We can bound the left-hand side of (4) as follows:

 ∥∥gk⋆h−hmg∥∥L∞ ≤∥∥ ∥∥(gk⋆h)(x)−m−1∑i=1ciknig(kix−zi)∥∥ ∥∥L∞+∥cmknmg(kmx−zm)∥L∞ ≤∥∥ ∥∥(gk⋆h)(x)−m−1∑i=1ciknig(kix−zi)∥∥ ∥∥L∞+ϵ2 =∥∥ ∥∥∫1{z:z∈kK}g(kx−z)h(zk)dλ(z)−m−1∑i=1∫1{z:z∈Aδi}g(kx−zi)h(zk)dλ(z)∥∥ ∥∥L∞+ϵ2 ≤m−1∑i=1∫1{z:z∈Aδi}|g(kx−z)−g(kx−zi)|h(zk)dλ(z)+ϵ2. (5)

Since

 ∥kx−z−(kx−zi)∥2=∥z−zi∥2≤diam(Aδi)≤δ,

we have , for each . Since (cf. Makarov and Podkorytov, 2013, Thm. 4.7.3), we may choose a so that . We may proceed from (2.1) as follows:

 ∥∥gk⋆h−hmg∥∥L∞ ≤w(g,δ(ϵ))∫1{z:z∈kK}h(zk)dλ+ϵ2 =w(g,δ(ϵ))kn∫hdλ+ϵ2≤w(g,δ(ϵ))kn+ϵ2 <ϵ2+ϵ2=ϵ. (6)

To conclude the proof, it suffices to choose an appropriate sequence of partitions , for some large but finite , so that (2.1) and (6) hold, which is possible by Lemma 8. ∎

For any , let be a closed ball of radius , centered at the origin.

###### Lemma 4.

If , such that , then

 limr→∞∥∥f−1¯Brf∥∥L1=0.
###### Proof.

By construction, each element of the sequence () is measurable, , and

 limr→∞1¯Brf=f,

point-wise. We obtain our conclusion via the Lesbegue dominated convergence theorem. ∎

### 2.2 Proof of Theorem 5(a)

We now proceed to prove each of the parts of Theorem 5. To prove Theorem 5(a) it suffices to show that for every , there exists a , such that

Start by applying Lemma 2 to obtain , such that and . Then, we have

 ∥∥f−hgm∥∥L∞ ≤∥f−h∥L∞+∥∥h−hgm∥∥L∞ <ϵ2+∥∥h−hgm∥∥L∞. (7)

The goal is to find a , such that . Since , we may find a compact such that . Apply Lemma 1 to show the existence of a , such that

 ∥h−gk⋆h∥L∞<ϵ4,

for all . With a fixed , apply Lemma 3 to show that there exists a , such that

 ∥∥gk(ϵ)⋆h−hgm∥∥L∞<ϵ4.

By the triangle inequality, we have

 ∥∥h−hgm∥∥L∞≤∥∥h−gk(ϵ)⋆h∥∥L∞+∥∥gk(ϵ)⋆h−hgm∥∥L∞<ϵ4+ϵ4=ϵ2. (8)

The proof is complete by substitution of (8) into (7).

### 2.3 Proof of Theorem 5(b)

For any and compact , it suffices to show that there exists a sufficiently large enough so that for all such that .

By Lemma 5, we can find a , such that

 ∥f−gk⋆f∥L∞(K)<ϵ3, (9)

for every . Since , for some positive , by Lemma 6. For any ,

 ∥∥gk⋆f−gk⋆(1¯Brf)∥∥L∞ =∥∥gk⋆(1¯B∁rf)∥∥L∞ =∥∥∥∫(1¯B∁rf)(y)kng(kx−ky)dλ(y)∥∥∥L∞ ≤knC∫(1¯B∁rf)dλ=knC∥∥f−1¯Brf∥∥L1. (10)

For fixed , we may choose , using Lemma 4, so that and thus the final term of (10) is bounded from above by for all . Thus, for and,

 (11)

Using Lemma 3, with approximand , component density , compact set , , and with fixed, we have the existence of a density