 # A Learning Theory in Linear Systems under Compositional Models

We present a learning theory for the training of a linear system operator having an input compositional variable and propose a Bayesian inversion method for inferring the unknown variable from an output of a noisy linear system. We assume that we have partial or even no knowledge of the operator but have training data of input and ouput. A compositional variable satisfies the constraints that the elements of the variable are all non-negative and sum to unity. We quantified the uncertainty in the trained operator and present the convergence rates of training in explicit forms for several interesting cases under stochastic compositional models. The trained linear operator with the covariance matrix, estimated from the training set of pairs of ground-truth input and noisy output data, is further used in evaluation of posterior uncertainty of the solution. This posterior uncertainty clearly demonstrates uncertainty propagation from noisy training data and addresses possible mismatch between the true operator and the estimated one in the final solution.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Compositional models are used universally in interpretation of compositions or relative ratios of discrete sections. A compositional model satisfies the constraints that the model elements are all non-negative and sum to unity. For example, pie charts, populations ratio, financial portfolios, etc. adopt this model. A probability mass function for a discrete random variable can be also considered as a compositional variable. However, direct inference for compositional variables can be difficult. Instead, transformation methods are often used. For example, softmax functions or similar functions transform the intermediate solution vectors in a multi-dimensional infinite or confined space to compositional variables. The result compositional variables through transformations can be interpreted as probability functions but they are not derived from probability models. Furthermore, describing the probability space of the inferred compositional solution after transformation is generally difficult, given a likelihood or probability function for an observation. This work addresses these two problems. We assume that we have partial or even no knowledge of the operator but have training data of input and ouput. We present a training method from ground-truth compositional variables and corresponding noisy outputs from a linear system. Then, we quantify the uncertainty in the trained operator. For several stochastic compositional models, we explicitly derived the convergence rates. With an estimated covariance matrix, we present a method to evaluate posterior solutions under a Bayesian framework and this addresses a possible mismatch between the true system operator and the estimated one. The posterior density for the compositional variable clearly demonstrates uncertainty propagation coming from the noisy training data.

## Ii Problem Statement

The linear forward model for noisy measurements can be stated as

 s=Am+n, (1)

where is a compositional vector of fractions, is an observation vector, is an additive noise, and is a matrix that describes the generative process. Assuming a zero-mean multinormal noise model with covariance matrix , the likelihood function that relates a measurement with the input fractions is given by

 p(s|m,A)=1√(2π)T|Σ|exp[−12(s−Am)TΣ−1(s−Am)], (2)

where the superscript denotes the transpose operator, and

denotes the inverse of the covariance matrix. The likelihood function is a multivariate probability density function (PDF) that describes the probability of measuring

given any particular choice of and . PDFs are denoted herein by , and conditional PDFs are denoted by , where the distributions of the symbols to the left of ‘’ are conditional on the values of the symbols to the right. The likelihood function is also conditioned on a number of other aspects, including the form of the forward model, the sample set used to train the forward model, and the noise model. These will be omitted from our notation since we fix these conditions.

Following Bayesian inverse theory, Bayes’ theorem

 is applied to reverse the sense of conditionality in the likelihood function to yield an estimate of given and :

 p(m|s,A)=p(s|m,A)p(m)p(s). (3)

is called the prior distribution, and describes any knowledge about that is available prior to the measurement of . In this work, this prior includes constraints on such that that its elements are non-negative and that the fractions sum to unity. could also describe correlations between the fractions, which can be learned from a training set. describes the probability of measuring with respect to the forward model when considering all possible values of . Since is constant for a given forward model, likelihood and prior, it can be taken as a constant of proportionality that normalizes such that it integrates to unity, and thus it is often not computed in Bayesian inversion. However, it plays a role in computing the Bayesian evidence, the measure of model fitness that we will investigate in future .

is called the posterior distribution. It describes all the information on that can be inferred from and , given all of the assumptions on which it is conditioned. Regarding the uncertainty of , we can use the joint posterior distribution . Because we focus on the unknown compositional variable and its uncertainty, we derive a marginalized posterior distribution for after marginalizing

from the joint distribution. We can draw an ensemble from

to provide a representation of the uncertainty in the posterior estimate of .

## Iii Training

### Iii-a Estimating A

We can train from a training set of samples. Each sample is assumed to have fractions and the sum of elements in each sample is one. We denote the set . Each compositional vector is of length . The measurements are obtained for each sample and assigned to the set , which is the training set for .

Since we are now focusing on the estimation of , the likelihood function (2) is now made explicitly conditional on . In order to estimate from the training set, we apply Bayes’ theorem to find the posterior on given the measurements:

 p(A|S,M) =p(S|M,A)p(A|M)p(S|M) (4) ∝p(S|M,A)p(A|M), (5)

where the proportionality constant, , is dropped because it is independent of . Here is the prior on , and the likelihood function is given by

 (6)

Assuming a uniform prior for , the maximum a posteriori (MAP) estimator of is equivalent to a maximum-likelihood estimator (MLE) for , which is given by the value of that minimizes

 (7)

Taking the derivatives of with respect to the elements of [2, Eq. 88] and equating to zero yields the following equation for :

 −Σ−1N∑i=1(si−Ami)miT=0, (8)

where is the matrix of zeros of the same dimensions as . Since is the true covariance matrix and not an estimate based on a small number of samples, it has full rank and is thus invertible. The above equation can thus be simplified to

 AN∑i=1mimiT=N∑i=1simiT. (9)

The formal MAP solution for can thus be expressed in matrix form as

 ~A=SMT(MMT)−1, (10)

where is the matrix whose -th column is , is the matrix whose -th column is , and it is assumed that the training set is constructed such that the inverse of exists. If this inverse does not exist, then we require more samples in the training set.

### Iii-B Variance Analysis of the MAP Estimator of A

The MAP estimator (10), under our assumption of a uniform prior on , converges with increasing to the true unknown since it is also a maximum likelihood estimator. We investigate this convergence rate under the assumption that the noise is multinormal with covariance

. In other words, we perform an analysis of variance of the MAP estimator. A natural extension of this discussion is to design an experiment of fabricating

such that the noise in is optimally reduced.

Since we assume that the noise in each column of

is independent and follows a multivariate normal distribution with mean

and variance , i.e., , the -th column of is distributed as . Denoting the -th column of by , the distribution of -th column of is given by

 ~ai∼N(ai,∥bi∥2Σ), (11)

where is the -th column of . Thus, the estimator is unbiased and expectedly centered at the true unknown value and the variance scaling factor for is given by , where is the Euclidean norm.

Further analyzing this variance scaling factor,

 ri =∥bi∥2=∥MT(MMT)−1ei∥2 (12) (13) =[(MMT)−1]i,i (14) =[1N(1NN∑j=1mjmTj)−1]i,i, (15)

where is the -th column of the identity matrix , and is the -th diagonal entry of the enclosed matrix. When

is large, by the law of large numbers,

 ⎡⎣1N(1NN∑j=1mjmTj)−1⎤⎦i,i→1N[E[mmT]]−1i,i, (16)

where is an expectation of the enclosed matrix. The expectation is defined as

 E[mmT]=∫ΩmmTp(m)dm, (17)

where is the domain for , and is a probability density function describing the compositions in the training set. Therefore, assuming that is large, we have the following value for :

 ri=1N[E[mmT]]−1i,i. (18)

The variance scaling factor can be interpreted as the multiplicative ‘error’ factor in the estimation of through training; this is because the true value, which is a mean parameter , is unknown in practice, and . Thus, the uncertainty in is related to the uncertainty in through the scaling factor .

In the following two sections we explore the variance scaling factors that result from different choices of compositional mixtures in the training set. We first consider the case of repeated identical compositional mixtures, and use (14) to compute . We next consider the case where the compositional vectors are stochastic, and use (18) to compute . In the stochastic case, the probability density function in (17) describes the distribution used in sampling the compositional variables.

#### Iii-B1 Variance scaling factor for repeated mixtures

Assume that there are unique compositions and that each composition in the training set is repeated times. Putting each of these mixtures into a matrix partition , the for the full training set is defined by , with . Then (14) becomes

 ri=1K[(M0M0T)−1]i,i, (19)

where the compositions are chosen such that is invertible. Thus, the noise variance of the estimated reduces as with increasing repetitions of the compositions. Note that this convergence rate for repeated compositions is identical to that when ‘pure’ samples are used, i.e., when is the identity matrix, .

The variance scaling factor is given by the diagonal values of . As noted above, for pure samples this scaling constant is unity. Next consider binary mixtures of the form

 M0=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1/200…01/21/20…001/21/20⋮⋮⋱1/200001/21⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (20)

where adjacent pairs of the endmembers are mixed in equal mass portions, and the -th endmember is provided in pure form in order to make invertible. In this case the variance scaling factor has the form . , the maximum value of , varies with as . Also note that the minimum factor is . For example, with ten endmembers the maximum value is , meaning that the variance is 37 times larger than for a sample of pure compositions.

#### Iii-B2 Variance scaling factor for stochastic mixtures

For a stochastic mixture, each column of is a realization of a random vector subject to a probability density function that must honor the non negativity and summation constraints on . In practice, the variance scaling factor, , could be evaluated by means of Monte Carlo methods for any feasible distribution by simply sampling from that distribution and evaluating the expectation of the function of those samples given in (18). In the following analysis we consider an important special case in which

 E[U2j]=E[U2k] and E[UjUk]=E[UlUm], for j≠k,l≠m. (21)

Under these conditions we have derived simple formulas for for several distributions. Following the formula for in (18) and using the assumption in (21), the expectation is given by

 E[mmT]=αI+β11T, (22)

where and . By using the Sherman-Morrison matrix inversion formula ,

 (E[mmT])−1=1α[I−11Tβα+βM]. (23)

Note that this result holds for a general random vector that satisfies (21) without positivity or summation constraints. Therefore, considering the -th diagonal entry of the inverse matrix,

 ri=r=1Nαα+(M−1)βα+Mβ. (24)

Due to the symmetry assumptions in (22), all are identical. Note that and are functions of .

In Appendix B, we derive formulas for for several distributions that satisfy the non negativity and summation constraints on :

• Multinomial distribution: Each sample from this distribution uniformly chooses one pure endmember, with replacement, from the set of endmembers

• Double-multinomial distribution with or without replacement: Each sample from this double-multinomial distribution uniformly chooses two pure endmembers, with or without replacement, from the set of endmembers

• Uniform distribution over -dimensional simplex: Each sample from this distribution uniformly chooses fractions, subject to the non negativity and summation conditions

Formulas for the in these cases are presented in Table I along with the formulas for deterministic mixtures derived above.

#### Iii-B3 Discussion on the impact of mixture selection in variance reduction

We present the several variance scaling factors for several mixture selection/preparation strategies in Table I. The variance of the linear operator always decreases with the increasing number of training samples as . However, it grows with the increasing number of endmembers with rates that vary with the choice of stochastic mixture. The worst cases increase quadratically with ; these are the deterministic binary mixtures and the uniform stochastic mixtures. The best cases increase linearly with ; these are the repeated pure endmember sets and the multinomial and double-multinomial mixtures.

We verified all the obtained variance scaling factors presented in Table I with numerical simulations. For the -th endmember and -th observation index, we define a quantity as the ratio of the estimated variance and the theoretical variance:

 γj(t)=[Var(~aj)]trj[Σ]t, (25)

where is the unbiased covariance estimator that is numerically computed. We applied two noise levels for indices and for observation indices , we compute two versions of defined as follows:

 γj(Ik)=1rj[Σ]t⎛⎝1|Ik|∑t′∈Ik[Var(~aj)]t′⎞⎠ for k=1,2, (26)

where , is an indicator function, and is a cardinality of the set . For the multinomial case, we present these curves in Fig. 1 as an illustration of how large should be, possibly depending on , in order to satisfy the law of large numbers that was used in the derivation of (18). The quantity indicates that is large enough for (18) to be valid. From this figure we see that one-percent accuracy is achieved in this example case when . Fig. 1: Under the multinomial distribution on M simplex (see Appendix B), γ(2) curves for several M values. Note the convergence after N≈800.

#### Iii-B4 Intuition to best configure training sets

So far, we have investigated the behavior of the estimation variance in several cases. Interestingly, a theoretical approximation also gives us insight on the behavior of estimation variance. Rather than focusing on an individual variance scaling factors , consider the sum of these factors, from which we gain insight on how to find that would minimize the estimation variance. The sum of the scaling factors is

 ~r =M∑i=1ri =tr([(MMT)−1] (27) =tr[(N∑i=1mmT)−1], (28)

where is the trace operator, defined as the sum of the diagonal elements of the enclosed matrix. Minimizing (27) with respect to is not trivial, but optimization of the following quantity is easy and can provide an intuition on the optimization of the scaling factor with the extreme :

 tr(N∑i=1mmT)=N∑i=1∥m∥2, (29)

where we used properties of the trace operator . The right-hand side of (29) indicates that for in the simplex (see Appendix B), the maximum is obtained when for any , corresponding to pure samples, and the minimum is obtained when . The matrix inside in (28) is the inverse matrix of that inside in (29), so we might guess that this maximization argument could apply to minimization of the , and vice versa. Indeed, the collection of maximizers of (29), thus similar to multinomial case, is the minimizer of having rate. The minimizer of (29) is the expectation of a random vector when its distribution is uniform, and this uniform distribution is the worst case (maximizer) having quadratic rate .

### Iii-C Estimating the Noise Covariance and its Inverse

The noise covariance matrix plays the role of a weighting factor in the likelihood function. Any observational indices that are less conformant to the linear model will be poorly predicted by the linear model and can be downweighted by assuming larger noise variance to these indices. In this context, the covariance matrix contributes significantly to an improved estimation of .

Given our training set, the maximum-likelihood estimate of the noise covariance matrix is found from the residuals between the predicted and measured ouput measurements:

 ~Σ=1NN∑i=1ririT, (30)

where the residuals, are defined by . The tilde over serves as a reminder that is an estimate of . Through (30), the uncertainty in is consolidated into . Note that the sample covariance matrix has a factor instead of and we call it . The correlation coefficient between the observation index and is defined as

 ρi,j=~Σi,j/√~Σi,i~Σj,j, (31)

where indicates the value at -th row and -th column of .

In practice, estimating the noise covariance can be posed as a well-known problem of high dimensional covariance matrix estimation; is often much less than the dimension of , making rank deficient. This rank deficiency of impacts our estimation of because the likelihood function (2) is expressed in terms of the inverse of . Recognizing that the off-diagonal elements of are poorly estimated when the training set contains a small number of samples, we propose that only a certain portion of the entries close to diagonal entries, or simply just the diagonal entries, of be estimated. Thus, the impact of these spurious off-diagonal entries can be offset by accepting only the diagonal elements. A further advantage of this diagonalization is that the inverse of is now well defined, which is used in estimation of . However, in general this diagonalized version of may overestimate the uncertainty in because the correlations are removed by marginalizing them away.

More general approaches in high dimensional covariance matrix estimation than the above diagonalization approach include banding , tapering , and block banding  methods. Among these, tapering and block banding require additional knowledge such as the tapering parameter and block structure. The banding method requires only the width of the diagonal band. It assumes a simple model describing a noise structure, because one can expect a nonzero correlation between close observation indices and a zero correlation between far observation indices.

The banding estimator is defined as

 ^Σ=(~σsampleijI(|i−j|≤K)). (32)

Under the several conditions , the convergence rate is

 E∥^Σ−Σ∥2≤C(logTN)αα+1, (33)

where is a positive constant, is the dimension of the output, e.g. is a matrix, and is the decaying factor of off-diagonal entries such that

 |σij|≤M|i−j|−(α+1) (34)

for some and .

## Iv Inversion

Once and have been estimated from the training set of samples, a new output measurement, , can be inverted to create a Bayesian estimate of the associated using (3). In the following, we drop the tildes from and for notational simplicity and denote the mean of by .

In the composition model, the prior distribution must enforce on the elements of and , i.e., the elements of sum to unity. Within the support imposed by these constraints, we assume a uniform distribution for , thus noninformative, as its prior distribution .

### Iv-a Solution with A fixed

With fixed, combining the likelihood function of (2) with a prior yields a posterior estimate of :

 p(m|s)∝p(m)exp[−12(s−Am)TΣ−1(s−Am)], (35)

with for an -dimensional simplex (see Appendix B). The constant of proportionality is found by normalizing such that it integrates to unity.

The MAP solution of (35) is then given by the value of that satisfies

 ATΣ−1Am=ATΣ−1s, (36)

subject to the non-negativity and summation constraints. This has the form of a standard linear programming problem that can be solved via standard numerical libraries. One such numerical solver is the Matlab

 software toolbox fmincon for solving the linear optimization with linear constraints. The solution to (36) is guaranteed to be unique when has full rank, as was the case in all of our experiments.

However, even when this MAP solution is unique, there may be strong correlations between elements of . This indicates that some elements are either fundamentally unresolvable or only weakly resolvable by the linear system. Note that the summation constraint in the prior imposes a correlation between endmembers even when no correlation is imposed by the likelihood function. For example, when the composition vector contains only two endmembers, and , as increases, must decrease commensurately to preserve the summation constraint. It is critical to capture this correlation in the analysis to avoid mischaracterization of comnpositional variables.

The uncertainty of

cannot be simply described, as for Gaussian distributions, by the inverse of the Hermitian matrix

because the constraints impose the boundary of feasible vectors on the posterior. To describe the posterior uncertainty, we use the drawn samples from the posterior distribution via Monte Carlo sampling methods. One can perform sampling based on a truncated multinormal sampler handling (in-)equality constraints.

### Iv-B Solution with stochastic A

In this section, instead of using a point estimate of ignoring its uncertainty, we assume has stochastic perturbations centered on the MLE of . By following the Bayesian rule, we have the joint posterior distribution for and but will be integrated out so that the final solution for has the uncertainty terms coming from the training of . We present practical estimation methods at the end of this section using Markov chain Monte Carlo techinques.

Assuming the Gaussian noise and independent sample acquisition, a noise in a sample is statistically independent from the noise in another sample, i.e., the noise in from the training set is independent of in . The prior distribution of, or the MAP estimator for, can be represented111Strictly speaking, due to our derivation through the MAP estimator, is a function of for or

, thus it is a random matrix.

as

 p(A;M,S)∝exp[−12tr((A−A0)TC(A−A0)F)], (37)

where is the mean of , and is a diagonal matrix having for . Henceforth, we omit the set of the training inputs and measurements, , to simplify notations of . Since we do not know , in practice we set , which is the MLE solution for with given and and converges to the true mean of as increases.

The full posterior distribution is then

 (38)

And we integrate it with respect to to incorporate the uncertainty coming from the training of .

 p(m|s)=∫p(m,A|s)dA=∫p(m|A,s)p(A)dA. (39)

We note that

 ∫exp[−12(s−Am)TΣ−1(s−Am)]exp[−12tr((A−A0)TC(A−A0)F)]dA =dbT/2exp[−12(s−A0m)TCb(s−A0m)],

where . Therefore, the posterior marginal distribution for

 p(m|s)∝1b(m)T/2exp[−12(s−A0m)T(b(m)Σ)−1(s−A0m)]⋅I(m∈SM). (40)

This is not generally convex so is difficult to optimize. However, when and it locally looks like a convex function of . Also, because we can obtain the tight maximum bound for such that (see Appendix A)

 b≤1+max(rj)=:bmax. (41)

When and are small, is close to one, so the posterior marginal distribution behaves similarly to the case in the previous section, where the uncertainty of is not considered, e.g. the distribution for is a linearly-confined Gaussian distribution with mean and variance and the unbounded version variance for is but .

The added uncertainty coming from is the multiplicative factor and the worst/upper bound is . Note that, always since and . Thus, the uncertainty coming from always increases the uncertainty in estimating .

The optimization of can be done by using software packages, Markov chain Monte Carlo (MCMC) methods [7, 8], or variational Bayes approaches . In this paper, we propose a sampling technique in the MCMC framework to quantify uncertainties of . Because Gibbs’ sampling is difficult to apply in this complex distribution (40), our sampling strategy uses the Metropolis-Hastings algorithm  with proposal distribution for having variance on a simplex domain. The proposal distribution to draw a new sample given the previous sample is symmetric as follows

 q(m→m′) ∝N(m−m′;0,(A0TΣ−1A0)−1)⋅I(m′∈SM) =1Z(m)exp[−12((m−m′)T(A0TΣ−1A0)(m−m′))]⋅I(m′∈SM),

where is a Gaussian distribution and is a normalization constant and function of . The acceptance rate is then defined as follows

 A(m→m′)=min(1,acc(m→m′)), (42)

where

 acc(m→m′) =p(m′|s)q(m′→m)p(m|s)q(m→m′) =exp[−12(δ′T(b(m′)Σ)−1δ′−δT(b(m)Σ)−1δ−Tlog(b(m)b(m′)))]⋅Z(m)Z(m′) ≈exp[−12(δ′T(b(m′)Σ)−1δ′−δT(b(m)Σ)−1δ−Tlog(b(m)b(m′)))],

where , , and the approximation is valid when the jump distance is small enough for . If this is not the case or the composition vector is near the boundary of the simplex, then the approximation may not be good enough. Then we need to evaluate the constants empirically using MCMC methods. Another strategy to ensure the convergence of the Markov chain is to selectively evaluate the constants if the distance is larger than a pre-defined threshold. The expected benefit of using this sampling strategy is the high acceptance rate when the training uncertainty is small, thus efficient sampling is performed. In other words, the proposal distribution would look similar to the target distribution because is close to one when and is small.

## V Experiments

We performed several tests of the proposed training-based inversion method on synthetically generated data. We consider a linear spectral system for our simulations. For testing and visualization, we use three endmembers () for training and inversion, and (realistic dimension for high resolution spectral system) and used outputs from pure endmembers. One dataset is built by using synthetic spectra of pure minerals which have large weighted- spectral distances, so these minerals are expected to be easily separable. The weighted- distance between mineral spectra and is defined as where is the noise covariance. The noise covariance can be estimated from the training set, after generating noises using the true noise covariance. Fig. 2 presents the spectra of these easily separable kaolinite, chlorite, and kerogen because their three peaks after 2500 cm do not overlap. The other dataset is fabricated with the spectra of minerals which have small weighted- distances, so they are difficult to separate. Fig. 3 presents the spectra of minerals (quartz, K-spar, and Na-spar) that look difficult to separate. The spectra overlap in the wavenumber range of peaks and look alike in their shape. Fig. 2: Three easily separable spectra where three peaks above 2500 cm−1 are distinguishable. The covariance (not shown) we use in our examples is a diagonal matrix and the noise level is high below 1500 cm−1, thus suppressing signals in the region. Fig. 3: Three mineral spectra that are difficult to separate. All the peaks are under the highly noisy region before 1500 cm−1 and alike in terms of their locations.

### V-a Minerals that are easy to separate, with m=[1/3,1/3,1/3]

The first test data has a training set of easily separable end-members (pure minerals) from Fig. 2, and the true mineralogy is . The corresponding inversion result addressing the uncertainty in the trained is presented in Fig. 4

. We present the 2-dimensional sample 95% confidence interval (CI) or ellipse in green curve and our CI contains the true solution, whereas the solution that does not consider the uncertainty coming from the training of

has a large bias. Also, with various values of we empirically observed that the confidence ellipse (from the sample covariance structure) converges to the model, true covariance as the size of the training set increases. Fig. 4: The inversion of easily separable spectra from Fig. 2 using a training set of N=30. The MAP estimate is the green dot, 95% confidence interval is the green ellipse, true mineralogy is the red dot, and the blue dot is the estimated mineralogy without considering the uncertainty coming from estimation of A. This result demonstrates that the uncertainty coming from the training of A must be considered to compute the solution and avoid the large bias. Our CI (green ellipse) contains the true solution (red dot). We note that the uncertainty level is approximately ±0.01 or 1 weight percent.

One significant advantage of our proposed method with the stochastic model is visualized in Fig. 4 by comparing our solution (MAP and CI in green dot and ellipse, respectively) with the result from the conventional optimization analysis in blue dot. By ‘conventional’ approach, we mean the optimization approach with fixed, as described in Section IV-A. In this conventional approach, is the MLE and treated as a fixed parameter playing a similar role of in the stochastic model. This inversion is performed by maximizing the posterior density (35), which is equivalent to maximizing (2) with the same constraint, . Since the estimation problem with fixed can be posed as a convex programming, we used a software package, Matlab fmincon, with the constraint in maximizing the likelihood function, which produces a point estimate. The inversion result clearly indicates the need for our approach. Moreover, because of the large bias, the probable solution area from the fixed does not cover the true mineralogy unlike ours222This is not shown to avoid convoluted graphics but it is trivial to see that the solution is far away from the true mineralogy. The comparison concludes that without addressing the uncertainty of the standards obtained from the training, we may not be able to obtain solutions whose confidence interval contains the true mineralogy.

### V-B Minerals that are difficult to separate, with m=[1/3,1/3,1/3]

A similar test set is constructed (end-members in the training set, ) but with minerals of quartz, K-spar, and Na-spar that are difficult to separate (Fig. 3). Here, we used training samples due to the low resolution of the spectra in Fig. 3. The low resolution is caused by the overlapping peaks of the quartz, K-spar, and Na-spar, as well as the their peak locations in noise abundant region ( cm). We also note that for the sample noise covariance is close to the true covariance, thus verifying the theory of the estimation of the covariance matrix in Section III-C.

In the inversion result presented in Fig. 5, our CI (green ellipse) again contains the true solution (red dot). This time the solution not considering the uncertainty from training is close to the true mineralogy. However, the uncertainty level now is approximately or 10 weight percent, almost 10 times larger than the case of easily separable minerals, compared to Fig. 4. The separability of the minerals caused this large uncertainty. The size of the CI we evaluated can be thought of the uncertainty level of mineralogy; if the noise level is small, the size of the confidence ellipse derived from the mineralogy covariance shrinks, and if the minerals are easily separable, e.g., major non-overlapping peaks over 2000 cm, the covariance and ellipse size shrinks. Fig. 5: Inversion result using N=1000 training samples and the three spectra that are difficult to separate from Fig. 3. The MAP estimate is the green dot, 95% confidence interval is the green ellipse, true mineralogy is the red dot, and the blue dot is the estimated mineralogy without considering the uncertainty coming from estimation of A. Again, our CI (green ellipse) contains the true solution (red dot). The solution not considering the uncertainty from training is close to the true mineralogy. However, the uncertainty level now is approximately ±0.1 or 10 weight percent, almost 10 times larger than the case of easily separable minerals, compared to Fig. 4. The separability of the minerals caused this large uncertainty.

### V-C Minerals that are difficult to separate, with m from a uniform distribution on S3

To link the training error to the performance of the inversion in the worst case, we synthesize data with the following setting: we uniformly draw a sample as a true mineralogy made of three minerals difficult to separate (quartz, K-spar, Na-spar) from , later to test whether it is within 95% CI from our inversion. Under the uniform distribution, all the valid mineralogy vectors have the same chance of selection/sampling. For the training set, we obtain samples under the several structures: pure minerals, binary mixes, and uniformly distributed mineralogy. Under each structure we perform the inversion and evaluate the ‘inclusion ratio’ of the true mineralogy within 95% CI for all minerals. We vary from 30 to approximately 1,000. In Fig. 6, the evaluated performances as of inclusion ratio verify our prediction based on the error scaling factor . For pure mineral training samples ( = end-member), , for binary mix training samples, , and for uniformly drawn training samples, . Therefore, the pure mineral case performs the best and the other two cases perform similarly and this was forecasted before inversion using values. Fig. 6: The inclusion ratio of the true mineralogy within 95% CI for all minerals, considering the three spectra that are most difficult to separate (Fig. 4). In the simulation, the true mineralogy is uniformly drawn on S3. The performances presented here verifies our prediction based on the error factor r. For pure mineral training samples (m = end-member), r=M/N=3/N, for binary mix training samples, r=(2M+MM−1)/N=9/N, and for uniformly drawn training samples, r=M2/N=9/N. Therefore, the pure mineral set leads to the least error factor and the performance for the remaining two cases is comparable.

### V-D Discussions

Our proposed method can address uncertainty in mineralogy space, which comes from the three sources of the spectral uncertainty – spectral noise, the ambiguity/similarity of certain minerals such as feldspar and quartz, and the training error. The noise directly affects the degree of uncertainty in the inversion whereas the ambiguity of minerals requires a study in their individual spectra. The training error can be predicted and it will be resolved by using a large number of the training samples. To resolve this error, from our experience, the minimum should lead to under a certain structure of the training set, thus an value can be suggested. We performed a fundamental analysis of ambiguity of minerals by using the weighted distance measure on the standards to understand which minerals are difficult or easy to separate and their effect in reconstructed mineralogy. From our numerous experiments, we found that when the estimation of the mineral standards is close to the true mineral standards , e.g. in the worst test case, the inversion result is reliable even when the conventional approach with the fixed model is applied. Otherwise, it is better to apply the stochastic

model for the unbiased estimator. The estimation of

indeed affects the performance of uncertainty analysis in inversion. We can verify this effect by checking whether the 95% confidence interval centered at the estimated mineralogy captures the true mineralogy and by looking at the estimated covariance structure converging to the true covariance structure as the size of the training set increases.

## Vi Conclusion

We propose a novel training and inversion approaches for compositional variables. The developed theory addresses several aspects of uncertainty that can arise in analyzing linear forward models, especially regarding training of the system and reconstruction of the unkown compositions. The proposed training analysis quantifies the variance scaling factor for the estimation of . We evaluated this factor under several cases of either deterministic or stochastic training sets, thus accounting for the uncertainty propagated from the estimated standards in the training step. The derived posterior distribution for a compositional variable produce more robust solutions that can capture uncertainty or ambiguity of compositional vectors. Especially, the comparisons between using fixed and stochastic models, when there is an uncertainty in , demonstrate that without addressing the uncertainty of the estimated obtained from the training, we may not be able to obtain a solution whose confidence interval contains the true composition. When the estimate is close to the true (for large or small ), then a point estimate obtained from fixed model can be accurate. Our theory implies that when is small, the uncertainty amplification factor in the estimated operator is close to unity () and both the fixed and stochastic models behave identically.

## Appendix A Maximum bound for b in the posterior marginal distribution p(m|s)

The maximum of can be found as follows

 b :=1+M∑j=1m2jrj (43) ≤1+M∑j=1mjrj (44) ≤1+max(rj)=:bmax. (45)

The first inequality holds because for . The second inequality holds because and .

Furthermore, is the tight bound for . For example, if the sample is a pure endmember ( for some ) and the corresponding variance scaling factor for the endmember is the largest (), then . Therefore, the maximum is achieved in this case.

## Appendix B Distributions on an n-simplex

The composition vector is constrained such that its components are nonnegative and sum to unity. These constraints define a simplex set such that any feasible is in the simplex set. An -dimensional simplex, or -simplex, is defined by

 Sn={(x1,…,xn)∈Rn:n∑i=1xi=1,xi≥0 for ∀i}. (46)

Any feasible of length is in . Let the random members of be denoted , where the are random variables.

### B-a Correlation property of distributions on an n-simplex

For , we define the following:

 Yk =k∑i=1Ui, (47) dk =E[Y2k], and (48) σ =E[U21]. (49)

are dependent on , however for simplicity we omit this dependence in our notation. From the definition of the -simplex, it is clear that

 Yn =1, (50) dn =1, and (51) σ =d1. (52)

These can be considered as boundary conditions for a distribution on an -simplex.

One can derive a recursive equation for under the assumptions (21):

 dk =E[U21]+E[Y2k−1]+2E[U1(U2+⋯+Uk)] (53) =E[U21]+dk−1+2(k−1)E[U1U2] (54) =σ+dk−1+2(k−1)β. (55)

Applying this recursion to solve for yields

 dk=kσ+(k−1)kβ. (56)

Applying (51) into (56) when leads to

 1=nσ+(n−1)nβ (57)

This equation (57) relating and can be used to demonstrate an intuitive and interesting property on the correlation. The correlation between random variables and is defined as follows:

 Corr(U1,U2)=E[(U1−EU1)(U2−EU2)]√E(U1−EU1)2√E(U2−EU2)2 (58)

Under the following assumption,

 E[Uj]=E[Uk]=μ for j≠k, (59)

the mean is

 μ=1/n. (60)

Using (57) and (60), (58) becomes

 Corr(U1,U2) =β−μ2σ−μ2 (61) =−1n−1 (62)

This result matches our intuition; when a variable is increased by , the sum of the remaining variables is decreased by , thus the expectation of the decrease in one of the remaining variables would be .

### B-B Example distributions on an n-simplex

We consider three distributions on a -simplex that satisfy (21) and (59): the multinomial distribution , the double-multinomial distribution , and the uniform distribution .

#### B-B1 Multinomial distribution

If , then

 P(m=ej)=1/M for j=1,…,M, (63)

where is the -th column of the identity matrix. Under this distribution,

 E[U1U2]=0 and EU21=EU1=1/M, (64)