Compositional data is used in many fields because the data in population ratios or fractions is easy to interpret. However, when the compositional data cannot be produced by simple scaling or normalization with the whole population size from the raw data or measurements, the process to produce such compositional outputs may not be straightforward. Here, we consider noisy outputs as our observations from an unknown linear or nonlinear system with the corresponding compositional variable inputs of interest. The pairs of input and outputs will be used as a training set for artificial neural networks (ANN) modeling to estimate the inverse of the unknown system. This trained inverse system can predict the unknown compositional input, given the output measurement coming from the original system with the input. As our approach is based on ANNs, we do not directly estimate the forward observation model, as in the classical inversion theory, but the inverse of the original system. The measurements, the outputs from the original system with the compositional inputs, are then the input of our estimated inverse system, which will predict the original compositional inputs. We do not apply post-processings or ad-hoc approaches such as truncation of the estimate followed by scaling so that the final answer is a non-negative vector that sums up to one. Rather, we directly apply non-negativity and scaling layers in the proposed ANNs.
We considered both linear observation models and several types of nonlinear models. For the linear cases, where we can theoretically analyze the optimal performance bounds, we demostrated with our experiments that the performance of ANNs for the inversion of the linear model outputs can compete with the optimal bounds. For the nonlinear systems, where convex optimization methods are not well suited for these general cases, we could still present promising results compared to the error levels in the linear models and leave the comparitive analysis with other feasible optimization methods for our future work.
Ii Observation Models
We first define a compositional vector and then present a general observation model. Then, we will formulate more specific observation models. An example of a compositional data or vector includes population ratos, concentration of chemicals in the air, numerous survey statistics in percentage.
We define the compositional vector to be constrained such that its components are nonnegative and sum to unity. These constraints define a simplex set such that any compositional vector is in the simplex set. An -dimensional simplex, or simply -simplex, is defined by
Let be the th component of a compositional column vector , then it can be denoted by where is a transpose operator. Further decomposing it leads to in terms of its components with basis vectors , which is th column of identity matrix .
We now assume the following system , i.e., a forward, observation model that generates our observation from the dimensional compositional input such that .
where , is a function from to , and is additive noise111For multiplicative noise, taking log transformation of the observation leads to the same formula..
In the rest of this chapter, we define specific forms of a nonlinear system with more restrictions as we proceed, finally leading to a linear model.
Ii-a General Systems
The system response from an input , without noise, is
The input is decomposed by using trivial bases s. If the system behaves nonlinearly or non-parametric ways without closed forms, then for the characterization of the system and the inversion for the input with the given the output, mapping or non-parametric esimations such as based on nearest neighbors of pairs of input and output could be working solutions. Training of ANNs is also possible as a candidate mapping solution.
For example, where , is a ceiling operator that maps to integer domain, dimensional compatible matrices and, a scalar constant .
Ii-B Systems with additivity
Ii-B1 A System with partial additivity
If the system holds partial additivity for several sets of groups, s, each of which is a set of component indices for the input vector , then
where is a function of the same dimension as but specific to the group and is a tuple of the components of in the indices in . Note that s do not have to be exhaustive such that the intersection of and for may not be empty.
A special case of this system can be the multiplicative system with the constant vectors corresponding to th functions of , . This can be seen as a linear system with respect to s.
where is a constant vector and independent of and is a nonlinear scalar function of . Note that can be either invertible or non-invertible. For the special case of the latter, where is a thresholding operator, we can minimize the inevitable estimation bias by configuring the optimal (inversion) mapping rule from output to input. Refer to Appendix -B.
This special case model can be practical because a general system on a simplex can be well-approximated if is differentiable with Taylor expansion. Even non-differentiable systems can be approximated to differentiable ones and can be decomposed. For a point , a general system response with , for order , but note that the notation is ‘loosely’ defined in relating the order . For example, for , in the th term can be either , or . For more precisely defined terms, refer to Appendix -C.
Example of this model can be as the followings:
with for and .
Ii-B2 An additive system with component-wise responses
If additivity holds for the system and the component-wise system response depends on the composition, then we model this system as the following.
where is a function of a scalar . For the th component, the system response depends on the composition of , such as shape change in the response. For example,
for a fixed index vector in observation . The peak location of this function is translated from to and the magnitude of the peak is scaled from to , as changes from 0 to 1.
Ii-B3 An additive system with fixed-shape component-wise responses and nonlinear scaling factors
If additivity holds for the system and the component-wise system response is a scaled version of a fixed shape characterized by the component, then we model this system as the following.
where is an arbitrary scalar function on the specific component of index and . For example, can be quadratic or piecewise continuous: where and zero elsewhere.
Ii-B4 A Linear System
Ii-C Systems with missing or noise compositional vector as obfuscating unknowns
Here, we do not assume a complete knowledge of the dimension of the unknown compositional vector but we are ignorant of a partial vector in some dimensions or interested in the compositional vector except this partial vector. In other words, we consider that the whole compositional vector comprises two components and the measurement forward model is
We assume that we do not have knowledge of the existence of the obfucsticating unknown vector or compositional noise vector and equivalently we are interested in obtaining only . The training set consists of pairs without . In practice, such a compositional noise vector can be from environmental effects, which are difficult to measure but still affects – even controlled – experiments.
Note that this model includes a trivial but practical case where a constant bias is added to and our observation, eg., spectral offsets from environments such as contribution of environmental elements in X-ray based spectroscopy.
Iii Baseline Performance Analysis for Inversion
Considering the models introduced in the last chapter, we will provide analyses based on the loss functions, metrics, and obfuscating variables in this chapter. Because the inversion performance of nonlinear systems with the simplex constraint is difficult to analyze compared to the linear inversion without the constraint, we provide theoretical analysis or bounds for the linear case as surrogate ones.
Iii-a Loss functions and performance metrics
Iii-A1 Loss function with the compositional target
Ideally, we want to directly minimize some distance, as an estimatior error, between the estimate and the true composition vector. In other words, the loss function in an ideal form can be used to minimize a distance between the true vector as the target and the estimated vector obtained from an estimator on the corresponding measurement , as seen below.
where both and satisfy the simplex constraints.
A trained system after minimizing (10) using a set of samples can produce a compositional estimate on a new measurement but this estimation is performed through mapping of the measurement as an input to the system, not by typical inversion. In this work, we will perform the optimization of the mapping function by minimizing the above distance using ANN on the training set under a given model order or hyper-parameters. The trained model retains estimated parameters such as weights and biases.
Considering possible convex optimization approaches, we note that it is difficult to formulate and efficiently solve a convex loss function with an explicitly form of because of the simplex conditions. For example, the typical projection onto a simplex is not a convex function. The simplex constaint is linear but applying the boundary conditions is not always trivial, especially in high dimensional space 
. To the best of our knowledge, efficient convex optimization algorithms guaranteeing global optimal solutions are difficult to find. In contrast, ANNs are generally non-convex with nonlinear activation functions but its training phase, if performed well with sufficient data, empirically guarantees good performances with a large size of training samples.
Iii-A2 Loss function with the measurement
In practice or testing of the inversion of a measurement by using the trained system, we cannot directly minimize the distance of the estimate from the true compositional input because the input is not known but will be estimated. Therefore, many inversion methods do not use the ideal loss function of (10) with the unknown but adopt loss functions of the measurements and the estimated projections on the observation domain, called projection errors. For practical optimization using measurements only, we will use the following loss function
with and distance .
The simplest case of this type of optimization is for the linear system and the unconstrained domain for , i.e., for
. Standard, classical linear regression methods can be used for this unconstrained optimization in minimizing the distance between the linear observation and the projection of the estimate.
We note a special case where training samples are used for the estimation of linear system with the simplex constraint . This work cannot cover nonlinear systems but shows how the direct inversion is effectively done after training the linear system having compositional inputs as unknowns.
For this simplest case with linear systems , in the view of approaches using ANNs, the minimum structure is a shallow network where only one matrix of weights without bias is used. This weight matrix is the same as the pseudo-inverse of the linear system matrix , denoted by . However, we empirically confirmed that the ANN with this minimum order converges slowly but higher ordered models converge fast while guranteeing the performance. Such higher orders seem redundant at first but we experimentally observe that they converge and perform better and consistently throughout our different experiments. In other words, the minimum possible structure in ANNs may not be the practically optimal. We adopted this principle in our work.
Iii-A3 Performance metrics
For fair comparisons of different methods, we use the following metrics of (average of distances of errors) and (average of absolute deviations or errors) in percent (%).
where is the sample size and is a vector of component-wise absolute value, i.e.,
Iii-B Benchmark performance in linear systems
Iii-B1 Inversion with the knowledge of the dimension of unknowns
Here, we assume a linear system to produce closed form metric as a (surrogate) benchmark performance. Also, we assume a complete knowledge of the dimension of the unknown compositional vector. We assume that is full-rank and overdetermined, , so is well defined.
Let by singular vector decomposition and , where is an operator that vectorize a matrix by extracting diagonal entries. let be the pseudo inverse of .
The expected error in norm, , on unconstrained domain for is calculated as follows.
where is the trace operator. Therefore, the equation (11) becomes
Iii-B2 Inversion with missing or noise compositional vector as obfuscating unknowns
If we know that there can be obfusating variables, then the standard simplex constraint for the estimated portion should be relaxed; we will have sum-to-less-than-or-equal-to-1 constraint instead of sum-to-one. Without knowing the dimension of missing or obfuscating variables, or simply our ignoring such variables, we can re-define the estimation error for a composition vector with the partial true vector of interest but without the noise vector in (9), by normalizing so that it satisfies the simplex constraint.
We provide an analysis of impact of an obfuscating vector on inversion for linear systems. The observation model equation can be rewritten as
where . Therefore, in practice without the knowledge of even existence of an obfuscating vector of missing variables, we seek a solution in a simplex where the linear system matrix is scaled with also an unknown factor from a measurment mixed with perturbed noise . The effective noise is generally centered at a non-zero vector and even correlated, even if is zero-mean and uncorrelated because of the unknown system and the obfuscating vector . The obfuscating vector can be treated as either a fixed unknown or a stochastic quantity which leads to correlated effective noise .
The loss is defined as the following.
Without knowing , to obtain , a ‘myopic’ estimator uses only , which is either given or estimated. A simple myopic estimator is , where projects any nonzero vector to , is a thresholding opereator, is a scaling operator.
The expected squared loss with an unconstrained pseudo-inverse of without projection is
where is a projection matrix of , is a orthogonal projection matrix of , by SVD, have orthogonal basis vectors with which span with being the sum of the dimensions of and (),
follows Gaussian distribution with mean zero and covariance matrix,
is the largest eigenvalue of, and is a trace operator.
We perform experiments based on the examples following the described models in Section II. We start from the simple models to more complex and nonlinear models.
Iv-a Design and implementations
We implemented the designed simulations using Python 3.5 and extensively experimented several objective functions, strucutures, tuning strategies, and different nonlinear and non-negative activation functions in ANNs.
First, to efficiently train ANNs and to better generalize, we include some redundancy in the structure. Indeed, minimal structures may not guarantee good convergence rate or sometimes fail to converge due to sensitivity, e.g., linear systems and modeling of it using only weights linking input and output directly. Further redundancy to avoid overfitting such as dropout layers was tried but not used in our experiments because they did not improve the estimation or has little effect. Batchnorm layers are inserted between layers for efficient training.
To obtain compositional vectors as outputs of our estimators, we added a simplex projection to the last layer in our ANNs, which is nonconvex. Here, we apply only rescaling of the vector, by dividing it with the sum of the vector components obtained from the previous layer, because the chosen activation function of the layer already guarantees non-negativity. We note that optimization of ANNs is a generally non-convex procedue but with rich empirical guidelines to avoid local minima and achieve satisfactory performance.
As an objective function to minimize, we use a mean squared () distance between the ANN output and in the loss function to optimize the ANNs, after trying different distances such as mean abosolute distance (using distance), mean absolute percentage distance, categorial crossentropy, soft-max types, etc. We empirically confirmed that using the distance achieves the performance in terms of lowest estimation bias and fast convergence rate.
Among many optimizers or packages, we adopted Adam optimizer for ANN training 1]. Also, we have tried many tuning strategies and the tunned parameters are mostly default values: , decay rate is 0.01. The learning rates and batch sizes depend on the experiments and range from to , from 64 to , respectively. In training stages, we checked the validation errors so that the overfitted parameters are not used in testing.
We evaluated the performance mainly using the compositional samples drawn according to the uniform distribution in a simplex, because this distribution is the most scattered distribution having the highest entropy in information theory under the volume measure. However, we added several tests having compositional samples drawn according to a mixture of concentrated distributions and uniform distribution.
Iv-B Simple linear systems
We perform the experiments on the linear systems of the low dimensional spaces of observations and unknowns. We set (the number of training samples), (the number of testing samples). Thus, even if we do not know the system function, the multiplicative system matrix in this case, we know its dimension and the matrix will be estimated using the training data.
We simulated the linear system matrix so that each of its entries was generated according to standard Gaussian distribution. The training and test set of compositional vectors are generated uniformly on simplex . Let and be the matrix comprising of the true label (compositional) vectors in the training and test set, respectively.
The realistic linear model can be described with an additive noise as the following
where is a noise vector. The additive noise vector in (24
) is generated such that each entry of the vector follows mean zero Gaussian distribution with standard deviation. The system responses in the training and test sets using the compositional input and are collected into the matrix and , respectively. The MLE (maximum likelihood estimator) of the system matrix is obtained as the following :
Using such an estimated linear system matrix, we perform inversion to estimate the unknown compositional vector from its system response.
For the experiment with ANNs, we try two cases: ANN with one layer vs. ANN with multiple layers. We measure the estimation performance by evaluating the difference of matrix of the test set and the matrix of the estimated compositional vectors obtained from . The error metric is precisely formulated by equation (12) in Section III-A3.
We note that the shallowest ANN will have nonunique optimal solutions depending on initialization or randomization. This is described in Appendix. -A and we do not experiment on this shallow structure.
Iv-B1 ANN with 1 layer
We present several trivial ANN learning cases to demonstrate that our intuitions match the desired behaviors of the learned models. We omit reporting error values of these trivial cases. We first train this shallow ANN to learn the mapping from compositional domain in to the output domain. The learned ANN should have the weight matrix related to the original linear system matrix. We below provide the discussion of this considering an optional bias term in ANNs and both forward and inversion models.
Estimation of linear system matrix without a bias term: We model . The input is multiplied by the first ANN weight matrix and the distance between this vector and the desired system output is minimized. We experimentally observed that the trained mapping result was good, i.e., and the weight matrix as expected.
Estimation of linear system matrix with a bias term: The input is multiplied by the first ANN weight matrix and added with a bias term. We empirically obtained the same good results as above but, the weight matrix differs the system matrix and the MLE because of the bias term in the ANN. Theoretically, if the distribution of the training samples cover all the possible domain space and goes to infinity, the bias terms will converge to zero and .
The above cases consider learning the forward model whereas the below cases consider learning the inversion so that the ANN can produce the compositional vector from a measurement .
(Inversion) Estimation of pseudo-compositional vector without a bias term: Similar to matrix inversion, we used a linear activation function after mulitplying a weight matrix. The trained ANN performs good inversion and the result is comparable to using the inverse matrix of the estimated , i.e. . Thresholding and scaling operations are required to project the ANN output onto the simplex domain.
(Inversion) Estimation of pseudo-compositional vector with a bias term: Similar to the above case, we used a linear activation function after mulitplying a weight matrix but adding a bias. The trained ANN performs good inversion and the result is comparable to using the inverse matrix of the estimated but with a constant term due to the introduced bias term in the model. Thresholding and scaling operations are required to project the ANN output onto the simplex domain.
(Inversion) Estimation of compositional vector without a bias term: We performed a similar experiment as above but added a mapping layer so that the ANN ouput is in a simplex. Then we do not need to apply thresholding and scaling operations to project the ANN output onto the simplex domain as done above. Throughout experiments222
The softmax activation was not good in training for this shallow layered ANN in our experiments even with batch normalization after weight multiplication.we could observed that this ANN shows good performance without a need for post-processing of mapping onto a simplex.
From the observation of the above last case demonstrating good inversion with a projection layer, we can extend the model further by adding another layer before the projection.
Iv-B2 ANN with multiple layers
To investigate extendibility of ANNs with muliple, possibly deep, layers, we designed two-layered ANN with the projection layer as the last layer. The first and second layers each have nodes, each followed by batch normalization and applying a sigmoid activation, and the last layer has
nodes with ReLu activation followed by the scaling operation as the projection layer because non-negativity is guaranteed by the previous activation function.
Note that the generated system matrix can have negative numbers, as was in our realization that was used throughout in our applicable experiments with its condition number 3.23 (the ratio of the largest singular value to the smallest). condition number
The errors of (12) are
where the case uses the true system matrix for inversion so the estimator is , case uses the MLE of the system matrix for inversion so the estimator is , and case indicates results from the trained ANN. The three error values are comparable. The error from the ANN approach is slightly larger than the rest.
The difference between and is
where indicates a Frobenius norm. This small number implies the MLE for the system matrix is accurate enough and the benchmark performance with MLE should be similar to the oracle case, as shown above.
We note that the theoretical bound for unconstrained estimator (14) is
This is significant larger than the error level of 0.57 seen in we obtain from several estimators, primarily because the simplex constraint applied with a projectiion operator or scaling seems to limit the variable ranges unlike the unconstrained estimator333We performed our experiments multiple times with different randomized-realization of the system matrix so this trend of observation is valid..
Iv-C Simple nonlinear systems
We perform the experiments on several different nonlinear systems of the low dimensional spaces of observations and unknowns. Most of these have dimensions of , unless explicitly stated, and (the number of training samples), (the number of testing samples).
Iv-C1 Nonlinear systems: invertible transformation on simplex variable
We designed a nonlinear system where the output should be uniquely invertible to the original input without noise. We designed the following particular nonlinear system:
where has entries generated according to the standard Gaussian distribution.
The inverse function of is as the following:
where is not necessarily in a simplex and can be negative as an input argument of due to the presence of noise, thus requiring non-negative projection for the square-root operation, and the third variable is by-passed as in .
The averaged errors in percentage are, again for ,
where the case uses the true system matrix for inversion so the estimator is , case uses the MLE of the system matrix for inversion so the estimator is , and case indicates results from the trained ANN but without knowledge of . It is surprising to note that ANN significantly beats other two estimators. We may not directly compare the results coming from two different systems of this nonlinear system and the previous linear system. However, it is clear to notice the gap of errors from ANN and the pseudo-inversion methods, compared to the plain linear model in IV-B2 with the negligible gap in errors from different methods. The only change added to the linear model is the additional nonlinear effects on by the function . Again, the benchmark case is similar to the oracle case because of close proximity of to . It is noteworthy to observe that the performance of these two has relatively degraded due to nonlinear effects of , while the ANN performance relatively improved even without using functional form of .
This result also implies that there must be optimal estimator better than the above ‘oracle’ estimator, which should depend on a particular nonlinear function . The cascading inversion operation after the pseudo-inverse with the system matrix may better be combined but the search of the better estimator, although interesting, is not in the scope of this work and we leave it as future work.
Iv-C2 Nonlinear systems: noninvertible transformation on simplex variable
Unlike the previous experiment above, we consider partially noninvertible and nonlinear transformation on simplex variables. Because of partial noninvertibility, the estimation has an unavoidable bias regarding the noninvertible space.
In our experiment, we apply and of equations 27 and 28, respectively, which perform transformations on the first two dimensions of . We added a noninvertible transformation with a thresholding operator on as below.
where is a threshold level. For numerical stability in , we use with a small positive number and is the optimal inversion function minimizing the expected loss (See Appendix. -B). For example, with is illustrated in Fig. 1. In our experiment, we used , so any value of less than two percents will be ignored, and .
The averaged errors in percentage are, again for ,
Again, the direct comparison with the results coming from above other systems may not be feasible due to different system functions but the superiority of the ANN approach is evident. The bias introduced from the thresholding effect derived in Appendix. -B is so the expected increased error is not large.
Iv-C3 Nonlinear systems: invertible transformation with an obfuscating variable
We added an obfuscating variable to the invertible system described in the above Section IV-C1. The dimension of the unknowns became .
We assume that this obfuscating variable is not a dominant in that its weight is not greater than . Generally, we can assume that the norm of the obfuscating variables are bounded. This is a reasonable assumption in practice too, because unknown variables outside our consideration or interest do not significantly determine the observations. If so, we would either include them in the model or research the physics to rebuild a model.
The errors from the oracle and benchmark estimators are calculated using equation (16) where contains only the scaled first 3 dimensional components such that .
In our experiment, we bound the obfuscating variable such that , which increases estimation error less than introducing a thresholding operation with the level 0.2 in one variable would because , and all the averaged errors above are less than . In our simulation with samples, the test error increase in the ANN approach is slightly less than those in the other approach in Section IV-C1 but, this requires more investigation because their system functions are different with different input vectors.
Iv-C4 Nonlinear systems: noninvertible transformation with an obfuscating variable
Iv-C5 Nonlinear systems: transformation with varying magnitudes
We define the following nonlinear system and experimented the ANN approach with as in Section IV-B2.
This case cannot have oracle nor benchmar inversion results because we cannot estimate the scale factor and the unknown variable simultaneously without good prior knowledge. This inversion is called generally blind-deconvolution and semi-blind or myopic deconvolution with some prior knowledge of the unknown or the system .
Our approach in this work estimates the inverse system in the ANN and the unknowns. The evaluated error shows the better result than other previous cases.
This better performance would be due to the effectively increased signal-to-noise ratio (SNR); the minimum of the scaling factors wasand of the factors were larger than 1, as seen in Fig. 2. The averaged norm is and with being an empirical averaging operator here.
Iv-C6 Nonlinear systems: transformation with added correlations of unknowns
We designed another type of nonlinear system with a nonlinear function mapping from simplex to an auxilary vector below.
In this system response, the information of is abundant also with its original value, while are transformed and multiplied with others. We have more redundant intermediate variables of 5 dimensions from and the system matrix is enlarged, from to , having more perturbations or variations in outputs. However, a large training set can accurately estimate the inverse system and the unknowns. Because the number of training samples seems large enough, the performance is similar to the linear case and other nonlinear cases as expected.
The oracle and benchmark cases are not evaluated because without knowing functional form or the intermediate dimension the estimators cannot be formulated. In contrast, the ANN approach is agnostic to such a knowledge of intermediate transformations and introduced correlations. If we assume this knowledge, then we can refer to the errorr levels in the linear system case in Section IV-B2 and these should be comparable with the above ANN performance.
Iv-C7 Nonlinear systems: transformation with varying peak responses
We define the following nonlinear system and experimented the ANN approach with as in Section IV-B2.
where is an index vector and
This system response has varying magnitudes dependent on composition weights s in s and different shapes also dependent on composition weights s in s. Therefore, this case is more general than the one presented in Section IV-C5. Fig. 3 shows the varying responses in shape or peak locations of the component-wise system functions as its argument for changes, sampled at . has a moving peak centered at the index 1 to 5 and the magnitude slightly increases as increases from 0 to 1, while shows the opposite behavior in terms of the peak locations and magnitudes. decreases slightly with shape changes as increases.
The result shown below, from the ANN approach, is comparable with other cases but direct comparisons do not make much sense because the systems are different.
Again, the oracle and benchmark cases are not evaluated because it is difficult even with functional forms and parameter values due to complex nonlinearity. Instead, we provide the ratio of intensity, eg., norm, in noiseless system output of this system to that in linear system.
where is an empirical averaging operator here, and is the same as used in Sections IV-C5 and IV-B2, (reported also in Section IV-C5) and . Considering only the amplified signal intensity we expect the better performance but the changing shapes must adversely affect the inversion performance.
Iv-C8 Nonlinear systems: transformation with varying peak responses wiht added correlations of unknowns
We define a similar nonlinear system to the previous system with but with the added correlated terms.
where is an index vector and
Fig. 4 shows the varying responses in shape or peak locations of the component-wise system functions as its argument for changes, sampled at . are the same as in the previous sytem in Section IV-C7 but with , a function of the unknown compositional vector . According to this function and the given system responses, a small quantity in seems difficult to estimate because its information is only in where small quantities of correspond to attenuated system responses. This would cause the degraded performance in inversion.
Also, comparing the number to the previous system in Section IV-C7, the added correlated terms did not help the inversion performance. Note that the direct comparison cannot be performed because are now linear and squared functions of , not identity functions of as in the previous Section IV-C7, resectively.
Iv-D High dimensional linear systems
We experiment on high dimensional simplex variables. To simulate realistic experiments, we set to represent high dimensional spaces for the unknowns and observations. We set and the designed system matrix in Fig. 5 has all nonnegative response curves. The designed system is given in Appendix. -E. For the training of the ANN, new
samples every 100 epochs were generated to train the ANN because of the memory limitation while avoiding overfitting. The samples in the training and tests sets are drawn according to the uniform distribution. The ANN is designed and tuned to the same parameter values as done in the previous experiments with linearly increased complexity of the networks asincreases in the double layers of nodes and another layer of nodes.
From Fig. 5, the correlations of components whose indices are 11 – 20 must be significant because their overall envelop shapes are similar expect the valley shapes. These components have information residing in their valleys not envelope and the result high corelations are seen in the red block in Fig. 6. Because of high correlations in the components number 11 – 20, their estimation errors are higher than the components 1 – 10, as seen in Fig. 8.
The trained system matrix for benchmark estimator is close to the true one because
The results on the test using the oracle and benchmark estimators are thus similar.
The nonnegative high dimensional matrix with the larger condition number, 360, compared to that in low dimensional system, 3.23, degrades the performance from to more than errors. This can be seen visually in Fig. 5, where there are many overlapped, similar shaped parts. However, the reported errors are still less than the theoretical bound for the unconstrained estimator (14),
Moreover, the ANN approach outperforms the other two. Compared to the low dimensional linear case in Section. IV-B2, the difference in the errors is significant. This must come from the locality of the ANN approach specific to the training set and the globality of the methods based on matrix pseudo-inversion. In the experiment, even with the uniform sampling in a simplex, the high dimensional simplex seems to exhibit locality with rare samples near the end-members () and relatively many samples away from them.
High dimensional simplex spaces may seem counter-intuitive particularly regarding the volume distribution. In fact, high dimensional simplices, along with other high dimensional polytopes, have the major volumn concentration on their surfaces but, near the corner, where the end-members are located, the volume diminishes as the dimension increases. This can be also demonstrated empirically by using uniform sampler on a simplex (see Appendix -D1). This implies that under the uniform distribution in a high dimensional simplex, the chance of drawing samples close to any end-members is negligible. However, in controlled experiments where observations are measured based on fabricated or designed samples on a simplex domain, as known as designed compositions, we can have the measurements corresponding to end-member compositions or pure contents of only one individual composition, i.e., for the th end-member. Therefore, we can add the observations from end-members into our training set if we believe that the observations coming from near end-members are expected in practice.
To test the locality of the ANN and globality of the other two based on matrix inversion, we performed a simple test with the observations only from the end-members. Here, for the benchmark estimator, the training and test sets coincide on the observations, while the ANN estimator was already trained using the training samples.
The oracle estimator is indepedent of the training set and uses the true matrix, whose error is now much closer to but still less than , the benchmark uses the trained matrix and again use it for testing, leading to close to zero error as expected, and the ANN approach produces a significantly large error because there were extremely rare samples among training samples that are close to any end-members. Therefore, in practice if we believe there is a significant number of samples coming from near end-members, we should include them in the training data.
Iv-E High dimensional nonlinear systems
We defined a high dimensional nonlinear systems in Appendix. -F, where obfuscating variables and mixture models are also considered too. The system correlates some variables and transforms the original unknown vector with nonlinearly with fractional polynomials and exponential functions, thresholding, and shape changing with moving peaks and valleys. In this section, we experimented numerous ANN structures because of the higher order of complexity of the system: our base model with double layers of , where is the number of components of interest or assumed, double layers of ,
, convolutional neural networks (CNN) of having a convolutional layer and then either double layers ofor feedforward networks.
Additionally, we tested two cases for the compositonal distributions. One is the uniform distribution and the other is a mixture model. In the designed mixture model, the mixture centers in percent are shown in Fig. 7, and the corresponding s, the sample proportions, and details are provided in Appendix. -F. In the mixture model, there are still samples drawn from the uniform distribution. The drawn compositional vector can be truncated and normalized to satisfy the simplex condition. Also, in generating samples we disgard the samples whose , as obfuscating variables, are greater than . The result samples in with the described specification and corresponding noisy measurements using the nonlinear system with the observational noise level constitute the training and test sets. In the experiments using the mixtures, we randomly shuffled the samples in training and test sets. We may retain the original compositional vector including obfuscating variables in , which is used to synthesize noisy measurments, but use without those variables for comparisons (Eq. 16). In other words, even if the responses of noisy observations embed the obfuscating variable effects, we do not use obfuscating varibles for training, and testing considers only the normalized version of the variables excluding obfuscating varibles.
The performance in high dimensional examples with are demonstrated by considering numerous cases of sample distribution, system types, and neural network structures. We added two nonlinear systems whose response is divided by its maxium or
norm, resulting in added nonlinearity and slightly increased errors. We tried convolutional neural networks (CNN) too. We placed the convolutional layers before the double layers. The CNN layers consists of a layer of 32 nodes and another of 16 with the kernel size 7, 3 strides, and ReLu activation.
For completeness, we included the results from linear systems in this section. For linear systems and for nonlinear systems and there are two obfuscating variables. The error, as the overall error, is computed using Eq. (12) and reported in Table. I. The component-wise error, the average of absolute deviation, is computed using Eq. (13) and illustrated in Fig. 8.
We note that the two linear cases along with the largest model with double layers of or larger achieve the minimal errors due to the lowest complexity or the adaptive power, respectively. Possibly, the two cases with double layers of whose errors are more than 3 seem to have estimation bias or optimized insufficiently, because the optimization with the simple ANN structure showed too slow convergence emprically through many trials of different optimizers, tunings, and techniques. In other words, the simplest ANN structure applied in nonlinear systems may have under-fitting or convergence problem in practice. Especially, component 14, corresonding to the signal of a moving peak, seems to cause the problem as the most difficult variable to estimate especially in simpler models, while the models of the order of or larger do not exhibit such problems (Fig. 8). Generally, increasing the number of nodes, or in our experiments, improves stability and accuracy without causing over-fitting by training on sufficient data. Adding a convolutional layer into our base structure with double layers of may help but has not been extensively experimented in our work.
|System type||samples||ANN type or method||error|
|linear||uniform||double layers of||2.21|
|nonlinear||mixture||double layers of||10.51|
|nonlinear||mixture||double layers of||3.53|
|nonlinear||mixture||double layers of||2.16|
|nonlinear||uniform||double layers of||2.36|
|nonlinear, divided by its max||uniform||double layers of||2.98|
|nonlinear, divided by its norm||uniform||double layers of||6.53|
|linear||uniform||CNN layers + double layers of||2.45|
|nonlinear||mixture||CNN layers + double layers of||2.87|
|linear||uniform||pseudo-inverse of & projection||3.78|
|linear||uniform||kNN (optimal k=11)||2.54|
|nonlinear||mixture||kNN (optimal k=11)||6.54|
|nonlinear||uniform||kNN (optimal k=11)||13.21|