Multidimensional data matrices are encountered in many disciplines of science as in biological domain studying the gene expression data Cheung et al. (2012); Golub et al. (1999), the geographical domain analysing the spatial earthquake data Van der Hilst et al. (2007), in financial market data for portfolio construction and assessment Jagannathan and Ma (2003) and many others domains. The complexity of matrices can be observed for example when a variable is informative only for a subset of data which render the application of some statistical methods too hard. Therefore, multidimensional data analysis Mardia et al. (1979)
refers to the process of summarizing data across multiple dimensions and presenting the results in a reduced dimension space. Several well-known methods exist which perform dimension reduction as principal component analysis, multiple correspondence analysis, multidimensional scaling, and others. A new method of multidimensional data analysis called multidimensional fitting (MDF) was introduced and studied inBerge et al. (2010) and Alawieh et al. (2016). This method is a new method of fitting distances used in data analysis and requires two observed matrices, the target matrix and the reference matrix. The idea of MDF method is to modify the coordinates given by the target matrix in order to fit the new distances calculated on these modified coordinates to distances given by the reference matrix. In Berge et al. (2010) and Alawieh et al. (2016), the authors consider the displacement vectors as deterministic vectors and the random effect that can be produced during the modification and can affect the interpretation of the modification significance is not taken into account.
In this article we want to introduce the random effect in the model of MDF method and to find then the real displacement vectors. Here, the minimization of the mean square error between the new distances and the reference distances performed in the deterministic model of MDF, to obtain the optimal values of displacement vectors, cannot be applied for the random model as the objective function here is a random variable. Therefore, we want to use different ways to find these vectors.
First of all, the random model of MDF is presented in Section 2, then in Sections 3 and 4, two ways to obtain the optimal values of displacement vectors are illustrated. After that, an application in the sensometric domain has been presented in Section 5 in order to fit the sensory profiles of products to consumers preferences of these products. Finally, we conclude our work in Section 6.
2 The random model of multidimensional fitting
Let us consider a target matrix which contains the coordinates of points in with in and a reference matrix which contains the distances between the same points. We note the modification function such as:
where for , the vectors and in are, respectively, the coordinate and the displacement vectors of point .
The problem of MDF is a mean square error minimization and the error, noted , defined by:
where and a real scaling variable. We note:
Owing to the negligence of random effects that can occur during modification, the interpretation of the displacements can be erroneous. Thereby, to tackle this problem we introduce the random effects in the modification function. So, the new modification function is given by:
where and in are, respectively, the fixed and random part of modification.
Contrary to what has been seen above, here is a random variable and not a deterministic value, for all , so the error cannot be minimized directly. Deterministic and stochastic optimization are presented to find the optimum value of vectors :
Deterministic optimization: by minimizing a function depending on vectors with consideration that the components of vectors , for all
, are independently and identically normally distributed.
Stochastic optimization: by simulating the error where the components of vectors for all are dependent and not normally distributed.
3 Calculation of by minimization
In this section, we suppose that the components of vector denoted , for all and , are -dimensional independent and identically distributed random variables where the vector is multivariate normally distributed with mean (the vector 0 in
is the null vector) and variance( is a strictly positive value to be fixed and
is the identity matrix).
We note a matrix that contains the distances between the points after modification. The objective function of the minimization problem called can be expressed in different forms. We cite below some of them:
In our work, we are interested to take . The expression of noted (as the mean square error cited above) can be rewritten as:
The problem here is to find the vectors () such that the minimum of under () is reached. The initial optimization problem (P) is defined by:
The optimal solution obtained from (P) is a solution assigns the smallest value to but moves too many points. So, it is a good solution from minimization standpoint, but awkward from parsimony standpoint.
A new optimization problem is presented to find the optimal vectors by taking into account the minimization of the expectation of and the parsimonious choice of displacements. A natural approach to obtain such sparsity solution is to use the number of non-zero displacements as a penalty. So, a penalty term can be defined, using -norm, as where
is the norm which measures the parsimony of the displacements of points. Thus, a new optimization problem called (P) is given by:
with is a positive regularization parameter to be chosen. It controls the trade-off between the minimization of the expectation of the error and the use of a parsimonious number of displacements.
3.1 Choice of regularization parameter
In different penalization problems as the penalized regression or penalized likelihood methods for high dimensional data analysisGreen (1987), the choice of regularization parameter is always crucial to lead good results attached to the problem at hand. Different methods have been introduced to find the good value of this parameter (see Hoerl and Kennard (1970),Golub et al. (1979)). Some practical approaches consist in comparing different models using a sequence of penalization parameter and then choose the best of them using some model selection criteria like Cross-validation (CV) (Allen (1974), Stone (1974)), Akaike Information Criterion (AIC) Akaike (1973) and Bayes Information Criterion (BIC) Schwarz (1978).
In our model, the choice of the value of is related to the number of displacements. With the same practical concept as the approaches presented in the literature, we want to solve the optimization problem (P) by taking different values of , and as our problem is related to the number of displacements so we choose a value of that takes into account the number of points that must be modified in our data to fit the references distances. This number of points can be computed from the data or fixed by the number of displacements that we want to perform. So, the chosen number of displacements can be taken by two ways:
through the posed problem,
using the data.
Obviously, first way is trivial. Indeed, it is sufficient an user or a company choose a fixed number of displacements that wish perform to find the desirable solution. Accordingly, fixing the number of displacements can be interesting to companies because in some cases a displacement can be difficult and expensive therefore it is suitable for them to fix at the beginning the number of displacements. For the second way, the number of displacements can be calculated using the data. Therefore, a criterion of points selection defined below is used to choose the number of displacements.
3.1.1 Criterion for selection points
The number of displacements is related to the number of points that are misplaced in their initial configuration and need movements to fit their reference distances. For that, we have developed a simple criterion based on the error calculated on the initial data before data modification. This criterion for selection of the points is denoted .
Indeed, for and , we calculate the following difference:
Note that is equivalent to with .
Then, for each , the criterion for selection points is defined as:
The values of are between and so, for fixed value which is chosen through the value of :
if , then is considered as correctly placed point,
else, is considered as misplaced point.
Now, in order to verify the interest of the modification of coordinates so as to approximate the distances, we want to perform a statistical test on the displacement vectors ().
3.2 Statistical test for the displacement vectors
In this section, we want to present a statistical test for the displacement vectors for all . This test is based on the hypothesis of displacements significance. Recall the error:
We note the initial error given by:
The two hypothesis of the statistical test are:
is the test statistic and the decision rule is the following:
To perform this test, we use the Bienaymé-Tchebychev inequality:
Moreover, we suppose that the random effect is injected in the observation so instead of observing we observe . Then, by choosing , the ratio can be considered as -value. So, if it is small than
then we reject the null hypothesiswith is the error of type I.
computation of expectation and variance of error are done in appendix. The results obtained in the appendix are valid for any values of . Therefore, under the hypothesis it is sufficient to replace by .
3.3 The optimization problem
Once hypothesis is rejected, the vectors () can be calculated by solving problem (P).
Using the results of appendix B, the expectation of the error has been calculated from the expectation of the non-central chi-squared and non-central chi distribution.
The expectation of the error is:
where and is the generalized Laguerre polynomial Filaseta and Lam (2000).
The optimization problem (P) can rewritten as:
4 Calculation of by simulation
In this section, we suppose that the components of are dependent or/and not necessarily normally distributed so the application of chi-squared and chi distributions becomes impossible. Therefore, we want to present an algorithm noted Algorithm which allows us to find the optimal vectors using Metropolis-Hastings Metropolis (1953).
4.1 Simulation tools
Different tools used in algorithm and associated to the generation of vectors in order to minimize the error are presented in the follow.
4.1.1 Identification of misplaced and correctly placed sets
The set of points can be divided into two subsets:
The first, noted , having size equal to . This subset contains the points that are correctly placed and should not be moved.
The second, noted , having size equal to . This subset contains the points that are misplaced and must be moved.
The criterion for points selection presented in Section 3.1.1 is used to construct these two subsets.
4.1.2 Movement of set
The subset contains the misplaced points that must be moved in order to fit the reference distances. In this section, the work is concentrated to find movements for the subset approaching as possible as the distances calculated after movements to the reference distances. The movements that can be applied to are translation, scaling and rotation. The scaling movement is not interest in our study as the subsets and are in the same scale seen that are derived from the same set of points and the scaling variable presented in the optimization problem of MDF is kept. Moreover, we suppose that the points inside are well concentrated so that the rotation movement can be neglected. That is why we are just interested on the translation movement. The translation of through a vector can be shown as the translation of each points in . So, the translation of a point is given by: where is the coordinate vector of point .
The translation movement is performed in such a way to approach the distances calculated after translation to the distances given by the reference matrix. Thus, find the vector return to solve the following optimization problem:
In order to simplify the problem , we suppose that for all and , and the problem becomes:
Relaxation of problem : The following problem can easily solved and provide a starting point to resolve . We have:
Using inequality (3), we can conclude that the optimal solution of the vector belongs to an hypersphere centered in with radius . As Equation (2) is never equal to zero during optimization, so we take it smaller than certain real value. Therefore, the optimal solution is guened to belong to an hypersphere with same center as hypersphere but with radius equal to with a small value in .
We suppose that vector
is uniformly distributed betweenand , so it is necessary to find the maximal value . This value is geometrically determined using Figure 1. Starting with a null value of vector , moves uniformly on the line passing by (the point where the vector is null) and to reach its maximum at the point , the far intersection between and hypersphere , hence the uniqueness of .
To calculate the maximal solution of vector , it is needed to find the far intersection of the line with hypersphere . The line has as direction vector the vector . So, the parametric equation of is equal to:
Furthermore, we have:
The intersection between and gives:
We are interested by the farthest intersection, thus we take By replacing in Equation (4), we obtain:
The values of can be proposed uniformly on the segment , so
4.1.3 Movement vectors generation
In practice, for all , we suppose that contains one point noted . The choice of this point is made by a multinomial distribution where for is as defined in section 3.1.1.
At instant , a point chosen as misplaced point must occur a movement through the uniform distribution such as with and are, respectively, the center and the radius of hypersphere obtained by taking . Thus, the movement of the point is equal to the movement at instant plus the new movement obtained by uniform distribution. Hence, we can write:
with is a generation value of the uniform distribution . Noted that the equation of hypersphere given by (3) in each instant depends of misplaced point .
We note the sequence of generated vectors in defined by: . Therefore, the passage from to occurs in a way to move:
4.1.4 Proposal distribution
A proposal distribution is needed in the Metropolis-Hastings algorithm defined below. This distribution is constructed by calculating the probability to pass fromto a new generate value of denoted and it is equal to the probability to choose a point multiplied by the probability of the movement of this point. So, the proposal distribution noted is given by:
with is the chosen point.
We can easily see that this proposal distribution is a probability density function as.
4.2 Calculation of using Metropolis-Hastings algorithm
We consider that the component of vector are dependent such that , with is the covariance matrix.
The Metropolis-Hastings algorithm allows us to build a Markov chain with a desired stationary distributionMetropolis (1953),Hastings (1970). The proposal distribution here is related to the choice of vectors for and it is given in paragraph 4.1.4. The target distribution is given by:
where is an application given by:
and is the density function of the normal distribution . The variable is the temperature parameter, to be fixed according to the value range of .
The algorithm of Metropolis-Hastings is given as follows:
Remark: The error
can be distributed through any other distribution other than the Gaussian distribution, so it is sufficient to generate the vectorusing this distribution instead the normal distribution in algorithm 1.
This random model of multidimensional fitting has been applied in the sensometrics domain. This relatively young domain concerns the analysis of data from sensory science in order to develop a product by linking sensory attributes to ingredients, benefits, values and emotional elements of the brand to design products that meet the sensory quality preferences of sensory-based consumer segments Meullenet (2007). This analysis of product characteristics among consumers gives an overview of the positive and negative aspects of products and aid the companies to better meet consumer tastes. So, the problem here is to fit the consumers scores to the product configuration given by the experts in order to find the ideal sensory profile of a product. Thus, two matrices are at disposal, one contains the consumer scores of products and the second the sensory profile of products given by the experts.
Several modelling techniques have been applied in sensory analysis domain like preference mapping which is the must popular of them. They can be divided into two methods: internal and external analysis Meilgaard et el. (2007). These methods have as objective to visually assess the relationship between the product space and patterns of preference Lawless and Heymann (1999). In our application, we want to use the random model of multidimensional fitting to match as well as possible the sensory profile to the consumers preference of a product. White corn tortilla chips and muscadine grape juice data sets are used in our application.
5.1 Data description
White corn tortilla chips data set has been studied in Meullenet (2007) where consumers rated commercially available toasted white corn tortilla chip products for overall liking, appearance liking, and flavor liking. The names of these tortilla chip products and their labels are given in the Table 2. Moreover, a group of Spectrum trained panelists evaluated apperance, flavor and texture attributes of tortilla chips using the Spectrum Method Meilgaard et el. (2007) (Sensory Spectrum Inc., Chantham, NJ, U.S.A.). This data set is available at "http://www.sensometric.org/datasets" and it is composed from consumers notes table and panelists notes table. The first table is constructed after asked each consumer to evaluate liking, appearance, flavor and texture of each tortilla chips sample on a 9-point dedonic scale and the saltiness on 5-point ’JustAboutRight’ (JAR) (for more information about the scale, visit "http://www.sensorysociety.org/knowledge/sspwiki/Pages/The209-point20Hedonic20Scale.aspx"). The second table is obtained after the evaluation of the panelists for flavor, texture and appearance attributes of all the chips and after that the calculation of the average score for each attribute. The total number of attributes studied in panelists notes table is , we note some of them: sweet, salt, sour, lime, astringent, grain complex, toasted corn, raw corn, masa, toasted grain…
The application of our method requires a target and reference matrices. The target matrix is given by the panelists notes table, so the dimension of this matrix is and the reference matrix is a matrix of dimension and contains the Euclidean distances between the different tortilla chip samples calculated using the consumers notes table.
Muscadine grape juice data set is well studied in Meullenet (2008), and it is composed from the scores of consumers and the average score for attributes given by panelists. This data is available at "http://www.sensometric.org/datasets". Consumers evaluated muscadine grape juices for overall impression, appearance, aroma, color, and flavor. The name of the studied muscadine grape cultivars are given in the Table 2. These juices are examined for aroma, basic tastes, aromatics, feeling factors by the group of Sensory Spectrum trained panelists. Likewise to white corn tortilla chips data set, this data set is composed from two tables: consumers notes table and panelists notes table. The first table contains the consumers evaluation of overall impression, appearance, aroma, color and flavor on the 9-point hedonic scale and the second one contains the average score for each attribute after evaluation of the panelists for the basic tastes, aromatics and feeling factors attributes for all muscadine juices.
The target matrix here is a matrix of dimension constructed by the average score given by the panelists and the reference matrix is a matrix of dimension constructed by the Euclidean distances between the consumers scores for the different cultivars of muscadine grape juices. To quote some of the studied attributes: sweet, sour, cooked muscadine, cooked grape, musty, green unripe, floral apple/pear, fermented …
5.2 Experimental setup
Random model of multidimensional fitting method is applied in the independent and dependent cases of the components of vectors for . The presence of Laguerre polynomial in the objective function of problem (P) complicates the optimization and makes the computation time too long. A way to simplify the optimization resolution is to calculate before the optimization a large set of Laguerre polynomial values corresponding to a large possible values of as the Laguerre polynomial presented in the expectation of the error is related to the value of . So we define a set in which contains many possible values of that can be used during the optimization. For all , the value of is proportional to the distance thus the set is related to the data set by the value of for all as the objective of our optimization problem is to approach to . Therefore, we define as follows: where is the mean of squared distances and is a value which gives the length of and related to the maximal value of in order to cover the largest possible values of . The increment between the elements of is taken equal to . After that, during optimization, each value of calculated with a particular and is replaced by the nearest value in and the Laguerre polynomial value associated to this value is injected directly in the objective function. This simplification gives results close to the results obtained directly by optimizing (P) and reduce thousandth times the resolution time.
Moreover, the choice of and the scale parameter are crucial to obtain good results. Therefore, the value of is taken equal to the mean of the standard deviation calculated on the target matrix , so we can write . Concerning the parameter , we calculate it using the following algorithm:
NLopt library (Version 2.4.2) Johnson (2008) implanted in language a free and open-source library is used to solve problem (P) as this problem is a non-linear and non-convex optimization problem. In this library, numerous algorithms exist to solve such non-linear optimization problems. In our application, we choose Sbplx algorithm which is based on Subplex method Rowan (1990) that is a generalization of Nelder-mead simplex.
Concerning simulation algorithm, the covariance matrix is given by the covariance of matrix multiply by a constant . The parameter presented in the proposal distribution is taken equal to . Concerning the temperature parameter, we take it equal to . Moreover, during simulation, the number of iterations is taken equal to .
5.3.1 Optimization results
First, we want to define the different values of parameters , and for the two data sets. Table 3 gives these values:
|White corn tortilla chips|
|Muscadine grape juices|
The values of for the two data sets are calculated using algorithm 2. Figure 2 depicts the trace plots of the value of at each iteration for the two data sets. We show clearly that the value of converge to an optimal value. Concerning the value of , a choice of for the two data set can be reasonable as the maximum value of the squared distances in the two data sets is in the range of .
After parameters determination, the statistical test developed in Section 3.2 has been applied to perform the interest of the displacements of the points. The values of the ratio calculated for the two data sets are given in the Table 4.
|White corn tortilla chips|
|Muscadine grape juices|
As the two values of for the two data sets are smaller than , so the statistical test is significant for therefore we reject the null hypothesis and we accept the alternative hypothesis . Thus, the movements of points and through vectors and are necessary to approach the distances to for all . After the statistical test, problem (P) has been solved using different values of regularization parameter .
Tables 6 and 6 show the different values of the expectation of error and the number of non-null displacements after optimization for different values of obtained after optimization. We remark that when increases, the number of displacements decreases and when becomes too large, the number of displacements tends to zero and nothing moves.
A way to choose the value of is to determine the number of misplaced points which must be moved to fit the distances. To find these misplaced points, we use the criterion of selection points presented in Section 2.
Table 7 shows the values of this criterion. We have seen that for a fixed real number between and , if we consider as misplaced point. So, by taking for the two data sets, we can detect misplaced points for white corn tortilla chips and for muscadine grape juices which is equivalent to values of for tortilla chips and for muscadine juices. Then, by referring to Tables 6 and 6, we choose the value of that gives a number of displacement close to that obtained using the criterion for tortilla chips and muscadine juices. Indeed, a value of equal to gives a number of displacements equal to displacements that is close to , so we can take . Similarly, we choose for muscadine juices data set. Noted that by changing the value of , we can detect more misplaced points so this choice must be reasonable.
Besides, if the number of desirable displacements is fixed by the user then it is not needed to compute the ratio and in the same way we can choose the value of . So, the choice of is always related to the objective which is aimed at.
The objective of the study is to determine the acceptable attributes categories of white corn tortilla chips and muscadine grape juices. Using our method we want to determine the product characteristics that must be changed to match with the consumers preference.
|Oily/ greasy lip|