Estimating finite mixtures of semi-Markov chains: an application to the segmentation of temporal sensory data

06/12/2018 ∙ by Hervé Cardot, et al. ∙ 0

In food science, it is of great interest to get information about the temporal perception of aliments to create new products, to modify existing ones or more generally to understand the perception mechanisms. Temporal Dominance of Sensations (TDS) is a technique to measure temporal perception which consists in choosing sequentially attributes describing a food product over tasting. This work introduces new statistical models based on finite mixtures of semi-Markov chains in order to describe data collected with the TDS protocol, allowing different temporal perceptions for a same product within a population. The identifiability of the parameters of such mixture models is discussed. A penalty is added to the log likelihood to ensure numerical stability and consistency of the EM algorithm used to fit the parameters. The BIC criterion is employed for determining the number of mixture components. Then, the individual qualitative trajectories are clustered by considering the MAP criterion. A simulation study confirms the good behavior of the proposed estimation procedure. The methodology is illustrated on an example of consumers perception of a Gouda cheese and assesses the existence of several behaviors in terms of perception of this product.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The development of food products is usually based on the measurement of product sensory perceptions from panels of consumers. Sensory perception while eating a food product has been acknowledged as a temporal process for 60 years (Neilson, 1957). Measuring the temporal sensory perception is a complex task and different approaches have been developed in sensory science (see Hort et al. (2017)). Recently a technique called Temporal Dominance of Sensations (TDS) has been introduced by Pineau et al. (2009). A review on TDS can be found in Schlich (2017). The panelists have to describe the tasted product by choosing which attribute, among a list composed of about ten items, corresponds to the most striking perception at a given time. This task results in sequences of attributes with choices and time of the choices. When an attribute is selected as dominant, it is considered as dominant until the panelist select another dominant attribute. At each time only one attribute can be dominant. An example of such an experiment for a chocolate tasting is presented in Figure 1 with data represented as bandplots.

Figure 1: Tasting of a chocolate with 70% of cocoa by 18 panelists, denoted by to , with 10 attributes. The bands represent the succession over time of the dominant attributes selected by each panelist while tasting this chocolate. The figure has been obtained by means of the TimeSens software (www.timesens.com)

Some simple methods are currently used to describe such qualitative temporal data. Most of them rely on the observation of TDS curves, which consist in representing the evolution along time of the proportions of the dominant attributes at a panel level. Even if this statistical approach can be very informative, such a tool only provides a mean panel overview and no information about the individual variability. Some quantitative analysis are used as complement (Galmarini et al., 2017) but these methods only consider dominance durations (the time spent as dominant for each attribute). None of these approaches takes into account the whole complexity of TDS data: choices of dominant attribute, order of the choices and dominance durations that are sojourn times in the successive dominant attributes. Recently, Franczak et al. (2015)

proposed to model TDS data with Markov chains. The Markov hypothesis, meaning that the probability of the next choice of dominant attribute only depends on the current dominant attribute, seems to be reasonable from a sensory perspective. However, the Markov hypothesis imposes strong restrictions on the sojourn time distribution which should be geometrically distributed when considering a discrete time process, or exponentially distributed when considering a continuous time process (see

e.g. Norris (1998) for a general presentation of Markov chains). In a recent paper by Lecuelle et al. (2018) it has been noted that the sojourn time distributions were not distributed according to a geometric law. Consequently, it has been proposed to model TDS data with semi-Markov chains (SMC) and it has been shown that allowing arbitrarily distributed sojourn times permits to get a better fit to the data. Note that approaches based on multivariate categorical data are not adapted for TDS data since we observe sequences with a random number of visited states (see Figure 1). SMC, or Markov renewal processes, which have been introduced more than sixty years ago (Lévy, 1956; Smith, 1955), are now widely used in numerous fields of science such as queuing theory, reliability and maintenance, survival analysis, performance evaluation, biology, DNA analysis, risk processes, insurance and finance or earthquake modeling (see e.g. Barbu and Limnios (2008) and references therein).

It has often been suggested by sensory scientists (Jaeger et al., 2017) that consumers form non homogeneous populations and heterogeneity in consumers’ food products perception has been established in Prutkin et al. (2000). To take into account heterogeneity among individuals and avoid conclusions on a non-existing "average consumer", consumer segmentation is a recommended strategy (Köster (2009), Meiselman (2013)). Introducing mixtures for modeling the different perceptions of a sample of panelists for a same product can be of real interest.

A mixture model (McLachlan and Peel, 2000; Melnykov and Maitra, 2010)

is a probabilistic model enabling to represent the presence of sub populations within an overall population. Finite mixture models are widely used in numerous fields of science such as biology or economy because they offer probabilistic tools for performing clustering. Mixture models are commonly used with the Gaussian distribution but they can also be used with any parametric model (see the numerous examples in

Frühwirth-Schnatter (2006) as well as Banfield and Raftery (1993) or McNicholas (2016)). For temporal data, mixtures of Markov chains have been used in different fields such as finance (Frydman, 2005), computer science (Song et al., 2009), road traffic estimation (Lawlor and Rabbat, 2017) or labor economy (Pamminger and Frühwirth-Schnatter, 2010). In continuous time and continuous response, Delattre et al. (2016)

introduce mixtures of stochastic differential equations and use a classification rule based on estimated posterior probabilities to cluster growth curves. However, as far as we know, the present work is the first one that considers mixtures of semi-Markov processes. The purpose of this article is to estimate mixtures of Semi-Markov chains, in discrete or continuous time, to perform a segmentation of a sample of panelists into groups with similar perceptions. The methodology developed in this article can be useful in many domains for which the aim is to analyze and perform a segmentation of panels of categorical trajectories.

Identifiability is a crucial issue for mixture models (see Titterington et al. (1985) and Frühwirth-Schnatter (2006)) and we show under general conditions that, when identifiable parametric models are considered for the distribution of sojourn times, the parameters of the model are identifiable up to label swapping. The estimation of the parameters is performed with the EM algorithm (McLachlan and Krishnan, 2008) in which a penalty may be added to avoid degenerate solutions. In our sensory analysis example, sojourn times are fitted with gamma distributions and as explained in Chen et al. (2016), the likelihood is generally unbounded in case of mixtures of gamma distributions. We thus consider a penalized likelihood criterion that leads to more stable estimates and permits to avoid degenerate solutions. The number of mixture components being generally unknown, an information criterion is employed to select the number of sub populations that should be considered (see Pamminger and Frühwirth-Schnatter (2010) for a discussion about model selection in the context of mixtures of Markov chains). Then, the observed trajectories can be clustered thanks to the maximum a posteriori probability (MAP) classification approach (see Frühwirth-Schnatter (2006)).

The proposed method is illustrated on a dataset from the European Sensory Network (Thomas et al., 2017). This dataset includes TDS data for 4 Gouda cheeses tasted by 665 consumers according to 10 attributes. A mixture of SMC with gamma sojourn time distributions is adjusted to fit the data.

The article is organized as follows. Section 2 presents the mixture models and discusses the identifiability issue. Section 3 presents the EM algorithm employed for the estimation of the parameters of the mixture, the proportions and the number of components. Section 4 evaluates the performances of the statistical methods through a simulation study and Section 5 provides an illustration of the proposed method on cheese tasting data. Concluding remarks and discussion are given in Section 6.

2 Stochastic model and notations

2.1 Markov renewal processes and finite mixtures of Markov renewal processes

Figure 2: Modeling of sequence (see Figure 1) with a Markov renewal process . The successive states chosen by the panelist are Crunchy, Cocoa, Melting and Sticky

Consider a finite state homogeneous Markov chain , taking values in the finite state space , with transition matrix , whose generic elements are , . Consider the random sequence made by the successive sojourn times in the visited states. For each , represents the sojourn time at state and takes values in if time, denoted by , is discrete and in if time is continuous. For , we denote by

, the cumulative distribution function of the sojourn time given the current and the next states of the random process

. We suppose that the random process satisfies the Markov property, for all , and ,

(1)

The process is called a Markov renewal process, whereas the stochastic process giving the state of the system at every time is called a semi-Markov process (see e.g. Pyke (1961) or Barbu and Limnios (2008)). For identifiability reasons, it is also supposed that , for all , so that at each jump, the system cannot remain in the same state. To avoid trajectories with an infinite number of visited states, we also suppose that the semi-Markov chain is regular (see Pyke (1961)). This is true for gamma distributed sojourn times considered in the application, and more generally under the very weak condition that the cumulative distribution function is continuous at 0 with . Finally, to completely characterize the law of

we define the vector

of initialization probabilities

(2)

The example given in Figure 2 describes the representation, in terms of semi-Markov trajectory, of the TDS sequence of the dataset presented in Figure 1.

The distribution of the semi-Markov process is completely characterized by the set of parameters and in the following its probability law is denoted by .

Let us consider now independent semi-Markov processes taking values in the same state space , and for , the initialization vector of probabilities , the transition matrix , and the cumulative distribution functions for the sojourn times . Denoting by , the probability of observing a Markov renewal process with parameters , we consider the finite mixture process whose law is given by

(3)

The following proposition states that a finite mixture of Markov renewal processes is a Markov renewal process.

Proposition 2.1

The process is a Markov renewal process with parameters

2.2 The identifiability issue

Identifiability of mixture models can be a complicated issue (see e.g. Teicher (1963), Yakowitz and Spragins (1968), Titterington et al. (1985) or Allman et al. (2009)). However, identifiability of the parameters of a stochastic model is a very important condition to ensure the convergence of estimation algorithms to a unique value. We consider here a parametric framework and we are interested in models defined by a family of distributions where is the parameter space and is a vector of parameters characterizing the probability distribution. We consider convex combinations of probability laws in , , with , and , for .

Adopting the same definition as in Yakowitz and Spragins (1968), we say that the finite mixtures are identifiable in the family if and only if the convex hull of has the uniqueness representation property:

(4)

implies and for each there is some such that and .

Moreover, it has been proven in Yakowitz and Spragins (1968) that finite mixtures of a family are identifiable if and only if the cumulative distribution functions of the elements are linearly independent.

We suppose from now that the family of distributions of the sojourn times is parametric, , with

. Classical parametric distributions of sojourn times are the negative binomial distribution if time is discrete (

), and exponential () or gamma distributions () if time is continuous. In our renewal Markov processes framework, a parameter will be of the form . It is shown in Teicher (1963) that finite mixtures of Gamma distributions are identifiable whereas it is proven in Yakowitz and Spragins (1968) that finite mixtures of exponential distributions as well as negative binomial distributions are also identifiable. More recently, it has been shown in Gupta et al. (2016), under technical assumptions, that almost all finite mixtures of Markov chains with states are identifiable provided that at least two consecutive transitions can be observed and the number of mixture components is not too large compared to the number of states, more precisely .

As shown in the next proposition, the identifiability of mixtures of renewal Markov processes can be assessed under weaker conditions than those required for mixtures of Markov chains because the sojourn time distributions are not directly related to transition probabilities and there is no need to introduce any particular condition on the number of mixture components or on the number of states of the Markov chain. See also Gassiat et al. (2016) for an intermediate identifiability result stated for Hidden Markov Chains, which does not impose any condition on the number of states and mixture components but is based on the knowledge of the law of at least two consecutive transitions.

For sake of simplicity we assume that all the initialization probabilities and all the transition probabilities are strictly positive. This ensures that all the sojourn time distributions can be observed by considering the law of .

We also need to add the following hypothesis which means that two subpopulations and cannot have exactly the same set of parameters for the distributions of duration times.

If this condition is not fulfilled, we may have two mixture components whose sojourn time distributions are exactly the same. In that case, we are not able to distinguish the two corresponding subpopulations according to their sojourn times.

Proposition 2.2

Suppose that the family of sojourn time distributions is identifiable and that hypotheses and hold. Then all the finite mixtures from the family can be identified when the law of the sequence drawn form a mixture of renewal Markov processes is known.

In other words, it is possible to identify the parameters of a finite mixture from provided that we can observe at least one transition and the first sojourn times and the first state. Note that the condition , which is also required in Gupta et al. (2016), ensures that all the possible transitions can be observed during the first transition. This hypothesis could be weakened by considering the law of mixture sequences with more than one transition. The condition on the transition probabilities that must be strictly positive is essentially of technical nature and allows to simplify the demonstration. Note that means that the transition from to is never observed so that we cannot associate a duration time distribution to the transition from state to state in mixture component . Without assumption , we should restrict the set of indices related to the sojourn times to the set corresponding to strictly positive transition probabilities.

3 Maximum likelihood estimation and model selection

Suppose we have a sample of independent consumers, for which we may consider independent and identically distributed replications of the tasting experiment. For each consumer , with , we thus get sequences , for , observed for and denoted by,

(5)

where is the random number of visited states by consumer during replication . We suppose that .

We suppose that the observed trajectories are drawn from a mixture of semi-Markov processes whose law is given in (3) and we aim at estimating the parameters which characterize the law of the mixture: the vector of mixture proportions , and , for which characterize the law of the semi-Markov processes for each mixture component. We suppose in this Section that the number of components is known.

3.1 The particular case of gamma distributed sojourn times with replications and no anticipation

In our sensory examples, sojourn times are positive and continuous random variables and suppose that they are distributed according to gamma distributions. The choice of the gamma distribution is motivated by its simplicity and its ability to fit sojourn time distributions with many different shapes. The density depends on two parameters, the shape parameter

and , and is defined as follows,

where is the gamma function. The corresponding expected value is

and the variance

.

We suppose, as in Lecuelle et al. (2018), that the sojourn time distribution only depends on the current state,

(6)

so that there is no anticipation, in some sense, of the next dominant attribute. This assumption, which seems relevant in a food tasting context, also allows us to deal with moderate size samples by reducing significantly the number of parameters to be estimated. In that case, hypothesis means that for each mixture components and , there is at least one state such that the two cumulative distributions and are not equal. If we denote by the number of parameters required to characterize each sojourn time distribution, we only need, with this simplification, to estimate parameters to characterize the sojourn time distributions instead of in the more general setting studied in previous Section. Note that form now on , which corresponds to the particular case of gamma distributed sojourn times.

3.2 The likelihood

By successive conditioning, the likelihood related to a statistical unit with independent replications drawn from a Markov renewal process with parameters can be written

(7)

where is the cumulative distribution function for a gamma random variable with parameters and ,

If we do not suppose anymore that the mixture component from which unit arises is known, the log likelihood under the mixture model of the trajectories becomes

(8)

where is the set of parameters of the mixture model. A direct maximization of the log-likelihood (8), according to is cumbersome and classical optimization algorithm are generally not suitable to deal with that kind of problem (see e.g McLachlan and Krishnan (2008)). The EM algorithm, presented below, is preferred because it allows the optimization procedure to be decomposed into two simple steps.

3.3 The EM algorithm

The Expectation Maximization (EM) algorithm is a very useful algorithm that has first been designed to perform maximum likelihood estimation for incomplete data problems (see

Dempster et al. (1977)). It is an iterative optimization technique of the likelihood that can be very effective for estimating mixture models by considering the unknown mixture components as missing observations (see McLachlan and Peel (2000)).

Let us introduce the missing mixture component indicators, , for , which are vectors with elements, composed of 1 one and zeros and that indicates from which component of the mixture the trajectory arises. In other words, if has been generated by the mixture component then and for . The complete data log-likelihood can be written as follows:

(9)

This function is much easier to maximize, according to , than the log-likelihood function given in (8).

An initial value of the parameters must be carefully chosen before starting the algorithm. The choice of the starting point can be of great importance and is discussed in Section 3.4. The EM algorithm proceeds iteratively according to the following scheme. Suppose an estimate of , denoted by , has been calculated at step , with .

E-step

The expectation (E) step consists in computing the expected log-likelihood of the complete data given the observed trajectories and the value of the parameters estimated during the previous iteration. We define

(10)

with , the conditional probability for to be generated by the component of a mixture model with parameters , where

is the value of the set of parameters computed at previous iteration. We get with Bayes theorem,

(11)

M-step

The maximization (M) step consists in updating the value of parameter given the expected values of , for and , by looking for the maximum, according to , of the function defined in (10). The mixture probabilities only appear in the second term at the righthand side of (10). The new estimates at step are obtained by solving

(12)

where is the Lagrange multiplier associated to the constraint . We get the standard solution, , with .

The Markov chain transition matrices and initialization probabilities as well as the parameters related to the sojourn time distributions are updated by maximizing the first term at the right-hand side of (10).

Thanks to the multiplicative structure of the likelihood (7) given the mixture component, the first term at the righthand side of (10) can be written as the sum of two distinct functions, where the first one only depends on the semi-Markov chains parameters whereas the second one only depends on the sojourn time distributions . Thus, these two sets of parameters can be estimated separately by maximizing each part of the log-likelihood during the M-step. Introducing again Lagrange multipliers, this yields the standard solution for the transition probabilities estimators as well as the initialization probabilities:

(13)

where is the number of transitions for trajectory .

It is shown in Chen et al. (2016) that for gamma mixture models the log likelihood is not bounded. Intuitively, the degeneracy comes from the fact that if the ratio , which corresponds to the expected sojourn time in state for mixture is kept constant, while is tending to infinity, then the corresponding variance will tend to zero and the corresponding gamma density, mimicking the Dirac distribution at , will not be bounded. Consequently, to avoid such degenerate solution, it may be preferable to introduce a penalization in the M-step that prevents the parameters from becoming too large. Thus, we add to function , defined in (10), a penalty similar to the penalty given in Chen et al. (2016) and defined as follows

(14)

Note that this penalty does not need to take into account the parameters of the gamma distributions. Its effect decreases as the sample size and the number of observed transitions increase.

Finally, the parameters of the sojourn time distributions can be estimated by maximizing the following expected partial penalized log-likelihood:

(15)

with classical optimization procedures.

Once the algorithm has converged, model-based clustering of the observed sequences is performed by considering the maximum a posteriori probability (MAP) criterion, defined as follows: if and otherwise.

3.4 Choosing the starting point of the EM algorithm

A crucial issue for the EM algorithm is the choice of the value of the starting point . It is shown in Galmarini et al. (2017)

that the time spent in each state provides interesting indicator to study TDS data. Thus, we have chosen to select the initial values of the EM algorithm by considering the Hartigan-Wong k-means algorithm

(Hartigan and Wong, 1979) applied to the dimensional vector of mean sojourn times in each state, with the Euclidean distance and

clusters. A heuristic justification can be given by the fact that the identification of the mixture components seems to be easier for sojourn times. Indeed as seen in the proof of Proposition 

2.2

, the sojourn time distribution of any finite mixture of Markov renewal processes is identifiable when the family of sojourn time distributions is identifiable. Then, the method of moments is employed to get the initial values for the transition matrices and for the parameters of the gamma distributions.

3.5 Selection of the number of mixture components

When the number of components is unknown, an information criterion can be used to select the number of mixture components (see McLachlan and Peel (2000) for a detailed presentation of the various approaches developed in the literature). Such information criteria rely on a compromise between the fit to the data and the complexity of the considered model, more complex models being less desirable. The Bayes Information Criterion (BIC), which has good asymptotic properties (see Keribin (2000)), is simple to compute and seems effective to select the number of components. We choose to define the BIC criterion as follows,

(16)

where is the estimation of when a mixture of components has been considered and is the number of free parameters to be estimated. Note that corresponds to the number of independent observations in classical mixture models and could be different in our setting of temporal data. Indeed, it is not completely clear which value should be considered for the sample size since, for each trajectory, we have observations that are correlated over time (see Pamminger and Frühwirth-Schnatter (2010) for a discussion in a similar context of mixtures of Markov chains) and we could also take account of the number of observed transitions. In the particular setting described in Section 3.1 and taking into account the fact that for all and all , we get , with for the two parameters gamma distribution. If there is one absorbing state in , as in the example in Section 5, and we suppose that it is not possible that the first observed state is this absorbing state, then .

Other popular criteria are the Akaike information criterion (AIC), in which the term in (16) which penalize the complexity of the model is replaced by and the corrected AIC, denoted by in which the term is replaced by .

4 A simulation study

A simulation study is conducted to evaluate the performances of the penalized and unpenalized EM algorithms under various mixture scenarios. We also measure the ability of the AIC and the BIC criteria to select the correct number of mixture components. Simulations are performed using the R language (R Development Core Team, 2018) and C++ with the Rcpp package (Eddelbuettel and François, 2011). Programs are available on request to the authors.

4.1 Simulation protocol and indicators of performance

In order to get realistic simulations, we simulated qualitative trajectories based on semi-Markov chains whose parameters were estimated on the real dataset presented in the Introduction of the paper. In that experiment, panelists evaluated three different chocolates, with a list of attributes, the first one with 70% of cocoa, the second one with 70% of cocoa too but sweeter than the first one and a third one with 90% of cocoa (see (Visalli et al., 2016) for a more detailed presentation of the data). An experiment related to the tasting of the first chocolate is presented in Figure 1. The components of the renewal Markov process corresponding to each chocolate are estimated by maximum likelihood (see Lecuelle et al. (2018)), considering gamma distributed sojourn times with no anticipation effect, as in Section 3.1. We also evaluate the effect of the introduction of the penalty (14) on the accuracy of the estimates.

First, we consider a known number of components equal to two, with simulated sequences of 4 or 10 transitions, without any absorbing state. We study two cases of mixtures: the fist one with two well separated sub populations (the chocolates with 70% and 90% of cocoa) and the second one with two populations with similar distributions (the two chocolates with 70% of cocoa).

Second, we assume that the number of components is unknown to evaluate the ability of the different information criteria presented in Section 3.5 to recover the true number of components in the population. We consider three different configurations. One with only one component (chocolate with 70% of chocolate), one with two well separated components (the chocolates with 70% and 90% of cocoa) and one with two similar components (the two chocolates with 70% of cocoa). The selection of the number of components is a difficult task and the information criteria do not always give good results with stochastic processes (see for example (Celeux and Durand, 2008)).

Thanks to our knowledge of these chocolates, we can assume that some transitions are not possible (occur with a probability zero). Taking this information into account, we can reduce the number of transition parameters to be estimated. We have 49 unknown probability transition parameters for the chocolate with 70% of cocoa, 62 unknown parameters when considering the two chocolates with 70% of cocoa and 69 unknown parameters when considering the chocolate with 70% of cocoa and the one with 90% of cocoa.

For simulating mixtures with components, the number of individuals belonging to each component is randomly selected thanks to a binomial law , meaning that . Then, for each type of chocolate, individual trajectories are simulated sequentially by selecting randomly the successive states and durations according to the estimated transition probabilities and dominance duration distributions. For each case, we simulated 500 datasets with sample of sizes , and and replications.

In order to avoid computation issues when estimating the parameters related to the gamma distributions, the values of are rounded to and the maximum likelihood estimation is only performed when there are more than 7 observations. Otherwise the gamma parameters are set to the values estimated on all the observations belonging to the corresponding mixture, independently of the state.

The number of maximal iterations of the EM algorithm is set to 100. A posteriori this was large enough because, for all the considered designs, convergence was achieved before 100 iterations.

To check if the transition matrices are well estimated, we consider the following relative error between the estimated transition matrices and the transition matrices used to generate the simulated data for component :

(17)

where is the squared Frobenius norm of matrix . A similar error is computed for the initial probabilities:

(18)

We also check if the estimated parameters of the sojourn time gamma distribution are well estimated by considering the following relative errors

(19)

4.2 Results

Parameters
With 4 transitions
Well separated components
.01(.01) .01(.02) .26(.12) .18(.10) .23(.75) .40(.91) .55(.14)
<.01(<.01) <.01(<.01) .06(.04) .04(.02) .04(.06) .08(.14) .50(.04)
<.01(<.01) <.01(<.01) .02(.01) .01(<.01) .01(.01) .02(.02) .50(.02)
Not well separated
.01(.01) .03(.04) .33(.13) .47(.15) .44(1.77) .70(3.00) .61(0.25)
<.01(<.01) .01(.03) .10(.08) .16(.16) .29(2.18) .38(2.93) .49(.12)
<.01(<.01) <.01(<.01) .02(.01) .01(.03) .03(.06) .04(.12) .50(.03)
With 10 transitions
Well separated components
.01(.01) .01(.01) .08(.06) .05(.04) .07(.12) .15(.21) .50(.10)
<.01(<.01) <.01(<.01) .02(.01) .01(<.01) .01(.01) .02(.02) .50(.04)
<.01(<.01) <.01(<.01) .01(<.01) <.01(<.01) <.01(<.01) .01(<.01) .50(.02)
Not well separated
.01(.01) .03(.07) .09(.06) .23(.19) .18(.41) .21(.85) .62(.19)
<.01(<.01) <.01(<.01) .02(.01) .02(.06) .03(.04) .03(.04) .53(.07)
<.01(<.01) <.01(<.01) .01(<.01) <.01(<.01) .01(.01) .01(.01) .50(.02)
Table 1: Parameter estimation errors when considering unpenalized EM for two clusters with and and with simulated sequences with 4 and 10 transitions and

repetitions. For each design, mean and standard deviation, in brackets, are computed considering 500 simulated datasets.

Parameters
With 4 transitions
Well separated components
.01(.01) .01(.02) .26(.13) .18(.10) .10(.07) .24(.15) .54(.15)
<.01(<.01) <.01(<.01) .06(.04) .04(.02) .03(.03) .06(.07) .50(.05)
<.01(<.01) <.01(<.01) .02(.01) .01(<.01) .01(<.01) .01(.01) .50(.02)
Not well separated
.01(.01) .04(.08) .32(.13) .48(.16) .11(.11) .22(.42) .61(.26)
<.01(<.01) .01(.02) .10(.07) .15(.16) .09(.19) .11(.22) .50(.12)
<.01(<.01) <.01(<.01) .02(.01) .01(.05) .03(.22) .04(.30) .50(.03)
With 10 transitions
Well separated components
.01(.01) .01(.01) .08(.07) .06(.05) .06(.04) .13(.11) .49(.11)
<.01(<.01) <.01(<.01) .01(.01) .01(.01) .01(.01) .02(.03) .50(.04)
<.01(<.01) <.01(<.01) .01(<.01) <.01(<.01) <.01(<.01) .01(<.01) .50(.02)
Not well separated
.01(.01) .02(.04) .10(.06) .20(.18) .09(.14) .10(.20) .62(.18)
<.01(<.01) <.01(<.01) .02(.01) .01(.04) .03(.02) .03(.03) .53(.07)
<.01(<.01) <.01(<.01) .01(<.01) <.01(<.01) .01(<.01) .01(<.01) .50(.02)
Table 2: Parameter estimation errors when considering penalized EM for two clusters with and and with simulated sequences with 4 and 10 transitions and repetitions. For each design, the mean and the standard deviation, in brackets, are computed considering 500 simulated datasets.

Parameter estimation errors, evaluated with (17), (18) and (19), are given in Table 1 for the unpenalized version of the EM algorithm and in Table 2 when the penalized version of the EM algorithm described in (15) is employed to estimate the parameters. We note that the introduction of the penalty allows to improve the accuracy of the estimates, especially for small samples, few transitions, or with clusters with similar distribution of the semi-Markov processes. Without penalty, we observe larger mean errors for the estimated parameters of the sojourn time distribution and high values for the standard deviations of the errors. For example, when with only 4 observed transitions and clusters that are not well separated, we obtain without penalty whereas this error is reduced to thanks to the introduction of the penalty. When the sample size gets larger ( or ) and the number of transitions is large both estimation procedures lead to similar results.

From now on, we will only consider estimates obtained with the penalized EM algorithm.

Well separated components Not well separated components
With 4 transitions
k-means .85(.07) .86(.05) .86(.03) .81(.09) .78(.08) .76(.06)
Mixture model .92(.07) .99(.02) 1(<.01) .82(.09) .93(.06) .98(.02)
With 10 transitions
k-means .87(.07) .89(.04) .90(.02) .83(.07) .84(.05) .85(.03)
Mixture model .97(.05) 1(.01) 1(<.01) .89(.09) .97(.05) 1(<.01)
Table 3: Correct classification rate for two clusters with well separated components and not well separated components with and and with length of simulated sequences equal to 4 and 10 transitions. For each design, mean and standard deviation, in brackets, are computed from 500 simulated datasets.

In our simulation context, we know for each trajectory which component of the mixture it belongs to and we can check if it has been assigned with the MAP criterion to the right component. The rate of correct classification is given in Table 3

. We note that overall, the rate of well classified trajectories is high with values ranging from

to . Model-based clustering substantially improves the classification accuracy compared to k-means, except for the more difficult case with a small sample (), 4 transitions and clusters not well separated, were both approaches do not perform well.

Selected number One component Two components
of components Well separated Not well separated
With 4 transitions
BIC
1 500 500 500 500 5 0 500 491 0
2 0 0 0 0 493 500 0 9 494
3 0 0 0 0 2 0 0 0 6
AIC
1 500 500 500 93 0 0 491 4 0
2 0 0 0 394 473 498 9 373 487
3 0 0 0 13 27 2 0 123 13
With 10 transitions
BIC
1 500 500 500 43 0 0 411 0 0
2 0 0 0 457 497 500 89 431 495
3 0 0 0 0 3 0 0 69 5
AIC
1 500 500 500 0 0 0 27 0 0
2 0 0 0 407 491 498 245 318 495
3 0 0 0 93 9 2 228 182 15
Table 4: Choice of the number of components with one component, two well separated components and two not well separated components and 4 or 10 observed transitions. The number of clusters selected by the BIC and the AIC are shown for 500 simulated datasets.

We present in Table 4 the number of components selected by the BIC and the AIC. Whatever the number of individuals, the BIC and the AIC select the correct number of components when there is only one component. With two well separated clusters, the BIC and the AIC generally give good results, except for the case with 4 transitions and where the BIC and, to a lesser degree, the AIC, select only one component rather than two.

When the two mixture components are not very different, the BIC and the AIC only provide effective criteria for selecting the number of components when the samples are large. The AIC often selects the same number of components as the BIC, but it sometimes selects too many components. For small samples and small number of transitions, the BIC criterion is more restrictive and tends to lead to an underestimation of the true number of components. Similar conclusions, in a different context, are found in Celeux and Durand (2008). The can only be used with large samples because of the too large number of parameters of the model. It does not perform better than the BIC and the AIC in this simulation study and the corresponding results are not shown here.

5 Clustering Temporal Dominance of Sensations for a Gouda cheese

We now study data resulting on an experiment of the European Sensory Network (ESN) aiming at measuring simultaneously perception and liking of Gouda cheeses (Thomas et al., 2017). A large panel of consumers from 6 european countries tasted 4 Gouda cheeses with different ages and fat contents according to the Temporal Dominance of Sensations protocol. A list of attributes was presented to the consumers on a computer screen. Panelists tasted successive bites so there is 3 sequences corresponding to the 3 repetitions for each panelist and for each product. In this sample, the mean number of transitions within a sequence is equal to 4.1 (see Table 14 in the Appendix).

Our goal in this study is to perform a segmentation of the panel, and to describe, if there are any, the differences of perceptions for a product. We only present the results for a young and low fat Gouda cheese, whose perception by consumers is more complex.

The maximal number of iterations of the EM algorithm is set to 400 because we observed that the algorithm requires more iterations to converge with this dataset. This can be explained by a higher complexity of the model than in the simulation study because all transitions are possible with these products.

1 2 3 4
BIC 87449.96 86720.10 86830.30 87295.43
AIC 86839.73 85494.05 84988.43 84837.74
86882.94 85710.59 85636.61 86554.71
Table 5: Values taken by the BIC, the AIC and the for a number of clusters ranging from 1 to 4 for the young and low fat Gouda cheese from the ESN dataset.

As shown in Table 5, all the information criteria approaches select at least two mixture components, showing the existence of different behaviors in the panel. The BIC suggests to consider two clusters and the suggests to consider three clusters but both take really close values for two and three components. That is why, we will examine these two cases in the following. The AIC suggests to consider at least 4 clusters, but as it well known, it is a less parsimonious criterion than the BIC and the which generally leads to consider a too large number of mixture components. With two components, the obtained clusters are respectively composed of 398 and 267 individuals, whereas with three components, the obtained clusters are respectively composed of 242, 209 and 214 individuals.

Cluster Estimates for the following attributes:
Bitter Cheese Dense Fatty Melting Milky Salty Sharp Sour Tender
hard cream
With 2 components
1 .03 .08 .44 .05 .03 .04 .03 .01 .02 .27
2 .02 .04 .39 .08 .05 .05 .03 .02 .01 .32
With 3 components
1 .02 .10 .15 .07 .06 .07 .04 .01 .02 .46
2 .02 .04 .41 .08 .04 .04 .02 .02 .01 .32
3 .03 .05 .75 .03 0 0 .03 .02 .02 .06

Table 6: Estimated initial probabilities for the young and low fat Gouda cheese from the ESN dataset, considering two or three mixture components.

The estimated initial probabilities are shown in Table 6. As expected in sensory studies, most of the panelists choose a texture attribute (Dense hard or Tender) as first dominant attribute. With two components, the initial probabilities are really close with only some small differences. On the other hand, with three components, large differences are observed between clusters especially for the attributes Dense hard and Tender. In cluster one, most of the panelists chose Tender as first attribute whereas in cluster three, most of the panelists chose Dense hard. In cluster two, both Dense hard and Tender have a high probability to be chosen as first attribute.

Figure 3: Estimated sojourn time distributions for the attributes Cheese, Dense hard and Tender when considering two mixture components for the young and low fat Gouda cheese.

Figure 4: Estimated sojourn time distributions for the attributes Cheese, Dense hard and Tender when considering three components for the young and low fat Gouda cheese.

Figure 3 presents the estimated gamma distributions of the sojourn times, with two components, for the attributes Cheese, Dense hard and Tender. Figure 4 presents the estimated sojourn time distributions when considering three components. We can note that for all clusters, there are only small differences between the estimated distributions of the different attributes. Then, we can observe that with two components, the estimated distributions are different between the two clusters, with higher probabilities for long durations in cluster one. With three components, the estimated distributions are really similar for the clusters one and three but are different from cluster two.

Figure 5: TDS graph for the young and low fat Gouda cheese.
Figure 6: TDS graphs for the young and low fat Gouda cheese when clustering into 2 segments.
Figure 7: TDS graphs for the young and low fat Gouda cheese when clustering into 3 segments.

TDS graphs of Figure 5-7 represent the most important transitions, namely those with a probability larger than 0.15 and with at least one half of the panelists having actually elicited the attribute of the product. The TDS graph from the whole group (Figure 5) suggests the existence of two different sequences of perception: the first one starts with the attribute "Dense Hard", transits to "Cheese" and ends, whereas the second one starts with "Tender" and either goes to "Cheese", then ends, or goes to "Melting" and then either ends or goes to "Cheese" before ending. It is interesting to note that the "Milky Cream" and "Salty" attributes are present on the graph, since elicited by 54% and 55 % of the panelists, but are reached by no arrows, since every transition to them occurred with a probability lower than 0.15. Considering the segmentation into two clusters, Figure 6 presents the two TDS graphs associated to each cluster. Both clusters start with a more or less balanced choice between "Dense Hard" and "Tender". From that point, panelists of Cluster 1 move to "Cheese" and then ends, whereas panelist of cluster 2 followed a more complex route. Indeed, those being on "Tender" can move to "Melting", Cheese" or "Fatty" and those on "Dense Hard" to the same but "Melting". Then their route to the end can be quite complex using some transitions to "Milky Cream" or Salty", the two attributes having too small probabilities at panel level to be reached. The fact that both group starts, as the whole panel, with a choice between opposite attribute "Tender" and "Dense Hard" is not satisfying and claims for investigating the decomposition into three clusters. Indeed, Clusters 1 and 3 in Figure 7 (segmentation into 3 groups) start respectively with "Tender" and "Dense Hard" and then follow a different route: Cluster going directly to "Cheese", whereas Cluster 1 first transits to "Melting", "Milky Cream" or "Fatty" before reaching the "Cheese" perception. Cluster 1 ends after "Cheese", whereas Cluster 3 can also transit first by "Salty". Cluster 2 in Figure 7 is rather similar to Cluster 2 in Figure 6 exhibiting a very complex perception path. Indeed, Table 14 shows that panelists in this cluster did an average number of transitions between 5 and 6, whereas panelists in the two other clusters made only an average number of transitions equal to 3. Thus this cluster is likely to gather panelists with different perceptions, but the algorithm was not able to split them while still improving the fit. Therefore, it is also possible that this cluster gathers "noisy panelists", namely panelists having not clearly understood the TDS task.

From a sensory perspective, the segmentation enables to model what is really perceived by the panelists instead of considering a mean panel overview, corresponding to the perception of none of the panelists. The observed differences between clusters can be explained both by real differences of perception and by differences of behavior with the TDS task. Such mixture models give the opportunity to further investigate on these new questions by for example examining the relation between perception and other variables such as age, sex or experience.

6 Concluding remarks

This research was motivated by the need of a segmentation method for temporal sensory data. For this purpose we have introduced a new mixture of semi-Markov chains which allows, thanks to a model-based clustering approach, to gather into homogeneous groups consumers having similar tasting perceptions. A penalized EM is introduced to estimate the parameters of the semi Markov chains and the mixture proportions. The evaluation of this estimation method on simulated data shows good performance, improving the segmentation obtained by the k-means algorithm, while providing much more information on individual behaviors. The results on real data show an interesting progress in TDS data analysis by offering the possibility of exhibiting different perceptions in a panel for a same product. The development of such segmentation approaches open new perspectives, both for understanding the perception mechanism and for studying how panelists used TDS and understand the TDS protocol.

The models presented in this paper may depend on a large number of parameters and so require to have large samples at hand to be estimated accurately. However, the real data analysis shows that only small differences seem to exist in a same cluster between the gamma distributions modeling sojourn times in the different states. If this hypothesis is verified, we could consider a more parsimonious model by estimating only one gamma distribution for all the states. As usual with unsupervised classification, choosing the number of clusters is a difficult task. The method used in this paper relies on information criteria and is not very effective for small samples. The BIC criterion seems to overestimate the model complexity whereas AIC has a tendency to select models with a too large complexity.

From a statistical perspective, this sensory modeling issue has given us the opportunity to study a new model for mixtures of qualitative trajectories which may have applications in many fields of science. The identifiability issue has been addressed under general conditions, considering parametric families of sojourn time distributions. From a methodological perspective, it also showed that introducing a penalty in the maximisation step of the EM algorithm improves the quality of the estimates. However some further investigations have to be done to determine whose penalty is the most effective. It would also be of great interest to check rigorously in a future work the consistency of such penalized maximum likelihood approach in the context of mixtures of semi-Markov chains and to study the asymptotic distribution of the estimators. This would permit to build confidence intervals and to test statistical hypotheses.

References

  • Allman et al. (2009) Allman, E. S., C. Matias, and J. A. Rhodes (2009). Identifiability of parameters in latent structure models with many observed variables. Ann. Statist. 37(6A), 3099–3132.
  • Banfield and Raftery (1993) Banfield, J. D. and A. E. Raftery (1993). Model-based gaussian and non-gaussian clustering. Biometrics 49(3), 803–821.
  • Barbu and Limnios (2008) Barbu, V. S. and N. Limnios (2008).

    Semi-Markov chains and hidden semi-Markov models toward applications : their use in reliability and DNA analysis

    .
    New York: Springer Science + Business Media.
  • Celeux and Durand (2008) Celeux, G. and J.-B. Durand (2008).

    Selecting hidden Markov model state number with cross-validated likelihood.

    Comput. Statist. 23, 541–564.
  • Chen et al. (2016) Chen, J., S. Li, and X. Tan (2016). Consistency of the penalized MLE for two-parameter gamma mixture models. Sci. China Math. 59(12), 2301–2318.
  • Delattre et al. (2016) Delattre, M., V. Genon-Catalot, and A. Samson (2016). Mixtures of stochastic differential equations with random effects: Application to data clustering. Journal of Statistical Planning and Inference 173, 109–124.
  • Dempster et al. (1977) Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society Series B-Methodological 39(1), 1–38.
  • Eddelbuettel and François (2011) Eddelbuettel, D. and R. François (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40(8), 1–18.
  • Franczak et al. (2015) Franczak, B. C., R. P. Browne, P. D. McNicholas, J. C. Castura, and C. J. Findlay (2015). A Markov model for temporal dominance of sensations data. In 11th Pangborn symposium.
  • Frühwirth-Schnatter (2006) Frühwirth-Schnatter, S. (2006). Finite mixture and Markov switching models. Springer series in statistics. Springer.
  • Frydman (2005) Frydman, H. (2005). Estimation in the mixture of Markov chains moving with different speeds. Journal of the American Statistical Association 100(471), 1046–1053.
  • Galmarini et al. (2017) Galmarini, M. V., M. Visalli, and P. Schlich (2017). Advances in representation and analysis of mono and multi-intake temporal dominance of sensations data. Food Quality and Preference 56, 247–255.
  • Gassiat et al. (2016) Gassiat, E., A. Cleynen, and S. Robin (2016). Inference in finite space non parametric Hidden Markov Models and applications. Statistics and Computing 26, 61–71.
  • Gupta et al. (2016) Gupta, R., R. Kumar, and S. Vassilvitskii (2016). On mixtures of Markov chains. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), Advances in Neural Information Processing Systems 29, pp. 3441–3449. Curran Associates, Inc.
  • Hartigan and Wong (1979) Hartigan, J. and M. Wong (1979). Algorithm as 136: A k-means clustering algorithm. Applied Statistics 28, 100–108.
  • Hort et al. (2017) Hort, J., S. Kemp, and T. Hollowood (2017). Time-dependent measures of perception in sensory evaluation. Wiley Blackwell.
  • Jaeger et al. (2017) Jaeger, S. R., J. Hort, C. Porcherot, G. Ares, S. Pecore, and H. J. H. MacFie (2017). Future directions in sensory and consumer science: Four perspectives and audience voting. Food Quality and Preference 56, 301–309.
  • Keribin (2000) Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhya: The Indian J. of Statist., Serie A 62, 49–66.
  • Köster (2009) Köster, E. (2009). Diversity in the determinants of food choice: A psychological perspective. Food Quality and Preference 20, 70–82.
  • Lawlor and Rabbat (2017) Lawlor, S. and M. G. Rabbat (2017). Time-varying mixtures of Markov chains: An application to road traffic modeling. IEEE Transactions on Signal Processing 65(12), 3152–3167.
  • Lecuelle et al. (2018) Lecuelle, G., M. Visalli, H. Cardot, and P. Schlich (2018). Modeling temporal dominance of sensations with semi-Markov chains. Food Quality and Preference 67, 59–66.
  • Lévy (1956) Lévy, P. (1956). Processus semi-Markoviens. In Proceedings of the International Congress of Mathematicians, 1954, Amsterdam, vol. III, pp. 416–426. Erven P. Noordhoff N.V., Groningen; North-Holland Publishing Co., Amsterdam.
  • McLachlan and Krishnan (2008) McLachlan, G. J. and T. Krishnan (2008). The EM algorithm and extensions (2nd ed.). New York: Wiley Series in Probability and Statistics.
  • McLachlan and Peel (2000) McLachlan, G. J. and D. Peel (2000). Finite Mixture Models. New York: Wiley Series in Probability and Statistics.
  • McNicholas (2016) McNicholas, P. D. (2016). Model-based clustering. Journal of Classification 33(3), 331–373.
  • Meiselman (2013) Meiselman, H. (2013). The future in sensory/consumer research: evolving to a better science. Food Quality and Preference 27, 208–214.
  • Melnykov and Maitra (2010) Melnykov, V. and R. Maitra (2010). Finite mixture models and model-based clustering. Statistics Surveys 4, 80–116.
  • Neilson (1957) Neilson, A. (1957). Time-intensity studies. Drug & Cosmetic Industry 80, 452–453.
  • Norris (1998) Norris, J. R. (1998). Markov chains, Volume 2 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge. Reprint of 1997 original.
  • Pamminger and Frühwirth-Schnatter (2010) Pamminger, C. and S. Frühwirth-Schnatter (2010). Model-based clustering of categorical time series. Bayesian Anal. 5(2), 345–368.
  • Pineau et al. (2009) Pineau, N., P. Schlich, S. Cordelle, C. Mathonniere, S. Issanchou, A. Imbert, M. Rogeaux, P. Etievant, and E. Koster (2009). Temporal dominance of sensations: Construction of the tds curves and comparison with time-intensity. Food Quality and Preference 20(6), 450–455.
  • Prutkin et al. (2000) Prutkin, J., V. Duffy, E. L., K. Fast, E. Gardner, L. Lucchina, D. Snyder, K. Tie, J. Weiffenbach, and L. Bartoshuk (2000). Genetic variation and inferences about perceived taste intensity in mice and men. Physiology & Behavior 69, 161–173.
  • Pyke (1961) Pyke, R. (1961). Markov renewal processes: definitions and preliminary properties. Ann. Math. Statist. 32, 1231–1242.
  • R Development Core Team (2018) R Development Core Team (2018). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.
  • Schlich (2017) Schlich, P. (2017). Temporal dominance of sensations (TDS): a new deal for temporal sensory analysis. Current Opinion in Food Science 15, 38–42.
  • Smith (1955) Smith, W. L. (1955). Regenerative stochastic processes. Proceedings of the Royal Society Series A 232, 6–31.
  • Song et al. (2009) Song, Y., A. Keromytis, and S. Stolfo (2009).

    Spectrogram: A mixture-of-Markov-chains model for anomaly detection in web traffic.

    In Proceedings of the Network and Distributed System Security Symposium, NDSS.
  • Teicher (1963) Teicher, H. (1963). Identifiability of finite mixtures. The Annals of Mathematical Statistics 34, 1265–1269.
  • Thomas et al. (2017) Thomas, A., M. Chambault, L. Dreyfuss, C. C. Gilbert, A. Hegyi, S. Henneberg, A. Knippertz, E. Kostyra, S. Kreme, A. P. Silva, and P. Schlich (2017). Measuring temporal liking simultaneously to temporal dominance of sensations in several intakes. an application to gouda cheeses in 6 europeans countries. Food Research International 99, 426–434.
  • Titterington et al. (1985) Titterington, D., A. Smith, and U. Makov (1985). Statistical analysis of finite mixture distributions. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley.
  • Visalli et al. (2016) Visalli, M., C. Lange, L. Mallet, S. Cordelle, and P. Schlich (2016). Should I use touchscreen tablets rather than computers and mice in TDS trials? Food Quality and Preference 52, 11–16.
  • Yakowitz and Spragins (1968) Yakowitz, S. and J. Spragins (1968). On the identifiability of finite mixtures. The Annals of Mathematical Statistics 39, 209–214.

Appendix

A. Proofs

Proof of Proposition 2.1.
The proof is immediate. Consider the unobserved latent class variable , taking values in and satisfying for . The law of given can be expressed as . Thus,

We also clearly have that is a Markov chain, with transition probabilities,

and, for ,

Proof of Proposition 2.2.
We have, for each mixture component , , for , and . We introduce the dimension (functional) vector, for ,

With assumption (H1), all the coefficients are strictly positive. Since the family of sojourn time distributions is identifiable, we can deduce, with assumption (H2) that are linearly independent vectors of functions. Considering the characterization of identifiability established in Yakowitz and Spragins (1968), this implies that, up to label swapping, there is a unique way of writing the mixture distribution ,

For , denote by the dimensional vector of functions that can be identified from the knowledge of . Since, by assumption (H1), , we can determine the value of the set of parameters by comparing with , and we can write each component of as follows .

We prove now that the mixture probability , the initialization probabilities and the transition probabilities are uniquely determined when the set of coefficients is known. Since , we first note that

because and . Using the same trick again, we get that, for each ,