1 Introduction
Modelling hierarchical aspects in complex stochastic processes is an important research issue in many application domains. In an hierarchy, each level is an abstraction
of lower level details. Consider, for example, a frequent activity performed by human like ‘eatbreakfast’ may include a series of more specific activities like ‘enterkitchen’, ‘gotocupboard’, ‘takecereal’, ‘washdishes’ and ‘leavekitchen’. Each specific activity can be decomposed into finer details. Similarly, in natural language processing (NLP) syntax trees are inherently hierarchical. In a partial parsing task known as nounphrase (NP) chunking
(Sang and Buchholz, 2000), there are three semantic levels: the sentence, nounphrases and partofspeech (POS) tags. In this setting, the sentence is a sequence of NPs and nonNPs and each phrase is a subsequence of POS tags.A popular approach to deal with hierarchical data is to build a cascaded model where each level is modelled separately, and the output of the lower level is used as the input of the level right above it (e.g. see (Oliver et al., 2004)). For instance, in NP chunking this approach first builds a POS tagger and then constructs a chunker that incorporates the output of the tagger. This approach is clearly suboptimal because the POS tagger takes no information of the NPs and the chunker is not aware of the reasoning of the tagger. In contrast, a nounphrase is often very informative to infer the POS tags belonging to the phrase. As a result, this layered approach often suffers from the socalled cascading error problem as the error introduced from the lower layer will propagate to higher levels.
A more holistic approach is to build a joint representation of all the levels. Formally, we are given a data observation and we need to model and infer about the joint semantic . The main problem is to choose an appropriate representation of so that inference can be efficient. In this paper, we are interested in a specific class of hierarchical models that supports both joint modelling and efficient inference. More specifically, the models of interest are recursive and sequential, in that each level is a sequence and each node in a sequence can be decomposed further into a subsequence of finer grain at the lower level.
There has been substantial investigation of these types of model, especially in the area of probabilistic contextfree grammars (e.g. see (Manning and Schütze, 1999, Chapter 11)). However, grammars are often unbounded in depth and thus difficult to represent by graphical models. A more restricted version known as hierarchical hidden Markov model (HHMM) (Fine et al., 1998) offers clearer representation in that the depth is fixed and the semantic levels are well defined. Essentially, the HHMM is a nested hidden Markov network (HMM) in the sense that each state is a sub HMM by itself.
These models share a common property in that they are generative
, i.e. they assume that the data observation is generated by the hierarchical semantics. The generative models try to construct the the joint distribution
. However, there are some drawbacks associated with this approach. First, the generative process modelled by is typically unknown and complicated. Second, given an observation , we are more often interested in inferring . Since , modelling may be unnecessary.An attractive alternative is to model the distribution directly, avoiding the modelling of . This line of research has recently attracted much interest, largely triggered by the introduction of the conditional random field (CRF) (Lafferty et al., 2001). The advantages of the CRF is largely attributed to its discriminative nature that allows arbitrary and longrange interdependent features.
In this paper we follow the HMM/HHMM path to generalise from chainstructured CRFs to nested CRFs. As a result, we construct a novel model called Hierarchical SemiMarkov Conditional Random Field (HSCRF), which is an undirected conditional graphical model of nested Markov chains. Thus HSCRF is the combination of the discriminative nature of CRFs and the nested modelling of the HHMM.
To be more concrete let us return to the NP chunking example. The problem can be modelled as a threelevel HSCRF, where the root represents the sentence, the second level the NP process, and the bottom level the POS process. The root and the two processes are conditioned on the sequence of words in the sentence. Under the discriminative modelling of the HSCRF, rich contextual information such as starting and ending of the phrase, the phrase length, and the distribution of words falling inside the phrase can be effectively encoded. On the other hand, such encoding is much more difficult for HHMMs.
We then proceed to address important issues. First, we show how to represent HSCRFs using a dynamic graphical model (e.g. see (Lauritzen, 1996)) which effectively encodes hierarchical and temporal semantics. For parameter learning, an efficient algorithm based on the Asymmetric InsideOutside of (Bui et al., 2004) is introduced. For inference, we generalise the Viterbi algorithm to decode the semantics from an observational sequence.
The common assumptions in discriminative learning and inference are that the training data in learning is fully labelled, and the test data during inference is not labelled. We propose to relax these assumptions in that training labels may only be partially available, and we term the learning as partialsupervision. Likewise, when some labels are given during inference, the algorithm should automatically adjust to meet the new constraints.
We demonstrate the effectiveness of HSCRFs in two applications: (i) segmenting and labelling activities of daily living (ADLs) in an indoor environment and (ii) jointly modeling nounphrases and partofspeeches in shallow parsing. Our experimental results in the first application show that the HSCRFs are capable of learning rich, hierarchical activities with good accuracy and exhibit better performance when compared to DCRFs and flatCRFs. Results for the partially observable case also demonstrate that significant reduction of training labels still results in models that perform reasonably well. We also show that observing a small amount of labels can significantly increase the accuracy during decoding. In shallow parsing, the HSCRFs can achieve higher accuracy than standard CRFbased techniques and the recent DCRFs.
To summarise, in this paper we claim the following contributions:

Introducing a novel Hierarchical SemiMarkov Conditional Random Field (HSCRF) to model complex hierarchical and nested Markovian processes in a discriminative framework,

Developing an efficient generalised Asymmetric InsideOutside (AIO) algorithm for fullsupervised learning.

Generalising the Viterbi algorithm for decoding the most probable semantic labels and structure given an observational sequence.

Addressing the problem of partiallysupervised learning and constrained inference.

Demonstration of the applicability of the HSCRFs for modeling human activities in the domain of home video surveillance and shallow parsing of English.
Notations and Organisation
This paper makes use of a number of mathematical notations which we include in Table 1 for reference.
Notation  Description 

Subset of state variables from level down to level  
and starting from time and ending at time , inclusive.  
Subset of ending indicators from level down to level  
and starting from time and ending at time , inclusive.  
Set of state variables and ending indicators of a  
sub model rooted at , level , spanning a substring  
Contextual clique  
Time indices  
Set of all ending time indices, e.g. if then  
State  
Statepersistence potential of state , level , spanning  
Initialisation potential of state at level , time initialising substate  
Transition at level , time from state to under the same parent  
Ending potential of state at level and time , and receiving  
the return control from the child  
The global potential of a particular configuration  
The number of state symbols at level  
The symmetric inside mass for a state at level ,  
spanning a substring  
The full symmetric inside mass for a state at level ,  
spanning a substring  
The symmetric outside mass for a state at level ,  
spanning a substring  
The full symmetric outside mass for a state at level ,  
spanning a substring  
The asymmetric inside mass for a parent state at level , starting at  
and having a childstate which returns control  
to parent or transits to new childstate at  
The asymmetric outside mass, as a counterpart of  
asymmetric inside mass  
Potential functions. 
The rest of the paper is organised as follows. Section 2 reviews Conditional Random Fields. Section 3 continues with the HSCRF model definition and parameterisation. Section 4 defines building blocks required for common inference tasks. These blocks are computed in Section 4.2 and 4.3. Section 5
presents the generalised Viterbi algorithm. Parameter estimation follows in Section
6. Learning and inference with partially available labels are addressed in Section 7. Section 8 presents a method for numerical scaling to prevent numerical overflow. Section 9 documents experimental results. Section 10 concludes the paper.2 Related Work
2.1 Hierarchical Modelling of Stochastic Processes
Hierarchical modelling of stochastic processes can be largely categorised as either graphical models extending the flat hidden Markov models (HMM) (e.g., the layered HMM (Oliver et al., 2004), the abstract HMM (Bui et al., 2002), hierarchical HMM (HHMM) (Fine et al., 1998; Bui et al., 2004), DBN (Murphy, 2002)) or grammarbased models (e.g., PCFG (Pereira and Schabes, 1992)). These models are all generative.
Recent development in discriminative, hierarchical structures include extension of the flat CRFs (e.g. dynamic CRFs (DCRF) (Sutton et al., 2007), hierarchical CRFs (Liao et al., 2007; Kumar and Hebert, 2005)) and conditional learning of the grammars (e.g. see (Miyao and Tsujii, 2002; Clark and Curran, 2003)). The main problem of the DCRFs is that they are not scalable due to inference intractability. The hierarchical CRFs, on the other hand, are tractable but assume fixed tree structures, and therefore are not flexible to adapt to complex data. For example, in the nounphrase chunking problem no prior tree structures are known. Rather, if such a structure exists, it can only be discovered after the model has been successfully built and learned.
The conditional probabilistic contextfree grammar (CPCFG) appears to address both tractability and dynamic structure issues. More precisely, in CPCFGs it takes cubic time in sequence length to parse a sentence. However, the contextfree grammar does not limit the depth of semantic hierarchy, thus making it unnecessarily difficult to map many hierarchical problems into its form. Secondly, it lacks a graphical model representation and thus does not enjoy the rich set of approximate inference techniques available in graphical models.
2.2 Hierarchical Hidden Markov Models
Hierarchical HMMs are generalisations of HMMs (Rabiner, 1989) in the way that a state in an HHMM may be a subHHMM. Thus, an HHMM is a nested Markov chain. In the model temporal evolution, when a child Markov chain terminates, it returns the control to its parent. Nothing from the terminated child chain is carried forward. Thus, the parent state abstracts out everything belonging to it. Upon receiving the return control the parent then either transits to a new parent, (given that the grand parent has not finished), or terminates.
Figure 1 illustrates the state transition diagram of a twolevel HHMM. At the top level there are two parent states . The parent has three children, i.e. and has four, i.e. . At the top level the transitions are between and , as in a normal directed Markov chain. Under each parent there are also transitions between child states, which only depend on the direct parent (either or ). There are special ending states (represented as shaded nodes in Figure 1) to signify the termination of the Markov chains. At each time step of the child Markov chain, a child will emit an observational symbol (not shown here).
The temporal evolution of the HHMM can be represented as a dynamic Bayesian network, which was first done in
(Murphy and Paskin, 2002). Figure 2 depicts a DBN structure of 3 levels. The bottom level is often referred to as production level. Associated with each state is an ending indicator to signify the termination of the state. Denote by and the state and ending indicator at level and time , respectively. When , the state continues, i.e. . And when , the state transits to a new state, or transits to itself. There are hierarchical consistency rules that must be ensured. Whenever a state persists (i.e. ), all of the states above it must also persist (i.e. for all ). Similarly, whenever a state ends (i.e ), all of the states below it must also end (i.e. for all ).Inference and learning in HHMMs follow the InsideOutside algorithm of the probabilistic contextfree grammars. Overall, the algorithm has time complexity where is the maximum size of the state space at each level, is the depth of the model and is the model length.
When representing as a DBN, the whole stack of states can be collapsed into a ‘megastate’ of a big HMM, and therefore inference can be carried out in time. This is efficient for a shallow model (i.e. is small), but problematic for a deep model (i.e. is large).
2.3 Conditional Random Fields
Denote by the graph where is the set of vertices, and is the set of edges. Associated with each vertex is a state variable Let be joint state variable, i.e. . Conditional random fields (CRFs) (Lafferty et al., 2001) define a conditional distribution given the observation as follows
(1) 
where is the index of cliques in the graph, is a nonnegative potential function defined over the clique , and is the partition function.
Let be the set of observed state variables with the empirical distribution , and
be the parameter vector. Learning in CRFs is typically by maximising the (log) likelihood
(2) 
The gradient of the loglikelihood can be computed as
(3) 
Thus, the inference needed in CRF parameter estimation is the computation of clique marginals .
Typically, CRFs are parameterised using loglinear models (also known as exponential family, Gibbs distribution or Maximum Entropy model), i.e. , where is the feature vector and is the vector of feature weights. The features are also known as sufficient statistics in the exponential family setting. Let be the global feature. Equation 3 can be written as follows
(4)  
(5) 
Thus gradientbased maximum likelihood learning in the loglinear setting boils down to estimating the feature expectations, also known as expected sufficient statistics (ESS).
The probabilistic nature of CRFs allows incorporating hidden variables in a disciplined manner. Let , where is the set of visible variables, and is the set of hidden variables. The incomplete loglikelihood and its gradient are given as
(6)  
where . The gradient reads
(7)  
There have been extensions to CRFs, which can be broadly grouped into two categories. The first category involves generalisation of model representation, for example by extending CRFs for complex temporal structures as in Dynamic CRFs (DCRFs) (Sutton et al., 2007), segmental sequences as in SemiMarkov CRFs (SemiCRFs) (Sarawagi and Cohen, 2004), and relational data (Taskar et al., 2002)
. The second category investigates learning schemes other than maximum likelihood, for example perceptron
(Collins, 2002) and SVM (Taskar et al., 2004).DCRFs and SemiCRFs are the most closely related to our HSCRFs. DCRFs are basically the conditional, undirected version of the Dynamic Bayesian Networks (Murphy, 2002). The DCRFs introduce multilevel of semantics, which help to represent more complex sequential data. The main drawback of the DCRFs is the intractability of inference, except for shallow models with small state space.
Similarly, the SemiCRFs are the conditional, undirected version of the SemiMarkov HMMs. These allows nonMarkovian processes to be embedded in the chain CRFs, and thus giving a possibility of modelling process duration. Appendix C analyses the SemiCRFs in more details.
Our HSCRFs deal with the inference problem of DCRFs by limiting to recursive processes, and thus obtaining efficient inference via dynamic programming in the InsideOutside family of algorithms. Furthermore, it generalises the SemiCRFs to model multilevel of semantics. It also addresses partial labels by introducing appropriate constraints to the InsideOutside algorithms.
3 Model Definition of HSCRF
Consider a hierarchically nested Markov process with levels. Then as in the HHMMs (see Section 2.2), the parent state embeds a child Markov chain whose states may in turn contain child Markov chains. The family relation is defined in the model topology, which is a state hierarchy of depth . The model has a set of states at each level , i.e. , where is the number of states at level . For each state where , the topological structure also defines a set of children . Conversely, each child has a set of parents . Unlike the original HHMMs where the child states belong exclusively to the parent, the HSCRFs allow arbitrary sharing of children between parents. For example, in Figure 3, , and . This helps to avoid an explosive number of substates when is large, leading to fewer parameters and possibly less training data and time. The shared topology has been investigated in the context of HHMMs in (Bui et al., 2004).
The temporal evolution in the nested Markov processes with sequence length of operates as follows:

As soon as a state is created at level , it initialises a child state at level . The initialisation continues downward until reaching the bottom level.

As soon as a child process at level ends, it returns control to its parent at level , and in the case of , the parent either transits to a new parent state or returns to the grandparent at level .
The main requirement for the hierarchical nesting is that the lifespan of the child process belongs exclusively to the lifespan of the parent. For example, consider a parent process at level starts a new state at time and persists until time . At time the parent initialises a child state which continues until it ends at time , at which the child state transits to a new child state . The child process exits at time , at which the control from the child level is returned to the parent . Upon receiving the control the parent state may transit to a new parent state , or end at , returning the control to the grandparent at level .
We are now in a position to specify the nested Markov processes in a more formal way. Let us introduce a multilevel temporal graphical model of length with levels, starting from the top as 1 and the bottom as (Figure 4). At each level and time index , there is a node representing a state variable . Associated with each is an ending indicator which can be either or to signify whether the state ends or persists at . The nesting nature of the HSCRFs is now realised by imposing the specific constraints on the value assignment of ending indicators (Figure 5).
The top state persists during the course of evolution, i.e. . 
When a state finishes, all of its descendants must also finish, 
i.e. implies . 
When a state persists, all of its ancestors must also persist, 
i.e. implies . 
When a state transits, its parent must remain unchanged, i.e. , . 
The bottom states do not persists, i.e. for all . 
All states end at , i.e. . 
Thus, specific value assignments of ending indicators provide contexts that realise the evolution of the model states in both hierarchical (vertical) and temporal (horizontal) directions. Each context at a level and associated state variables form a contextual clique, and we identify four contextual clique types:

Statepersistence : This corresponds to the life time of a state at a given level (see Figure 6). Specifically, given a context , then , is a contextual clique that specifies the lifespan of any state .

Statetransition : This corresponds to a state at level at time transiting to a new state (see Figure 7a). Specifically, given a context then is a contextual clique that specifies the transition of to at time under the same parent .

Stateinitialisation : This corresponds to a state at level initialising a new child state at level at time (see Figure 7b). Specifically, given a context , then is a contextual clique that specifies the initialisation at time from the parent to the child .

Stateending : This corresponds to a state at level to end at time (see Figure 7c). Specifically, given a context , then is a contextual clique that specifies the ending of at time with the last child .
In the HSCRF we are interested in the conditional setting in which the entire state variables and ending indicators are conditioned on observational sequences . For example, in computational linguistics, the observation is often the sequence of words and the state variables might be the partofspeech tags and the phrases.
To capture the correlation between variables and such conditioning, we define a nonnegative potential function over each contextual clique . Figure 8 shows the notations for potentials that correspond to the four contextual clique types we have identified above. Details of potential specification are described in the Section 6.1.
where . 
where and . 
where . 
where . 
Let denote the set of all variables that satisfies the set of hierarchical constraints in Figure 5. Let denote ordered set of all ending time indices at level , i.e. if then . The joint potential defined for each configuration is the product of all contextual clique potentials over all ending time indices and all semantic levels :
(8)  
The conditional distribution is given as
(9) 
where is the partition function for normalisation.
In what follows we omit for clarity, and implicitly use
it as part of the partition function and the potential .
It should be noted that in the unconditional formulation, there is only
a single for all data instances. In conditional setting
there is a for each data instance .
Remarks: The temporal model of HSCRFs presented here is not a standard graphical model (Lauritzen, 1996) since the connectivity (and therefore the clique structures) is not fixed. The potentials are defined onthefly depending on the context of assignments of ending indicators. Although the model topology is identical to that of shared structure HHMMs (Bui et al., 2004), the unrolled temporal representation is an undirected graph and the model distribution is formulated in a discriminative way. Furthermore, the state persistence potentials capture duration information that is not available in the dynamic DBN representation of the HHMMs in (Murphy and Paskin, 2002).
In the way the potentials are introduced it may first appear to resemble the clique templates in the discriminative relational Markov networks (RMNs) (Taskar et al., 2002). It is, however, different because cliques in the HSCRFs are dynamic and contextdependent.
4 Asymmetric InsideOutside Algorithm
This section describes a core inference engine called Asymmetric InsideOutside (AIO) algorithm, which is partly adapted from the generative, directed counter part of HHMMs in (Bui et al., 2004). We now show how to compute the building blocks that are needed in most inference and learning tasks.
4.1 Building Blocks and Conditional Independence
(a)  (b) 
4.1.1 Contextual Markov blankets
In this subsection we define elements that are building blocks for inference and learning. These building blocks are identified given the corresponding boundaries. Let us introduce two types of boundaries: the contextual symmetric and asymmetric Markov blankets.
Definition 1.
A symmetric Markov blanket at level for a state starting at and ending at is the following set
(10) 
Definition 2.
Let be a symmetric Markov blanket, we define and as follows
(11)  
(12) 
subject to . Further, we define
(13)  
(14) 
Figure 9a shows an example of a symmetric Markov blanket (represented by a doublearrowed line).
Definition 3.
A asymmetric Markov blanket at level for a parent state starting at and a child state ending at is the following set
(15) 
Definition 4.
Let be an asymmetric Markov blanket, we define and as follows
(16)  
(17) 
subject to and . Further, we define
(18)  
(19) 
Figure 9b shows an example of asymmetric Markov blanket (represented by an arrowed line).
Remark: The concepts of contextual Markov blankets (or Markov blankets for short) are different from those in traditional Markov random fields and Bayesian networks because they are specific assignments of a subset of variables, rather than a collection of variables.
4.1.2 Conditional independence
Given these two definitions we have the following propositions of conditional independence.
Proposition 1.
and are conditionally independent given
(20) 
This proposition gives rise to the following factorisation
(21) 
Proposition 2.
and are conditionally independent given
(22) 
The following factorisation is a consequence of Proposition 2
(23)  
4.1.3 Symmetric Inside/Outside Masses
From Equation 12 we have . Since separates from , we can group local potentials in Equation 8 into three parts: , , and . By ‘grouping’ we mean to multiply all the local potentials belonging to a certain part, in the same way that we group all the local potentials belonging to the model in Equation 8. Note that although contains we do not group into . The same holds for .
By definition of the statepersistence clique potential (Figure 8), we have . Thus Equation 8 can be replaced by
(24) 
There are two special cases: (1) when , for , and (2) when , for and . This factorisation plays an important role in efficient inference.
We know define a quantity called symmetric inside mass , and another called symmetric outside mass .
Definition 5.
Given a symmetric Markov blanket , the symmetric inside mass and the symmetric outside mass are defined as
(25)  
(26) 
As special cases we have and , and for , . For later use let us introduce the ‘full’ symmetric inside mass and the ‘full’ symmetric outside mass as
(27)  
(28) 
In the rest of the thesis, when it is clear in the context, we will use inside mass as a shorthand for symmetric inside mass, outside mass for symmetric outside mass, fullinside mass for fullsymmetric inside mass, and fulloutside mass for fullsymmetric outside mass.
Thus, from Equation 24 the partition function can be computed from the fullinside mass at the top level ()
(29)  
With the similar derivation the partition function can also be computed from the fulloutside mass at the bottom level ()
(30) 
In fact, we will prove a more general way to compute in Appendix B
(31) 
for any and . These relations are summarised in Figure 10.
for any 
for any and 
Given the fact that is separated from the rest of variables by the symmetric Markov blanket , we have Proposition 3.
Proposition 3.
The following relations hold
(32)  
(33)  
(34) 
The proof of this proposition is given in Appendix A.2.
4.1.4 Asymmetric Inside/Outside Masses
Recall that we have introduced the concept of asymmetric Markov blanket which separates and . Let us group all the local contextual clique potentials associated with and into a joint potential . Similarly, we group all local potentials associated with and into a joint potential . Note that includes the statepersistence potential .
Definition 6.
Given the asymmetric Markov blanket , the asymmetric inside mass and the asymmetric outside mass are defined as follows
Comments
There are no comments yet.