Hierarchical Semi-Markov Conditional Random Fields for Recursive Sequential Data

by   Tran The Truyen, et al.
Curtin University
SRI International

Inspired by the hierarchical hidden Markov models (HHMM), we present the hierarchical semi-Markov conditional random field (HSCRF), a generalisation of embedded undirectedMarkov chains tomodel complex hierarchical, nestedMarkov processes. It is parameterised in a discriminative framework and has polynomial time algorithms for learning and inference. Importantly, we consider partiallysupervised learning and propose algorithms for generalised partially-supervised learning and constrained inference. We demonstrate the HSCRF in two applications: (i) recognising human activities of daily living (ADLs) from indoor surveillance cameras, and (ii) noun-phrase chunking. We show that the HSCRF is capable of learning rich hierarchical models with reasonable accuracy in both fully and partially observed data cases.



There are no comments yet.


page 1

page 2

page 3

page 4


Human Activity Learning and Segmentation using Partially Hidden Discriminative Models

Learning and understanding the typical patterns in the daily activities ...

A Markovian-based Approach for Daily Living Activities Recognition

Recognizing the activities of daily living plays an important role in he...

MCMC for Hierarchical Semi-Markov Conditional Random Fields

Deep architecture such as hierarchical semi-Markov models is an importan...

Compressed Inference for Probabilistic Sequential Models

Hidden Markov models (HMMs) and conditional random fields (CRFs) are two...

On equivalence between linear-chain conditional random fields and hidden Markov chains

Practitioners successfully use hidden Markov chains (HMCs) in different ...

Modelling columnarity of pyramidal cells in the human cerebral cortex

For modelling the location of pyramidal cells in the human cerebral cort...

Classification using log Gaussian Cox processes

McCullagh and Yang (2006) suggest a family of classification algorithms ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modelling hierarchical aspects in complex stochastic processes is an important research issue in many application domains. In an hierarchy, each level is an abstraction

of lower level details. Consider, for example, a frequent activity performed by human like ‘eat-breakfast’ may include a series of more specific activities like ‘enter-kitchen’, ‘go-to-cupboard’, ‘take-cereal’, ‘wash-dishes’ and ‘leave-kitchen’. Each specific activity can be decomposed into finer details. Similarly, in natural language processing (NLP) syntax trees are inherently hierarchical. In a partial parsing task known as noun-phrase (NP) chunking 

(Sang and Buchholz, 2000), there are three semantic levels: the sentence, noun-phrases and part-of-speech (POS) tags. In this setting, the sentence is a sequence of NPs and non-NPs and each phrase is a sub-sequence of POS tags.

A popular approach to deal with hierarchical data is to build a cascaded model where each level is modelled separately, and the output of the lower level is used as the input of the level right above it (e.g. see (Oliver et al., 2004)). For instance, in NP chunking this approach first builds a POS tagger and then constructs a chunker that incorporates the output of the tagger. This approach is clearly sub-optimal because the POS tagger takes no information of the NPs and the chunker is not aware of the reasoning of the tagger. In contrast, a noun-phrase is often very informative to infer the POS tags belonging to the phrase. As a result, this layered approach often suffers from the so-called cascading error problem as the error introduced from the lower layer will propagate to higher levels.

A more holistic approach is to build a joint representation of all the levels. Formally, we are given a data observation and we need to model and infer about the joint semantic . The main problem is to choose an appropriate representation of so that inference can be efficient. In this paper, we are interested in a specific class of hierarchical models that supports both joint modelling and efficient inference. More specifically, the models of interest are recursive and sequential, in that each level is a sequence and each node in a sequence can be decomposed further into a sub-sequence of finer grain at the lower level.

There has been substantial investigation of these types of model, especially in the area of probabilistic context-free grammars (e.g. see (Manning and Schütze, 1999, Chapter 11)). However, grammars are often unbounded in depth and thus difficult to represent by graphical models. A more restricted version known as hierarchical hidden Markov model (HHMM) (Fine et al., 1998) offers clearer representation in that the depth is fixed and the semantic levels are well defined. Essentially, the HHMM is a nested hidden Markov network (HMM) in the sense that each state is a sub HMM by itself.

These models share a common property in that they are generative

, i.e. they assume that the data observation is generated by the hierarchical semantics. The generative models try to construct the the joint distribution

. However, there are some drawbacks associated with this approach. First, the generative process modelled by is typically unknown and complicated. Second, given an observation , we are more often interested in inferring . Since , modelling may be unnecessary.

An attractive alternative is to model the distribution directly, avoiding the modelling of . This line of research has recently attracted much interest, largely triggered by the introduction of the conditional random field (CRF) (Lafferty et al., 2001). The advantages of the CRF is largely attributed to its discriminative nature that allows arbitrary and long-range interdependent features.

In this paper we follow the HMM/HHMM path to generalise from chain-structured CRFs to nested CRFs. As a result, we construct a novel model called Hierarchical Semi-Markov Conditional Random Field (HSCRF), which is an undirected conditional graphical model of nested Markov chains. Thus HSCRF is the combination of the discriminative nature of CRFs and the nested modelling of the HHMM.

To be more concrete let us return to the NP chunking example. The problem can be modelled as a three-level HSCRF, where the root represents the sentence, the second level the NP process, and the bottom level the POS process. The root and the two processes are conditioned on the sequence of words in the sentence. Under the discriminative modelling of the HSCRF, rich contextual information such as starting and ending of the phrase, the phrase length, and the distribution of words falling inside the phrase can be effectively encoded. On the other hand, such encoding is much more difficult for HHMMs.

We then proceed to address important issues. First, we show how to represent HSCRFs using a dynamic graphical model (e.g. see (Lauritzen, 1996)) which effectively encodes hierarchical and temporal semantics. For parameter learning, an efficient algorithm based on the Asymmetric Inside-Outside of (Bui et al., 2004) is introduced. For inference, we generalise the Viterbi algorithm to decode the semantics from an observational sequence.

The common assumptions in discriminative learning and inference are that the training data in learning is fully labelled, and the test data during inference is not labelled. We propose to relax these assumptions in that training labels may only be partially available, and we term the learning as partial-supervision. Likewise, when some labels are given during inference, the algorithm should automatically adjust to meet the new constraints.

We demonstrate the effectiveness of HSCRFs in two applications: (i) segmenting and labelling activities of daily living (ADLs) in an indoor environment and (ii) jointly modeling noun-phrases and part-of-speeches in shallow parsing. Our experimental results in the first application show that the HSCRFs are capable of learning rich, hierarchical activities with good accuracy and exhibit better performance when compared to DCRFs and flat-CRFs. Results for the partially observable case also demonstrate that significant reduction of training labels still results in models that perform reasonably well. We also show that observing a small amount of labels can significantly increase the accuracy during decoding. In shallow parsing, the HSCRFs can achieve higher accuracy than standard CRF-based techniques and the recent DCRFs.

To summarise, in this paper we claim the following contributions:

  • Introducing a novel Hierarchical Semi-Markov Conditional Random Field (HSCRF) to model complex hierarchical and nested Markovian processes in a discriminative framework,

  • Developing an efficient generalised Asymmetric Inside-Outside (AIO) algorithm for full-supervised learning.

  • Generalising the Viterbi algorithm for decoding the most probable semantic labels and structure given an observational sequence.

  • Addressing the problem of partially-supervised learning and constrained inference.

  • Demonstration of the applicability of the HSCRFs for modeling human activities in the domain of home video surveillance and shallow parsing of English.

Notations and Organisation

This paper makes use of a number of mathematical notations which we include in Table 1 for reference.

Notation Description
Subset of state variables from level down to level
  and starting from time and ending at time , inclusive.
Subset of ending indicators from level down to level
  and starting from time and ending at time , inclusive.
Set of state variables and ending indicators of a
  sub model rooted at , level , spanning a sub-string
Contextual clique
Time indices
Set of all ending time indices, e.g. if then
State-persistence potential of state , level , spanning
Initialisation potential of state at level , time initialising sub-state
Transition at level , time from state to under the same parent
Ending potential of state at level and time , and receiving
  the return control from the child
The global potential of a particular configuration
The number of state symbols at level
The symmetric inside mass for a state at level ,
  spanning a substring
The full symmetric inside mass for a state at level ,
  spanning a substring
The symmetric outside mass for a state at level ,
  spanning a substring
The full symmetric outside mass for a state at level ,
  spanning a substring
The asymmetric inside mass for a parent state at level , starting at
  and having a child-state which returns control
  to parent or transits to new child-state at
The asymmetric outside mass, as a counterpart of
  asymmetric inside mass
Potential functions.
Table 1: Notations used in this paper.

The rest of the paper is organised as follows. Section 2 reviews Conditional Random Fields. Section 3 continues with the HSCRF model definition and parameterisation. Section 4 defines building blocks required for common inference tasks. These blocks are computed in Section 4.2 and 4.3. Section 5

presents the generalised Viterbi algorithm. Parameter estimation follows in Section 

6. Learning and inference with partially available labels are addressed in Section 7. Section 8 presents a method for numerical scaling to prevent numerical overflow. Section 9 documents experimental results. Section 10 concludes the paper.

2 Related Work

2.1 Hierarchical Modelling of Stochastic Processes

Hierarchical modelling of stochastic processes can be largely categorised as either graphical models extending the flat hidden Markov models (HMM) (e.g., the layered HMM (Oliver et al., 2004), the abstract HMM (Bui et al., 2002), hierarchical HMM (HHMM) (Fine et al., 1998; Bui et al., 2004), DBN (Murphy, 2002)) or grammar-based models (e.g., PCFG (Pereira and Schabes, 1992)). These models are all generative.

Recent development in discriminative, hierarchical structures include extension of the flat CRFs (e.g. dynamic CRFs (DCRF) (Sutton et al., 2007), hierarchical CRFs (Liao et al., 2007; Kumar and Hebert, 2005)) and conditional learning of the grammars (e.g. see (Miyao and Tsujii, 2002; Clark and Curran, 2003)). The main problem of the DCRFs is that they are not scalable due to inference intractability. The hierarchical CRFs, on the other hand, are tractable but assume fixed tree structures, and therefore are not flexible to adapt to complex data. For example, in the noun-phrase chunking problem no prior tree structures are known. Rather, if such a structure exists, it can only be discovered after the model has been successfully built and learned.

The conditional probabilistic context-free grammar (C-PCFG) appears to address both tractability and dynamic structure issues. More precisely, in C-PCFGs it takes cubic time in sequence length to parse a sentence. However, the context-free grammar does not limit the depth of semantic hierarchy, thus making it unnecessarily difficult to map many hierarchical problems into its form. Secondly, it lacks a graphical model representation and thus does not enjoy the rich set of approximate inference techniques available in graphical models.

2.2 Hierarchical Hidden Markov Models

Hierarchical HMMs are generalisations of HMMs (Rabiner, 1989) in the way that a state in an HHMM may be a sub-HHMM. Thus, an HHMM is a nested Markov chain. In the model temporal evolution, when a child Markov chain terminates, it returns the control to its parent. Nothing from the terminated child chain is carried forward. Thus, the parent state abstracts out everything belonging to it. Upon receiving the return control the parent then either transits to a new parent, (given that the grand parent has not finished), or terminates.

Figure 1 illustrates the state transition diagram of a two-level HHMM. At the top level there are two parent states . The parent has three children, i.e. and has four, i.e. . At the top level the transitions are between and , as in a normal directed Markov chain. Under each parent there are also transitions between child states, which only depend on the direct parent (either or ). There are special ending states (represented as shaded nodes in Figure 1) to signify the termination of the Markov chains. At each time step of the child Markov chain, a child will emit an observational symbol (not shown here).

Figure 1: The state transition diagram of an HHMM.

The temporal evolution of the HHMM can be represented as a dynamic Bayesian network, which was first done in

(Murphy and Paskin, 2002). Figure 2 depicts a DBN structure of 3 levels. The bottom level is often referred to as production level. Associated with each state is an ending indicator to signify the termination of the state. Denote by and the state and ending indicator at level and time , respectively. When , the state continues, i.e. . And when , the state transits to a new state, or transits to itself. There are hierarchical consistency rules that must be ensured. Whenever a state persists (i.e. ), all of the states above it must also persist (i.e. for all ). Similarly, whenever a state ends (i.e ), all of the states below it must also end (i.e. for all ).

Inference and learning in HHMMs follow the Inside-Outside algorithm of the probabilistic context-free grammars. Overall, the algorithm has time complexity where is the maximum size of the state space at each level, is the depth of the model and is the model length.

When representing as a DBN, the whole stack of states can be collapsed into a ‘mega-state’ of a big HMM, and therefore inference can be carried out in time. This is efficient for a shallow model (i.e. is small), but problematic for a deep model (i.e. is large).

Figure 2: Dynamic Bayesian network representation of HHMMs.

2.3 Conditional Random Fields

Denote by the graph where is the set of vertices, and is the set of edges. Associated with each vertex is a state variable Let be joint state variable, i.e. . Conditional random fields (CRFs) (Lafferty et al., 2001) define a conditional distribution given the observation as follows


where is the index of cliques in the graph, is a non-negative potential function defined over the clique , and is the partition function.

Let be the set of observed state variables with the empirical distribution , and

be the parameter vector. Learning in CRFs is typically by maximising the (log) likelihood


The gradient of the log-likelihood can be computed as


Thus, the inference needed in CRF parameter estimation is the computation of clique marginals .

Typically, CRFs are parameterised using log-linear models (also known as exponential family, Gibbs distribution or Maximum Entropy model), i.e. , where is the feature vector and is the vector of feature weights. The features are also known as sufficient statistics in the exponential family setting. Let be the global feature. Equation 3 can be written as follows


Thus gradient-based maximum likelihood learning in the log-linear setting boils down to estimating the feature expectations, also known as expected sufficient statistics (ESS).

The probabilistic nature of CRFs allows incorporating hidden variables in a disciplined manner. Let , where is the set of visible variables, and is the set of hidden variables. The incomplete log-likelihood and its gradient are given as


where . The gradient reads


There have been extensions to CRFs, which can be broadly grouped into two categories. The first category involves generalisation of model representation, for example by extending CRFs for complex temporal structures as in Dynamic CRFs (DCRFs) (Sutton et al., 2007), segmental sequences as in Semi-Markov CRFs (SemiCRFs) (Sarawagi and Cohen, 2004), and relational data (Taskar et al., 2002)

. The second category investigates learning schemes other than maximum likelihood, for example perceptron 

(Collins, 2002) and SVM (Taskar et al., 2004).

DCRFs and SemiCRFs are the most closely related to our HSCRFs. DCRFs are basically the conditional, undirected version of the Dynamic Bayesian Networks (Murphy, 2002). The DCRFs introduce multi-level of semantics, which help to represent more complex sequential data. The main drawback of the DCRFs is the intractability of inference, except for shallow models with small state space.

Similarly, the SemiCRFs are the conditional, undirected version of the Semi-Markov HMMs. These allows non-Markovian processes to be embedded in the chain CRFs, and thus giving a possibility of modelling process duration. Appendix C analyses the SemiCRFs in more details.

Our HSCRFs deal with the inference problem of DCRFs by limiting to recursive processes, and thus obtaining efficient inference via dynamic programming in the Inside-Outside family of algorithms. Furthermore, it generalises the SemiCRFs to model multilevel of semantics. It also addresses partial labels by introducing appropriate constraints to the Inside-Outside algorithms.

3 Model Definition of HSCRF

Consider a hierarchically nested Markov process with levels. Then as in the HHMMs (see Section 2.2), the parent state embeds a child Markov chain whose states may in turn contain child Markov chains. The family relation is defined in the model topology, which is a state hierarchy of depth . The model has a set of states at each level , i.e. , where is the number of states at level . For each state where , the topological structure also defines a set of children . Conversely, each child has a set of parents . Unlike the original HHMMs where the child states belong exclusively to the parent, the HSCRFs allow arbitrary sharing of children between parents. For example, in Figure 3, , and . This helps to avoid an explosive number of sub-states when is large, leading to fewer parameters and possibly less training data and time. The shared topology has been investigated in the context of HHMMs in (Bui et al., 2004).

The temporal evolution in the nested Markov processes with sequence length of operates as follows:

  • As soon as a state is created at level , it initialises a child state at level . The initialisation continues downward until reaching the bottom level.

  • As soon as a child process at level ends, it returns control to its parent at level , and in the case of , the parent either transits to a new parent state or returns to the grand-parent at level .

The main requirement for the hierarchical nesting is that the life-span of the child process belongs exclusively to the life-span of the parent. For example, consider a parent process at level starts a new state at time and persists until time . At time the parent initialises a child state which continues until it ends at time , at which the child state transits to a new child state . The child process exits at time , at which the control from the child level is returned to the parent . Upon receiving the control the parent state may transit to a new parent state , or end at , returning the control to the grand-parent at level .

Figure 3: The shared topological structure.
Figure 4: The multi-level temporal model.

We are now in a position to specify the nested Markov processes in a more formal way. Let us introduce a multi-level temporal graphical model of length with levels, starting from the top as 1 and the bottom as (Figure 4). At each level and time index , there is a node representing a state variable . Associated with each is an ending indicator which can be either or to signify whether the state ends or persists at . The nesting nature of the HSCRFs is now realised by imposing the specific constraints on the value assignment of ending indicators (Figure 5).

   The top state persists during the course of evolution, i.e. .
   When a state finishes, all of its descendants must also finish,
    i.e. implies .
   When a state persists, all of its ancestors must also persist,
    i.e. implies .
   When a state transits, its parent must remain unchanged, i.e. , .
   The bottom states do not persists, i.e. for all .
   All states end at , i.e. .
Figure 5: Hierarchical constraints.

Thus, specific value assignments of ending indicators provide contexts that realise the evolution of the model states in both hierarchical (vertical) and temporal (horizontal) directions. Each context at a level and associated state variables form a contextual clique, and we identify four contextual clique types:

  • State-persistence : This corresponds to the life time of a state at a given level (see Figure 6). Specifically, given a context , then , is a contextual clique that specifies the life-span of any state .

  • State-transition : This corresponds to a state at level at time transiting to a new state (see Figure 7a). Specifically, given a context then is a contextual clique that specifies the transition of to at time under the same parent .

  • State-initialisation : This corresponds to a state at level initialising a new child state at level at time (see Figure 7b). Specifically, given a context , then is a contextual clique that specifies the initialisation at time from the parent to the child .

  • State-ending : This corresponds to a state at level to end at time (see Figure 7c). Specifically, given a context , then is a contextual clique that specifies the ending of at time with the last child .

Figure 6: An example of a state-persistence sub-graph.
Figure 7: Sub-graphs for state transition (left), initialisation (middle) and ending (right).

In the HSCRF we are interested in the conditional setting in which the entire state variables and ending indicators are conditioned on observational sequences . For example, in computational linguistics, the observation is often the sequence of words and the state variables might be the part-of-speech tags and the phrases.

To capture the correlation between variables and such conditioning, we define a non-negative potential function over each contextual clique . Figure 8 shows the notations for potentials that correspond to the four contextual clique types we have identified above. Details of potential specification are described in the Section 6.1.

   where .
   where and .
   where .
   where .
Figure 8: Shorthands for contextual clique potentials.

Let denote the set of all variables that satisfies the set of hierarchical constraints in Figure 5. Let denote ordered set of all ending time indices at level , i.e. if then . The joint potential defined for each configuration is the product of all contextual clique potentials over all ending time indices and all semantic levels :


The conditional distribution is given as


where is the partition function for normalisation.

In what follows we omit for clarity, and implicitly use it as part of the partition function and the potential . It should be noted that in the unconditional formulation, there is only a single for all data instances. In conditional setting there is a for each data instance .

Remarks: The temporal model of HSCRFs presented here is not a standard graphical model (Lauritzen, 1996) since the connectivity (and therefore the clique structures) is not fixed. The potentials are defined on-the-fly depending on the context of assignments of ending indicators. Although the model topology is identical to that of shared structure HHMMs (Bui et al., 2004), the unrolled temporal representation is an undirected graph and the model distribution is formulated in a discriminative way. Furthermore, the state persistence potentials capture duration information that is not available in the dynamic DBN representation of the HHMMs in (Murphy and Paskin, 2002).

In the way the potentials are introduced it may first appear to resemble the clique templates in the discriminative relational Markov networks (RMNs) (Taskar et al., 2002). It is, however, different because cliques in the HSCRFs are dynamic and context-dependent.

4 Asymmetric Inside-Outside Algorithm

This section describes a core inference engine called Asymmetric Inside-Outside (AIO) algorithm, which is partly adapted from the generative, directed counter part of HHMMs in (Bui et al., 2004). We now show how to compute the building blocks that are needed in most inference and learning tasks.

4.1 Building Blocks and Conditional Independence

(a) (b)
Figure 9: (a) Symmetric Markov blanket, and (b) Asymmetric Markov blanket.

4.1.1 Contextual Markov blankets

In this subsection we define elements that are building blocks for inference and learning. These building blocks are identified given the corresponding boundaries. Let us introduce two types of boundaries: the contextual symmetric and asymmetric Markov blankets.

Definition 1.

A symmetric Markov blanket at level for a state starting at and ending at is the following set

Definition 2.

Let be a symmetric Markov blanket, we define and as follows


subject to . Further, we define


Figure 9a shows an example of a symmetric Markov blanket (represented by a double-arrowed line).

Definition 3.

A asymmetric Markov blanket at level for a parent state starting at and a child state ending at is the following set

Definition 4.

Let be an asymmetric Markov blanket, we define and as follows


subject to and . Further, we define


Figure 9b shows an example of asymmetric Markov blanket (represented by an arrowed line).

Remark: The concepts of contextual Markov blankets (or Markov blankets for short) are different from those in traditional Markov random fields and Bayesian networks because they are specific assignments of a subset of variables, rather than a collection of variables.

4.1.2 Conditional independence

Given these two definitions we have the following propositions of conditional independence.

Proposition 1.

and are conditionally independent given


This proposition gives rise to the following factorisation

Proposition 2.

and are conditionally independent given


The following factorisation is a consequence of Proposition 2


The proof of Propositions 1 and 2 is given in Appendix A.1.

4.1.3 Symmetric Inside/Outside Masses

From Equation 12 we have . Since separates from , we can group local potentials in Equation 8 into three parts: , , and . By ‘grouping’ we mean to multiply all the local potentials belonging to a certain part, in the same way that we group all the local potentials belonging to the model in Equation 8. Note that although contains we do not group into . The same holds for .

By definition of the state-persistence clique potential (Figure 8), we have . Thus Equation 8 can be replaced by


There are two special cases: (1) when , for , and (2) when , for and . This factorisation plays an important role in efficient inference.

We know define a quantity called symmetric inside mass , and another called symmetric outside mass .

Definition 5.

Given a symmetric Markov blanket , the symmetric inside mass and the symmetric outside mass are defined as


As special cases we have and , and for , . For later use let us introduce the ‘full’ symmetric inside mass and the ‘full’ symmetric outside mass as


In the rest of the thesis, when it is clear in the context, we will use inside mass as a shorthand for symmetric inside mass, outside mass for symmetric outside mass, full-inside mass for full-symmetric inside mass, and full-outside mass for full-symmetric outside mass.

Thus, from Equation 24 the partition function can be computed from the full-inside mass at the top level ()


With the similar derivation the partition function can also be computed from the full-outside mass at the bottom level ()


In fact, we will prove a more general way to compute in Appendix B


for any and . These relations are summarised in Figure 10.

for any
for any and
Figure 10: Computing the partition function from the full-inside mass and full-outside mass.

Given the fact that is separated from the rest of variables by the symmetric Markov blanket , we have Proposition 3.

Proposition 3.

The following relations hold


The proof of this proposition is given in Appendix A.2.

4.1.4 Asymmetric Inside/Outside Masses

Recall that we have introduced the concept of asymmetric Markov blanket which separates and . Let us group all the local contextual clique potentials associated with and into a joint potential . Similarly, we group all local potentials associated with and into a joint potential . Note that includes the state-persistence potential .

Definition 6.

Given the asymmetric Markov blanket , the asymmetric inside mass and the asymmetric outside mass are defined as follows