A diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes

02/16/2016 ∙ by Weiyao Lin, et al. ∙ Nanjing University Shanghai Jiao Tong University Microsoft 0

This paper addresses the problem of detecting coherent motions in crowd scenes and presents its two applications in crowd scene understanding: semantic region detection and recurrent activity mining. It processes input motion fields (e.g., optical flow fields) and produces a coherent motion filed, named as thermal energy field. The thermal energy field is able to capture both motion correlation among particles and the motion trends of individual particles which are helpful to discover coherency among them. We further introduce a two-step clustering process to construct stable semantic regions from the extracted time-varying coherent motions. These semantic regions can be used to recognize pre-defined activities in crowd scenes. Finally, we introduce a cluster-and-merge process which automatically discovers recurrent activities in crowd scenes by clustering and merging the extracted coherent motions. Experiments on various videos demonstrate the effectiveness of our approach.



There are no comments yet.


page 1

page 4

page 6

page 9

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Coherent motions, which represent coherent movements of massive individual particles, are pervasive in natural and social scenarios. Examples include traffic flows and parades of people (cf. Figs (a)a and (a)a). Since coherent motions can effectively decompose scenes into meaningful semantic parts and facilitate the analysis of complex crowd scenes, they are of increasing importance in crowd-scene understanding and activity recognition [2, 3, 4, 5, 6].

In this paper, we address the problem of detecting coherent motions in crowd scenes, and subsequently using them to understand input scenes. More specifically, we focus on 1) constructing an accurate coherent motion field to find coherent motions, 2) finding stable semantic regions based on the detected coherent motions and using them to recognize pre-defined activities (i.e., activities with labeled training data) in a crowd scene, and 3) automatically mining recurrent activities in a crowd scene based on the detected coherent motions and semantic regions.

First, constructing an accurate coherent motion field is crucial in detecting reliable coherent motions. In Fig. 1LABEL:sub@fig:coherent_example_b is the input motion field and LABEL:sub@fig:coherent_example_c is the coherent motion field which is constructed from LABEL:sub@fig:coherent_example_b using the proposed approach. In LABEL:sub@fig:coherent_example_b

, the motion vectors of particles at the beginning of the Marathon queue are far different from those at the end, and there are many inaccurate optical flow vectors. Due to such variations and input errors, it is difficult to achieve satisfying coherent motion detection results directly from 

LABEL:sub@fig:coherent_example_b. However, by transferring LABEL:sub@fig:coherent_example_b into a coherent motion field where the coherent motions among particles are suitably highlighted in LABEL:sub@fig:coherent_example_c, coherent motion detection is greatly facilitated. Although many algorithms have been proposed for coherent motion detection [7, 8, 9, 2], this problem is not yet effectively addressed. We argue that a good coherent motion field should effectively be able to 1) encode motion correlation among particles, such that particles with high correlations can be grouped into the same coherent region; and, 2) maintain motion information of individual particles, such that activities in crowd scenes can be effectively parsed by the extracted coherent motion field. Based on these intuitions, we propose a thermal-diffusion-based approach, which can extract accurate coherent motion fields.

Figure 1: (a) Example frame of a Marathon video sequence, the red curve is the coherent motion region; (b) Input motion vector field of (a); (c) Coherent motion field from (b) using the proposed approach (Best viewed in color).
Figure 2: (a) Example time-varying coherent motions in a scene, where different coherent motions are circled by curves with different color; (b) Constructed semantic regions for the scene in (a); (c) Recurrent activities for the scene in (a), where the arrows represent the major motion flows in each recurrent activity (Best viewed in color).

Second, constructing meaningful semantic regions to describe activity patterns in a scene is also essential. Coherent motions at different times may vary widely. In Fig. (a)a, changing of traffic lights will lead to different coherent motions. Coherent motions alone may not effectively describe the overall semantic patterns in a scene either. Therefore, semantic regions need to be extracted from these time-varying coherent motions to achieve stable and meaningful semantic patterns, as in Fig. (b)b. However, most existing works only focus on the detection of coherent motions at some specific time, while the problem of handling time-varying coherent motions is less studied. We proposed a two-step clustering process for this purpose.

Third, mining recurrent activities is another important issue. Many crowd scenes are composed of recurrent activities [10, 11, 12]. For example, the scene in Fig. 2 is composed of recurrent activities including vertical motion activities and horizontal motion activities, as in Fig. (c)c. Automatically mining these recurrent activities is important in understanding scene contents and their dynamics. Although many researches have been done for parsing recurrent activities in low-crowd scenes [13, 14, 15, 16], this issue is not well addressed in crowd scene scenarios where reliable motion trajectories are unavailable. We proposed a cluster-and-merge process, which can effectively extract recurrent activities in crowd scenes.

Our contributions to crowd scene understanding and activity recognition are summarized as follows.

  1. We introduce a coarse-to-fine thermal diffusion process to transfer an input motion field into a thermal energy field (TEF), which is a more accurate coherent motion field. TEF effectively encodes both motion correlation among particles and motion trends of individual particles. To our knowledge, this is the first work that introduces thermal diffusion to detect coherent motions in crowd scenes. We also introduce a triangulation-based scheme to effectively identify coherent motion components from the TEF.

  2. We present a two-step clustering scheme to find semantic regions according to the correlations among coherent motions. The found semantic regions can effectively catch activity patterns in a scene. Thus good performance can be achieved when recognizing pre-defined crowd activities based on these semantic regions.

  3. We propose a cluster-and-merge process to automatically mine recurrent activities by clustering and merging the coherent motions. The obtained recurrent activities can accurately describe recurrent motion patterns in a crowd scene.

The remainder of this paper is organized as follows. Section II reviews related works. Section III describes the framework of the proposed approach. Sections IV to VI describe the details of our proposed thermal diffusion process, triangulation scheme, two-step clustering scheme, and cluster-and-merge process. Section VII shows the experimental results and Section VIII concludes the paper.

Ii Related Works

Many works [17, 7, 8, 9, 2, 18, 19, 20, 21, 22] have been proposed on coherent motion detection. Due to the complex nature of crowd scenes, they are not yet mature for accurate detection of coherent motion fields. Cremers and Soatto [20] and Brox et al. [21] model the intensity variation of optical flow by an objective functional minimization scheme. These methods are only suitable for motions with simple patterns and cannot effectively analyze complex crowd patterns such as the circular flow in Fig. (a)a. Other works introduce external spatial-temporal correlation traits to model the motion coherency among particles [7, 8, 9]. Since these methods model particle correlations in more precise ways, they can achieve more satisfying results. However, most of these methods only consider short-distance particle motion correlation within a local region while neglecting long-distance correlation among distant particles, they have limitations in handling low-density or disconnected coherent motions where the long-distance correlation is essential. Furthermore, without the information from distant particles, these methods are also less effective in identifying coherent motion regions in the case when local coherent motion patterns are close to their neighboring backgrounds. One example of this kind of scenario is showcased in the region B in Fig. (b)b.

There are also other works related to motion modeling. One line of related works is advanced optical flow estimation. These methods try to improve the estimation accuracy of the input motion field by including global constraints over particles

[23, 24, 25, 26]. The focus of our approach is different from these methods. We focus on enhancing the correlation among coherent particles to facilitate coherent motion detection. Thus, the motion vectors of coherent particles are enhanced even if their actual motions are small, such as the region B in Figs (b)b and (c)c. In contrast, advanced optical flow estimation methods focus on estimating the actual motion of particles. They are still less capable of creating precise results when applied to coherent motion detection.

The anisotropic diffusion based methods, used in image segmentation, is also related to our work [27, 28, 29]. Our approach differs from these methods. First, our approach not only embeds the motion correlation among particles, but also suitably maintains the original motion information from the input motion vector field. Comparatively, the anisotropic-diffusion-based methods are more focused on enhancing the correlation among particles while neglecting the particles’ original information. As aforementioned, maintaining particle motion information is important in parsing crowd scenes. More importantly, due to the complex nature of crowd scenes, many coherent region boundaries are vague, subtle and unrecognizable. Simply applying the anisotropic-diffusion methods cannot identify the ideal boundaries. The proposed thermal diffusion process can achieve more satisfying results by modeling the motion direction, strength, and spatial correlation among particles.

Besides coherent motion detection, it is also important to utilize coherent motions to recognize pre-defined crowd activities. However, most existing coherent motion works only focus on the extraction of coherent motions while the recognition of crowd activities is much less studied. In [17], Ali and Shah detected instability regions in a scene by comparing with its normal coherent motions. However, they assume coherent motions to be stable, while in practice, many coherent motions may vary widely over time, making it difficult to construct stable normal coherent motions. Furthermore, besides the works on coherent motion, there are also other works which directly extract global features from the entire scene to recognize crowd activities [3, 30]. However, since they do not consider the semantic region correlations inside the scene, they have limitations in differentiating subtle differences among activities. Although there are some works [4, 31] which recognize crowd activities by segmenting scenes into semantic regions, our approach differs from them. Our approach finds the semantic regions by first extracting global coherent motion information, while these methods construct semantic regions from the particles’ local features. As will be shown later, information from the coherent motions can effectively enhance the correlation among particles, resulting in more meaningful semantic regions to facilitate activity recognition.

Furthermore, pre-defining or labeling crowd activities requires lots of human labors, making it desirable to automatically discover activity patterns in a crowd video without human intervention. In [15]

, Morris and Trivedi clustered trajectories into groups and modeled the spatio-temporal dynamic patterns of each trajectory group by Hidden Markov Models. Wang et al.

[13] and Hu et al. [14] further introduced Dirichlet processes to model the activity patterns of different trajectory groups. However, since these methods extract recurrent activities from motion trajectories, they are not suitable for crowd scene scenarios where reliable trajectories are difficult to achieve. Besides using motion trajectories, other researches tried to find recurrent activities by extracting low-level or short-term motion features. For example, Zhou et al. [12] extracted fragments of trajectories (called tracklets) and utilized a Latent Dirichlet Allocation topic model to infer recurrent activities. Jagannadan et al. [11] and Emonet et al. [10]

extracted low-level motion flows as motion descriptors and introduced a Probabilistic Latent Sequential Motif (PLSM) model to achieve recurrent activities. Although these methods can be applied in crowd scenes, they still have limitations in obtaining precise recurrent activity patterns under scenes with complex motions. Our approach differs from the previous methods in that 1) Our approach utilizes coherent motions to discover recurrent activities. Since coherent motions can effectively catch the local activity pattern in each frame, more precise recurrent activities can be achieved by our approach, 2) Our approach also extracts flow curves to describe and visualize recurrent activities. Compared with the previous methods which described recurrent activities by trajectory clusters or probability densities, the flow curves derived by our approach can visualize recurrent activity patterns in a clearer and more straightforward way.

Iii Overview of the Approach

Figure 3: The flowchart of the proposed approach (best viewed in color).

The framework of the proposed approach is shown in Fig. 3. The optical flow fields [17, 32] are first extracted from input videos. Secondly, the coarse-to-fine thermal diffusion process is applied to transfer the input motion fields into coherent motion fields, i.e., thermal energy fields (TEFs). Thirdly, the triangulation-based scheme is applied to identify coherent motions. Fourthly, with the obtained coherent motions, the two-step clustering scheme is performed to cluster coherent motions from multiple TEFs and construct semantic regions for the target scene. Finally, based on these semantic regions, we can extract effective features to describe crowd activities in the scene and recognize pre-defined crowd activities accordingly. At the same time, the cluster-and-merge process is also applied based on the extracted coherent motions and semantic regions to discover recurrent activities in the target scene. These proposed techniques are described in the following sections in detail.

Iv Finding Coherent Motions

In order to find accurate coherent motions, it is important to construct a coherent motion field to highlight the motion correlation among particles while still maintaining the original motion information. To achieve this requirement, we introduce a thermal diffusion process to model particle correlations. Given an input optical flow field, we view each particle (i.e., each pixel in a frame) as a “heat source” and it can diffuse energies to influence other particles. By suitably modeling this thermal diffusion process, precise correlation among particles can be achieved. The formulation is motivated by the following intuitions:

  1. Particles farther from heat source should achieve fewer thermal energies;

  2. Particles residing in the motion direction of the heat source particle should receive more thermal energies;

  3. Heat source particles with larger motions should carry more thermal energies.

Iv-a Thermal Diffusion Process

Based on the above discussions, we borrow the idea from physical thermal propagation [33] and model the thermal diffusion process by Eq. 1:


where is the thermal energy for the particle at location after performing thermal diffusion for seconds, is the input motion vector for particle , is the propagation coefficient.

The first term in Eq. 1 models the propagation of thermal energies over free space such that the spatial correlation among particles can be properly enhanced during thermal diffusion. The second term can be viewed as the external force added on the particle to affect its diffusion behavior, which preserves the original motion patterns. The inclusion of this term is one of the major differences between the proposed approach and the anisotropic-diffusion methods [29]. Without the term, Eq. 1 can be solved by:


where is the final diffused thermal energy for particle after seconds, is the set of all particles in the frame, and are width and height of the frame. The individual thermal energy is diffused from the heat source particle to particle after seconds, defined as:


where , is the current motion pattern for the heat source particle and it is initialized by , is the distance between particles and . In this paper, we fix to be 1 to eliminate its effect.

However, when in Eq. 1 is non-zero, it is difficult to get the exact solution for Eq. 1. So we introduce an additional term to approximate the influence of where is a force propagation factor. Moreover, in order to prevent unrelated particles from accepting too much heat from , we restrict that only highly correlated particles will propagate energies to each other. The final individual thermal energy from to is:


if and is 0 if otherwise, where and are the input motion vectors of the current particle and the heat source particle , and

is the cosine similarity,

is a threshold. In our experiments, , , and are set to be 0.2, 0.8, 0.7, which are decided from the experimental statistics.

From Eq. 2, we see that the diffused thermal energy is the summation from all other particles, which encodes the correlation among and all other particles in the frame. Furthermore, in Eq. 4, the first term preserves the motion pattern of the heat source. The second term considers the spatial correlation between source and target particles. The third term guarantees that particles along the motion direction of the heat source receives more thermal energies. Furthermore, the cosine similarity is introduced in Eq. 4 such that particle will not accept energy from if their input motion vectors are far different (or less-coherent) from each other. That is, Eq. 4 successfully satisfies all the intuitions.

Fig. 4 shows one example of the thermal diffusion process, which reveals that:

  1. Comparing Figs (b)b and (a)a, the original motion information is indeed preseved in the TEF. Moreover, TEF further strengthens particle motion coherency by thermal diffusion, which integrates the influence among particles. Coherent motions become more recognizable, thus more accurate coherent motion extraction can be achieved.

  2. From Fig. (c)c, we can see that the thermal energy for each heat source particle is propagated in a sector shape. Particles along the motion direction of the heat source (C and D) receive more energies than particles outside the motion direction (such as E). In Fig. (d)d, since particles on the lower side of the heat source B have small (cosine) motion similarities with B, they do not accept thermal energies.

Figure 4: (a),(b): One input optical flow field and its thermal energy field; (c), (d): Individual thermal diffusion result by diffusing from a single heat source particle A and B to the entire field.

Iv-B The Coarse-to-Fine Scheme

Although Eqs 2 and 4 can effectively strengthen the coherency among particles, it is based on a single input motion field, and only short-term motion information is considered, which is volatile and noisy. Thus, we propose a coarse-to-fine scheme to include long-term motion information. The entire coarse-to-fine thermal diffusion process is described in Algorithm 1.

2:  Calculate the input motion vector field with -frame intervals
4:  for  to  do
5:     Use Eq. 2 to create the new thermal energy field based on and
6:     Normalize the vector magnitudes in
9:     if  then
10:         Calculate with the new
11:     end if
12:  end for
13:  Output
Algorithm 1 Coarse-to-Fine Thermal Diffusion Process

The long-term motion vector field with a large frame interval is first calculated and used to create the thermal energy field. Then, the TEF is iteratively updated with shorter-term motion vector fields, i.e., with smaller . Figs (a)a to (d)d show the TEF results after different iteration numbers. When more iterations are performed, more motion information with different intervals will be included in the thermal diffusion process. Thus, more precise results can be achieved in the TEF, as in Fig. (d)d. Fig. (c)c shows another TEF result after the entire coarse-to-fine thermal diffusion scheme. We find that:

  1. TEF is an enhanced version of the input motion where particles’ energy directions in the TEF are similar to their original motion directions. Besides, since TEF include both the motion correlation among particles and the short-/long-term motion information among frames, coherent motions are effectively strengthened and highlighted in TEF.

  2. As mentioned, input motion vectors may be disordered, e.g., region A in Fig (b)b. However, the thermal energies from other particles can help recognize these disordered motion vectors and make them coherent, e.g., Fig. (c)c.

  3. Input motion vectors may be extremely small due to slow motion or occlusion by other objects (region B in Fig. (b)b and region C in Fig. (b)b). It is very difficult to include these particles into the coherent region by traditional methods [17, 7, 8, 9] because they are close to the background motion vector. However, TEF can strengthen these small motion vectors by diffusing thermal energies from distant particles with larger motions.

Figure 5: (a),(b): An input video frame and its input motion vector field; (c),(d): TEF results of Algorithm 1 after 1 and 3 iterations, respectively (=5 and =1).

Iv-C Finding Coherent Motions through Triangulation

Coherent motion regions can be achieved by performing segmentation on the TEF. We propose a triangulation-based scheme as follows:

Step 1: Triangulation. In this step, we randomly sample particles from the entire scene and apply the triangulation process [34] to link the sampled particles. The block labeled as “triangulation” in Fig. 3 shows one triangulation result, where red dots are the sampled particles and the lines are links created by the triangulation process [34].

Step 2: Boundary detection. We first obtain each triangulation link weight by:


where and are two connected particles, and are the thermal energy vectors of and in the TEF. A large weight will be assigned if the connected particles are from different coherent motion regions (i.e., they have different thermal energy vectors). Thus, by thresholding on the link weights, we can find links crossing the boundaries. The block labeled as “detected region boundary” in Fig. 3 shows one boundary detection result after step 2.

Step 3: Coherent motion segmentation. Then, coherent motions can be easily segmented and we use the watershed algorithm [35]. The final coherent motions are shown in the block named “detected coherent motions” in Fig. 3.

V Constructing Semantic Regions

With the extracted coherent motions, accurate motion information in a frame can be achieved. However, since coherent motions vary over time, it is essential to construct semantic regions from time-varying coherent motions to catch stable semantic patterns inside a scene. For this purpose, we propose a two-step clustering scheme. Assuming that in total coherent motions (, ) from TEFs extracted at times, the two-step clustering scheme is:

Step 1: Cluster coherent motion regions. The similarity between two coherent motions and is computed as:


where {} is the number of elements in a set. is a threshold which is set to be the same as in Eq. 4 in our experiments. Furthermore, and are the sets of “indicative particles” for and :


where is the outer normal vector at , i.e., perpendicular to the boundary and pointing outward the coherent motion region, is the same threshold as in the condition for Eq. 4. That is, only particles which are on the boundaries of the coherent motion region and whose thermal energy vectors sharply point outward the region are selected as the indicative particles. Thus, we can avoid noisy particles and substantially reduce the required computations.

From Eq. 6, we can see that we first extract the indicative particles, then only utilize those high-correlation pairs, and the total number of such pairs are the similarity value between two coherent motions. It should be noted that the similarity will be calculated between any coherent motion pairs even if they belong to different TEFs.

Then, we construct a similarity graph for the coherent motions, and perform clustering [36] on this similarity graph with the optimal number of clusters being determined automatically, the cluster results are grouped coherent regions.

Figure 6: (a) Step 1: Coherent regions in the three TEFs have been assigned different cluster labels by Step 1 and are displayed in different colors); (b) Find semantic regions by clustering the cluster label vectors of the particles (best viewed in color).
Figure 7: (a) Directly segmenting semantic regions according to the particles’ local features. (b) Segmenting semantic regions with the guidance of coherent motion clusters.

Step 2: Cluster to find semantic regions. Each coherent motion is assigned a cluster label in Step 1, as illustrated in Fig. (a)a. However, due to the variation of coherent motions at different times, there exist many ambiguous particles. For example, in Fig. (a)a, the yellow cross particle belongs to different coherent motion clusters in different TEFs. This makes it difficult to directly use the clustered coherent motion results to construct reliable semantic regions. In order to address this problem, we further propose to encode particles in each TEF by the cluster labels of the particles’ affiliated coherent motions. And by concatenating the cluster labels over different TEFs, we can construct a “cluster label” vector for each particle, as in Fig. (a)a

And with these label vectors, the same spectral clustering process as Step 1 can be performed on the particles to achieve the final semantic regions, as in Fig. 


Comparing with previous semantic region segmentation methods [4, 31] which perform clustering using local similarity among particles, our scheme utilizes the guidance from the global coherent motion clustering results to strengthen the correlations among particles. For example, in Fig. (a)a, when directly segmenting the particles by their local features, its accuracy may be limited due to similar distances among particles. However, by utilizing cluster labels to encode the particles, similarities among particles can be suitably enhanced by the global coherent cluster information, as in Fig. (b)b. Thus, more precise segmentation results can be achieved.

V-a Recognizing Pre-defined Activities

Based on the constructed semantic regions, we are able to recognize pre-defined activities (i.e., activities with labeled training data) in the scene. In this paper, we simply average the TEF vectors in each semantic region and concatenate these averaged TEF vectors as the final feature vector for describing the activity patterns in a TEF. Then, a linear support vector machine (SVM)

[37] is utilized to train and recognize pre-defined activities. Experimental results show that with accurate TEF and precise semantic regions, we can achieve satisfying results using this simple method.

V-B Merging Disconnected Coherent Motions

Since TEF also includes long-distance correlations between distant particles, by performing our clustering scheme, we also have the advantage of effectively merging disconnected coherent motions, which may be caused by the occlusion from other objects or low density of the crowd. For examples, the two disconnected blue regions in the right-most figure in Fig. (a)a are merged into the same cluster by our approach. Note that this issue is not well studied in the existing coherent motion research.

Vi Mining Recurrent Activities

With the extracted coherent motions and constructed semantic regions, crowd activities can be recognized by constructing and pre-labeling training data, as in Section V-A. However, since pre-defining or labeling crowd activities take lots of human labors, it is also desirable to automatically mine recurrent activity patterns in a crowd scene without human intervention. For this purpose, we propose a cluster-and-marge process which includes three steps: frame-level clustering, coherent motion merging, and flow curve extraction.

Vi-a Frame-level Clustering

The frame-level clustering step clusters frames according to the extracted coherent motions and semantic regions, such that frames with the same recurrent activity pattern can be organized into the same group. In this paper, we first calculate inter-frame similarities for all frame pairs and then utilize spectral clustering [36] to cluster frames according to these inter-frame similarities.

In order to calculate the inter-frame similarity between frames and , the similarities between all coherent motions from frames and are first calculated using Eq. 6. Then, the inter-frame similarity can be achieved from these coherent motion similarities and the segmented semantic regions. More specifically, we define as


where is the similarity for the matched coherent motion pairs between and , is the similarity for the unmatched coherent motion regions in frames and . and can be calculated by Eqs 9 and 10.

First, is defined as


where is the similarity between coherent motion regions and , is the corresponding weight. and are the total number of coherent motion regions in frames and , respectively. is the set of all matched coherent region pairs. In this paper, and are calculated by the Hungarian algorithm [38] which can achieve optimal coherent motion matching results based on the input coherent motion similarities. Furthermore, in order to exclude dissimilar coherent motion pairs from the matching result, coherent motion pairs with small similarity values will be deleted from . Fig. 8 shows an example of the matched coherent motion pairs.

Figure 8: Example of matched and unmatched coherent motion region sets (best viewed in color).

The next term is defined as


where and are the sets of unmatched coherent regions in frames and , as shown in Fig. 8. is the unmatching cost for coherent motion region :


where is the -th semantic region of the scene, the term represents the total number of semantic regions that have overlap with the coherent motion region . is the importance cost measuring whether semantic region is important in distinguishing different recurrent activities. For example, assuming that a scene includes two recurrent activities, as in Fig. 9, it is obvious that the semantic region on the right should have larger importance cost since the two recurrent activity patterns have different motion flows in . Comparatively, the semantic region on the left should have smaller importance cost since both recurrent activity patterns have similar flows in . Therefore, when calculating the similarity between frames and , if there exists an unmatched coherent region in , a large importance cost will be applied to reduce the inter-frame similarity, indicating that frames and have different recurrent activity patterns. On the contrary, if there exists an unmatched coherent region in , the inter-frame similarity will be less affected since a coherent region in is less indicative of the differences between recurrent activities.

Figure 9: Motion flows for two recurrent activities displayed over semantic regions.

For , we first perform a pre-clustering according to the matched coherent motion similarities which roughly clusters frames into different recurrent activity groups. Then a vector is constructed for each semantic region : where is the total number of coherent motions located in in the -th pre-clustered recurrent activity group , is the total number of pre-clustered recurrent activity groups. Finally, can be calculated by:



is the variance operation,

where is the total number of frames to be clustered. According to Eq. 12, if coherent motions appear evenly in for different recurrent activities, i.e., the variance is smaller, it implies that is less important in distinguishing different recurrent activities. On the contrary, if the appearance time of coherent motions in has larger variation over different pre-clustered recurrent activity groups, a large will be achieved to increase the importance of . The complete process of frame-level clustering is illustrated in Algorithm 2.

6pt0pt Input: Coherent regions extracted for each frame , and semantic regions of the scene
    Output: Recurrent activity groups including frames with similar recurrent activity patterns

1:  Calculate similarities between all coherent motion regions from different frames
2:  Calculate for all frame pairs based on
3:  Pre-cluster frames based on
4:  Calculate importance cost for all semantic regions based on the pre-clustering result
5:  Calculate for all frame pairs according to the unmatched coherent regions and
6:  Calculate inter-frame similarities for all frame pairs using and
7:  With , cluster frames into recurrent activity groups
8:  Output the clustering result in line 9
Algorithm 2 Frame-level Clustering Process

Vi-B Coherent Motion Merging

After frame-level clustering, frames are clustered into different recurrent activity groups. Thus, by parsing frames in each recurrent activity group, complete motion patterns for each recurrent activity can be estimated. In this paper, we introduce a coherent motion merging step to merge similar coherent motions from the same recurrent activity group for achieving motion pattern regions. More specifically, we first apply the same operation as Step 1 in the two-step clustering scheme (Section V) to cluster coherent motion regions inside the same recurrent activity group. Then, coherent motions of the same cluster are merged together to form a motion pattern region. The merging process can be described by Eq. 13 and Fig. 10.


if , and it is if otherwise, where the -th coherent motion cluster. is the merged motion vector result for at particle . is a coherent motion region belonging to coherent motion cluster . is the total number of coherent motion regions in cluster . is a threshold which is set as 0.4 in our experiments. is the TEF thermal energy for at particle . Note that is set to if is outside the region of . And is the total number of non-zero TEF thermal energies at particle and belonging to .

According to Eq. 13, the merged motion pattern region for a coherent motion cluster is basically the normalized summation over all coherent regions in . Besides, we further introduce a threshold to filter out noisy or isolated particles which have low frequent motions in the coherent motion cluster . An example of merged motion pattern regions is shown in Fig. 10.

Figure 10: Process of similar coherent motion merging. Frames and are from the same recurrent activity group, the green coherent motion regions in frames and belong to one coherent motion cluster, and the blue coherent motion regions in frames and belong to another coherent motion cluster. (Best viewed in color.)

Vi-C Flow Curve Extraction

The motion pattern regions achieved in the previous step can represent the complete motion information for each recurrent activity. However, since motion pattern regions may overlap with each other and the contours of motion pattern regions may also be irregular, it is necessary to extract flow curves from these motion pattern regions such that recurrent activities can be more clearly described and visualized.

Our proposed flow curve extraction process can be described by Algorithm 3 and Fig. 11. According to Algorithm 3 and Fig. 11, our approach first sequentially cuts a motion pattern region into sub-regions along the motion direction in . Then the centroids of sub-regions are linked together to achieve the output flow curve. With the above process, the extracted flow curve can accurately catch the major motion flow of a motion pattern region. Furthermore, it should be noted that in step 5 of Algorithm 3, if the line perpendicular to the motion vector at is intersecting with a branched motion region (i.e., the motion region diverges around ), multiple points will be achieved and the following flow curve extraction process will be performed on each respectively. In this way, we can properly achieve branched flow curves at the branch region.

5pt0pt Input: A motion pattern region merged from coherent region cluster
Output: A flow curve extracted from

1:  Calculate the skeleton of [35]
2:  Find the end point of the skeleton which is on “backward” position to all other end points, where the “backward” direction is defined as the reversed direction of the motion flows in
3:  , where is the current segmentation point
4:  while  is inside  do
5:      as the middle point of the line perpendicular to the motion vector at
6:     for =0 to { is the number of movements} do
7:         Move from to by , where is motion vector at in
8:         =
9:     end for
10:     = , where is the next segmentation point
11:     Draw two straight lines perpendicular to the motion vectors of at and , respectively
12:     Calculate the centroid of the sub-region segmented by the lines in line 13
14:  end while
15:  Sequentially link together all centroid points achieved by line 14
16:  Smooth the linked curve by line 17
17:  Output the curve by line 18
Algorithm 3 Flow Curve Extraction
Figure 11: Flow curve extraction process. Finding segmentation points, draw straight lines to achieve sub-regions, and calculate centroids of each sub-region; (b) Link centroids to achieve the extracted flow curve. (Best viewed in color.)

Vii Experimental Results

Our approach is implemented by Matlab and the optical flow fields [32] are used as the input motion vector fields while each pixel in the frame is viewed as a particle. In order to achieve motion vector fields with -frame intervals ( in our experiments), the particle advection method [17] is used which tracks the movement of each particle over frames.

Vii-a Results for Coherent Motion Detection

We perform experiments on a dataset including 30 different crowd videos collected from the UCF dataset [17], the UCSD dataset [39], the CUHK dataset [9], and our own collected set. This dataset covers various real-world crowd scene scenarios with both low- and high-density crowds and both rapid and slow motion flows. Some example frames of the dataset is shown in Fig. 12.

Figure 12: Coherent motion extraction results. (a): Ground Truth, (b): Results of our approach, (c): Results of [17], (d): Results of [7], (e): Results of [8], (f): Results of [9], (g): Results of [40], (h): Results of [28]. (Best viewed in color)

We compare our approach with four state-of-the-art coherent motion detection algorithms: The Lagrangian particle dynamics approach [17], the local-translation domain segmentation approach [7], the coherent-filtering approach [8], and the collectiveness measuring-based approach [9]. In order to further demonstrate the effectiveness of our approach, we also include the results of a general motion segmentation method [40] and an anisotropic-diffusion-based image segmentation method [28].

Qualitative comparison on coherent motion detection. Fig. 12 compares the coherent motion detection results for different methods. We include the manually labeled ground truth results in the first column. From Fig. 12, we can see that our approach can achieve better coherent motion extraction than the compared methods. For example, in sequence 1, our approach can effectively extract the circle-shape coherent motion. Comparatively, the method in [17] can only detect part of the circle while the methods in [8] and [9] fail to work since few reliable key points are extracted from this over-crowded scene. For sequences 2 and 4 where multiple complex motion flows exist, our approach can still precisely detect the small and less differentiable coherent motions, such as the pink region on the bottom and the blue region on the top in sequence 2 LABEL:sub@fig:coherent_motion_results_a. The compared methods have low effectiveness in identifying these regions due to the interference from the neighboring motion regions. In sequences 3 and 6, since motions on the top of the frame are extremely small and close to the background, the compared methods fail to include these particles into the coherent motion region. However, in our approach, these small motions can be suitably strengthened and included through the thermal diffusion process. Furthermore, the methods in [40] and [28] do not show satisfying results, e.g., in sequences 5 and 6. This is because: (1) the crowd scenes are extremely complicated such that the extracted particle flows or trajectories become unreliable, thus making the general motion segmentation methods [40] difficult to create precise results; (2) Since many coherent region boundaries in the crowd motion fields are rather vague and unrecognizable, good boundaries cannot be easily achieved without suitably utilizing the characteristics of the motion vector fields. Thus, simply applying the existing anisotropic-diffusion segmentation methods [28] cannot achieve satisfying results.

Capability to handle disconnected coherent motions. Sequences 5-8 in Fig. 12 compare the algorithms’ capability in handling disconnected coherent motions. In sequence 7, we manually block one part of the coherent motion region while in sequences 5, 6, and 8, the red or green coherent motion regions are disconnected due to occlusion by other objects or low density. Since the disconnected regions are separated far from each other, most compared methods wrongly segment them into different coherent motion regions. However, with our thermal diffusion process and two-step clustering scheme, these regions can be successfully merged into one coherent region.

Quantitative comparison. Table I compares the quantitative results for different methods. In Table I, the average Particle Error Rates (PERs) and the average Coherent Number Error (CNE) for all the sequences in our dataset are compared to measure the overall accuracy of coherent motion detection. PER is calculated by PER = # of Wrong Particles Total # of Particles. CNE is calculated by where and are the numbers of detected and ground-truth coherent regions for sequence , respectively, is the total number of sequences.

Proposed [17] [7] [8] [9] [40] [28]
 PER 7.8 32.5 19.5 25.6 24.1 66.4 21.4
 CNE 0.14 1.24 0.93 1.05 0.96 1.78 0.84
Table I: Average PER and CNE for all sequences in the dataset.

Table I further demonstrates the effectiveness of our approach. In Table I, we can see that 1) Our approach can achieve smaller coherent detection error rates than the other methods, 2) Our approach can accurately obtain the coherent region numbers (close to the ground truth) while other methods often over-segment or under-segment the coherent regions.

Effect of different parameter values. Finally, Fig. 13 shows the results of our approach under different parameter values, i.e., and in Eqs 3 and 4. From Figs (a)a to (c)c, we can see that mainly governs the thermal diffusion distance. A small will make the thermal energies to be diffused farther and thus can achieve larger coherent motion regions. When increases, the extracted coherent motion region will shrink. Furthermore, determines the directivity of thermal diffusion. When increases, the diffused thermal energies will concentrate more along the motion direction of the source heat particles. On the contrary, when decreases, the thermal energies will be propagated more uniformly to all directions around the heat source particle. Thus, the boundaries will shrink horizontally with larger , as in Fig. (e)e. However, note that in all examples in Fig. 13, our approach can always suitably merge coherent regions together even when they become disconnected when the parameter value changes.

Figure 13: The coherent motion detection results of our approach under different and values.

Vii-B Results for Semantic Region Construction and Pre-defined Activity Recognition

We perform experiments on two crowd videos in our dataset, as the first and second rows in Fig. 14. 400 video clips are selected from each video with each clip including 20 frames. Four crowd activities are defined for each video and the example frames for the crowd activities are shown in Fig. 14. Note that these videos are challenging in that: (1) the crowd density in the scene varies frequently including both high density as Fig. (d)d and low density clips as Fig. (c)c; (2) The motion patterns are varying for different activities, making it difficult to construct meaningful and stable semantic regions; (3) There are large numbers of irregular motions that disturb the normal motion patterns (e.g., people running the red lights or bicycle following irregular paths); (4) The number of clips in the dataset is small, which increases the difficulty of constructing reliable semantic regions. Moreover, in order to further demonstrate the effectiveness of our approach, we also perform experiments on a public QMUK Junction dataset [41] where five crowd activities are defined, as shown in the third row of Fig. 14.

(a) HD
(b) HP
(c) BT
(d) VP
(e) VL
(f) BT
(g) HP
(h) HU
(i) VR
(j) HU
(k) VP
(l) HD
(m) VB
Figure 14: Example frames of the ground-truth activities in different videos. First and second rows: videos in our dataset; Third row: video of QMUK Junction dataset [41]. HD: Horizontal pass and down turn; HP: Horizontal pass; BT: Both turn; VP: Vertical pass; VL: Vertical pass and left turn; HU: Horizontal pass and up turn; VR: Vertical pass and right turn; VB: Vertical pass and both turn.

Accuracy on semantic region construction. For each video in Fig. 14, we randomly select 200 video clips and use them to construct the corresponding semantic regions. Fig. 15 compares the results of four methods: (1) Our approach (“Our”), (2) Directly cluster regions based on the particles’ TEF vectors (“Direct”, note that our approach differs from this method by clustering over the cluster label vectors), (3) Use [7] to achieve coherent motion regions and then apply our two-step clustering scheme to construct semantic regions (“[7]+Two-Step”, we show the results of [7] because in our experiments, [7] has the best semantic region construction results among the compared methods in Table I), (4) The activity-based scene segmentation method in [4] (“[4]”). We also show original scene images and plot all major activity flows to ease the comparison (“original scene”).

Fig. 15 shows that the methods utilizing “coherent motion cluster label” information (“our” and “[7]+two-step”) create more meaningful semantic regions than the other methods, e.g., successfully identifying the horizontal motion regions in the middle of the scene in Fig. (b)b. This shows that our cluster label features can effectively strengthen the correlation among particles to facilitate semantic region construction. Furthermore, comparing our approach with the “[7]+Two-Step” method, it is obvious that the semantic regions by our approach are more accurate (e.g., more precise semantic region boundaries and more meaningful segmentations in the scene). This further shows that more precise coherent motion detection results can result in more accurate semantic region results.

(a) Original
(b) Our
(c) Direct
(d) [7]
(e) [4]
(f) Original
(g) Our
(h) Direct
(i) [7]
(j) [4]
(k) Original
(l) Our
(m) Direct
(n) [7]
(o) [4]
Figure 15: Constructed semantic regions of different methods for the videos in Fig. 14. The caption “[7]” denotes the method “[7]+Two step”. (Best viewed in color)

Performances on recognizing pre-defined activities. In order to recognize the pre-defined activities in Fig. 14, for each video, we randomly select 200 video clips and construct semantic regions by the methods in Fig. 15

. After that, we derive features from the TEF and train SVM classifiers by the method in Section 

V-A. Finally, we perform recognition on the other 200 video clips in the same video. Besides, we also include the results of two additional methods: (1) a state-of-the-art dense-trajectory-based recognition method [3] (“Dense-Traj”); (2) the method which uses our semantic regions but uses the input motion field (i.e., the optical flows) to derive the motion features in each semantic region (“Our+OF”). From the recognition accuracy shown in Table II, we observe that:

  1. Methods using more meaningful semantic regions (i.e., “our”, “our+OF”, and “[7]+Two step”) achieve better results than other methods. This shows that suitable semantic region construction can greatly facilitate activity recognition.

  2. Approaches using TEF (“Our”) achieve better results than those using the input motion field (“Our+OF”). This demonstrates that compared with the input motion filed, our TEF can effectively improve the effectiveness in representing the semantic regions’ motion patterns.

  3. The dense-trajectory method [3] which extracts global features does not achieve satisfying results. This is because the global features still have limitations in differentiating the subtle differences among activities. This further implies the usefulness of semantic region decomposition in analyzing crowd scenes.

OF (%)
Step (%)
Fig. (a)a video
92.2 87.75 77.0 89.5 79.2 67.0
Fig. (e)e video
90.69 83.83 73.53 81.76 72.35 69.80
Fig. (i)i video
93.58 91.03 83.42 88.32 84.83 82.69
Table II: Recognition accuracy of different methods

Vii-C Results for Recurrent Activity Mining

In this experiment, we use the same videos as in Fig. 14 for mining recurrent activities. For each video, we sample one frame per second, then calculate coherent motions for the sampled frames, and finally apply our cluster-and-merge process to achieve recurrent activity patterns. Note that the target for recurrent activity mining is to automatically discover recurrent activities from an input video without pre-defining activity types or pre-labeling training data. And ideally, good activity mining approaches should achieve similar activity patterns as the human-observed activity types in Fig. 14.

Performances on frame-level clustering. For each video, we apply our frame-level clustering step to cluster the sampled frames into four recurrent activity groups. Our clustering results are compared with two methods: (1) Direct clustering. Directly clustering based on the TEF difference between two frames (i.e., use the summation of absolute thermal energy differences between the co-located particles in two TEFs as the inter-frame similarity). (2) Pre-clustering. Using the matched-coherent-motion similarities in Eq. 9 as the inter-frame similarity for clustering.

Fig. 16

compares the clustering confusion matrixes of different methods. From Fig. 

16, we can see that since frames of the same recurrent activity may contain different parts of a complete activity flow (e.g., Fig. 10), their TEFs may have large differences. Therefore, directly using TEF difference for clustering (i.e., direct TEF clustering) cannot achieve satisfying results. Comparatively, by including coherent motions to evaluate inter-frame similarities (i.e., “pre-clustering” and “our”), the clustering accuracy can be improved. However, the pre-clustering method still have limitations in differentiating similar recurrent activities, e.g., HP and HU in Figs (g)g and (h)h. Comparatively, by introducing the importance cost of semantic regions to measure the effects of unmatched coherent motions, our frame-level clustering approach can have stronger capability in differentiating similar recurrent activity patterns.

(a) Our
(b) Direct Clustering
(c) Pre-clustering
(d) Our
(e) Direct Clustering
(f) Pre-clustering
(g) Our
(h) Direct Clustering
(i) Pre-clustering
Figure 16: The confusion matrix of clustering results. (The clustering results in the 1st, 2nd, and 3rd rows correspond to the videos in the 1st, 2nd, and 3rd rows in Fig. 14, respectively)

Performances on coherent motion merging and flow curve extraction. Fig. 17 shows the results of our coherent motion merging and flow curve extraction steps. Besides, we also compare our approach with a state-of-the-art activity mining method which utilizes a Probabilistic Latent Sequential Motif (PLSM) model to discover recurrent activities [11], which are shown as the last rows in Figs (a)a(b)b, and (c)c. From Fig. 17, we can have the following observations:

  1. The recurrent activities mined by our approach is similar to the human-observed activity types in Fig. 14. This demonstrates that our proposed cluster-and-merge process can effectively discover desired activity types from an input video.

  2. Note that although the clustering result in our frame-level clustering step is not 100 percentage accurate (as in Fig. 16), the extracted flow curves are less affected by the wrongly clustered frames because: (i) The noisy or isolated thermal energy vectors from the wrongly clustered frames will be filtered by the threshold in Eq. 13. (ii) The flow curve extraction process will further reduce the effects of wrong frames by dividing sub-regions to derive flow curves, as in Fig. (a)a.

  3. Comparing our approach with the PLSM-based method [11], we can see that: (i) By introducing coherent regions to measure inter-frame similarities and derive motion pattern regions, our approach can achieve cleaner activity flows which are more coherent with the human-observed activity types in Fig. 14. Comparatively, results of the PLSM-based method still include noisy motion patterns, e.g., the last column in Fig. (b)b. (ii) Our approach can precisely differentiate motion flows inside a recurrent activity. However, the PLSM-based method has limitations in differentiate motion flows when they are located close to each other, e.g., the second column in Fig. (a)a. (iii) The differences between similar recurrent activities are clearly differentiated and visualized by our approach, while they are less obvious in the results of the PLSM-based method, e.g., the third and fourth columns in Fig. (b)b.

(a) Recur. act. mining results for video of Fig. (a)a
(b) Recur. act. mining results for video of Fig. (e)e
(c) Recur. act. mining results for video of Fig. (i)i
Figure 17: Recurrent activity mining results. First rows in (a), (b), and (c): the merged motion pattern regions and the extracted flow curves by our approach; Second rows in (a), (b), and (c): the extracted flow curves of our approach displayed over the video frame; Third rows in (a), (b), and (c): the recurrent activities extracted by [11].

Viii Conclusion

In this paper, we study the problem of coherent motion detection, semantic region construction, and recurrent activity mining in crowd scenes. A thermal-diffusion-based algorithm together with a two-step clustering scheme are introduced, which can achieve more meaningful coherent motion and semantic region results. Based on the extracted coherent motions and semantic regions, a cluster-and-merge process is further proposed which can effectively discover desirable activity patterns from a crowd video. Experiments on various videos show that our approach achieves the state-of-the-art performance.


  • [1] W. Wang, W. Lin, Y. Chen, J. Wu, J. Wang, and B. Sheng, “Finding coherent motions and semantic regions in crowd scenes: a diffusion and clustering approach,”

    European Conf. Computer Vision (ECCV)

    , pp. 756–771, 2014.
  • [2] M. Hu, S. Ali, and M. Shah, “Learning motion patterns in crowded scenes using motion flow field,”

    Intl. Conf. Pattern Recognition (ICPR)

    , pp. 1–5, 2008.
  • [3] J. Wu, Y. Zhang, and W. Lin, “Towards good practices for action video encoding,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2577–2584, 2014.
  • [4] C. C. Loy, T. Xiang, and S. Gong, “Multi-camera activity correlation analysis,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1988–1995, 2009.
  • [5] T. Mei, Y. Rui, S. Li, and Q. Tian, “Multimedia search reranking: A literature survey,” ACM Computing Surveys, vol. 2, no. 3, pp. 1:1–1:36, 2012.
  • [6] H. Liu, T. Mei, J. Luo, H. Li, and S. Li, “Finding perfect rendezvous on the go: accurate mobile visual localization and its applications to routing,” ACM Intl. Conf. Multimedia (MM), pp. 9–18, 2012.
  • [7] S. Wu and H. Wong, “Crowd motion partitioning in a scattered motion field,” IEEE Trans. Systems, Man, and Cybernetics, vol. 42, no. 5, pp. 1443–1454, 2012.
  • [8] B. Zhou, X. Tang, and X. Wang, “Coherent filtering: detecting coherent motions from crowd clutters,” European Conf. Computer Vision (ECCV), pp. 857–871, 2012.
  • [9] B. Zhou, X. Wang, and X. Tang, “Measuring crowd collectiveness,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 3049–3056, 2013.
  • [10] R. Emonet, J. Varadarajan, and J.-M. Odobez, “Temporal analysis of motif mixtures using Dirichlet processes,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 36, no. 1, pp. 140–156, 2013.
  • [11] V. Jagannadan, R. Emonet, and J.-M. Odobez, “A sequential topic model for mining recurrent activities from long term video logs,” Intl. J. Computer Vision, vol. 103, no. 1, pp. 100–126, 2013.
  • [12] B. Zhou, X. Wang, and X. Tang, “Random field topic model for semantic region analysis in crowded scenes from tracklets,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 3441–3448, 2011.
  • [13] X. Wang, K. T. Ma, G. Ng, and E. Grimson, “Trajectory analysis and semantic region modeling using nonparametric Bayesian models,” Intl. J. Computer Vision, vol. 96, pp. 287–321, 2011.
  • [14] W. Hu, X. Li, G. Tian, S. Maybank, and Z. Zhang, “An incremental DPMM-based method for trajectory clustering, modeling, and retrieval,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1051–1065, 2013.
  • [15] B. Morris and M. Trivedi, “Trajectory learning for activity understanding: unsupervised, multilevel, and long-term adaptive approach,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2287–2301, 2011.
  • [16] J. Nascimento, M. A. T. Figueiredo, and J. S. Marques, “Trajectory classification using switched dynamical Hidden Markov Models,” IEEE Trans. Image Processing, vol. 19, no. 5, pp. 1338–1348, 2010.
  • [17] S. Ali and M. Shah, “A Lagrangian particle dynamics approach for crowd flow segmentation and stability analysis,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 1–6, 2007.
  • [18] R. Mehran, A. Oyama, and M. Shah, “Abnormal crowd behavior detection using social force model,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 935–942, 2009.
  • [19] B. Zhan, D. Monekosso, P. Remagnino, S. Velastin, and L. Xu, “Crowd analysis: a survey,” Machine Vision and Applications, vol. 19, no. 5-6, pp. 345–357, 2008.
  • [20] D. Cremers and S. Soatto, “Motion competition: A variational approach to piecewise parametric motion segmentation,” Intl. J. Computer Vision, vol. 62, no. 3, pp. 249–265, 2005.
  • [21] T. Brox, M. Rousson, R. Deriche, and J. Weickert, “Colour, texture, and motion in level set based segmentation and tracking,” Image and Vision Computing, vol. 28, no. 3, pp. 376–390, 2010.
  • [22] X. Cui, Q. Liu, M. Gao, and D. N. Metaxas, “Abnormal detection using interaction energy potentials,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 3161–3167, 2011.
  • [23] L. Li and Y. Yang, “Optical flow estimation for a periodic image sequence,” IEEE Trans. Image Processing, vol. 19, no. 1, pp. 1–10, 2010.
  • [24] L. Xu, J. Jia, and Y. Matsushita, “Motion detail preserving optical flow estimation.” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 9, pp. 1744–1757, 2012.
  • [25] D. Lin, E. Grimson, and J. Fisher, “Learning visual flows: a Lie algebraic approach,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 747–754, 2009.
  • [26] A. Bimbo, P. Nesi, and J. Sanz, “Optical flow computation using extended constraints,” IEEE Trans. Image Processing, vol. 5, no. 5, pp. 720–739, 1996.
  • [27] M. J. Black, G. Sapiro, D. H. Marimont, and D. Heeger, “Robust Anisotropic Diffusion,” IEEE Trans. Image Processing, vol. 7, no. 3, pp. 421–432, 1998.
  • [28] Y. Wu, Y. Wang, and Y. Jia, “Adaptive diffusion flow active contours for image segmentation,” Computer Vision and Image Understanding, vol. 117, no. 10, pp. 1421–1435, 2013.
  • [29] J. Weickert, Anisotropic diffusion in image processing.   Stuttgart: Teubner, 1998.
  • [30] T. Xu, P. Peng, X. Fang, C. Su, Y. Wang, Y. Tian, W. Zeng, and T. Huang, “Single and multiple view detection, tracking and video analysis in crowded environments,” Intl. Conf. Advanced Video and Signal-Based Surveillance (AVSS), pp. 494–499, 2012.
  • [31] J. Li, S. Gong, and T. Xiang, “Scene segmentation for behavior correlation,” European Conf. Computer Vision (ECCV), pp. 383–395, 2008.
  • [32] A. Bruh, J. Weickert, and C. Schnorr, “Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods,” Intl. J. Computer Vision, vol. 61, no. 3, pp. 211–231, 2005.
  • [33] H. Carslaw and J. Jaeger, Conduction of heat in solids.   Oxford Science Publications, 1959.
  • [34] H. Edelsbrunner and N. Shah, “Incremental topological flipping works for regular triangulations,” Algorithmica, vol. 15, no. 3, pp. 223–241, 1996.
  • [35] R. Gonzales and R. Woods, Digital image processing.   Prentice Hall, 2001.
  • [36] Z. Lu, X. Yang, W. Lin, H. Zha, and X. Chen, “Inferring user image search goals under the implicit guidance of users,” IEEE Trans. Circuits and Systems for Video Technology, vol. 24, no. 3, pp. 394–406, 2014.
  • [37] C. Chang and C. Lin, “LIBSVM: a library for support vector machines,” ACM Trans. Intelligent Systems Technology, vol. 2, no. 3, pp. 1–27, 2011.
  • [38] L. Lovasz and M. Plummer, Matching theory.   North Holland, 1986.
  • [39] http://www.svcl.ucsd.edu/projects/anomaly/.
  • [40] T. Brox and J. Malik, “Object segmentation by long term analysis of point trajectories,” European Conf. Computer Vision (ECCV), pp. 282–295, 2010.
  • [41] http://www.eecs.qmul.ac.uk/~ccloy/downloads_qmul_junction.html.