Explainable Video Action Reasoning via Prior Knowledge and State Transitions

08/28/2019 ∙ by Tao Zhuo, et al. ∙ National University of Singapore 19

Human action analysis and understanding in videos is an important and challenging task. Although substantial progress has been made in past years, the explainability of existing methods is still limited. In this work, we propose a novel action reasoning framework that uses prior knowledge to explain semantic-level observations of video state changes. Our method takes advantage of both classical reasoning and modern deep learning approaches. Specifically, prior knowledge is defined as the information of a target video domain, including a set of objects, attributes and relationships in the target video domain, as well as relevant actions defined by the temporal attribute and relationship changes (i.e. state transitions). Given a video sequence, we first generate a scene graph on each frame to represent concerned objects, attributes and relationships. Then those scene graphs are linked by tracking objects across frames to form a spatio-temporal graph (also called video graph), which represents semantic-level video states. Finally, by sequentially examining each state transition in the video graph, our method can detect and explain how those actions are executed with prior knowledge, just like the logical manner of thinking by humans. Compared to previous works, the action reasoning results of our method can be explained by both logical rules and semantic-level observations of video content changes. Besides, the proposed method can be used to detect multiple concurrent actions with detailed information, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). Experiments on a re-annotated dataset CAD-120 show the effectiveness of our method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Video action reasoning with prior knowledge and semantic-level state transitions. All of the concerned objects, attributes and relationships are represented in the generated video graph, each attribute and relationship transition can be explained as a performed action by logical rules. Two actions “open” (microwave_1, closed to open, frame 216) and “pick” (hand_2 and cloth_1, not_holding to holding, frame 242) can be detected and explained by the attribute and relationship changes, respectively.

Human action analysis and understanding in videos is an important problem of multimedia content analysis and a crucial component of human-machine interaction systems. Recently, with the success of deep learning in a variety of computer vision tasks, great progress has been achieved in video action recognition with various deep neural networks 

(Carreira and Zisserman, 2017; Wang et al., 2018a; Feichtenhofer et al., 2018; Tran et al., 2018; Wang et al., 2018c). Compared to early action recognition approaches (Intille and Bobick, 2001; Tran and Davis, 2008; Ijsselmuiden and Stiefelhagen, 2010; Morariu and Davis, 2011; Brendel et al., 2011) that perform logical reasoning by using rules on low-level features (e.g. gradients, motion, location and trajectories), deep learning based methods can take advantage of the semantic-level information (e.g. attributes of an object and relationship between objects) of video frames. However, due to the lack of rules for logical reasoning, most of the existing deep learning based action recognition methods (Carreira and Zisserman, 2017; Wang et al., 2018a; Feichtenhofer et al., 2018; Tran et al., 2018) cannot provide detailed information to explain how an action is performed, such as who (particular objects), when (time), where (object locations) and how (what kind of changes). In this paper, we develop a novel video action reasoning framework that uses rules to understand and explain the semantic-level video state changes, which effectively bridges the gap between the classical reasoning and the modern deep learning based approaches.

Deep neural networks have been widely used in video action recognition from various perspectives. The popular two-stream convolutional networks (Carreira and Zisserman, 2017; Wang et al., 2018a; Feichtenhofer et al., 2018; Tran et al., 2018)

can capture the complementary information on appearance from still frames and motion between frames. Besides, spatio-temporal graphs with Recurrent Neural Networks (RNN) 

(Jain et al., 2016) or Graph Convolutional Networks (GCN) (Kipf and Welling, 2017; Wang and Gupta, 2018; Guo et al., 2018; Qi et al., 2018) focus on the structured video representation. Recently, with the advances of deep learning in scene graph representation (Krishna et al., 2017; Dai et al., 2017; Xu et al., 2017), researchers attempt to use attributes of an object and the relationship between objects for semantic-level video content understanding. For example, Alayrac et al.  (Alayrac et al., 2017) automatically discovered the states of objects (e.g. empty/full of a cup) and associated manipulation actions. Liu et al.  (Liu et al., 2017b) jointly recognized object fluents (i.e. , changeable attributes of objects and relationships between objects) and tasks in egocentric videos. Zhu et al.  (Zhu et al., 2017) predicted a sequence of actions from visual semantic observations. Action recognition accuracy has significantly improved in the past few years. However, without rules for logical reasoning, the explainability of those deep learning based methods is limited, which means that they cannot explain how an action is performed with detailed information, i.e. who, when, where and how. Moreover, during their training and testing stages, a video sequence is often assigned a single action category label only. Most of those methods are incapable of detecting concurrent actions in complex videos.

To develop an explainable video understanding framework, the early approaches (Intille and Bobick, 2001; Tran and Davis, 2008; Ijsselmuiden and Stiefelhagen, 2010; Morariu and Davis, 2011; Brendel et al., 2011) often use first-order logic (Richardson and Domingos, 2006; Russell and Norvig, 2009) to predict performed actions or events. Given a video sequence with predefined rules on the spatio-temporal video representation, these algorithms first detect and track concerned objects across the video sequence, and then apply first-order logic to detect occurred actions. Based on the rule-based action definitions, the time and location of the performed action can be detected on the spatio-temporal video representation. In addition, because of the flexibility of logical reasoning framework, concurrent actions can be also detected (Morariu and Davis, 2011). However, due to low-level image features (e.g. motion, location, foreground region) used in early approaches, their explainability and practical applications are still limited. For example, Morariu et al.  (Morariu and Davis, 2011) only used the location information (e.g. players, ball and loop) to describe the observations of a basketball scene, which is insufficient to describe the “tumble” state of a player, because such a semantic state simply cannot be illustrated by locations.

In this paper, we propose a novel action reasoning framework that uses prior knowledge111Prior knowledge is the information about a domain that can be used to solve problems in that domain  (Russell and Norvig, 2009). In this paper, the prior knowledge consists of concerned objects, attributes, relationships and state transition based action definitions. to explain semantic-level observations of video content changes. In the field of artificial intelligence (AI), an action indicates something done by an agent, and it can be observed by the high-level state transition (Russell and Norvig, 2009) from precondition to effect. The precondition defines the states in which the action can be executed, and the effect defines the result of executing the action (Russell and Norvig, 2009). A state here could be an attribute (e.g. , a microwave is closed or open) of an object or a relationship (e.g. , a hand is holding or not_holding a cup) between two objects. Given the prior knowledge with a set of rules for logical reasoning, performed actions can be detected on the semantic-level video content representation. Consequently, we define two action reasoning models: an attribute-transition based and a relationship-transition based action reasoning model, or AAR and RAR model for short, respectively. For example, when the attribute state of a microwave is changed from closed (precondition) to open (effect), it can be inferred as an “open” action by the AAR model. Different from low-level features (e.g. appearance, location and motion) representation, the semantic-level state is explainable as it can be understood by humans.

Using AAR and RAR models, we design a spatio-temporal graph for semantic-level video state representation, namely the video graph. Specifically, we extend the scene graph (Krishna et al., 2017) in images into a spatio-temporal structure for video graph generation. Each node in our video graph denotes an object and its attributes (including object category, location and its state, e.g. closed state of a microwave); each edge represents a type of semantic relationship (e.g. , a hand is holding a cup) between two objects. By detecting and tracking the concerned objects across the video sequence, a spatial-temporal video graph could be generated by sequentially linking the scene graph of each frame throughout the video. Meanwhile, by observing the video graph in temporal order, when the semantic-level video state changes, not only can we recognize the action category with AAR and RAR models, but also explain what happened with detailed state transition information (i.e. who, when, where and why), as in the example shown in Figure 1. Moreover, since each state transition is independently detected and recognized, multiple concurrent actions can also be handled by the proposed method. To evaluate our method, we re-annotated the CAD dataset with detailed object category, location, attributes, relationships and actions. Experimental results on this dataset demonstrate the effectiveness of the proposed method.

In summary, our contributions are as follows: (1) We propose a novel video action reasoning framework that uses prior knowledge to explain semantic-level video state changes. Compared to previous works, our action reasoning results can be explained by both logical rules and semantic-level observations of video content. (2) We design a video graph representation method for semantic-level video content understanding, and it can detect and provide detailed information for multiple concurrent actions reasoning. (3) We re-annotated the CAD-120 dataset with additional objects, attributes, relationships and actions for empirical studies. Experiments demonstrate the effectiveness of our method in terms of both accuracy and explainability.

2. Related Work

Action and knowledge representation. In the classical planning (Russell and Norvig, 2009) problem of AI, an action can be represented by the semantic-level state transition from precondition to effect. For example, in Action Description Language (ADL) (Pednault, 1994; Russell and Norvig, 2009), the load action can be defined as:

where describes whether an object is at an airport , denotes whether a freight is in an airplane. Given the knowledge of a certain domain, such action representation lifts the level of reasoning from propositional logic to a restricted subset of first-order logic (Russell and Norvig, 2009). Besides, for the logical and flexible knowledge representation, semantic network (Sowa, 2014; Russell and Norvig, 2009; Poole and Mackworth, 2010) is capable of representing objects, attributes of object and relations among relevant objects in real world. Here, the semantic network is a graph based knowledge representation method in AI, where a node represents an object and an edge describes the relationship between two different objects. Inspired by the semantic network for knowledge representation and logical reasoning, we design a video graph for detailed semantic-level video content representation by extending the scene graph (Krishna et al., 2017) into a spatio-temporal structure. By observing the state changes over time, performed actions can be detected and explained by logical rules.

Video action recognition. Without rules for logical reasoning, many approaches often employ hand-crafted (Klaser et al., 2008; Laptev, 2008; Wang and Schmid, 2013; Liu et al., 2017a) or deep-learned features (Simonyan and Zisserman, 2014a; Wang et al., 2018b; Feichtenhofer et al., 2017; Lee et al., 2017; Wang et al., 2018a; Feichtenhofer et al., 2018) of appearance and motion for action recognition. Recently, researchers attempt to use the semantic-level state changes (Fathi and Rehg, 2013; Fire and Zhu, 2015; Wang et al., 2016; Liu et al., 2017b; Alayrac et al., 2017; Wang and Gupta, 2018) for video analysis. For example, Liu et al. (Liu et al., 2017b) adopted unary fluents to represent attributes of a single object, and binary fluents for two objects in egocentric videos, and then they used LSTM (Graves, 2012) to recognize which action is performed. In addition, Recurrent Neural Networks (RNN) (Jain et al., 2016) or Graph Convolutional Networks (GCN) (Kipf and Welling, 2017; Wang and Gupta, 2018; Guo et al., 2018; Qi et al., 2018) is used for structured video representation and action recognition in 2D or 3D scenes. Due to the absence of rules for logical reasoning, the explainability of these methods is limited.

Early logical reasoning based methods (Intille and Bobick, 2001; Tran and Davis, 2008; Ijsselmuiden and Stiefelhagen, 2010; Morariu and Davis, 2011; Brendel et al., 2011) often use logical rules to explain the low-level features, such as motion, location and trajectories. Tran et al.  (Tran and Davis, 2008) used Markov logic network (Richardson and Domingos, 2006) and first-order logic for event recognition in surveillance domains. By observing the location of cars and pedestrians in the scenes, a set of actions can be inferred, such as “door opening” and “car leaving”. Brendel et al.  (Brendel et al., 2011) introduced a probabilistic event logic method for interval-based event recognition, and they used a manually defined knowledge base to interpret low-level histogram of gradients (HoG) and the histogram of flow (HoF) features. Different from previous works, our method uses logical rules to explain semantic-level observations of video content changes.

Scene graph representation in images. To understand and describe the visual world in a single image, Krishna et al. (Krishna et al., 2017, 2018) proposed a scene graph representation for rich image content cognition. In (Krishna et al., 2017, 2018), an image is represented by a set of objects, attributes, and relationships. Given an input image, the object is represented by the its category, location, and attributes that denote detailed object information (e.g. age and gender of a man). Relationship connecting two different objects represents the semantic relation between the subject and object (e.g. , a man is holding a cup). To further exploit robust relationship prediction with limited training samples, Lu et al. (Lu et al., 2016) proposed a visual relationship model with language priors. Dai et al. (Dai et al., 2017) developed a Deep Relational Network (DR-Net), which integrates a variety of cues: object categories, appearance, spatial configurations, and the statistical relations.

Instead of using static image scene description, we track the concerned objects with their attributes and relationships by going frame-wise through the whole video sequence to construct a video graph, which is used to represent the video state (attribute and relationship) transitions.

3. Explainable Action Analysis

In this section, we introduce the proposed problem definition, video graph generation, and how to reason about performed actions by integrating prior knowledge with logical rules.

Figure 2. An overview of the proposed video action reasoning framework. AAR denotes attribute-based action reasoning model and RAR represents relationship-based action reasoning model, is the state transition time. Based on the prior knowledge and video graph representation, performed actions can be detected and explained with detailed information.

3.1. Problem definition

Given a video of the target domain (such as daily life), we define the prior knowledge in this domain as a set of concerned object categories , associated attributes , potential relationships between each pair of objects, and possible actions that might be performed among these objects. According to the state transition (Pednault, 1994; Russell and Norvig, 2009; Poole and Mackworth, 2010) based action representation, two types of video state are considered for video action reasoning: attribute-based state and relationship-based state. Given a target video domain, the number of concerned objects and states is limited, and thus we can build a complete knowledge base222A knowledge base represents facts about the world that are stored by an agent  (Russell and Norvig, 2009). (Russell and Norvig, 2009) to explain all concerned state transitions.

Attribute. Let be the state of object with attribute , where denotes the -th concerned object category, denotes the -th concerned attribute. Then an attribute-based action can be defined as a valid attribute transition , where is the initial attribute as precondition and is the effect of executing the action.

Relationship. Similarly, let be the state of object and with semantic relationship . denotes the -th concerned relationship. The relationship-based action can be defined as (), where and denote the precondition and effect of a relationship, respectively.

Video graph. Given a video with frames, we assume there are concerned objects in this video domain. Each object in frame is represented by its category and a bounding box location as . To describe the video state transitions, a video graph is defined to represent the attribute and relationship on each object, where is a scene graph, which denotes the state of video frame .

Attribute-based action definition. Let denote the attribute state of object at frame , represent the relationship state between object and at frame . When , an attribute-transition has occurred, namely , and the state transition time (i.e. when) is marked as . Then the performed action can be explained by the attribute-transition based action definitions, and the location (i.e. where) of performed action is represented by of object .

Relationship-based action definition. Similarly, the relationship-based action can be defined by relationship-transition , and the locations can be denoted by and of object and , respectively.

Video action reasoning. Given a video and the prior knowledge about concerned object categories , attributes , relationships , and action definitions with a set of state transition and (), based on the semantic-level video graph representation, our target is to recognize and reason a set of performed actions and when or , respectively.

3.2. Overview of the Proposed Method

Similar to previous works (Morariu and Davis, 2011; Brendel et al., 2011) for explainable video analysis, we use manually defined prior knowledge as the logical rules for action reasoning. In addition, to overcome the limitation of low-level features used in  (Morariu and Davis, 2011; Brendel et al., 2011), we design a semantic-level video content representation method, namely video graph. Figure 2 shows the overview of the proposed action reasoning framework.

Training stage. The training stage includes two aspects: (1) the state detectors consist of the object detector , attribute detector (Eq. 1) and relationship detector (Eq. 2). Similar to the scene graph (Krishna et al., 2017; Lu et al., 2016; Dai et al., 2017) generation in still images, those state detectors are trained on annotated video states; (2) The action models include the Attribute-based Action Reasoning (AAR) model (Eq. 3) and Relationship-based Action Reasoning (RAR) model (Eq. 4). Different from many existing methods (Carreira and Zisserman, 2017; Wang et al., 2018a; Feichtenhofer et al., 2018) that learn the action model on well-annotated video clips (a single action label for a video clip), the state transition based action models are learned on the given prior knowledge (semantic-level action definitions).

Testing stage. Given a video sequence, we first employ the trained object detector , attribute detector , and relationship detector to detect the concerned objects, attributes and relationships on each frame. By tracking those objects across different video frames, a video graph is generated for semantic-level state representation. Then the AAR model and RAR model are used to reason about the performed actions with attribute or relationship transitions, respectively. Since each state transition is explained by the logical rules independently, our method is able to obtain detailed action information from these state transitions, and it can be also used to detect multiple actions in complex videos, including concurrent actions.

3.3. Video Graph Generation

By assuming the objects and their locations are available (He et al., 2016; Szegedy et al., 2017), as in scene graph (Krishna et al., 2017; Xu et al., 2017; Dai et al., 2017), attribute and relationship detectors can be learned on still video frames.

Attribute detector. Let denote the concerned objects in the whole video, be the -th object with category and location in frame . The same attribute category may show very different visual appearances to different objects. For example, the open

states of a microwave and a bottle are quite different. In order to ensure the semantic representativeness capability of the trained attribute classifier, the object label needs to be incorporated into the attribute model. Given a set of images with annotated objects and attributes, the object attribute

with the detector can be learned as:

(1)

where

denotes the learned parameters to predict attribute probabilities,

represents -th concerned attribute in the target domain.

is the extracted visual feature via convolutional neural networks in location

of the frame .

is the one-hot feature vector of object category with dimension

.

is the learned parameter for feature selection, and

is the learned weights from object category.

Relationship detector. Different from detailed image understanding in scene graph (Krishna et al., 2017), we only focus on the relationships that can be used to generate concerned actions. Similar to attribute detector, we also consider the object categories as in Eq. 1. Let denote the subject and represent the object in frame . The spatial relationship can be calculated by the detector (Lu et al., 2016) as:

(2)

where denotes the learned parameters to predict relationship probabilities, represents n-th relationship in the target domain.

is the feature extracted from the union region of the

and boxes in frame . and denote the one-hot vectors of subject category and object category , respectively. represents the concatenated object category vector. is the learned parameter for feature selection. is the extracted weights from both subject and object categories.

Video graph generation and refinement. Based on detected objects, attributes and relationships, semantic-level state in each individual frame can be represented by the generated scene graph. By tracking each object across all the video frames, a video graph is further generated. Then state transitions can be detected by observing the attribute and relationship changes over time, as illustrated in Figure 1. In addition, since the proposed video graph is a general and flexible video representation method, it can be also used in other computer vision tasks, such as video summarization (Wang et al., 2012a, b) and video captioning (Xu et al., 2018).

In a complex video scene, the predicted attributes and relationships are sometimes inaccurate, which need to be refined for more robust video graph generation. With the assumption that state changes in a video sequence should be consistent, a sliding window with a width is utilized to improve the quality of the generated video graph. When the same state is not continuously detected, the detection is considered inaccurate, and then the latest accurate value is assigned to the inaccurate state. Figure 3 shows an example of refinement using a width of 3, where 0 denotes one state and 1 represents another possible state.

Figure 3. An example of state refinement (width ).

3.4. Explainable Action Reasoning Models

Suppose that the environment state will not change unless an action is performed. In the real world, a complex action can contain different objects, attribute and relationship transitions. For those actions, they can be divided into a set of atomic propositions with first-order logic (Russell and Norvig, 2009; Sowa, 2014). For example, the “having_meal” activity can be defined by two atomic actions as: . Therefore, in order to clearly introduce the proposed method, we mainly focus on the atomic action reasoning model that involves only one attribute or relationship transition in this paper.

Attribute-based action reasoning (AAR) model. The AAR model is used to detect the attribute-transition of a node in video graph, such as “open a microwave” (the attribute changes from closed to open). Considering both object category and state transition, the attribute-based action is formulated by a projection function as:

(3)

where is the learned parameter, represents -th concerned action in the target video domain. is a attribute of precondition, and is the attribute of effect.

Relationship-based action reasoning (RAR) model. In contrast, the RAR model is used to detect the relation-transition of an edge in video graph, such as “hand picks a cup” (spatial relationship between hand and cup changes from not_holding to holding). Similar to the attribute-based action model, we also build a conjunction of the subject and object categories with relationship-transition to distinguish the action on different object categories. Then relationship-based action can be formulated by a projection function as:

(4)

where is the learned parameter, and denotes -th concerned action in the target video domain. is a relationship of precondition, while is the relationship of effect.

Learning action models from prior knowledge. Since (1) the number of concerned objects and state transitions in a target video domain is often limited (Pednault, 1994) and (2) all of the potential actions can be represented by different state transitions, an appropriate action model can be learned from a complete knowledge base (i.e. state-transition based action definitions) (Russell and Norvig, 2009; Poole and Mackworth, 2010).

4. Experiments

In this section, we first introduce the dataset and prior knowledge definition in this work, and then describe the implementation details of our method. In the next, we report the accuracy of the video graph generation and action reasoning results, and demonstrate the explainability of our method. Finally, we discuss the potential extensions of the proposed method.

4.1. Dataset and Prior Knowledge Definition

For explainable video action reasoning, the annotated objects, attributes, relationship and actions are employed to validate the effectiveness of our method. However, available datasets do not satisfy these requirements. Therefore, we construct a new dataset by re-annotating the CAD-120 dataset (Koppula et al., 2013) with detailed object locations, attributes, relationships and actions. The original CAD-120 is a RGBD dataset (we only use the RGB images for action analysis, depth information is ignored in this work) captured by Microsoft Kinect sensor (Shotton et al., 2011), which focuses on the human activity of daily life. Each image of CAD-120 has resolution of and the whole dataset contains 124 video sequences of 10 different high-level activities (such as arranging objects, having meal, taking food) performed by 4 different subjects, and each activity was performed 3 or 4 times. In addition, each action is carried out on different objects, such as “pick a cup” and “pick a box”.

Attribute Object
closed/open medicine-box, microwave, bottle
Table 1. Attributes of an object.
Action Attribute Transition Object
close open closed medicine-box, microwave, bottle
open closed open medicine-box, microwave, bottle
Table 2. Attribute-based action definitions.

As an extension, the re-annotated CAD-120 dataset (Koppula et al., 2013) consists of 551 video clips with 32327 frames. For the purpose of extending, 10 potential actions (include null action that means nothing changed/happened) have been defined in the target domain (daily life), as well as a set of concerned objects, attributes and relationships. Since we often describe an action (as well as attribute and relationship) on different objects with the same term in our daily life, such as “pick a book” and “pick an apple”, we also follow this tradition in the proposed framework.

Relationship Subject Object
holding / not_holding hand
box, medicine-box, bowl,
cup, book, cloth, remote,
apple, bottle, plate
contacting / apart head bottle, bowl, cup, apple
containing / separate microwave bowl, cup, cloth, box
Table 3. Relationships between two objects.
Action Relationship Transition Subject Object
pick not_holding holding hand box, medicine-box, bowl, cup, book, cloth, remote, apple, bottle, plate
place holding not_holding hand box, medicine-box, bowl, cup, book, cloth, remote, apple, bottle, plate
drink apart contacting head cup, bottle
eat apart contacting head apple, bowl
micr_food separate containing microwave cup, box, bowl
take_food containing separate microwave cup, box, bowl
clean separate containing microwave cloth
Table 4. Relationship-based action definitions.
State closed open holding not_holding contacting apart containing separate Overall
w/o obj 0.99 0.70 0.86 0.91 0.59 0.60 0.93 0.67 0.82
w obj 0.98 0.74 0.82 0.96 0.80 0.96 0.95 0.96 0.94
Table 5. State recognition accuracy of the proposed method with (w) or without (w/o) object categories.
(a) False attribute as open.
(b) False relationship as holding.
Figure 4. Failure cases.

Table 1-4 represent the prior knowledge about the target domain used in this work and the action models can be trained from Table 2 and Table 4 without any annotated videos. More explicitly, Table 1 and Table 3 are the concerned object categories, attributes and relationships; Table 2 and Table 4 are the defined actions based on valid attribute-transition and relationship-transition, respectively. When the state does not change or the state transitions are not contained in Table 2 and Table 4, null is used to represent such conditions as denoting nothing happened. In summary, there are 13 object categories, 2 attributes, 6 relationships, 12 attribute-based transitions, 72 relationship-based transitions, and 10 actions in total. Based on the prior knowledge on this video domain, we can generate video graphs for semantic-level video content understanding and further action reasoning.

4.2. Implementation Details

Based on the manually labeled data, the attribute and relationship detectors are trained with a VGG-16 model (Simonyan and Zisserman, 2014b), respectively. The learning rate is set to

for both attribute and relationship detectors. For the training of AAR and RAR action reasoning models, the learning rate is set to 0.01. In addition, for all detectors, the categorical cross-entropy is used as the loss function and Adam optimizer

(Kingma and Ba, 2014) is used for optimization. Besides, the smoothing width is empirically set to 5 for robust video graph generation.

Figure 5. Examples of video action reasoning with detailed state transition information in complex videos.

4.3. Video Graph Generation

Given a video sequence with manually annotated objects on a few frames, we use a tracking algorithm (Lukezic et al., 2017)

to estimate the locations of each object in the rest of video. By using the described methods (as in Sec.

3.3) for scene graph generation on each frame, a video graph is further generated. In the state transition based action reasoning framework, when an action occurs, it can be detected by the temporal attribute or relationship changes (i.e. state transitions) in the generated video graph. Because our action models are learned from the prior knowledge, the performance of our method depends on the accuracy of state detection. If the more accurate video state detection results are provided, more accurate action reasoning results can be obtained.

To demonstrate the accuracy of the state detection, we report both the attribute and relationship recognition accuracy. In many cases, because the same attribute and relationship often exist in different object categories, it can be very diverse, such as “a hand is holding a box” and “a hand is holding a cup”. With the consideration of the object categories in attribute and relationship modeling, the overall performance has been effectively improved, as the results shown in Table 5.

For some challenging scenes, such as heavy occlusion and inconspicuous conditions, it is difficult to accurately predict the states. As the example shown in Figure 4(a), when the microwave is occluded by the girl, its attribute is incorrectly predicted as open. For another example shown in Figure 4(b), the ground truth relationship is not_holding, but it is improperly detected as holding since the hand is very close to the bowl. In practice, although the accuracy of state detection is not perfect, it still provides important information for action reasoning, see Sec. 4.4 and Sec. 4.5.

4.4. Explainable Action Reasoning

By observing each state transition in the video graph over time, performed actions can be detected and explained by the rule-based action reasoning models (i.e. AAR and RAR). Moreover, since each state transition in the video graph is detected and explained independently, detailed state transition information (i.e. who, when, where and how) can be obtained from the video graph, and multiple concurrent actions can also be detected.

To illustrate the advantages of the proposed method, action reasoning results on two long video sequences are reported. As shown in Figure 5, it can be seen that all of the concerned objects are represented in these two videos. When their attributes or relationships change, performed actions are detected and marked with relative objects at the video frames where the state transitions occur. In the first row of Figure 5, the man “picks cup_2” (not_holding to holding between hand_1 and cup_2) with “hand_1” at frame 35; then he uses “cup_2” to “drink” (apart to contacting between head_1 and cup_2) at frame 70. Later, at frame 180, the man “places cup_2” (holding to not_holding between hand_1 and cup_2) and “picks apple_1” with “hand_2” (not_holding to holding between hand_2 and apple_1) at the same time. Finally, he “eats apple_1” (apart to contacting between head_1 and apple_1) at frame 188. Note that although some objects are irrelevant to any actions, such as the bowl (marked in yellow rectangle) and bottle (marked in black rectangle), they are still need to be considered in the action reasoning stage. This is because we do not know what will happen at the beginning of an input video, we need to monitor all the objects.

Similarly, the second row of Figure 5 shows another action reasoning result. The man “opened microwave_1” (closed to open of microwave_1) at frame 125, then the concurrent actions “hand_1 picks bowl_1” and “hand_2 picks bowl_1” are performed at frame , another concurrent actions “hand_2 places bowl_1” and “microwave_1 micro_food bowl_1” are performed at frame . In the end, the man “closes microwave_1” at frame 367.

As we can see, our method can provide a video analysis strategy as in the logical manner of thinking by humans. Different from previous logical reasoning methods  (Intille and Bobick, 2001; Tran and Davis, 2008; Ijsselmuiden and Stiefelhagen, 2010; Morariu and Davis, 2011; Brendel et al., 2011) and semantic-level video action recognition algorithms (Fathi and Rehg, 2013; Fire and Zhu, 2015; Wang et al., 2016; Liu et al., 2017b; Alayrac et al., 2017; Wang and Gupta, 2018), the explainability of our method is supported by both logical rules and semantic-level video content understanding. Besides, Figure 5 also demonstrates that the proposed method can detect multiple concurrent actions in complex videos, and it can also provide detailed action information to explain how those actions are executed.

Action null open close pick place drink eat micr_food take_food clean Overall
Video number 161 48 40 109 100 31 28 10 12 10 551
TSN_RGB 0.92 0.46 0.45 0.53 0.57 0.74 0.21 0 0 0.33 0.42
TSN_Flow 0.91 0.69 0.70 0.81 0.92 0.84 0.39 0.80 0.33 0.67 0.71
TSN_Fusion 0.96 0.77 0.50 0.88 0.87 0.86 0.20 1.00 0.67 1.00 0.77
Ours 0.84 0.83 0.88 0.68 0.86 1.00 0.89 0.50 0.67 0.83 0.80
Table 6. Comparison between the proposed method and TSN (Wang et al., 2018b) for single action recognition.

4.5. Action Recognition Accuracy

Note that there are essential differences between the proposed action reasoning approach and many deep learning based action recognition methods (Simonyan and Zisserman, 2014a; Wang et al., 2018b; Feichtenhofer et al., 2017; Lee et al., 2017; Wang et al., 2018a; Feichtenhofer et al., 2018): (1) Instead of only predicting a single action label, our method outputs multiple action labels with relevant objects, attributes/relationships and the time of each state transition. (2) Our action models are learned from semantic-level state transitions based definitions (state detectors are trained on still images), and thus it does not need well-annotated video clips for training. To demonstrate the effectiveness of our action reasoning framework on action recognition, we divide the long video sequences into small clips, each of which contains only one action. The small clips are used to evaluate the performance of our method with Average Recall metric to evaluate whether or not the performed actions are recognized.

We compare our method to a representative two-stream (appearance and motion) action recognition algorithm TSN (Wang et al., 2018b), which adopts an end-to-end deep learning scheme that utilizes RGB frames and optical flow as a two-stream input. It is worth mentioning that TSN achieves the state-of-the-art performance and on the benchmark action recognition dataset UCF 101 (Soomro et al., 2012) and ActivityNet (Fabian Caba Heilbron and Niebles, 2015), respectively. Similar to the other popular action recognition methods (Simonyan and Zisserman, 2014a; Feichtenhofer et al., 2017; Lee et al., 2017; Wang et al., 2018a; Feichtenhofer et al., 2018), the output of TSN is only an action label for a video sequence.

In our experiment, the TSN model is trained with 50 epochs for both appearance and optical flow models, and the best recognition results are reported for comparison. Let TSN_RGB be the appearance based model of TSN, TSN_Flow be the motion based model of TSN, and TSN_Fusion represent the fused model of appearance and motion. As the results shown in Table

6, our approach achieves very competitive performance in comparison to different TSN models, especially for “open”, “close”, “drink” and “eat” actions. The average recall of TSN_Flow is and TSN_Fusion is , while the appearance based model TSN_RGB is only . The reason is that in the CAD-120 dataset (Koppula et al., 2013), all the 4 subjects (in videos) adopted similar movements to accomplish the same action. Therefore, the motion component contributes disproportionately (w.r.t. appearance ) for the action recognition in these videos. Noted that even without the motion information, the proposed method can still achieve a better recognition performance () comparing to that of the TSN_Flow (0.71) and TSN_Fusion (0.77).

In addition, due to inaccurate video state prediction in some challenging scenes (e.g. heavy occlusion), it is difficult to always generate reliable video graph. Therefore, the multiple outputs of our method may contain some false positive actions, and the Average Precision is 0.52.

4.6. Discussion

In this work, instead of following the traditional action recognition strategy to pursue performance improvement, we would like to advocate a logical reasoning framework for action analysis and understanding in videos. Based on both the logical rules and semantic-level video graph representation, our method enjoys great flexibility and extendability to be applicable for more difficult activity reasoning tasks and other video domains as follows.

Complex activity reasoning. This paper mainly focuses on the atomic action reasoning problem, which can be observed by an attribute or relationship changes in videos. In practice, many complex actions or activities involve a set of objects, attributes and relationships. Similar to the previous work (Morariu and Davis, 2011; Brendel et al., 2011) that use prior knowledge and Markov Logic Networks (Russell and Norvig, 2009) for event modeling, our method can also be extended in the same way to reason about complex activities. For example, the “having_meal” activity can be defined as: . When both the two atomic actions eat and drink are detected, the complex activity “having_meal” can be inferred by the first-order logic with predefined rules, as the example shown in the first row of Figure 5.

Applications on other video domains. Since our action reasoning framework is based on prior knowledge and semantic-level video graph representation, it can be easily adapted to a new video domain (such as sports) for explainable action reasoning as long as the knowledge is available. In fact, the prior knowledge used in our framework is commonsense knowledge (as presented in Table 1-4), which is easy to collect. For other video domains, the only requirement is to replace the prior knowledge (as presented in Table 1-4) and relevant state detectors.

5. Conclusion and Future Work

We proposed an explainable video action reasoning framework based on prior knowledge and semantic-level state transitions. Given a certain video domain, a set of concerned objects, attributes and relationships can be defined based on commonsense knowledge. Moreover, the concerned actions can be explained by the attribute and relationship changes in videos. During the testing stage, given a video, we first generate the scene graph on each video frame for semantic-level state representation, and then construct the video graph by linking the scene graphs by tracking objects across all the video frames sequentially. To this end, our model can detect and explain performed actions by observing state transitions in the video graph. Compared to previous methods, the action reasoning results of our method can be explained by both logical rules and semantic-level video content understanding. Experiments on the daily life dataset show that the proposed method can not only recognizes performed actions, but also provides detailed information to explain how those actions are executed. Moreover, our method can handle multiple concurrent actions in complex videos, just as that of the single action detection.

In the future, we will construct dataset to empirically study the effectiveness of our framework on complex activity and event detection, as well as its extendability for other domains. In addition, another interesting future work is to automatically learn additional rules by using Probabilistic Inductive Logic Programming 

(De Raedt and Kersting, 2008).

Acknowledgments

This research is supported by the National Research Foundation, Prime Minister’s Office, Singapore under its Strategic Capability Research Centres Funding Initiative, and the Agency for Science, Technology and Research (A*STAR) under its AME Programmatic Funding Scheme (#A18A2b0046). This research is also supported by the NSF of China 61571362, and grant 2018JM6015 of Natural Science Basic Research Plan in Shaanxi Province of China, as well as the Fundamental Research Funds for the Central Universities 3102019ZY1004. Zhiyong Cheng is the corresponding author.

References

  • J. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-Julien (2017) Joint discovery of object states and manipulating actions. In ICCV, Cited by: §1, §2, §4.4.
  • W. Brendel, A. Fern, and S. Todorovic (2011) Probabilistic event logic for interval-based event recognition. In CVPR, Cited by: §1, §1, §2, §3.2, §4.4, §4.6.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, Cited by: §1, §1, §3.2.
  • B. Dai, Y. Zhang, and D. Lin (2017) Detecting visual relationships with deep relational networks. In CVPR, Cited by: §1, §2, §3.2, §3.3.
  • L. De Raedt and K. Kersting (2008) Probabilistic inductive logic programming. In Probabilistic Inductive Logic Programming, pp. 1–27. Cited by: §5.
  • B. G. Fabian Caba Heilbron and J. C. Niebles (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In CVPR, Cited by: §4.5.
  • A. Fathi and J. M. Rehg (2013) Modeling actions through state changes. In CVPR, Cited by: §2, §4.4.
  • C. Feichtenhofer, A. Pinz, R. P. Wildes, and A. Zisserman (2018) What have we learned from deep representations for action recognition?. In CVPR, Cited by: §1, §1, §2, §3.2, §4.5, §4.5.
  • C. Feichtenhofer, A. Pinz, and R. P. Wildes (2017) Spatiotemporal multiplier networks for video action recognition. In CVPR, Cited by: §2, §4.5, §4.5.
  • A. Fire and S. Zhu (2015) Learning perceptual causality from video. Transactions on Intelligent Systems and Technology 7 (2), pp. 23:1–23:22. Cited by: §2, §4.4.
  • A. Graves (2012) Supervised sequence labelling. In Supervised sequence labelling with recurrent neural networks, pp. 5–13. Cited by: §2.
  • M. Guo, E. Chou, D. Huang, S. Song, S. Yeung, and L. Fei-Fei (2018) Neural graph matching networks for fewshot 3d action recognition. In ECCV, Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.3.
  • J. Ijsselmuiden and R. Stiefelhagen (2010) Towards high-level human activity recognition through computer vision and temporal logic. In AAAI, Cited by: §1, §1, §2, §4.4.
  • S. S. Intille and A. F. Bobick (2001) Recognizing planned, multiperson action. Computer Vision and Image Understanding (CVIU) 81 (3), pp. 414–445. Cited by: §1, §1, §2, §4.4.
  • A. Jain, A. R. Zamir, S. Savarese, and A. Saxena (2016) Structural-rnn: deep learning on spatio-temporal graphs. In CVPR, pp. 5308–5317. Cited by: §1, §2.
  • D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.2.
  • T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1, §2.
  • A. Klaser, M. Marszalek, and C. Schmid (2008) A spatio-temporal descriptor based on 3d-gradients. In BMVC, External Links: Link Cited by: §2.
  • H. S. Koppula, R. Gupta, and A. Saxena (2013) Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research 32 (8), pp. 951–970. Cited by: §4.1, §4.1, §4.5.
  • R. Krishna, I. Chami, M. Bernstein, and L. Fei-Fei (2018) Referring relationships. In CVPR, Cited by: §2.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV) 123 (1), pp. 32–73. Cited by: §1, §1, §2, §2, §3.2, §3.3, §3.3.
  • I. Laptev (2008) Learning realistic human actions from movies. In CVPR, Cited by: §2.
  • I. Lee, D. Kim, S. Kang, and S. Lee (2017) Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In ICCV, Cited by: §2, §4.5, §4.5.
  • A. Liu, Y. Su, W. Nie, and M. Kankanhalli (2017a) Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE transactions on pattern analysis and machine intelligence (TPAMI) 39 (1), pp. 102–114. Cited by: §2.
  • Y. Liu, P. Wei, and S. Zhu (2017b) Jointly recognizing object fluents and tasks in egocentric videos. In ICCV, Cited by: §1, §2, §4.4.
  • C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In ECCV, Cited by: §2, §3.2, §3.3.
  • A. Lukezic, T. Vojir, L. ˇCehovin Zajc, J. Matas, and M. Kristan (2017) Discriminative correlation filter with channel and spatial reliability. In CVPR, Cited by: §4.3.
  • V. I. Morariu and L. S. Davis (2011) Multi-agent event recognition in structured scenarios. In CVPR, Cited by: §1, §1, §2, §3.2, §4.4, §4.6.
  • E. P. Pednault (1994) ADL and the state-transition model of action. Journal of logic and computation 4 (5), pp. 467–512. Cited by: §2, §3.1, §3.4.
  • D. L. Poole and A. K. Mackworth (2010) Artificial intelligence: foundations of computational agents. Cambridge University Press. Cited by: §2, §3.1, §3.4.
  • S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, pp. 401–417. Cited by: §1, §2.
  • M. Richardson and P. Domingos (2006) Markov logic networks. Machine learning 62 (1-2), pp. 107–136. Cited by: §1, §2.
  • S. Russell and P. Norvig (2009) Artificial intelligence: a modern approach. 3rd edition, Prentice Hall Press. Cited by: §1, §1, §2, §3.1, §3.4, §3.4, §4.6, footnote 1, footnote 2.
  • J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake (2011) Real-time human pose recognition in parts from single depth images. In CVPR, Cited by: §4.1.
  • K. Simonyan and A. Zisserman (2014a) Two-stream convolutional networks for action recognition in videos. In NeurIPS, pp. 568–576. Cited by: §2, §4.5, §4.5.
  • K. Simonyan and A. Zisserman (2014b) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. External Links: Link Cited by: §4.2.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. Cited by: §4.5.
  • J. F. Sowa (2014) Principles of semantic networks: explorations in the representation of knowledge. Morgan Kaufmann. Cited by: §2, §3.4.
  • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    .
    In AAAI, Cited by: §3.3.
  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In CVPR, Cited by: §1, §1.
  • S. D. Tran and L. S. Davis (2008) Event modeling and recognition using markov logic networks. In ECCV, Cited by: §1, §1, §2, §4.4.
  • H. Wang and C. Schmid (2013) Action recognition with improved trajectories. In ICCV, Cited by: §2.
  • L. Wang, W. Li, W. Li, and L. Van Gool (2018a) Appearance-and-relation networks for video classification. In CVPR, Cited by: §1, §1, §2, §3.2, §4.5, §4.5.
  • L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2018b) Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence (TPAMI). Cited by: §2, §4.5, §4.5, Table 6.
  • M. Wang, R. Hong, G. Li, Z. Zha, S. Yan, and T. Chua (2012a) Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia (TMM) 14 (4), pp. 975–985. Cited by: §3.3.
  • M. Wang, R. Hong, X. Yuan, S. Yan, and T. Chua (2012b) Movie2Comics: towards a lively video content presentation. IEEE Transactions on Multimedia (TMM) 14 (3-2), pp. 858–870. Cited by: §3.3.
  • M. Wang, C. Luo, B. Ni, J. Yuan, J. Wang, and S. Yan (2018c) First-person daily activity recognition with manipulated object proposals and non-linear feature fusion. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) 28 (10), pp. 2946–2955. Cited by: §1.
  • X. Wang, A. Farhadi, and A. Gupta (2016) Actions   transformations. In CVPR, Cited by: §2, §4.4.
  • X. Wang and A. Gupta (2018) Videos as space-time region graphs. In ECCV, Cited by: §1, §2, §4.4.
  • D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei (2017) Scene graph generation by iterative message passing. In CVPR, Cited by: §1, §3.3.
  • N. Xu, A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M. Kankanhalli (2018) Dual-stream recurrent neural network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). Cited by: §3.3.
  • Y. Zhu, D. Gordon, E. Kolve, D. Fox, L. Fei-Fei, A. Gupta, R. Mottaghi, and A. Farhadi (2017) Visual semantic planning using deep successor representations. In ICCV, Cited by: §1.