HAPRec: Hybrid Activity and Plan Recognizer

by   Roger Granada, et al.

Computer-based assistants have recently attracted much interest due to its applicability to ambient assisted living. Such assistants have to detect and recognize the high-level activities and goals performed by the assisted human beings. In this work, we demonstrate activity recognition in an indoor environment in order to identify the goal towards which the subject of the video is pursuing. Our hybrid approach combines an action recognition module and a goal recognition algorithm to identify the ultimate goal of the subject in the video.



There are no comments yet.


page 1

page 2


Analysis of Gait Pattern to Recognize the Human Activities

Human activity recognition based on the computer vision is the process o...

An IoT Based Framework For Activity Recognition Using Deep Learning Technique

Activity recognition is the ability to identify and recognize the action...

CAPHAR: context-aware personalized human activity recognition using associative learning in smart environments

The existing action recognition systems mainly focus on generalized meth...

Contextual Relationship-based Activity Segmentation on an Event Stream in the IoT Environment with Multi-user Activities

The human activity recognition in the IoT environment plays the central ...

Multimodal Approaches for Indoor Localization for Ambient Assisted Living in Smart Homes

This work makes multiple scientific contributions to the field of Indoor...

Dynamic Probabilistic Network Based Human Action Recognition

This paper examines use of dynamic probabilistic networks (DPN) for huma...

Anticipation-driven Adaptive Architecture for Assisted Living

Anticipatory expression underlies human performance. Medical conditions ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Activity recognition can be understood as the task of recognizing the independent set of actions that generates an interpretation of a movement that is being performed. On the other hand, plan recognition can be understood as the task of recognizing agent goals and plans based on observed interactions in an environment. These observed interactions can be either events provided by sensors or actions/activities performed by an agent. Although much research effort focuses on activity and plan recognition as separate challenges, comparatively less effort focused on attempting to identify higher-level plans from activities in video sequences, i.e., try to understand the overarching goal of subjects within a video and make the correct inference from the observed activities. Rafferty et al. RaffertyEtAl2017 use sensor based approach to implement assistive smart homes. Their approach is based upon an intention recognition mechanism that uses sensors affixed to objects and an ontological rule-based goal recognition system. Massardi et al. MassardiEtAl2019 performs plan recognition using plan libraries from learned activities. Their top-down approach uses a particle filter with a population of plan trees to deal with noisy observations, producing a quick reliable solution.

In this work, we develop a hybrid approach that comprises both activity and plan recognition that identifies, from a set of candidate plans, which plan a human subject is pursuing based exclusively on still-camera video sequences. To recognize such plan, we employ an activity recognition algorithm based on convolutional neural networks (CNN), which generates a sequence of activities that are checked for temporal consistency against a plan library using a symbolic plan recognition approach modified to work with a CNN. As supplemental material we provide a video demonstration of our architecture

111Link to our video: https://youtu.be/eb_6I6dzrEE.

Figure 1: Pipeline of the hybrid architecture for activity and plan recognition.

2 A Hybrid Architecture for Activity and Plan Recognition

Our hybrid architecture is divided in two main parts: i) CNN-based activity recognition, and ii) CNN-backed symbolic plan recognition. The first part consists of training a Convolutional Neural Network (CNN) using video frames as input, with the activity being done in the video as the expected output. Our CNN architecture is based on GoogLeNet architecture [4]

and computes a probability score for all possible classes (

softmax output) for each frame. If two classes contain a high probability and the difference between them is lower than a threshold (

), we use an heuristic approach to disambiguate classes. This heuristic consists of assigning the class of the last frame to the current frame in case one of the two classes is equal to the class of the last frame. Otherwise, the current frame receives the class that contains the highest probability, disregarding the threshold.

After using the CNN to identify the activity being pursued, we use a plan recognizer that returns a set of possible plans that are temporally consistent with what is recognized from the input frames. To perform the task of plan recognition, we use a symbolic plan recognition approach called Symbolic Behavior Recognition (SBR). SBR [1]

is a plan recognition approach that takes as input a plan library and a sequence of observations, in this case, a sequence of observed feature values. Feature values are used as a set of conditions to execute a plan-step in a plan library. To match observed features with plan-steps in a plan library, we use an efficient matching step that maps observed features with matching plan-step nodes in a plan library. To do so, they use a feature decision tree (FDT) that maps observable features to plan-steps in a plan library. As output, SBR returns set of hypotheses plans such that each hypothesis represents a plan that achieves a top-level goal in a plan library. Instead of using the FDT to match observations with consistent plan-steps in the plan library, we modify the SBR and replace the FDT with the CNN-backed Activity Recognition. For instance, given a video frame, the CNN-based Activity Recognition returns which activity such video frame corresponds, and subsequently, we take this activity as input to the SBR, as shown in Figure 

1. Note that to recognize goals and plans using the SBR, we must model a plan library containing a set of possible sequence of activities (i.e., plan) that achieves goals. In this paper, a plan library corresponds to a model that contains a set of plans to achieve cooking menus.

3 Application

We create HAPRec [2] for demonstrating that it is possible to perform goal recognition using CNN-based activity recognition and plan libraries with real-world data (images). To demonstrate our work, we use the activities from ICPR 2012 Kitchen Scene Context based Gesture Recognition dataset (KSCGR) [3], which contains video sequences of five menus for cooking eggs in Japan: Ham and Eggs, Omelet, Scrambled Egg, Boiled Egg, and Kinshi-Tamago. Each menu is performed by 7 subjects: 5 actors in training datasets and 2 actors in evaluation datasets, i.e., 5 cooking scenes are available for each training menu. Eight cooking gestures composes the dataset: breaking, mixing, baking, turning, cutting, boiling, seasoning, peeling, and none, where none means that there is not an activity being performed in the current frame. We chose the KSCGR dataset since it contains the activity being performed in each frame (e.g., breaking, baking and turning) as well as the goal achieved in the whole video sequence (e.g., preparing Ham and Eggs, Omelet, Scrambled Egg, etc.). Thus, we can carry out activity recognition using activities performed in each frame and plan recognition using the steps to achieve the recipe in each video. For recognizing goals and plans, we model a plan library containing knowledge of the agent’s possible goals and plans based on the dataset, where each recipe is a top-level goal in the plan library. Based on videos from the training set, we model all possible plans for achieving each possible menu (i.e., top-level goal). We consider that a sequence of cooking gestures is analogous to a sequence of plan-steps, i.e., a plan in the plan library. Figure 2 illustrates the demo screen, showing the current image of the dataset, its frame id, the action predicted in that frame (Baking), and the list of candidate goals (Omelet and Scrambled-Egg). On the right side, the sequence of plan-steps identified and the top-level goals, where the candidate goals are highlighted in green.

Figure 2: Demo screen showing the activity identified in the current frame, the plan-steps and the set of candidate goals.

4 Conclusion

We presented HAPRec, a tool that performs both activity and plan recognition using real-world data. Our architecture is based on CNNs and a modified symbolic approach to plan recognition. We demonstrated how the algorithm works by testing using a kitchen scene environment containing actions performed by subjects and plans (recipes).


This study was financed in part by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) and the CAPES/FAPERGS agreement (DOCFIX 04/2018). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU.


  • [1] D. Avrahami-Zilberbrand and G. A. Kaminka (2005) Fast and Complete Symbolic Plan Recognition. In IJCAI’05, Cited by: §2.
  • [2] R. Granada, R. F. Pereira, J. Monteiro, R. Barros, D. Ruiz, and F. Meneguzzi (2017) Hybrid Activity and Plan Recognition for Video Streams. In The AAAI-PAIR 2017, Cited by: §3.
  • [3] A. Shimada, K. Kondo, D. Deguchi, G. Morin, and H. Stern (2013) Kitchen scene context based gesture recognition: a contest in icpr2012. In Advances in depth image analysis and applications, pp. 168–185. Cited by: §3.
  • [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR 2015, Cited by: §2.