Improved Actor Relation Graph based Group Activity Recognition

10/24/2020 ∙ by Zijian Kuang, et al. ∙ University of Alberta 0

Video understanding is to recognize and classify different actions or activities appearing in the video. A lot of previous work, such as video captioning, has shown promising performance in producing general video understanding. However, it is still challenging to generate a fine-grained description of human actions and their interactions using state-of-the-art video captioning techniques. The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions. We propose to use Normalized cross-correlation (NCC) and the sum of absolute differences (SAD) to calculate the pair-wise appearance similarity and build the actor relationship graph to allow the graph convolution network to learn how to classify group activities. We also propose to use MobileNet as the backbone to extract features from each video frame. A visualization model is further introduced to visualize each input video frame with predicted bounding boxes on each human object and predict individual action and collective activity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Video understanding is an extensively studied topic widely used in the video content analysis area [10]. Traditional video captioning techniques such as LSTM-YT [26] and S2VT [25]

use recurrent neural networks, specifically LSTMs

[24], to train the models with video-sentence pairs [25, 14, 10, 26]. The models can learn the association between video frames’ sequence and the sequence of sentences to generate a description of videos [25]. Krishna et al. indicated that those video captioning approaches only works for a short video with only one major event [14]. Therefore, they introduced a new captioning module that uses contextual information from the timeline to describe all the events during a video clip [14]. However, it is still very limited in video captioning approaches to generate a detailed description of human actions and their interactions [8]

. Recent studies in pose estimation, human action recognition, and group activity recognition areas show the capability to describe more detailed human actions and human group activities

[18, 2].

Human action recognition and group activity recognition are an important problem in video understanding [27]. The action and activity recognition techniques have been widely applied in different areas such as social behavior understanding, sports video analysis, and video surveillance. To better understanding a video scene that includes multiple persons, it is essential to understand both each individual’s action and their collective activity. Actor Relation Graph (ARG) based group activity recognition is the state-of-the-art model that focuses on capturing the appearance and position relationship between each actor in the scene and performing action and group activity recognition [27]. In this paper, we will propose an improved ARG based model that combined with an object detection model and a visualization model to perform a better video understanding.

We propose several approaches to improve the functionality and the performance of the ARG based model to perform a better video understanding. Since the ARG based model cannot perform human object detection and it requires the bounding box as an input parameter, we will improve the model by combining ARG based model with state of the art human object detection method You Only Look Once (YOLO) [20]. To enhance human action and group activity recognition performance, we will try different approaches, such as increasing human object detection accuracy with YOLO, increasing process speed by reducing the input image size, and applying ResNet in the CNN layer. We will also introduce a visualization model that will plot each input video frame with predicted bounding boxes on each human object and predicted “video captioning” (individual action descriptions and group activity description).

Ii Related Work

II.a Video Captioning with Sequence-to-sequence LSTM Models

In 2015, S. Venugopalan et al. proposed an end-to-end sequence-to-sequence model which exploited recurrent neural network, specifically Long Short-Term Memory (LSTM

[24]) networks as trained on video-sentence pairs and learned to associate a sequence of frames in a video to sequential words to generate the descriptions of the event in the video as captions [25]. A stack of two LSTMs was used to learn the frames’ sequence’s temporal structure and the sequence model of the generated sentences. In this approach, the entire video sequence needs to be encoded using the LSTM network at the beginning. Long video sequences could lead to vanishing gradients and prevent the model from being trained successfully [14].

II.b You Only Look Once: Unified, Real-Time Object Detection

In 2016, J. Redmon et al. introduced a unified model YOLO for object detection. It reframes object detection as a regression problem that separates bounding boxes spatially and associates their class probabilities

[20]. Only one neural network is used to predict bounding boxes and class probabilities in the YOLO’s system. By the inspiration of the GoogLeNet model, YOLO is composed of 24 convolutional layers and is followed by two fully connected layers. During the training, YOLO uses the entire image’s features to predict bounding boxes, which enables it to reason globally and implicitly encode contextual information when making predictions. Moreover, YOLO learns the more generalized representation of the objects, making YOLO outperformed than other detection methods. Compared with other fast detectors such as Faster R-CNN [21]

, YOLO throws out the pipeline to generate potential bounding boxes and extract features, which makes it faster allows it to work as a general-purpose detector to identify a variety of objects simultaneously. YOLO is fast at test time using a single network evaluation, making it ideal for real-time computer vision applications. It achieves efficient performance in both fetching images from the camera and displaying the detections. However, YOLO struggles with small items that appeared in the group under the strong spatial constraints. It also struggles to identify objects in new or unusual configurations from the data it has not seen during the training. YOLO’s loss function also needs to improve identifying errors in a small bounding box versus a large bounding box.

II.c Video Captioning with Dense-Captioning Events Models

In 2017, R. Krishna et al. introduced a Dense-Captioning Events (DCE) model that can detect multiple events and generate a description for each event using the contextual information from past, concurrent, and future in a single pass of the video

[14]. In this paper, the process is divided into two steps: event detection and description of detected events. The DCE model leverages a multi-scale variant of the deep action proposal model to localize temporal proposals of interest in short and long video sequences. In addition, a captioning LSTM model is introduced to exploit the context from the past and future with an attention mechanism.

II.d Real-time Multi-person 2D Human Pose Estimation

OpenPose is an open-source real-time system which is used for 2D multi-person pose detection

[2]. Nowadays, it is also widely used in body and facial landmark points detection in video frames [12, 18, 19]. It produces a spatial encoding of pairwise relationships between body parts for a variable number of people, followed by a greedy bipartite graph matching to output the 2D keypoints for all people in the image. In this approach, both prediction of part affinity fields (PAFs) and detection of confidence maps are refined at each stage [2]. By doing this, the real-time performance is improved while it maintains the accuracy of each component separately. The online OpenPose library supports jointly detect the human body, hand, and facial keypoints on a single image, which provides 2D human pose estimation for our proposed system.

II.e Human Activity Recognition using OpenPose, Motion, and RNN

Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells are widely used for human action recognition with the emerging accessible human activity recognition methods. In this paper, F. M. Noori et al. proposes an approach that first extracts anatomical keypoints from RGB images using the OpenPose library and then obtains extra-temporal motions features after considering the movements in consecutive video frames, and lastly classifies the features into associated activities using RNN with LSTM [18]. Improved performance is shown as organizing activities performed by several different subjects from various camera angles. However, their work based on multi-person action classification is still in progress for accuracy improvement.

II.f Residual Attention-based LSTM Model

In 2018, X. P. Li et al. introduced a novel attention-based framework called Residual attention-based LSTM (Res-ATT [15]). This new model benefits from the existing attention mechanism and further integrates the residual mapping into a two-layer LSTM network to avoid losing previously generated words information. The residual attention-based decoder model is designed with five separate parts: a sentence encoder, temporal attention, a visual and sentence feature fusion layer, a residual layer, and an MLP [15]. The sentence encoder is an LSTM layer that explores important syntactic information from a sentence. The temporal attention is designed to identify the importance of each frame. The visual and sentence feature fusion LSTM layer is working on mixing natural language information with image features, and the residual layer is proposed to reduce the transmission loss. The MLP layer is used to predict the word to generate a description in natural language [15].

II.g Learning Actor Relation Graphs for Group Activity Recognition

In 2019, J. Wu et al. proposed to use Actor Relational Graph (ARG) to model relationships between actors and recognize group activity with multiple persons involved [27]. Using the ARG in a multi-person scene, the relation between actors from respect to appearance similarity and the relative location is inferred and captured. Compared with using a CNN to extract person-level features and later aggregate the features into a scene-level feature or using RNN to capture temporal information in the densely sampled frames, learning with ARG is less computationally expensive and more flexible while dealing with variation in the group activity. Given a video sequence with bounding boxes and ground truth labels of action for actors in the scene, the trained network can recognize individual actions and group activity in a multi-person scene. For long-range video clips, ARG’s efficiency is improved by forcing a relational connection only in a local neighborhood and randomly dropping several frames while maintaining the training samples’ diversity and reducing the risk of overfitting. At the beginning of the training process, the actors’ features are first extracted by CNN and RoIAlign model[11]

using the provided bounding boxes. After obtaining the feature vectors for actors in the scene, multiple actor relation graphs are built to represent the diverse information for the same set of actors’ features. Finally, Graph Convolutional Network (GCN) is applied to perform learning and inference to recognize individual actions and group activity based on ARG. Two classifiers used for individual actions and group activity recognition are applied respectively to the pooled ARG. Scene-level representation is generated by max-pooling individual actor representations, which later uses for group activity classification.

Iii Literature Review

III.a Human Body Fall Recognition System

Falling represents a significant threat to the older person, young kids, and people with disabilities. Therefore, fall detection is important in health care, construction sites, and the smart home system. In 2020, J. Mourey et al. introduced the human fall detection method using OpenCV to detect and report failings in real-time [17]. The algorithm contains three stages: first, the model processes the input frame images and implement GMG and MOG2 for background subtraction. Then the human body is detected with a bounding box in each frame. Next, the falling is defined by two main parameters: the change of width/height and theta change, the degree between the human body and the floor. In the end, if the change parameters have been detected and if any of it surpasses the threshold, the alert system will get triggered [17].

III.b Using Participatory Design to Create A User Interface for Analyzing Pivotal Response Treatment Video Probes

Pivotal response treatment (PRT) focused on helping to improve the communication skills of kids with autism. The video recordings of the interaction between the caregiver and the child are an essential tool for assessments. In 2020, C. D. C. Heath et al. proposed a prototype user interface to display the extracted PRT data from video recording to analyze PRT implementation better and provide feedback [12]. This prototype UI is developed based on an auto data processing model using both video and audio assessment. In video processing, OpenPose [2]

is used to detect body and facial features in each frame. The data is then further used to train a support vector machine (SVM) classification with three labels – attentive, inattentive, and shared attention. In audio processing, the model uses PyAudioAnalysis to extract features and uses two SVM to classify the data as silence, noise, adult speech, or child vocalizations. This prototype is developed based on agile software development methodologies with three sprints iterations. In the first sprint, based on literature review, observations, and discussions, a wireframe mock of the interface is designed along with extracted data examples. In the second sprint, an alpha prototype UI is developed and evaluated using a think-aloud session. In the end, the beta prototype UI is designed based on the feedback from the previous think-aloud session


III.c Integrating Active Face Tracking with Model-Based Coding

In 1999, A. Basu and L. Yin proposed a system to detect and track a talking face with an active camera and then implement adaptation and animation automatically onto the detected face [9]. This paper proposes an advanced head silhouette generation method to detect the motion of a talking face. And then, a spatiotemporal filter is used to fuse the motion mask and complete the moving head detection. After the head region is detected, the deformable template matching combined with color information extraction and Hough Transform is used to extract the facial features such as eyes and mouth. In the end, a 3D the wireframe model is generated to fit onto the moving face so that the positions of eyes and mouth can be determined along with the entire face. They further introduced a “coarse-to-fine” adaptation algorithm to complete the face features adaptation process. [9].

III.d Nose Shape Estimation and Tracking For Model-Based Coding

In previous researches, most facial feature extraction methods are focused on eye and mouth feature extractions. While the detection of the nostril and nose-side shape can be used in facial expression recognition since the different expression can result in the change of nose shape. In 2001, A. Basu and L. Yin proposed a feature detection method that focused on the nose shape recognition

[28]. First, a two-stage region growing algorithm is developed to extract the feature blobs. Then two deformable templates are pre-defined to detect and track the nostril shape and nose-side shape. In the last step, a 3D wireframe model is matched onto the individual face to track the facial expressions using energy minimization [28].

III.e Perceptually Guided Fast Compression of 3D Motion Capture Data

Motion capture data is widely used in producing skeletal animations, and efficient compression of motion capture data is important in optimizing the environment’s usage with a limited amount of bandwidth and memory and preserving the high quality of animation. In 2001, A. Firouzmanesh, I. Cheng, and A. Basu proposed a method to perform a better compression ratio with shorter compression and decompression time. The perceptually guided fast compression model is proposed to optimize the coefficient selection algorithm considering two critical factors: the bone’s length connected to a joint and the variation in rotation [9]. This paper also inspires the other researchers that, other than the two factors like bone length and variation in rotation, there can be many other factors affecting the quality of animations, such as the distance from camera, horizontal and vertical velocity of the object, and the size of the limbs [9].

III.f QoE-Based Multi-Exposure Fusion in Hierarchical Multivariate Gaussian CRF

In 2013, A. Basu, I. Cheng, and R. Shen proposed a Hierarchical Multivariate Gaussian Conditional Random Field (HMGCRF) model, which was based on perceptual quality measures to provide viewers better quality experience on fused images. HMGCRF takes account of the human visual system and uses perceived local contrast and color saturation to improve the performance of Multi-Exposure Fusion (MEF). The proposed model, HMGCRF, applies a novel MEF method that exploits both contrast and color information to deliver maximum image details. Using the MEF techniques instead of HDR imaging techniques, minimal user intervention is required, and a visually appealing and improved image is directly built. Also, the human visual system’s probability is brought into the HMGCRF model to deliver maximum image details. With a source image is given as input, the individual pixel’s contribution to the fused image is perceptually tuned by perceived local contrast and color saturation. It is the first time that modeling the probability for human eyes is brought into multi-exposure fusion to detect local contrast. The maximum local detail preservation is achieved using HMGCRG with perceptual quality measures to exhibit more vivid colors on fused images [23].

III.g Airway Segmentation and Measurement in CT Images

In this literature, A. Basu, I. Cheng et al. introduced a new strategy for detecting the boundary of slices of the upper airway and tracking the contour of the airway using Gradient Vector Flow (GVF) snakes. The traditional GVF snakes are not performing well for airway segmentation when applied directly to CT images. Therefore the new method proposed in this paper has modified the GVF algorithm with edge detection and sneak-shifting steps. By applying edge detection before the GVF snakes and using snake shifting techniques, the prior knowledge of airway CT slices is utilized, and the model works more robustly. The previous knowledge of the shape of the airway can automatically detect the airway in the first slice. The detected contour will then be used as the sneak initialization of the second slice and so on [3]

. A heuristic is also applied to differentiate bones from the airway by the color to make sure the snake converges correctly. Following this, the airway volume is estimated based on the 3D model constructed with automatically detected contours.

III.h A Framework for Adaptive Training and Games in Virtual Reality Rehabilitation Environments

There are many challenges when disabled individuals are using electric power wheelchairs at their training and rehabilitation stages. Especially in children’s cases, equipment adjustment and the environment with specialists are essential but limited. Therefore, a new adaptive strategy for training and games in the virtual rehabilitation environment is introduced in this paper. By using virtual reality and a powered wheelchair simulator in training, patients can be more engaged in the training session with a longer period. This proposed rehabilitation system is flexible, low-cost, and focused on indoor environments for safer and effective training. Both interactive and standard training modes are provided to ensure patient engagement. Comparing with other approaches in the area, the clinicians can design, build, and customize the interactive training environments for patients on this system, resulting in a more effective training process with dynamically responding to a user’s action. Along with the framework based on Bayesian networks, the problem of intelligent and automatic adaption to various types and levels of training that patients are having at different times is addressed as well. The same approach can also be applied to other spatial-based training applications


III.i Eye Tracking and Animation for MPEG-4 Coding

A. Basu et al. proposed a heuristic that can effectively improve facial feature detection, track algorithm, and focus on eye movements. Since accurate localization and tracking of facial features are important to high-quality model-based coding (MPEG-4) systems, it is essential to automatically detect and track facial features and the synthesis and analysis of facial expression. In this paper, an improvement in the initial localization process and simple processing of images with Hough transform and deformable templates are used to produce more accurate results [1]. By introducing exploitation of the color information of eyes, the feature detection becomes more robust and faithful. Moreover, a methodology for eye movement synthesis is also presented in this paper. After extracting the contours of iris and eyelids from the image sequence, the deformation of the 3D model’s eyes is computed and used to synthesize the real eye expression completely. Further extension of the approach will be applied to lip movements and network strategies for a real-time system [1].

III.j Tile Priorities in Adaptive 360-Degree Video Streaming

Tiled streaming is currently a popular way to deliver a 360-degree video that allows users to experience an omnidirectional scene with a more immersive sense of feeling. However, this kind of video streaming requires a high demand for network bandwidth and is affected by network bandwidth fluctuations. To solve this problem, many researchers have optimized the bandwidth by only presenting the partial Field of View (FOV) seen by the user at high quality while the rest of the content is streamed at a lower rate. This strategy is called Video Dependent Streaming (VDS) and is further improved with video coding features. By partitioning the 360-degree video into independent and smaller tiles, continuous streaming with minimal and sufficient tiles is guaranteed to cover all parts of a user’s FOV. In this paper, an approach is introduced to assign priorities to tiles and decide which tile streams first in the viewport under a bandwidth-limited condition. A priority map is constructed for each tile with the value of the tile’s relative quality and relative order within the viewport. The tiles are later separated into the foreground and background group. Finally, the streaming of 360-degree video is chosen based on the criteria of quality of tiles and the order of display. Graceful degradation is achieved without sacrificing the user QoE during this process. The minor bandwidth usage can degrade by 10% in the experiment [5].

III.k Domain Adaptive Fusion for Adaptive Image Classification

A. Dudley et al. proposed a domain adaption algorithm, Domain Adaptive Fusion (DAF), which effectively bridges the gap between different source domains and target domains. In this paper, a DAF model uses adversarial domain network to align features, reduces domain discrepancy between the source and target domains, and then apply a semi-supervised algorithm followed by a domain fusion model [7]

. For domain adaption, a feature extractor is trained using Domain Adversarial Neural Network (DANN) to minimize the discrepancy. After the domain is aligned, the DAF model is trained to fuse data from samples of source and target data and predict their corresponding fused labels. Finally, a standard loss function for semi-supervised learning and L2 regularization loss is applied across the DAF network’s layers. With the testing on two datasets, the DAF model can produce less different source and target domains with more defined clusters. By using this DAF model, the hypothesis that domain adaption problems can be reduced to semi-supervised learning problems using domain alignment is validated


III.l RCA-NET: Image Recovery Network with Channel Attention Group for Image Dehazing

In 2020, J. Du et al. proposed an image dehazing and channel attention model using an end-to-end pipeline, producing more realistic results

[6]. This image recovery network is using channel attention (RCA-Net) to extract channel-wise features. Then the model minimizing the reconstruction errors and draw a transmission model M(x). Next, the M(x) is optimized through an image recovery network with channel attention. In the end, more realistic color and structural details are generated from the recovery network [6].

III.m Race Classification Based Iris Image Segmentation

Iris segmentation is important in biometric authentication and personal identification. Existing works are mostly restricted to specific iris databases and cannot produce a promising performance in various iris image databases. In 2020, X. Ke et al. proposed a race classification based iris image segmentation method [13]. This model firstly utilizes the local Gabor binary pattern (LGBP) with a support vector machine (SVM) to build a high-performance classifier. LGBP-SVM is used to divide iris images into the human eye and non-human eye images. These two kinds of iris images are further segmented by algorithms based on circular Hough transform. The result shows that the race classification based iris segmentation model improves the segmentation accuracy among the existing works. It also produces a promising performance in iris segmentation for various iris image databases [13].

III.n Semantic Learning for Image Compression (SLIC)

Image compression reduces the image size with a specific compression ratio while maintaining the architecture’s visual quality and complexity. In this paper, K. Mahalingaiah et al. proposed a model using a deep learning Convolutional Neural Network (CNN) to enhance the quality of compressed images by understanding an image’s semantics. The proposed approach in this paper applies CNN to enhance lossy compression. By modifying the model of ResNet-50 with the replacement of the last three layers and integrating the encoder and decoder architecture of the Joint Photographic Experts Group (JPEG), the higher visual quality is achieved as the result

[16]. A semantic map is also produced to encode salient regions at a higher quality. Moreover, the model for different qualities and resolutions of image compression can detect multiple objects with various scales. The model achieves a remarkable ratio of compression and visual quality compared to the compression standards, such as JPEG [16].

Iv Proposed Method

In this paper, we will propose to use a human action and group activity recognition model to perform video understanding. The ARG-based model is a state-of-art human action and group activity recognition method. The overview of this ARG-based model is shown in Fig. 1.

Fig. 1: The original ARG-based group activity recognition framework. [27]

The model first extracts actor features from sampled video frames with manually labeled bounding boxes using CNN and RoIAlign [11]. Next, it builds an N by d dimensional feature matrix, using a d-dimension vector to represent each actor’s bounding box and using N to present the total number of bounding boxes in video frames. The actor relation graphs are then built to capture each actor’s appearance and position relationship in the scene. Afterward, the model uses Graph Convolutional Networks (GCN) to analyze each actor’s relationship from the ARG. Finally, the original and relational features are aggregated and used by two separate classifiers to perform actions and group activity recognition [27].

In this paper, we will use one of the public group activity recognition datasets called collective activity dataset to train and test our model [4]. This dataset has 52 video scene that includes multiple persons in each scene. The manually defined bounding boxes on each person and the ground truth of their actions are also labeled in each frame.

Although using ARG-based model archives high accuracy predictions on both human actions and group activities. There are still some potential improvement areas. The improved ARG-based human actions and group activity recognition model proposed by us is illustrated in Fig. 2

Fig. 2: The improved ARG-based human actions and group activity recognition model proposed by us

ARG-based group activity recognition model requires to include a manually defined bounding box for each individual person on each frame as an input. This process increased the manual work and failed to process a real-time video directly. To solve this issue, we will propose combining the YOLO detection model into this ARG-model to handle the real-time video since the frame images will be the only input data required by our model.

Based on the collective activity dataset we used, after training and testing on this ARG-based model, the best accuracy it has achieved is 86%, and the training/testing process takes about 6-7 hours on four pieces 2080 RTX Ti GPUs. There are several approaches we would like to implement and try to increase the accuracy and processing speed. The methods include increasing human object detection accuracy with YOLO, increasing process speed by reducing the input image size, and applying ResNet in the CNN layer will be implemented in our project.

This ARG-based model also failed to generate a visualized output. It only generates a log file that represents the predicted group activity for each scene. To make it has a more visualized result, we will introduce a visualization model that will plot each input video frame with predicted bounding boxes on each human object and predicted individual action descriptions and group activity description as the output.

V Timeline and Individual Responsibility

The proposed timeline is included in Fig. 3. Each team member will follow this timeline with equally divided tasks on literature review, project implementation, final report and presentations.

Currently, Zijian and Xinran are focusing on proposing improvement methods and evaluating their feasibility. Both team members are trying to configure the environment and test the source code for the baseline ARG based model. A visualization model for enhancing the description of individual actions and group activity is also in progress.

Fig. 3: Project Timeline

For the next step, each team member will review 10 research papers in the related areas such as group activity recognition and object detection. Also, every team member will start working on code implementation as well. Zijian and Xinran will both work on improving the ARG based model with proposed strategies and finding a newly potential dataset for testing the existing and later work.

At the end of this course, the team expects to improve the system with the proposed methods and completes the implementation for improvement on the existing ARG based model. A final report will also be delivered at the end. More detailed descriptions of our model and results of experiments will be included as well.

Vi Conclusion

The improved ARG based model with object detection and visualization model will provide a better video understanding and boost the accuracy and efficiency of detecting group activity. However, it is challenging to fuse an ARG based model with object detection tools. Finding another appropriate dataset to examine the performance of the model is essential to our work. Still, it may cost more time to identify suitable group activity dataset with action labels carefully in the future. More research and experiments will be conducted to assess the feasibility of our proposed approaches.


  • [1] S. Bernogger, L. Yin, A. Basu, and A. Pinz (1998) Eye tracking and animation for mpeg-4 coding.

    Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No.98EX170)

    External Links: Document Cited by: §III.i.
  • [2] Z. Cao, G. H. Martinez, T. Simon, S. Wei, and Y. A. Sheikh (2019) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: Document Cited by: §I, §II.d, §III.b.
  • [3] I. Cheng, S. Nilufar, C. Flores-Mir, and A. Basu (2007) Airway segmentation and measurement in ct images. 2007 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. External Links: Document Cited by: §III.g.
  • [4] W. Choi, K. Shahid, and S. Savarese (2009) What are they doing? : collective activity classification using spatio-temporal relationship among people. 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops. External Links: Document Cited by: §IV.
  • [5] D. Curcio, A. Hourunranta, and E. B. Aksu (2020) Tile priorities in adaptive 360-degree video streaming. In Smart Multimedia, T. McDaniel, S. Berretti, I. D. D. Curcio, and A. Basu (Eds.), Cham, pp. 212–223. External Links: ISBN 978-3-030-54407-2 Cited by: §III.j.
  • [6] J. Du, J. Zhang, Z. Zhang, W. Tan, S. Song, and H. Zhou (2020) RCA-net: image recovery network with channel attention group for image dehazing. Lecture Notes in Computer Science Smart Multimedia, pp. 330–337. External Links: Document Cited by: §III.l.
  • [7] A. Dudley, B. Nagabandi, H. Venkateswara, and S. Panchanathan (2020) Domain adaptive fusion for adaptive image classification. In Smart Multimedia, T. McDaniel, S. Berretti, and A. Curcio (Eds.), Cham, pp. 357–371. External Links: ISBN 978-3-030-54407-2 Cited by: §III.k.
  • [8] B. Fernando, C. T. Y. Chet, and H. Bilen (2020) Weakly supervised gaussian networks for action detection. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). External Links: Document Cited by: §I.
  • [9] A. Firouzmanesh, I. Cheng, and A. Basu (2011) Perceptually guided fast compression of 3-d motion capture data. IEEE Transactions on Multimedia 13 (4), pp. 829–834. External Links: Document Cited by: §III.c, §III.e.
  • [10] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen (2017) Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia 19 (9), pp. 2045–2055. External Links: Document Cited by: §I.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2980–2988. Cited by: §II.g, §IV.
  • [12] C. D. C. Heath, T. D. C. Heath, T. D. C. Mcdaniel, H. D. C. Venkateswara, and S. D. C. Panchanathan (2020) Using participatory design to create a user interface for analyzing pivotal response treatment video probes. Lecture Notes in Computer Science Smart Multimedia, pp. 183–198. External Links: Document Cited by: §II.d, §III.b.
  • [13] X. Ke, L. An, Q. Pei, and X. Wang (2020) Race classification based iris image segmentation. Lecture Notes in Computer Science Smart Multimedia, pp. 383–393. External Links: Document Cited by: §III.m.
  • [14] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles (2017) Dense-captioning events in videos. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: Document Cited by: §I, §II.a, §II.c.
  • [15] Z. G. L. Gao, J. S. S. Hochreiter, T. T. A. Kojima, L. G. X. Li, L. G. J. Song, L. G. J. Song, HT. S. J. Song, X. L. X. Zhu, and L. Z. X. Zhu (1970-01) Residual attention-based lstm for video captioning. Springer US. External Links: Link Cited by: §II.f.
  • [16] K. Mahalingaiah, H. Sharma, P. Kaplish, and I. Cheng (2020) Semantic learning for image compression (slic). In Smart Multimedia, T. McDaniel, S. Berretti, I. D. D. Curcio, and A. Basu (Eds.), Cham, pp. 57–66. External Links: ISBN 978-3-030-54407-2 Cited by: §III.n.
  • [17] J. Mourey, A. Sehat Niaki, P. Kaplish, and R. Gupta (2020) Human body fall recognition system. In Smart Multimedia, T. McDaniel, S. Berretti, I. D. D. Curcio, and A. Basu (Eds.), Cham, pp. 372–380. External Links: ISBN 978-3-030-54407-2 Cited by: §III.a.
  • [18] F. M. Noori, B. Wallace, Md. Z. Uddin, and J. Torresen (2019) A robust human activity recognition approach using openpose, motion features, and deep recurrent neural network. Image Analysis Lecture Notes in Computer Science, pp. 299–310. External Links: Document Cited by: §I, §II.d, §II.e.
  • [19] Y. Raaj, H. Idrees, G. Hidalgo, and Y. Sheikh (2019) Efficient online multi-person 2d pose tracking with recurrent spatio-temporal affinity fields. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Document Cited by: §II.d.
  • [20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Document Cited by: §I, §II.b.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun (2016) Faster r-cnn: towards real-time object detection with region proposal networks. External Links: 1506.01497 Cited by: §II.b.
  • [22] N. Rossol, I. Cheng, W. F. Bischof, and A. Basu (2011) A framework for adaptive training and games in virtual reality rehabilitation environments. Proceedings of the 10th International Conference on Virtual Reality Continuum and Its Applications in Industry - VRCAI 11. External Links: Document Cited by: §III.h.
  • [23] R. Shen, I. Cheng, and A. Basu (2013) QoE-based multi-exposure fusion in hierarchical multivariate gaussian crf. IEEE Transactions on Image Processing 22 (6), pp. 2469–2478. External Links: Document Cited by: §III.f.
  • [24] A. Sherstinsky (2020-01) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. North-Holland. External Links: Link Cited by: §I, §II.a.
  • [25] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko (2015) Sequence to sequence – video to text. 2015 IEEE International Conference on Computer Vision (ICCV). External Links: Document Cited by: §I, §II.a.
  • [26] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko (2015) Translating videos to natural language using deep recurrent neural networks. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. External Links: Document Cited by: §I.
  • [27] J. Wu, L. Wang, L. Wang, J. Guo, and G. Wu (2019) Learning actor relation graphs for group activity recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). External Links: Document Cited by: Video Understanding based on Human Action and Group Activity Recognition, §I, §II.g, Fig. 1, §IV.
  • [28] L. Yin and A. Basu (2001) Nose shape estimation and tracking for model-based coding. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221). External Links: Document Cited by: §III.d.