Multi-modal Egocentric Activity Recognition using Audio-Visual Features

by   Mehmet Ali Arabaci, et al.

Egocentric activity recognition in first-person videos has an increasing importance with a variety of applications such as lifelogging, summarization, assisted-living and activity tracking. Existing methods for this task are based on interpretation of various sensor information using pre-determined weights for each feature. In this work, we propose a new framework for egocentric activity recognition problem based on combining audio-visual features with multi-kernel learning (MKL) and multi-kernel boosting (MKBoost). For that purpose, firstly grid optical-flow, virtual-inertia feature, log-covariance, cuboid are extracted from the video. The audio signal is characterized using a "supervector", obtained based on Gaussian mixture modelling of frame-level features, followed by a maximum a-posteriori adaptation. Then, the extracted multi-modal features are adaptively fused by MKL classifiers in which both the feature and kernel selection/weighing and recognition tasks are performed together. The proposed framework was evaluated on a number of egocentric datasets. The results showed that using multi-modal features with MKL outperforms the existing methods.



There are no comments yet.


page 1

page 6

page 7

page 8

page 9

page 10


Boosted Multiple Kernel Learning for First-Person Activity Recognition

Activity recognition from first-person (ego-centric) videos has recently...

Multi-Modal Recognition of Worker Activity for Human-Centered Intelligent Manufacturing

In a human-centered intelligent manufacturing system, sensing and unders...

M^3T: Multi-Modal Continuous Valence-Arousal Estimation in the Wild

This report describes a multi-modal multi-task (M^3T) approach underlyin...

Qiniu Submission to ActivityNet Challenge 2018

In this paper, we introduce our submissions for the tasks of trimmed act...

Exploiting Structure Sparsity for Covariance-based Visual Representation

The past few years have witnessed increasing research interest on covari...

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Visual voice activity detection (V-VAD) uses visual features to predict ...

Template co-updating in multi-modal human activity recognition systems

Multi-modal systems are quite common in the context of human activity re...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.