Identifying reasons for human actions in lifestyle vlogs.
We aim to automatically identify human action reasons in online videos. We focus on the widespread genre of lifestyle vlogs, in which people perform actions while verbally describing them. We introduce and make publicly available the WhyAct dataset, consisting of 1,077 visual actions manually annotated with their reasons. We describe a multimodal model that leverages visual and textual information to automatically infer the reasons corresponding to an action presented in the video.READ FULL TEXT VIEW PDF
Identifying reasons for human actions in lifestyle vlogs.
Significant research effort has been recently devoted to the task of action recognition Carreira2017QuoVA; Shou2017CDCCN; Tran2018ACL; Chao2018RethinkingTF; Girdhar2019VideoAT; Feichtenhofer2019SlowFastNF. Action recognition works well when applied to well defined/constrained scenarios, such as people following scripts and instructions Sigurdsson2016HollywoodIH; Miech2019HowTo100MLA; Tang2019COINAL, performing sports Soomro2012UCF101AD; Karpathy2014LargeScaleVC or cooking Rohrbach2012ADF; Damen2018EPICKITCHENS; Damen2020RESCALING; Zhou2018TowardsAL. At the same time however, action recognition is limited and error-prone once the application space is opened to everyday life. This indicates that current action recognition systems rely mostly on pattern memorization and do not effectively understand the action, which makes them fragile and unable to adapt to new settings Sigurdsson2017WhatAA; Kong2018HumanAR. Research on how to improve action recognition in videos Sigurdsson2017WhatAA shows that recognition systems for actions with known intent have a significant increase in performance, as knowing the reason for performing an action is an important step for understanding that action Tosi1991ATO; Gilovich2002HeuristicsAB.
In contrast to action recognition, action causal reasoning research is just emerging in computational applications Vondrick2016PredictingMO; Yeo2018VisualCO; Zhang2020LearningCC; Fang2020Video2CommonsenseGC. Causal reasoning has direct applications on many real-life settings, for instance to understand the consequences of events (e.g., if “there is clutter,” “cleaning” is required), or to enable social reasoning (e.g., when “guests are expected,” “cleaning” may be needed – see Figure 1). Most of the work to date on causal systems has relied on the use of semantic parsers to identify reasons He2017DeepSR, however this approach does not work well on more realistic every-day settings. As an example, consider the statement “This is a mess and my friends are coming over. I need to start cleaning.” Current causal systems are unable to identify “this is a mess” and “friends are coming over” as reasons, and are thus failing to use them as context for understanding the action of “cleaning.”
In this paper, we propose the task of multimodal action reason identification in everyday life scenarios. We collect a dataset of lifestyle vlogs from YouTube that reflect daily scenarios and are currently very challenging for systems to solve. Vloggers freely express themselves while performing most common everyday activities such as cleaning, eating, cooking, writing and others. Lifestyle vlogs present a person’s everyday routine: the vlogger visually records the activities they perform during a normal day and verbally express their intentions and feelings about those activities. Because of these characteristics, lifestyle vlogs are a rich data source for an in depth study of human actions and the reasons behind them.
The paper makes four main contributions. First, we formalize the new task of multimodal action reason identification in online vlogs. Second, we introduce a new dataset, WhyAct, consisting of 1,077 (action, context, reasons) tuples manually labeled in online vlogs, covering 24 actions and their reasons drawn from ConceptNet as well as crowdsourcing contributions. Third, we propose several models to solve the task of human action reason identification, consisting of single-modalities models based on the visual content and vlog transcripts, as well as a multimodal model using a fill-in-the-blanks strategy. Finally, we also present an analysis of our new dataset, which leads to rich avenues for future work for improving the tasks of reason identification and ultimately action recognition in online videos.
There are three areas of research related to our work: identifying action motivation, commonsense knowledge acquisition, and web supervision.
The research most closely related to our paper is the work that introduced the task of predicting motivations of actions by leveraging text Vondrick2016PredictingMO
. Their method was applied to images from the COCO datasetLin2014MicrosoftCC, while ours is focused on videos from YouTube. Other work on human action causality in the visual domain Yeo2018VisualCO; Zhang2020LearningCC
relies on object detection and automatic image captioning as a way to represent videos and analyze visual causal relations. Research has also been carried out on detecting the intentions of human actionsPezzelle2020BeDT; the task definition differs from ours, however, as their goal is to automatically choose the correct action for a given image and intention. Other related work includes Synakowski2020AddingKT, a vision-based classification model between intentional and non-intentional actions and Intentonomy Jia2020IntentonomyAD, a dataset on human intent behind images on Instagram.
Research on commonsense knowledge often relies on textual knowledge bases such as ConceptNet Speer2017ConceptNet5A, ATOMIC Sap2019ATOMICAA, COMET-ATOMIC 2020 Hwang2020COMETATOMIC2O, and more recently GLUCOSE Mostafazadeh2020GLUCOSEGA.
Recently, several of these textual knowledge bases have also been used for visual applications, to create more complex multimodal datasets and models Park2020VisualCOMETRA; Fang2020Video2CommonsenseGC; Song2020KVLBERTKE. VisualCOMET Park2020VisualCOMETRA is a dataset for visual commonsense reasoning tasks to predict events that might have happened before a given event, events that might happen next, as well as people intents at a given point in time. Their dataset is built on top of VCR zellers2019vcr, which consists of images of multiple people and activities. Video2Commonsense Fang2020Video2CommonsenseGC uses ATOMIC to extract from an input video a list of intentions that are provided as input to a system that generates video captions, as well as three types of commonsense descriptions (intention, effect, attribute). KVL-BERT Song2020KVLBERTKE proposes a knowledge enhanced cross-modal BERT model by introducing entities extracted from ConceptNet Speer2017ConceptNet5A into the input sentences, followed by testing their visual question answering model on the VCR benchmark zellers2019vcr. Unlike previous work that broadly addresses commonsense relations, we focus on the extraction and analysis of action reasons, which allows us to gain deeper insights for this relation type.
The space of current commonsense inference systems is often limited to one dataset at a time, e.g., COCO Lin2014MicrosoftCC, VCR zellers2019vcr, MSR-VTT Xu2016MSRVTTAL. In our work, we ask commonsense questions in the context of rich, unlimited, constantly evolving online videos from YouTube.
Previous work has leveraged webly-labeled data for the purpose of identifying commonsense knowledge. One of the most extensive efforts is NELL (Never Ending Language Learner) Mitchell2015NeverEndingL, a system that learns everyday knowledge by crawling the web, reading documents and analysing their linguistic patterns. A closely related effort is NEIL (Never Ending Image Learner), which learns commonsense knowledge from images on the web Chen2013Neil. Large scale video datasets Miech2019HowTo100MLA on instructional videos and lifestyle vlogs Fouhey2018FromLV; Ignat2019IdentifyingVA are other examples of web supervision. The latter are similar to our work as they analyse online vlogs, but unlike our work, their focus is on action detection and not on the reasons behind actions.
In order to develop and test models for recognizing reasons for human actions in videos, we need a manually annotated dataset. This section describes the WhyAct dataset of action reasons.
We start by compiling a set of lifestyle videos from YouTube, consisting of people performing their daily routine activities, such as cleaning, cooking, studying, relaxing, and others. We build a data gathering pipeline to automatically extract and filter videos and their transcripts.
We select five YouTube channels and download all the videos and their transcripts. The channels are selected to have good quality videos with automatically generated transcripts containing detailed verbal descriptions of the actions depicted. An analysis of the videos indicates that both the textual and visual information are rich sources for describing not only the actions, but why the actions in the videos are undertaken (action reasons). We present qualitative and quantitative analyses of our data in section 6.
We also collect a set of human actions and their reasons from ConceptNet Speer2017ConceptNet5A. Actions include verbs such as: clean, write, eat, and other verbs describing everyday activities. The actions are selected based on how many reasons are provided in ConceptNet and how likely they are to appear in our collected videos. For example, the action of cleaning is likely to appear in the vlog data, while the action of yawning is not.
After collecting the videos, actions and their corresponding reasons, the following data pre-processing steps are applied.
|Actions with reasons in ConceptNet||139|
|Actions with at least 3 reasons in CN||102|
|Actions with at least 25 video-clips||25|
From ConceptNet, we select actions that contain at least three reasons. The reasons in ConceptNet are marked by the “motivated by“ relation. We further filter out those actions that appear less than 25 times in our video dataset, in order to assure that each action has a significant number of instances.
We find that the reasons from ConceptNet are often very similar to each other, and thus easy to confound. For example, the reasons for the action clean are: “dirty”, “remove dirt”, “don’t like dirtiness”, “there dust”, “dirtiness unpleasant”, “dirt can make ill”, “things cleaner”, “messy”, “company was coming”. To address this issue, we apply agglomerative clustering Murtagh2014WardsHA to group similar actions together. For instance, for the action clean, the following clusters are produced: [“dirty”, “remove dirt”, “there dust”, “things cleaner”], [“don’t like dirtiness”, “dirtiness unpleasant”, “dirt can make ill”], [“messy”], [“company was coming”]. Next, we manually select the most representative and clear reason from each cluster. We also correct any spelling mistakes and rename the reasons that are either too general or unclear (e.g., we rename “messy” to “declutter”). Finally, after the clustering and processing steps, we filter out all the actions that contain less than three reasons.
We show the statistics before and after the additive filtering steps in Table 1.
We want transcripts that reflect the reasons for performing one or more actions shown in the video. However, the majority of the transcripts contain mainly verbal descriptions of the action, which are not always helpful in determining their reason. We therefore implement a method to select candidate transcript sequences that contain at least one causal relation related to the actions shown in the video.
We start by automatically splitting the transcripts into sentences using spaCy spacy. Next, we select the sentences with at least one action from the final list of actions we collected from ConceptNet (see the previous section). For each selected sentence, we also collect its context consisting of the sentences before and after. We do this in order to increase the search space for the reasons for the actions mentioned in the selected sentences.
We want to keep the sentences that contain action reasons. We tried multiple methods to automatically determine the sentences more likely to include causal relations using Semantic Role Labeling (SRL) Ouchi2018ASS, Open Information Extraction (OpenIE) Angeli2015LeveragingLS and searching for causal markers. We found that SRL and OpenIE do not work well on our data, likely due to the fact that the transcripts are more noisy than the datasets these models were trained on. Most of the language in the transcripts does not follow simple patterns such as “I clean because it is dirty.” Instead, the language consists of natural everyday speech such as “Look at how dirty this is, I think I should clean it.”
We find that a strategy sufficient for our purposes is to search for causal markers such as “because”, “since”, “so that is why”, “thus”, “therefore” in the sentence and the context, and constrain the distance between the actions and the markers to be less than 15 words – a threshold identified on development data. We thus keep all the transcript sentences and their context that contain at least one action and a causal marker within a distance of less than the threshold of 15 words.
As transcripts are temporally aligned with videos, we can obtain meaningful video clips related to the narration. We extract video clips corresponding to the sentences selected from transcripts (described in the section above).
We want video clips that show why the actions are being performed. Although there can be many actions along with reasons in the transcript, if they are not depicted in the video, we cannot leverage the video information in our task. Videos with low movement tend to show people sitting in front of the camera, describing their routine, but not performing the action they are talking about. We therefore remove clips that do not contain enough movement. We sample one out of every one hundred frames of the clip, and compute the 2D correlation coefficient between these sampled frames. If the median of the obtained values is greater than a certain threshold (0.8, selected on the development data), we filter out the clip. We also remove video-clips that are shorter than 10 seconds and longer than 3 minutes.
The resulting (video clip, action, reasons) tuples are annotated with the help of Amazon Mechanical Turk (AMT) workers. They are asked to identify: (1) what are the reasons shown or mentioned in the video clip for performing a given action; (2) how are the reasons identified in the video: are they mentioned verbally, shown visually, or both; (3) whether there are other reasons other than the ones provided; (4) how confident the annotator is in their response. The guidelines and interface for annotations are shown in Figure 2. In addition to the guidelines, we also provide the annotators with a series of examples of completed assignments with explanations for why the answers were selected. We present them in the supplemental material in Figure 6.
We add new action reasons from the ones added by the annotators if they repeat at least three times in the collected answers and are not similar to the ones already existing.
Each assignment is completed by three different master annotators. We compute the agreement between the annotators using Fleiss Kappa Fleiss1971MeasuringNS and we obtain 0.6, which indicates a moderate agreement. Because the annotators can select multiple reasons, the agreement is computed per reason and then averaged.
We also analyse how confident the workers are in their answers: for each video, we take the confidence selected by the majority of workers: out of 1,077 videos, in 890 videos the majority of workers are highly confident.
Table 2 shows statistics for our final dataset of video-clips and actions annotated with their reasons. Figure 1 shows a sample video and transcript, with annotations. Additional examples of annotated actions and their reasons can be seen in the supplemental material in Figure 8.
Given a video, an action, and a list of candidate action reasons, our goal is to determine the reasons mentioned or shown in the video. We develop a multimodal model that leverages both visual and textual information, and we compare its performance with several single-modality baselines.
The models we develop are unsupervised in that we are not learning any task-specific information from a training dataset. We use a validation set only to tune the hyper-parameters of the models.
To represent the textual data – transcripts and candidate reasons – we use sentence embeddings computed using the pre-trained model Sentence-BERT reimers-2019-sentence-bert.
In order to tie together the causal relations, both the textual, and the visual information, we represent the video as a bag of object labels and a collection of video captions. For object detection we use Detectron2 wu2019detectron2, a state-of-the-art object detection algorithm.
We generate automatic captions for the videos using a state-of-the-art dense captioning modelBMT_Iashin_2020
. The input to the model are visual features extracted from I3D model pre-trained on KineticsCarreira2017QuoVA, audio features extracted with VGGish model 45611 pre-trained on YouTube-8M AbuElHaija2016YouTube8MAL and caption tokens using GloVe Pennington2014GloveGV.
Using the representations described in Section 4.1, we implement several textual and visual models.
Given an action, a video transcript associated with the action, and a list of the candidate action reasons, we compute the cosine similarity between the textual representations of the transcript and all the candidate reasons. We predict as correct those reasons that have a cosine similarity with the transcript greater than a threshold of 0.1. The threshold is fine-tuned on development data.
Because the transcript might contain information that may be unrelated to the action described or its reasons, we also develop a second version of this baseline. When computing the similarity, instead of using the whole transcript, we select only the part of the transcript that is in the vicinity of the causal markers (before and after a fixed number words, fine-tuned on development data).
We use a pre-trained NLI model Yin2019BenchmarkingZTN18-1101, a collection of sentence pairs annotated with textual entailment information.
The method works by posing the sequence to be classified as the NLI premise and constructing a hypothesis from each candidate label: given the transcript as a premise and the list of reasons as the hypotheses, each reason will receive a score that reflects the probability of entailment. For example, if we want to evaluate whether the label “declutter” is a reason for the action “cleaning”, we construct the hypothesis “The reason for cleaning is declutter.”
We use a threshold of 0.8 fine-tuned on the development data to filter the reasons that have a high entailment score with the transcript.
We replace the transcript in the premise with a list of object labels detected from the video. The objects are detected using the Detectron2 model wu2019detectron2 on each video frame, at 1fps. We select only the objects that pass a confidence score of 0.7.
We replace the transcript in the premise with a list of video captions detected using the Bi-modal Transformer for Dense Video Captioning model BMT_Iashin_2020. The video captioning model generates captions for several time slots. We further filter the generated captions to remove redundant captions: if a time slot is heavily overlapped or even covered by another time slot, we only keep the caption of the longer time slot. We find that captions of longer time slots are also more informative and accurate compared to captions of shorter time slots.
To leverage information from both the visual and linguistic modalities, we propose a new model that recasts our task as a Cloze task, and attempts to identify the action reasons by performing a fill-in-the-blanks prediction, similarly to fitb that proposes to fill blanks corresponding to noun phrases in descriptions based on video clips content. Specifically, after each action mention for which we want to identify the reason, we add the text “because _____.” For instance, “I clean the windows” is replaced by “I clean the windows because _____”. We train a language model to compute the likelihood of filling in the blank with each of the candidate reasons. For this purpose, we use T5 t5, an encoder-decoder transformer transformer pre-trained model, to fill in blanks with text.
To incorporate the visual data, we first obtain Kinetics-pre-trained I3D Carreira2017QuoVA RGB features at 25fps (the average pooling layer). We input the features to the T5 encoder after the transcript text tokens. The text input is passed through an embedding layer (as in T5), while the video features are passed through a linear layer. Since T5 was not trained with this kind of input, we fine-tune it on unlabeled data from the same source, without including data that contains the causal marker “because”. Note this also helps the model specialize on filling-in-the-blank with reasons. Finally, we fine-tune the model on the development data. We obtain the reasons for an action by computing the likelihood of the potential ones and taking the ones that pass a threshold selected based on the development data. The model architecture is shown in Figure 3.
We also use our fill-in-the-blanks model in a single modality mode, where we apply it only on the transcript.
We consider as gold standard the labels selected by the majority of workers (at least two out of three workers).
For our experiments, we split the data across video-clips: 20% development and 80% test (see Table 3 for a breakdown of actions, reasons and video-clips in each set). We evaluate our systems as follows. For each action and corresponding video-clip, we compute the Accuracy, Precision, Recall and F1 scores between the gold standard and predicted labels. We then compute the average of the scores across actions. Because the annotated data is unbalanced (in average, 2 out of 6 candidate reasons per instance are selected as gold standard), the most representative metric is F1 score. The average results are shown in Table 4. The results also vary by action: the F1 scores for each action, of the best performing method, are shown in the supplemental material in Figure 12.
Experiments on WhyAct reveal that both textual and visual modalities contribute to solving the task. The results demonstrate that the task is challenging and there is room for improvement for future work models.
Selecting the most frequent reason for each action on test data achieves on average an F1 of 40.64, with a wide variation ranging from a very low F1 for the action “writing” (7.66 F1) to a high F1 for the action “cleaning” (55.42 F1). Note however that the “most frequent reason” model makes use of data distributions that our models do not use (because our models are not trained). Furthermore, we believe that it is expected that for certain actions the distribution of reasons is unbalanced, as in everyday life there are action reasons much more common than others (e.g. for “cleaning”, “remove dirt” is a more common/frequent reason than “company was coming”).
|Causal relations from transcript||50.85||30.40||68.91||39.73|
|Single Modality Models|
|Natural Language Inference||Transcript||68.41||41.90||48.01||40.78|
|Video object labels||54.49||31.70||59.93||36.79|
|Video dense captions||49.18||29.54||68.47||37.40|
|Video object labels & dense captions||36.93||27.34||87.97||39.11|
|Multimodal Neural Models|
|Fill-in-the-blanks||Video & Transcript||32.6||27.56||94.76||41.11|
We perform an analysis of the actions, reasons and video-clips in the WhyAct dataset. The distribution of actions and their reasons are shown in Figure 4. The supplemental material includes additional analyses: the distribution of actions and their number of reasons (Figure 11) and videos (Figure 10) and the distribution of actions and their worker agreement scores (Figure 9).
We also explore the content of the videos by analysing their transcripts. In particular, we look at the actions and their direct objects. For example, the action clean is depicted in various ways in the videos: “clean shower”, “clean body”, “clean makeup”, “clean dishes”. The action diversity assures that the task is challenging and complex, trying to cover the full spectrum of everyday activities. In Figure 5 we show what kind of actions are depicted in the videos: we extract all the verbs and their most five most frequent direct objects using spaCy spacy and then we cluster them by verb and plot them using t-distributed Stochastic Neighbor Embedding (t-SNE) Maaten08visualizingdata.
Finally, we analyse what kind of information is required for detecting the action reasons: what is verbally described, visually shown in the video or the combination of visual and verbal cues. For this, we analyse the worker’s justifications for selecting the action reasons: if the reasons were verbally mentioned in the video, visually shown or both. For each video, we take the justification selected by the majority of workers. We find that the reasons for the actions can be inferred only by relying on the narration for less than half of the videos (496 / 1,077). For the remaining videos, the annotators answered that they relied on either the visual information (in 55 videos) or on both visual and audio information (in 423 videos). The remaining 103 videos do not have a clear agreement among annotators on the modality used to indicate the action reasons. We believe that this imbalanced split might be a reason for why the multimodal model does not perform as well as the text model. For future work, we want to collect more visual data that contains action reasons.
The reasons in WhyAct vary from specific (e.g., for the verb “fall’, possible reasons are: “tripped”, “ladder broke”, “rush”, “makeup fell”) to general (e.g., for the verb “play”, possible reasons are: “relax”, “entertain yourself”, “play an instrument”). We believe that a model can benefit from learning both general and specific reasons. From general reasons such as “relax”, a model can learn to extrapolate, generalize, and adapt to other actions for which those reasons might apply (e.g., “relax” can also be a reason for actions like “drink” or “read”) and use these general reasons to learn commonalities between these actions. On the other hand, from a specific reason like “ladder broke”, the model can learn very concise even if limited information, which applies to very specific actions.
During the data annotation process, the workers had the choice to write comments about the task. From these comments we found that some difficulties with data annotation had to do with actions expressed through verbs that have multiple meanings and are sometimes used as figures of speech. For instance, the verb “jump” was often labeled by workers as “jumping means starting” or “jumping is a figure of speech here.” Because the majority of videos containing the verb “jump” are labeled like this, we decided to remove this verb from our initial list of 25 actions. Another verb that is used (only a few times) with multiple meanings is “fall” and some of the comments received from the workers are: “she mentions the season fall, not the action of falling,” “falling is falling into place,” “falling off the wagon, figure of speech.” These examples confirm how rich and complex the collected data is and how current state-of-the-art parsers are not sufficient to correctly process it.
In this paper, we addressed the task of detecting human action reasons in online videos. We explored the genre of lifestyle vlogs, and constructed WhyAct – a new dataset of 1,077 video-clips, actions and their reasons. We described and evaluated several textual and visual baselines and introduced a multimodal model that leverages both visual and textual information.
We built WhyAct and action reason detection models to address two problems important for the advance of action recognition systems: adaptability to changing visual and textual context, and processing the richness of unscripted natural language. In future work, we plan to experiment with our action reason detection models in action recognition systems to improve their performance.
The dataset and the code introduced in this paper are publicly available at https://github.com/MichiganNLP/vlog_action_reason.
Our dataset contains public YouTube vlogs, in which vloggers choose to share episodes of their daily life routine. They share not only how they perform certain actions, but also their opinions and feelings about different subjects. We use the videos to detect actions and their reasons, without relying on any information about the identity of the person such as gender, age or location.
The data can be used to better understand people’s lives, by looking at their daily routine and why they choose to perform certain actions. The data contains videos of men and women and sometimes children. The routine videos present mostly ideal routines and are not comprehensive of all people’s daily lives. Most of the people represented in the videos are middle class Americans.
In our data release, we only provide the YouTube urls of the videos, so the creator of the videos can always have the option to remove them. YouTube videos are a frequent source of data in research papers Miech2019HowTo100MLA; Fouhey2018FromLV; AbuElHaija2016YouTube8MAL, and we followed the typical process used by all this previous work of compiling the data through the official YouTube API and only sharing the urls of the videos. We have the rights to use our dataset in the way we are using it, and we bear responsibility in case of a violation of rights or terms of service.
We thank Pingxuan Huang for his help in improving the annotation user interface. This research was partially supported by a grant from the Automotive Research Center (ARC) at the University of Michigan.