Code accompanying EGO-TOPO: Environment Affordances from Egocentric Video (CVPR 2020)
First-person video naturally brings the use of a physical environment to the forefront, since it shows the camera wearer interacting fluidly in a space based on his intentions. However, current methods largely separate the observed actions from the persistent space itself. We introduce a model for environment affordances that is learned directly from egocentric video. The main idea is to gain a human-centric model of a physical space (such as a kitchen) that captures (1) the primary spatial zones of interaction and (2) the likely activities they support. Our approach decomposes a space into a topological map derived from first-person activity, organizing an ego-video into a series of visits to the different zones. Further, we show how to link zones across multiple related environments (e.g., from videos of multiple kitchens) to obtain a consolidated representation of environment functionality. On EPIC-Kitchens and EGTEA+, we demonstrate our approach for learning scene affordances and anticipating future actions in long-form video.READ FULL TEXT VIEW PDF
Code accompanying EGO-TOPO: Environment Affordances from Egocentric Video (CVPR 2020)
“The affordances of the environment are what it offers the animal, what it provides or furnishes… It implies the complementarity of the animal and the environment.”—James J. Gibson, 1979
In traditional third-person images and video, we see a moment in time captured intentionally by a photographer who paused to actively record the scene. As a result, scene understanding is largely about answering thewho/where/what questions of recognition: what objects are present? is it an indoor/outdoor scene? where is the person and what are they doing? [55, 52, 72, 42, 73, 34, 69, 18].
In contrast, in video captured from a first-person “egocentric” point of view, we see the environment through the eyes of a person passively wearing a camera. The surroundings are tightly linked to the camera-wearer’s ongoing interactions with the environment. As a result, scene understanding in egocentric video also entails how questions: how can one use this space, now and in the future? what areas are most conducive to a given activity?
Despite this link between activities and environments, existing first-person video understanding models typically ignore that the underlying environment is a persistent physical space. They instead treat the video as fixed-sized chunks of frames to be fed to neural networks[46, 5, 14, 65, 48, 41]. Meanwhile, methods that do model the environment via dense geometric reconstructions [63, 20, 57] suffer from SLAM failures—common in quickly moving head-mounted video—and do not discriminate between those 3D structures that are relevant to human actions and those that are not (e.g., a cutting board on the counter versus a random patch of floor). We contend that neither the “pure video” nor the “pure 3D” perspective adequately captures the scene as an action-affording space.
Our goal is to build a model for an environment that captures how people use it. We introduce an approach called Ego-Topo that converts egocentric video into a topological map consisting of activity “zones” and their rough spatial proximity. Taking cues from Gibson’s vision above, each zone is a region of the environment that affords a coherent set of interactions, as opposed to a uniformly shaped region in 3D space. See Fig. 1.
Specifically, from egocentric video of people actively using a space, we link frames across time based on (1) the physical spaces they share and (2) the functions afforded by the zone, regardless of the actual physical location. For example, for the former criterion, a dishwasher loaded at the start of the video is linked to the same dishwasher when unloaded, and to the dishwasher on another day. For the latter, a trash can in one kitchen could link to the garbage disposal in another: though visually distinct, both locations allow for the same action—discarding food. See Fig. 3.
In this way, we re-organize egocentric video into “visits” to known zones, rather than a series of unconnected clips. We show how doing so allows us to reason about first-person behavior (e.g., what are the most likely actions a person will do in the future?) and the environment itself (e.g., what are the possible object interactions that are likely in a particular zone, even if not observed there yet?).
Our Ego-Topo approach offers advantages over the existing models discussed above. Unlike the “pure video” approach, it provides a concise, spatially structured representation of the past. Unlike the “pure 3D” approach, our map is defined organically by people’s use of the space.
We demonstrate our model on two key tasks: inferring likely object interactions in a novel view and anticipating the actions needed to complete a long-term activity in first-person video. These tasks illustrate how a vision system that can successfully reason about scenes’ functionality would contribute to applications in augmented reality (AR) and robotics. For example, an AR system that knows where actions are possible in the environment could interactively guide a person through a tutorial; a mobile robot able to learn from video how people use a zone would be primed to act without extensive exploration.
On two challenging egocentric datasets, EPIC and EGTEA+, we show the value of modeling the environment explicitly for egocentric video understanding tasks, leading to more robust scene affordance models, and improving over state-of-the-art long range action anticipation models.
Whereas the camera is a bystander in traditional third-person vision, in first-person or egocentric vision, the camera is worn by a person interacting with the surroundings firsthand. This special viewpoint offers an array of interesting challenges, such as detecting gaze [40, 29], monitoring human-object interactions [4, 6, 51], creating daily life activity summaries [44, 39, 70, 43]
, or inferring the camera wearer’s identity or body pose[28, 33]. The field is growing quickly in recent years, thanks in part to new ego-video benchmarks [5, 41, 54, 62].
Recent work to recognize or anticipate actions in egocentric video adopts state-of-the-art video models from third-person video, like two-stream networks [41, 46], 3DConv models [5, 53, 48], or recurrent networks [15, 16, 61, 65]. In contrast, our model grounds first-person activity in a persistent topological encoding of the environment. Methods that leverage SLAM together with egocentric video [20, 57, 63] for activity forecasting also allow spatial grounding, though in a metric manner and with the challenges discussed above, which we illustrate in our experiments.
Recent work explores ways to enrich video representations with more structure. Graph-based methods encode relationships between detected objects: nodes are objects or actors, and edges specify their spatio-temporal layout or semantic relationships (e.g., is-holding) [67, 3, 45, 71]
. Architectures for composite activity learn to encode atomic action “primitives” that are aggregated over the full time extent of the video[17, 30, 31], memory-based models record a recurrent network’s state , and 3D convnets augmented with long-term feature banks provide temporal context . Unlike any of the above, our approach encodes video in a human-centric manner according to how people use a space. In our graphs, nodes are spatial zones and connectivity depends on a person’s visitation over time.
Traditional maps use simultaneous localization and mapping (SLAM) to obtain dense metric measurements, viewing a space in strictly geometric terms. Instead, recent work in embodied visual navigation explores learning-based maps that leverage both visual patterns as well as geometry, with the advantage of extrapolating to novel environments (e.g., [23, 22, 59, 26, 9]). Our approach shares this motivation. However, unlike any of the above, our approach analyzes egocentric video, as opposed to controlling a robotic agent. Furthermore, whereas existing maps are derived from a robot’s exploration, our maps are derived from human behavior.
., my office) can be recognized using supervised learning. In contrast, our approach automatically discovers zones of activity from ego-video, and it links action-related zones across multiple environments.
Affordances are often focused on objects, where the goal is to anticipate how an object could be used—for example learning to model object manipulation [1, 4] or how people would grasp an object [37, 51, 10, 6]. People’s body pose can even improve object recognition [7, 19]. The affordances of scenes are less studied. Prior work explores how a third-person view of a scene suggests likely 3D body poses that would occur there [60, 66, 21] and vice versa . More closely related to our work, Action Maps estimate missing activity labels for regular grid cells in an environment, using matrix completion with object and scene similarities as side information. In contrast, our work considers affordances not strongly tied to a single object’s appearance, and we introduce a graph-based video encoding derived from our topological maps that benefits action anticipation.
We aim to organize egocentric video into a map of activity “zones”—regions that afford a coherent set of interactions—and ground the video as a series of visits to these zones. This representation offers a middle ground between the “pure video” and “pure 3D” approaches discussed above, which either ignore the underlying environment by treating video as fixed-sized chunks of frames, or sacrifice important semantics of human behavior by densely reconstructing the whole environment. Instead, our model reasons jointly about the environment and the agent: which parts of the environment are most relevant for human action, what interactions does each zone afford, and how actions at these zones accomplish a goal.
Our approach is best suited to long term activities in egocentric video where zones are repeatedly visited and used in multiple ways over time. This definition applies broadly to common household and workplace environments (e.g., office, kitchen, retail store, grocery). In this work, we study kitchen environments using two public ego-video datasets (EPIC  and EGTEA+ ), since cooking activities entail frequent human-object interactions and repeated use of multiple zones. Our approach is not intended for third-person video, short video clips, or video where the environment is constantly changing (e.g., driving down a street).
Our approach works as follows. First, we train a zone localization network to discover commonly visited spaces from egocentric video (Sec. 3.1). Then, given a novel video, we use the network to assign video clips to zones and create a topological map (graph) for the environment. We further link zones based on their function across video instances to create consolidated maps (Sec. 3.2). Finally, we train models that leverage the resulting graphs to uncover environment affordances (Sec. 3.3) and anticipate future actions in long videos (Sec. 3.4).
We leverage egocentric video of human activity to discover important “zones” for action. At a glance, one might attempt to discover spatial zones based on visual clustering or geometric partitions. However, clustering visual features (e.g., from a pretrained CNN) is insufficient since manipulated objects often feature prominently in ego-video, making the features sensitive to the set of objects present. For example, a sink with a cutting-board being washed vs. the same sink at a different time filled with plates would cluster into different zones. On the other hand, SLAM localization is often unreliable due to quick motions characteristic of egocentric video.111For example, on the EPIC Kitchens dataset, only of frames can be accurately registered with a state-of-the-art SLAM algorithm . Further, SLAM reconstructs all parts of the environment indiscriminately, without regard for their ties to human action or lack thereof, e.g., giving the same capacity to a kitchen sink area as it gives to a random wall.
To address these issues, we propose a zone discovery procedure that links views based on both their visual content and their visitation by the camera wearer. The basis for this procedure is a localization network that estimates the similarity of a pair of video frames, designed as follows.
We sample pairs of frames from videos that are segmented into a series of action clips. Two training frames are similar if (1) they are near in time (separated by fewer than 15 frames) or from the same action clip, or (2) there are at least 10 inlier keypoints consistent with their estimated homography. The former allows us to capture the spatial coherence revealed by the person’s behavior and his/her tendency to dwell by action-informative zones, while the latter allows us to capture repeated backgrounds despite significant foreground object changes. Dissimilar frames are temporally distant views with low visual feature similarity, or incidental views in which no actions occur. See Fig. 2. We use SuperPoint  keypoint descriptors to estimate homographies, and euclidean distance between pretrained ResNet-152  features for visual similarity.
The sampled pairs are used to train , a Siamese network with a ResNet-18 that two frames in an egocentric video belong to the same zone.
Our localization network draws inspiration from the retrieval network employed in  to build maps for embodied agent navigation, and more generally prior work leveraging temporal coherence to self-supervise image similarity [24, 49, 32]. However, whereas the network in  is learned from view sequences generated by a randomly navigating agent, ours learns from ego-video taken by a human acting purposefully in an environment rich with object manipulation. In short, nearness in  is strictly about physical reachability, whereas nearness in our model is about human interaction in the environment.
With a trained localization network, we process the stream of frames in a new untrimmed, unlabeled egocentric video to build a topological map of its environment. For a video with frames , we create a graph with nodes and edges . Each node of the graph is a zone and records a collection of “visits”—clips from the egocentric video at that location. For example, a cutting board counter visited at and , for 7 and 38 frames each, will be represented by a node with visits . See Fig. 1.
We initialize the graph with a single node corresponding to a visit with just the first frame. For each subsequent frame , we compute the average frame-level similarity score for the frame compared to each of the nodes using the localization network from Sec. 3.1:
where is the center frame selected from each visit in node . If the network is confident that the frame is similar to one of the nodes, it is merged with the highest scoring node corresponding to . Alternately, if the network is confident that this is a new location (very low ), a new node is created for that location, and an edge is created from the previously visited node. The frame is ignored if the network is uncertain about the frame. Algorithm 1 summarizes the construction algorithm. Further implementation details and values , can be found in Supp.
When all frames are processed, we are left with a graph of the environment per video where nodes correspond to zones where actions take place (and a list of visits to them) and the edges capture weak spatial connectivity between zones based on how people traverse them.
Importantly, beyond per-video maps, our approach also creates cross-video and cross-environment maps that link spaces by their function. We show how to link zones across 1) multiple episodes in the same environment and 2) multiple environments with shared functionality. To do this, for each node
we use a pretrained action/object classifier to compute, the distribution of actions and active objects222An active object is an object involved in an interaction. that occur in all visits to that node. We then compute a node-level functional similarity score:
where KL is the KL-Divergence. We score pairs of nodes across all kitchens, and perform hierarchical agglomerative clustering to link nodes with functional similarity. Details about the clustering algorithm are in Supp.
Linking nodes in this way offers several benefits. First, not all parts of the kitchen are visited in every episode (video). We link zones across different episodes in the same kitchen to create a combined map of that kitchen that accounts for the persistent physical space underlying multiple video encounters. Second, we link zones across kitchens to create a consolidated kitchen map, which reveals how different kitchens relate to each other. For example, a gas stove in one kitchen could link to a hotplate in another, despite being visually dissimilar (see Fig. 3). Being able to draw such parallels is valuable when planning to act in a new unseen environment, as we will demonstrate below.
Next, we leverage the proposed topological graph to predict a zone’s affordances—all likely interactions possible at that zone. Learning scene affordances is especially important when an agent must use a previously unseen environment to perform a task. Humans seamlessly do this, e.g., cooking a meal in a friend’s house; we are interested in AR systems and robots that learn to do so by watching humans.
We know that egocentric video of people performing daily activities reveals how different parts of the space are used. Indeed, the actions observed per zone partially reveal its affordances. However, since each clip of an ego-video shows a zone being used only for a single interaction, it falls short of capturing all likely interactions at that location.
To overcome this limitation, our key insight is that linking zones within/across environments allows us to extrapolate labels for unseen interactions at seen zones, resulting in a more complete picture of affordances. In other words, having seen an interaction at a zone allows us to augment training for the affordance of at zone , if zones and are functionally linked. See Fig. 4 (Left).
To this end, we treat the affordance learning problem as a multi-label classification task that maps image features to an
-dimensional binary indicator vector, where is the number of possible interactions. We generate training data for this task using the topological affordance graphs defined in Sec. 3.2.
Specifically, we calculate node-level affordance labels for each node :
where is the set of all interactions that occur during visit . Then, from each visit to a node , we sample a frame, generate its frame features , and use
as the affordance multi-label target. We use a 2-layer MLP for the affordance classifier, followed by a linear classifier and a sigmoid function. The network is trained using binary cross entropy loss.
At test time, given an image in an environment, this classifier directly predicts its affordance probabilities. See Fig. 4 (Left). Critically, linking frames into zones and linking zones between environments allows us to share labels across instances in a manner that that benefits affordance learning, better than models that link data purely based on geometric or visual nearness (cf. Sec. 4.1). Our Ego-Topo graph derives its connectivity naturally by observing human use of related environments.
Next, we leverage our topological affordance graphs for long horizon anticipation. In the anticipation task, we see a fraction of a long video (e.g., the first 25%), and from that we must predict what actions will be done in the future. Compared to affordance learning, which benefits from how zones are functionally related to enhance static image understanding, long range action anticipation is a video understanding task that leverages how objects are distributed among zones and how these zones are laid out to anticipate human behavior.
Recent action anticipation [13, 75, 15, 5, 53, 16, 61] predicts the immediate next action (e.g. in the next 1 second) rather than all future actions, for which an encoding of recent video information is sufficient. For long range anticipation, models need to first understand how much progress has been made on the composite activity so far, and then anticipate what actions need to be done in the future to complete it. For this, a structured representation of all past activity and affordances is essential.
Existing long range video understanding methods [30, 31, 68] build complex models over past clip features to aggregate information from the past, but do not model the environment explicitly, which we hypothesize is important for anticipating actions in long video. Our graphs provide a concise representation of observed activity, grounding frames in the spatial environment. We leverage this grounding to learn trends in interaction sequences.
Given an untrimmed video with interaction clips each involving an action with some object, we see the first clips333Experiments sweep over values of to test seeing more/less video. and predict the future action labels as a -dimensional binary vector , where is the number of action classes and for .
We generate the corresponding topological graph built up to clips, and extract features for each node using a 2-layer MLP, over averaged clip features sampled from visits to that node.
Actions at one node influence future activities in other nodes. To account for this, we enhance node features by integrating neighbor node information from the topological graph using a graph convolutional neural network (GCN)
where are the neighbors of node , and are learnable parameters of the GCN.
The updated GCN representation for each individual node is enriched with global scene context from neighboring nodes, allowing patterns in actions across locations to be learned. For example, vegetables that are taken out of the fridge in the past are likely to be washed in the sink later. The GCN node features are then averaged to derive a representation of the video . This is then fed to a linear classifier followed by a sigmoid to predict future action probabilities, trained using binary cross entropy loss, .
At test time, given an untrimmed, unlabeled video showing the onset of a long composite activity, our model can predict the actions that will likely occur in the future to complete it. Fig. 4 (Right) illustrates the task. Further details are in Supp. As we will see in results, grounding ego-video clips in the real environment—rather than treat them as an arbitrary set of frames—provides a stronger video representation for anticipation.
We evaluate the proposed topological graphs for scene affordance learning and action anticipation in long videos.
Datasets. We use two egocentric video datasets:
EGTEA Gaze+  contains videos of 32 subjects following 7 recipes in a single kitchen. Each video captures a complete dish being prepared (e.g., potato salad, pizza), with clips annotated for interactions (e.g., open drawer, cut tomato), spanning 53 objects and 19 actions.
EPIC-Kitchens  contains videos of daily kitchen activities, and is not limited to a single recipe. It is annotated for interactions spanning 352 objects and 125 actions. Compared to EGTEA+, EPIC is larger, unscripted, and collected across multiple kitchens.
The kitchen environment is an ideal setting for our experiments, and has been the subject of several recent egocentric datasets [5, 41, 38, 64, 58, 74]. Repeated interaction with different parts of the kitchen during complex, multi-step cooking activities is a rich domain for learning affordance and anticipation models.
In this section, we evaluate how linking actions in zones and across environments can benefit affordances.
Baselines. We compare the following methods:
ClipAction uses clip-level action labels to learn to recognize those afforded actions which it has seen at a given location during training.
ActionMaps  estimates affordances of locations via matrix completion with side-information. It assumes that nearby locations with similar appearance/objects have similar affordances. See Supp. for details.
SLAM trains an action affordance classifier with the same architecture as ours, and treats all frames associated with the same grid cell on the ground plane as positives for actions observed at any time in that grid cell. locations are obtained from monocular SLAM , and the cell size is based on the typical scale of an interaction area following prior work . It shares our insight to link actions in the same location, but is limited to a uniformly defined location grid and cannot link different environments.
KMeans clusters action clips using their visual features alone. We select as many clusters as there are nodes in our consolidated graph to ensure fair comparison.
Ours We show the three variants from Sec. 3.2 which use maps built from a single video (Ours-S), multiple videos of the same kitchen (Ours-M), and a functionally linked, consolidated map across kitchens (Ours-C).
Note that all methods use the clip-level annotated data, in addition to data from linking actions/spaces. They see the same video frames during training, only they are organized and presented with labels according to the method.
We crowd-source annotations for afforded interactions. Each instance is a view from the environment, paired with all likely interactions at that location regardless of whether the view shows it (e.g., turn-on stove, take/put pan etc. at a stove). We collect 1020 instances spanning interactions on EGTEA+ and 1155 instances over on EPIC (see Supp. for details). All methods are evaluated on this test set. We report mean average precision (mAP) over all afforded interactions, and separately for the rare and frequent ones (10 and 100 training instances, respectively).
Table 1 summarizes the results. By capturing the persistent environment in our discovered zones, and linking them across environments, our method outperforms all other methods on the affordance prediction task. All models perform better on EGTEA+, which has fewer interaction classes, contains only one kitchen, and has at least 30 training examples per afforded action (compared to EPIC where 10% of the actions have a single annotated clip).
SLAM and ActionMaps  rely on monocular SLAM, which introduces certain limitations. See Fig. 5 (Left). A single grid cell in the SLAM map reliably registers only small windows of smooth motion, often capturing only single action clips at each location. In addition, inherent scale ambiguities and uniformly shaped cells can result in incoherent activities placed in the same cell. Note that this limitation stands even if SLAM were perfect. Together, these factors hurt performance on both datasets, more severely affecting EGTEA+ due to the scarcity of SLAM data (only 6% accurately registered). Noisy localizations also affect the kernel computed by ActionMaps, which accounts for physical nearness as well as similarities in object/scene features. In contrast, a zone in our topological affordance graph corresponds to a coherent set of clips at different times, offering a more reliable and diverse set of actions to link, as seen in Fig. 5 (Right).
Clustering using purely visual features in KMeans helps consolidate information in EGTEA+ where all videos are in the same kitchen, but hurts performance where visual features are insufficient to capture coherent zones.
Linking actions to discovered zones in our topological graph results in consistent improvements on both datasets. Moreover, aligning spaces based on function in the consolidated graph (Ours-C) provides the largest improvement, especially for rare classes that may only be seen tied to a single location.
Fig. 3 and Fig. 5 show the diverse actions captured in each node of our graph. Multiple actions at different times and from different kitchens are linked to the same zone, thus overcoming the sparsity in demonstrations and translating to a strong training signal for our scene affordance model. Fig. 6 shows example affordance predictions.
Next we evaluate how the structure of our topological graph yields better video features for long term anticipation.
TrainDist simply outputs the distribution of actions performed in all training videos, to test if a few dominant actions are repeatedly done, regardless of the video.
I3D uniformly samples 64 clip features and averages them to generate a video feature.
|Ours w/o GCN||34.6||55.3||24.9||72.5||79.5||54.2|
While all the compared methods model temporal information, none explicitly model the persistent environment in video as we propose.
For evaluation, we use the first % of each untrimmed video as input, and predict all actions in the remaining video. We sweep values of representing different anticipation horizons. We report mAP over all action classes, and in low-shot (rare) and many-shot (freq) settings.
Table 2 shows the results averaged over all ’s, and Fig. 7 plots results vs. . Our model outperforms all other methods on EPIC, improving over the next strongest baseline by 2.4% mAP on all 125 action classes. On EGTEA+, our model matches the performance of models with complicated temporal aggregation schemes, and achieves the highest results for many-shot classes.
EGTEA+ has a less diverse action vocabulary with a fixed set of recipes. TrainDist, which simply outputs a fixed distribution of actions for every video, performs relatively well (59% mAP) compared to its counterpart on EPIC (only 16% mAP), highlighting that there is a core set of repeatedly performed actions in the dataset.
Among the methods that employ complex temporal aggregation schemes, Timeception improves over I3D on both datasets, though our method outperforms it on the larger EPIC dataset. Simple aggregation of node level information (Ours w/o GCN) still consistently outperforms most baselines. However, including the graph convolution operations is essential to outperform more complex models, which shows the benefit of encoding the physical layout and interactions between zones in our topological map.
Fig. 7 breaks down performance by anticipation horizon . On EPIC, our model is uniformly better across all prediction horizons, and it excels at predicting actions further into the future. This highlights the benefit of our environment-aware video representation. On EGTEA+, our model outperforms all other models except ActionVLAD on short range settings, but performs slightly worse at =50%. On the other hand, ActionVLAD falls short of all other methods on the more challenging EPIC data.
We proposed a method to produce a topological affordance graph from egocentric video of human activity, highlighting commonly used zones that afford coherent actions across multiple kitchen environments. Our experiments on scene affordance learning and long range anticipation demonstrate its viability as an enhanced representation of the environment gained from egocentric video. Future work can leverage the environment affordances to guide users in unfamiliar spaces with AR or allow robots to explore a new space through the lens of how it is likely used.
International Conference on Computer Vision (ICCV), Cited by: §1.
First-person activity forecasting with online inverse reinforcement learning. In ICCV, Cited by: §1, §2.
This section contains supplementary material to support the main paper text. The contents include:
(§S2) Setup and details for crowdsourced affordance annotation on EPIC and EGTEA+.
We show examples of our graph construction process over time from egocentric videos following Algorithm 1 in the main paper. The end result is a topological map of the environment where nodes represent primary spatial zones of interaction, and edges represent commonly traversed paths between them. Further, the video demonstrates our affordance prediction results from Sec. 4.1 over the constructed topological graph. The video and interface to explore the topological graphs can be found on the project page.
Fig. S2 shows static examples of fully constructed topological maps from a single egocentric video from the test sets of EPIC and EGTEA+. Graphs built from long videos with repeated visits to nodes (P01_18, P22_07) result in a more complete picture of the environment. Short videos where only a few zones are visited (P31_14) can be linked to other graphs of the same kitchen (Sec. 3.2). The last panel shows a result on EGTEA+.
As mentioned in Sec. 4.1, we collect annotations for afforded interactions for EPIC and EGTEA+ video frames to evaluate our affordance learning methods. We present annotators with a single frame (center frame) from a video clip and ask them to select all likely interactions that occur in the location presented in the clip. Note that these annotations are used exclusively for evaluating affordance models — they are trained using single-clip interaction labels (See Sec. 3.3).
On EPIC, we select 120 interactions (verb-noun pairs) over the 15 most frequent verbs and for common objects that afford multiple interactions. For EGTEA+, we select all 75 interactions provided by the dataset. A list of all these interactions is in Table S1. Each image is labeled by 5 distinct annotators, and only labels that 3 or more annotators agree on are retained. This results in 1,020 images for EGTEA+ and 1,155 images for EPIC. Our annotation interface is shown in Fig. S1 (top panel), and examples of resulting annotations are shown in Fig. S1 (bottom panel).
|EPIC||put/take: pan, spoon, lid, board:chopping, bag, oil, salt, towel:kitchen, scissors, butter; open/close: tap, cupboard, fridge, lid, bin, salt, kettle, milk, dishwasher, ketchup; wash: plate, spoon, pot, sponge, hob, microwave, oven, scissors, mushroom; cut: tomato, pepper, chicken, package, cucumber, chilli, ginger, sandwich, cake; mix: pan, onion, spatula, salt, egg, salad, coffee, stock; pour: pan:dust, onion, water, kettle, milk, rice, egg, coffee, liquid:washing, beer; throw: onion, bag, bottle, tomato, box, coffee, towel:kitchen, paper, napkin; dry: pan, plate, knife, lid, glass, fork, container, hob, maker:coffee; turn-on/off: kettle, oven, machine:washing, light, maker:coffee, processor:food, switch, candle; turn: pan, meat, kettle, hob, filter, sausage; shake: pan, hand, pot, glass, bag, filter, jar, towel; peel: lid, potato, carrot, peach, avocado, melon; squeeze: sponge, tomato, liquid:washing, lemon, lime, cream; press: bottle, garlic, dough, switch, button; fill: pan, glass, cup, bin, bottle, kettle, squash|
|EGTEA+||inspect/read: recipe; open: fridge, cabinet, condiment_container, drawer, fridge_drawer, bread_container, dishwasher, cheese_container, oil_container; cut: tomato, cucumber, carrot, onion, bell_pepper, lettuce, olive; turn-on: faucet; put: eating_utensil, tomato, condiment_container, cucumber, onion, plate, bowl, trash, bell_pepper, cooking_utensil, paper_towel, bread, pan, lettuce, pot, seasoning_container, cup, bread_container, cutting_board, sponge, cheese_container, oil_container, tomato_container, cheese, pasta_container, grocery_bag, egg; operate: stove, microwave; move-around: eating_utensil, bowl, bacon, pan, patty, pot; wash: eating_utensil, bowl, pan, pot, hand, cutting_board, strainer; spread: condiment; divide/pull-apart: onion, paper_towel, lettuce; clean/wipe: counter; mix: mixture, pasta, egg; pour: condiment, oil, seasoning, water; compress: sandwich; crack: egg; squeeze: washing_liquid|
As noted in our experiments in Sec. 4.1, our method performs better on low-shot classes. Fig. S3 shows a class-wise breakdown of improvements achieved by our model over the ClipAction model on the scene affordance task. Among the interactions, those involving objects that are typically tied to a single physical location, highlighted in red (e.g., fridges, stoves, taps etc.), are easy to predict, and do not improve much. Our method works especially well for interaction classes that occur in multiple locations (e.g., put/take spoons/butter, pour rice/egg etc.), which are linked in our topological graph.
Homography estimation details (Sec. 3.1). We generate SuperPoint keypoints  using the pretrained model provided by the authors. For each pair of frames, we calculate the homography using 4 random points, and use RANSAC to maximize the number of inliers. We use inlier count as a measure of similarity.
Similarity threshold and margin values in Algorithm 1 (). We fix our similarity threshold to ensure that only highly confident views are included in the graph. We select a large margin to make sure that irrelevant views are readily ignored.
Node linking details (Sec. 3.2). We use hierarchical agglomerative clustering to link nodes across different environments based on functional similarity. We set the similarity threshold below which nodes will not be linked as 40% of the average pairwise similarity between every node. We found that threshold values around this range (40-60%) produced a similar number of clusters, while values beyond them resulted in too few nodes linked, or all nodes collapsing to a single node.
Other details. We subsample all videos to 6 fps. To calculate in Equation 2, we average scores for a window of 9 frames around the current frame, and we uniformly sample a set of 20 frames for each visit for robust score estimates.
We next provide additional implementation and training details for our experiments in Sec. 4 of the main paper.
Affordance learning experiments in Sec. 4.1.
For all models, we use ImageNet pretrained ResNet-152 features for frame feature inputs. As mentioned in Sec.3.3
, we use binary cross entropy (BCE) for our loss function. For original clips labeled with a single action label, we evaluate BCE for only the positive class, and mask out the loss contributions for all other classes. Adam with learning rate 1e-4, weight decay 1e-6, and batch size 256 is used to optimize the models parameters. All models are trained for 20 epochs, and learning rate is annealed once to 1e-5 after 15 epochs.
Long term action anticipation experiments in Sec. 4.2. We pretrain an I3D model with ResNet-50 as the backbone on the original clip-level action recognition task for both EPIC-Kitchen and EGTEA+. Then, we extract the features from the pretrained I3D model for each set of 64 frames as the clip-level features. These features are used for all models in our long-term anticipation experiments.
Among the baselines, we implement TrainDist, I3D, RNN, and ActionVlad. For Timeception, we import the authors’ module444https://github.com/noureldien/timeception and for Videograph, we directly use the authors’ implementation555https://github.com/noureldien/videograph with our features as input.
For EPIC, all models are trained for 100 epochs with the learning rate starting from 1e-3 and decreased by a factor of 0.1 after 80 epochs. We use Adam as the optimization method with weight decay 1e-5 and batch size 256. For the smaller EGTEA+ dataset, we follow the same settings, except we train for 50 epochs.
For the ActionMaps method, we follow Rhinehart and Kitani 
making a few necessary modifications for our setting. We use cosine similarity between pretrained ResNet-152 features to measure semantic similarity between locations as side information, instead of object and scene classifier scores, to be consistent with the other evaluated methods. We use the latent dimension 256 for the matrix factorization, and setfor the RWNMF optimization objective in . We use location information in the similarity kernel only when it is available, falling back to just feature similarity when it is not (due to SLAM failures). We use this baseline in our experiments in Sec. 4.1.
We generate monocular SLAM trajectories for egocentric videos using the code and protocol from . Specifically, we use ORB-SLAM2  to extract trajectories for the full video, and drop timesteps where either tracking is unreliable or lost. We scale all trajectories by the maximum movement distance for each kitchen, so that (x, y) coordinates are bounded between [0, 1]. We create a uniform grid of squares, each with edge length 0.2 . We use this grid to accumulate trajectories for the SLAM baseline and to construct the ActionMaps matrix in our experiments in Sec. 4.1. We use the same process for EPIC and EGTEA+, with camera parameters from the dataset authors.