Furniture assembly understanding is closely related to the broader field of action recognition. The rise of deep learning has rapidly advanced this field [carreira2017quo]. However, deep learning models require vast amounts of training data and are often evaluated on large datasets of short video clips, typically extracted from YouTube [carreira2017quo, Kuehne11], that include a set of arbitrary yet highly discriminative actions. Therefore, the research on assembly understanding is far behind generic action recognition due to the insufficient datasets for training such models and other challenges such as the need to understand longer timescale activities. Existing assembly datasets [toyer2017human] are limited to the classification of very few actions and focus on human pose and color information only.
We aim to enable research of assembly understanding and underlying perception algorithms under real-life conditions by creating diversity in the assembly environment, assemblers, furniture types and color, and body visibility. To this end, we present the novel IKEA ASM dataset, the first publicly available dataset with the following properties:
Multi-modality: Data is captured from multiple sensor modalities including color, depth, and surface normals. It also includes various semantic modalities including human pose and object instance segmentation.
Multi-view: Three calibrated camera views cover the work area to handle body, object and self occlusions.
Fine-grained: There is subtle distinction between objects (such as table top and shelf) and action categories (such as aligning, spinning in, and tightening a leg), which are all visually similar.
High diversity: The same furniture type is assembled in numerous ways and over varying time scales. Moreover, human subjects exhibit natural, yet unusual poses, not typically seen in human pose datasets.
Transferability: The straightforward data collection protocol and readily available furniture makes the dataset easy to reproduce worldwide and link to other tasks such as robotic manipulation of the same objects.
While the task of furniture assembly is simple and well-defined, there are several difficulties that make inferring actions and detecting relevant objects challenging. First, unlike standard activity recognition the background does not provide any information for classifying the action (since all actions take place in the same environment). Second, parts being assembled are symmetric and highly similar requiring understanding of context and the ability to track objects relative to other parts and sub-assemblies. Third, the strong visual similarity between actions and parts requires a higher-level understanding of the assembly process and state information to be retained over long time periods.
On the other hand, the strong interplay between geometry and semantics in furniture assembly provides an opportunity to model and track the process. Moreover, cues obtained from the different semantic modalities, such as human pose and object types, combine to provide strong evidence for the activity being performed. Our dataset enables research along this direction where both semantics and geometry are important and where short-term feed-forward perception is insufficient to solve the problem.
The main contributions of this paper are: (1) the introduction of a novel furniture assembly dataset that includes multi-view, and multi-modal annotated data; and (2) evaluation of baseline method for different tasks (action recognition, pose estimation, object instance segmentation and tracking) to establish performance benchmarks.
|YouCook [das2013thousand]||2013||2h,20m||88||NA||cooking||YouTube||1||✗||✗||✓ – (bb)|
|MPII Cooking 2 [rohrbach2016recognizing]||2016||8h||273||2.88M||cooking||collected||1||✗||✓||✗|
|YouCook2 [zhou2018towards]||2018||176h||2000||NA||cooking||YoutTube||1||✗||✗||✓ – (bb)|
|EPIC-Kitchens [damen2018scaling]||2018||55h||432||11.5M||cooking||collected||1||✗||✗||✓ – (bb)|
|COIN [tang2019coin]||2019||476h,38m||11827||NA||180 tasks||YouTube||1||✗||✗||✗|
2 Background and related work
Related Datasets. The increasing popularity of action recognition in the computer vision community has led to the emergence of a wide range of action recognition datasets. One of the most prominent datasets for action recognition is the Kinetics [Qiu_2017_ICCV] dataset—a large-scale human action dataset collected from Youtube videos. It is two orders of magnitude larger than some predecessors, e.g. the UCF101 [soomro2012ucf101] and HMDB51 [kuehne2011hmdb]. Additional notable datasets in this context are ActivityNet [caba2015activitynet] and Charades [sigurdsson2016hollywood], which include a wide range of human activities in daily life. The aforementioned datasets, while very large in scale, are not domain specific or task-oriented. Additionally, they are mainly centered on single-view RGB data.
Instructional video datasets usually include domain specific videos, e.g., cooking (MPII [rohrbach2012database], YouCook [das2013thousand], YouCook2 [zhou2018towards], EPIC-Kitchens [damen2018scaling]) and furniture assembly (IKEA-FA [toyer2017human]). These are most often characterized by having fine grained action labels and may include some additional modalities to the RGB stream such as human pose and object bounding boxes. There are also more diverse variants like the recent COIN [tang2019coin] dataset, which forgoes the additional modalities in favor of a larger scale.
The most closely related to the proposed dataset are the Drive & Act [drive_and_act_2019_iccv] and NTU RGB+D [shahroudy2016ntu, Liu_2019_NTURGBD120] datasets. Drive & Act is specific to the domain of in-car driver activity and contains multi-view, multi-modal data, including IR streams, pose, depth, and RGB. While the actors follow some instructions, their actions are not task-oriented in the traditional sense. Due to the large effort in collecting it, the total number of videos is relatively low (30). Similarly, NTU RGB+D [shahroudy2016ntu] and its recent extension NTU RGB+D 120 [Liu_2019_NTURGBD120] contain three different simultaneous RGB views, IR and depth streams as well as 3D skeletons. However, in this case the videos are very short (few seconds), non-instructional and are focused on general activities, some of which are health related or human interaction related. For a detailed quantitative comparison between the proposed and closely-related datasets see Table 1.
Other notable work is the IKEA Furniture Assembly Environment [lee2019ikea]
, a simulated testbed for studying robotic manipulation. The testbed synthesizes robotic furniture assembly data for imitation learning. Our proposed dataset is complimentary to this work as it captures real-world data of humans that can be used for domain-adaptation.
In this paper we propose a furniture assembly domain-specific, instructional video dataset with multi-view and multi-modal data, which includes fine grained actions, human pose, object instance segmentation and tracking labels.
We provide a short summary of methods used as benchmarks in the different dataset tasks including action recognition, instance segmentation, multiple object tracking and human pose estimation. For an extended summary, see the supplementary material.
Action Recognition . Current action recognition architectures for video data are largely image-based. The most prominent approach uses 3D convolutions to extract spatio-temporal features,and includes methods like convolutional 3D (C3D) [tran2015learning], which was the first to apply 3D convolutions in this context, pseudo-3D residual net (P3D ResNet) [Qiu_2017_ICCV]
, which leverages pre-trained 2D CNNs and utilizes residual connections and simulates 3D convolutions, and the two-stream inflated 3D ConvNet (I3D)[carreira2017quo]
, which uses an inflated inception module architecture and combines RGB and optical flow streams. Other approaches attempt to decouple visual variations by using a mid-level representation like human pose (skeletons). One idea is to use a spatial temporal graph CNN (ST-GCN)[yan2018spatial] to process the skeleton’s complex structure. Another is to learn skeleton features combined with global co-occurrence patterns [li2018co].
Instance Segmentation. Early approaches to instance segmentation typically perform segment proposal and classification in two stages [pinheiro2015learning, dai2016instance, pinheiro2016learning]. Whereas recent one-stage approaches tend to be faster and more accurate [he2017mask, li2017fully]. Most notably, Mask R-CNN [he2017mask] combines binary mask prediction with Faster R-CNN [ren2015faster], showing impressive performance. They predict segmentation masks on a coarse grid, independent of the instance size and aspect ratio which tends to produce coarse segmentation for instances occupying larger part of the image. To alleviate this problem approaches have been proposed to focus on the boundaries of larger instances, e.g., InstanceCut [kirillov2017instancecut], TensorMask [chen2019tensormask], and point-based prediction as in PointRend [kirillov2019pointrend].
Multiple Object Tracking (MOT). Tracking-by-detection is a common approach for multiple object tracking. MOT can be considered from different aspects: It can be categorized into online or offline, depending on when the decisions are made. In online tracking [saleh2020artist, wojke2017simple, bergmann2019tracking, chu2019famnet, xu2019spatial, kim2018multi], the tracker assigns detections to tracklets at every time-step, whereas in offline tracking [tang2017multiple, maksai2018eliminating] the decision about the tracklets are made after observing the whole video. Different MOT approaches can also be divided into geometry-based [saleh2020artist, Bewley2016_sort] or appearance-based [chu2019famnet, bergmann2019tracking, xu2019spatial]. In our context, an application may be human-robot collaboration during furniture assembly, where the tracking system is required to make real-time online decisions [saleh2020artist, Bewley2016_sort]. In this scenario, IKEA furniture parts are almost textureless and of the same color and shape, and thus the appearance information could be misleading. Additionally, IKEA furniture parts are rigid, non-deformable objects, that are moved almost linearly in a short temporal window. As such, a simple, well-designed tracker that models linear motions [Bewley2016_sort] is a reasonable choice.
Human Pose Estimation. Multi-person 2D pose estimation methods can be divided into bottom-up (predict all joints first) [pishchulin2016deepcut, cao2017realtime, cao2019openpose, raaj2019efficient] or top-down (detect all person bounding boxes first) [he2017mask, fang2017rmpe, chen2018cascaded]. The popular OpenPose detector [cao2017realtime, cao2019openpose] assembles the skeleton using a joint detector and part affinity fields. This was extended to incorporate temporal multi-frame information in Spatio-Temporal Affinity Fields (STAF) [raaj2019efficient]. Mask R-CNN [he2017mask] is a notable top-down detection-based approach, where a keypoint regression head can be learned alongside the bounding box and segmentation heads. Monocular 3D human pose estimation methods can be categorized as being model-free [pavlakos2018ordinal, pavllo20193d] or model-based [kanazawa2018end, kanazawa2019learning, kolotouros2019convolutional, kocabas2020vibe]. The former include VideoPose3D [pavllo20193d] which estimates 3D joints via temporal convolutions over 2D joint detections in a video sequence. The latter approach predicts the parameters of a body model, often the SMPL model [loper2015smpl], such as the joint angles, shape parameters, and rotation. Some model-based approaches [kanazawa2018end, kanazawa2019learning, kocabas2020vibe] leverage adversarial learning to produce realistic body poses and motions. Therefore, they tend to generalize better to unseen datasets, and so we focus on these methods as benchmarks on our dataset.
3 The IKEA assembly dataset
The IKEA ASM video dataset will be made publicly available for download of all 371 examples and ground-truth annotations. It includes three RGB views, one depth stream, atomic actions, human poses, object segments, and extrinsic camera calibration. Additionally, we provide code for data processing, including depth to point cloud conversion, surface normal estimation, visualization, and evaluation in a designated github repository.
Data collection. Our data collection hardware system is composed of three Kinect V2 cameras. These three cameras are oriented to collect front, side and top views of the work area. In particular, the top-view camera is set to acquire the scene structure. The front and side-view cameras are placed at eye-level height (1.6m). The three Kinect V2 cameras are triggered to capture the assembly activities simultaneously in real time (24 fps). To achieve real-time data acquisition performance, multi-threaded processing is used to capture and save images on an Intel i7 8-core CPU with NVIDIA GTX 2080 Ti GPU used for data encoding.
To collect our IKEA ASM dataset, we ask 48 human subjects to assemble furniture in five different environments, such as offices, labs and family homes. In this way, the backgrounds are diverse in terms of layout, appearance and lighting conditions. The background is dynamic, containing moving people who are not relevant to the assembly process. These environments will force algorithms to focus on human action and furniture parts while ignoring the background clutter and other distractors. Moreover, to allow human pose diversity, we ask participants to conduct assembly either on the floor or on a table work surface. This yields a total of 10 camera configurations (two per environment).
Statistics. The IKEA ASM dataset consists of 371 unique assemblies of four different furniture types (side table, coffee table, TV bench, and drawer) in three different colors (white, oak, and black). There are in total 1113 RGB videos and 371 depth videos (top view). Figure 2 shows the video and individual action length distribution. Overall, the dataset contains 3,046,977 frames (35.27h) of footage with an average of 2735.2 frames per video (1.89min).
Figure 3 shows the atomic action distribution in the train and test sets. Each action class contains at least 20 clips. Due to the nature of the assemblies, there is a high imbalance (each table assembly contains four instances of leg assembly). The dataset contains a total of 16,764 annotated actions with an average of 150 frames per action (6sec). For a full list of action names and ids, see supplemental.
Data split. We aim to enable model training that will generalize to previously unseen environments and human subjects. However, there is a great overlap between subjects in the different scenes and creating a split that will hold-out both simultaneously results in discarding a large portion of the data. Therefore, we propose an environment-based train/test split, i.e., test environments do not appear in the trainset and vise-versa. The trainset and testset consist of 254 and 117 scans, respectively. Here, test set includes environments 1 and 2 (family room and office). All benchmarks in Section 4 were conducted using this split. Additionally, we provide scripts to generate alternative data splits to hold out subjects, environments and joint subject-environments.
Data annotation. We annotate our dataset with temporal and spatial information using pre-selected Amazon Mechanical Turk workers to ensure quality. Temporally, we specify the boundaries (start and end frame) of all atomic actions in the video from a pre-defined set. Actions involve interaction with specific object types (e.g., table leg).
Multiple spatial annotations are provided. First, we annotate instance-level segmentation of the objects involved in the assembly. Here an enclosing polygon is drawn around each furniture part. Due to the size of the dataset, we manually annotate only of the video frames which are selected as keyframes that cover diverse object poses and human poses throughout the entire video and provide pseudo ground-truth for the remainder (see §4.3). Visual inspection was used to confirm the quality of the pseudo ground-truth. For the same set of manually annotated frames, we also assign each furniture part with a unique ID, which preserves the identity of that part throughout the entire video.
We also annotated the human skeleton of the subjects involved assembly. Here, we asked workers to annotated 12 body joints and five key points related to the face. Due to occlusion with furniture, self-occlusions and uncommon human poses, we include a confidence value between 1 and 3 along with the annotation. Each annotation was then visually inspected and re-worked if deemed to be poor quality.
4 Experiments and benchmarks
We benchmark several state-of-the-art methods for the tasks of frame-wise action recognition, object instance segmentation and tracking, and human pose estimation.
4.1 Action recognition
We use three main metrics for evaluation. First, the frame-wise accuracy (FA) which is the de facto standard for action recognition. We compute it by counting the number of correctly classified frames and divide by the total number of frames in each video and then average over all videos in the test set. Second, since the data is highly imbalanced, we also report the macro-recall by separately computing recall for each category and then averaging. Third, we report the mean average precision (mAP) since all untrimmed videos contain multiple action labels. We compare several state-of-the-art methods for action recognition, including I3D [carreira2017quo], P3D ResNet [Qiu_2017_ICCV], C3D [tran2015learning], and frame-wise ResNet [he2016deep]. For each we start with a pre-trained model and fine-tune it on the IKEA ASM dataset using parameters provided in the original papers. To handle data imbalance we use a weighted random sampler where each class is weighted inversely proportional to its abundance in the dataset. Results are reported in Table 2 and show that P3D outperforms all other methods, consistent with performance on other datasets. Additionally, the results demonstrate the challenges compared to other datasets where I3D, for example, has an FA score of 57.57% compared to 68.4% on Kinetics and 63.64% on Drive&Act dataset.
|top 1||top 3||macro||mAP|
4.2 Multi-view and multi-modal action recognition
We further explore the affects of multi-view and multi-modal data using the I3D method. In Table 3 we report performance on different views and different modalities. We also report their combination by averaging softmax output scores. We clearly see that combining views gives a boost in performance compared to the best single view method. We also find that combining views and pose gives an additional performance increase. Additionally, combining views, depth and pose in the same manner results a small disadvantage, which is due to the inferior performance of the depth based method. This suggests that exploring action recognition in the 3D domain is an open and challenging problem. The results also suggest that a combined, holistic approach that uses multi-view and multi-modal data, facilitated by our dataset, should be further investigated in future work.
|Data type||View||Frame acc.|
|top 1||top 3||macro||mAP|
|Human pose||HCN [li2018co]||37.75||63.07||26.18||22.14|
|Human pose||ST-GCN [yan2018spatial]||36.99||59.63||22.77||17.63|
4.3 Instance segmentation
As discussed in Section 3, the dataset comes with manual instance segmentation annotation for 1% of the frames (manually selected keyframes that cover diverse object poses and human poses throughout the entire video). To evaluate the performance of existing instance segmentation methods on almost texture-less IKEA furniture, we train Mask R-CNN [he2017mask] with ResNet50, ResNet101, and ResNeXt101, all with feature pyramid networks structure (FPN) on our dataset. We train each network using the implementation provided by the Detectron2 framework [wu2019detectron2]. Table 4 shows the instance segmentation accuracy for the aforementioned baselines. As expected, the best performing model corresponds to the Mask R-CNN with ResNeXt101-FPN, outperforming ResNet101-FPN and ResNet50-FPN with 3.8% AP and 7.8% AP, respectively.
Since the manual annotation only covers 1% of the whole dataset, we propose to extract pseudo-ground-truth automatically. To this end, we train 12 different Mask R-CNNs with a ResNet50-FPN backbone to overfit on subsets of the training set that cover similar environments and furniture. We show that to achieve manual-like annotations with more accurate part boundaries, training the models with PointRend [kirillov2019pointrend] as an additional head is essential. Figure 4 compares the automatically generated pseudo-ground-truth with and without the PointRend head. To evaluate the effectiveness of adding pseudo-ground-truth, we compare the Mask R-CNN trained with ResNet50-FPN with 1% annotated data (i.e., manual annotations) and 20% annotated data (combination of manual and automatically generated annotations) illustrated in Table 5(a) and see a slight improvement. Note that any backbone architecture can benefit from the automatically generated pseudo-ground-truth.
We also investigate the contribution of adding a PointRend head to the Mask R-CNN with ResNet-50-FPN when training on 1% of manually annotated data. Table 5(b) shows that boundary refinement through point-based classification improves the overall instance segmentation performance. This table also clearly shows the effect of PointRend on estimating tighter bounding boxes.
Additionally, to evaluate the effect of furniture color and environment complexity, we report the instance segmentation results partitioned by color (Table 5(c)) and by environment (Table 5(d)). Note that, for both of these experiment we use the same model trained on all furniture colors and environments available in the training set. Table 5(c) shows that oak furniture parts are easier to segment. On the other hand, white furniture parts are the hardest to segment as they reflect the light in the scene more intensely. Another reason is that white parts might be missed due to poor contrast against the white work surfaces.
Although Mask R-CNN shows promising results in many scenarios, there are also failure cases, reflecting the real-world challenges introduced by our dataset. These failures are often due to (1) relatively high similarities between different furniture parts, e.g., front panel and rear panel of drawers illustrated in Figure 5(top row) and (2) relatively high similarities between furniture parts of interest and other parts of the environment which introduces false positives. An example of the latter can be seen in Figure 5(bottom row) where Mask R-CNN segmented part of the working surface as the shelf.
|Feature Extractor||Annotation Type||AP||AP50||AP75||table-t||leg||shelf||side-p||front-p||bottom-p||rear-p|
|(a) Influence of adding Pseudo GT|
|Manual + Pseudo GT||60.1||77.7||66.1||62.6||77.8||69.9|
|(b) Influence of PointRend head|
|(c) Color-based Evaluation|
|(d) Environment-based Evaluation|
|Env1 (Family Room)||47.1||63.0||53.5||49.9||65.6||58.4|
4.4 Multiple furniture part tracking
As motivated in Section 3, we utilize SORT [Bewley2016_sort] as a fast online multiple object tracking algorithm that only relies on geometric information in a class-agnostic manner. Given the detections predicted by the Mask R-CNN, SORT assigns IDs to each detected furniture part at each time-step.
To evaluate the MOT performance, we use standard metrics [ristani2016performance, bernardin2008evaluating]. The main metric is MOTA, which combines three error sources: false positives (FP), false negatives (FN) and identity switches (IDs). A higher MOTA score implies better performance. Another important metric is IDF1, i.e., the ratio of correctly identified detections over the average number of ground-truth and computed detections. The number of identity switches (IDs), FP and FN are also frequently reported. Furthermore, mostly tracked (MT) and mostly lost (ML), that are respectively the ratio of ground-truth trajectories that are covered/lost by the tracker for at least 80% of their respective life span, provide finer details on the performance of a tracking system. All metrics were computed using the official evaluation code provided by the MOTChallenge benchmark111https://motchallenge.net/.
Table 6 shows the performance of SORT on each test environment as well as the entire test set. The results reflect the challenges introduced by each environment in the test set. For instance, in Env1 (Family Room) provides a side view of the assembler and thus introduces many occlusions. This can be clearly seen in the number of FN. Moreover, since the tracker may lose an occluded object for a reasonably long time, it may assign new IDs after occlusion, thus affecting the mostly tracked parts and IDF1. On the other hand, the front view provided in Env2 (Office) leads to less occlusions, and thus better identity preservation reflected in IDF1 and MT. However, since the office environment contains irrelevant but similar parts, e.g., the desk partition or the work surface illustrated in Fig. 5(bottom row), we observed considerably higher FP which further affects MOTA.
|Env1 (Family Room)||63.7||69.6||60.1||35.9||4.0||92||1152||382|
4.5 Human pose
The dataset contains 2D human joint annotations in the COCO format [lin2014microsoft] for 1% of frames, the same keyframes selected for instance segmentation, which cover a diverse range of human poses across each video. As shown in Figure 6, there are many highly challenging and unusual poses in the dataset, due to the nature of furniture assembly, particularly when performed on the floor. There are also many other factors that reduce the accuracy of pose estimation approaches, including self-occlusions, occlusions from furniture, baggy clothing, long hair, and human distractors in the background. We also obtain pseudo-ground-truth 3D annotations by fine-tuning a Mask R-CNN [he2017mask] 2D joint detector on the labeled data, and triangulating the detections of the model from the three calibrated camera views. As a verification step, the 3D points are backprojected to 2D and are discarded if more than 30 pixels from the most confident ground-truth annotations. The reprojection error of the true and pseudo ground-truth annotations is 7.12 pixels on the train set (83% of ground-truth joints detected) and 9.14 pixels on the test set (53% of ground-truth joints detected).
To evaluate the performance of benchmark 2D human pose approaches, we perform inference with existing state-of-the-art models, pre-trained by the authors on the large COCO [lin2014microsoft] and MPII [andriluka20142d] datasets and fine-tuned on our annotated data. We compare OpenPose [cao2017realtime, cao2019openpose], Mask R-CNN [he2017mask] (with a ResNet-50-FPN backbone [lin2017feature]), and Spatio-Temporal Affinity Fields (STAF) [raaj2019efficient]
. The first two operate on images, while the last one operates on videos, and all are multi-person pose estimation methods. We require this since our videos sometimes have multiple people in a frame with only the single assembler annotated. For fine-tuning, we trained the models for ten epochs with learning rates of 1 and 0.001 for OpenPose and Mask R-CNN, respectively. We report results with respect to the best detected person per frame, that is, the one that is closest to the ground-truth keypoints, since multiple people may be validly detected in many frames. We use standard error measures to evaluate the performance of 2D human pose methods: the 2D Mean Per Joint Position Error (MPJPE) in pixels, the Percentage of Correct Keypoints (PCK)[yang2012articulated], and the Area Under the Curve (AUC) as the PCK threshold varies to a maximum of 100 pixels. A joint is considered correct if it is located within a threshold of 10 pixels from the ground-truth position, which corresponds to 0.5% of the image width (10801920). Absolute measures in pixel space are appropriate for this dataset because the subjects are positioned at an approximately fixed distance from the camera in all scenes. In computing these metrics, only confident ground-truth annotations are used and only detected joints contribute to the mean error (for MPJPE). The results for 2D human pose baselines on the IKEA ASM train and test sets are reported in Tables 7, 8, and 9. The best performing model is the fine-tuned Mask R-CNN model, with an MPJPE of 11.5 pixels, a PCK @ 10 pixels of 64.3% and an AUC of 87.8, revealing considerable room for improvement on this challenging data. The error analysis shows that upper body joints were detected accurately more often than lower body joints, likely due to the occluding table work surface in half the videos. In addition, female subjects were detected considerably less accurately than male subjects and account for almost all entirely missed detections.
|Train set||Test set|
|Male / Female||Floor / Table|
|OpenPose-pt [cao2017realtime]||15.3 / 19.2||46.6 / 46.9||0 / 1||16.9 / 15.8||47.1 / 45.9||1 / 0|
|OpenPose-ft [cao2017realtime]||13.8 / 14.0||52.4 / 53.0||0 / 6||14.3 / 13.0||52.5 / 52.8||0 / 6|
|MaskRCNN-pt [he2017mask]||15.6 / 17.5||51.9 / 50.5||0 / 1||16.8 / 14.8||53.1 / 48.3||0 / 1|
|MaskRCNN-ft [he2017mask]||11.2 / 11.9||64.6 / 63.8||0 / 0||11.4 / 11.5||65.4 / 62.3||0 / 0|
|STAF-pt [raaj2019efficient]||17.6 / 24.1||40.7 / 42.1||1 / 1||19.3 / 20.3||39.3 / 44.6||1 / 1|
To evaluate the performance of benchmark 3D human pose approaches, we perform inference with existing state-of-the-art models, pre-trained by the authors on large 3D pose datasets, including Human Mesh and Motion Recovery (HMMR) [kanazawa2019learning], VideoPose3D (VP3D) [pavllo20193d], and VIBE [kocabas2020vibe]. All are video-based methods. To measure the performance of the different methods, we use the 3D Mean/median Per Joint Position Error (M/mPJPE), which computes the Euclidean distance between the estimated and ground-truth 3D joints in millimeters, averaged over all joints and frames, Procrustes Aligned (PA) M/mPJPE, where the estimated and ground-truth skeletons are rigidly aligned and scaled before evaluation, and the Percentage of Correct Keypoints (PCK) [mehta2017monocular]. As in the Humans 3.6M dataset [ionescu2014humans36m], the MPJPE measure is calculated after aligning the centroids of the 3D points in common. The PCK threshold is set to 150mm, approximately half a head. The results for 3D human pose baselines on the IKEA ASM dataset are reported in Table 10. The best performing model is VIBE, with a median Procrustes-aligned PJPE of 153mm, and a PA-PCK @ 150mm of 50%. The baseline methods perform significantly worse on our dataset than standard human pose datasets, demonstrating its difficulty. For example, OpenPose’s joint detector [wei2016convolutional] achieves a PCK of 88.5% on the MPII dataset [andriluka20142d], compared to 52.6% on our dataset, and VIBE has a PA-MPJPE error of 41.4mm on the H36M dataset [ionescu2014humans36m], compared to 940mm on our dataset.
|Train set||Test set|
In this paper, we introduce a large-scale comprehensively labeled furniture assembly dataset for understanding task-oriented human activities with fine-grained actions and common parts. The proposed dataset can also be used as a challenging test-bed for underlying computer vision algorithms such as textureless object segmentation/tracking and human pose estimations in multiple views. Furthermore, we report benchmark results of strong baseline methods on those tasks for ease of research comparison. Notably, since our dataset contains multi-view and multi-modal data, it enables the development and analysis of algorithms that use this data, further improving performance on these tasks. Through recognizing human actions, poses and object positions, we believe this dataset will also facilitate understanding of human-object-interactions and lay the groundwork for the perceptual understanding required for long time-scale structured activities in real-world environments.
6.1 Extended related work
In this section we provide an extended summary of related work for each of the tasks presented in the paper: action recognition, human pose estimation, object instance segmentation and multi-object tracking.
Action Recognition Methods: Current action recognition architectures for video data are largely based on image-based models. These methods employ several strategies for utilizing the additional (temporal) dimension. One approach is to process the images separately using 2D CNNs and then average the classification results across the temporal domain [simonyan2014two]. Another approach includes using an RNN instead [yue2015beyond, donahue2015long]. The most recent and most prominent approach uses 3D convolutions to extract spatio-temporal features, this approach includes the convolutional 3D (C3D) method [tran2015learning] which was the first to apply 3D convolutions in this context, Pseudo-3D Residual Net (P3D ResNet) [Qiu_2017_ICCV] which leverages pretrained 2D CNNs, utilizes residual connections and simulates 3D convolutions, and the two-stream Inflated 3D ConvNet (I3D) [carreira2017quo] which uses an inflated inception module architecture and combines RGB and optical flow streams. Most recently, the slow-fast method [feichtenhofer2019slowfast] builds on top of the CNN and processes videos using two frame rates separately to obtain a unified representation.
Another approach for action recognition is to decouple the visual variations and use a mid-level representation like human pose (skeletons). Several different approaches were proposed to process the skeleton’s complex structure. One approach is to use an LSTM [liu2016spatio], another approach is to use a spatial temporal graph CNN (ST-GCN) [yan2018spatial]. An alternative approach is to encode the skeleton joints and the temporal dynamics in a matrix and process it like an image using a CNN [du2015skeleton, ke2017new]. Similarly, Hierarchical Co-occurrence Network (HCN) [li2018co], adopts a CNN to learn skeleton features while leveraging it to learn global co-occurrence patterns.
Early approaches to instance segmentation usually combined segment proposal classification in a two-stage framework. For instance, given a number of instance proposal, DeepMask[pinheiro2015learning] and closely related works [dai2016instance, pinheiro2016learning] learn to propose instance segment candidates which are then passed through a classifier (e.g., Fast R-CNN). These approaches are usually slower due to the architecture design and tend to be less accurate compared to one-stage counterparts [he2017mask]. To form a single stage instance segmentation Li et al. [li2017fully] merged segment proposal and object detection to form a fully convolutional instance segmentation framework. Following this trend, Mask R-CNN [he2017mask] combines binary mask prediction with Faster R-CNN, showing impressive performance compared to its prior work.
Mask R-CNN and other similar region-based approaches to instance segmentation [he2017mask] usually predict segmentation masks on a coarse grid, independent of the instance size and aspect ratio. While this leads to reasonable performance on small objects, around the size of the grid, it tends to produce coarse segmentation for instances occupying larger part of the image. To alleviate the problem of coarse segmentation of large instances, approaches have been proposed to focus on the boundaries of larger instances, e.g., through pixel grouping to form larger masks [arnab2017pixelwise, liu2017sgn, kirillov2017instancecut] as in InstanceCut [kirillov2017instancecut], utilizing sliding windows on the boundaries or complex networks for high-resolution mask prediction as in TensorMask [chen2019tensormask], and point-based segmentation prediction as in PointRend [kirillov2019pointrend].
Multiple Object Tracking.
With the advances in object detection [redmon2016you, girshick2015fast, ren2015faster], tracking-by-detection is now a common approach for multiple object tracking (MOT). Mostly studied in the context of multiple person tracking, MOT can be considered from different aspects. It can be categorized into online or offline, depending on when the decisions are made. In online tracking [saleh2020artist, wojke2017simple, bergmann2019tracking, chu2019famnet, xu2019spatial, kim2018multi], the tracker assigns detections to tracklets at every time-step, whereas in offline tracking [tang2017multiple, maksai2018eliminating] the decision about the tracklets are made after observing the whole context of the video. Different MOT approaches can also be divided into geometry-based [saleh2020artist, Bewley2016_sort], appearance-based [chu2019famnet, bergmann2019tracking, xu2019spatial], and a combination of appearance and geometry information with social information [maksai2018eliminating, sadeghian2017tracking]. The choice of information to represent each object highly depends on the context and scenario. For instance, for general multiple person tracking, social information and appearance information could be helpful, but, in sport scenarios, appearance information could be misleading. In our context for instance, one common application is human-robot collaboration in IKEA furniture assembly, where the tracking system should be able to make its decisions in real-time in an online fashion [saleh2020artist, Bewley2016_sort]. Moreover, we know that IKEA furniture parts are almost textureless and of the same color and shape, and thus the appearance information could be misleading. Therefore, one may need to employ a completely geometry-based approach. Additionally, we know that IKEA furniture parts are rigid, non-deformable object, that are moved almost linearly in a short temporal window. Therefore, a simple, well-designed MOT that models linear motions [Bewley2016_sort] is a reasonable choice.
Human Pose Estimation.
The large volume of work on human pose estimation precludes a comprehensive list; the reader is referred to two recent surveys on 2D and 3D human pose estimation [chen2020monocular, sarafianos20163d] and the references therein. Here, we will briefly discuss recent state-of-the-art approaches, including the baselines selected for our experiments. Multi-person 2D pose estimation methods can be divided into bottom-up (predict all joints first) [pishchulin2016deepcut, cao2017realtime, cao2019openpose, raaj2019efficient] or top-down (detect all person bounding boxes first) [he2017mask, fang2017rmpe, chen2018cascaded] approaches, with the former reaching real-time processing speeds and the latter having better performance. OpenPose [cao2017realtime, cao2019openpose] uses the CPM joint detector [wei2016convolutional] to predict candidate joint heatmaps and part affinity fields, encoding limb orientation, from which the skeletons can be assembled. This was extended to incorporate temporal multi-frame information in Spatio-Temporal Affinity Fields (STAF) [raaj2019efficient]. Mask R-CNN [he2017mask] is a notable top-down detection-based approach, where a keypoint regression head can be learned alongside the bounding box and segmentation heads. More recently, Cascade Pyramid Networks (CPN) [chen2018cascaded] were proposed, which use multi-scale feature maps and hard keypoint mining to improve position accuracy. Monocular 3D human pose estimation methods can be categorized as being model-free [pavlakos2018ordinal, pavllo20193d] or model-based [kanazawa2018end, kanazawa2019learning, kolotouros2019convolutional, kocabas2020vibe]. The former include VideoPose3D [pavllo20193d] which estimates 3D joints via temporal convolutions over 2D joint detections in a video sequence. The latter approach predicts the parameters of a body model, often the SMPL model [loper2015smpl], such as the joint angles, shape parameters, and rotation. For example, Kanazawa [kanazawa2018end] trained an encoder to predict the SMPL parameters, using adversarial learning to encourage realistic body poses and shapes, and later extended this to video input [kanazawa2019learning]. Instead of estimating SMPL parameters, Kolotouros [kolotouros2019convolutional] directly regress the location of mesh vertices using graph convolutions. Finally, Kocabas [kocabas2020vibe] proposed a video-based approach that uses adversarial learning to generate kinematically plausible motions. These model-based approaches tend to generalize better to unseen datasets, and so we focus on these methods as benchmarks on our dataset.
6.2 Dataset auxiliary data
|0||-||-||No Annotation (NA)|
|1||align||leg||align leg screw with table thread|
|2||align||side panel||align side panel holes with front panel dowels|
|3||attach||back panel||attach drawer back panel|
|4||attach||side panel||attach drawer side panel|
|5||attach||shelf||attach shelf to table|
|8||flip||table top||flip table top|
|9||insert||pin||insert drawer pin|
|10||lay down||back panel||lay down back panel|
|11||lay down||bottom panel||lay down bottom panel|
|12||lay down||front panel||lay down front panel|
|13||lay down||leg||lay down leg|
|14||lay down||shelf||lay down shelf|
|15||lay down||side panel||lay down side panel|
|16||lay down||table top||lay down table top|
|17||-||-||other (unavailable action class)|
|18||pick up||back panel||pick up back panel|
|19||pick up||bottom panel||pick up bottom panel|
|20||pick up||front panel||pick up front panel|
|21||pick up||leg||pick up leg|
|22||pick up||pin||pick up pin|
|23||pick up||shelf||pick up shelf|
|24||pick up||side panel||pick up side panel|
|25||pick up||table top||pick up table top|
|26||position||drawer||position the drawer right side up|
|28||push||table top||push table top|
|30||slide||bottom panel||slide bottom of drawer|
6.3 Additional results
6.3.1 Action recognition results
6.3.2 Action localization results
|I3D combined views||0.44||0.39||0.33||0.27||0.2||0.18||0.16||0.13||0.12||0.1||0.09||0.06||0.03||0.01|
|I3D combined all||0.38||0.33||0.28||0.21||0.18||0.15||0.14||0.12||0.09||0.08||0.07||0.04||0.02||0.01|
In this section we provide additional baseline results for the task of action localization. The goal in this task is to find and recognize all action instances within an untrimmed test video. The desired output here is a start and end frame for each action appearing in the video sequence. In order to evaluate performance on this task, we follow [caba2015activitynet] and compute the mean average precision (mAP) over all action classes. We set an example as true positive by computing the intersection over union score between the predicted and ground truth temporal segments and checking if it is greater than a threshold .