Bonn Activity Maps: Dataset Description

The key prerequisite for accessing the huge potential of current machine learning techniques is the availability of large databases that capture the complex relations of interest. Previous datasets are focused on either 3D scene representations with semantic information, tracking of multiple persons and recognition of their actions, or activity recognition of a single person in captured 3D environments. We present Bonn Activity Maps, a large-scale dataset for human tracking, activity recognition and anticipation of multiple persons. Our dataset comprises four different scenes that have been recorded by time-synchronized cameras each only capturing the scene partially, the reconstructed 3D models with semantic annotations, motion trajectories for individual people including 3D human poses as well as human activity annotations. We utilize the annotations to generate activity likelihoods on the 3D models called activity maps.


page 2

page 4

page 6

page 8

page 9

page 10


Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset

Recently, there has been a growing interest in wearable sensors which pr...

CLAD: A Complex and Long Activities Dataset with Rich Crowdsourced Annotations

This paper introduces a novel activity dataset which exhibits real-life ...

Automated Surgical Activity Recognition with One Labeled Sequence

Prior work has demonstrated the feasibility of automated activity recogn...

The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

The 3rd annual installment of the ActivityNet Large- Scale Activity Reco...

Glimpse Clouds: Human Activity Recognition from Unstructured Feature Points

We propose a method for human activity recognition from RGB data which d...

Efficient data-driven encoding of scene motion using Eccentricity

This paper presents a novel approach of representing dynamic visual scen...

MEVA: A Large-Scale Multiview, Multimodal Video Dataset for Activity Detection

We present the Multiview Extended Video with Activities (MEVA) dataset, ...

1 Introduction

Scene analysis, modeling and understanding are among the everlasting goals of computer vision, graphics and robotics. Besides the impressive improvements resulting from sophisticated, task-specific learning approaches, the major progress achieved in the variety of different tasks is rooted in the development of datasets that capture the complex relationships between different scene aspects. Examples include RGB-D datasets [10]

, datasets for human pose estimation and activity recognition 

[16, 7, 19, 15, 29, 34, 25, 18, 47, 38, 6, 44, 13, 24, 23, 30], material recognition [9, 12, 21, 46, 45, 2, 3, 27], semantic segmentation of RGB-D data [1, 36, 35] as well as 3D scenes [22, 26, 11, 37, 5, 8, 40, 48, 36, 1, 14, 43], and the combination of geometry, semantics and activity maps [32, 33], i.e. likelihood maps of actions performed in certain scene regions with certain objects, to anticipate human behavior (see Table 1). Our dataset differs from previous work [32, 33] since it contains scenes with up to 12 visible persons at the same time.

To further foster research in this domain, we introduce Bonn Activity Maps. This large-scale dataset combines various types of scene information at the example of kitchen scenarios and their adjacent common areas (see Figures 1 and 2). In addition to multi-view scene video observation by pre-calibrated, time-synchronized cameras and accurate 3D models of the underlying scene geometry with semantic annotations, we additionally provide annotations regarding the poses, motion and activities of humans in the respective scenes, where each person is assigned a unique id which will remain consistent with that person, even in the event of leaving the scene for multiple minutes. Each person can be assigned multiple activity labels, such as eat cake, make coffee or use smartphone. Since these labels are attached to a 3D object in the scene, they may transition between various camera views due to the fact that each camera only covers a small area in the recording volume.

Figure 1: Bonn Activity Maps: Likelihood maps on the 3D model of two exemplary activities draw on whiteboard (blue) and make coffee (orange) are shown here. We annotate activities for every camera view, where each person is assigned a unique id (left), and generate activity maps for each 3D model indicating the locality of the respective activity over the entire recording and over all actors (right).
Figure 2: Bonn Activity Maps: Trajectories of two actors are plotted into the 3D environment over the entire recording session.
Dataset Camera Setup 3D Environment Data Activities Max Persons Human Motion Year
Matterport3D [5] 1 RGB-D 90 scenes - - - 2017
ScanNet [8] 1 RGB-D 707 scenes - - - 2017
Replica [40] 1 RGB-D 18 scenes - - - 2019
Berkeley MHAD [29] 12 RGB, 2 RGB-D - 11 1 660 instances 2013
CAD-120 [19] 1 RGB-D - 10 1 1,200 instances 2013
Panoptic Studio [16] 480 RGB, 5 RGB-D - 5 8 50 instances 2015
NTU RGB+D [34] 3 RGB-D (17 configs) - 60 2 56,880 instances 2016
PKU-MMD [7] 3 RGB-D - 51 2 21545 instances 2017
NTU RGB+D 120 [25] 3 RGB-D (32 configs) - 120 2 114,480 instances 2019
MMAct [18] 4 RGB + 1 Ego. RGB - 37 1 36,764 instances 2019
SceneGrok [32] 1 RGB-D 14 scenes 7 1 1:51h video 2014
PiGraphs [33] 1 RGB-D 30 scenes 43 1 2:00h video 2016
Bonn Activity Maps 12 RGB, 1 RGB-D 4 scenes 60 12 5:22h video 2019
Table 1: Overview of recent datasets for 3D scene and human activity understanding. While many datasets target only one of these aspects, there are very few that contain both types of scene information. The Bonn Activity Maps dataset is the only dataset that contains 3D data of the environment as well as activities and tracked poses of multiple persons.

In the following, we describe the data formats and how they are organized within the dataset (see Section 2) as well as the acquisition process (see Section 3). Please contact for additional information or questions with regards to the dataset. More information about the dataset can be found on

2 Dataset

In this section, we describe Bonn Activity Maps and how the various data formats are organized within the dataset. It consists of recordings of four different kitchens and their surrounding environment at the University of Bonn. Figure 4 shows the 3D models of all four kitchens. The kitchens share common characteristics such as similar furniture, floor, and walls but provide very different scales and geometries. For each recording, we provide a folder with the following structure:

  • [leftmargin=*]

  • videos/: Contains camera recordings and their associated calibrations, described in Section 2.1.

  • trajectories/: Contains the human-annotated 3D trajectories of all individuals as well as their activity labels, described in Section 2.2.

  • models3d/: Contains the 3D models of the kitchens as well as 3D semantic segmentations of the data, described in Section 2.3 and Section 2.4.

In the following, we will describe each of the aspects in detail.

2.1 Video Recordings

One of the key aspects of the dataset are the captured activities in a kitchen environment. Each of the four chosen kitchens were recorded for about two hours by 12 Sony HDR-PJ410 cameras – the exact recording lengths are 01:26 hours, 01:59 hours, 01:57 hours and 02:00 hours. The cameras are placed at fixed positions and record with a resolution of at 25Hz. For each camera, we provide an MP4 file which is time-synchronized with all other cameras as well as a file calibration.json which contains the extrinsic and intrinsic camera parameters. In total, every camera recorded over 660,000 frames resulting in over 7,900,000 single frames when summing over all views and all recordings. The 3D recording volume is spanned by the area that is covered by at least two cameras. In detail, we provide the following files which are enumerated for each of the 12 cameras.

  • [leftmargin=*]

  • videos/

    • camera00/

      • calibration.json: List of ranges of camera parameters defined as follows:

        • start_frame: Starting frame from which this calibration is valid, with 0 being the first frame. The frame number corresponds to the frames in the recorded 25Hz video.

        • end_frame: Last frame for which this calibration is valid.

        • w: Width of the video in pixels.

        • h: Height of the video in pixels.

        • tvec

          : 3D vector that transforms the global 3D camera position to the local position in camera space (origin). Part of the extrinsic calibration.

        • rvec: 3D vector that represents the cameras rotation in axis-angle format and transforms the global 3D camera orientation to the local orientation in camera space (z-direction). Part of the extrinsic calibration.

        • K:

          intrinsic camera matrix containing the focal lengths, the principal point, as well as the skewness parameter.

        • distCoef: Distortion coefficients in the order .

      • recording.mp4: The recorded video data of the kitchen stored with a resolution of at 25Hz in MP4 format.

    • camera01/

    • camera11/

We use a list of ranges for calibration.json rather then a single entry since in one recording a camera was accidentally hit. We thus provide two calibrations for this camera, one before and one after the hit occurred.

During the recording the number of persons in the camera views fluctuates strongly between only a single person to multiple people in close proximity overlapping each other. Some of the actors were given tasks such as preparing coffee or having a phone call but the order in which to execute the task as well as the time could be chosen freely. Furthermore, people were encouraged to participate in group activities or just follow their normal routines during the recording. In comparison to other works, this results in very natural behavior. This was further improved by the long recording time, relative to other datasets, as it made people get used to the cameras.

2.2 Person Trajectories

Figure 3: Generated 3D skeletons of persons (bottom right) projected into three camera views.

A further crucial aspect of the dataset are the provided 3D person trajectories. For each person that enters the recording volume a unique id is created. Persons can enter and leave the scene but their unique id stays the same during the complete recording. In case a person appears in multiple recordings, the unique id is kept to ensure cross-recording consistency. Furthermore, a 3D point is annotated in each frame where a person is in the recording volume. In addition, we provide 3D skeletons for each person which were generated using the method by Tanke and Gall [42]. The skeletons utilize the COCO keypoint challenge structure and 6 additional joints for the feet [4]. An example of the generated skeleton data in a crowded scene is shown in Figure 3. The number of total persons in a recording vary from 18 to 30 persons with 12 being the maximum number of persons seen at the same time in a scene. Furthermore, a total of 60 high-level activities were annotated during recording such as eat cake, make coffee or use smartphone. We provide a separate JSON file for each person which contains the following items:

  • [leftmargin=*]

  • trajectories/

    • person000.json

      • frames: Ordered list of all the frames where the person is visible during the recording. The frames correspond one-to-one to the frames of the 25Hz videos recording.mp4, described in Section 2.1.

      • activities: List of activities that the person is doing. This list has a one-to-one mapping to the frames property, but a person can do multiple activities at the same time. The list of all activities is described in Table 2.

      • positions: List of 3D points, representing the 3D location of the person. This list has a one-to-one mapping to the frames property and is human-annotated.

      • poses: List of 3D poses, representing the 3D pose of the person. This list has a one-to-one mapping to the frames property and is generated using the method by Tanke and Gall [42]. A pose consists of 24 joints each being represented by a 3D point.

    • person001.json

Carry cake Carry cup Carry kettle
Carry milk Carry plate Carry whiteboard eraser
Carry whiteboard marker Check water in coffee machine Clean countertop
Clean dish Close cupboard Close dishwasher
Close drawer Close fridge Cut cake
Draw on whiteboard Drink Eat cake
Eat fruit Empty cup in sink Empty ground coffee
Empty water from coffee machine Erase on whiteboard Fill coffee beans
Fill coffee water tank Make coffee Mark coffee in list
Open cupboard Open dishwasher Open drawer
Open fridge Peel fruit Place cake on plate
Place cake on table Place cup onto coffee machine Place in microwave
Place sheet onto whiteboard Place water tank into coffee machine Pour kettle
Pour milk Press coffee button Put cake in fridge
Put cup in microwave Put in dishwasher Put sugar in cup
Put teabag in cup Put water in kettle Read paper
Remove sheet from whiteboard Start dishwasher Start microwave
Take cake out of fridge Take cup from coffee machine Take out of microwave
Take teabag Take water from sink Take water tank from coffee machine
Use laptop Use smartphone Wash hands
Table 2: List of the 60 annotated activities.

2.3 3D Data Recordings

In addition to the videos of recorded activities as well as person trajectory data, we also provide RGB-D and reconstructed 3D mesh data of the kitchen environments. For this, we used a Microsoft Kinect v2 sensor which records RGB data with a native resolution of as well as depth images with a resolution of both at 30Hz. These input streams are already time-synchronized by the sensor’s firmware and cover all relevant objects and scene content. Across the four kitchen scenes, we recorded RGB-D image data which were used to reconstruct high-resolution reference 3D meshes. In particular, we provide the following data:

  • [leftmargin=*]

  • models3d/

    • kitchen00/

      • input_rgbd_stream/: Recorded RGB-D image data in TUM RGB-D dataset [41] format:

        • rgb/: Directory containing the recorded color data stored as 3-channel 8-bit PNG images.

        • depth/: Directory containing the recorded depth data stored as 1-channel 16-bit PNG images. Depth values measured in meter are scaled by a factor of 5000.

        • rgb.txt: Order of the RGB image sequence.

        • depth.txt: Order of the depth image sequence. These two sequences are already time-synchronized such that the -th image in both text files corresponds to the same point in time during recording.

      • intrinsics.json: Intrinsic camera parameters of the RGB-D sensor

        • depth: Intrinsic parameter of the depth camera, i.e. w, h, K, and distCoef (see Section 2.1)

        • rgb: Intrinsic parameters of the RGB camera. Same format as for depth.

        • tvec: 3D translation vector between both cameras.

        • rvec: 3D rotation vector between both cameras in axis-angle format. The resulting rigid transformation maps a 3D point from the depth camera coordinate system to the one of the RGB camera.

      • initial_camera_pose.json: Initial 6-DOF camera pose relative to the calibration coordinate system (see Section 3.2) in JSON format.

        • tvec: 3D translation vector to the global coordinate system.

        • rvec: 3D rotation vector to the global coordinate system in axis-angle format.

      • reference_output_mesh.ply: Reference 3D mesh reconstructed from the RGB-D data and stored in PLY format.

    • kitchen01/

    • kitchen03/

To improve the usability and accessibility of the dataset, we adopt widely used file formats for data storage. For the RGB-D image sequences, we use the same format as for the TUM RGB-D dataset [41] which is commonly known in the computer vision and robotics communities. Similar to video data (see Section 2.1

), we have chosen a simple format for the intrinsic camera parameters based on the OpenCV calibration toolkit API. Note that we provide the raw image data without any further pre-processing (besides the internal time-synchronization) to reduce the probability of introducing biases into the data. Therefore, we also provide the 6-DOF rigid transformation between the depth and RGB camera which is required for camera calibration.

2.4 3D Data Segmentation

Figure 4: Semantic predictions: Colored reference meshes are reconstructed and provided as well as their semantically segmented counterparts where the color coding of the semantic labels corresponds to the ScanNet dataset [8].

The reconstructed 3D meshes are used for prediction of semantic classes. We trained LatticeNet [31] on the ScanNet room dataset [8] and evaluate it on each kitchen mesh individually. Since LatticeNet can deal with raw point clouds, we ignore the connectivity information of the faces and only use the vertices of the mesh as input. As a result, we obtain per-vertex probability values over the 21 labels annotated in ScanNet, e.g. chairs, tables, windows, etc. Therefore, we provide a further PLY file for each scene in which every 3D point has an additional attribute corresponding to the class label with the highest probability and an attribute for the respective color according to the ScanNet color scheme.

  • [leftmargin=*]

  • models3d/

    • kitchen00/

      • reference_output_mesh_with_labels.ply: Reconstructed and additionally labelled reference 3D mesh stored in PLY format.

    • kitchen01/

    • kitchen03/

Results of the 3D meshes and their corresponding class label predictions are shown in Figure 4.

3 Data Acquisition

Figure 5: Time synchronization of video recordings. We align the peaks of the unsynchronized videos (left) to ensure temporal consistency across different views.
Figure 6: Laser measuring device: We utilized a laser which we mounted onto a camera gimbal to measure out-of-plane distances (right). In the schematic overview of the apparatus (left), denotes the center of the ball-bearing, is the target point, is the distance to , measured by the laser, is the top offset of the laser from the rotation axis, while represents the forward offset. is the distance of to the ground.
Figure 7: Example frame of a calibrated camera. Small light-red circles represent 2D annotations of the landmark points that were hand-clicked, blue crosses represent the 3D positions of the landmarks which are projected into the calibrated camera view, and large red points represent the 3D locations of the other cameras projected into the camera view. Note that 3D positions in a camera view might be occluded by the scene.

In the following, we provide a description of the data acquisition process.

3.1 Synchronization

When working in multi-camera setups, the synchronization of the videos is essential to ensure temporal consistency across views. Unfortunately, the utilized off-the-shelf cameras do not provide hardware-synchronization and, hence, we resort to the following approach: We record the videos independently and synchronize them afterwards in a post-processing step using the audio channel. To aid the offline synchronisation, we produce sharp sounds using a clapperboard at the beginning and end of the recordings. We can then synchronize two signals (videos) by finding the peak of the cross-correlation between them, which can be easily done using the Fourier transform:


After finding the common start and end point, we cut the videos accordingly. Figure 5 shows an example of two videos before and after time synchronization.

3.2 Calibration

To obtain the intrinsic camera parameters , we utilize a checkerboard and the well-known algorithm described in the work of Zhang [49].

For extrinsic parameter calibration, we have to solve the Perspective--Point (PP) problem [20] for which we need a set of 3D points in world coordinates and their respective 2D positions in the camera image. We need to cover the complete recording volume with well-known 3D points since each camera only covers a small section. To ease our annotation work, we use easy-to-recognize landmark points such as corners of cupboards, whiteboards and fire warning signs. For cameras that face mostly uniform walls we create artificial landmarks using colored tape.

To place these landmark points into a joint global coordinate space, we follow a multi-stage approach since our scenes are non-convex and our measuring device only accurately measures in-plane distances. We measure a set of base points which all reside on the ground plane () to get an initial set of landmarks. For this, we first pick a point on the ground to be the origin and then choose one point on the - as well as one on the -axis of our right-handed coordinate frame and measure the distance from the origin. Finding positions of other points on the ground plane can now be solved by gradient descent using the known positions of the base points and their distance to the target point. We use an off-the-shelf range measuring laser to measure the actual distances.

To increase the set of landmarks to also cover arbitrary point in 3D space, we need the calculate the distances to at least three already known points . We obtain these distances by a laser measuring device that is mounted onto a camera gimbal as described in Figure 6. Here, we need to account for the gimbal to get the correct values:


and are offset constants given by the gimbal while is the laser-measured distance for the laser apparatus at point to target point . As the ball hinge is placed several millimeters above , we also need to adjust for the height :


When we measure base points with , we do not use the gimbal and the distance simplifies to . For each known point, we can now setup the constraint


with and . To get the target position , we solve the system with least-squared error


by gradient descent. Figure 7 shows an example of a calibrated kitchen with various landmark points.

3.3 3D Data Recordings

Figure 8: Activity annotation tool. The tools shows up to four camera views in which the a person can be annotated. To create a 3D point, at least two views have to be utilized while additional views can be used to obtain more accurate results. If a 2D point is selected in one or more views (top right and bottom left), epipolar lines are drawn in the other views to guide the annotator (top left and bottom right).

In addition to the video data, we also captured the scene using an RGB-D camera to obtain a 3D model of the scene geometry. For this, we used a Microsoft Kinect v2 to record the raw RGB-D data. In order to reconstruct 3D models from these input data, we use standard volumetric 3D reconstruction approach [28, 17, 39]. Although significant advances in 3D reconstruction have been achieved [50], the reconstruction accuracy may still not be sufficient for certain parts of the model, e.g. in regions with challenging illumination conditions or only very few geometric and photometric features. Therefore, we provide both the captured RGB-D input streams including the intrinsic parameters of our camera as well as the reference 3D models. This allows users to directly use the reference 3D model for a fair and comparable evaluation of their technique as well as to generate their own models with a different reconstruction approach to leverage improvements in terms of accuracy. A common and widely used choice of the world coordinate system of the reconstructed model is the camera coordinate system at the first frame, i.e. choosing the identity as the initial camera pose. Since the calibration coordinate system is set as the reference system for the whole dataset, we align the 3D model and provide the respective transformation as the coordinates of the initial camera pose. Note that the RGB-D image acquisition was performed before the video recordings and the resulting 3D models represent the state of the kitchen before any action has been performed.

3.4 Data Annotation

Figure 9: Activity annotation tool. An annotator can choose an activity as well as its starting and end point. Activities can overlap each other and are always associated to a person which is represented as a 3D point.

For data annotation, we developed a new tool. Since each camera only covers a small portion of the recording volume, annotators must have easy access to other views to switch quickly between them. For annotating 3D objects, at least two views have to be utilized to allow triangulation. However, we allow to annotate points in up to four views which is shown in Figure 8. For comfort and performance, the tool separates the videos into small chunks that can be easily annotated while retaining global information such as currently active actions and person ids. Moving on to the next or previous chunk can be easily done with the tool. Figure 8 shows how 3D person ids are annotated while Figure 9 shows how activities are annotated. As activities are always associated with a 3D person trajectory, we first had to completely annotate the trajectories before labeling activities. The list of all 60 labeled activities can be seen in Table 2. Furthermore, an exemplary activity map as well as trajectories of two persons are shown in Figures 1 and 2, respectively.

4 Summary

We presented a description of Bonn Activity Maps, a large-scale dataset for human tracking, activity recognition and anticipation of multiple persons. Our dataset combines various types of scene information at the example of kitchen scenarios where each recording contains time-synchronized video sequences of the scene, human-annotated 3D trajectories and generated 3D human poses of all individuals as well as their activity labels, and reconstructed 3D environment data including their 3D semantically segmented counterparts.


The work has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) FOR 2535 Anticipating Human Behavior.


  • [1] I. Armeni, A. Sax, A. R. Zamir, and S. Savarese. Joint 2D-3D-Semantic Data for Indoor Scene Understanding. arXiv preprint arXiv:1702.01105, 2017.
  • [2] S. Bell, P. Upchurch, N. Snavely, and K. Bala. OpenSurfaces: A Richly Annotated Catalog of Surface Appearance. Transactions on Graphics, 2013.
  • [3] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material Recognition in the Wild with the Materials in Context Database. In

    Conference on Computer Vision and Pattern Recognition

    , 2015.
  • [4] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. arXiv preprint arXiv:1812.08008, 2018.
  • [5] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang. Matterport3D: Learning from RGB-D Data in Indoor Environments. International Conference on 3D Vision, 2017.
  • [6] C. Chen, R. Jafari, and N. Kehtarnavaz. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In International Conference on Image Processing, 2015.
  • [7] L. Chunhui, H. Yueyu, L. Yanghao, S. Sijie, and L. Jiaying. Pku-mmd: A large scale benchmark for continuous multi-modal human action understanding. arXiv preprint arXiv:1703.07475, 2017.
  • [8] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. In Conference on Computer Vision and Pattern Recognition, 2017.
  • [9] K. J. Dana, S. K. Nayar, B. van Ginneken, and J. J. Koenderink. Reflectance and texture of real-world surfaces. In Conference on Computer Vision and Pattern Recognition, 1997.
  • [10] M. Firman. RGBD datasets: Past, present and future. In Conference on Computer Vision and Pattern Recognition Workshops, 2016.
  • [11] M. Fisher, D. Ritchie, M. Savva, T. Funkhouser, and P. Hanrahan. Example-based Synthesis of 3D Object Arrangements. Transactions on Graphics, 2012.
  • [12] E. Hayman, B. Caputo, M. Fritz, and J.-O. Eklundh. On the Significance of Real-World Conditions for Material Classification. In European Conference on Computer Vision, 2004.
  • [13] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding. In Conference on Computer Vision and Pattern Recognition, 2015.
  • [14] B.-S. Hua, Q.-H. Pham, D. T. Nguyen, M.-K. Tran, L.-F. Yu, and S.-K. Yeung. SceneNN: A Scene Meshes Dataset with aNNotations. In International Conference on 3D Vision, 2016.
  • [15] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. Transactions on Pattern Analysis and Machine Intelligence, 2014.
  • [16] H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara, and Y. Sheikh. Panoptic Studio: A Massively Multiview System for Social Motion Capture. In International Conference on Computer Vision, 2015.
  • [17] O. Kähler, V. A. Prisacariu, C. Y. Ren, X. Sun, P. Torr, and D. Murray. Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices. Transactions on Visualization and Computer Graphics, 2015.
  • [18] Q. Kong, Z. Wu, Z. Deng, M. Klinkigt, B. Tong, and T. Murakami. MMAct: A Large-Scale Dataset for Cross Modal Human Action Understanding. In International Conference on Computer Vision, 2019.
  • [19] H. S. Koppula, R. Gupta, and A. Saxena. Learning Human Activities and Object Affordances from RGB-D Videos. International Journal of Robotics Research, 2013.
  • [20] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An Accurate O(n) Solution to the PnP Problem. International Journal of Computer Vision, 2009.
  • [21] W. Li and M. Fritz. Recognizing Materials from Virtual Examples. In European Conference on Computer Vision, 2012.
  • [22] W. Li, S. Saeedi, J. McCormac, R. Clark, D. Tzoumanikas, Q. Ye, Y. Huang, R. Tang, and S. Leutenegger. InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset. In British Machine Vision Conference, 2018.
  • [23] Y. Li, C. Lan, J. Xing, W. Zeng, C. Yuan, and J. Liu.

    Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks.

    In European Conference on Computer Vision, 2016.
  • [24] I. Lillo, A. Soto, and J. Carlos Niebles. Discriminative Hierarchical Modeling ofSpatio-Temporally Composable Human Activities. In Conference on Computer Vision and Pattern Recognition, 2014.
  • [25] J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [26] J. McCormac, A. Handa, S. Leutenegger, and A. J.Davison.

    SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation?

    International Conference on Computer Vision, 2017.
  • [27] L. Murmann, M. Gharbi, M. Aittala, and F. Durand. A Dataset of Multi-Illumination Images in the Wild. In International Conference on Computer Vision, 2019.
  • [28] M. Nießner, M. Zollhöfer, S. Izadi, and M. Stamminger. Real-time 3D Reconstruction at Scale Using Voxel Hashing. Transactions on Graphics, 2013.
  • [29] F. Ofli, R. Chaudhry, G. Kurillo, R. Vidal, and R. Bajcsy. Berkeley MHAD: A comprehensive Multimodal Human Action Database. In Workshop on Applications of Computer Vision, 2013.
  • [30] H. Rahmani, A. Mahmood, D. Huynh, and A. Mian. Histogram of Oriented Principal Components for Cross-View Action Recognition. Transactions on Pattern Analysis and Machine Intelligence, 2016.
  • [31] R. A. Rosu, P. Schütt, J. Quenzel, and S. Behnke. LatticeNet: Fast Point Cloud Segmentation Using Permutohedral Lattices. arXiv preprint arXiv:1912.05905, 2019.
  • [32] M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner. SceneGrok: Inferring Action Maps in 3D Environments. Transactions on Graphics, 2014.
  • [33] M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner. PiGraphs: Learning Interaction Snapshots from Observations. Transactions on Graphics, 2016.
  • [34] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Conference on Computer Vision and Pattern Recognition, 2016.
  • [35] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor Segmentation and Support Inference from RGBD Images. In European Conference on Computer Vision, 2012.
  • [36] S. Song, S. P. Lichtenberg, and J. Xiao.

    SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite.

    In Conference on Computer Vision and Pattern Recognition, 2015.
  • [37] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser. Semantic Scene Completion from a Single Depth Image. Conference on Computer Vision and Pattern Recognition, 2017.
  • [38] E. H. Spriggs, F. De La Torre, and M. Hebert. Temporal Segmentation and Activity Classification from First-person Sensing. In Conference on Computer Vision and Pattern Recognition Workshops, 2009.
  • [39] P. Stotko, S. Krumpen, M. Weinmann, and R. Klein. Efficient 3D Reconstruction and Streaming for Group-Scale Multi-Client Live Telepresence. In International Symposium for Mixed and Augmented Reality, 2019.
  • [40] J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y. Yan, X. Pan, J. Yon, Y. Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. The Replica Dataset: A Digital Replica of Indoor Spaces. arXiv preprint arXiv:1906.05797, 2019.
  • [41] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of RGB-D SLAM systems. In International Conference on Intelligent Robots and Systems, 2012.
  • [42] J. Tanke and J. Gall. Iterative Greedy Matching for 3D Human Pose Tracking from Multiple Views. In German Conference on Pattern Recognition, 2019.
  • [43] M. A. Uy, Q.-H. Pham, B.-S. Hua, D. T. Nguyen, and S.-K. Yeung. Revisiting Point Cloud Classification: A New Benchmark Dataset and Classification Model on Real-World Data. In International Conference on Computer Vision, 2019.
  • [44] K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo.

    3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks.

    In International Conference on Multimedia, 2014.
  • [45] T.-C. Wang, J.-Y. Zhu, E. Hiroaki, M. Chandraker, A. A. Efros, and R. Ramamoorthi. A 4D Light-Field Dataset and CNN Architectures for Material Recognition. In European Conference on Computer Vision, 2016.
  • [46] M. Weinmann, J. Gall, and R. Klein. Material classification based on training data synthesized using a btf database. In European Conference on Computer Vision, 2014.
  • [47] C. Wu, J. Zhang, S. Savarese, and A. Saxena. Watch-n-Patch: Unsupervised Understanding of Actions and Relations. In Conference on Computer Vision and Pattern Recognition, 2015.
  • [48] J. Xiao, A. Owens, and A. Torralba. SUN3D: A Database of Big Spaces Reconstructed using SfM and Object Labels. In International Conference on Computer Vision, 2013.
  • [49] Z. Zhang. A flexible new technique for camera calibration. Transactions on Pattern Analysis and Machine Intelligence, 2000.
  • [50] M. Zollhöfer, P. Stotko, A. Görlitz, C. Theobalt, M. Nießner, R. Klein, and A. Kolb. State of the Art on 3D Reconstruction with RGB-D Cameras. Computer Graphics Forum, 2018.