Understanding human behavior has been the subject of research for autonomous intelligent systems across many domains, from automated driving and mobile robotics to intelligent video surveillance systems and motion simulation. Human motion trajectories are a valuable learning and validation resource for a variety of tasks in these domains. For instance, they can be used for learning safe and efficient human-aware navigation, predicting motion of people for improved interaction and service, inferring motion regularities and detecting anomalies in the environment. Particular attention towards trajectories, intentions and mobility patterns of people has considerably increased in the last decade .
Datasets of ground level human trajectories, typically used for learning and benchmarking, include the ETH , Edinburgh  and the Stanford Drone  datasets, recorded outdoors, or the indoor ATC , L-CAS  or Central Station  datasets (see Table I). While providing the basic input of motion trajectories, these datasets often lack relevant contextual information and the desired properties of data, e.g. the map of static obstacles, coordinates of goal locations, social information such as the grouping of agents, high variety in the recorded behaviors or long continuous tracking of each observed agent. Furthermore, most of the recordings are made outdoors, a robot is rarely present in the environment and the ground truth pose annotation, either automated or manual, is prone to artifacts and human errors.
In this paper we present a human-robot interaction study, designed to collect human trajectories in a generic indoor social setting with extensive interaction between groups of people and a robot in a spacious environment with several obstacles. The locations of the obstacles and goal positions are set up to make navigation non-trivial and produce a rich variety of behaviors. The participants are tracked with a motion capture system; furthermore, several participants are wearing eye-tracking glasses. “Tracking Human motion in the ÖRebro university” (THÖR) dataset111Available at http://thor.oru.se, which is released public and free for non-commercial purposes, contains over 60 minutes of human motion in 395k frames, recorded at 100 Hz, 2531k people detections and over 600 individual and group trajectories between multiple resting points. In addition to the video stream from one of the eye tracking headsets, the data includes 3D Lidar scans and a video recording from stationary sensors. We quantitatively analyze the dataset using several metrics, such as tracking duration, perception noise, curvature and speed variation of the trajectories, and compare it to popular state-of-the-art datasets of human trajectories. Our analysis shows that THÖR has more variety in recorded behavior, less noise, and high duration of continuous tracking.
Ii Related Work
Recordings of human trajectory motion and eye gaze are useful for a number of research areas and tasks both for machine learning and benchmarking. Examples include person and group tracking[28, 16, 18], human-aware motion planning [11, 3, 27], motion behavior learning , human motion prediction [8, 30], human-robot interaction , video surveillance  or collision risk assessment . In addition to basic trajectory data, state-of-the-art methods for tracking or motion prediction, for instance, can also incorporate information about the environment, social grouping, head orientation or personal traits. For instance, Lau et al.  estimate social grouping formations during tracking and Rudenko et al.  use group affiliation as a contextual cue to predict future motion. Unhelkar et al.  use head orientation to disambiguate and recognize typical motion patterns that people are following. Bera et al.  and Ma et al.  learn personal traits to determine interaction parameters between several people. To enable such research in terms of training data and benchmarking requirements, a state-of-the-art dataset should include this information.
Human trajectory data is also used for learning long-term mobility patterns , such as the CLiFF maps , to enable compliant flow-aware global motion planning and reasoning about long-term path hypotheses towards goals in distant map areas for which no observations are immediately available. Finally, eye-gaze is a critical source of non-verbal information about human task and motion intent in human-robot collaboration, traffic maneuver prediction, spatial cognition or sign placement [10, 26, 1, 13, 7].
Existing datasets of human trajectories, commonly used in the literature , are summarized in Table I. With the exception of [6, 33, 34, 9], all datasets have been collected outdoors. Intuitively, patterns of human motion in indoor and outdoor environments are substantially different due to scope of the environment and typical intentions of people therein. Indoors people navigate in loosely constrained but cluttered spaces with multiple goal points and many ways (e.g. from different homotopy classes) to reach a goal. This is different from their behavior outdoors in either large obstacle-free pedestrian areas or relatively narrow sidewalks, surrounded by various kinds of walkable and non-walkable surfaces. Among the indoor recordings, only [33, 9] introduce a robot, navigating in the environment alongside humans. However, recording only from on-board sensors limits visibility and consequently restricts the perception radius. Furthermore, ground truth positions of the recorded agents in all prior datasets were estimated from RGB(-D) or laser data. On the contrary, we directly record the position of each person using a motion capture system, thus achieving higher accuracy of the ground truth data and complete coverage of the working environment at all times. Moreover, our dataset contains many additional contextual cues, such as social roles and groups of people, head orientations and gaze directions.
Iii Experimental Procedure
In order to collect motion data relevant for a broad spectrum of research areas, we have designed an experiment that encourages social interactions between individuals, groups of people and with the robot. The interactive setup assigns social roles and tasks so as to imitate typical activities found in populated spaces such as offices, train stations, shopping malls or airports. Its goal is to motivate participants to engage into natural and purposeful motion behaviors as well as to create a rich variety of unscripted interactions. In this section we detail the system setup and experiment design.
Iii-a System Setup
The experiment was performed in a spacious laboratory room of and the adjacent utility room, separated by a glass wall (see the overview in Fig. 2). The laboratory room, where the motion capture system is installed, is mostly empty to allow for maneuvering of large groups, but also includes several constrained areas where obstacle avoidance and the choice of homotopy class is necessary. Goal positions are placed to force navigation along the room and generate frequent interactions in its center, while the placement of obstacles prevents walking between goals on a straight line.
To track the motion of the agents we used the Qualisys Oqus 7+ motion capture system222https://www.qualisys.com/hardware/5-6-7/ with 10 infrared cameras, mounted on the perimeter of the room. The motion capture system covers the entire room volume apart from the most right part close to the podium entrance – a negligible loss due to the experiment’s focus on the central part of the room. The system tracks small reflective markers at with spatial discretization of . The coordinate frame origin is on the ground level in the middle of the room. For people tracking, the markers have been arranged in distinctive 3D patterns on the bicycle helmets, shown in Fig. 3. The motion capture system was calibrated before the experiment with an average residual tracking error of , and each helmet, as well as the robot, was defined in the system as a unique rigid body of markers, yielding its 6D position and orientation. Each participant was assigned an individual helmet for the duration of the experiment.
For acquiring eye gaze data we used four mobile eye-tracking headsets worn by four participants (helmet numbers 3, 6, 7, and 9 respectively) for the entire duration of the experiment. However, in this dataset we only include data from Tobii Pro Glasses. The gaze sampling frequency of Tobii Pro Glasses is . It also has a scene camera which records the video at 25 fps. A gaze overlaid version of this video is included in this dataset. We synchronized the time server of the Qualisys system with the stationary Velodyne sensor and the eye-tracking glasses. Finally, we recorded a video of the experiment from a stationary camera, mounted in a corner of the room.
The robot, used in our experiment, is a small forklift Linde CitiTruck robot with a footprint of and high, shown in Fig. 3. It was programmed to move in a socially unaware manner, following a pre-defined path around the room and adjusting neither its speed nor trajectory to account for surrounding people. For safety reasons, the robot was navigating with a maximal speed of and projecting its current motion intent on the floor in front of it using a mounted beamer . A dedicated operator was constantly monitoring the experiment from a remote workstation to stop the robot in case of an emergency. The participants were made aware of the emergency stop button on the robot should they be required to use it.
Iii-B Experiment Description
During the experiment the participants performed simple tasks, which required walking between several goal positions. To increase the variety of motion, interactions and behavioral patterns, we introduced several roles for the participants and created individual tasks for each role, summarized in Fig. 4.
The first role is a visitor, navigating alone and in groups of up to 5 people between four goal positions in the room. At each goal they take a random card, indicating the next target. As each group was instructed to travel together, they only take one card at a time. We asked the visitors to talk and interact with the members of their group during the experiment, and changed the structure of groups every 4-5 minutes. There are 6 visitors in our experiment. The second role is a worker, whose task is to receive and carry large boxes between the laboratory and the utility room. The workers wear a yellow reflective vest. There are 2 workers in our experiment, one carrying the boxes from the laboratory to the unity room, and the other vice versa. The third role is the inspector. An inspector is navigating alone between many additional targets in the environment, indicated by a QR-code, in no particular order and stops at each target to scan the code. We have one inspector in our experiment.
There are several points to motivate the introduction of the social roles. Firstly, with the motion of the visitors and the workers we introduce distinctive motion patterns in the environment, while the cards and the tasks make the motion focused, goal-oriented and prevent random wandering. However, the workers’ tasks allocation is such that at some points idle standing/wandering behavior is also observed, embedded in their cyclical activity patterns. Furthermore, we expect that the visitors navigating alone, in groups and the workers who carry heavy boxes exhibit distinctive behavior, therefore the grouping information and the social role cue (reflective vest) may improve the intention and trajectory prediction. Finally, motion of the inspector introduces irregular patterns in the environment, distinct from the majority of the visitors.
|Experiment, round||Visitors, groups||Workers Utility, lab||Inspector||Duration|
|One 1||6,7,5 + 8,2,4||3 9||10||368 sec|
|obstacle 2||2,5,6,7 + 8,4||3 9||10||257 sec|
|3||6,7,8 + 4,5 + 2||3 9||10||275 sec|
|4||2,4,5,7,8 + 6||3 9||10||315 sec|
|Moving 1||4,5,6 + 3,7,9||2 8||10||281 sec|
|robot 2||3,5,6,9 + 7,4||2 8||10||259 sec|
|3||5,7,9 + 4,6 + 3||2 8||10||286 sec|
|4||3,5,6,7,9 + 4||2 8||10||279 sec|
|5||3,6 + 4,9 + 5,7||2 8||10||496 sec|
|Three 1||2,3,8 + 6,7,9||5 4||10||315 sec|
|obstacles 2||2,8,9 + 3,6,7||5 4||10||290 sec|
|3||2,3,7 + 8,9 + 6||5 4||10||279 sec|
|4||2,3,6,7,9 + 8||5 4||10||277 sec|
We prepared three variations of the experiment with different numbers of obstacles and motion state of the robot. In the first variation, the robot is placed by a wall and not moving, and the environment has only one obstacle (see the layout in Fig. 2). The second variation introduces the moving robot, navigating around the obstacle (the trajectory of the robot is depicted in Fig. 4). The third variation features an additional obstacle and a stationary robot in the environment (see Fig. 2 with additional obstacles). We denote these variations as One obstacle, Moving robot and Three obstacles, accordingly. In each variation of the experiment the group structure for the visitors was reassigned 4-5 times. Between the variations, the roles were also reassigned. A summary of the experiments’ variations and durations is given in Table II.
Each round of the experiment started with the participants, upon command, beginning to execute their tasks. The round lasted for approximately four minutes and ended with another call from the moderator. To avoid artificial and unnatural motion due to knowing the true purpose of the experiment, we told the participants that the experiment is set to validate the robot’s perceptive abilities, while the motion capture data will be used to compare the perceived and actual positions of humans. Participants were asked not to communicate with us during the experiment. For safety and ethical reasons, we have instructed participants to act carefully near the robot, described as “autonomous industrial equipment” which does not stop if someone is in its way. An ethics approval was not required for our experiment as per institutional guidelines and the Swedish Ethical Review Act (SFS number: 2003:460). Written informed consent was obtained from all participants. Due to the relatively low weight of the robot used in this study and the safety precautions taken, there was no risk of harm to participants.
Iv Results and Data Analysis
The THÖR dataset includes over 60 minutes of motion in 13 rounds of the three experiment variations. The recorded data contains over 395k frames at , 2531k human detections and 600+ individual and group trajectories between the goal positions. For each detected person the 6D position and orientation of the helmet in the global coordinate frame is provided. Furthermore, the dataset includes the map of the static obstacles, goal coordinates and grouping information. We also share the Matlab scripts for loading, plotting and animating the data. Additionally, the eye gaze data is available for one of the participants (Helmet 9), as well as Velodyne scans from a static sensor and the recording from the camera. We thoroughly inspected the motion capture data and manually cleaned it to remove occasional helmet ID switches and recover several lost tracks. Afterwards we applied an automated procedure to restore the lost positions of the helmets from incomplete set of recognized markers. In Fig. 5 we show the summary of the recorded trajectories.
The THÖR dataset is recorded using a motion capture system, which yields more consistent tracking and precise estimation of the ground truth positions and therefore higher quality of the trajectories, compared to the human detections from RGB-D or laser data, typically used in existing datasets. For the quantitative analysis of the dataset, we compare the recorded trajectories to the several datasets which are often used for training and evaluation of motion predictors for human environments . The popular ETH dataset  is recorded outdoors in a pedestrian zone with a stationary camera facing downwards and manually annotated at . The Hotel sequence, used in our comparison, includes the coordinates of the 4 common goals in the environment and group information for walking pedestrians. The ATC dataset  is recorded in a large shopping mall using multiple 3D range sensors at over an area of . This allows for long tracking durations and potential to capture interesting interactions between people. In addition to positions it also includes facing angles. In this comparison we used the recordings from 24th and 28th of October and 14th of November. The Edinburgh dataset  is recorded in a university campus yard using a camera facing down with variable detection frequency, on average . For comparison we used the recordings from 27th of September, 16th of December, 14th of January and 22th of June.
For evaluating the quality of recorded trajectories we propose several metrics:
Tracking duration (): average length of continuous observations of a person, higher is better.
Trajectory curvature (): global curvature of the trajectory , caused by maneuvering of the agents in presence of static and dynamic obstacles, measured on intervals based on the first (), middle () and last () points of the interval: . Higher curvature values correspond to more challenging, non-linear paths.
Perception noise (): under the assumption that people move on smooth, not jerky paths, we evaluate local distortions of the recorded trajectory of length , caused by the perception noise of the mocap system as the average absolute acceleration: . Less noise is better.
Motion speed (
): mean and standard deviation of velocities in the dataset, measured onintervals. If the effect of perception noise on speed is negligible, higher standard deviation means more diversity in behavior of the observed agents, both in terms of individually preferred velocity and compliance with other dynamic agents.
Minimal distance between people (): average minimal euclidean distance between two closest observed people. This metric indicates the density of the recorded scenarios, lower values correspond to more crowded environments.
|Tracking duration |
|Trajectory curvature |
|Perception noise |
|Motion speed |
|Min. dist. between people |
The results of the evaluation are presented in Table III. Our dataset has sufficiently long trajectories (on average tracking duration) with high curvature values ( ), indicating that it includes more human-human and human-environment interactions than the existing datasets. Furthermore, despite the much higher recording frequency, e.g. (THÖR) vs. (ATC), the amount of perception noise in the trajectories is lower than in all baselines. The speed distribution of
shows that the range of observed velocities corresponds to the baselines, while the lower average velocity in combination a high average curvature confirms higher complexity of the recorded behaviors, because comfortable navigation in straight paths with constant velocity is not possible in presence of static and dynamic obstacles. Finally, the high variance of the minimal distance between people (THÖR vs. ATC) shows that our dataset features both dense and sparse scenarios, similarly to ETH and Edinburgh.
An important advantage of THÖR in comparison to the prior art is the availability of rich interactions between the participants and groups in presence of static obstacles and the moving robot. In this compact one hour recording we observe numerous interesting situations, such as accelerating to overtake another person; cutting in front of someone; halting to let a large group pass; queuing for the occupied goal position; group splitting and re-joining; choosing a sub-optimal motion trajectory from a different homotopy class due to a narrow passage being blocked; hindrance from walking towards each other in opposite directions. In Fig. 6 we illustrate several examples of such interactions.
In this paper we present a novel human motion trajectories dataset, recorded in a controlled indoor environment. Aiming at applications in training and benchmarking human-aware intelligent systems, we designed the dataset to include a rich variety of human motion behaviors, interactions between individuals, groups and a mobile robot in the environment with static obstacles and several motion targets. Our dataset includes accurate motion capture data at high frequency, head orientations, eye gaze directions, data from a stationary 3D lidar sensor and an RGB camera. Using a novel set of metrics for the dataset quality estimation, we show that it is less noisy and contains higher variety of behavior than the prior art datasets.
This work has been partly funded from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 732737 (ILIAD) and by the Swedish Knowledge Foundation under contract number 20140220 (AIR). We would like to acknowledge Martin Magnusson for his invaluable support in our work with the motion capture system.
-  (2017) Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction 6 (1), pp. 25–63. Cited by: §II.
-  (2016) Social LSTM: human trajectory prediction in crowded spaces. In Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec. (CVPR), pp. 961–971. Cited by: §II.
-  (2015-05) Intention-aware online pomdp planning for autonomous driving in a crowd. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 454–460. External Links: Cited by: §II.
-  (2011) Stable multi-target tracking in real-time surveillance video. In Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec. (CVPR), pp. 3457–3464. Cited by: TABLE I.
Aggressive, tense, or shy? Identifying personality traits from crowd videos.
Proc. of the Int. Conf. on Artificial Intelligence (IJCAI), pp. 112–118. Cited by: §II.
-  (2013) Person tracking in large public spaces using 3-d range sensors. IEEE Trans. on Human-Machine Systems 43 (6), pp. 522–534. Cited by: TABLE I, §I, §II, §IV.
-  (2020) Bi-directional navigation intent communication using spatial augmented reality and eye-tracking glasses for improved safety in human–robot interaction. Robotics and Computer-Integrated Manufacturing 61, pp. 101830. Cited by: §II, §III-A.
-  (2012) Incremental learning of human social behaviors with feature-based spatial effects. In Proc. of the IEEE Int. Conf. on Intell. Robots and Syst. (IROS), pp. 2417–2422. Cited by: §II.
-  (2015) Real-time multisensor people tracking for human-robot spatial interaction. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Works. on ML for Social Robo., Cited by: TABLE I, §II.
-  (2009) On the roles of eye gaze and head dynamics in predicting driver’s intent to change lanes. IEEE Transactions on Intelligent Transportation Systems 10 (3), pp. 453–462. Cited by: §II.
-  (2010) Probabilistic autonomous robot navigation in dynamic environments with human motion prediction. Int. Journal of Social Robotics 2 (1), pp. 79–94. Cited by: §II.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In , Cited by: TABLE I.
-  (2017) Eye tracking for spatial research: cognition, computation, challenges. Spatial Cognition & Computation 17 (1-2), pp. 1–19. Cited by: §II.
-  (2017) Enabling flow awareness for mobile robots in partially observable environments. IEEE Robotics and Automation Letters 2 (2), pp. 1093–1100. Cited by: §II.
-  (2017) A survey of methods for safe human-robot interaction. Foundations and Trends in Robotics 5 (4), pp. 261–349. Cited by: §II.
-  (2009) Tracking groups of people with a multi-model hypothesis tracker. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Cited by: §II.
-  (2007) Crowds by example. In Computer Graphics Forum, Vol. 26, pp. 655–664. Cited by: TABLE I.
-  (2016) On multi-modal people tracking from mobile platforms in very crowded and dynamic environments. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Cited by: §II.
-  (2019) Robust motion planning and safety benchmarking in human workspaces.. In SafeAI@ AAAI, Cited by: §II.
-  (2017) Forecasting interactive dynamics of pedestrians with fictitious play. In Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec. (CVPR), pp. 4636–4644. Cited by: §II.
-  (2009) Statistical models of pedestrian behaviour in the forum. Master’s thesis, School of Informatics, University of Edinburgh. Cited by: TABLE I, §I, §IV.
-  (2018) Modelling and predicting rhythmic flow patterns in dynamic environments. In Annual Conf. Towards Autonom. Rob. Syst., pp. 135–146. Cited by: §II.
-  (2010) The walking behaviour of pedestrian social groups and its impact on crowd dynamics. PloS one 5 (4), pp. e10047. Cited by: Fig. 6.
-  (2011) A large-scale benchmark dataset for event recognition in surveillance video. In Proc. of the IEEE Conf. on Comp. Vis. and Pat. Rec. (CVPR), pp. 3153–3160. Cited by: TABLE I.
Learning socially normative robot navigation behaviors with bayesian inverse reinforcement learning. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), Cited by: §II.
-  (2016) Robot reading human gaze: why eye tracking is better than head tracking for human-robot collaboration. In Proc. of the IEEE Int. Conf. on Intell. Robots and Syst. (IROS), Cited by: §II.
-  (2017) Kinodynamic motion planning on gaussian mixture fields. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 6176–6181. Cited by: §II.
-  (2009) You’ll never walk alone: modeling social behavior for multi-target tracking. In Proc. of the IEEE Int. Conf. on Computer Vision (ICCV), pp. 261–268. Cited by: TABLE I, §I, §II, §IV.
-  (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In Proc. of the Europ. Conf. on Comp. Vision (ECCV), pp. 549–565. Cited by: TABLE I, §I.
-  (2018) Human motion prediction under social grouping constraints. In Proc. of the IEEE Int. Conf. on Intell. Robots and Syst. (IROS), Cited by: §II.
-  (2019) Human motion trajectory prediction: a survey. arXiv preprint arXiv:1905.06113. Cited by: §I, §II, §IV.
-  (2015) Human-robot co-navigation using anticipatory indicators of human walking motion. In Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA), pp. 6183–6190. Cited by: §II.
-  (2017) Online learning for human classification in 3D LiDAR-based tracking. In Proc. of the IEEE Int. Conf. on Intell. Robots and Syst. (IROS), pp. 864–871. Cited by: TABLE I, §I, §II.
-  (2012) Understanding collective crowd behaviors: learning a mixture model of dynamic pedestrian-agents. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2871–2878. Cited by: TABLE I, §I, §II.