Human Position Detection Tracking with On-robot Time-of-Flight Laser Ranging Sensors

by   Sarthak Arora, et al.
Rochester Institute of Technology

In this paper, we propose a simple methodology to detect the partial pose of a human occupying the manipulator work-space using only on-robot time–of–flight laser ranging sensors. The sensors are affixed on each link of the robot in a circular array fashion where each array possesses sixteen single unit laser ranging lidar(s). The detection is performed by leveraging an artificial neural network which takes a highly sparse 3-D point cloud input to produce an estimate of the partial pose which is the ground projection frame of the human footprint. We also present a particle filter based approach to the tracking problem when the input data is unreliable. Ultimately, the simulation results are presented and analyzed.



There are no comments yet.


page 1


Speed and Separation Monitoring using on-robot Time--of--Flight laser--ranging sensor arrays

In this paper, a speed and separation monitoring (SSM) based safety cont...

Sensing Volume Coverage of Robot Workspace using On-Robot Time-of-Flight Sensor Arrays for Safe Human Robot Interaction

In this paper, an analysis of the sensing volume coverage of robot works...

Multisensor Online Transfer Learning for 3D LiDAR-based Human Detection with a Mobile Robot

Human detection and tracking is an essential task for service robots, wh...

Multisensor Online Transfer Learning for 3D LiDAR-based Human Classification with a Mobile Robot

Human detection and tracking is an essential task for service robots, wh...

Physics-based Simulation of Continuous-Wave LIDAR for Localization, Calibration and Tracking

Light Detection and Ranging (LIDAR) sensors play an important role in th...

Hallucinating robots: Inferring obstacle distances from partial laser measurements

Many mobile robots rely on 2D laser scanners for localization, mapping, ...

MARF: Multiscale Adaptive-switch Random Forest for Leg Detection with 2D Laser Scanners

For the 2D laser-based tasks, e.g., people detection and people tracking...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the onset of human robot collaboration (HRC), the interaction between the operator and the robot have become extremely human-centric. For any interaction to safely occur, information associated with human/operator position with-respect-to the robot must be present. Usually, these scenarios are rife in factory floors and indoor environments. Therefore, the use of exteroceptive sensor systems such as [5] and [16] have enabled complete human tracking including bio-mechanical information. It must be noted that these sensing systems are setup and mounted in the robot’s environment and usually require calibration routines and planning of sensor placement around the concerned volume of operation. However, due to the densely occluded nature of indoor environments and factory floors, occlusion becomes inevitable. To alleviate this problem the use of exteroceptive sensors affixed to the robot is a viable option.

As it can be verbose and confusing to refer to the aforementioned sensors with their designated terms. For convenience, the systems can be divided into two categories similar to virtual-reality (VR) tracking systems. When the tracking system is completely self-contained within the VR headset it is referred to as ”inside-out” tracking. When the tracking system is completely external to the VR headset it is referred to as ”outside-in” tracking. Similarly, when a sensing system is affixed on the robot it would be convenient to express the system as an ”inside-out” sensing system from the robot’s perspective and ”outside-in” from the sensors mounted in the environment. A similar idea was proposed in [8]

where the author(s) classified ”inside-out” & ”outside-in” as intrinsic & extrinsic sensing systems respectively. They also laid out the ground work for the sensing system demonstrated in this work. In

[9], several arguments were presented to demonstrate the similarity between the simulated and the physical versions of the time–of-flight laser ranging sensing system. An ISO compliant [1] safety algorithm was also presented in [9], however, the approach only leveraged distance thresholds to design the controller. The motivation of this work is to augment the work done in [9] with a human position estimation system using solely the ”inside-out” sensing system.

Fig. 1: UR10 robot with time–of–flight sensors mounted on each link, rays are shown when hit. The observation from each lidar are used to estimate and track the human operator in the robot workspace.

Our approach is inspired by several works done in the field of object detection and vehicle tracking. In [7], the authors present a particle filter based approach to track a single vehicle that passes by using six ultrasonic sensors distributed around the vehicle. In [12]

, the authors design an Extended Kalman Filter using 8 ultrasonic sensors mounted on each side of the vehicle to track other objects. It should be noted that aforementioned approaches assume the vehicles to be completely on the ground and therefore, track the ground projection(s) of the objects. Levying this assumption, the authors of

[17], use 3-D point clouds to generate the top view of the point clouds and slice the cloud along the z-axis to get 2-D slices of the points. In [4]

, the authors use the top view of the point cloud to pass it to a deep learning model


As our environment is indoors and controlled, our problem formulation is greatly simplified as compared to the approaches mentioned above. It is assumed that the robot workspace is sporadically occupied by one person only and the 3-D point clouds generated are down-projected to 2D point maps for denser inputs with limited pose information. Ultimately, our work holds the following contributions:

  • A simple approach to human position estimation and tracking that solely uses ”inside-out” sensing systems.

  • With work done in [9], the work presented in this paper can be used as a perception strategy for the safety controller.

Ii Problem Formulation

Ii-a Setup & Assumptions

The system presented in this work comprises of three circular arrays which consist sixteen single unit lidar(s) each with a field of view of 25 degrees on every single unit lidar. The pose of the human occupying the workspace is considered to be a two dimensional partial pose due to the limited sensing capability of the system.

The human pose is represented by the 2-D location of the human on the workspace floor. The pose is computed by only considering & with respect to a fixed reference frame. Therefore, it can be represented as:


Where, “World” represents the fixed reference frame in the center of the workspace. It should be noted that the third coordinate () has been discarded.

It should be noted that the robot’s base frame is created by projecting it directly on top of the world frame. As each sensor array is circular, it is mounted like a ring around each link of the robot such that there is zero translation and rotation between the ring center and link center. This enables the sensors to be a part of the robot’s kinematic chain. Therefore, the distance reported by each single unit lidar can be converted a 3-D point observed with respect to the world frame. Each lidar unit can observe distance of upto 2m. This can be imagined as a distributed lidar spread across the robot’s body surface. Each distance transformation reported by each lidar in each ring can be reported as:


Where & represent the ring and the lidar index respectively.

The lidar observation returned is a 3D point from the above mentioned kinematic-chain. However, as the lidar reports any observed distance, association of that data with a human subject can be difficult.

Ii-B Data Association

In [9], a method for self-occlusion detection is shown where a physics engine is used for data association. Only the lidar observations that are not associated with the robot are kept, as there is only one human occupying the workspace, it is safe to say that the filtered out observations are therefore associated with the human.

Iii Methodology

The lidar observations from each ring that are associated with the human are collected along with the human pose with-respect-to the “” frame. The collected data is then used to develop a test-train split in the dataset for training an artificial neural network. The output of the artificial neural network is then used as input to a particle filter that provides the tracking capability to the system.

Iii-a Data Collection & Processing

The data is collected by time synchronization of the entire system. As a simulator is used for the experiment, each lidar observation, joint angles of the robot and the human pose is hard time synced with each other for getting the complete state of the state at any given time.

Iii-B Artificial Neural Network

The artificial neural network learns the mapping between the lidar observations & joint angles of the robot with respect to the human pose. The network receives an input vector of size 54. Where:


Each element in is vertical vector. The vector is a vector of size x where each element is the 3D observation reported by a lidar from ring . The is a vector of the robot’s joint angles of size x.

The output of the network is the human pose which essentially a vector of size x. The network is essentially a regressor that takes the input vector and outputs a continuous value where the ground truth is the human pose recorded during the simulation. Therefore, the network output is characterized by:


Where n is the size of the training data.

The network essentially act like a detector which regresses over inputs to produce a continuous pose. It can be formulated in a parameterized form:


Iii-C Network Training & Architecture

The network has a simple structure and comprises of three hidden layers with Relu


and tanh non-linear activations. The output layer of the network is identity or it directly produces the logits in the output. The network also possesses a dropout

[13] layer before the output layer to avoid overfitting.

The network loss function was modified to be Root Mean Square Error as the data is essentially pose data, therefore, we attempt to directly learn a mapping. The loss is given by:


The input data for the network was subjected to gaussian noise as the simulator provides ideal condition to simulate sensor noise. To make the network robust to estimating the human pose, some gaussian noise was also injected in the ground truth and data augmentation was done.

Iii-D Tracking with Particle Filter

The output of neural network was used as input to a particle filter where each particle is represented by a state vector:


The state vector was subjected to input noise as shown in [7]. The particle filter algorithm was implemented as done in [2]. At first, it seems obvious to directly fuse the reading from each single unit lidar in the system, however, the single unit lidar(s) are observing only a few different points in the human body at a time. The sensor independence would lead to biased estimation the raw observations were passed to the filter. To overcome the sparsity in the observations the joint angles of the robot were added to the neural network inputs.

initialize particles;
initialize neuralnet;
initialize timestep;
while true do
       if inputIsValid() then
             output = net(input);
             prediction = predict(particles, timestep);
             correction = correct(particles, output)
             prediction = predict(particles, timestep);
       end if
end while
Algorithm 1 Estimation and Tracking Algorithm

Shown above, is the algorithm being used, it should be noted that the algorithm first checks for valid input and then moves on to start making predictions. An invalid input is obtained when no input observations are reported, therefore the algorithm then moves on to making predictions using the motion model which in this case is a fixed velocity model.

Iv Simulation Experiment Results

The experiment was performed by randomizing the robot and the human trajectories. In other words, the experiment was run for an arbitrary amount of time while the data was collected from the simulator. The joint angles of the robot were rotated by drawing angle values from a uniform distribution and then passing those values to a joint controller. On the other hand, the human trajectory was randomized by inducing random turn in the walking path of the human.

Fig. 2: A figure showing the trajectory of the human around the robot (located at (0,0)). It can be seen from the figure that ground truth is biased away from the neural network and the particle filter plots. That can be explained by the fact that the true human pose is always further away from the outermost points on the human body surface as observed by the lidar(s).
Neural Net Particle Filter
RMSE 0.1196 0.1061

After obtaining the RMSE values, it can be seen that the particle filter has a lower RMSE of 0.1061 as compared to the Neural Net which is due to the innovation done by the particle filter when no inputs were present, it can be thought of as imputing the values when none are present. It can be seen from the results that the bias parameters

in the neural network are essentially responsible for controlling the offset between the ground truth trajectory and the predicted trajectory. On the other hand, the weights are responsible for controlling the significance of each lidar observation. The particle filter performance could be tuned primarily by changing the number of particles it requires, for higher quantities the particle filter generated a smoother trajectory, however increasing the particles can greatly increase the computation time.

The neural network performance was greatly affected by introducing a tanh activation in the dense layers of the network. The network was trained using Stochastic Gradient Descent


with nesterov momentum

[14]. A decaying learning rate was used with a factor of 0.000001 where the learning rate was kept around 0.01. It was also observed that introduction of RMSE as a loss function made it penalize the network much higher than mean squared error.

V Conclusion and Future Work

As it can be seen from the simulation results that using a neural network as an estimator for sparse point clouds is a possible, it can be safely said that using more powerful methods such as convolutional neural network (CNNs)

[10]. Using techniques to project point cloud information into occupancy grids [15] can potentially create an opportunity for using CNNs. The approach can also benefit from applying sensor fusion directly to the lidar observations, however, it was noted that using joint angles of the robot along with the observations greatly enhanced the network performance because the joint angles act as a context vector in the network and were effective when only one or more 3D point was available in the point cloud.


  • [1] 14:00-17:00 ISO/TS 15066:2016. (en). Cited by: §I.
  • [2] M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp (2002-02) A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Transactions on Signal Processing 50 (2), pp. 174–188. External Links: ISSN 1053-587X, Document Cited by: §III-D.
  • [3] L. Bottou, F. E. Curtis, and J. Nocedal (2016-06)

    Optimization Methods for Large-Scale Machine Learning

    arXiv:1606.04838 [cs, math, stat]. Note: arXiv: 1606.04838 External Links: Link Cited by: §IV.
  • [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2016-11) Multi-View 3d Object Detection Network for Autonomous Driving. arXiv:1611.07759 [cs]. Note: arXiv: 1611.07759 External Links: Link Cited by: §I.
  • [5] Flex 13 - An affordable motion capture camera. (en). External Links: Link Cited by: §I.
  • [6] G. E. Hinton Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair. Cited by: §III-C.
  • [7] P. Köhler, C. Connette, and A. Verl (2013-05) Vehicle tracking using ultrasonic sensors joined particle weighting. In 2013 IEEE International Conference on Robotics and Automation, pp. 2900–2905. External Links: Document Cited by: §I, §III-D.
  • [8] S. Kumar, C. Savur, and F. Sahin (2018-10) Dynamic Awareness of an Industrial Robotic Arm Using Time-of-Flight Laser-Ranging Sensors. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2850–2857. External Links: Document Cited by: §I.
  • [9] S. Kumar, S. Arora, and F. Sahin (2019-04) Speed and separation monitoring using on-robot time–of–flight laser–ranging sensor arrays. Cited by: 2nd item, §I, §II-B.
  • [10] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: ISSN 0018-9219, Document Cited by: §V.
  • [11] Y. LeCun, Y. Bengio, and G. Hinton (2015-05) Deep learning. Nature 521 (7553), pp. 436–444 (en). External Links: ISSN 1476-4687, Link, Document Cited by: §I.
  • [12] S. E. Li, G. Li, J. Yu, C. Liu, B. Cheng, J. Wang, and K. Li (2018-01) Kalman filter-based tracking of moving objects using linear ultrasonic sensor array for road vehicles. Mechanical Systems and Signal Processing 98, pp. 173–189. External Links: ISSN 0888-3270, Link, Document Cited by: §I.
  • [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research 15, pp. 1929–1958. External Links: Link Cited by: §III-C.
  • [14] I. Sutskever, J. Martens, G. Dahl, and G. Hinton (2013) On the Importance of Initialization and Momentum in Deep Learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, ICML13, pp. III–1139–III–1147. Note: event-place: Atlanta, GA, USA External Links: Link Cited by: §IV.
  • [15] S. Thrun (2001-10) Learning occupancy grids with forward models. In Proceedings 2001 IEEE/RSJ International Conference on Intelligent Robots and Systems. Expanding the Societal Role of Robotics in the the Next Millennium (Cat. No.01CH37180), Vol. 3, pp. 1676–1681 vol.3. External Links: Document Cited by: §V.
  • [16] VICON Motion Capture Systems. External Links: Link Cited by: §I.
  • [17] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7652–7660. Cited by: §I.