Metric Pose Estimation for Human-Machine Interaction Using Monocular Vision

The rapid growth of collaborative robotics in production requires new automation technologies that take human and machine equally into account. In this work, we describe a monocular camera based system to detect human-machine interactions from a bird's-eye perspective. Our system predicts poses of humans and robots from a single wide-angle color image. Even though our approach works on 2D color input, we lift the majority of detections to a metric 3D space. Our system merges pose information with predefined virtual sensors to coordinate human-machine interactions. We demonstrate the advantages of our system in three use cases.


page 1

page 2


Enhanced Human-Machine Interaction by Combining Proximity Sensing with Global Perception

The raise of collaborative robotics has led to wide range of sensor tech...

Deep Convolutional Poses for Human Interaction Recognition in Monocular Videos

Human interaction recognition is a challenging problem in computer visio...

Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep Learning Perspective

Estimation of the human pose from a monocular camera has been an emergin...

Monocular Human Shape and Pose with Dense Mesh-borne Local Image Features

We propose to improve on graph convolution based approaches for human sh...

CPS: Class-level 6D Pose and Shape Estimation From Monocular Images

Contemporary monocular 6D pose estimation methods can only cope with a h...

Beyond Weak Perspective for Monocular 3D Human Pose Estimation

We consider the task of 3D joints location and orientation prediction fr...

The Challenges in Modeling Human Performance in 3D Space with Fitts' Law

With the rapid growth in virtual reality technologies, object interactio...

I Introduction

A central aspect of collaborative robotics is the reliable detection of human-machine interactions over large work spaces. Typical conventional sensors, such as light barriers and physical keys, do not provide the required level of perception or lead to non-intuitive operation. In addition, the integration of such sensors requires considerable amounts of time for rewiring and reprogramming [1].

Over the past years, several alternative sensor technologies have been proposed to close this gap: Tactile sensors provide reliable touch gesture detection [7], whereas proximity sensors ensure human-robot safety aspects [4] in close vicinity of the robot. Recently, RGB-D sensors have been increasingly used to detect human poses [3]. These sensors provide dense depth data, limited to a few meters distance. The light emitted by these cameras is easily scattered by shiny surfaces, rendering many measurements unusable.

Fig. 1: Our system coordinates human-machine interactions from a bird’s-eye perspective. We introduce a set of extensible virtual regions that act as replacements for physical sensors. 2D human/robot poses are recognized from color input images and are lifted into a metric 3D space. The geometric relation between humans, robots and virtual regions enables our system to recognize interaction patterns and trigger environmental reactions.

Our system is illustrated in Fig. 1

. We propose to replace a range of conventional hardware sensors by a single monocular vision system that operates from a bird’s eye perspective. In contrast to RGB-D systems, our approach processes pure 2D color data from a wide-angle camera. Despite the consequent loss of depth, we show that 3D metric pose information is recoverable for humans and robots using homographies. We fuse pose estimation with an extensible set of virtual sensors to determine interaction events. We demonstrate the system in three use-cases involving multiple humans and robots

111Demonstration video

Ii Method

The architecture of our approach is depicted in Fig. 2. Our system first synthesizes rectilinear views [5]

from a single panoramic color image. In rectilinear views straight lines are preserved, so that our system is enabled to build on machine learning models particularly engineered for such optics.

Next, we perform 2D human and robot pose estimation [2, 6] on each view (see Fig. 3a-c). These detections are then transformed into metric space by mapping pixel locations to world coordinates using plane homographies. In particular, a homography between ground and camera image plane allows us to convert pixel to metric ground coordinates. Since such a mapping is valid only for keypoints semantically close to the ground (such as feet positions), we additionally integrate statistical height measurements of an upright body model in order to map hip and shoulder keypoints. These extra points are used to predict body orientation and to stabilize localization in the presence of occlusions.

Once world coordinates are established, our system scans for events that arise from the interaction of humans with a predefined set of virtual sensors placed in ground coordinates. Each such event might then trigger one or more environmental reactions, depending on the application. Among others, our system supports the following virtual sensors: light barriers, (freeform) step-on sensor mats, proximity sensors and body orientation aware sensors.

Fig. 2:

System Overview. Our approach considers panoramic images as input (a); these are synthesized to form one or more rectilinear views (b). Deeply learned neural networks predict human and robot poses (c). A number of virtual regions raise location-aware events based on geometric relations to surrounding entities (d). These events in turn lead to application dependent environmental reactions (e).

Fig. 3: Use cases (UCs); (a.1) UC1 Entering the (green) rectangular sensor enables the robot to start; (a.2) the worker’s proximity to the robot controls its speed. (b) UC2 Regions are only sensitive to humans, but are not influenced by other objects. (c) UC3 Free-form region definition by humans.

Iii Demonstrations

We demonstrate the interaction potential of the AEYE system in three Use Cases (UCs)222See footnote 1 for video link. :

UC1 focuses on human-robot interaction (see Fig. 3a). A human entering the rectangular region enables the robot to start. The robot’s speed is adjusted according to the distance to the closest worker. This contributes to creating safe operating environments for the interaction between man and machine.

In UC2 we demonstrate several people interacting with multiple regions. The focus is on system robustness in the presence of occlusion and the system’s ability to react to people only—avoiding false triggers caused by other objects (see Fig. 3b).

In UC3 we define virtual sensors by visual demonstration through human movements (see Fig. 3c). Teaching by demonstration significantly reduces the time required to create and program virtual regions. Note that action detection is currently not part of our system. The start and stop gestures are triggered by a wireless presenter in this use case.

Iv Conclusion

We demonstrate that the challenging task of monitoring human-machine interactions on a metric scale is largely feasible using a single color camera. Combining the latest results of deep learning with traditional computer vision methods lifts many 2D recognitions to metric scales. This allows us to replace many physical sensors with virtual surrogates.


  • [1] S. Brown and A. Woods (2017) An operations management perspective on collaborative robotics. In Proceedings of the International Annual Conference of the American Society for Engineering Management., pp. 1–8. Cited by: §I.
  • [2] Z. Cao, T. Simon, S. Wei, and Y. Sheikh (2017) Realtime multi-person 2D pose estimation using part affinity fields. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 7291–7299. Cited by: §II.
  • [3] Z. Fang, J. Yuan, and N. Magnenat-Thalmann (2018) Understanding human-object interaction in RGB-D videos for human robot interaction. In Proceedings of Computer Graphics International 2018, pp. 163–167. Cited by: §I.
  • [4] M. Geiger and C. Waldschmidt (2019) 160-ghz radar proximity sensor with distributed and flexible antennas for collaborative robots. IEEE Access 7, pp. 14977–14984. Cited by: §I.
  • [5] C. Heindl, T. Ponitz, A. Pichler, and J. Scharinger (2018) Large area 3D human pose detection via stereo reconstruction in panoramic cameras. In Proceedings of the OAGM Workshop 2018, Cited by: §II.
  • [6] C. Heindl, S. Zambal, and J. Scharinger (2019) Learning to predict robot keypoints using artificially generated images. arXiv preprint arXiv:1907.01879. Cited by: §II.
  • [7] D. Silvera-Tawil, D. Rye, and M. Velonaki (2015) Artificial skin and tactile sensing for socially interactive robots: a review. Robotics and Autonomous Systems 63, pp. 230–243. Cited by: §I.