Enabling Tangible Interaction through Detection and Augmentation of Everyday Objects

by   Thomas Kosch, et al.
Universität München

Digital interaction with everyday objects has become popular since the proliferation of camera-based systems that detect and augment objects "just-in-time". Common systems use a vision-based approach to detect objects and display their functionalities to the user. Sensors, such as color and depth cameras, have become inexpensive and allow seamless environmental tracking in mobile as well as stationary settings. However, object detection in different contexts faces challenges as it highly depends on environmental parameters and the conditions of the object itself. In this work, we present three tracking algorithms which we have employed in past research projects to track and recognize objects. We show, how mobile and stationary augmented reality can be used to extend the functionalities of objects. We conclude, how common items can provide user-defined tangible interaction beyond their regular functionality.



There are no comments yet.


page 1

page 3

page 4


6D Object Pose Estimation with Depth Images: A Seamless Approach for Robotic Interaction and Augmented Reality

To determine the 3D orientation and 3D location of objects in the surrou...

Region Graph Based Method for Multi-Object Detection and Tracking using Depth Cameras

In this paper, we propose a multi-object detection and tracking method u...

Semantic Interaction in Augmented Reality Environments for Microsoft HoloLens

Augmented Reality is a promising technique for human-machine interaction...

A Comparison of Visualisation Methods for Disambiguating Verbal Requests in Human-Robot Interaction

Picking up objects requested by a human user is a common task in human-r...

On-Orbit Smart Camera System to Observe Illuminated and Unilluminated Space Objects

The wide availability of Commercial Off-The-Shelf (COTS) electronics tha...

Hand Segmentation for Hand-Object Interaction from Depth map

Hand-object interaction is important for many applications such as augme...

They See Me Rollin': Inherent Vulnerability of the Rolling Shutter in CMOS Image Sensors

Cameras have become a fundamental component of vision-based intelligent ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Augmenting common items with digitized content to extend their functionalities has been the focus of past research in the domain of tangible user interfaces [9]. Thereby, objects are tracked by a system that displays visual cues or extends the functionality of the object itself [10]. By rotating, repositioning, or placing objects in defined positions, user-defined actions can be triggered. Thus, common items are augmented by functionalities which they do not implement by themselves.

Two modalities to display such augmented content have emerged. Smart glasses, such as the Microsoft HoloLens111www.microsoft.com/en-us/hololens - last access 2019-05-17, enable mobile use of augmented reality to display additional supporting content [4]. Furthermore, in-situ projection systems enable the augmentation of stationary workstations that can be used for practical exercises (see Figure Enabling Tangible Interaction through Detection and Augmentation of Everyday Objects). While smart glasses are preferred in a mobile context, in-situ projections are suitable for stationary settings. While mobile augmentation was preferred during practical physics exercises that required mobility of their students [18], industrial use cases [5] and social housing organizations [14, 15] found stationary settings more suitable. Furthermore, employing object augmentation provides cognitive alleviation, which has the potential to boost overall user performance and productivity [11, 12].

Both modalities use camera-based systems to recognize objects and enrich them with additional content. However, seamless object detection and augmentation poses challenges for different use cases. In this work, we present object detection strategies we employed in past research projects to enable object detection and augmentation. We discuss the advantages and disadvantages of different object tracking modalities. Finally, we present how user-defined tangibles from everyday items can be created by augmenting them with in-situ projections. We conclude with challenges that have to be considered when integrating ubiquitous object augmentation.

Figure 1:

Object detection using SURF. The positioned object is compared to a reference image. Feature extraction, such as provided by the SURF algorithm, shows the similarity of the image.

(a): Correct positioned image. (b): A rotated object does not guarantee that it will be detected relative to the reference image.

2 Object Tracking

To enable interaction with common items, suitable tracking systems and algorithms need to be employed. In the following, we present three object tracking strategies we have employed in past research.

2.1 Surf

The Speeded Up Robust Feature (SURF) algorithm [3] enables to recognize points and areas of "interest" in images. Due to its efficient implementation, it enables the processing of images in real-time. Thereby, the algorithm has been used for object detection by comparing points of interest in a captured image relative to a reference image [2]. SURF can be employed with inexpensive hardware since it processes color images. However, SURF is not rotation and perspective invariant. This requires objects to be in a similar position that is expected by a system (see Figure 1).

2.2 Depth Sensing

[1em] Infrared pattern of a Kinect v1 on a wooden plate [1].

A depth sensor, such as the Intel Realsense222www.intel.com/content/www/us/en/architecture-and-technology/realsense-overview.html - last access 2019-05-17 or the Kinect v2333https://developer.microsoft.com/en-us/windows/kinect - last access 2019-05-17, provide a 3D representation of objects that they are pointed to. Objects are recognized by analyzing the shape. Thereby, two relevant methods have emerged. The first method uses a projected infrared pattern on a surface (see Figure 2.2). Afterward, the depth sensor measures changes in the perspective of the pattern. This enables to detect the distance between infrared waves and allows a reconstruction of the 3D space on a surface [1].

Figure 2: Using YOLO to detect objects independent from their position. (a): Test image to evaluate a trained model. (b): Detected objects using YOLO. A blue bounding box denotes the detected objects.

The second method uses a Time-of-Flight approach. Thereby, the round trip time of an artificial light (i.e., infrared light) is measured between the sensor and a point on the surface. When the reflection of the light is captured, a 3D representation on of the surface is created [8].

Depth sensing is insensitive to lighting conditions. However, changes in perspective and rotation of objects may affect the overall detection quality. Thus, depth sensing is suitable for use cases where objects reside in stable positions.

2.3 You Only Look Once

The algorithm "You Only Look Once" (YOLO) is a deep learning approach to detect objects regardless of their perspective and position 

[17] (see Figure 2

). It applies a single neural network on an image that detects features in bounding boxes after clustering their properties. By evaluating those properties, a probability of a correctly detected object is calculated. While YOLO represents a robust real-time method to detect objects regardless of their positioning and perspective, it requires an extensive training set beforehand. Furthermore, training a neural network on a large data set requires time and, depending on the use case, fast computational hardware to speed up the training process.

3 Object Augmentation

Objects can be used as a visual cue for interaction or interaction device itself. In the following, we show implementations of tangible object augmentation we have conducted in the past.

3.1 Ambient Augmentation

Figure 3: Augmenting a workplace using in-situ projections. (a): A detected item selection bin is visually highlighted. (b): A projection on the working area depicts the final position of an assembly part.

After recognizing the type of object, cues can be used to implicitly guide the user through a series of actions. Figure 3 shows an augmented workspace that uses in-situ projection as a guide through a series of assembly steps. By detecting the user’s action and items on the workspace, in-situ projections are placed on the current relevant bin or final spot for assembly. While boosting the overall performance of workers in industrial environments [7], people with dementia and loss in memory benefit from in-situ projections [13].

3.2 User-defined Tangibles

Regular objects can be registered as user-defined tangible that is made available for interaction [6]. For example, rotating (see Figure (a)a) or positioning (see Figure (b)b) objects can be used to change the speaker volume.

After registering the object, a series of options are made available to the user. The user can choose to interact with existing objects or register new objects. Such objects can be everyday items which do not implement a logic. This transforms objects into user-defined tangibles that are already around the user with just-in-time interaction.

Figure 4: User-defined tangibles that use in-situ projections to provide feedback. (a): Rotating a bottle similar to a knob. (b): Using a pen as a slider [6].

4 Challenges and Future Work

Seamless object detection and augmentation in home and workplace settings are prone to certain challenges. In this work, we presented three strategies to detect objects and augment objects. However, choosing the right detection modality depends on the environment as well as on the properties of the object itself. For example, a depth sensor will struggle to detect flat objects as they have scarce 3D properties. While a regular color camera can solve this problem, it is sensitive to the overall environmental illumination. In future work, we want to combine the definition and detection of user-defined tangibles by using an approach that combines color as well as depth images [16]. Thereby, a combination of depth and color data provides an approximation of object type.

Furthermore, privacy and ethical considerations have to be taken into account. By using the presented camera-based approach, public and private spaces are recorded during user interaction. While users can give consent to process the collected data in private settings, public spaces and workplaces are more sensitive to privacy-related issues. In future work, we want to investigate those ethical ramifications. Ultimately, we will investigate design guidelines that explore how a camera-based approach can be conducted while minimally invading the user’s privacy.

5 Conclusion

In this work, we present three strategies to detect objects which we have employed in past research projects. We outline the advantages and disadvantages of each strategy which we have encountered. We show how object detection and user-defined tangibles can be implemented to provide ambient or explicit interaction. Finally, we discuss challenges that have to be tackled before enabling seamless object tracking in home and work settings. Since common objects do not implement any logic, we believe that external object augmentation paves the way for ubiquitous tangible interaction at home, public spaces, and workplaces.


  • [1] M. R. Andersen, T. Jensen, P. Lisouski, A. K. Mortensen, M. K. Hansen, T. Gregersen, and P. Ahrendt (2012) Kinect depth sensor evaluation for computer vision applications. Aarhus University, pp. 1–37. Cited by: §2.2, §2.2.
  • [2] H. Bay, T. Tuytelaars, and L. Van Gool (2006) SURF: speeded up robust features. In Computer Vision – ECCV 2006, A. Leonardis, H. Bischof, and A. Pinz (Eds.), Berlin, Heidelberg, pp. 404–417. External Links: ISBN 978-3-540-33833-8 Cited by: §2.1.
  • [3] D. Bouris, A. Nikitakis, and I. Papaefstathiou (2010-05) Fast and efficient fpga-based feature detection employing the surf algorithm. In 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, Vol. , pp. 3–10. External Links: Document, ISSN Cited by: §2.1.
  • [4] W. Dangelmaier, M. Fischer, J. Gausemeier, M. Grafe, C. Matysczok, and B. Mueck (2005) Virtual and augmented reality support for discrete manufacturing system simulation. Computers in Industry 56 (4), pp. 371 – 383. Note: The Digital Factory: An Instrument of the Present and the Future External Links: ISSN 0166-3615, Document, Link Cited by: §1.
  • [5] M. Funk, A. Bächler, L. Bächler, T. Kosch, T. Heidenreich, and A. Schmidt (2017) Working with augmented reality? a long-term analysis of in-situ instructions at the assembly workplace. In Proceedings of the 10th ACM International Conference on PErvasive Technologies Related to Assistive Environments, New York, NY, USA. External Links: Document Cited by: §1.
  • [6] M. Funk, O. Korn, and A. Schmidt (2014) An augmented workplace for enabling user-defined tangibles. In CHI ’14 Extended Abstracts on Human Factors in Computing Systems, CHI EA’14, New York, NY, USA, pp. 1285–1290. External Links: ISBN , Link, Document Cited by: Figure 4, §3.2.
  • [7] M. Funk, T. Kosch, and A. Schmidt (2016) Interactive worker assistance: comparing the effects of in-situ projection, head-mounted displays, tablet, and paper instructions. Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 934–939. External Links: ISBN 978-1-4503-4461-6, Link, Document Cited by: §3.1.
  • [8] B. Gokturk, H. Yalcin, and C. Bamji (2004-06) A time-of-flight depth sensor - system description, issues and solutions. In

    2004 Conference on Computer Vision and Pattern Recognition Workshop

    Vol. , pp. 35–35. External Links: Document, ISSN Cited by: §2.2.
  • [9] E. Hornecker and J. Buur (2006) Getting a grip on tangible interaction: a framework on physical space and social interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’06, New York, NY, USA, pp. 437–446. External Links: ISBN 1-59593-372-7, Link, Document Cited by: §1.
  • [10] M. Kaltenbrunner and R. Bencina (2007) ReacTIVision: a computer-vision framework for table-based tangible interaction. In Proceedings of the 1st International Conference on Tangible and Embedded Interaction, TEI ’07, New York, NY, USA, pp. 69–74. External Links: ISBN 978-1-59593-619-6, Link, Document Cited by: §1.
  • [11] T. Kosch, Y. Abdelrahman, M. Funk, and A. Schmidt (2017) One size does not fit all - challenges of providing interactive worker assistance in industrial settings. Proceedings of the 2017 ACM International Joint Conference on Pervasive and Ubiquitous Computing. External Links: ISBN 978-1-4503-4461-6, Link, Document Cited by: §1.
  • [12] T. Kosch, M. Funk, A. Schmidt, and L. Chuang (2018) Identifying cognitive assistance with mobile electroencephalography: a case study with in-situ projections for manual assembly. In Proceedings of the 10th ACM SIGCHI symposium on Engineering interactive computing systems, External Links: Document Cited by: §1.
  • [13] T. Kosch, R. Kettner, M. Funk, and A. Schmidt (2016) Comparing tactile, auditory, and visual assembly error-feedback for workers with cognitive impairments. In Proceedings of the 18th international ACM SIGACCESS conference on Computers & accessibility, pp. . External Links: Document Cited by: §3.1.
  • [14] T. Kosch, K. Wennrich, D. Topp, M. Muntzinger, and A. Schmidt (2019) The digital cooking coach: using visual and auditory in-situ instructions to assist cognitively impaired during cooking. In Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, New York, NY, USA. External Links: Document Cited by: §1.
  • [15] T. Kosch, P. Wozniak, E. Brady, and A. Schmidt (2018) Smart kitchens for people with cognitive impairments: a qualitative study of design requirements. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18, New York, NY, USA. External Links: Document Cited by: §1.
  • [16] K. Lai, L. Bo, X. Ren, and D. Fox (2011-05) Sparse distance learning for object recognition combining rgb and depth information. In 2011 IEEE International Conference on Robotics and Automation, Vol. , pp. 4007–4013. External Links: Document, ISSN 1050-4729 Cited by: §4.
  • [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016-06) You only look once: unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.3.
  • [18] M. P. Strzys, S. Kapp, M. Thees, P. Klein, P. Lukowicz, P. Knierim, A. Schmidt, and J. Kuhn (2018-03) Physics holo.lab learning experience: using smartglasses for augmented reality labwork to foster the concepts of heat conduction. European Journal of Physics 39 (3), pp. 035703. External Links: Document, Link Cited by: §1.