CAVE-AR: A VR User Interface to Interactively Design, Monitor, and Facilitate AR Experiences

09/14/2018 ∙ by Marco Cavallo, et al. ∙ 0

In this paper we propose CAVE-AR, a novel virtual reality (VR) system for authoring custom augmented reality (AR) experiences and interacting with participating users. We introduce an innovative technique to integrate different representations of the world, mixing geographical information, architectural features, and sensor data, allowing us to understand precisely how users are behaving within the AR experience. By taking advantage of this technique to "mix realities", our VR application provides the designer with tools to create and modify a AR application, even while other people are in the midst of using it. Our VR application further lets the designer track how users are behaving, preview what they are currently seeing, and interact with them through different channels. This enables new possibilities which range from simple debugging and testing to more complex forms of centralized task control, such as placing a virtual avatar in the AR experience to guide a user. In addition to describing details of how we create effective representations of the real-world for enhanced AR experiences and our novel interaction modalities, we introduce two use cases demonstrating the potential of our approach. The first is an AR experience that enables users to discover historical information during an urban tour along the Chicago Riverwalk; the second is a novel scavenger hunt that places virtual objects within a real-world environment to facilitate solving complex multi-user puzzles. In both cases, the ability to develop and test the AR experience remotely greatly enhanced the design process and the novel interaction techniques greatly enhanced overall user experience.



There are no comments yet.


page 2

page 3

page 5

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many challenges involved in creating meaningful narratives that rely on overlaying virtual content on top of real-world objects. Currently, interfaces for facilitating these augmented reality (AR) experiences lack important functionality due both to technical challenges and to the difficulty of understanding ahead of time how users will respond to the various components that make up the experience.

A primary task in AR involves determining how best to overlay virtual content on top of real-world objects or within real-world spaces. Different approaches include identifying features on objects, keeping track of user and device locations, and incorporating additional sensor data from beacons, cellphones, or cameras to increase the accuracy of placing virtual content. The problem is exacerbated in dynamic setting containing many users within complex environments, and fully understanding the real-world environment and the location of the users within it in real-time is not possible. Thus, many AR applications are event-based only, and consist of creating overlays on top of specific environmental features detected by a camera, and determine only a relative pose estimation of the device but do not accurately define the position of virtual content in space. While effective in some situations, this limits the possibilities of the AR experience and the types of interactions a player can have with virtual objects. Another challenge involves registering objects to a specified real-world position, a task that greatly depends on the tracking accuracy achievable with the current technology. That is, heterogeneous devices may have a different representation of virtual content and may be affected by different environmental conditions; because of this, AR applications generally do not allow interaction between multiple devices. In this paper, we describe an effective abstraction of multiple kinds of real-world data in order to enable an appropriate spatial definition of content that can be used, for example, to effectively position virtual objects within the real-world and to clearly understand how users are behaving within an AR experience.

We introduce CAVE-AR, a novel system for facilitating both the design of complex AR experiences and a means for monitoring and communicating with users taking part in the experience. Both aspects of this system rely on the use of a virtual reality (VR) interface whereby a designer is able to remotely view and interact with all aspects of the AR experience, including 3D models of real-world buildings and other features, maps of the environment, live camera views, and rich sensor data from each player participating in the AR experience. Specifically, in this paper we present our implementation in a CAVE2 [cave2] immersive hybrid environment, but the technique could easily be applied to other VR systems, including portable head-mounted displays. Fundamentally, our system enables the designer to see what the player is seeing, leading to new ways to interface with the players in real-time and to monitor and modify their experience.

Below we describe in detail the underlying technology that drives CAVE-AR, allowing us to integrate location data, feature detection, sensor data, and enabling the VR interface for designing the AR experience remotely. We also describe how our VR interface allows the designer to move between and to integrate real-world views, augmented content, and live sensor data. Second, we describe details about how this technology can be used as a way to communicate with players participating in the AR experience. Specifically, we introduce new interaction techniques for generating virtual avatars (to provide advice or clues for the players) and for “debugging” the AR experience by monitoring the players’ as they walk through different scenes.

After describing the technological contributions and the interface design, we provide details about two different AR experiences developed using CAVE-AR that show our contribution in the context of real-world projects and that serve as initial evaluations for our approach of designing AR experiences via a VR interface. The first, Riverwalk [cavallo2016riverwalk], is part of an ongoing effort by the Chicago History Museum called “Chicago 0,0” to promote public access to archival photographs. This application shows a timeline of historical images on top of important locations alongside the river that runs through downtown Chicago. The second experience, DigitalQuest [cavallo2016digitalquest], is an AR version of the classical “scavenger hunt” game where players compete in teams in order to find virtual objects and to solve challenges associated with them and the real-world locations where they are situated.

2 Related Work

In the last decade, AR has emerged as an effective technology in many different contexts, including education, entertainment, medical, and engineering applications [azumasurvey, milgram]. Creating AR applications requires an in-depth knowledge about several different technologies. Programming toolkits, such as ARToolkit [artoolkit], meant to facilitate the development of such applications, nonetheless require low-level skills that are often impractical for content developers to learn. Considering the potential of AR, effective authoring tools are needed so that developers and artists can quickly create and customize AR experiences. A basic challenge when editing AR content is difficulty of previewing it, as most commonly the position of virtual content is related to a single device with specific characteristics. This makes it difficult to reason more generally about the AR experience more multiple users and to determine if the virtual overlay is correctly positioned onto a live video feed of a real-world scene. Many solutions to this problem involve prototyping AR experience from within the augmented reality itself. For example, Rekimoto et al. [rekimoto] propose a method based on 2D printable binary matrix markers aimed at providing landmarks to registering information on a live camera stream. The idea of using binary markers for prototyping AR application has been widely adopted since it provides reliable tracking in many situations. A downside to this approach is that is necessary to modify the real-world environment through the placement of fiducial markers. Poupyrev et al. [tiles] propose an authoring interface aimed at an easy and effective spatial composition where digital objects can be arranged within a small AR environment. the designer, through the use of an HMD display with a mounted camera, could see both real elements and virtual objects simultaneously and perform basic operations on them by combining the use of multiple binary markers, used as physical controllers for the interface. Lee et al. [immersive] extend this concept of “tangible” AR interaction by adding cubic marker-based props for performing authoring tasks in indoor environments. Höllerer et al. [mars] propose instead a method for authoring outdoor AR experiences, leveraging the combination of a HMD see-through display used in conjunction with a hand-held computer. As Etzold et al. [mipos] shows, recent progress in smartphone technology has enabled easier and more accurate ways to interact with and position virtual content. Langlotz et al. [sketching] explore on-site modification of content through interaction via a mobile device. Other alternative approaches include work by Haringer et al. [pragmatic] which extends the Microsoft PowerPoint XML protocol for defining the behavior of an AR application. Yang et al. [mobile] explore how to leverage mobile phone gestures to perform editing in small environments. However, as noted by Hampshire et al. [context], the development of an AR application is often a long and non-intuitive task which imposes significant limitations on the creative expression of the designer. Instead of using an augmented reality environment to author AR experiences, we propose instead a virtual reality editing environment. This allows us to forgo the use of markers entirely and to utilize a spatial representation of content through a partially modelled representation of the environment, making the development of AR applications independent of the tracking issues encountered in previous solutions.

3 Mixing Realities

CAVE-AR’s approach to designing AR experiences is based on the concept of mapping virtual world coordinates to real world locations (and vice-versa), and our system includes both a VR interface and a mobile framework that runs on a range of smartphones and tablets. Our strategy is related to the classical implementations of location-based augmented reality in which all augmented content is associated to specific geo-coordinates in the real world. As shown in Fig. 2, we define two “parallel worlds”, the real world (where the player holding a device running our mobile application is located) which is characterized by positions on Earth and concrete objects, and a virtual world, containing invisible content that is superimposed within the real world and observable only through, in our initial examples, the display on a player’s smartphone or tablet. By defining a one-to-one mapping where these two worlds overlap with an acceptable accuracy, we can define a wide range of interaction between real-world and virtual elements.

To accomplish this, we refer to the Earth’s coordinate system provided by the standard World Geodetic System [wgs84] with the aim of finding a correspondence between Earth coordinates to virtual units, mapping latitude, longitude, and height to a 3D space representation as a vector. In our case, we define the plane to represent the ground and the axis perpendicular to that plane and increasing with the height, which is defined as height from the ground in that specific position (e.g., to sea level). In order to define orientation, in our virtual world we define the geographic North to correspond to the positive value of the axis.

We define a fixed point on the ground plane as the origin of our coordinate system and then compute all the other points in relation to that one. The origin point in the virtual world could correspond, for example, to the initial location on Earth (latitude and longitude) of a user who is using our application. Using this expedient, we obtain an equivalence of 1 unit in the virtual world for every 1 meter in the real world.

After mapping positions and rotations between the two worlds, we can represent our real-world player as a camera in the virtual world, assuming a position equivalent with the one of the player and a rotation equivalent with the orientation in space of the player’s device. Briefly, the camera in the virtual world moves with the player and represents the actual mobile camera of the device. By enforcing that the virtual camera has the same parameters (e.g. the field of view) as the real one, we can simply render onto the video stream provided by the mobile camera what is currently seen by the virtual camera, creating a sort of moving “window” on the virtual world that we will leverage to generate the AR effect.

Figure 2: Mapping the two worlds. While a user walks in the real world, a virtual camera moves and rotates accordingly in the virtual scene, defining which elements need to be rendered on top of the live camera stream of the player’s device.

3.1 Mobile Implementation

CAVE-AR’s mobile application estimates, as accurately as possible, the position and orientation of the device in the world, eventually rendering the corresponding virtual content on top of the live video stream provided by the mobile camera. Differently from other augmented approaches that solely rely on geolocations or on marker-based tracking, our hybrid implementation combines different types of tracking in order to achieve a reasonable estimation of the camera pose in a wide range of situations. In particular, we define the following three virtual “helper cameras”, that do not concretely exist but act as placeholders for the values computed by a same number of tracking method:

  • The ARCamera, which represents the estimated pose of the device with respect to a fiducial. Our implementation is based on tracking natural features and computes the position and orientation of the device with respect to a pattern image detected by the device camera, leveraging image processing techniques. For this reason, the ARCamera is enabled only in presence of a fiducial that can be represented by any type of flat image present in the real world (e.g. a panel or the facade of a building). While this method would normally allow estimating only a relative pose of the camera, our approach involves the definition of the location and size of fiducials in space so as to provide an absolute positioning of the device in the real world. If these parameters are correctly set and a filtering function is applied, the ARCamera guarantees good accuracy.

  • The SensorCamera

    does not rely on computer vision techniques but leverages instead the sensors available on the mobile device, applying sensor fusion algorithms to estimate a full 6DOF camera pose by combining A-GPS, accelerometer, gyroscope, and compass data. Since the accuracy of the

    SensorCamera greatly depends on the current reliability of the sensors, it is generally well suited for open spaces with few sources of magnetic disturbances and for smoothing rotational movements once pattern-based tracking from the ARCamera has been lost. The main advantage of this helper camera is that it is always available. But despite its horizontal accuracy it is not well suited for small movements due to GPS accuracy and refresh rate.

  • The SLAMCamera implements a visual-based markerless SLAM technique aimed at creating a map of the features available in the environment at run-time, while at the same time estimating the relative pose of the device camera. In combination with the other two helper cameras, the SLAMCamera is fundamental for small movements and for situations in which environmental conditions do not allow the use of normal fiducials.

By intelligently combining the values represented by these three helper cameras and by smoothing the transitions between the activation of different tracking methods, we can always estimate the position and the rotation in space of a mobile device. Thus, we are always able to know where a user is pointing his device at and from which location in the world (within a certain accuracy threshold that depends on the current nature of the environment). Ultimately, our framework represents an abstraction for these three different types of augmented reality techniques that we are able to define within the same space.

3.2 Editor Implementation

The CAVE-AR editor application utilizes this “mapping two worlds” approach. This VR application recreates a virtual environment resembling the real one that we use to place digital content, and which is then available to players in the AR experience. This environment is characterized by 1-to-1 scale with the real world: this means that, with a certain approximation, a concrete object of width 2 meters needs to be represented in the virtual environment as an object of width 2 units and vice-versa. In the VR application, the real world is no longer represented by a live camera stream (as it is in the mobile application), but by digital content, as in a virtual simulation. So, in this environment, both real and virtual elements have a digital representation as 3D meshes. Ideally, the designer will have access to at least a partial representation of the real environment being augmented so that it is possible to see clearly how the virtual content is positioned in 3D space. As we explain in the use cases below, we relied on a 1-to-1 scaled, texturized 3D model of the entire city of Chicago that we used both for the Riverwalk and the Digital Quest experiences. We also discuss alternative implementations of our authoring tool that do not necessarily require 3D models of the environments.

3.3 Virtual Elements

Both our mobile application and our authoring tool involve reasoning about correspondences between virtual elements and the real-world. The CAVE-AR editor allows a designer to insert virtual content in the scene so that a player using mobile application will see that content in the location defined by the designer. We can imagine the mobile camera of the user as directly connected to a virtual camera moving around the scene built with the editor, but rendering only the virtual elements on top the live video stream from his phone. In addition to the 3D representation of the environment, our editor facilitates working with virtual objects and fiducials that need to be accurately positioned in space according to their real-world position in order to provide higher precision for the pose estimated by the ARCamera. Virtual objects may represent every kind of content we would like to add to our AR experience; in particular, our current implementation enables the designer to insert the following elements: 2D images or videos, oriented in the 3D space, that the user will eventually be able to see only from a specific position defined by the angle of view; 3D meshes, including both static and animated models; and spatial audio, played according to the position and/or orientation of the user.

While virtual objects are perceivable from both the mobile and the editor application, virtual fiducials have a representation only in the authoring tool. Knowing the position of these flat, real-world elements allows us to determine a player’s position and enables the placement of virtual content with respect a user or fiducial. That is, the designer is able to interactively define the features as needed that are necessary to ensure a robust player experience. We currently support two types of virtual fiducials:

  • 2D pattern images used for visual tracking depicted in a similar way to the 2D images representing virtual content, with the difference that they will never be visible to the user and will only be used for tracking. Pattern images will be seen only by the designer and allow him or her to position and orientate them in space to improve tracking.

  • Placeholders for Bluetooth beacons, physical devices that can be put in the real world to allow an estimation of the position of the user indoors or to enable custom behaviours based on the proximity of the user to the device.

While an off-line editing of the elements characterizing the AR experience is sufficient for some applications, in other cases it may be useful instead to customize elements at run-time, enabling dynamic AR applications that change over time and adapt to user behavior. For example, we could improve at run-time the position of an object that turns out to be unreachable due to environmental constraints (traffic, construction, crowds, etc), that limit the accuracy and accessibility of mobile devices in a particular area. We can also change the position of elements when organizing collaborative, live AR events so that users will see different content based on external variables such as time, weather, and behavior of the other users.

4 Enabling New Interactions

In our CAVE2 implementation of CAVE-AR, interfacing with the editor application is performed through wireless controllers, such as an Xbox One controller or Playstation Move controller. Tracking makers are place on these controllers so that we can compute their position and orientation inside CAVE2 by leveraging its tracking system. We also provide natural gestural control using the Playstation Move controller, which acts as a “wand” which allows to point at and interact with the content within the AR experience.

In addition to object selection and manipulation, a fundamental operation performed through the controller is the camera movement. Differently from headset-based VR applications, the camera in the scene does not rotate with the head of the user. Since the CAVE2 display provides a 320 degree field of view, the user can simply observe the rest of the scene from the lateral and back screens. However, there may be cases in which the designer would prefer not to rotate his or head and rotate the scene instead. The CAVE2 environment has only limited space in which to move, so controls to move the camera around the scene is also provided.

Figure 3: In the image on the left, coloured corners appear around an object to show that the controller is hovering that particular content, that can now be selected; on the right, multiple objects have been selected at the same time and then scaled.

4.1 Placing Virtual Elements

Once the designer has inserted a virtual element inside the scene, a main task is to position it at the correct location. We provide a visual encoding to indicate when a virtual object is able to be selected whereby four light blue-colored corners appear around a selected object, as shown in Fig. 3, left. Once the object is selected, the element changes color in order to make to it distinguishable from the other ones, as shown in Fig. 3, right. At the same time, a window containing information about the object pops up on the CAVE2’s curved display showing the type of the object, its size in meters (height, width, depth), and its location (latitude and longitude). The user can then translate, rotate, or scale the object, clone the object, or delete the object from the scene. Multiple objects can be selected at the same time, and the above operations will be applied to all those selected. If any instances of the mobile application is running, the content modifications are immediately propagated to the players’ mobile devices.

4.2 Interacting with Players

The real-time editing of an AR experience can be very useful to adjust content and to correct design choices, but leads also to more interesting applications. In addition to seeing virtual objects and virtual fiducials, our editor lets us view the players and their data as they navigate the AR experience. We also make it possible to interact with the users participating in our mobile AR experience. This introduces many advantages for the designer, such as:

  • Knowing the position of users at run-time in order to study their behavior and modify future design decisions accordingly;

  • Correcting the placement of content if a user appear not to be able to reach it due to environmental contraints.

  • Previewing the user augmented reality perspective directly from the editor, without the need of physically observing players during the use of the mobile application.

  • Testing the AR experience before public deployment in order to more easily to collect user data and to “debug” the AR solution as needed.

  • Experimenting with new ways of interaction between players and the designer or designers within the CAVE2 environment, creating a sort of “portal” interconnecting the designer and the people in the outer world.

4.2.1 Representing Players

Thanks to the information gathered through our mobile application, we are able to know at real-time the geolocation, estimated horizontal accuracy, and mobile device orientation of a user. Additionally, we also know the field of view of the mobile camera as well as other statistical information about the device itself. By considering the horizontal position on the

plane and the 3D orientation of the device, we have a total of 5 degrees of freedom. This provides us with sufficient information to represent the player within the VR editor application.

An avatar is instantiated each time a user connects to the CAVE-AR server for a particular AR experience (i.e., when starting the mobile application). The avatar is set to have human-like proportions and a default height of 170 cm. Its position in space is based on the geolocation estimated by the mobile application and is computed by using the same map projections mentioned in section Mixing Realities

. Since smoothing and filtering are already performed on the mobile application, in the editor we can simply use the already precomputed values. Since we are not modifying these original values we are in fact seeing exactly the same behavior of the player’s mobile application, which allows us to observe and resolve any inconsistencies. Currently, the avatar is characterized by two animations: standing and walking, activated depending on the variation of position over time. It is very important to take into account the horizontal accuracy of the position of the avatar as is it quite possible that the user is not in the exact position indicated by the avatar due to the various factors influencing the camera pose estimation. To address this problem, we define for each avatar an area which indicates what is the “probable” location of the user based on the known accuracy thresholds for our helper cameras. A red semi-transparent circle is shown beneath the feet of the avatar and has a diameter corresponding to the horizontal accuracy of the mobile device. This way we can expect the real position of the user to be anywhere inside the highlighted area, in order to have a realistic and quantitative image of the accuracy of the overall AR experience for that specific user. Another important aspect of the avatar representation involves mapping the device orientation in space to the rotation of the head of the avatar. In the real world case, the device of the user can be both moved and rotated in space, changing from situations where a device stays in the pocket of the player, to being held directly in front of the user’s eyes, to being held in the player’s hand as he or she navigates the real-world. Since we are mostly interested in the moments in which the device is held for a continuous use in front of the user’s eyes, with map the device’s orientation to the head of the avatar, rotating it correspondingly. The hardware mobile camera represented in our application with a virtual camera placed orthogonally to the eyes of the avatar, with a field of view that matches the real one. While the rotation of the head is defined by the pose estimated by the mobile camera, the orientation along the vertical axis of the avatar itself takes into account different factors. In real life, the user may rotate the device up to a certain angle before actually rotating his body and moving his legs: since our main interest is the rotation of the head of the avatar, this is, for now, a secondary issue (especially considering we cannot guarantee a precise way to estimate the rotation of the body with respect to the head). However, in order to make the movements of the avatar realistic, we applied the unmodified estimated pose of the camera to the head, and rotated the body by applying a smoothing factor to the value estimated by the compass sensor of the mobile device: this way, the avatar first moves its head with real-time rotations and, after a certain amount of time, slowly rotates its body to match the orientation of the head.

4.2.2 Visualizing User Data

Inside the editor application it is possible to select avatars with the Wand controller in the same way this operation is performed with normal virtual elements. When an avatar is selected, a wireframe indicating the player’s camera frustum is also represented in front of the eyes of the avatar, allowing a more precise visualization of the orientation of the related user’s device. The “details window” that appears on the graphical user interface shows the the geolocation of the user with latitude and longitude coordinates and some useful information information involving the user’s device:

  • Its model, operative system and screen resolution.

  • The field of view and resolution of the mobile camera.

  • The estimated horizontal accuracy of the device.

  • The framerate of the rendering and tracking threads running on the device processor.

  • The types of tracking currently used for determining the position and orientation of the device.

By taking into account this information, the designer can make some considerations about the technology currently used by the people running the mobile application, and if needed, make real-time decisions about whether or not to change aspects of the AR experience or to interact one-on-one with a particular player.

Figure 4: On the left, the user in the real world is seeing the augmented content through his mobile device; on the right, the designer is able to observe in real-time from the editor application what that specific user is seeing. In particular, the red area below the avatar represents the estimated horizontal accuracy of the user, while in the panel on the left some information about his mobile device are displayed. The right panel shows instead a comparison between two views: the upper one is the expected perspective of the user, while the lower one is represented by a video streaming of what he is actually seeing.

4.2.3 Assuming a Player’s Perspective

In order to provide a visual comparison between what the player sees and what we expect him or her to see, we provide also a simple form of image streaming between the mobile devices and the CAVE-AR editor application. When an avatar is selected, a window shows in two separate views both what the user is seeing according to the position of his avatar in the editor application and, on request, the live video stream from his or her mobile device, including the augmented content from his or her perspective. Since, due to accuracy errors, content is rendered on the device according to the estimated position and orientation of a user, comparing these two views allows the designer to immediately understand if the user is perceiving the virtual content in the correct way and eventually adjust the location of objects inside the scene. Having the possibility to activate the live video feed from the mobile camera make it possible to identify external elements that are not modeled inside the virtual environment, such as weather conditions, passing of people or cars, environment modifications— elements that may significantly affect the AR experience. In Fig. 4 (right), it is possible to observe how the two views are located inside the GUI: the upper one renders how the user is expected two see a statue considering the estimated position of his device, while the lower one reflects how the user concretely sees the augmented content.

In order to enable more accurate debugging of the overall AR experience, once an avatar is selected, it is also possible to activate what we called “user perspective” mode: the camera of the editor application is placed in the position of the avatar and follows its movements, in order to simulate a first person view of the mobile application from the user’s perspective. Since the CAVE2 is characterized by a cylindrical shape, it is not so straightforward to represent the limited field of view of a mobile camera. Aiming at addressing this problem, we propose the solution of creating a “mask” whose size is proportionally related to the field of view of the device. All the content outside this mask is slightly darkened in order highlight only what the player is currently seeing.

Another decision involves how to render the mask when a user is rotating his or her device. A first option is keep the mask fixed in the central screens of CAVE2 and rotate the surrounding scene according to the orientation in space of the device. An alternative is to leave the scene as it is and move the mask on different screens. The former solution has the advantage of leaving the point of interest fixed in front of the designer, but can possibly create unpleasant rotations of the content rendered on all the other screens, especially in case of fast orientation changes; the latter requires instead the designer to follow the movements of the mask with his head, but has the advantage of not modifying the content rendered on the screens. A third possible solution could be to implement the editor as an HMD application, leveraging the inherently first-person view of this technology: we will briefly explain this alternative implementation at the end of this section.

4.2.4 Enabling Communication Channels

In addition to previewing what a user is able to see, we consider also the possibility for the people inside CAVE2 to communicate with the players in the real world and vice-versa. From the designer’s perspective, it is possible to open an audio channel towards a specific user by selecting a button on the detail window of the respective avatar: this will connect the microphone of the Kinect [kinect] device inside CAVE2 to the speakers of the mobile phone and the microphone of the device to CAVE2 sound system. This function can be particularly useful for assigning task and coordinating users during an AR experience, or simply to give them advice or further information. At the same time, it is often important for each user in the outer world to always have a way to contact the designer in case of problems. Since creating too many audio channels at the same time could become problematic, we implemented a “request of communication” from the mobile device: the user can require to open a communication channel with CAVE2 and in this case an overhead notification icon will appear on his avatar and the designer will decide if to accept it or not. Alternatively, the user can send a message, that will result in a similar notification with text displayed on top of an avatar.

Considering the amount of user-specific features, we considered useful the addition of a panel listing the users currently connected through the mobile application, along with their visualization on a top-down map of the environment: this way people inside CAVE2 can easily track in one single view where single users are and if they are encountering problems while running the mobile application. By clicking on the name assigned to a user, the designer can move the camera of the editor directly to where the respective avatar is located, thus avoiding to move manually to that position. We note that the resolution and dimension of displays in CAVE2 has a significant role in the implementation of the graphical interface, since it allows to represent over 320 degrees and allows the comparison between different views.

5 Use Cases: Two Example AR Experiences

We have used the CAVE-AR authoring tool to develop and edit two different AR applications, Riverwalk and DigitalQuest, both taking place in the city of Chicago. Though both AR experiences are based on the same mapping technique and helper cameras we introduced above, these two experiences have very different characteristics:

Riverwalk presents historical photographs of Chicago in an urban environment that is characterized by low sensors accuracy and thus requires the use of fiducials. The virtual content that needs to be shown is composed of 2D transparencies that need to be located in 3D space in order to make them overlap with particular views of the city. On the other hand, DigitalQuest is a collaborative videogame intended for organizing events, providing a digital version of the classical scavenger hunts. The virtual content associated to this application involves mostly 3D content which does not need to have as precise a registration in space, but requires the definition of custom behaviors associated to it according to different in-game variables that need to be taken into account. Differently from Riverwalk, DigitalQuest is mostly intended for outdoor open spaces where mobile sensors are more accurate.

5.1 The Chicago 0,0 Riverwalk AR Experience

Many marker-less approaches to the presentation of 2D content through augmented reality have relied on two-dimensional correspondences between the image to be displayed and the corresponding pattern image to be tracked: when a specific pattern is detected by the mobile camera, the virtual content is simply rendered on top of it, without a notion of space. However, this approach has many limitations: for instance, it is not always possible to find a fiducial onto which to overlay the 2D content, that eventually needs to be collocated in a 3D space. This is the case of Chicago 0,0, where historical photos of the city need to be placed on top of buildings that sometimes do not exist anymore or on top of elements that are too far from the user to be used as a pattern for tracking. Thanks to our approach, we can define the position and orientation in space of content independently of the tracking method used by the mobile application: each historical image of downtown Chicago is simply placed through the editor in the location from which its needs to be viewed. A simple comparison between the results of our approach with respect to the usual marker-less one is shown in 5.

Overlaying 2D transparencies upon existing views of the city requires indeed to precisely match the augmentation with some environmental elements, creating correspondences that are valid only from a particular perspective, that the user needs to assume in order to display the content correctly. While this would generally require several onsite tests to verify the actual overlapping of content, thanks to the previewing feature of our authoring tool it has been possible to see how overlay would have appeared on the screen of a user running Chicago 0,0 mobile application: this has allowed us to design the whole application remotely, requiring us to build the executable only one time for the final adjustments and thus saving a huge amount of time.

Figure 5: The figure on the left shows how Chicago 0,0 would appear in Unity Editor is designed with the Vuforia image targets approach: virtual content does not have a notion of space and consists only of photos that need to be overlapped on their corresponding image target, cluttering the view and making the work of the designer very difficult. On the right, a simple representation of how 2D virtual content is organized in 3D space thanks to our approach.

Since our mobile framework makes it possible for the user to visualize content precisely even after a tracking image has been lost, is has been possible for us to define a set of fiducials to be used in the urban environment to compensate the inaccuracy of sensors. In many situations, the distance of virtual content from the position of the user and the lack of other convenient trackable features have brought us to the decision of using the facades of some buildings as fiducials, allowing our application to estimate the position of users even without the GPS signal. In order to achieve a good accuracy with our method, it is fundamental to express the position in space of fiducials and their size. While normally we would need to find the geocoordinates of the buildings and estimate the height and width of their facades, our editor allows us to simply insert the fiducial inside the scene and scale it according to the dimensions of the 3D representation of a specific building: since the scale is 1-to-1, the position, orientation and size of the fiducial are automatically set by the application, without requiring the designer to calculate the parameters himself. Additionally, the feature for sticking elements to environment surfaces is very helpful for quickly aligning the fiducials to the buildings.

Figure 6: The image above shows a sample screenshot from the DigitalQuest application: a virtual object has appeared in front of a public sculpture, but the user still needs to get closer to activate its challenge. The bar at the upper-left corner indicates the score of the user, while the buttons at the upper-right corner respectively show available riddles and enable the map view.

5.2 The DigitalQuest AR Experience

Differently from Riverwalk, DigitalQuest is an application aimed at fast-paced AR experiences in which teams of users compete in order to solve the greatest number of challenges, which are connected to virtual objects located in the real world: when a user reaches a virtual object, an animation is displayed and one riddle is presented, with eventual multimedia additional content; if the puzzle is solved and the answer is inserted correctly, the user gains a certain amount of points and unlocks one or more new challenges. The definition of all these custom behaviours, ranging from the events connected to a single virtual object to the definition of which challenges unlock which other ones, are easily made possible by extending the standard protocol defined by our authoring tool. The designer can in fact associate to each virtual content a set of parameters that the mobile application has to interpret in order to apply corresponding functionalities of the game.

Since DigitalQuest does not require a very accurate positioning of virtual objects and since it is mostly used in open spaces, the mobile application can often rely on sensors and get rid of fiducials when possible. Though in certain cases it would be enough to insert manually the desired geocoordinates for an object, our authoring tool still provides the designer a more precise positioning, giving him the possibility to preview how the object will look in a specific location and eventually to detect more suitable nearby places for its positioning.

Similarly to Riverwalk, our authoring tool has been fundamental for designing remotely the AR experiences related to the DigitalQuest application; in particular, it enabled previewing the whole application from the perspective of the users, by mimicking their path to resolve the single challenges. In addition to previewing, we also leveraged the real-time functionalities of our editor to monitor the behavior of users during a Quest organized on the campus of the University Of Illinois At Chicago: observing the time spent by users in solving certain enigmas and in finding particular objects raised many unexpected considerations that will be very useful for improving of choices for objects displacement and riddles difficulty. On top of this, the editor was also used in few cases to alter at run-time the position of content. For instance, due to unexpected inaccuracy of the device of a user, an object that was expected to appear on the edge of a road was rendered instead in the middle of the street, made inaccessible by the passing of cars LABEL:fig:street. By moving the object backwards towards the user, he was able to reach it and the users who came after him did not encounter the same problem. Similarly, an object that had been located in a narrow passage appeared inside of a building, making it impossible to reach it; moving the object to a nearby open position seemed to solve the problem. In another case, instead, an object was voluntarily moved away from the too many users that were trying to solve the same challenge at the same time: some of them still followed the object, continuing with the resolution of the enigma, while the others got dispersed and switched to a different nearby challenge.

Figure 7: The above picture shows how the horizontal inaccuracy, represented by the red area under the avatar, can significantly affect the positioning of virtual content on the mobile device. In this example, the object is rendered in the middle of the street, made inaccessible by the passing of cars. Thanks to our editor, the designer can detect and adjust at real-time these kind of issues.

6 Future Work

Our ongoing research is both focused on producing new more quantitative evalutation of our use cases and at the same time on exploring new possible features that our approach makes possible. We plan to further explore new creative AR experiences through the CAVE-AR editor, with particular interest in those involving coordinating users and dynamically assigning tasks directly from inside CAVE2. At the same time, we are exploring the possibility to implement object occlusion without the use of depth maps: since we already have a partial reconstruction of the environment, we can use that one to render content more realistically, showing only the parts that should be visible to the user and hiding or blurring out the ones that are behind walls, for example. Another very important feature we are currently testing involves a further step in the process of making the designer more influence during the execution of an AR experience. Thanks to the tracking system offered by CAVE2, we are able to track the movements of the joints of the designer and render himself as an avatar inside the virtual scene, placing him at a desired position. This feature would allow many other interaction possibilities based on a full-body communication between the designer and the users that would see him represented on their mobile phone in front of them, giving advice or indications on how to proceed. Such a form of teleportation would be useful, for instance, for better coordinating tasks or, in the Riverwalk example, to guide users and provide them information when they get lost. Similarly, it would be possible to introduce avatars aimed at helping users that are indirectly controlled by the designer and possibly respond to some forms of AI.

Since not everybody can afford to develop an editor application by leveraging a CAVE2, we have ultimately worked on the implementation of some alternative versions of our authoring tool, which aim at maintaining the same main features but with less requirements. Among them, we can currently list the following alternative implementations:

  • Head mounted display (HMD) implementation: basically, this implementation uses the same virtual scene that we presented for CAVE2, but relies on slightly different user interfaces. the designer is completely immersed inside the 3D representation of the real world, to which virtual objects are added, making it easier to recreate the perspective of a user from a first person view. However, the need to stand or seat in a fixed position for eventually prolonged time leads sometimes to fatigue and eye strain due to the excessive use of the HMD display.

  • Unity Editor implementation: this is probably the simplest and cheapest implementation: it leverages the use of Google Maps [googlemaps] and can be performed directly inside Unity Editor, without requiring a complete 3D recontruction of the environment. In this implementation we defined two different views: a User mode, a 1:1 scaled simulation where we can preview what the user would be able to see from a particular perspective, and a Map mode, a 1:100 scaled map representation with a 3D perspective top-view of the previous mode.

  • Mobile implementation: this implementation is complimentary to the Unity Editor one and provides the possibility to adjust the positioning of content by going onsite and testing overlays directly on top of a live camera feed.

  • Web-based implementation: our idea here is to leverage the huge imagery dataset provided by Google Street View [streetview] in order to build an editor that allows the placement of virtual content in 3D inside widely available spherical panoramas, combined with a partial reconstruction of the environment based on depth maps.

7 Conclusion

In this paper we have presented how our novel CAVE-AR authoring tool can be effectively used to create and edit augmented reality experiences of different nature, leveraging a location-based approach which abstracts various flavors of AR and aims at partially reconstructing a virtual copy of the real world. With our method it is possible to bridge the gap between the two worlds in an easier way, enabling a precise positioning of content in space and the possibility to preview what users will experience from their mobile device. On top of this, the real-time editing features offered by our editor allow the designer to create dynamic AR experiences and the same time to visualize the current behavior of users, represented as avatars that move according to the position and orientation of the user’s mobile device. The graphical interface enables the visualization of data related to the current instance of the application running on the device and gives the possibility to compare what the user is currently seeing with what he is supposed to see, creating a chance for debugging the overall AR experience, correcting it at real-time and improving its design. Finally, the introduction of various possibilities of interaction with the single users enables the exploration of several ways centralized task control, where the designer can direct independent or collaborative tasks or provide advice and assistance to the users.