Semantic-Aware Label Placement for Augmented Reality in Street View

by   Jianqing Jia, et al.

In an augmented reality (AR) application, placing labels in a manner that is clear and readable without occluding the critical information from the real-world can be a challenging problem. This paper introduces a label placement technique for AR used in street view scenarios. We propose a semantic-aware task-specific label placement method by identifying potentially important image regions through a novel feature map, which we refer to as guidance map. Given an input image, its saliency information, semantic information and the task-specific importance prior are integrated into the guidance map for our labeling task. To learn the task prior, we created a label placement dataset with the users' labeling preferences, as well as use it for evaluation. Our solution encodes the constraints for placing labels in an optimization problem to obtain the final label layout, and the labels will be placed in appropriate positions to reduce the chances of overlaying important real-world objects in street view AR scenarios. The experimental validation shows clearly the benefits of our method over previous solutions in the AR street view navigation and similar applications.


page 4

page 6

page 8

page 11


Labeling Out-of-View Objects in Immersive Analytics to Support Situated Visual Searching

Augmented Reality (AR) embeds digital information into objects of the ph...

Context-Responsive Labeling in Augmented Reality

Route planning and navigation are common tasks that often require additi...

Saliency in Augmented Reality

With the rapid development of multimedia technology, Augmented Reality (...

AR Mapping: Accurate and Efficient Mapping for Augmented Reality

Augmented reality (AR) has gained increasingly attention from both resea...

FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning

The ability to recognize the position and order of the floor-level lines...

A fast and practical grid based algorithm for point-feature label placement problem

Point-feature label placement (PFLP) is a major area of interest within ...

Patient Specific Biomechanics Are Clinically Significant In Accurate Computer Aided Surgical Image Guidance

Augmented Reality is used in Image Guided surgery (AR IG) to fuse surgic...

1 Introduction

Augmented Reality (AR) technology enhances the physical world with digital information by overlaying visual information such as in the form of labels ref1 . Well-placed informative labels can provide accurate instructions for important objects such as landmarks, road signs, and so on. However, the task of placing labels in the user’s view can be challenging, especially for dense scenes. The label annotations should be placed to avoid overlapping and occluding real-world textual objects and scene features. The readability of the text annotations themselves depends strongly on the background color and texture ref2 ; ref3 . In the area of dynamic computer generated label layouts, labels should satisfy readability, unambiguity, aesthetics and frame coherence, especially in AR scenes ref4 where these considerations could be more challenging.

In common AR applications, the computer-generated labels are usually registered based on the object’s geographical locations, which are usually given as points of interest (POI) with their respective GPS position. The precise models of the environment are difficult to obtain, especially since the models can include dynamic objects, for the following reasons. First, such a model usually cannot capture fine details that determine the background clutter in images. Second, the general model may not include all moving objects or able track them. For example, a driver ignoring a pedestrian may cause a traffic accident. Finally, inaccurate registration of sensor-based tracking may not lead to productive use of the additional scene knowledge. To obtain appropriate layout in all situations, an image-driven label placement method for AR is required.

Prior work on image-based label placement algorithms generally used some visual saliency map to highlight prominent regions in an image that attracts human attention. However, these methods often suffer from the following problems:

  • They ignore the semantic information in the user’s view, which can be very useful in enhancing the understanding of human interest regions

  • Previous works are task-unaware, and make little use of the user’s understanding of the specific scene, interests and preferences. The saliency algorithm by itself has limitations with regards to label placement. Specifically, the aim of image saliency detection is to highlight visually salient regions in an image, whereas the most salient region may not always correspond to the most important region for a specific scenario such as during driving.

  • To the best of our knowledge, there are no standard benchmarks for quantitatively evaluate label placement methods.

  • Previously used saliency models are typically outdated, e.g., the saliency detection method proposed by Achanta ref36 is mostly used in label placement area, leaving unexplored many recent advances WangLFSL19arXiv in saliency analysis.

Keeping these limitations in mind, we propose an image-driven semantic-aware label placement approach to obtain an appropriate layout, which can improve the visual quality of annotated AR environments in street view scenarios. Unlike prior works which mostly use the saliency map to highlight important regions to aid them with the label layout task, we introduce a new feature map called guidance map

to characterize the important regions, which can be regarded as a task-specific importance map. The guidance map consists of three parts corresponding respectively to the saliency information, the semantic information, and the task-specific importance prior which is automatically learned from the dataset. For the saliency information, we use a state-of-the-art deep-learning based saliency detection method, instead of previous methods using traditional saliency detection algorithm like IG

ref36 . For the semantic information, we use the deep-learning based semantic segmentation algorithm proposed by Chen ref51

. For the task-specific importance prior, we automatically learn the prior using the manually labeled dataset. We created a manual label placement dataset to provide the task-specific adjustment to our system as the importance prior. The dataset also serves as quantitative evaluation benchmark. Finally, the proposed evaluation metrics brings a convincing quantitative comparison with the state-of-the-art label placement methods.

In this work, we only consider external labeling and do not consider internal labeling. Our labels will be placed in such a way that interference with task specific important real world information will be reduced. Although we mainly focus on street view in this paper, our task-specific label placement framework is not restricted to providing special benefits only to applications like AR street view navigation, but can be also applied to many other AR applications which are lacking scene knowledge without loss of generality.

This paper makes the following contributions:

  1. We introduce a new feature map called Guidance Map. In addition to using saliency information of the image, we add the semantic information and task-specific importance prior into the label placement optimization, to make the Guidance Map more aligned to the task specific important regions.

  2. We manually collect a label placement dataset. Different from former task-unaware saliency detection, we use the dataset to learn task-specific importance priors. Morever, it serves as a quantitative evaluation benchmark.

  3. We integrate the latest state-of-the-art saliency model to further improve the label placement performance.

  4. We define the evaluation metrics in the experiment to quantitatively evaluate the label placement results.

The rest of the paper is organized as follows. In Section 2, we cover related work in the label placement field, especially image-based layout techniques. In Section 3, we introduce our semantic-aware task-specific label placement method and details of data collection. Experimental results in Section 4 show the efficiencies of the proposed method. Finally, we conclude the paper and describe future work in Section 5.

2 Related Work

Annotation placement has been discussed in augmented reality view management (ARVM) ref27 that aims to determine positions and rendering styles of visual annotations based on the camera poses and the 3D scene information for augmented views. Previous ARVM approaches can be grouped into two categories: geometry-based ones and image-based ones.

In the geometry-based approaches, camera poses and 3D scene information are used to determine the annotation placement. Bell et al. ref6 proposed to manage the view in interactive 3D user interfaces by solving the point-feature annotation placement problem using greedy algorithms. Cmolik and Bittner ref8 place the annotations for 3D objects by combining greedy algorithms with fuzzy logic. Tenmoku et al. ref9 propose a view management method to emphasize objects viewed by end users, leveraging 3D models of the real scene to locate annotations. Makita et al. ref10 overlay annotations of objects such as pedestrians for wearable AR, making use of the positions and shapes of objects to compute the positions of annotations via minimization. Zhang et al. ref11 propose an annotating system for augmenting tourist videos by registering videos to 3D models and the label placement is achieved using a dynamic programming algorithm.

More recent work by Iwai et al. ref12

proposed an annotation arrangement method for projection-based AR applications. Each annotation is superimposed directly onto a nonplanar and textured surface based on positions by minimizing an energy function using a genetic algorithm. Tatzgern et al.

ref13 cluster similar items and select a single representation for each cluster to reduce the annotation complexity of crowded objects. In a different work, Tatzgern et al. ref14 apply 3D geometric constraints to manage the annotation placement in the 3D object space and attain temporal coherence. Kishishita et al. perform experiments on two view management methods for a wide field of view augmented reality display.

Most prior works under the geometry-based approach formulate the view management problem as an optimization problem and propose different algorithms to obtain annotation positions in each frame. Some researchers focus on applying geometry-based approaches to display visual information for head-mounted display (HMD). For example, Lauber and Butz ref19 propose a layout management technique for HMD in cars. The annotations are rearranged based on the driver’s head rotations to avoid the label occlusions in the driver’s view. Orlosky et al. ref52 evaluate user tendencies through a see-through HMD to improve view management algorithms. A user tendency is found to arrange the label locations near the center of the viewing field.

A number of image-driven attempts have been made for improving label placement in augmented reality. In these methods, a weighted linear combination of factors that affect visibility is often utilized ref7 ; ref10 , however, the method of finding optimum weights has not been shown in detail. Leykin et al. ref15

used a pattern recognition method to automatically obtain the readability of annotations over textured backgrounds. However, the method fails to include an explicit algorithm to choose the appropriate position for the labels. Rosten et al.

ref16 identify unimportant regions of an image using FAST features. The positions of annotations are decided by considering the occluded image features when the labels are placed in particular positions. However, the technique is demonstrated with a only few labels and for indoor scenes. Tanaka et al. ref17

proposed a color-based viewability estimation where they use the averages of RGB components, the S component in HSV color space, and the Y component in YCbCr color space. Grasset et al.

ref18 proposed a framework that combines visual saliency with edge analysis to identify image regions suitable for label placement. They used the algorithm proposed by Achanta et al. ref36 for computing the saliency maps. We think that the labeling results can be improved, especially for scenes like automated driving. Fujinami et al. ref56 ; ref57

proposed a view management method for spatial AR called VisLP, that places labels based on the estimation of visibility. It employs machine-learning techniques to estimate the visibility that reflects human’s subjective mental workload in reading information and objective measures of reading correctness in various projection conditions. Li et al.

ref58 ; ref59 presented empirical results extracted from experiments aimed at the user’s visual perception with regards to AR labeling layouts, reflecting the subjective preferences to different factors influencing the labeling result.

Unlike previous work, the approach presented in this paper considers not only the saliency information, but also the semantic information and the user labeling preferences in the label placement system. We also differ from previous task-unaware approaches by introducing a task-specific labeling framework and automatically learning the importance prior from the manually labeled dataset.

3 Our Proposed Method

3.1 Overview

In this section, we present our algorithm for semantic-aware task-specific label placement. Given a label, it needs to be placed in an appropriate position so as to avoid overlapping with other labels and salient regions from the user’s view. A fundamental difference between previous works and our method is the fact that we consider the semantic information and specific task in addition to the classic model. These factors are able to contribute to the label placement system. We introduce a new feature map called guidance map to highlight the important regions in the user’s view.

Figure 1: Flowchart of the proposed algorithm

Figure 1 shows the processing flow of our method. During the training phase, we collect a Manual Label Placement dataset and generate the task-specific importance prior from the semantic information and the manual labels. The importance prior contributes to the saliency detection to get the guidance maps of the original images. At the testing phase, the label placement starts when the new object is detected in the field of projection corresponding to the preset labels or a new label placement is requested by the application. For an input image and preset label information, we first obtain the saliency map, semantic map and edge map, all from the original images. The input label information includes the POI position and the label size. Then we analyze the saliency map and semantic map with the task-specific importance prior learned from the training phase to generate the Guidance Map of the image. Finally, we add the guidance map with the label information into the optimization layout solver to output the user’s view image with the labels in their appropriate positions.

3.2 Creating the manual label placement dataset

To the best of our knowledge, there is no standard dataset for label placement in AR. Since we are using AR label placement in street view scenarios as a motivation application, we randomly select images from the Cityscapes Dataset ref46 to simulate the scenarios in the AR Street View navigation application. The reason for choosing the task of AR street view navigation is because urban scene images are widely used in auto-driving technology analysis. Afterwards, we collect manual label information through an experiment to generate the Manual Label Placement dataset.

Data selection and scene simulation In AR-based navigation, labels are used in the users’ view to indicate the nearby restaurants, banks or other locations of interest. When collecting the dataset, the experiment participants are told to imagine the following task scenario: we are driving a car or walking on the road, and want to see additional information about the surrounding destinations of interest in an on-demand fashion. The AR street view navigation system connects the POI to the corresponding label with the help of a leader line, clearly associating the target location to the augmented overlaid information.

We randomly choose 300 pictures from the Cityscapes Dataset ref46 (180 for training and 120 for testing). The street scenes in this dataset are perfectly suited for the purpose of simulating a driving scenario. Moreover, the semantic annotation is very accurate such that the semantic information can be used in the ablation study.

We collect the manual placement data by simulating an AR street view navigation scenario to make it look as if the participants are sitting in a car and using equipment like AR-HUD. One of the most popular AR-HUD products produced by Continental inserts full-color graphics into the real road view in an approximately 130 cm-wide and over 60 cm-high section of the driver’s field of vision at a distance of 2.5 meters. The basis is provided by the digital micro-mirror device (DMD) technology, which is also used in digital cinema projectors, as shown in Figure 2.

Figure 2: Schematic diagram for AR navigation display in AR-HUD
Figure 3: Illustration for Manual Label Placement dataset collecting. Participants view the image with 8 labels, and use a mouse to drag the labels onto the appropriate place, just like the right image.

Every participant is positioned in front of the projection screen at a distance of 2.5 meters. The valid section of the field of view is 120cm wide and 60cm high. The participants are asked to use a mouse to place labels onto the position where they deem most appropriate.

Label configuration We need to set the POI point and its corresponding label information before the experiment. We should first consider the number of target objects and related labels in the same street view image. If the number of labels in the user’s view is low (1 or 2 labels), the layout method can easily avoid the important regions in the user’s view, due to the abundance of candidate positions. In this case, the labels can be placed arbitrarily. If the number of labels in the image is high (more than 10), the label placement task can prove daunting even for humans, as it is very difficult to decide on an appropriate label layout. Therefore, we show the observers different images with the number of labels varying between 5 and 9, and also let them to vote for the maximum number of labels that they feel acceptable. Based on their selection, we made a suitable trade-off between visual quality and algorithm validity, and set the number of labels in each view to 8. As for POI point selection, we try to find meaningful regions like gates or windows as the target objects.

When placing labels, the content of labels may also affect the participant’s decision. The AR annotation can be in various forms such as text, image, video, etc., and we set the content of the label as the name of the simulation target location. We noticed that different names may affect different user’s labeling decision. In order to avoid users regarding some labels as more important than others in terms of the textual content of the labels, we opt to set the textual content of the 8 labels in each image with exactly the same names. The style of the label is black outline border, white background and black text.

The size of the labels may also affect the user’s decision. We set the eight labels of the same size. We show examples of different label size styles and let the participants vote for their favorite. We finally set the text size to 12 pixels and the label size as pixels.

Collecting label placements We conduct the experiment to study how users manually place labels in specific scenes. There are 20 users (age range 18 to 40, 11 females and 9 males) participating in the study. The participants are asked to use a mouse to place labels onto the position they think most appropriate, just like Figure 3. In the experiment, all participants view the same 300 photos (180 for training and 120 for testing), and the order of the images is randomized for each participant. We finally collect 48,000 labels.

3.3 Guidance Map generation

As we mentioned in Section 2, image-based AR label placement not only values the readability of the target objects and labels, but also uses the background and other important regions in the image of the user’s view. When placing labels, previous work mostly uses the visual saliency to avoid important regions, however, the saliency detection methods in previous works are outdated. Therefore, we ran different saliency algorithm ref36 ; ref42 ; ref43 on the images in our dataset, the results are shown in Figure 4.

Figure 4: Visual comparison of selected saliency detection.

We found that the saliency detection methods do not perform well in street view images. For the contrast-based traditional methods like IG ref36 and DRFI ref42 , the sky regions in the results are shown to be more salient than the surroundings, which is contrary to our expectations. Individuals tend to place labels onto uniform regions, and the sky region is obviously on the top of the list of positions we expect users to place the labels on. Also, the road regions are less salient than others, and sometimes the traffic sign is not salient. This will lead the label placement algorithm to place the labels onto road or traffic sign regions, which will affect the drivers’ view. So the saliency detection used in the previous works in street view image is not ideal.

Task-specific importance prior Based on our observations, the semantic information in the scene has an influence on the user’s tendency when placing labels. In a specific task scenario, the user’s tendencies when placing labels can vary on based on the semantic regions. For this purpose, we integrate the semantic information with the saliency information into the label placement algorithm. But the question is, how to use the semantic information and how to measure the users’ labeling preference on different semantic areas? We can treat the user’s labeling preferences as a task-specific importance prior. In order to quantify the importance prior of the users’ manual label placement in the AR Street View navigation scenario, we estimate the importance prior from human annotations using the collected data, specifically the 28,800 labels in the training set. We calculate the frequency of labels falling into regions of a specific semantic category.

We also think it is important to not only look at the label’s centroid for determining the semantic category that the label is placed in, but instead it is crucial to look at the accompanying category of each pixel belonging to a label. Just like the example shown in Figure 5, although the centroid of the label is related to the semantic category Foliage, most of the label area is in the Sky region. So we calculate the quota instead, as shown in Figure 5.

Figure 5: An incorrectly calculated example (left) and the implemented method of calculation (right).

We calculate the number of labels for each semantic category in the training set, and name it as the number of actual labels . However, this number can not directly represent the manual placement tendencies, because different semantic categories appear with different frequencies in the dataset. For example, most of the drivers think the Bridge region is less important in their view and tend to put the labels onto it. The Bridge semantic category only appears 9 times in our training set, and at most 1,440 labels can be placed onto the Bridge regions. We call 1,440 as the number of potential labels for the Bridge semantic category. We introduce a tendency factor defined as:


The ratio for the Bridge category is high enough, so even though the for the Bridge category is very low, we can conclude that users tend to put the labels onto the bridge region once they see it. The statistic placement tendencies for different semantic categories is shown in Table 1.

Sky 28800 12569.43 43.64%
Foliage 28480 10840.36 38.06%
Building 28480 4956.41 17.40%
Bridge 1440 156.31 10.85%
Person 22720 97.52 0.43%
Bus 1120 3.04 0.27%
Motorcycle 1920 5.11 0.27%
Pole 28800 51.31 0.18%
Sidewalk 28800 47.21 0.16%
Car 27680 40.36 0.15%
Traffic Sign 27040 37.64 0.14%
Rider 5280 6.43 0.12%
Road 28000 15.76 0.05%
Bicycle 7840 2.28 0.03%
Traffic Light 13600 0.83 0.01%
Table 1: Placement tendencies for different semantic categories

From Table 1 and an interview that we performed with the participants after the data collection, we can confirm that the statistic manual label placement tendencies are reasonable. As shown, Sky, Foliage, Building and Bridge are among the most widely chosen categories to put the label on. The Sky and Foliage region is very common in the dataset, and drivers tend to put labels on these regions since the color and texture are uniform and don’t vary drastically. Sometimes these regions are more salient than surroundings in the saliency map, leading algorithms to avoid placing labels in these regions. Also, even if the Road regions constitute a great proportion in the dataset, drivers do not like putting the labels on it since they think the labels will adversely effect their driving. In spite of the Traffic Sign and Traffic Light’s appearance in almost every picture in the training dataset and their high value, drivers do not put labels on it. We conclude that the reason for this is because they think these are the most important object they want to see while driving, and will not allow anything to cover it.

Guidance Map We use the semantic information and the preference of user’s label placement to adjust the saliency map to generate a Guidance Map, closely resembling users’ natural preferences to indicate the important regions in the view in a specific task. The Guidance Map, denoted by , can be conducted through the model:


where, for pixel location , represents the saliency map, represents the semantic map, is the maximum function for normalization, and is the task-specific importance prior weight learned in the training phase. In this study, from the statistic features for manual placement tendency, we define for different semantic regions as the following


Table 2 shows the task-specific category weights in this study. Even though some participants put some labels on the traffic sign and traffic light regions, we assume that the participants did not notice these important objects due to not paying enough attention. We conclude that this is mainly because the users expressed their regret with regards to these labels during our interview. For this purpose, we adjusted these coefficients to 1.

Sky 43.64% 0.0000
Foliage 38.06% 0.1279
Building 17.40% 0.6013
Bridge 10.85% 0.7514
Person 0.43% 0.9901
Bus 0.27% 0.9938
Motorcycle 0.27% 0.9938
Pole 0.18% 0.9959
Sidewalk 0.16% 0.9963
Car 0.15% 0.9966
Traffic Sign 0.14% 1.0000
Rider 0.12% 0.9973
Road 0.05% 0.9989
Bicycle 0.03% 0.9993
Traffic Light 0.01% 1.0000
Table 2: Task-specific importance prior weight for different semantic categories
Figure 6: Qualitative result of semantic segmentation generated by Deeplabv3 (middle), the origin image (left) and ground-truth (right).
Figure 7: Qualitative result of Guidance Map generated by the proposed method.

To obtain the important regions in the user’s view, we combine the saliency-detection algorithm with semantic information for the image of the user’s view. The pixel’s value indicates the importance of the region for the drivers. Also, we can see edges do not always show as significantly salient after running the saliency-detection algorithm. It is obvious that edges of objects should not be occluded by labels. For this purpose, we use the Canny algorithm to detect the edges of objects in the user’s view, serving as a supplement for the Guidance Map, obtaining the positions where annotations tend not to be placed.

In our system, we use the DSS saliency-detection algorithm proposed by Cheng ref43 . The semantic information comes from semantic segmentation result of Deeplabv3 ref51 , which shows good performance in the City-scapes dataset, as shown in Figure 6. Our Guidance Map is shown in Figure 7.

3.4 Optimization

Throughout the paper, for the -th image in the testing set ( images), we let the target objects in the -th image form a set , represent the -th target object, and is the total number of objects. Then we set the POI as the centroid of an object’s bounding box and named as . To present the visual information in our study, each has its own label , and the position of the label is . The set is defined as the set of labels in the -th image. Then the label placement problem aims to compute each .

We formulate the label placement problem as an energy minimization problem, with the energy function defined as


It consists of the label energy term () and the leader line energy term ().

Label energy term As shown in Eq. 6, the label energy term includes the energy of the label overlaying the Guidance Map and edge map, and the energy of the labels overlaying each other.


where are the weight coefficients, which are automatically learned from the Manual Label Placement dataset.

  1. Label overlaying Guidance Map (joint saliency & semantic)


    where the is the map which indicates the region of the annotation of (the label area pixel value is and the rest of the image is ), the image size is and the label size is .

  2. Label overlaying the edge map (Canny)


    where the is the edge map generated by Canny edge detector.

  3. Labels intersection


Leader line energy term As shown in Eq. 10, the label energy term includes the energy of the leader line overlaying the Guidance Map, the leader line intersection, the leader line length and orientation.


where are the weight coefficients, which are automatically learned from the Manual Label Placement dataset.

  1. Leader line overlaying Guidance Map


    where the is the map which indicate the region of the leader line between the label and the POI (the leader line pixel value is and the rest of the image is ).

  2. Leader line intersection

  3. Leader line length

  4. Leader line orientation


    where the

    is the angle between the leader line vector

    and the axis (the vertical alignment leader line is preferred).

We consider three algorithms for implementing the optimization: greedy algorithm, simulated annealing and a force-based algorithm. We firstly rule out simulated annealing algorithm because its low efficiency is not suitable for our particular AR scenario. The greedy algorithm iterates all label and calculates their energy function values. We find the minimum of these values to determine the final appropriate positions. The force-based algorithm implements penalty factors as a set of forces, and labels are moved in parallel in this force field. After a certain number of iterations, we get the labels’ final positions. When testing the force-based algorithm, we found that the force-field is too complex for us to handle. It is impossible to find the appropriate number of iterations in the implementation. Therefore, we chose the greedy algorithm for our optimization. We sort the labels from left to the right, nearest to the farthest. Then we iterate each label for calculation. In the end, we find the minimum and obtain the final results.

4 Experiments and Results

We apply our proposed method to the Manual Label Placement dataset, which has 180 images for training and 120 for testing. We use the training set to generate the task-specific semantic-aware weights and the coefficients in the energy function as , , and , , , . Afterwards, we generate the label layout on the testing dataset. Qualitative and quantitative comparison is conducted between the proposed method and state-of-the-art label placement algorithms.

4.1 Evaluation Metrics

To evaluate the performance of the annotation results, we apply four metrics, (1) the average distance from the centroid of manual placement , (2) the average annotation overlapped area , (3) the intersection , and (4) the leader line length .

The metric is one of the most important metrics. It aims to assess the difference between the implementation result and users’ manual placement. In the testing dataset, 20 participants decide the label positions for each label for images. We eliminate two of the most isolated and get the centroid of the remaining 18 positions to set it as . The is defined as


A lower value of means that the placement result is closer to the manual placement.

The second metric aims to assess the severity of overlapping of both annotations and important regions. is defined as:


A lower value of means that fewer collisions occur between annotations and important regions.

The third metric aims to estimate if label leader lines have intersections. To prevent confusion when reading the annotations, the leader lines should not intersect with each other. To assess the severity of intersection, is represented as:


Fewer intersections of leader lines yield to a lower .

The last evaluation metric is evaluating the leader line length. If the label is not close to the point of interest, users need to spend time to track the related label. Moreover, the participants stated in the interview that they dislike long leader lines and think that the acceptable leader line length is 5-10 cm . We set as the optimized length, is computed to quantify the similarity between and as:


When the leader line length is closer to the optimal length, the value of will be smaller. We set in our experiment.

4.2 Qualitative comparison with state of the art

Figure 8: The qualitative comparison between the proposed method and other state-of-the-art methods.

The comparisons with the state-of-the-art approaches help identify the contributions of the proposed approach. We compare the proposed approach to a baseline approach and three state-of-the-art approaches: height-separation ref55 , planar-separation and Grasset’s method ref17 . The baseline approach is the naive method which overlays the computer-generated information onto the POIs in the user’s view. Naturally, this approach leads to label occlusions. In the height separation method presented by Peterson et al. ref55 , the annotation and the POI share the same -coordinate. The height separation technique iterates each label, and when it detects that two labels overlap, it moves up the label related to the farthest POI for half a label’s height. Similar to the naive and the height-separation method, the planar separation technique also lacks image analysis. With planar separation, if a label is detected to overlap another label, it will evaluate the new label position with separations and choose the best position after evaluating a total of 36 positions. Additionally, we also compare our technique to the layout algorithm presented by Grasset ref17 . Qualitative results are shown in Figure 8.

From the qualitative comparison, we can see that the naive approach, the height separation method and the planar separation method inappropriately cover pedestrians and traffic signs. Moreover, there is a significant inter-overlapping of labels with the naive method, which seriously affects the readability of the labels. This case necessitates that the drivers change the viewpoint to see the annotations. Grasset’s ref17 method is based on the IG saliency algorithm to avoid occluding pedestrians and traffic signs, and a large number of the labels are placed on the road, which strongly affect the driver’s vision in the AR street view navigation scene.

The above mentioned problems are eliminated in our method. The labels do not cover important areas in the user’s field of view, and the labels tend to be positioned on regions of uniform color and texture (sky, leaves, bridge), while avoiding the road as expected. Our method outperforms other methods on our testing dataset.

4.3 Quantitative Results and Ablation Study

From the evaluation metrics defined in Section 4.1, we quantitatively compare our method with the state-of-the-art approaches. The quantitative comparison is shown in Table 3.

Naive 105.16 38.74 0 0
Height Separation 86.39 83.30 0 10.73
Planar Separation 89.45 48.88 0 9.84
Grassetref17 85.33 35.58 78 123.43
Proposed Method 61.78 8.50 12 26.99
Table 3: Quantitative Results of Different Methods

From the table, we can see that for one of the most important evaluation metric , our method has the lowest value compared to others - our label layout is closer to the Manual Label Placement benchmark and hence provides better layout than the comparing methods. For another important metric , out method yields a much smaller value than other methods, indicating that we successfully avoid occluding the salient areas in the user’s view. As for and , it is obvious that the first three methods will get lower values. Out method performs much better than ref17 based on the two metrics. To summarize, our method performs better than the other state of the art methods.

To validate the impact of the different components of the latest saliency model and the semantic information, we conduct ablation experiments on the testing dataset, shown in Table 4.

IG(Grassetref17 ) 85.33 35.58 78 123.43
IG+Deeplabv3 72.70 15.18 27 46.83
IG+Groundtruth 64.65 9.17 16 37.48
DSS+Deeplabv3(Ours) 61.78 8.50 12 26.99
DSS+Groundtruth 57.65 7.16 12 26.49
Table 4: Ablation Study

The ablation study results indicate that adding the semantic information to the IG algorithm, the label layout quantitative evaluation performs better. When directly integrating the accurate semantic segmentation ground truth, the improvement is more obvious. It indicates that the semantic information is useful for the label placement problem. On the other hand, from the comparison of DSS+Deeplabv3 (our method) with IG+Deeplabv3, we can conclude that the latest state-of-the-art saliency model improves the label layout performance. These studies show that each ingredient brings individual improvement, and all of them work together to produce better label layouts.

5 Conclusion

This paper presents a semantic-aware approach for label placement in AR street view scenarios. We introduce a new algorithm for labeling that can be used for future development of AR-HUD street view navigation applications. Compared to other label layout algorithms, our method has the following advantages: (1) Both semantic information and saliency detection are integrated into the label placement optimization to further improve the layout performance. (2) With the help of a label placement dataset, we have a quantitative evaluation benchmark to conduct the quantitative experiment. (3) Unlike previous task-unaware methods, our system provides a task-specific label placement framework.


  • (1) Azuma R.T., A survey of augmented reality, Teleoperators and Virtual Environments, 6(4), pp.355-385 (1997)
  • (2) Carmigniani J., Furht B., Anisetti M., Ceravolo P., Damiani E. and Ivkovic M., Augmented reality technologies, systems and applications, Multimedia tools and applications, 51(1), pp.341-377 (2011)
  • (3) Chang G., Morreale P. and Medicherla P., Applications of augmented reality systems in education, In Society for Information Technology Teacher Education International Conference Association for the Advancement of Computing in Education (AACE), pp. 1380-1385 (2010)
  • (4) Hartmann K., Götzelmann T., Ali K. and Strothotte T., Metrics for functional and aesthetic label layouts, In International Symposium on Smart Graphics, pp.115-126 (2005)
  • (5) Bell B., Feiner S. and Höllerer T., View management for virtual and augmented reality, In Proceedings of the 14th annual ACM symposium on User interface software and technology, pp.101-110 (2001)
  • (6) Azuma R. and Furmanski C., Evaluating label placement for augmented reality view management, In Proceedings of the 2nd IEEE/ACM international Symposium on Mixed and Augmented Reality, pp.66 (2003)
  • (7) Cmolík L. and Bittner J., Layout-aware optimization for interactive labeling of 3D models, Computers and Graphics, 34(4), pp.378-387 (2010)
  • (8) Tenmoku R., Kanbara M. and Yokoya N., Annotating user-viewed objects for wearable AR systems. In Proceedings of the 4th IEEE/ACM International Symposium on Mixed and Augmented Reality, pp.192-193 (2005)
  • (9) Makita K., Kanbara M. and Yokoya N., View management of annotations for wearable augmented reality, In 2009 IEEE International Conference on Multimedia and Expo, pp.982-985 (2009)
  • (10) Zhang B., Li Q., Chao H., Chen B., Ofek E. and Xu Y.Q., Annotating and navigating tourist videos. In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, pp.260-269 (2010)
  • (11) Iwai D., Yabiki T. and Sato K., View management of projected labels on nonplanar and textured surfaces, IEEE transactions on visualization and computer graphics, 19(8), pp.1415-1424 (2013)
  • (12) Tatzgern M., Kalkofen D. and Schmalstieg D., Dynamic compact visualizations for augmented reality, In 2013 IEEE Virtual Reality (VR), pp.3-6 (2013)
  • (13) Tatzgern M., Kalkofen D., Grasset R. and Schmalstieg D., Hedgehog labeling: View management techniques for external labels in 3D space, In 2014 IEEE Virtual Reality (VR), pp.27-32 (2014)
  • (14) Leykin A. and Tuceryan M., Automatic determination of text readability over textured backgrounds for augmented reality systems, In Third IEEE and ACM International Symposium on Mixed and Augmented Reality, pp.224-230 (2004)
  • (15) Rosten E., Reitmayr G. and Drummond T., Real-time video annotations for augmented reality, In International Symposium on Visual Computing, pp.294-302 (2005)
  • (16) Tanaka K., Kishino Y., Miyamae M., Terada T. and Nishio S., An information layout method for an optical see-through head mounted display focusing on the viewability, In 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pp.139-142 (2008)
  • (17) Grasset R., Langlotz T., Kalkofen D., Tatzgern M. and Schmalstieg D., Image-driven view management for augmented reality browsers, In 2012 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp.177-186 (2012)
  • (18) Lauber F., Butz A., View management for driver assistance in an HMD, In 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 1-6 (2013)
  • (19) Stein T. and Décoret X., Dynamic label placement for improved interactive exploration, In Proceedings of the 6th international symposium on Non-photorealistic animation and rendering, pp.15-21 (2008)
  • (20) Vollick I., Vogel D., Agrawala M. and Hertzmann A., Specifying label layout style by example, In Proceedings of the 20th annual ACM symposium on User interface software and technology, pp.221-230 (2007)
  • (21) Augmented reality head-up display, from:, Retrieved Apr 30, 2019, (2015)
  • (22) Rakholia N., Hegde S. and Hebbalaguppe R., Where to Place: A Real-Time Visual Saliency Based Label Placement for Augmented Reality Applications. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp.604-608 (2018)
  • (23)

    Achanta R., Hemami S., Estrada F. and Susstrunk S., Frequency-tuned salient region detection, Proceedings of IEEE Internatinoal Conference on Computer Vision and Pattern Recognition, (2009)

  • (24) Jiang H., Wang J., Yuan Z., Wu Y., Zheng N. and Li S., Salient object detection: A discriminative regional feature integration approach, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.2083-2090 (2013)
  • (25) Hou Q., Cheng M.M., Hu X., Borji A., Tu Z. and Torr P.H., Deeply supervised salient object detection with short connections, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3203-3212 (2017)
  • (26) Wang W., Lai Q., Fu H., Shen J., and Ling, H., Salient Object Detection in the Deep Learning Era: An In-Depth Survey, arXiv:1904.09146 (2019)
  • (27) Cordts M., Omran M., Ramos S., Rehfeld T., Enzweiler M., Benenson R., Franke U., Roth S. and Schiele B., The cityscapes dataset for semantic urban scene understanding, In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.3213-3223 (2016)
  • (28) Chen L.C., Papandreou G., Schroff F. and Adam H., Rethinking atrous convolution for semantic image segmentation, arXiv preprint, arXiv:1706.05587 (2017)
  • (29) Orlosky J., Kiyokawa K. and Takemura H., Towards intelligent view management: A study of manual text placement tendencies in mobile environments using video see-through displays, In 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp.281-282 (2013)
  • (30) Peterson S.D., Axholt M., Cooper M. and Ellis S.R., Evaluation of alternative label placement techniques in dynamic virtual environments, In International Symposium on Smart Graphics, pp.43-55 (2009)
  • (31) Ichihashi, Keita, and Fujinami K., Estimating Visibility of Annotations for View Management in Spatial Augmented Reality Based on Machine-Learning Techniques, Sensors, 19(4), pp.939-966 (2019)
  • (32) Sato, Makoto, and Fujinami K., Nonoverlapped view management for augmented reality by tabletop projection, Journal of Visual Languages and Computing, 25(6), pp.891-902 (2014)
  • (33) Li G., Liu Y., and Wang Y., An empirical evaluation of labelling method in augmented reality, Proceedings of the 16th ACM SIGGRAPH International Conference on Virtual-Reality Continuum and its Applications in Industry, doi:10.1145/3284398.3284422 (2018)
  • (34) Li G., Liu Y., Wang Y. and Xu Z., Evaluation of labelling layout method for image-driven view management in augmented reality, Proceedings of the 29th Australian Conference on Computer-Human Interaction, pp.266-274 (2017)