Utilizing High-level Visual Feature for Indoor Shopping Mall Navigation

10/06/2016 ∙ by Ziwei Xu, et al. ∙ 0

Towards robust and convenient indoor shopping mall navigation, we propose a novel learning-based scheme to utilize the high-level visual information from the storefront images captured by personal devices of users. Specifically, we decompose the visual navigation problem into localization and map generation respectively. Given a storefront input image, a novel feature fusion scheme (denoted as FusionNet) is proposed by fusing the distinguishing DNN-based appearance feature and text feature for robust recognition of store brands, which serves for accurate localization. Regarding the map generation, we convert the user-captured indicator map of the shopping mall into a topological map by parsing the stores and their connectivity. Experimental results conducted on the real shopping malls demonstrate that the proposed system achieves robust localization and precise map generation, enabling accurate navigation.



There are no comments yet.


page 2

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In absence of a portable, low-cost positioning system like GPS for outdoor localization, indoor positioning system (IPS) has been a long-lasting and attractive research topic. Infrastructure-based indoor positioning systems which make use of pre-installed infrastructures such as RFID [1], configured fluorescent lights [2] or Wi-Fi access points [3]

have achieved impressive performance in real scene. The infrastructure-free IPS, on the contrary, superior and challenging by itself, attracts a lot of attention nowadays. Resorting to image retrieval techniques,

[4] [5] [6] proposed vision-based IPS which can tell a user’s position using photos taken by smart-phones. However, all of these methods require an off-line database building process which is time-consuming and costly.

Recent advances on robotics and computer vision throw light on IPS. Simultaneous Localization and Mapping (SLAM) techniques and Visual Odometry (VO) which can accurately estimate motion make these methods promising for IPS.

[7] proposed a monocular SLAM system which uses bag-of-words for place recognition [8]. [9] proposed an accurate monocular VO algorithm which can run at 55 FPS on an ARM platform. [10] proposed a real-time indoor and outdoor localization system with visual lidar odometry and mapping [11] and ranked highest on Microsoft Indoor Localization Competition in 2016 [12]. However, running SLAM or VO in practice means users have to record videos by cameras or laser transceivers.

To address this issue, Wang et al. [13] proposed an approach which relies on text recognition [14] [15]

for shop candidate classification. Specifically, shops in an image are classified by text recognition and serve as landmarks for coarse-level localization (i.e., localization by shop classification). This approach is extendable and flexible, since it does not require large amount of pre-captured data about the indoor scenes except a pre-labelled floor plan.

Inspired by the idea of localization by shop classification, we propose a flexible and robust indoor shopping mall navigation system. Three major contributions are made in this paper. Firstly, we propose a novel style feature and fuse this feature with text feature for robust classification. We prove by experiment that our fusion framework, the FusionNet, outperforms [13]. Moreover, we build a storefront image dataset and plan to make it publicly available. Secondly, we introduce a shopping instruction parsing method which can automatically build topological map from photos taken by smart-phones. Thirdly, with robust shop recognition and automatic topological map construction, we put forward a flexible indoor navigation system which successfully works in real scenes. The architecture of our system is shown in Fig. 1.

Figure 1: System Architecture. Style features and text features are extracted from storefront images and fused for shop classification. Images of shopping instructions are parsed and a topological map of the shopping mall is constructed. Localization and navigation is performed based on the classification result and the topological map.

2 Localization by Classification

Localization by classification is the most common way that shoppers adopt when they get lost in a shopping mall. Recognizing a shop and matching it with a map can produce an instant estimation of location. In this section, we introduce our feature fusion method for shop classification.

2.1 Data Collection

We collected storefront images, from Adidas to Zippo, of 56 different classes of shop from the Internet. The collected images include varieties of style, decoration, and tone of color. For most classes, up to 88 images were collected, and for some rarely seen shop brands, there were still at least 20 images available. We collected a total of 2,876 images, with 51 images collected for each brand on average. Within each class, images were divided into training set and test set, with a ratio of 4:1. Interested readers are referred to the supplemental material for a detailed description of the dataset.

2.2 Style Features for Classification

Many famous brands decorate their shops in a particular way so that customers can easily recognize and remember their storefronts. For example, Adidas stores usually have a black background decorated by unique black-white stripes, while the common decoration for Gucci shops is a golden color with a grid-patterned window. These visual patterns are distinctive for different brands, and are usually stable among shopping malls.

To utilize these visual patterns for shop recognition, we adopt transfer learning to learn discriminative visual representations. Specifically, random patches along with their shop brand ground truths are fetched from our dataset to fine-tune the AlexNet

[16]. The 4096-dimensional feature of the trained network (i.e., the output of the 7th layer) is used to represent shop styles.

In the testing phase, we randomly choose 16 patches in an image and calculate each patch’s feature vector. After this, a bin-wise max operation (we find that features obtained with bin-wise max operation outperform features that obtained with bin-wise average operation) is performed on each of the feature dimensions. A 4096-dimensional vector is thereby generated to represent style feature.

2.3 Improving Text Features

The previous work [13]

made it feasible to utilize text detection and bag-of-N-grams feature for shop recognition. However, from experiments with our dataset, issues such as false positive detection and irrelevant text made text features unreliable.

We addressed this problem by training a linear classifier to predict the reliability of text detection. Specially, for each text detection result, the corresponding 10000-dimensional text feature plus 4 geometrical features – , , and

, which is the width, height, scale and shape of the bounding box – are used as input. A logistic regression classifier is trained to reject unreliable text detection. In practice, the simple linear classifier filtered out most of the false positives.

Another modification on text-based classification is made in this paper. [13] designed the n-gram score as a linear summation of bag-of-N-grams feature with predefined 0-or-1 weights; the larger n-gram score means the higher likelihood of the text belonging to a class. In this paper, the n-gram score is generalized to a linear classifier with learned weights, and we use classification score to predict the class type. Intuitively, the learned weight would avoid putting large weights on grams like single letters (e.g., “a”, “b”, etc.) or the common combinations (e.g., “ti”, “la”, which occur in many brands), thus should be more discriminative.

To handle the problem of overfitting, we reduce the feature dimensionalities by truncating feature vectors and keeping components which correspond to n-grams with a length less than or equal to . The truncated feature is denoted as and we will refer to as the order of the text feature. In experiments, we found that performs the best on our dataset (please refer to Section 4.1 for details).

2.4 Fusion of Style and Text

Style and text features describe very different properties of a storefront respectively. Therefore, fusing the two features for classification should improve classification accuracy.

The most straightforward way for fusion is to concatenate the two feature vectors into a new vector and then perform a logistic regression on it. The linear regression is a natural scheme because the last layer of the fine-tuned style CNN can be seen as a linear classifier, and also because of the linear extension we made on the text score in Section

2.3. We will refer to this model as the early FusionNet (E-FusionNet) model.

As shown by [17], the early fusion model usually performs worse than the late fusion model. Based on that observation, we also tested the late fusion. Specifically, we get the normalized text score and normalized style score separately from two classifiers and add the two scores together as the final class score. The late fusion model can be expressed as


where is the class score, is the text feature vector, is the style feature, and are linear classifier weights learned separately from text features and style features,

is the sigmoid activation function, and

is a tunable parameter that controls the weights of the two scores. The selection of is discussed in Section 2. We will refer to model in equation 1 as the L-FusionNet model.

3 Topological Map Construction

Shopping mall operators usually provide shoppers with sufficient topological information about the shopping mall in the form of shopping instructions. A typical set of shopping instructions consists of an indicator map and a list of shops. In this section, we introduce our method of building a topological map from shopping instructions.

3.1 Text Detection and Recognition

In most shopping mall indicator maps, there is a significant contrast between text elements and other components. Therefore, text can be extracted by detecting maximally stable extremal regions (MSERs) [18]. We detect MSERs in the image as connected components. Detected connected components are filtered based on their size, eccentricity and aspect ratio (i.e., , where is the width of the bounding box and is the height). Filtered connected components are clustered using the run length smoothing algorithm (RLSA) [19] and then recognized by the open source tesseract-ocr software package [20]. After being localized and recognized,text and icons are removed using inpainting method introduced in [21] and [22].

3.2 Road and Shop Segmentation

We apply the Statistical Region Merging (SRM) [23] for map segmentation. This segmentation method can be tuned by a Q value, which indicates the approximate number of expected segments. We are particularly interested in the road component on which our topological map is built; therefore, we perform an initial SRM on the whole map with a small Q (16 in our experiment). This gives a coarse output where road component, shop blocks and other components, like the background, are separated. To identify the road component, we calculate a road score for all components. The score is defined as


Here, denotes the component, is the number of holes inside , and is the area of ’s circumscribed rectangle. is the distance between ’s centroid and the center of the whole image. is defined as


where is the indicator function. All these values are normalized into the same scale. The component with the largest road score is considered as the road component.

The rest of connected components are shop blocks. A second SRM is performed on the indicator map with a larger Q value (512 in our experiment) to separate these shop blocks. This operation is confined within the area wrapped by the road component’s circumscribed rectangle so that items on the background will not interfere with the subsequent road segmentation process.

The road component is further segmented for navigation. We treat each pixel of the road component as an observation spot. For each observation spot, the algorithm searches its neighborhood for shop blocks and store the ID of such blocks as landmarks of the spot. Pixels (spots) with same landmarks are grouped together to form a node.

3.3 Shop List Parsing

When wandering in a shopping mall, a shopper is interested in the name of a shop rather than its ID number on the indicator map. Therefore, it is important to map shop IDs with shop names by parsing the shop lists.

A shop list is cut into several blocks using the Recursive XY-cut algorithm. Each of the blocks is a column of shop names, shop IDs or a mixture of both. We then split or merge segments so that each segment contains only one column of shop names and one column of shop ID numbers. Such segments are split into different lines where each line is a mixture of a shop name and a shop ID. A name-ID map is finally constructed between the name and the ID on such lines.

A topological map is constructed based on the topological and semantic information extracted in all steps above. An undirected weighted graph is constructed to represent the topological layout of the shopping mall. Each node has a landmark array storing IDs of all shop blocks in its neighborhood. The weight of an edge is the distance between the centroids of two nodes.

4 Experiments

Figure 2: 2-D embedding visualization of style feature using t-SNE [24]. Five groups of clustered images are shown in the colored boxes. The supplementary material includes a high-resolution version of this visualization.

4.1 Storefront Recognition

The accuracy for shop classification is the metric we are interested in. In our experiment, we compared our feature fusion models with style-only models and text-only models. Moreover, we compared our method with [13]. The result is shown in Table 1.

Style Features

For style feature retrieval, we fine-tuned the AlexNet with our dataset. Our implementation is based on the deep learning framework Caffe

[25]. After 100,000 iterations of SGD optimization, with a learning rate of 1e-7 for all convolution layers and 1e-3 for all fully connected layers, the test accuracy reached 44.67%. This classification accuracy on individual patches is satisfying, given the fact that some shop brands have very similar visual appearance and accuracy of a random guess is below 2%. After the bit-wise max operation, our style feature achieved an accuracy of 66.14% with linear regression, as denoted by “Style+LR” in Table 1.

To better understand what “style” exactly represents, we project style features into a 2-D embedding by applying t-SNE[24], and place each dataset image on a 2-D location according to its style feature. As shown in the colored boxes in Fig. 2 in left-to-right, top-to-bottom order, we see some interesting patterns within some clusters, like “red blocks in image”, “thin doors in image”, “vertical/horizontal textures”, “black door plates” and “blue door plates”.

Text Features We tested [13]’s ngram-score-based method with or without false positive detection (denoted by ALL and FPD in Table 1), and the shop prediction accuracy is and respectively. In contrast, our text features with false text detection and a linear classifier (denoted by “+LR”) achieved higher accuracy, regardless of the choices of . This performance improvement is possibly due to the fact that the learned weights rely less on common grams and therefore become more discriminative. Interestingly, +LR achieved best accuracy when . Our interpretation of this result is that the dictionary of all shop names is usually small and therefore does not require high-order text features to encode. High-order text feature requires a larger number of parameters during model training and therefore could cause overfitting if the training set is small.

Fusion As shown in Table 1, E-FusionNet models outperform style or n-gram classification schemes by a large margin. The best E-FusionNet scheme (using features) achieves the accuracy of 82.55%, which is consistent with the previous text feature experiment.

For L-FusionNet models, we set in equation 1 to the value which performs best on the training set. All the 4 L-FusionNet models (using with ) perform best on the training set when (See Fig. 3). L-FusionNet model’s accuracy at is showed in Table 1. As shown in the table, L-FusionNet reaches the highest accuracy of 86.39% with features, and outperforms the E-FusionNet model significantly.

Overfitting could be the reason for performance deterioration on E-FusionNet models, because the number of training samples (maximum 2,307) is greatly smaller than the number of parameters required for model training (the feature dimensionality reaches when using features). The L-FusionNet scheme suffers less from this issue, because the dimension of style feature is 4096 and dimension of text feature can be reduced to 558 when using .

To see how affects L-FusionNet models, we further tested L-FusionNet model with different values on the test set. As shown in Fig. 3, when reaches or , L-FusionNet model degenerates to text-only model or style-only model respectively, and performs poorly. The accuracy rises when approaches from or . This observation proves that text features and style features compensate for each other’s deficiencies when fused together.

Figure 3: Performance of L-FusionNet model on our dataset under different configurations. denotes the order of text features.
Method for Best
1 2 3 4
Style+LR 66.14
Wang [13] (ALL) 50.26
Wang [13] (FPD) 51.83
+LR 60.03 61.78 61.43 61.08 61.78
E-FusionNet 78.36 82.55 81.33 80.63 82.55
L-FusionNet () 85.17 86.39 85.86 85.69 86.39
Table 1: Comparison on shop classification accuracy between [13]’s method, style-only method, text-only method, early FusionNet (E-FusionNet) method and late fusion (L-FusionNet) method on test set.
Figure 4: Topological map construction on different indicator maps and simulated path planning. (a)(b) shows two different indicator maps and its segmented road components and the topological maps. (c) shows examples of path planning.

Figure 5: Real-scene localization. Yellow stars indicate averaged position estimation. Green circles indicate the ground truth.

4.2 Map Construction, Navigation and Localization

To show how our system works in real-life applications, we tested our topological map construction module on shop indicator maps collected from different shops using a cellphone camera. For path planning, a standard Dijkstra algorithm is used on our topological map to find the shortest path between the origin and the destination. For localization, we recorded a video in a shopping mall using a camera installed on a wheelbarrow. A few representative frames containing a storefront are picked out to test our system. Recognized shops are matched with the topological map constructed in Section 3 and a small group of nodes is retrieved as the location estimation.

Experimental results show that our method works on indicator maps with different layouts and in different design styles (see Fig. 5). Because information available to the system is limited, in many cases the user will be located on several nodes near the ground truth position. However, if we assume that the user takes photos near each shop and give larger weights to nodes that are closer to the corresponding shop block, a weighted average could be calculated to refine the output (see Fig. 5).

5 Conclusion

In this article, we proposed an indoor positioning system that successfully works in shopping malls. We put forward a feature fusion scheme that fuses high-level style feature and text feature of a storefront image for accurate shop recognition. We designed an automatic method of interpreting shopping instructions for topological map construction. We showed by experiments that feature fusion can improve the accuracy of shop recognition and our system works well in real scenes.

While we see performance improvements by introducing early/late feature fusion, an end-to-end structure could be a better fusion scheme. Also, an abstract topological map may not be precise enough for complicated jobs. However, this problem could be handled by integrating low-level features during SLAM. We leave these questions for future research.


  • [1] Ahmed Wasif Reza and Tan Kim Geok, “Investigation of indoor location sensing via rfid reader network utilizing grid covering algorithm,” Wireless Personal Communications, vol. 49, no. 1, pp. 67–80, 2009.
  • [2] Xiaohan Liu, Hideo Makino, and Kenichi Mase, “Improved indoor location estimation using fluorescent light communication system with a nine-channel receiver,” IEICE Transactions, vol. 93-B, no. 11, pp. 2936–2944, 2010.
  • [3] N. Chang, R. Rashidzadeh, and M. Ahmadi, “Robust indoor positioning using differential wi-fi access points,” IEEE Transactions on Consumer Electronics, vol. 56, no. 3, pp. 1860–1867, Aug 2010.
  • [4] Robert Huitl, Georg Schroth, Sebastian Hilsenbeck, Florian Schweiger, and Eckehard Steinbach, “Virtual reference view generation for cbir-based visual pose estimation,” in ACMMM. ACM, 2012, pp. 993–996.
  • [5] Jason Zhi Liang, Nicholas Corso, Eric Turner, and Avideh Zakhor, “Image based localization in indoor environments,” in COM. Geo. IEEE, 2013, pp. 70–75.
  • [6] Kai Guan, Lin Ma, Xuezhi Tan, and Shizeng Guo, “Vision-based indoor localization approach based on surf and landmark,” in IWCMC. IEEE, 2016, pp. 655–659.
  • [7] Raul Mur-Artal, JMM Montiel, and Juan D Tardós, “Orb-slam: a versatile and accurate monocular slam system,” TOR, vol. 31, no. 5, pp. 1147–1163, 2015.
  • [8] Dorian Gálvez-López and Juan D Tardos, “Bags of binary words for fast place recognition in image sequences,” TOR, vol. 28, no. 5, pp. 1188–1197, 2012.
  • [9] Christian Forster, Matia Pizzoli, and Davide Scaramuzza, “Svo: Fast semi-direct monocular visual odometry,” in ICRA. IEEE, 2014, pp. 15–22.
  • [10] Ji Zhang, Volker Grabe, Brad Hamner, Dave Duggins, and Sanjiv Singh, “Compact, real-time localization without reliance on infrastructure,” .
  • [11] Ji Zhang and Sanjiv Singh, “Visual-lidar odometry and mapping: Low-drift, robust, and fast,” in ICRA. IEEE, 2015, pp. 2174–2181.
  • [12] “Microsoft indoor localization competition,” http://ipsn.acm.org/2016/competition.html?v=1.
  • [13] Raquel Urtasun Shenlong Wang, Sanja Filder, “Lost shopping! monocular localization in large indoor spaces,” in ICCV, 2015.
  • [14] Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet,

    “Multi-digit number recognition from street view imagery using deep convolutional neural networks,” 2013.

  • [15] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, “Synthetic data and artificial neural networks for natural scene text recognition,” 2014.
  • [16] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,

    Imagenet classification with deep convolutional neural networks,”

    in NIPS, pp. 1097–1105. Curran Associates, Inc., 2012.
  • [17] Cees GM Snoek, Marcel Worring, and Arnold WM Smeulders, “Early versus late fusion in semantic video analysis,” in Proceedings of the 13th annual ACM international conference on Multimedia. ACM, 2005, pp. 399–402.
  • [18] David Obdržálek, Stanislav Basovník, Lukáš Mach, and Andrej Mikulík, “Detecting scene elements using maximally stable colour regions,” in EUROBOT. Springer, 2009, pp. 107–115.
  • [19] Nikos Nikolaou, Michael Makridis, Basilis Gatos, Nikolaos Stamatopoulos, and Nikos Papamarkos, “Segmentation of historical machine-printed documents using adaptive run length smoothing and skeleton segmentation paths,” Image Vision Comput., vol. 28, no. 4, pp. 590–604, Apr. 2010.
  • [20] Ray Smith, “An overview of the tesseract ocr engine,” in ICDAR, 2007, pp. 629–633.
  • [21] Damien Garcia, “Robust smoothing of gridded data in one and higher dimensions with missing values,” Computational Statistics & Data Analysis, vol. 54, no. 4, pp. 1167 – 1178, 2010.
  • [22] Guojie Wang, Damien Garcia, Yi Liu, Richard De Jeu, and A Johannes Dolman, “A three-dimensional gap filling method for large geophysical datasets: Application to global satellite soil moisture observations,” Environmental Modelling & Software, vol. 30, pp. 139–142, 2012.
  • [23] Richard Nock and Frank Nielsen, “Statistical region merging,” TPAMI, vol. 26, no. 11, pp. 1452–1458, Nov. 2004.
  • [24] Laurens van der Maaten and Geoffrey Hinton, “Visualizing data using t-sne,”

    Journal of Machine Learning Research

    , vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [25] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACMMM. ACM, 2014, pp. 675–678.