Improving Place Recognition Using Dynamic Object Detection

by   Juan Pablo Munoz, et al.
CUNY Law School

Traditional appearance-based place recognition algorithms based on handcrafted features have proven inadequate in environments with a significant presence of dynamic objects – objects that may or may not be present in an agent's subsequent visits. Place representations from features extracted using Deep Learning approaches have gained popularity for their robustness and because the algorithms that used them yield better accuracy. Nevertheless, handcrafted features are still popular in devices that have limited resources. This article presents a novel approach that improves place recognition in environments populated by dynamic objects by incorporating the very knowledge of these objects to improve the overall quality of the representations of places used for matching. The proposed approach fuses object detection and place description, Deep Learning and handcrafted features, with the significance of reducing memory and storage requirements. This article demonstrates that the proposed approach yields improved place recognition accuracy, and was evaluated using both synthetic and real-world datasets. The adoption of the proposed approach will significantly improve place recognition results in environments populated by dynamic objects, and explored by devices with limited resources, with particular utility in both indoor and outdoor environments.



There are no comments yet.


page 9

page 12

page 13

page 14

page 17

page 18

page 19

page 20


Place recognition: An Overview of Vision Perspective

Place recognition is one of the most fundamental topics in computer visi...

Multi-modal Visual Place Recognition in Dynamics-Invariant Perception Space

Visual place recognition is one of the essential and challenging problem...

Place recognition in gardens by learning visual representations: data set and benchmark analysis

Visual place recognition is an important component of systems for camera...

Artificial and beneficial – Exploiting artificial images for aerial vehicle detection

Object detection in aerial images is an important task in environmental,...

Discriminative and Semantic Feature Selection for Place Recognition towards Dynamic Environments

Features play an important role in various visual tasks, especially in v...

Place recognition survey: An update on deep learning approaches

Autonomous Vehicles (AV) are becoming more capable of navigating in comp...

Rhythmic Representations: Learning Periodic Patterns for Scalable Place Recognition at a Sub-Linear Storage Cost

Robotic and animal mapping systems share many challenges and characteris...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Appearance-based place recognition is a crucial component of mapping, localization and navigation applications, which assist agents in their exploration of indoor and outdoor environments. By recognizing places, these agents can better plan their paths to a desired destination and/or correct errors when performing Simultaneous Localization and Mapping (SLAM). The importance of accurate and rapid visual place recognition is even more critical in situations where agents cannot rely on Global Positioning System (GPS) or other technologies to confirm that they are revisiting a place, such as in indoor environments.

Image-based approaches have proven to be robust methods for recognizing places Williams et al. (2009). When agents use appearance-based place recognition, they attempt to infer their location from matching information about their current environment, gathered by their visual sensors, with a database of information about previously-visited locations. State-of-the-art devices that use sophisticated methods for appearance-based place recognition have shown outstanding performance in mapping and localization tasks Lee and Dugan . Researchers have exploited the capabilities of these devices in a variety of applications, including indoor navigation Muñoz et al. (2016) Muñoz et al. (2017).

Indoor and outdoor places alike are usually populated with dynamic objects, that is, objects that are not guaranteed to be present or in the same location in future observations of the place. A significant presence of these dynamic objects can cause traditional appearance-based place recognition algorithms to fail. In this article, we present a novel approach that improves the representation of places by augmenting traditional image-based place representation schemes with high-level visual information about dynamic objects. Approaches based on feature tracking can detect dynamic objects that are moving at the time of an agent’s observation, but the approach we present accommodates both moving and motionless dynamic objects.

Our approach produces the following contributions:

  • conceptualization of the validity of a place representation based on the presence of dynamic objects. We describe how this notion of a valid place representation can be used to make efficiency improvements to traditional place recognition algorithms, and to measure the quality of an agent’s observation;

  • classification of existing place recognition techniques into rigid or flexible depending on the malleability of their place representation;

  • reduction in the size of the original representation used by suitable place recognition algorithms;

  • reduction in the size of the database of places visited by an agent;

  • reduction in the time required to match two places;

  • improvement in the accuracy of place recognition in environments populated by dynamic objects.

This article is organized as follows: Section 2 discusses related work in appearance-based place recognition, object classification, and localization. Section 3 describes the proposed method to improve place representations. Section 4 explains how the proposed method can be incorporated in state-of-the-art place recognition algorithms. Section 5 presents an evaluation of the proposed approach.

2 Related Work

2.1 Appearance-based Place Recognition

Appearance-based place recognition approaches have substantially improved their effectiveness in the past few years, but there is still room for improvement. Early approaches were only capable of deciding whether an agent was visiting a particular room based on multiple images taken from multiple different viewpoints Ulrich and Nourbakhsh (2000). More recently, sophisticated approaches are capable of localizing an agent with great accuracy based on a single image that is associated with a pose of the agent, e.g., Gálvez-López and Tardós (2012) Cummins and Newman (2011) Sünderhauf and Protzel (2011) Milford and Wyeth (2012) Johns and Yang (2013)Stumm et al. (2013)Pepperell et al. (2014)Arroyo et al. (2014)Siam and Zhang (2017). These latter approaches use sophisticated human-crafted feature detectors and descriptors to produce robust place representations. Lately, several feature detectors and binary descriptors, e.g., Learned Arrangements of Three Patch Codes (LATCH) Levi and Hassner (2016), have been proposed that produce compact and precise representations in a fraction of the time required by traditional approaches like Scale Invariant Feature Transform (SIFT) Lowe (1999) Lowe (2004) and Speeded-Up Robust Features (SURF) Bay et al. (2008). A breakthrough in local feature detection occurred when Features from Accelerated Segment Test (FAST) Rosten and Drummond (2006)

, a corner detector that applies machine learning techniques, was developed. The development of

FAST was inspired by the Univalue Segment Assimilating Nucleus (USAN) principle Smith and Brady (1997), which had already been used to quickly detect corners. Improvements to the FAST detector produced Adaptive and Generic Corner Detection Based on the Accelerated Segment Test (AGAST) Mair et al. (2010)

, which uses a combination of generic decision trees instead of the environment-specific decision trees of the original

FAST algorithm.

Along with the success of feature detection and description techniques, (e.g., FAST, AGAST, SIFT, and SURF), the development of Bags of Visual Words Sivic and Zisserman (2003) Fei-fei et al. (2005)

brought an efficient method for quantizing descriptors into “visual words” using a distance metric, e.g., Euclidean distance. A vector, the

bag of visual words, can then represent the collection of visual words in an image. Matching images then becomes a problem of finding images that have the most similar arrangement of visual words. Several improvements to the bags of visual words approach have been proposed throughout the years, with the vocabulary tree being among the most successful Nistér and Stewénius (2006). FABMAP, a turning point in place recognition frameworks, used bags of words to perform place recognition by modeling the correlation of visual words in an agent’s observation Cummins and Newman (2011). Kejriwal et al. Kejriwal et al. (2016) proposed the use of an additional vocabulary of word pairs that has proven to be effective in dealing with the problem of perceptual aliasing.

More recently, the advent of binary descriptors made it easier to implement real-time place recognition applications, since these descriptors require orders of magnitude less construction time than approaches like SIFT and SURF. The BRIEF-Gist Sünderhauf and Protzel (2011) approach to place recognition proved that using a very simple representation, composed of a very small number of Binary Robust Independent Elementary Features (BRIEF) Calonder et al. (2010) descriptors, could yield performance levels competitive with more sophisticated approaches, like FABMAP. Later, the Bags of Binary Words Gálvez-López and Tardós (2012) approach showed how BRIEF descriptors could be quantized into visual words to efficiently and accurately recognize places. The BRIEF descriptor is not invariant to rotation and scale, but more sophisticated binary descriptors have been proposed; e.g., Binary Robust Invariant Scalable Keypoints (BRISK) Leutenegger et al. (2011), Oriented FAST and Rotated BRIEF (ORB) Rublee et al. (2011), and Fast Retina Keypoint (FREAK) Alahi et al. (2012). The arrival of these more robust binary descriptors (some of them invariant to rotations and scale, and others more tolerant to slight changes in viewpoint and illumination), has made possible further developments in place recognition systems. Some approaches have included additional information to describe places. For instance, ABLE-S added depth information to the place representation in order to make it more robust Arroyo et al. (2014).

In the past few years, following the success that

Deep Artificial Neural Networks

have had in image classificationKrizhevsky et al. (2012)Russakovsky et al. (2015), appearance-based place recognition approaches have incorporated Deep Learning techniques with interesting results Chen et al. (2013). For instance, approaches based on Convolutional Neural Networks (CNNs) have shown to be capable of achieving real-time place recognition with great accuracy Sunderhauf et al. (2015). We expect that Deep Learning techniques will continue to permeate place recognition in the near future. However, handcrafted feature detectors and descriptors are still fast and efficient solutions for place recognition systems. Deep Learning approaches require massive datasets for training that are not usually available for new environments in which place recognition will be performed.

In this article, we combine both techniques, i.e., traditional handcrafted feature detection and description with Deep Learning-based detection of objects. As opposed to Hou et al. (2017), in which the focus is on improving place recognition from within a CNN landmark-based framework, this article shows the inadequacy of approaches that rely on handcrafted feature detection and description, especially in environments with a significant presence of dynamic objects. Once the limitations of traditional approaches have been explained, we present effective solutions to overcome these limitations, and that can be used in devices with limited resources. Furthermore, we show that by identifying and proposing solutions to the deficiencies of traditional approaches, we can also introduce useful notions, such as the validity of a place representation discussed in Section 4.3.

2.2 Recognition of Dynamic Objects

The problem of identifying dynamic objects in an agent’s visual observation is essentially a problem of image classification. The goal of image classification is to assign a class to the whole image or a portion of it (the area that contains the detected object.) Traditionally, researchers have used handcrafted features to recognize objects. There have also been attempts at using biological inspired approaches, e.g., Kostavelis et al. (2012)

, to recognize and classify objects using saliency maps.

Image classification approaches have reached very high accuracy in their prediction results in the past few years. These successes in image classification have been driven by embracing Deep Learning approaches, exemplified in the drastic reduction of image classification error rates in the ImageNet competition Russakovsky et al. (2015). While handcrafted feature detection and description had been used in the past, those approaches have not been able to achieve an error rate in the single digits. On the other hand, Deep Learning approaches are currently reaching single digits in their error rate for this competition.

Deep Learning approaches are being used in sophisticated object detection and localization approaches. Among the most efficient and popular object detectors are unified, single-shot detectors, e.g., You Only Look Once (YOLOv3) Redmon et al. (2016) Redmon and Farhadi (2017) or Single-shot Detector (SSD) Liu et al. (2016), and two-stage detectors, e.g., Region-based CNN (R-CNN), Fast R-CNN Girshick (2015), and Faster R-CNN Ren et al. (2017). Nowadays, object detection can be accomplished in real-time and with great accuracy.

Object recognition has also been discussed in the context of the growing interest for the construction of semantic maps, that is, maps that include high level information about places in addition to geometrical or topological information. Object recognition has been used in conjunction with other techniques, including place recognition, and it is often considered as an important step in the construction of semantic maps Kostavelis and Gasteratos (2017) Kostavelis and Gasteratos (2013).

3 Combining Place Recognition and Dynamic Object Detection

This article addresses the limitations of traditional appearance-based place recognition approaches that have proved inadequate to assist agents recognize a previously visited place if the environment is densely populated by dynamic objects. Our focus is on agents that capture monocular images that may contain dynamic objects, i.e., objects that do not have a fixed position in a place, such as a cyclist or pedestrian, or an object that appears or disappears in the observation of a place from one visit to the next, such as a parked car. The next time that an agent visits the place, these dynamic objects may not be present or may be occupying a different space in the captured image. This unpredictability in the presence of dynamic objects can negatively affect traditional place recognition algorithms because features extracted from dynamic objects corrupt place representations. For instance, the presence of dynamic objects may change a set of features extracted from the observations of the same place between visits. Additionally, the features extracted from dynamic objects may generate different geometric arrangements, negatively affecting additional verification steps.

Most place recognition algorithms use pose-based representations, that is, places are represented by a multiset, , of n descriptors, , that are produced by an algorithm that extracts and describes salient features detected in the observation of each place, when the agent is at a particular pose:


For instance, one version of the place recognition algorithm BRIEF-Gist Sünderhauf and Protzel (2011) represents a place with a single BRIEF descriptor generated from a predetermined keypoint at the center of a downsampled image. Each place representation is of size one ( ), and there is no guarantee that the single descriptor has not been generated from pixels that are part of a dynamic object in the scene. Another place recognition algorithm, FABMAP Cummins and Newman (2011), on the other hand, uses a vector of visual words. Each of these words are quantized descriptors that collectively represent a place. Hence, the definition of a pose-based place representation above is a generalization of the representation used by different approaches to place recognition.

The hypothesis this article confirms is that for the purposes of place recognition, an ideal representation of a place must only include descriptors generated from visual features that can be reliably observed the next time an agent visits the place. That is, a place representation should not incorporate descriptors that are “dirty,” i.e., that describe, at least in part, dynamic objects present in the agent’s observation.

It is important to understand how dynamic objects affect descriptors in the place representation, which has the ultimate effect of corrupting the ideal representation of a place. We define the extent of the descriptor () as the set of pixels, , in the original image, , that were used to generate the descriptor of a detected local feature (Equation 2). The pixels in the extent of the descriptor may be used either directly or after being transformed; e.g., when applying a filter to the original image. Depending on the algorithm used to generate the descriptor, its extent may include the feature keypoint.


where FD is the procedure (e.g., SIFT, SURF, FREAK, ORB) that takes a set of n pixels (usually located around the detected keypoint) as input and produces a descriptor, . For instance, the ORB descriptor compares pairs of pixels in the vicinity of the keypoint to generate a binary descriptor.

We classify each of the pixels in the original image as being part of either dynamic or static objects. If the extent of the descriptor includes a pixel that belongs to a dynamic object, then we say that belongs to class , the class of descriptors that are affected by dynamic objects. Otherwise, belongs to the class , that is, the class of descriptors that are generated only from pixels that lie in static objects in the original image. Hence, is the finite, pairwise disjoint multiset that contains the union of and ,


It may not always be effective to classify as member of a descriptor whose extent contains just a few pixels in dynamic objects. We can relax the definition of by defining a sensitivity threshold which indicates the proportion of the extent of the descriptor that is affected by dynamic objects, i.e., the pixels that belong to dynamic objects. This threshold can be used to control how sensitive membership in is. That is, if the proportion of pixels in the extent of descriptor, , that are in a dynamic object exceeds the sensitivity threshold, then the descriptor, , belongs to .

In the following section, we use the ideas in the paragraphs above to classify popular place recognition algorithms based on their place representations. Next, we show how our proposed approach overcomes the limitations of traditional place recognition algorithms in environments highly populated by dynamic objects. We evaluate our approach using the improved version of the algorithm. Later in 4.2, we discuss how to use Deep Learning-based

object detectors and common properties in feature descriptors, e.g., they tend to be isotropic, to quickly estimate which descriptors belong to


4 Incorporating Dynamic Objects into Place Recognition Algorithms

We begin by first classifying which place recognition algorithms are suitable for our approach. For example, there are some place recognition approaches where applying our proposed approach will improve their place representation, and thus their performance, in environments populated by dynamic objects. However, other place recognition systems cannot avoid being negatively affected by dynamic objects; further, some systems offer no mechanism by which we can take into account information from dynamic objects present in the agent’s observation. We classify place representations used in place recognition algorithms as either rigid or flexible. We consider a place representation to be rigid when the representation does not allow an easy modification to remove the impact of dynamic objects present in the place. On the other hand, we consider a place representation to be flexible if it directly allows modification by incorporating information about dynamic objects observed by the agent. Table 1 lists a few popular place recognition approaches and how we classified the type of place representation that they use. For example, we classified both BRIEF-Gist Sünderhauf and Protzel (2011) and ABLE-S Arroyo et al. (2014) as rigid because the position of the keypoint(s) and the pixels that will be sampled to construct the descriptor (the extent of the descriptor) are predetermined. That is, because the underlying algorithm relies on each of these descriptors with predetermined locations, we cannot remove any descriptors, even if we determine them to be in DC.

Place Representation
Approach Representation
FABMAP Cummins and Newman (2011) FLEXIBLE
BRIEF-Gist Sünderhauf and Protzel (2011) RIGID
SeqSLAM Milford and Wyeth (2012) RIGID
Bags of Binary Words Gálvez-López and Tardós (2012) FLEXIBLE
Cooc-Map Johns and Yang (2013) FLEXIBLE
COVISMAP Stumm et al. (2013) FLEXIBLE
SMART Pepperell et al. (2014) RIGID
ABLE-S Arroyo et al. (2014) RIGID
Fast-SeqSLAM Siam and Zhang (2017) RIGID
Table 1: Examples of place recognition algorithms and our classification of their respective place representations.

On the other hand, the Bags of Binary Words approach proposed by Gálvez-López et al. Gálvez-López and Tardós (2012) is one example from Table 1, in which it is possible to modify the place representation to take into account the presence of dynamic objects. We chose to extend this approach for the evaluation of our proposal. The following paragraphs present a summary of Gálvez-López’s Bags of Binary Words framework.

4.1 Bags of Binary Words

This approach was the first to use binary descriptors with the Bag of Visual Words paradigm. At first, this approach used the BRIEF descriptor, but there have also been implementations that use ORB descriptors Rublee et al. (2011), which have the benefit of being rotation invariant, something that BRIEF lacks.

In the Bags of Binary Words (BoBW) paradigm, first, a vocabulary tree is built from the discretization of the binary descriptor space. The final structure, a hierarchical tree, allows for efficiently matching place representations (i.e. bags of visual words). By using binary descriptors and the Hamming distance, BoBW is capable of reducing the computation time required for matching bags of visual words by one order of magnitude compared to the time required by other popular approaches, e.g., Cummins and Newman (2011) and Stumm et al. (2013).

The Bags of Binary Words uses an inverted index, a common structure used in Bag of Visual Words approaches, to quickly find images where a particular word is present. Gálvez-López et al. augment this index to include the weight of the word in the image, thus, the inverted index stores pairs , that is, word is present in the description (bag of words used to represent a place) of image and is the weight of the visual word in ,


In addition to the inverted index, Gálvez-López et al. also introduce a direct index to store a reference to the features extracted from the image. This index plays an important role when checking for geometrical consistency. Using this index, Gálvez-López et al. can quickly access a subset of the features of the candidate image, and together with the features from the query image, they compute a fundamental matrix using Random Sample Consensus (RANSAC)Fischler and Bolles (1981). The direct index is used to avoid comparing all the features in the pair of images when verifying for geometrical consistency.

Gálvez-López et al. use a -score (Equation 5) to measure the similarity between two binary bags of words, and :


This score is a scaled version of the score proposed by Nister et al. in their seminal paper about creating hierarchical trees of words Nistér and Stewénius (2006).

4.2 Determining Whether a Descriptor is Affected by Dynamic Objects

To determine whether a descriptor, , is a member of , the first step is to identify the areas occupied by dynamic objects in the image. A fast object detector, e.g., YOLORedmon and Farhadi (2017), could be used to obtain the approximate area occupied by a dynamic object in real time. The object detector produces bounding boxes that roughly enclose the detected dynamic objects; with these boxes, we can determine approximately whether a descriptor is affected by a dynamic object above the sensitivity threshold. A naive approach consists of modifying the feature description algorithm to take into account the information provided by the bounding boxes, and check whether the proportion of the extent of the affected descriptor exceeds the sensitivity threshold. Of course, in the case of some complex feature descriptors, modifying the algorithm in this way may dramatically increase computational costs.

Another approach is to use heuristics that take advantage of common properties of feature descriptor algorithms. For example, many feature descriptor algorithms sample locations in an isotropic manner around the feature keypoint. Hence, one heuristic is that if the keypoint is located inside the bounding box of a dynamic object, we can conclude that approximately 25% of the extent of the descriptor is affected by dynamic objects. This heuristic gives us a useful approximation that allows us to assign all descriptors with keypoints inside a bounding box to the

class when the sensitivity threshold is set to . Another heuristic for when the sensitivity threshold is set to uses the distance, , from the keypoint to the furthest sampled point. If the keypoint is inside the bounding box and the distance from the keypoint to each of the corners of the box is greater than , we can conclude that the proportion of the extent of the descriptor that is affected by dynamic objects is approximately above the sensitivity threshold.

Figure 1: Diagram of how incorporating the proposed procedure improves a place representation by taking into account high-level information from dynamic objects.

Figure 1 illustrates the steps taken to improve a place representation. The procedure receives a list of dynamic objects of interest to be detected in the captured images. Using the information from the object detector, place representations are modified to reduce the impact of descriptors that are affected by dynamic objects.

4.3 Valid Place Representation and Efficiency Improvements

Two or more observations of the same place in the real world can result in several different place representations. One reason is that these images may contain dynamic objects, which may alter the representation of the place, resulting in alternative representations. Ideally, once an agent has captured a digital image of a place, the generated representation should be robust enough to allow the agent to match it with a representation of a future observation of the place. Incorporating high-level information about dynamic objects when generating a place representation allows us to define the concept of a valid place representation,


A valid place representation is one that contains a number of descriptors from the class SC that is above a threshold, placeThreshold. That is, all of these descriptors in the place representation have an extent below the sensitivity threshold defined in Section 3.

Traditional place recognition algorithms do not discriminate between observations. They attempt to find a match in the database for each new observation, even when these observations produce a place representation with a small number of descriptors. What is worse is that, as we have mentioned in this article, traditional place recognition algorithms do not take into account that despite the number of descriptors in a place representation, some of those descriptors may be generated from dynamic objects, hence misrepresenting the place in question. Having bad quality place representations in the database increases its size and makes the system inefficient. Let us discuss next some examples of how the notion of a valid place representation may help a system overcome the limitations of traditional approaches, making the place recognition system more efficient.

We can use the important notion of a valid place representation to implement at least two kinds of efficiency improvements to our place recognition systems. First, we can avoid the costly procedure of attempting to match a place if its representation has been deemed invalid. Second, an agent can also decide not to store invalid place representations, resulting in reduced storage requirements. Further, by leaving only descriptors that belong to the class SC in the place representation, the size of individual place representations may be reduced. These reductions together result in a smaller database, which is crucial for applications that are designed to explore large environments. A smaller database is also important for applications that run in devices with limited capabilities. Comparing two place representations that have been reduced in size also improves the time required to compute the matching score; that is, the score that will be used to decide whether or not an agent is revisiting a place.

5 Evaluation

5.1 Experimental Configuration

The proposed approach was evaluated using a Dell Precision 5510 workstation running Ubuntu 16.04LTS with 8GiB of RAM, an Intel Core i7-6700HQ processor, and an Nvidia Quadro M1000M GPU. Two datasets we used in the evalution, one with synthetic images (Synthia dataset Ros et al. (2016)), and the other containing real-world images (Málaga dataset Blanco et al. (2014)).

We used the SYNTHIA-RAND-CVPR16 subset of the Synthia dataset, which is a collection of a set of photo-realistic frames taken every 10 meters as an agent moves in a virtual city. For each position, several frames are randomly generated using different configurations (illumination and textures), including a variation in the presence of different classes of dynamic objects. Figure 2 shows an example of the frames that correspond to one particular virtual location. In our evaluation with this dataset, we used the images from the front camera, which is a subset of 4,485 images. We configured our system for the high-level detection of the following dynamic objects: cars, trucks, motorcycles, bicycles (either moving or parked), and people (either standing in the sidewalks or walking). In the case of the real world images from the Málaga dataset, we used 17,300 images of subset #10 that were captured at 20 frames per second in 865 seconds by a vehicle moving through the Spanish city of Málaga.

We used the vocabulary of binary words created from ORB descriptors by Mur-Artal et al. Mur-Artal et al. (2015), and the implementation of Bags of Binary Words, DBoW2, by Gálvez-López and Tardós (2012). We tested our approach with several configurations of the object detection, place representation, and place recognition parameters; see Table 2. For the configurations that required geometric verification, we used the default values in the DBoW2 library.

The approximation of the space occupied by dynamic objects in an image is obtained from an object detection algorithm. For our evaluation, we used the You Look Only Once (YOLO) algorithm Redmon and Farhadi (2017). We modified the images to have an aspect ratio of 1:1 by cropping the sides from the center. Then we applied YOLO with pre-trained weights trained with the COCO dataset Lin et al. (2014) to the squared RGB images. YOLO provides us with localization information (coordinates of the center of the object, width, height) and a confidence value of each detected dynamic object in real-time. Other methods may provide more accurate information about the detected dynamic objects, but they usually cannot be applied in real time.

Parameter Values
ORB Keypoints 300, 500, 1000, 1500, 2000
Geometric verification Disabled, level 0, Exhaustive check
YOLO Confidence Threshold 0.10, 0.20, 0.30, 0.40
Sensitivity Threshold 25%
Table 2: Configuration parameters for the evaluation.
Figure 2: Collage of images from the Synthia dataset corresponding to the same location with different illumination, textures, and dynamic objects.

5.2 Problem Formulation

In our evaluation, we considered the scenario in which an agent has already captured observations of several configurations for each place. What occurs when the agent is given a new representation of a place? Can the agent match the representation to one of the other representations of the the same place in the database? The problem is illustrated in Figure 3. Our goal was to compare the performance of the traditional Bag of Binary Words method and our extended version that incorporates the information about dynamic objects.

Figure 3: The agent has to identify other place representations associated with the place observed in the query image.
Figure 4: Place recognition on the Synthia dataset. On the left is the current observation. The first column in the middle shows the candidates found by the Bags of Binary Words approach. The second column in the middle shows the candidates found by the extended approach, which incorporates knowledge about dynamic objects. The blue circle means that the candidate also passed geometric verification. On the right is the approximation of the space occupied by the dynamic objects in the image. The first candidate from our approach shows a correct prediction, even though the cars that are parked on the street are different from one observation to the next. The original approach fails to return a good match due to the presence of dynamic objects.
Configuration All Images Images with 10% Dyn. Obj. Images with 20% Dyn. Obj. Images with 30% Dyn. Obj.
keys Geom BoBW BoBW+DO + - BoBW BoBW+DO + - BoBW BoBW+DO + - BoBW BoBW+DO + -
300 NoGeom 42.9 47.78 11.38 39.94 48.51 21.46 39.51 49.59 25.51 38.56 50.9 32
300 Geo-0 0.76 0.42 -44.12 0.2 0.08 -60 0.16 0 -100 0.51 0 -100
300 Geo-1 2.92 2.23 -23.66 1.87 1.24 -34.04 2.03 1.3 -36 2.31 1.8 -22.22
300 Geo-2 7.22 7.98 10.49 5.22 6.7 28.24 4.72 6.1 29.31 4.63 6.17 33.33
300 Geo-6 23.75 23.14 -2.54 20.96 22.28 6.27 19.76 20.65 4.53 18.51 20.82 12.5
500 NoGeom 54 58.39 8.13 52.41 58.71 12.02 51.63 57.24 10.87 51.16 60.41 18.09
500 Geo-0 5.73 4.53 -21.01 4.66 3.47 -25.64 5.45 3.41 -37.31 7.97 4.63 -41.94
500 Geo-1 15.18 13.76 -9.4 13.31 12 -9.88 13.25 11.54 -12.88 15.17 12.85 -15.25
500 Geo-2 12.91 17.35 34.37 11.28 17.54 55.48 11.63 17.8 53.15 12.08 21.34 76.6
500 Geo-6 42.5 43.9 3.31 40.89 44.08 7.8 40.49 42.85 5.82 39.85 46.27 16.13
1000 NoGeom 63.95 68.38 6.94 62.93 68.23 8.42 62.03 68.13 9.83 61.44 67.61 10.04
1000 Geo-0 28.18 27.31 -3.09 26.94 26.19 -2.81 26.91 25.12 -6.65 30.33 28.79 -5.08
1000 Geo-1 28.74 34.27 19.24 27.3 34.28 25.55 28.29 34.88 23.28 32.39 38.3 18.25
1000 Geo-2 14.4 22.83 58.51 12.71 24.07 89.34 13.33 26.42 98.17 15.42 30.85 100
1000 Geo-6 61 64.93 6.43 59.43 64.29 8.18 58.37 64.47 10.45 58.87 65.81 11.79
1500 NoGeom 69.54 74.23 6.73 68.63 74.33 8.3 66.42 74.47 12.12 64.27 75.58 17.6
1500 Geo-0 37.17 41.07 10.5 35.55 40.97 15.25 34.47 41.06 19.1 34.45 43.44 26.12
1500 Geo-1 33 40.78 23.58 30.65 41.89 36.67 29.35 43.66 48.75 29.56 46.53 57.39
1500 Geo-2 16.95 25.93 53.03 14.03 27.46 95.74 13.25 29.92 125.77 13.11 34.7 164.71
1500 Geo-6 67.92 72.4 6.6 67.04 72.82 8.62 64.63 72.6 12.33 62.21 72.24 16.12
2000 NoGeom 73.04 76.74 5.07 71.86 76.76 6.82 71.95 77.32 7.46 73.52 79.18 7.69
2000 Geo-0 42.81 48.41 13.07 41.61 48.74 17.15 41.87 50.49 20.58 44.73 57.33 28.16
2000 Geo-1 35.05 43.55 24.24 31.89 45.64 43.12 31.79 49.51 55.75 34.7 56.3 62.22
2000 Geo-2 24.64 30.7 24.62 22.52 31.85 41.42 21.63 36.83 70.3 24.42 44.47 82.11
2000 Geo-6 72 75.3 4.58 70.75 74.81 5.75 70.98 75.45 6.3 72.24 77.12 6.76
Table 3: Place recognition results for the original Bags of Binary Words (BoBW) algorithm and our extended approach (BoBW + DO). Incorporating information about dynamic objects improves the recognition rate in all the configurations in which the recognition rate is greater than about 30%.
Figure 5: Percentage of correct place recognition in the Synthia dataset. Red triangles correspond to the original Binary Bags of Words algorithm; green dots are the results when incorporating information from dynamic objects. Each row represents the approximate number of features extracted from each image (approximately 300, 500, 1000, 1500 and 2000), each column represents the degree of geometric verification used (no geometric verification, geometric verification at level 1, and exhaustive geometric verification). As the percentage of the area of the image that is covered by dynamic objects increases, the performance of our approach yields better place recognition .

5.3 Results

Figure 4 shows a comparison of the original Bags of Binary Words algorithm and the extension with our proposed method. On the right is a picture of the current observation of the agent; in the middle, the two columns are the candidates returned by each version of the algorithm. The left column shows the candidates returned by the original approach, while on the right are the ones returned by the our approach that incorporates information about detected dynamic objects. The first candidate from our approach shows a correct prediction, even though the cars that are parked on the street are different from one observation to the next. The original approach fails to return a good match due to the presence of dynamic objects. The blue circle indicates that our approach has also passed the geometric verification. On the far right is the approximation of the dynamic objects detected in the observation.

Table 3 shows a comparison of the results obtained by the original (BoBW) approach and the proposed extended approach using dynamic objects to improve the place representation (BoBW + DO). This table shows how taking into account information about dynamic objects improves the recognition results in all the configurations in which the BoBW-only recognition accuracy is more than about 30%. When we further limit our analysis to those images with a minimum level of coverage by dynamic objects (10%, 20% and 30%), our proposed approach performs much better than BoBW-only approach as the percentage of dynamic objects in the images increases. The table shows only a subset of the results, with YOLO’s confidence set to 0.20. Additional details are available in Muñoz (2018) and on a website that we have created to share our progress and additional findings Figure 5 shows a visualization of the results in which it is clear that in most configurations, as the percentage of the area of the image that is covered by dynamic objects increases, the performance of our approach yields better place recognition than the Bags of Binary Words approach without incorporating dynamic object detection. These improvements confirm the significance of our approach: incorporating high level information about dynamic objects improves the performance of existing place recognition algorithms in environments highly populated by dynamic objects. The place recognition accuracy improves significantly for images with a greater percentage of the area covered by dynamic objects. For instance, as shown in table 3, when using 2000 ORB features, and geometric verification at level 1, the proposed approach produces an improvement of 43.12% on the place recognition accuracy on images with more than 10% of the image area covered by dynamic objects. Interestingly, the place recognition accuracy for images with more than 20% of their area covered by dynamic objects increases by 55.75%, and when taking into account only images with more than 30% of their area covered by dynamic objects, the improvement on place recognition increases to 62.22%.

Figure 6: Comparison of databases generated using the Synthia dataset. The proposed approach significantly reduces the size of the database, and produces better recognition results than the version that uses the original place representation.

Figure 6 shows a comparison of the databases generated after processing the Synthia dataset. The proposed approach generates much smaller databases for all configurations. For instance, setting the number of maximum ORB keypoints to 300 and disabling geometric verification results in a reduction of 21.1% from the original size of the database, from 94.36 MB to 74.44 MB. When the geometric verification uses level 0 of the vocabulary tree, the reduction is 23.9% from the original size of the database (from 209 MB to 159 MB). In the configuration that uses a maximum of 1500 ORB keypoints and no geometric verification, the reduction is 21% from the original size, saving 84.5 MB of storage space.

Figure 7: Place recognition times. Red lines correspond to the original Bags of Binary Words approach; green lines correspond to our extended approach. The graphs on the left correspond to the configuration that uses ORB features, while the graphs on the the right correspond to the configuration that uses ORB features. The first row does not use geometric verification, while the second row does. Dashed lines represent the average time for each method.

Reducing the size of places representations has an additional positive effect in the time required for the place recognition algorithm to find matches in the database. Figure 7 shows a comparison of the time required to match places by the original approach (BoBW) and our extension (BoBW + DO). Our approach decreases this required time by several milliseconds depending on the selected configuration. However, our approach requires the costliest object detection step. This step took an average of 66 milliseconds per image, which includes resizing the image to 416x416 to meet the object detector requirements. The average time to detect objects is expected to decrease to milliseconds per image when no image resizing is needed.

Figure 8: Dynamic objects behaving as static objects in the Málaga dataset. A-B and C-D: Several of the dynamic objects detected during the first visit, e.g., cars parked on the street, remain in the same place until the next visit of the agent, behaving as static objects.
Figure 9: Example from the Kitti dataset Geiger et al. (2013) of dynamic objects behaving as static objects. The agent revisits this place a few minutes later; most of the cars parked on the street are in the same exact place. Our approach may not be suitable for applications in which this situation is expected to arise frequently.

5.4 Real-World Dataset Insights

The Málaga urban dataset provides additional insights into the behavior of the proposed approach. Some segments of the route used for our evaluation were revisited by the agent after just a few seconds from the previous visit. With such a short timespan between visits, many dynamic objects remained in the same place, thus behaving more like static objects. For instance, most of the cars that appeared parked in the first visit were also spotted in the following visits as illustrated in Figure 8.

This characteristic is not unique to the Málaga urban dataset. Other subsets of popular datasets, e.g, Kitti dataset, present similar characteristics as illustrated in Figure 9, which shows how when the agent revisits a place 306.08 seconds later, it encounters dynamic objects that have not moved at all. The evaluation with the Málaga dataset confirms that the proposed approach thrives in applications where there is an expectation that the agent is exploring a highly dynamic environment, or when enough time has passed to allow for dynamic objects to behave as such. In this dataset, all 17,300 images were captured in a short amount of time, in a little more than 14 minutes. Traditional place recognition algorithms that do not incorporate information about dynamic objects will include in their place representations the descriptors that were affected by dynamic objects, but since the agent revisits the place in a very short amount of time, these descriptors are not really affected by the presence of dynamic objects, thus contributing to the matching of the place.

Despite the fact that the agent revisited some places in the Málaga dataset in a very short time, and thus, reducing the benefits of taking into account the presence of dynamic objects in observations, the proposed approach detected the same number of loop closures. There is a total of 5 loops in subset #10 of the Málaga dataset. All of the closures for these loops were correctly detected by both the original BoBW approach and our extension, BoBW-DO, that takes into account dynamic objects. This is illustrated in Figure 10.

As it was also expected and as illustrated in Figure 11, using the proposed approach resulted in a smaller database. The proposed approach produces similar recognition results while greatly reducing storage resources from the system. The proposed approach generates much smaller databases for all configurations. For instance, setting the number of maximum ORB keypoints to 1500 and enabling exhaustive geometric verification, there is a reduction of 14.3% from the database generated by BoBW, from 1705 MB to 1462 MB. When the geometric verification uses level 0 of the vocabulary tree, the database size is reduced by 14.1%, from 3687 MB to 3166 MB.

Figure 10: Comparison of place recognition matches found by the original (left) and proposed (right) approaches. The extended approach detected the same loop closures as the original algorithm. The path traversed by the vehicle is in blue, while the places that have been correctly recognized when revisited are in red. Each loop closure in the subset of the Málaga dataset is indicated with an arrow.
Figure 11: Comparison of databases generated using the Málaga dataset. BoBW-Dynamic Objects performs as well as BoBW by recognizing the same number of loop closures, but with the additional benefit of reducing the size of the database.

6 Conclusions and Future Work

Appearance-based place recognition approaches are still plagued by several challenges that are rooted in the complexity of the real world and the limitations of visual sensors. One of those challenges is the intermittent presence of dynamic objects. In this article, we have presented an approach to reduce the negative impact of dynamic objects in place representations.

As explained in this article, the proposed approach introduces several benefits to place recognition, including the reduction of storage requirements while improving recognition accuracy. The proposed approach can be used to improve the performance of suitable existing place recognition algorithms in environments where there is an expectation of a great presence of dynamic objects.

We have classified traditional place recognition algorithms into having flexible or rigid representations. We have concluded that algorithms with flexible place representations would experience performance improvements by incorporating high-level information from dynamic objects into their place representations. Our evaluation uses the state-of-the-art Bags of Binary Words algorithm Gálvez-López and Tardós (2012). In the future, we anticipate applying our approach to other suitable algorithms to further substantiate the significance of this approach.

Modifying place representations based on the presence of dynamic objects in the observations may not generalize well to applications in which an agent will revisit the environment in a very short amount of time, primarily because most of the dynamic objects may have not moved since the previous visit, e.g., cars parked on the street. Figure 8 from the Málaga dataset and Figure 9 from the Kitti dataset illustrate these kinds of situations.

Future work will also explore improvements in the approximation of the area covered by the detected dynamic objects maintaining the requirement of running in real-time. This improvement will result in a more precise identification of the proportion of the extent of the descriptor that is affected by dynamic objects and in further improvement to the resultant place representation.

Finally, we expect that information about dynamic objects could have additional applications. For example, this information could allow navigation modules to plan paths that avoid areas where there is a tendency toward a high presence of dynamic objects. The information about dynamic objects could also be used to determine the kind of place that an agent is visiting, which could also enrich navigation applications.


  • A. Alahi, R. Ortiz, and P. Vandergheynst (2012) FREAK: Fast retina keypoint. In

    IEEE Conference on Computer Vision and Pattern Recognition

    IEEE Conference on Computer Vision and Pattern Recognition, New York, pp. 510–517. Note: From Duplicate 1 ({FREAK}: {F}ast {R}etina {K}eypoint - Alahi, Alexandre; Ortiz, Raphael; Vandergheynst, Pierre)
  • (47)

    CVPR 2012 Open Source Award Winner

  • External Links: Document, ISBN 978-1-4673-1228-8, ISSN 10636919 Cited by: §2.1.
  • R. Arroyo, P. F. Alcantarilla, L. M. Bergasa, J. J. Yebes, and S. Bronte (2014) Fast and effective visual place recognition using binary codes and disparity information. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 3089–3094. External Links: Document, ISBN 978-1-4799-6934-0, ISSN 21530866 Cited by: §2.1, §2.1, Table 1, §4.
  • H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool (2008) Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 110 (3), pp. 346–359. Note: NULL External Links: Document, ISBN 3540338322, ISSN 1077-3142 Cited by: §2.1.
  • J. Blanco, F. Moreno, and J. Gonzalez-Jimenez (2014) The Málaga Urban Dataset: High-rate Stereo and Lidars in a realistic urban scenario. International Journal of Robotics Research 33 (2), pp. 207–214. External Links: Document Cited by: §5.1.
  • M. Calonder, V. Lepetit, C. Strecha, and P. Fua (2010) BRIEF : Binary Robust Independent Elementary Features. European Conference on Computer Vision (ECCV), pp. 778–792. Note: NULL External Links: Document Cited by: §2.1.
  • Z. Chen, O. Lam, A. Jacobson, and M. Milford (2013) Convolutional Neural Network-based Place Recognition. In 2014 Australasian Conference on Robotics and Automation (ACRA 2014), pp. 8. Note: This paper External Links: 1411.1509 Cited by: §2.1.
  • M. Cummins and P. Newman (2011) Appearance-only SLAM at large scale with FAB-MAP 2.0. The International Journal of Robotics Research 30 (9), pp. 1100–1123. External Links: Document, 97, ISBN 0010-0285 (Print)$\$r0010-0285 (Linking), ISSN 0278-3649 Cited by: §2.1, §2.1, §3, §4.1, Table 1.
  • L. Fei-fei, P. Perona, F. Li, and P. Perona (2005) A Bayesian Hierarchical Model for Learning Natural Scene Categories. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 2 - Volume 02, CVPR ’05, Washington, DC, USA, pp. 524–531. External Links: Document, ISBN 0-7695-2372-2 Cited by: §2.1.
  • M. a. Fischler and R. C. Bolles (1981) Random Sample Consensus: A Paradigm for Model Fitting with. Communications of the ACM 24, pp. 381–395. External Links: Document, ISBN 0934613338, ISSN 00010782 Cited by: §4.1.
  • D. Gálvez-López and J. D. Tardós (2012) Bags of binary words for fast place recognition in image sequences. IEEE Transactions on Robotics 28 (5), pp. 1188–1197. Note: NULL External Links: Document, ISBN 1552-3098, ISSN 15523098 Cited by: §2.1, §2.1, Table 1, §4, §5.1, §6.
  • A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: The KITTI dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. External Links: Document, ISSN 0278-3649 Cited by: Figure 9.
  • R. Girshick (2015) Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, Washington, DC, USA, pp. 1440–1448. External Links: Document, ISBN 978-1-4673-8391-2 Cited by: §2.2.
  • Y. Hou, H. Zhang, S. Zhou, and H. Zou (2017) Use of roadway scene semantic information and geometry-preserving landmark pairs to improve visual place recognition in changing environments. IEEE Access 5 (), pp. 7702–7713. External Links: Document, ISSN 2169-3536 Cited by: §2.1.
  • E. Johns and G. Z. Yang (2013) Feature Co-occurrence Maps: Appearance-based localisation throughout the day. Proceedings - IEEE International Conference on Robotics and Automation, pp. 3212–3218. External Links: Document, ISBN 9781467356411, ISSN 10504729 Cited by: §2.1, Table 1.
  • N. Kejriwal, S. Kumar, and T. Shibata (2016) High performance loop closure detection using bag of word pairs. Robotics and Autonomous Systems 77 (Supplement C), pp. 55–65. External Links: Document, ISSN 0921-8890 Cited by: §2.1.
  • I. Kostavelis, L. Nalpantidis, and A. Gasteratos (2012) Object recognition using saliency maps and htm learning. In 2012 IEEE International Conference on Imaging Systems and Techniques Proceedings, Vol. , pp. 528–532. External Links: Document, ISSN 1558-2809 Cited by: §2.2.
  • I. Kostavelis and A. Gasteratos (2013) Learning spatially semantic representations for cognitive robot navigation. Robotics and Autonomous Systems 61 (12), pp. 1460 – 1475. External Links: ISSN 0921-8890, Document Cited by: §2.2.
  • I. Kostavelis and A. Gasteratos (2017) Semantic maps from multiple visual cues. Expert Systems with Applications 68, pp. 45 – 57. External Links: ISSN 0957-4174, Document Cited by: §2.2.
  • A. Krizhevsky, I. Sulskever, and G. E. Hinton (2012) ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information and Processing Systems (NIPS), pp. 1–9. Note: NULL Cited by: §2.1.
  • [20] J.C. Lee and R. Dugan Google Project Tango. Note: NULL External Links: Link Cited by: §1.
  • S. Leutenegger, M. Chli, and R. Y. Siegwart (2011) BRISK: Binary Robust invariant scalable keypoints. Proceedings of the IEEE International Conference on Computer Vision, pp. 2548–2555. Note: NULL External Links: Document, ISBN 9781457711015, ISSN 1550-5499 Cited by: §2.1.
  • G. Levi and T. Hassner (2016) LATCH: Learned Arrangements of Three Patch Codes. Winter Conference on Applications of Computer Vision (WACV) abs/1501.0. Cited by: §2.1.
  • T. Y. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. ECCV 2014: 13th European Conference abs/1405.0, pp. 740–755. External Links: Document, 1405.0312, ISBN 978-3-319-10601-4, ISSN 16113349 Cited by: §5.1.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot MultiBox Detector. In Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 21–37. External Links: Document, ISBN 978-3-319-46448-0 Cited by: §2.2.
  • D. G. Lowe (1999) Object recognition from local scale-invariant features. In Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on, Vol. 2, pp. 1150–1157 vol.2. Note: NULL External Links: Document Cited by: §2.1.
  • D. G. Lowe (2004) Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vision 60 (2), pp. 91–110. Note: NULL External Links: Document, ISSN 0920-5691 Cited by: §2.1.
  • E. Mair, G. D. Hager, D. Burschka, M. Suppa, and G. Hirzinger (2010) Adaptive and Generic Corner Detection Based on the Accelerated Segment Test. In European Conference on Computer Vision (ECCV’10), Cited by: §2.1.
  • M. J. Milford and G. F. Wyeth (2012) SeqSLAM: Visual route-based navigation for sunny summer days and stormy winter nights. In Proceedings - IEEE International Conference on Robotics and Automation, pp. 1643–1649. Note: (Best Robot Vision Award) External Links: Document, ISBN 9781467314039, ISSN 10504729 Cited by: §2.1, Table 1.
  • J. P. Muñoz, B. Li, X. Rong, J. Xiao, Y. Tian, and A. Arditi (2016) Demo : Assisting Visually Impaired People Navigate Indoors. In

    IJCAI International Joint Conference on Artificial Intelligence

    New York, NY, USA, pp. 4260–4261. Note: From Duplicate 1 (Demo : Assisting Visually Impaired People Navigate Indoors ∗ - Muñoz, J Pablo; Li, Bing; Rong, Xuejian; Xiao, Jizhong; Tian, Yingli; Arditi, Aries)
  • (30) NULL
  • Cited by: §1.
  • J. P. Muñoz (2018) Collaborative Appearance-Based Place Recognition and Improving Place Recognition Using Detection of Dynamic Objects. Ph.D. Thesis, CUNY Academic Works. Cited by: §5.3.
  • J. P. Muñoz, B. Li, X. Rong, J. Xiao, Y. Tian, and A. Arditi (2017) An Assistive Indoor Navigation System for the Visually Impaired in Multi-Floor Environments. In 7th Annual IEEE Int. Conf. on CYBER Technology in Automation, Control, and Intelligent Systems (IEEE-CYBER 2017), Cited by: §1.
  • R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-SLAM: A Versatile and Accurate Monocular SLAM System. IEEE Transactions on Robotics 31 (5). External Links: Document, 1502.00956, ISBN 1552-3098, ISSN 15523098 Cited by: §5.1.
  • D. Nistér and H. Stewénius (2006) Scalable recognition with a vocabulary tree. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 2161–2168. Note: NULL External Links: Document, ISBN 0769525970, ISSN 10636919 Cited by: §2.1, §4.1.
  • E. Pepperell, P. Corke, and M. Milford (2014) All-environment visual place recognition with SMART. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, pp. 1612–1618. External Links: Document, ISBN 9781479936847, ISSN 10504729 Cited by: §2.1, Table 1.
  • J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection. In CVPR 2016, Cited by: §2.2.
  • J. Redmon and A. Farhadi (2017) YOLO9000: Better, Faster, Stronger. In CVPR 2017, External Links: arXiv:1612.08242v1 Cited by: §2.2, §4.2, §5.1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: Document, ISSN 0162-8828 Cited by: §2.2.
  • G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez (2016) The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3234–3243. External Links: Document, ISBN 978-1-4673-8851-1, ISSN 10636919 Cited by: §5.1.
  • E. Rosten and T. Drummond (2006) Machine learning for high-speed corner detection. In European Conference on Computer Vision, Vol. 1, pp. 430–443. Note: NULL External Links: Document Cited by: §2.1.
  • E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: An efficient alternative to SIFT or SURF. Proceedings of the IEEE International Conference on Computer Vision, pp. 2564–2571. Note: NULL External Links: Document, ISBN 9781457711015, ISSN 1550-5499 Cited by: §2.1, §4.1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Note: NULL External Links: Document, 1409.0575, ISBN 0920-5691, ISSN 1573-1405 Cited by: §2.1, §2.2.
  • S. M. Siam and H. Zhang (2017) Fast-SeqSLAM: A fast appearance based place recognition algorithm. Proceedings - IEEE International Conference on Robotics and Automation, pp. 5702–5708. External Links: Document, ISBN 9781509046331, ISSN 10504729 Cited by: §2.1, Table 1.
  • J. Sivic and A. Zisserman (2003) Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the Ninth IEEE International Conference on Computer Vision - Volume 2, ICCV ’03, Washington, DC, USA, pp. 1470–. Note: NULL External Links: ISBN 0-7695-1950-4 Cited by: §2.1.
  • S. M. Smith and J. M. Brady (1997) SUSAN - A New Approach to Low Level Image Processing. International Journal of Computer Vision 23 (1), pp. 45–78. Note: NULL External Links: Document, ISBN 0920-5691, ISSN 1573-1405 Cited by: §2.1.
  • E. Stumm, C. Mei, and S. Lacroix (2013) Probabilistic place recognition with covisibility maps. IEEE International Conference on Intelligent Robots and Systems, pp. 4158–4163. External Links: Document, ISBN 9781467363587, ISSN 21530858 Cited by: §2.1, §4.1, Table 1.
  • N. Sunderhauf, F. Dayoub, S. Sareh, U. Ben, M. Michael, N. Sünderhauf, S. Shirazi, F. Dayoub, B. Upcroft, and M. Milford (2015) On the performance of ConvNet features for place recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4297–4304. Note: From Duplicate 2 (On the Performance of ConvNet Features for Place Recognition - Sunderhauf, Niko; Dayoub, Feras; Sareh, Shirazi; Ben, Upcroft; Michael, Milford)
  • (26) NULL
  • External Links: Document, 1501.04158v1, ISBN 9781479999934 Cited by: §2.1.
  • N. Sünderhauf and P. Protzel (2011) BRIEF-Gist - Closing the loop by simple means. IEEE International Conference on Intelligent Robots and Systems, pp. 1234–1241. External Links: Document, ISBN 9781612844541, ISSN 2153-0858 Cited by: §2.1, §2.1, §3, Table 1, §4.
  • I. Ulrich and I. Nourbakhsh (2000) Appearance-based place recognition for topological localization. In IEEE International Conference on Robotics and Automation, pp. 1023–1029. Note: NULL External Links: Document, ISBN 0-7803-5886-4, ISSN 1050-4729 Cited by: §2.1.
  • B. Williams, M. Cummins, J. Neira, P. Newman, I. Reid, and J. Tardós (2009) A comparison of loop closing techniques in monocular SLAM. Robotics and Autonomous Systems 57 (12), pp. 1188–1197. External Links: Document, ISBN 0921-8890, ISSN 09218890 Cited by: §1.