Log In Sign Up

EchoVPR: Echo State Networks for Visual Place Recognition

Recognising previously visited locations is an important, but unsolved, task in autonomous navigation. Current visual place recognition (VPR) benchmarks typically challenge models to recover the position of a query image (or images) from sequential datasets that include both spatial and temporal components. Recently, Echo State Network (ESN) varieties have proven particularly powerful at solving machine learning tasks that require spatio-temporal modelling. These networks are simple, yet powerful neural architectures that – exhibiting memory over multiple time-scales and non-linear high-dimensional representations – can discover temporal relations in the data while still maintaining linearity in the learning. In this paper, we present a series of ESNs and analyse their applicability to the VPR problem. We report that the addition of ESNs to pre-processed convolutional neural networks led to a dramatic boost in performance in comparison to non-recurrent networks in four standard benchmarks (GardensPoint, SPEDTest, ESSEX3IN1, Nordland) demonstrating that ESNs are able to capture the temporal structure inherent in VPR problems. Moreover, we show that ESNs can outperform class-leading VPR models which also exploit the sequential dynamics of the data. Finally, our results demonstrate that ESNs also improve generalisation abilities, robustness, and accuracy further supporting their suitability to VPR applications.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


Learning Spatio-Temporal Representation with Local and Global Diffusion

Convolutional Neural Networks (CNN) have been regarded as a powerful cla...

Convolutional Neural Network-based Place Recognition

Recently Convolutional Neural Networks (CNNs) have been shown to achieve...

Enhanced Spatio-Temporal Interaction Learning for Video Deraining: A Faster and Better Framework

Video deraining is an important task in computer vision as the unwanted ...

DenseImage Network: Video Spatial-Temporal Evolution Encoding and Understanding

Many of the leading approaches for video understanding are data-hungry a...

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

Models based on deep convolutional networks have dominated recent image ...

Graph-based non-linear least squares optimization for visual place recognition in changing environments

Visual place recognition is an important subproblem of mobile robot loca...

I Introduction

Visual Place Recognition (VPR) challenges algorithms to recognise previously visited places despite changes in appearance caused by illuminance, viewpoint, and weather conditions [lowry2015visual] (see Fig. 2

for example images). Unlike in many machine learning domains, typical VPR benchmark require learning of position from images gathered during one route traversal, when compared with data during another route traversal, meaning that there are very few examples to learn from (typically only the images within a few metres of the correct location) making the task even more challenging. One approach is to recognise places based on matching single views using image processing methods to remove the variance between datasets. For instance, models have been developed that use different image descriptors to obtain meaningful image representations that are robust to visual change (e.g. AMOSNet 

[chen2017deep], DenseVLAD [torii2015place], and NetVLAD [arandjelovic2016netvlad]). While matching single images is successful in many benchmarks, it can suffer from the effects of aliasing, individual image corruption, or sampling mismatches between datasets (e.g. it is challenging to ensure that images sampled along the same route precisely overlap).

One way to improve performance is to exploit the temporal relationships inherent in images sampled along routes (see models by [milford2012seqslam, milford2013vision, hansen2014visual, kagioulis2020insect, zhu2020spatio, chancan2020hybrid]). Milford and Wyeth [milford2012seqslam] were the first to demonstrate improved VPR performance through matches sequences of images using a global search to overcome individual image mismatches. These methods often have an explicit encoding of speed to limit the image search space and/or store a stack of images to allow comparison of image sequences: both of which are undesirable for autonomous robots that may have limited memory and external sensing capabilities.

Fig. 1: An illustration of EchoVPR framework. Echo State Networks (ESN) incorporate temporality while still maintaining real-time prediction capability, which is a key feature for a robotic system in real-world applications. Given an input image at a time (from snowy Nordland [sunderhauf2013we] in this example), an image descriptor (class-leading NetVLAD [arandjelovic2016netvlad]) provides a meaningful representation to the ESN to update the fixed reservoir.
Fig. 2: Example dataset images. Reference (top) and query (bottom) images from four VPR benchmarking datasets, from left to right; GardensPoint [glover2014day] and ESSEX3IN1 [zaffar2020memorable]: different viewpoint and illuminance conditions, SPEDTest [chen2018learning] and Nordland [sunderhauf2013we]: fixed viewpoint but different season and weather conditions.

Echo State Networks (ESN) [jaeger2007optimization]

are a class of recurrent neural networks, ideally suited to addressing VPR problems without the need for additional support cues or input data caching, see Fig. 


ESNs are a subset of reservoir computing models in which the reservoir neurons possess fixed, random and recurrent interconnections that sustain recent memories, i.e.

echoes [jaeger2001echo] with the practical benefit that only the output layer weights require training. ESNs thus act as a temporal kernel [hermans2012recurrent] over a variety of time-scales, creating a form of working memory dispensing of the need for input caching. They are therefore well-suited to temporal problems such as VPR and have excelled when applied to problems that involve sequential data including dynamical system predictions [li2012chaotic, deihimi2012application], robotic motion and navigation tasks [ploger2003echo, ishu2004identification, hartland2007using].

In this paper, we will therefore apply ESNs to VPR to see if these temporal networks can take advantage of the inherent structure of visual input, focusing in particular on two recent advances in ESNs. First, the application of neuron-specific learnable thresholds of reservoir activity results in an improved capacity and performance in comparison to traditional ESNs. Second, layering ESNs in a hierarchical framework facilitates learning of cues from different time-scales concurrently [jaeger2007discovering, gallicchio2018design, manneschi2021exploiting]

. Such hierarchical ESNs invoking multiple and diverse time-scales to enrich the dynamics of the system have achieved class-leading performance in the permuted-sequential MNIST task 

[manneschi2021exploiting]. The best operational regime of such systems occurs when the first reservoir of neurons (the ones closer to the input signal) have faster time-scales in comparison to the ‘deeper’ ones. In this way, the first reservoirs can quickly adapt to changes in the external signal (i.e input) while deeper ESNs can maintain longer memory and react more slowly. We hypothesise that these advances can help in addressing complex VPR problems on real-world image datasets which require a large memory capacity (often containing a lot of redundant information between subsequent images) and have long and short time dependencies.

For recent reviews of the state-of-the-art in visual place recognition, refer to [lowry2015visual, masone2021survey, zhang2021visual], and for overviews of most prominent benchmarking datasets, model results, and recommended protocols, see [garg2021your, zaffar2021vpr].

The remainder of the paper is organised as follows: Section II summarises the VPR problem formulation and presents four varieties of ESNs (standard and hierarchical, with/without SpaRCe) that will be evaluated. Section IV, compares the performance of these ESNs combined with a NetVLAD [arandjelovic2016netvlad] image descriptor against state-of-the-art single-view matching models (AMOSNet [chen2017deep] and DenseVLAD [torii2015place]) in three benchmark datasets (GardensPoint, SPEDTest, ESSEX3IN1). We then compare the (best) ESN approach to the current best sequence matching models (FlyNet+RNN & FlyNet+CANN [chancan2020hybrid]) in the highly challenging Nordland dataset. Section V places these results in the context of current methods and offers an outlook for future work as well as potential bio-inspired extensions.

Ii Methods

Ii-a Problem Formulation

VPR algorithms are provided with a sequence of places (in form of images) sampled along a route, then they are asked to correctly match (within an acceptable threshold) the places by the image key-frames along the same route at a different time, see Fig. 2.111The VPR challenge and recent models were summarised in VPR-Bench [zaffar2021vpr]. The input data is composed of videos where the network has to correctly infer the location, i.e. the image key-frame that is processed at the considered time. In all the tasks there are at least two sequences of images, one used as a training set (i.e. reference) and the other used as a test set (i.e. query), acquired by visiting the same locations and following the same path twice. Even though there is a one-to-one mapping between training and test samples, the latter is acquired by visiting the locations at different times, leading to differences in visual appearances, such as seasonal or illuminance as well as viewpoint changes. Often times, perfect matching is not possible, hence, there can be a tolerance term that allow a close match to be accepted. A match is considered successful, if .

Fig. 3: Scheme of the ESN models and the overall network architecture. A: ESN protocol. The input is fed to an ESN and the training process occurs on the read-out from the network representation. When the SpaRCe algorithm is adopted, additional thresholds are initialised and adapted through the gradient. B: Hierarchical ESN. The input is first processed by the first reservoir (), which is then connected to a second ESN (, tuned with different values of the hyper-parameters to exhibit diverse dynamical properties) unidirectionally. As in , learning occurs on the output weights defined from the representation of both reservoirs and on the thresholds when SpaRCe is adopted. C: Scheme of the overall model, composed of a pre-processing module (red boxes) and a reservoir model (blue boxes). In the pre-processing, an image is fed through a CNN (i.e. NetVLAD [arandjelovic2016netvlad]

), and through a hidden layer (the input to the ESN), pre-trained to reduce the dimensionality of NetVLAD output (4096 to 500) and to be fed into the reservoir system. The reservoir model can then be a single or hierarchical ESN with or without the SpaRCe model. Input images are perceived sequentially as a video, and the network has to correctly classify the location of the current image

In our specific implementation, we consider supervised learning with the ESNs as a predictor, hence, forming a classification problem. The number of read-out nodes is equal to the number of places, and therefore, specific to the given dataset. The read-out nodes (the final and the only learnable layer) output a probability distribution,

, for each given query image. The prediction (i.e. key-frame of the query) is the number of the read-out node, i.e. .

Ii-B Standard ESN

An ESN is a reservoir of recurrently connected nodes, whose temporal dynamics evolves following [jaeger2007optimization]:


where is the leakage term and defines the rate of integration of information,

is a non-linear activation function (usually

), is the input signal,

is the input connectivity matrix, which is commonly drawn from a random Gaussian distribution, and

is a multiplicative factor of the external signal. The recurrent connectivity

is a sparse, random and fixed matrix whose eigenvalues are constrained inside the unit circle of the imaginary plane, with a hyper-parameter

(usually in the range of ) set to further control the spectral radius. As depicted in Fig. 3, learning occurs on the read-out weights from a representation of the ESN dynamic through minimisation of a cost function:


Optimisation of

can be accomplished through different techniques, as ridge regression or iterative gradient descent methods


Ii-C Hierarchical ESNs and SpaRCe

Recent works have started to analyse the benefits of reservoir computing systems composed of multiple ESNs. In these composed architectures, ESNs are connected hierarchically and are tuned differently to exhibit diverse dynamical properties. For instance, the values of the leakage term , where is the reservoir number, can vary for different networks, allowing to regulate the time-scales at which diverse reservoirs operate. As a result, the overall system can be characterised by a wider range of time constants that has richer dynamics and improved memory abilities. Following the architecture in Fig. 3(b), the equations that describe a system of hierarchically connected reservoirs can be easily defined by generalising Eqs.(1-2),


where parameters have similar definitions to the ones in Eq. (1). In the hierarchical structure of Fig. 3(b), if or . In detail, indicates the recurrent connectivity of reservoir and needs to have a spectral radius smaller than one, while , where is the connectivity among different reservoirs and can be drawn from any desirable distribution. In this work, we focus on a hierarchical structure of two ESNs with different values for the two leakage terms.

While the exploitation of multiple ESNs can enrich the dynamics of the system by discovering temporal dependencies over multiple time-scales, the definition of sparse representations through the SpaRCe model [manneschi2021sparce] can enhance the capacity of the reservoir to learn associations by introducing specialised neurons through the definition of learnable thresholds. Considering the representation from which the read-out is defined, as in Eq. (1), SpaRCe consists of the following normalisation operation:


where is the -th dimension, is the sign function and

is the rectified linear unit. Of course, the new read-out is defined from

, that is after the transformation given in Eq. (7) and (8), which leaves unaltered the dynamics of the system and can be easily applied to any reservoir representation. The threshold is composed of two factors: , i.e. the -th percentile of , which stands for the distribution of activities of dimension after the presentation of a number of samples with sufficient statistics, and a learnable part , which is adapted through gradient descent and is initialised to arbitrarily small values at the beginning of training. The percentile can be considered as an additional interpretable hyper-parameter that controls the sparsity level of the network at the start of the training phase.222

For different methodologies to estimate the percentile operation, see


Ii-D Benchmarks and Pre-processing

Convolutional neural networks (CNN) are the best performing architectures for processing images and discover high-level features from visual data. However, they are static and lack temporal dynamics. In contrast, recurrent connections can be fundamental for the considered tasks where the driving signals are a succession of images acquired during the exploration of an environment. Thus, after a pre-processing module composed of NetVLAD [arandjelovic2016netvlad], a pre-trained CNN, we adopted a system composed by one or multiple ESNs. Considering that the reservoir computing paradigm is more effective when the reservoir expands the dimensionality of its corresponding input, we first decreased the dimensionality of NetVLAD output (original dimension is ) by training a feedforward network composed of one hidden layer (with nodes) on the considered classification task. This new representation is then considered as the input to the reservoir computing system, see Fig. 3(c). The reservoir is then trained to distinguish the different locations, which are processed successively in the natural order of acquisition by the overall architecture. The four reservoir computing models we study are summarised below.333The source-code for ESN implementations can be found in

  • Echo State Network (ESN), where learning happens on the output weights only. The critical hyper-parameters of the system for the cases studied, and will be tuned are (leakage term, input factor, learning rate).

  • Echo State Network with SpaRCe (ESN+SpaRCe), where thresholds are applied to the reservoir following Eq. (7) and learning occurs on and . The hyper-parameters are the same as the standard ESN with the addition of the starting percentile of Eq. (8).

  • Hierarchical ESN (H-ESN), composed by two reservoir connected unidirectionally. The read-out is defined from both reservoirs and, as for the case of a single ESN, is subject to training. In this case, the number of hyper-parameters is theoretically more than doubled in comparison to a single ESN and it is practically challenging to perform an exhaustive tuning procedure of all of them. We selected the value of as the optimal one found for the single ESN and fixed , focusing on the tuning of . The constraint is justified by considering that the second reservoir would lose information that lives on fast time-scales if , leading to an overall system with slow reacting dynamics. On the contrary, if and , the first reservoir can react to rapid changes of the input and the second can maintain past temporal information, leading to a system that is robust to signals with both short and long temporal dependencies.

  • Hierarchical ESN and SpaRCe (H-ESN+SpaRCe), which is the same as a hierarchical reservoir, but with the addition of SpaRCe.

The total number of reservoir nodes is 444It is for the hierarchical models and for the Nordland. and learning of and is accomplished trough mini-batches and by minimisation of softmax cross-entropy loss:


where is the minibatch size, the output of the neural network,

the target output, and the indexes

and correspond to the sample number and to the output node considered. The models are trained for up to epochs, i.e. each training image is passed times.

Fig. 4: Comparison between different models. The utilisation of reservoir computing models permits to capture of the temporal dynamics of the problem and improve the performance of CNNs. ESN and ESN with SpaRCe are shown in blue-green colours, while the performance of static neural networks is reported in red-yellow colours. The performance of AMOSNet, DenseVLAD and NetVLAD were taken from [zaffar2021vpr], where image matching was achieved by computing distances among the representation. and correspond to models in which a simple read-out or a hidden layer were trained from the representation of the convolutional network respectively. This was achieved through the minimisation of Eq. (9) on the specific task considered, similar to the approach used for ESNs. The bar plots for our method shows average performance over trials.

Specifically for the Nordland dataset, which is more challenging than the previous benchmarks, we used the sigmoid cross-entropy loss as the error function, which led to better performance:


where the terms have similar meaning to the ones of Eq. (9). The models are trained for up to a total of iterations.

Iii Experiments

Iii-a Datasets and Performance Metrics

We evaluate the performance of the models proposed on four standard benchmarks: GardensPoint [glover2014day], ESSEX3IN1 [zaffar2020memorable], SPEDTest [chen2018learning], and Nordland [sunderhauf2013we], using two metrics: prediction accuracy and precision-recall area-under-curve (AUC). GardensPoint consists of indoor, outdoor and natural environments with both viewpoint and conditional changes throughout the dataset. A tolerance of is acceptable. ESSEX3IN1 consists of images taken at the university campus and surroundings, focusing on perceptual aliasing and confusing places. There is no tolerance for this dataset, hence, the prediction has to be exact. SPEDTest consists of low-quality but high-depth images collected from CCTV cameras around the World; it includes environmental changes including variations in weather, seasonal and illumination conditions. There is no tolerance for this dataset. Nordland consists of images taken at train traversals in four different seasons in Norway; the viewpoint angle is fixed although there is a high weather, seasonal and illumination variability. A tolerance of is acceptable—the same as the sequential models [chancan2020hybrid] we compare against (see Section IV-D for more details).

Iii-B Training ESNs and Hyper-parameter Tuning

The lack of a validation set for the considered tasks makes the hyper-parameters selection challenging. This difficulty is emphasized by the small number of samples in the training set (i.e. one sample per place) and by the major statistical differences between training and test data. In particular, the seasonal difference in the acquisition of reference and query data lead to the possible presence or absence of snow and shifts in colours intensities. In our preliminary experiments, different hyper-parameters would reach perfect accuracy (i.e. ) on the training set and degraded, variable performance on the test set. We believe that there is a lack of clarity in previous research works regarding the definition of a clear methodology to overcome the problem of hyper-parameter selection.

We tuned the hyper-parameters of the reservoir by using a small percentage (i.e. ) of samples of the test set as validation. In other words, while the read-out was always optimised from reference samples, hyper-parameters were optimised through grid search over the performance achieved on of the query data. Being aware of the limitations of this methodology, we will later show how it is possible to use the test set of one task as validation for another task with little performance lost, demonstrating how the model can achieve generalisation abilities if the hyper-parameters were selected to be robust to non-excessive statistical changes (see Section IV-C).

Iv Results

Iv-a Assessing ESN Utility to Visual Place Recognition

The performance of ESN and ESN+SpaRCe were first evaluated in three datasets (GardensPoint, SPEDTest and ESSEX3IN1). Fig. 4 shows that both ESN variants outperform state-of-the-art single-view matching models (including NetVLAD with read-out and hidden layers) in all three conditions. The ESN achieves mean accuracy scores of , and and mean AUC scores of , and . The addition of the SpaRCe layer provides additional improvement with accuracy scores of , and and mean AUC scores of , and .

Iv-B Hierarchical Models for Performance Improvement

We then assessed if a hierarchical ESN architecture would improve results in the challenging GardensPoint dataset. Fig. 5 shows that the introduction of hierarchical ESNs increased the median accuracy scores while decreasing their variance (ESN median: and std: 0 vs H-ESN+Sparce median: and std: ; both for trials). AUC scores showed little change but they were already close to the maximum possible () and thus there was little room for improvement. Considering the performance improvement consequent to the utilisation of the hierarchical model, it is evident how the GardensPoint dataset contains longer temporal dependencies among images that cannot be captured by a single ESN. This result can be intuitively understood by comparing the sequences of images between the three datasets presented in Fig. 4. After an inspection of the datasets, it is clear that data of GardensPoint are captured at a higher frame-rate in comparison to the other datasets, where images appear more static and separated in time across each other. Consequently, GardensPoint has a more complex underlying temporal structure.

Fig. 5: Hierarchical models improves performance. More complex models (H-ESN and H-ESN+SpaRCe) yields to higher and more robust performance. The box plots show the results over trials.

Iv-C Generalisability Study

We also analysed the sensitivity of the ESN models with respect to hyper-parameter selection. Fig. 6 shows accuracy scores for hyper-parameters tuned by training the models on GardensPoint and maintaining them when training in SPEDTest and ESSEX3IN1. The reason we chose the hyper-parameters from GardensPoint is that generalisation is more likely to occur when the baseline task is more complex than the new tasks to which it is applied. Indeed, richer and more difficult datasets can lead neural networks to discover high-level features that are transferable to simpler datasets, while the contrary is difficult. Fig. 6 demonstrates how, even with sub-optimal hyper-parameters, the introduction of ESNs leads to higher performance in comparison to single-view matching models, NetVLAD and NetVLAD. Again, hierarchical ESNs provide a noticeable improvement in median accuracy and AUC scores as well as reducing variance again. Moreover, the performance remains above for both accuracy and AUC compared to the virtually perfect scores achieved when hyper-parameters were tuned using the same dataset (see Fig 4).

Fig. 6: Generalisability of the hyper-parameter transferring. The proposed models show generalisation ability by maintaining performance despite the hyper-parameters were selected using a different dataset (GardensPoint). All four variants of ESN are well above the accuracy achieved by static models (horizontal lines). The box plots represent the distribution of trials.

Iv-D Comparing ESN with sequential VPR models

Fig. 7: Comparison against state-of-the-art sequential models in Nordland dataset. The ESN model and in particular SpaRCe, show class-leading performance on the Nordland dataset. The horizontal lines report the performance of FlyNet+RNN and FlyNet+CANN, taken from [chancan2020hybrid]. The box plots represent the distribution of trials.

In this section, we benchmark the performance of ESNs against state-of-the-art sequence matching VPR models. Specifically, we compare with two models recently reported to achieve great performance [chancan2020hybrid] in the challenging Nordland dataset [sunderhauf2013we]. Both models use a bio-inspired feedforward neural network (FlyNet) to encode visual information and either a recurrent neural network (RNN) or a continuous attractor network (CANN) to introduce temporality. Fig. 7 shows accuracy scores of and for the standard ESN and ESN+SpaRCe respectively (no accuracy scores are available for comparison). For the AUC test, ESN achieves scores of , with SpaRCE improving results to . This compares favourably to both static view matching models (e.g. NetVLAD+HL) which score , and sequential models which score (FlyNet+RNN) and (FlyNet+CANN).

V Conclusions

In this paper, we have demonstrated the viability of ESNs as a solution to the VPR problem. All the ESN variants implemented achieve higher performance than single-view matching models (AMOSNet, DenseVLAD, NetVLAD, NetVLAD+HL, NetVLAD), in three benchmarking datasets (GardensPoint, SPEDTest, ESSEX3IN1). In the more challenging Nordland dataset, two of our models (single reservoir ESN and SpaRCe) achieved performance above/equal to the class-leading results achieved by sequential matching models (FlyNet+RNN and FlyNet+CANN). While performance is comparable we note that FlyNet [chancan2020hybrid] have many fewer parameters. However, the ESNs do not require images to be cached during multiple comparisons and also serve to implicitly assess any velocity dependence through the temporal dynamics.

In terms of the recent ESN advances, namely hierarchical and SpaRCe, the results differ depending on the dataset. The addition of SpaRCe to the standard ESN improved performance considerably, showing how the introduction of sparse representations can efficiently help the classification process. The utilisation of hierarchical ESNs was beneficial in the GardensPoint dataset, but not for the larger and more challenging Nordland dataset. Hierarchical models have higher complexity, in terms of the number of hyper-parameters, and can overfit the training data. This is particularly an issue when considering the benchmarking VPR datasets, as there is only a ‘single’ sample to learn from (as opposed to standard machine learning datasets that have many samples per class, e.g. approximately samples per class for the well-known MNIST dataset). Preliminary analysis supports this hypothesis: hierarchical models achieved perfect scores on the Nordland training sets (summer) but low performance when presented with test set (winter). Such issues might be addressed by augmenting training data [shorten2019survey] (e.g. through artificial illuminance changes or weather effects) to supply a variety of real-world conditions.

While there are many ways to optimise the ESNs for the VPR problem, an intriguing future course of action is to take inspiration from invertebrate mini-brains that possess analogous structural motifs of both deep and shallow ESNs. A simple example is the insect mushroom body. This is considered the cognitive centre of the insect brain [menzel2001cognitive] and is necessary for learning relationships sequences and patterns in honey bees [menzel2001cognitive, boitard2015gabaergic, devaud2015neural, cope2018abstract]. Structurally the mushroom body is a three-layer network with a compact input layer, an expanded middle layer of inter-neurons called Kenyon cells, and a small layer of output neurons [fahrbach2006structure]. The connections between the Kenyon cells and output neurons are plastic and modified by learning [gerber2004engram]

, and there are chemical and electrical synapses between the Kenyon cells 

[zheng2018complete, takemura2017connectome, liu2016gap]. These features are analogous to the recurrent connections in the reservoir layer of an ESN, and it has been hypothesised [manneschi2021sparce, manneschi2021exploiting] that these recurrent connections in the Kenyon cell layer could contribute to the reverberant activity of the mushroom body that supports forms of memory [cognigni2018right]. Given the similar structures, insights gained from neurobiology could help shape the future ESN investigations and in turn, analysis of the optimal structure for VPR could shed light on the function of different brain areas.

In practice, it is desirable that places are recognised from a single input image allowing robotics to truly solve the kidnapped robot problem. However, in the cases where such methods fail, traversing portions of a familiar path can help to disambiguate input . ESNs provide a means to exploit such temporal dynamics using only visual data but more powerful variants require tuning of a large number of parameters which may not be possible when only a small amount of training examples are provided. Other methods [milford2012seqslam, chancan2020hybrid] have focused on low-parameter models but often require additional cues such as velocity to focus the image search. Ensemble methods [hausler2019multi, fischer2020event] that combine these features are emerging that may provide the best of both worlds.

Finally, assessment of methods on robots in the real-world is essential. This will not only challenge current approaches to be more robust but can also show some difficulties caused by the pre-collected datasets, such as continual learning or robotic safety.