I Introduction
Visual Place Recognition (VPR) challenges algorithms to recognise previously visited places despite changes in appearance caused by illuminance, viewpoint, and weather conditions [lowry2015visual] (see Fig. 2
for example images). Unlike in many machine learning domains, typical VPR benchmark require learning of position from images gathered during one route traversal, when compared with data during another route traversal, meaning that there are very few examples to learn from (typically only the images within a few metres of the correct location) making the task even more challenging. One approach is to recognise places based on matching single views using image processing methods to remove the variance between datasets. For instance, models have been developed that use different image descriptors to obtain meaningful image representations that are robust to visual change (e.g. AMOSNet
[chen2017deep], DenseVLAD [torii2015place], and NetVLAD [arandjelovic2016netvlad]). While matching single images is successful in many benchmarks, it can suffer from the effects of aliasing, individual image corruption, or sampling mismatches between datasets (e.g. it is challenging to ensure that images sampled along the same route precisely overlap).One way to improve performance is to exploit the temporal relationships inherent in images sampled along routes (see models by [milford2012seqslam, milford2013vision, hansen2014visual, kagioulis2020insect, zhu2020spatio, chancan2020hybrid]). Milford and Wyeth [milford2012seqslam] were the first to demonstrate improved VPR performance through matches sequences of images using a global search to overcome individual image mismatches. These methods often have an explicit encoding of speed to limit the image search space and/or store a stack of images to allow comparison of image sequences: both of which are undesirable for autonomous robots that may have limited memory and external sensing capabilities.

![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Echo State Networks (ESN) [jaeger2007optimization]
are a class of recurrent neural networks, ideally suited to addressing VPR problems without the need for additional support cues or input data caching, see Fig.
1ESNs are a subset of reservoir computing models in which the reservoir neurons possess fixed, random and recurrent interconnections that sustain recent memories, i.e.
echoes [jaeger2001echo] with the practical benefit that only the output layer weights require training. ESNs thus act as a temporal kernel [hermans2012recurrent] over a variety of time-scales, creating a form of working memory dispensing of the need for input caching. They are therefore well-suited to temporal problems such as VPR and have excelled when applied to problems that involve sequential data including dynamical system predictions [li2012chaotic, deihimi2012application], robotic motion and navigation tasks [ploger2003echo, ishu2004identification, hartland2007using].In this paper, we will therefore apply ESNs to VPR to see if these temporal networks can take advantage of the inherent structure of visual input, focusing in particular on two recent advances in ESNs. First, the application of neuron-specific learnable thresholds of reservoir activity results in an improved capacity and performance in comparison to traditional ESNs. Second, layering ESNs in a hierarchical framework facilitates learning of cues from different time-scales concurrently [jaeger2007discovering, gallicchio2018design, manneschi2021exploiting]
. Such hierarchical ESNs invoking multiple and diverse time-scales to enrich the dynamics of the system have achieved class-leading performance in the permuted-sequential MNIST task
[manneschi2021exploiting]. The best operational regime of such systems occurs when the first reservoir of neurons (the ones closer to the input signal) have faster time-scales in comparison to the ‘deeper’ ones. In this way, the first reservoirs can quickly adapt to changes in the external signal (i.e input) while deeper ESNs can maintain longer memory and react more slowly. We hypothesise that these advances can help in addressing complex VPR problems on real-world image datasets which require a large memory capacity (often containing a lot of redundant information between subsequent images) and have long and short time dependencies.For recent reviews of the state-of-the-art in visual place recognition, refer to [lowry2015visual, masone2021survey, zhang2021visual], and for overviews of most prominent benchmarking datasets, model results, and recommended protocols, see [garg2021your, zaffar2021vpr].
The remainder of the paper is organised as follows: Section II summarises the VPR problem formulation and presents four varieties of ESNs (standard and hierarchical, with/without SpaRCe) that will be evaluated. Section IV, compares the performance of these ESNs combined with a NetVLAD [arandjelovic2016netvlad] image descriptor against state-of-the-art single-view matching models (AMOSNet [chen2017deep] and DenseVLAD [torii2015place]) in three benchmark datasets (GardensPoint, SPEDTest, ESSEX3IN1). We then compare the (best) ESN approach to the current best sequence matching models (FlyNet+RNN & FlyNet+CANN [chancan2020hybrid]) in the highly challenging Nordland dataset. Section V places these results in the context of current methods and offers an outlook for future work as well as potential bio-inspired extensions.
Ii Methods
Ii-a Problem Formulation
VPR algorithms are provided with a sequence of places (in form of images) sampled along a route, then they are asked to correctly match (within an acceptable threshold) the places by the image key-frames along the same route at a different time, see Fig. 2.111The VPR challenge and recent models were summarised in VPR-Bench [zaffar2021vpr]. The input data is composed of videos where the network has to correctly infer the location, i.e. the image key-frame that is processed at the considered time. In all the tasks there are at least two sequences of images, one used as a training set (i.e. reference) and the other used as a test set (i.e. query), acquired by visiting the same locations and following the same path twice. Even though there is a one-to-one mapping between training and test samples, the latter is acquired by visiting the locations at different times, leading to differences in visual appearances, such as seasonal or illuminance as well as viewpoint changes. Often times, perfect matching is not possible, hence, there can be a tolerance term that allow a close match to be accepted. A match is considered successful, if .
), and through a hidden layer (the input to the ESN), pre-trained to reduce the dimensionality of NetVLAD output (4096 to 500) and to be fed into the reservoir system. The reservoir model can then be a single or hierarchical ESN with or without the SpaRCe model. Input images are perceived sequentially as a video, and the network has to correctly classify the location of the current image
In our specific implementation, we consider supervised learning with the ESNs as a predictor, hence, forming a classification problem. The number of read-out nodes is equal to the number of places, and therefore, specific to the given dataset. The read-out nodes (the final and the only learnable layer) output a probability distribution,
, for each given query image. The prediction (i.e. key-frame of the query) is the number of the read-out node, i.e. .Ii-B Standard ESN
An ESN is a reservoir of recurrently connected nodes, whose temporal dynamics evolves following [jaeger2007optimization]:
(1) | ||||
(2) |
where is the leakage term and defines the rate of integration of information,
is a non-linear activation function (usually
), is the input signal,is the input connectivity matrix, which is commonly drawn from a random Gaussian distribution, and
is a multiplicative factor of the external signal. The recurrent connectivityis a sparse, random and fixed matrix whose eigenvalues are constrained inside the unit circle of the imaginary plane, with a hyper-parameter
(usually in the range of ) set to further control the spectral radius. As depicted in Fig. 3, learning occurs on the read-out weights from a representation of the ESN dynamic through minimisation of a cost function:(3) | ||||
(4) |
Optimisation of
can be accomplished through different techniques, as ridge regression or iterative gradient descent methods
[lukovsevivcius2012practical].Ii-C Hierarchical ESNs and SpaRCe
Recent works have started to analyse the benefits of reservoir computing systems composed of multiple ESNs. In these composed architectures, ESNs are connected hierarchically and are tuned differently to exhibit diverse dynamical properties. For instance, the values of the leakage term , where is the reservoir number, can vary for different networks, allowing to regulate the time-scales at which diverse reservoirs operate. As a result, the overall system can be characterised by a wider range of time constants that has richer dynamics and improved memory abilities. Following the architecture in Fig. 3(b), the equations that describe a system of hierarchically connected reservoirs can be easily defined by generalising Eqs.(1-2),
(5) | ||||
(6) |
where parameters have similar definitions to the ones in Eq. (1). In the hierarchical structure of Fig. 3(b), if or . In detail, indicates the recurrent connectivity of reservoir and needs to have a spectral radius smaller than one, while , where is the connectivity among different reservoirs and can be drawn from any desirable distribution. In this work, we focus on a hierarchical structure of two ESNs with different values for the two leakage terms.
While the exploitation of multiple ESNs can enrich the dynamics of the system by discovering temporal dependencies over multiple time-scales, the definition of sparse representations through the SpaRCe model [manneschi2021sparce] can enhance the capacity of the reservoir to learn associations by introducing specialised neurons through the definition of learnable thresholds. Considering the representation from which the read-out is defined, as in Eq. (1), SpaRCe consists of the following normalisation operation:
(7) | ||||
(8) |
where is the -th dimension, is the sign function and
is the rectified linear unit. Of course, the new read-out is defined from
, that is after the transformation given in Eq. (7) and (8), which leaves unaltered the dynamics of the system and can be easily applied to any reservoir representation. The threshold is composed of two factors: , i.e. the -th percentile of , which stands for the distribution of activities of dimension after the presentation of a number of samples with sufficient statistics, and a learnable part , which is adapted through gradient descent and is initialised to arbitrarily small values at the beginning of training. The percentile can be considered as an additional interpretable hyper-parameter that controls the sparsity level of the network at the start of the training phase.222For different methodologies to estimate the percentile operation, see
[manneschi2021sparce].Ii-D Benchmarks and Pre-processing
Convolutional neural networks (CNN) are the best performing architectures for processing images and discover high-level features from visual data. However, they are static and lack temporal dynamics. In contrast, recurrent connections can be fundamental for the considered tasks where the driving signals are a succession of images acquired during the exploration of an environment. Thus, after a pre-processing module composed of NetVLAD [arandjelovic2016netvlad], a pre-trained CNN, we adopted a system composed by one or multiple ESNs. Considering that the reservoir computing paradigm is more effective when the reservoir expands the dimensionality of its corresponding input, we first decreased the dimensionality of NetVLAD output (original dimension is ) by training a feedforward network composed of one hidden layer (with nodes) on the considered classification task. This new representation is then considered as the input to the reservoir computing system, see Fig. 3(c). The reservoir is then trained to distinguish the different locations, which are processed successively in the natural order of acquisition by the overall architecture. The four reservoir computing models we study are summarised below.333The source-code for ESN implementations can be found in https://github.com/anilozdemir/EchoVPR.
-
Echo State Network (ESN), where learning happens on the output weights only. The critical hyper-parameters of the system for the cases studied, and will be tuned are (leakage term, input factor, learning rate).
-
Hierarchical ESN (H-ESN), composed by two reservoir connected unidirectionally. The read-out is defined from both reservoirs and, as for the case of a single ESN, is subject to training. In this case, the number of hyper-parameters is theoretically more than doubled in comparison to a single ESN and it is practically challenging to perform an exhaustive tuning procedure of all of them. We selected the value of as the optimal one found for the single ESN and fixed , focusing on the tuning of . The constraint is justified by considering that the second reservoir would lose information that lives on fast time-scales if , leading to an overall system with slow reacting dynamics. On the contrary, if and , the first reservoir can react to rapid changes of the input and the second can maintain past temporal information, leading to a system that is robust to signals with both short and long temporal dependencies.
-
Hierarchical ESN and SpaRCe (H-ESN+SpaRCe), which is the same as a hierarchical reservoir, but with the addition of SpaRCe.
The total number of reservoir nodes is 444It is for the hierarchical models and for the Nordland. and learning of and is accomplished trough mini-batches and by minimisation of softmax cross-entropy loss:
(9) |
where is the minibatch size, the output of the neural network,
the target output, and the indexes
and correspond to the sample number and to the output node considered. The models are trained for up to epochs, i.e. each training image is passed times.
Specifically for the Nordland dataset, which is more challenging than the previous benchmarks, we used the sigmoid cross-entropy loss as the error function, which led to better performance:
(10) |
where the terms have similar meaning to the ones of Eq. (9). The models are trained for up to a total of iterations.
Iii Experiments
Iii-a Datasets and Performance Metrics
We evaluate the performance of the models proposed on four standard benchmarks: GardensPoint [glover2014day], ESSEX3IN1 [zaffar2020memorable], SPEDTest [chen2018learning], and Nordland [sunderhauf2013we], using two metrics: prediction accuracy and precision-recall area-under-curve (AUC). GardensPoint consists of indoor, outdoor and natural environments with both viewpoint and conditional changes throughout the dataset. A tolerance of is acceptable. ESSEX3IN1 consists of images taken at the university campus and surroundings, focusing on perceptual aliasing and confusing places. There is no tolerance for this dataset, hence, the prediction has to be exact. SPEDTest consists of low-quality but high-depth images collected from CCTV cameras around the World; it includes environmental changes including variations in weather, seasonal and illumination conditions. There is no tolerance for this dataset. Nordland consists of images taken at train traversals in four different seasons in Norway; the viewpoint angle is fixed although there is a high weather, seasonal and illumination variability. A tolerance of is acceptable—the same as the sequential models [chancan2020hybrid] we compare against (see Section IV-D for more details).
Iii-B Training ESNs and Hyper-parameter Tuning
The lack of a validation set for the considered tasks makes the hyper-parameters selection challenging. This difficulty is emphasized by the small number of samples in the training set (i.e. one sample per place) and by the major statistical differences between training and test data. In particular, the seasonal difference in the acquisition of reference and query data lead to the possible presence or absence of snow and shifts in colours intensities. In our preliminary experiments, different hyper-parameters would reach perfect accuracy (i.e. ) on the training set and degraded, variable performance on the test set. We believe that there is a lack of clarity in previous research works regarding the definition of a clear methodology to overcome the problem of hyper-parameter selection.
We tuned the hyper-parameters of the reservoir by using a small percentage (i.e. ) of samples of the test set as validation. In other words, while the read-out was always optimised from reference samples, hyper-parameters were optimised through grid search over the performance achieved on of the query data. Being aware of the limitations of this methodology, we will later show how it is possible to use the test set of one task as validation for another task with little performance lost, demonstrating how the model can achieve generalisation abilities if the hyper-parameters were selected to be robust to non-excessive statistical changes (see Section IV-C).
Iv Results
Iv-a Assessing ESN Utility to Visual Place Recognition
The performance of ESN and ESN+SpaRCe were first evaluated in three datasets (GardensPoint, SPEDTest and ESSEX3IN1). Fig. 4 shows that both ESN variants outperform state-of-the-art single-view matching models (including NetVLAD with read-out and hidden layers) in all three conditions. The ESN achieves mean accuracy scores of , and and mean AUC scores of , and . The addition of the SpaRCe layer provides additional improvement with accuracy scores of , and and mean AUC scores of , and .
Iv-B Hierarchical Models for Performance Improvement
We then assessed if a hierarchical ESN architecture would improve results in the challenging GardensPoint dataset. Fig. 5 shows that the introduction of hierarchical ESNs increased the median accuracy scores while decreasing their variance (ESN median: and std: 0 vs H-ESN+Sparce median: and std: ; both for trials). AUC scores showed little change but they were already close to the maximum possible () and thus there was little room for improvement. Considering the performance improvement consequent to the utilisation of the hierarchical model, it is evident how the GardensPoint dataset contains longer temporal dependencies among images that cannot be captured by a single ESN. This result can be intuitively understood by comparing the sequences of images between the three datasets presented in Fig. 4. After an inspection of the datasets, it is clear that data of GardensPoint are captured at a higher frame-rate in comparison to the other datasets, where images appear more static and separated in time across each other. Consequently, GardensPoint has a more complex underlying temporal structure.

Iv-C Generalisability Study
We also analysed the sensitivity of the ESN models with respect to hyper-parameter selection. Fig. 6 shows accuracy scores for hyper-parameters tuned by training the models on GardensPoint and maintaining them when training in SPEDTest and ESSEX3IN1. The reason we chose the hyper-parameters from GardensPoint is that generalisation is more likely to occur when the baseline task is more complex than the new tasks to which it is applied. Indeed, richer and more difficult datasets can lead neural networks to discover high-level features that are transferable to simpler datasets, while the contrary is difficult. Fig. 6 demonstrates how, even with sub-optimal hyper-parameters, the introduction of ESNs leads to higher performance in comparison to single-view matching models, NetVLAD and NetVLAD. Again, hierarchical ESNs provide a noticeable improvement in median accuracy and AUC scores as well as reducing variance again. Moreover, the performance remains above for both accuracy and AUC compared to the virtually perfect scores achieved when hyper-parameters were tuned using the same dataset (see Fig 4).

Iv-D Comparing ESN with sequential VPR models

In this section, we benchmark the performance of ESNs against state-of-the-art sequence matching VPR models. Specifically, we compare with two models recently reported to achieve great performance [chancan2020hybrid] in the challenging Nordland dataset [sunderhauf2013we]. Both models use a bio-inspired feedforward neural network (FlyNet) to encode visual information and either a recurrent neural network (RNN) or a continuous attractor network (CANN) to introduce temporality. Fig. 7 shows accuracy scores of and for the standard ESN and ESN+SpaRCe respectively (no accuracy scores are available for comparison). For the AUC test, ESN achieves scores of , with SpaRCE improving results to . This compares favourably to both static view matching models (e.g. NetVLAD+HL) which score , and sequential models which score (FlyNet+RNN) and (FlyNet+CANN).
V Conclusions
In this paper, we have demonstrated the viability of ESNs as a solution to the VPR problem. All the ESN variants implemented achieve higher performance than single-view matching models (AMOSNet, DenseVLAD, NetVLAD, NetVLAD+HL, NetVLAD), in three benchmarking datasets (GardensPoint, SPEDTest, ESSEX3IN1). In the more challenging Nordland dataset, two of our models (single reservoir ESN and SpaRCe) achieved performance above/equal to the class-leading results achieved by sequential matching models (FlyNet+RNN and FlyNet+CANN). While performance is comparable we note that FlyNet [chancan2020hybrid] have many fewer parameters. However, the ESNs do not require images to be cached during multiple comparisons and also serve to implicitly assess any velocity dependence through the temporal dynamics.
In terms of the recent ESN advances, namely hierarchical and SpaRCe, the results differ depending on the dataset. The addition of SpaRCe to the standard ESN improved performance considerably, showing how the introduction of sparse representations can efficiently help the classification process. The utilisation of hierarchical ESNs was beneficial in the GardensPoint dataset, but not for the larger and more challenging Nordland dataset. Hierarchical models have higher complexity, in terms of the number of hyper-parameters, and can overfit the training data. This is particularly an issue when considering the benchmarking VPR datasets, as there is only a ‘single’ sample to learn from (as opposed to standard machine learning datasets that have many samples per class, e.g. approximately samples per class for the well-known MNIST dataset). Preliminary analysis supports this hypothesis: hierarchical models achieved perfect scores on the Nordland training sets (summer) but low performance when presented with test set (winter). Such issues might be addressed by augmenting training data [shorten2019survey] (e.g. through artificial illuminance changes or weather effects) to supply a variety of real-world conditions.
While there are many ways to optimise the ESNs for the VPR problem, an intriguing future course of action is to take inspiration from invertebrate mini-brains that possess analogous structural motifs of both deep and shallow ESNs. A simple example is the insect mushroom body. This is considered the cognitive centre of the insect brain [menzel2001cognitive] and is necessary for learning relationships sequences and patterns in honey bees [menzel2001cognitive, boitard2015gabaergic, devaud2015neural, cope2018abstract]. Structurally the mushroom body is a three-layer network with a compact input layer, an expanded middle layer of inter-neurons called Kenyon cells, and a small layer of output neurons [fahrbach2006structure]. The connections between the Kenyon cells and output neurons are plastic and modified by learning [gerber2004engram]
, and there are chemical and electrical synapses between the Kenyon cells
[zheng2018complete, takemura2017connectome, liu2016gap]. These features are analogous to the recurrent connections in the reservoir layer of an ESN, and it has been hypothesised [manneschi2021sparce, manneschi2021exploiting] that these recurrent connections in the Kenyon cell layer could contribute to the reverberant activity of the mushroom body that supports forms of memory [cognigni2018right]. Given the similar structures, insights gained from neurobiology could help shape the future ESN investigations and in turn, analysis of the optimal structure for VPR could shed light on the function of different brain areas.In practice, it is desirable that places are recognised from a single input image allowing robotics to truly solve the kidnapped robot problem. However, in the cases where such methods fail, traversing portions of a familiar path can help to disambiguate input . ESNs provide a means to exploit such temporal dynamics using only visual data but more powerful variants require tuning of a large number of parameters which may not be possible when only a small amount of training examples are provided. Other methods [milford2012seqslam, chancan2020hybrid] have focused on low-parameter models but often require additional cues such as velocity to focus the image search. Ensemble methods [hausler2019multi, fischer2020event] that combine these features are emerging that may provide the best of both worlds.
Finally, assessment of methods on robots in the real-world is essential. This will not only challenge current approaches to be more robust but can also show some difficulties caused by the pre-collected datasets, such as continual learning or robotic safety.