Official PyTorch implementation of paper "A Hybrid Compact Neural Architecture for Visual Place Recognition" by M. Chancán (RA-L & ICRA 2020) https://doi.org/10.1109/LRA.2020.2967324
State-of-the-art algorithms for visual place recognition can be broadly split into two categories: computationally expensive deep-learning/image retrieval based techniques with minimal biological plausibility, and computationally cheap, biologically inspired models that yield poor performance in real-world environments. In this paper we present a new compact and high-performing system that bridges this divide for the first time. Our approach comprises two key components: FlyNet, a compact, sparse two-layer neural network inspired by fruit fly brain architectures, and a one-dimensional continuous attractor neural network (CANN). Our FlyNet+CANN network combines the compact pattern recognition capabilities of the FlyNet model with the powerful temporal filtering capabilities of an equally compact CANN, replicating entirely in a neural network implementation the functionality that yields high performance in algorithmic localization approaches like SeqSLAM. We evaluate our approach and compare it to three state-of-the-art methods on two benchmark real-world datasets with small viewpoint changes and extreme appearance variations including different times of day (afternoon to night) where it achieves an AUC performance of 87 and 1READ FULL TEXT VIEW PDF
Official PyTorch implementation of paper "A Hybrid Compact Neural Architecture for Visual Place Recognition" by M. Chancán (RA-L & ICRA 2020) https://doi.org/10.1109/LRA.2020.2967324
This is an alternate Implementation of "A Hybrid Compact Neural Architecture for Visual Place Recognition" which uses FlyNet, a neural architecture inspired from the brain of a fly. This project also combines it with continuous attractor networks and compares the FlyNet with other neural networks.
Performing visual localization reliably is a challenge for any robotic system operating autonomously over long time periods in real-world environments, due to viewpoint changes, perceptual aliasing (multiple places may look similar), and appearance variations over time (e.g. day/night cycles, weather/seasonal conditions) 
. Convolutional neural networks (CNN)3, 4, 5], typically only used in real-time [6, 7, 8]
with dedicated hardware (GPUs). In addition, vanilla CNN pre-trained on benchmark datasets such as ImageNet generally neglect any temporal information between images.
To address these shortcomings, researchers have introduced recurrent models [10, 11, 12], but this has increased the complexity of the training process of CNN models, further limiting their deployability in a range of real-world applications, especially in scenarios with limited computational resources and/or training data.
In this work, we introduce FlyNet, a compact network architecture inspired by the fruit fly olfactory neural circuit—that is known to assign similar neural activity patterns to similar odors 
. Furthermore, FlyNet addresses the issue of temporal relationships between images from previously visited locations since it is coupled with a temporal encoding model such as a recurrent neural network (RNN) or a continuous attractor neural network (CANN) (see Fig.1). We show that our resulting network FlyNet+CANN achieves competitive performance on two benchmark robotic datasets but with far less parameters (see Fig. 2), minimal training time, storage and computation footprint than conventional CNN-based approaches. In Fig. 2 the area of the circle is proportional to the number of layers per model. The number of layers of the pre-trained networks used in MPF (VGG-16) and LoST-X (RefineNet) for feature extraction purposes are 13 and 213 respectively, while for FlyNet+CANN is 2.
This paper is structured as follows. Section II provides a brief overview of related work on VPR and the biological inspiration for the FlyNet architecture; Section III describes the FlyNet architecture in detail; Section IV and Section V present the experiments and evaluations respectively, where our results are compared against three existing state-of-the-art techniques on two real-world benchmark datasets; and Section VI provides discussion around our results as well as future work.
This section outlines some key biological background for navigation in the insect brain, reviews the usage of deep learning-based approaches for localization and discusses recent developments in temporal filtering of sequential data to further improve performance.
Our understanding of how animals navigate using vision has been used as inspiration for designing effective VPR algorithms. RatSLAM  is one example of this, using a model based on the rodent brain to perform visual SLAM over large environments for the first time . Likewise researchers have developed a range of robotic navigation models based on other animals including insects [16, 17, 18].
Insects such as ants, bees and wasps regularly return to specific locations that are of significance to them such as the nest, large food sources and the routes leading to them. It is typical for bees to range more than 6 km from the hive on a foraging trip . Furthermore, these insects use environmental cues, as well as visual and odor information for effective navigation [20, 21]. They can effectively encode multiple places in their memory and use visual features to identify and navigate between them [22, 23].
Flies also exhibit a similar ability to navigate [24, 25]. In fact, their brains share the same general structure , , with the central complex being closely associated with navigation, orientation and spatial learning [27, 28]. VPR is, however, most likely mediated by processing within the “mushroom bodies”, a separate pair of structures within the brain. These regions are involved in classification, learning, and recognition of both olfactory and visual information in bees and ants 
. They receive densely coded and highly processed input from the sensory lobes, which then connects sparsely to the very large numbers of intrinsic neurons within mushroom bodies. Their structure has been likened to a perceptron, and is considered optimal for learning and correctly classifying complex input.
These impressive capabilities, achieved with relatively small brains, make them attractive models for roboticists. For FlyNet, we take inspiration from the olfactory neural circuits found in the fruit fly to design our network architecture. Our focus here is primarily on taking high level inspiration from the size and structure of what is known of the fly brain and investigating the extent to which it can be used for VPR and localization, much as in the early RatSLAM work .
Over recent years CNN have been applied to a range of recognition problems with great success, including the VPR problem addressed here. These models can handle many challenging environments with both visual appearance and viewpoint changes, as well as large scale VPR problems [30, 31, 32, 3, 33]. Despite their success, these approaches typically rely on the usage of CNN models that are pre-trained on various computer vision datasets  using millions of images [6, 7, 30, 5]. CNN models trained in an end-to-end fashion specifically for the VPR tasks have also recently been proposed [30, 4]. However, they are still initialized using pre-trained network architectures, i.e., AlexNet , VGGNet , ResNet , with slight modifications to the model architecture to perform VPR. All these systems share common undesirable characteristics with respect to their widespread deployability on robots including large network sizes, significant computational and storage requirements, and onerous training requirements.
such as Long Short-Term Memory (LSTM). These temporal-based approaches have been applied specifically to navigation  and spatial localization  in artificial agents. In a nice closure back to the inspiring biology, these approaches led to the emergence of grid-like representations in networks trained to perform path integration, resembling neural representations found in grid cells and other entorhinal cortex cell types  in mammalian brains . One of the older approaches to filtering temporal information in a neural network incorporated continuous attractor neural networks (CANNs), with pre-assigned weight structures set up to model the activity dynamics of place and grid cells found in the rat mammalian brain . Other non-neural techniques have been developed including SeqSLAM 
, which matches sequences of pre-processed frames from a video feed to provide an estimate of place, with a range of subsequent follow on system development[41, 42, 43].
The work to date has captured many key aspects of the VPR problem, investigating complex but powerful deep learning-based approaches, biologically-inspired approaches that work in simulation or in small laboratory mazes, or larger mammalian-brain based models with competitive real-world robotics performance. In this work we attempt to merge the desirable properties of several of these techniques, by developing a novel bio-inspired neural network architecture for VPR that Through being loosely based on insect brains, this architecture is extremely compact but also achieves competitive localization performance using the filtering capabilities of a continuous attractor neural network. We also show how our compact core Flynet architecture can easily be adapted to other filtering techniques including RNNs and SeqSLAM.
In this Section, we present the FlyNet core algorithm inspired by the fly algorithm introduced in , and its derived neural network architectures.
The fly olfactory system solves an essential computing problem, approximate similarity search, by assigning similar neural activity patterns to similar odors in the fly olfactory neural circuit. The neural strategy that this neural circuit performs is a three-step procedure as the input odor information goes through a three-layer neural circuit 
. First, the firing rates across the first layer are centered to the same mean for all odors (removing the concentration dependence). Second, the third layer is connected by a sparse, binary random matrix to the second layer (each output neuron receives and sums about 10% of the firing rates in the second layer). Third, only the highest-firing 5% neurons across the third layer create a specific binary tag to the input odor (winner-take-all).
In this work, we leverage the fly algorithm, previously described, from a computer vision perspective to propose the Algorithm 1 - FlyNet Algorithm (FNA). Fig. 3 show the FNA mapping, when the input and output dimensions are , respectively. The FNA then computes random projections defined by (the scheme only shows the projections to and
). The WTA step then creates a binary output tag, which is a compact representation of the input feature vector.
We utilize the proposed FNA as our core feature encoder to develop the compact neural network architectures described in the next subsection. We also perform an image pre-processing step before applying Algorithm 1. The details of this step are outlined in the following Section IV-C.
We implement three neural network architectures based on the FNA: a single-frame based system (FlyNet), and two multi-frame based (FlyNet+RNN, FlyNet+CANN) VPR systems with temporal filtering capabilities. For comparison purposes, we also deploy a FlyNet+SeqSLAM system, described in more detail in Section IV.
Our custom two-layer neural network architecture, FlyNet (see Fig. 4 left), has an input image dimension of () in gray-scale, and incorporates the FNA as a hidden-layer with output dimension
. Then, the FNA output feeds into a fully-connected (FC) layer of 1000 units with a soft-max activation function in order to compute a particular class score for each input image to the network. In other words, the FNA maps the input images into a compact feature vector representation to perform VPR.
FlyNet vs. Fully-Connected Neural Network. A generalized version of our proposed FlyNet architecture was also considered during our experiments, that is, one where the FNA hidden-layer would be replaced by a FC layer. The main difference between FlyNet and a two-layer neural network is that FlyNet does not require trainable parameters and encodes the input vector using a sparse, binary matrix. In contrast, a conventional two-layer neural network would require trainable parameters, and hence was left as an avenue for future work.
We conducted additional experiments incorporating the algorithmic technique SeqSLAM  (described in Section II-C) on top of our single-frame FlyNet network in order to obtain a multi-frame system as a comparison method along with our temporal filtering based architectures (FlyNet+RNN and FlyNet+CANN) described in the following Subsections.
The first temporal filtering enhancement involved incorporating a vanilla RNN on top of the FlyNet architecture. We also investigated the usage of more sophisticated types of recurrent layers such as a Gated Recurrent Unit (GRU) and LSTM, however they showed no significant performance improvements despite having far more parameters. Fig.4 (middle) illustrates one of our temporal filtering based neural architectures, FlyNet+RNN.
We implemented a variation of the CANN model introduced in , motivated by its suitability as a compact neural network-based way to implement the filtering capabilities of SeqSLAM . As described in Section II-C, a CANN is a type of recurrent network that utilizes pre-assigned weight values within its configuration. The Fig. 1 (middle) shows our detailed FlyNet+CANN implementation, see also Fig. 4 (right). It can also be seen in Fig. 1 (middle) that, in contrast to an RNN, a unit in a CANN can excite or inhibit itself and units nearby using excitatory (arrows) or inhibitory (rounds) connections respectively. For this implementation, activity shifts in the network representing movement through the environment were implemented through a direct shift and copy action, although this could be implemented with more biologically faithful details such as velocity units and asymmetric connections, as in prior CANN research .
To provide a comprehensive evaluation of the capabilities of our proposed systems, we conducted extensive evaluations on two of the most widespread benchmarks used in VPR, the Nordland  and Oxford Robotcar  datasets. We also conduct comparison to three state-of-the-art techniques. In this section we describe our network configurations, dataset preparation, any relevant pre-processing and the comparison systems we implemented111Code available at https://github.com/mchancan/flynet.
We test our four FlyNet baselines (described in Section III-C) in order to compare them and further evaluate our best performing network against the current state-of-the-art VPR methods. We define the same number of units for the corresponding layers between our models. For instance, the baseline architectures for both FlyNet and FlyNet+RNN use an FNA layer with 64 units, and corresponding FC layers with 1000 units. The number of recurrent units in the RNN-based model was 512. The CANN-based model uses 1002 units (see Table I). The Adam optimizer  was chosen to train our baselines, with a learning rate set to 0.001 for all our experiments.
|Architecture||# layers||# params||# neurons|
The experiments are performed on two benchmark datasets widely used in the VPR literature: Nordland and Oxford RobotCar.
The Nordland dataset, introduced in , comprises four single traverses of a train journey, in northern Norway, with extreme seasonal changes (spring, summer, fall, and winter). The dataset is primarily used to evaluate visual appearance change, as instantiated through its four season coverage. In our experiments we used the four traverses to perform VPR at 1fps as in . We use the summer subset for training our models, and the reminder for testing.
The Oxford RobotCar dataset  provides several traverses with different lighting (e.g. day, night) and weather (e.g. direct sun, overcast) changes through a car ride in Oxford city—that implicitly contains various challenges of pose and occlusions such as pedestrians, vehicles, and bicycles. In our evaluations we used the same subsets as in , which include: two traverses with the most challenging illumination conditions, referred to here as day (overcast summer) and night (autumn) for testing purposes, and overcast (autumn) for training. The variable distance between frames ranged from 0 to 5 meters.
|Dataset||Appearance Changes||Viewpoint Changes|
The image preprocessing procedure we use to evaluate our FlyNet baselines comprises two steps: converting the images into a single channel (grayscale) with normalized pixel values between [0, 1], and resizing to pixels. The dataset length we use in all the traverses is 1000 images (places). On the other hand, the three state-of-the-art VPR methods we compare with, were provided the original image resolutions, for Nordland, and for Oxford RobotCar.
Our best performing baseline, FlyNet+CANN, is compared to three state-of-the-art multi-frame VPR systems: the algorithmic technique SeqSLAM  (without FlyNet attached), and two deep learning based methods such as LoST-X  and the recent work Multi-Process Fusion (MPF) .
SeqSLAM shows state-of-the-art results on VPR with challenging appearance variations. We used the MATLAB implementation of SeqSLAM presented in , with a sequence length of frames, and threshold of 1 in order to have consistent and comparable results. The remaining SeqSLAM parameters maintained its default values.
The multi-frame LoST-X pipeline  use visual semantics to perform VPR with opposite viewpoints across day and night cycles. This method uses the RefineNet architecture  (a ResNet-101  based model) as a feature extractor, pre-trained on the Cityscapes  for high-resolution semantic segmentation.
We evaluate the quantitative results of our best performing network against three multi-frame state-of-the-art methods (SeqSLAM, LoST-X, and MPF) by using both the precision-recall (PR) curves and area under the curve (AUC) metrics. The tolerance used to classify a query place as a correct match was being within 20 frames of the ground truth location in the Nordland dataset, and up to 50 meters (10 frames) away from the ground truth location in the Oxford RobotCar dataset, as per previous research , , .
We trained our four FlyNet baselines in an end-to-end manner from scratch. On the Nordland dataset, we used the Summer traversal (reference) for training and the Fall/Winter traverses (query) for testing. On the Oxford RobotCar dataset, the Overcast traversal (reference) was chosen for training, and the Day/Night traverses (query) for testing. Similarly, the three state-of-the-art methods were evaluated using the same reference/query traverses, respectively.
Figs. 5 and 8 show the qualitative results for our best performing baseline (FlyNet+CANN) on both benchmark datasets. It can be seen in Fig. 5 that the system is able to correctly match places under significant seasonally-driven appearance changes (summer vs winter) on the Nordland dataset. Similarly, Fig. 8 shows again a correctly retrieved image sequence under drastic illumination changes (overcast vs night), even when facing occlusions such as vehicles.
The single-frame based model, FlyNet, and its recurrent version, FlyNet+RNN, struggled in these conditions. FlyNet alone did not perform well; as shown in Fig. 9 (left) the best test recall on the Nordland dataset was below 40% (Summer vs Fall), and on the Oxford RobotCar the best test recall was around 20% (Overcast vs Day), see Fig. 9 (left). When integrating FlyNet with a RNN performance improved in some cases, see Fig. 9 (right), but on the other Nordland traverses (spring, winter) as well as on the Oxford RobotCar traverses (day, night) the RNN did not improve the performance of the single-frame model.
Additionally, we compared our four FlyNet baselines’ performance by calculating the area under the curve (AUC) metric when trained on Summer and tested on Fall/Winter seasonal changes for the Nordland dataset (see Fig. 7) and when trained on Overcast and tested on Day/Night time changes for the Oxford RobotCar dataset (see Fig. 10). It can be seen that the FlyNet+CANN network outperformed the other baselines, and this is the network we choose to compare with the state-of-the-art methods in the next subsection.
Figs. 11 (left) show the PR curves on the Nordland dataset. It can be seen that the MPF is the best performing VPR system which is able to recall almost all places at 100% precision on both Fall and Winter testing traverses, and also achieve higher AUC values (see Fig. 12 (left)). On the other hand, the deep learning-based method, LoST-X, is not able to recall a single match at 100% precision on both testing traverses. In contrast, our FlyNet+CANN network achieves state-of-the-art performance comparable with SeqSLAM and MPF for all of the testing traverses, as can be seen in Fig. 12 (left).
Similarly, the PR performance on the Oxford RobotCar dataset is shown in Fig. 11 (right). Also notable in this case is that the FlyNet+CANN baseline again achieves state-of-the-art results that are now comparable with SeqSLAM, LoST-X, and MPF approaches. The FlyNet+CANN model consistently maintains its performance even under the extreme appearance change faced in that experiment caused by overcast-night cycle, as seen in Fig. 11 (right-bottom), which is represented in Fig. 12 (right), where the FlyNet+CANN network show higher AUC values than the remaining state-of-the-art techniques.
The computational cost required by our best performing VPR network (FlyNet+CANN) is compared with the three state-of-the-art methods (SeqSLAM, LoST-X, and MPF) in terms of the running time for (1) feature extraction by the neural networks models, (2) feature matching between the reference and query traverses, and the (3) average processing time to compare and match a single query image to the 1000 database images (reference), that can be calculated as (Feat. Ext. + Feat. Match.)/1000. Table III shows that our FlyNet+CANN network is 6.5, 310, and 1.5 times faster than MPF, LoST-X, and MPF respectively in terms of the average processing time (m: minutes, s: seconds).
|VPR system||Feat. Ext.||Feat. Match.||Average|
|FlyNet+CANN||35s||25s||0.06s (16.66 Hz)|
|MPF||1.9m||4.6m||0.39s (2.56 Hz)|
|LoST-X||110m||200m||18.6s (0.05 Hz)|
|SeqSLAM||50s||40s||0.09s (11.11 Hz)|
The tested VPR systems were processed using their default parameters and recommended configurations. We performed all our tests on an Ubuntu 16.04 LTS computer with GeForce GTX1080Ti GPU. The SeqSLAM algorithm used CPU processing with MATLAB. Both LoST-X and MPF used GPU for their RefineNet and VGG-16 pre-trained networks, respectively. Similarly, our FlyNet+CANN network used GPU. However, the higher time required by the LoST-X framework was due to the intermediate checking and processing steps as it used the CPU.
Fig. 13 shows a similar comparison presented in Fig. 2 but with moderated appearance changes (overcast vs. day) on the Oxford RobotCar dataset. In both figures, again, the area of the circle is proportional to the number of layers per model, except for the SeqSLAM system which performs an algorithmic matching procedure. We can see that state-of-the-art systems such as MPF, LoST-X and SeqSLAM achieve better results than in Fig. 2 in terms of AUC metrics: 95%, 95% and 93% respectively, while FlyNet+CANN also present competitive performance with 96%, as explained in Subsection V-B,C and shown in Fig. 11 (top right) and Fig. 12 (right).
In this paper, we presented a novel bio-inspired visual place recognition model based by the part on the fruit fly brain and combined with a compact continuous attractor neural network. Our proposed system was able to achieve competitive performance compared to benchmark systems that have much larger network storage and compute footprints. It was also, to the best of our knowledge, the furthest in capability an insect-based place recognition system has been pushed with respect to demonstrating real-world appearance-invariant without resorting to full deep learning architectures.
There are a number of promising avenues for future research. The first is the untapped source of architecture inspiration that can be drawn from further study of insect brains. Insects face tremendous pressure to minimize neural costs for metabolic reasons [48, 49]; they have to have the most efficient brains possible . For example, there is a small recurrent pathway in the honey bee mushroom body called the protocerebral calycal tract. It is implicated in sharpening representation in the mushroom bodies as well as performing complex classification tasks , both capabilities that might enhance the versatility and utility of the system described here.
Future research bridging the divide between well-characterized insect neural circuits [59, 60] as well as recent deep neural network architectures and computational models of network dynamics related to spatial memory and navigation [61, 62] are likely to yield further performance and capability improvements, and may also shed new light of the functional purposes of these neural systems.
The authors would like to thanks Jake Bruce currently at Google DeepMind for helpful discussions about the potential ways to implement the FlyNet multi-frame baselines which enabled to develop the FlyNet+RNN network, and also thanks to Stephen Hausler for helping to configure his recent work Multi-Process Fusion (MPF) to perform our state-of-the-art comparison.
Z. Chen, L. Liu, I. Sa, Z. Ge, and M. Chli, “Learning Context Flexible Attention Model for Long-Term Visual Place Recognition,”IEEE Robotics and Automation Letters (ICRA), vol. 3, no. 4, pp. 4015–4022, Oct 2018.
Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI Press, 2014, pp. 2564–2570.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” inProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.