Safe navigation of intelligent mobile robots in unstructured and unknown outdoor environments (e.g. search and rescue, agriculture, and mining industry sectors) requires perception systems which deliver a detailed understanding of surroundings regardless of any environmental factor (e.g. weather, scene illumination, etc). In many environments, some terrains are unsuitable to traverse and so robust route identification is a key problem to be solved. To that end, a variety of sensor technologies have been used for solving related problems, including: cameras, lidar, sonar, audio, and radar.
lidar- and vision-based terrain classification systems, despite their success in certain scenarios, are highly susceptible to inclement environmental or atmospheric conditions: heavy rain, fog, direct sunlight, and dust all greatly degrade the performance of these systems, thereby limiting their range of applications.
fmcw scanning radar, in contrast, operates robustly under such adverse conditions and additionally operates at ranges of up to many hundreds of metres – relaxing the maximum speed at which a robot can safely travel and facilitating longer planning horizons. Indeed, there is a burgeoning interest in exploiting fmcw radar to enable robust mobile autonomy, including ego-motion estimation[cen2018precise, aldera2019, 2019ITSC_aldera, Barnes2019MaskingByMoving, UnderTheRadarArXiv], localisation [KidnappedRadarArXiv, gadd2020lookaroundyou, tang2020rsl, UnderTheRadarArXiv]
, and scene understanding[williams2019listening, weston2019probably, kaul2020rssnet].
As a novel contribution to scene understanding with radar, this paper presents a system that detects permissible driving routes from raw radar scans. Specifically, it focusses on the methodology for the obtainment of labelling and a novel training procedure for the radar classifier.
Radar measurements are complex, containing significant multipath reflections, speckle noise, and other artefacts in addition to the radar’s internal noise characteristics [robo_radar]. This makes the interaction of the electromagnetic wave in the environment more complex than that of tof lasers. As obtaining a labelled radar dataset for supervision – with each scan annotated on a bin-by-bin basis – is challenging and time consuming, we propose an weakly-supervised framework using an alternative sensing modality: audio.
Audio-based terrain classifiers can be used to predict the permissibility of a driving route when the route is characterised by its terrain (e.g. grass, gravel, asphalt). Predicting terrain from audio is possible as each interaction between the robot and the ground has a terrain-specific audio signature.
Audio offers two advantages over other modalities, e.g. vision-based systems: first, audio is invariant to scene appearance and less affected by weather conditions, providing more stable and predictable results; moreover, the use of microphones is advantageous as audio is a one-dimensional signal, easing the labelling process as the audio for each terrain can be collected separately.
Once the audio-based terrain classifier has been trained, we exploit it to weakly supervise the radar classifier training. vo and gps are used to trace the trajectory of the robot on the radar scan as if it were a canvas (see Figure 1) and each traversed bin is classified by the audio classifier. In theory, with access to gps, it should be possible extract labels for the audio from osm. Thus, the system could be trained in a completely self-supervised fashion. We leave this to future work.
Ii Related Work
Mature techniques for identifying the driveable area of urban environments with cameras and lidar often learn to semantically segment the entire scene through the use of fully labelled datasets such as Cityscapes [cordts2016cityscapes] or by weak supervision and demonstration as in [barnes16]. In non-urban outdoor environments, path detection is closely related to the task of terrain classification [blas2008fast]. For the environment in which our system was trained and tested111University Parks, Oxford, https://www.parks.ox.ac.uk/home, all permissible driving routes belong to one terrain class (gravel) and so for this application the tasks of permissible driving route identification and terrain classification are equivalent.
Vision-based terrain classification is perhaps the most traditional approach due to its associated intuitiveness and affordability. In [jansen2005colour], colour segmentation is employed to identify different terrains, while [blas2008fast] performs both colour and texture segmentation for path detection. However in [jansen2005colour], problems arising due to variations in illumination are exposed. Although these problems are mitigable, when also paired with environmental factors such as fog, heavy rain and dust clouds, these systems alone seem unfit for robust autonomy.
lidar can be used to build successful terrain classifiers by observing the texture of the 3D point-cloud as seen in [kragh2015object]. In low light conditions lidar works well, however it suffers greatly in the presence of rain and fog, limiting its applicability in much the same way as vision.
As mentioned in Section I, audio can also be used for terrain classification. Terrain-specific audio signatures are invariant to scene appearance and much less influenced by weather conditions compared with vision and lidar-based methods. The obvious disadvantage to this technique is that only the terrain the robot is currently operating on can be classified. As discussed in this paper, this characteristic can be leveraged for labelling purposes. [Valada2018] reports classification of nine different terrains with an accuracy of by leveraging advances in dl and using a cnn classifier. The audio features used for the cnn classifier were spectrograms generated with the stft.
Finally, radar is invariant to almost all environmental factors posed by even the most extreme environments, such as dusty underground mines, blizzards [brooker2007seeing, foessel1999short]. This is reflected in literature as extensive research has been done using millimetre-wave radar systems for odometry, obstacle detection, mapping and outdoor reconstruction [cen2018precise, heuer2014detection, robo_radar]. Less work, however, has been carried out to investigate radar’s performance on more comprehensive scene understanding tasks such as terrain classification or path identification. [reina2011radar] presents an outdoor ground segmentation technique using a millimetre wave radar, however the chosen method limits its range of operation.
Perhaps most similar to our work is a visual terrain classifier which is also supervised by learned acoustic features presented in [zurn2019self]. In our work, however, we focus on the usage of radar, which has advantages over vision in terms of robustness to both weather and illumination as well as sensor range. This work exposes at the same time challenges specific to the modality – especially the high sparsity of labelling. This is overcome with a stronger focus on the training procedure for the proposed network by explicitly promoting generalisation.
Our method is based on our early investigation described in [williams2019listening]. Learning to segment driveable routes in a radar scan in a supervised manner, requires that routes in each scan are labelled. For a dataset of sufficient size (in the order of thousands of training examples), doing this by hand is a prohibitively time-consuming process. We therefore opt to weakly supervise the training of a radar-based segmentation network with an audio-based classifier that is trained independently of the radar-based classifier. Audio is collected for each terrain separately (making labelling trivial) and used to train the audio classifier for later use.
Through the use of odometry and gps, we obtain the data collection robot’s timestamped trajectory in the environment. The audio terrain classifier is then used to accurately predict the terrain at each timestamp. By combining both, we produce a terrain-labelled trajectory of the robot in the environment (depicted in Figure 1) which is used as sparse labelling. For the purpose of segmenting paths in our environment, only the terrain labels denoting gravel are required.
Iii-a Audio Classification
As audio is best interpreted as a sequence of frequencies correlated in time, we discuss its representation in the form of different types of spectrograms. As suggested in [Valada2018], spectrograms can be used as 1-channel images to feed into a cnn. This is effective as the success of cnn classifiers is in their ability to learn features automatically from data containing local spatial correlations. By assuming local spatial correlations in a spectrogram, the classifier recognises the temporal correlation of characteristic audio frequencies.
Our cnn classifier follows a standard architecture with several convolutional layers and max-pooling for downsampling.
For audio representation, we assess the performance of three types of spectrograms (results found in Section V). The representations considered are: Spectrograms, Mel-frequency spectrograms and Gammatonegrams (see Figure 3).
Spectrograms are the simplest time-frequency diagrams and are generated directly by the stft. Mel-frequency spectrograms and gammatonegrams are motivated by the idea that the human auditory system does not perceive pitch in a linear manner. For humans, lower frequencies are perceptually much more important than higher frequencies and this can be represented in time-frequency representations. Gammatonegrams extend this biological inspiration, using filter banks modelled on the human cochlea and have been successfully used before in a robotics context [Marchegiani2018].
The implementation used to generate both spectrograms and mel-frequency spectrograms is courtesy of VOICEBOX: Speech Processing Toolbox for MATLAB222Found at ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.
Produced by Mike Brookes, Dept. Electrical and Electronic Engineering, Imperial College in 1997 and the MATLAB toolbox: Gammatone-like spectrograms333Found at ee.columbia.edu/~dpwe/LabROSA/matlab/gammatonegram.
Produced by Dan Ellis, Dept. of Electrical Engineering, Columbia University in 2009 is used to generate gammatonegrams.
Iii-B From Audio to Labelled Radar
In order to project terrain labels from audio into radar scans, we make use of the visual odometry estimate on the platform and gps. vo produces a locally accurate, smooth trajectory and contains important orientation estimates. Although the estimates are locally accurate, they tend to drift over longer distances. In contrast, gps measurements are globally accurate, but suffer from significant noise resulting in a non-smooth trajectory and contain low quality information about the orientation of the robot. In order to leverage the benefits of both techniques, we fuse these data streams using an ekf.
Once the robot’s trajectory has been generated, it is labelled using the audio classifier to predict the terrain for each timestamp. Finally, the labelled trajectory is fitted automatically to each radar scan using the position and orientation estimates from the ekf.
Iii-C Radar Segmentation Training Procedure
The nature of the method used for collecting the labels means that the radar scans are both inexactly and sparsely labelled. The inexactness comes from measurement errors from the gps and vo, and the sparsity comes from our inability to thoroughly traverse every driveable surface observed in the radar scans. This means that the training procedure must be designed such that the network can learn a more complex model than the labelling might immediately suggest.
To do this, data augmentation and a label propagation technique are used to design a two stage curriculum learning procedure. As described in [curriculum_learning], the idea of curriculum learning is that neural networks perform better when presented with the most understandable training examples first. This is done in the first stage by limiting the network’s receptive field by only showing the network very small crops of the global scan. In this way, the network is restricted to simply learning what a path looks like and is relieved of learning more complex concepts such as scene context. By comparison to the more difficult task of simultaneously segmenting multiple paths in the global scan, the network generalises much better on the simpler task of segmenting small crops (as suggested in [curriculum_learning]). For this reason, we are able to generalise beyond the initially incomplete labelling (see Figure 1(a)). Before input to the network, crops are also flipped, rotated, elastically deformed and rescaled to expose the network to paths that are of different orientations, shapes and widths. This data augmentation promotes a broader understanding of what a path looks, thus assisting with generalisation.
Upon completion of the first stage, the network accurately segments small sections of paths contained in crops of the global scan (whether initially labelled or not) but is unsuited to segmenting the whole scan. The second stage of the curriculum is therefore to train the network to segment a whole scan containing multiple paths in one forward pass. By combining the predictions of the network from stage one and the original labelling, we obtain a more complete and exact set of labels from which the network can be trained to complete the more complex task. The idea of using a trained network’s predictions to augment the labels is presented in a classification context in [labelrefinery], however we adapt it to a segmentation context (described in Section V-B).
For the segmentation network, we chose a U-Net architecture [ronneberger2015u], which has proven effective for segmentation of radar scans [aldera2019, weston2019probably]. A U-Net is a fcn containing downsampling and upsampling paths with skip connections between paths to propagate fine detail.
Iv Experimental Setup
This section discusses the platform and the dataset collected and used for training and testing of our system.
Iv-a Platform and Sensors
A Clearpath Husky A200 robot was fitted with microphones and radar, for audio recording and route identification, and with cameras and gps for odometry estimation. The audio data was recorded by using two Knowles omnidirectional boom microphones, mounted in proximity to the two front wheels, and an ALESIS IO4 audio interface, at a sampling frequency of and a resolution of 16 bits.
We employed a Navtech CTS350-X fmcw scanning radar, mounted on top of the platform with an axis of rotation perpendicular to the driving surface. The radar is characterised by an operating frequency of to , yielding up to range readings, each constituting one of the azimuth readings with a scan rotation rate of . The radar’s range resolution in short and long range configurations is and respectively, resulting in ranges of and
Images for vo were gathered by a Point Grey Bumblebee 2 camera, mounted facing the direction of motion on the front of the platform. gps measurements were collected with a GlobalSat BU-353-S4 USB GPS Receiver.
As discussed in Section III, audio was collected for each terrain separately. It was recorded from both microphones for per terrain class, corresponding to approximately 7200 spectrograms per class (using a clip length of ). Audio for grass and gravel terrains was collected in University Parks and the asphalt terrain in the Radcliffe Observatory Quarter.
Datasets for training and testing the classifier were collected with the radar in both the long range and short range configurations to ensure the network performs well regardless of specific radar configuration. We collected training data in two locations in University Parks, Oxford and testing data in two different locations in the same park. The audio classifier in combination with vo and gps provides labelling for the training datasets. Figure 1(a) shows one location where the training dataset was collected comprises of two paths surrounded by grass. As the radar scan covers an area of in its longest range configuration, it is impractical to traverse every path observed by the radar. For this reason, we leave the side path untraversed (and therefore unlabelled), such that we can test the segmentation network’s ability to generalise effectively.
This section presents experimental evidence of the efficacy of our system.
V-a Reliability of the Audio Supervisory Signal
An investigation was performed into the performance of the audio classifier using each different audio feature representation to determine which one would be used in the final classifier. In our experiments, the classifier is tested on a withheld testing dataset and predicts from three possible terrains: grass, gravel and asphalt. After averaging over multiple experiments, the accuracies for the spectrogram, mel-frequency spectrogram and gammatonegram were , , respectively (using a clip length of ). As the best performing feature representation, the gammatonegram was used to train the final audio terrain classifier.
Additionally, investigations into the audio clip length used to generate the gammatonegrams showed that the longer the clip length, the more accurate the terrain classifier. Whilst an intuitive result, this means a compromise between accuracy and system frequency is necessary. We chose a clip length of by balancing classification accuracy and other system frequencies (such as GPS update rate at ) to result in a classification frequency of .
Lastly, the final audio terrain classifier was tested on a dataset where the robot dynamically traversed gravel and grass for . Approximate hand-labels were generated by cross-referencing the predicted terrain at each of the 1320 GPS measurements with satellite imagery. Here, the audio terrain classifier performed the task with an accuracy of .
V-B Effective Supervision of Radar-only Segmentation
Firstly, a U-Net is trained on the training set shown in Figure 1(a) as stage one in the curriculum detailed in Section III. Trained on the simple task of segmenting crops out of a scan, the network effectively learns not only to reproduce the labelling but also to segment paths unlabelled in our datasets (see Figure 1(b)).
To generate the labels for the previously unlabelled sections of scans, the radar scan is divided into a small sub-scans which are sequentially segmented by the trained network. To suppress spurious predictions, we randomly rotate each scan a small number of times and combine the predictions on each. Figure 1(a) shows an example of both the initial labelling and the generated labelling after stage 1.
Stage two of the curriculum involves fine-tuning the network with the newly generated dataset. We then test the network on datasets collected in two unseen locations with the radar in both long and short range configurations. Figure 4 shows both typical segmentations and some radar specific failure cases.
In both short and long range segmentations, the system is able to reliably detect driveable routes with a field of view and up to hundreds of metres away. In Figures 3(g) and 3(d), we observe that paths approximately away and occluded by trees are accurately segmented in a way that would not be possible using any other sensor modality. Figure 3(a) shows the network segmenting around pedestrians and Figure 3(d) shows a consistent path detection behind occluding trees.
Figures 3(p), 3(n) and 3(i) show examples where occluded sections of the scan are misclassified as paths. This problem may be ameliorated by enforcing temporal consistency. In Figure 3(k), the vertical disjoint in the radar scan is misidentified as driveable path. This artefact arises due to the motion of the radar during scan formation, and can be fixed by motion correction. Finally, the network understandably doesn’t predict through large occlusions, however could be achieved by fitting cubic curves between path segments as in [Suleymanov2018].
The network correctly classified of pixels with an achieved IoU score of when evaluated on 25 hand-labelled unseen examples from the testing set. Comparing this with an IoU of achieved with cameras in [zurn2019self] and considering radar’s robustness to weather and illumination, we show the feasibility of our method for all-weather scene understanding.
During inference, our U-Net runs at and uses less than of GPU memory when processing scans. We take this to be indicative that a CPU implementation may be feasible for closed-loop autonomy.
Vi Conclusions and Future Work
This paper presents a system that identifies permissible driving routes using scanning radar alone. With a specific focus on the methodology, the system is trained using an audio-leveraged automatic labelling procedure, followed by a curriculum designed to promote generalisation from sparse labelling. Qualitative results show that the network is capable of generalising effectively to the unseen testing set and to unlabelled areas of the training set. Quantitative results demonstrate the feasibility of our methodology for learning robust scene understanding from radar.
In the future, we plan to retrain and test the system on the all-weather platform described in [kyberd2019], as part of closed-loop autonomy. The proposed system will also be applied in off-road intelligent transportation contexts444The Sense Assess eXplain (SAX) project: https://ori.ox.ac.uk/projects/sense-assess-explain-sax.
This project is supported by the Assuring Autonomy International Programme, a partnership between Lloyd’s Register Foundation and the University of York, and UK EPSRC programme grant EP/M019918/1. Additionally, we would like to thank the Groundsmen and Officers of the University Parks as well as our partners at Navtech Radar.