Detecting Road Surface Wetness from Audio: A Deep Learning Approach

11/22/2015 ∙ by Irman Abdić, et al. ∙ Technische Universität München IEEE 0

We introduce a recurrent neural network architecture for automated road surface wetness detection from audio of tire-surface interaction. The robustness of our approach is evaluated on 785,826 bins of audio that span an extensive range of vehicle speeds, noises from the environment, road surface types, and pavement conditions including international roughness index (IRI) values from 25 in/mi to 1400 in/mi. The training and evaluation of the model are performed on different roads to minimize the impact of environmental and other external factors on the accuracy of the classification. We achieve an unweighted average recall (UAR) of 93.2 mph. The classifier still works at 0 mph because the discriminating signal is present in the sound of other vehicles driving by.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

(a) PCA for selected full wet and dry trips from Table I.
(b) PCA for a randomly-selected segment of road from wet and dry trips from Table I.
Fig. 1: PCA analysis for wet and dry road surfaces that illustrates a representative case where audio-based wetness detection is linearly separable for similar road type and vehicle speeds.

Wet pavement is responsible for 74 % of all weather-related crashes in the U.S. with over 380,000 injuries and 4,700 deaths per year [1]. Furthermore, wet roads often increase traffic congestion and result in infrastructure damage and supply chain disruptions [2]

. From the perspective of driver safety, wetness detection during the period of time after the percipitation has ended but whether the road is still wet is critical. Under these conditions, human estimation of road wetness and friction properties is less accurate than normal, especially in reduced visibility over night or in the presence of fog


The automated detection of road conditions from audio may be an important component of next generation Advanced Driver Assistance Systems (ADAS) that have the potential to enhance driver safety [4]

. Moreover, autonomous and semi-autonomous vehicles have to be aware of road conditions to automatically adapt vehicle speed while entering the curve or keep a safe distance to the vehicle in front. There are numerous approaches that can detect whether a surface is wet or dry, but in the majority of cases they are not robust to variation in real-world datasets. Accuracy of video-based wetness prediction decreases significantly in poor lighting conditions (i.e., night, fog, smoke). Audio-based wetness prediction is heavily dependent upon surface type and vehicle speed which is fairly represented in our dataset of 785,826 bins (feature vectors described in §

III-B) [5]. We elucidate this dependence by visualizing the first two principal components for (1) two full trips and (2) a small 10-second section of road from (1). These two visualizations are shown in Fig. 0(a) and Fig. 0(b), respectively. The feature set we use is linearly separable for a specific road type and vehicle speed, as visualized in Fig. 0(b). However, given the nonlinear relation of our feature set for (1) that is visualized in Fig. 0(a) we applied Recurrent Neural Networks (RNNs) which can model and separate the data points.

Ii Related Work

Long short-term memory RNNs (LSTM-RNNs) have been successfully applied in many fields from hand writing recognition to robotic heart surgery [6, 7]. In the audio context, LSTM-RNNs contributed to the development of better phoneme classification, speech enhancement, affect recognition from speech, animal species identification and finding temporal structure in music [8, 9, 10, 11, 12, 13]. However, to our best knowledge LSTM-RNNs have not been applied to the task of road wetness detection.

Related works can be found in the video processing domain, where wetness detection has been studied with two camera set-ups: (1) a surveillance camera at night, and (2) a camera on-board a vehicle. The detection of road surface wetness using surveillance camera images at night is relying on passing cars’ headlights as a lighting source that creates a reflection artifact on the road area [14]. On-board video cameras use polarization changes of reflections on road surfaces or spatio-temporal reflection models [15, 16, 17]. A recent study uses near infrared (NIR) camera to classify several road conditions per every pixel with a high accuracy, the evaluation has been done in laboratory conditions, and field experiments [18]. However, a drawback of video processing methods is that they require (1) an external illumination source to be present and (2) visibility conditions to be clear.

Another approach capable of detecting road wetness relies on 24-GHz automotive radar technology for detecting low-friction spots [19]. It analyzes backscattering properties of wet, dry, and icy asphalt in laboratory and field experiments.

Traditionally, audio analysis of the road-tire interaction has been done by examining tire noises of passing vehicles from a stationary microphone positioned on the side of the road. This kind of analysis reveals that tire speed, vertical tire load, inflation pressure and driving torque are primary contributors to tire sound in dry road conditions [20]. Acoustic-based vehicle detection methods, as the one that uses bispectral entropy have been applied in the ground surveillance systems [21]. Other on-road audio collecting devices for surface analysis can be found in specialized vehicles for pavement quality evaluation (e.g., VOTERS [22]) and for vehicles instrumented for studying driver behavior in the context of automation (e.g., MIT RIDER [23]). Finally, road wetness has been studied from on-board audio of tire-surface interaction, where SVMs have been applied [5].

Ii-a Contribution

The method described in our paper improves the prediction accuracy of the method presented in [5] and expands the evaluation to a wider range of surface types and pavement conditions. Additionally, the present study is the first in applying LSTM-RNNs in this field. Moreover, we improve on the following three aspects of [5] where (1) the model was trained and tested on the same road segment, (2) false predictions caused by the impact of pebbles on the vehicle chassis were ignored, and (3) audio segments associated with speeds below 18.6 mph were removed.

We trained and tested the model on different routes, and considered all predictions regardless of the speed, pebbles impact or any other factor.

Iii Road Surface Wetness Classification

Iii-a Data Collection

For data collection purposes, we instrumented a 2014 Mercedes CLA with an inexpensive shotgun microphone behind the rear tire, as shown in Fig. 2. The gain level of the microphone and its distance from the tire were kept the same for the entire data collection process. Three different routes were selected. For each route, we drove the same exact path once during the rain (or immediately after) and another time when the road surface was completely dry, as shown in Fig. 3. We provide spectrograms in Fig. 4 for wet and dry road segments of the same route that highlight the difference in frequency response. The duration and length of trips ranged from 14 min to 30 min and 6.1 mi to 9.0 mi, respectively. The summary of the dataset is presented in Table I.

Fig. 2: Instrumented MIT AgeLab vehicle (left) and placement of the shotgun microphone behind the rear tire (right).
Fig. 3: Snapshots from the video of the forward roadway showing the same GPS location for a ‘wet’ trip 1 (left) and a ‘dry’ trip 1 (right). 111A video of these trips is available at:
Fig. 4: Spectrograms for the wet trip 2 (left) and dry trip 2 (right) from the same route segment at the speed of approximately 20 mph.
Trip Time Distance Avg Speed Avg IRI
wet 1 26 min 9.0 mi 7.4 mph 267 in/mi
wet 2 16 min 6.4 mi 9.4 mph 189 in/mi
wet 3 14 min 6.1 mi 13.5 mph 142 in/mi
dry 1 30 min 9.0 mi 9.6 mph 267 in/mi
dry 2 14 min 6.4 mi 9.1 mph 189 in/mi
dry 3 18 min 6.1 mi 9.3 mph 142 in/mi
TABLE I: Statistics of the collected data for six trips: time, distance, average speed and average IRI.

The data collection was carried out in Cambridge and the Greater Boston area with different speeds, traffic conditions and pavement roughness. The latter is measured with the International Roughness Index (IRI) which represents pavement quality [24]. A histogram of IRI values for the collected dataset is presented in Fig. 5, wherein the unit of measurement is in inches per mile (in/mi). Our dataset contains values from 25 in/mi to 1400 in/mi, but in Fig. 5, values over 400 in/mi are aggregated into a single bin. According to the Massachusetts Department of Transportation (MassDOT) Road Inventory, the route we traveled is a combination of surface-treated road and bituminous concrete road [25].

Fig. 5: Histogram or IRI distribution throughout collected data.

Iii-B Features

Our aim was to model the whole spectrum along with the first order differences and then select a subset of features that discriminates our classes the best. We extracted Auditory Spectral Features (ASF)[13]

, that were computed by applying the short-time Fourier transform (STFT) using a frame size 30 ms and a frame step of 10 ms. Furthermore, each STFT power spectrogram has been converted to the Mel-Frequency scale using 26 triangular filters obtaining the Mel spectrograms

. To match the human perception of loudness, a logarithmic representation has been chosen:


In addition, the positive first order differences were calculated from each Mel spectrogram as follows:


The frame energy has also been included as a feature which resulted in a total of 54 features [26]

. To foster reproducibility, we use the opensource software toolkits: (a) openSMILE – for extracting features from the audio, and (b) Weka 3 – for feature evaluation with Information Gain (IG) and Correlation-based Feature Selection (CFS) to reduce the dimension of the feature space

[27, 28].

The IG feature evalaution is an univariate filter that calculates the worth of a feature by measuring the IG with respect to the class, it measures individual feature value but neglects redundancy [29, 30]. The output is a list of ranked features of which we selected best features, where and the whole feature set for comparison.

The CFS subset evaluation is a multivaraite filter that seeks for subsets of features that are highly correlated with the class while having low intercorrelation [28, 31, 29]. We used the BestFirst search algorithm in a forward search mode (-D 1) and a threshold of 5 non-improving nodes (-N 5) for consideration before terminating search. The CFS subset evaluation returned a list of 5 features.

Iii-C Classifier

In this work, we used a deep learning approach with initialized nets – LSTM and bi-directional LSTM (BLSTM) RNN architectures which in contrast to other RNNs do not suffer from the problem of vanishing gradients [32]. The BLSTM is an extension of the LSTM architecture that allows for an additional forward pass if a look-ahead buffer may be used, which has been proven successful in many applications [8].

In addition, we evaluated different parameters, such as the layout of LSTM and BLSTM hidden layers (54-54-54, 54-30-54, 156-256-156, 216-216-216, 216-316-216 neurons in the three hidden layers) and learning rates (1e-4, 1e-5, 1e-6). Initially, we chose deep architecture with three hidden layers of the same size as input vectors (54), before we ranked features and reduced its dimensionality. In the next step we investigated effectiveness of internal feature compression and augmentation of hidden layers to model more information. We used feed forward output layer with a logistic activation function and sum of squared error as objective function. The experiments were carried out with the CURRENNT toolkit


Iv Results

Table II shows the evaluation results in an ascending order for the best 20 features that were selected with IG (IG-20), as described in §III-B and trained with LSTM-RNNs. We present only the worst three and the best three results for RNNs, whereas other experiments were left out from the table. For every combination of parameters we conducted cross-validation on all three folds from Table I. I.e., we leave out wet/dry 3 at a time for training with wet/dry 1 and testing with wet/dry 2, and run six experiments in total. Furthermore, an average UAR was computed for results obtained from all speeds including vehicle stationary mode. The best result with an UAR of 93.2 % was achieved with BLSTM network layout 216-216-216 and learning rate .

Additionally, we compared our results with the state-of-the-art approach of [5] that uses zero-norm minimization (L0) to select four most promising features (L0-4) from 125 ms audio bins of 1/3 octave bands (5000 Hz, 1600 Hz, 630 Hz and 200 Hz frequency bands). We trained SVMs with Sequential Minimal Optimization (SMO) on our dataset and found a C parameter of to give the best UAR of 67.4 %. Furthermore, experiments with SVMs and IG-20 feature set were carried out and gave the best UAR of 78.8 %.

gray!20 Network Feature set C (1e) UAR (%) DIFF
SVM Z0-4 3 67.4 +4.2
SVM IG-20 3 78.8 +3.0
gray!20 Network Layout LR (1e) UAR (%) DIFF
LSTM 216-216-216 4 66.3 -20.3
BLSTM 216-316-216 4 76.1 -10.5
LSTM 156-256-156 4 78.0 -8.6
BLSTM 216-316-216 5 92.6 +6.0
LSTM 216-216-216 5 92.6 +6.0
BLSTM 216-216-216 5 93.2 +6.6
TABLE II: Comparison of results (upper) that were obtained by applying state-of-the-art approach of Alonso et al., and (lower) our approach with RNNs, both trained and tested on our dataset. The column LR is an abbreviation for Learning Rate, and the column DIFF is an abbreviation for difference from the mean UAR.

The mean UAR value for experiments with LSTM-RNNs is 86.6 % and the standard deviation equals 6.4. The mean UAR of all experiments with BLSTM network is 87.0 %, while the mean UAR for experiments with LSTM network is 86.0 %. The best mean UAR for experiments with learning rate

amounts to 90.8 %, while the worst performing learning rate achieves only 78.8 %.

Two out of three wet trips have significantly higher number of false predictions (1) at the beginning, where vehicle tires were dry before getting wetted from the surface, and (2) at the end of the trip, when the vehicle entered a parking lot with relatively dry road surface.

(a) An 18 min long wet trip showing speed and false predictions.
(b) A 19 min long dry trip showing speed and false predictions.
Fig. 6: Graphs for wet and dry road surfaces for route two that clarify the correlation between low speed and inaccurate predictions.

In Fig. 6 we compare speed and false predictions of wet and dry trips for the same route that has similar properties, which are described in §III-A. One can observe that all false predictions of wet trip 2 in Fig. 5(a) occured below the speed of 2.9 mph, whilst Fig. 5(b) depicts a dry trip 2 and has only one false prediction when the vehicle is not moving. Therefore, discarding speeds below 2.9 mph improves the UAR to 100 %. When we look only at speeds below 2.9 mph and ignore everything above we are still able to attain 74.5 % UAR. The latter is possible only in presence of ambient sounds, as noises of vehicles that are driving by.

V Conclusion

We proposed a deep learning approach based on LSTM-RNNs for detecting road wetness from audio of the tire-surface interaction and discriminating between wet and dry classes. This method is shown to be robust to vehicle speed, road type, and pavement quality on a dataset containing 785,826 bins of audio. It outperforms the state-of-the-art SVMs and achieves an outstanding performance on the road wetness detection task with an 93.2 % UAR for all vehicle speeds and the more challenging speeds being those below 2.9 mph, including vehicle stationary mode. In future work, we will augment the feature set for estimating depth of water on the road surface and detecting hydroplaning conditions.


  • [1] Booz-Allen-Hamilton, “Ten-year averages from 2002 to 2012 based on nhtsa data,” US Department of Transportation - Federal Highway Administration, 2012. [Online]. Available:
  • [2] J. Andrey, B. Mills, M. Leahy, and J. Suggett, “Weather as a chronic hazard for road transportation in canadian cities,” Natural Hazards, vol. 28, no. 2-3, pp. 319–343, 2003.
  • [3] J. Andrey, B. Mills, and J. Vandermolen, “Weather information and road safety,” Institute for Catastrophic Loss Reduction, Toronto, Ontario, Canada, 2001.
  • [4] M. Mueller, “Sensor sensibility: Advanced driver assistance systems,” Vision Zero International, 2015.
  • [5]

    J. Alonso, J. López, I. Pavón, M. Recuero, C. Asensio, G. Arcas, and A. Bravo, “On-board wet road surface identification using tyre/road noise and support vector machines,”

    Applied Acoustics, vol. 76, pp. 407–415, 2014.
  • [6] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 31, no. 5, pp. 855–868, 2009.
  • [7] H. Mayer, F. Gomez, D. Wierstra, I. Nagy, A. Knoll, and J. Schmidhuber, “A system for robotic heart surgery that learns to tie knots using recurrent neural networks,” Advanced Robotics, vol. 22, no. 13-14, pp. 1521–1537, 2008.
  • [8] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5, pp. 602–610, 2005.
  • [9] M. Wöllmer, F. Eyben, S. Reiter, B. Schuller, C. Cox, E. Douglas-Cowie, and R. Cowie, “Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies.” in INTERSPEECH, vol. 2008, 2008, pp. 597–600.
  • [10] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” Signal Processing Letters, IEEE, vol. 21, no. 1, pp. 65–68, 2014.
  • [11] F. Weninger and B. Schuller, “Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations,” in acoustics, speech and signal processing (ICASSP), 2011 IEEE international conference on.   IEEE, 2011, pp. 337–340.
  • [12] D. Eck and J. Schmidhuber, “Finding temporal structure in music: Blues improvisation with lstm recurrent networks,” in Neural Networks for Signal Processing, 2002. Proceedings of the 2002 12th IEEE Workshop on.   IEEE, 2002, pp. 747–756.
  • [13] E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli, S. Squartini, and B. Schuller, “Multi-resolution Linear Prediction Based Features for Audio Onset Detection with Bidirectional LSTM Neural Networks,” in Proceedings 39th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2014, IEEE.   Florence, Italy: IEEE, May 2014, pp. 2183–2187, (acceptance rate: 50 %, IF* 1.16 (2010)).
  • [14] Y. Horita, S. Kawai, T. Furukane, and K. Shibata, “Efficient distinction of road surface conditions using surveillance camera images in night time,” in Image Processing (ICIP), 2012 19th IEEE International Conference on.   IEEE, 2012, pp. 485–488.
  • [15] M. Yamada, T. Oshima, K. Ueda, I. Horiba, and S. Yamamoto, “A study of the road surface condition detection technique for deployment on a vehicle,” JSAE review, vol. 24, no. 2, pp. 183–188, 2003.
  • [16] M. Jokela, M. Kutila, and L. Le, “Road condition monitoring system based on a stereo camera,” in Intelligent Computer Communication and Processing, 2009. ICCP 2009. IEEE 5th International Conference on.   IEEE, 2009, pp. 423–428.
  • [17] M. Amthor, B. Hartmann, and J. Denzler, “Road condition estimation based on spatio-temporal reflection models,” pp. 3–15, 2015.
  • [18] P. Jonsson, J. Casselgren, and B. Thornberg, “Road surface status classification using spectral analysis of nir camera images,” Sensors Journal, IEEE, vol. 15, no. 3, pp. 1641–1656, 2015.
  • [19] V. V. Viikari, T. Varpula, and M. Kantanen, “Road-condition recognition using 24-ghz automotive radar,” Intelligent Transportation Systems, IEEE Transactions on, vol. 10, no. 4, pp. 639–648, 2009.
  • [20] K. Iwao and I. Yamazaki, “A study on the mechanism of tire/road noise,” JSAE review, vol. 17, no. 2, pp. 139–144, 1996.
  • [21] M. Bao, C. Zheng, X. Li, J. Yang, and J. Tian, “Acoustical vehicle detection based on bispectral entropy,” Signal Processing Letters, IEEE, vol. 16, no. 5, pp. 378–381, 2009.
  • [22] R. Birken, G. Schirner, and M. Wang, “Voters: design of a mobile multi-modal multi-sensor system,” in Proceedings of the Sixth International Workshop on Knowledge Discovery from Sensor Data.   ACM, 2012, pp. 8–15.
  • [23] L. Fridman, D. E. Brown, W. Angell, I. Abdić, B. Reimer, and H. Y. Noh, “Automated synchronization of driving data using vibration and steering events,” arXiv preprint arXiv:1510.06113, 2015.
  • [24] W. D. Paterson, “International roughness index: Relationship to other measures of roughness and riding quality,” Transportation Research Record, no. 1084, 1986.
  • [25] MassDOT, “Road inventory - massdot planning,” 2015. [Online]. Available:
  • [26]

    E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Non-Linear Prediction with LSTM Recurrent Neural Networks for Acoustic Novelty Detection,” in

    Proceedings 2015 International Joint Conference on Neural Networks (IJCNN), IEEE.   Killarney, Ireland: IEEE, July 2015, pp. 1–7.
  • [27] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent developments in opensmile, the munich open-source multimedia feature extractor,” in Proceedings of the 21st ACM international conference on Multimedia.   ACM, 2013, pp. 835–838.
  • [28] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The weka data mining software: an update,” ACM SIGKDD explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
  • [29] A. G. Karegowda, A. Manjunath, and M. Jayaram, “Comparative study of attribute selection using gain ratio and correlation based feature selection,” International Journal of Information Technology and Knowledge Management, vol. 2, no. 2, pp. 271–277, 2010.
  • [30] M. Hall, G. Holmes et al., “Benchmarking attribute selection techniques for discrete class data mining,” Knowledge and Data Engineering, IEEE Transactions on, vol. 15, no. 6, pp. 1437–1447, 2003.
  • [31]

    M. A. Hall, “Correlation-based feature subset selection for machine learning,” Ph.D. dissertation, University of Waikato, Hamilton, New Zealand, 1998.

  • [32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [33] J. Weninger, Felix Bergmann and B. Schuller, “Introducing currennt - the munich open-source cuda recurrent neural network toolkit,” Journal of Machine Learning Research, no. 16, pp. 547–551, 2014.