Machine learning in acoustics: a review

05/11/2019 ∙ by Michael J. Bianco, et al. ∙ University of California, San Diego 0

Acoustic data provide scientific and engineering insights in fields ranging from biology and communications to ocean and Earth science. We survey the recent advances and transformative potential of machine learning (ML), including deep learning, in the field of acoustics. ML is a broad family of statistical techniques for automatically detecting and utilizing patterns in data. Relative to conventional acoustics and signal processing, ML is data-driven. Given sufficient training data, ML can discover complex relationships between features. With large volumes of training data, ML can discover models describing complex acoustic phenomena such as human speech and reverberation. ML in acoustics is rapidly developing with compelling results and significant future promise. We first introduce ML, then highlight ML developments in five acoustics research areas: source localization in speech processing, source localization in ocean acoustics, bioacoustics, seismic exploration, and environmental sounds in everyday scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Acoustic data provide scientific and engineering insights in a very broad range of fields including machine interpretation of human speechvincent2018audio and animal vocalizations, mellinger2016signal ocean source localization,gemba2019robust ; niu2017 and imaging geophysical structures in the ocean.gerstoft1996 ; jensen2011computational In all these fields, data analysis is complicated by a number of challenges. These challenges include data corruption, missing or sparse measurements, reverberation, and large data volumes. For example, multiple acoustic arrivals of a single event or utterance make source localization and speech interpretation a difficult task for machines.traer2016statistics ; vincent2018audio In many cases, such as acoustic tomography and bioacoustics, large volumes of data can be collected. The amount of human effort required to manually identify acoustic features and events rapidly becomes limiting as the size of the data sets increase. Further, patterns may exist in the data that are not easily ascertained by human cognition.

Machine learning (ML) techniquesjordan2015machine ; lecun2015

have enabled broad advances in automated data processing and pattern recognition capabilities across many fields, including computer vision, image processing, speech processing, and (geo)physical science.

kong2018machine ; bergen2019machine ML in acoustics is a rapidly developing field, with many compelling solutions to the aforementioned acoustics challenges. The potential impact of ML-based techniques in the field of acoustics, and the recent attention they have received, motivates this review.

ML, broadly defined, is a family of statistical techniques for automatically detecting and utilizing patterns in data. The patterns obtained are used to predict future data or make decisions from uncertain measurements. In this way ML provides a means for machines to gain knowledge, or to ‘learn’.bishop2006 ; murphy2012

ML methods are often divided into two major categories: supervised and unsupervised learning. There is also a third category, called reinforcement learning but we do not discuss it here. In supervised learning, the goal is to learn a predictive mapping from inputs to outputs given labeled input and output pairs. The labels can be categorical or real-valued scalars for classification and regression, respectively. In unsupervised learning, no labels are given, and the task is to discover interesting or useful structure within the data. An example of unsupervised learning is clustering analysis (e.g. K-means

macqueen1967some ). Other paradigms exist but are beyond the scope of this paper. Supervised and unsupervised modes can also be combined. Names semi- and weakly supervised learning methods can be used when the labels only give partial or almost-correct information.

Research in acoustics has mostly been concerned with developing high-level physical models and using these for modeling and inference of the environment, represented with the x-axis in Fig. 1. With increasing amounts of data, data-driven approaches have made enormous success in web-based applications, represented with the y-axis in Fig. 1. It is expected that as more data becomes available in physical sciences that we will be able to better combine advanced acoustic models with ML.

In ML, data representations and models are based primarily on the structure of the data, rather than on physical models or prior knowledge. ML can help build upon physical models and prior knowledge, improving data interpretation by data finding representations that are ‘optimal’ in some sense. goodfellow2016deep

Thus the features in the data are fundamental to ML methods, whose performance depends on the quality of features for the task at hand. The features can be latent factors, determined in the classic principal components analysis (PCA) approach, which is useful for dimension reduction and many other purposes. More flexible latent representations include Gaussian mixture models (GMMs) obtained using the expectation-maximization (EM) procedure. Another example is the classic multi-layer perceptron (MLP) neural network model, which can be understood as a general function approximator. Deep neural networks (NNs) are specialized versions of MLPs, with many layers to increase representation and generalization capabilities.

lecun2015 ; goodfellow2016deep

This review focuses on the significant advances ML has already provided in the field of acoustics. We first introduce ML theory, including deep learning (DL). Then we discuss applications and advances of the theory in six acoustics research areas. In Secs. IIIV, basic ML concepts are introduced, and some fundamental algorithms are developed. In Sec. V, the field of DL is introduced, and applications to acoustics are discussed. Next, an overview of ML theory and applications in the following fields are highlighted: speech enhancement (Sec. VI), ocean acoustic source localization (Sec. VII), bioacoustics (Sec. VIII), seismic exploration (Sec. IX), and reverberation and environmental sounds in everyday scenes (Sec. X). While the list of fields we cover and the treatment of ML theory is not exhaustive, we hope this article can serve as inspiration for future ML research in acoustics. There are excellent ML and signal processing textbooks, which are useful supplements to the material presented here.murphy2012 ; hastie2009 ; bishop2006 ; goodfellow2016deep ; duda2012pattern ; cohen2009speech ; vincent2018audio ; elad2010 ; mairal2014

Figure 1: (Color Online) Acoustic insight can be improved by leveraging the strengths of both physical and ML-based, data-driven models. Analytic physical models (lower left) give basic insights about physical systems. More sophisticated models, reliant on computational methods (lower right), can model more complex phenomena. Whereas physical models are reliant on rules, which are updated by physical evidence (data), ML is purely data-driven (upper left). By augmenting ML methods with physical models to obtain hybrid models (upper right), a synergy of the strengths of physical intuition and data-driven insights can be obtained.

Ii Machine learning principles

ML is data-driven and, given sufficient training data, can discover more complex (useful) relationships between features than conventional methods. Classic signal processing techniques for modeling and predicting data are based on provable performance guarantees. These methods use simplifying assumptions such as Gaussian independent and identically distributed (iid) variables and 2nd order statistics (covariance). ML methods allow for more complex models of the data interdependencies, which have shown state of the art performance in a number of tasks compared to conventional methods, e.g. DL.lecun2015 However, the increased flexibility of the ML models comes with certain difficulties.

Often the complexity of ML models and the algorithms for training them make analysis of their performance guarantees difficult and can hinder model interpretation. Further, ML models require significant amounts of training data. Though we note that ‘vast’ quantities of training data are not required to take advantage of ML techniques. Due to the no free lunch theorem,wolpert1997no models whose performance is maximized for one task will likely perform worse at others. Provided high-performance is desired only for a specific task, and there is enough training data, the benefits of ML may outweigh these issues.

ii.1 Inputs and outputs

In ML, we are often interested in training a model to produce a desired output given inputs,

(1)

are the inputs and are the desired outputs. is a set of observations (data) from which we would like to make some prediction or decision. The observation contains features, which are used to calculate the output. is predicted output, and can be a linear or non-linear mapping from input to desired output. Finally,

is the error between the estimate

and the desired value , which can for example be noise. For training an ML model, we need many training examples. We define and the corresponding outputs for the observations.

The use of ML to obtain a desired output or action from observations , as described above, is called supervised learning (Sec. III). Often, we might wish to discover interesting or useful patterns in the data, without explicitly specifying output. This is called unsupervised learning (Sec. IV).

ii.2 Supervised and unsupervised learning

ML methods generally can be categorized as either supervised or unsupervised learning tasks. In supervised learning, the task is to learn a predictive mapping from inputs to outputs given labeled input and output pairs. Supervised learning is the most widely used ML category, and includes familiar methods such as linear regression (a.k.a. ridge regression) and nearest-neighbor classifiers, as well as more sophisticated support vector machine (SVM) and neural network (NN) models- sometimes referred to as artificial NNs, due to their weak relationship to neural structure in the biological brain. In unsupervised learning, no labels are given, and the task is to discover interesting or useful structure within the data. This has many useful applications, which include data visualization, exploratory data analysis, anomaly detection, and feature learning. Unsupervised methods such as PCA, K-means,

macqueen1967some and Gaussian mixture models (GMMs) have been used for decades. Newer methods include t-SNE,maaten2008visualizing dictionary learning,tosic2011

and deep representations (e.g. autoencoders).

goodfellow2016deep An important point is that the results of unsupervised methods can be used either directly, such as for discovery of latent factors or data visualization, or as part of a supervised learning framework, where they supply transformed versions of the features to improve supervised learning performance.

ii.3 Generalization: train and test data

Central to ML is the requirement that learned models must perform well on unobserved data, as well as observed data. The ability of the model to predict unseen data well is called generalization. We first discuss relevant terminology, then discuss how generalization of an ML model can be assessed.

Often the term complexity is used to denote the level of sophistication of the data relationships or ML task. The ability of a particular ML model to well approximate data of a particular complexity is the capacity. These terms are not strictly defined, but efforts have been made to mathematically formalize these concepts. For example, the Vapnik-Chervonenkis (VC) dimension provides a means of quantifying model capacity, in the case of binary classifiers.hastie2009 For example, data complexity can be interpreted as the number of dimensions in which useful relationships exist between features. Higher complexity implies higher-dimensional relationships. We note that the capacity of the ML model can be limited by the quantity of training data.

Figure 2: (Color online) Model generalization with polynomial regression. (Top) The true signal, training data, and three of the polynomial regression results are shown. (Bottom) The root mean square error (RMSE) of the predicted training and test signals were estimated for each polynomial degree.

In general, ML models perform best when their capacity is suited to the complexity of the data provided and the task. For mismatched model-data/task complexities, two situations can arise. If a high-capacity model is used for a low-complexity task, the model will overfit, or learn the noise or idiosyncrasies of the training set. In the opposite scenario, a low-capacity model trained on a high-complexity task will tend to underfit the data, or not learn enough details of the physical model. Both overfitting and underfitting hurt ML model generalization. The behavior of the ML model on training and novel observations relative to the model (hyper)parameters can be used to determine the appropriate model complexity. We next discuss how this can be done. We note that underfitting and overfitting can be quantified using the bias and variance of the ML model.hastie2009

To estimate the performance of ML models on unseen observations, and thereby assess their generalization, a set of test data drawn from the full training set can be excluded from the model training and used to estimate generalization given the current parameters. In many cases, the data used in developing the ML model are split repeatedly into different sets of training and test data using cross validation techniques.kohavi1995

The test data is used to adjust the model hyperparameters (e.g. regularization, priors, number of NN units/layers) to optimize generalization. The hyperparameters are model dependent, but generally governing the models capacity to learn.

In Fig. 2, we illustrate the effect of model capacity on train and test error using polynomial regression. Train and test data (10 and 100 points) were generated from a sinusoid (, left) with additive Gaussian noise. Polynomial models of orders 0 to 9 were fit to the training data, and the RMSE of the test and train data predictions are compared. , with the number of samples (test or train) and the estimate of

. Increasing model capacity (complexity), as expected, decreases the training error, up to degree 9 where the degree plus intercept matches the number of training points (degrees of freedom). While increasing the complexity initially decreases the RMSE of the test data prediction, polynomial degrees greater than 5 increase the test error. Thus, it can be concluded that the model that generalizes best is degree 5. In ML applications on real data, the test/train error curves are generated using cross-validation to improve the robustness of the model selection.

ii.4 Cross-validation

One popular cross-validation technique, called K-fold cross validation,hastie2009 assesses model generalization by dividing training data into K roughly equal-sized subgroups of the data, called folds, excluding one fold from the model training, and calculating the error on the excluded fold. This procedure is executed K times, with the th fold used as the test data and the remaining K-1 folds used for model training. With target values divided into folds by and inputs , the cross validation error is

(2)

with the model learned using all folds except , the hyperparameters, and

a loss function.

gives a curve describing the cross-validation (test) error as a function of the hyperparameters.

Figure 3:

(Color online) Illustration of curse of dimensionality. 10 uniformly distributed data points on the interval (0 1) can be quite close in 1D (top, squares), but as the number of dimensions,

, increases, the distance between the points increases rapidly. This is shown for points in 2D (top, circles), and 3D (bottom). The increasing volume , with the normalized feature value scale, presents two issues. (1) local methods (like K-means) break-down with increasing dimension, since small neighborhoods in lower-dimensional space cover an increasingly small volume as the dimension increases. (2) Assuming discrete values, the number of possible data configurations, and thereby the minimum number of training examples, increase with dimension .hastie2009 ; goodfellow2016deep

ii.5 Curse of dimensionality

High-dimensional data also present challenges in ML, referred to as the ‘curse of dimensionality’. Considering features are uniformly distributed in dimensions, (see Fig. 3) with the normalized feature value, then (for example describing a neighborhood as a hypercube) constitutes a decreasing fraction of the features space volume. The fraction of the volume, , is given by , with and the volume and length fractions, respectively. Similarly, data tend to become more sparsely distributed in high-dimensional space. The curse of dimensionality most strongly affects methods that depend on distance measures in feature space, such as K-means, since neighborhoods are no longer ‘local’. Another result of the curse of dimensionality is the increased number of possible configurations, which may lead to ML models requiring increased training data to learn representations.

With prior assumptions on the data, enforced as model constraints (e.g. total variationchambolle2004 or regularization), training with smaller data sets is possible.goodfellow2016deep This is related to the concept of learning a manifold, or a lower-dimensional embedding of the salient features. While the manifold assumption is not always correct, it is at least approximately correct for processes involving images and sound (for more discussion, see Ref. goodfellow2016deep,  [pp. 156–159].

ii.5.1 Bayesian machine learning

A theoretically robust way to implement ML methods is to use the tools of probability, which have been a critical force in the development of modern science and engineering. Bayesian statistics provide a framework for integrating prior knowledge and uncertainty about physical systems into ML models. It also provides convenient analysis of estimated parameter uncertainty. Naturally, Bayes rule plays a fundamental rule in many acoustic applications, especially in methods for estimating the parameters of model-based inverse methods. In the wider ML community, there are also attempts to expand the ML to be Bayesian model-based, for a review see Ref. 

ghahramani2015, . We here discuss the basic rules of probability, as they relate to Bayesian analysis, and show how Bayes’ rule can be used to estimate ML model parameters.

Figure 4: (Color online) Bayesian estimate of polynomial regression model parameters for sinusoidal data from Fig. 2. Given prior knowledge and assumptions about the data, Bayesian parameter estimation can help prevent overfitting. It also provides statistics about the predictions. The mean of the prediction (blue line) is compared with the true signal (red), and the training data (blue dots, same as Fig. 2

). The standard deviation (STD) of the prediction (light blue) is also given by the Bayesian estimate. The estimate uses prior knowledge about the noise level

, and a Gaussian prior on the model weights .

Two simple rules for probability are of fundamental importance for Bayesian ML bishop2006 . They are the sum rule

(3)

and the product rule

(4)

Here the ML model inputs and outputs are uncertain quantities. The sum rule (3) states that the marginal distribution

is obtained by summing the joint distribution

over all values of . The product rule (4) states that is obtained as a product of the conditional distribution, , and .

Bayes’s rule is obtained from the sum and product rules by

(5)

which gives the model output conditioned on the input as the joint distribution divided by the marginal .

In ML, we need to choose an appropriate ML model (1) and estimate the model parameters to best give the desired output from inputs . This is the inverse problem. The model parameters conditioned on the data is . From Bayes’s rule (5) we have

(6)
(7)

is the prior distribution on the parameters, called the likelihood, and the posterior. The quantity is the distribution of the data, also called the evidence or Type II likelihood. Often it can be neglected as for given data is constant and we are concerned with inferring .

A Bayesian estimate of the parameters is obtained using (6). Assuming a scalar linear model , with , the parameters weights (see Sec. III.1 for more details). A simple solution to the parameter estimate is obtained if we assume the prior is Gaussian, with mean and covariance . Often we also assume a Gaussian likelihood , with mean and covariance . We get, see Ref. bishop2006,  [p.93],

(8)
(9)
(10)

The formulas are very efficient for sequential estimation as the prior is conjugated, i.e. it is of the same form as the posterior. In acoustics this framework has been used for range estimation michalopoulou2019 and for sparse estimation via the sparse Bayesian learning approach.Gemba2017SBL ; nannuru2019 In the latter, the sparsity is controlled by diagonal prior covariance matrix, where entries with zero prior variance will force the posterior variance and mean to be zero.

With prior knowledge and assumptions about the data, Bayesian approaches to parameter estimation can prevent overfitting. Further, Bayesian approaches provide the probability distribution of target estimates

. Fig. 4 shows a Bayesian estimate of polynomial curve-fit developed in Fig. 2. The mean and standard deviation of the predictions from the model are given. The Bayesian curve fitting is here performed assuming prior knowledge of the noise standard deviation () and with a Gaussian prior on the weights (). The hyperparameters can be estimated from the data using empirical Bayes.gelman2013bayesian This is counterpoint to the test-train error analysis (Fig. 2), where fewer assumptions are made about the data, and the noise is unknown. We note that it is not always practical to formally implement Bayesian parameter estimation. Where it is applicable, it well characterizes the ML results.

Iii Supervised learning

The goal of supervised learning is to learn a mapping from a set of inputs to desired outputs given labeled input and output pairs (1). For discussion, we here focus on real-valued features and labels. The features in observation can be real, complex, or categorical (binary or integer). Based on the type of desired output , supervised learning can be divided into two subcategories: regression and classification. When is real or complex valued, the task is regression. When is categorical, the task is called classification.

The methods of finding the function are the core of ML methods, and the subject of this section. Generally, we prefer to use the tools of probability to find , if practical. We can state the supervised ML task as the task of maximizing the conditional distribution . One example is the maximum a posteriori (MAP) estimator

(11)

which gives the most probable value of , corresponding to the mode of the distribution conditioned on the observed evidence . While the MAP can be considered Bayesian, it is really only a step toward Bayesian treatment (see Sec. II.5.1) since MAP returns a point estimate rather than the posterior distribution.

In the following, we further describe regression and classification methods, and give some illustrative applications.

iii.1 Linear regression, classification

We illustrate supervised ML with a simple method: linear regression. We develop a MAP formulation of linear regression in the context of DOA estimation in beamforming. In seismic and acoustic beamforming, waveforms are recorded on an array of receivers with the goal of finding their direction of arrival (DOA). The observations are , waveform measurements from receivers and the output is the DOA azimuth angle (see (1)). The relationship between DOA and array power is non-linear, but is expressed as a linear problem by discretizing the array response using basis functions , with called steering vectors. The array observations are expressed as . The weights relate the steering vectors to the observations . We thus write the linear measurement model as

(12)

In the case of a single source, DOA is corresponding to . is noise (often Gaussian). We seek values of weights w which minimize the difference between the left and right-hand sides of (12).

From Bayes’s rule (5), the posterior of the model is

(13)

with the likelihood and the prior. Assuming the noise Gaussian iid with zero-mean, with the identity,

(14)

with a constant and complex Gaussian. Maximizing the posterior, we obtain

(15)

Thus, the MAP estimate , is

(16)

Depending on the choice of probability density function for

, different solutions are obtained. One popular choice is a Gaussian distribution. For

Gaussian,

(17)

where is a regularization parameter, and the variance of . This is the classic regularized least-squares estimate (a.k.a. damped least squares, or ridge regression). rodgers2000 ; bishop2006 ; aster2013 Eq. (17) has the analytic solution

(18)

Although the regularization in (17

) is often convenient, it is sensitive to outliers in the data

. In the presence of outliers, or if the true weights are sparse (e.g. few non-zero weights), a better prior is the Laplacian, which gives

(19)

where a regularization parameter, and a scaling parameter for the Laplacian distribution. murphy2012 Eq. (19) is called the regularized least-squares estimator of . While the problem is convex, it is not analytic, though there are many practical algorithms for its solution.elad2010 ; mairal2014 ; gerstoft2015 In sparse modeling, the -regularization is considered a convex relaxation of pseudo-norm, and under certain conditions, provides a good approximation to the -norm. For a more detailed discussion, please see.elad2010 ; mairal2014 The solution to (19) is also known as the LASSO,tibshirani1996 and forms the cornerstone of the field of compressive sensing (CS). candes2006 ; donoho2006 ; gerstoft2018

Whereas in the estimate obtained from (17), many of the coefficients are small, the estimate from (19) has only few non-zero coefficients. Sparsity is a desirable property in many applications, including array processinghaykin2014 ; gerstoft2018 and image processing.mairal2014 We give an example of (in CS) and regularization in the estimation of DOAs on a line array, Fig. 5.

Linear regression can be extended to the binary classification problem. Here for binary classification, we have a single desired output () for each input , and the labels are either 0 or 1: the desired labels for observations are (row vector),

(20)

Here is the weights vector. Following the derivation of (17), the MAP estimate of the weights is given by

(21)

with the ridge regression estimate of the weights.

This ridge regression classifier is demonstrated for binary classification () in Fig. 6 (top). The cyan class is and red is , thus, the decision boundary (black line) is . Points classified as are , and points classified as are . In the case where each class is composed of a single Gaussian distribution (as in this example), the linear decision boundary can do well.hastie2009 However, for more arbitrary distributions, such a linear decision boundary may not suffice, as shown by the poor classification results of the ridge classifier on concentric class distributions, as shown for example in Fig. 6 (top-right).

Figure 5: (Color online) Beamformer direction of arrival (DOA) estimation using compressive sensingxenaki2014 (CS, red) and least squares (conventional beamforming, dashed blue). Reproduced from Ref. xenaki2014, .

In the case of the concentric distribution, a non-linear decision boundary must be obtained. This can be performed using many classification algorithms, including logistic regression and support vector machines (SVM).

murphy2012 In the following section we illustrate the non-linear decision boundary estimation using SVMs.

iii.2 Support vector machines

Thus far in our discussion of classification and regression, we have calculated the outputs based on feature vectors

in the raw feature dimension (classification) or on a transformed version of the inputs (beamforming, regression). Often, we can make classification methods more flexible by enlarging the feature space with non-linear transformations of the inputs

. These transformations can make data linearly separable in the transformed space, which is not separable in the original feature space (see Fig. 6). However, for large feature expansions, the feature transform calculation can be computationally prohibitive.

Support vector machines (SVMs) can be used to perform classification and regression tasks where the transformed feature space is very large (potentially infinite). SVMs are based on maximum margin classifiers,murphy2012 and use a concept called the kernel trick to use potentially infinite-dimensional feature mappings with reasonable computational cost.bishop2006 This uses kernel functions, relating the transforms of two features as . They can be interpreted as similarity measures of linear or non-linear transformations of the feature vectors . Kernel functions can take many forms (see Ref. bishop2006,

 [pp. 291-323]), but for this review we illustrate SVMs with the Gaussian radial basis function (RBF) kernel

(22)

controls the length scale of the kernel. RBF can also be used for regression. The RBF is one example of kernelization of an infinite dimensional feature transform.

SVMs can be easily formulated to take advantage of such kernel transformations. Below, we derive the maximum margin classifier of SVM, following the arguments of Ref. bishop2006, , and show how kernels can be used to enhance classification.

Figure 6: (Color online) Binary classification of points with two distributions, two-Gaussian and radially distributed classes (red, cyan) using ridge regression (top), support vector machines (SVMs) with radial basis functions (RBFs, middle) with support vectors (black circles), and feed forward NN (NNs, bottom). SVM is more flexible than linear regression, and can fit more general distributions using the kernel trick with, e.g., RBFs. NN requires fewer data assumptions to separate the classes, instead using non-linear modeling to fit the distributions.

Initially, we assume linearly separable features (see Fig. 7) with classes . The class of the objects corresponding to the features is determined by

(23)

with and

the weights and biases. A decision hyperplane satisfying

is used to separate the classes. If is above the hyperplane (), the estimated class label is , whereas if is below (), . This gives the condition . The margin is defined as the distance between the nearest features (Fig. 7) with different labels, and . These points correspond to the equations and . The difference between these equations, normalized by the weights , yields an expression for the margin

(24)

The expression says the projection of the difference of and on (unit vector perpendicular to the hyperplane) is . Hence, .

The weights and are estimated by maximizing the margin, subject to the constraint that the points are correctly classified

(25)

However, the term makes (25) difficult to solve. Eq. (25) is reformulated as a quadratic program, which can be solved using well established techniques

(26)

If the data are linearly non-separable (class overlapping), slack variables allows some of the training points to be misclassified.bishop2006 This gives

(27)

The parameter controls the trade-off between the slack variable penalty and the margin.

For the non-linear classification problems, the quadratic program (27) can be kernelized, via technique called the kernel trick,bishop2006 to make the data linearly separable in a non-linear space defined by feature vectors in the kernel function . Eq. (27) can be rewritten using the Lagrangian dualbishop2006

(28)

Eq. (28) is solved as a quadratic programming problem. From the Karush-Kuhn-Tucker conditions,bishop2006 either or . Points with are not considered in the solution tox (28). Thus, only points within the specified slack distance from the margin, , participate in the prediction. These points are called support vectors.

In Fig. 6 we use SVM with the RBF kernel (22) to classify points where the true decision boundary is either linear or circular. The SVM result is compared with linear regression (Sec. III.1) and NNs (Sec. III.3). Where linear regression fails on the circular decision boundary, SVM with RBF well separates the two classes. The SVM example was implemented in Python using Scikit-learn.scikit-learn

We here note that the SVM does not provide probabilistic output, since it gives hard labels of data points and not distributions. Though, its label uncertainties can be quantified heuristically.

murphy2012

Because the SVM is a two-class model, multi-class SVM with classes is created by training models on all possible pairs of classes. The points that are assigned to the same class most frequently are considered to comprise a single class, and so on until all points are assigned a class from to . This approach is known as the “one-versus-one” scheme, although slight modifications have been introduced to reduce computational complexity.bishop2006 ; murphy2012

Figure 7: (Color online) A hyperplane learned by training an SVM in two dimensions ().

iii.3 Neural networks: multi-layer perceptron

Neural networks (NN) can overcome the limitations of linear models (linear regression, SVM) by learning a non-linear mapping of the inputs from the data over their network structure. Linear models are appealing because they can be fit efficiently and reliably, with solutions obtained in closed form or with convex optimization. However, they are limited to modeling linear functions. As we saw in previous sections, linear models can use non-linear features by prescribing basis functions (DOA estimation) or by mapping the features into a more useful space using kernels (SVM). Yet these prescribed feature mappings are limited since kernel mappings are generic, and based on the principle of local smoothness. Such general functions perform well for many tasks, but better performance can be obtained for specific tasks by training on specific data. NNs (and also dictionary learning, see Sec. IV) provide the algorithmic machinery to learn representations directly from data. lecun1998 ; lecun2015 ; goodfellow2016deep

The purpose of feed-forward NNs, also referred to as deep NNs (DNNs) or multi-layer perceptrons (MLPs), is to approximate functions (Eq. (1)). These models are called feed-forward

because information flows only from the inputs (features) to the outputs (labels), through the intermediate calculations. When feedback connections are included in the network, the network is referred to as a recurrent neural network (RNN, for more details see Sec. 

V).

NN are called networks because they are composed of a series of functions, associated by a directed graph. Each set of functions in the NN is referred to as a layer. The number of layers in the network (see Fig. 8), called the NN depth, typically is the number of hidden layers plus one (the output layer). The NN depth is one of the parameters that affect the capacity of NNs. The term deep learning refers to NNs with many layers.goodfellow2016deep

In Fig. 8, an example 3 layer fully-connected NN is illustrated. The first layer, called the input layer, is the features . The last layer, called the output layer, is the target values, or labels . The intervening layers of the NN, called hidden layers since the training data does not explicitly define their output, are and . The circles in the network (see Fig. 8) represent network units.

The output of the network units in the hidden and output layers is a non-linear transformation of the inputs, called the activation

. Common activation functions include softmax, sigmoid, hyperbolic tangent, and rectified linear units (ReLU). Activation functions are further discussed in Sec. 

V. Before the activation, a linear transformation is applied to the inputs

(29)

with the input to the th unit of the first hidden layer, and and the weights and biases, which are to be learned. The output of the hidden unit , with the activation function. Similarly,

(30)

and , .

Figure 8: Feed-forward neural network (NN).

The NN architecture, combined with the series of small operations by the activation functions make the NN a general function approximator.hornik1989multilayer In fact, a NN with a single hidden layer can approximate nearly any continuous, real-valued function with a sufficient number of hidden units.goodfellow2016deep We here illustrate a NN with two hidden layers. Deeper NN architectures are discussed in Sec.V.

NNs training is analogous to the methods we have previously discussed (e.g. least squares and SVM models): a loss function is constructed, the gradient of the function is evaluated using the training data, and from the gradient the model parameters are adjusted. A typical loss function, , for classification is cross-entropy.goodfellow2016deep Given the target values (labels) and input features , the average cross-entropy and weight estimate are given by

(31)

with the matrix of the weights and its estimate. The gradient of the objective (31),

, is obtained via backpropagation.

rumelhart1988learning

Backpropagation uses the derivative chain rule to find the gradient of the cost with respect to the weights at each NN layer. With backpropagation, any of the numerous variants of gradient descent can be used to optimize the weights at all layers.

The gradient information from backpropagation is used to find the optimal weights. The simplest weight update is obtained by taking a small step in the direction of the negative gradient

(32)

with called the learning rate

, which controls the step size. Popular NN training algorithms are stochastic gradient descent

goodfellow2016deep

and Adam (adaptive moment estimation).

kingma2014adam

Iv Unsupervised learning

Unlike in supervised learning where there are given target values or labels , unsupervised learning deals only with modeling the features , with the goal of discovering interesting or useful structures in the data. The structures of the data, represented by the data model parameters , give probabilistic unsupervised learning models of the form . This is in contrast to supervised models that predict the probability of labels or regression values given the data and model: (see Sec. III). We note that the distinction between unsupervised and supervised learning methods is not always clear. Generally, a learning problem can be considered unsupervised if there are no annotated examples or prediction targets provided.

The structures discovered in unsupervised learning serve many purposes. The models learned can, for example, indicate how features are grouped or define latent representations of the data such as the subspace or manifold which the data occupies in higher-dimensional space. Unsupervised learning methods for grouping features include clustering algorithms such as K-meansmacqueen1967some and Gaussian mixture models (GMM). Unsupervised methods for discovering latent models include principal components analysis (PCA), matrix factorization methods such as non-negative matrix factorization (NMF),lee2001algorithms ; hoyer2004non independent component analysis (ICA),hyvarinen2001 and dictionary learning.kreutz2003dictionary ; elad2010 ; tosic2011 ; mairal2014 Neural network models, called autoencoders, are also used for learning latent models.goodfellow2016deep Autoencoders can be understood as a non-linear generalization of PCA and, in the case of sparse regularization (see Sec. III), dictionary learning.

The aforementioned models obtained in unsupervised learning have many practical uses. Often, they are used to find the ‘best’ representation of the data given a desired task. A special class of K-means based techniques, called vector quantization,gersho1991 was developed for lossy compression. In sparse modeling, dictionary learning seeks to learn the ‘best’ sparsifying dictionary of basis functions for a given class of data. In ocean acoustics, PCA (a.k.a. empirical orthogonal functions) have been used to constrain estimates of ocean sounds speed profiles (SSPs), though methods based on sparse modeling and dictionary learning have given an alternative representation.bianco2016 ; bianco2017a Recently, dictionary-learning based methods have been developed for travel time tomography.bianco2018b Aside from compression, such methods can be used for data restoration tasks such as denoising and inpainting. Methods developed for denoising and inpainting can also be extended to inverse problems, more generally.

In the following, we illustrate unsupervised ML, highlighting PCA, EM with GMMs, K-means, dictionary learning, and autoencoders.

iv.1 Principal components analysis

For data visualization and compression, we are often interested in finding a subspace of the feature space which contains the most important feature correlations. This can be a subspace which contains the majority of the feature variance. PCA finds such a subspace by learning an orthogonal, linear transformation of the data. The principal components

of the features are obtained as the right eigenvectors of the design matrix

, with

(33)

are principal components (eigenvectors) and are the total variances of the data along the principal directions defined by principal components , with

. This is a matrix factorization can be obtained using for example singular value decomposition.

hastie2009

In the coordinate system defined by , with axes , the first coordinate accounts for the highest portion of the overall variance in the data and subsequent axes have equal or smaller contributions. Thus, truncating the resulting coordinate space results in a lower dimensional representation that often captures a large portion of the data variance. This has benefits both for visualization of data and modeling as it can reduce the aforementioned curse of dimensionality (see Sec. II.5). Formally, the projection of the original features onto the principal components is

(34)

with the first eigenvectors and the lower-dimensional projection of the data. can be approximated by

(35)

which give a compressed version data , which has less information than the original data (lossy compression).

PCA is a simple example of a learned representation that attempts to disentangle the unknown factors of the data variation. The principal components explain the correlation of the features, and are a coordinate system which de-correlates the features. While correlation is an important category or dependencies between features, we often are interested in learning representations that can disentangle more complicated, perhaps correlated, dependencies.

iv.2 Expectation maximization and Gaussian mixture models

Often, we would like to model the dependency between observed features. An efficient way of doing this is to assume that the observed variables are correlated because they are generated by a hidden or latent model. This can be understood as modeling a complicated probability distribution as a combination of simpler distributions, which must be estimated. Such models can be challenging to fit but offer advantages, including a compressed representation of the data. A popular latent modeling technique, called Gaussian mixture models (GMMs)mclachlan2000finite models arbitrary probability distributions as a linear superposition of Gaussian densities.

The latent parameters of GMMs (and other mixture models) can be obtained using a non-linear optimization procedure called the expectation-maximization (EM) algorithm.dempster_maximum_1977 EM is an iterative technique which alternates between (1) finding the expected value of the latent factors given data and initialized parameters, and (2) optimizing parameter updates based on the latent factors from (1). We here derive EM in the context of GMMs and later show how it relates to other popular algorithms, like the K-means.macqueen1967some

For features , the GMM is

(36)

with the weights of the Gaussians in the mixture, and and the mean and covariance of the th Gaussian. The weights define the marginal distribution of a binary random vector , which give membership of data vector to the th Gaussian ( and ). The weights are thus related to the marginal probabilities by , giving

(37)

The weights must satisfy and to be valid probabilities.

The conditional distribution , which gives

(38)

The joint distribution is obtained using the product rule (4), with (37),(38), and the marginal distribution (36) is obtained by summing the joint over all the states of (sum rule, (3)),

(39)

with the parameters . Eq. (39) is equivalent to (36). To find the parameters, the log-likelihood or is maximized

(40)

For multiple observations , (40) becomes

(41)

Eq. (40) and (41) are challenging to solve because the logarithm cannot be pushed inside the summation over .

In EM, a complete data log likelihood

(42)

is used to define an auxiliary function, , which is the expectation of the likelihood evaluated assuming some knowledge of the parameters. The knowledge of the parameters is based on the previous or ‘old’ values, . The EM algorithm is derived using the auxiliary function. For more details, please see Ref. murphy2012,  [pp. 350–354]. Helpful discussion is also presented in Ref. bishop2006,  [pp. 430–443]

The first step of EM, called the E-step (for expectation), estimates the responsibility of the th Gaussian in reconstructing the th data density , given the current parameters . From Bayes’s rule, the E-step is

(43)

The second step of EM, called the M-step, updates the parameters by maximizing the auxiliary function, , with the responsibilities from the E-step (43).bishop2006 ; ng2000cs229 The M-step estimates of (using also ), , and are

(44)
(45)
(46)

with the weighted number of points in cluster . The EM algorithm is run until an acceptable error has been obtained.

iv.3 K-means

The K-means algorithmmacqueen1967some is a method for discovering clusters of features in unlabeled data. The goal of doing this can be to estimate the number of clusters, or for data compression (e.g. vector quantizationgersho1991 ). Like EM, K-means solves (41). Except, unlike EM, and are fixed. Rather than responsibility describing the posterior distribution of (per (43)), in K-means the membership is a ‘hard’ assignment (in the limit , please see Ref. bishop2006, for more details):

(47)

Thus in K-means, each feature vector is assigned to the nearest centroid . The distance measure is the Euclidian distance (defined by the -norm, (47)). Based on the centroid membership of the features, the centroids are updated using the mean of the feature vectors in the cluster

(48)

Sometimes the variances are also calculated. Thus, K-means is a two-step iterative algorithm which alternates between categorizing the features and updating the centroids. Like EM, K-means must be initialized, which can be done with random initial assignments. The number of clusters can be estimated using, for example, the gap statistic.hastie2009

iv.4 Dictionary learning

In this section we introduce dictionary learning and discuss one classic dictionary learning method: the K-SVD algorithm.aharon2006 An important task in sparse modeling (see Sec. III) is obtaining a dictionary which can well model a given class of signals. There are a number of methods for dictionary design, which can be divided roughly into two classes: analytic and synthetic. Analytic dictionaries have columns, called atoms, which are derived from analytic functions such as wavelets or the discrete cosine transform (DCT).mallat1999 ; elad2010 Such dictionaries have useful properties, which allow them to obtain acceptable sparse representation performance for a broad range of data. However, if enough training examples of a specific class of data are available, a dictionary can be synthesized or learned directly from the data. Learned dictionaries, which are designed from specific instances of data using dictionary learning algorithms, often achieve greater reconstruction accuracy over analytic, generic dictionaries. Many dictionary learning algorithms are available.mairal2014

As discussed in Sec. III, sparse modeling assumes that a few (sparse) atoms from a dictionary can adequately construct a given feature . With coefficients , this is articulated as . The coefficients can be solved by

(49)

with the number of non-zero coefficients. The penalty is the -pseudo-norm, which counts the number of non-zero coefficients. Since least square minimization with an -norm penalty is non-convex (combinatorial), solving (49) exactly is often impractical. However, many fast-approximate solution methods exist, including orthogonal matching pursuit (OMP)pati93 and sparse Bayesian learning (SBL).wipf2004

Eq. (49) can be modified, to also solve for the dictionaryelad2010

(50)

with the coefficients for all examples. Eq. (50) is a bi-linear optimization problem for which no general practical algorithm exists.elad2010 However, it can be solved well using methods related to K-means. Clustering-based dictionary learning methodsmairal2014 are based on the alternating optimization concept introduced in K-means and EM. The operations of a dictionary learning algorithm are thus formulated as

  1. Sparse coding: Given dictionary , solve for non-zero coefficients in corresponding to .

  2. Dictionary update: Given coefficients , solve for which minimizes reconstruction error for .

This assumes an initial dictionary (the columns of which can be Gaussian noise). Sparse coding can be accomplished by OMP, or other greedy methods. The dictionary update stage can be approached in a number of ways. We next briefly describe the class K-SVD dictionary learning algorithmaharon2006 ; elad2010 to illustrate basic dictionary learning concepts. Like K-means, K-SVD learns K latent prototypes of the data (in dictionary learning these are called atoms, where in K-means they are called centroids), but instead of learning them as the means of the data ‘clusters’, they are found using the SVD since there may be more than one atom used per data point.

In the K-SVD algorithm, dictionary atoms are learned based on the SVD of the reconstruction error caused by excluding the atoms from the sparse reconstruction. Expressing the dictionary coefficients as row vectors and , which relate all examples to and , respectively, the -penalty from (50) is rewritten as

(51)

where

(52)

and is the Frobenius norm. Thus, in (51), the -penalty is separated into an error term , which is the error for all examples if is excluded from their reconstruction, and the product of the excluded entry and coefficients .

An update to the dictionary entry and coefficients which minimizes (51) is found by taking the SVD of . However, many of the entries in are zero (corresponding to examples which do not use ). To properly update and with SVD, (51) must be restricted to examples which use

(53)

where and are entries in and , respectively, corresponding to examples which use , and are defined as

(54)

Thus for each K-SVD iteration, the dictionary entries and coefficients are sequentially updated as the SVD of . The dictionary entry is updated with the first column in and the coefficient vector is updated as the product of the first singular value with the first column of .

For the case when , the results of K-SVD reduces to the K-means based model called gain-shape vector quantization.gersho1991 ; elad2010 When , the -norm in (50) is minimized by the dictionary entry that has the largest inner product with example .elad2010 Thus for , define radial partitions of . These partitions are shown in Fig. 9(b) for a hypothetical 2D () random data set.

Other clustering-based dictionary learning methods are the method of optimal directionsengan2000 and the iterative thresholding and signed K-means algorithm.schnass2015 Alternative methods include online dictionary learning.mairal2009

Figure 9: (Color online) Partitioning of Gaussian random distribution using (a) K-means with 5 centroids and (b) K-SVD dictionary learning with and 5 atoms. In K-means, the centroids define Vornoi cells which divide the space based on Euclidian distance. In K-SVD, for , the atoms define radial partitions based on the inner product of the data vector with the atoms. Reproduced from bianco2017a,

iv.5 Autoencoder networks

Autoencoder networks are a special case of NNs (Sec. III), in which the desired output is an approximation of the input. Because they are designed to only approximate their input, autoencoders prioritize which aspects of the input should be copied. This allows them to learn useful properties of the data. Autoencoder NNs are used for dimensionality reduction and feature learning, and are a critical component of modern generative modeling.goodfellow2016deep They can also be used as a pretraining step for DNNs (see Sec. V.2). They can be viewed as a non-linear generalization of PCA and dictionary learning. Because of the non-linear encoder and decoder functions, autoencoders potentially learn more powerful feature representations than PCA or dictionary learning.

Like feed-forward NNs (Sec. III.3), activation functions are used on the output of the hidden layers (Fig. 8). In the case of an autoencoder with a single hidden layer, the input to the hidden layer is and the output is , with (see Fig. 8). The first half of the NN, which maps the inputs to the hidden units is called the encoder. The second half, which maps the output of the hidden units to the output layer (with same dimension of input features) is called the decoder. The features learned in this single layer network are the weights of the first layer.

If the code dimension is less than the input dimension, the autoencoder is called undercomplete. In having the code dimension less than the input, undercomplete networks are well suited to extract salient features since the representation of the inputs is ‘compressed’, like in PCA. However, if too much capacity is permitted in the encoder or decoder, undercomplete autoencoders will still fail to learn useful features.goodfellow2016deep

Depending on the task, code dimension equal to or greater than the inputs is desireable. Autoencoders with code dimension greater than the input dimension are called overcomplete and these codes exhibit redundancy similar to overcomplete dictionaries and CNNs. This can be useful for example learning shift invariant features. However, without regularization, such autoencoder architectures will fail to learn useful features. Sparsity regularization, similar to dictionary learning, can be used to train overcomplete autoencoder networks.goodfellow2016deep For more details and deeper discussion, please see Sec. V.

V Deep learning

Deep learning (DL) refers to ML techniques that are based on a cascade of non-linear feature transforms trained during a learning step.deng2014deep In several fields of science, decades of research and engineering have led to elegant ways to model data. Nevertheless, the DL community argues that these models are too simplistic to capture the subtleties of the phenomena underlying the data. Often it is beneficial to learn the representation directly from a large collection of examples. Yet DL leverages a fundamental concept shared by many successful handcrafted features. Representations such as Mel frequency cepstrumstevens1937scale used in speech processing, or multi-scale waveletsmallat1989theory and SIFTlowe1999object used in image processing, all analyze the data by applying filter banks at different scales. DL mimics this by learning a cascade of features capturing information at different levels of abstraction. Non-linearities between these features allow deep NNs to learn complicated manifolds. Findings in neuroscience also suggest that mammal brains process information in a non-linear hierarchical way.hubel1962receptive

In short, a NN-based ML pipeline is considered DL if it satisfiesdeng2014deep : (i) features are not handcrafted but learned, (ii) features are organized in a hierarchical manner from low- to high-level abstraction, (iii) there are at least two layers of non-linear feature transformations. As an example, applying DL on a large corpus of texts must uncover meanings behind words, sentences and paragraphs (low-level) to further extract concepts such as lexical field, genre, and writing style (high-level). Likewise, DL has been used in acoustics.chakrabarty2017broadband ; ernst2018speech ; adavanne2019sound ; hershey2016deep ; mesaros2017dcase ; wang2018supervised

To comprehend DL, it is useful to look at what it is not. MLPs with one hidden layer (aka, shallow NN) are not deep as they only learn one level of feature extraction. Similarly, non-linear SVMs are analogous to shallow NNs. Multi-scale wavelet representations

mallat2016understanding are a hierarchy of features (sub-bands) but the relationships between features are linear. When a classifier is used after transforming the data into a handcrafted representation, the architecture becomes deep, but it is not DL as the first transformation is not learned.

Most DL architectures are deep NN, such as MLPs, and are traced to the 1970-80s. Nevertheless, over three decades, only a few deep architectures have emerged and were limited to process data of no more than a few hundred dimensions. Successful examples are the two handwritten digit classifiers: Neocognitronfukushima1980neocognitron and LeNet5lecun1998 . Yet the success of DL started at the end of the 2000s on what is called the third wave of artificial NN. This success is attributed to the large increase in available data and computation power, including parallel architectures and GPUs. Nevertheless, several open-source DL toolboxes(Torch, ; Tensorflow, ; chollet2015keras, ; vedaldi2015matconvnet, ) have helped the community in introducing a multitude of new strategies. These aim at fighting the limitations of back-propagation: its slowness and tendency to get trapped to poor stationary points (local optima or saddle points). The following subsections describe some of these strategies, see Goodfellow et al., 2016goodfellow2016deep for an exhaustive review.

Figure 10:

(Color Online) Illustration of the vanishing and exploding gradient problems. (a) The sigmoid and ReLU activation functions. (b) The loss

as a function of the network weights when using sigmoid activation functions is shown as a ‘landscape’. Typical cost landscapes are hilly with large plateaus delimited by cliffs. When estimates of the gradient are very small (vanishing), the network learns very slowly. In the case of exploding gradients, the model updates can overshoot optimal parameters.

v.1 Activation Functions and Rectifiers

The earliest multi-layer NN used logistic sigmoids (Sec. III-c) or hyperbolic tangents for the non-linear activation function :

(55)

where are the vector of features at layer and are the vector of potentials (the affine combination of the features from the previous layer). For the sigmoid activation function in Fig. 10(a), the derivative is significantly non-zero for only near . With such functions, in a randomly initialized NN, half of the hidden units are expected to activate () for a given training example, but only few will influence the gradient, as . In fact, many hidden units will have near zero gradient for all training samples, and the parameters responsible for that units will be slowly updated. This is called the vanishing gradient problem

. A naïve repair to solve the vanishing gradient problem is to increase the learning rate. However, parameter updates will become too large when the potential

approaches 0. The overall training procedure might be unstable: this is the gradient exploding problem. Fig. 10(b) indicates of these two problems. Shallow NN are not much impacted by these problems, but they become very harmful in deep NN. Back-propagation with such activation functions in deep NN is slow, unstable, and leads to poor solutions.

Figure 11: (Color Online) The three steps for learning a deep classifier based on stacked autoencoders: (a) learn one shallow auto-encoder for each feature, (b) stack the shallow auto-encoders, (c) replace the decoder by a shallow classifier.

Rectifiers are activation functions that are nearly zero on the negative side and nearly linear on the positive side. The most popular is the Rectifier Linear Unit (ReLU)nair2010rectified defined as (see Fig. 10):

(56)

While the derivative is nearly zero for negative potentials , the derivative is nearly one for

(though non-differentiable at 0, ReLU is continuous and then back-propagation is a sub-gradient descent). Thus, in a randomly initialized NN, half of the hidden units fire and the same half influence the gradient. Most units get significant gradients from at least half of the training samples, and all parameters in the NN is expected to be equally updated at each epoch (the initial weights must be zero-mean with a variance that preserves the range of variations of all potentials across all NN layers

glorot2010understanding ; he2015delving

). In practice the use of rectifiers leads to tremendous improvement in convergence. Regarding exploding gradients, an efficient solution called gradient clipping

pascanu2012understanding simply consists in thresholding the gradient.

Figure 12: The first layer of a traditional CNN. For this illustration we chose a first hidden layer extracting 3 feature maps. The filters have the size .

v.2 Unsupervised Pretraining

Avoiding gradient vanishing and exploding problems is not enough for back-propagation to avoid poor stationary points in deep NN. A pioneering alternative, unsupervised pretraining, consists in learning a deep NN by successively training shallow architectures. This is achieved by deep belief NN based on restricted Boltzmann machines

hinton2006fast or stacked auto-encodersbengio2007greedy . Stacked auto-encoders are trained in a greedy unsupervised fashion. First, a shallow auto-encoder is trained to reproduce all training data, and the learned encoder is used to extract all of their features. Next, a shallow auto-encoder is trained to reproduce all of these features, hence yielding a second layer of feature extraction. The process can be repeated several times, but at each step it only involves training a shallow network. Afterwards, a deep auto-encoder is designed by stacking all shallow encoders and decoders, see Fig. 11. The deep encoder can finally be used as a feature extractor for a supervised learning task, for instance, by replacing the decoder by a shallow classifier. Today, deep NN are trained end-to-end from scratch, but unsupervised pretraining was the first method to succeed on high-dimensional data.

v.3 End-to-End Training

Unlike unsupervised pretraining approaches, modern DL approaches train deep network end-to-end. They rely on variants of gradient descent that aims at fighting poor stationary solutions. These approaches include stochastic gradient descent, mini-batch gradient descent, adaptive learning ratesduchi2011adaptive , and momentum techniques.sutskever2013importance Among these concepts, two main notions emerged: (i) annealing by randomly exploring configurations first and exploiting them next, (ii) momentum by combining gradient and velocity. Adam,kingma2014adam based on adaptive moment estimation, is currently the most popular optimization approach.

Figure 13:

(Color Online) Deep CNN architecture for classifying image into a thousand classes. Convolution layers create redundant information by increasing the number of channels in the tensors. ReLU is used to capture non-linearity in the data. Max-pooling operations reduce spatial dimension to get abstraction and robustness with regards of the exact location of objects. When the tensor becomes flat (

i.e., the spatial dimension is reduced to ), each coefficient serves as input of a fully connected NN based classifier. The feature dimensions, filter sizes, and number of output classes are only for illustration.

Gradient descent methods can fall into the closest local minimum which leads to underfitting. On the contrary, stochastic gradient descent and variants are expected to find solutions with lower loss and are more prone to overfitting. Overfitting occurs when learning a model with many degrees of freedom compared to the number of training samples. Curse of dimensionality (Sec. II.5) claims that, without assumptions on the data, the number of training data should grow exponentially with the number of free parameters. In classical NN, an output feature is influenced by all input features, a layer is fully-connected (FC). Given an input of size and a feature vector of size , a FC layer is then composed of weights (see Sec. III-c). Given that the signal size can be large, FC NN are prone to overfitting. Thus, special care should be taken for initializing the weights,glorot2010understanding ; he2015delving and specific strategies must be employed to have some regularization, such as dropoutsrivastava2014dropout

and batch-normalization

ioffe2015batch .

v.4 Convolutional Neural Networks

Convolutional NN (CNNs)fukushima1980neocognitron ; lecun1998 are an alternative to conventional, fully-connected NN for temporally or spatially correlated signals. They limit dramatically the number of parameters of the model and memory requirements by relying on two main concepts: local receptive fields and shared weights. In conventional NNs, for a given layer, every output interacts with every input. This results in an excessive number of weights for large input dimension (number of weights is ). In CNNs, each output unit is connected only with subsets of inputs corresponding to given filter (and filter position) - the local receptive field. This significantly reduces the number of NN multiplication operations on the forward pass of a convolutional layer for a single filter to , with , typically much smaller than and . Further, for a given filter, the same weights are used for all receptive fields. This means the number of parameters calculated for each layer and weight is reduced from to .

Weight sharing in CNNs gives another important property called shift invariance. Since for a given filter, the weights are the same for all receptive fields, the filter must model well signal content that is shifted in space or time. The response to the same stimuli is unchanged whenever the stimuli occurs within overlapping receptive fields. Experiments in neuroscience reveal the existence of such a behavior (denoted self-similar receptive fields) in simple cells of the mammal visual cortex hubel1962receptive . This principle leads CNNs to consider convolution layers with linear filter banks on their inputs.

Fig. 12 provides an illustration of one convolution layer. The convolution layer applies three filters to an input signal to produce three feature maps. Denoting the th input feature map at layer as and the th output feature map at layer as , a convolution layer at layer produces new feature maps from input feature maps as follows

(57)

where is the discrete convolution, are learned linear filters, are learned scalar bias, is an output channel index and an input channel index. Stacking all feature maps together, the set of hidden features is represented as a tensor where each channel corresponds to a given feature map.

For example, a spectrogram is represented by a tensor where is the signal length and the number of channels is the number of frequency sub-bands. Convolution layers preserve the spatial or temporal resolution of the input tensor but usually increasing the number of channels: . This produces a redundant representation which allows for sparsity in the feature tensor. Only a few units should fire for a given stimuli: a concept that has also been influenced by vision research experiments.olshausen1997 Using tensors is a common practice allowing us to represent CNN architectures in a condensed way, see Fig. 13.

Local receptive fields impose that an output feature is influenced by only a small temporal or spatial region of the input feature tensor. This implies that each convolution is restricted to a small sliding centered kernel window of odd size

, for example, is a common practice for images. The number of parameters to learn for that layer is then and is independent on the input signal size . In practice , and are chosen so small that it is robust against overfitting. Typically, and are less than a few hundreds. A byproduct is that processing becomes much faster for both learning and testing.

Applying convolution layers of support size increases the region of influence (called effective receptive field) to a window. With only convolution layers, such an architecture requires being very deep to capture long-range dependencies. For instance, using filters of size , a deep architecture will process inputs in sliding windows of only size .

To capture long-range dependencies, CNNs introduce a third concept: pooling. While convolution layers preserve the spatial or temporal resolution, pooling preserves the number of channels but reduces the signal resolution. Pooling is applied independently on each feature map as

(58)

and such that has a smaller resolution than . Max-pooling of size 2 is commonly employed by replacing in all directions two successive values by their maximum. By alternating convolution and pooling layers, the effective receptive field becomes of size . Using filters of size , a deep architecture will have an effective receptive field of size and can thus capture long-range dependencies.

Pooling is grounded on neuroscientific findings regarding complex cells in the mammal visual cortex.hubel1962receptive These cells condense the information to get some invariance and robustness against small distortions of the same stimuli. Deeper tensors become more elongated with more channels and smaller signal resolution. Hence, the deeper the architecture, the more robust becomes the network with respect to exact locations. Eventually the tensor becomes flat meaning that it is reduced to a vector. Features in that tensor are no longer temporally or spatially related and they can serve as input feature vectors for a classifier. The output tensor is not always exactly flat, but then the tensor is mapped into a vector. In general, a MLP with two hidden FC layers is employed and the architecture is trained end-to-end by backpropagation or variants, see Fig. 13.

This type of architecture is typical of modern image classification NN such as AlexNetkrizhevsky2012imagenet and ZFnetzeiler2014visualizing , but was already employed in Neocognitronfukushima1980neocognitron and LeNet5lecun1998

. The main difference is that modern architectures can deal with data of much higher dimensions as they employed some of the aforementioned strategies such as rectifiers, Adam, dropout, batch-normalization. A trend in DL is to make such CNNs as deep as possible with the least number of parameters by employing specific architectures such as inception modules

szegedy2015going , depth-wise separable convolutions,sifre2014rigid skip connections,he2016deep and dense architectures.huang2017densely

Since 2012, such architectures have led to state of the art classification in computer vision,krizhevsky2012imagenet

even rivaling human performances on the ImageNet challenge.

he2015delving Regarding acoustic applications, this architecture has been employed for broadband DOA estimationchakrabarty2017broadband where each class corresponds to a given time frame.

v.5 Transfer learning

Training a deep classifier from scratch requires using a large labeled dataset. In many applications, these are not available. An alternative is using transfer learning.pratt1993discriminability Transfer learning reuses parts of a network that were trained on a large and potentially unrelated dataset in order to solve another ML task. The key idea in transfer learning is that early stages of a deep network are learning generic features that may be applicable to other tasks. Once a network has learned such a task, it is often possible to remove the feed forward layers at the end of the network that are tailored exclusively to the original task. These are then replaced with new classification or regression layers and the learning process finds the appropriate weights of these final layers on the new task. If the previous representation captured information relevant to the new task, they can be learned with a much smaller data set. Eventually, after the classifier has been trained, all the layers will be slightly adjusted by performing a few backpropagation steps end-to-end (referred to as fine tuning). Many modern DL techniques rely on this principle.

v.6 Specialized architectures

Beyond classification, there exists myriad NN and CNN architectures. Fully convolutional and U-shape architectures are typical architectures that can be used for regression problems such as signal enhancement,zhang2017beyond segmentationronneberger2015u or object localization.dai2016r Recurrent NNsrumelhart1988learning

(RNNs) are an alternative to classical feed-forward NNs to process or produce sequences of variable length. In particular, long short term memories

hochreiter1997long

(LSTMs) are a specific type of RNN that have produced remarkable results in several applications such as speech processing and natural language processing. Recently, NNs have gained much attention in unsupervised learning tasks. One key example is data generation with generative adversarial networks

goodfellow2014generative

(GANs). The later relies on an original idea grounded on game theory. It performs a two player game between a generative network and a discriminative one. The generator produces fake data from random seeds, while the discriminator aims at distinguishing the fake data from the ones of the training set. Both NN compete against each other. The generator tries to fool the discriminator such that the fake data cannot be distinguished from the ones of the training set.

v.7 Applications in Acoustics

DL has yielded promising advances in acoustics. The data-driven DL approaches can provide competitive results with conventional or hand-engineered methods in their respective fields. A challenge across all fields is the amount of available training data. To train deep NNs in for example audio processing tasks, hours of training data may be required.vincent2018audio Since large amounts of training data might not be available, DL is not always practical. Though scarcity of training data can be addressed partly by using synthetic training data.mesaros2017dcase ; niu2019deep In the following we highlight recent advances in the application of DL in acoustics, though our reference are by no means complete.cakir2017convolutional ; mesaros2017dcase ; dibias ; adavanne2019sound ; trees2002optimum ; He2016ResNet ; hershey2016deep ; ernst2018speech ; parviainen2018time ; diment2017transfer ; shen2018natural ; nugraha2016multichannel ; Perotin2019CRNN

<