Deep neural networks, and convolutional neural networks in particular, have proven quite effective at discerning hierarchical patterns in large datasets . Some examples include image classification 3], and speech recognition , among many others. Clearly, something is going very right in the design of CNNs. However, the principles that account for this success are still somewhat elusive, as is the construction of systems that work well with few examples.
The scattering transform was created to remedy these issues. In 2012, Stephane Mallat and Joan Bruna published both theoretical results  and numerical implementations  tying together convolutional neural networks and wavelet theory. They demonstrated that scattering networks of wavelets and modulus nonlinearities, are translation invariant in the limit of infinite scale, and Lipschitz continuous under non-uniform translation, i.e. for with bounded gradient. Numerically, they achieved state of the art on image and texture classification problems.
More recent work from Wiatowski and Bölcskei have generalized the Lipschitz continuity result from wavelet transforms to frames, and more importantly, established that increasing the depth of the network also leads to translation invariant features . There have been a number of follow up papers, including a discrete version of Wiatowski’s result , a related method on graphs , and a pseudo-inverse using phase retrieval . There have also been a number of papers using the scattering transform in such problems as fetal heart rate classification 
, age estimation from face images, and voice detection in the presence of transient noise .
In this paper, we will apply the scattering transform of Mallat et al. to the problem of classification of underwater objects using sonar. This is motivated by previous work done by our group on the detection of unexploded ordinance (UXO) using synthetic aperture sonar (SAS) . From this, we use the BAYEX14 dataset , which contains 14 different objects at various distances and rotations, partially submerged in a thin layer of mud on a sandy shallow ocean bed off Florida. Figure 1 shows examples of these objects. Additionally, we can generate synthetic examples using a fast solver for the Helmholtz equation in two regions with differing speed of sound, provided by Ian Sammis and James Bremer[17, 18]. We use this to examine more closely the dependence of both classification and the output of the scattering transform on both material properties and shape variations.
The approximate translation invariance and Lipschitz continuity under small diffeomorphisms of the scattering transform mean that for classes that are invariant under these transformations, members of the same class will be close together in the resulting space. So long as morphing via from one class to another requires a
with large derivative, then the classes will be well separated. Accordingly, we use a linear classifier on the output of the scattering transform. Additionally, because the scattering transform concentrates energy at coarser scales and wavelets in general encourage sparsity for smooth signals with singularities, we use a sparse version of logistic regression as our linear classifier, using LASSO. Specifically, we use a julia wrapper around Glmnet.
As a baseline classification scheme to compare the classification performance, we use the same linear classifier on the absolute value of the Fourier transform (AVFT) of the signal. One strong advantage of the absolute value of the Fourier transform is completely invariant to translations, and is the prototype of every linear filter which is invariant to translations[21, 22]. In addition to this invariant, the close ties between frequency and the speed of sound suggest that it should be sensitive to changes in the material. This will be examined in more depth in Section 4.
Section 2 gives the basic set-up of a scattering transform, and approximate invariants from previous theoretical developments. These are increasing translation invariance with increasing depth, and Lipschitz stability to nonuniform time and frequency translation. Section 3 describes the setup used to generate both the synthetic and real datasets. Section 5 gives the results of applying the ST and AVFT to the dataset to performing binary classification on shape and material in the synthetic case, and between UXO’s and various other objects in the real case.
2 Generalized Scattering Transform
A generalized scattering network (hereafter referred to simply as a scattering network or ST)[8, 6], has an architecture that is a continuous analog of a CNN. For a diagram, see Figure 2. First, at layer , we start with a family of generators for a translation invariant frame with frame bounds and , that is
for all . Here, is the translation operator, is the involution operator, and is some countable discrete index set, such as . This index set needs to tile the frequency plane in some way, for example by indexing scales and rotations. The frame atoms correspond to the receptive fields found in each layer of a CNN. In addition, at layer we define a Lipschitz-continuous operator with bound which satisfies . After these have been applied, the result is sub-sampled at a rate . Finally, the output at each layer is then generated by averaging with a specific atom , which is removed from the set of frame atoms.
The scattering transform of Mallat for corresponds to choosing for some finite rotation group , for some mother wavelet , its corresponding father wavelet, and a quality factor. The Lipschitz nonlinearity is . However, it lacks a sub-sampling factor, so . Explicitly, the generator corresponding to index is .
To get from layer to layer , we define the function by
Using this to define the value at a layer , we have a path coming along a path of indices ,
Choosing corresponds to audio signals, such as sonar. Unlike a CNN, in a Scattering transform each layer has output, including just the average , which is layer zero. For , the output is
while for the entire scattering transform,
2.1 Previous theory
There are two additional conditions that restrict the various operators in a given layer simultaneously. The first is the weak admissibility condition, which requires that the upper frame bound be sufficiently small compared to the sub-sampling factor and Lipschitz constants:
which can be achieved by scaling the .
The second is that the nonlinearities must commute with the translation operator, so . Most nonlinearities used for CNN’s are pointwise i.e., , so they certainly commute with . Given these constraints, we can now state the results of Wiatowski and Bölcskei precisely; the first is that the resulting features deform stably with respect to small frequency and space deformations:
 For frequency shift and space shift , define the operator . If , then there exists a independent of the choice of parameters for s.t. for all ,
where is the set of functions whose Fourier transforms are band limited to . Mallat shows a similar bound for the specific case that and is generated by an admissible mother wavelet with a number of conditions .
Their next result, translation invariance that increases with depth, is distinct from the one shown by Mallat where translation invariance increases with resolution:  Given the conditions above, for the layer’s features satisfy
If there is also a global bound on the decay of the Fourier transforms of the output features :
then we have the stronger result
To compare with the result from Mallat, that says that for an admissible mother wavelet , the scattering transform achieves perfect translation invariance as the lower bound on the scale goes to infinity:
3 Sonar Detection
The problem that we will be investigating with the scattering transform techniques is the classification of objects partially buried on the sea floor using sonar data. This is motivated by using an unmanned underwater vehicle (UUV) equipped with sonar to detect unexploded ordinance. For this problem, we have both real and synthetic examples. The real examples consist of 14 partially buried objects at various distances and rotations, in a shallow mud layer on top of a sand ocean bed at a depth of 8m. BAYEX14 simulated the path of a UUV by stationing a sensor/emitter onto a rail, and pinging the field of objects was pinged at short intervals, see Figure 3. So for each rotation of the object relative to the rail, there is a 2D wavefield, where each signal corresponds to a location on the rail and the observation time.
The synthetic examples come from considering the 2D Helmholtz equation in the regions with differing speed of sound:
which gives the response to a sinusoidal signal with frequency on an object with , where is the speed of sound in the material, ranging from in air, to in water, to in aluminum. It is worth noting that this is idealized in several ways: the model we use is 2D, rather than 3D, the material is modeled as a fluid with only one layer, instead of a solid with multiple different components, and there is no representation of the ocean floor itself.
The signals sent out by the UUVs are not pure sinusoids. One can approximate the response to multi-frequency signals (e.g. Gabor functions or chirps) by integrating across frequencies. We use a fast solver created by Ian Sammis and James Bremer to synthesize a set of examples, where we can more explicitly test the dependence of the scattering transform on the material properties (corresponding to changing the speed of sound) and geometry. The current dataset we use was created by Vincent Bodin, a former summer intern supervised by the first author. The input/source signal is shown in Figure 5
; zero padding is applied to each source signal to make the periodized version have approximately the same behavior as an input with finite support.
There are many design choices in testing the effectiveness of a classifier. As a baseline to compare against, we use the same Glmnet
logistic classifier on the absolute value of the Fourier transform (AVFT), which is a simple classification technique that is translation invariant and sensitive to frequency shifts. To understand the generalization ability of the techniques, we split the data into two halves, one half training set and one half test set, uniformly at random 10 times, i.e. 10-fold cross validation. For the synthetic dataset, we also normalized the signals and added uniform Gaussian white noise so the SNR is 5dB.
For the synthetic data, there are two primary problems of interest. The first is determining the effects of varying shape on the scattering transform. An example of the 1D scattering transform for a triangle is in Figure 4. Since the energy at each layer decays exponentially with layer index, layers 1-3 have been scaled to match the intensity of the first layer. Note that only the zeroth layer has negative values; this is because the nonlinearity used by the scattering transform is an absolute value. In the figure, one can clearly see a time concentrated portion of the signal in layers 0, 1, and 2.
For the real dataset, the problem of interest is somewhat more ambiguous. In addition to a set of UXO’s and a set of arbitrary objects, there are some UXO replicas, not all of which are made of the same material. As we will see in the synthetic case, the difference in the speed of sound has a much clearer effect on classification accuracy than shape, so it is somewhat ambiguous how to treat these. We should expect that correctly classifying non-UXO’s to be more difficult, since as a class they do not have much in common– a SCUBA tank bears more resemblance to a UXO than it does to a rock.
4 Geometric properties
Ideally the invariants discussed above would apply to transformations in the object domain rather than to signals. But translations of the object (or equivalently, the observation rail), have a more complicated effect on the signal than simply translating the observation; at a minimum translating away from the object will cause a decay in signal amplitude. The changes in the object domain we seek to understand are changes in object material, translations and rotations of the object/rail, and changes in geometry. A classifier for this problem should be invariant to translation and rotation, but sensitive to the geometry of the object and the material.
To understand more deeply how changes in the geometry affect the observed signals, we need to examine the solutions more closely. Let the input signal be fixed as .The ideal reconstruction of the response to , for a transmitter and receiver located at is
where is the solution, denotes the Fourier transform of . To define the observation rail, first we define an unrotated observation, for , so the rail has range and length . Then, a rotation of the object by an angle is equivalent to rotating the rail by , so define where is the appropriate rotation matrix. In this setup, , and .
4.1 Effects of the speed of sound
To determine the behavior of a fixed location as we change the speed of sound, we use common acoustic properties such as reflection coefficients and Snell’s law. The crudest possible assumption that still gives meaningful results is that internal angles are irrelevant, and only refraction, reflection, and distance matters. Going from with speed of sound to with speed of sound , the reflection coefficient is given by , while the refraction coefficient is , where the impedance is ( is the density of the material). The distance from the center to a given point on the line is just given by the Pythagorean theorem, . This means the initial peak occurs at . If the input peak has magnitude , then the first return peak should be approximately ; since for most relevant examples, this is positive. Setting , the next peak can be approximated as ; the sign of the second peak will flip. The third peak is and will flip sign again. Similarly .
From this, we should expect that the decay from the first peak to the second peak, , is larger than that between any further consecutive peaks, . One can see the sign flip clearly in Figure 6 (take care that the input signal (Figure 5) is a positive spike followed by a negative spike, so the second peak begins at for speed of sound ).
For classification purposes, for a fixed angle and position on the rail, the reflection of every peak but the first depends on both and the speed of sound . Further, the time between peaks should be , so the scale of the solutions should provide a strong indicator of the speed of sound in the material. We will indeed see that the absolute value of a Fourier transform (AVFT), which has access to scale information, is reasonably effective at this problem. However for shapes with approximately the same diameter, such as the triangle and shark-fin, none of the above is useful for discrimination.
4.2 Effects of rotation
The discussion so far has almost completely ignored the internal geometry of . For a rectangle, when facing the longer side, this approximation is reasonable, as Snell’s law matters when the internal angles depart significantly from . However, the problem of ray tracing in a region is non-trivial in its own right, requiring averaging over all paths.
To avoid this issue, instead of trying to directly construct properties of the observations, we can derive how they will change under rotation and translation. We can use the far field approximation to do so [23, Chpt. 4]. Since and the center frequency of is , we have that the range , the condition for far-field. Here we take the solution as approximately separable, so . In this representation, is a Bessel function of the first kind, and so to zero-th order is approximately .
Using this, we can examine the effect of rotation of the object (or the rail about the object). Suppose we know the solution at angles and , and we want to determine the solution at an angle between these. Every point in will have the same angle as a point on either or (or possibly both) if and are close enough that the paths cross, such as the case in Figure 7. This happens when , or in the synthetic dataset, ). Some geometry gives that the point has the same angle as if
for ; similar reasoning works for and with flipped signs. The distance changes from to , while the angle remains fixed, so plugging the zero-th order approximation to the Bessel function above into eq. 13,
where . So there are two effects on in the zero-th order approximation. The first is a phase shift by , under which both the AVFT and the ST are invariant. The second is a small amplitude modulation (for , this ranges from .94 to 1.077). However, since the error in this approximation is , we should only roughly expect this to hold, since most of the signals have amplitudes on the order of . In the full case of eq. 15, we have a ratio of Bessel functions whose argument depends linearly on the frequency . If , then this can definitely be written in the form of a non-constant frequency shift as in Section 2.1.
4.3 Effects of translation
Increasing the distance to the object from to is a similar transformation to rotation, since every point on the new rail corresponds to a point on the old rail, but with increased radius. If is the location on the new rail then the point with the same angle is just ; since , this is smaller than . Then we have the same sort of derivation as above, with a strictly linear function instead of eq. 14.
For translation along a given rail, the dependence on is unavoidable. For a given point , the radius is simply , while the angle is . We leave further discussion of the properties of for our future publication.
As noted before, for the linear classifier applied after the non-linear transform, we used sparse logistic regression as implemented inGlmnet . For the scattering transform, we used the Scatnet package as implemented by Mallat and his group .
For the synthetic dataset, we compared three transforms. The first was the absolute value of the Fourier transform (AVFT). The second, hereafter referred to as the coarser ST, was a three layer scattering transform using Morlet Wavelets, with different quality factors rates in each layer: . Increasing quality factor corresponds to decreasing the rate of scaling of the mother wavelet, and thus gives more coefficients for the same frequency regime. The third, hereafter referred to as the finer ST, was another three layer scattering transform with increased quality in all 3 layers: . For each of these, we used 10-fold cross validation to check the generalization of our results.
As in section 4, we investigated two problems: shape and material discrimination. For the shape discrimination, we compared the triangle and the shark-fin, with the material speed of sound fixed at , while for the material discrimination, we fix to be a triangle, and compared with
. To do this comparison, we use the received operator characteristic (ROC) curve, which compares the trade off between false positives and true positives as we change the classification threshold; since it strictly concerns one class, it is insensitive to skewed class sizes. A way of summarizing the ROC is the area under the curve (AUC), which simply integrates the total area underneath the curve; we use the trapezoidal approximation. This varies from .5 for random guessing to 1 for the ideal classifier, which doesn’t misclassify.
The results for material discrimination are in Figure 8; the corresponding AUCs are for the AVFT, for the coarser ST, and for the finer ST. Fitting with the basic derivation in the geometry section, even the AVFT is capable of discriminating material effectively. Somewhat surprisingly, the coarser ST performed worse than the AVFT. This is likely because of insufficient frequency resolution, which the finer ST is able to achieve.
The results in shape discrimination are more definitive in demonstrating the effectiveness of the ST, as seen in Figure 9. The coarser scattering transform, with an AUC of .886, outperforms the AVFT with an AUC of . But the finer ST clearly outperforms both of these, with an AUC of , on par with the classification rates for the speed of sound problem.
|155mm Howitzer with collar||55-gallon drum, filled with water|
|155mm Howitzer w/o collar||2ft aluminum pipe|
|aluminum UXO replica||Scuba tank, water filled|
|steel UXO replica|
|DEU trainer (mine-like object)|
For the real dataset, we compared two transforms, the AVFT and a two layer scattering transform with . We have split the data into a set of objects that are either UXOs or replicas, and a set of the other objects in the dataset, as listed in Table 1. In both classes, there are a variety of materials and shapes. Between classes, there are no similar shapes (as the shape is what determines if it is a replica rather than a UXO), but there are two with the same material (aluminum UXO replica vs aluminum pipe). The ST has an AUC of .9487, while the AVFT has an AUC of .8186. The AVFT actually did better on this problem than it had on the shape detection problem, suggesting that it is primarily the material properties of the UXOs that distinguish them.
In this paper, we have given initial results on understanding the problem of classifying sonar signals using the scattering transform. Using geometric arguments and the synthetic dataset, we have demonstrated that material detection is a considerably simpler problem than shape detection, and that the scattering transform is capable of solving both problems. We have given arguments for why the scattering transform may be suited to this problem by using previous theory, though it is difficult to accurately cast the transform in eq. 15 in the form of a non-linear frequency modulation. Further, we have given numerical evidence that the scattering transform works well on this problem. This suggests that there may be additional operators for which the scattering transform has the Lipschitz stability of eq. 7, which better characterize the object domain.
Acknowledgements.This research was supported in part by the grants from ONR N00014-12-1-0177, N00014-16-1-2255, and from NSF DMS-1418779. We would like to thank Frank Crosby and Julia Gazagnaire of Naval Surface Warfare Center, Panama City, FL, for providing us with the real BAYEX14 dataset. The work on the synthetic dataset is extended from the work of our former intern Vincent Bodin (now at Sinequa) and our former postdoctoral researcher Ian Sammis (now at Google) who first generated the dataset. James Bremer (UC Davis) and Ian Sammis developed the fast Helmholtz equation solver that we used. Finally, we would like to thank Stephane Mallat and his group (ENS, France) for their ST codes, Simon Kornblith (MIT) for the Julia wrapper of the Glmnet Fortran code, which was in turn developed by Jerome Friedman, Trevor Hastie, Rob Tibshirani and Noah Simon (Stanford).
LeCun, Y., Bengio, Y., and Hinton, G., “Deep learning,”Nature 521(7553), 436–444 (2015).
Krizhevsky, A., Sutskever, I., and Hinton, G. E., “ImageNet classification with deep convolutional neural networks,”NIPS 25, 1097–1105 (2012).
-  Sun, Y., Wang, X., and Tang, X., “Deep convolutional network cascade for facial point detection,” Proc. IEEE Trans. Pattern Anal. Mach. Intell. , 3476–3483 (2013).
Dahl, G. E., Sainath, T. N., and Hinton, G. E., “Improving deep neural networks for LVCSR using rectified linear units and dropout,”IEEE Int. Conf. on Acoustics, Speech and Signal Process. , 8609–8613 (2013).
-  Mallat, S., “Understanding deep convolutional networks,” Philos trans. A, Math. Phys eng sci 374 (2016).
-  Mallat, S., “Group invariant scattering,” Comm. Pure Appl. Math. 65(10), 1331–1398 (2012).
-  Bruna, J. and Mallat, S., “Classification with scattering operators,” IEEE Conf. Comp. Vision and Pattern Recog. , 1561–1566 (2011).
-  Wiatowski, T. and Bölcskei, H., “Deep convolutional neural networks based on semi-discrete frames,” IEEE Int. Symp. on Info. Theory , 1212–1216 (2015).
Wiatowski, T., Tschannen, M., Stanic, A., Grohs, P., and Bölcskei, H., “Discrete deep feature extraction: A theory and new architectures,”
Proc. Int. Conf. on Machine Learning(2016).
-  Chen, X., Cheng, X., and Mallat, S., “Unsupervised deep Haar scattering on graphs,” Adv. Neural Info. Process Syst 27, 1709–1717 (2014).
-  Mallat, S. and Waldspurger, I., “Phase retrieval for the cauchy wavelet transform,” J. Fourier Anal. Appl. 21(6), 1251–1309 (2015).
-  Chudáček, V., Andén, J., Mallat, S., Abry, P., and Doret, M., “Scattering transform for intrapartum fetal heart rate variability fractal analysis: a case-control study,” IEEE Trans. on Biomed. Eng. 61(4), 1100–1108 (2014).
-  Chang, K.-Y. and Chen, C.-S., “A learning framework for age rank estimation based on face images with scattering transform,” IEEE Trans. on Image Process. 24(3), 785–798 (2015).
-  Dov, D. and Cohen, I., “Voice activity detection in presence of transients using the scattering transform,” IEEE Conv. of Electrical Electronics Eng. in Israel , 1–5 (2014).
-  Kargl, S., “Acoustic response of underwater munitions near a sediment interface: Measurement model comparisons and classification schemes,” Tech. Rep. MR-2231, SERDP and University of Washington Seattle Applied Physics Lab (2015).
-  Marchand, B., Saito, N., and Xiao, H., “Classification of objects in synthetic aperture sonar images,” IEEE/SP Workshop on Stat. Signal Process. 14, 433–437 (2007).
-  Bremer, J., “A fast direct solver for the integral equations of scattering theory on planar curves with corners,” J. Comput. Phys. 231(4), 1879–1899 (2012).
-  Bremer, J. and Gimbutas, Z., “On the numerical evaluation of the singular integrals of scattering theory,” J. Comput. Phys. 251, 327–343 (2013).
-  Hastie, Trevor; Tibshirani, R. and Wainwright, M., [Statistical Learning with Sparsity - The Lasso and Generalizations ], Chapman & Hall/CRC Press (2015).
-  Friedman, J., Hastie, T., and Tibshirani, R., “Regularization paths for generalized linear models via coordinate descent,” J Stat. Softw. 33(1), 1–22 (2010).
-  Otsu, O., “An invariant theory of linear functionals as linear feature extractors,” Bulletin of the Electrotechnical Laboratory 37(10), 893–913 (1973).
Amari, S., “Invariant structures of signal and feature space in pattern recognition problems,”RAAG Memoirs 4(1-2), 553–566 (1968).
-  Etter, P. C., [Underwater Acoustic Modeling and Simulation ], CRC Press, Boca Raton, FL (2013).
-  Fawcett, T., “An introduction to ROC analysis,” Pattern Recogn. Lett. 27(8), 861–874 (2006).