The wavelet transform has several useful properties that make it a good choice for a feature representation including: a linear time algorithm, perfect reconstruction, and the ability to tailor wavelet functions to the application. However, the wavelet transform is not widely used in the machine learning community. Instead, methods like the Fourier transform and its variants are often used (e.g.[graves2013speech]). We believe that one cause of the lack of use is the difficulty in designing and selecting appropriate wavelet functions. Wavelet filters are typically derived analytically using Fourier methods. Furthermore, there are many different wavelet functions to choose from. Without a deep understanding of wavelet theory, it can be difficult to know which wavelet to choose. This difficulty may lead many to stick to simpler methods.
We propose a method that learns wavelet functions directly from data using a neural network framework. As such, we can leverage the theoretical properties of the wavelet transform without the difficult task of designing or choosing a wavelet function. An advantage of this method is that we are able to learn directly from raw audio data. Learning from raw audio has shown success in audio generation [oord2016wavenet].
We are not the first to propose using wavelets in neural network architectures. There has been previous work in using fixed wavelet filters in neural networks such as the wavelet network [zhang1992wavelet] and the scattering transform [mallat2012group]. Unlike our proposed method, these works do not learn wavelet functions from data.
One notable work involving learning wavelets can be found in [waveletGraphs]. Though the authors also propose learning wavelets from data, there are several differences from our work. One major difference is that second generation wavelets are considered instead of the traditional (first generation) wavelets considered here [sweldens1998lifting]. Secondly, the domain of the signals were over the vertices of graphs, as opposed to .
We begin our discussion with the wavelet transform. We will provide some mathematical background as well as outline the discrete wavelet transform algorithm. Next, we outline our proposed wavelet transform model. We show that we can represent the wavelet transform as a modified convolutional neural network. We then evaluate our model by demonstrating we can learn useful wavelet functions by using an architecture similar to traditional autoencoders [hinton2006reducing].
2 Wavelet transform
We choose to focus on a specific type of linear time-frequency transform known as the wavelet transform. The wavelet transform makes use of a dictionary of wavelet functions that are dilated and shifted versions of a mother wavelet. The mother wavelet, , is constrained to have zero mean and unit norm. The dilated and scaled wavelet functions are of the form:
where . The discrete wavelet transform is defined as
for a discrete real signal .
The wavelet functions can be thought of as a bandpass filter bank. The wavelet transform is then a decomposition of a signal with this filter bank. Since the wavelets are bandpass, we require the notion of a lowpass scaling function that is the sum of all wavelets above a certain scale in order to fully represent the signal. We define the scaling function, , such that its Fourier transform, , satisfies
with the phase of being arbitrary [mallat].
The discrete wavelet transform and its inverse can be computed via a fast decimating algorithm. Let us define two filters
The following equations connect the wavelet coefficients to the filters and , and give rise to a recursive algorithm for computing the wavelet transform.
Wavelet Filter Bank Decomposition:
Wavelet Filter Bank Reconstruction
We call and the approximation and detail coefficients respectively. The detail coefficients are exactly the wavelet coefficients defined by Equation 2. As shown in Equations 6 and 7, the wavelet coefficients are computed by recursively computing the coefficients at each scale, with initialized with the signal . At each step of the algorithm, the signal is split into high and low frequency components by convolving the approximation coefficients with (scaling filter) and (wavelet filter). The low frequency component becomes the input to the next step of the algorithm. Note that and are downsampled by a factor of two at each iteration. An advantage of this algorithm is that we only require two filters instead of an entire filter bank. The wavelet transform effectively partitions the signal into frequency bands defined by the wavelet functions. We can reconstruct a signal from its wavelet coefficients using Equation 8. We call the reconstruction algorithm the inverse discrete wavelet transform. A thorough treatment of the wavelet transform can be found in [mallat].
3 Proposed Model
We propose a method for learning wavelet functions by defining the discrete wavelet transform as a convolutional neural network (CNN). CNNs compute a feature representation of an input signal through a cascade of filters. They have seen success in many signal processing tasks, such as speech recognition and music classification [sainath2013deep, choi2017convolutional]. Generally, CNNs are not applied directly to raw audio data. Instead, a transform is first applied to the signal (such as the short-time Fourier transform). This representation is then fed into the network.
Our proposed method works directly on the raw audio signal. We accomplish this by implementing the discrete wavelet transform as a modified CNN. Figure 1 shows a graphical representation of our model, which consists of repeated applications of Equations 6 and 7. The parameters (or weights) of this network are the wavelet and scaling filters and . Thus, the network computes the wavelet coefficients of a signal, but allows the wavelet filter to be learned from the data. We can similarly define an inverse network using Equation 8.
We can view our network as an unrolling of the discrete wavelet transform algorithm similar to unrolling a recurrent neural network (RNN)[pascanu2013difficulty]. Unlike an RNN, our model takes as input the entire input signal and reduces the scale at every layer through downsampling. Each layer of the network corresponds to one iteration of the algorithm. At each layer, the detail coefficients are passed directly to the final layer. The final layer output, denoted , is formed as a concatenation of all the computed detail coefficients and the final approximation coefficients. We propose that this network be used as an initial module as part of a larger neural network architecture. This would allow a neural network architecture to take as input raw audio data, as opposed to some transformed version.
We restrict ourselves to quadrature mirror filters. That is, we set
By making this restriction, we reduce our parameters to only the scaling filter .
The model parameters will be learned by gradient descent. As such, we must introduce constraints that will guarantee the model learns wavelet filters. We define the wavelet constraints as
where and are the means of and respectively, and is length of the filters. The first two terms correspond to finite and norms respectively. The third term is a relaxed orthogonality constraint. Note that these are soft constraints, and thus the filters learned by the model are only approximately wavelet filters. See Figure 2 for examples of randomly chosen wavelet functions derived from filters that minimize Equation 10. We have not explored the connection between the space of wavelets that minimize Equation 10 and those of parameterized wavelet families [burrus1997introduction].
We will evaluate our wavelet model by learning wavelet filters that give sparse representations. We achieve this by constructing an autoencoder as illustrated in Figure 3
. Autoencoders are used in unsupervised learning in order to learn useful data representations[hinton2006reducing]
. Our autoencoder is composed of a wavelet transform network followed by an inverse wavelet transform network. The loss function is made up of a reconstruction loss, a sparsity term, and the wavelet constraints. Letdenote the reconstructed signal. The loss function is defined as
for a dataset of fixed length signals. In our experiments, we fix .
We conducted experiments on synthetic and real data. The real data consists of segments taken from the MIDI aligned piano dataset (MAPS) [emiya2010multipitch]. The synthetic data consists of harmonic data generated from simple periodic waves. We construct a synthetic signal, , from a base periodic wave function, , as follows:
where is a phase offset chosen uniformly at random, and is the
harmonic indicator which takes the value of 1 with probability. We considered three different base waves: sine, sawtooth, and square. A second type of synthetic signal was created similarly to Equation 12 by windowing the base wave at each scale with randomly centered Gaussian windows (multiple windows at each scale were allowed). The length of the learned filters is 20 in all trials. The length of each is 1024 and we set , , and
. We implemented our model using Google’s Tensorflow library[tensorflow-short]
. We make use of the Adam algorithm for stochastic gradient descent[kingma2014adam]. We use a batch size of 32 and run the optimizer until convergence.
The wavelet filters learned are unique to the type of data that is being reconstructed. Example wavelet functions are included in Figure 4. These functions are computed from the scaling filter coefficients using the cascade algorithm [strang1996wavelets]. Note that the learned functions are highly structured, unlike the random wavelet functions in Figure 2.
In order to compare the learned wavelets to traditional wavelets, we will first define a distance measure between filters of length :
where is circular shifted by
samples. This measure is the minimum cosine distance under all circular shifts of the filters. To compare different length filters, we zero-pad the shorter filter to the length of the longer. We restrict our consideration to the following traditional wavelet families: Haar, Daubechies, Symlets, and Coiflets. The middle column of Figure4 shows the closest traditional wavelet to the learned wavelets according to Equation 13. The distances are listed in the right column.
In order to determine how well the learned wavelets capture the structure of the training data signals, we will consider signals randomly generated from the learned wavelets. To generate signals we begin by sparsely populating wavelet coefficients from . The coefficients from the three highest frequency scales are then set to zero. Finally, the generated signal is obtained by performing an inverse wavelet transform of the sparse coefficients. Qualitative results are shown in Figure 5. Typical training examples are shown in the left column. Example generated signals are shown in the right column. Note that the generated signals have visually similar structure to the training examples. This provides evidence that the learned wavelets have captured the structure of the data.
We have proposed a new model capable of learning useful wavelet representations from data. We accomplish this by framing the wavelet transform as a modified CNN. We show that we can learn useful wavelet filters by gradient descent, as opposed to the traditional derivation of wavelets using Fourier methods. The learned wavelets are able to capture the structure of the data. We hope that our work leads to wider use of the wavelet transform in the machine learning community.
Framing our model as a neural network has the benefit of allowing us to leverage deep learning software frameworks, and also allows for simple integration into existing neural network architectures. An advantage of our method is the ability to learn directly from raw audio data, instead of relying on a fixed representation such as the Fourier transform.