Despite deep neural networks (DNNs) have achieved remarkable success in various applications, e.g.
audio classification, speech recognition and natural language processing, they usually suffer following two problems from their inherent huge parameter space. First, most of state-of-the-art deep architectures are prone to over-fitting even trained on large datasets(Simonyan & Zisserman, 2015; Szegedy et al., 2015). Secondly, as they usually consume large storage memory and energy (Han et al., 2016), DNNs are difficult to embed into devices with limited memory and power (like portable devices or chips). Most existing networks explore to reduce the computational budget through network pruning (Han et al., 2015; Anwar et al., 2017; Li et al., 2017; Collins & Kohli, 2014), filter factorization (Jaderberg et al., 2014; Lebedev et al., 2014), low bit representation (Rastegari et al., 2016) for weights and knowledge transfering (Hinton et al., 2015). Different from all the above works that ignore the strong dependencies among weights and learn filters independently based on existing network architectures, this paper proposes to explicitly enforce the parameter sharing among filters to more effectively learn compact and efficient deep networks.
In this paper, we propose a Weight Sampling deep neural network (i.e. WSNet) to significantly reduce both the model size and computation cost of deep networks, achieving more than 100 smaller size and up to 16 speedup at negligible performance drop or even achieving better performance than the baselines (i.e. conventional networks that learn filters independently). Specifically, WSNet is parameterized by layer-wise condensed filters from which each filter participating in actual convolutions can be directly sampled, in both spatial and channel dimensions. Since condensed filters have much fewer parameters than independently trained filters as in conventional CNNs, learning by sampling from them makes WSNet a more compact model compared with conventional CNNs. In addition, to reduce the ubiquitous computational redundancy in convolving the overlapped filters and input patches, we propose an integral image based method to dramatically reduce computation cost of WSNet in both training and inference. The integral image method is also advantageous by enabling weight sampling with different filter size with little computational overhead to enhance the learning capability of WSNet.
For demonstrating efficacy of WSNet, we conduct extensive experiments on the challenging acoustic scene classification and music detection tasks. On all the test datasets, including MusicDet200K (a self-collected dataset, as detailed in Section4), ESC-50 (Piczak, 2015a), UrbanSound8K (Salamon et al., 2014) and DCASE (Stowell et al., 2015a), WSNet significantly reduces the model sizes of the baselines by 100 with comparable or even higher classification accuracy. When compressing more than 180, WSNet is only subject to negligible accuracy drop. At the same time, WSNet also significantly reduces the computation cost (up to 16). Such results strongly evident the capability of WSNet on learning compact and efficient networks. Note although the experiments in this paper are mostly limited to 1D CNNs, the same approach can be naturally generalized to 2D CNNs which we will explore in the future.
2 Related Works
2.1 Audio classification
In this paper we considered Acoustic Scene Classification (ASC) tasks as well as music detection task. ASC aims to classify the surrounding environment where an audio stream is generated given the audio input(Barchiesi et al., 2015). It can be applied in many different applications such as audio tagging (Cai et al., 2006), audio collections œmanagement (Landone et al., 2007), robotic navigation (Chu et al., 2006), intelligent wearable interfaces (Xu et al., 2008), context adaptive tasks (Schilit et al., 1994), and etc. Music detection is a related task to determine whether a small segment of audio is music or not. It is usually treated as a binary classification problem given an audio segment as input, i.e., to classify the segment into two categories, music or non-music.
Like in many other areas, convolutional neural networks (CNN) have been widely used in audio classification tasks (Valenti et al., 2016) (Salamon & Bello, 2017). SoundNet (Aytar et al., 2016) stands out among different CNNs for sound classification due to two reasons: first it is trained from the large-scale unlabeled sound data using visual information as a bridge, while many other networks are trained with smaller dataset. Secondly, SoundNet directly takes the one dimensional raw wave signals as input so that there is no need to calculate time-consuming audio specific features, e.g. MFCC (Pols et al., 1966) (Davis & Mermelstein, 1980) and spectrogram (Flanagan, 1972). SoundNet has yielded significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene classification. In this paper, we demonstrate that the proposed WSNet achieves comparable or even better performance than SoundNet with significantly smaller size and faster speed.
2.2 Deep Model Compression and Acceleration
Early approaches for deep model compression include (LeCun et al., 1989; Hassibi & Stork, 1993) that prune the connections in networks based on the second order information. Most recent works in network compression adopts weight pruning (Han et al., 2015; Collins & Kohli, 2014; Anwar et al., 2017; Lebedev & Lempitsky, 2016; Kim et al., 2015; Luo et al., 2017; Li et al., 2017), filter factorization Sindhwani et al. (2015); Denton et al. (2014) and weight quantization Han et al. (2016). However, although those works reduce model size, they also suffer from large performance drop. Jin et al. (2016) proposes an iterative hard thresholding method, but only achieve relatively small compression ratios. Gong et al. (2014) uses binning method which can only be applied over fully connected layers. Hinton et al. (2015) compresses deep models by transferring the knowledge from pre-trained larger networks to smaller networks. In contrast, WSNet is able to learn compact representation for both convolution layers and fully connected layers from scratch. The deep models learned by WSNet can significantly reduce model size compared to the baselines with comparable or even better performance.
In terms of deep model acceleration, the factorization and quantization methods listed above can also reduce computation latency in inference. While irregular prunning (as done in most prunning methods (Han et al., 2016)) even incur computational overhead, grouped pruning (Lebedev & Lempitsky, 2016) is able to accelerate networks. FFT (Mathieu et al., 2013) and LCNN (Bagherinezhad et al., 2016) are also used to speedup computation in pratice. Comparatively, WSNet is superior by learning networks which have both smaller model size and faster computation versus baselines.
2.3 Efficient Model Design
WSNet presents a class of novel models that has the appealling properties of small model size and small computation cost. Some recently proposed efficient model architectures include the class of Inception models (Szegedy et al., 2015; Ioffe & Szegedy, 2015; Chollet, 2016) which adopts depthwise separable convolutions, the class of Residual models (He et al., 2016; Xie et al., 2017; Chen et al., 2017) which uses residual path for efficient optimization and the factorized networks which use fully factorized convolutions. MobileNet (Howard et al., 2017) and Flattened networks (Jin et al., 2014) are based on factorization convolutions. ShuffleNet (Zhang et al., 2017) uses group convolution and channel shuffle to reduce computational cost. Compared with above works, WSNet presents a new model design strategy which is more flexible and generalizable: the parameters in deep networks can be obtained conveniently from a more compact representation, e.g. through the weight sampling method proposed in this paper or other more complex methods based on the learned statistic models.
In this section, we describe details of the proposed WSNet for 1D CNNs. First, the notations are introduced. Secondly, we elaborate on the core components in WSNet: weight sampling along the spatial dimension and channel dimension. Thirdly, we introduce the denser weight sampling to enhance the learning capability of WSNet. Finally, we propose an integral image method for accelerating WSNet in both training and inference.
Before diving into the details, we first introduce the notations used in this paper. The traditional 1D convolution layer takes as input the feature map and produces an output feature map where
denotes the spatial length of input, the channel of input and the number of filters respectively. Note here we assume the output has the same spatial size as input which holds true by using zero padded convolution. The 1D convolution kernelused in the actual convolution of WSNet has the shape of where is the kernel size. Let denotes a filter and denotes a input patch that spatially spans from to
, then the convolution assuming stride one and zero padding is computed as:
stands for the vector inner product.
In WSNet, instead of learning each weight independently, is obtained by sampling from a learned condensed filter which has the shape of . The goal of training WSNet is thus cast to learn more compact DNNs which satisfy the condition of . To quantize the advantage of WSNet in achieving compact networks, we define the compactness of in a learned layer in WSNet w.r.t. the conventional layer with independently learned weights as:
In the following, we show that WSNet learn compact networks by sampling weights in two dimensions: the spatial dimension and the channel dimension.
3.2 Weight sampling
3.2.1 Along spatial dimension
In conventional CNNs, the filters in a layer are learned independently which suffers two disadvantages. Firstly, the resulted DNNs have a large number of parameters, which impedes them to be deployed in computation resource constrained platforms. Second, such over-parameterization makes the network prone to overfitting and getting stuck in (extra introduced) local minimums. To solve these two problems, a novel weight sampling method is proposed to efficiently reuse the weights among filters. Specifically, in each convolutional layer of WSNet, all convolutional filters are sampled from the condensed filter , as illustrated in Figure 1. By scanning the weight sharing filter with a window size of and stride of , we could sample out filters with filter size of . Formally, the equation between the filter size of the weight sharing filter and the sampled filters is:
The compactness along spatial dimension is . Note that since the minimal value of is 1, the minimal value of (i.e. the minimum spatial length of the condensed filter) is and the maximal achievable compactness is therefore .
3.2.2 Along Channel dimension
Although it is experimentally verified that the weight sampling strategy could learn compact deep models with negligible loss of classification accuracy (see Section 4), the maximal compactness is limited by the filter size , as mentioned in Section 3.2.1.
For seeking more compact networks without such limitation, we propose a channel sharing strategy for WSNet to learn by weight sampling along the channel dimension. As illustrated in Figure 1 (top panel), the actual filter used in convolution is generated by repeating sampling for times. The relation between the channels of filters before and after channel sampling is:
Therefore, the compactness of WSNet along the channel dimension achieves . As introduced later in Experiments (Section 4), we observe that the repeated weight sampling along the channel dimension significantly reduces the model size of WSNet without significant performance drop. One notable advantage of channel sharing is that the maximum compactness can be as large as (i.e. when the condensed filter has channel of 1) which paves the way for learning much more aggressively smaller models (e.g. more than 100 smaller models than baselines).
The above analysis for weight sampling along spatial/channel dimensions can be conveniently generalized from convolution layers to fully connected layers. For a fully connected layer, we treat its weights as a flattened vector with channel of 1, along which the spatial sampling (ref. Section 3.2.1) is performed to reduce the size of learnable parameters. For example, for the fully connected layer “fc1” in the baseline network in Table 1, its filter size, channel number and filter number are 1536, 1 and 256 respectively. We can therefore perform spatial sampling for “fc1” to learn a more compact representation. Compared with convolutional layers which generally have small filter sizes and thus have limited compactnesses along the spatial dimenstion, the fully connected layers can achieve larger compactnesses along the spatial dimension without harming the performance, as demonstrated in experimental results (ref. to Section 4.2).
3.3 Denser Weight Sampling
The performance of WSNet might be adversely affected when the size of condensed filter is decreased aggressively (i.e. when and are large). To enhance the learning capability of WSNet, we could sample more filters for layers with significantly reduced sizes. Specifically, we use a smaller sampling stride () when performing spatial sampling. In order to keep the shape of weights unchanged in the following layer, we append a 11 convolution layer with the shape of to reduce the channels of densely sampled filters. It is experimentally verified that denser weight sampling can effectively improve the performance of WSNet in Section 4. However, since it also brings extra parameters and computational cost to WSNet, denser weight sampling is only used in lower layers of WSNet whose filter number () is small. Besides, one can also conduct channel sampling on the added 11 convolution layers to further reduce their sizes.
3.4 Efficient Computation with integral image
According to Equation 1, the computation cost in terms of the number of multiplications and adds (i.e. Mult-Adds) in a conventional convolutional layer is:
However, as illustrated in Figure 2, since all filters in a layer in WSNet are sampled from a condensed filter with stride , calculating the results of convolution in the conventional way as in Equation 1 incurs severe computational redundance. Concretely, as can be seen from Eq. (1), one item in the ouput feature map is equal to the summation of inner products between the row vector of and the column vector of . Therefore, when two overlapped filters that are sampled from the condensed filter (e.g. and in Fig. 2) convolves with the overlapped input windows (e.g. and in Fig. 2)), some partially repeated calculations exist (e.g. the calculations highlight in green and indicated by arrow in Fig. 2). To eliminate such redundancy in convolution and speed-up WSNet, we propose a novel integral image method to enable efficient computation via sharing computations.
We first calculate an inner product map which stores the inner products between each row vector in the input feature map (i.e. ) and each column vector in the condensed filter (i.e. ):
The integral image for speeding-up convolution is denoted as . It has the same size as and can be conveniently obtained throught below formulation:
Based on , all convolutional results can be obtained in time complexity of as follows
Recall that the -th filter lies in the spatial range of in the condensed filter . In Eq. (5) Eq. (7), we omit the case of padding for clear description. When zero padding is applied, we can freely get the convolutional results for the padded areas even without using Eq. (7) since .
Recall that is the filter size and is the pre-defined stride when sampling filters from the condensed filter (ref. to Eq. (2)).
In practice, we adopt a variant of above method to further boost the computation efficiency of WSNet, as illustrated in Fig 3. In Eq. (5), we repeat by times along the channel dimension to make it equal with the channel of the input . However, we could first wrap the channels of by accumulating the values with interval of along its channel dimension to a thinner feature map which has the same channel number as , i.e. . Both Eq. (6) and Eq. (7) remain the same. Then the computational cost is reduced to
Finally, we note that the integral image method applied in WSNet naturally takes advantage of the property in weight sampling: redundant computations exist between overlapped filters and input patches. Different from other deep model speedup methods (Sindhwani et al., 2015; Denton et al., 2014) which require to solve time-consuming optimization problems and incur performance drop, the integral image method can be seamlessly embeded in WSNet without negatively affecting the final performance.
In this section, we present the details and analysis of the results in our experiments. Extensive ablation studies are conducted to verify the effectiveness of the proposed WSNet on learning compact and efficient networks. On all tested datasets, WSNet is able to improve the classification performance over the baseline networks while using 100 smaller models. When using even smaller (e.g. 180) model size, WSNet achieves comparable performance w.r.t the baselines. In addition, WSNet achieves 2 4 acceleration compared to the baselines with a much smaller model (more than 100 smaller).
4.1 Experimental Settings
We collect a large-scale music detection dataset (MusicDet200K) from publicly available platforms (e.g. Facebook, Twitter, etc.) for conducting experiments. For fair comparison with previous literatures, we also test WSNet on three standard, publicly available datasets, i.e ESC-50, UrbanSound8K and DCASE. The details of used datasets are as follows.
MusicDet200K aims to assign a sample a binary label to indicate whether it is music or not. MusicDet200K has overall 238,000 annotated sound clips. Each has a time duration of 4 seconds and is resampled to 16000 Hz and normalized (Piczak, 2015b). Among all samples, we use 200,000/20,000/18,000 as train/val/test set. The samples belonging to “non-music” count for 70% of all samples, which means if we trivially assign all samples to be ”non-music”, the classification accuracy is 70%.
ESC-50 (Piczak, 2015a) is a collection of 2000 short (5 seconds) environmental recordings comprising 50 equally balanced classes of sound events in 5 major groups (animals, natural soundscapes and water sounds, human non-speech sounds, interior/domestic sounds and exterior/urban noises) divided into 5 folds for cross-validation. Following Aytar et al. (2016), we extract 10 sound clips from each recording with length of 1 second and time step of 0.5 second (i.e. two neighboring clips have 0.5 seconds overlapped). Therefore, in each cross-validation, the number of training samples is 16000. In testing, we average over ten clips of each recording for the final classification result.
UrbanSound8K (Salamon et al., 2014) is a collection of 8732 short (around 4 seconds) recordings of various urban sound sources (air conditioner, car horn, playing children, dog bark, drilling, engine idling, gun shot, jackhammer, siren and street music). Like in ESC-50, we extract 8 clips with time length of 1 second and time step of 0.5 second from each recording. For those that are less than 1 second, we pad them with zeros and repeat for 8 times (i.e. time step is 0.5 second).
DCASE (Stowell et al., 2015a) is used in the Detection and Classification of Acoustic Scenes and Events Challenge (DCASE). It contains 10 acoustic scene categories, 10 training examples per category and 100 testing examples. Each sample is a 30 seconds audio recording. During training, we evenly extract 12 sound clips with time length of 5 seconds and time step of 2.5 seconds from each recording.
To demonstrate that WSNet is capable of learning more compact and efficient models than conventional CNNs, three evaluation criteria are used in our experiments: model size, the number of multiply and adds in calculation (mult-adds) and classification accuracy.
To test the scability of WSNet to different network architectures (e.g. whether having fully connected layers or not), two baseline networks are used in comparision. The baseline network used on MusicDet200K consists of 7 convolutional layers and 2 fully connected layers, using which we demonstrate the effectiveness of WSNet on both convolutional layers and fully connected layers. For fair comparison with previous literatures, we firstly modify the state-of-the-art SoundNet (Aytar et al., 2016) by applying pooling layers to all but the last convolutional layer. As can be seen in Table 5, this modification significantly boosts the performance of original SoundNet. We then use the modified SoundNet as baseline on all three public datasets. The architectures of those two baseline networks are shown in Table 1 and Table 2 respectively.
Similar to other works (Han et al., 2016; Rastegari et al., 2016), we apply weight quantization to further reduce the size of WSNet. Specifically, the weights in each layer are linearly quantized to bins where is a pre-defined number. By setting all weights in the same bin to the same value, we only need to store a small index of the shared weight for each weight. The size of each bin is calculated as . Given bins, we only need bits to encode the index. Assuming each weight in WSNet is represented using 4 bytes float number (32 bits) without weight quantization, the ratio of each layer’s size before and after weight quantization is . Recall that and are the spatial size and the channel number of condensed filter. Since the condition generally holds in most layers of WSNet, weight quantization is able to reduce the model size by a factor of . Different from (Han et al., 2016; Rastegari et al., 2016) which learns the quantization during training, we apply weight quantization to WSNet after its training. In the experiments, we find that such an off-line way is sufficient to reduce model size without losing accuracy.
WSNet is implemented and trained from scratch in Tensorflow(Abadi et al., 2016). Following Aytar et al. (2016), the Adam (Kingma & Ba, 2014)
optimizer and a fixed learning rate of 0.001 and momentum term of 0.9 and batch size of 64 are used throughout experiments. We initialized all the weights to zero mean gaussian noise with a standard deviation of 0.01. In the network used on MusicDet200K, the dropout ratio for the dropout layers(Srivastava et al., 2014) after each fully connected layer is set to be 0.8. The overall training takes 100,000 iterations.
4.2 Results and analysis
Through controled experiments, we investigate the effects of each component in WSNet on the model size, computational cost and classification accuracy. The comparative study results of different settings of WSNet are listed in Table 3. For clear description, we name WSNets with different settings by the combination of symbols S/C/SC/D/Q. Please refer to the caption of Table 3 for detailed meanings.
(1) Spatial sampling. We test the performance of WSNet by using different sampling stride in spatial sampling. As listed in Table 3, S and S slightly outperforms the classification accuracy of the baseline, possibly due to reducing the overfitting of models. When the sampling stride is 8, i.e. the compactness in spatial dimension is 8 (ref. to Section 3.2.1), the classification accuracy of S only drops slightly by 0.6%. Note that the maximum compactness along the spatial dimension is equal to the filter size, thus for the layer “conv7” which has a filter size of 4, its compactness is limited by 4 (highlighted by underline in Table 3) in S. Above results clearly demonstrate that the spatial sampling enables WSNet to learn significantly smaller model with comparable accuracies w.r.t. the baseline.
(2) Channel sampling. Three different compactness along the channel dimension, i.e. 2, 4 and 8 are tested by comparing with baslines. It can be observed from Table 3 that C and C and C have linearly reduced model size without incurring noticeable drop of accuracy. In fact, C can even improve the accuracy upon the baseline, demonstrating the effectiveness of channel sampling in WSNet. When learning more compact models, C demonstrates better performance compared to S tha has the same compactness in the spatial dimension, which suggests we should focus on the channel sampling when the compactness along the spatial dimension is high.
We then simultaneously perform weight sampling on both the spatial and channel dimensions. As demonstrated by the results of SCSC and SCSC, WSNet can learn highly compact models (more than 20 smaller than baselines) without noticeable performance drop (less than 0.5%).
(3) Denser weight sampling. Denser weight sampling is used to enhance the learning capability of WSNet with aggressive compactness (i.e. when and are large) and make up the performance loss caused by sharing too much parameters among filters. As shown in Table 3, by sampling 2 more filters in conv1, conv2 and conv3, SCSCD significantly outperforms the SCSC. Above results demonstrate the effectiveness of denser weight sampling to boost the performance of WSNet.
(4) Integral image for efficient computation. As evidenced in the last column in Table 3, the proposed integral image method consistently reduces the computation cost of WSNet. For SCSC which is 23 smaller than the baseline, the computation cost (in terms of #mult-adds) is significantly reduced by 16.4 times. Due to the extra computation cost brought by the 11 convolution in denser filter sampling, SCSCD achieves lower acceleration (3.8). Group convolution (Xie et al., 2017) can be used to alleviate the computation cost of the added 11 convolution layers. We will explore this direction in our future work.
(5) Weight quantization. It can be observed from Table 3 that by using 256 bins to represent each weight by one byte (i.e. 8bits), SCSCAQ is reduced to 1/168 of the baseline’s model size while incurring only 0.1% accuracy loss. Above result demonstrates that the weight quantization is complementary to WSNet and they can be used jointly to effectively reduce the model size of WSNet. In this paper, since we do not explore using weight quantization to accelerate models, the WSNets before and after weight quantization have the same computational cost.
|Baseline||1||1||1||1||1||1||1||1||1||1||1||1||1||1||1||1||88.9||3M (1)||1.2e9 (1)|
|Baseline||1||1||1||1||1||1||1||1||1||1||1||1||1||1||1||3M (1)||2.4e8 (1)|
|Model||Settings||Acc. (%)||Model size|
|baseline||scratch init.; provided data||66.0||13.7M (1)|
|Piczak ConvNet (Piczak, 2015b)||-||64.5||-|
|SoundNet (Aytar et al., 2016)||scratch init.; provided data||51.1||1|
|SoundNet (Aytar et al., 2016)||pre-training; extra data||72.9||1|
|Model||Settings||Acc. (%)||Model size|
|baseline||raw sound wave||70.39||13.7M (1)|
|Piczak ConvNet (Piczak, 2015b)||pre-computed features||73.1||-|
|Salamon (Salamon & Bello, 2015)||pre-computed features||73.6||1|
The comparison of WSNet with other state-of-the-arts on ESC-50 is listed in Table 5. The settings of WSNet used on ESC-50, UrbanSound8K and DCASE are listed in Table 4. Compared with the baseline, WSNet is able to significantly reduce the model size of the baseline by 25 times and 45 times, while at the same time improving the accuracy of the baseline by 0.5% and 0.1% respectively. The computation costs of WSNet are listed in Table 4, from which one can observe that WSNet achieves higher computational efficiency by reducing the #Mult-Adds of the baseline by 2.3 and 2.4, respectively. Such promising results again demonstrate the effectiveness of WSNet on learning compact and efficient networks. After applying weight quantization to WSNet, its model size is reduced to only 1/180 of the baseline while the accuracy only slightly drops by 0.2%. Compared with the SoundNet trained from scratch with provided data, WSNets significantly outperform its classification accuracy by over 10% with more than 100
smaller models. Using a transfer learning approach, SoundNet(Aytar et al., 2016) that is pre-trained using a large number of unlabeled videos achieves better accuracy than WSNet. However, since the training method is orthogonal to WSNet, we believe that WSNet can achieve better performance by training in a similar way as SoundNet (Aytar et al., 2016) on a large amount of unlabeled video data.
We report the comparsion results of WSNet with state-of-the-arts on UrbanSound8k in Table 6. It is again observed that WSNet significantly reduces the model size of baseline while obtaining comparative results. Both Piczak (2015b) and Salamon & Bello (2015)
use pre-computed mel-spectrogram features during training. In comparison, the proposed WSNet simply takes the raw wave of recordings as input, avoiding the time-consuming feature extraction as done inPiczak (2015b); Salamon & Bello (2015).
As evidenced in Table 7, WSNet outperforms the classification accuracy of the baseline by 1% with a 100 smaller model. When using an even more compact model, i.e. 180 smaller in model size, the classification accuracy of WSNet is only one percentage lower than the baseline (i.e. has only one more incorrectly classified sample), verifying the effectiveness of WSNet. Compared with SoundNet (Aytar et al., 2016) that utilizes a large number of unlabeled data during training, WSNet (SCDQ) that is 100 smaller achieves comparable results by only using the provided data.
|Model||Settings||Acc. (%)||Model size|
|baseline||scratch init.; provided data||85||13.7 (1)|
|RG (Rakotomamonjy & Gasso, 2015)||-||69||-|
|LTT (Li et al., 2013)||-||72||-|
|RNH (Roma et al., 2013)||-||77||-|
|Ensemble (Stowell et al., 2015b)||-||78||-|
|SoundNet (Aytar et al., 2016)||pre-training; extra data||88||1|
In this paper, we present a class of Weight Sampling networks (WSNet) which are highly compact and efficient. A novel weight sampling method is proposed to sample filters from condensed filters which are much smaller than the independently trained filters in conventional networks. The weight sampling in conducted in two dimensions of the condensed filters, i.e. by spatial sampling and channel sampling. Taking advantage of the overlapping property of the filters in WSNet, we propose an integral image method for efficient computation. Extensive experiments on four audio classification datasets including MusicDet200K, ESC-50, UrbanSound8K and DCASE clearly demonstrate that WSNet can learn compact and efficient networks with competitive performance.
- Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Anwar et al. (2017) Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung. Structured pruning of deep convolutional neural networks. J. Emerg. Technol. Comput. Syst., 13(3):32:1–32:18, February 2017. ISSN 1550-4832. doi: 10.1145/3005348. URL http://doi.acm.org/10.1145/3005348.
- Aytar et al. (2016) Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Soundnet: Learning sound representations from unlabeled video. In NIPS, 2016.
- Bagherinezhad et al. (2016) Hessam Bagherinezhad, Mohammad Rastegari, and Ali Farhadi. Lcnn: Lookup-based convolutional neural network. arXiv preprint arXiv:1611.06473, 2016.
- Barchiesi et al. (2015) Daniele Barchiesi, Dimitrios Giannoulis, Dan Stowell, and Mark D Plumbley. Acoustic scene classification: Classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3):16–34, 2015.
- Cai et al. (2006) Rui Cai, Lie Lu, Alan Hanjalic, Hong-Jiang Zhang, and Lian-Hong Cai. A flexible framework for key audio effects detection and auditory context inference. IEEE Transactions on audio, speech, and language processing, 14(3):1026–1039, 2006.
- Chen et al. (2017) Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. arXiv preprint arXiv:1707.01629, 2017.
- Chollet (2016) François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint arXiv:1610.02357, 2016.
- Chu et al. (2006) Selina Chu, Shrikanth Narayanan, C-C Jay Kuo, and Maja J Mataric. Where am i? scene recognition for mobile robots using audio features. In ICME, 2006.
- Collins & Kohli (2014) Maxwell D Collins and Pushmeet Kohli. Memory bounded deep convolutional networks. arXiv preprint arXiv:1412.1442, 2014.
- Davis & Mermelstein (1980) Steven Davis and Paul Mermelstein. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. ASSP, Aug. 1980.
- Denton et al. (2014) Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS, 2014.
- Flanagan (1972) James L Flanagan. Speech analysis, synthesis and perception. Springer- Verlag, 1972.
- Gong et al. (2014) Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional networks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.
- Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In NIPS, 2015.
- Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.
- Hassibi & Stork (1993) Babak Hassibi and David G Stork. Second order derivatives for network pruning: Optimal brain surgeon. Morgan Kaufmann, 1993.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
- Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
- Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014.
- Jin et al. (2014) Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014.
- Jin et al. (2016) Xiaojie Jin, Xiaotong Yuan, Jiashi Feng, and Shuicheng Yan. Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423, 2016.
- Kim et al. (2015) Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530, 2015.
- Kingma & Ba (2014) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
- Landone et al. (2007) Christian Landone, Joseph Harrop, and Josh Reiss. Enabling access to sound archives through integration, enrichment and retrieval: the easaier project. In ISMIR, 2007.
- Lebedev & Lempitsky (2016) Vadim Lebedev and Victor Lempitsky. Fast convnets using group-wise brain damage. In CVPR, 2016.
- Lebedev et al. (2014) Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky. Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553, 2014.
- LeCun et al. (1989) Yann LeCun, John S Denker, Sara A Solla, Richard E Howard, and Lawrence D Jackel. Optimal brain damage. In NIPS, 1989.
Li et al. (2013)
David Li, Jason Tam, and Derek Toub.
Auditory scene classification using machine learning techniques.IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2013.
- Li et al. (2017) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. In ICLR, 2017.
- Luo et al. (2017) Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In ICCV, 2017.
- Mathieu et al. (2013) Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.
- Piczak (2015a) Karol J Piczak. Esc: Dataset for environmental sound classification. In ACM MM, 2015a.
- Piczak (2015b) Karol J Piczak. Environmental sound classification with convolutional neural networks. In MLSP, 2015b.
- Pols et al. (1966) Louis CW Pols et al. Spectral analysis and identification of dutch vowels in monosyllabic words. dissertation, 1966.
- Rakotomamonjy & Gasso (2015) Alain Rakotomamonjy and Gilles Gasso. Histogram of gradients of time-frequency representations for audio scene classification. IEEE Transactions on Audio, Speech and Language Processing, 23(1):142–153, 2015.
- Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
- Roma et al. (2013) Gerard Roma, Waldo Nogueira, Perfecto Herrera, and Roc de Boronat. Recurrence quantification analysis features for auditory scene classification. IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events, 2, 2013.
- Salamon & Bello (2015) Justin Salamon and Juan Pablo Bello. Unsupervised feature learning for urban sound classification. In ICASSP, 2015.
- Salamon & Bello (2017) Justin Salamon and Juan Pablo Bello. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Processing Letters, 24(3):279–283, 2017.
- Salamon et al. (2014) Justin Salamon, Christopher Jacoby, and Bello Juan Pable. A dataset and taxonomy for urban sound research. In ACM MM, 2014.
- Schilit et al. (1994) Bill Schilit, Norman Adams, and Roy Want. Context-aware computing applications. IEEE Mobile Computing Systems and Applications, pp. 85–90, 1994.
- Simonyan & Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
Sindhwani et al. (2015)
Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar.
Structured transforms for small-footprint deep learning.In NIPS, 2015.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15(1):1929–1958, 2014.
- Stowell et al. (2015a) Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, 2015a.
- Stowell et al. (2015b) Dan Stowell, Dimitrios Giannoulis, Emmanouil Benetos, Mathieu Lagrange, and Mark D Plumbley. Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(10):1733–1746, 2015b.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
- Valenti et al. (2016) Michele Valenti, Aleksandr Diment, Giambattista Parascandolo, Stefano Squartini, and Tuomas Virtanen. Dcase2016 acoustic scene classification using convolutional neural networks. Workshop of Detection and classification of Acoustic Scenes Events, pp. 95–99, 2016.
- Xie et al. (2017) Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
- Xu et al. (2008) Yangsheng Xu, Wen Jung Li, and Ka Keung Lee. Intelligent wearable interfaces. John Wiley & Sons, 2008.
- Zhang et al. (2017) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.