WSNet: Compact and Efficient Networks with Weight Sampling

11/28/2017 ∙ by Xiaojie Jin, et al. ∙ National University of Singapore 0

We present a new approach and a novel architecture, termed WSNet, for learning compact and efficient deep neural networks. Existing approaches conventionally learn full model parameters independently at first and then compress them via ad hoc processing like model pruning or filter factorization. Different from them, WSNet proposes learning model parameters by sampling from a compact set of learnable parameters, which naturally enforces parameter sharing throughout the learning process. We show that such novel weight sampling approach (and induced WSNet) promotes both weights and computation sharing favorably. It can more efficiently learn much smaller networks with competitive performance, compared to baseline networks with equal number of convolution filters. Specifically, we consider learning compact and efficient 1D convolutional neural networks for audio classification. Extensive experiments on multiple audio classification datasets verify the effectiveness of WSNet. Combined with weight quantization, the resulted models are up to 180x smaller and theoretically up to 16x faster than the well-established baselines, without noticeable performance drop.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite deep neural networks (DNNs) have achieved remarkable success in various applications, e.g.

audio classification, speech recognition and natural language processing, they usually suffer following two problems from their inherent huge parameter space. First, most of state-of-the-art deep architectures are prone to over-fitting even trained on large datasets 

(Simonyan & Zisserman, 2015; Szegedy et al., 2015). Secondly, as they usually consume large storage memory and energy (Han et al., 2016), DNNs are difficult to embed into devices with limited memory and power (like portable devices or chips). Most existing networks explore to reduce the computational budget through network pruning (Han et al., 2015; Anwar et al., 2017; Li et al., 2017; Collins & Kohli, 2014), filter factorization (Jaderberg et al., 2014; Lebedev et al., 2014), low bit representation (Rastegari et al., 2016) for weights and knowledge transfering (Hinton et al., 2015). Different from all the above works that ignore the strong dependencies among weights and learn filters independently based on existing network architectures, this paper proposes to explicitly enforce the parameter sharing among filters to more effectively learn compact and efficient deep networks.

In this paper, we propose a Weight Sampling deep neural network (i.e. WSNet) to significantly reduce both the model size and computation cost of deep networks, achieving more than 100 smaller size and up to 16 speedup at negligible performance drop or even achieving better performance than the baselines (i.e. conventional networks that learn filters independently). Specifically, WSNet is parameterized by layer-wise condensed filters from which each filter participating in actual convolutions can be directly sampled, in both spatial and channel dimensions. Since condensed filters have much fewer parameters than independently trained filters as in conventional CNNs, learning by sampling from them makes WSNet a more compact model compared with conventional CNNs. In addition, to reduce the ubiquitous computational redundancy in convolving the overlapped filters and input patches, we propose an integral image based method to dramatically reduce computation cost of WSNet in both training and inference. The integral image method is also advantageous by enabling weight sampling with different filter size with little computational overhead to enhance the learning capability of WSNet.

For demonstrating efficacy of WSNet, we conduct extensive experiments on the challenging acoustic scene classification and music detection tasks. On all the test datasets, including MusicDet200K (a self-collected dataset, as detailed in Section 

4), ESC-50 (Piczak, 2015a), UrbanSound8K (Salamon et al., 2014) and DCASE (Stowell et al., 2015a), WSNet significantly reduces the model sizes of the baselines by 100 with comparable or even higher classification accuracy. When compressing more than 180, WSNet is only subject to negligible accuracy drop. At the same time, WSNet also significantly reduces the computation cost (up to 16). Such results strongly evident the capability of WSNet on learning compact and efficient networks. Note although the experiments in this paper are mostly limited to 1D CNNs, the same approach can be naturally generalized to 2D CNNs which we will explore in the future.

2 Related Works

2.1 Audio classification

In this paper we considered Acoustic Scene Classification (ASC) tasks as well as music detection task. ASC aims to classify the surrounding environment where an audio stream is generated given the audio input 

(Barchiesi et al., 2015). It can be applied in many different applications such as audio tagging (Cai et al., 2006), audio collections œmanagement (Landone et al., 2007), robotic navigation (Chu et al., 2006), intelligent wearable interfaces (Xu et al., 2008), context adaptive tasks (Schilit et al., 1994), and etc. Music detection is a related task to determine whether a small segment of audio is music or not. It is usually treated as a binary classification problem given an audio segment as input, i.e., to classify the segment into two categories, music or non-music.

Like in many other areas, convolutional neural networks (CNN) have been widely used in audio classification tasks (Valenti et al., 2016) (Salamon & Bello, 2017). SoundNet (Aytar et al., 2016) stands out among different CNNs for sound classification due to two reasons: first it is trained from the large-scale unlabeled sound data using visual information as a bridge, while many other networks are trained with smaller dataset. Secondly, SoundNet directly takes the one dimensional raw wave signals as input so that there is no need to calculate time-consuming audio specific features, e.g. MFCC (Pols et al., 1966) (Davis & Mermelstein, 1980) and spectrogram (Flanagan, 1972). SoundNet has yielded significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene classification. In this paper, we demonstrate that the proposed WSNet achieves comparable or even better performance than SoundNet with significantly smaller size and faster speed.

2.2 Deep Model Compression and Acceleration

Early approaches for deep model compression include (LeCun et al., 1989; Hassibi & Stork, 1993) that prune the connections in networks based on the second order information. Most recent works in network compression adopts weight pruning (Han et al., 2015; Collins & Kohli, 2014; Anwar et al., 2017; Lebedev & Lempitsky, 2016; Kim et al., 2015; Luo et al., 2017; Li et al., 2017), filter factorization Sindhwani et al. (2015); Denton et al. (2014) and weight quantization Han et al. (2016). However, although those works reduce model size, they also suffer from large performance drop. Jin et al. (2016) proposes an iterative hard thresholding method, but only achieve relatively small compression ratios. Gong et al. (2014) uses binning method which can only be applied over fully connected layers.  Hinton et al. (2015) compresses deep models by transferring the knowledge from pre-trained larger networks to smaller networks. In contrast, WSNet is able to learn compact representation for both convolution layers and fully connected layers from scratch. The deep models learned by WSNet can significantly reduce model size compared to the baselines with comparable or even better performance.

In terms of deep model acceleration, the factorization and quantization methods listed above can also reduce computation latency in inference. While irregular prunning (as done in most prunning methods (Han et al., 2016)) even incur computational overhead, grouped pruning (Lebedev & Lempitsky, 2016) is able to accelerate networks. FFT (Mathieu et al., 2013) and LCNN (Bagherinezhad et al., 2016) are also used to speedup computation in pratice. Comparatively, WSNet is superior by learning networks which have both smaller model size and faster computation versus baselines.

2.3 Efficient Model Design

WSNet presents a class of novel models that has the appealling properties of small model size and small computation cost. Some recently proposed efficient model architectures include the class of Inception models (Szegedy et al., 2015; Ioffe & Szegedy, 2015; Chollet, 2016) which adopts depthwise separable convolutions, the class of Residual models (He et al., 2016; Xie et al., 2017; Chen et al., 2017) which uses residual path for efficient optimization and the factorized networks which use fully factorized convolutions. MobileNet (Howard et al., 2017) and Flattened networks (Jin et al., 2014) are based on factorization convolutions. ShuffleNet (Zhang et al., 2017) uses group convolution and channel shuffle to reduce computational cost. Compared with above works, WSNet presents a new model design strategy which is more flexible and generalizable: the parameters in deep networks can be obtained conveniently from a more compact representation, e.g. through the weight sampling method proposed in this paper or other more complex methods based on the learned statistic models.

3 Method

In this section, we describe details of the proposed WSNet for 1D CNNs. First, the notations are introduced. Secondly, we elaborate on the core components in WSNet: weight sampling along the spatial dimension and channel dimension. Thirdly, we introduce the denser weight sampling to enhance the learning capability of WSNet. Finally, we propose an integral image method for accelerating WSNet in both training and inference.

3.1 Notations

Before diving into the details, we first introduce the notations used in this paper. The traditional 1D convolution layer takes as input the feature map and produces an output feature map where

denotes the spatial length of input, the channel of input and the number of filters respectively. Note here we assume the output has the same spatial size as input which holds true by using zero padded convolution. The 1D convolution kernel

used in the actual convolution of WSNet has the shape of where is the kernel size. Let denotes a filter and denotes a input patch that spatially spans from to

, then the convolution assuming stride one and zero padding is computed as:



stands for the vector inner product.

In WSNet, instead of learning each weight independently, is obtained by sampling from a learned condensed filter which has the shape of . The goal of training WSNet is thus cast to learn more compact DNNs which satisfy the condition of . To quantize the advantage of WSNet in achieving compact networks, we define the compactness of in a learned layer in WSNet w.r.t. the conventional layer with independently learned weights as:

In the following, we show that WSNet learn compact networks by sampling weights in two dimensions: the spatial dimension and the channel dimension.

Figure 1: Illustration of WSNet that learns small condensed filters with weight sampling along two dimensions: spatial dimension (the bottom panel) and channel dimension (the top panel). The figure depicts procedure of generating two continuous filters (in pink and purple respectively) that convolve with input. In spatial sampling, filters are extracted from the condensed filter with a stride of . In channel sampling, the channel of each filter is sampled repeatedly for times to achieve equal with the input channel. Please refer to Section 3.2 for detailed explanations.

3.2 Weight sampling

3.2.1 Along spatial dimension

In conventional CNNs, the filters in a layer are learned independently which suffers two disadvantages. Firstly, the resulted DNNs have a large number of parameters, which impedes them to be deployed in computation resource constrained platforms. Second, such over-parameterization makes the network prone to overfitting and getting stuck in (extra introduced) local minimums. To solve these two problems, a novel weight sampling method is proposed to efficiently reuse the weights among filters. Specifically, in each convolutional layer of WSNet, all convolutional filters are sampled from the condensed filter , as illustrated in Figure 1. By scanning the weight sharing filter with a window size of and stride of , we could sample out filters with filter size of . Formally, the equation between the filter size of the weight sharing filter and the sampled filters is:


The compactness along spatial dimension is . Note that since the minimal value of is 1, the minimal value of (i.e. the minimum spatial length of the condensed filter) is and the maximal achievable compactness is therefore .

3.2.2 Along Channel dimension

Although it is experimentally verified that the weight sampling strategy could learn compact deep models with negligible loss of classification accuracy (see Section 4), the maximal compactness is limited by the filter size , as mentioned in Section 3.2.1.

For seeking more compact networks without such limitation, we propose a channel sharing strategy for WSNet to learn by weight sampling along the channel dimension. As illustrated in Figure 1 (top panel), the actual filter used in convolution is generated by repeating sampling for times. The relation between the channels of filters before and after channel sampling is:


Therefore, the compactness of WSNet along the channel dimension achieves . As introduced later in Experiments (Section 4), we observe that the repeated weight sampling along the channel dimension significantly reduces the model size of WSNet without significant performance drop. One notable advantage of channel sharing is that the maximum compactness can be as large as (i.e. when the condensed filter has channel of 1) which paves the way for learning much more aggressively smaller models (e.g. more than 100 smaller models than baselines).

The above analysis for weight sampling along spatial/channel dimensions can be conveniently generalized from convolution layers to fully connected layers. For a fully connected layer, we treat its weights as a flattened vector with channel of 1, along which the spatial sampling (ref. Section 3.2.1) is performed to reduce the size of learnable parameters. For example, for the fully connected layer “fc1” in the baseline network in Table 1, its filter size, channel number and filter number are 1536, 1 and 256 respectively. We can therefore perform spatial sampling for “fc1” to learn a more compact representation. Compared with convolutional layers which generally have small filter sizes and thus have limited compactnesses along the spatial dimenstion, the fully connected layers can achieve larger compactnesses along the spatial dimension without harming the performance, as demonstrated in experimental results (ref. to Section 4.2).

3.3 Denser Weight Sampling

The performance of WSNet might be adversely affected when the size of condensed filter is decreased aggressively (i.e. when and are large). To enhance the learning capability of WSNet, we could sample more filters for layers with significantly reduced sizes. Specifically, we use a smaller sampling stride () when performing spatial sampling. In order to keep the shape of weights unchanged in the following layer, we append a 11 convolution layer with the shape of to reduce the channels of densely sampled filters. It is experimentally verified that denser weight sampling can effectively improve the performance of WSNet in Section 4. However, since it also brings extra parameters and computational cost to WSNet, denser weight sampling is only used in lower layers of WSNet whose filter number () is small. Besides, one can also conduct channel sampling on the added 11 convolution layers to further reduce their sizes.

Figure 2: Illustration of efficient computation with integral image in WSNet. The inner product map calculates the inner product of each row in and each column in as in Eq. (5). The convolution result between a filter which is sampled from and the input patch is then the summation of all values in the segment between and in (recall that is the convolutional filter size). Since there are repeated calculations when the filter and input patch are overlapped, e.g. the green segment indicated by arrow when performing convolution between and , we construct the integral image using according to Eq. (6). Based on , the convolutional results between any sampled filter and input patch can be retrieved directly in time complexity of O(1) according to Eq. (7), e.g. the results of is . For notation definitions, please refer to Sec. 3.1. The comparisons of computation costs between WSNet and the baselines using conventional architecutres are introduced in Section 3.4.
Figure 3: A variant of the integral image method used in practice which is more efficient than that illustrated in Figure 2. Instead of repeatedly sampling along the channel dimension of to convolve with the input , we wrap the channels of by summing up matrixes that are evenly divided from along the channels, i.e. . Since the channle of is only of the channel of , the overall computation cost is reduced as demonstrated in Eq. (9).

3.4 Efficient Computation with integral image

According to Equation 1, the computation cost in terms of the number of multiplications and adds (i.e. Mult-Adds) in a conventional convolutional layer is:


However, as illustrated in Figure 2, since all filters in a layer in WSNet are sampled from a condensed filter with stride , calculating the results of convolution in the conventional way as in Equation 1 incurs severe computational redundance. Concretely, as can be seen from Eq. (1), one item in the ouput feature map is equal to the summation of inner products between the row vector of and the column vector of . Therefore, when two overlapped filters that are sampled from the condensed filter (e.g. and in Fig. 2) convolves with the overlapped input windows (e.g. and in Fig. 2)), some partially repeated calculations exist (e.g. the calculations highlight in green and indicated by arrow in Fig. 2). To eliminate such redundancy in convolution and speed-up WSNet, we propose a novel integral image method to enable efficient computation via sharing computations.

We first calculate an inner product map which stores the inner products between each row vector in the input feature map (i.e. ) and each column vector in the condensed filter (i.e. ):


The integral image for speeding-up convolution is denoted as . It has the same size as and can be conveniently obtained throught below formulation:


Based on , all convolutional results can be obtained in time complexity of as follows


Recall that the -th filter lies in the spatial range of in the condensed filter . In Eq. (5) Eq. (7), we omit the case of padding for clear description. When zero padding is applied, we can freely get the convolutional results for the padded areas even without using Eq. (7) since .

Based on Eq. (5) Eq. (7), the computation cost of the proposed integral image method is


Based on Eq. (4), Eq. (8) and Eq. (2), the theoretical acceleration ratio is

Recall that is the filter size and is the pre-defined stride when sampling filters from the condensed filter (ref. to Eq. (2)).

In practice, we adopt a variant of above method to further boost the computation efficiency of WSNet, as illustrated in Fig 3. In Eq. (5), we repeat by times along the channel dimension to make it equal with the channel of the input . However, we could first wrap the channels of by accumulating the values with interval of along its channel dimension to a thinner feature map which has the same channel number as , i.e. . Both Eq. (6) and Eq. (7) remain the same. Then the computational cost is reduced to


where the first item is the computational cost of warping the channels of to obtain . Accordingly, by combining Eq. (9) and Eq. (4), the theoretical acceleration compared to the baseline is


Finally, we note that the integral image method applied in WSNet naturally takes advantage of the property in weight sampling: redundant computations exist between overlapped filters and input patches. Different from other deep model speedup methods (Sindhwani et al., 2015; Denton et al., 2014) which require to solve time-consuming optimization problems and incur performance drop, the integral image method can be seamlessly embeded in WSNet without negatively affecting the final performance.

4 Experiments

In this section, we present the details and analysis of the results in our experiments. Extensive ablation studies are conducted to verify the effectiveness of the proposed WSNet on learning compact and efficient networks. On all tested datasets, WSNet is able to improve the classification performance over the baseline networks while using 100 smaller models. When using even smaller (e.g. 180) model size, WSNet achieves comparable performance w.r.t the baselines. In addition, WSNet achieves 2 4 acceleration compared to the baselines with a much smaller model (more than 100 smaller).

4.1 Experimental Settings


We collect a large-scale music detection dataset (MusicDet200K) from publicly available platforms (e.g. Facebook, Twitter, etc.) for conducting experiments. For fair comparison with previous literatures, we also test WSNet on three standard, publicly available datasets, i.e ESC-50, UrbanSound8K and DCASE. The details of used datasets are as follows.

MusicDet200K aims to assign a sample a binary label to indicate whether it is music or not. MusicDet200K has overall 238,000 annotated sound clips. Each has a time duration of 4 seconds and is resampled to 16000 Hz and normalized (Piczak, 2015b). Among all samples, we use 200,000/20,000/18,000 as train/val/test set. The samples belonging to “non-music” count for 70% of all samples, which means if we trivially assign all samples to be ”non-music”, the classification accuracy is 70%.

ESC-50 (Piczak, 2015a) is a collection of 2000 short (5 seconds) environmental recordings comprising 50 equally balanced classes of sound events in 5 major groups (animals, natural soundscapes and water sounds, human non-speech sounds, interior/domestic sounds and exterior/urban noises) divided into 5 folds for cross-validation. Following Aytar et al. (2016), we extract 10 sound clips from each recording with length of 1 second and time step of 0.5 second (i.e. two neighboring clips have 0.5 seconds overlapped). Therefore, in each cross-validation, the number of training samples is 16000. In testing, we average over ten clips of each recording for the final classification result.

UrbanSound8K (Salamon et al., 2014) is a collection of 8732 short (around 4 seconds) recordings of various urban sound sources (air conditioner, car horn, playing children, dog bark, drilling, engine idling, gun shot, jackhammer, siren and street music). Like in ESC-50, we extract 8 clips with time length of 1 second and time step of 0.5 second from each recording. For those that are less than 1 second, we pad them with zeros and repeat for 8 times (i.e. time step is 0.5 second).

DCASE (Stowell et al., 2015a) is used in the Detection and Classification of Acoustic Scenes and Events Challenge (DCASE). It contains 10 acoustic scene categories, 10 training examples per category and 100 testing examples. Each sample is a 30 seconds audio recording. During training, we evenly extract 12 sound clips with time length of 5 seconds and time step of 2.5 seconds from each recording.

Evaluation criteria

To demonstrate that WSNet is capable of learning more compact and efficient models than conventional CNNs, three evaluation criteria are used in our experiments: model size, the number of multiply and adds in calculation (mult-adds) and classification accuracy.

Baseline networks

To test the scability of WSNet to different network architectures (e.g. whether having fully connected layers or not), two baseline networks are used in comparision. The baseline network used on MusicDet200K consists of 7 convolutional layers and 2 fully connected layers, using which we demonstrate the effectiveness of WSNet on both convolutional layers and fully connected layers. For fair comparison with previous literatures, we firstly modify the state-of-the-art SoundNet (Aytar et al., 2016) by applying pooling layers to all but the last convolutional layer. As can be seen in Table 5, this modification significantly boosts the performance of original SoundNet. We then use the modified SoundNet as baseline on all three public datasets. The architectures of those two baseline networks are shown in Table 1 and Table 2 respectively.

Weight Quantization

Similar to other works (Han et al., 2016; Rastegari et al., 2016), we apply weight quantization to further reduce the size of WSNet. Specifically, the weights in each layer are linearly quantized to bins where is a pre-defined number. By setting all weights in the same bin to the same value, we only need to store a small index of the shared weight for each weight. The size of each bin is calculated as . Given bins, we only need bits to encode the index. Assuming each weight in WSNet is represented using 4 bytes float number (32 bits) without weight quantization, the ratio of each layer’s size before and after weight quantization is . Recall that and are the spatial size and the channel number of condensed filter. Since the condition generally holds in most layers of WSNet, weight quantization is able to reduce the model size by a factor of . Different from (Han et al., 2016; Rastegari et al., 2016) which learns the quantization during training, we apply weight quantization to WSNet after its training. In the experiments, we find that such an off-line way is sufficient to reduce model size without losing accuracy.

Layer conv1 conv2 conv3 conv4 conv5 conv6 conv7 fc1 fc2
Filter sizes 32 32 16 8 8 8 4 1536 256
#Filters 32 64 128 128 256 512 512 256 128
Stride 2 2 2 2 2 2 2 1 1
#Params 1K 65K 130K 130K 260K 1M 1M 390K 33K
Params (%) 0.03 2.1 4.2 4.2 8.4 33.7 33.7 12.6 1.1
#Mult-Adds () 4.1 65.5 32.7 8.2 4.1 4.2 1.0 0.1 0.007
Mult-Adds (%) 3.4 54.5 27.3 6.8 3.4 3.5 0.9 0.1 0.005
Table 1: Baseline-1: configurations of the baseline network used on MusicDet200K. Each convolutional layer is followed by a nonlinearity layer (i.e.ReLU), batch normalization layer and pooling layer, which are omitted in the table for brevity. The strides of all pooling layers are 2. The padding strategies adopted for both convolutional layers and fully connected layers are all “size preserving”.
Layer conv1 conv2 conv3 conv4 conv5 conv6 conv7 conv8
Filter sizes 64 32 16 8 4 4 4 8
#Filters 16 32 64 128 256 512 1024 1401
Stride 2 2 2 2 2 2 2 2
#Params 1K 16K 32K 65K 130K 520K 2M 11M
Params (%) 0.01 0.11 0.22 0.45 0.90 3.63 14.55 79.61
#Mult-Adds () 2.3 9.0 4.5 2.3 1.2 1.2 1.2 2.3
Mult-Adds (%) 9.4 37.7 18.8 9.5 4.8 4.8 5.3 9.6
Table 2: Baseline-2: configuration of the baseline network used on ESC-50, UrbanSound8K and DCASE. This baseline is adapted from SoundNet (Aytar et al., 2016) as detailed in Section 4.1. For brevity, the nonlinearity layer (i.e. ReLU), batch normalization layer and pooling layer following each convolutional layer are omitted. The kernel sizes for pooling layers following conv1-conv4 and conv5-conv7 are 8 and 4 respectively. The stride of every pooling layers is 2.
Implementation details

WSNet is implemented and trained from scratch in Tensorflow 

(Abadi et al., 2016). Following Aytar et al. (2016), the Adam (Kingma & Ba, 2014)

optimizer and a fixed learning rate of 0.001 and momentum term of 0.9 and batch size of 64 are used throughout experiments. We initialized all the weights to zero mean gaussian noise with a standard deviation of 0.01. In the network used on MusicDet200K, the dropout ratio for the dropout layers 

(Srivastava et al., 2014) after each fully connected layer is set to be 0.8. The overall training takes 100,000 iterations.

4.2 Results and analysis

4.2.1 MusicDet200K

Ablation analysis

Through controled experiments, we investigate the effects of each component in WSNet on the model size, computational cost and classification accuracy. The comparative study results of different settings of WSNet are listed in Table 3. For clear description, we name WSNets with different settings by the combination of symbols S/C/SC/D/Q. Please refer to the caption of Table 3 for detailed meanings.

(1) Spatial sampling. We test the performance of WSNet by using different sampling stride in spatial sampling. As listed in Table 3, S and S slightly outperforms the classification accuracy of the baseline, possibly due to reducing the overfitting of models. When the sampling stride is 8, i.e. the compactness in spatial dimension is 8 (ref. to Section 3.2.1), the classification accuracy of S only drops slightly by 0.6%. Note that the maximum compactness along the spatial dimension is equal to the filter size, thus for the layer “conv7” which has a filter size of 4, its compactness is limited by 4 (highlighted by underline in Table 3) in S. Above results clearly demonstrate that the spatial sampling enables WSNet to learn significantly smaller model with comparable accuracies w.r.t. the baseline.

(2) Channel sampling. Three different compactness along the channel dimension, i.e. 2, 4 and 8 are tested by comparing with baslines. It can be observed from Table 3 that C and C and C have linearly reduced model size without incurring noticeable drop of accuracy. In fact, C can even improve the accuracy upon the baseline, demonstrating the effectiveness of channel sampling in WSNet. When learning more compact models, C demonstrates better performance compared to S tha has the same compactness in the spatial dimension, which suggests we should focus on the channel sampling when the compactness along the spatial dimension is high.

We then simultaneously perform weight sampling on both the spatial and channel dimensions. As demonstrated by the results of SCSC and SCSC, WSNet can learn highly compact models (more than 20 smaller than baselines) without noticeable performance drop (less than 0.5%).

(3) Denser weight sampling. Denser weight sampling is used to enhance the learning capability of WSNet with aggressive compactness (i.e. when and are large) and make up the performance loss caused by sharing too much parameters among filters. As shown in Table 3, by sampling 2 more filters in conv1, conv2 and conv3, SCSCD significantly outperforms the SCSC. Above results demonstrate the effectiveness of denser weight sampling to boost the performance of WSNet.

(4) Integral image for efficient computation. As evidenced in the last column in Table 3, the proposed integral image method consistently reduces the computation cost of WSNet. For SCSC which is 23 smaller than the baseline, the computation cost (in terms of #mult-adds) is significantly reduced by 16.4 times. Due to the extra computation cost brought by the 11 convolution in denser filter sampling, SCSCD achieves lower acceleration (3.8). Group convolution (Xie et al., 2017) can be used to alleviate the computation cost of the added 11 convolution layers. We will explore this direction in our future work.

(5) Weight quantization. It can be observed from Table 3 that by using 256 bins to represent each weight by one byte (i.e. 8bits), SCSCAQ is reduced to 1/168 of the baseline’s model size while incurring only 0.1% accuracy loss. Above result demonstrates that the weight quantization is complementary to WSNet and they can be used jointly to effectively reduce the model size of WSNet. In this paper, since we do not explore using weight quantization to accelerate models, the WSNets before and after weight quantization have the same computational cost.

WSNet’s conv{1-3} conv4 conv5 conv6 conv7 fc1, fc2 Acc. Model Mult-Adds
settings S C A S C A S C A S C A S C A SC size
Baseline 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 88.9 3M (1) 1.2e9 (1)
S 2 1 1 2 1 1 2 1 1 2 1 1 2 1 1 2 89.0 2 1
S 4 1 1 4 1 1 4 1 1 4 1 1 4 1 1 4 89.0 4 1.8
S 8 1 1 8 1 1 8 1 1 8 1 1 4 1 1 8 88.3 5.7 3.4
C 1 2 1 1 2 1 1 2 1 1 2 1 1 2 1 2 89.1 2 1
C 1 4 1 1 4 1 1 4 1 1 4 1 1 4 1 4 88.7 4 1.4
C 1 8 1 1 8 1 1 8 1 1 8 1 1 8 1 8 88.6 8 2.4
SCSC 4 4 1 4 4 1 4 4 1 4 4 1 4 4 1 4 88.7 11.1 5.7
SCSC 8 8 1 8 8 1 8 8 1 8 8 1 4 8 1 8 88.4 23 16.4
SCSCD 8 8 2 8 8 1 8 8 1 8 8 1 4 8 1 8 89.2 20 3.8
SCSCD 8 8 2 8 8 1 8 8 1 8 8 1 8 8 1 15 88.6 42 3.8
SCSCQ 8 8 1 8 8 1 8 8 1 8 8 1 4 8 1 8 88.4 92 16.4
SCSCDQ 8 8 2 8 8 1 8 8 1 8 8 1 8 8 1 15 88.5 168 3.8
Table 3: Ablative study of the effects of different settings of WSNet on the model size, computation cost (in terms of #mult-adds) and classification accuracy on MusicDet200K. For clear description, we name WSNets with different settings by the combination of symbols S/C/SC/D/Q. “S” denotes the weight sampling along spatial dimension; “C” denotes the weight sampling along the channel dimension. “SC” denotes the weight sampling of fully connected layers whose parameters can be seen as flattened vectors with channel of 1. “D” denotes denser filter sampling. “Q” denotes weight quantization. With a symbol occurred in the name, the corresponding component is used in WSNet. The numbers in subscripts of S/C/SC/D/Q denotes the maximum compactness (ref. to Sec. 3.1 for the definition of compactness) on spatial/channel dimension in all layers, the ratio of the number of filters in WSNet versus in the baseline, the compactness of fully connected layers and the ratio of WSNet’s size before and after weight quantization, respectively. To avoid confusion, SC only occured in the names when both spatial and channel sampling are applied for convolutional layers. The model size and the computational cost are provided for the baseline. For the model size and #mult-adds of WSNet, we provide the ratio of the baseline’s model size versus WSNet’s model size and the ratio of the baseline’s #Mult-Adds versus WSNet’s #Mult-Adds.
WSNet’s conv{1-4} conv5 conv6 conv7 conv8 Model Mult-Adds
settings S C A S C A S C A S C A S C A size
Baseline 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3M (1) 2.4e8 (1)
SCD 4 4 2 4 4 1 4 4 1 4 4 1 8 4 1 25 2.3
SCD 4 4 2 4 4 1 4 4 1 4 8 1 8 8 1 45 2.4
Table 4: The configurations of the WSNet used on ESC-50, UrbanSound8K and DCASE. Please refer the denotation of symbols in the “Settings” column to Table 3. Since the input lengths for the baseline are different in each dataset, thus we only provide the #Mult-Adds for UrbanSound8K. Note since we use the ratio of baseline’s #Mult-Adds versus WSNet’s #Mult-Adds for one WSNet, thus the numbers corresponding to WSNets in the column of #Mult-Adds are the same for all dataset.
Model Settings Acc. (%) Model size
baseline scratch init.; provided data 66.0 13.7M (1)
WSNet SCD 66.5 25
WSNet SCDQ 66.25 100
WSNet SCD 66.1 45
WSNet SCDQ 65.8 180
Piczak ConvNet (Piczak, 2015b) - 64.5 -
SoundNet (Aytar et al., 2016) scratch init.; provided data 51.1 1
SoundNet (Aytar et al., 2016) pre-training; extra data 72.9 1
Table 5: Comparison with state-of-the-arts on ESC-50. All results of WSNet are obtained by 10-folder validation. Please refer the denotation of symbols in the “Settings” column to Table 3. The baseline used here is a simple modification of SoundNet with 8 convolution layers (refer to Section 4.1 for details), thus they have the same model size.
Model Settings Acc. (%) Model size
baseline raw sound wave 70.39 13.7M (1)
WSNet SCD 70.76 25
WSNet SCDQ 70.61 100
WSNet SCD 70.14 45
WSNet SCDQ 70.03 180
Piczak ConvNet (Piczak, 2015b) pre-computed features 73.1 -
Salamon (Salamon & Bello, 2015) pre-computed features 73.6 1
Table 6: Comparison with state-of-the-arts on UrbanSound8K. All results of WSNet are obtained by 5-folder validation. Please refer to the denotation of symbols in the “Settings” column in Table 3. While (Piczak, 2015b; Salamon & Bello, 2015) use complex and time-consuming pre-computed features (i.e. mel-spectrograms) as input, WSNet only uses raw sound wave which makes it more suitable in practical usage.

4.2.2 Esc-50

The comparison of WSNet with other state-of-the-arts on ESC-50 is listed in Table 5. The settings of WSNet used on ESC-50, UrbanSound8K and DCASE are listed in Table 4. Compared with the baseline, WSNet is able to significantly reduce the model size of the baseline by 25 times and 45 times, while at the same time improving the accuracy of the baseline by 0.5% and 0.1% respectively. The computation costs of WSNet are listed in Table 4, from which one can observe that WSNet achieves higher computational efficiency by reducing the #Mult-Adds of the baseline by 2.3 and 2.4, respectively. Such promising results again demonstrate the effectiveness of WSNet on learning compact and efficient networks. After applying weight quantization to WSNet, its model size is reduced to only 1/180 of the baseline while the accuracy only slightly drops by 0.2%. Compared with the SoundNet trained from scratch with provided data, WSNets significantly outperform its classification accuracy by over 10% with more than 100

smaller models. Using a transfer learning approach, SoundNet 

(Aytar et al., 2016) that is pre-trained using a large number of unlabeled videos achieves better accuracy than WSNet. However, since the training method is orthogonal to WSNet, we believe that WSNet can achieve better performance by training in a similar way as SoundNet (Aytar et al., 2016) on a large amount of unlabeled video data.

4.2.3 UrbanSound8K

We report the comparsion results of WSNet with state-of-the-arts on UrbanSound8k in Table 6. It is again observed that WSNet significantly reduces the model size of baseline while obtaining comparative results. Both Piczak (2015b) and Salamon & Bello (2015)

use pre-computed mel-spectrogram features during training. In comparison, the proposed WSNet simply takes the raw wave of recordings as input, avoiding the time-consuming feature extraction as done in

Piczak (2015b); Salamon & Bello (2015).

4.2.4 Dcase

As evidenced in Table 7, WSNet outperforms the classification accuracy of the baseline by 1% with a 100 smaller model. When using an even more compact model, i.e. 180 smaller in model size, the classification accuracy of WSNet is only one percentage lower than the baseline (i.e. has only one more incorrectly classified sample), verifying the effectiveness of WSNet. Compared with SoundNet (Aytar et al., 2016) that utilizes a large number of unlabeled data during training, WSNet (SCDQ) that is 100 smaller achieves comparable results by only using the provided data.

Model Settings Acc. (%) Model size
baseline scratch init.; provided data 85 13.7 (1)
WSNet SCD 86 25
WSNet SCDQ 86 100
WSNet SCD 84 45
WSNet SCDQ 84 180
RG (Rakotomamonjy & Gasso, 2015) - 69 -
LTT (Li et al., 2013) - 72 -
RNH (Roma et al., 2013) - 77 -
Ensemble (Stowell et al., 2015b) - 78 -
SoundNet (Aytar et al., 2016) pre-training; extra data 88 1
Table 7: Comparison with state-of-the-arts on DCASE. Note there are only 100 samples in testing set. Please refer the denotation of symbols in the “Settings” column to Table 3.

5 Conclusion

In this paper, we present a class of Weight Sampling networks (WSNet) which are highly compact and efficient. A novel weight sampling method is proposed to sample filters from condensed filters which are much smaller than the independently trained filters in conventional networks. The weight sampling in conducted in two dimensions of the condensed filters, i.e. by spatial sampling and channel sampling. Taking advantage of the overlapping property of the filters in WSNet, we propose an integral image method for efficient computation. Extensive experiments on four audio classification datasets including MusicDet200K, ESC-50, UrbanSound8K and DCASE clearly demonstrate that WSNet can learn compact and efficient networks with competitive performance.