I Introduction
According to an anaysis conducted by the World Health Organization [22], it is fact that respiratory illness, which comprises of lung cancer, tuberculosis, asthma, chronic obstructive pulmonary disease (COPD), and lower respiratory tract infection (LRTI), account for a high percentage of mortality worldwide. Indeed, annual record indicates around 10, 65 and 334 million people currently suffering from tuberculosis (TB), chronic obstructive pulmonary disease (COPD), and asthma, respectively. Noticeably, there are about 1.4, 1.6, and 3 million people die by TB, lung cancers, and COPD each year. To deal with respiratory diseases, early detection is the key factor to increase effectiveness of treatment as well as limit spread. In an respiratory examination, lung auscultation is an important part to diagnose respiratory diseases. By listening to the sounds produced during lung auscultation, experts can recognize adventitious sounds (e.g., Crackles and wheezes) in the respiratory cycle that usually occurs in people suffering pulmonary disorders. If automated methods can be developed to detect these anomaly sounds, it may be useful in enhancing the early detection of respiratory disease in future. Although automated analysis of respiratory sounds were early conducted [27, 30, 28]
, the research field attracted little attention. However, it has drawn much attention in recent years due to applying robust machine learning and deep learning techniques.
As regards machine learning approach, proposed systems used for respiratory sound analysis tend to rely upon framebased representations. Most researches [20, 12]
approached Frequency Cepstral Coefficients (MFCC), the most popular feature used in Automatic Speech Recogntion (ASC) research field, to derive feature vectors. Using both spectral and temporal features, Melbye et al.
[8]extracted fivedimensional feature vectors from draw audio signal, comprising of four features from the time domain (variance, range, and sum of simple moving average, sum of simple moving average) and one feature from the frequency domain (spectrum mean). Meanwhile, Hanna et al.
[3]firstly extracted spectral information from barkbands, energybands, melbands, mfcc, etc. , the rythm features from beats loudness, bpm, etc), the harmonicity and inharmonicity features, and the tonal features (chords strength, tuning frequency, etc). Next, they computed statistical values such as standard deviation, variance, minimum and maximum, median, mean, means of first and second derivatives and variances of first and second derivatives from the features to maximize the chance of correct feature representation. To further explore audio features, Mendes et al.
[17] proposed to use 35 different types of features, mainly come from the research of Music Information Retrieval. Inspire that only some certain features mainly affect the final result, Datta et al. [5] firstly extracted various features such as power spectral density (PSD), FFT and Wavelet spectrogram, Frequency Cepstral Coefficients (MFCC), and Linear Frequency Cepstral Coefficients (LFCC). Next, they applied a Maximal Information Coefficient (MIC) [29] to score these features, thus selected the the most influencing features before feeding into a classifier. Similarly, Kok et al. [12]applied the Wilcoxon Sum of Rank test to indicate which feature among MFCC, Discrete Wavelet Transform (DWT) and Time Domain Features (the power, mean, variance, skewness and kurtosis of audio signal) mainly affect the final accuracy. Approach image processing techniques, Sengupta et al.
[32]applied Local Binary Pattern (LBP) analysis on melfrequency spectral coefficient (MFSC) spectrogram to capture texture information of the spectrogram. Next, LBP spectrogram is converted into Histogram presentation before feeding into a backend classification. Ordinarily, framebased features, likely vectors, are classified by traditional machine learning models such as Logistic Regression
[17], KNearest Neighbor (KNN)
[8, 32][20, 21][8, 5, 32, 33][12, 8, 3].Regarding deep learning techniques achieved strong and robust detection performance for general sounds [37], [16], feature extraction involves generating twodimensional spectrograms that is able to capture both temporal and spectral information and present much wider time context than single frame analysis. While there are a variety of spectrogram transformations, Melbased methods such as logMel [34, 14, 1] and MFCC [2, 34, 24, 23, 18, 11] are the most popular approach. Some papers approached diffefent spectrograms such as a combination of two spectrograms (STFT and Wavelet) proposed by Minami et al. [19], optimized STransformation in [4]. Current deep learning classifiers exploring spectrogram representation of respiratory sounds mainly base on Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), or hybrid architectures. As regard CNNbased network, published papers presented diverse architectures such as Lenet6 [24, 2], VGG5 [14], two parallel VGG16 [19], and Resnet50 [4]. Inspire that adventitious respiratory sounds such as Crackle and Wheeze present certain temporal sequence and RNNbased networks able to capture these structures, Perna and Tagarelli [23]
conducted a comprehensive analysis of using Long Shortterm Memory (LSTM) network, which is used for both tasks of classifying anomaly respiratory sounds and respiratory diseases. By using both LSTM and Gated Recurrent Unit (GRU) cells, learning components in a RNNbased network, Kochetov et al.
[11] proposed a novel architecture, namely Noise Masking Recurrent Neural Network, which aims to distinguish both noise and anomaly respiratory sounds. As regards hybrid architectures proposed in [1, 19], CNN is firstly used to map spectrogram input to a time sequence. Next, LSTM [1] or GRU [19] cells are used to learn structure of the sequence before sending to fullyconnected layers for final classification.Compare with machine learning approach, stateoftheart comparison presented in [23, 4] indicates that deep learning classifiers are more robust and effective to achieve good scores. However, deep learning based models show much more complicated architecture, thus require a large memory when large models are integrated into wearable devices or certain embedded systems for realtime applications. In other words, the stateoftheart systems present a tradeoff between model performance and model size. Additionally, although recent deep learning techniques help to achieve good performance in terms of classification of respiratory sounds, it is hard to compare systems due to the use of different datasets, mainly collected by authors, and often not publicly available.
In this paper, we propose robust deep learning frameworks evaluated on ICBHI dataset [31] , aim to

Compare our results to the stateoftheart systems due to using the published ICBHI dataset. Furthermore, as ICBHI is one of the biggest datasets of respiratory sounds currently, it is beneficial to make proposed deep learning models general.

Provide a comprehensive analysis of various factors, such as type of spectrogram, overlap/nonoverlap patches and patch size, data augmentation, etc., thus propose two best deep learning models, each targets individual task of either anomaly respiratory sound classification or respiratory disease detection.

Solve the tradeoff between model performance and model size by applying StudentTeacher scheme. In particular, we consider the best deep learning model as Teacher. We extract middle layers’ information from Teacher model and consider the values as soft labels. Next, we use the soft labels to train another model, called Student, with smaller size. Eventually, we obtain the smallsize model (Student network trained with soft labels), showing similar performance to Teacher model.
Ii ICBHI dataset and our tasks proposed
Iia ICBHI dataset
The 2017 Internal Conference on Biomedical Health Informatics (ICBHI) [31] provided a large dataset of respiratory sounds. Particularly, it comprises of 920 audio recordings over 5.5 hours. The audio recordings have various lengths from 10 to 90 s, recorded with a wide range of sampling frequencies from 4 kHz to 44.1 kHz. ICBHI dataset was collected from a total of 128 patients, thus identified their situation in terms of being healthy or exhibiting one of the following respiratory diseases or conditions (COPD, Bronchiectasis, Asthma, Upper and Lower respiratory tract infection, Pneumonia, Bronchiolitis) and labelled diseases’ name on each audio recording. Inside each audio recording, different types of respiratory cycle, called Crackle, Wheeze, Crackle & Wheeze, and Normal, are presented. These cycles were labelled by experts, thus provide onset and offset time. Noticeably, these cycles have various recording lengths (from 0.2 s up to 16.2 s), with the number of cycles being unbalanced (1864, 886, 506 and 3642 cycles respectively for Crackle, Wheeze, Crackle & Wheeze, and Normal).
IiB Main tasks from ICBHI dataset
Given this metadata, the ICBHI challenge is separated into two main tasks.
Task 1, referred to as respiratory anomaly classification, is separated into two subtasks. The first subtask aims to classify four different cycles (Crackle, Wheeze, Crackle & Wheeze, and Normal). The second subtask is for classifying four types of cycles into two groups of Normal and Anomaly cycles (the latter consisting of Crackle, Wheeze, Both Crackle & Wheeze). We named theses tasks as Task 11 and Task 12.
Task 2, referred to as respiratory disease prediction, also comprises two subtasks. The first subtask aims to classify audio recordings into groups of disease conditions: Healthy, Chronic Disease (i.e. COPD, Bronchiectasis and Asthma) and NonChronic Disease (i.e. Upper and Lower respiratory tract infection, Pneumonia, and Bronchiolitis). The second subtask is for two groups of healthy or unhealthy (i.e. the chronic and nonchronic disease groups combined). We named theses tasks as Task 21 and Task 22. While Task 11 and 12 are evaluated over respiratory cycles, Task 21 and 22 are evaluated over entire audio recordings.
IiC Evaluation metric and our setting
In this paper, we attempt all of the ICBHI challenge tasks recently mentioned. To evaluate our systems over each task, we separate the ICBHI dataset (6898 respiratory cycles for Task 11, Task 12 and 920 entire recordings for Task 21 and Task 22) into fivefolds for cross validation. We firstly introduce a baseline system, thus conduct experiments on the baseline to indicate the most influencing factors over just the first fold. From the analysis of such factors, we propose the best system configurations and evaluate over all folds. Noting that we eventually propose two deep learning framework, each for individual task of either anomaly cycle detection (Task 11 and 12) or respiratory disease detection (Task 21 and 22). To evaluate proposed system and compare to the state of the art, we follow the ICBHI criteria and settings, and report results in terms of sensitivity, specitivity and ICBHI score as defined in [31, 23] below,
(1) 
for classifying four classes of cycles in Task 11,
(2) 
for two groups of normal or adventitious cycle in Task 12, and
(3) 
where and are the number of correct inference and the total cases. Similarly, respiratorydisease classification in Task 21 and 22 provides criteria as below equations.
(4) 
for three groups of diseases in Task 21,
(5) 
for two groups of healthy or unhealthy in Task 22, and
(6) 
where and are the number of correct inference and the total cases.
The ICBHI score is computed by averaging of Sensitivity and Specificity.
Iii Highlevel system and the baseline architecture proposed
Iiia Highlevel system architecture
Firstly, highlevel system architecture used for all tasks of anomaly sounds and disease detection is introduced as described in Figure 1. As Figure 1 shown, the entire system is separated into two main steps, comprising a frontend feature extraction (the upper part) and a backend deep learning models (the lower part). Particularly, cycles in Tasks 1 or entire audio recording in Tasks 2 are transformed into spectrogram representation. The entire spectrogram is thus split into image patches. Next, mixup data augmentation is applied on image patches to generate new data before feeding into deep learning models for classification.
IiiB Baseline architecture proposed
From the highlevel system architecture in Figure 1, it can be seen that there are a variety of factors affecting deeplearningbased system’s performance such as cycle length (only for Task 11 and 12), type of spectrogram, overlap or nonoverlap splitting, patch size, data augmentation. It is fact that non of research on respiratory sounds has analysed all these factors. We, therefore, provide an intensive analysis, thus indicate the most influencing factors in this paper. To do this, we firstly introduce a baseline system architecture with setting shown in Table I.
Factors  Setting 

Resample  16KHz 
Cycle duration (only for Task 1)  5s 
Spectrogram  logMel 
Patch Splitting  nonoverlap 
Patch size  
Data augmentation  None 
Deep learning model  CDNN based architecture 
By only selecting one option of each factor, we firstly resample all audio recoding to 16 kHz due to different sample rates. Since cycle lengths are different, we duplicate short respiratory cycles to ensure input features have a minimum length (e.g. 5 s or longer – this is unnecessary for Task 2 which uses entire recordings). Next, each cycle or recording audio is then transformed into a logMel spectrogram with window size=1024 samples, hop size=256, FFT length=2048 and filter number=64. The resulting spectrogram is then nonoverlap split into smaller patches of . As data augmentation is one of factors evaluated, we do not apply this technique on the baseline system. As regards deep learning model used for the baseline system, we propose a CDNN network architecture, likely VGG7, as shown in Table II. The CDNN
contains 7 subblocks, comprising 6 Conv. Blocks and 1 Dense Block, which perform batch normalization (Bn), convolution (Cv[kernel size]), rectified linear units (Relu), average pooling (Ap[kernel size]), global average pooling (Gap), drop out (Dr (percentage drop)), Fullyconnected (Fl) and final Softmax layer for classification.
C is the number of categories classified that depends on specific tasks. In particular, we use two CDNN models, each settextitC to 3 and 4 for Task 11, 12 and Task 21, 22, respectively. Note that as each model is used for each Task, obtained parameters of each model are different after training.
Architecture  layers  Output 

Input layer (image patch)  
Conv. Block 01  Bn  Cv []  Relu  Bn  Ap []  Dr ()  
Conv. Block 02  Bn  Cv []  Relu  Bn  Ap []  Dr ()  
Conv. Block 03  Bn  Cv []  Relu  Bn  Dr ()  
Conv. Block 04  Bn  Cv []  Relu  Bn  Ap []  Dr ()  
Conv. Block 05  Bn  Cv []  Relu  Bn  Dr ()  
Conv. Block 06  Bn  Cv []  Relu  Bn  Gap  Dr ()  
Dense Block  Fl  Softmax layer  C 
IiiC Experimental setting of the baseline system
We use TensorFlow framework to build
CDNNmodels with hyperparameters set to Adam optimiser
[10], epoches, batch size of , and crossentropy loss as below,(7) 
where is all trainable parameters, is batch size, and constant set initially to . and denote expected and predicted results.
As an entire spectrogram or cycle is separated into patches and applied patchbypatch to the CDNN model which then returns the posterior probability computed over each patch. The posterior probability of an entire spectrogram can then be computed by taking the average of all patches’ posterior probabilities. Let us consider
, with being the category number and the out of patches fed into learning model, as the probability of a test sound instance, then the mean classification probability is denoted as where,(8) 
and the predicted label from the CDNN is determined using,
(9) 
Iv Analyse influencing factors
By using the baseline system recently proposed, we conduct experiments to evaluate effect of factors mentioned, thus propose the best options of these factors in this Section.
Iva Spectrogram analysis
As our previous work’s results on sound scene DCASE dataset [26, 25], spectrogram is one of the most important factors affect the final classification. Therefore, effect of spectrogram factor on the baseline is firstly evaluated on all tasks. In particular, baseline system’s setting as described in Table I is remained, but type of spectrogram is replaced by logMel [15], Gammatone filter (Gamma.) [6], MelFrequency Cepstral Coefficient (MFCC) [15], and Constant Q Transform (CQT) [15] in the order.
The obtained results as shown in the Figure 2 indicate that MFCC, logMel, and Gamma. performance are equal in general and much better than CQT over all Tasks. If we consider logMel scores as standard, Figure 3 indicates how performance difference between logMel and Gamma. or MFCC. It can be seen that Gamma. outperforms other spectrograms over Task 1, reports an improvement of 4% and 3.2% compared with logMel and MFCC, respectively. However, both Gamma. and MFCC shows little poorer performance than logMel in Task 2. From obtained results on spectrograms, we decide to use Gamma. for anomaly cycles detection (Task 11 and 12) and logMel for respiratory diseases detection (Task 21 and 22). Note that using Gamma. and logMel are applied for next all experiments presented below.
IvB Cycle length analysis
Respiratory cycles in the ICBHI dataset have diverse lengths ranging from 0.2 s to 16.2 s with 80% of cycles being less than 5 s. It is, therefore, interesting to understand how respiratory cycle length affects classification accuracy. By using the baseline proposed in Section IIIC with Gamma. spectrogram indicated in Section IVA, we evaluated cycle length from 3 to 8 s, called standard lengths. To deal with short cycles, we duplicate them to obtain new cycles that are equal or longer than standard lengths before transforming into spectrograms.
Task  Cyc. Len.  Spec.  Sen.  ICBHI Score 

11, 4category  3 s  84.20  70.92  77.56 
11, 4category  4 s  83.52  71.54  77.53 
11, 4category  5 s  84.20  73.08  78.64 
11, 4category  6 s  81.45  73.07  77.27 
11, 4category  7 s  82.56  73.54  78.05 
11, 4category  8 s  86.95  69.84  78.40 
12, 2category  3 s  84.20  81.85  83.03 
12, 2category  4 s  83.52  83.54  83.53 
12, 2category  5 s  84.20  83.69  83.95 
12, 2category  6 s  81.45  81.46  82.81 
12, 2category  7 s  82.56  84.31  83.43 
12, 2category  8 s  86.95  80.92  83.94 
The Table III reports Task 11 and 12 results for different cycle lengths. The best ICBHI scores are 78.64 and 83.95 for 4 and 2category subtasks respectively with cycle length set to 5 s (same as the baseline ’s setting). Although 7 or 8s cycle lengths also show competitive results, we choose the shorter length of 5 s for further experiments due to reducing running cost and the best results obtained.
IvC Overlap/nonoverlap splitting analysis
As spectrogram representation of an entire cycle or audio recoding is too long in terms of temporal dimension, they are split into smaller patches of that is suitable for backend deep learning models. Overlap splitting is considered to be useful to make temporal sequence continuous. This Section, therefore, evaluates if overlap splitting should be applied.
As obtained results shown in Figure 4, classifying anomaly cycles in sub task 11 and 12 achieves the best core of 76.59% and 78.57% respectively with overlap patches. By contrast, nonoverlap splitting is more effective for Task 21 and 22 with entire recordings, reports 78.64% and 83.95%, respectively.
IvD Time resolution analysis
The baseline network operates on patches, with the horizontal dimension denoting the time span for each feature. Features are sequential, so the time span also sets the temporal resolution of the features. To explore, we adjust patch widths to 0.6 s, 1.2 s, 1.8 s, 2.4 s, and 3 s by setting the patch dimension to be , , , , and , respectively, then retrain and evaluate performance of each.
Results shown in Figure 5 indicate that while patch of (1.2 s) is the best choice with 78.64% and 83.95% for Task 11 and Task 12 respectively, Task 21 and 22 achieve the best score of 80.63% and 84.58% with patch size of (2.4 s).
IvE Data augmentation analysis
Data augmentation is useful to enforce learning ability of deep learning models that was proven in [25, 26]. In this paper, we, therefore, apply a data augmentation method, namely mixup [36, 35, 38], and evaluate if it is useful for respiratory sounds. Let consider and as two image patches randomly selected from the set of original image patches with their labels and , respectively, mixup data augmentation helps to generate new image patches as Equations below,
(10) 
(11) 
(12) 
(13) 
where
is drawn from both uniform distribution or beta distribution,
and are two new image patches resulted by mixing and with a random mixing coefficient . After mixup, old data and generated data from mixup data augmentation are shuffled and feed into CDNN baseline proposed, double batch size and consider learning time (Noting that as categories classified in Task 11 and 12 comprise of Crackle, Wheeze, Crackle & Wheeze and Normal, each of two original patches selected for mixup data augmentation is from either Normal or the group of anomaly cycles. However, this selection is randomly in Task 21 and 22).By applying mixup data augmentation technique, the new labels and of the two mixup patches are no longer onehot labels, KullbackLeibler (KL) divergence loss [13] rather than the standard crossentropy loss is used as shown in Equation below,
(14) 
where
is KLloss function,
denotes the trainable network parameters and denote the norm regularization coefficient, set to 0.0001, is the batch number. and denote the groundtruth and the network output, respectively.Obtained results in Figure 6 indicates that applying mixup data augmentation is effective to improve the classification accuracy on all tasks. Noticeably, it helps Task 21 and 22 significantly improve by 8.5% and 8.2%, respectively..
V Propose robust deep learning frameworks
Factors  Anomaly cycles  Respiratory diseases 

detection  detection  
Resample  16KHz  16KHz 
Cycle duration  5s   
Spectrogram  Gamma.  logMel 
Patch Splitting  overlap  nonoverlap 
Patch size  
Data augmentation  True  True 
Deep leanring model  CDNN with MoE  CDNN with MoE 
From the comprehensive analysis of influencing factors in deep learning based system above, we propose two deep learning frameworks, each which is applied for either task of anomaly cycle detection or respiratory disease detection with setting shown as Table IV. As regards deep learning models, we use a same network architecture for all Tasks, namely CDNN with MoE, which is an improvement of CDNN baseline presented in the next Section.
Va Improve deep learning CDNN model
Review the CDNN baseline as shown in Table I
, the first six Conv. Blocks are used to map image patch input to condensed vector features (output of the global mean pooling layer in Conv. Block 06). Thus, these vectors are classified by a fullyconnected layer and a Softmax layer in the final Dense Block. Inspire that the condensed vector features extracted by global mean across channel dimension may not capture enough information, we extract more information by separately using two other pooling layers which are global max pooling and global conv. pooling. While global max pooling is widely used, the global conv. pooling proposed is an added convolutional layer with kernel size set to frequency and temporal dimensions and filter number set to channel dimension. In particular, the output shape of the tensor at the final convolutional layer in CDNN basline is [
] where n is batch size and 8, 8, and 512 are frequency, temporal and channel dimensions, respectively. Thus, we apply a convolutional layer with kernel size of [] and 512 filter on the tensor, thus obtain a 512dimensional vectors. By using three types of pooling layers (global max pooling, global mean pooling, and conv. pooling), we capture as much information as possible. Each type of pooling layer extracts a 512dimensional vector as shown in Figure 8. The second improvement is focusing on Dense Block architecture which takes the role of final classification. Particularly, we replace Dense Block by MoE block. A conventional MoE block architecture comprises many experts and incorporates a gate network to decide which expert is applied in which input region as shown in Figure 7. In our context, the 512dimensional input vector extracted from pooling layers mentioned goes through the experts. Next the experts is gated before passing through a softmax to determine the final score. Each MoE expert comprises a fullyconnected layer and a ReLu activation function. Its input dimension is 512 and its output size is the number or categories
classified. The gate network is implemented as a Softmax Gate – an additional fullyconnected layer with softmax activation function and a gating dimension equal to the number of experts. Let be the output vectors of the experts, and be the outputs of the gate network where The predicted output is then found as,(15) 
VB Experimental setting and accuracy fusion
As setting mentioned in Table IV for proposed deep learning frameworks, mixup data augmentation is used, thus make label not shape of onehot coding. Therefore, KullbackLeibler (KL) divergence loss [13] mentioned in Section IVE and Equation (14
) is used to train deep learning models proposed (CDNN with MoE). We use TensorFlow framework to build the model and set learning rate=0.0001, eporch num=100, batch size=100, initial trainable parameters by Normal Distribution with mean=0 and standard deviation=0.1.
As using three types of pooling layers, we apply meanfusion method to fuse posterior probability obtained. Let us consider , with being the category number, the out of patches fed into learning model, and the out of 3 types of pooling layers to be the probability of a test sound instance. The mean classification probability is then denoted as where,
(16) 
and similarly the predicted label is determined as in Equation (9).
VC Performance compared to the state of the art
Task  Method  train/test  Spec.  Sen.  ICBHI Score 

11, 4class  Boosted Tree [3]  60/40  0.78  0.21  0.49 
11, 4class  CNNRNN [1]  80/20      0.66 
11, 4class  CNN [24]  80/20  0.77  0.45  0.61 
11, 4class  MNRNN [11]  five folds  0.74  0.56  0.65 
11, 4class  LSTM [23]  80/20  0.85  0.62  0.74 
11, 4class  Our system  five folds  0.87  0.74  0.81 
12, 2class  LSTM [23]  80/20      0.81 
12, 2class  CNN [14]  75/25      0.82 
12, 2class  Our system  five folds  0.86  0.85  0.86 
21, 3class  CNN [24]  80/20  0.76  0.89  0.83 
21, 3class  LSTM [23]  80/20  0.82  0.98  0.90 
21, 3class  Our system  five folds  0.86  0.95  0.91 
22, 2class  CNNRNN [1]  60/40      0.71 
22, 2class  CNN [24]  80/20  0.78  0.97  0.88 
22, 2class  RUSBoost [12]  50/50  0.93  0.86  0.90 
22, 2class  LSTM [23]  80/20  0.82  0.99  0.91 
22, 2class  Our system  five folds  0.86  0.98  0.92 
From the experimental analysis results, we propose separate network configurations for ICBHI challenge Tasks 1 and 2, although both share the same deep learning model (CDNN+MoE).
Table V (top) compares the proposed Task 1 system against the state of the art, demonstrating the highest accuracy of 0.81 and 0.86 for the 4category and 2category subtasks, respectively. Task 2 results (Table V, bottom) reveal an accuracy of 0.91 and 0.90 for the 3category and 2category subtasks respectively. These results indicate that our proposed deep learning frameworks outperform the stateoftheart methods. However, the comparison may be not 100% exact due to different proportion splitting over dataset.
Vi StudentTeacher scheme to reduce model size for respiratory diseases detection
Via StudentTeacher scheme
Inspiration from effectively applying TeacherStudent scheme on sound scenes and sound events [9, 7], we apply this technique on Task 2 to deal with the tradeoff between model size and performance. The proposed TeacherStudent scheme as described in Figure 9 comprises of two networks, namely Teacher (the upper) and Student (the lower), respectively. Teacher network reuses the CDNN+MoE architecture introduced in Section VA with only using global mean pooling. As regards the Student network architecture, it comprises of two Conv. Block 07, 08, and a Dense Block with configuration as denoted in Table VI. Note that Conv. Bocks used in Student network do not apply Batchnorm and Dropout layers. To operate the scheme, training TeacherStudent scheme is separated into two phases. The Teacher is firstly trained, thus extract the output of global mean pooling layer. The extracted features, likely 512dimensional vectors, are referred as to soft labels that will be used to train Student network. Since we obtain the soft labels from the Teacher network, we train the Student network by combining two loss functions. The first loss function used Euclidean distance aims to minimize difference between soft labels and 512dimensional vectors extracted from the output of global mean pooling layer on the Student network. Meanwhile, the second CrossEntropy loss function is used for classification of three groups of respiratory diseases. Eventually, the final loss is described below,
(17) 
where and are the CrossEntropy and Euclidean distance losses repectively. Hyperparameter is experimentally set to 1/2. is total trainable parameters.
Architecture  layers  Output 

Input layer (image patch)  
Conv. Block 07  Cv []  Relu  Ap []  
Conv. Block 08  Cv []  Relu  Gap  
Dense Block  Fl  Softmax layer  3 
ViB Experimental results
Task  Method  Spec.  Sen.  ICBHI Score 

21, 3class  Teacher  0.86  0.95  0.91 
21, 3class  Student only  0.43  0.94  0.68 
21, 3class  Student with soft labels  0.86  0.90  0.88 
22, 2class  Teacher  0.86  0.98  0.92 
22, 2class  Student only  0.43  0.99  0.71 
22, 2class  Student with soft labels  0.86  0.96  0.91 
From results obtained as in Table VII, Student network trained with soft labels from Teacher network achieves 0.88 and 0.91 on Task 21 and 22, respectively. Although Student network cannot reach the scores of Teacher network with 0.91 and 0.92 on Task 21 and 22 respectively, Student network helps to significantly reduce the size of reference model without reducing performance too much. Look at results on Student network without soft labels from Teacher, the performance significantly reduces to 0.68 and 0.71 for Task 21 and 22, respectively. As a result, apply TeacherStudent scheme helps to achieve a Student network with lower parameter of 7296, compared with 30912 trainable parameters in Teacher network, which is effectively used for detection processes in lowparameter required systems.
Vii Conclusion
This paper has presented an exploration of deep learning models for detecting respiratory disease from auditory recordings. By conducting intensive experiments over the ICBHI dataset, we propose deep learning frameworks for four challenge tasks of respiratory sound classification. The proposed systems are shown to outperform the state of the art on all tasks. Furthermore, effectively applying TeacherStudent scheme helps to significantly reduce model size used for reference process but still achieves high performance. Obtained experimential results validate application of deep learning for early diagnosis of respiratory disease.
References
 [1] (2020) Deep neural network for respiratory sound classification in wearable devices enabled by patient specific model tuning.. IEEE transactions on biomedical circuits and systems. Cited by: §I, TABLE V.
 [2] (2017) Classification of lung sounds using convolutional neural networks. EURASIP Journal on Image and Video Processing 2017 (1), pp. 65. Cited by: §I.
 [3] (2018) Automatic detection of patient with respiratory diseases using lung sound analysis. In Proc. CBMI, pp. 1–6. Cited by: §I, TABLE V.
 [4] (2019) Tripleclassification of respiratory sounds using optimized stransform and deep residual networks. IEEE Access 7, pp. 32845–32852. Cited by: §I, §I.
 [5] (2017) Automated lung sound analysis for detecting pulmonary abnormalities. In 2017 39th Annual International Conference of the Ieee Engineering in Medicine and Biology Society (Embc), pp. 4594–4598. Cited by: §I.
 [6] Cited by: §IVA.
 [7] (2019) An adversarial feature distillation method for audio classification. IEEE Access 7, pp. 105319–105330. Cited by: §VIA.
 [8] (2017) Feature extraction for machine learning based crackle detection in lung sounds from a health survey. arXiv preprint arXiv:1706.00005. Cited by: §I.

[9]
(2019)
Acoustic scene classification using teacherstudent learning with softlabels
. arXiv preprint arXiv:1904.10135. Cited by: §VIA.  [10] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IIIC.
 [11] (2018) Noise masking recurrent neural network for respiratory sound classification. In International Conference on Artificial Neural Networks, pp. 208–217. Cited by: §I, TABLE V.
 [12] (2019) A novel method for automatic identification of respiratory disease from acoustic recordings. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2589–2592. Cited by: §I, TABLE V.
 [13] (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §IVE, §VB.
 [14] (2019) Detection of adventitious respiratory sounds based on convolutional neural network. In 2019 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), pp. 298–303. Cited by: §I, TABLE V.
 [15] (2015) Librosa: audio and music signal analysis in python. In Proceedings of The 14th Python in Science Conference, pp. 18–25. Cited by: §IVA.
 [16] (2018) Early detection of continuous and partial audio events using CNN. In Proc. INTERSPEECH, Cited by: §I.
 [17] (2016) Detection of crackle events using a multifeature approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3679–3683. Cited by: §I.
 [18] (2018) Crackle and breathing phase detection in lung sounds with deep bidirectional gated recurrent neural networks. In Proc. EMBC, pp. 356–359. Cited by: §I.
 [19] (2019) Automatic classification of largescale respiratory sound dataset based on convolutional neural network. In 2019 19th International Conference on Control, Automation and Systems (ICCAS), pp. 804–807. Cited by: §I.
 [20] (2014) Classification of healthy subjects and patients with pulmonary emphysema using continuous respiratory sounds. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 70–73. Cited by: §I.
 [21] (2019) Hidden markov modelbased asthmatic wheeze recognition algorithm leveraging the parallel ultralowpower processor (pulp). In 2019 IEEE Sensors Applications Symposium, pp. 1–6. Cited by: §I.
 [22] Cited by: §I.
 [23] (2019) Deep auscultation: predicting respiratory anomalies and diseases via recurrent neural networks. In Proc. CBMS, pp. 50–55. Cited by: §I, §I, §IIC, TABLE V.
 [24] (2018) Convolutional neural networks learning from respiratory data. In Proc. BIBM, pp. 2109–2113. Cited by: §I, TABLE V.
 [25] (2019) Bagoffeatures models based on CDNN network for acoustic scene classification. In Proc. AES, Cited by: §IVA, §IVE.
 [26] (201909) A robust framework for acoustic scene classification. In Proc. INTERSPEECH, pp. 3634–3638. Cited by: §IVA, §IVE.
 [27] (2004) A simple computerbased measurement and analysis system of pulmonary auscultation sounds. Journal of medical systems 28 (6), pp. 665–672. Cited by: §I.
 [28] (2008) Analysis of respiratory sounds: state of the art. Clinical medicine. Circulatory, respiratory and pulmonary medicine 2, pp. CCRPM–S530. Cited by: §I.
 [29] (2011) Detecting novel associations in large data sets. science 334 (6062), pp. 1518–1524. Cited by: §I.
 [30] (2003Sep.) Automatic wheezing recognition in recorded lung sounds. In Proc. EMBC, Vol. 3, pp. 2535–2538. Cited by: §I.
 [31] (2018) A respiratory sound database for the development of automated classification. In Precision Medicine Powered by pHealth and Connected Health, pp. 33–37. Cited by: §I, §IIA, §IIC.
 [32] (2017) Lung sound classification using local binary pattern. arXiv preprint arXiv:1710.01703. Cited by: §I.
 [33] (2018) An automated lung sound preprocessing and classification system based onspectral analysis methods. In Precision Medicine Powered by pHealth and Connected Health, pp. 45–49. Cited by: §I.
 [34] (2019) Lung sound recognition algorithm based on vggishbigru. IEEE Access 7, pp. 139438–139449. Cited by: §I.
 [35] (2017) Learning from betweenclass examples for deep sound recognition. arXiv preprint arXiv:1711.10282. Cited by: §IVE.
 [36] (2018) Mixupbased acoustic scene classification using multichannel convolutional neural network. In Pacific Rim Conference on Multimedia, pp. 14–23. Cited by: §IVE.
 [37] (201504) Robust sound event recognition using convolutional neural networks. In Proc. ICASSP, pp. 559–563. Cited by: §I.
 [38] (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §IVE.
Comments
There are no comments yet.