Predicting Respiratory Anomalies and Diseases Using Deep Learning Models

by   Lam Pham, et al.

In this paper, robust deep learning frameworks are introduced, aims to detect respiratory diseases from respiratory sound inputs. The entire processes firstly begins with a front-end feature extraction that transforms recordings into spectrograms. Next, a back-end deep learning model classifies the spectrogram features into categories of respiratory disease or anomaly. Experiments are conducted over the ICBHI benchmark dataset of respiratory sounds. According to obtained experimental results, we make three main contributions toward lung-sound analysis: Firstly, we provide an extensive analysis on common factors (type of spectrogram, time resolution, cycle length, or data augmentation, etc.) that affect final prediction accuracy in a deep learning based system. Secondly, we propose novel deep learning based frameworks by using the most influencing factors indicated. As a result, the proposed deep learning frameworks outperforms state of the art methods. Finally, we successfully to apply the Teacher-Student scheme to solve the trade-off between model performance and model size that helps to increase ability of building real-time applications.



There are no comments yet.


page 1

page 4

page 5

page 6


An Ensemble of Deep Learning Frameworks Applied For Predicting Respiratory Anomalies

In this paper, we evaluate various deep learning frameworks for detectin...

Deep auscultation: Predicting respiratory anomalies and diseases via recurrent neural networks

Respiratory diseases are among the most common causes of severe illness ...

Robust Deep Learning Frameworks for Acoustic Scene and Respiratory Sound Classification

This thesis focuses on dealing with the task of acoustic scene classific...

Deep Learning Framework Applied for Predicting Anomaly of Respiratory Sounds

This paper proposes a robust deep learning framework used for classifyin...

Deep Learning based Frameworks for Handling Imbalance in DGA, Email, and URL Data Analysis

Deep learning is a state of the art method for a lot of applications. Th...

Attention-based Deep Tropical Cyclone Rapid Intensification Prediction

Rapid intensification (RI) is when a sudden and considerable increase in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

According to an anaysis conducted by the World Health Organization [22], it is fact that respiratory illness, which comprises of lung cancer, tuberculosis, asthma, chronic obstructive pulmonary disease (COPD), and lower respiratory tract infection (LRTI), account for a high percentage of mortality worldwide. Indeed, annual record indicates around 10, 65 and 334 million people currently suffering from tuberculosis (TB), chronic obstructive pulmonary disease (COPD), and asthma, respectively. Noticeably, there are about 1.4, 1.6, and 3 million people die by TB, lung cancers, and COPD each year. To deal with respiratory diseases, early detection is the key factor to increase effectiveness of treatment as well as limit spread. In an respiratory examination, lung auscultation is an important part to diagnose respiratory diseases. By listening to the sounds produced during lung auscultation, experts can recognize adventitious sounds (e.g., Crackles and wheezes) in the respiratory cycle that usually occurs in people suffering pulmonary disorders. If automated methods can be developed to detect these anomaly sounds, it may be useful in enhancing the early detection of respiratory disease in future. Although automated analysis of respiratory sounds were early conducted [27, 30, 28]

, the research field attracted little attention. However, it has drawn much attention in recent years due to applying robust machine learning and deep learning techniques.

As regards machine learning approach, proposed systems used for respiratory sound analysis tend to rely upon frame-based representations. Most researches [20, 12]

approached Frequency Cepstral Coefficients (MFCC), the most popular feature used in Automatic Speech Recogntion (ASC) research field, to derive feature vectors. Using both spectral and temporal features, Melbye et al. 


extracted five-dimensional feature vectors from draw audio signal, comprising of four features from the time domain (variance, range, and sum of simple moving average, sum of simple moving average) and one feature from the frequency domain (spectrum mean). Meanwhile, Hanna et al. 


firstly extracted spectral information from barkbands, energybands, melbands, mfcc, etc. , the rythm features from beats loudness, bpm, etc), the harmonicity and inharmonicity features, and the tonal features (chords strength, tuning frequency, etc). Next, they computed statistical values such as standard deviation, variance, minimum and maximum, median, mean, means of first and second derivatives and variances of first and second derivatives from the features to maximize the chance of correct feature representation. To further explore audio features, Mendes et al. 

[17] proposed to use 35 different types of features, mainly come from the research of Music Information Retrieval. Inspire that only some certain features mainly affect the final result, Datta et al. [5] firstly extracted various features such as power spectral density (PSD), FFT and Wavelet spectrogram, Frequency Cepstral Coefficients (MFCC), and Linear Frequency Cepstral Coefficients (LFCC). Next, they applied a Maximal Information Coefficient (MIC) [29] to score these features, thus selected the the most influencing features before feeding into a classifier. Similarly, Kok et al.  [12]

applied the Wilcoxon Sum of Rank test to indicate which feature among MFCC, Discrete Wavelet Transform (DWT) and Time Domain Features (the power, mean, variance, skewness and kurtosis of audio signal) mainly affect the final accuracy. Approach image processing techniques, Sengupta et al. 


applied Local Binary Pattern (LBP) analysis on mel-frequency spectral coefficient (MFSC) spectrogram to capture texture information of the spectrogram. Next, LBP spectrogram is converted into Histogram presentation before feeding into a back-end classification. Ordinarily, frame-based features, likely vectors, are classified by traditional machine learning models such as Logistic Regression  


, K-Nearest Neighbor (KNN

[8, 32]

, Hidden Markov Model 

[20, 21]

, Support Vector Machine 

[8, 5, 32, 33]

or decision trees 

[12, 8, 3].

Regarding deep learning techniques achieved strong and robust detection performance for general sounds [37][16], feature extraction involves generating two-dimensional spectrograms that is able to capture both temporal and spectral information and present much wider time context than single frame analysis. While there are a variety of spectrogram transformations, Mel-based methods such as log-Mel [34, 14, 1] and MFCC [2, 34, 24, 23, 18, 11] are the most popular approach. Some papers approached diffefent spectrograms such as a combination of two spectrograms (STFT and Wavelet) proposed by Minami et al. [19], optimized S-Transformation in [4]. Current deep learning classifiers exploring spectrogram representation of respiratory sounds mainly base on Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), or hybrid architectures. As regard CNN-based network, published papers presented diverse architectures such as Lenet6 [24, 2], VGG5 [14], two parallel VGG16 [19], and Resnet50 [4]. Inspire that adventitious respiratory sounds such as Crackle and Wheeze present certain temporal sequence and RNN-based networks able to capture these structures, Perna and Tagarelli [23]

conducted a comprehensive analysis of using Long Short-term Memory (LSTM) network, which is used for both tasks of classifying anomaly respiratory sounds and respiratory diseases. By using both LSTM and Gated Recurrent Unit (GRU) cells, learning components in a RNN-based network, Kochetov et al. 

[11] proposed a novel architecture, namely Noise Masking Recurrent Neural Network, which aims to distinguish both noise and anomaly respiratory sounds. As regards hybrid architectures proposed in  [1, 19], CNN is firstly used to map spectrogram input to a time sequence. Next, LSTM [1] or GRU [19] cells are used to learn structure of the sequence before sending to fully-connected layers for final classification.

Compare with machine learning approach, state-of-the-art comparison presented in [23, 4] indicates that deep learning classifiers are more robust and effective to achieve good scores. However, deep learning based models show much more complicated architecture, thus require a large memory when large models are integrated into wearable devices or certain embedded systems for real-time applications. In other words, the state-of-the-art systems present a trade-off between model performance and model size. Additionally, although recent deep learning techniques help to achieve good performance in terms of classification of respiratory sounds, it is hard to compare systems due to the use of different datasets, mainly collected by authors, and often not publicly available.

In this paper, we propose robust deep learning frameworks evaluated on ICBHI dataset [31] , aim to

  • Compare our results to the state-of-the-art systems due to using the published ICBHI dataset. Furthermore, as ICBHI is one of the biggest datasets of respiratory sounds currently, it is beneficial to make proposed deep learning models general.

  • Provide a comprehensive analysis of various factors, such as type of spectrogram, overlap/non-overlap patches and patch size, data augmentation, etc., thus propose two best deep learning models, each targets individual task of either anomaly respiratory sound classification or respiratory disease detection.

  • Solve the trade-off between model performance and model size by applying Student-Teacher scheme. In particular, we consider the best deep learning model as Teacher. We extract middle layers’ information from Teacher model and consider the values as soft labels. Next, we use the soft labels to train another model, called Student, with smaller size. Eventually, we obtain the small-size model (Student network trained with soft labels), showing similar performance to Teacher model.

Ii ICBHI dataset and our tasks proposed

Ii-a ICBHI dataset

The 2017 Internal Conference on Biomedical Health Informatics (ICBHI) [31] provided a large dataset of respiratory sounds. Particularly, it comprises of 920 audio recordings over 5.5 hours. The audio recordings have various lengths from 10 to 90 s, recorded with a wide range of sampling frequencies from 4 kHz to 44.1 kHz. ICBHI dataset was collected from a total of 128 patients, thus identified their situation in terms of being healthy or exhibiting one of the following respiratory diseases or conditions (COPD, Bronchiectasis, Asthma, Upper and Lower respiratory tract infection, Pneumonia, Bronchiolitis) and labelled diseases’ name on each audio recording. Inside each audio recording, different types of respiratory cycle, called Crackle, Wheeze, Crackle & Wheeze, and Normal, are presented. These cycles were labelled by experts, thus provide onset and offset time. Noticeably, these cycles have various recording lengths (from 0.2 s up to 16.2 s), with the number of cycles being unbalanced (1864, 886, 506 and 3642 cycles respectively for Crackle, Wheeze, Crackle & Wheeze, and Normal).

Ii-B Main tasks from ICBHI dataset

Given this metadata, the ICBHI challenge is separated into two main tasks.

Fig. 1: High-level system architecture

Task 1, referred to as respiratory anomaly classification, is separated into two sub-tasks. The first sub-task aims to classify four different cycles (Crackle, Wheeze, Crackle & Wheeze, and Normal). The second sub-task is for classifying four types of cycles into two groups of Normal and Anomaly cycles (the latter consisting of Crackle, Wheeze, Both Crackle & Wheeze). We named theses tasks as Task 1-1 and Task 1-2.

Task 2, referred to as respiratory disease prediction, also comprises two sub-tasks. The first sub-task aims to classify audio recordings into groups of disease conditions: Healthy, Chronic Disease (i.e. COPD, Bronchiectasis and Asthma) and Non-Chronic Disease (i.e. Upper and Lower respiratory tract infection, Pneumonia, and Bronchiolitis). The second sub-task is for two groups of healthy or unhealthy (i.e. the chronic and non-chronic disease groups combined). We named theses tasks as Task 2-1 and Task 2-2. While Task 1-1 and 1-2 are evaluated over respiratory cycles, Task 2-1 and 2-2 are evaluated over entire audio recordings.

Ii-C Evaluation metric and our setting

In this paper, we attempt all of the ICBHI challenge tasks recently mentioned. To evaluate our systems over each task, we separate the ICBHI dataset (6898 respiratory cycles for Task 1-1, Task 1-2 and 920 entire recordings for Task 2-1 and Task 2-2) into five-folds for cross validation. We firstly introduce a baseline system, thus conduct experiments on the baseline to indicate the most influencing factors over just the first fold. From the analysis of such factors, we propose the best system configurations and evaluate over all folds. Noting that we eventually propose two deep learning framework, each for individual task of either anomaly cycle detection (Task 1-1 and 1-2) or respiratory disease detection (Task 2-1 and 2-2). To evaluate proposed system and compare to the state of the art, we follow the ICBHI criteria and settings, and report results in terms of sensitivity, specitivity and ICBHI score as defined in [31, 23] below,


for classifying four classes of cycles in Task 1-1,


for two groups of normal or adventitious cycle in Task 1-2, and


where and are the number of correct inference and the total cases. Similarly, respiratory-disease classification in Task 2-1 and 2-2 provides criteria as below equations.


for three groups of diseases in Task 2-1,


for two groups of healthy or unhealthy in Task 2-2, and


where and are the number of correct inference and the total cases. The ICBHI score is computed by averaging of Sensitivity and Specificity.

Iii High-level system and the baseline architecture proposed

Iii-a High-level system architecture

Firstly, high-level system architecture used for all tasks of anomaly sounds and disease detection is introduced as described in Figure 1. As Figure 1 shown, the entire system is separated into two main steps, comprising a front-end feature extraction (the upper part) and a back-end deep learning models (the lower part). Particularly, cycles in Tasks 1 or entire audio recording in Tasks 2 are transformed into spectrogram representation. The entire spectrogram is thus split into image patches. Next, mixup data augmentation is applied on image patches to generate new data before feeding into deep learning models for classification.

Iii-B Baseline architecture proposed

From the high-level system architecture in Figure 1, it can be seen that there are a variety of factors affecting deep-learning-based system’s performance such as cycle length (only for Task 1-1 and 1-2), type of spectrogram, overlap or non-overlap splitting, patch size, data augmentation. It is fact that non of research on respiratory sounds has analysed all these factors. We, therefore, provide an intensive analysis, thus indicate the most influencing factors in this paper. To do this, we firstly introduce a baseline system architecture with setting shown in Table I.

Factors Setting
Re-sample 16KHz
Cycle duration (only for Task 1) 5s
Spectrogram log-Mel
Patch Splitting non-overlap
Patch size
Data augmentation None
Deep learning model C-DNN based architecture
TABLE I: Baseline system setting proposed.

By only selecting one option of each factor, we firstly re-sample all audio recoding to 16 kHz due to different sample rates. Since cycle lengths are different, we duplicate short respiratory cycles to ensure input features have a minimum length (e.g. 5 s or longer – this is unnecessary for Task 2 which uses entire recordings). Next, each cycle or recording audio is then transformed into a log-Mel spectrogram with window size=1024 samples, hop size=256, FFT length=2048 and filter number=64. The resulting spectrogram is then non-overlap split into smaller patches of . As data augmentation is one of factors evaluated, we do not apply this technique on the baseline system. As regards deep learning model used for the baseline system, we propose a C-DNN network architecture, likely VGG-7, as shown in Table II. The C-DNN

contains 7 sub-blocks, comprising 6 Conv. Blocks and 1 Dense Block, which perform batch normalization (Bn), convolution (Cv[kernel size]), rectified linear units (Relu), average pooling (Ap[kernel size]), global average pooling (Gap), drop out (Dr (percentage drop)), Fully-connected (Fl) and final Softmax layer for classification.

C is the number of categories classified that depends on specific tasks. In particular, we use two C-DNN models, each set
textitC to 3 and 4 for Task 1-1, 1-2 and Task 2-1, 2-2, respectively. Note that as each model is used for each Task, obtained parameters of each model are different after training.

Architecture layers Output
Input layer (image patch)
Conv. Block 01 Bn - Cv [] - Relu - Bn - Ap [] - Dr ()
Conv. Block 02 Bn - Cv [] - Relu - Bn - Ap [] - Dr ()
Conv. Block 03 Bn - Cv [] - Relu - Bn - Dr ()
Conv. Block 04 Bn - Cv [] - Relu - Bn - Ap [] - Dr ()
Conv. Block 05 Bn - Cv [] - Relu - Bn - Dr ()
Conv. Block 06 Bn - Cv [] - Relu - Bn - Gap - Dr ()
Dense Block Fl - Softmax layer C
TABLE II: C-DNN network architecture used in the baseline system

Iii-C Experimental setting of the baseline system

We use TensorFlow framework to build


models with hyperparameters set to Adam optimiser 

[10], epoches, batch size of , and cross-entropy loss as below,


where is all trainable parameters, is batch size, and constant set initially to . and denote expected and predicted results.

As an entire spectrogram or cycle is separated into patches and applied patch-by-patch to the C-DNN model which then returns the posterior probability computed over each patch. The posterior probability of an entire spectrogram can then be computed by taking the average of all patches’ posterior probabilities. Let us consider

, with being the category number and the out of patches fed into learning model, as the probability of a test sound instance, then the mean classification probability is denoted as where,


and the predicted label from the C-DNN is determined using,


Iv Analyse influencing factors

By using the baseline system recently proposed, we conduct experiments to evaluate effect of factors mentioned, thus propose the best options of these factors in this Section.

Iv-a Spectrogram analysis

As our previous work’s results on sound scene DCASE dataset [26, 25], spectrogram is one of the most important factors affect the final classification. Therefore, effect of spectrogram factor on the baseline is firstly evaluated on all tasks. In particular, baseline system’s setting as described in Table I is remained, but type of spectrogram is replaced by log-Mel [15], Gammatone filter (Gamma.) [6], Mel-Frequency Cepstral Coefficient (MFCC) [15], and Constant Q Transform (CQT) [15] in the order.

Fig. 2: Baseline performance comparison with different spectrograms
Fig. 3: Performance comparison between log-Mel and Gamma., MFCC

The obtained results as shown in the Figure 2 indicate that MFCC, log-Mel, and Gamma. performance are equal in general and much better than CQT over all Tasks. If we consider log-Mel scores as standard, Figure 3 indicates how performance difference between log-Mel and Gamma. or MFCC. It can be seen that Gamma. outperforms other spectrograms over Task 1, reports an improvement of 4% and 3.2% compared with log-Mel and MFCC, respectively. However, both Gamma. and MFCC shows little poorer performance than log-Mel in Task 2. From obtained results on spectrograms, we decide to use Gamma. for anomaly cycles detection (Task 1-1 and 1-2) and log-Mel for respiratory diseases detection (Task 2-1 and 2-2). Note that using Gamma. and log-Mel are applied for next all experiments presented below.

Iv-B Cycle length analysis

Respiratory cycles in the ICBHI dataset have diverse lengths ranging from 0.2 s to 16.2 s with 80% of cycles being less than 5 s. It is, therefore, interesting to understand how respiratory cycle length affects classification accuracy. By using the baseline proposed in Section III-C with Gamma. spectrogram indicated in Section IV-A, we evaluated cycle length from 3 to 8 s, called standard lengths. To deal with short cycles, we duplicate them to obtain new cycles that are equal or longer than standard lengths before transforming into spectrograms.

Task Cyc. Len. Spec. Sen. ICBHI Score
1-1, 4-category 3 s 84.20 70.92 77.56
1-1, 4-category 4 s 83.52 71.54 77.53
1-1, 4-category 5 s 84.20 73.08 78.64
1-1, 4-category 6 s 81.45 73.07 77.27
1-1, 4-category 7 s 82.56 73.54 78.05
1-1, 4-category 8 s 86.95 69.84 78.40
1-2, 2-category 3 s 84.20 81.85 83.03
1-2, 2-category 4 s 83.52 83.54 83.53
1-2, 2-category 5 s 84.20 83.69 83.95
1-2, 2-category 6 s 81.45 81.46 82.81
1-2, 2-category 7 s 82.56 84.31 83.43
1-2, 2-category 8 s 86.95 80.92 83.94
TABLE III: Respiratory cycle length analysis on Task 1 (%)

The Table III reports Task 1-1 and 1-2 results for different cycle lengths. The best ICBHI scores are 78.64 and 83.95 for 4- and 2-category sub-tasks respectively with cycle length set to 5 s (same as the baseline ’s setting). Although 7 or 8-s cycle lengths also show competitive results, we choose the shorter length of 5 s for further experiments due to reducing running cost and the best results obtained.

Iv-C Overlap/non-overlap splitting analysis

As spectrogram representation of an entire cycle or audio recoding is too long in terms of temporal dimension, they are split into smaller patches of that is suitable for back-end deep learning models. Overlap splitting is considered to be useful to make temporal sequence continuous. This Section, therefore, evaluates if overlap splitting should be applied.

Fig. 4: Performance comparison between non-overlap and overlap splitting option on all tasks

As obtained results shown in Figure 4, classifying anomaly cycles in sub task 1-1 and 1-2 achieves the best core of 76.59% and 78.57% respectively with overlap patches. By contrast, non-overlap splitting is more effective for Task 2-1 and 2-2 with entire recordings, reports 78.64% and 83.95%, respectively.

Iv-D Time resolution analysis

The baseline network operates on patches, with the horizontal dimension denoting the time span for each feature. Features are sequential, so the time span also sets the temporal resolution of the features. To explore, we adjust patch widths to 0.6 s, 1.2 s, 1.8 s, 2.4 s, and 3 s by setting the patch dimension to be , , , , and , respectively, then retrain and evaluate performance of each.

Fig. 5: Performance comparison among time resolutions on all tasks

Results shown in Figure 5 indicate that while patch of (1.2 s) is the best choice with 78.64% and 83.95% for Task 1-1 and Task 1-2 respectively, Task 2-1 and 2-2 achieve the best score of 80.63% and 84.58% with patch size of (2.4 s).

Iv-E Data augmentation analysis

Data augmentation is useful to enforce learning ability of deep learning models that was proven in [25, 26]. In this paper, we, therefore, apply a data augmentation method, namely mixup [36, 35, 38], and evaluate if it is useful for respiratory sounds. Let consider and as two image patches randomly selected from the set of original image patches with their labels and , respectively, mixup data augmentation helps to generate new image patches as Equations below,



is drawn from both uniform distribution or beta distribution,

and are two new image patches resulted by mixing and with a random mixing coefficient . After mixup, old data and generated data from mixup data augmentation are shuffled and feed into C-DNN baseline proposed, double batch size and consider learning time (Noting that as categories classified in Task 1-1 and 1-2 comprise of Crackle, Wheeze, Crackle & Wheeze and Normal, each of two original patches selected for mixup data augmentation is from either Normal or the group of anomaly cycles. However, this selection is randomly in Task 2-1 and 2-2).

By applying mixup data augmentation technique, the new labels and of the two mixup patches are no longer one-hot labels, Kullback-Leibler (KL) divergence loss [13] rather than the standard cross-entropy loss is used as shown in Equation below,



is KL-loss function,

denotes the trainable network parameters and denote the -norm regularization coefficient, set to 0.0001, is the batch number. and denote the ground-truth and the network output, respectively.

Fig. 6: Improve performance by using mixup data augmentation

Obtained results in Figure 6 indicates that applying mixup data augmentation is effective to improve the classification accuracy on all tasks. Noticeably, it helps Task 2-1 and 2-2 significantly improve by 8.5% and 8.2%, respectively..

V Propose robust deep learning frameworks

Factors Anomaly cycles Respiratory diseases
detection detection
Re-sample 16KHz 16KHz
Cycle duration 5s -
Spectrogram Gamma. log-Mel
Patch Splitting overlap non-overlap
Patch size
Data augmentation True True
Deep leanring model C-DNN with MoE C-DNN with MoE
TABLE IV: Deep learning framework setting

From the comprehensive analysis of influencing factors in deep learning based system above, we propose two deep learning frameworks, each which is applied for either task of anomaly cycle detection or respiratory disease detection with setting shown as Table IV. As regards deep learning models, we use a same network architecture for all Tasks, namely C-DNN with MoE, which is an improvement of C-DNN baseline presented in the next Section.

V-a Improve deep learning C-DNN model

Fig. 7: MoE block architecture
Fig. 8: Proposed deep learning model

Review the C-DNN baseline as shown in Table I

, the first six Conv. Blocks are used to map image patch input to condensed vector features (output of the global mean pooling layer in Conv. Block 06). Thus, these vectors are classified by a fully-connected layer and a Softmax layer in the final Dense Block. Inspire that the condensed vector features extracted by global mean across channel dimension may not capture enough information, we extract more information by separately using two other pooling layers which are global max pooling and global conv. pooling. While global max pooling is widely used, the global conv. pooling proposed is an added convolutional layer with kernel size set to frequency and temporal dimensions and filter number set to channel dimension. In particular, the output shape of the tensor at the final convolutional layer in C-DNN basline is [

] where n is batch size and 8, 8, and 512 are frequency, temporal and channel dimensions, respectively. Thus, we apply a convolutional layer with kernel size of [] and 512 filter on the tensor, thus obtain a 512-dimensional vectors. By using three types of pooling layers (global max pooling, global mean pooling, and conv. pooling), we capture as much information as possible. Each type of pooling layer extracts a 512-dimensional vector as shown in Figure 8. The second improvement is focusing on Dense Block architecture which takes the role of final classification. Particularly, we replace Dense Block by MoE block. A conventional MoE block architecture comprises many experts and incorporates a gate network to decide which expert is applied in which input region as shown in Figure 7

. In our context, the 512-dimensional input vector extracted from pooling layers mentioned goes through the experts. Next the experts is gated before passing through a softmax to determine the final score. Each MoE expert comprises a fully-connected layer and a ReLu activation function. Its input dimension is 512 and its output size is the number or categories

classified. The gate network is implemented as a Softmax Gate – an additional fully-connected layer with softmax activation function and a gating dimension equal to the number of experts. Let be the output vectors of the experts, and be the outputs of the gate network where The predicted output is then found as,


V-B Experimental setting and accuracy fusion

As setting mentioned in Table IV for proposed deep learning frameworks, mixup data augmentation is used, thus make label not shape of one-hot coding. Therefore, Kullback-Leibler (KL) divergence loss [13] mentioned in Section IV-E and Equation (14

) is used to train deep learning models proposed (C-DNN with MoE). We use TensorFlow framework to build the model and set learning rate=0.0001, eporch num=100, batch size=100, initial trainable parameters by Normal Distribution with mean=0 and standard deviation=0.1.

As using three types of pooling layers, we apply mean-fusion method to fuse posterior probability obtained. Let us consider , with being the category number, the out of patches fed into learning model, and the out of 3 types of pooling layers to be the probability of a test sound instance. The mean classification probability is then denoted as where,


and similarly the predicted label is determined as in Equation (9).

V-C Performance compared to the state of the art

Fig. 9: Teacher-Student scheme architecture
Task Method train/test Spec. Sen. ICBHI Score
1-1, 4-class Boosted Tree [3] 60/40 0.78 0.21 0.49
1-1, 4-class CNN-RNN [1] 80/20 - - 0.66
1-1, 4-class CNN [24] 80/20 0.77 0.45 0.61
1-1, 4-class MNRNN [11] five folds 0.74 0.56 0.65
1-1, 4-class LSTM [23] 80/20 0.85 0.62 0.74
1-1, 4-class Our system five folds 0.87 0.74 0.81
1-2, 2-class LSTM [23] 80/20 - - 0.81
1-2, 2-class CNN [14] 75/25 - - 0.82
1-2, 2-class Our system five folds 0.86 0.85 0.86
2-1, 3-class CNN [24] 80/20 0.76 0.89 0.83
2-1, 3-class LSTM [23] 80/20 0.82 0.98 0.90
2-1, 3-class Our system five folds 0.86 0.95 0.91
2-2, 2-class CNN-RNN [1] 60/40 - - 0.71
2-2, 2-class CNN [24] 80/20 0.78 0.97 0.88
2-2, 2-class RUSBoost [12] 50/50 0.93 0.86 0.90
2-2, 2-class LSTM [23] 80/20 0.82 0.99 0.91
2-2, 2-class Our system five folds 0.86 0.98 0.92
TABLE V: Comparison against state-of-the-art systems

From the experimental analysis results, we propose separate network configurations for ICBHI challenge Tasks 1 and 2, although both share the same deep learning model (C-DNN+MoE).

Table V (top) compares the proposed Task 1 system against the state of the art, demonstrating the highest accuracy of 0.81 and 0.86 for the 4-category and 2-category subtasks, respectively. Task 2 results (Table V, bottom) reveal an accuracy of 0.91 and 0.90 for the 3-category and 2-category subtasks respectively. These results indicate that our proposed deep learning frameworks outperform the state-of-the-art methods. However, the comparison may be not 100% exact due to different proportion splitting over dataset.

Vi Student-Teacher scheme to reduce model size for respiratory diseases detection

Vi-a Student-Teacher scheme

Inspiration from effectively applying Teacher-Student scheme on sound scenes and sound events [9, 7], we apply this technique on Task 2 to deal with the trade-off between model size and performance. The proposed Teacher-Student scheme as described in Figure 9 comprises of two networks, namely Teacher (the upper) and Student (the lower), respectively. Teacher network reuses the C-DNN+MoE architecture introduced in Section V-A with only using global mean pooling. As regards the Student network architecture, it comprises of two Conv. Block 07, 08, and a Dense Block with configuration as denoted in Table VI. Note that Conv. Bocks used in Student network do not apply Batchnorm and Dropout layers. To operate the scheme, training Teacher-Student scheme is separated into two phases. The Teacher is firstly trained, thus extract the output of global mean pooling layer. The extracted features, likely 512-dimensional vectors, are referred as to soft labels that will be used to train Student network. Since we obtain the soft labels from the Teacher network, we train the Student network by combining two loss functions. The first loss function used Euclidean distance aims to minimize difference between soft labels and 512-dimensional vectors extracted from the output of global mean pooling layer on the Student network. Meanwhile, the second Cross-Entropy loss function is used for classification of three groups of respiratory diseases. Eventually, the final loss is described below,


where and are the Cross-Entropy and Euclidean distance losses repectively. Hyperparameter is experimentally set to 1/2. is total trainable parameters.

Architecture layers Output
Input layer (image patch)
Conv. Block 07 Cv [] - Relu - Ap []
Conv. Block 08 Cv [] - Relu - Gap
Dense Block Fl - Softmax layer 3
TABLE VI: Student network architecture

Vi-B Experimental results

Task Method Spec. Sen. ICBHI Score
2-1, 3-class Teacher 0.86 0.95 0.91
2-1, 3-class Student only 0.43 0.94 0.68
2-1, 3-class Student with soft labels 0.86 0.90 0.88
2-2, 2-class Teacher 0.86 0.98 0.92
2-2, 2-class Student only 0.43 0.99 0.71
2-2, 2-class Student with soft labels 0.86 0.96 0.91
TABLE VII: Performance comparison among Student network only, Teach network only, and Teacher-Student scheme

From results obtained as in Table VII, Student network trained with soft labels from Teacher network achieves 0.88 and 0.91 on Task 2-1 and 2-2, respectively. Although Student network cannot reach the scores of Teacher network with 0.91 and 0.92 on Task 2-1 and 2-2 respectively, Student network helps to significantly reduce the size of reference model without reducing performance too much. Look at results on Student network without soft labels from Teacher, the performance significantly reduces to 0.68 and 0.71 for Task 2-1 and 2-2, respectively. As a result, apply Teacher-Student scheme helps to achieve a Student network with lower parameter of 7296, compared with 30912 trainable parameters in Teacher network, which is effectively used for detection processes in low-parameter required systems.

Vii Conclusion

This paper has presented an exploration of deep learning models for detecting respiratory disease from auditory recordings. By conducting intensive experiments over the ICBHI dataset, we propose deep learning frameworks for four challenge tasks of respiratory sound classification. The proposed systems are shown to outperform the state of the art on all tasks. Furthermore, effectively applying Teacher-Student scheme helps to significantly reduce model size used for reference process but still achieves high performance. Obtained experimential results validate application of deep learning for early diagnosis of respiratory disease.


  • [1] J. Acharya and A. Basu (2020) Deep neural network for respiratory sound classification in wearable devices enabled by patient specific model tuning.. IEEE transactions on biomedical circuits and systems. Cited by: §I, TABLE V.
  • [2] M. Aykanat, Ö. Kılıç, B. Kurt, and S. Saryal (2017) Classification of lung sounds using convolutional neural networks. EURASIP Journal on Image and Video Processing 2017 (1), pp. 65. Cited by: §I.
  • [3] G. Chambres, P. Hanna, and M. Desainte-Catherine (2018) Automatic detection of patient with respiratory diseases using lung sound analysis. In Proc. CBMI, pp. 1–6. Cited by: §I, TABLE V.
  • [4] H. Chen, X. Yuan, Z. Pei, M. Li, and J. Li (2019) Triple-classification of respiratory sounds using optimized s-transform and deep residual networks. IEEE Access 7, pp. 32845–32852. Cited by: §I, §I.
  • [5] S. Datta, A. D. Choudhury, P. Deshpande, S. Bhattacharya, and A. Pal (2017) Automated lung sound analysis for detecting pulmonary abnormalities. In 2017 39th Annual International Conference of the Ieee Engineering in Medicine and Biology Society (Embc), pp. 4594–4598. Cited by: §I.
  • [6] Cited by: §IV-A.
  • [7] L. Gao, H. Mi, B. Zhu, D. Feng, Y. Li, and Y. Peng (2019) An adversarial feature distillation method for audio classification. IEEE Access 7, pp. 105319–105330. Cited by: §VI-A.
  • [8] M. Grønnesby, J. C. A. Solis, E. Holsbø, H. Melbye, and L. A. Bongo (2017) Feature extraction for machine learning based crackle detection in lung sounds from a health survey. arXiv preprint arXiv:1706.00005. Cited by: §I.
  • [9] H. Heo, J. Jung, H. Shim, and H. Yu (2019)

    Acoustic scene classification using teacher-student learning with soft-labels

    arXiv preprint arXiv:1904.10135. Cited by: §VI-A.
  • [10] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-C.
  • [11] K. Kochetov, E. Putin, M. Balashov, A. Filchenkov, and A. Shalyto (2018) Noise masking recurrent neural network for respiratory sound classification. In International Conference on Artificial Neural Networks, pp. 208–217. Cited by: §I, TABLE V.
  • [12] X. H. Kok, S. A. Imtiaz, and E. Rodriguez-Villegas (2019) A novel method for automatic identification of respiratory disease from acoustic recordings. In 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 2589–2592. Cited by: §I, TABLE V.
  • [13] S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §IV-E, §V-B.
  • [14] R. Liu, S. Cai, K. Zhang, and N. Hu (2019) Detection of adventitious respiratory sounds based on convolutional neural network. In 2019 International Conference on Intelligent Informatics and Biomedical Sciences (ICIIBMS), pp. 298–303. Cited by: §I, TABLE V.
  • [15] McFee, Brian, R. Colin, L. Dawen, Daniel. PW.Ellis, M. Matt, B. Eric, and N. Oriol (2015) Librosa: audio and music signal analysis in python. In Proceedings of The 14th Python in Science Conference, pp. 18–25. Cited by: §IV-A.
  • [16] I. McLoughlin, Y. Song, L. D. Pham, H. Pham, P. Ramaswamy, and L. Yue (2018) Early detection of continuous and partial audio events using CNN. In Proc. INTERSPEECH, Cited by: §I.
  • [17] L. Mendes, I. M. Vogiatzis, E. Perantoni, E. Kaimakamis, I. Chouvarda, N. Maglaveras, J. Henriques, P. Carvalho, and R. P. Paiva (2016) Detection of crackle events using a multi-feature approach. In 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3679–3683. Cited by: §I.
  • [18] E. Messner, M. Fediuk, P. Swatek, S. Scheidl, F. Smolle-Juttner, H. Olschewski, and F. Pernkopf (2018) Crackle and breathing phase detection in lung sounds with deep bidirectional gated recurrent neural networks. In Proc. EMBC, pp. 356–359. Cited by: §I.
  • [19] K. Minami, H. Lu, H. Kim, S. Mabu, Y. Hirano, and S. Kido (2019) Automatic classification of large-scale respiratory sound dataset based on convolutional neural network. In 2019 19th International Conference on Control, Automation and Systems (ICCAS), pp. 804–807. Cited by: §I.
  • [20] T. Okubo, N. Nakamura, M. Yamashita, and S. Matsunaga (2014) Classification of healthy subjects and patients with pulmonary emphysema using continuous respiratory sounds. In 2014 36th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 70–73. Cited by: §I.
  • [21] D. Oletic, M. Matijascic, V. Bilas, and M. Magno (2019) Hidden markov model-based asthmatic wheeze recognition algorithm leveraging the parallel ultra-low-power processor (pulp). In 2019 IEEE Sensors Applications Symposium, pp. 1–6. Cited by: §I.
  • [22] Cited by: §I.
  • [23] D. Perna and A. Tagarelli (2019) Deep auscultation: predicting respiratory anomalies and diseases via recurrent neural networks. In Proc. CBMS, pp. 50–55. Cited by: §I, §I, §II-C, TABLE V.
  • [24] D. Perna (2018) Convolutional neural networks learning from respiratory data. In Proc. BIBM, pp. 2109–2113. Cited by: §I, TABLE V.
  • [25] L. Pham, I. McLoughlin, H. Phan, R. Palaniappan, and Y. Lang (2019) Bag-of-features models based on C-DNN network for acoustic scene classification. In Proc. AES, Cited by: §IV-A, §IV-E.
  • [26] L. Pham, I. Mcloughlin, H. Phan, and R. Palaniappan (2019-09) A robust framework for acoustic scene classification. In Proc. INTERSPEECH, pp. 3634–3638. Cited by: §IV-A, §IV-E.
  • [27] H. Polat and İ. Güler (2004) A simple computer-based measurement and analysis system of pulmonary auscultation sounds. Journal of medical systems 28 (6), pp. 665–672. Cited by: §I.
  • [28] S. Reichert, R. Gass, C. Brandt, and E. Andrès (2008) Analysis of respiratory sounds: state of the art. Clinical medicine. Circulatory, respiratory and pulmonary medicine 2, pp. CCRPM–S530. Cited by: §I.
  • [29] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti (2011) Detecting novel associations in large data sets. science 334 (6062), pp. 1518–1524. Cited by: §I.
  • [30] R. J. Riella, P. Nohama, R. F. Borges, and A. L. Stelle (2003-Sep.) Automatic wheezing recognition in recorded lung sounds. In Proc. EMBC, Vol. 3, pp. 2535–2538. Cited by: §I.
  • [31] B. Rocha, D. Filos, L. Mendes, et al. (2018) A respiratory sound database for the development of automated classification. In Precision Medicine Powered by pHealth and Connected Health, pp. 33–37. Cited by: §I, §II-A, §II-C.
  • [32] N. Sengupta, M. Sahidullah, and G. Saha (2017) Lung sound classification using local binary pattern. arXiv preprint arXiv:1710.01703. Cited by: §I.
  • [33] G. Serbes, S. Ulukaya, and Y. P. Kahya (2018) An automated lung sound preprocessing and classification system based onspectral analysis methods. In Precision Medicine Powered by pHealth and Connected Health, pp. 45–49. Cited by: §I.
  • [34] L. Shi, K. Du, C. Zhang, H. Ma, and W. Yan (2019) Lung sound recognition algorithm based on vggish-bigru. IEEE Access 7, pp. 139438–139449. Cited by: §I.
  • [35] Y. Tokozume, Y. Ushiku, and T. Harada (2017) Learning from between-class examples for deep sound recognition. arXiv preprint arXiv:1711.10282. Cited by: §IV-E.
  • [36] K. Xu, D. Feng, H. Mi, B. Zhu, D. Wang, L. Zhang, H. Cai, and S. Liu (2018) Mixup-based acoustic scene classification using multi-channel convolutional neural network. In Pacific Rim Conference on Multimedia, pp. 14–23. Cited by: §IV-E.
  • [37] H. Zhang, I. McLoughlin, and Y. Song (2015-04) Robust sound event recognition using convolutional neural networks. In Proc. ICASSP, pp. 559–563. Cited by: §I.
  • [38] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §IV-E.