In this study, respiratory sounds were prospectively collected from pediatric patients in actual clinical practice. In addition, pediatric pulmonologists with abundant clinical experience carefully recorded the respiratory sounds and verified them by blind validation. Therefore, our dataset is comparable to any gold standard for deep learning because it reflects the real world, has a high sound description accuracy, and has a high sample rate. We developed a deep-learning AI model to classify wheezing using CBAM in a CNN-based ResNet structure. This model has a sufficiently high performance to be useful in actual clinical practice. We also found that adding tabular data to deep-learning models improved performance.
Recently, various methods have been proposed to improve the performance of deep-learning models of lung sound classification. The use of CNN, RNN, and other methods has been proposed as deep-learning architectures. Among them, several studies have evaluated CNN as most optimal for the respiratory sound classification model27,28. CNN operates the neural network by applying convolutional operations and is used in various fields such as image, video, and natural language interpretation. Recently, CNNs have also been frequently used in tasks using audio, and many models have been derived by transforming and upgrading CNN29.
The CNN model we adopted as a basic structure can extract abundant features and learn efficiently as the layer deepens30. However, overfitting may occur as the layer becomes deeper, which increases the complexity of the model and reduces performance30. Based on CNN, several hybrid models have been proposed to compensate for such problems and achieve optimal performance15,28,31. As for the most recent research, a model with performance higher than that of the existing breathing sound classification models by adding artificial noise addition technique to the general CNN structure has been proposed28. Moreover, a study proposed a model that achieved good performance using the combination of the co-tuning and stochastic normalization techniques of the CNN-based pre-trained ResNet as backbone15.
We tried to achieve optimal performance by applying ResNet with skip connection techniques based on CNN. ResNet is characterized by preventing overfitting and increasing performance using residual learning and skip connections16. In addition to ResNet, various feature extractors, such as the inception network (InceptionNet), dense network (DenseNet), and visual geometry group network (VGGNet) have been proposed to solve gradient loss and overfitting32. A recent study reported that VGG16 use pre-trained on ImageNet had the best performance in the detection of abnormal lung sounds, with an AUC 0.93 and an accuracy of 86.5%11. We tested the performance of respiratory sound classification by applying the same model as that tested in our previous study. As a result, the ResNet we adopted performed the best.
LSTM is a model of the RNN family, used for data with continuous properties33. Since respiratory sound data can be viewed as time series data with continuous properties, the LSTM model is also suitable for respiratory sound classification. Petmezas et al.31 used a hybrid CNN-LSTM structure and a local loss to solve the data imbalance. The lung sounds data were input for CNN, and the obtained output was used as an input for the LSTM. However, in general, it is known that CNN models learn features better than RNN models when learning audio data34,35. We confirmed that the performance of a typical LSTM family is lower than that of a typical CNN family through the performance comparison of the models.
We improved the performance by adding CBAM to CNN. An attention mechanism has recently been proposed to effectively deal with sequence problems36. The attention module uses weights to focus more on important parts and less on relatively unimportant parts36. In our study, the CBAM was introduced to improve the performance by giving weight to the mel spectrogram of the part where the wheeze pattern exists, and the accuracy improved by 1.7% compared to before the introduction. In addition, we constructed a multi-modal configuration to use not only respiratory sound data but also tabular data, such as age and gender information, for classification. We found that this model improved performance compared to the model using only breathing sound data. In particular, the increase in F1 scores was the most notable. It can be inferred that adding tabular data to the algorithm helps solve the problem of unbalanced data. Further research is required to confirm this hypothesis.
In previous studies, the deep-learning model was trained using only audio data without considering variables such as gender and age of the patient11,12. However, the characteristics of lung sound may differ slightly depending on gender and age, and to consider them together, a multi-modal model including tabular data of gender and age was constructed. In addition, a previous study, reported a model, combining tabular data with images, that solved the class imbalance between normal and abnormal data in the classification of chest radiographic images and reported improved image classification based on the sensitivity metric37. In our study, addition of the MLP layer showed an improvement in all performances, including the F-1 score, compared to CNN alone.
Several previous studies on CNN-based AI for lung sound classification used an open-access sound database, such as the International Conference on Biomedical and Health Informatics (ICBHI) 201712,13,14. The ICBHI dataset contains a large number of respiratory sounds and events, including wheezes, crackles, cough, and noise38. However, such open-access data may have selection bias. In fact, some sounds from the ICBHI dataset are collected in non-clinical environments, some are from healthy patients, and some have not been double-validated38. In addition, there is a possibility that only certain sounds may be emphasized because of the short respiratory cycle of recordings39. In particular, when examining an actual patient, crying sounds and other breathing sounds may be auscultated. Therefore, research using open-access data is difficult to apply in the real world. The audio data used in this study were recorded in an actual clinical setting and verified by experts to increase accuracy. Therefore, our database is an excellent gold standard for constructing AI models that are useful in clinical practice.
This study had several limitations. First, this was a single-center study with a small sample size. We split our data and used 80% for training and 20% for validation. Furthermore, we used data augmentation and repeat padding to overcome the limitation in the amount of data for deep learning. Large amount of real-world data needs to be collected through a multicenter prospective study in the future. In addition, there was a problem with imbalanced data in the training dataset. We adopted the F1 score to solve the problem using metrics, and our model showed a high performance. Second, our model is a binary classification model that differentiates sounds that contain wheezing. For real-time monitoring, a deep-learning model needs to be developed through the advancement of data and AI performance that can be applied to various breathing sounds in the future. Finally, we did not collect patients’ diagnostic information. Diagnosis of lung disease is based on a comprehensive assessment of the patient’s clinical symptoms, laboratory test results, and breathing sounds. Development of a model for diagnosing diseases and evaluating the response to treatment by integrating this information is warranted through future studies.