1,011
views
0
recommends
+1 Recommend
1 collections
    2
    shares

      King Salman Center for Disability Research is pleased to invite you to submit your scientific research to the Journal of Disability Research. JDR contributes to the Center's strategy to maximize the impact of the field, by supporting and publishing scientific research on disability and related issues, which positively affect the level of services, rehabilitation, and care for individuals with disabilities.
      JDR is an Open Access scientific journal that takes the lead in covering disability research in all areas of health and society at the regional and international level.

      scite_
      0
      0
      0
      0
      Smart Citations
      0
      0
      0
      0
      Citing PublicationsSupportingMentioningContrasting
      View Citations

      See how this article has been cited at scite.ai

      scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

       
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A Speech Disorder Detection Model Using Ensemble Learning Approach

      Published
      research-article
      Bookmark

            Abstract

            Speech disorder detection (SDD) models can assist speech therapists in providing personalized treatment to individuals with speech impairment. Speech disorders (SDs) comprise a broad spectrum of problems that affect the production, articulation, fluency, and quality of speech. Prompt identification and timely intervention are essential for efficient control and therapy of speech problems. However, current diagnostic techniques often depend on subjective evaluations by speech-language pathologists and may encounter challenges in terms of accessibility, scalability, and consistency. The limited dataset and substantial computational power reduce the efforts for developing automated SDD models. However, recent technology developments enable researchers to determine key factors to classify voice samples. The voice sample classification can identify the severity of the SD. Ensemble learning (EL) facilitates the integration of multiple models’ predictions to generate an optimal outcome. Hence, an EL-based SDD model is introduced in this study. A mel-spectrogram (MS) generator is built to produce images using the voice samples. The authors suggested a feature engineering based on the ResNet 18 model for extracting crucial features from the MS. CatBoost and XGBoost models are employed to classify the features. The outcomes of these models are used to train the support vector machine model to make a final prediction. The VOice ICarfEDerico II (VOICED) dataset is used to generalize the proposed model. The experimental findings indicate that the recommended SDD model outperforms the state-of-the-art models by yielding an outstanding performance. This suggests that the model can assist speech therapists in offering customized speech therapies based on the SD severity. The model performance can be improved using multilanguage voice samples.

            Main article text

            INTRODUCTION

            Speech disorders (SDs) influence an individual’s capacity to generate sounds, rhythm, and voice quality for communication ( Chaiani et al., 2022). Various speech impairments can affect an individual’s articulation, fluency, and voice quality. Speech impairments may significantly impair communication ( Xiong et al., 2018). These medical conditions may be present in childhood or developed during early adulthood. SDs can be broadly classified into articulation, fluency, voice, and language disorders ( Shahin et al., 2019). Articulation disorders impair sound pronunciation. Individuals may substitute, omit, distort, or augment sounds, making speech unintelligible. It may hinder communication, making it challenging for others to understand the individual ( Mulfari et al., 2022). Fluency disorders, including stuttering, cause problems with the natural progression of speech. Individuals with fluency disorders may hesitate to repeat or extend sounds. To avoid stuttering, they may prevent conversation. Voice disorders entail pitch, loudness, or quality impairments ( Mahmoud et al., 2020). These conditions may result from vocal nodules and cord paralysis. These difficulties may impact communication and social relations, resulting in voice exhaustion. Language disorders include problems with understanding and interpreting words ( Smith et al., 2017; Cummins et al., 2018; Adeel et al., 2019; Shahamiri, 2021). These conditions can cause challenges to individuals with phrase construction, grammar, and vocabulary. Academic, social, and cognitive developments may be affected by language disorders. Communicating, comprehending, and engaging in age-appropriate activities may be difficult for individuals with language disorders.

            Children with speech-language disorders, especially receptive language disorders, have difficulty in interpreting and processing verbal information. They face challenges in the learning environment. Lack of linguistic comprehension and memory may significantly limit a child’s capacity to engage in activities ( Nossier et al., 2020a). Oral instructions, including multiple words and phrases, can be challenging for children with SDs. According to studies, communication ability significantly impacts the development of reading abilities. Individuals with severe speech impairments or who are unable to communicate may use augmentative and alternative communication (AAC) devices and assistive technology. AAC enables individuals to communicate using symbols, graphics, or synthetic voice on communication boards and electronic devices ( Xu et al., 2014). Automatic speech recognition (ASR) aims to develop technology capable of comprehending and reacting to human speech ( Tan and Wang, 2021). It deals with processing written and spoken words as a subfield of natural language processing and voice processing ( Jolad and Khanai, 2023). To enable the automated transcription of spoken words into text, it uses machine learning algorithms to assess and understand the spoken language.

            SD patients benefit from automated SD classifiers ( Nossier et al., 2020b). Machine learning and artificial intelligence-based classifiers assist the assessment, diagnosis, and management of speech impairments. Speech abnormalities may be detected in the initial stages using automated classifiers. The classifier analyzes speech patterns to detect pronunciation, fluency, and other SDs. Early detection offers expedited therapy. Automatic classifiers diagnose speech abnormalities objectively and consistently ( Wang and Chen, 2018). Traditional evaluations may be subjective and clinician-dependent. Automated systems can conduct consistent evaluations using predetermined specifications, leading to reliable diagnosis. The utilization of automatic SD classifiers enables remote monitoring. This is particularly crucial for telehealth or remote treatment sessions ( Krecichwost et al., 2021; Vásquez-Correa et al., 2021; Sivakumar and Shankar, 2022). Therapists may track progress and alter therapies using the classifier to assess individuals’ voice recordings. An automated classifier can assist in establishing customized treatment procedures by analyzing speech patterns. The technology can identify speech impairments and recommend training sessions to improve them. Speech exercises using automated classifiers can receive rapid feedback. Real-time feedback helps individuals practice and transform their speaking practices. It encourages introspection and enhances the learning process. A convolutional neural network (CNN) can extract crucial visual characteristics to identify speech problems. Architecture, preprocessing, and training may be customized for the specified task ( Espana-Bonet and Fonollosa, 2016). CNNs’ convolutional layers learn feature location hierarchies. These layers acquire input visual patterns and edges using convolutional filters. Convolutional procedures are required to detect SD-related optical characteristics. The network can capture more abstract and complicated characteristics with multiple convolutional blocks ( Lai and Zheng, 2019). Each block typically includes convolutional, activation, and pooling layers.

            Ensemble learning (EL) assists in recognizing SDs by addressing generalization, robustness, and speech data diversity ( Valles and Matin, 2021). Combining numerous models improves the performance and reliability of the detection system, making it a valuable method for constructing reliable speech disorder detection (SDD) models. A model overfits when it performs successfully in training data, but negatively on unknown data ( Zhang et al., 2016). EL approaches, including bagging and boosting, combine models trained on multiple data subsets or weights to reduce overfitting. Noise and variation are inherent in speech data. Numerous model predictions may be used in EL to strengthen the detection process. The EL-based model can handle speech patterns, accents, and background noise. Hence, the authors aim to build an SDD model using the EL approach. The contributions of this study are as follows:

            • A mel-spectrogram (MS) generator based on the customized CNN model using the voice samples.

            • A feature engineering technique based on an enhanced ResNet-18 model.

            • An EL-based SDD model using fine-tuned CatBoost, XGBoost, and support vector machine (SVM) models.

            The structure of this study is as follows: The Literature Review section offers the significance of SDD in improving the speech level of individuals with speech impairment. The next section describes the research methodology. The Results and Discussion section presents the outcome of the study. Finally, the Conclusion section highlights the study’s contribution to the SDD literature.

            LITERATURE REVIEW

            Therapy for speech impairments depends on their nature and severity. Speech-language pathologists (SLPs) analyze, diagnose, and customize speech problem treatment plans. Speech therapy is the gold standard for treating a wide range of speech impairments ( Pravin and Palanivelan, 2021). The patient focuses on their voice quality, fluency, articulation, and overall speech output during treatment ( Hameed et al., 2021). To address SDs, therapists employ drills, exercises, and interactive activities. For instance, individuals with articulation difficulties may participate in activities prioritizing accurate sound articulation. Specialized articulation therapy aims to assist individuals who struggle with voice generation ( Suthar et al., 2022). Enhancing articulation improves speech organ location and movement. In order to support their patients facing challenges with sounds and syllables, therapists employ a variety of games, exercises, and activities ( Liu et al., 2017). It is possible to improve learning using visual aids and comments. The treatment regimens are tailored to each patient’s requirements and objectives. Treatment efficacy relies on the type and degree of disorders.

            ASR relies on the acoustic model to convey the relationship between audio signals and phonemes ( Ariyanti et al., 2021). Training a model using massive audio recordings and transcriptions is required. The ASR system is able to better comprehend the context and the probability of word sequences with the assistance of the language model. It addresses linguistic grammar, syntax, and word frequency ( Liu et al., 2023). Language models improve context-based word recognition. The ASR system learns to identify a set of words from a dictionary or vocabulary. As a result, the options for transcribing audible utterances into text are reduced. To extract characteristics from the speech stream, ASR systems frequently employ preprocessing procedures. In order to render the speech signal appropriate for machine learning algorithms, methods including Fourier analysis and mel-frequency cepstral coefficients are used. Decoding determines the most probable word or phoneme sequence from the incoming voice signal. The transcription is generated by integrating acoustic, linguistic, and lexical data.

            Automated models frequently encounter difficulties when dealing with the varied nature of speech impairments ( Liu et al., 2023). Numerous speech problems have unique characteristics. A knowledge gap exists in developing models to properly diagnose typical or poorly understood speech problems. One potential drawback of using big, varied datasets to train models for automated SD identification is their relatively limited availability. Annotated datasets with a variety of speech problems, age ranges, and language variants are crucial for effective models. SD identification may be improved by combining audio, video, and language information. However, integrating and using multimodal inputs remain difficult. Many models are trained on language-specific datasets; consequently, their performance may vary between languages ( Jain et al., 2021). A knowledge gap exists in developing models that generalize across languages and dialects. Speech difficulties are commonly associated with hearing or cognitive disability. Ongoing research focuses on comprehending how automated models may accurately identify speech impairments in persons with numerous co-occurring diseases.

            RESEARCH METHODOLOGY

            An EL-based model is proposed to improve the performance of the SDD. An MS generator is built to produce images from the audio samples. ResNet-18 residual learning principle and block architecture have inspired several new image categorization and computer vision models. The pretrained ResNet-18 model is helpful for transfer learning. It achieves excellent performance with minimal task-specific data. In medical imaging. EL combines the predictions of multiple models, which often leads to better accuracy compared to individual models. Ensembles are less sensitive to noise and outliers in the data, as they can average out errors and biases present in individual models. However, ensembles typically involve training and combining multiple models, which can increase computational complexity, memory requirements, and training time. Ensembles often require tuning of hyperparameters, such as the number of base models, their architectures, and the method of combining predictions, which can be time-consuming and require careful experimentation. Automated hyperparameter tuning methods, such as grid search or Bayesian optimization, can help find optimal hyperparameters for ensemble models. Posthoc interpretability techniques, such as feature importance analysis or model visualization, can provide insights into the contributions of individual models to ensemble prediction.

            SVM models are well-known for their capacity to achieve excellent accuracy in both binary and multiclass classification problems. They exhibit strong performance even in areas with a large number of dimensions and are very efficient in distinguishing between classes with intricate decision boundaries. SVM models provide a regularization parameter that effectively manages overfitting, making them resilient to noisy data and guaranteeing strong generalization capabilities on unknown data. SVM models are capable of handling several data types, such as numerical and categorical information, which makes them very adaptable for various categorization tasks. They may also be modified for regression and outlier identification applications. SVM models provide distinct decision boundaries, which enhance the interpretability of the models. Furthermore, support vectors, which are data points that are in close proximity to the decision boundary, play a vital role in establishing the decision border and comprehending the model’s predictions.

            CatBoost and XGBoost are widely used for classification and regression. The characteristics of these models have motivated the authors to apply them in SDD development. A feature engineering technique is proposed using the weights of the ResNet-18 model for the feature extraction. The authors introduced the SDD model using CatBoost, XGBoost, and SVM models. The proposed SDD model is presented in Figure 1.

            The recommended SDD model
            Figure 1:

            The recommended SDD model. Abbreviations: SDD, speech disorder detection; SVM, support vector machine.

            Data acquisition

            The authors employ the VOice ICarfEDerico II (VOICED) dataset ( Goldberger et al., 2000; Peng et al., 2023) to generalize the proposed SDD model. The dataset contains 150 pathological and 58 healthy samples. A total of 135 females and 73 males recorded their voices. A Samsung Galaxy S4 mobile device with a dedicated voice recorder application was used to record the participants’ voices. The data owner had positioned the device at 45° and 20 cm away from the participants during the process of voice recording. The participants were instructed to pronounce the vowel “a” at a fixed sound intensity. Each recording was extended upto 5 seconds.

            Voice preprocess

            The authors employed preprocessing technique to normalize the voice samples. A total of 208 voice recordings are obtained from the dataset. Each voice sample contains some anomalous segments. The authors employed the cubic interpolation (CI) technique to expand the samples in order to increase the dataset size. Based on the study by Cesari et al. (2018), the authors applied a resampling ratio of 48 Hz. CI is a widely used resampling approach for audio samples. It determines an intermediate value between audio samples to generate a continuous signal representation. The audio samples are divided into segments to apply the CI function. In addition, anti-aliasing filters are employed to remove the high frequencies. Equation 1 reflects the computation of cubic polynomials for identifying the intermediate value.

            (1) P(x)=n+m(xxi)+k(xxi)2+q(xxi)3,

            where n, m, k, and q are the coefficients identified by the data points, and x is the data point.

            To produce the MS, the authors use the pretrained CNN model weights. They build a CNN model with four convolutional, batch normalization, and rectified linear unit (ReLu) layers. They used Fourier transform-based CNN model weights to train the CNN model. In addition, they applied early stopping and weight pruning strategies to improve the CNN model’s performance.

            Feature engineering

            In order to extract features, the lowest layers of the ResNet-18 model are used. ResNet-18 follows a multilayer architecture using residual blocks as its building components. In addition to a shortcut connection, each block has two or three convolutional layers. The network architecture efficiently trains deep networks. The lowest layers detect and represent low-level and mid-level characteristics observed in the MS images, including edges, textures, and fundamental patterns. A series of feature maps are generated as the image passes through the convolutional layers and residual blocks. These feature maps show neuron activity at various levels and capture more abstract aspects. The residual blocks’ ReLu activation functions provide nonlinearity, enabling the network to learn the complicated data patterns of SD. Figure 2 shows the suggested feature engineering model using the ResNet-18 model.

            The proposed Feature Extraction
            Figure 2:

            The proposed feature extraction model. Abbreviation: CNN, convolutional neural network.

            The authors use adaptive pruning to boost model sparsity and training efficiency. Weights, neurons, and connections are pruned adaptively at various training phases to be more adaptable and responsive to model adjustments. The adaptive pruning method observes that the model improves substantially on the training set while not improving on the validation set in the initial epochs. To avoid overfitting, it prunes less significant weights rapidly. The validation performance increases, and the model converges throughout training. The adaptive pruning algorithm identifies convergence and reduces the pruning rate in order to prevent deactivating essential data. As the model converges, the adaptive pruning algorithm adjusts the pruning rate to reach the required sparsity without affecting performance. It carefully weighs the model size and accuracy. After adaptive pruning, the model may be fine-tuned to recover from aggressive pruning. This entails retraining the pruned model on the original task with a decreased learning rate. Fully connected (FCN) layers follow a global average pooling layer in ResNet-18. The learned features are reduced in size by reducing the spatial dimensions of the feature maps before the global average pooling layer. The authors removed the FCN layer after training the model. A flattened layer and reshape function is used to generate a two-dimensional vector.

            SD classification

            CatBoost employs a faster and memory-efficient technique to classify the SD features. It handles the categorical features effectively and improves the efficiency of SD identification. It uses random permutation of features to enhance the generalizability. The model’s sensitivity to the order of features is reduced to prevent overfitting. In addition, the authors employ hyperband optimization to fine-tune the hyperparameters including, iterations, depth, learning rate, and loss function.

            The XGBoost model is used to predict SD using the extracted features. It can handle the complex relationships in the features and efficiently generate outcomes using more extensive datasets. The feature measurement functionality reduces the loss and maximizes the accuracy. The regularization feature controls the model complexity in detecting SD. XGBoost uses gradient descent optimization to lower the objective function. It computes the gradient concerning the model’s predictions. The authors use the early stopping strategy to improve the model’s performance.

            Furthermore, in order to extend the SVM model for multiclass classification, the authors employ the one-vs-all approach. The multiple binary classifiers are trained to enable the SVM model to generate multiclass outcomes. Randomized search is used to fine-tune the regularization parameter to maintain the trade-off between the model’s prediction rate and classification error.

            Evaluation metrics

            Evaluating the performance of an EL-based SDD model is essential to determine its generalizability and efficacy. A number of metrics may be used to measure the model’s performance in multiple formats. In this study, the authors employ accuracy, precision, recall, F1-score, Matthew’s correlation coefficient, and Cohen’s kappa to evaluate the model’s generalizability in unknown data. In addition, computational strategies and uncertainty analysis are performed to measure the reliability of the proposed model in a resource-constrained environment.

            RESULTS AND DISCUSSIONS

            The proposed SDD is implemented in Windows 10 Professional with NVIDIA A100 TensorCore GPU. PyTorch, Librosa, TensorFlow, and Keras libraries are used for model development. The Github repositories are used to extract the source codes of CatBoost ( https://github.com/catboost/catboost), XGBoost ( https://github.com/dmlc/xgboost), and SVM ( https://github.com/topics/support-vector-machine) models. The learning rates of 1 × 10 −3 and 1 × 10 −4 are used to train the initial and final convolution layers with the ResNet-18 model’s weights. A sampling rate of 22,050 and a hop length of 52 are used to generate the MS from the audio samples. The details of the computational configuration are listed in Table 1.

            Table 1:

            Computational configuration.

            Primary parametersValue
            Number of epochs14
            Batch size32
            SVM kernel parameter2.14
            SVM penalty factor for the loss function1
            Number of convolutional layers for image generation5
            Number of convolutional layers for feature extraction4
            RegularizationL1 and L2

            Abbreviation: SVM, support vector machine.

            Table 2 shows the proposed model classification performance for the individual classes. The process of MS generation has assisted the suggested model in identifying the disorders with optimal accuracy. For instance, the model obtained an accuracy of 99.7 % with an F1-score of 99.6% for the healthy class. The findings outlined the significant improvement in the model’s performance. In addition, the multiclass performance of the recommended model is highlighted in Figure 3.

            Table 2:

            Multiclass classification performance of the proposed model.

            ClassesAccuracyPrecisionRecallF1-scoreMCCKappa
            Healthy99.599.799.699.695.896.1
            Hyperkinetic dysphonia99.799.599.899.696.795.7
            Hypokinetic dysphonia99.899.899.899.895.996.3
            Reflux laryngitis99.899.899.799.796.797.7
            Average99.799.799.799.696.296.4

            Abbreviation: MCC, Matthew’s correlation coefficient.

            Multi-class classification
            Figure 3:

            Multiclass classification. Abbreviation: MCC, Matthew’s correlation coefficient.

            The batch-wise performance is given in Table 3. The results indicated that there is no significant variation in the model’s performance. This shows that the model is not overfitting on the dataset. The suggested optimization techniques for base models yielded a better outcome. In addition, the one-vs-all approach has supported the meta model in identifying the individual classes.

            Table 3:

            Batch-wise performance analysis.

            BatchesAccuracyPrecisionRecallF1-scoreMCCKappa
            497.197.097.297.190.490.1
            897.696.896.796.791.590.6
            1298.597.597.197.393.493.1
            1699.198.198.398.294.195.4
            2099.799.799.799.696.296.4

            Abbreviation: MCC, Matthew’s correlation coefficient.

            Table 4 presents the generalization performance of the SDD models. It is evident that the proposed model identified the four classes of SDs with outstanding performance. The recommended feature engineering assisted the proposed SD model in achieving an optimal result. The authors fine-tuned the ResNet-18 model in order to produce meaningful patterns to the base models. In contrast, the ResNet-18 model without hyperparameter optimization yielded lower accuracy. Thus, it is evident that the suggested hyperparameter optimization and adaptive pruning techniques improved the proposed model’s performance. The findings of the comparative analysis are illustrated in Figure 4.

            Table 4:

            Comparative analysis findings.

            ModelsAccuracyPrecisionRecallF1-scoreMCCKappa
            OpenL3-SVM ( Peng et al., 2023)99.599.799.699.695.390.9
            VGGish-SVM ( Peng et al., 2023)95.096.596.496.490.189.8
            MobileNet V393.193.493.593.490.791.2
            EfficientNet B794.194.094.394.196.195.8
            ResNet-1890.889.189.889.495.792.3
            Proposed SDD99.799.799.799.696.296.4

            Abbreviations: MCC, Matthew’s correlation coefficient; SDD, speech disorder detection; SVM, support vector machine.

            Findings of Comparative Analysis
            Figure 4:

            Findings of comparative analysis. Abbreviations: MCC, Matthew’s correlation coefficient; SDD, speech disorder detection; SVM, support vector machine.

            The reliability of the generated outcome is revealed in Table 5. The findings highlighted that the proposed model required less parameters and floating point operations for achieving exceptional results. Additionally, it shows that the model’s outcomes are reliable and trustworthy. On the other hand, the existing individual models demanded a higher number of computational resources for output generation.

            Table 5:

            Computational strategies and uncertainty analysis.

            ModelsParameters (in millions)FLOPs (in giga)LossStandard deviationConfidence intervalTesting time (in seconds)
            OpenL3-SVM ( Peng et al., 2023)27341.50.000596.4-96.7121.75
            VGGish-SVM ( Peng et al., 2023)36412.30.000397.8-97.9186.43
            MobileNet V329482.70.000395.8-96.1153.31
            EfficientNet B731382.50.000495.3-96.1165.42
            ResNet-1842542.80.000496.1-96.7197.52
            Proposed SDD21311.30.000497.4-98.3118.9

            Abbreviations: FLOPs, floating point operation; SDD, speech disorder detection; SVM, support vector machine.

            CatBoost and XGBoost combine multiple weak learners (decision trees) to form a strong learner. Ensemble methods captured complementary patterns of SD and reduced model variance, leading to improved generalization performance. SVM, CatBoost, and XGBoost offer various hyperparameters that can be fine-tuned to optimize the model’s performance. Techniques such as grid search or randomized search can be used to efficiently search the hyperparameter space and identify the optimal hyperparameter configuration for each algorithm. From a clinical perspective, the proposed SD classifier can assist SLPs in improving individual’s speech. The ability to rapidly examine vast amounts of data allows therapists to concentrate on devising therapies and delivering individualized assistance to SD patients. Therapists and patients may monitor progress using the recommended model. The technology can measure speech progress with regular examinations. The efficacy of therapies may be assessed, and treatment plans can be modified by using the proposed model. Speech evaluation and assistance are simplified using automated SD classifiers. Diagnostic tools and therapeutic treatments may be more accessible to rural or marginalized populations. Automation classifier data may be helpful for future research studies. Researchers may use massive data sets to determine speech problem prevalence, therapeutic efficacy, and demographic trends. Although automated SD classifiers have several advantages, they are intended to improve SLPs’ skills. A complete and successful SD support system requires technology and human skills. EL boosts machine learning model’s performance and resilience. Similar to other approaches, the proposed model has some limitations. The proposed model involves training and maintaining three models, which may increase computational complexity, especially if the base models are computationally expensive. The efficacy of the suggested model depends on the diversity of the individual base models. Ensemble models are more complicated and harder to understand than individual model. Interpreting ensemble predictions may be difficult. Despite these shortcomings, EL improves the stability and generalization of the proposed SDD model.

            CONCLUSION

            A novel SDD model is proposed to identify SD using individuals’ voices. The authors addressed the challenges using the feature engineering and EL approach. They generated MS images using the fine-tuned Fourier transform-based CNN model. ResNet-18-based feature extraction and EL-based image classification have supported the proposed SDD model to prevent overfitting and bias. The generalization output has revealed the significance of the recommended model in classifying SD. The model produced an outstanding result by outperforming the existing SDD and pretrained CNN models. The speech therapists can benefit from the proposed SDD model. The implementation of the suggested SDD in healthcare centers can offer an effective environment for individuals with speech impairment. However, the authors faced few challenges during model development. The ResNet-18 model required a substantial training time to learn the intricate patterns of SD. Data augmentation was required to improve the proposed model’s performance. In addition, diverse audio samples are necessary to increase the generalizability of the suggested SDD model. Despite promising results in research settings, the integration of deep learning models into clinical workflows and adoption by healthcare professionals remain challenging. Bridging the gap between research and clinical practice requires addressing usability, scalability, and regulatory considerations. Enhancing the interpretability and explainability of deep learning models for SDD is crucial for gaining trust from clinicians and patients. Future research should focus on developing transparent models and visualization techniques to provide insights into model predictions and decision-making processes.

            REFERENCES

            1. Adeel A, Gogate M, Hussain A, Whitmer WM. 2019. Lip-reading driven deep learning approach for speech enhancement. IEEE Trans. Emerg. Top. Comput. Intell. Vol. 5(3):481–490

            2. Ariyanti W, Hussain T, Wang JC, Wang CT, Fang SH, Tsao Y. 2021. Ensemble and multimodal learning for pathological voice classification. IEEE Sens. Lett. Vol. 5(7):1–4

            3. Cesari U, De Pietro G, Marciano E, Niri C, Sannino G, Verde L. 2018. A new database of healthy and pathological voices. Comput. Electr. Eng. Vol. 68:310–321

            4. Chaiani M, Selouani SA, Boudraa M, Yakoub MS. 2022. Voice disorder classification using speech enhancement and deep learning models. Biocybern. Biomed. Eng. Vol. 42(2):463–480

            5. Cummins N, Baird A, Schuller BW. 2018. Speech analysis for health: current state-of-the-art and the increasing impact of deep learning. Methods. Vol. 151:41–54

            6. Espana-Bonet C, Fonollosa JA. 2016. Automatic speech recognition with deep neural networks for impaired speechProceedings of the Advances in Speech and Language Technologies for Iberian Languages: Third International Conference: IberSPEECH 2016; Lisbon, Portugal. 23-25 November 2016; Springer International Publishing. Cham: p. 97–107

            7. Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, et al.. 2000. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation. Vol. 101:e215–e220

            8. Hameed Z, Rehman WU, Khan W, Ullah N, Albogamy FR. 2021. Weighted hybrid feature reduction embedded with ensemble learning for speech data of Parkinson’s disease. Mathematics. Vol. 9(24):3172

            9. Jain D, Mishra AK, Das SK. 2021. Machine learning based automatic prediction of Parkinson’s disease using speech featuresProceedings of International Conference on Artificial Intelligence and Applications: ICAIA 2020; Springer. Singapore. p. 351–362

            10. Jolad B, Khanai R. 2023. An approach for speech enhancement with dysarthric speech recognition using optimization based machine learning frameworks. Int. J. Speech Technol. Vol. 26(2):287–305

            11. Krecichwost M, Mocko N, Badura P. 2021. Automated detection of sigmatism using deep learning applied to multichannel speech signal. Biomed. Signal Process. Control. Vol. 68:102612

            12. Lai YH, Zheng WZ. 2019. Multi-objective learning based speech enhancement method to increase speech quality and intelligibility for hearing aid device users. Biomed. Signal Process. Control. Vol. 48:35–45

            13. Liu Z, Li C, Gao X, Wang G, Yang J. 2017. Ensemble-based depression detection in speech2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); Kansas City, MO, USA. IEEE. p. 975–980

            14. Liu Z, Yu H, Li G, Chen Q, Ding Z, Feng L, et al.. 2023. Ensemble learning with speaker embeddings in multiple speech task stimuli for depression detection. Front. Neurosci. Vol. 17:1141621

            15. Mahmoud SS, Kumar A, Tang Y, Li Y, Gu X, Fu J, et al.. 2020. An efficient deep learning based method for speech assessment of mandarin-speaking aphasic patients. IEEE J. Biomed. Health Inform. Vol. 24(11):3191–3202

            16. Mulfari D, La Placa D, Rovito C, Celesti A, Villari M. 2022. Deep learning applications in telerehabilitation speech therapy scenarios. Comput. Biol. Med. Vol. 148:105864

            17. Nossier SA, Wall J, Moniri M, Glackin C, Cannings N. 2020a. An experimental analysis of deep learning architectures for supervised speech enhancement. Electronics. Vol. 10(1):17

            18. Nossier SA, Wall J, Moniri M, Glackin C, Cannings N. 2020b. A comparative study of time and frequency domain approaches to deep learning based speech enhancement2020 International Joint Conference on Neural Networks (IJCNN); Glasgow, UK. 19-24 July 2020; IEEE. p. 1–8

            19. Peng X, Xu H, Liu J, Wang J, He C. 2023. Voice disorder classification using convolutional neural network based on deep transfer learning. Sci. Rep. Vol. 13:7264[Cross Ref]

            20. Pravin SC, Palanivelan M. 2021. A hybrid deep ensemble for speech disfluency classification. Circuits Syst. Signal Process. Vol. 40:3968–3995

            21. Shahamiri SR. 2021. Speech vision: an end-to-end deep learning-based dysarthric automatic speech recognition system. IEEE Trans. Neural Syst. Rehabil. Eng. Vol. 29:852–861

            22. Shahin M, Zafar U, Ahmed B. 2019. The automatic detection of speech disorders in children: challenges, opportunities, and preliminary results. IEEE J. Sel. Top. Signal Process. Vol. 14(2):400–412

            23. Sivakumar NYC, Shankar A. 2022. The speech-language processing model for managing the neuro-muscle disorder patients by using deep learning. NeuroQuantology. Vol. 20(8):918

            24. Smith DV, Sneddon A, Ward L, Duenser A, Freyne J, Silvera-Tawil D, et al.. 2017. Improving child speech disorder assessment by incorporating out-of-domain adult speechProceedings of the Interspeech; Stockholm, Sweden. 20-24 August 2017; p. 2690–2694

            25. Suthar K, Yousefi Zowj F, Speights Atkins M, He QP. 2022. Feature engineering and machine learning for computer-assisted screening of children with speech disorders. PLoS Digit. Health. Vol. 1(5):e0000041

            26. Tan K, Wang D. 2021. Towards model compression for deep learning based speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. Vol. 29:1785–1794

            27. Valles D, Matin R. 2021. An audio processing approach using ensemble learning for speech-emotion recognition for children with ASD2021 IEEE World AI IoT Congress (AIIoT); Seattle, WA, USA. 10-13 May 2021; IEEE. p. 0055–0061

            28. Vásquez-Correa JC, Rios-Urrego CD, Arias-Vergara T, Schuster M, Rusz J, Nöth E, et al.. 2021. Transfer learning helps to improve the accuracy to classify patients with different speech disorders in different languages. Pattern Recognit. Lett. Vol. 150:272–279

            29. Wang D, Chen J. 2018. Supervised speech separation based on deep learning: an overview. IEEE/ACM Trans. Audio Speech Lang. Process. Vol. 26(10):1702–1726

            30. Xiong F, Barker J, Christensen H. 2018. Deep learning of articulatory-based representations and applications for improving dysarthric speech recognitionSpeech Communication; 13th ITG-Symposium; Oldenburg, Germany. 10-12 October 2018; VDE. p. 1–5

            31. Xu Y, Du J, Dai L-R, Lee C-H. 2014. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. Vol. 23(1):7–19

            32. Zhang HH, Yang L, Liu Y, Wang P, Yin J, Li Y, et al.. 2016. Classification of Parkinson’s disease utilizing multi-edit nearest-neighbor and ensemble learning algorithms with speech samples. Biomed. Eng. Online. Vol. 15:122

            Author and article information

            Journal
            jdr
            Journal of Disability Research
            King Salman Centre for Disability Research (Riyadh, Saudi Arabia )
            1658-9912
            2 April 2024
            : 3
            : 3
            : e20240026
            Affiliations
            [1 ] Department of Computer Science and Information Systems, College of Applied Sciences, AlMaarefa University, Ad Diriyah, Riyadh 13713, Saudi Arabia ( https://ror.org/00s3s5518)
            [2 ] Department of Documents and Archive, Center of Documents and Administrative Communication, King Faisal University, Hofuf 31982, Al-Ahsa, Saudi Arabia ( https://ror.org/00dn43547)
            Author notes
            Author information
            https://orcid.org/0000-0002-1208-2678
            https://orcid.org/0000-0001-5445-7899
            Article
            10.57197/JDR-2024-0026
            b0776d1f-8ced-42a5-ab4a-d8441be19224
            Copyright © 2024 The Authors.

            This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY) 4.0, which permits unrestricted use, distribution and reproduction in any medium, provided the original author and source are credited.

            History
            : 25 January 2024
            : 13 March 2024
            : 14 March 2024
            Page count
            Figures: 4, Tables: 5, References: 32, Pages: 8
            Funding
            Funded by: King Salman Center for Disability Research
            Award ID: KSRG-2023-320
            The authors extend their appreciation to the King Salman Center for Disability Research (funder ID: http://dx.doi.org/10.13039/501100019345) for funding this work through Research Group no KSRG-2023-320.
            Categories

            Computer science
            speech impairment,speech disorders,mel spectrogram,voice samples,ensemble learning,feature extraction,ResNet 18,deep learning

            Comments

            Comment on this article