INTRODUCTION
Speech disorders (SDs) influence an individual’s capacity to generate sounds, rhythm, and voice quality for communication ( Chaiani et al., 2022). Various speech impairments can affect an individual’s articulation, fluency, and voice quality. Speech impairments may significantly impair communication ( Xiong et al., 2018). These medical conditions may be present in childhood or developed during early adulthood. SDs can be broadly classified into articulation, fluency, voice, and language disorders ( Shahin et al., 2019). Articulation disorders impair sound pronunciation. Individuals may substitute, omit, distort, or augment sounds, making speech unintelligible. It may hinder communication, making it challenging for others to understand the individual ( Mulfari et al., 2022). Fluency disorders, including stuttering, cause problems with the natural progression of speech. Individuals with fluency disorders may hesitate to repeat or extend sounds. To avoid stuttering, they may prevent conversation. Voice disorders entail pitch, loudness, or quality impairments ( Mahmoud et al., 2020). These conditions may result from vocal nodules and cord paralysis. These difficulties may impact communication and social relations, resulting in voice exhaustion. Language disorders include problems with understanding and interpreting words ( Smith et al., 2017; Cummins et al., 2018; Adeel et al., 2019; Shahamiri, 2021). These conditions can cause challenges to individuals with phrase construction, grammar, and vocabulary. Academic, social, and cognitive developments may be affected by language disorders. Communicating, comprehending, and engaging in age-appropriate activities may be difficult for individuals with language disorders.
Children with speech-language disorders, especially receptive language disorders, have difficulty in interpreting and processing verbal information. They face challenges in the learning environment. Lack of linguistic comprehension and memory may significantly limit a child’s capacity to engage in activities ( Nossier et al., 2020a). Oral instructions, including multiple words and phrases, can be challenging for children with SDs. According to studies, communication ability significantly impacts the development of reading abilities. Individuals with severe speech impairments or who are unable to communicate may use augmentative and alternative communication (AAC) devices and assistive technology. AAC enables individuals to communicate using symbols, graphics, or synthetic voice on communication boards and electronic devices ( Xu et al., 2014). Automatic speech recognition (ASR) aims to develop technology capable of comprehending and reacting to human speech ( Tan and Wang, 2021). It deals with processing written and spoken words as a subfield of natural language processing and voice processing ( Jolad and Khanai, 2023). To enable the automated transcription of spoken words into text, it uses machine learning algorithms to assess and understand the spoken language.
SD patients benefit from automated SD classifiers ( Nossier et al., 2020b). Machine learning and artificial intelligence-based classifiers assist the assessment, diagnosis, and management of speech impairments. Speech abnormalities may be detected in the initial stages using automated classifiers. The classifier analyzes speech patterns to detect pronunciation, fluency, and other SDs. Early detection offers expedited therapy. Automatic classifiers diagnose speech abnormalities objectively and consistently ( Wang and Chen, 2018). Traditional evaluations may be subjective and clinician-dependent. Automated systems can conduct consistent evaluations using predetermined specifications, leading to reliable diagnosis. The utilization of automatic SD classifiers enables remote monitoring. This is particularly crucial for telehealth or remote treatment sessions ( Krecichwost et al., 2021; Vásquez-Correa et al., 2021; Sivakumar and Shankar, 2022). Therapists may track progress and alter therapies using the classifier to assess individuals’ voice recordings. An automated classifier can assist in establishing customized treatment procedures by analyzing speech patterns. The technology can identify speech impairments and recommend training sessions to improve them. Speech exercises using automated classifiers can receive rapid feedback. Real-time feedback helps individuals practice and transform their speaking practices. It encourages introspection and enhances the learning process. A convolutional neural network (CNN) can extract crucial visual characteristics to identify speech problems. Architecture, preprocessing, and training may be customized for the specified task ( Espana-Bonet and Fonollosa, 2016). CNNs’ convolutional layers learn feature location hierarchies. These layers acquire input visual patterns and edges using convolutional filters. Convolutional procedures are required to detect SD-related optical characteristics. The network can capture more abstract and complicated characteristics with multiple convolutional blocks ( Lai and Zheng, 2019). Each block typically includes convolutional, activation, and pooling layers.
Ensemble learning (EL) assists in recognizing SDs by addressing generalization, robustness, and speech data diversity ( Valles and Matin, 2021). Combining numerous models improves the performance and reliability of the detection system, making it a valuable method for constructing reliable speech disorder detection (SDD) models. A model overfits when it performs successfully in training data, but negatively on unknown data ( Zhang et al., 2016). EL approaches, including bagging and boosting, combine models trained on multiple data subsets or weights to reduce overfitting. Noise and variation are inherent in speech data. Numerous model predictions may be used in EL to strengthen the detection process. The EL-based model can handle speech patterns, accents, and background noise. Hence, the authors aim to build an SDD model using the EL approach. The contributions of this study are as follows:
A mel-spectrogram (MS) generator based on the customized CNN model using the voice samples.
A feature engineering technique based on an enhanced ResNet-18 model.
An EL-based SDD model using fine-tuned CatBoost, XGBoost, and support vector machine (SVM) models.
The structure of this study is as follows: The Literature Review section offers the significance of SDD in improving the speech level of individuals with speech impairment. The next section describes the research methodology. The Results and Discussion section presents the outcome of the study. Finally, the Conclusion section highlights the study’s contribution to the SDD literature.
LITERATURE REVIEW
Therapy for speech impairments depends on their nature and severity. Speech-language pathologists (SLPs) analyze, diagnose, and customize speech problem treatment plans. Speech therapy is the gold standard for treating a wide range of speech impairments ( Pravin and Palanivelan, 2021). The patient focuses on their voice quality, fluency, articulation, and overall speech output during treatment ( Hameed et al., 2021). To address SDs, therapists employ drills, exercises, and interactive activities. For instance, individuals with articulation difficulties may participate in activities prioritizing accurate sound articulation. Specialized articulation therapy aims to assist individuals who struggle with voice generation ( Suthar et al., 2022). Enhancing articulation improves speech organ location and movement. In order to support their patients facing challenges with sounds and syllables, therapists employ a variety of games, exercises, and activities ( Liu et al., 2017). It is possible to improve learning using visual aids and comments. The treatment regimens are tailored to each patient’s requirements and objectives. Treatment efficacy relies on the type and degree of disorders.
ASR relies on the acoustic model to convey the relationship between audio signals and phonemes ( Ariyanti et al., 2021). Training a model using massive audio recordings and transcriptions is required. The ASR system is able to better comprehend the context and the probability of word sequences with the assistance of the language model. It addresses linguistic grammar, syntax, and word frequency ( Liu et al., 2023). Language models improve context-based word recognition. The ASR system learns to identify a set of words from a dictionary or vocabulary. As a result, the options for transcribing audible utterances into text are reduced. To extract characteristics from the speech stream, ASR systems frequently employ preprocessing procedures. In order to render the speech signal appropriate for machine learning algorithms, methods including Fourier analysis and mel-frequency cepstral coefficients are used. Decoding determines the most probable word or phoneme sequence from the incoming voice signal. The transcription is generated by integrating acoustic, linguistic, and lexical data.
Automated models frequently encounter difficulties when dealing with the varied nature of speech impairments ( Liu et al., 2023). Numerous speech problems have unique characteristics. A knowledge gap exists in developing models to properly diagnose typical or poorly understood speech problems. One potential drawback of using big, varied datasets to train models for automated SD identification is their relatively limited availability. Annotated datasets with a variety of speech problems, age ranges, and language variants are crucial for effective models. SD identification may be improved by combining audio, video, and language information. However, integrating and using multimodal inputs remain difficult. Many models are trained on language-specific datasets; consequently, their performance may vary between languages ( Jain et al., 2021). A knowledge gap exists in developing models that generalize across languages and dialects. Speech difficulties are commonly associated with hearing or cognitive disability. Ongoing research focuses on comprehending how automated models may accurately identify speech impairments in persons with numerous co-occurring diseases.
RESEARCH METHODOLOGY
An EL-based model is proposed to improve the performance of the SDD. An MS generator is built to produce images from the audio samples. ResNet-18 residual learning principle and block architecture have inspired several new image categorization and computer vision models. The pretrained ResNet-18 model is helpful for transfer learning. It achieves excellent performance with minimal task-specific data. In medical imaging. EL combines the predictions of multiple models, which often leads to better accuracy compared to individual models. Ensembles are less sensitive to noise and outliers in the data, as they can average out errors and biases present in individual models. However, ensembles typically involve training and combining multiple models, which can increase computational complexity, memory requirements, and training time. Ensembles often require tuning of hyperparameters, such as the number of base models, their architectures, and the method of combining predictions, which can be time-consuming and require careful experimentation. Automated hyperparameter tuning methods, such as grid search or Bayesian optimization, can help find optimal hyperparameters for ensemble models. Posthoc interpretability techniques, such as feature importance analysis or model visualization, can provide insights into the contributions of individual models to ensemble prediction.
SVM models are well-known for their capacity to achieve excellent accuracy in both binary and multiclass classification problems. They exhibit strong performance even in areas with a large number of dimensions and are very efficient in distinguishing between classes with intricate decision boundaries. SVM models provide a regularization parameter that effectively manages overfitting, making them resilient to noisy data and guaranteeing strong generalization capabilities on unknown data. SVM models are capable of handling several data types, such as numerical and categorical information, which makes them very adaptable for various categorization tasks. They may also be modified for regression and outlier identification applications. SVM models provide distinct decision boundaries, which enhance the interpretability of the models. Furthermore, support vectors, which are data points that are in close proximity to the decision boundary, play a vital role in establishing the decision border and comprehending the model’s predictions.
CatBoost and XGBoost are widely used for classification and regression. The characteristics of these models have motivated the authors to apply them in SDD development. A feature engineering technique is proposed using the weights of the ResNet-18 model for the feature extraction. The authors introduced the SDD model using CatBoost, XGBoost, and SVM models. The proposed SDD model is presented in Figure 1.

The recommended SDD model. Abbreviations: SDD, speech disorder detection; SVM, support vector machine.
Data acquisition
The authors employ the VOice ICarfEDerico II (VOICED) dataset ( Goldberger et al., 2000; Peng et al., 2023) to generalize the proposed SDD model. The dataset contains 150 pathological and 58 healthy samples. A total of 135 females and 73 males recorded their voices. A Samsung Galaxy S4 mobile device with a dedicated voice recorder application was used to record the participants’ voices. The data owner had positioned the device at 45° and 20 cm away from the participants during the process of voice recording. The participants were instructed to pronounce the vowel “a” at a fixed sound intensity. Each recording was extended upto 5 seconds.
Voice preprocess
The authors employed preprocessing technique to normalize the voice samples. A total of 208 voice recordings are obtained from the dataset. Each voice sample contains some anomalous segments. The authors employed the cubic interpolation (CI) technique to expand the samples in order to increase the dataset size. Based on the study by Cesari et al. (2018), the authors applied a resampling ratio of 48 Hz. CI is a widely used resampling approach for audio samples. It determines an intermediate value between audio samples to generate a continuous signal representation. The audio samples are divided into segments to apply the CI function. In addition, anti-aliasing filters are employed to remove the high frequencies. Equation 1 reflects the computation of cubic polynomials for identifying the intermediate value.
where n, m, k, and q are the coefficients identified by the data points, and x is the data point.
To produce the MS, the authors use the pretrained CNN model weights. They build a CNN model with four convolutional, batch normalization, and rectified linear unit (ReLu) layers. They used Fourier transform-based CNN model weights to train the CNN model. In addition, they applied early stopping and weight pruning strategies to improve the CNN model’s performance.
Feature engineering
In order to extract features, the lowest layers of the ResNet-18 model are used. ResNet-18 follows a multilayer architecture using residual blocks as its building components. In addition to a shortcut connection, each block has two or three convolutional layers. The network architecture efficiently trains deep networks. The lowest layers detect and represent low-level and mid-level characteristics observed in the MS images, including edges, textures, and fundamental patterns. A series of feature maps are generated as the image passes through the convolutional layers and residual blocks. These feature maps show neuron activity at various levels and capture more abstract aspects. The residual blocks’ ReLu activation functions provide nonlinearity, enabling the network to learn the complicated data patterns of SD. Figure 2 shows the suggested feature engineering model using the ResNet-18 model.
The authors use adaptive pruning to boost model sparsity and training efficiency. Weights, neurons, and connections are pruned adaptively at various training phases to be more adaptable and responsive to model adjustments. The adaptive pruning method observes that the model improves substantially on the training set while not improving on the validation set in the initial epochs. To avoid overfitting, it prunes less significant weights rapidly. The validation performance increases, and the model converges throughout training. The adaptive pruning algorithm identifies convergence and reduces the pruning rate in order to prevent deactivating essential data. As the model converges, the adaptive pruning algorithm adjusts the pruning rate to reach the required sparsity without affecting performance. It carefully weighs the model size and accuracy. After adaptive pruning, the model may be fine-tuned to recover from aggressive pruning. This entails retraining the pruned model on the original task with a decreased learning rate. Fully connected (FCN) layers follow a global average pooling layer in ResNet-18. The learned features are reduced in size by reducing the spatial dimensions of the feature maps before the global average pooling layer. The authors removed the FCN layer after training the model. A flattened layer and reshape function is used to generate a two-dimensional vector.
SD classification
CatBoost employs a faster and memory-efficient technique to classify the SD features. It handles the categorical features effectively and improves the efficiency of SD identification. It uses random permutation of features to enhance the generalizability. The model’s sensitivity to the order of features is reduced to prevent overfitting. In addition, the authors employ hyperband optimization to fine-tune the hyperparameters including, iterations, depth, learning rate, and loss function.
The XGBoost model is used to predict SD using the extracted features. It can handle the complex relationships in the features and efficiently generate outcomes using more extensive datasets. The feature measurement functionality reduces the loss and maximizes the accuracy. The regularization feature controls the model complexity in detecting SD. XGBoost uses gradient descent optimization to lower the objective function. It computes the gradient concerning the model’s predictions. The authors use the early stopping strategy to improve the model’s performance.
Furthermore, in order to extend the SVM model for multiclass classification, the authors employ the one-vs-all approach. The multiple binary classifiers are trained to enable the SVM model to generate multiclass outcomes. Randomized search is used to fine-tune the regularization parameter to maintain the trade-off between the model’s prediction rate and classification error.
Evaluation metrics
Evaluating the performance of an EL-based SDD model is essential to determine its generalizability and efficacy. A number of metrics may be used to measure the model’s performance in multiple formats. In this study, the authors employ accuracy, precision, recall, F1-score, Matthew’s correlation coefficient, and Cohen’s kappa to evaluate the model’s generalizability in unknown data. In addition, computational strategies and uncertainty analysis are performed to measure the reliability of the proposed model in a resource-constrained environment.
RESULTS AND DISCUSSIONS
The proposed SDD is implemented in Windows 10 Professional with NVIDIA A100 TensorCore GPU. PyTorch, Librosa, TensorFlow, and Keras libraries are used for model development. The Github repositories are used to extract the source codes of CatBoost ( https://github.com/catboost/catboost), XGBoost ( https://github.com/dmlc/xgboost), and SVM ( https://github.com/topics/support-vector-machine) models. The learning rates of 1 × 10 −3 and 1 × 10 −4 are used to train the initial and final convolution layers with the ResNet-18 model’s weights. A sampling rate of 22,050 and a hop length of 52 are used to generate the MS from the audio samples. The details of the computational configuration are listed in Table 1.
Computational configuration.
Primary parameters | Value |
---|---|
Number of epochs | 14 |
Batch size | 32 |
SVM kernel parameter | 2.14 |
SVM penalty factor for the loss function | 1 |
Number of convolutional layers for image generation | 5 |
Number of convolutional layers for feature extraction | 4 |
Regularization | L1 and L2 |
Abbreviation: SVM, support vector machine.
Table 2 shows the proposed model classification performance for the individual classes. The process of MS generation has assisted the suggested model in identifying the disorders with optimal accuracy. For instance, the model obtained an accuracy of 99.7 % with an F1-score of 99.6% for the healthy class. The findings outlined the significant improvement in the model’s performance. In addition, the multiclass performance of the recommended model is highlighted in Figure 3.
Multiclass classification performance of the proposed model.
Classes | Accuracy | Precision | Recall | F1-score | MCC | Kappa |
---|---|---|---|---|---|---|
Healthy | 99.5 | 99.7 | 99.6 | 99.6 | 95.8 | 96.1 |
Hyperkinetic dysphonia | 99.7 | 99.5 | 99.8 | 99.6 | 96.7 | 95.7 |
Hypokinetic dysphonia | 99.8 | 99.8 | 99.8 | 99.8 | 95.9 | 96.3 |
Reflux laryngitis | 99.8 | 99.8 | 99.7 | 99.7 | 96.7 | 97.7 |
Average | 99.7 | 99.7 | 99.7 | 99.6 | 96.2 | 96.4 |
Abbreviation: MCC, Matthew’s correlation coefficient.
The batch-wise performance is given in Table 3. The results indicated that there is no significant variation in the model’s performance. This shows that the model is not overfitting on the dataset. The suggested optimization techniques for base models yielded a better outcome. In addition, the one-vs-all approach has supported the meta model in identifying the individual classes.
Batch-wise performance analysis.
Batches | Accuracy | Precision | Recall | F1-score | MCC | Kappa |
---|---|---|---|---|---|---|
4 | 97.1 | 97.0 | 97.2 | 97.1 | 90.4 | 90.1 |
8 | 97.6 | 96.8 | 96.7 | 96.7 | 91.5 | 90.6 |
12 | 98.5 | 97.5 | 97.1 | 97.3 | 93.4 | 93.1 |
16 | 99.1 | 98.1 | 98.3 | 98.2 | 94.1 | 95.4 |
20 | 99.7 | 99.7 | 99.7 | 99.6 | 96.2 | 96.4 |
Abbreviation: MCC, Matthew’s correlation coefficient.
Table 4 presents the generalization performance of the SDD models. It is evident that the proposed model identified the four classes of SDs with outstanding performance. The recommended feature engineering assisted the proposed SD model in achieving an optimal result. The authors fine-tuned the ResNet-18 model in order to produce meaningful patterns to the base models. In contrast, the ResNet-18 model without hyperparameter optimization yielded lower accuracy. Thus, it is evident that the suggested hyperparameter optimization and adaptive pruning techniques improved the proposed model’s performance. The findings of the comparative analysis are illustrated in Figure 4.
Comparative analysis findings.
Models | Accuracy | Precision | Recall | F1-score | MCC | Kappa |
---|---|---|---|---|---|---|
OpenL3-SVM ( Peng et al., 2023) | 99.5 | 99.7 | 99.6 | 99.6 | 95.3 | 90.9 |
VGGish-SVM ( Peng et al., 2023) | 95.0 | 96.5 | 96.4 | 96.4 | 90.1 | 89.8 |
MobileNet V3 | 93.1 | 93.4 | 93.5 | 93.4 | 90.7 | 91.2 |
EfficientNet B7 | 94.1 | 94.0 | 94.3 | 94.1 | 96.1 | 95.8 |
ResNet-18 | 90.8 | 89.1 | 89.8 | 89.4 | 95.7 | 92.3 |
Proposed SDD | 99.7 | 99.7 | 99.7 | 99.6 | 96.2 | 96.4 |
Abbreviations: MCC, Matthew’s correlation coefficient; SDD, speech disorder detection; SVM, support vector machine.

Findings of comparative analysis. Abbreviations: MCC, Matthew’s correlation coefficient; SDD, speech disorder detection; SVM, support vector machine.
The reliability of the generated outcome is revealed in Table 5. The findings highlighted that the proposed model required less parameters and floating point operations for achieving exceptional results. Additionally, it shows that the model’s outcomes are reliable and trustworthy. On the other hand, the existing individual models demanded a higher number of computational resources for output generation.
Computational strategies and uncertainty analysis.
Models | Parameters (in millions) | FLOPs (in giga) | Loss | Standard deviation | Confidence interval | Testing time (in seconds) |
---|---|---|---|---|---|---|
OpenL3-SVM ( Peng et al., 2023) | 27 | 34 | 1.5 | 0.0005 | 96.4-96.7 | 121.75 |
VGGish-SVM ( Peng et al., 2023) | 36 | 41 | 2.3 | 0.0003 | 97.8-97.9 | 186.43 |
MobileNet V3 | 29 | 48 | 2.7 | 0.0003 | 95.8-96.1 | 153.31 |
EfficientNet B7 | 31 | 38 | 2.5 | 0.0004 | 95.3-96.1 | 165.42 |
ResNet-18 | 42 | 54 | 2.8 | 0.0004 | 96.1-96.7 | 197.52 |
Proposed SDD | 21 | 31 | 1.3 | 0.0004 | 97.4-98.3 | 118.9 |
Abbreviations: FLOPs, floating point operation; SDD, speech disorder detection; SVM, support vector machine.
CatBoost and XGBoost combine multiple weak learners (decision trees) to form a strong learner. Ensemble methods captured complementary patterns of SD and reduced model variance, leading to improved generalization performance. SVM, CatBoost, and XGBoost offer various hyperparameters that can be fine-tuned to optimize the model’s performance. Techniques such as grid search or randomized search can be used to efficiently search the hyperparameter space and identify the optimal hyperparameter configuration for each algorithm. From a clinical perspective, the proposed SD classifier can assist SLPs in improving individual’s speech. The ability to rapidly examine vast amounts of data allows therapists to concentrate on devising therapies and delivering individualized assistance to SD patients. Therapists and patients may monitor progress using the recommended model. The technology can measure speech progress with regular examinations. The efficacy of therapies may be assessed, and treatment plans can be modified by using the proposed model. Speech evaluation and assistance are simplified using automated SD classifiers. Diagnostic tools and therapeutic treatments may be more accessible to rural or marginalized populations. Automation classifier data may be helpful for future research studies. Researchers may use massive data sets to determine speech problem prevalence, therapeutic efficacy, and demographic trends. Although automated SD classifiers have several advantages, they are intended to improve SLPs’ skills. A complete and successful SD support system requires technology and human skills. EL boosts machine learning model’s performance and resilience. Similar to other approaches, the proposed model has some limitations. The proposed model involves training and maintaining three models, which may increase computational complexity, especially if the base models are computationally expensive. The efficacy of the suggested model depends on the diversity of the individual base models. Ensemble models are more complicated and harder to understand than individual model. Interpreting ensemble predictions may be difficult. Despite these shortcomings, EL improves the stability and generalization of the proposed SDD model.
CONCLUSION
A novel SDD model is proposed to identify SD using individuals’ voices. The authors addressed the challenges using the feature engineering and EL approach. They generated MS images using the fine-tuned Fourier transform-based CNN model. ResNet-18-based feature extraction and EL-based image classification have supported the proposed SDD model to prevent overfitting and bias. The generalization output has revealed the significance of the recommended model in classifying SD. The model produced an outstanding result by outperforming the existing SDD and pretrained CNN models. The speech therapists can benefit from the proposed SDD model. The implementation of the suggested SDD in healthcare centers can offer an effective environment for individuals with speech impairment. However, the authors faced few challenges during model development. The ResNet-18 model required a substantial training time to learn the intricate patterns of SD. Data augmentation was required to improve the proposed model’s performance. In addition, diverse audio samples are necessary to increase the generalizability of the suggested SDD model. Despite promising results in research settings, the integration of deep learning models into clinical workflows and adoption by healthcare professionals remain challenging. Bridging the gap between research and clinical practice requires addressing usability, scalability, and regulatory considerations. Enhancing the interpretability and explainability of deep learning models for SDD is crucial for gaining trust from clinicians and patients. Future research should focus on developing transparent models and visualization techniques to provide insights into model predictions and decision-making processes.