INTRODUCTION
Neurons (Ma, 2023), also called nerve cells, play a significant role in controlling the movement and activity of humans, such as walking, talking, and self-care. It is the basic fundamental unit of the nervous system, responsible for carrying information to all parts and organs of the body. Predominantly, motor neurons (Hug et al., 2023; Limone et al., 2023) are specialized nerve cells that connect the glands, muscles, and organs throughout the body. The degeneration and expiration of these motor neurons leads to amyotrophic lateral sclerosis (ALS) disease (Masrori and van Damme, 2020; Mead et al., 2023; Vidovic et al., 2023). In the past few years, there have been signs of the increasing prevalence of ALS cases worldwide. Particularly, Saudi Arabia, United States, New Zealand, Uruguay, and some other countries are being affected by the ALS disease. In order to reduce the risk of occurrence of ALS (Akçimen et al., 2023; Udine et al., 2023), it is necessary to detect the disease earlier to take appropriate treatment. Besides, it will reduce the disease’s harshness and enhance the ALS survival rate. Classical screening of ALS includes a blood test to identify gene mutations in the body related to ALS (Akçimen et al., 2023; Suzuki et al., 2023). It is a highly priced, time-intensive process and requires skilled doctors to process the screening. To resolve this, researchers focused on artificial intelligence (AI) (Nakamori et al., 2023; Segura et al., 2023; Tavazzi et al., 2023) for the detection and classification of ALS. This technology, ranging from machine learning (ML) to deep learning (DL), is the predominant technique in the healthcare industry to automate the process of disease diagnosis. It reduces the risk factor of human error based on inaccuracy, enhances the speed, and minimizes the manpower in the screening of the disease.
Correspondingly, numerous research studies concentrated on ALS screening through ML and DL. Generally, the existing research concentrated on classifying neurological disorders such as ALS, Huntington’s disease (HD), and Parkinson’s disease (PD). Besides, in detecting ALS, most studies focused on detecting symptoms such as speech and behavioral changes. For instance, the conventional model deploys an ML-based method to classify ALS. Besides, it is designed to identify the health impacts. The outcome of the existing model represents better performance with a higher accuracy of 93.28% (Sekar et al., 2022). Similarly, traditional research designs use linear discriminant analysis (LDA) to classify phonation production in ALS patients. Here, acoustic features such as jitter, mel-frequency cepstral coefficients (MFCC), personal protective equipment (PPE), and formants are used. Some algorithms with the feature selection are tested to identify the optimal feature of the LDA subset. The result signifies that the LASSO feature selection accomplished better performance, and LDA with five features attained a higher accuracy of 87.5% (Vashkevich and Rushkevich, 2021). Likewise, the conventional model projects the classification of ALS through the ML architecture. It used the features of linear and non-linear methods and combined the brain structure and function associated with ALS. In the classical system, multimodal and unimodal random forest (RF) classifiers are implemented to classify ALS. Besides, it was evaluated using five cross-validation. From the results, it is signified that the unimodal classifier attained an accuracy of 61.66%, and the multimodal classifier acquired an accuracy of 66.82% (Thome et al., 2022). Nevertheless, although the existing research attained effective results, it has a few drawbacks such as lack of accuracy and speed, overfitting of data, and noise removal tasks.
To resolve the limitations and enhance the screening of ALS, the proposed model employs progressive entropy weighted focal loss (PEWFL)-XGBoost to classify ALS and non-ALS. The respective research is carried out through the Kaggle ALS dataset, which comprises gene samples of both ALS and non-ALS. The proposed method comprises preprocessing, data splitting, classification, and prediction phases. Initially, the data are loaded into the system. Then, a preprocessing method, which is based on the normalization technique, is used to prepare the dataset for classification. Then, the data are divided into training and testing data. Formerly, classification was processed through PEWFL-XGBoost using the training data. In the prediction phase, the proposed model is processed through the testing data to estimate the performance. Finally, the efficiency of the respective method is calculated using performance metrics to evaluate the model’s efficacy. Moreover, an internal comparison of the conventional algorithms to the proposed model is carried out to show the effectiveness of the respective approach.
The major contributions of the proposed method are as follows:
To employ PEWFL-XGBoost through the Kaggle ALS dataset to enhance the classification accuracy of ALS and non-ALS.
To calculate the efficacy of the proposed model through performance metrics.
To evaluate the performance of the respective approach through internal comparison with conventional algorithms such as XGBoost, K-nearest neighbor (KNN), and RF.
Paper organization
The paper is organized based on the effectual approaches applied in the binary classification of ALS and non-ALS by analyzing the existing research studies discussed in the Review of Literature section. The Proposed Methodology section presents the process of the proposed method. Further, the outcomes attained by the employed method are depicted in the Results and Discussion section. Finally, the conclusion with the future work of the respective model is presented in the Conclusion section.
REVIEW OF LITERATURE
Several ML- and DL-based techniques for the classification of ALS and non-ALS are analyzed in this section. Moreover, the problem identified in the existing research is also mentioned. The existing model has been implemented using the linear support vector classifier for the classification of ALS. It has shown the neuroimaging markers as age-corrected features for the classification with the cohort consisting of 502 subjects. These data are taken from 404 patients with ALS and health controls (HC). It is further verified through the outcomes with multilayer perceptron (MLP). The experimental result signifies the better performance of the existing model with better accuracy (Kocar et al., 2021). Similarly, the conventional method used three-dimensional convolutional learning network (3D CNN) and convolutional long short-term memory (ConvLSTM) for the detection of neurodegenerative disorders. Here, ConvLSTM is used with the approach of sequential data of the LSTM technique and CNN-based pattern detection system. Correspondingly, 3D CNN is used with the outcome of ConvLSTM. The tensor generated by ConvLSTM has the shape of a 3D cell structure (Erdaş et al., 2021). Similarly, the classical model has been implemented in predicting ALS through the images of patients’ induced pluripotent stem cells (iPSCs). It has been carried out through the CNN-based framework. The experimental outcomes signify the effective performance of the system with a receiver operating characteristic (ROC) value of 0.97 (Imamura et al., 2021). This traditional system has projected the identification of NDDs such as ALS (Golini et al., 2020; Bjornevik et al., 2021; Neumann et al., 2021), HC, and PD through the image data. For that purpose, a basic framework has been used in the existing research. It comprises three stages such as preprocessing, classification, and feature transformation. The better efficiency of the conventional model has been identified through the results (Lin et al., 2020).
Correspondingly, the existing research has been designed for the detection of ALS through voice analysis. It tested the sustainability of voice signals in the essential periods for the computation of perturbation measurement. From the experimental outcomes, it is stated that better efficacy has been attained by the existing ALS detection with an accuracy of 86.7% (Vashkevich et al., 2019). Accordingly, the conventional system has deployed the LSTM-based grid pattern approach to detect neurological diseases (Torres-Castillo et al., 2022; Bernhardt et al., 2023). Here, three distinct gait datasets have been extracted, which hold the recordings of vertical ground reaction force) for diverse walking scenarios. Further, L2 regulation and dropout techniques have been used to reduce the overfitting of data. Moreover, the stochastic gradient optimizer model has been used for solving the cost function. From the multiclass classification results, it is shown that the conventional method attained effective performance with an accuracy of 96.6% (Balaji et al., 2021). Likewise, the classical approach has utilized the advantage of deep convolutional neural network (DCNN) and BAT algorithm to classify ALS and myopathy. In the system, the features are considered based on the time domain and the Wigner–Ville transformed time frequency. These features have been taken from the abnormal electromyography (EMG) signals. The DCNN has been used for feature selection by the extracted time-frequency features by the BAT algorithm. The efficiency of the conventional model has been signified through the results with better computational time (Bakiya et al., 2023). This traditional method has implemented diverse feature optimization techniques such as principal component analysis and genetic algorithm to classify neurological diseases. Further, the existing model has utilized diverse classification methods such as probabilistic, non-linear, and linear methods. The input data comprise ALS (Black et al., 2015; Roy et al., 2020; Dodge et al., 2021), PD, and HD that are extracted from the public database. The outcome of the classification represents the better performance of the model (Aich et al., 2019).
Likewise, the existing research has implemented an ML-based framework for classifying neurological diseases with the analysis in a multidimensional approach. Besides, the dimension reduction of the data is processed through the feature selection. It extracted the data from the cohorts of over 231 patients. The efficiency of the conventional model has been depicted through the results (Gross et al., 2021). In the same way, the classical model has been projected for the detection of bulbar changes in ALS disease. It is processed through the speech analysis with the cohort of patients. From the experimental outcomes, it is represented that the existing model acquired better performance (Stegmann et al., 2020). Correspondingly, the traditional system has designed the CNN-based architecture for the classification of NDDs (Chatterjee et al., 2019; French et al., 2019), such as ALS (Rahman et al., 2019), HD, and PD (Beyrami and Ghaderyan, 2020). It has utilized the advantage of CNN with the gain synchronization of wavelet coherence spectrogram to classify NDD through gait force signals. Here, the features have been extracted through the online gait database of NDDs. The existing method classifies the NDDs through the diverse gait patterns based on the time-frequency gait force signals, achieving effective performance with an accuracy of 96.37% (Setiawan et al., 2022). Similarly, the existing method has been deployed to detect C-terminal TDP-43 fragments in the ALS disease. The suggestion of highly sensitive mass spectrometry through monitoring parallel reactions has been analyzed. The sample was acquired from the Oxford brain bank. The efficiency of the traditional model has been depicted through the outcomes (Feneberg et al., 2021).
Accordingly, classical research has been implemented to detect NDDs through sparse coding and distance metrics. It has used symmetric features and a sparse non-negative least sequence classifier. The results of the experiment represented the better efficiency of the conventional system (Ghaderyan and Beyrami, 2020). Consequently, conventional research has been designed for the molecular classification of ALS based on CNN architecture. Here, the input image data are acquired based on altering values of RNA into pixels through the deep insight framework. Further, the pixels are mapped into the genes. The existing model has identified the genes that are linked with ALS. The outcome of the classification depicted the better performance of the existing model (Karim et al., 2021). The traditional technique has constructed the CNN-BiLSTM framework to classify ALS (Scialò et al., 2020), health controls, and PD with the raw speech waveform. It employed the data-driven method for learning from the raw speech waveform. Besides, the conventional method has used four diverse speech stimuli in training and testing data. From the experimental results, it is represented that the existing model accomplished better performance in the classification (Mallela et al., 2020).
Problem identification
Accuracy is the significant factor for evaluating the model performance, which depicts the value of accurately predicted results. Numerous research studies attempted to determine the effective classification of ALS. However, it lacks in accuracy (Balaji et al., 2021; Vashkevich and Rushkevich, 2021; Sekar et al., 2022; Thome et al., 2022).
Some research focused on the detection of ALS, PD, and HD. Limited research focused on the ALS classification (Aich et al., 2019; Setiawan et al., 2022).
In the ALS detection research, symptoms of ALS have been used as the features for classification. The primary cause, such as gene mutation, has been lacking in the conventional research (Vashkevich et al., 2019; Mallela et al., 2020).
THE PROPOSED METHODOLOGY
In the current world, the severity of the ALS disease is rising in many countries, which affects the lives of numerous people. To avoid the consequences of ALS and to reduce the disease severity, it is vital to identify the disease primarily. As classical screening is an expensive and time-consuming process, several research studies focused on effective ALS screening through AI and it also has limitations such as lack of accuracy, overfitting of data, and noise. To enhance the ALS screening, the proposed research employed PEWFL-XGBoost to classify ALS and non-ALS through the Kaggle ALS dataset. Correspondingly, it is necessary to identify the primary cause of the disease for precise classification as ALS is the most hazardous neuron disease, which weakens the lower and upper motor neurons. It starts with muscle weakness, spreads across the body, and gets worse over time. The symptoms of ALS are depicted in Figure 1.
Primarily, ALS affects the nerve cells that control the muscle movements in the body called the motor neurons. The motor neurons comprise two groups: upper motor neurons and lower motor neurons. The upper motor neurons encompass from the brain to the spinal cord to muscles in the body, whereas the lower motor neurons cover from the spinal cord to the muscles in the body. In the case of ALS, both motor neurons degenerate and expire. As a result, it stops sending messages to the muscles, leading to muscle dysfunction. The main cause of ALS is genetic factors. Fundamentally, genes are a significant part of DNA, which hold the instructions for producing the proteins for the cells. A single neuron comprises 50 billion proteins for significant purposes in the body. If the gene instructions are dissimilar from those needed to make healthy proteins, cells may produce defective or little proteins. It can affect the cells, damage the DNA, and lead to ALS. It is mainly due to the change in the gene known as mutations. Moreover, antibodies are designed with the specific protein α5 integrin which can influence the disease’s progression and has the ability to hold the potential therapeutic agents. However, iPSC cell lines are observed from ALS patients and tested with various therapies to gain insight into the molecular mechanism. Besides, plasmids are used to deliver oligonucleotides, which are responsible for finding the target genetic mutations and it is familiar among ALS gene therapies. This change in the gene can be on its own, or a mutated gene passed from the parents to the children. Figure 2 illustrates the causes of ALS.
In order to reduce the consequences of ALS, it is necessary to detect the disease early to support people in getting proper treatment and aid in slowing the development of the disease. Conventionally, blood tests are used to identify gene mutations associated with ALS. It is an agonizing, expensive, and time-intensive procedure. To resolve the issue, several existing researchers attempted to attain effective prediction of ALS. Conversely, conventional methods have some limitations such as lack of accuracy and speed, overfitting and vanishing gradient problems. To solve the limitations, the proposed method employs PEWFL-XGBoost to classify ALS and non-ALS through the Kaggle ALS dataset. Figure 3 presents the illustrative framework of the respective model.

Schematic representation of the proposed work. Abbreviation: ALS, amyotrophic lateral sclerosis.
The proposed system utilizes the Kaggle ALS dataset to classify ALS through the gene sequences. There are various genes related to ALS, such as LRRFIP1, THADA, RGS6, ZNF638, MMP23B, PLXNB1, RP3-377D14.1, USP34, and THADA.1. The respective approach utilizes the advantages of XGBoost and incorporates PEWFL in XGBoost to enhance the performance of ALS and non-ALS classification. To calculate the efficacy of the proposed model, performance metrics such as precision, ROC, F1-score, recall, and accuracy are used in the projected experiment. The proposed method used XGBoost to train the ML models efficiently and able to manage missing data and handle larger datasets and this is important for processing the ALS classification of gene sequences. It integrated the concepts of weight cross-entropy and focal loss to address and improve imbalance data handling. The efficiency of classification in XGBoost is improved by using the PEWFL techniques. It is used to overcome the limitations of traditional focal loss and weighted cross-entropy. The proposed ALS and non-ALS classification framework is presented in Figure 4.

Proposed structure of PEWFL-XGBoost model. Abbreviation: PEWFL, progressive entropy weighted focal loss.
Figure 4 depicts the respective model which comprises several stages such as selection of dataset, normalization-based preprocessing method, training and testing split, classification with PEWFL-XGBoost, and prediction phase. A brief description of each stage in the proposed research is presented below:
Dataset selection
The dataset for the projected system is gathered from the Kaggle website, which is a publically available dataset. For the assessment of the respective research, the Kaggle ALS dataset is utilized in this system, as outlined in Table 8. The dataset includes the data of both ALS and non-ALS. It comprises 45 samples, where 30 are ALS samples and 15 are non-ALS samples. The inclusion criteria are gene samples which are predominantly related to ALS and non-ALS samples; however, the exclusion criteria are not applied in the study. The genes for this dataset are obtained based on the P value. Besides, it contains both downregulated and upregulated genes. The official link of the Kaggle ALS dataset is as follows: https://www.kaggle.com/datasets/bhavithabairapureddy/c9orf72-ggggcc-expanded-repeats-ofalsgse68607 (Cooper-Knock et al., 2015).
Moreover, the most frequent genetic origin of familial ALS and frontotemporal dementia is the C9ORF72 gene’s GGGGCC repeat expansions, which are responsible for around 40% of familial ALS cases. These enlargements result in disordered splicing, which is related to the severity of the disease. However, the increased replications can cause the formation of RNA foci that trap RNA-binding proteins, which affects disruptions in regular mRNA splicing and possibly overpowering cellular compensatory processes in the long run. This imbalance could play a role in the diverse characteristics and delayed appearance commonly seen in patients. Moreover, the pathogenesis is further complicated by the existence of toxic dipeptide repeat proteins generated by repeat-associated non-ATG translation. Moreover, the study has indicated a correlation between the length of GGGGCC repeat expansion and changes in gene expression and splicing error rates, which impact the severity of ALS symptoms. Therefore, it is essential to comprehend these mechanisms in order to clarify the pathophysiology of C9ORF72-related diseases and create specific treatments.
Preprocessing
ML models generally use the data preprocessing method to clean, alter, and incorporate the data to prepare them for classification. The main objective of data preprocessing is to enhance the quality of the data and to formulate it to be more suitable for the classification task. Initially, in the respective research, before preprocessing, synthetic data are generated from the existing data. To attain this, the mean and standard deviation of each feature in the dataset are calculated without the target column. Further, for each feature, it produces synthetic feature values by drawing random values from the distribution with calculated standard deviation and mean. Subsequently, target synthetic values are generated from the random sampling of the resulting data from the original data with replacement.
In the proposed method, the normalization-based preprocessing method is used to change the dataset’s features to the common scale, enhancing the performance and accuracy of the classification. The main purpose of the normalization technique is to remove the potential distortions and biases caused by diverse scales in the features. Particularly, min-max scalar normalization is used in the respective model, which shrinks the feature values to a range between 0 and 1. It is processed by reducing the minimum value of each feature and dividing the feature range. The advantage of utilizing min-max scaling is to preserve the distance and order of the data points.
Data splitting
In ML, data splitting is used to avoid overfitting of data. Classically, ML uses the data splitting method to train the models, where the training data are added to the system for updating the training phase parameters. After training, the test set data are measured to evaluate the proposed model for handling new observations.
In the respective approach, the original data in the proposed method are divided into two sets, the training and testing sets, in the ratio of 80:20. It indicates that 80% of the observation is used for training and 20% is used for testing. The training data are used to train the model, and the testing data are used to evaluate the model’s performance.
Classification
The proposed research employs ML-based PEWFL-XGBoost to enhance the classification results. The ML algorithm is trained on the gene sequence of the selected genes in the Kaggle ALS dataset. This section presents the details of the classification mechanism and algorithms of the proposed approach. Besides, it signifies the process and algorithms used for the internal comparison, such as XGBoost, KNN, and RF. The following section presents the classical XGBoost mechanism.
Conventional XGBoost classifier
Researchers have used XGBoost to resolve a variety of ML problems. It stands for extreme gradient boosting, projected by the researchers at the University of Washington. It is an elevated gradient boosting library used to efficiently train the ML methods. The significant features of this algorithm are the effective handling of larger datasets and the ability to handle missing values.
Figure 5 illustrates the structure of the traditional XGBoost algorithm. It comprises decision trees (DTs), which are formed in consecutive forms. The substantial factor in the mechanism of XGBoost is the weight that is allocated to every variable. These weights in the variables are processed to the DT for identifying the outcomes. If the variable’s weight is identified wrongly by the tree, then these variables are processed to another DT. The XGBoost rises the amount of DTs where every tree tries to minimize the errors of the prior trees. The final identification is based on the value of the weighted sum of the identification process. The XGBoost approach is depicted in the following:
Let input = {(xi , yi )}, which includes m features and n samples (|input| = number, xi ∈ Rm , yi ∈ R. The additive function z for the tree ensemble model for approximating the method is shown in Equation (1).
Here, the space of regression is signified as F. The space of regression is depicted in Equation (2).
where q represents the tree’s structure, w depicts the weight of leaf nodes, and T signifies the number of leaf nodes. The objective function of the algorithm is reduced to optimize the tree and minimize errors. The process involved in the reduction of the objective function is shown in Equation (3).
The convex function is used to define the dissimilarity among the calculated and exact values. The convex function is represented as l, the measured value is depicted as yi , and the predicted value is signified as y′i. The amount of iteration is used to reduce the errors where t describes the iteration number. The difficulties of the penalty factor for the regression tree is shown in Equation (4).
Nevertheless, XGBoost is an effective classifier, and it has a few drawbacks such as overfitting, poor handling of the smaller datasets, and utilization of enormous trees in the model. Besides, the major limitation in the traditional XGBoost classification is the class imbalance that minimizes the accuracy of the classification. Also, hyperparameter tuning is one of the main drawbacks of the classical XGBoost system.
Proposed PEWFL-XGBoost classifier
The proposed approach utilizes the PEWFL function to enhance the classification performance and overcome the limitations of the classical XGBoost. The PEWFL is used to enhance the performance of imbalanced datasets and strength to noisy data. This proposed technique leads to best generalization to unseen data by encouraging the model to allocate lower conviction for unreliable data. Based on their unreliable levels, the proposed PEWFL permits flexibility in tuning and gives importance to the dataset. This technique involves the utilization of PEWFL in conjunction with the XGBoost algorithm for the classification of ALS and non-ALS cases based on gene sequences. Figure 6 presents the classification process of PEWFL-XGBoost.

PEWFL-XGBoost classification method. Abbreviation: PEWFL, progressive entropy weighted focal loss.
In Figure 6, it is depicted that the classification mechanism comprises passing arguments, initializing the values, PEWFL-based XGBoost, alpha and gamma parameters, and trained model. Primarily, the classification process starts with passing the arguments where each function argument becomes a variable of the assigned value. Further, the values are initialized to describe the initial values of the parameter in the network to train the system. In the weight initialization, PEWFL is processed to enhance the classification method of the XGBoost model. Here, alpha and gamma parameters are used in the system. Finally, the proposed model is trained by running the input data over the algorithm to associate the processed output against the sample output. It is used to evaluate the efficiency of the model. Moreover, the accuracy of ALS classification is enhanced and, on the contrary, it is not affected by issues such as class imbalance, overfitting, and vanishing gradient issues as that of the traditional approach. Similarly, PEWFL-XGBoost can attain its effective ALS classification, by including PEWFL, and it can handle smaller datasets. Besides, the noise is handled by improving its reliability of classification results.
Figure 7 illustrates the architecture of the proposed PEWFL-XGBoost. For the enhancement of the classification and to resolve the class imbalance in the classification, the respective approach tunes the loss function with progressive weight factor and modified focal loss. Traditionally, challenges in focal loss and weighted cross-entropy include minimal loss of weighting among the classes and vanishing gradient. It prevents the system from acquiring higher accuracy. The focal loss is an effective technique for balancing loss by enhancing the loss to classify the system classes.
Nevertheless, it inclines for the vanishing gradient in the process of backpropagation. To resolve this problem, the respective approach uses modified focal loss to enhance the accuracy of the classification. It modifies the technique of loss scaling of focal loss to be effective against the vanishing gradient problem. Moreover, a progressive weight factor is added in the proposed approach to oblige the labels of negative classes to minimize the effect of the vanishing gradient. The process involved in PEWFL-XGBoost is described below:
Initially, the loss cross-entropy (loss nce ) calculates the deviation among the produced result bn and expected output for the input n of the network to the ith class between the total class. It is shown in Equation (5).
In Equation (5), bn = s(un ): ui,n ∈ IR and (bi,n ) ∈ [0,1]. Here, the activation function of the classifier is depicted in Equation (6).
Correspondingly, to resolve the class imbalance problem, weighted cross-entropy is used as the loss function loss nce weightCE in the respective method. The loss function is represented in Equation (7).
The modified focal loss formulates the factor of weighting as dynamic values by using it as the function of error among bi,n , and {yi,n : yi,n = 1}, which is represented in Equation (8).
Equation (8) signifies the modified focal loss utilizing two extra scaling coefficients to handle the amount of loss. Besides, the feature of weight (1 – bi,n ) γ increases if γ ≥ 1. Hence, if the bi,n value is lower, then the weight feature becomes larger, which will gradually increase the value of loss in the particular loss. Conversely, the limitation of modified focal loss is bi,n → yi,n . Here, the gradient will be smaller than the cross-entropy, which leads to the vanishing gradient and minimizes the system’s performance. To resolve the vanishing gradient problem, a progressive weight factor is utilized in the proposed approach. The progressive weight factor of the focal loss, α(1 − bi,n ) γ is modified to α(|yi,n − bi,n |) γ , where α, γ ≥ 1. The progressive weight factor is used in the cross-entropy loss, represented in Equation (9). Adding the progressive weight factor results in a greater gradient and loss value than the original cross-entropy. The loss nce value is compared with the focal loss, subsequently {yi,n , bi,n : 0 ≤ yi,n , bi,n ≤ 1}.
The analysis shows that the modifications added in the focal loss enhance the classification performance and increase the accuracy by arresting loss in class classification. Moreover, it enhances the gradient condition. Correspondingly, the algorithms used for the internal comparison are depicted in the following section.
KNN
It is a supervised ML mechanism widely utilized in regression and classification problems. The framework of the algorithm is to find the K-nearest points for the particular observations. It utilizes K-nearest point’s average or the majority of the class as the predicted value for the purpose of observation. It is used in many applications, such as medicine, agriculture, and finance. Generally, it classifies the unlabeled data by computing the distance between all points and unlabeled data points in the dataset.
Further, it allocates the unlabeled data point to the class of the similarly labeled data. It is processed by identifying patterns in the dataset. The most common formula for finding the distance in the algorithm is the Euclidean distance, where the distance among each test sample and training data points, where input i = input1, input2 … input n and xi = x 1, x 2 … xn is computed through Equation (10).
The method of classification is processed based on the KNN, where the amount of nearest neighbors is involved in the process of voting. The label of the class is assigned to the test sample, which is classified using the majority votes in the KNN, which is depicted in Equation (11).
Here, the test object is represented as xj , xj is one of its KNN in the training set, and E(xj , YK ) signifies whether xj belongs to class YK . The weighted KNN is based on the class contribution, which is the enhanced version of the KNN algorithm. It assigns weight to every feature in the dataset, resulting in enhanced accuracy, which is represented in Equation (12).
From Equation (12), pre t is the average accuracy. Subsequently, the weight wi is attained by normalizing the feature of i-dimension through Equation (13).
Moreover, the weight of the every feature is utilized to identify the weighted Euclidean distance using Equation (14). The number of features is represented as 𝑛.
The major advantages of KNN are simple implementation, better noise reduction tasks, and the ability to handle complex patterns in the data. Consistently, the subsequent section depicts the process in the algorithm used for the internal comparison.
Random forest
It is a form of ensemble learning technique widely used for the regression and classification of ML problems. To enhance the accuracy, this algorithm uses a great amount of DTs, which are trained and combined for classification. The DT is made with the features of a random subset, and then the last prediction is produced by taking majority voting or averaging among the predictions in the trees. This reduces the chance of overfitting in the network. The major advantages of the RF algorithm are the ability to handle missing values, non-linear parameters, and better noise removal tasks. The pseudocode of the RF algorithm is shown in Pseudocode 1.
Pseudocode 1: Random forest
Input: Distance of data of training N∗P and Tree amount [Binary] |
---|
For every variable i ∈ P do |
For a = 1 to binary: |
1. Design a sample b ∗ of size M through the data of training |
2. Develop Random − forest tree Tree a to the 23 of data. |
3. Identify the classification of the leftover 15 with tree and compute the rate of classification = rate of accuracy (OOB), namely accuracy a . |
4. For variable i, permute the variable value and calculate the value of accuracy (accuracy b ) eliminate the original OOB error (ha = accuracy a − ea ), the rise indicates the importance of variable. |
End for |
Aggregate total accuracy with all trees and compute variance. |
ˆJ=1binary∑binaryk=1Jk and s2J=1binary−1∑binaryk=1(Jk−ˆJ)2 |
Compute variable importance i: variablei=ˆJ/sJ |
End for |
Prediction phase
It is a commonly used technique by researchers to determine the efficiency of the proposed system. In the prediction phase, the algorithm is processed using the test data, which will reveal the performance of the proposed model. Finally, the system’s efficiency is calculated using certain performance matrices such as ROC, accuracy, F1-score, precision, and recall to evaluate the efficacy of the proposed model.
RESULTS AND DISCUSSION
This section briefly describes the outcomes attained by the proposed system in the classification of ALS and non-ALS. Further, exploratory data analysis (EDA), performance metrics, experimental outcomes, and comparative analysis of conventional methods are presented.
EDA
It represents the data utilized in the proposed model with the Kaggle ALS dataset. EDA is used to view the data for better understanding.
Figure 8 presents the percentage and number of patients used in the Kaggle ALS dataset. The pie chart represents the percentage of patients in the dataset, whereas the bar chart represents the number of patients. From the representation of the pie chart, over 36% of people have ALS, and 64% of people are normal. Correspondingly, in the bar chart, 0 signifies patients with the disease, and 1 signifies patients without the disease. It is identified that the data comprise 125 patients with the disease and 220 patients without the disease in the respective approach.

Visualization of patients counts in the Kaggle ALS dataset. Abbreviation: ALS, amyotrophic lateral sclerosis.
Figure 9 depicts the knowledge graph of the proposed research. The knowledge graph in the ML models is used to understand the data integration. It comprises three major components such as labels, edges, and nodes. The nodes signify the entities for identifying the relationships in the data, whereas the edge represents the similarity among the nodes. Similarly, the labels are the relationship among the rules of edge and nodes.
Figure 10 presents the density plot of the gene expression. It is the graphical representation of the dissemination of numeric variables. The density plot, also called kernel density estimate, depicts the probability density function. The genes used in the plot are THADA, RGS6, and LRRFIP1. Figure 11 depicts the signal to noise ratio (SNR) pair plot for data feature distribution.
Figure 11 depicts the relationship among the variables in the dataset. It is vital to understand the data by illustrating huge data in a single figure.
Figure 12 presents the box plot of the utilized gene expression. It signifies the gene expression of every gene sample consecutive to data normalization.
Figure 13 illustrates the network graph of the proposed model. It is also called a node-link chart or link graph. It is used to understand, analyze, and visualize the relationship among the entities.
Correspondingly, blinding is a method used to ensure the researchers and patients that data used for analyzing are not labeled. Hence, it can minimize bias and confirm the objectivity of the study.
Software and hardware configuration
The environmental configuration of obtaining the results for the proposed model is presented in Table 1, where hardware and software configurations used for implementing the results of the model are listed.
Performance metrics
The effectiveness of the proposed PEWFL-XGBoost is calculated with performance metrics such as F1-score, accuracy, recall, and precision.
F1-score: It is calculated by the mean of precision and recall values. Besides, it signifies that if the F1-score predicted is higher, then the efficiency of the classifier is also high, and it is depicted in Equation (15):
Accuracy: It is stated as the ratio of correct identification in the system to complete system identification. The formula for accuracy is shown in Equation (16):
where TN is true negative, TP is true positive, FN is false negative, and FP is false positive.
Recall: It is the ratio of correctly identified outcomes to overall identified outcomes. Recall is also called sensitivity or specificity and is represented by Equation (17):
where FN and TP are false negative and true positive, respectively.
Precision: It is also called the value of the identified positive figure and is stated by the fraction of TPs to the average of TPs and FPs and given in Equation (18):
where FP is false positive and TP is true positive.
Experimental results
This section describes the outcomes accomplished by the proposed research in classifying ALS and non-ALS with the Kaggle ALS dataset. Further, the results obtained in the internal comparison of the proposed PEWFL-XGBoost and conventional algorithms such as KNN, XGBoost, and RF are presented.
Figure 14 and Table 2 illustrate the results attained by the proposed research of the PEWFL-XGBoost system. Here, the outcomes accomplished for both ALS and non-ALS and the overall results are presented. The results of the classification for non-ALS attained an F1-score of 0.97, an accuracy rate of 0.98, a recall rate of 0.97, and a precision rate of 0.97. Correspondingly, the classification outcomes for ALS attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision rate of 0.98. Finally, the overall results of the classification state that the proposed model attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision rate of 0.98.

Visual Representation of the PEWFL-XG-Boost. Abbreviations: ALS, amyotrophic lateral sclerosis; PEWFL, progressive entropy weighted focal loss.
Performance of the proposed research PEWFL-XGBoost.
Model | Precision | Recall | F1-score | Accuracy |
---|---|---|---|---|
Non-ALS | 0.97 | 0.97 | 0.97 | 0.98 |
ALS | 0.98 | 0.98 | 0.98 | 0.98 |
PEWFL-XGBoost | 0.98 | 0.98 | 0.98 | 0.98 |
Abbreviations: ALS, amyotrophic lateral sclerosis; PEWFL, progressive entropy weighted focal loss.
Figure 15 and Table 3 depict the outcomes accomplished by the classical XGBoost system. Here, the results attained for both ALS and non-ALS and the overall outcomes are presented. The outcomes of the classification for non-ALS attained an F1-score of 0.72, an accuracy rate of 0.73, a recall rate of 0.72, and a precision rate of 0.73. Similarly, the results of the classification for ALS accomplished an F1-score of 0.72, an accuracy rate of 0.75, a recall rate of 0.84, and a precision rate of 0.75. Lastly, the overall results of the classification state that the classical XGBoost attained an F1-score of 0.75, an accuracy rate of 0.75, a recall rate of 0.84, and a precision rate of 0.72.

Illustration of the traditional XGBoost model. Abbreviation: ALS, amyotrophic lateral sclerosis.
Performance metrics of classical XGBoost.
Precision | Recall | F1-score | Accuracy | |
---|---|---|---|---|
Non-ALS | 0.73 | 0.72 | 0.72 | 0.73 |
ALS | 0.75 | 0.84 | 0.72 | 0.75 |
XG_Boost | 0.72 | 0.71 | 0.75 | 0.75 |
Abbreviation: ALS, amyotrophic lateral sclerosis.
Figure 16 and Table 4 show the outcomes accomplished by the RF system. Here, the outcomes accomplished for both ALS and non-ALS and the overall results are presented. The classification outcomes for non-ALS acquired an F1-score of 0.71, an accuracy rate of 0.71, a recall rate of 0.72, and a precision rate of 0.72. Likewise, the results of the classification for ALS attained an F1-score of 0.75, an accuracy rate of 0.72, a recall rate of 0.75, and a precision rate of 0.72. Finally, the overall outcomes of the classification state that the RF system attained an F1-score of 0.71, an accuracy rate of 0.74, a recall rate of 0.75, and a precision rate of 0.72.

Visual Overview of the RF model. Abbreviations: ALS, amyotrophic lateral sclerosis; RF, random forest.
Metrics of the RF technique.
Model | Precision | Recall | F1-score | Accuracy |
---|---|---|---|---|
Non-ALS | 0.71 | 0.72 | 0.71 | 0.71 |
ALS | 0.72 | 0.75 | 0.75 | 0.72 |
RF | 0.72 | 0.71 | 0.71 | 0.74 |
Abbreviations: ALS, amyotrophic lateral sclerosis; RF, random forest.
Figure 17 and Table 5 illustrate the results attained by the KNN model. Here, the results accomplished for both ALS and non-ALS and the overall outcomes are depicted. The classification results for non-ALS acquired an F1-score of 0.72, an accuracy rate of 0.71, a recall rate of 0.71, and a precision rate of 0.68. Similarly, the classification outcomes for ALS accomplished an F1-score of 0.72, an accuracy rate of 0.72, a recall rate of 0.89, and a precision rate of 0.73. Finally, the overall outcomes of the classification state that the KNN system acquired an F1-score of 0.72, an accuracy rate of 0.72, a recall rate of 0.89, and a precision rate of 0.73. Table 6 and Figure 18 present the overall outcomes of the internal comparison.

Graphical representation of the KNN model. Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor.
Performance of the KNN method.
Model | Precision | Recall | F1-score | Accuracy |
---|---|---|---|---|
Non-ALS | 0.69 | 0.71 | 0.72 | 0.71 |
ALS | 0.73 | 0.89 | 0.72 | 0.72 |
KNN | 0.73 | 0.72 | 0.72 | 0.72 |
Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor.
Internal comparison of the proposed and classical models.
Model | Precision | Recall | F1-score | Accuracy |
---|---|---|---|---|
XG_Boost | 0.54 | 0.59 | 0.53 | 0.59 |
RF | 0.51 | 0.61 | 0.48 | 0.61 |
KNN | 0.49 | 0.58 | 0.49 | 0.58 |
Proposed | 0.98 | 0.98 | 0.98 | 0.98 |
Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor; RF, random forest.

Comparative analysis of proposed against classical algorithms. Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor; RF, random forest.
In Figure 18 and Table 6, the efficiency of the classical algorithms with the proposed PEWFL-XGBoost is compared. From the conventional algorithms, a higher accuracy of 0.61 is attained by RF. The proposed system acquired an accuracy of 0.98, which is comparatively higher than that of the conventional algorithms, which reveals the effectiveness of the respective research.
Statistical report
The proposed model has acquired better results when compared with other traditional algorithms. Moreover, the same dataset is examined to determine its stability of the model. Besides, the “LRRFIP1” variable is used for the investigation and its statistical report is provided in Table 8.
Statistical report of the proposed model.
Test | Values |
---|---|
Shapiro test for group 0 | 0.20317941217179597 |
Shapiro test for group 1 | 0.19140789961680915 |
Levene’s test | 0.4135585674363751 |
t-Test results | |
t-statistic | 34.4585 |
P value | 5.9389e−33 |
Cohen’s d | 11.9870 |
χ2 test results | |
χ2 | 40.6125 |
P value | 1.8562e−10 |
Gene sequence and functions.
S. no. | Gene sequence | Functions | Gene mutation effects |
---|---|---|---|
1 | LRRFIP1 | It encrypts a protein used in the gene expression’s management | Alzheimer’s disease and some neurodegenerative diseases |
2 | THADA | It encrypts a protein used in the transcript process and cell signaling | Cancer and some inflammatory diseases |
3 | RGS6 | It encrypts a protein used in the management of G-protein-coupled regulators | Cancer and heart diseases |
4 | ZNF638 | It encrypts a protein used in the transcript process and repair of DNA | Inflammatory diseases and cancer |
5 | MMP23B | It encrypts an enzyme used in the failure of extracellular medium | Inflammatory diseases and cancer |
6 | PLXNB1 | It encrypts a protein used in the synaptic transmission and signaling of cells. Neurodegenerative disease | Alzheimer’s disease and some neurodegenerative diseases |
7 | RP3-377D14.1 | It encrypts a protein used in the management of gene expression | Inflammatory diseases and cancer |
8 | USP34 | It encrypts an enzyme used in the protein modifications | Inflammatory diseases and cancer |
9 | THADA.1 | It encrypts a variant of THADA | Inflammatory diseases and cancer |
10 | CYorf15A | It encrypts a protein used for the growth of cells | Inflammatory diseases and cancer |
11 | CHRNA6 | It encrypts an alpha-6 subsection of nAchR (nicotinic acetylcholine receptor) | Schizophrenia and Parkinson’s disease |
12 | THSD7B | It encrypts a protein used in the management of gene expression | Inflammatory diseases and cancer |
13 | THADA.2 | THADA variant gene | Inflammatory diseases and cancer |
14 | MTIF2 | It encrypts a protein used in the management of cell growth | Inflammatory diseases and cancer |
15 | FAM18B2 | It encrypts a protein used in the management of the immune system | Cancer and autoimmune disorders |
16 | AC092299.1 | Not characterized | — |
17 | AC069287.3 | Not characterized | — |
18 | ARMC10 | It encrypts a protein used in the management of the immune system | Cancer and autoimmune disorders |
20 | FAM123C | It encrypts a protein used in the management of the immune system | Cancer and autoimmune disorders |
21 | USP34.1 | USP34 variant | Inflammatory diseases and cancer |
22 | RP11-339I24.1 | Not characterized | — |
23 | ZKSCAN4 | It encrypts a protein used in the management of the immune system | Cancer and autoimmune disorders |
24 | LL22NC03-80A10.2 | It encrypts a protein used in the cell growth | Inflammatory diseases and cancer |
25 | FAM123C.1 | FAM123C variant | Inflammatory diseases and cancer |
26 | RPL21 | It encrypts a protein used in the translation of RNA in the protein | Muscular dystrophy and cancer |
27 | SR140 | It encrypts a protein used in the management of gene expression | Inflammatory diseases and cancer |
28 | MUC4 | It encrypts a protein used in the mucus formation | Inflammatory diseases and cancer |
29 | EHBP1 | It encrypts a protein used in the growth of cells | Inflammatory diseases and cancer |
30 | FOXO4 | It encrypts a protein used in the management of the growth of cells | Inflammatory diseases and cancer |
31 | TRIM59 | It encrypts a protein used in the management of gene expression | Cancer and autoimmune disorders |
32 | MUC4.1 | MUC4 variant | Inflammatory diseases and cancer |
33 | MUC4.2 | MUC4 variant | Inflammatory diseases and cancer |
34 | USP39 | It encrypts a protein used in the post-translational protein alteration | Inflammatory diseases and cancer |
35 | LRRFIP1.1 | LRRFIP1.1 variant | Alzheimer’s disease and some neurodegenerative diseases |
36 | MMP23B.1 | MMP23B | Inflammatory diseases and cancer |
37 | GRM8 | It encrypts a protein used in the taste perception | Autism and taste disorders |
38 | TRIM59.1 | TRIM59 variant | Autoimmune disorders and cancer |
Correspondingly, the normality and homogeneity of variance is evaluated using the Shapiro–Wilk test. From the test it signifies that the P values of normality is 0.203 for group 0 and 0.191 for group 1. Subsequently, P values are higher when it is related to a typical alpha level of 0.05. Besides, the null hypothesis of normality is not rejected and has shown data in both groups which almost attained a normal distribution.
Similarly, for Levene’s test, the variance has attained a P value of 0.414 which is higher than 0.05. It signifies that the hypothesis of identical variance among the groups has a “true” value. Additionally, a t-test is directed to relate “LRRFIP1” among group 0 and 1. Then, the t-statistic is 34.46 and its P value is 5.94 × 10−33, which is less and it shows solid indication for null hypothesis. The effect size is given by Cohen’s d with 11.99 which specifies a wide range of effect size. Moreover, from the χ2 test, χ2 has attained 40.61 with a P value of 1.86 × 10−10. From these findings, it has delivered a connection among “LRRFIP1” and the results specify the requirement of other consideration of these variables in future study.
Moreover, power analysis is a statistical technique that is utilized to define minimal sample size and to detect the statistically substantial impact in the study. Moreover, it assists the researchers to avoid studies that cannot detect a true effect and overpowered studies that can waste the available resources.
Performance analysis
This section projects and analyzes the performance of the proposed PEWFL-XGBoost, conventional XGBoost, RF model, and KNN system.
Figure 19 showcases both the confusion matrix and ROC curve for the KNN model, offering a detailed evaluation of its classification performance. The model’s confusion matrix displayed 3 TPs, 7 FPs, 37 FNs, and 57 TNs. These measurements demonstrate that the KNN model successfully identified certain positive instances, but encountered difficulties with numerous FNs, indicating opportunities for enhancing sensitivity and overall accuracy. The ROC curve illustrates the performance of the model by demonstrating the trade-off between accurately detecting TPs and erroneously detecting TNs across various thresholds. These results bring attention to both the advantages and disadvantages of the KNN model in classification tasks, offering insights on enhancing its capability to detect positive instances.

Performance with confusion matrix and ROC curve of the KNN classification. Abbreviations: KNN, K-nearest neighbor; ROC, receiver operating characteristic.
Figure 20 displays the confusion matrix and ROC curve for the RF method, providing a comprehensive evaluation of its classification performance. The accuracy of the model can be seen in the confusion matrix which includes 1 correct positive prediction, 2 incorrect positive predictions, 39 incorrect negative predictions, and 62 correct negative predictions. The values suggest that the RF model struggled to correctly detect positive instances, with few TPs and a relatively high FN rate. The ROC curve shows how effectively the model can balance sensitivity and specificity across different thresholds. Overall, these measurements offer valuable insights into the effectiveness of the RF method, highlighting areas requiring improvement for better classification outcomes in subsequent applications.

Performance assessment of confusion matrix and ROC curve of the RF model. Abbreviations: RF, random forest; ROC, receiver operating characteristic.
Figure 21 presents the confusion matrix and ROC curve for the conventional XGBoost model, providing a comprehensive assessment of its classification performance. The confusion matrix displays 7 TPs, 10 FPs, 33 FNs, and 54 TNs. These measurements indicate that while the model successfully identified certain positive cases, it also had a considerable number of FNs, suggesting issues with sensitivity. The ROC curve improves this analysis by visually displaying the trade-off between TP rate and FP rate across various threshold settings. Overall, these findings highlight the strengths and areas for improvement in the classification performance of the XGBoost model.

Evaluation of confusion matrix and ROC curve of the traditional XGBoost Model. Abbreviation: ROC, receiver operating characteristic.
Figure 22 presents the confusion matrix and ROC curve value of PEWFL-XGBoost. In the confusion matrix, the TP, FP, FN, and TN values are 39, 1, 1, and 63, respectively. Essentially, the confusion matrix is used to evaluate the classification performance. It is a significant parameter that presents the number of TP values, FP values, FN values, and TN values acquired in the classification. Similarly, the ROC curve is the graphical representation of the performance of the classification. The graph depicts the two significant parameters which are the FP rate and TP rate. Correspondingly, the overall analysis outcome shows that the proposed PEWFL-XGBoost accomplished a better outcome than the conventional algorithms.

Performance metrics of PEWFL-XGBoost model’s confusion matrix and ROC curve. Abbreviations: PEWFL, progressive entropy weighted focal loss; ROC, receiver operating characteristic.
Accordingly, the proposed model used PEWFL-XGBoost to classify ALS and non-ALS systems with the Kaggle ALS dataset. The XGBoost is utilized for the capability of managing speed and missing data. Though, it has some confines such as hyperparameter tuning, handling of smaller datasets, and overfitting of data. To resolve the problem and enhance the classification performance, PEWFL is added to the XGBoost system. From the outcomes, it is depicted that the ALS and non-ALS classifications for the proposed model attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision of 0.98. It is the higher results when compared to the results acquired by conventional algorithms such as XGBoost, KNN, and RF. Therefore, the PEWFL-XGBoost system for the classification of ALS and non-ALS through the Kaggle ALS dataset achieved better outcomes with better accuracy, verified by the results. Besides, when combining, the PEWFL-XGBoost system assists in addressing class imbalance, overfitting of data, and noise problems which are established in the classical methods.
Insights and discussion
The contribution of the proposed model and the dataset is illustrated in this section. Numerous research studies have focused on the detection of ALS through symptoms such as voice perturbation (Vashkevich et al., 2019; Stegmann et al., 2020) and behavioral screening (Tremolizzo et al., 2020). In order to reduce the disease severity, it is vital to detect the disease early by identifying the primary cause of ALS. Accordingly, the main cause of ALS is gene mutations, which can disturb the cells’ ability to make healthy proteins. As a result, neurons lack proteins, which leads to degeneration and expiration. This expiration of neurons is called ALS disease. All the functions of the neurons will be affected, such as controlling muscles like walking and talking. The foremost symptoms of ALS are presented below:
Correspondingly, detecting the gene sequences associated with ALS is necessary to take precise treatment to reduce the consequences of the disease. For that purpose, the proposed research focused on classifying ALS and non-ALS through the Kaggle ALS dataset. This dataset comprises gene samples that are related to ALS and non-ALS.
Table 8 presents the gene sequences associated and non-related with the ALS disease and their appropriate functions and the mutation effects. The respective research utilized the gene sequence to classify ALS and non-ALS. Correspondingly, several researchers implemented the common classification of neurological diseases such as ALS, HD, and PD (Aich et al., 2019; Setiawan et al., 2022). Limited research focused on ALS-centered classification. The proposed research fills this gap by employing ALS-intensive research by classifying ALS and non-ALS. Moreover, accuracy is the primary factor in detecting the model’s performance. Several existing models attained effective results but lacked accuracy. The proposed classification, with the advantage of higher results with effective handling of the smaller dataset, reduced overfitting of data, and noise prevention system, makes the respective approach attain an accuracy of 0.98, which is higher than that of the conventional models. Besides, the utilized dataset is insufficient in existing research. For that reason, the proposed model is analyzed with the classical algorithms. In the comparison, the respective method accomplished higher accuracy than the conventional algorithms, which reveals the efficacy of the proposed model. Correspondingly, the proposed system is planned to assist qualified doctors in providing effective ALS diagnosis, observing disease development, and planning treatment. Besides, it is envisioned to enhance the life quality of ALS patients. Overall, the proposed PEWFL-XGBoost improves the efficacy, accuracy, and speed of ALS classification have certain limitations such as noise handling, overfitting, and imbalance. It improves the understanding of genetic factors that contribute to ALS by a molecular-based method for classification. The proposed method aids to mitigate the limitations of traditional ALS screening methods.
Case study discussion
Patient-specific diagnosis
The proposed PEWFL-XGBoost of gene-level analysis discovered the RNA splicing proteins of genes encoding that exhibited dysregulation of network function enriched with both number of differently expressed networks and cell types. An independent generated list with repeat protein binding partners of GGGGCC was the important overlap of these genes. The fusing fidelity was lower than lines in non-C9ORF72 ALS patients and the exon level of lymphoblastoid cells in C9ORF72-ALS patients. Patients with earlier disease progression had lower splicing consistency and patients with shorter survival had a higher number of lymphoblastoid cells. The total of 56 ALS patients underwent RNA extraction from lymphoblastoid cell lines along with 15 controls. The C9ORF72 mutation was detected from 31 ALS patients.
CONCLUSION
The worldwide increasing rate of ALS affects the life of enormous people with excruciating issues such as muscle cramps, slurred speech, neuropathic pain, and breathing trouble. To reduce the effects of ALS, it is vital to detect the ALS disease early. Besides, primary detection is essential to take appropriate treatment to reduce the severity of the disease. Traditional blood test screening is a painful, expensive, and time-consuming procedure. To resolve the issue, numerous research studies focused on the effective detection of ALS. Nevertheless, it lacked accuracy and speed. Therefore, to achieve better detection of ALS, the proposed research employed PEWFL-XGBoost for the classification of ALS and non-ALS systems with the Kaggle ALS dataset. The XGBoost was used for the ability to handle speed and missing data. Although it was a better classifier, it possesses some drawbacks, such as hyperparameter tuning, handling of smaller datasets, and overfitting of data. To resolve these limitations and to improve the classification performance, PEWFL was added to the XGBoost system. Accordingly, the experimental results showed that PEWFL-XGBoost attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision rate of 0.98. Correspondingly, the outcomes of the internal comparison of the respective model and conventional algorithms such as KNN, XGBoost, and RF show the effective performance of the proposed model. Though the proposed model attained effective results, it has some limitations. Overfitting may happen when the model’s performance is influenced by both the amount and quality of data. ALS is complex and its effects on patients vary, and while addressing class differences may help, it might not prevent mislabeling in unique cases. Additionally, healthcare professionals may have difficulties in understanding predictions because XGBoost is complex. These elements underscore the significance of accurate validation in clinical applications. In the future, the proposed method can be extended and it can be incorporated with multi-omics data to provide a more comprehensive understanding of the molecular mechanism underlying ALS and also explore the DL application techniques for ALS classification using large-scale genomic data.