Revolutionizing ALS Assessment: XGBoost Classification with Progressive Entropy Weighted-based Focal Loss on Gene Sequences

Alqahtani, Abdullah; Alsubai, Shtwai; Sha, Mohemmed; Dutta, Ashit Kumar

doi:10.57197/JDR-2024-0119

INTRODUCTION

Neurons (Ma, 2023), also called nerve cells, play a significant role in controlling the movement and activity of humans, such as walking, talking, and self-care. It is the basic fundamental unit of the nervous system, responsible for carrying information to all parts and organs of the body. Predominantly, motor neurons (Hug et al., 2023; Limone et al., 2023) are specialized nerve cells that connect the glands, muscles, and organs throughout the body. The degeneration and expiration of these motor neurons leads to amyotrophic lateral sclerosis (ALS) disease (Masrori and van Damme, 2020; Mead et al., 2023; Vidovic et al., 2023). In the past few years, there have been signs of the increasing prevalence of ALS cases worldwide. Particularly, Saudi Arabia, United States, New Zealand, Uruguay, and some other countries are being affected by the ALS disease. In order to reduce the risk of occurrence of ALS (Akçimen et al., 2023; Udine et al., 2023), it is necessary to detect the disease earlier to take appropriate treatment. Besides, it will reduce the disease’s harshness and enhance the ALS survival rate. Classical screening of ALS includes a blood test to identify gene mutations in the body related to ALS (Akçimen et al., 2023; Suzuki et al., 2023). It is a highly priced, time-intensive process and requires skilled doctors to process the screening. To resolve this, researchers focused on artificial intelligence (AI) (Nakamori et al., 2023; Segura et al., 2023; Tavazzi et al., 2023) for the detection and classification of ALS. This technology, ranging from machine learning (ML) to deep learning (DL), is the predominant technique in the healthcare industry to automate the process of disease diagnosis. It reduces the risk factor of human error based on inaccuracy, enhances the speed, and minimizes the manpower in the screening of the disease.

Correspondingly, numerous research studies concentrated on ALS screening through ML and DL. Generally, the existing research concentrated on classifying neurological disorders such as ALS, Huntington’s disease (HD), and Parkinson’s disease (PD). Besides, in detecting ALS, most studies focused on detecting symptoms such as speech and behavioral changes. For instance, the conventional model deploys an ML-based method to classify ALS. Besides, it is designed to identify the health impacts. The outcome of the existing model represents better performance with a higher accuracy of 93.28% (Sekar et al., 2022). Similarly, traditional research designs use linear discriminant analysis (LDA) to classify phonation production in ALS patients. Here, acoustic features such as jitter, mel-frequency cepstral coefficients (MFCC), personal protective equipment (PPE), and formants are used. Some algorithms with the feature selection are tested to identify the optimal feature of the LDA subset. The result signifies that the LASSO feature selection accomplished better performance, and LDA with five features attained a higher accuracy of 87.5% (Vashkevich and Rushkevich, 2021). Likewise, the conventional model projects the classification of ALS through the ML architecture. It used the features of linear and non-linear methods and combined the brain structure and function associated with ALS. In the classical system, multimodal and unimodal random forest (RF) classifiers are implemented to classify ALS. Besides, it was evaluated using five cross-validation. From the results, it is signified that the unimodal classifier attained an accuracy of 61.66%, and the multimodal classifier acquired an accuracy of 66.82% (Thome et al., 2022). Nevertheless, although the existing research attained effective results, it has a few drawbacks such as lack of accuracy and speed, overfitting of data, and noise removal tasks.

To resolve the limitations and enhance the screening of ALS, the proposed model employs progressive entropy weighted focal loss (PEWFL)-XGBoost to classify ALS and non-ALS. The respective research is carried out through the Kaggle ALS dataset, which comprises gene samples of both ALS and non-ALS. The proposed method comprises preprocessing, data splitting, classification, and prediction phases. Initially, the data are loaded into the system. Then, a preprocessing method, which is based on the normalization technique, is used to prepare the dataset for classification. Then, the data are divided into training and testing data. Formerly, classification was processed through PEWFL-XGBoost using the training data. In the prediction phase, the proposed model is processed through the testing data to estimate the performance. Finally, the efficiency of the respective method is calculated using performance metrics to evaluate the model’s efficacy. Moreover, an internal comparison of the conventional algorithms to the proposed model is carried out to show the effectiveness of the respective approach.

The major contributions of the proposed method are as follows:

To employ PEWFL-XGBoost through the Kaggle ALS dataset to enhance the classification accuracy of ALS and non-ALS.
To calculate the efficacy of the proposed model through performance metrics.
To evaluate the performance of the respective approach through internal comparison with conventional algorithms such as XGBoost, K-nearest neighbor (KNN), and RF.

Paper organization

The paper is organized based on the effectual approaches applied in the binary classification of ALS and non-ALS by analyzing the existing research studies discussed in the Review of Literature section. The Proposed Methodology section presents the process of the proposed method. Further, the outcomes attained by the employed method are depicted in the Results and Discussion section. Finally, the conclusion with the future work of the respective model is presented in the Conclusion section.

REVIEW OF LITERATURE

Several ML- and DL-based techniques for the classification of ALS and non-ALS are analyzed in this section. Moreover, the problem identified in the existing research is also mentioned. The existing model has been implemented using the linear support vector classifier for the classification of ALS. It has shown the neuroimaging markers as age-corrected features for the classification with the cohort consisting of 502 subjects. These data are taken from 404 patients with ALS and health controls (HC). It is further verified through the outcomes with multilayer perceptron (MLP). The experimental result signifies the better performance of the existing model with better accuracy (Kocar et al., 2021). Similarly, the conventional method used three-dimensional convolutional learning network (3D CNN) and convolutional long short-term memory (ConvLSTM) for the detection of neurodegenerative disorders. Here, ConvLSTM is used with the approach of sequential data of the LSTM technique and CNN-based pattern detection system. Correspondingly, 3D CNN is used with the outcome of ConvLSTM. The tensor generated by ConvLSTM has the shape of a 3D cell structure (Erdaş et al., 2021). Similarly, the classical model has been implemented in predicting ALS through the images of patients’ induced pluripotent stem cells (iPSCs). It has been carried out through the CNN-based framework. The experimental outcomes signify the effective performance of the system with a receiver operating characteristic (ROC) value of 0.97 (Imamura et al., 2021). This traditional system has projected the identification of NDDs such as ALS (Golini et al., 2020; Bjornevik et al., 2021; Neumann et al., 2021), HC, and PD through the image data. For that purpose, a basic framework has been used in the existing research. It comprises three stages such as preprocessing, classification, and feature transformation. The better efficiency of the conventional model has been identified through the results (Lin et al., 2020).

Correspondingly, the existing research has been designed for the detection of ALS through voice analysis. It tested the sustainability of voice signals in the essential periods for the computation of perturbation measurement. From the experimental outcomes, it is stated that better efficacy has been attained by the existing ALS detection with an accuracy of 86.7% (Vashkevich et al., 2019). Accordingly, the conventional system has deployed the LSTM-based grid pattern approach to detect neurological diseases (Torres-Castillo et al., 2022; Bernhardt et al., 2023). Here, three distinct gait datasets have been extracted, which hold the recordings of vertical ground reaction force) for diverse walking scenarios. Further, L2 regulation and dropout techniques have been used to reduce the overfitting of data. Moreover, the stochastic gradient optimizer model has been used for solving the cost function. From the multiclass classification results, it is shown that the conventional method attained effective performance with an accuracy of 96.6% (Balaji et al., 2021). Likewise, the classical approach has utilized the advantage of deep convolutional neural network (DCNN) and BAT algorithm to classify ALS and myopathy. In the system, the features are considered based on the time domain and the Wigner–Ville transformed time frequency. These features have been taken from the abnormal electromyography (EMG) signals. The DCNN has been used for feature selection by the extracted time-frequency features by the BAT algorithm. The efficiency of the conventional model has been signified through the results with better computational time (Bakiya et al., 2023). This traditional method has implemented diverse feature optimization techniques such as principal component analysis and genetic algorithm to classify neurological diseases. Further, the existing model has utilized diverse classification methods such as probabilistic, non-linear, and linear methods. The input data comprise ALS (Black et al., 2015; Roy et al., 2020; Dodge et al., 2021), PD, and HD that are extracted from the public database. The outcome of the classification represents the better performance of the model (Aich et al., 2019).

Likewise, the existing research has implemented an ML-based framework for classifying neurological diseases with the analysis in a multidimensional approach. Besides, the dimension reduction of the data is processed through the feature selection. It extracted the data from the cohorts of over 231 patients. The efficiency of the conventional model has been depicted through the results (Gross et al., 2021). In the same way, the classical model has been projected for the detection of bulbar changes in ALS disease. It is processed through the speech analysis with the cohort of patients. From the experimental outcomes, it is represented that the existing model acquired better performance (Stegmann et al., 2020). Correspondingly, the traditional system has designed the CNN-based architecture for the classification of NDDs (Chatterjee et al., 2019; French et al., 2019), such as ALS (Rahman et al., 2019), HD, and PD (Beyrami and Ghaderyan, 2020). It has utilized the advantage of CNN with the gain synchronization of wavelet coherence spectrogram to classify NDD through gait force signals. Here, the features have been extracted through the online gait database of NDDs. The existing method classifies the NDDs through the diverse gait patterns based on the time-frequency gait force signals, achieving effective performance with an accuracy of 96.37% (Setiawan et al., 2022). Similarly, the existing method has been deployed to detect C-terminal TDP-43 fragments in the ALS disease. The suggestion of highly sensitive mass spectrometry through monitoring parallel reactions has been analyzed. The sample was acquired from the Oxford brain bank. The efficiency of the traditional model has been depicted through the outcomes (Feneberg et al., 2021).

Accordingly, classical research has been implemented to detect NDDs through sparse coding and distance metrics. It has used symmetric features and a sparse non-negative least sequence classifier. The results of the experiment represented the better efficiency of the conventional system (Ghaderyan and Beyrami, 2020). Consequently, conventional research has been designed for the molecular classification of ALS based on CNN architecture. Here, the input image data are acquired based on altering values of RNA into pixels through the deep insight framework. Further, the pixels are mapped into the genes. The existing model has identified the genes that are linked with ALS. The outcome of the classification depicted the better performance of the existing model (Karim et al., 2021). The traditional technique has constructed the CNN-BiLSTM framework to classify ALS (Scialò et al., 2020), health controls, and PD with the raw speech waveform. It employed the data-driven method for learning from the raw speech waveform. Besides, the conventional method has used four diverse speech stimuli in training and testing data. From the experimental results, it is represented that the existing model accomplished better performance in the classification (Mallela et al., 2020).

Problem identification

Accuracy is the significant factor for evaluating the model performance, which depicts the value of accurately predicted results. Numerous research studies attempted to determine the effective classification of ALS. However, it lacks in accuracy (Balaji et al., 2021; Vashkevich and Rushkevich, 2021; Sekar et al., 2022; Thome et al., 2022).
Some research focused on the detection of ALS, PD, and HD. Limited research focused on the ALS classification (Aich et al., 2019; Setiawan et al., 2022).
In the ALS detection research, symptoms of ALS have been used as the features for classification. The primary cause, such as gene mutation, has been lacking in the conventional research (Vashkevich et al., 2019; Mallela et al., 2020).

THE PROPOSED METHODOLOGY

In the current world, the severity of the ALS disease is rising in many countries, which affects the lives of numerous people. To avoid the consequences of ALS and to reduce the disease severity, it is vital to identify the disease primarily. As classical screening is an expensive and time-consuming process, several research studies focused on effective ALS screening through AI and it also has limitations such as lack of accuracy, overfitting of data, and noise. To enhance the ALS screening, the proposed research employed PEWFL-XGBoost to classify ALS and non-ALS through the Kaggle ALS dataset. Correspondingly, it is necessary to identify the primary cause of the disease for precise classification as ALS is the most hazardous neuron disease, which weakens the lower and upper motor neurons. It starts with muscle weakness, spreads across the body, and gets worse over time. The symptoms of ALS are depicted in Figure 1.

Figure 1:

Common symptoms of ALS. Abbreviation: ALS, amyotrophic lateral sclerosis.

Primarily, ALS affects the nerve cells that control the muscle movements in the body called the motor neurons. The motor neurons comprise two groups: upper motor neurons and lower motor neurons. The upper motor neurons encompass from the brain to the spinal cord to muscles in the body, whereas the lower motor neurons cover from the spinal cord to the muscles in the body. In the case of ALS, both motor neurons degenerate and expire. As a result, it stops sending messages to the muscles, leading to muscle dysfunction. The main cause of ALS is genetic factors. Fundamentally, genes are a significant part of DNA, which hold the instructions for producing the proteins for the cells. A single neuron comprises 50 billion proteins for significant purposes in the body. If the gene instructions are dissimilar from those needed to make healthy proteins, cells may produce defective or little proteins. It can affect the cells, damage the DNA, and lead to ALS. It is mainly due to the change in the gene known as mutations. Moreover, antibodies are designed with the specific protein α5 integrin which can influence the disease’s progression and has the ability to hold the potential therapeutic agents. However, iPSC cell lines are observed from ALS patients and tested with various therapies to gain insight into the molecular mechanism. Besides, plasmids are used to deliver oligonucleotides, which are responsible for finding the target genetic mutations and it is familiar among ALS gene therapies. This change in the gene can be on its own, or a mutated gene passed from the parents to the children. Figure 2 illustrates the causes of ALS.

Figure 2:

Identifying the origins of ALS. Abbreviation: ALS, amyotrophic lateral sclerosis.

In order to reduce the consequences of ALS, it is necessary to detect the disease early to support people in getting proper treatment and aid in slowing the development of the disease. Conventionally, blood tests are used to identify gene mutations associated with ALS. It is an agonizing, expensive, and time-intensive procedure. To resolve the issue, several existing researchers attempted to attain effective prediction of ALS. Conversely, conventional methods have some limitations such as lack of accuracy and speed, overfitting and vanishing gradient problems. To solve the limitations, the proposed method employs PEWFL-XGBoost to classify ALS and non-ALS through the Kaggle ALS dataset. Figure 3 presents the illustrative framework of the respective model.

Figure 3:

Schematic representation of the proposed work. Abbreviation: ALS, amyotrophic lateral sclerosis.

The proposed system utilizes the Kaggle ALS dataset to classify ALS through the gene sequences. There are various genes related to ALS, such as LRRFIP1, THADA, RGS6, ZNF638, MMP23B, PLXNB1, RP3-377D14.1, USP34, and THADA.1. The respective approach utilizes the advantages of XGBoost and incorporates PEWFL in XGBoost to enhance the performance of ALS and non-ALS classification. To calculate the efficacy of the proposed model, performance metrics such as precision, ROC, F1-score, recall, and accuracy are used in the projected experiment. The proposed method used XGBoost to train the ML models efficiently and able to manage missing data and handle larger datasets and this is important for processing the ALS classification of gene sequences. It integrated the concepts of weight cross-entropy and focal loss to address and improve imbalance data handling. The efficiency of classification in XGBoost is improved by using the PEWFL techniques. It is used to overcome the limitations of traditional focal loss and weighted cross-entropy. The proposed ALS and non-ALS classification framework is presented in Figure 4.

Figure 4:

Proposed structure of PEWFL-XGBoost model. Abbreviation: PEWFL, progressive entropy weighted focal loss.

Figure 4 depicts the respective model which comprises several stages such as selection of dataset, normalization-based preprocessing method, training and testing split, classification with PEWFL-XGBoost, and prediction phase. A brief description of each stage in the proposed research is presented below:

Dataset selection

The dataset for the projected system is gathered from the Kaggle website, which is a publically available dataset. For the assessment of the respective research, the Kaggle ALS dataset is utilized in this system, as outlined in Table 8. The dataset includes the data of both ALS and non-ALS. It comprises 45 samples, where 30 are ALS samples and 15 are non-ALS samples. The inclusion criteria are gene samples which are predominantly related to ALS and non-ALS samples; however, the exclusion criteria are not applied in the study. The genes for this dataset are obtained based on the P value. Besides, it contains both downregulated and upregulated genes. The official link of the Kaggle ALS dataset is as follows: https://www.kaggle.com/datasets/bhavithabairapureddy/c9orf72-ggggcc-expanded-repeats-ofalsgse68607 (Cooper-Knock et al., 2015).

Moreover, the most frequent genetic origin of familial ALS and frontotemporal dementia is the C9ORF72 gene’s GGGGCC repeat expansions, which are responsible for around 40% of familial ALS cases. These enlargements result in disordered splicing, which is related to the severity of the disease. However, the increased replications can cause the formation of RNA foci that trap RNA-binding proteins, which affects disruptions in regular mRNA splicing and possibly overpowering cellular compensatory processes in the long run. This imbalance could play a role in the diverse characteristics and delayed appearance commonly seen in patients. Moreover, the pathogenesis is further complicated by the existence of toxic dipeptide repeat proteins generated by repeat-associated non-ATG translation. Moreover, the study has indicated a correlation between the length of GGGGCC repeat expansion and changes in gene expression and splicing error rates, which impact the severity of ALS symptoms. Therefore, it is essential to comprehend these mechanisms in order to clarify the pathophysiology of C9ORF72-related diseases and create specific treatments.

Preprocessing

ML models generally use the data preprocessing method to clean, alter, and incorporate the data to prepare them for classification. The main objective of data preprocessing is to enhance the quality of the data and to formulate it to be more suitable for the classification task. Initially, in the respective research, before preprocessing, synthetic data are generated from the existing data. To attain this, the mean and standard deviation of each feature in the dataset are calculated without the target column. Further, for each feature, it produces synthetic feature values by drawing random values from the distribution with calculated standard deviation and mean. Subsequently, target synthetic values are generated from the random sampling of the resulting data from the original data with replacement.

In the proposed method, the normalization-based preprocessing method is used to change the dataset’s features to the common scale, enhancing the performance and accuracy of the classification. The main purpose of the normalization technique is to remove the potential distortions and biases caused by diverse scales in the features. Particularly, min-max scalar normalization is used in the respective model, which shrinks the feature values to a range between 0 and 1. It is processed by reducing the minimum value of each feature and dividing the feature range. The advantage of utilizing min-max scaling is to preserve the distance and order of the data points.

Data splitting

In ML, data splitting is used to avoid overfitting of data. Classically, ML uses the data splitting method to train the models, where the training data are added to the system for updating the training phase parameters. After training, the test set data are measured to evaluate the proposed model for handling new observations.

In the respective approach, the original data in the proposed method are divided into two sets, the training and testing sets, in the ratio of 80:20. It indicates that 80% of the observation is used for training and 20% is used for testing. The training data are used to train the model, and the testing data are used to evaluate the model’s performance.

Classification

The proposed research employs ML-based PEWFL-XGBoost to enhance the classification results. The ML algorithm is trained on the gene sequence of the selected genes in the Kaggle ALS dataset. This section presents the details of the classification mechanism and algorithms of the proposed approach. Besides, it signifies the process and algorithms used for the internal comparison, such as XGBoost, KNN, and RF. The following section presents the classical XGBoost mechanism.

Conventional XGBoost classifier

Researchers have used XGBoost to resolve a variety of ML problems. It stands for extreme gradient boosting, projected by the researchers at the University of Washington. It is an elevated gradient boosting library used to efficiently train the ML methods. The significant features of this algorithm are the effective handling of larger datasets and the ability to handle missing values.

Figure 5 illustrates the structure of the traditional XGBoost algorithm. It comprises decision trees (DTs), which are formed in consecutive forms. The substantial factor in the mechanism of XGBoost is the weight that is allocated to every variable. These weights in the variables are processed to the DT for identifying the outcomes. If the variable’s weight is identified wrongly by the tree, then these variables are processed to another DT. The XGBoost rises the amount of DTs where every tree tries to minimize the errors of the prior trees. The final identification is based on the value of the weighted sum of the identification process. The XGBoost approach is depicted in the following:

Figure 5:

Flow chart of conventional XGBoost process.

Let input = {(x_i , y_i )}, which includes m features and n samples (|input| = number, x_i ∈ R^m , y_i ∈ R. The additive function z for the tree ensemble model for approximating the method is shown in Equation (1).

(1)

${\hat{y}}_{i}^{'} = \emptyset (x_{i}) = \sum_{(z = 1)}^{z} {function}_{z} (x_{i}), {function}_{z} \in F$

Here, the space of regression is signified as F. The space of regression is depicted in Equation (2).

(2)

$F = {Function (x) = w_{q} (x)} (q : R^{m} \to T, weight \in R^{T},$

where q represents the tree’s structure, w depicts the weight of leaf nodes, and T signifies the number of leaf nodes. The objective function of the algorithm is reduced to optimize the tree and minimize errors. The process involved in the reduction of the objective function is shown in Equation (3).

(3)

$L^{(t)} = \sum_{(i = 1)}^{n} l (y_{i} {\hat{y}}^{'}_{i} (t - 1) + {function}_{i} (x_{i}) + Ω ({function}_{i}))$

The convex function is used to define the dissimilarity among the calculated and exact values. The convex function is represented as l, the measured value is depicted as y_i , and the predicted value is signified as ${y^{'}}_{i} .$ The amount of iteration is used to reduce the errors where t describes the iteration number. The difficulties of the penalty factor for the regression tree is shown in Equation (4).

(4)

$Ω ({function}_{k}) = γ T + \frac{1}{2} λ | | weight | | 2$

Nevertheless, XGBoost is an effective classifier, and it has a few drawbacks such as overfitting, poor handling of the smaller datasets, and utilization of enormous trees in the model. Besides, the major limitation in the traditional XGBoost classification is the class imbalance that minimizes the accuracy of the classification. Also, hyperparameter tuning is one of the main drawbacks of the classical XGBoost system.

Proposed PEWFL-XGBoost classifier

The proposed approach utilizes the PEWFL function to enhance the classification performance and overcome the limitations of the classical XGBoost. The PEWFL is used to enhance the performance of imbalanced datasets and strength to noisy data. This proposed technique leads to best generalization to unseen data by encouraging the model to allocate lower conviction for unreliable data. Based on their unreliable levels, the proposed PEWFL permits flexibility in tuning and gives importance to the dataset. This technique involves the utilization of PEWFL in conjunction with the XGBoost algorithm for the classification of ALS and non-ALS cases based on gene sequences. Figure 6 presents the classification process of PEWFL-XGBoost.

Figure 6:

PEWFL-XGBoost classification method. Abbreviation: PEWFL, progressive entropy weighted focal loss.

In Figure 6, it is depicted that the classification mechanism comprises passing arguments, initializing the values, PEWFL-based XGBoost, alpha and gamma parameters, and trained model. Primarily, the classification process starts with passing the arguments where each function argument becomes a variable of the assigned value. Further, the values are initialized to describe the initial values of the parameter in the network to train the system. In the weight initialization, PEWFL is processed to enhance the classification method of the XGBoost model. Here, alpha and gamma parameters are used in the system. Finally, the proposed model is trained by running the input data over the algorithm to associate the processed output against the sample output. It is used to evaluate the efficiency of the model. Moreover, the accuracy of ALS classification is enhanced and, on the contrary, it is not affected by issues such as class imbalance, overfitting, and vanishing gradient issues as that of the traditional approach. Similarly, PEWFL-XGBoost can attain its effective ALS classification, by including PEWFL, and it can handle smaller datasets. Besides, the noise is handled by improving its reliability of classification results.

Figure 7 illustrates the architecture of the proposed PEWFL-XGBoost. For the enhancement of the classification and to resolve the class imbalance in the classification, the respective approach tunes the loss function with progressive weight factor and modified focal loss. Traditionally, challenges in focal loss and weighted cross-entropy include minimal loss of weighting among the classes and vanishing gradient. It prevents the system from acquiring higher accuracy. The focal loss is an effective technique for balancing loss by enhancing the loss to classify the system classes.

Figure 7:

Framework of PEWFL-XGBoost. Abbreviation: PEWFL, progressive entropy weighted focal loss.

Nevertheless, it inclines for the vanishing gradient in the process of backpropagation. To resolve this problem, the respective approach uses modified focal loss to enhance the accuracy of the classification. It modifies the technique of loss scaling of focal loss to be effective against the vanishing gradient problem. Moreover, a progressive weight factor is added in the proposed approach to oblige the labels of negative classes to minimize the effect of the vanishing gradient. The process involved in PEWFL-XGBoost is described below:

Initially, the loss cross-entropy (loss _nce ) calculates the deviation among the produced result b_n and expected output for the input n of the network to the ith class between the total class. It is shown in Equation (5).

(5)

${Loss}_{n} c e = - y_{n}^{T} \cdot log b_{n} = - \sum_{i}^{c} y_{i, n} log (b_{i, n})$

In Equation (5), b_n = s(u_n ): u_i,n ∈ IR and (b_i,n ) ∈ [0,1]. Here, the activation function of the classifier is depicted in Equation (6).

(6)

$b_{i, n} = s (u_{i, n}) = \frac{e^{u} i^{'}}{\sum^{w} i e^{u} i^{'}}$

Correspondingly, to resolve the class imbalance problem, weighted cross-entropy is used as the loss function loss _nce _weightCE in the respective method. The loss function is represented in Equation (7).

(7)

${Loss}_{n c e weightCE} = - \sum_{i}^{c} ω_{i}, y_{i, n} log(b_{i, n})$

The modified focal loss formulates the factor of weighting as dynamic values by using it as the function of error among b_i,n , and {y_i,n : y_i,n = 1}, which is represented in Equation (8).

(8)

${Loss}_{n Focal loss} = - α {\sum_{i}^{c} (1 - b_{i, n})}^{γ} y_{i, n} log(b_{i, n})$

Equation (8) signifies the modified focal loss utilizing two extra scaling coefficients to handle the amount of loss. Besides, the feature of weight (1 – b_i,n ) ^γ increases if γ ≥ 1. Hence, if the b_i,n value is lower, then the weight feature becomes larger, which will gradually increase the value of loss in the particular loss. Conversely, the limitation of modified focal loss is b_i,n → y_i,n . Here, the gradient will be smaller than the cross-entropy, which leads to the vanishing gradient and minimizes the system’s performance. To resolve the vanishing gradient problem, a progressive weight factor is utilized in the proposed approach. The progressive weight factor of the focal loss, α(1 − b_i,n ) ^γ is modified to α(|y_i,n − b_i,n |) ^γ , where α, γ ≥ 1. The progressive weight factor is used in the cross-entropy loss, represented in Equation (9). Adding the progressive weight factor results in a greater gradient and loss value than the original cross-entropy. The loss _nce value is compared with the focal loss, subsequently {y_i,n , b_i,n : 0 ≤ y_i,n , b_i,n ≤ 1}.

(9)

${Loss}_{n Focal loss} = α {(| y_{i, n} - b_{i, n} |)}^{γ} - \sum_{i}^{c} y_{i, n} log(b_{i, n})$

The analysis shows that the modifications added in the focal loss enhance the classification performance and increase the accuracy by arresting loss in class classification. Moreover, it enhances the gradient condition. Correspondingly, the algorithms used for the internal comparison are depicted in the following section.

KNN

It is a supervised ML mechanism widely utilized in regression and classification problems. The framework of the algorithm is to find the K-nearest points for the particular observations. It utilizes K-nearest point’s average or the majority of the class as the predicted value for the purpose of observation. It is used in many applications, such as medicine, agriculture, and finance. Generally, it classifies the unlabeled data by computing the distance between all points and unlabeled data points in the dataset.

Further, it allocates the unlabeled data point to the class of the similarly labeled data. It is processed by identifying patterns in the dataset. The most common formula for finding the distance in the algorithm is the Euclidean distance, where the distance among each test sample and training data points, where input _i = input₁, input₂ … input _n and x_i = x ₁, x ₂ … x_n is computed through Equation (10).

(10)

$Distance (input, X) = \sqrt{\sum_{i = 1}^{n} ({input}_{i} - x_{i}) 2}$

The method of classification is processed based on the KNN, where the amount of nearest neighbors is involved in the process of voting. The label of the class is assigned to the test sample, which is classified using the majority votes in the KNN, which is depicted in Equation (11).

(11)

$c ({input}_{i})= arg max \sum_{x_{i} \in KNN} E (x_{j}, Y_{K})$

Here, the test object is represented as x_j , x_j is one of its KNN in the training set, and E(x_j , Y_K ) signifies whether x_j belongs to class Y_K . The weighted KNN is based on the class contribution, which is the enhanced version of the KNN algorithm. It assigns weight to every feature in the dataset, resulting in enhanced accuracy, which is represented in Equation (12).

(12)

${Disc}_{i} = 1 - ({pre}_{i} - {pre}_{t})$

From Equation (12), pre _t is the average accuracy. Subsequently, the weight w_i is attained by normalizing the feature of i-dimension through Equation (13).

(13)

$w_{i} = \frac{{Disc}_{i}}{\sum_{i = 1}^{n} {Disc}_{i}}$

Moreover, the weight of the every feature is utilized to identify the weighted Euclidean distance using Equation (14). The number of features is represented as 𝑛.

(14)

$dist (X 1, X 2) = \sqrt{\sum_{i = 1}^{n} ({input}_{i} - x_{i}) 2}$

The major advantages of KNN are simple implementation, better noise reduction tasks, and the ability to handle complex patterns in the data. Consistently, the subsequent section depicts the process in the algorithm used for the internal comparison.

Random forest

It is a form of ensemble learning technique widely used for the regression and classification of ML problems. To enhance the accuracy, this algorithm uses a great amount of DTs, which are trained and combined for classification. The DT is made with the features of a random subset, and then the last prediction is produced by taking majority voting or averaging among the predictions in the trees. This reduces the chance of overfitting in the network. The major advantages of the RF algorithm are the ability to handle missing values, non-linear parameters, and better noise removal tasks. The pseudocode of the RF algorithm is shown in Pseudocode 1.

Pseudocode 1: Random forest

Input: Distance of data of training _N∗P and Tree amount [Binary]
For every variable i ∈ P do
For a = 1 to binary:
1. Design a sample b ∗ of size M through the data of training
2. Develop Random − forest tree Tree _a to the $\frac{2}{3}$ of data.
3. Identify the classification of the leftover $\frac{1}{5}$ with tree and compute the rate of classification = rate of accuracy (OOB), namely accuracy _a .
4. For variable i, permute the variable value and calculate the value of accuracy (accuracy _b ) eliminate the original OOB error (h_a = accuracy _a − e_a ), the rise indicates the importance of variable.
End for
Aggregate total accuracy with all trees and compute variance.
$\hat{J} = \frac{1}{binary} \sum_{k = 1}^{binary} J_{k} and s_{J}^{2} = \frac{1}{binary - 1} \sum_{k = 1}^{binary} {(J_{k} - \hat{J})}^{2}$
Compute variable importance i: ${variable}_{i} = \hat{J} / s_{J}$
End for

Prediction phase

It is a commonly used technique by researchers to determine the efficiency of the proposed system. In the prediction phase, the algorithm is processed using the test data, which will reveal the performance of the proposed model. Finally, the system’s efficiency is calculated using certain performance matrices such as ROC, accuracy, F1-score, precision, and recall to evaluate the efficacy of the proposed model.

RESULTS AND DISCUSSION

This section briefly describes the outcomes attained by the proposed system in the classification of ALS and non-ALS. Further, exploratory data analysis (EDA), performance metrics, experimental outcomes, and comparative analysis of conventional methods are presented.

EDA

It represents the data utilized in the proposed model with the Kaggle ALS dataset. EDA is used to view the data for better understanding.

Figure 8 presents the percentage and number of patients used in the Kaggle ALS dataset. The pie chart represents the percentage of patients in the dataset, whereas the bar chart represents the number of patients. From the representation of the pie chart, over 36% of people have ALS, and 64% of people are normal. Correspondingly, in the bar chart, 0 signifies patients with the disease, and 1 signifies patients without the disease. It is identified that the data comprise 125 patients with the disease and 220 patients without the disease in the respective approach.

Figure 8:

Visualization of patients counts in the Kaggle ALS dataset. Abbreviation: ALS, amyotrophic lateral sclerosis.

Figure 9 depicts the knowledge graph of the proposed research. The knowledge graph in the ML models is used to understand the data integration. It comprises three major components such as labels, edges, and nodes. The nodes signify the entities for identifying the relationships in the data, whereas the edge represents the similarity among the nodes. Similarly, the labels are the relationship among the rules of edge and nodes.

Figure 9:

Knowledge graph representation of system.

Figure 10 presents the density plot of the gene expression. It is the graphical representation of the dissemination of numeric variables. The density plot, also called kernel density estimate, depicts the probability density function. The genes used in the plot are THADA, RGS6, and LRRFIP1. Figure 11 depicts the signal to noise ratio (SNR) pair plot for data feature distribution.

Figure 10:

Density plot for gene expression analysis.

Figure 11:

Feature distribution with SNR pair plot.

Figure 11 depicts the relationship among the variables in the dataset. It is vital to understand the data by illustrating huge data in a single figure.

Figure 12 presents the box plot of the utilized gene expression. It signifies the gene expression of every gene sample consecutive to data normalization.

Figure 12:

Box plot of gene expressions levels.

Figure 13 illustrates the network graph of the proposed model. It is also called a node-link chart or link graph. It is used to understand, analyze, and visualize the relationship among the entities.

Figure 13:

Proposed model network graph.

Correspondingly, blinding is a method used to ensure the researchers and patients that data used for analyzing are not labeled. Hence, it can minimize bias and confirm the objectivity of the study.

Software and hardware configuration

The environmental configuration of obtaining the results for the proposed model is presented in Table 1, where hardware and software configurations used for implementing the results of the model are listed.

Table 1:

Environmental configuration.

Hardware configuration	Software tools
CPU-Intel Core i7-7700@2.80 GHz	Windows 10
GPU-GTX 1050	Python-3.7
RAM: 16 GB	Anaconda-Spyder

Performance metrics

The effectiveness of the proposed PEWFL-XGBoost is calculated with performance metrics such as F1-score, accuracy, recall, and precision.

F1-score: It is calculated by the mean of precision and recall values. Besides, it signifies that if the F1-score predicted is higher, then the efficiency of the classifier is also high, and it is depicted in Equation (15):

(15) $F 1 -score = 2 \times \frac{Recall \times precision}{Recall + precision}$
Accuracy: It is stated as the ratio of correct identification in the system to complete system identification. The formula for accuracy is shown in Equation (16):

(16) $Accuracy = \frac{TP + TN}{TP + FP + TN + FN},$

where TN is true negative, TP is true positive, FN is false negative, and FP is false positive.
Recall: It is the ratio of correctly identified outcomes to overall identified outcomes. Recall is also called sensitivity or specificity and is represented by Equation (17):

(17) $Recall = \frac{TP}{TP + FN},$

where FN and TP are false negative and true positive, respectively.
Precision: It is also called the value of the identified positive figure and is stated by the fraction of TPs to the average of TPs and FPs and given in Equation (18):

(18) $Precision = \frac{TP}{TP + FP},$

where FP is false positive and TP is true positive.

Experimental results

This section describes the outcomes accomplished by the proposed research in classifying ALS and non-ALS with the Kaggle ALS dataset. Further, the results obtained in the internal comparison of the proposed PEWFL-XGBoost and conventional algorithms such as KNN, XGBoost, and RF are presented.

Figure 14 and Table 2 illustrate the results attained by the proposed research of the PEWFL-XGBoost system. Here, the outcomes accomplished for both ALS and non-ALS and the overall results are presented. The results of the classification for non-ALS attained an F1-score of 0.97, an accuracy rate of 0.98, a recall rate of 0.97, and a precision rate of 0.97. Correspondingly, the classification outcomes for ALS attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision rate of 0.98. Finally, the overall results of the classification state that the proposed model attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision rate of 0.98.

Figure 14:

Visual Representation of the PEWFL-XG-Boost. Abbreviations: ALS, amyotrophic lateral sclerosis; PEWFL, progressive entropy weighted focal loss.

Table 2:

Performance of the proposed research PEWFL-XGBoost.

Model	Precision	Recall	F1-score	Accuracy
Non-ALS	0.97	0.97	0.97	0.98
ALS	0.98	0.98	0.98	0.98
PEWFL-XGBoost	0.98	0.98	0.98	0.98

Abbreviations: ALS, amyotrophic lateral sclerosis; PEWFL, progressive entropy weighted focal loss.

Figure 15 and Table 3 depict the outcomes accomplished by the classical XGBoost system. Here, the results attained for both ALS and non-ALS and the overall outcomes are presented. The outcomes of the classification for non-ALS attained an F1-score of 0.72, an accuracy rate of 0.73, a recall rate of 0.72, and a precision rate of 0.73. Similarly, the results of the classification for ALS accomplished an F1-score of 0.72, an accuracy rate of 0.75, a recall rate of 0.84, and a precision rate of 0.75. Lastly, the overall results of the classification state that the classical XGBoost attained an F1-score of 0.75, an accuracy rate of 0.75, a recall rate of 0.84, and a precision rate of 0.72.

Figure 15:

Illustration of the traditional XGBoost model. Abbreviation: ALS, amyotrophic lateral sclerosis.

Table 3:

Performance metrics of classical XGBoost.

	Precision	Recall	F1-score	Accuracy
Non-ALS	0.73	0.72	0.72	0.73
ALS	0.75	0.84	0.72	0.75
XG_Boost	0.72	0.71	0.75	0.75

Abbreviation: ALS, amyotrophic lateral sclerosis.

Figure 16 and Table 4 show the outcomes accomplished by the RF system. Here, the outcomes accomplished for both ALS and non-ALS and the overall results are presented. The classification outcomes for non-ALS acquired an F1-score of 0.71, an accuracy rate of 0.71, a recall rate of 0.72, and a precision rate of 0.72. Likewise, the results of the classification for ALS attained an F1-score of 0.75, an accuracy rate of 0.72, a recall rate of 0.75, and a precision rate of 0.72. Finally, the overall outcomes of the classification state that the RF system attained an F1-score of 0.71, an accuracy rate of 0.74, a recall rate of 0.75, and a precision rate of 0.72.

Figure 16:

Visual Overview of the RF model. Abbreviations: ALS, amyotrophic lateral sclerosis; RF, random forest.

Table 4:

Metrics of the RF technique.

Model	Precision	Recall	F1-score	Accuracy
Non-ALS	0.71	0.72	0.71	0.71
ALS	0.72	0.75	0.75	0.72
RF	0.72	0.71	0.71	0.74

Abbreviations: ALS, amyotrophic lateral sclerosis; RF, random forest.

Figure 17 and Table 5 illustrate the results attained by the KNN model. Here, the results accomplished for both ALS and non-ALS and the overall outcomes are depicted. The classification results for non-ALS acquired an F1-score of 0.72, an accuracy rate of 0.71, a recall rate of 0.71, and a precision rate of 0.68. Similarly, the classification outcomes for ALS accomplished an F1-score of 0.72, an accuracy rate of 0.72, a recall rate of 0.89, and a precision rate of 0.73. Finally, the overall outcomes of the classification state that the KNN system acquired an F1-score of 0.72, an accuracy rate of 0.72, a recall rate of 0.89, and a precision rate of 0.73. Table 6 and Figure 18 present the overall outcomes of the internal comparison.

Figure 17:

Graphical representation of the KNN model. Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor.

Table 5:

Performance of the KNN method.

Model	Precision	Recall	F1-score	Accuracy
Non-ALS	0.69	0.71	0.72	0.71
ALS	0.73	0.89	0.72	0.72
KNN	0.73	0.72	0.72	0.72

Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor.

Table 6:

Internal comparison of the proposed and classical models.

Model	Precision	Recall	F1-score	Accuracy
XG_Boost	0.54	0.59	0.53	0.59
RF	0.51	0.61	0.48	0.61
KNN	0.49	0.58	0.49	0.58
Proposed	0.98	0.98	0.98	0.98

Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor; RF, random forest.

Figure 18:

Comparative analysis of proposed against classical algorithms. Abbreviations: ALS, amyotrophic lateral sclerosis; KNN, K-nearest neighbor; RF, random forest.

In Figure 18 and Table 6, the efficiency of the classical algorithms with the proposed PEWFL-XGBoost is compared. From the conventional algorithms, a higher accuracy of 0.61 is attained by RF. The proposed system acquired an accuracy of 0.98, which is comparatively higher than that of the conventional algorithms, which reveals the effectiveness of the respective research.

Statistical report

The proposed model has acquired better results when compared with other traditional algorithms. Moreover, the same dataset is examined to determine its stability of the model. Besides, the “LRRFIP1” variable is used for the investigation and its statistical report is provided in Table 8.

Table 7:

Statistical report of the proposed model.

Test	Values
Shapiro test for group 0	0.20317941217179597
Shapiro test for group 1	0.19140789961680915
Levene’s test	0.4135585674363751
t-Test results
t-statistic	34.4585
P value	5.9389e−33
Cohen’s d	11.9870
χ² test results
χ²	40.6125
P value	1.8562e−10

Table 8:

Gene sequence and functions.

S. no.	Gene sequence	Functions	Gene mutation effects
1	LRRFIP1	It encrypts a protein used in the gene expression’s management	Alzheimer’s disease and some neurodegenerative diseases
2	THADA	It encrypts a protein used in the transcript process and cell signaling	Cancer and some inflammatory diseases
3	RGS6	It encrypts a protein used in the management of G-protein-coupled regulators	Cancer and heart diseases
4	ZNF638	It encrypts a protein used in the transcript process and repair of DNA	Inflammatory diseases and cancer
5	MMP23B	It encrypts an enzyme used in the failure of extracellular medium	Inflammatory diseases and cancer
6	PLXNB1	It encrypts a protein used in the synaptic transmission and signaling of cells. Neurodegenerative disease	Alzheimer’s disease and some neurodegenerative diseases
7	RP3-377D14.1	It encrypts a protein used in the management of gene expression	Inflammatory diseases and cancer
8	USP34	It encrypts an enzyme used in the protein modifications	Inflammatory diseases and cancer
9	THADA.1	It encrypts a variant of THADA	Inflammatory diseases and cancer
10	CYorf15A	It encrypts a protein used for the growth of cells	Inflammatory diseases and cancer
11	CHRNA6	It encrypts an alpha-6 subsection of nAchR (nicotinic acetylcholine receptor)	Schizophrenia and Parkinson’s disease
12	THSD7B	It encrypts a protein used in the management of gene expression	Inflammatory diseases and cancer
13	THADA.2	THADA variant gene	Inflammatory diseases and cancer
14	MTIF2	It encrypts a protein used in the management of cell growth	Inflammatory diseases and cancer
15	FAM18B2	It encrypts a protein used in the management of the immune system	Cancer and autoimmune disorders
16	AC092299.1	Not characterized	—
17	AC069287.3	Not characterized	—
18	ARMC10	It encrypts a protein used in the management of the immune system	Cancer and autoimmune disorders
20	FAM123C	It encrypts a protein used in the management of the immune system	Cancer and autoimmune disorders
21	USP34.1	USP34 variant	Inflammatory diseases and cancer
22	RP11-339I24.1	Not characterized	—
23	ZKSCAN4	It encrypts a protein used in the management of the immune system	Cancer and autoimmune disorders
24	LL22NC03-80A10.2	It encrypts a protein used in the cell growth	Inflammatory diseases and cancer
25	FAM123C.1	FAM123C variant	Inflammatory diseases and cancer
26	RPL21	It encrypts a protein used in the translation of RNA in the protein	Muscular dystrophy and cancer
27	SR140	It encrypts a protein used in the management of gene expression	Inflammatory diseases and cancer
28	MUC4	It encrypts a protein used in the mucus formation	Inflammatory diseases and cancer
29	EHBP1	It encrypts a protein used in the growth of cells	Inflammatory diseases and cancer
30	FOXO4	It encrypts a protein used in the management of the growth of cells	Inflammatory diseases and cancer
31	TRIM59	It encrypts a protein used in the management of gene expression	Cancer and autoimmune disorders
32	MUC4.1	MUC4 variant	Inflammatory diseases and cancer
33	MUC4.2	MUC4 variant	Inflammatory diseases and cancer
34	USP39	It encrypts a protein used in the post-translational protein alteration	Inflammatory diseases and cancer
35	LRRFIP1.1	LRRFIP1.1 variant	Alzheimer’s disease and some neurodegenerative diseases
36	MMP23B.1	MMP23B	Inflammatory diseases and cancer
37	GRM8	It encrypts a protein used in the taste perception	Autism and taste disorders
38	TRIM59.1	TRIM59 variant	Autoimmune disorders and cancer

Correspondingly, the normality and homogeneity of variance is evaluated using the Shapiro–Wilk test. From the test it signifies that the P values of normality is 0.203 for group 0 and 0.191 for group 1. Subsequently, P values are higher when it is related to a typical alpha level of 0.05. Besides, the null hypothesis of normality is not rejected and has shown data in both groups which almost attained a normal distribution.

Similarly, for Levene’s test, the variance has attained a P value of 0.414 which is higher than 0.05. It signifies that the hypothesis of identical variance among the groups has a “true” value. Additionally, a t-test is directed to relate “LRRFIP1” among group 0 and 1. Then, the t-statistic is 34.46 and its P value is 5.94 × 10⁻³³, which is less and it shows solid indication for null hypothesis. The effect size is given by Cohen’s d with 11.99 which specifies a wide range of effect size. Moreover, from the χ² test, χ² has attained 40.61 with a P value of 1.86 × 10⁻¹⁰. From these findings, it has delivered a connection among “LRRFIP1” and the results specify the requirement of other consideration of these variables in future study.

Moreover, power analysis is a statistical technique that is utilized to define minimal sample size and to detect the statistically substantial impact in the study. Moreover, it assists the researchers to avoid studies that cannot detect a true effect and overpowered studies that can waste the available resources.

Performance analysis

This section projects and analyzes the performance of the proposed PEWFL-XGBoost, conventional XGBoost, RF model, and KNN system.

Figure 19 showcases both the confusion matrix and ROC curve for the KNN model, offering a detailed evaluation of its classification performance. The model’s confusion matrix displayed 3 TPs, 7 FPs, 37 FNs, and 57 TNs. These measurements demonstrate that the KNN model successfully identified certain positive instances, but encountered difficulties with numerous FNs, indicating opportunities for enhancing sensitivity and overall accuracy. The ROC curve illustrates the performance of the model by demonstrating the trade-off between accurately detecting TPs and erroneously detecting TNs across various thresholds. These results bring attention to both the advantages and disadvantages of the KNN model in classification tasks, offering insights on enhancing its capability to detect positive instances.

Figure 19:

Performance with confusion matrix and ROC curve of the KNN classification. Abbreviations: KNN, K-nearest neighbor; ROC, receiver operating characteristic.

Figure 20 displays the confusion matrix and ROC curve for the RF method, providing a comprehensive evaluation of its classification performance. The accuracy of the model can be seen in the confusion matrix which includes 1 correct positive prediction, 2 incorrect positive predictions, 39 incorrect negative predictions, and 62 correct negative predictions. The values suggest that the RF model struggled to correctly detect positive instances, with few TPs and a relatively high FN rate. The ROC curve shows how effectively the model can balance sensitivity and specificity across different thresholds. Overall, these measurements offer valuable insights into the effectiveness of the RF method, highlighting areas requiring improvement for better classification outcomes in subsequent applications.

Figure 20:

Performance assessment of confusion matrix and ROC curve of the RF model. Abbreviations: RF, random forest; ROC, receiver operating characteristic.

Figure 21 presents the confusion matrix and ROC curve for the conventional XGBoost model, providing a comprehensive assessment of its classification performance. The confusion matrix displays 7 TPs, 10 FPs, 33 FNs, and 54 TNs. These measurements indicate that while the model successfully identified certain positive cases, it also had a considerable number of FNs, suggesting issues with sensitivity. The ROC curve improves this analysis by visually displaying the trade-off between TP rate and FP rate across various threshold settings. Overall, these findings highlight the strengths and areas for improvement in the classification performance of the XGBoost model.

Figure 21:

Evaluation of confusion matrix and ROC curve of the traditional XGBoost Model. Abbreviation: ROC, receiver operating characteristic.

Figure 22 presents the confusion matrix and ROC curve value of PEWFL-XGBoost. In the confusion matrix, the TP, FP, FN, and TN values are 39, 1, 1, and 63, respectively. Essentially, the confusion matrix is used to evaluate the classification performance. It is a significant parameter that presents the number of TP values, FP values, FN values, and TN values acquired in the classification. Similarly, the ROC curve is the graphical representation of the performance of the classification. The graph depicts the two significant parameters which are the FP rate and TP rate. Correspondingly, the overall analysis outcome shows that the proposed PEWFL-XGBoost accomplished a better outcome than the conventional algorithms.

Figure 22:

Performance metrics of PEWFL-XGBoost model’s confusion matrix and ROC curve. Abbreviations: PEWFL, progressive entropy weighted focal loss; ROC, receiver operating characteristic.

Accordingly, the proposed model used PEWFL-XGBoost to classify ALS and non-ALS systems with the Kaggle ALS dataset. The XGBoost is utilized for the capability of managing speed and missing data. Though, it has some confines such as hyperparameter tuning, handling of smaller datasets, and overfitting of data. To resolve the problem and enhance the classification performance, PEWFL is added to the XGBoost system. From the outcomes, it is depicted that the ALS and non-ALS classifications for the proposed model attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision of 0.98. It is the higher results when compared to the results acquired by conventional algorithms such as XGBoost, KNN, and RF. Therefore, the PEWFL-XGBoost system for the classification of ALS and non-ALS through the Kaggle ALS dataset achieved better outcomes with better accuracy, verified by the results. Besides, when combining, the PEWFL-XGBoost system assists in addressing class imbalance, overfitting of data, and noise problems which are established in the classical methods.

Insights and discussion

The contribution of the proposed model and the dataset is illustrated in this section. Numerous research studies have focused on the detection of ALS through symptoms such as voice perturbation (Vashkevich et al., 2019; Stegmann et al., 2020) and behavioral screening (Tremolizzo et al., 2020). In order to reduce the disease severity, it is vital to detect the disease early by identifying the primary cause of ALS. Accordingly, the main cause of ALS is gene mutations, which can disturb the cells’ ability to make healthy proteins. As a result, neurons lack proteins, which leads to degeneration and expiration. This expiration of neurons is called ALS disease. All the functions of the neurons will be affected, such as controlling muscles like walking and talking. The foremost symptoms of ALS are presented below:

Slurred speech
Weakness of hands
Falling or tripping
Thinking
Breathing

Correspondingly, detecting the gene sequences associated with ALS is necessary to take precise treatment to reduce the consequences of the disease. For that purpose, the proposed research focused on classifying ALS and non-ALS through the Kaggle ALS dataset. This dataset comprises gene samples that are related to ALS and non-ALS.

Table 8 presents the gene sequences associated and non-related with the ALS disease and their appropriate functions and the mutation effects. The respective research utilized the gene sequence to classify ALS and non-ALS. Correspondingly, several researchers implemented the common classification of neurological diseases such as ALS, HD, and PD (Aich et al., 2019; Setiawan et al., 2022). Limited research focused on ALS-centered classification. The proposed research fills this gap by employing ALS-intensive research by classifying ALS and non-ALS. Moreover, accuracy is the primary factor in detecting the model’s performance. Several existing models attained effective results but lacked accuracy. The proposed classification, with the advantage of higher results with effective handling of the smaller dataset, reduced overfitting of data, and noise prevention system, makes the respective approach attain an accuracy of 0.98, which is higher than that of the conventional models. Besides, the utilized dataset is insufficient in existing research. For that reason, the proposed model is analyzed with the classical algorithms. In the comparison, the respective method accomplished higher accuracy than the conventional algorithms, which reveals the efficacy of the proposed model. Correspondingly, the proposed system is planned to assist qualified doctors in providing effective ALS diagnosis, observing disease development, and planning treatment. Besides, it is envisioned to enhance the life quality of ALS patients. Overall, the proposed PEWFL-XGBoost improves the efficacy, accuracy, and speed of ALS classification have certain limitations such as noise handling, overfitting, and imbalance. It improves the understanding of genetic factors that contribute to ALS by a molecular-based method for classification. The proposed method aids to mitigate the limitations of traditional ALS screening methods.

Case study discussion

Patient-specific diagnosis

The proposed PEWFL-XGBoost of gene-level analysis discovered the RNA splicing proteins of genes encoding that exhibited dysregulation of network function enriched with both number of differently expressed networks and cell types. An independent generated list with repeat protein binding partners of GGGGCC was the important overlap of these genes. The fusing fidelity was lower than lines in non-C9ORF72 ALS patients and the exon level of lymphoblastoid cells in C9ORF72-ALS patients. Patients with earlier disease progression had lower splicing consistency and patients with shorter survival had a higher number of lymphoblastoid cells. The total of 56 ALS patients underwent RNA extraction from lymphoblastoid cell lines along with 15 controls. The C9ORF72 mutation was detected from 31 ALS patients.

CONCLUSION

The worldwide increasing rate of ALS affects the life of enormous people with excruciating issues such as muscle cramps, slurred speech, neuropathic pain, and breathing trouble. To reduce the effects of ALS, it is vital to detect the ALS disease early. Besides, primary detection is essential to take appropriate treatment to reduce the severity of the disease. Traditional blood test screening is a painful, expensive, and time-consuming procedure. To resolve the issue, numerous research studies focused on the effective detection of ALS. Nevertheless, it lacked accuracy and speed. Therefore, to achieve better detection of ALS, the proposed research employed PEWFL-XGBoost for the classification of ALS and non-ALS systems with the Kaggle ALS dataset. The XGBoost was used for the ability to handle speed and missing data. Although it was a better classifier, it possesses some drawbacks, such as hyperparameter tuning, handling of smaller datasets, and overfitting of data. To resolve these limitations and to improve the classification performance, PEWFL was added to the XGBoost system. Accordingly, the experimental results showed that PEWFL-XGBoost attained an F1-score of 0.98, an accuracy rate of 0.98, a recall rate of 0.98, and a precision rate of 0.98. Correspondingly, the outcomes of the internal comparison of the respective model and conventional algorithms such as KNN, XGBoost, and RF show the effective performance of the proposed model. Though the proposed model attained effective results, it has some limitations. Overfitting may happen when the model’s performance is influenced by both the amount and quality of data. ALS is complex and its effects on patients vary, and while addressing class differences may help, it might not prevent mislabeling in unique cases. Additionally, healthcare professionals may have difficulties in understanding predictions because XGBoost is complex. These elements underscore the significance of accurate validation in clinical applications. In the future, the proposed method can be extended and it can be incorporated with multi-omics data to provide a more comprehensive understanding of the molecular mechanism underlying ALS and also explore the DL application techniques for ALS classification using large-scale genomic data.

[1] Aich S, Joo M-i, Kim H-C, Park J. 2019. Improvisation of classification performance based on feature optimization for differentiation of Parkinson’s disease from other neurological diseases using gait characteristics. Int. J. Electr. Comput. Eng. Vol. 9(6):5176–5184. [Cross Ref]

[2] Akçimen F, Lopez ER, Landers JE, Nath A, Chiò A, Chia R, et al.. 2023. Amyotrophic lateral sclerosis: translating genetic discoveries into therapies. Nat. Rev. Genet. Vol. 24(9):642–658. [Cross Ref]

[3] Bakiya A, Anitha A, Sridevi T, Kamalanand K. 2023. Retracted: Classification of myopathy and amyotrophic lateral sclerosis electromyograms using bat algorithm and deep neural networks. Behav. Neurol. Vol. 2023:9769130. [Cross Ref]

[4] Balaji E, Brindha D, Elumalai VK, Vikrama R. 2021. Automatic and non-invasive Parkinson’s disease diagnosis and severity rating using LSTM network. Appl. Soft Comput. Vol. 108:107463. [Cross Ref]

[5] Bernhardt AM, Tiedt S, Teupser D, Dichgans M, Meyer B, Gempt J, et al.. 2023. A unified classification approach rating clinical utility of protein biomarkers across neurologic diseases. EBioMedicine. Vol. 89:104456. [Cross Ref]

[6] Beyrami SMG, Ghaderyan P. 2020. A robust, cost-effective and non-invasive computer-aided method for diagnosis three types of neurodegenerative diseases with gait signal analysis. Measurement. Vol. 156:107579. [Cross Ref]

[7] Bjornevik K, O’Reilly EJ, Molsberry S, Kolonel LN, Le Marchand L, Paganoni S, et al.. 2021. Prediagnostic neurofilament light chain levels in amyotrophic lateral sclerosis. Neurology. Vol. 97(15):e1466–e1474. [Cross Ref]

[8] Black M, Ganesh D, Madduri N. 2015. Predicting the rate of progression of the ALS disease. Front. Neurol.

[9] Chatterjee S, Samanta K, Choudhury NR, Bose R. 2019. Detection of myopathy and ALS electromyograms employing modified window Stockwell transform. IEEE Sens. Lett. Vol. 3(7):1–4. [Cross Ref]

[10] Cooper-Knock J, Bury JJ, Heath PR, Wyles M, Higginbottom A, Gelsthorpe C, et al.. 2015. C9ORF72 GGGGCC expanded repeats produce splicing dysregulation which correlates with disease severity in amyotrophic lateral sclerosis. PLoS One. Vol. 10(5):e0127376. [Cross Ref]

[11] Dodge JC, Yu J, Sardi SP, Shihabuddin LS. 2021. Sterol auto-oxidation adversely affects human motor neuron viability and is a neuropathological feature of amyotrophic lateral sclerosis. Sci. Rep. Vol. 11(1):803[Cross Ref]

[12] Erdaş ÇB, Sümer E, Kibaroğlu SJ. 2021. Neurodegenerative disease detection and severity prediction using deep learning approaches. Biomed. Signal Process. Control. Vol. 70:103069. [Cross Ref]

[13] Feneberg E, Charles PD, Finelli MJ, Scott C, Kessler BM, Fischer R, et al.. 2021. Detection and quantification of novel C-terminal TDP-43 fragments in ALS-TDP. Brain Pathol. Vol. 31(4):e12923. [Cross Ref]

[14] French RL, Grese ZR, Aligireddy H, Dhavale DD, Reeb AN, Kedia N, et al.. 2019. Detection of TAR DNA-binding protein 43 (TDP-43) oligomers as initial intermediate species during aggregate formation. J. Biol. Chem. Vol. 294(17):6696–6709. [Cross Ref]

[15] Ghaderyan P, Beyrami SM. 2020. Neurodegenerative diseases detection using distance metrics and sparse coding: a new perspective on gait symmetric features. Comput. Biol. Med. Vol. 120:103736. [Cross Ref]

[16] Golini E, Rigamonti M, Iannello F, De Rosa C, Scavizzi F, Raspa M, et al.. 2020. A non-invasive digital biomarker for the detection of rest disturbances in the SOD1G93A mouse model of ALS. Front. Neurosci. Vol. 14:896. [Cross Ref]

[17] Gross CC, Schulte-Mecklenbeck A, Madireddy L, Pawlitzki M, Strippel C, Räuber S, et al.. 2021. Classification of neurological diseases using multi-dimensional CSF analysis. Brain. Vol. 144(9):2625–2634. [Cross Ref]

[18] Hug F, Avrillon S, Ibáñez J, Farina D. 2023. Common synaptic input, synergies and size principle: control of spinal motor neurons for movement generation. J. Physiol. Vol. 601(1):11–20. [Cross Ref]

[19] Imamura K, Yada Y, Izumi Y, Morita M, Kawata A, Arisato T, et al.. 2021. Prediction model of amyotrophic lateral sclerosis by deep learning with patient induced pluripotent stem cells. Ann. Neurol. Vol. 89(6):1226–1233. [Cross Ref]

[20] Karim A, Su Z, West PK, Keon M; The NYGC ALS Consortium; Shamsani J, et al.. 2021. Molecular classification and interpretation of amyotrophic lateral sclerosis using deep convolution neural networks and Shapley values. Genes (Basel). Vol. 12(11):1754. [Cross Ref]

[21] Kocar TD, Behler A, Ludolph AC, Müller H-P, Kassubek J. 2021. Multiparametric microstructural MRI and machine learning classification yields high diagnostic accuracy in amyotrophic lateral sclerosis: proof of concept. Front. Neurol. Vol. 12:745475. [Cross Ref]

[22] Limone F, Guerra San Juan I, Mitchell JM, Smith JLM, Raghunathan K, Meyer D, et al.. 2023. Efficient generation of lower induced motor neurons by coupling Ngn2 expression with developmental cues. Cell Rep. Vol. 42(1):111896. [Cross Ref]

[23] Lin C-W, Wen T-C, Setiawan F. 2020. Evaluation of vertical ground reaction forces pattern visualization in neurodegenerative diseases identification using deep learning and recurrence plot image feature extraction. Sensors (Basel). Vol. 20(14):3857. [Cross Ref]

[24] Ma J. 2023. Biophysical neurons, energy, and synapse controllability: a review. J. Zhejiang Univ. Sci. A. Vol. 24(2):109–129. [Cross Ref]

[25] Mallela J, Illa A, Belur Y, Atchayaram N, Yadav R, Reddy P, et al.. 2020. Raw speech waveform based classification of patients with ALS, Parkinson’s disease and healthy controls using CNN-BLSTMProc. Interspeech 2020. p. 4586–4590. [Cross Ref]

[26] Masrori P, Van Damme P. 2020. Amyotrophic lateral sclerosis: a clinical review. Eur. J. Neurol. Vol. 27(10):1918–1929. [Cross Ref]

[27] Mead RJ, Shan N, Reiser HJ, Marshall F, Shaw PJ. 2023. Amyotrophic lateral sclerosis: a neurodegenerative disorder poised for successful therapeutic translation. Nat. Rev. Drug Discov. Vol. 22(3):185–212. [Cross Ref]

[28] Nakamori M, Ishikawa R, Watanabe T, Toko M, Naito H, Takahashi T, et al.. 2023. Swallowing sound evaluation using an electronic stethoscope and artificial intelligence analysis for patients with amyotrophic lateral sclerosis. Front. Neurol. Vol. 14:1212024. [Cross Ref]

[29] Neumann M, Roesler O, Liscombe J, Kothare H, Suendermann-Oeft D, Pautler D, et al.. 2021. Investigating the utility of multimodal conversational technology and audiovisual analytic measures for the assessment and monitoring of amyotrophic lateral sclerosis at scale. Proc. Interspeech. 4783–4787. [Cross Ref]

[30] Rahman MR, Islam T, Huq F, Quinn JM, Moni MA. 2019. Identification of molecular signatures and pathways common to blood cells and brain tissue of amyotrophic lateral sclerosis patients. Inform. Med. Unlocked. Vol. 16:100193. [Cross Ref]

[31] Roy SS, Samanta K, Modak S, Chatterjee S, Bose R. 2020. Cross spectrum aided deep feature extraction based neuromuscular disease detection framework. IEEE Sens. Lett. Vol. 4(6):1–4. [Cross Ref]

[32] Scialò C, Tran TH, Salzano G, Novi G, Caponnetto C, Chiò A, et al.. 2020. TDP-43 real-time quaking induced conversion reaction optimization and detection of seeding activity in CSF of amyotrophic lateral sclerosis and frontotemporal dementia patients. Brain Commun. Vol. 2(2):fcaa142. [Cross Ref]

[33] Segura T, Medrano IH, Collazo S, Maté C, Sguera C, Del Rio-Bermudez C, et al.. 2023. Symptoms timeline and outcomes in amyotrophic lateral sclerosis using artificial intelligence. Sci. Rep. Vol. 13(1):702[Cross Ref]

[34] Sekar G, Sivakumar C, Logeshwaran J. 2022. NMLA: the smart detection of motor neuron disease and analyze the health impacts with neuro machine learning model. NeuroQuantology. Vol. 20(8):892–899. [Cross Ref]

[35] Setiawan F, Liu A-B, Lin C-W. 2022. Development of neuro-degenerative diseases’ gait classification algorithm using convolutional neural network and wavelet coherence spectrogram of gait synchronization. IEEE Access. Vol. 10:38137–38153. [Cross Ref]

[36] Stegmann GM, Hahn S, Liss J, Shefner J, Rutkove S, Shelton K, et al.. 2020. Early detection and tracking of bulbar changes in ALS via frequent and remote speech analysis. NPJ Digit. Med. Vol. 3(1):132[Cross Ref]

[37] Suzuki N, Nishiyama A, Warita H, Aoki MJ. 2023. Genetics of amyotrophic lateral sclerosis: seeking therapeutic targets in the era of gene therapy. Hum. Genet. Vol. 68(3):131–152. [Cross Ref]

[38] Tavazzi E, Longato E, Vettoretti M, Aidos H, Trescato I, Roversi C, et al.. 2023. Artificial intelligence and statistical methods for stratification and prediction of progression in amyotrophic lateral sclerosis: a systematic review. Artif. Intell. Med. Vol. 142:102588. [Cross Ref]

[39] Thome J, Steinbach R, Grosskreutz J, Durstewitz D, Koppe G. 2022. Classification of amyotrophic lateral sclerosis by brain volume, connectivity, and network dynamics. Hum. Brain Mapp. Vol. 43(2):681–699. [Cross Ref]

[40] Torres-Castillo JR, Lopez-Lopez CO, Padilla-Castaneda MA. 2022. Neuromuscular disorders detection through time-frequency analysis and classification of multi-muscular EMG signals using Hilbert-Huang transform. Biomed. Signal Process. Control. Vol. 71:103037. [Cross Ref]

[41] Tremolizzo L, Lizio A, Santangelo G, Diamanti S, Lunetta C, Gerardi F, et al.. 2020. ALS Cognitive Behavioral Screen (ALS-CBS): normative values for the Italian population and clinical usability. Neurol. Sci. Vol. 41:835–841. [Cross Ref]

[42] Udine E, Jain A, van Blitterswijk M. 2023. Advances in sequencing technologies for amyotrophic lateral sclerosis research. Mol. Neurodegener. Vol. 18(1):4[Cross Ref]

[43] Vashkevich M, Rushkevich YJ. 2021. Classification of ALS patients based on acoustic analysis of sustained vowel phonations. Biomed. Signal Process. Control. Vol. 65:102350. [Cross Ref]

[44] Vashkevich M, Petrovsky A, Rushkevich Y. 2019. Bulbar ALS detection based on analysis of voice perturbation and vibrato.2019 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA). IEEE. p. 267–272. [Cross Ref]

[45] Vidovic M, Müschen LH, Brakemeier S, Machetanz G, Naumann M, Castro-Gomez S. 2023. Current state and future directions in the diagnosis of amyotrophic lateral sclerosis. Cells. Vol. 12(5):736. [Cross Ref]

Journal of Disability Research