Toward Robust Arabic Sign Language Recognition <i>via</i> Vision Transformers and Local Interpretable Model-agnostic Explanations Integration

Baghdadi, Nadiah A.; Abdulazeem, Yousry M.; ZainEldin, Hanaa; Farrag, Tamer Ahmed; Aljohani, Mansourah K.; Malki, Amer; Badawy, Mahmoud; Elhosseini, Mostafa A.

doi:10.57197/JDR-2024-0092

INTRODUCTION

Communicating with others is challenging for those who have severe or extensive hearing loss since they are unable to pay attention to other people. A deaf person’s mental health can be affected significantly due to poor communication, which can lead to emotions of sadness, loneliness, and isolation. For persons who are deaf or hard of hearing, sign language (SL) is the primary means of communication in their everyday lives (Dabwan et al., 2023). Hand gestures and facial emotions produce signs that make SL a complete and complicated language (Dabwan et al., 2023). Arabic is the world’s fourth most spoken language. For those with speech and hearing impairments, Arabic SL (ArSL) was approved as the primary language in several Arab countries (Almasre and Al-Nuaim, 2020). ArSL is still in its early stages, despite Arabic being one of the international core languages (Younes et al., 2023). One of the most critical areas of research in computer vision and human–computer interaction is ArSL recognition (ArSLR). With this technology, people who are deaf or hard of hearing—hereafter, referred to as mute—seek to close the communication gap and participate more fully in society. The term “mute” often denotes individuals who are entirely or partially unable to use vocal speech; however, it is vital to approach this term with sensitivity and awareness of its evolving context within disability discourse.

In the Kingdom of Saudi Arabia (KSA), like elsewhere, the mute community represents a significant population segment that encounters numerous challenges in accessing seamless communication. Reliable statistics are essential for understanding the scope of these challenges and driving the development of technological solutions like ArSLR systems. Such systems are not merely academic pursuits but are instrumental in fostering inclusivity, providing mute individuals with tools for education, healthcare, and social interaction. The prevalence of mute conditions within the country underscores the importance of ArSLR in the KSA. According to the latest available data, about 720,000 people of the KSA population rely on SL as their primary means of communication (Faisal et al., 2023). According to the World Health Organization, 5% of people worldwide have hearing loss. By 2050, it is estimated that about 900 million people will be deaf globally, and 94 million of them will live in the Eastern Mediterranean (EM). The global economy bears the expense of hearing loss at $980 billion, of which 30 billion come from EM (World Health Organization, 2021). This demographic data reinforce the need for robust ArSLR systems that facilitate effective communication, support autonomous living, and enhance the quality of life for the mute community. This motivates researchers to propose new research in this field.

Research motivation

Some key drivers for this research are the following:

Improving accessibility: Enhancing communication for Arabic-speaking individuals with hearing impairments, allowing for seamless interaction with the hearing world.
Assistive technology development: Advancing ArSLR to develop effective assistive technologies, such as real-time translation systems and virtual SL interpreters.
Preserving and promoting ArSL: Supporting the documentation, preservation, and promotion of ArSL, contributing to the cultural heritage of Arab communities.
Advancing linguistic and cognitive understanding: Providing insights into linguistic and cognitive processes through the study of ArSL, informing broader research in linguistics and cognitive science.

By advancing ArSLR technology, researchers and technologists have the potential to make a profound impact on the mute community, promoting equal opportunities and aiding in the eradication of communication barriers that have historically marginalized this group. Pursuing such technologies in the KSA is a matter of not only technological innovation but also social responsibility and commitment to equity and accessibility for all.

Challenges facing ArSLR

Several challenges face ArSLR research compared to other SL recognition systems:

Linguistic complexity: ArSL’s complex grammar, syntax, and rich morphology present significant challenges.
Lack of standardization: Regional and dialectal differences in ArSL complicate the development of universal recognition systems.
Limited data resources: The scarcity of high-quality, publicly available datasets for ArSL limits the development of robust recognition models.
Cultural and social factors: Variations in ArSL usage across Arab nations and between urban and rural groups add complexity.

Artificial intelligence (AI) is a field that covers multiple disciplines that seek to enhance the creation of computer systems capable of doing tasks that typically require human intelligence (Haque et al., 2023; ZainEldin et al., 2024). Image classification plays a crucial role in various applications by extracting features, comprehending images, and interpreting and evaluating them (Poonguzhali et al., 2023). Convolutional neural networks (CNNs) are a foundational architecture in computer vision, particularly effective for tasks such as image classification and object detection. The popularity of the CNN model is rapidly increasing because of its impressive performance capabilities (Vaiyapuri et al., 2023). Their design, comprising convolutional layers, pooling layers, activation functions, and fully connected layers, allows them to learn and extract features from images. This makes them suitable for applications in ArSL, where precise feature extraction is crucial. However, CNNs have limitations, especially in capturing long-term dependencies and global contextual information, which are essential in SL interpretation. Moreover, CNNs can struggle with variable image sizes, often necessitating additional computational resources for processing (Alsayed et al., 2023). Transformers, with their self-attention mechanisms, address some of these limitations by enabling the modeling of long-range dependencies across input sequences. Originally designed for natural language processing, transformers process entire sequences in parallel, thus efficiently capturing the context and relationships crucial for language tasks (Brour and Benabbou, 2021).

The vision transformer (ViT) model comprises several key components that distinguish its architecture from conventional CNNs. These components work together to process images as sequences of patches and enable the model to capture global dependencies (Alsulaiman et al., 2023; Zhang et al., 2023):

Patch embedding: In ViTs, the input image is divided into fixed-size patches. These patches are then flattened and mapped onto a higher dimensional space through a trainable linear projection to create patch embeddings.
Positional encoding: Since the model needs to account for the order of the input, positional encodings are added to the patch embeddings to retain positional information. This is crucial as the transformer architecture, unlike CNNs, does not inherently understand the order of the input sequence.
Transformer encoder: This is the core of the ViT; this consists of alternating layers of multi-head self-attention (MSA) mechanisms and feed-forward neural networks.
Multi-head self-attention: This mechanism allows the model to focus on different parts of the image and to capture complex relationships between patches regardless of their position in the sequence. It processes the input through multiple attention heads, each able to attend to different parts of the input sequence.
Layer normalization: applied before every block in the encoder (both self-attention and feed-forward blocks) and before the final output for stabilizing the training process.
Feed-forward networks (FFNs): consists of two layers of linear transformations with a non-linearity in between. Each position gets the same FFN applied independently.
Classification head: At the top of the ViT, there is a classification head typically consisting of a linear layer that maps the transformer’s output to the desired number of classes. This is applied after the average pooling of the last encoder’s output.

These components enable the ViT to process images by considering the entire image at once rather than through the local receptive fields used by CNNs, allowing for a more holistic understanding of the visual data.

Multi-head attention: The model can simultaneously focus on different image segments.
Positional encoding: giving the model information about the position of each patch in the image.
Feed-forward neural networks: further processing the information extracted by the attention mechanisms (Alsulaiman et al., 2023; Zhang et al., 2023).

The comparison of ViTs to CNNs highlights distinct advantages in the context of ArSLR (Brour and Benabbou, 2021; Alsayed et al., 2023; Alsulaiman et al., 2023; Zhang et al., 2023):

Global context understanding: ViTs can process an entire image by attending to all parts simultaneously, which is beneficial for interpreting the comprehensive gestures of American SL (ASL) where context is key.
Scalability: Unlike CNNs, ViTs can handle inputs of various sizes more flexibly, as they do not require the image to conform to a specific dimension for processing.
Interpretable attention mechanism: The self-attention mechanism in ViTs allows for generating attention maps, which can offer insights into which parts of the image the model considers most important for a prediction.
Efficient transfer learning: ViTs have demonstrated their potential in generalizing across different tasks, especially when models pre-trained on large datasets are fine-tuned for specific tasks with less data.
Reduced bias to texture: ViTs focus on the entire image rather than local textures, which may lead to more robust generalization beyond the training data.

In essence, while CNNs have been instrumental in computer vision, ViTs offer a compelling alternative for tasks like ASL recognition, where understanding the global context and long-range dependencies within an image is critical (Brour and Benabbou, 2021; Alsayed et al., 2023; Alsulaiman et al., 2023; Zhang et al., 2023). This study addresses the nuanced challenge of multiclass classification of ArSL.

Main contributions

The main contributions of this study are as follows:

Development of a computer-aided diagnosis (CAD) framework: a CAD framework for ArSL classification using ViTs and local interpretable model-agnostic explanations (LIME).
Utilization of self-attention mechanisms: capturing global dependencies and optimizing performance through a stacking/voting strategy with multiple ViT models.
Provision of clear visual explanations: LIME integration provides clear visual explanations for model predictions, enhancing interpretability.
Achieving remarkable accuracy: achieving accuracy rates of 99.46% and 99.88% on the ArSL21L and RGB datasets, respectively, outperforming traditional models in all metrics.

This paper is structured as follows. The Introduction section introduces the problem and its significance and provides an overview of CNNs, ViTs, and explainable artificial intelligence. The Related Studies section reviews current research on ArSL, identifying research gaps. The Methodology section details the methodology, including materials, methods, datasets, and performance metrics. The Experiments and Discussions section presents experimental results, and the Conclusions and Future Work section concludes the paper, summarizing findings and suggesting future research directions.

RELATED STUDIES

An overview of recent research on ArSLR is provided in this section. Several machine learning approaches interpret the image’s semantics and provide precise and accurate descriptions to identify the corresponding sign. Brour and Benabbou (2021) delivered an updated version of their system, ATLASLang. It is a feed-forward back-propagation artificial neural network-based system that translates Arabic text into ArSL. The system was trained on a dataset of around 9715 distinct sentence types (interrogative, affirmative, and imperative), and it was assessed on 73 basic sentences. The system was analyzed using the BLEU score, yielding an average 4-gram score of 0.79. A deep learning (DL) architecture-based real-time ArSLR system was presented by Alsaadi et al. (2022). After choosing a reliable scientific ArSL dataset, the system assembles the top DL architectures from current studies. They are assessed to select the architecture that yields the most remarkable outcomes. The system evaluates different architectures and selects the one that yields the best results. These procedures allowed them to create a real-time recognition system. The experiment’s findings revealed that the AlexNet architecture performed better and had a high-accuracy rate. The model constructed using the AlexNet architecture attained an accuracy of 94.81%. Zakariah et al. (2022) created a system that converts a visual hand dataset from an ArSL into textual information. The dataset utilized consists of 54,049 photos of ArSL alphabets, with 1500 images per class, each representing a distinct meaning through a hand gesture or sign. Several pre-trained models have been used in the experiments employing the provided dataset. EfficientNetB4 was the most accurate model tested, with a training accuracy of 98% and a testing accuracy of 95%.

Luqman (2023) proposed ArabSign, a continuous ArSL dataset. The proposed dataset comprises 9335 samples produced by 6 signers and annotated using ArSL and Arabic language structures. A Kinect V2 camera was used to capture the dataset. For every sentence, the camera concurrently records three different sorts of information: color, depth, and skeleton joint points. Additionally, they proposed employing an encoder–decoder model to benchmark the dataset for continuous ArSLR. The acquired results demonstrated that the encoder–decoder model worked better than the attention mechanism, with an average word error rate of 0.50 instead of 0.62 for the attention method. An ArSL dataset containing 8467 films of 20 signs for various volunteers was offered by Balaha et al. (2023). They used CNNs and recurrent neural networks (RNNs) to create a new video detection and classification technique. They extracted information from video frames using double CNNs and concatenated them to form a sequence. The RNN was employed to determine the association between the sequences and make the final prediction. On the specified dataset, the proposed technique scored 98% and 92% on the validation and testing subsets, respectively. An ML model employing deer hunting optimization was suggested by Al-onazi et al. (2023) for ArSL Gesture Classification. The model pre-processes input gesture photos and uses DenseNet169 to produce feature vectors. A multilayer perceptron (MLP) classifier detected and categorized the presence of SL gestures. Finally, the deer hunting optimization method is used to optimize the parameters of the MLP model. The proposed model attained a maximum accuracy of 92.88%.

Dabwan et al. (2023) developed a technique for converting the ArSL-based visual hand dataset into written information. After pre-training the CNN model on ImageNet, they built it using EfficientNetB1 scaling and loaded it with weights. Using a straightforward yet incredibly powerful compound coefficient, it evenly scales all width, depth, and resolution parameters. The model’s results indicated that 97.9% accuracy was attained. From isolated RGB films of Moroccan SL (MoSL), Boukdir et al. (2023) devised an encoding–decoding method that produces Arabic phrases at the character level. A skeleton-based pattern-based encoder spatiotemporal feature extractor and an encoder character sequence processor comprise the architecture of the proposed model. A decoder language model, which predicts the output vocabulary, comes after the encoder. The proposed system was trained using an isolated MoSL dataset consisting of RGB videos of 125 MoSL signals. The hands and integrated body landmarks scored 85.53% and 91.82%, respectively, above the experimental evaluation of pose accuracy, which yielded a score of 73%. For SL translation, Zhang et al. (2023) proposed a neural machine translation architecture. They proposed a lightweight dual-stream attention module and a multi-channel attention enhancement technique. They ran several tests on the tough PHOENIX-Weather-2014T dataset and achieved a best BLEU-4 score of 24.50/25.33 on the development and test set.

Using a hybrid DL approach that combines long short-term memory and CNN, Alsolai et al. (2024) proposed an automated method for classifying and detecting SL. The technique uses DL and metaheuristic optimizers to identify and categorize various sign kinds. Feature vectors are generated using the MobileNet feature extractor, and the manta ray foraging optimization approach may be used to modify its hyperparameters. The accuracy of the proposed method was 99.51%. A Qur’anic SL recognition approach was suggested by AbdElghfar et al. (2024) using the CNN. The model is based on the ArSLR system, which employs a subset of the ArSL2018 dataset. The collection included 24,137 pictures of the ArSL alphabet generated by over 40 people, with 14 letters symbolizing the beginnings of the Qur’anic Surahs. The application identifies ArSL hand gestures that correspond to dashed Qur’anic letters. The model’s testing accuracy was 97.13%, while its training accuracy was 98.05%.

Research gap

The review of existing research in ArSLR reveals notable advancements yet identifies key areas for further exploration as follows:

Data resource limitations: The scarcity of publicly available, high-quality ArSL datasets hindered robust model development.
Adoption of ViTs: The use of ViTs in ArSLR is minimal. Their understanding of complex spatial relationships in images could significantly enhance recognition accuracy but remains underexplored.
Optimization through model stacking: Combining multiple ViT models to improve performance in ArSLR is novel and not widely documented, suggesting an area ripe for research.
Limited interpretability: Prior studies using CNNs lacked model interpretability tools, limiting prediction transparency.
Suboptimal accuracy: CNNs struggled to capture long-range dependencies and global contextual information crucial for interpreting SL.
Use of comprehensive evaluation metrics: Expanding the range of evaluation metrics beyond the standard set to include metrics like intersection over union (IoU) and Matthews correlation coefficient (MCC) could provide a deeper understanding of model performance.

Addressing these gaps can lead to more accurate, interpretable, and reliable ArSLR systems, advancing the technology and its practical application for the deaf community. In this vein, this study utilized two primary large datasets: ArSL21L and RGB Arabic Alphabets Sign Language Dataset. Augmentation is used to increase diversity and improve model generalization. The methodology also adopts ViTs as the primary model architecture for image classification to capture global dependencies and intricate relationships between image patches. A stacking/voting strategy is then used to aggregate predictions from many ViT models, further optimizing the system. The methodology incorporates LIME for model interpretability. Various metrics, including accuracy, precision, recall, specificity, F1 score, IoU, balanced accuracy (BAC), MCC, Youden’s index, and Yule’s Q, are employed to assess different aspects of model performance.

METHODOLOGY

This section outlines the proposed methodology for multiclass classification of ArSL using the CAD architecture, incorporating ViTs and LIME. The framework is visually represented in Figure 1.

Figure 1:

The proposed comprehensive CAD framework for the multiclass classification of the sign language. Abbreviations: CAD, computer-aided diagnosis; LIME, local interpretable model-agnostic explanations; MLP, multilayer perceptron; XAI, explainable artificial intelligence.

Step-by-step representation of the proposed framework

The step-by-step execution of the proposed framework shown in Figure 2 is as follows:

Figure 2:

Step-by-step representation of the proposed comprehensive CAD framework for the multiclass classification of sign language. Abbreviations: CAD, computer-aided diagnosis; LIME, local interpretable model-agnostic explanations.

Dataset selection and preparation: The proposed framework utilizes two primary datasets: ArSL21L (over 14,000 images of 32 letter signs) and RGB Arabic Alphabets Sign Language Dataset (nearly 8000 labeled images of ArSL alphabets). Both datasets are meticulously annotated and publicly accessible.
Data pre-processing: The proposed framework applies various data augmentation techniques (random flipping, cropping, rotation, scaling, and translation) to increase diversity and improve model generalization. These techniques prevent overfitting and enhance the model’s adaptability to real-world challenges.
Model architecture: The proposed framework adopts ViTs as the primary model for image classification. ViTs utilize tokenization (dividing input images into fixed-size non-overlapping patches) and classification heads (predicting the class of input images based on processed tokens). This approach captures global dependencies and models intricate relationships between image patches, effectively handling the rich vocabulary of ArSL gestures and expressions.
Performance measurement: The proposed model performance is evaluated using various metrics: accuracy, precision, recall, specificity, F1 score, IoU, BAC, MCC, Youden’s index, and Yule’s Q. These metrics provide insights into overall correctness, class-wise performance, and robustness to imbalanced datasets, facilitating informed decisions and improvements.
Model interpretability: incorporated LIME to elucidate the reasons behind the model’s predictions. LIME creates slight variations around specific instances, observes prediction changes, and fits a simpler, interpretable model to these variations. This process provides clear explanations of the complex model’s decisions, revealing influential features in the decision-making process and enhancing transparency and trustworthiness.

Materials

ArSL21L: Arabic Sign Language Letter Dataset

This dataset consists of 14,202 images featuring 32 letter signs captured from 50 individuals of diverse backgrounds, was presented and annotated (Batnasan et al., 2022). The dataset is available at https://data.mendeley.com/datasets/f63xhm286w/1.

RGB Arabic Alphabets Sign Language Dataset

It is the first publicly accessible RGB dataset of ArSL alphabets, consisting of 7857 meticulously labeled images. It was designed to support the development of practical ArSL classification models. The dataset, gathered from more than 200 participants under various conditions, including various lighting, backgrounds, image orientations, sizes, and resolutions, was monitored, validated, and filtered by field experts to ensure a high-quality dataset (Al-Barham et al., 2023b). The dataset is available at https://www.kaggle.com/datasets/muhammadalbrham/rgb-arabic-alphabets-sign-language-dataset.

Samples of the datasets utilized (i.e. ArSL21L and RGB) in the current study are presented in Figure 3. The first row is for the ArSL21L dataset, while the second row is for the RGB dataset. Each dataset consists of four samples.

Figure 3:

Samples from the utilized datasets: “ArSL21L: Arabic Sign Language Letter Dataset” and “RGB Arabic Alphabets Sign Language Dataset.”

Pre-processing

The significance of data augmentation in increasing the diversity of the dataset cannot be overstated. This crucial technique, exemplified by random flipping and cropping methods, is vital in improving the training dataset’s richness and diversity. The utilization of random flipping introduces the model to a spectrum of orientation changes, mirroring the natural variability observed in the positioning of tissue samples during image acquisition. This exposure contributes substantially to constructing a resilient and adaptable model capable of accurately classifying the different patterns, irrespective of their spatial orientation (Balaha et al., 2023).

Additionally, the incorporation of random cropping simulates variations in image compositions. The model gains proficiency in recognizing features across different scales and spatial locations by integrating random cropping into the data augmentation pipeline. This enhancement enables effective generalization across the diverse range of images encountered in practical scenarios.

Furthermore, techniques such as rotation, scaling, and translation can be applied to further augment the dataset. Rotation introduces variations in the orientation of the images, mimicking different perspectives from which the samples are captured. Scaling alters the size of the images, allowing the model to learn from samples of varying magnifications, which is particularly relevant in histopathology where different magnification levels may be used for analysis. Translation shifts the position of the images within the frame, replicating the displacements that may occur during sample preparation or imaging.

Integrating data augmentation techniques, including random flipping, cropping, rotation, scaling, and translation, into the training process yields a twofold advantage. First, it is a preventive measure against overfitting by enriching the dataset with broader image variations, discouraging the model from memorizing specific instances. Second, it increases the model’s adaptability to real-world challenges, where the samples may exhibit various orientations, scales, and spatial distributions. Consequently, data augmentation emerges as an indispensable tool for fortifying the performance and robustness of histopathology classification models, ensuring their reliability and generalization capabilities in clinical applications.

ViT classification

ViTs represent a groundbreaking departure from the traditional CNN paradigm in computer vision. Unlike CNNs, which have long dominated image processing tasks, ViTs utilize the power of transformers, initially designed for natural language processing tasks. This innovative approach uses transformers to capture global dependencies and intricate relationships between image patches, leading to state-of-the-art performance in various vision tasks (Raghu et al., 2021).

At the core of ViTs lies the transformer architecture, comprising multiple layers of self-attention mechanisms and feed-forward neural networks (Park and Kim, 2022). The self-attention mechanism, a fundamental component of transformers, enables the model to weigh the importance of different patches when predicting features for a specific patch. Mathematically, the self-attention mechanism is represented in Equation (1), where Q, K, and V denote the query, key, and value matrices, respectively, and d_k represents the dimension of the key vectors. This attention mechanism allows ViTs to capture long-range dependencies and contextual information in images, addressing the limitations of CNNs, which primarily rely on localized receptive fields.

(1)

$Attention (Q, K, V) = SoftMax (\frac{Q \times K^{T}}{\sqrt{d_{k}}}) \times V$

In the ViT architecture, the input image is divided into fixed-size, non-overlapping patches, which are then linearly embedded into sequences of vectors. Mathematically, this embedding process is represented in Equation (2), where X _patch represents the patches and W _proj denotes the learnable linear projection matrix. These sequences of vectors serve as the input to the transformer encoder, enabling the model to process the image sequentially.

(2)

$X_{linear} = X_{patch} \times W_{proj}^{T}$

The transformer encoder consists of multiple layers comprising self-attention mechanisms and feed-forward neural networks. The output of each layer, as depicted in Equation (3), undergoes normalization and residual connections, allowing the model to iteratively refine the input data representation and capture increasingly complex patterns and relationships.

(3)

${Output}_{layer} = LayerNorm (Attention ({Input}_{layer}) + {Input}_{layer})$

Tokenization and classification heads are essential components within the ViT architecture. Tokenization involves converting the image patches into tokens, which are then processed by the transformer encoder. Mathematically, tokenization is represented in Equation (4), where W _token represents another learnable linear projection matrix. The classification head, often a simple linear layer, generates the final output by predicting the input image class based on the processed tokens. This process allows ViTs to effectively capture spatial and contextual information from images, enabling robust image classification.

(4)

$Tokens = X_{linear} \times W_{token}^{T}$

In the ArSL context, ViTs offer promising opportunities for improving SL recognition and interpretation. ArSL involves a rich vocabulary of gestures and expressions, which ViTs can effectively model due to their ability to capture global dependencies and intricate patterns. By representing SL as a sequence of images, ViTs can process gestures’ spatial and temporal aspects, facilitating accurate recognition of distinct signs and their contextual nuances.

The self-attention mechanism of transformers is particularly advantageous for discerning subtle variations in hand movements, facial expressions, and body language elements crucial for SL interpretation. ViTs’ adaptability to different resolutions and their capacity to model complex relationships make them well suited for the diverse and expressive nature of ArSL. Moreover, ViTs can be trained on large datasets of SL videos to learn the nuances and variations in signing, further enhancing their accuracy and robustness in real-world applications.

In a multilingual and culturally diverse region like the Arabic-speaking world, effective interpretation of ArSL is essential for promoting social inclusion and communication among individuals within the deaf community. ViTs, with their ability to capture global context and nuanced details, hold promise for advancing accessibility and communication in this domain. Through continued research and development, ViTs can contribute to building more inclusive and supportive environments for individuals with hearing impairments, fostering greater understanding and connection within society.

The ViT versions used in this study, namely Base-P16-224-In21K, Base-P16-224, and Large-P32-384, represent variations of the ViT architecture tailored for different tasks and datasets. The naming convention indicates key specifications such as patch size (P), image resolution (224 for input size), and the number of layers (16 for Base and 32 for Large). “In21K” also signifies pre-training on a large-scale dataset containing 21,000 classes.

The Base-P16-224-In21K variant, pre-trained on a diverse dataset with extensive class coverage, demonstrates robust feature extraction capabilities. This pre-training strategy enriches the model’s representation learning, enabling it to capture intricate patterns and features across various visual concepts. As a result, the Base-P16-224-In21K model exhibits high adaptability and generalization performance when fine-tuned on specific downstream tasks, such as SL alphabet classification.

In contrast, the Base-P16-224 variant, although sharing similar architecture with Base-P16-224-In21K, lacks the pre-training on the extensive 21K-class dataset. However, it still benefits from the transformer’s inherent capacity to learn hierarchical representations from input images. While potentially not as versatile as its In21K counterpart, this version offers a balance between computational efficiency and task-specific performance, making it suitable for various image classification tasks with moderate to high-accuracy requirements.

The Large-P32-384 model, characterized by a deeper architecture and larger patch size, introduces additional capacity for capturing more intricate spatial dependencies and semantic information within images. This increased model capacity allows for more comprehensive feature extraction and representation learning, potentially enhancing performance, especially on complex datasets or tasks requiring fine-grained discrimination. However, the larger model size and computational demands may challenge training time and resource requirements.

Performance measurement

Performance measurement is critical in assessing machine learning models’ effectiveness. Various metrics provide insights into different aspects of model performance, allowing for a comprehensive evaluation. Commonly used metrics include accuracy, precision, recall, specificity, F1 score, IoU, BAC, MCC, Youden’s index, and Yule’s Q. These metrics play a crucial role in quantifying the model’s ability to classify instances and handle imbalances in the dataset correctly.

Equation (5) represents the accuracy, which measures the overall correctness of predictions. Precision, as given by Equation (6), focuses on the accuracy of positive predictions, while recall, expressed in Equation (7), emphasizes the model’s ability to capture all positive instances. Specificity, detailed in Equation (8), assesses the model’s capability to identify negative instances correctly. The F1 score, shown in Equation (9), strikes a balance between precision and recall, making it valuable when there is an uneven class distribution.

IoU, defined in Equation (10), quantifies the spatial overlap between predicted and actual regions, particularly relevant in segmentation tasks. BAC, as per Equation (11), provides a balanced measure of accuracy, considering both positive and negative class performance. MCC, introduced in Equation (12), considers true positives, true negatives, false positives, and false negatives, offering a balanced measure even in imbalanced datasets.

Youden’s index, expressed in Equation (13), combines sensitivity and specificity, providing an overall measure of discrimination ability. Yule’s Q, detailed in Equation (14), assesses the association between predicted and actual classifications, making it useful in binary classification scenarios. By utilizing this array of performance metrics, practitioners can comprehensively evaluate the strengths and weaknesses of their machine learning models, facilitating informed decisions and improvements.

(5)

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

(6)

$Precision = \frac{TP}{TP + FP}$

(7)

$Recall = \frac{TP}{TP + FN}$

(8)

$Specificity = \frac{TN}{TN + FP}$

(9)

$F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}$

(10)

$IoU = \frac{TP}{TP + FP + FN}$

(11)

$BAC = \frac{1}{2} \times (\frac{TP}{TP + FN} + \frac{TN}{TN + FP})$

(12)

$MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP + FP) \times (TP + FN) \times (TN + FP) \times (TN + FN)}}$

(13)

$Youden ’ s index = Sensitivity + Specificity - 1$

(14)

$Yule ’ s Q = \frac{(TP \times TN) - (FP \times FN)}{(TP \times TN) + (FP \times FN)}$

LIME interpretability

LIME is a technique designed to interpret the predictions made by complex machine learning models. In ArSLR, LIME elucidates the reasons behind a model’s specific prediction for a given sign. This interpretability is pivotal for maintaining the transparency and trustworthiness of machine learning models, which is particularly significant in applications like SL recognition, where understanding the model’s decision-making process is crucial for users (Aljadani et al., 2023).

To illustrate how LIME operates in this context, let us consider a machine learning model trained to recognize ArSL. The model takes the features of a sign as input and outputs a prediction indicating the sign’s meaning. LIME helps understand this process by creating slight variations, or perturbations, around a particular instance (a specific sign). It then observes how the model’s predictions change in response to these perturbations. By fitting a simpler, interpretable model to these variations and their corresponding predictions, LIME provides a clear, understandable explanation of how the original complex model arrived at its decision.

The original instance can be mathematically represented as x and the perturbed instances as x′. The model’s prediction for the original instance is denoted as f(x). LIME works by approximating f with a simpler, interpretable model g around the vicinity of x. The interpretability of this approach is achieved by minimizing the difference between f and g for the perturbed instances x′. This is formalized in Equation (15), where L is a loss function that measures the fidelity of g in approximating f, π _x′ is a proximity measure that indicates the closeness of the perturbed instances to the original instance, and Ω(g) is a complexity term that penalizes the complexity of the interpretable model g. In the specific case of ArSL, x represents the features of a sign, such as hand shapes, movements, and positions. LIME helps to reveal which of these features are most influential in the model’s prediction, thereby enhancing our understanding of the model’s behavior.

(15)

$g = \arg \min_{g} L (f, g, π_{x^{'}}) + Ω (g)$

By breaking down the components of Equation (15), we have:

$L (f, g, π_{x^{'}}) :$ This term is a loss function quantifying how well the interpretable model g approximates the original model f. It considers the predictions of f for the perturbed instances x′ and measures how closely g can replicate these predictions. The proximity measure π _x′ ensures that the focus remains on instances close to the original x, making the interpretation locally accurate.

Ω(g): This is a complexity term that penalizes the complexity of the interpretable model g. The goal is to strike a balance between accuracy and simplicity, ensuring that g is a good approximation of f and easy to understand. For instance, g could be a simple linear model or a decision tree with a limited depth.

In practical applications, LIME’s interpretability can be invaluable. For example, suppose the model predicts a sign to mean “hello” in an ArSLR system. LIME could reveal that the model bases this prediction primarily on the specific movement of the hand and the position relative to the body. By understanding these influential features, users and developers can gain confidence in the model’s reliability and make informed decisions about its deployment and further improvement.

LIME’s use in this context fosters transparency and facilitates debugging and refining the model. Suppose the explanations provided by LIME reveal unexpected or undesirable patterns in the model’s decision-making process. In that case, developers can take steps to address these issues, such as collecting more representative training data or adjusting the model’s architecture.

In short, LIME is critical in demystifying the predictions of complex machine learning models used for ArSLR. By generating interpretable explanations, LIME enhances these models’ transparency, trustworthiness, and usability, making them more accessible and reliable for real-world applications.

EXPERIMENTS AND DISCUSSIONS

The study’s software setup revolves around Python, with Windows 11 as the operating system and Anaconda as the distribution platform. Hardware specifications include an NVIDIA Graphical Processing Unit (GPU) with 8 GB of memory, 256 GB of RAM, and an Intel Core i7 processor.

The results presented in Table 1 showcase the performance metrics of different models evaluated on the “ArSL21L: Arabic Sign Language Letter Dataset” testing data. All models demonstrate exceptionally high-accuracy levels, with values >99%. Precision values range from approximately 88% to 92%, indicating a high level of confidence in the correctness of positive predictions made by the models.

Table 1:

Performance metrics for different models assessed on the “ArSL21L: Arabic Sign Language Letter Dataset” testing data.

Model	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F1 (%)	IoU (%)	BAC (%)	MCC (%)	Youden (%)	Yule (%)
Base-P16-224-In21K	99.22	88.83	87.40	99.60	87.77	78.80	93.50	87.56	87.01	99.88
Base-P16-224	99.35	89.99	89.66	99.67	89.70	81.63	94.66	89.44	89.33	99.91
Large-P32-384	99.30	89.32	88.84	99.64	88.87	80.33	94.24	88.62	88.48	99.90
Stacking	99.46	91.58	91.29	99.72	91.30	84.25	95.51	91.09	91.01	99.94

The evaluation covers different metrics such as accuracy, precision, recall, specificity, F1, IoU, BAC, MCC, Youden’s index, and Yule’s Q.

Abbreviations: BAC, balanced accuracy; F1, F1 score; IoU, intersection over union; MCC, Matthews correlation coefficient.

Recall and specificity values exceed 88% and 99%, respectively, highlighting the models’ effectiveness in capturing positive instances while minimizing false positives. The F1 scores and IoU values range from approximately 78% to 84%, indicating a harmonious balance between precision and recall and a substantial overlap between predicted and ground truth regions. The models demonstrate commendable performance across these metrics, with values consistently >88%.

Starting with the individual models, “Base-P16-224-In21K” exhibits impressive results with a 99.22% accuracy. It demonstrates high precision (88.83%), recall (87.40%), specificity (99.60%), and F1 score (87.77%). Additionally, it achieves substantial values for IoU (78.80%), BAC (93.50%), MCC (87.56%), Youden’s index (87.01%), and Yule’s Q (99.88%).

The “Base-P16-224” model performs similarly well, with a slightly higher accuracy of 99.35%. It maintains strong values across precision (89.99%), recall (89.66%), specificity (99.67%), and F1 score (89.70%). The model also excels in terms of IoU (81.63%), BAC (94.66%), MCC (89.44%), Youden’s index (89.33%), and Yule’s Q (99.91%).

The “Large-P32-384” model achieves a commendable accuracy of 99.30%. It demonstrates competitive precision (89.32%), recall (88.84%), specificity (99.64%), and F1 score (88.87%). The model’s performance is evident in IoU (80.33%), BAC (94.24%), MCC (88.62%), Youden’s index (88.48%), and Yule’s Q (99.90%).

Furthermore, the “stacking” approach surpasses individual models, reaching an accuracy of 99.46%. It showcases remarkable precision (91.58%), recall (91.29%), specificity (99.72%), and F1 score (91.30%). The model outperforms others in terms of IoU (84.25%), BAC (95.51%), MCC (91.09%), Youden’s index (91.01%), and Yule’s Q (99.94%). Figures 4 and 5 show the evaluation accuracy and loss curves during the different epochs.

Figure 4:

The evaluation accuracy curve during the learning process for the “ArSL21L: Arabic Sign Language Letter Dataset.” The light orange curve represents the records, while the orange curve represents the smoothed evaluation curve.

Figure 5:

The evaluation loss curve during the learning process for the “ArSL21L: Arabic Sign Language Letter Dataset.” The light orange curve represents the records, while the orange curve represents the smoothed evaluation curve.

Comparing the performance of different models, it is evident that the stacking model outperforms the base and large models across most metrics. The stacking model achieves the highest values among all models evaluated, suggesting that the ensemble approach effectively improves classification performance and generalization ability.

Insight: The results underscore the effectiveness of the models in accurately classifying ArSL letters. The high performance across various metrics demonstrates the potential of these models for practical applications, such as SL recognition systems. Additionally, the comparative analysis highlights the benefits of ensemble learning techniques, emphasizing the importance of exploring diverse model architectures and training strategies to achieve superior performance.

The results presented in Table 2 illustrate the performance metrics of various models assessed on the “RGB Arabic Alphabets Sign Language Dataset” testing data. All models achieve remarkably high-accuracy levels, exceeding 99%. Precision values range from approximately 96.78% to 97.26%, indicating a high level of confidence in positive predictions made by the models.

Table 2:

Performance metrics for different models assessed on the “RGB Arabic Alphabets Sign Language Dataset” testing data.

Model	Accuracy (%)	Precision (%)	Recall (%)	Specificity (%)	F1 (%)	IoU (%)	BAC (%)	MCC (%)	Youden (%)	Yule (%)
Base-P16-224-In21K	99.82	97.26	97.19	99.91	97.19	94.66	98.55	97.12	97.10	99.99
Base-P16-224	99.81	97.10	97	99.91	97	94.31	98.45	96.93	96.90	99.99
Large-P32-384	99.79	96.78	96.68	99.89	96.67	93.67	98.28	96.59	96.57	99.99
Stacking	99.88	98.06	98.02	99.94	98.01	96.17	98.98	97.96	97.95	100

The evaluation covers different metrics such as accuracy, precision, recall, specificity, F1, IoU, BAC, MCC, Youden’s index, and Yule’s Q.

Abbreviations: BAC, balanced accuracy; F1, F1 score; IoU, intersection over union; MCC, Matthews correlation coefficient.

Recall and specificity values surpass 96% and 99%, respectively, demonstrating the models’ ability to capture positive instances while minimizing false positives. The F1 scores and IoU values range from approximately 93.67% to 94.66%, indicating a balanced performance between precision and recall and a substantial overlap between predicted and ground truth regions. The models exhibit commendable performance across these metrics, with values consistently >96%.

Starting with the individual models, “Base-P16-224-In21K” showcases exceptional results with an accuracy of 99.82%. It demonstrates high precision (97.26%), recall (97.19%), specificity (99.91%), and F1 score (97.19%). The model also excels in terms of IoU (94.66%), BAC (98.55%), MCC (97.12%), Youden’s index (97.10%), and Yule’s Q (99.99%).

The “Base-P16-224” model performs similarly well, achieving an accuracy of 99.81%. It maintains strong values across precision (97.10%), recall (97%), specificity (99.91%), and F1 score (97%). Additionally, the model excels in IoU (94.31%), BAC (98.45%), MCC (96.93%), Youden’s index (96.90%), and Yule’s Q (99.99%).

The “Large-P32-384” model achieves a commendable accuracy of 99.79%. It demonstrates competitive precision (96.78%), recall (96.68%), specificity (99.89%), and F1 score (96.67%). The model’s performance is evident in IoU (93.67%), BAC (98.28%), MCC (96.59%), Youden’s index (96.57%), and Yule’s Q (99.99%).

Furthermore, the “stacking” approach surpasses individual models, reaching an accuracy of 99.88%. It showcases remarkable precision (98.06%), recall (98.02%), specificity (99.94%), and F1 score (98.01%). The model outperforms others in terms of IoU (96.17%), BAC (98.98%), MCC (97.96%), Youden’s index (97.95%), and Yule’s Q (100%).

Comparing the performance of different models, it is evident that the stacking model outperforms the base and large models across most metrics. The stacking model achieves the highest values among all models evaluated, indicating its superior classification performance and generalization ability.

Insight: The results highlight the effectiveness of the models in accurately classifying Arabic alphabets in SL. The high performance across various metrics underscores the potential of these models for practical applications, such as SL recognition systems. Additionally, the comparative analysis emphasizes the benefits of ensemble learning techniques, emphasizing the importance of exploring diverse model architectures and training strategies to achieve superior performance.

LIME interpretability: In Figure 6, a comprehensive LIME interpretability analysis is presented for a chosen sample from the “ArSL21L: Arabic Sign Language Letter Dataset.” The visual display encompasses the original image, an insightful interpretive explanation, a delineating mask highlighting critical regions or features, and a vivid heatmap providing a nuanced view of the intensity of influence across diverse areas. This analysis focuses specifically on the elucidation of the sign for the Arabic letter “ta,” capturing the intricacies of the hand gesture involving two fingers directed toward the upper left. Figure 7 is similar, but for the letter “ra.” Figure 8 is similar, but for the letter “bb.” These three images prove that the stacking approach is not influenced by the background, lighting, or clothing.

Figure 6:

LIME interpretability analysis of a sample from the “ArSL21L: Arabic Sign Language Letter Dataset.” The display includes the original image, an interpretive explanation, a mask pinpointing highlighted regions or features, and a heatmap illustrating the intensity of influence across various areas. Specifically, this elucidates the sign for the letter “ta,” depicting the hand gesture with two fingers directed to the top left. Abbreviation: LIME, local interpretable model-agnostic explanations.

Figure 7:

LIME interpretability analysis of a sample from the “ArSL21L: Arabic Sign Language Letter Dataset.” The display includes the original image, an interpretive explanation, a mask pinpointing highlighted regions or features, and a heatmap illustrating the intensity of influence across various areas. Specifically, this elucidates the sign for the letter “ra,” depicting the gesture of the hand with one finger directed to the top left. Abbreviation: LIME, local interpretable model-agnostic explanations.

Figure 8:

LIME interpretability analysis of a sample from the “ArSL21L: Arabic Sign Language Letter Dataset.” The display includes the original image, an interpretive explanation, a mask pinpointing highlighted regions or features, and a heatmap illustrating the intensity of influence across various areas. Specifically, this elucidates the sign for the letter “bb,” depicting the hand gesture with one finger directed to the top right. Abbreviation: LIME, local interpretable model-agnostic explanations.

Comparison of related studies

Mazen and Ezz-Eldin (2024) recently addressed the growing need for ArSLR systems, which are crucial in facilitating communication between deaf individuals and the general population. Despite the prior lack of focus on ArSLR, their work introduced a pioneering image-based approach utilizing You Only Look Once v7 (YOLOv7) to develop an accurate ArSL alphabet detector and classifier. Utilizing the ArSL21L, their proposed YOLOv7-medium model achieved remarkable performance, surpassing both YOLOv5m and YOLOv5l in terms of mAP0.5 and mAP0.5:0.95 scores. Additionally, the YOLOv7-tiny model outperformed YOLOv5s and YOLOv5m in terms of mAP0.5 and mAP0.5:0.95 scores, highlighting its efficacy in ArSLR tasks. Specifically, YOLOv5s exhibited the lowest mAP0.5 and mAP0.5:0.95 scores among the compared models.

Moreover, Shin et al. (2024) undertook the task of improving communication accessibility for the deaf and hard-of-hearing communities by focusing on Korean SL (KSL) recognition in Korea. Despite previous efforts in SL recognition, little attention had been given to KSL alphabet recognition, resulting in significant performance limitations in existing systems due to ineffective features. The researchers introduced an innovative KSL recognition system employing a strategic fusion approach to address this gap. This approach combined joint skeleton-based handcrafted features and pixel-based ResNet-101 transfer learning features to overcome traditional system limitations. The proposed system consisted of two streams: one focused on capturing hand orientation information within KSL gestures using handcrafted features, while the other employed a DL-based ResNet-101 module to capture hierarchical representations of KSL alphabet signs. Combining information from both streams generated comprehensive representations of KSL gestures, resulting in improved accuracy. Extensive experiments with newly created KSL alphabet datasets and existing benchmark datasets demonstrated the superiority of the fusion approach in achieving high-performance accuracy in both KSL and other SLs, such as ArSL and ASL.

Furthermore, Abdelhadi (2023) addressed the challenges faced by hearing-impaired individuals, as highlighted by the Ministry of Community Development database in the United Arab Emirates (UAE), which reported approximately 3065 people with hearing disabilities (Emirates News Agency—Ministry of Community Development). The communication barrier experienced by this demographic often necessitates the presence of SL interpreters, whose availability may be insufficient to meet the growing demand. Compounding the issue, the absence of a standardized SL dictionary in specialized schools can be attributed to the diglossic nature of the Arabic language, leading to the coexistence of various dialects. Furthermore, limited research on ArSL exacerbates the lack of unification in this domain. To address these challenges, they developed an Emirate SL (ESL) electronic dictionary (e-Dictionary) featuring four key functionalities: Dictation, Alpha Webcam, Vocabulary, and Spelling. Additionally, they curated 2 datasets comprising letters and vocabulary/sentences, recorded by Azure Kinect and performed by 4 Emirate signers with hearing loss, totaling 127 signs and 50 sentences across 708 clips. All signs underwent rigorous review by the head of the Community Development Authority in the UAE for compliance. Integrating cutting-edge techniques such as the Automatic Speech Recognition API by Google, the YOLOv8 model trained on their dataset, and an algorithm inspired by the bag-of-words model, the ESL e-Dictionary demonstrated practical utility on laptops in real-time scenarios.

Additionally, El Kharoua and Jiang (2024) presented a CNN model for ArSL (AASL) recognition, utilizing the AASL dataset. Recognizing the importance of communication for the hearing impaired, particularly within the Arabic-speaking deaf community, the study underscored the criticality of SL recognition systems. The proposed methodology yielded remarkable accuracy, with the CNN model achieving 99.9% accuracy on the training set and a validation accuracy of 97.4%. Notably, the study established a high-accuracy AASL recognition model and provided valuable insights into effective dropout strategies.

Moreover, Al-Barham et al. (2023a) addressed communication challenges with the deaf community by integrating AI, which necessitates proficiency in various SLs. Their research introduced the RGB Arabic Alphabet SL (ArASL) dataset, marking the first publicly available high-quality RGB dataset for ArSL. Comprising 7856 meticulously labeled RGB images representing ArSL alphabets, the dataset aimed to facilitate the development of practical ArSL classification models. It was meticulously compiled with input from over 200 individuals, considering factors such as lighting conditions, backgrounds, image orientations, sizes, and resolutions. It underwent rigorous validation and filtering by domain experts to ensure reliability. Training four models on the ArASL dataset, they found that ResNet-18 achieved the highest accuracy of 96.77%.

Table 3 provides a comparative overview of the current study with related studies focusing on the ArSL21L dataset. Each row presents a different study, specifying the authors, year of publication, approach used, and the corresponding results. Mazen and Ezz-Eldin (2024) employed YOLOv7 and achieved high precision, recall, and mean average precision (mAP) values. Shin et al. (2024) utilized joint skeleton-based handcrafted features and pixel-based ResNet-101 transfer learning features, achieving notable accuracy. Abdelhadi conducted a study in 2023 utilizing YOLOv5. In comparison, the current study in 2024 adopted a ViT combined with LIME, achieving a remarkably high accuracy of 99.46%, as indicated in Table 3. This comparative analysis highlights the superior performance of the proposed approach in terms of accuracy compared to existing methods.

Table 3:

Comparison between the current study and a set of the related studies regarding the ArSL21L dataset.

Study	Year	Approach	Results
Mazen and Ezz-Eldin (2024)	2024	You Only Look Once v7 (YOLOv7)	0.982 precision, 0.983 recall, and 0.992 mAP
Shin et al. (2024)	2024	Joint skeleton-based handcrafted features and pixel-based ResNet-101 transfer learning features	92.60% accuracy
Abdelhadi (2023)	2023	YOLOv5	0.9787 precision, 0.9766 recall, and 0.8306 mAP
Current study	2024	Vision transformer + LIME	99.46% accuracy (Table 1)

Abbreviations: LIME, local interpretable model-agnostic explanations; mAP, mean average precision.

Table 4 provides a comparative analysis between the current study and related research efforts on the RGB Arabic Alphabets Sign Language Dataset. It highlights key aspects such as the year of the study, the approach employed, and the achieved results in terms of accuracy. El Kharoua and Jiang (2024) utilized a CNN model, achieving an accuracy of 97.4%. Al-Barham et al. (2023a) employed a ResNet-18 model, resulting in a 96.77% accuracy. In contrast, the current study, conducted in 2024, adopted a ViT combined with LIME, achieving a significantly higher accuracy of 99.88%. This comparison underscores the effectiveness of the ViT approach in SL recognition on the RGB Arabic Alphabets Sign Language Dataset, outperforming previous methods in terms of accuracy.

Table 4:

Comparison between the current study and a set of the related studies regarding the RGB Arabic Alphabets Sign Language Dataset.

Study	Year	Approach	Results
El Kharoua and Jiang (2024)	2024	CNN model	97.4% accuracy
Al-Barham et al. (2023a)	2023	ResNet-18 model	96.77% accuracy
Current study	2024	Vision transformer + LIME	99.88% accuracy (Table 2)

Abbreviations: CNN, convolutional neural network; LIME, local interpretable model-agnostic explanations.

Data augmentation alternatives

In this study, we have utilized traditional data augmentation techniques, such as random flipping, cropping, rotation, scaling, and translation, to enhance our model’s generalization ability. These techniques have been widely adopted in image classification and have demonstrated significant improvements in model performance by enriching the training dataset with diverse variations of the input images.

We opted for traditional augmentation methods for several reasons. First, traditional augmentation techniques are well established and widely understood, making them more accessible for implementation and interpretation. This ensures that our approach is transparent and reproducible, aligning with scientific rigor and transparency principles. Second, traditional augmentation methods are computationally efficient and require less computational resources, which can be computationally intensive and time-consuming, especially for large datasets. This consideration is crucial, particularly in resource-constrained settings where computational resources may be limited.

Moreover, traditional augmentation techniques offer a high degree of control over the augmentation process, allowing us to tailor the augmentation parameters to the specific characteristics of our dataset and task. This level of control enables us to fine-tune the augmentation strategy to maximize its effectiveness in improving model generalization while minimizing the risk of introducing irrelevant or misleading variations. Additionally, traditional augmentation methods are less prone to mode collapse or instability issues commonly associated with DL model training, ensuring a more stable and reliable training process.

While traditional augmentation methods are well established and widely understood, we acknowledge the potential of alternative approaches, such as generative adversarial networks (GANs), for generating adversarial images. GANs can create synthetic data samples that closely mimic real data distribution, thereby providing additional diversity to the training dataset.

In future work, we plan to explore implementing GAN-based augmentation techniques to enhance our model’s generalization capabilities further. By utilizing GANs, we may generate more complex and diverse synthetic images, augmenting our dataset with additional variations that may not be easily achievable using traditional augmentation methods alone.

CONCLUSIONS AND FUTURE WORK

This study introduces a novel CAD framework for multiclass classification of ArSL, addressing limitations of traditional CNNs by enhancing model robustness and interpretability for the deaf community. Combining ViTs with LIME, the framework improves accuracy and transparency by capturing global image dependencies through self-attention mechanisms. A stacking/voting scheme and advanced data augmentation techniques such as flipping, cropping, and rotation further strengthen model generalization and resilience to various SL presentations. Evaluated using the “ArSL21L: Arabic Sign Language Letter Dataset” and the “RGB Arabic Alphabets Sign Language Dataset,” totaling over 22,000 images, the framework demonstrated exceptional performance. Key metrics assessed included accuracy, precision, recall, specificity, F1 score, IoU, BAC, MCC, Youden’s index, and Yule’s Q, with the framework achieving accuracies of 99.46% on the ArSL21L dataset and 99.88% on the RGB dataset. These results showcase significant advancements over traditional models across all performance indicators. Including LIME provides crucial interpretability with clear visual explanations for model predictions, which is essential for practical applications. Future work should enhance real-time interpretation, integrate with wearable technologies for natural interaction, and explore advanced data augmentation techniques like GANs to improve model robustness and adaptability further, advancing ArSLR technology to serve the deaf community’s needs better.

[1] AbdElghfar HA, Ahmed AM, Alani AA, AbdElaal HM, Bouallegue B, Khattab MM, et al.. 2024. QSLRS-CNN: qur’anic sign language recognition system based on convolutional neural networks. Imaging Sci. J. Vol. 72(2):254–266

[2] Abdelhadi AA. 2023. Interactive emirate sign language e-dictionary based on deep learning recognition models. https://scholarworks.uaeu.ac.ae/all_theses/1022/

[3] Al-Barham M, Alomari OA, Elnagar A. 2023a. Arabic sign language alphabet classification via transfer learningInternational Conference on Emerging Trends and Applications in Artificial Intelligence; Springer. Turkey. p. 226–237

[4] Al-Barham M, Alsharkawi A, Al-Yaman M, Al-Fetyani M, Elnagar A, SaAleek AA, et al.. 2023b. RGB Arabic alphabets sign language dataset. https://www.kaggle.com/datasets/muhammadalbrham/rgb-arabic-alphabets-sign-language-dataset

[5] Aljadani A, Alharthi B, Farsi MA, Balaha HM, Badawy M, Elhosseini MA. 2023. Mathematical modeling and analysis of credit scoring using the LIME explainer: a comprehensive approach. Mathematics. Vol. 11(19):4055

[6] Almasre MA, Al-Nuaim H. 2020. A comparison of Arabic sign language dynamic gesture recognition models. Heliyon. Vol. 6(3):e03554

[7] Al-onazi BB, Nour MK, Alshahran H, Elfaki MA, Alnfiai MM, Marzouk R, et al.. 2023. Arabic sign language gesture classification using deer hunting optimization with machine learning model. Comput. Mater. Contin. Vol. 75(2):3413–3429

[8] Alsaadi Z, Alshamani E, Alrehaili M, Alrashdi AAD, Albelwi S, Elfaki AO. 2022. A real time Arabic sign language alphabets (ArSLA) recognition model using deep learning architecture. Computers. Vol. 11(5):78

[9] Alsayed A, Qadah TM, Arif M. 2023. A performance analysis of transformer-based deep learning models for Arabic image captioning. J. King Saud Univ. Comput. Inf. Sci. Vol. 35(9):101750

[10] Alsolai H, Alsolai L, Al-Wesabi FN, Othman M, Rizwanullah M, Abdelmageed AA. 2024. Automated sign language detection and classification using reptile search algorithm with hybrid deep learning. Heliyon. Vol. 10(1):e23252

[11] Alsulaiman M, Faisal M, Mekhtiche M, Bencherif M, Alrayes T, Muhammad G, et al.. 2023. Facilitating the communication with deaf people: building a largest Saudi sign language dataset. J. King Saud Univ. Comput. Inf. Sci. Vol. 35(8):101642

[12] Balaha MM, El-Kady S, Balaha HM, Salama M, Emad E, Hassan M, et al.. 2023. A vision-based deep learning approach for independent-users Arabic sign language interpretation. Multimed. Tools Appl. Vol. 82(5):6807–6826

[13] Batnasan G, Gochoo M, Otgonbold ME, Alnajjar F, Shih TK. 2022. ArSL21L: Arabic Sign Language Letter Dataset benchmarking and an educational avatar for Metaverse applications2022 IEEE Global Engineering Education Conference (Educon); IEEE. Tunisia. p. 1814–1821

[14] Boukdir A, Benaddy M, El Meslouhi O, Kardouchi M, Akhloufi M. 2023. Character-level Arabic text generation from sign language video using encoder–decoder model. Displays. Vol. 76:102340

[15] Brour M, Benabbou A. 2021. ATLASLang NMT: Arabic text language into Arabic sign language neural machine translation. J. King Saud Univ. Comput. Inf. Sci. Vol. 33(9):1121–1131

[16] Chadha S, Kamenov K, Cieza A. 2021. The world report on hearing. Bull World Health Organ. Vol. 99(4):242–242A. [Cross Ref]

[17] Dabwan BA, Jadhav ME, Ali YA, Olayah FA. 2023. Arabic sign language recognition using EfficientnetB1 and transfer learning technique2023 International Conference on IT Innovation and Knowledge Discovery (ITIKD); IEEE. Bahrain. p. 1–5

[18] El Kharoua R, Jiang X. 2024. Deep learning recognition for Arabic alphabet sign language RGB dataset. J. Comput. Commun. Vol. 12(3):32–51

[19] Faisal M, Alsulaiman M, Mekhtiche M, Bencherif MA, Algabri M, Alrayes TBS, et al.. 2023. Enabling two-way communication of deaf using Saudi sign language. IEEE Access. Vol. 11:135423–135434

[20] Haque MA, Ahmad S, Sonal D, Haque S, Kumar K, Rahman M. 2023. Analytical studies on the effectiveness of IoMT for healthcare systems. Iraqi J. Sci. Vol. 64(9):4719–4728

[21] Luqman H. 2023. ArabSign: a multi-modality dataset and benchmark for continuous Arabic sign language recognition2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG); IEEE. USA: p. 1–8

[22] Mazen F, Ezz-Eldin M. 2024. A novel image-based Arabic hand gestures recognition approach using YOLOv7 and ArSL21L. Fayoum Univ. J. Eng. Vol. 7(1):40–48

[23] Park N, Kim S. 2022. How do vision transformers work? arXiv preprint arXiv. 2202.06709. [Cross Ref]

[24] Poonguzhali R, Ahmad S, Thiruvannamalai Sivasankar P, Anantha Babu S, Joshi P, Joshi GP, et al.. 2023. Automated brain tumor diagnosis using deep residual U-net segmentation model. Comput. Mater. Con. Vol. 74(1):2179–2194

[25] Raghu M, Unterthiner T, Kornblith S, Zhang C, Dosovitskiy A. 2021. Do vision transformers see like convolutional neural networks? Adv. Neural Inf. Process. Syst. Vol. 34:12116–12128

[26] Shin J, Miah ASM, Akiba Y, Hirooka K, Hassan N, Hwang YS. 2024. Korean sign language alphabet recognition through the integration of handcrafted and deep learning-based two-stream feature extraction approach. IEEE Access. Vol. 12:68303–68318

[27] Vaiyapuri T, Jaiganesh M, Ahmad S, Abdeljaber HAM, Yang E, Jeong S-Y. 2023. Ensemble learning driven computer-aided diagnosis model for brain tumor classification on magnetic resonance imaging. IEEE Access. Vol. 11:91398–91406

[28] Younes SM, Gamalel-Din SA, Rohaim MA, Elnabawy MA. 2023. Automatic translation of Arabic text to Arabic sign language using deep learning. J. Al-Azhar Univ. Eng. Sect. Vol. 18(68):566–579

[29] ZainEldin H, Gamel SA, Talaat FM, Aljohani M, Baghdadi NA, Malki A, et al.. 2024. Silent no more: a comprehensive review of artificial intelligence, deep learning, and machine learning in facilitating deaf and mute communication. Artif. Intell. Rev. Vol. 57(7):188

[30] Zakariah M, Alotaibi YA, Koundal D, Guo Y, Mamun Elahi M. 2022. Sign language recognition for Arabic alphabets using transfer learning technique. Comput. Intell. Neurosci. Vol. 2022:4567989

[31] Zhang H, Sun Y, Liu Z, Liu Q, Liu X, Jiang M, et al.. 2023. Heterogeneous attention based transformer for sign language translation. Appl. Soft Comput. Vol. 144:110526

Journal of Disability Research

Toward Robust Arabic Sign Language Recognition via Vision Transformers and Local Interpretable Model-agnostic Explanations Integration

Abstract

Main article text

INTRODUCTION

Research motivation

Challenges facing ArSLR

Main contributions

RELATED STUDIES

Research gap

METHODOLOGY

Step-by-step representation of the proposed framework

Materials

ArSL21L: Arabic Sign Language Letter Dataset

RGB Arabic Alphabets Sign Language Dataset

Pre-processing

ViT classification

Performance measurement

LIME interpretability

EXPERIMENTS AND DISCUSSIONS

Comparison of related studies

Data augmentation alternatives

CONCLUSIONS AND FUTURE WORK

COMPETING INTERESTS

REFERENCES

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Comments

Comment on this article