INTRODUCTION
According to the World Health Organization (WHO), there are approximately 466 million individuals worldwide who are affected by deafness and hearing impairments ( Rastgoo et al., 2021). This demographic consists of 432 million adults and 34 million minors, collectively constituting 6.1% of the global population. Projections from WHO in 2021 ( Barioul et al., 2020) suggest that the number of individuals facing these challenges will exceed 900 million, or roughly 10% of the world’s population, by the year 2050. These statistics emphasize the critical significance of sign language and the advancements achieved in the realm of machine-based sign language translation systems ( Camgoz et al., 2020). Significantly, substantial scholarly efforts have been dedicated to the study of sign language recognition.
Sign language, which was created to assist those with hearing and speech impairments, is an indispensable means of communication ( Suneetha et al., 2021). Communication within a community typically involves a variety of written symbols and phonemes that constitute a language ( Rahman et al., 2019). However, individuals with hearing and speech impairments are unable to use such conventional languages. Conversely, they depend on sign language. In contrast to spoken languages, sign languages lack a universally applied standard form, resulting in notable discrepancies among diverse nations and geographical areas. Geographical divisions consequently produce sign languages that lack mutual intelligibility.
The American Sign Language (ASL), Indian Sign Language (ISL), and Italian Sign Language are just a few of the more than 120 distinct sign languages in existence worldwide ( Aly et al., 2019). The authors opted to utilize ISL as a representative example within the system in question. The dataset that was generated for this system was customized to consist of 11 static signs, for a grand total of 630 samples. The aforementioned signs symbolize frequently employed phrases such as “hello,” “goodbye,” and “good morning,” among others. By incorporating natural gesture inputs, the system is designed to generate sign language expressions. Following the input of these gestures, the system proceeds with a sequence of preprocessing and processing stages. The objective of these procedures is to predict with precision the word or phrase that is symbolized by each gesture.
Conventional approaches to sign language recognition frequently rely on sensor-based systems, which require users to don specialized equipment such as sensor-equipped mittens or colored gloves ( Wadhawan and Kumar, 2020). These garments subsequently convey information through the use of a motion capture system. Although successful, these sensor-based methods may be costly and intricate, necessitating complex hardware configurations. Furthermore, these sensors are impractical for daily use due to the necessity for the user to wear them, their reliance on a constant power supply, and the frequent involvement of bulky cables and additional apparatus ( Wen et al., 2021). On the other hand, vision-based methodologies provide a more intuitive substitute. These methodologies employ computer computations and image processing to examine videos and images ( Mittal et al., 2019). Machine learning and deep learning methodologies are implemented to categorize and forecast the data that have been processed. A variety of neural network architectures—including Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN)—are the foundation of vision-based systems. These technologies enable sign language recognition that is more natural and convenient, eliminating the requirement for specialized wearable equipment.
Research contribution
The proposed research offers several significant contributions in the field of sign language recognition, detailed as follows:
A thorough exploration of existing literature in sign language recognition is conducted to identify unaddressed research areas, thereby aiding in the formulation of specific research problems.
An advanced deep learning framework is proposed, integrating Custom Convolutional Neural Network (CCNN) and Temporal Convolutional Neural Network (TCNN). This model is designed to effectively classify sign language gestures, even in scenarios where the signer is not in close proximity.
The research encompasses a series of experiments designed to assess the efficacy of the proposed framework. The outcomes of these experiments are juxtaposed with the most recent advancements in the field. Empirical evidence strongly indicates that the proposed methodology surpasses existing methods in terms of performance, representing a noteworthy advancement in sign language recognition technology.
The remaining sections of the paper are structured as follows: Literature Review section provides an overview of pertinent research. The Proposed Methodology section offers a concise description of the proposed CNN-TCN. In the Experimental Results and Analysis section, we delve into the experimental evaluation and subsequent discussion. Finally, the Conclusion section concludes with a summary and outlines future directions for our work.
LITERATURE REVIEW
Over the past several decades, researchers have developed a variety of methods for recognizing sign language. This section offers an extensive analysis, focusing on the fundamental methodologies, applicable language domains, and the effectiveness of select cutting-edge techniques. Additionally, for broader insight, Table 1 presents a summarized overview of various other sign language recognition approaches.
Existing sign language recognition model.
Reference | Methodology | Dataset | Language | Accuracy (%) |
---|---|---|---|---|
Jiang et al. (2021) | CNN with 3D, RGB Modalities with GCN | Turkish Sign Language | Turkish Language | 92 |
Barbhuiya et al. (2021) | CNN and VGG16 | Hand Gesture Recognition Dataset | English Language | 94 |
Hasan et al. (2020) | Inception V3 Architecture | English character and digits | English Language | 97 |
Kembuan et al. (2020) | CNN Architecture with Tensorflow Library | BISINDO Dataset | Indonesian Language | 98 |
Alawwad et al. (2021) | Faster Region-based Convolutional Neural Network (R-CNN), Res Net, VGG 16 | Image-based Arabic Sign Language (ArSL) Dataset | Arabic Language | 92 |
Kamruzzaman (2020) | CNN with Leap Motion or Xbox Kinect | Arabic Hand Gesture Dataset | Arabic Language | 90 |
Althagafi et al. (2020) | CNN with Transfer Learning | RGB Arabic Image Dataset | Arabic Language | 92 |
A research developed a model to facilitate public access to the American Sign Language Lexicon Video dataset ( Golestani and Moghaddam, 2020). The extensive dataset comprises >3300 videos of ASL signs, which feature synchronized segments exhibiting a wide range of sign language gestures in various environments and from diverse vantage points. The software incorporates comprehensive linguistic annotations, including gloss labels that denote the commencement and conclusion times of signs, labels that specify the commencement and culmination of hand shapes, and sign type classifications predicated on their articulation and morphology. Furthermore, annotating each constituent of compound signs is a practice. Additional video sequences, footage captured from a variety of camera angles, specialized software designed to extract skin features, and unique numeric identifiers for each sign are added to the dataset. This compilation of materials aids in the instruction of computer vision-based sign language categorization. In addition to emphasizing the intricacies of language annotation and classification, the authors provide an example of the ASLLVD’s implementation in a computer vision application.
In another study, authors pioneered the development of a transformer network aimed at sign language recognition from recordings, incorporating nonmanual elements such as eyebrow and mouth movements ( De Coster et al., 2020). Each of these networks was meticulously designed with unique neural network architectures tailored to discern various sign language signals. In a separate investigation, the authors introduced a multimodal sign language recognition system ( Jiang et al., 2021). Their methodology harnessed a skeleton-based graph technique for the detection of isolated signs. Additionally, they introduced the SAM-SLR framework, specifically tailored for the recognition of isolated signs. Performance evaluations were conducted using the AULTS dataset to validate their approach. Another research introduced the BLSTM-3DRN model as a dynamic sign language recognition solution ( Liao et al., 2019). Their approach was rigorously tested on the DEVISIGN_D dataset, which primarily focuses on Chinese hand sign language.
The authors developed a novel approach for continuous sign language recognition, particularly in the context of syntax formation, by integrating I3D with a ResNet and a B-LSTM framework ( Adaloglou and Chatzis, 2022). They specifically applied this model to various datasets, including RGB + D data, and focused on Greek sign language, annotating it across three distinct levels. In another study, Sharma et al. (2021) employed a transfer learning approach using a deep CNN architecture. The MobileNetV2 model, which had been pretrained on the ImageNet dataset, was employed to identify and classify 35 distinct symbols in ISL. By retraining the unfrozen layers of this model using the ISL dataset, significant progress was made in terms of test sample performance. A neural network-based model, specifically designed for mobile devices, was introduced in this study, with the objective of distinguishing the ASL alphabet from color images ( Kasukurthi et al., 2019). By utilizing the squeezenet architecture, which reduces the number of trainable parameters through the application of 1 × 1 convolution filters, their model remains compact and suitable for implementation on mobile devices. Another research introduces the Temporal Relation Network (TRN), a module for learning and reasoning about time-based dependencies in videos ( Zhou et al., 2018). It demonstrates its superiority in activity recognition tasks across three datasets, outperforming existing networks while also providing interpretable insights into visual common sense knowledge. The aforementioned methodology yielded a noteworthy accuracy rate of 83.29%.
The foregoing discussion leads to the conclusion that the majority of current methods for sign language recognition depend heavily on image processing and computational techniques. A common challenge observed is that improper application of deep learning algorithms can result in reduced accuracy, particularly when the signer is not closely positioned. The objective of this study is to create an exceptionally precise sign language classifier meticulously tailored for the recognition of English Sign Language (ESL) signs, encompassing both static and dynamic forms. The fundamental stages of the proposed methodology comprise data collection, preprocessing, feature selection, the establishment of the feature network, and culminate in the classification process.
PROPOSED METHODOLOGY
This section provides a comprehensive examination of the proposed hybrid CNN-TCN model. The hybrid model is constructed by integrating the Temporal Convolutional Network (TCN) and CNN elements to facilitate sign language recognition, as illustrated in Figure 1. This approach effectively combines CNNs’ spatial feature extraction capabilities with TCNs’ temporal sequence modeling. The subsequent section outlines the step-by-step procedure for the proposed model.

Proposed CNN-TCN modal for sign language recognition. Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.
Data preprocessing
The collected sign language data include videos that cover various signers, backgrounds, lighting conditions, and sign language dialects. In the very first attempt, the data are prepared and cleaned by applying various steps that include frame extraction, normalization, and labeling. For the frame extraction, an event-driven frame extraction ( Rana et al., 2023) has been applied which is a specialized process in video analysis and considered to be useful in contexts like sign language recognition, where specific actions are the focus. Initially, an “event” is defined for each video that could be a specific gesture, a change in hand position or the start/end of a sign along with its temporal aspects to identify not just the presence of an event, but also its duration.
Each video frame or segment is further represented as a feature vector X t , where t represents the time or frame index. A function f( x t ) maps the feature vector to a binary decision, indicating the presence of an event as specified in Equation 1. For a particular timeframe θ, an event is detected if f( x t ) > θ, after which a sliding window approach has been applied that assesses a sequence of frames X t−n ……… X t+n to consider the temporal context. Once an event is detected, the start and end frames of the event are identified and these frames are extracted for further analysis:
Noise removal is an important aspect of data preprocessing, as it directly impacts the quality and reliability of the data used for analysis and modeling. By eliminating irrelevant or erroneous information, noise removal ensures that the data reflect accurate insights, thereby improving the performance of machine learning models by preventing overfitting. This step is essential for enabling accurate predictions and informed decision-making, as it allows models to generalize better to new, unseen data rather than memorizing the noise within the training dataset. The complete algorithm is shown in Algorithm 1.
Noise removal from image and video
Given: |
input_video_path: Path to the input video |
file.output_video_path: Path to the output video |
file.kernel_size: Size of the kernel for the median filter, typically an odd integer. |
Variables: |
cap: Video capture object for input video. |
out: Video writer object for output video. |
frame: A single frame from the input video. |
gray_frame: Grayscale version of frame. |
filtered_frame: Frame after applying median filter. |
fourcc: Codec used for output video. |
fps: Frames per second for the output video. |
frame_size: Size of each video frame (width, height). |
Initialize: |
cap ← cv2.VideoCapture(input_video_path) |
fourcc ← cv2.VideoWriter_fourcc(“X”, “V”, “I”, “D”) |
fps ← 20.0 (or the frame rate of cap if it needs to match the input video) |
frame_size ← (640, 480) (or the frame size of cap if it needs to match the input video) |
out ← cv2.VideoWriter(output_video_path, fourcc, fps, frame_size) |
Algorithm: |
While cap.isOpened() do |
ret, frame ← cap.read() |
If ret is True then |
gray_frame ← cv2.cvtColor |
(frame, cv2.COLOR_BGR2GRAY) (if grayscale conversion is needed) |
filtered_frame ← cv2.medianBlur(gray_frame, kernel_size) |
out.write(filtered_frame) |
Else |
Break from the loop |
End While |
Finalization: |
cap.release() |
out.release() |
cv2.destroyAllWindows() |
Feature extraction using CCNN
An innovative intersection of machine learning and computer vision ( Rana et al., 2023) is the construction of a custom CNN to extract features from video frames for sign language recognition. This method identifies and interprets sign language gestures through the analysis of sequential video frames, a socially significant and difficult endeavor. The first algorithm illustrates the comprehensive procedure that consists of preprocessing video data, developing and training a CNN, and subsequently utilizing the network to extract significant features that faithfully depict sign language gestures ( Algorithm 2).
Feature extraction based on custom CNN
Input: initialize video frames from video |
for each layer in CNN: |
if layer is convolutional: |
apply convolution Qlij=∑m∑nKlmnIl(i+m)(j+n). |
apply ReLU Il+1=max (0,O(l)) |
else if layer is pooling: |
apply max pooling such that Pij=max (subregion of I) |
else if layer is fully connected: |
if not last layer: |
for each fully connected layer fc with weights W and bias b |
do |
apply dense operation D(fc)=W(fc)v+b(fc) |
apply ReLU v=max (0,D(fc)) |
else |
apply dense operation D(fc)=W(fc)v+b(fc) |
apply softmax softmax (D(fc)) |
return output from final layer |
The architecture of the CNN is pivotal, as it needs to capture the complexity and subtleties of hand movements and facial expressions that are integral to sign language ( Zhang et al., 2020). By leveraging the convolutional layers for automatic feature extraction and hierarchical pattern recognition, the network learns to discern intricate gestures from raw video frames. This sophisticated procedure not only pushes the boundaries of machine learning in the realm of natural language processing but also holds immense potential for bridging communication gaps for the deaf and hard-of-hearing community.
Initially, the input layer accepts raw video frames where each frame is a three-dimensional (3D) tensor (height, width, color channels), after which the convolutional layers extract spatial features from the input frames. Filters (or kernels) slide over the input data to perform convolution operations. For a particular image I, the convolution layer could be defined as:
where O is the output and K is the kernel.
The utilization of the ReLU-based activation function introduces nonlinearity into the model, thereby enhancing its capacity to discern and learn intricate patterns. For a function f( x), the ReLU could be defined as:
where K ij is a subject of the input.
Whereas the pooling layers based on max pooling reduce the spatial dimensions such that:
where “ y” represents the output, “ W” signifies the weight matrix, “ x” denotes the input, and “ b” represents the bias.
Commonly, max pooling is used. At the end, batch normalization has been applied after convolutional layers to stabilize and speed up training.
Sign language recognition
The TCN receives a sequence of preprocessed frames from a sign language video. Each frame is represented by a set of features, which could include the position, orientation, and movement of hands, fingers, and possibly facial expressions. As the data pass through the TCN’s layers, the network extracts and learns temporal features. Due to the dilated convolutions, the network can capture long-range dependencies, crucial for understanding movements and gestures that span multiple frames. Due to the temporal learning phase in each convolutional layer, the network learns more abstract representations of the input data. Early layers might identify simple movements or positions, while deeper layers integrate this information to recognize complex gestures over time.
The causal nature ensures that the model’s prediction at time t is only influenced by data from time t and earlier, maintaining the temporal sequence’s integrity. Dilations allow the network to have a wider “view” of the input without increasing computational complexity significantly. In the final layers, after extracting and processing the features through its layers, the TCN feeds the data into one or more dense (fully connected) layers that act as classifiers. The dense layers map the learned features to specific sign language gestures or phrases. This is typically done through a softmax function if the task is classification, which provides a probability distribution over the possible sign language classes.
EXPERIMENTAL RESULTS AND ANALYSIS
This section delves into the analysis of the conducted experiment, encompassing the description of performance metrics, baseline methods, dataset details, and the results obtained.
Dataset description
Three different types of dataset have been used for the analysis of the proposed model. The very first dataset (SL-DS-I) is the British Sign Language Open Broadcast Subtitles Large (BOBSL) dataset. It is an extensive collection of British Sign Language (BSL) resources having 1962 episodes. These episodes are paired with English subtitles and span a diverse array of genres—including horror; period dramas; medical shows; historical, natural, and scientific documentaries; sitcoms; and children’s programming—and shows about cooking, beauty, business, and travel. In total, the dataset showcases the work of 39 different signers. The SL-DS-II dataset comprises a series of images depicting the ASL alphabet, organized into 29 distinct folders corresponding to various classes. It includes a training set of 87,000 images, each with a resolution of 200 × 200 pixels. The 29 classes consist of 26 for the letters A-Z, and an additional three classes designated for the signs SPACE, DELETE, and NOTHING.
The very last dataset, which is SL-DS-III, is The American Sign Language MNIST dataset that is composed of multiple grayscale images, each with a resolution of 28 × 28 pixels. These images are designed to represent the ASL alphabet. The dataset description is shown in Table 2.
Dataset description.
Dataset ID | Name | Web link |
---|---|---|
SL-DS-I | British Sign Language Open Broadcast Subtitles Large (BOBSL) | https://paperswithcode.com/dataset/bobsl |
SL-DS-II | Video-based Images of Alphabets from the American Sign Language | https://shorturl.at/HIRXZ |
SL-DS-III | American Sign Language Alphabet Dataset | https://t.ly/yS-vY |
Experimental enjoinment
With the help of the Tensor Flow and Keras libraries, Python was used in order to put the suggested model into action ( Kamruzzaman, 2020). In order to carry out the experiment, a CPU system with a moderate price tag with a speed of 2.60GHz Intel Core-i5-3230 and 8GB of data storage was used.
Performance matrices
To evaluate the effectiveness of the work that is being suggested, the performance matrices that are listed below have been used.
A confusion matrix is as follows: A table with dimensions of 2 by 2, which includes four outputs from the classifier that was built and is composed of the following elements: True Positive, True Negative, False Positive, and False Negative.
The receiver operating characteristic (ROC) curve is a visual tool that demonstrates the capacity of a binary classification model to differentiate between classes.
Recognition speed: This refers to the time it takes for a system to identify and process a sign from the moment it is presented to the system until the system provides an output (e.g. a recognized sign, word, or phrase).
where n is the total number of signs or sequences tested. T start,i is the start time of the hith sign’s recognition process. T end,i is the end time of the hith sign’s recognition process.
Baseline methods
For the purpose of evaluating the effectiveness and quality of the work that is being suggested, the following baseline approaches have been used:
The study carried out by Butter et al. (2023) presents an algorithm rooted in deep learning that possesses the capability to identify and detect words through the analysis of a person’s gestures.
A human gesture recognition (HGR) system was suggested by Hussain et al. (2023) in this study. The system is built on deep learning models that are CNNs with fine-tuned Inception-v3 and EfficentNet-Bo networks. Based on the results of their experiments, it was determined that Inception-v3 obtained a level of accuracy of 90%, precision of 0.93%, recall of 0.91%, and F1 score of 0.90%, respectively.
Amin and Rizvi (2023): The purpose of this paper is to discuss the smart prototype that is built on flex, accelerometer, and gyroscope sensors and is intended to record sign gestures. For the purpose of capturing and compiling datasets consisting of digits, such as 0-10, alphabets, such as A-Z, and alphanumeric, such as 0-10, these sensors are attached to a glove.
Environmental setup (real-time environment)
In the experimental evaluation of the CNN-TCN model for sign language recognition, the physical environment setup is meticulously designed to emulate real-world conditions and assess the model’s robustness to noise and occlusions. The experiment begins in a controlled laboratory setting with optimal conditions to establish baseline performance metrics. It then transitions to simulations of real-world environments by introducing background visual noise through variable lighting and irrelevant background movements, and simulating occlusions with objects partially obstructing the view of signing actions. Further complexity is added by testing in outdoor settings under different weather conditions and in public spaces with fluctuating background noise and dynamic backgrounds, using high-resolution cameras to capture detailed video. A diverse group of participants with varying signing styles and proficiency levels ensures the system’s adaptability and generalization capabilities are thoroughly evaluated. This setup aims to replicate the challenges faced in everyday scenarios, providing a comprehensive assessment of the CNN-TCN model’s practical effectiveness and efficiency in sign language recognition.
RESULTS
The very first experiment shown in Figure 2 presents the performance of the CNN-TCN across different datasets, in terms of accuracy, specificity, and sensitivity. Notably, on the SL-DS-I dataset, the model achieved remarkable results with a 94.21% precision, 93.25% recall, 95.25% accuracy, and 93.69% F-measure. In the case of the SL-DS-II dataset, the model performed even better, attaining a precision of 94.45%, a recall of 93.62%, an F-measure of 93.79%, and an accuracy of 96.02%. Additionally, on the dataset SL-DS-III, the model demonstrated high efficiency with 93.45% precision, 93.13% recall, 94.67% accuracy, and 93.22% F-measure. These results collectively underscore the effectiveness and efficiency of the proposed method in identifying dog breeds, as evidenced by its strong performance across various evaluation metrics.
The performance of the proposed approach is assessed in Figure 3 through the depiction of ROC curves. This assessment involves an analysis of the true positive rate (TPR) and the false positive rate (FPR) across various datasets. A critical metric for evaluating model effectiveness is the area under the curve (AUC) of the ROC curve for each dataset. To provide specific details, the model achieves an AUC of 0.81 for the SL-DS-I dataset, 0.83 for the SL-DS-II dataset, and 0.72 for the SL-DS-III dataset. For two of these datasets, the ROC curves cover over 80% of the region, while for the third dataset, it encompasses >72% of the mentioned area. These high AUC values signify the model’s robust performance in distinguishing between positive and negative classes within the datasets, underscoring its accuracy and efficiency.

ROC curves: (a) SL-DS-I dataset, (b) SL-DS-II dataset, and (c) SL-DS-III dataset. Abbreviation: ROC curve, receiver operating characteristic curve.
In the first comparative analysis, six different combinations of standard deep learning models each having feature extraction and sign language recognition phases have been compared with CNN-TCN to test their performance. These configurations included a single-layer LSTM, a double-layer LSTM (LSTM-LSTM), a single-layer GRU, a double-layer GRU (GRU-GRU), a combination of GRU followed by LSTM (GRU-LSTM), and a combination of LSTM followed by GRU (LSTM-GRU).
The effectiveness of each model was evaluated using K-fold cross-validation, and the results are summarized in Table 3. Notably, the LSTM-GRU model exhibited the highest accuracy among the six configurations. The development of these models was carried out using the Keras/TensorFlow libraries, widely recognized for their flexibility and efficiency in building deep learning models.
Comparative analysis of CNN-TCN with standard deep learning models.
Model | Precision (%) | Recall (%) | Accuracy (%) | F score |
---|---|---|---|---|
SL-DS-I dataset | ||||
LSTM-LSTM | 80.56 | 79.35 | 81.36 | 79.65 |
GRU-LSTM | 84.65 | 84.36 | 85.34 | 84.36 |
LSTM-GRU | 87.36 | 87.02 | 88.45 | 88.79 |
CNN-TCN | 94.21 | 93.25 | 95.25 | 93.69 |
SL-DS-II dataset | ||||
LSTM-LSTM | 81.65 | 81.25 | 82.69 | 81.65 |
GRU-LSTM | 85.69 | 85.45 | 86.14 | 85.02 |
LSTM-GRU | 88.36 | 88.25 | 89.36 | 88.96 |
CNN-TCN | 94.45 | 93.62 | 96.02 | 93.79 |
SL-DS-II dataset | ||||
LSTM-LSTM | 82.69 | 82.45 | 83.36 | 82.14 |
GRU-LSTM | 85.65 | 85.45 | 86.28 | 85.12 |
LSTM-GRU | 87.36 | 87.12 | 87.96 | 87.03 |
CNN-TCN | 93.45 | 93.13 | 94.67 | 93.22 |
Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.
The primary application of the trained models was the identification of sign language. The study suggested that the performance of these models could be further enhanced by expanding the dataset size and including more samples per word. This approach would provide the models with a richer and more varied set of training examples, potentially leading to more effective and accurate sign language detection.
In the experimental setup designed to evaluate the robustness of the proposed CNN-TCN model for sign language recognition, a comprehensive real-world environment test has been conducted. This setup involved two distinct phases: the first under normal, controlled conditions to establish baseline performance metrics and the second under various challenging conditions, introducing factors such as visual noise and partial occlusions to mimic real-world disturbances. For each phase, we meticulously recorded the system’s accuracy and recognition speed, ensuring consistency in data collection methods across all tests. The accuracy was measured as the percentage of signs correctly recognized out of the total signs presented, while recognition speed was quantified as the average time taken from sign presentation to sign recognition.
To compute the robustness of the CNN-TCN model, the percentage change formula comparing the performance metrics (accuracy and recognition speed) obtained under normal conditions with those under the introduced noisy and occluded conditions has been used. This comparative analysis aimed to quantify the model’s resilience to real-world environmental challenges, providing insights into its practical effectiveness and efficiency in diverse settings. Table 4 presents the performance metrics under normal conditions and compares them with metrics under noisy and occluded conditions using the percentage change formula.
Real-world performance of the CNN-TCN using robustness and recognition speed.
Condition | Accuracy (%) | Recognition speed (ms) | Percentage change in accuracy | Percentage change in recognition speed |
---|---|---|---|---|
Normal | 95.0 | 200 | - | - |
With noise | 90.0 | 220 | −5.26% | +10.0% |
With occlusions | 88.0 | 230 | −7.37% | +15.0% |
Noise + occlusions | 85.0 | 250 | −10.53% | +25.0% |
Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.
The obtained results of CNN-TCN model’s performance varies across different environmental conditions, demonstrating its robustness in terms of both accuracy and speed. This setup highlighted not only the CNN-TCN model’s capabilities in ideal conditions but also its adaptability and reliability in less-than-optimal environments, critical for real-world applications.
In a subsequent experiment, the effectiveness of the proposed model was benchmarked against standard baseline models presented in the Baseline Methods section. From the data in Figure 4, it was observed that the proposed model attained an accuracy of 82.43%. This performance was contrasted with that of two other approaches: Mateen et al., which achieved an accuracy of 78.56%, and Hussain et al., which recorded a slightly higher accuracy of 83.70%. Furthermore, the close proximity of the proposed model’s accuracy to that of Saad et al.’s model also underscores the competitiveness and viability of the proposed CNN-TCN. This comparison highlights the potential of the proposed model in applications where accuracy is a critical metric, especially considering the inherent differences in the underlying methodologies of these models.

Comparative analysis of CNN-TCN with baseline methods. Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.
The CNN-TCN model represents a significant advancement in addressing the challenges faced by vision-based sign language recognition systems, particularly those related to sensor limitations and environmental variations. By leveraging the CNN component for sophisticated spatial feature extraction and the TCN for capturing temporal dynamics, this model offers enhanced robustness against common issues such as variable lighting, background clutter, and occlusions. Its ability to discern and prioritize relevant features from complex visual inputs enables effective sign interpretation even in less-than-ideal conditions, reducing the dependency on specialized hardware. This makes the CNN-TCN model not only more adaptable to a range of real-world environments but also more accessible, as it can deliver high performance with widely available camera technology. Furthermore, its scalable architecture ensures that it can be trained to recognize a wide array of signs and gestures, paving the way for broader applications in real-world settings where environmental control is limited. This blend of spatial and temporal analysis capabilities positions the CNN-TCN model as a highly effective solution for overcoming the inherent limitations of sensor-based, vision-driven sign language recognition systems.
The main constraint of the proposed work lies in its significant computational demands; training deep CNNs with multiple layers requires substantial GPU resources, restricting access for entities with limited computational power. Future research will aim to mitigate this limitation by enhancing the efficiency of both training and testing processes.
CONCLUSION
Sign language is an essential form of communication for people with hearing and speech challenges. They depend on visual means, mainly through hand gestures and body language, to express their ideas and emotions in everyday interactions. Sign language typically divides into two primary categories: digits (numbers) and characters (letters). This study introduces an innovative hybrid methodology that merges TCNN with a CCNN for the automatic recognition of sign language. The efficacy of this system was thoroughly evaluated using three distinct benchmark datasets, encompassing isolated numbers and letters from both American and British sign languages, which are widely accessible and comprehensive resources. The CNN-TCN model incorporates multiple phases, including data collection, preprocessing (frame extraction, normalization, and labeling), feature extraction via CCNN, and sequence modeling with TCNN for final recognition. The outcomes of this research highlight the system’s high accuracy, precision, recall, and F1 scores, achieving 91.67%, 93.64%, 91.67%, and 91.47%, respectively, for the American Sign Language dataset, and 97.33%, 97.89%, 97.33%, and 97.37% for digit recognition in the BSL dataset. These results validate the effectiveness and practicality of the CNN-TCN model in sign language recognition. Future studies will focus on overcoming this limitation by improving the efficiency of the training and testing phases.