CNN-TCN: Deep Hybrid Model Based on Custom CNN with Temporal CNN to Recognize Sign Language

Ahmadi, Saad Al; Muhammad, Farah Diana; Al Dawsari, Haya

doi:10.57197/JDR-2024-0034

INTRODUCTION

According to the World Health Organization (WHO), there are approximately 466 million individuals worldwide who are affected by deafness and hearing impairments ( Rastgoo et al., 2021). This demographic consists of 432 million adults and 34 million minors, collectively constituting 6.1% of the global population. Projections from WHO in 2021 ( Barioul et al., 2020) suggest that the number of individuals facing these challenges will exceed 900 million, or roughly 10% of the world’s population, by the year 2050. These statistics emphasize the critical significance of sign language and the advancements achieved in the realm of machine-based sign language translation systems ( Camgoz et al., 2020). Significantly, substantial scholarly efforts have been dedicated to the study of sign language recognition.

Sign language, which was created to assist those with hearing and speech impairments, is an indispensable means of communication ( Suneetha et al., 2021). Communication within a community typically involves a variety of written symbols and phonemes that constitute a language ( Rahman et al., 2019). However, individuals with hearing and speech impairments are unable to use such conventional languages. Conversely, they depend on sign language. In contrast to spoken languages, sign languages lack a universally applied standard form, resulting in notable discrepancies among diverse nations and geographical areas. Geographical divisions consequently produce sign languages that lack mutual intelligibility.

The American Sign Language (ASL), Indian Sign Language (ISL), and Italian Sign Language are just a few of the more than 120 distinct sign languages in existence worldwide ( Aly et al., 2019). The authors opted to utilize ISL as a representative example within the system in question. The dataset that was generated for this system was customized to consist of 11 static signs, for a grand total of 630 samples. The aforementioned signs symbolize frequently employed phrases such as “hello,” “goodbye,” and “good morning,” among others. By incorporating natural gesture inputs, the system is designed to generate sign language expressions. Following the input of these gestures, the system proceeds with a sequence of preprocessing and processing stages. The objective of these procedures is to predict with precision the word or phrase that is symbolized by each gesture.

Conventional approaches to sign language recognition frequently rely on sensor-based systems, which require users to don specialized equipment such as sensor-equipped mittens or colored gloves ( Wadhawan and Kumar, 2020). These garments subsequently convey information through the use of a motion capture system. Although successful, these sensor-based methods may be costly and intricate, necessitating complex hardware configurations. Furthermore, these sensors are impractical for daily use due to the necessity for the user to wear them, their reliance on a constant power supply, and the frequent involvement of bulky cables and additional apparatus ( Wen et al., 2021). On the other hand, vision-based methodologies provide a more intuitive substitute. These methodologies employ computer computations and image processing to examine videos and images ( Mittal et al., 2019). Machine learning and deep learning methodologies are implemented to categorize and forecast the data that have been processed. A variety of neural network architectures—including Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN)—are the foundation of vision-based systems. These technologies enable sign language recognition that is more natural and convenient, eliminating the requirement for specialized wearable equipment.

Research contribution

The proposed research offers several significant contributions in the field of sign language recognition, detailed as follows:

A thorough exploration of existing literature in sign language recognition is conducted to identify unaddressed research areas, thereby aiding in the formulation of specific research problems.
An advanced deep learning framework is proposed, integrating Custom Convolutional Neural Network (CCNN) and Temporal Convolutional Neural Network (TCNN). This model is designed to effectively classify sign language gestures, even in scenarios where the signer is not in close proximity.
The research encompasses a series of experiments designed to assess the efficacy of the proposed framework. The outcomes of these experiments are juxtaposed with the most recent advancements in the field. Empirical evidence strongly indicates that the proposed methodology surpasses existing methods in terms of performance, representing a noteworthy advancement in sign language recognition technology.

The remaining sections of the paper are structured as follows: Literature Review section provides an overview of pertinent research. The Proposed Methodology section offers a concise description of the proposed CNN-TCN. In the Experimental Results and Analysis section, we delve into the experimental evaluation and subsequent discussion. Finally, the Conclusion section concludes with a summary and outlines future directions for our work.

LITERATURE REVIEW

Over the past several decades, researchers have developed a variety of methods for recognizing sign language. This section offers an extensive analysis, focusing on the fundamental methodologies, applicable language domains, and the effectiveness of select cutting-edge techniques. Additionally, for broader insight, Table 1 presents a summarized overview of various other sign language recognition approaches.

Table 1:

Existing sign language recognition model.

Reference	Methodology	Dataset	Language	Accuracy (%)
Jiang et al. (2021)	CNN with 3D, RGB Modalities with GCN	Turkish Sign Language	Turkish Language	92
Barbhuiya et al. (2021)	CNN and VGG16	Hand Gesture Recognition Dataset	English Language	94
Hasan et al. (2020)	Inception V3 Architecture	English character and digits	English Language	97
Kembuan et al. (2020)	CNN Architecture with Tensorflow Library	BISINDO Dataset	Indonesian Language	98
Alawwad et al. (2021)	Faster Region-based Convolutional Neural Network (R-CNN), Res Net, VGG 16	Image-based Arabic Sign Language (ArSL) Dataset	Arabic Language	92
Kamruzzaman (2020)	CNN with Leap Motion or Xbox Kinect	Arabic Hand Gesture Dataset	Arabic Language	90
Althagafi et al. (2020)	CNN with Transfer Learning	RGB Arabic Image Dataset	Arabic Language	92

A research developed a model to facilitate public access to the American Sign Language Lexicon Video dataset ( Golestani and Moghaddam, 2020). The extensive dataset comprises >3300 videos of ASL signs, which feature synchronized segments exhibiting a wide range of sign language gestures in various environments and from diverse vantage points. The software incorporates comprehensive linguistic annotations, including gloss labels that denote the commencement and conclusion times of signs, labels that specify the commencement and culmination of hand shapes, and sign type classifications predicated on their articulation and morphology. Furthermore, annotating each constituent of compound signs is a practice. Additional video sequences, footage captured from a variety of camera angles, specialized software designed to extract skin features, and unique numeric identifiers for each sign are added to the dataset. This compilation of materials aids in the instruction of computer vision-based sign language categorization. In addition to emphasizing the intricacies of language annotation and classification, the authors provide an example of the ASLLVD’s implementation in a computer vision application.

In another study, authors pioneered the development of a transformer network aimed at sign language recognition from recordings, incorporating nonmanual elements such as eyebrow and mouth movements ( De Coster et al., 2020). Each of these networks was meticulously designed with unique neural network architectures tailored to discern various sign language signals. In a separate investigation, the authors introduced a multimodal sign language recognition system ( Jiang et al., 2021). Their methodology harnessed a skeleton-based graph technique for the detection of isolated signs. Additionally, they introduced the SAM-SLR framework, specifically tailored for the recognition of isolated signs. Performance evaluations were conducted using the AULTS dataset to validate their approach. Another research introduced the BLSTM-3DRN model as a dynamic sign language recognition solution ( Liao et al., 2019). Their approach was rigorously tested on the DEVISIGN_D dataset, which primarily focuses on Chinese hand sign language.

The authors developed a novel approach for continuous sign language recognition, particularly in the context of syntax formation, by integrating I3D with a ResNet and a B-LSTM framework ( Adaloglou and Chatzis, 2022). They specifically applied this model to various datasets, including RGB + D data, and focused on Greek sign language, annotating it across three distinct levels. In another study, Sharma et al. (2021) employed a transfer learning approach using a deep CNN architecture. The MobileNetV2 model, which had been pretrained on the ImageNet dataset, was employed to identify and classify 35 distinct symbols in ISL. By retraining the unfrozen layers of this model using the ISL dataset, significant progress was made in terms of test sample performance. A neural network-based model, specifically designed for mobile devices, was introduced in this study, with the objective of distinguishing the ASL alphabet from color images ( Kasukurthi et al., 2019). By utilizing the squeezenet architecture, which reduces the number of trainable parameters through the application of 1 × 1 convolution filters, their model remains compact and suitable for implementation on mobile devices. Another research introduces the Temporal Relation Network (TRN), a module for learning and reasoning about time-based dependencies in videos ( Zhou et al., 2018). It demonstrates its superiority in activity recognition tasks across three datasets, outperforming existing networks while also providing interpretable insights into visual common sense knowledge. The aforementioned methodology yielded a noteworthy accuracy rate of 83.29%.

The foregoing discussion leads to the conclusion that the majority of current methods for sign language recognition depend heavily on image processing and computational techniques. A common challenge observed is that improper application of deep learning algorithms can result in reduced accuracy, particularly when the signer is not closely positioned. The objective of this study is to create an exceptionally precise sign language classifier meticulously tailored for the recognition of English Sign Language (ESL) signs, encompassing both static and dynamic forms. The fundamental stages of the proposed methodology comprise data collection, preprocessing, feature selection, the establishment of the feature network, and culminate in the classification process.

PROPOSED METHODOLOGY

This section provides a comprehensive examination of the proposed hybrid CNN-TCN model. The hybrid model is constructed by integrating the Temporal Convolutional Network (TCN) and CNN elements to facilitate sign language recognition, as illustrated in Figure 1. This approach effectively combines CNNs’ spatial feature extraction capabilities with TCNs’ temporal sequence modeling. The subsequent section outlines the step-by-step procedure for the proposed model.

Figure 1:

Proposed CNN-TCN modal for sign language recognition. Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.

Data preprocessing

The collected sign language data include videos that cover various signers, backgrounds, lighting conditions, and sign language dialects. In the very first attempt, the data are prepared and cleaned by applying various steps that include frame extraction, normalization, and labeling. For the frame extraction, an event-driven frame extraction ( Rana et al., 2023) has been applied which is a specialized process in video analysis and considered to be useful in contexts like sign language recognition, where specific actions are the focus. Initially, an “event” is defined for each video that could be a specific gesture, a change in hand position or the start/end of a sign along with its temporal aspects to identify not just the presence of an event, but also its duration.

Each video frame or segment is further represented as a feature vector X _t , where t represents the time or frame index. A function f( x _t ) maps the feature vector to a binary decision, indicating the presence of an event as specified in Equation 1. For a particular timeframe θ, an event is detected if f( x _t ) > θ, after which a sliding window approach has been applied that assesses a sequence of frames X _t−n ……… X _t+n to consider the temporal context. Once an event is detected, the start and end frames of the event are identified and these frames are extracted for further analysis:

(1)

$Event at frame t = {\begin{array}{l} 1 if f (x_{t}; θ) > threshold \\ 0 otherwise \end{array} .$

Noise removal is an important aspect of data preprocessing, as it directly impacts the quality and reliability of the data used for analysis and modeling. By eliminating irrelevant or erroneous information, noise removal ensures that the data reflect accurate insights, thereby improving the performance of machine learning models by preventing overfitting. This step is essential for enabling accurate predictions and informed decision-making, as it allows models to generalize better to new, unseen data rather than memorizing the noise within the training dataset. The complete algorithm is shown in Algorithm 1.

Algorithm 1:

Noise removal from image and video

Given:

input_video_path: Path to the input video

file.output_video_path: Path to the output video

file.kernel_size: Size of the kernel for the median filter, typically an odd integer.

Variables:

cap: Video capture object for input video.

out: Video writer object for output video.

frame: A single frame from the input video.

gray_frame: Grayscale version of frame.

filtered_frame: Frame after applying median filter.

fourcc: Codec used for output video.

fps: Frames per second for the output video.

frame_size: Size of each video frame (width, height).

Initialize:

cap ← cv2.VideoCapture(input_video_path)

fourcc ← cv2.VideoWriter_fourcc(“X”, “V”, “I”, “D”)

fps ← 20.0 (or the frame rate of cap if it needs to match the input video)

frame_size ← (640, 480) (or the frame size of cap if it needs to match the input video)

out ← cv2.VideoWriter(output_video_path, fourcc, fps, frame_size)

Algorithm:

While cap.isOpened() do

ret, frame ← cap.read()

If ret is True then

gray_frame ← cv2.cvtColor

(frame, cv2.COLOR_BGR2GRAY) (if grayscale conversion is needed)

filtered_frame ← cv2.medianBlur(gray_frame, kernel_size)

out.write(filtered_frame)

Else

Break from the loop

End While

Finalization:

cap.release()

out.release()

cv2.destroyAllWindows()

Feature extraction using CCNN

An innovative intersection of machine learning and computer vision ( Rana et al., 2023) is the construction of a custom CNN to extract features from video frames for sign language recognition. This method identifies and interprets sign language gestures through the analysis of sequential video frames, a socially significant and difficult endeavor. The first algorithm illustrates the comprehensive procedure that consists of preprocessing video data, developing and training a CNN, and subsequently utilizing the network to extract significant features that faithfully depict sign language gestures ( Algorithm 2).

Algorithm 2:

Feature extraction based on custom CNN

Input: initialize video frames from video

for each layer in CNN:

if layer is convolutional:

apply convolution

$Q_{i j}^{l} = \sum_{m} \sum_{n} K_{m n}^{l} I_{(i + m) (j + n)}^{l} .$

apply ReLU

$I^{l + 1} = max (0, O^{(l)})$

else if layer is pooling:

apply max pooling such that

$P i j = max (subregion of I)$

else if layer is fully connected:

if not last layer:

for each fully connected layer fc with weights W and bias b

do

apply dense operation

$D (f c) = W (f c) v + b (f c)$

apply ReLU

$v = max (0, D (f c))$

else

apply dense operation

$D (f c) = W (f c) v + b (f c)$

apply softmax

$softmax (D (f c))$

return output from final layer

The architecture of the CNN is pivotal, as it needs to capture the complexity and subtleties of hand movements and facial expressions that are integral to sign language ( Zhang et al., 2020). By leveraging the convolutional layers for automatic feature extraction and hierarchical pattern recognition, the network learns to discern intricate gestures from raw video frames. This sophisticated procedure not only pushes the boundaries of machine learning in the realm of natural language processing but also holds immense potential for bridging communication gaps for the deaf and hard-of-hearing community.

Initially, the input layer accepts raw video frames where each frame is a three-dimensional (3D) tensor (height, width, color channels), after which the convolutional layers extract spatial features from the input frames. Filters (or kernels) slide over the input data to perform convolution operations. For a particular image I, the convolution layer could be defined as:

(2)

$O_{i j} = {(K * I)}_{i j} = \sum_{m} \sum_{n} K_{m n} I_{(i + m) (j + n)},$

where O is the output and K is the kernel.

The utilization of the ReLU-based activation function introduces nonlinearity into the model, thereby enhancing its capacity to discern and learn intricate patterns. For a function f( x), the ReLU could be defined as:

$f (x) = \max (0, x)$

(3)

$O_{i j} = max(K_{i j}),$

where K _ij is a subject of the input.

Whereas the pooling layers based on max pooling reduce the spatial dimensions such that:

(4)

$y = W x + b,$

where “ y” represents the output, “ W” signifies the weight matrix, “ x” denotes the input, and “ b” represents the bias.

Commonly, max pooling is used. At the end, batch normalization has been applied after convolutional layers to stabilize and speed up training.

Sign language recognition

The TCN receives a sequence of preprocessed frames from a sign language video. Each frame is represented by a set of features, which could include the position, orientation, and movement of hands, fingers, and possibly facial expressions. As the data pass through the TCN’s layers, the network extracts and learns temporal features. Due to the dilated convolutions, the network can capture long-range dependencies, crucial for understanding movements and gestures that span multiple frames. Due to the temporal learning phase in each convolutional layer, the network learns more abstract representations of the input data. Early layers might identify simple movements or positions, while deeper layers integrate this information to recognize complex gestures over time.

The causal nature ensures that the model’s prediction at time t is only influenced by data from time t and earlier, maintaining the temporal sequence’s integrity. Dilations allow the network to have a wider “view” of the input without increasing computational complexity significantly. In the final layers, after extracting and processing the features through its layers, the TCN feeds the data into one or more dense (fully connected) layers that act as classifiers. The dense layers map the learned features to specific sign language gestures or phrases. This is typically done through a softmax function if the task is classification, which provides a probability distribution over the possible sign language classes.

EXPERIMENTAL RESULTS AND ANALYSIS

This section delves into the analysis of the conducted experiment, encompassing the description of performance metrics, baseline methods, dataset details, and the results obtained.

Dataset description

Three different types of dataset have been used for the analysis of the proposed model. The very first dataset (SL-DS-I) is the British Sign Language Open Broadcast Subtitles Large (BOBSL) dataset. It is an extensive collection of British Sign Language (BSL) resources having 1962 episodes. These episodes are paired with English subtitles and span a diverse array of genres—including horror; period dramas; medical shows; historical, natural, and scientific documentaries; sitcoms; and children’s programming—and shows about cooking, beauty, business, and travel. In total, the dataset showcases the work of 39 different signers. The SL-DS-II dataset comprises a series of images depicting the ASL alphabet, organized into 29 distinct folders corresponding to various classes. It includes a training set of 87,000 images, each with a resolution of 200 × 200 pixels. The 29 classes consist of 26 for the letters A-Z, and an additional three classes designated for the signs SPACE, DELETE, and NOTHING.

The very last dataset, which is SL-DS-III, is The American Sign Language MNIST dataset that is composed of multiple grayscale images, each with a resolution of 28 × 28 pixels. These images are designed to represent the ASL alphabet. The dataset description is shown in Table 2.

Table 2:

Dataset description.

Dataset ID	Name	Web link
SL-DS-I	British Sign Language Open Broadcast Subtitles Large (BOBSL)	https://paperswithcode.com/dataset/bobsl
SL-DS-II	Video-based Images of Alphabets from the American Sign Language	https://shorturl.at/HIRXZ
SL-DS-III	American Sign Language Alphabet Dataset	https://t.ly/yS-vY

Experimental enjoinment

With the help of the Tensor Flow and Keras libraries, Python was used in order to put the suggested model into action ( Kamruzzaman, 2020). In order to carry out the experiment, a CPU system with a moderate price tag with a speed of 2.60GHz Intel Core-i5-3230 and 8GB of data storage was used.

Performance matrices

To evaluate the effectiveness of the work that is being suggested, the performance matrices that are listed below have been used.

A confusion matrix is as follows: A table with dimensions of 2 by 2, which includes four outputs from the classifier that was built and is composed of the following elements: True Positive, True Negative, False Positive, and False Negative.
Accuracy may be defined as the proportion of instances that are really categorized in comparison with the total number of examples:

(5) $Accuracy = \frac{(t_{p} + t_{n})}{(t_{p} + t_{n} + f_{p} + f_{n})} .$
Precision (P): This refers to the proportion of total true positive instances that are made to be true in comparison with the whole number of examples that are made to be true:

(6) $Precision = \frac{(t_{p})}{(t_{p} + f_{p})} .$
Recall (R): It shows the percentage of properly categorized positive instances to all positive examples:

(7) $Recall = \frac{(t_{p})}{(t_{p} + f_{n})} .$
F-measure (F): It serves as a means for comparing algorithms, calculated as the harmonic mean of sensitivity and precision:

(8) $F-measure = \frac{2 * P * S}{P + S} .$
The receiver operating characteristic (ROC) curve is a visual tool that demonstrates the capacity of a binary classification model to differentiate between classes.
Recognition speed: This refers to the time it takes for a system to identify and process a sign from the moment it is presented to the system until the system provides an output (e.g. a recognized sign, word, or phrase).

(9) $Recognation speed = \frac{\sum_{i - 1}^{n} (T_{e n d, i} - T_{s t a r t, i})}{n},$

where n is the total number of signs or sequences tested. T _start,i is the start time of the hith sign’s recognition process. T _end,i is the end time of the hith sign’s recognition process.
Robustness (accuracy): The robustness is the percentage change in a metric (accuracy) from normal to noisy conditions:

(10) $% Change = \frac{{Accuracy}_{Nosiy} - {Accuracy}_{Normal}}{{Accuracy}_{Normal}} \times 100.$

Baseline methods

For the purpose of evaluating the effectiveness and quality of the work that is being suggested, the following baseline approaches have been used:

The study carried out by Butter et al. (2023) presents an algorithm rooted in deep learning that possesses the capability to identify and detect words through the analysis of a person’s gestures.
A human gesture recognition (HGR) system was suggested by Hussain et al. (2023) in this study. The system is built on deep learning models that are CNNs with fine-tuned Inception-v3 and EfficentNet-Bo networks. Based on the results of their experiments, it was determined that Inception-v3 obtained a level of accuracy of 90%, precision of 0.93%, recall of 0.91%, and F1 score of 0.90%, respectively.
Amin and Rizvi (2023): The purpose of this paper is to discuss the smart prototype that is built on flex, accelerometer, and gyroscope sensors and is intended to record sign gestures. For the purpose of capturing and compiling datasets consisting of digits, such as 0-10, alphabets, such as A-Z, and alphanumeric, such as 0-10, these sensors are attached to a glove.

Environmental setup (real-time environment)

In the experimental evaluation of the CNN-TCN model for sign language recognition, the physical environment setup is meticulously designed to emulate real-world conditions and assess the model’s robustness to noise and occlusions. The experiment begins in a controlled laboratory setting with optimal conditions to establish baseline performance metrics. It then transitions to simulations of real-world environments by introducing background visual noise through variable lighting and irrelevant background movements, and simulating occlusions with objects partially obstructing the view of signing actions. Further complexity is added by testing in outdoor settings under different weather conditions and in public spaces with fluctuating background noise and dynamic backgrounds, using high-resolution cameras to capture detailed video. A diverse group of participants with varying signing styles and proficiency levels ensures the system’s adaptability and generalization capabilities are thoroughly evaluated. This setup aims to replicate the challenges faced in everyday scenarios, providing a comprehensive assessment of the CNN-TCN model’s practical effectiveness and efficiency in sign language recognition.

RESULTS

The very first experiment shown in Figure 2 presents the performance of the CNN-TCN across different datasets, in terms of accuracy, specificity, and sensitivity. Notably, on the SL-DS-I dataset, the model achieved remarkable results with a 94.21% precision, 93.25% recall, 95.25% accuracy, and 93.69% F-measure. In the case of the SL-DS-II dataset, the model performed even better, attaining a precision of 94.45%, a recall of 93.62%, an F-measure of 93.79%, and an accuracy of 96.02%. Additionally, on the dataset SL-DS-III, the model demonstrated high efficiency with 93.45% precision, 93.13% recall, 94.67% accuracy, and 93.22% F-measure. These results collectively underscore the effectiveness and efficiency of the proposed method in identifying dog breeds, as evidenced by its strong performance across various evaluation metrics.

Figure 2:

Experimental results in terms of accuracy, precision, F-measure, and recall.

The performance of the proposed approach is assessed in Figure 3 through the depiction of ROC curves. This assessment involves an analysis of the true positive rate (TPR) and the false positive rate (FPR) across various datasets. A critical metric for evaluating model effectiveness is the area under the curve (AUC) of the ROC curve for each dataset. To provide specific details, the model achieves an AUC of 0.81 for the SL-DS-I dataset, 0.83 for the SL-DS-II dataset, and 0.72 for the SL-DS-III dataset. For two of these datasets, the ROC curves cover over 80% of the region, while for the third dataset, it encompasses >72% of the mentioned area. These high AUC values signify the model’s robust performance in distinguishing between positive and negative classes within the datasets, underscoring its accuracy and efficiency.

Figure 3:

ROC curves: (a) SL-DS-I dataset, (b) SL-DS-II dataset, and (c) SL-DS-III dataset. Abbreviation: ROC curve, receiver operating characteristic curve.

In the first comparative analysis, six different combinations of standard deep learning models each having feature extraction and sign language recognition phases have been compared with CNN-TCN to test their performance. These configurations included a single-layer LSTM, a double-layer LSTM (LSTM-LSTM), a single-layer GRU, a double-layer GRU (GRU-GRU), a combination of GRU followed by LSTM (GRU-LSTM), and a combination of LSTM followed by GRU (LSTM-GRU).

The effectiveness of each model was evaluated using K-fold cross-validation, and the results are summarized in Table 3. Notably, the LSTM-GRU model exhibited the highest accuracy among the six configurations. The development of these models was carried out using the Keras/TensorFlow libraries, widely recognized for their flexibility and efficiency in building deep learning models.

Table 3:

Comparative analysis of CNN-TCN with standard deep learning models.

Model	Precision (%)	Recall (%)	Accuracy (%)	F score
SL-DS-I dataset
LSTM-LSTM	80.56	79.35	81.36	79.65
GRU-LSTM	84.65	84.36	85.34	84.36
LSTM-GRU	87.36	87.02	88.45	88.79
CNN-TCN	94.21	93.25	95.25	93.69
SL-DS-II dataset
LSTM-LSTM	81.65	81.25	82.69	81.65
GRU-LSTM	85.69	85.45	86.14	85.02
LSTM-GRU	88.36	88.25	89.36	88.96
CNN-TCN	94.45	93.62	96.02	93.79
SL-DS-II dataset
LSTM-LSTM	82.69	82.45	83.36	82.14
GRU-LSTM	85.65	85.45	86.28	85.12
LSTM-GRU	87.36	87.12	87.96	87.03
CNN-TCN	93.45	93.13	94.67	93.22

Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.

The primary application of the trained models was the identification of sign language. The study suggested that the performance of these models could be further enhanced by expanding the dataset size and including more samples per word. This approach would provide the models with a richer and more varied set of training examples, potentially leading to more effective and accurate sign language detection.

In the experimental setup designed to evaluate the robustness of the proposed CNN-TCN model for sign language recognition, a comprehensive real-world environment test has been conducted. This setup involved two distinct phases: the first under normal, controlled conditions to establish baseline performance metrics and the second under various challenging conditions, introducing factors such as visual noise and partial occlusions to mimic real-world disturbances. For each phase, we meticulously recorded the system’s accuracy and recognition speed, ensuring consistency in data collection methods across all tests. The accuracy was measured as the percentage of signs correctly recognized out of the total signs presented, while recognition speed was quantified as the average time taken from sign presentation to sign recognition.

To compute the robustness of the CNN-TCN model, the percentage change formula comparing the performance metrics (accuracy and recognition speed) obtained under normal conditions with those under the introduced noisy and occluded conditions has been used. This comparative analysis aimed to quantify the model’s resilience to real-world environmental challenges, providing insights into its practical effectiveness and efficiency in diverse settings. Table 4 presents the performance metrics under normal conditions and compares them with metrics under noisy and occluded conditions using the percentage change formula.

Table 4:

Real-world performance of the CNN-TCN using robustness and recognition speed.

Condition	Accuracy (%)	Recognition speed (ms)	Percentage change in accuracy	Percentage change in recognition speed
Normal	95.0	200	-	-
With noise	90.0	220	−5.26%	+10.0%
With occlusions	88.0	230	−7.37%	+15.0%
Noise + occlusions	85.0	250	−10.53%	+25.0%

Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.

The obtained results of CNN-TCN model’s performance varies across different environmental conditions, demonstrating its robustness in terms of both accuracy and speed. This setup highlighted not only the CNN-TCN model’s capabilities in ideal conditions but also its adaptability and reliability in less-than-optimal environments, critical for real-world applications.

In a subsequent experiment, the effectiveness of the proposed model was benchmarked against standard baseline models presented in the Baseline Methods section. From the data in Figure 4, it was observed that the proposed model attained an accuracy of 82.43%. This performance was contrasted with that of two other approaches: Mateen et al., which achieved an accuracy of 78.56%, and Hussain et al., which recorded a slightly higher accuracy of 83.70%. Furthermore, the close proximity of the proposed model’s accuracy to that of Saad et al.’s model also underscores the competitiveness and viability of the proposed CNN-TCN. This comparison highlights the potential of the proposed model in applications where accuracy is a critical metric, especially considering the inherent differences in the underlying methodologies of these models.

Figure 4:

Comparative analysis of CNN-TCN with baseline methods. Abbreviation: CNN-TCN, Convolutional Neural Network-Temporal Convolutional Network.

The CNN-TCN model represents a significant advancement in addressing the challenges faced by vision-based sign language recognition systems, particularly those related to sensor limitations and environmental variations. By leveraging the CNN component for sophisticated spatial feature extraction and the TCN for capturing temporal dynamics, this model offers enhanced robustness against common issues such as variable lighting, background clutter, and occlusions. Its ability to discern and prioritize relevant features from complex visual inputs enables effective sign interpretation even in less-than-ideal conditions, reducing the dependency on specialized hardware. This makes the CNN-TCN model not only more adaptable to a range of real-world environments but also more accessible, as it can deliver high performance with widely available camera technology. Furthermore, its scalable architecture ensures that it can be trained to recognize a wide array of signs and gestures, paving the way for broader applications in real-world settings where environmental control is limited. This blend of spatial and temporal analysis capabilities positions the CNN-TCN model as a highly effective solution for overcoming the inherent limitations of sensor-based, vision-driven sign language recognition systems.

The main constraint of the proposed work lies in its significant computational demands; training deep CNNs with multiple layers requires substantial GPU resources, restricting access for entities with limited computational power. Future research will aim to mitigate this limitation by enhancing the efficiency of both training and testing processes.

CONCLUSION

Sign language is an essential form of communication for people with hearing and speech challenges. They depend on visual means, mainly through hand gestures and body language, to express their ideas and emotions in everyday interactions. Sign language typically divides into two primary categories: digits (numbers) and characters (letters). This study introduces an innovative hybrid methodology that merges TCNN with a CCNN for the automatic recognition of sign language. The efficacy of this system was thoroughly evaluated using three distinct benchmark datasets, encompassing isolated numbers and letters from both American and British sign languages, which are widely accessible and comprehensive resources. The CNN-TCN model incorporates multiple phases, including data collection, preprocessing (frame extraction, normalization, and labeling), feature extraction via CCNN, and sequence modeling with TCNN for final recognition. The outcomes of this research highlight the system’s high accuracy, precision, recall, and F1 scores, achieving 91.67%, 93.64%, 91.67%, and 91.47%, respectively, for the American Sign Language dataset, and 97.33%, 97.89%, 97.33%, and 97.37% for digit recognition in the BSL dataset. These results validate the effectiveness and practicality of the CNN-TCN model in sign language recognition. Future studies will focus on overcoming this limitation by improving the efficiency of the training and testing phases.

[1] Adaloglou N, Chatzis T. 2022. A comprehensive study on deep learning-based methods for sign language recognition. IEEE Trans. Multimed. Vol. 24:1750–1762

[2] Alawwad RA, Bchir O, Ismail MMB. 2021. Arabic sign language recognition using Faster R-CNN. Int. J. Adv. Comput. Sci. Appl. Vol. 12(3):692–700

[3] Althagafi A, Althobaiti G, Alsubait T, Alqurashi T. 2020. ASLR: Arabic sign language recognition using convolutional neural networks. IJCSNS Int. J. Comput. Sci. Netw. Secur. Vol. 20:124–129

[4] Aly W, Aly S, Almotairi S. 2019. User-independent American sign language alphabet recognition based on depth image and PCANet features. IEEE Access. Vol. 7:123138–123150

[5] Amin MS, Rizvi STH. 2023. Sign gesture classification and recognition using machine learning. Cybern. Syst. Vol. 54(5):604–618

[6] Barbhuiya AA, Karsh RK, Jain R. 2021. CNN based feature extraction and classification for sign language. Multimed. Tools Appl. Vol. 80(2):3051–3069

[7] Barioul R, Ghribi SF, Derbel HBJ, Kanoun O. 2020. Four sensors bracelet for American sign language recognition based on wrist force myography2020 IEEE International Conference on Computational Intelligence and Virtual Environments for Measurement Systems and Applications (CIVEMSA); p. 1–5. IEEE.

[8] Buttar AM, Ahmad U, Gumaei AH, Assiri A, Akbar MA, Alkhamees BF. 2023. Deep learning in sign language recognition: a hybrid approach for the recognition of static and dynamic signs. Mathematics. Vol. 11(17):3729

[9] Camgoz NC, Koller O, Hadfield S, Bowden R. 2020. Sign language transformers: joint end-to-end sign language recognition and translationProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; p. 10023–10033

[10] De Coster M, Herreweghe MV, Dambre J. 2020. Sign language recognition with transformer networksProceedings of the Conference on Language Resources and Evaluation (LREC 2020); Marseille, France. 13-15 May 2020; p. 6018–6024

[11] Golestani N, Moghaddam M. 2020. Human activity recognition using magnetic induction-based motion signals and deep recurrent neural networks. Nat. Commun. Vol. 11:1551

[12] Hasan MM, Srizon AY, Sayeed A, Hasan MAM. 2020. Classification of American sign language by applying a transfer learned deep. Convolutional neural network2020 23rd International Conference on Computer and Information Technology (ICCIT); p. 1–6. IEEE.

[13] Hussain A, Ul Amin S, Fayaz M, Seo S. 2023. An efficient and robust hand gesture recognition system of sign language employing finetuned inception-V3 and efficientnet-B0 network. Comput. Syst. Sci. Eng. Vol. 46(3):3509–3525

[14] Jiang S, Sun B, Wang L, Bai Y, Li K, Fu Y. 2021. Skeleton aware multi-modal sign language recognitionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Nashville, TN, USA. 21-24 June 2021; p. 3413–3423

[15] Kamruzzaman M. 2020. Arabic sign language recognition and generating Arabic speech using convolutional neural network. Wirel. Commun. Mob. Comput. Vol. 2020:1–9

[16] Kasukurthi N, Rokad B, Bidani S, Dennisan DA. 2019. American sign language alphabet recognition using deep learning. arXiv preprint. arXiv:1905.05487. [Cross Ref]

[17] Kembuan O, Rorimpandey GC, Tengker SMT. 2020. Convolutional Neural Network (CNN) for image classification of Indonesia sign language using tensorflow2020 2nd International Conference on Cybernetics and Intelligent System (ICORIS); p. 1–5. IEEE.

[18] Liao Y, Xiong P, Min W, Min W, Lu J. 2019. Dynamic sign language recognition based on video sequence with BLSTM-3D residual networks. IEEE Access. Vol. 7:38044–38054. [Cross Ref]

[19] Mittal A, Kumar P, Roy PP, Balasubramanian R, Chaudhuri BB. 2019. A modified LSTM model for continuous sign language recognition using leap motion. IEEE Sens. J. Vol. 19(16):7056–7063

[20] Rahman MM, Islam MS, Rahman MH, Sassi R, Rivolta MW, Aktaruzzaman M. 2019. A new benchmark on American sign language recognition using convolutional neural network2019 International Conference on Sustainable Technologies for Industry 4.0 (STI); p. 1–6. IEEE.

[21] Rana MRR, Nawaz A, Ali T, El-Sherbeeny AM, Ali W. 2023. A BiLSTM-CF and BiGRU-based deep sentiment analysis model to explore customer reviews for effective recommendations. Eng. Technol. Appl. Sci. Res. Vol. 13(5):11739–11746

[22] Rastgoo R, Kiani K, Escalera S. 2021. Sign language recognition: a deep survey. Expert Syst. Appl. Vol. 164:113794

[23] Sharma CM, Tomar K, Mishra RK, Chariar VM. 2021. Indian sign language recognition using fine-tuned deep transfer learning modelProceedings of the International Conference on Innovations in Computer and Information Science ICICIS; p. 62–67

[24] Suneetha M, Prasad MVD, Kishore PVV. 2021. Multi-view motion modelled deep attention networks (M2DA-Net) for video based sign language recognition. J. Vis. Commun. Image Represent. Vol. 78:103161

[25] Wadhawan A, Kumar P. 2020. Deep learning-based sign language recognition system for static signs. Neural Comput. Appl. Vol. 32:7957–7968

[26] Wen F, Zhang Z, He T, Lee C. 2021. AI enabled sign language recognition and VR space bidirectional communication using triboelectric smart glove. Nat. Commun. Vol. 12(1):5378

[27] Zhang F, Bazarevsky V, Vakunov A, Tkachenka A, Sung G, Chang CL, Grundmann M. 2020. Mediapipe hands: on-device real-time hand tracking. arXiv preprint. arXiv:2006.10214

[28] Zhou B, Andonian A, Oliva A, Torralba A. 2018. Temporal relational reasoning in videosProceedings of the European Conference on Computer Vision (ECCV); p. 803–818

Journal of Disability Research

CNN-TCN: Deep Hybrid Model Based on Custom CNN with Temporal CNN to Recognize Sign Language

Abstract

Main article text

INTRODUCTION

Research contribution

LITERATURE REVIEW

PROPOSED METHODOLOGY

Data preprocessing

Feature extraction using CCNN

Sign language recognition

EXPERIMENTAL RESULTS AND ANALYSIS

Dataset description

Experimental enjoinment

Performance matrices

Baseline methods

Environmental setup (real-time environment)

RESULTS

CONCLUSION

COMPETING INTEREST

REFERENCES

Author and article information

Journal

Affiliations

Author notes

Author information

Article

History

Page count

Funding

Categories

Comments

Comment on this article