INTRODUCTION
Speech technologies have the potential to greatly assist individuals with disabilities, contributing to their equality and inclusion in society and everyday life ( Delić et al., 2013). Speech, being a natural and intuitive form of human communication, holds great potential as an ideal medium for human–computer interaction. The ability of machines to accurately mimic and understand human speech would provide a straightforward solution to this challenge. Among the various applications of speech technology, one of the most promising is the development of spoken dialogue systems, which enable users to access information in a simple, direct, and hands-free manner ( Aggarwal and Dave, 2012; Zheng and Li, 2017; Ross et al., 2020; Alenizi and Al-Karawi, 2023a). This becomes particularly crucial when users have disabilities that hinder their ability to interact with systems using standard methods. However, achieving reliable automatic speech recognition (ASR) is not straightforward ( Rosdi and Ainon, 2008; Al-Karawi, 2015; Al-Karawi and Ahmed, 2021). Factors such as variations in speech patterns among different speakers, background noise levels, variations in pronunciation speed, and user mood can introduce errors that significantly impact the accuracy of speech recognition systems, leading to low success rates. As a result, many current systems are limited to controlled environments or cater to specific user groups, often requiring particular microphone positioning or other constraints ( Vieira et al., 2022). This limitation results in interfaces that may feel unnatural and restrictive ( Bedoya and Muñoz, 2012; Al-Karawi and Mohammed, 2019). Efforts to overcome these challenges are ongoing, with researchers and developers striving to improve the robustness and adaptability of ASR systems.
By addressing factors such as speaker variability, noise robustness, and user-centric design, the goal is to create more inclusive and user-friendly interfaces that accommodate a more comprehensive range of users, including those with disabilities. Despite the complexity of the task, advancements in speech technology continue to pave the way for more seamless and effective human–computer interaction ( Busatlic et al., 2017). As research progresses, speech recognition systems are expected to become more accurate, versatile, and accessible, enabling individuals with disabilities to benefit from natural and intuitive technological interactions ( Delić et al., 2013; Al-Karawi, 2021). Over the years, numerous approaches have been proposed for ASR ( Abushariah et al., 2010; Ibarra and Guerrero, 2010; Bedoya and Muñoz, 2012). Among these, the most robust methods are based on Hidden Markov Models (HMM) ( Al-Karawi and Ahmed, 2021; Al-Karawi and Mohammed, 2021). While these HMM-based systems have achieved significant accuracy, they continue to face challenges related to high computational costs.
The high cost of commercial ASR systems and copyright restrictions limit access for users and researchers. To address these challenges, an in-house ASR system is needed. This solution is simple, computationally efficient, accessible, reliable, and adaptable to any platform, reducing computational burden and allowing for customization ( Mohammed et al., 2021; Alenizi and Al-Karawi, 2023a). Our ASR system aims to provide a cost-effective, reliable, and accessible solution for computational resources, promoting openness, collaboration, and innovation in speech recognition, by eliminating the limitations of existing commercial systems ( Mohammed et al., 2020; Al-Karawi, 2023). Our ASR development approach offers flexibility in system improvements, customization, and exploration, enhancing speech recognition technology accessibility, affordability, and adaptability for various applications. The remaining sections of this paper are organized as follows. The following sections describe speech recognition systems for disabled people: Speech Recognition Techniques, State of the Art, Proposed Models, Experiments and Results, and Conclusions.
SPEECH RECOGNITION SYSTEMS FOR DISABLED
Technological advancements, particularly speech recognition technology, are revolutionizing the lives of disadvantaged and disabled individuals, enhancing their daily lives and overall well-being ( Noyes and Frankish, 1992). Speech recognition technology is the foundation of popular voice assistants like Siri, Amazon Echo, and Google Assistant, enabling computers to understand and process human spoken language ( Jiang et al., 2000; Alenizi and Al-Karawi, 2023b). Speech recognition technology enables natural, intuitive communication through voice interaction, providing a transformative solution for individuals with disabilities or limitations that make traditional input methods challenging ( Azam and Islam, 2015; Alenizi and Al-karawi, 2022). Speech recognition technology is advancing, enabling individuals to control devices, access information, perform tasks, and engage with digital platforms, enhancing independence and quality of life. This technology holds promise for creating inclusive environments, enabling everyone to fully participate ( Noyes et al., 1989). ASR, or voice recognition, helps disabled individuals with limited mobility or visual impairments by converting human speech into machine-readable language. As speech recognition technology advances, it transforms devices into digital assistants, enhancing efficiency and productivity, especially for those with limited upper-limb mobility.
Speech recognition technology can also assist older people and individuals with speech and hearing impairments ( Gonzalez et al., 2016; Alenizi and Al-Karawi, 2023a). Considering the estimated 15 million disabled individuals in the United States alone, with millions worldwide, the potential benefits of speech recognition are immense. By leveraging the power of speech recognition, we can empower disabled individuals to navigate and interact with digital devices more effectively, fostering greater independence and inclusivity. The widespread adoption of speech recognition technology promises to transform millions of individuals’ lives, enabling them to quickly overcome communication barriers and access the digital world ( Isyanto et al., 2020; Al-Karawi and Mohammed, 2023).
SPEECH RECOGNITION TECHNIQUES
As the market continues to witness the proliferation of devices like Siri on iPhone and Alexa from Amazon, speech recognition has gained significant influence in our daily lives. At its core, speech recognition enables machines to hear, comprehend, and respond to the information conveyed through speech. The primary objective of ASR is to assess, extract, analyze, and recognize spoken speech to obtain meaningful information ( Gaikwad, 2010). Therefore, it becomes crucial to comprehend and examine the techniques involved in the comprehensive identification and understanding of speech. The speech recognition system comprises three main stages, as depicted in Figure 1: feature extraction, modeling, and performance evaluation.
STATE OF THE ART
A comprehensive analysis of the current state of ASR systems was conducted using the Tree of Science tool developed at Universidad Nacional de Colombia. This systematic review aimed to identify influential articles that have contributed to advancements in the accessibility, accuracy, and efficiency of ASR systems. In the early stages of speech processing research, the short-term spectral amplitude technique, employing the minimum mean square error estimator, was widely used ( Ephraim and Malah, 1985). Although complex, this algorithm offered higher accuracy than other methods available at that time. Subsequently, the utilization of more robust approaches based on HMM gained prominence. These techniques incorporated mel frequency cepstral coefficients (MFCCs) for feature extraction ( Gales and Young, 2008). HMM continues to be an essential method for continuous speech recognition systems with large vocabularies due to its reliable performance. Another unique method mentioned in the literature is PARADE (Periodic Component to Aperiodic Component Ratio-based Activity Detection), combined with a feature extraction technique known as SPADE (Subband-based Periodicity and Aperiodicity Decomposition) ( Ishizuka et al., 2010). This approach has demonstrated significantly improved accuracy in word recognition. Dynamic time warping (DTW), a widely used algorithm known for its low computational cost, has been discussed by Zhang et al. (2014). However, DTW is limited to small vocabularies. Current research efforts focus on achieving accurate speech recognition and developing tools for word segmentation, which involves identifying individual words’ start and endpoints. This segmentation aims to reduce the complexity associated with continuous speech recognition ( Komatani et al., 2015). By examining the progress made by the scientific community, this systematic review provides valuable insights into the evolution of ASR systems. The adoption of robust methods such as HMM, along with advancements in feature extraction techniques and word segmentation, has contributed to improved accuracy and efficiency. Continued research in this field aims to further enhance the recognition of speech, expand the vocabulary size, and develop innovative tools to simplify the processing complexity of continuous speech recognition systems.
PROPOSED MODEL
We propose a system that facilitates the recognition of a specific set of voice commands and can be seamlessly integrated with various applications, including virtual learning tools. Interacting with applications through voice commands can be a powerful tool for inclusion, especially when it offers user-friendly features and accessibility options for both end users and developers. Our model was explicitly designed to cater to individuals with physical and sensory disabilities, focusing on enabling interaction with digital learning resources. While already existing tools like job access with speech (JAWS) serve similar purposes, we aimed to create an experimental tool that could be easily integrated with other developments to address diverse needs. In particular, our system was incorporated into the global astrometry interferometer for astrophysics (GAIA) tools framework, which is dedicated to constructing accessible learning objects for individuals with visual disabilities. The initial version of GAIA tools includes various authoring tools such as a dictionary, text editor and reader, a game for learning, and assessment through questionnaires.
Furthermore, it guides designers in developing learning objects and enables visually impaired users to interact effectively with them in educational activities ( Gonzalez et al., 2016). By proposing this system, we aim to enhance accessibility and inclusivity in the learning environment, ensuring that disabled individuals can engage in educational activities more effectively. Integrating voice commands and interaction capabilities with the GAIA tools’ framework provides new possibilities for accessible learning and opens doors for further developments in this field. In the context of developing countries, where socioeconomic limitations are more prominent, tools like the one proposed here hold particular significance. The motivation behind focusing on an audio recognition system as a complement to educational tools stems from the understanding that education plays a fundamental role in overcoming socioeconomic challenges. Currently, the system can recognize 10 isolated words in low-noise environments. A straightforward process can expand the system to cater to the specific needs of individual users.
Users are required to record new words to increase the system’s database, allowing for customization. It is worth noting that while the system was initially designed to recognize words in Spanish, there is no limitation if someone needs to add new commands to the database in another language. The proposed system is divided into four stages, as illustrated in Figure 1. The first stage involves acquiring the audio signal that contains the information to be recognized. Subsequently, a preprocessing stage is implemented to filter out noise and unwanted segments, such as silence at the beginning and end of the recording. The next stage is feature extraction, where a matrix containing MFCCs is calculated. Finally, the system employs an algorithm to compute the Euclidean distance, comparing the feature matrices with the corresponding patterns stored in the database. This decision-making process completes the recognition process. By offering a customizable and efficient audio recognition system, we aim to address the specific needs of users in educational settings, particularly in developing countries. This tool can potentially empower individuals by enhancing access to educational resources, thereby contributing to the overall socioeconomic development of these regions.
Speech samples
In this stage, the user speaks the word to be recognized. The system records an audio vector consisting of t * Fs samples, where t represents the duration of the recording, and Fs denotes the sampling frequency. In this case, the recording duration is 2 seconds, and the sampling frequency is 44,100 Hz. The audio recording is in the mono format, meaning only one audio channel is recorded.
Preprocessing
The preprocessing stage comprises three steps: filtering, normalization, and silence suppression. To perform filtering, a spectral analysis process was conducted on multiple audio recordings to identify expected noise frequencies in various environments (office, outdoor, home, etc.). A general frequency range where helpful information is expected was established. The best-performing filter was a Hamming windowing finite impulse response (FIR) filter, specifically a band-pass filter with a sampling frequency of 44,100 Hz and cut-off frequencies of 200 Hz and 8000 Hz. Once the signal was filtered, it underwent normalization to restrict its values to a standardized range between 0 and 1. This normalization step is particularly important when comparing multiple audio samples with different amplitude ranges due to varying speaker intensity or noise levels. Normalizing the signal facilitates the subsequent steps in the process. The next step involves silence detection and suppression. The signal is segmented, and the energy of each segment is calculated. Segments with energy values below a defined threshold are considered nonuseful and are discarded. The energy threshold was determined by comparing the noise energy values with those obtained from randomly pronounced words. This approach helps reduce the computational cost of the algorithm by excluding signal segments that do not provide relevant information and may hinder the recognition process.
Feature extraction
The time-domain waveform of a speech signal, representing the amplitude variations over time, contains essential auditory information. However, to extract meaningful information from the waveform, it is necessary to condense the data of each segment into a limited number of parameters or characteristics while retaining the signal’s discriminatory power ( Delić et al., 2013; Delić et al., 2014; Ajibola Alim and Rashid, 2018; Alenizi and Al-Karawi, 2023b). As the number of input voice samples increases, the accuracy of the speech recognition systems tends to decrease ( Gaikwad, 2010; Virkar et al., 2020). This highlights the significance of feature extraction in achieving accurate speech processing. Feature extraction plays a critical role in representing a speech signal using a predetermined set of signal components that are more distinctive and reliable. These features should effectively capture the characteristics of each segment, enabling the grouping of similar segments based on their shared characteristics ( Shrawankar and Thakare, 2013). Feature extraction is an initial step in ASR preprocessing or front-end signal processing. Over the years, several approaches have been developed for extracting features from audio signals, drawing from extensive research in mathematics, acoustics, and speech technology ( Ajibola Alim and Rashid, 2018). Feature extraction is closely intertwined with model variable selection, which is another crucial aspect that can significantly impact the performance of a speech processing system. Proper selection and inclusion of relevant model variables are essential for achieving accurate and reliable results. The MFCCs are employed as the feature representation for each command in the ASR system. These coefficients have been recognized in the state-of-the-art review as effective for achieving accurate results while maintaining low computational costs. A series of steps are involved to compute the MFCCs.
First, a perceptually spaced triangular filter bank is applied to the discrete Fourier-transformed signal. This filter bank captures the important frequency components of the audio signal. The resulting filter outputs undergo logarithmic compression to compress the energy values. Next, the discrete cosine transform is applied to the logarithmically compressed filter-output energies. This transformation decorates the coefficients, making them more suitable for speech recognition tasks. The MFCCs are derived from this process, resulting in decor-related parameters ( Hossan et al., 2010; Delić et al., 2013; Alenizi and Al-Karawi, 2023a). First, the audio signal is segmented into intervals of 1024 sample length, with a 410-sample overlap between segments to avoid information loss during windowing using a Hamming window in the time-domain. This windowing process attenuates the beginning and end of each segment. A total of 14 MFCCs (excluding the 0th coefficient) are calculated for each segment. This is achieved by applying a logarithmically compressed filter bank, derived from a perceptually spaced triangular filter bank, to the discrete Fourier-transformed signal. In this case, 30 filters are used in the filter bank, with a low-end frequency of 0 and a high-end frequency of 0.1815. The calculated MFCCs are stored in a matrix of size N × C, where N represents the number of segments into which the audio signal was divided, and C represents the number of MFCCs calculated (which is 14 in this case). The number of rows in the feature matrices may vary due to the silence suppression process applied to the recorded signals, which can result in different lengths. The parameters for calculating the MFCCs were selected based on the results obtained through various tests, including those described in the next section. For each command that the system can recognize, a “pattern matrix” of features is calculated and stored, representing the class of the command. Additionally, whenever a new recording is entered for recognition, a new matrix of features, known as the “new matrix,” is calculated.
Decision stage
This stage aims to associate the newly calculated matrix with a specific command to determine the corresponding recording. This is achieved by comparing the new matrix with the pattern matrices and identifying the best match. To compare the matrices, it is essential to note that each row represents a segment of the audio signal in terms of the MFCCs. Therefore, comparing each row of the new matrix with the rows of the pattern matrix is equivalent to comparing each segment of the new recording with the segments of the pattern recordings. However, there is a challenge regarding time alignment, as the two recordings may not match in terms of pronunciation duration, resulting in the comparison of different audible segments of the same word. To address this issue, an individual error is calculated using the Euclidean distance between one row of the new matrix and each row of the pattern matrix. The minimum individual error is then determined, and this value is considered as the contribution of the analyzed row to the total error ( Al-Karawi, 2015; Zheng and Li, 2017; Ross et al., 2020; Al-Karawi and Ahmed, 2021; Alenizi and Al-Karawi, 2023a). This process is repeated for each row of the new matrix. Ultimately, the total errors are calculated for each class, and by identifying the minimum error, it becomes possible to determine which command was pronounced in the entered recording. Although this process may appear complex, it is performed through algebraic arrangements rather than iterative calculations, significantly reducing computational costs. Finally, it has been determined that the minimum total error value must be below a certain threshold to ensure reliability. If the minimum error exceeds this threshold, the user is prompted to repeat the command more clearly to achieve an acceptable level of reliability.
EXPERIMENTS AND RESULTS
To evaluate the performance of the implemented model, a test was conducted involving 30 participants with diverse characteristics, including different genders and ages. The test was conducted in moderate noise environments such as study rooms and bedrooms. The experiment consisted of two parts: the first involved creating a database with the participants’ voices, and the second focused on validating the system. During the database-creation phase, participants were instructed to speak clearly and naturally while repeating 10 specific words in Spanish: open, back, center, right, left, save, home, help, view, and internet. Each participant repeated these words four times, resulting in 40 recordings per user. The system then determined whether it recognized the spoken word successfully in each attempt. The detailed results of these tests are presented in Tables 1 and 2.
Elements of a confusion matrix.
Predictive values | ||
---|---|---|
Actual values | True positive (TP) | False positive (FP) |
False negative (FN) | True negative (TN) |
Results of the tests.
User | Number of hits | Percentage (%) |
---|---|---|
1 | 36 | 90 |
2 | 39 | 97.5 |
3 | 34 | 85 |
4 | 36 | 90 |
5 | 34 | 86 |
6 | 36 | 91 |
7 | 36 | 91 |
8 | 40 | 100 |
9 | 34 | 86 |
10 | 32 | 80 |
11 | 32 | 80 |
12 | 34 | 86 |
13 | 39 | 97.6 |
14 | 39 | 97.5 |
15 | 34 | 86 |
16 | 39 | 97.6 |
17 | 38 | 95 |
18 | 32 | 81 |
19 | 32 | 81 |
20 | 32 | 81 |
21 | 27 | 68.6 |
22 | 36 | 66 |
23 | 36 | 91 |
24 | 28 | 71 |
25 | 36 | 91 |
26 | 34 | 86 |
27 | 39 | 97.6 |
28 | 34 | 86 |
29 | 33 | 83.5 |
30 | 38 | 95 |
Performance evaluation
Evaluating the performance of a classification model is crucial to assess its effectiveness in achieving a desired outcome. Performance evaluation metrics quantitatively assess the model’s performance on a test dataset. Selecting appropriate metrics to evaluate the model’s performance accurately is essential. Several metrics can be utilized, including the confusion matrix, accuracy, specificity, sensitivity, and more. The following formulas are commonly employed to calculate these performance metrics:
In the above mentioned formulas (i.e. accuracy, precision, sensitivity, specificity, and F1) and the confusion matrix, TPs denote true positives, TNs refer to true negatives, FPs signify false positives, and FNs represent false negatives. The confusion matrix provides insights into the percentages of accurate and inaccurate classifications for each class, pinpointing precisely which classes the algorithms encounter the most challenges in classification for the trained models. TP and TN signify the number of data points belonging to the positive and negative classes, respectively, where the model correctly identifies them. Conversely, FP indicates the number of negatives erroneously classified as positives, and FN represents the count of positives mistakenly classified as negatives by the machine.
CONCLUSIONS
The proposed ASR system serves as a tool to complement the development of educational platforms, aiming to enhance accessibility for individuals with visual disabilities, among others. While the system’s recognition accuracy may not match that of commercial systems, it offers several advantages, such as low computational cost, seamless integration with other platforms, ease of customization, and the ability to adapt to the specific needs of individual users. Additionally, the system allows for the inclusion of new words in its database with a single recording, providing flexibility and ease of updates. As part of their future work, the authors have outlined several areas for improvement in the ASR system. They plan to explore other characterization methods, such as autoregressive coefficients or strategies of classification like HMM, to enable continuous speech recognition and enhance the system’s interaction with applications.
Additionally, they aim to enhance the preprocessing stage by implementing more efficient filters, improving the system’s robustness, and mitigating issues related to tone, pronunciation, and noise variations. Furthermore, a key objective is to achieve system generalization, allowing it to recognize any user without the need for prior registration of their voices. This expansion will enhance the system’s usability and make it more accessible to a broader range of users.