SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition

Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic vi...

Full description

Saved in:
Bibliographic Details
Main Authors: Nicole Yah Yie Ha, Lee-Yeng Ong, Meng-Chew Leow
Format: Article
Language:English
Published: Ital Publication 2024-12-01
Series:Emerging Science Journal
Subjects:
Online Access:https://ijournalse.org/index.php/ESJ/article/view/2670
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846138002292604928
author Nicole Yah Yie Ha
Lee-Yeng Ong
Meng-Chew Leow
author_facet Nicole Yah Yie Ha
Lee-Yeng Ong
Meng-Chew Leow
author_sort Nicole Yah Yie Ha
collection DOAJ
description Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic visual units of speech that are produced by the lip movements and positions. Furthermore, visemes are typically having shorter durations than words. Consequently, there is less temporal information for distinguishing between different viseme classes, leading to increased visual ambiguity during classification. To address this challenge, viseme classification must not only extract lip image spatial features, but also to handle visemes of varying durations and temporal features. Therefore, this study proposed a new deep learning approach SlowFast-TCN. SlowFast network is used as the frontend architecture to extract the spatio-temporal features of the slow and fast pathways. Temporal Convolutional Network (TCN) is used as the backend architecture to learn the features from the frontend to perform the classification. A comparative ablation analysis to dissect each component of the proposed SlowFast-TCN is performed to evaluate the impact of each component. This study utilizes a benchmark dataset, Lip Reading in Wild (LRW), that focuses on English language. Two subsets of the LRW dataset, comprising of homopheme words and unique words, represent the homophemic and non-homophemic dataset, respectively. The proposed approach is evaluated on varying lighting conditions to assess its performance in real-world scenarios. It was found that illumination can significantly affect the visual data. Key performance metrics, such as accuracy and loss are used to evaluate the effectiveness of the proposed approach. The proposed approach outperforms traditional baseline models in accuracy while maintaining competitive execution time. Its dual-pathway architecture effectively captures both long-term dependencies and short-term motions, leading to better performance in both homophemic and non-homophemic datasets. However, it is less robust when dealing with non-ideal lighting scenarios, indicating the need for further enhancements to handle diverse lighting scenarios.   Doi: 10.28991/ESJ-2024-08-06-024 Full Text: PDF
format Article
id doaj-art-184428b4cbec4b46b552ec048f0f78e6
institution Kabale University
issn 2610-9182
language English
publishDate 2024-12-01
publisher Ital Publication
record_format Article
series Emerging Science Journal
spelling doaj-art-184428b4cbec4b46b552ec048f0f78e62024-12-07T14:11:58ZengItal PublicationEmerging Science Journal2610-91822024-12-01862554256910.28991/ESJ-2024-08-06-024755SlowFast-TCN: A Deep Learning Approach for Visual Speech RecognitionNicole Yah Yie Ha0Lee-Yeng Ong1Meng-Chew Leow2Faculty of Information Science and Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450,Faculty of Information Science and Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450,Faculty of Information Science and Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450,Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic visual units of speech that are produced by the lip movements and positions. Furthermore, visemes are typically having shorter durations than words. Consequently, there is less temporal information for distinguishing between different viseme classes, leading to increased visual ambiguity during classification. To address this challenge, viseme classification must not only extract lip image spatial features, but also to handle visemes of varying durations and temporal features. Therefore, this study proposed a new deep learning approach SlowFast-TCN. SlowFast network is used as the frontend architecture to extract the spatio-temporal features of the slow and fast pathways. Temporal Convolutional Network (TCN) is used as the backend architecture to learn the features from the frontend to perform the classification. A comparative ablation analysis to dissect each component of the proposed SlowFast-TCN is performed to evaluate the impact of each component. This study utilizes a benchmark dataset, Lip Reading in Wild (LRW), that focuses on English language. Two subsets of the LRW dataset, comprising of homopheme words and unique words, represent the homophemic and non-homophemic dataset, respectively. The proposed approach is evaluated on varying lighting conditions to assess its performance in real-world scenarios. It was found that illumination can significantly affect the visual data. Key performance metrics, such as accuracy and loss are used to evaluate the effectiveness of the proposed approach. The proposed approach outperforms traditional baseline models in accuracy while maintaining competitive execution time. Its dual-pathway architecture effectively captures both long-term dependencies and short-term motions, leading to better performance in both homophemic and non-homophemic datasets. However, it is less robust when dealing with non-ideal lighting scenarios, indicating the need for further enhancements to handle diverse lighting scenarios.   Doi: 10.28991/ESJ-2024-08-06-024 Full Text: PDFhttps://ijournalse.org/index.php/ESJ/article/view/2670visual speech recognitiontemporal convolutional networklip reading in wildslowfast networkhomophemes.
spellingShingle Nicole Yah Yie Ha
Lee-Yeng Ong
Meng-Chew Leow
SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
Emerging Science Journal
visual speech recognition
temporal convolutional network
lip reading in wild
slowfast network
homophemes.
title SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_full SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_fullStr SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_full_unstemmed SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_short SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_sort slowfast tcn a deep learning approach for visual speech recognition
topic visual speech recognition
temporal convolutional network
lip reading in wild
slowfast network
homophemes.
url https://ijournalse.org/index.php/ESJ/article/view/2670
work_keys_str_mv AT nicoleyahyieha slowfasttcnadeeplearningapproachforvisualspeechrecognition
AT leeyengong slowfasttcnadeeplearningapproachforvisualspeechrecognition
AT mengchewleow slowfasttcnadeeplearningapproachforvisualspeechrecognition