SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition

Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic vi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nicole Yah Yie Ha, Lee-Yeng Ong, Meng-Chew Leow
Format:	Article
Language:	English
Published:	Ital Publication 2024-12-01
Series:	Emerging Science Journal
Subjects:	visual speech recognition temporal convolutional network lip reading in wild slowfast network homophemes.
Online Access:	https://ijournalse.org/index.php/ESJ/article/view/2670
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1846138002292604928
author	Nicole Yah Yie Ha Lee-Yeng Ong Meng-Chew Leow
author_facet	Nicole Yah Yie Ha Lee-Yeng Ong Meng-Chew Leow
author_sort	Nicole Yah Yie Ha
collection	DOAJ
description	Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic visual units of speech that are produced by the lip movements and positions. Furthermore, visemes are typically having shorter durations than words. Consequently, there is less temporal information for distinguishing between different viseme classes, leading to increased visual ambiguity during classification. To address this challenge, viseme classification must not only extract lip image spatial features, but also to handle visemes of varying durations and temporal features. Therefore, this study proposed a new deep learning approach SlowFast-TCN. SlowFast network is used as the frontend architecture to extract the spatio-temporal features of the slow and fast pathways. Temporal Convolutional Network (TCN) is used as the backend architecture to learn the features from the frontend to perform the classification. A comparative ablation analysis to dissect each component of the proposed SlowFast-TCN is performed to evaluate the impact of each component. This study utilizes a benchmark dataset, Lip Reading in Wild (LRW), that focuses on English language. Two subsets of the LRW dataset, comprising of homopheme words and unique words, represent the homophemic and non-homophemic dataset, respectively. The proposed approach is evaluated on varying lighting conditions to assess its performance in real-world scenarios. It was found that illumination can significantly affect the visual data. Key performance metrics, such as accuracy and loss are used to evaluate the effectiveness of the proposed approach. The proposed approach outperforms traditional baseline models in accuracy while maintaining competitive execution time. Its dual-pathway architecture effectively captures both long-term dependencies and short-term motions, leading to better performance in both homophemic and non-homophemic datasets. However, it is less robust when dealing with non-ideal lighting scenarios, indicating the need for further enhancements to handle diverse lighting scenarios. Doi: 10.28991/ESJ-2024-08-06-024 Full Text: PDF
format	Article
id	doaj-art-184428b4cbec4b46b552ec048f0f78e6
institution	Kabale University
issn	2610-9182
language	English
publishDate	2024-12-01
publisher	Ital Publication
record_format	Article
series	Emerging Science Journal
spelling	doaj-art-184428b4cbec4b46b552ec048f0f78e62024-12-07T14:11:58ZengItal PublicationEmerging Science Journal2610-91822024-12-01862554256910.28991/ESJ-2024-08-06-024755SlowFast-TCN: A Deep Learning Approach for Visual Speech RecognitionNicole Yah Yie Ha0Lee-Yeng Ong1Meng-Chew Leow2Faculty of Information Science and Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450,Faculty of Information Science and Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450,Faculty of Information Science and Technology (FIST), Multimedia University, Jalan Ayer Keroh Lama, Melaka 75450,Visual Speech Recognition (VSR), commonly referred to as automated lip-reading, is an emerging technology that interprets speech by visually analyzing lip movements. A challenge in VSR where visually distinct words produce similar lip movements is known as homopheme problem. Visemes are the basic visual units of speech that are produced by the lip movements and positions. Furthermore, visemes are typically having shorter durations than words. Consequently, there is less temporal information for distinguishing between different viseme classes, leading to increased visual ambiguity during classification. To address this challenge, viseme classification must not only extract lip image spatial features, but also to handle visemes of varying durations and temporal features. Therefore, this study proposed a new deep learning approach SlowFast-TCN. SlowFast network is used as the frontend architecture to extract the spatio-temporal features of the slow and fast pathways. Temporal Convolutional Network (TCN) is used as the backend architecture to learn the features from the frontend to perform the classification. A comparative ablation analysis to dissect each component of the proposed SlowFast-TCN is performed to evaluate the impact of each component. This study utilizes a benchmark dataset, Lip Reading in Wild (LRW), that focuses on English language. Two subsets of the LRW dataset, comprising of homopheme words and unique words, represent the homophemic and non-homophemic dataset, respectively. The proposed approach is evaluated on varying lighting conditions to assess its performance in real-world scenarios. It was found that illumination can significantly affect the visual data. Key performance metrics, such as accuracy and loss are used to evaluate the effectiveness of the proposed approach. The proposed approach outperforms traditional baseline models in accuracy while maintaining competitive execution time. Its dual-pathway architecture effectively captures both long-term dependencies and short-term motions, leading to better performance in both homophemic and non-homophemic datasets. However, it is less robust when dealing with non-ideal lighting scenarios, indicating the need for further enhancements to handle diverse lighting scenarios. Doi: 10.28991/ESJ-2024-08-06-024 Full Text: PDFhttps://ijournalse.org/index.php/ESJ/article/view/2670visual speech recognitiontemporal convolutional networklip reading in wildslowfast networkhomophemes.
spellingShingle	Nicole Yah Yie Ha Lee-Yeng Ong Meng-Chew Leow SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition Emerging Science Journal visual speech recognition temporal convolutional network lip reading in wild slowfast network homophemes.
title	SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_full	SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_fullStr	SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_full_unstemmed	SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_short	SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition
title_sort	slowfast tcn a deep learning approach for visual speech recognition
topic	visual speech recognition temporal convolutional network lip reading in wild slowfast network homophemes.
url	https://ijournalse.org/index.php/ESJ/article/view/2670
work_keys_str_mv	AT nicoleyahyieha slowfasttcnadeeplearningapproachforvisualspeechrecognition AT leeyengong slowfasttcnadeeplearningapproachforvisualspeechrecognition AT mengchewleow slowfasttcnadeeplearningapproachforvisualspeechrecognition

SlowFast-TCN: A Deep Learning Approach for Visual Speech Recognition

Similar Items