End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning

The loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes artic...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fengji Li, Fei Shen, Ding Ma, Jie Zhou, Shaochuan Zhang, Li Wang, Fan Fan, Tao Liu, Xiaohong Chen, Tomoki Toda, Haijun Niu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Transactions on Neural Systems and Rehabilitation Engineering
Subjects:	Ultrasound tongue image speech reconstruction end-to-end generative adversarial networks (GANs) Mandarin speech
Online Access:	https://ieeexplore.ieee.org/document/10810495/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841536206979989504
author	Fengji Li Fei Shen Ding Ma Jie Zhou Shaochuan Zhang Li Wang Fan Fan Tao Liu Xiaohong Chen Tomoki Toda Haijun Niu
author_facet	Fengji Li Fei Shen Ding Ma Jie Zhou Shaochuan Zhang Li Wang Fan Fan Tao Liu Xiaohong Chen Tomoki Toda Haijun Niu
author_sort	Fengji Li
collection	DOAJ
description	The loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes articulatory motion information to effectively restore speech. This study proposed a deep learning-based end-to-end method for speech reconstruction using ultrasound tongue images. Initially, ultrasound tongue images and speech data were collected simultaneously with a designed Mandarin corpus. Subsequently, a speech reconstruction model was built based on adversarial neural networks. The model includes a pretrained feature extractor to process ultrasound images, an upsampling block to generate speech, and discriminators to ensure the similarity and fidelity of the reconstructed speech. Finally, both objective and subjective evaluations were conducted for the reconstructed speech. The reconstructed speech demonstrated high intelligibility in both Mandarin phonemes and tones. The character error rate of phonemes in automatic speech recognition was 0.2605, and tone error rate obtained from dictation tests was 0.1784, respectively. Objective results showed high similarity between the reconstructed and ground truth speech. Subjective perception results also indicated an acceptable level of naturalness. The proposed method demonstrates its capability to reconstruct tonal Mandarin speech from ultrasound tongue images. However, future research should concentrate on specific conditions of laryngectomees, aiming to enhance and optimize model performance. This will be achieved by enlarging training datasets, investigating the impact of ultrasound tongue imaging parameters, and further refining this method.
format	Article
id	doaj-art-ca57405c679f4fc2867a8996777ac389
institution	Kabale University
issn	1534-4320 1558-0210
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Transactions on Neural Systems and Rehabilitation Engineering
spelling	doaj-art-ca57405c679f4fc2867a8996777ac3892025-01-15T00:00:10ZengIEEEIEEE Transactions on Neural Systems and Rehabilitation Engineering1534-43201558-02102025-01-013314014910.1109/TNSRE.2024.352049810810495End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep LearningFengji Li0https://orcid.org/0009-0002-0426-7223Fei Shen1https://orcid.org/0000-0002-7358-5033Ding Ma2https://orcid.org/0009-0002-6564-4571Jie Zhou3Shaochuan Zhang4Li Wang5Fan Fan6https://orcid.org/0000-0001-8708-040XTao Liu7https://orcid.org/0000-0002-7783-3073Xiaohong Chen8Tomoki Toda9https://orcid.org/0000-0001-8146-1279Haijun Niu10https://orcid.org/0000-0001-8891-6846School of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaGraduate School of Informatics, Nagoya University, Nagoya, JapanSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaDepartment of Otolaryngology, Head and Neck Surgery, Beijing Tongren Hospital, Capital Medical University, Beijing, ChinaInformation Technology Center, Nagoya University, Nagoya, JapanSchool of Biological Science and Medical Engineering, Beihang University, Beijing, ChinaThe loss of speech function following a laryngectomy usually leads to severe physiological and psychological distress for laryngectomees. In clinical practice, most laryngectomees retain intact upper tract articulatory organs, emphasizing the significance of speech rehabilitation that utilizes articulatory motion information to effectively restore speech. This study proposed a deep learning-based end-to-end method for speech reconstruction using ultrasound tongue images. Initially, ultrasound tongue images and speech data were collected simultaneously with a designed Mandarin corpus. Subsequently, a speech reconstruction model was built based on adversarial neural networks. The model includes a pretrained feature extractor to process ultrasound images, an upsampling block to generate speech, and discriminators to ensure the similarity and fidelity of the reconstructed speech. Finally, both objective and subjective evaluations were conducted for the reconstructed speech. The reconstructed speech demonstrated high intelligibility in both Mandarin phonemes and tones. The character error rate of phonemes in automatic speech recognition was 0.2605, and tone error rate obtained from dictation tests was 0.1784, respectively. Objective results showed high similarity between the reconstructed and ground truth speech. Subjective perception results also indicated an acceptable level of naturalness. The proposed method demonstrates its capability to reconstruct tonal Mandarin speech from ultrasound tongue images. However, future research should concentrate on specific conditions of laryngectomees, aiming to enhance and optimize model performance. This will be achieved by enlarging training datasets, investigating the impact of ultrasound tongue imaging parameters, and further refining this method.https://ieeexplore.ieee.org/document/10810495/Ultrasound tongue imagespeech reconstructionend-to-endgenerative adversarial networks (GANs)Mandarin speech
spellingShingle	Fengji Li Fei Shen Ding Ma Jie Zhou Shaochuan Zhang Li Wang Fan Fan Tao Liu Xiaohong Chen Tomoki Toda Haijun Niu End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning IEEE Transactions on Neural Systems and Rehabilitation Engineering Ultrasound tongue image speech reconstruction end-to-end generative adversarial networks (GANs) Mandarin speech
title	End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning
title_full	End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning
title_fullStr	End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning
title_full_unstemmed	End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning
title_short	End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning
title_sort	end to end mandarin speech reconstruction based on ultrasound tongue images using deep learning
topic	Ultrasound tongue image speech reconstruction end-to-end generative adversarial networks (GANs) Mandarin speech
url	https://ieeexplore.ieee.org/document/10810495/
work_keys_str_mv	AT fengjili endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT feishen endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT dingma endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT jiezhou endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT shaochuanzhang endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT liwang endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT fanfan endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT taoliu endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT xiaohongchen endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT tomokitoda endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning AT haijunniu endtoendmandarinspeechreconstructionbasedonultrasoundtongueimagesusingdeeplearning

End-to-End Mandarin Speech Reconstruction Based on Ultrasound Tongue Images Using Deep Learning

Similar Items