Dimensional Affective Speech Synthesis Based on Voice Conversion

Affective speech synthesis can promote more natural human–computer interaction. Previous studies in the field of speech synthesis have used feature conversion to achieve natural affective speech. However, they focused on the adjustment of prosodic features and typically used a discrete emotion model...

Full description

Saved in:
Bibliographic Details
Main Authors: Xin Zhang, Yaobin Wan, Wei Wang
Format: Article
Language:English
Published: American Association for the Advancement of Science (AAAS) 2024-01-01
Series:Intelligent Computing
Online Access:https://spj.science.org/doi/10.34133/icomputing.0092
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846164259276324864
author Xin Zhang
Yaobin Wan
Wei Wang
author_facet Xin Zhang
Yaobin Wan
Wei Wang
author_sort Xin Zhang
collection DOAJ
description Affective speech synthesis can promote more natural human–computer interaction. Previous studies in the field of speech synthesis have used feature conversion to achieve natural affective speech. However, they focused on the adjustment of prosodic features and typically used a discrete emotion model; few studies on affective speech synthesis reflect the dimensional emotions expressed in daily life. To address these issues, we introduce a 2-dimensional valence–arousal emotion model into a speech synthesis system and take inspiration from voice conversion to convert prosodic and spectral acoustic features to achieve dimensional emotional speech expression. First, the acoustic features corresponding to the input text are predicted by the front end of the speech synthesis system, or the acoustic features of the input speech are extracted by World, a vocoder-based speech synthesis tool that generates prosodic and spectral features simultaneously. Then, the acoustic features of different dimensions of affective speech are analyzed and the fundamental frequency parameters and spectral envelope parameters of the source speech are converted based on the average ratio of the acoustic features of the input speech and the affective dimensions of the target. Finally, the World vocoder is used to output the converted emotion feature parameters into audio waveforms, and emotional speech synthesis with different dimensional values is realized in the 2-dimensional valence–arousal space. Objective and subjective evaluation results show that the dimensional affective speech synthesized using this method can be perceived well, especially in the arousal dimension.
format Article
id doaj-art-4031789bd3af46c2b8cffe150691a57b
institution Kabale University
issn 2771-5892
language English
publishDate 2024-01-01
publisher American Association for the Advancement of Science (AAAS)
record_format Article
series Intelligent Computing
spelling doaj-art-4031789bd3af46c2b8cffe150691a57b2024-11-18T09:21:37ZengAmerican Association for the Advancement of Science (AAAS)Intelligent Computing2771-58922024-01-01310.34133/icomputing.0092Dimensional Affective Speech Synthesis Based on Voice ConversionXin Zhang0Yaobin Wan1Wei Wang2College of Education Science, Nanjing Normal University, Nanjing 210097, China.College of Education Science, Nanjing Normal University, Nanjing 210097, China.College of Education Science, Nanjing Normal University, Nanjing 210097, China.Affective speech synthesis can promote more natural human–computer interaction. Previous studies in the field of speech synthesis have used feature conversion to achieve natural affective speech. However, they focused on the adjustment of prosodic features and typically used a discrete emotion model; few studies on affective speech synthesis reflect the dimensional emotions expressed in daily life. To address these issues, we introduce a 2-dimensional valence–arousal emotion model into a speech synthesis system and take inspiration from voice conversion to convert prosodic and spectral acoustic features to achieve dimensional emotional speech expression. First, the acoustic features corresponding to the input text are predicted by the front end of the speech synthesis system, or the acoustic features of the input speech are extracted by World, a vocoder-based speech synthesis tool that generates prosodic and spectral features simultaneously. Then, the acoustic features of different dimensions of affective speech are analyzed and the fundamental frequency parameters and spectral envelope parameters of the source speech are converted based on the average ratio of the acoustic features of the input speech and the affective dimensions of the target. Finally, the World vocoder is used to output the converted emotion feature parameters into audio waveforms, and emotional speech synthesis with different dimensional values is realized in the 2-dimensional valence–arousal space. Objective and subjective evaluation results show that the dimensional affective speech synthesized using this method can be perceived well, especially in the arousal dimension.https://spj.science.org/doi/10.34133/icomputing.0092
spellingShingle Xin Zhang
Yaobin Wan
Wei Wang
Dimensional Affective Speech Synthesis Based on Voice Conversion
Intelligent Computing
title Dimensional Affective Speech Synthesis Based on Voice Conversion
title_full Dimensional Affective Speech Synthesis Based on Voice Conversion
title_fullStr Dimensional Affective Speech Synthesis Based on Voice Conversion
title_full_unstemmed Dimensional Affective Speech Synthesis Based on Voice Conversion
title_short Dimensional Affective Speech Synthesis Based on Voice Conversion
title_sort dimensional affective speech synthesis based on voice conversion
url https://spj.science.org/doi/10.34133/icomputing.0092
work_keys_str_mv AT xinzhang dimensionalaffectivespeechsynthesisbasedonvoiceconversion
AT yaobinwan dimensionalaffectivespeechsynthesisbasedonvoiceconversion
AT weiwang dimensionalaffectivespeechsynthesisbasedonvoiceconversion