Dimensional Affective Speech Synthesis Based on Voice Conversion
Affective speech synthesis can promote more natural human–computer interaction. Previous studies in the field of speech synthesis have used feature conversion to achieve natural affective speech. However, they focused on the adjustment of prosodic features and typically used a discrete emotion model...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
American Association for the Advancement of Science (AAAS)
2024-01-01
|
| Series: | Intelligent Computing |
| Online Access: | https://spj.science.org/doi/10.34133/icomputing.0092 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1846164259276324864 |
|---|---|
| author | Xin Zhang Yaobin Wan Wei Wang |
| author_facet | Xin Zhang Yaobin Wan Wei Wang |
| author_sort | Xin Zhang |
| collection | DOAJ |
| description | Affective speech synthesis can promote more natural human–computer interaction. Previous studies in the field of speech synthesis have used feature conversion to achieve natural affective speech. However, they focused on the adjustment of prosodic features and typically used a discrete emotion model; few studies on affective speech synthesis reflect the dimensional emotions expressed in daily life. To address these issues, we introduce a 2-dimensional valence–arousal emotion model into a speech synthesis system and take inspiration from voice conversion to convert prosodic and spectral acoustic features to achieve dimensional emotional speech expression. First, the acoustic features corresponding to the input text are predicted by the front end of the speech synthesis system, or the acoustic features of the input speech are extracted by World, a vocoder-based speech synthesis tool that generates prosodic and spectral features simultaneously. Then, the acoustic features of different dimensions of affective speech are analyzed and the fundamental frequency parameters and spectral envelope parameters of the source speech are converted based on the average ratio of the acoustic features of the input speech and the affective dimensions of the target. Finally, the World vocoder is used to output the converted emotion feature parameters into audio waveforms, and emotional speech synthesis with different dimensional values is realized in the 2-dimensional valence–arousal space. Objective and subjective evaluation results show that the dimensional affective speech synthesized using this method can be perceived well, especially in the arousal dimension. |
| format | Article |
| id | doaj-art-4031789bd3af46c2b8cffe150691a57b |
| institution | Kabale University |
| issn | 2771-5892 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | American Association for the Advancement of Science (AAAS) |
| record_format | Article |
| series | Intelligent Computing |
| spelling | doaj-art-4031789bd3af46c2b8cffe150691a57b2024-11-18T09:21:37ZengAmerican Association for the Advancement of Science (AAAS)Intelligent Computing2771-58922024-01-01310.34133/icomputing.0092Dimensional Affective Speech Synthesis Based on Voice ConversionXin Zhang0Yaobin Wan1Wei Wang2College of Education Science, Nanjing Normal University, Nanjing 210097, China.College of Education Science, Nanjing Normal University, Nanjing 210097, China.College of Education Science, Nanjing Normal University, Nanjing 210097, China.Affective speech synthesis can promote more natural human–computer interaction. Previous studies in the field of speech synthesis have used feature conversion to achieve natural affective speech. However, they focused on the adjustment of prosodic features and typically used a discrete emotion model; few studies on affective speech synthesis reflect the dimensional emotions expressed in daily life. To address these issues, we introduce a 2-dimensional valence–arousal emotion model into a speech synthesis system and take inspiration from voice conversion to convert prosodic and spectral acoustic features to achieve dimensional emotional speech expression. First, the acoustic features corresponding to the input text are predicted by the front end of the speech synthesis system, or the acoustic features of the input speech are extracted by World, a vocoder-based speech synthesis tool that generates prosodic and spectral features simultaneously. Then, the acoustic features of different dimensions of affective speech are analyzed and the fundamental frequency parameters and spectral envelope parameters of the source speech are converted based on the average ratio of the acoustic features of the input speech and the affective dimensions of the target. Finally, the World vocoder is used to output the converted emotion feature parameters into audio waveforms, and emotional speech synthesis with different dimensional values is realized in the 2-dimensional valence–arousal space. Objective and subjective evaluation results show that the dimensional affective speech synthesized using this method can be perceived well, especially in the arousal dimension.https://spj.science.org/doi/10.34133/icomputing.0092 |
| spellingShingle | Xin Zhang Yaobin Wan Wei Wang Dimensional Affective Speech Synthesis Based on Voice Conversion Intelligent Computing |
| title | Dimensional Affective Speech Synthesis Based on Voice Conversion |
| title_full | Dimensional Affective Speech Synthesis Based on Voice Conversion |
| title_fullStr | Dimensional Affective Speech Synthesis Based on Voice Conversion |
| title_full_unstemmed | Dimensional Affective Speech Synthesis Based on Voice Conversion |
| title_short | Dimensional Affective Speech Synthesis Based on Voice Conversion |
| title_sort | dimensional affective speech synthesis based on voice conversion |
| url | https://spj.science.org/doi/10.34133/icomputing.0092 |
| work_keys_str_mv | AT xinzhang dimensionalaffectivespeechsynthesisbasedonvoiceconversion AT yaobinwan dimensionalaffectivespeechsynthesisbasedonvoiceconversion AT weiwang dimensionalaffectivespeechsynthesisbasedonvoiceconversion |