Any-to-any voice conversion using representation separation auto-encoder

In view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-enc...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhihua JIAN, Zixu ZHANG
Format:	Article
Language:	zho
Published:	Editorial Department of Journal on Communications 2024-02-01
Series:	Tongxin xuebao
Subjects:	voice conversion representation separation adaptive instance normalization self-content loss self-speaker loss
Online Access:	http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2024044/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841540044772343808
author	Zhihua JIAN Zixu ZHANG
author_facet	Zhihua JIAN Zixu ZHANG
author_sort	Zhihua JIAN
collection	DOAJ
description	In view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-encoder voice conversion) was proposed.The speaker’s personality characteristics in the speech were regarded as time invariant and the content information as time variant, and the instance normalization and activation guidance layer were used in the encoder to separate them from each other.Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder.The experimental results demonstrate that RSAE-VC has an average reduction of 3.11% and 2.41% in Mel cepstral distance and root mean square error of pitch frequency respectively, and has an increasement of 5.22% in MOS and 8.45% in ABX, compared with the AGAIN-VC (activation guidance and adaptive instance normalization voice conversion) method.In RSAE-VC, self-content loss is applied to make the converted speech reserve more content information, and self-speaker loss is used to separate the speaker personality characteristics from the speech better, which ensure the speaker personality characteristics be left in the content information as little as possible, and the conversion performance is improved.
format	Article
id	doaj-art-627570ad73c9440881db81b98c566296
institution	Kabale University
issn	1000-436X
language	zho
publishDate	2024-02-01
publisher	Editorial Department of Journal on Communications
record_format	Article
series	Tongxin xuebao
spelling	doaj-art-627570ad73c9440881db81b98c5662962025-01-14T06:22:07ZzhoEditorial Department of Journal on CommunicationsTongxin xuebao1000-436X2024-02-014516217259383378Any-to-any voice conversion using representation separation auto-encoderZhihua JIANZixu ZHANGIn view of the problem that it was difficult to separate speaker personality characteristics from semantic content information in any-to-any voice conversion under non-parallel corpus, which led to unsatisfied performance, a voice conversion method, called RSAE-VC (representation separation auto-encoder voice conversion) was proposed.The speaker’s personality characteristics in the speech were regarded as time invariant and the content information as time variant, and the instance normalization and activation guidance layer were used in the encoder to separate them from each other.Then the content information of the source speech and the personality characteristics of the target one was utilized to synthesize the converted speech by the decoder.The experimental results demonstrate that RSAE-VC has an average reduction of 3.11% and 2.41% in Mel cepstral distance and root mean square error of pitch frequency respectively, and has an increasement of 5.22% in MOS and 8.45% in ABX, compared with the AGAIN-VC (activation guidance and adaptive instance normalization voice conversion) method.In RSAE-VC, self-content loss is applied to make the converted speech reserve more content information, and self-speaker loss is used to separate the speaker personality characteristics from the speech better, which ensure the speaker personality characteristics be left in the content information as little as possible, and the conversion performance is improved.http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2024044/voice conversionrepresentation separationadaptive instance normalizationself-content lossself-speaker loss
spellingShingle	Zhihua JIAN Zixu ZHANG Any-to-any voice conversion using representation separation auto-encoder Tongxin xuebao voice conversion representation separation adaptive instance normalization self-content loss self-speaker loss
title	Any-to-any voice conversion using representation separation auto-encoder
title_full	Any-to-any voice conversion using representation separation auto-encoder
title_fullStr	Any-to-any voice conversion using representation separation auto-encoder
title_full_unstemmed	Any-to-any voice conversion using representation separation auto-encoder
title_short	Any-to-any voice conversion using representation separation auto-encoder
title_sort	any to any voice conversion using representation separation auto encoder
topic	voice conversion representation separation adaptive instance normalization self-content loss self-speaker loss
url	http://www.joconline.com.cn/zh/article/doi/10.11959/j.issn.1000-436x.2024044/
work_keys_str_mv	AT zhihuajian anytoanyvoiceconversionusingrepresentationseparationautoencoder AT zixuzhang anytoanyvoiceconversionusingrepresentationseparationautoencoder

Any-to-any voice conversion using representation separation auto-encoder

Similar Items