Pretraining Enhanced RNN Transducer

Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In thi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Junyu Lu, Rongzhong Lian, Di Jiang, Yuanfeng Song, Zhiyang Su, Victor Junqiu Wei, Lin Yang
Format:	Article
Language:	English
Published:	Tsinghua University Press 2024-12-01
Series:	CAAI Artificial Intelligence Research
Subjects:	pretraining automatic speech recognition self-supervised learning
Online Access:	https://www.sciopen.com/article/10.26599/AIR.2024.9150039
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841550213238489088
author	Junyu Lu Rongzhong Lian Di Jiang Yuanfeng Song Zhiyang Su Victor Junqiu Wei Lin Yang
author_facet	Junyu Lu Rongzhong Lian Di Jiang Yuanfeng Song Zhiyang Su Victor Junqiu Wei Lin Yang
author_sort	Junyu Lu
collection	DOAJ
description	Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.
format	Article
id	doaj-art-b7b3ac275e4e45d0ba0b01f4f3f57f23
institution	Kabale University
issn	2097-194X 2097-3691
language	English
publishDate	2024-12-01
publisher	Tsinghua University Press
record_format	Article
series	CAAI Artificial Intelligence Research
spelling	doaj-art-b7b3ac275e4e45d0ba0b01f4f3f57f232025-01-10T06:44:32ZengTsinghua University PressCAAI Artificial Intelligence Research2097-194X2097-36912024-12-013915003910.26599/AIR.2024.9150039Pretraining Enhanced RNN TransducerJunyu Lu0Rongzhong Lian1Di Jiang2Yuanfeng Song3Zhiyang Su4Victor Junqiu Wei5Lin Yang6WeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaRecurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.https://www.sciopen.com/article/10.26599/AIR.2024.9150039pretrainingautomatic speech recognitionself-supervised learning
spellingShingle	Junyu Lu Rongzhong Lian Di Jiang Yuanfeng Song Zhiyang Su Victor Junqiu Wei Lin Yang Pretraining Enhanced RNN Transducer CAAI Artificial Intelligence Research pretraining automatic speech recognition self-supervised learning
title	Pretraining Enhanced RNN Transducer
title_full	Pretraining Enhanced RNN Transducer
title_fullStr	Pretraining Enhanced RNN Transducer
title_full_unstemmed	Pretraining Enhanced RNN Transducer
title_short	Pretraining Enhanced RNN Transducer
title_sort	pretraining enhanced rnn transducer
topic	pretraining automatic speech recognition self-supervised learning
url	https://www.sciopen.com/article/10.26599/AIR.2024.9150039
work_keys_str_mv	AT junyulu pretrainingenhancedrnntransducer AT rongzhonglian pretrainingenhancedrnntransducer AT dijiang pretrainingenhancedrnntransducer AT yuanfengsong pretrainingenhancedrnntransducer AT zhiyangsu pretrainingenhancedrnntransducer AT victorjunqiuwei pretrainingenhancedrnntransducer AT linyang pretrainingenhancedrnntransducer

Pretraining Enhanced RNN Transducer

Similar Items