Pretraining Enhanced RNN Transducer

Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In thi...

Full description

Saved in:
Bibliographic Details
Main Authors: Junyu Lu, Rongzhong Lian, Di Jiang, Yuanfeng Song, Zhiyang Su, Victor Junqiu Wei, Lin Yang
Format: Article
Language:English
Published: Tsinghua University Press 2024-12-01
Series:CAAI Artificial Intelligence Research
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/AIR.2024.9150039
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841550213238489088
author Junyu Lu
Rongzhong Lian
Di Jiang
Yuanfeng Song
Zhiyang Su
Victor Junqiu Wei
Lin Yang
author_facet Junyu Lu
Rongzhong Lian
Di Jiang
Yuanfeng Song
Zhiyang Su
Victor Junqiu Wei
Lin Yang
author_sort Junyu Lu
collection DOAJ
description Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.
format Article
id doaj-art-b7b3ac275e4e45d0ba0b01f4f3f57f23
institution Kabale University
issn 2097-194X
2097-3691
language English
publishDate 2024-12-01
publisher Tsinghua University Press
record_format Article
series CAAI Artificial Intelligence Research
spelling doaj-art-b7b3ac275e4e45d0ba0b01f4f3f57f232025-01-10T06:44:32ZengTsinghua University PressCAAI Artificial Intelligence Research2097-194X2097-36912024-12-013915003910.26599/AIR.2024.9150039Pretraining Enhanced RNN TransducerJunyu Lu0Rongzhong Lian1Di Jiang2Yuanfeng Song3Zhiyang Su4Victor Junqiu Wei5Lin Yang6WeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaRecurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.https://www.sciopen.com/article/10.26599/AIR.2024.9150039pretrainingautomatic speech recognitionself-supervised learning
spellingShingle Junyu Lu
Rongzhong Lian
Di Jiang
Yuanfeng Song
Zhiyang Su
Victor Junqiu Wei
Lin Yang
Pretraining Enhanced RNN Transducer
CAAI Artificial Intelligence Research
pretraining
automatic speech recognition
self-supervised learning
title Pretraining Enhanced RNN Transducer
title_full Pretraining Enhanced RNN Transducer
title_fullStr Pretraining Enhanced RNN Transducer
title_full_unstemmed Pretraining Enhanced RNN Transducer
title_short Pretraining Enhanced RNN Transducer
title_sort pretraining enhanced rnn transducer
topic pretraining
automatic speech recognition
self-supervised learning
url https://www.sciopen.com/article/10.26599/AIR.2024.9150039
work_keys_str_mv AT junyulu pretrainingenhancedrnntransducer
AT rongzhonglian pretrainingenhancedrnntransducer
AT dijiang pretrainingenhancedrnntransducer
AT yuanfengsong pretrainingenhancedrnntransducer
AT zhiyangsu pretrainingenhancedrnntransducer
AT victorjunqiuwei pretrainingenhancedrnntransducer
AT linyang pretrainingenhancedrnntransducer