Pretraining Enhanced RNN Transducer

Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In thi...

Full description

Saved in:
Bibliographic Details
Main Authors: Junyu Lu, Rongzhong Lian, Di Jiang, Yuanfeng Song, Zhiyang Su, Victor Junqiu Wei, Lin Yang
Format: Article
Language:English
Published: Tsinghua University Press 2024-12-01
Series:CAAI Artificial Intelligence Research
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/AIR.2024.9150039
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.
ISSN:2097-194X
2097-3691