Pretraining Enhanced RNN Transducer
Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In thi...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Tsinghua University Press
2024-12-01
|
Series: | CAAI Artificial Intelligence Research |
Subjects: | |
Online Access: | https://www.sciopen.com/article/10.26599/AIR.2024.9150039 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841550213238489088 |
---|---|
author | Junyu Lu Rongzhong Lian Di Jiang Yuanfeng Song Zhiyang Su Victor Junqiu Wei Lin Yang |
author_facet | Junyu Lu Rongzhong Lian Di Jiang Yuanfeng Song Zhiyang Su Victor Junqiu Wei Lin Yang |
author_sort | Junyu Lu |
collection | DOAJ |
description | Recurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches. |
format | Article |
id | doaj-art-b7b3ac275e4e45d0ba0b01f4f3f57f23 |
institution | Kabale University |
issn | 2097-194X 2097-3691 |
language | English |
publishDate | 2024-12-01 |
publisher | Tsinghua University Press |
record_format | Article |
series | CAAI Artificial Intelligence Research |
spelling | doaj-art-b7b3ac275e4e45d0ba0b01f4f3f57f232025-01-10T06:44:32ZengTsinghua University PressCAAI Artificial Intelligence Research2097-194X2097-36912024-12-013915003910.26599/AIR.2024.9150039Pretraining Enhanced RNN TransducerJunyu Lu0Rongzhong Lian1Di Jiang2Yuanfeng Song3Zhiyang Su4Victor Junqiu Wei5Lin Yang6WeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaWeBank Co., Ltd., Shenzhen 518000, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaDepartment of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, ChinaRecurrent neural network transducer (RNN-T) is an important branch of current end-to-end automatic speech recognition (ASR). Various promising approaches have been designed for boosting RNN-T architecture; however, few studies exploit the effectiveness of pretrained methods in this framework. In this paper, we introduce the pretrained acoustic extractor (PAE) and the pretrained linguistic network (PLN) to enhance the Conformer long short-term memory (Conformer-LSTM) transducer. First, we construct the input of the acoustic encoder with two different latent representations: one extracted by PAE from the raw waveform, and the other obtained from filter-bank transformation. Second, we fuse an extra semantic feature from the PLN into the joint network to reduce illogical and homophonic errors. Compared with previous works, our approaches are able to obtain pretrained representations for better model generalization. Evaluation on two large-scale datasets has demonstrated that our proposed approaches yield better performance than existing approaches.https://www.sciopen.com/article/10.26599/AIR.2024.9150039pretrainingautomatic speech recognitionself-supervised learning |
spellingShingle | Junyu Lu Rongzhong Lian Di Jiang Yuanfeng Song Zhiyang Su Victor Junqiu Wei Lin Yang Pretraining Enhanced RNN Transducer CAAI Artificial Intelligence Research pretraining automatic speech recognition self-supervised learning |
title | Pretraining Enhanced RNN Transducer |
title_full | Pretraining Enhanced RNN Transducer |
title_fullStr | Pretraining Enhanced RNN Transducer |
title_full_unstemmed | Pretraining Enhanced RNN Transducer |
title_short | Pretraining Enhanced RNN Transducer |
title_sort | pretraining enhanced rnn transducer |
topic | pretraining automatic speech recognition self-supervised learning |
url | https://www.sciopen.com/article/10.26599/AIR.2024.9150039 |
work_keys_str_mv | AT junyulu pretrainingenhancedrnntransducer AT rongzhonglian pretrainingenhancedrnntransducer AT dijiang pretrainingenhancedrnntransducer AT yuanfengsong pretrainingenhancedrnntransducer AT zhiyangsu pretrainingenhancedrnntransducer AT victorjunqiuwei pretrainingenhancedrnntransducer AT linyang pretrainingenhancedrnntransducer |