Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
Dialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different repr...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10818458/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841557025613414400 |
---|---|
author | Ananya Angra H. Muralikrishna Dileep Aroor Dinesh Veena Thenkanidiyoor |
author_facet | Ananya Angra H. Muralikrishna Dileep Aroor Dinesh Veena Thenkanidiyoor |
author_sort | Ananya Angra |
collection | DOAJ |
description | Dialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different representations for efficient DID, which are motivated by the recent advancements in related areas. Firstly, we propose to learn a representation by aggregating the layers-wise features from wav2vec 2.0. We propose multiple approaches to combine the layer-wise features. Since different layers of wav2vec 2.0 are known to capture different acoustic-linguistic characteristics, such aggregated representation encode DID-specific contents in a better way. Followed by this, we explore the usage of recently proposed global-aware filter (GAF) layer based dual-stream time delay neural network (DS-TDNN) for DID. The GAF layer employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context along with dynamic filtering and sparse regularization. DS-TDNN has two separate input branches, one for capturing global context and the other for local context which are combined in a parallel pattern. Results obtained on dialects of Kannada, Tamil, Konkani and Marathi, four low-resource languages of India show that aggregated wav2vec 2.0 features perform better compared to DS-TDNN approach. |
format | Article |
id | doaj-art-84b28370936c48ecb3d90e8f9af15ba6 |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-84b28370936c48ecb3d90e8f9af15ba62025-01-07T00:02:32ZengIEEEIEEE Access2169-35362025-01-01133115312910.1109/ACCESS.2024.352395110818458Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationAnanya Angra0https://orcid.org/0009-0008-7141-2877H. Muralikrishna1https://orcid.org/0000-0002-6340-4227Dileep Aroor Dinesh2Veena Thenkanidiyoor3MANAS Laboratory, SCEE, Indian Institute of Technology Mandi, Mandi, IndiaDepartment of Electronics and Communication Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, IndiaDepartment of CSE, Indian Institute of Technology Dharwad, Dharwad, IndiaDepartment of CSE, National Institute of Technology Goa, Ponda, Goa, IndiaDialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different representations for efficient DID, which are motivated by the recent advancements in related areas. Firstly, we propose to learn a representation by aggregating the layers-wise features from wav2vec 2.0. We propose multiple approaches to combine the layer-wise features. Since different layers of wav2vec 2.0 are known to capture different acoustic-linguistic characteristics, such aggregated representation encode DID-specific contents in a better way. Followed by this, we explore the usage of recently proposed global-aware filter (GAF) layer based dual-stream time delay neural network (DS-TDNN) for DID. The GAF layer employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context along with dynamic filtering and sparse regularization. DS-TDNN has two separate input branches, one for capturing global context and the other for local context which are combined in a parallel pattern. Results obtained on dialects of Kannada, Tamil, Konkani and Marathi, four low-resource languages of India show that aggregated wav2vec 2.0 features perform better compared to DS-TDNN approach.https://ieeexplore.ieee.org/document/10818458/Spoken dialect identificationwav2vec 2.0feature representationsDS-TDNN |
spellingShingle | Ananya Angra H. Muralikrishna Dileep Aroor Dinesh Veena Thenkanidiyoor Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification IEEE Access Spoken dialect identification wav2vec 2.0 feature representations DS-TDNN |
title | Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification |
title_full | Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification |
title_fullStr | Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification |
title_full_unstemmed | Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification |
title_short | Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification |
title_sort | exploring aggregated wav2vec 2 0 features and dual stream tdnn for efficient spoken dialect identification |
topic | Spoken dialect identification wav2vec 2.0 feature representations DS-TDNN |
url | https://ieeexplore.ieee.org/document/10818458/ |
work_keys_str_mv | AT ananyaangra exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification AT hmuralikrishna exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification AT dileeparoordinesh exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification AT veenathenkanidiyoor exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification |