Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification

Dialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different repr...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ananya Angra, H. Muralikrishna, Dileep Aroor Dinesh, Veena Thenkanidiyoor
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Spoken dialect identification wav2vec 2.0 feature representations DS-TDNN
Online Access:	https://ieeexplore.ieee.org/document/10818458/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841557025613414400
author	Ananya Angra H. Muralikrishna Dileep Aroor Dinesh Veena Thenkanidiyoor
author_facet	Ananya Angra H. Muralikrishna Dileep Aroor Dinesh Veena Thenkanidiyoor
author_sort	Ananya Angra
collection	DOAJ
description	Dialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different representations for efficient DID, which are motivated by the recent advancements in related areas. Firstly, we propose to learn a representation by aggregating the layers-wise features from wav2vec 2.0. We propose multiple approaches to combine the layer-wise features. Since different layers of wav2vec 2.0 are known to capture different acoustic-linguistic characteristics, such aggregated representation encode DID-specific contents in a better way. Followed by this, we explore the usage of recently proposed global-aware filter (GAF) layer based dual-stream time delay neural network (DS-TDNN) for DID. The GAF layer employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context along with dynamic filtering and sparse regularization. DS-TDNN has two separate input branches, one for capturing global context and the other for local context which are combined in a parallel pattern. Results obtained on dialects of Kannada, Tamil, Konkani and Marathi, four low-resource languages of India show that aggregated wav2vec 2.0 features perform better compared to DS-TDNN approach.
format	Article
id	doaj-art-84b28370936c48ecb3d90e8f9af15ba6
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-84b28370936c48ecb3d90e8f9af15ba62025-01-07T00:02:32ZengIEEEIEEE Access2169-35362025-01-01133115312910.1109/ACCESS.2024.352395110818458Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationAnanya Angra0https://orcid.org/0009-0008-7141-2877H. Muralikrishna1https://orcid.org/0000-0002-6340-4227Dileep Aroor Dinesh2Veena Thenkanidiyoor3MANAS Laboratory, SCEE, Indian Institute of Technology Mandi, Mandi, IndiaDepartment of Electronics and Communication Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, IndiaDepartment of CSE, Indian Institute of Technology Dharwad, Dharwad, IndiaDepartment of CSE, National Institute of Technology Goa, Ponda, Goa, IndiaDialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different representations for efficient DID, which are motivated by the recent advancements in related areas. Firstly, we propose to learn a representation by aggregating the layers-wise features from wav2vec 2.0. We propose multiple approaches to combine the layer-wise features. Since different layers of wav2vec 2.0 are known to capture different acoustic-linguistic characteristics, such aggregated representation encode DID-specific contents in a better way. Followed by this, we explore the usage of recently proposed global-aware filter (GAF) layer based dual-stream time delay neural network (DS-TDNN) for DID. The GAF layer employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context along with dynamic filtering and sparse regularization. DS-TDNN has two separate input branches, one for capturing global context and the other for local context which are combined in a parallel pattern. Results obtained on dialects of Kannada, Tamil, Konkani and Marathi, four low-resource languages of India show that aggregated wav2vec 2.0 features perform better compared to DS-TDNN approach.https://ieeexplore.ieee.org/document/10818458/Spoken dialect identificationwav2vec 2.0feature representationsDS-TDNN
spellingShingle	Ananya Angra H. Muralikrishna Dileep Aroor Dinesh Veena Thenkanidiyoor Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification IEEE Access Spoken dialect identification wav2vec 2.0 feature representations DS-TDNN
title	Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_full	Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_fullStr	Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_full_unstemmed	Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_short	Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_sort	exploring aggregated wav2vec 2 0 features and dual stream tdnn for efficient spoken dialect identification
topic	Spoken dialect identification wav2vec 2.0 feature representations DS-TDNN
url	https://ieeexplore.ieee.org/document/10818458/
work_keys_str_mv	AT ananyaangra exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification AT hmuralikrishna exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification AT dileeparoordinesh exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification AT veenathenkanidiyoor exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification

Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification

Similar Items