Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification

Dialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different repr...

Full description

Saved in:
Bibliographic Details
Main Authors: Ananya Angra, H. Muralikrishna, Dileep Aroor Dinesh, Veena Thenkanidiyoor
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10818458/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841557025613414400
author Ananya Angra
H. Muralikrishna
Dileep Aroor Dinesh
Veena Thenkanidiyoor
author_facet Ananya Angra
H. Muralikrishna
Dileep Aroor Dinesh
Veena Thenkanidiyoor
author_sort Ananya Angra
collection DOAJ
description Dialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different representations for efficient DID, which are motivated by the recent advancements in related areas. Firstly, we propose to learn a representation by aggregating the layers-wise features from wav2vec 2.0. We propose multiple approaches to combine the layer-wise features. Since different layers of wav2vec 2.0 are known to capture different acoustic-linguistic characteristics, such aggregated representation encode DID-specific contents in a better way. Followed by this, we explore the usage of recently proposed global-aware filter (GAF) layer based dual-stream time delay neural network (DS-TDNN) for DID. The GAF layer employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context along with dynamic filtering and sparse regularization. DS-TDNN has two separate input branches, one for capturing global context and the other for local context which are combined in a parallel pattern. Results obtained on dialects of Kannada, Tamil, Konkani and Marathi, four low-resource languages of India show that aggregated wav2vec 2.0 features perform better compared to DS-TDNN approach.
format Article
id doaj-art-84b28370936c48ecb3d90e8f9af15ba6
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-84b28370936c48ecb3d90e8f9af15ba62025-01-07T00:02:32ZengIEEEIEEE Access2169-35362025-01-01133115312910.1109/ACCESS.2024.352395110818458Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect IdentificationAnanya Angra0https://orcid.org/0009-0008-7141-2877H. Muralikrishna1https://orcid.org/0000-0002-6340-4227Dileep Aroor Dinesh2Veena Thenkanidiyoor3MANAS Laboratory, SCEE, Indian Institute of Technology Mandi, Mandi, IndiaDepartment of Electronics and Communication Engineering, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, IndiaDepartment of CSE, Indian Institute of Technology Dharwad, Dharwad, IndiaDepartment of CSE, National Institute of Technology Goa, Ponda, Goa, IndiaDialect identification (DID) is a challenging task due to high inter-class similarities between the dialects. Efficiency of a DID system depends on how well the input features encode the DID-specific contents in the speech that is spread across the utterance. In this paper, we explore different representations for efficient DID, which are motivated by the recent advancements in related areas. Firstly, we propose to learn a representation by aggregating the layers-wise features from wav2vec 2.0. We propose multiple approaches to combine the layer-wise features. Since different layers of wav2vec 2.0 are known to capture different acoustic-linguistic characteristics, such aggregated representation encode DID-specific contents in a better way. Followed by this, we explore the usage of recently proposed global-aware filter (GAF) layer based dual-stream time delay neural network (DS-TDNN) for DID. The GAF layer employs a set of learnable transform-domain filters between a 1D discrete Fourier transform and its inverse transform to capture global context along with dynamic filtering and sparse regularization. DS-TDNN has two separate input branches, one for capturing global context and the other for local context which are combined in a parallel pattern. Results obtained on dialects of Kannada, Tamil, Konkani and Marathi, four low-resource languages of India show that aggregated wav2vec 2.0 features perform better compared to DS-TDNN approach.https://ieeexplore.ieee.org/document/10818458/Spoken dialect identificationwav2vec 2.0feature representationsDS-TDNN
spellingShingle Ananya Angra
H. Muralikrishna
Dileep Aroor Dinesh
Veena Thenkanidiyoor
Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
IEEE Access
Spoken dialect identification
wav2vec 2.0
feature representations
DS-TDNN
title Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_full Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_fullStr Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_full_unstemmed Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_short Exploring Aggregated wav2vec 2.0 Features and Dual-Stream TDNN for Efficient Spoken Dialect Identification
title_sort exploring aggregated wav2vec 2 0 features and dual stream tdnn for efficient spoken dialect identification
topic Spoken dialect identification
wav2vec 2.0
feature representations
DS-TDNN
url https://ieeexplore.ieee.org/document/10818458/
work_keys_str_mv AT ananyaangra exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification
AT hmuralikrishna exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification
AT dileeparoordinesh exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification
AT veenathenkanidiyoor exploringaggregatedwav2vec20featuresanddualstreamtdnnforefficientspokendialectidentification