A High-Performance Neural Network SoC for End-to-End Speaker Verification

The use of the neural network to recognize a speaker’s identity from their speech sounds has become popular in the last few years. Among these methods, the x-vector extractor, which is based on time-delay neural networks (TDNN), performs better in noise-canceling and generally achieves hi...

Full description

Saved in:
Bibliographic Details
Main Authors: Tsung-Han Tsai, Meng-Jui Chiang
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10744044/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The use of the neural network to recognize a speaker’s identity from their speech sounds has become popular in the last few years. Among these methods, the x-vector extractor, which is based on time-delay neural networks (TDNN), performs better in noise-canceling and generally achieves higher accuracy compared to previous methods such as the Gaussian mixture model (GMM) and the support vector machines (SVM). This paper presents a system-on-chip (SoC) composed of a RISC-V CPU and a neural network accelerator module for x-vector-based speaker verification (SV). To ensure real-time latency and enable the implementation of the system on edge devices, this work employs three steps for processing x-vector including size reduction, pruning, and compression. We are dedicated to optimizing the data flow with sparsity. Compared with the conventional sparse matrix compression method compressed sparse row (CSR), we propose the binary pointer compressed sparse row (BPCSR) method which significantly improves the latency and avoids the load balancing issue in each PE. We further design the neural network accelerator module that stores the compressed parameters and computes the x-vector extractor while the RISC-V CPU processes the rest of the calculations such as feature extraction and the classifier. The system was tested on the VoxCeleb dataset, containing 1251 test speakers, and achieved over 95% accuracy. Lastly, we synthesized the chip with TSMC 90 nm technology. It presents 15.5 mm2 in the area and 97.88 mW for real-time identification.
ISSN:2169-3536