Publications
2023
1.
Carofilis-Vasco, Andrés; Fernández-Robles, Laura; Alegre, Enrique; Fidalgo, Eduardo
MeWEHV: Mel and Wave Embeddings for Human Voice Tasks Artículo de revista
En: IEEE Access, 2023, (Publisher: IEEE).
Resumen | Enlaces | BibTeX | Etiquetas: Accent Recognition, Embeddings, Speaker Identification, Speech Processing
@article{carofilis_mewehv_2023,
title = {MeWEHV: Mel and Wave Embeddings for Human Voice Tasks},
author = {Andrés Carofilis-Vasco and Laura Fernández-Robles and Enrique Alegre and Eduardo Fidalgo},
url = {https://ieeexplore.ieee.org/abstract/document/10198451},
year = {2023},
date = {2023-01-01},
urldate = {2023-01-01},
journal = {IEEE Access},
abstract = {This paper introduces MeWEHV, a model that generates robust embeddings for speech processing by combining two types of embeddings: one generated by a pre-trained raw audio waveform encoder and another extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker identification, language identification, and accent identification. Various datasets were used, including VoxCeleb1 and VBHIR for speaker identification, VoxForge and LRE17 for language identification, and LASC and Common Voice for accent identification. The results show a significant performance increase in state-of-the-art embedding generation models with a low additional computational cost.},
note = {Publisher: IEEE},
keywords = {Accent Recognition, Embeddings, Speaker Identification, Speech Processing},
pubstate = {published},
tppubtype = {article}
}
This paper introduces MeWEHV, a model that generates robust embeddings for speech processing by combining two types of embeddings: one generated by a pre-trained raw audio waveform encoder and another extracted from Mel Frequency Cepstral Coefficients (MFCCs) using Convolutional Neural Networks (CNNs). We evaluate the performance of MeWEHV on three tasks: speaker identification, language identification, and accent identification. Various datasets were used, including VoxCeleb1 and VBHIR for speaker identification, VoxForge and LRE17 for language identification, and LASC and Common Voice for accent identification. The results show a significant performance increase in state-of-the-art embedding generation models with a low additional computational cost.