Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

Fahad Khalil Peracha; Muhammad Irfan Khattak; Nema Salem; Nasir Saleem

doi:10.1371/journal.pone.0285629

Causal speech enhancement using dynamical-weighted loss and attention encoder-decoder recurrent neural network

PLoS One. 2023 May 11;18(5):e0285629. doi: 10.1371/journal.pone.0285629. eCollection 2023.

Authors

Fahad Khalil Peracha¹, Muhammad Irfan Khattak¹, Nema Salem², Nasir Saleem¹

Affiliations

¹ Department of Electrical Engineering, University of Engineering and Technology, Peshawar, KPK, Pakistan.
² Electrical and Computer Engineering Department, Effat College of Engineering, Effat University, Jeddah, KSA.

Abstract

Speech enhancement (SE) reduces background noise signals in target speech and is applied at the front end in various real-world applications, including robust ASRs and real-time processing in mobile phone communications. SE systems are commonly integrated into mobile phones to increase quality and intelligibility. As a result, a low-latency system is required to operate in real-world applications. On the other hand, these systems need efficient optimization. This research focuses on the single-microphone SE operating in real-time systems with better optimization. We propose a causal data-driven model that uses attention encoder-decoder long short-term memory (LSTM) to estimate the time-frequency mask from a noisy speech in order to make a clean speech for real-time applications that need low-latency causal processing. The encoder-decoder LSTM and a causal attention mechanism are used in the proposed model. Furthermore, a dynamical-weighted (DW) loss function is proposed to improve model learning by varying the weight loss values. Experiments demonstrated that the proposed model consistently improves voice quality, intelligibility, and noise suppression. In the causal processing mode, the LSTM-based estimated suppression time-frequency mask outperforms the baseline model for unseen noise types. The proposed SE improved the STOI by 2.64% (baseline LSTM-IRM), 6.6% (LSTM-KF), 4.18% (DeepXi-KF), and 3.58% (DeepResGRU-KF). In addition, we examine word error rates (WERs) using Google's Automatic Speech Recognition (ASR). The ASR results show that error rates decreased from 46.33% (noisy signals) to 13.11% (proposed) 15.73% (LSTM), and 14.97% (LSTM-KF).

Copyright: © 2023 Peracha et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Memory, Long-Term
Neural Networks, Computer
Noise
Speech Intelligibility
Speech Perception*
Speech*

Grants and funding

Enter: The author(s) received no specific funding for this work.