Decoding speech envelopes from electroencephalogram (EEG) signals holds potential as a research tool for objectively assessing auditory processing, which could contribute to future developments in hearing loss diagnosis. However, current methods struggle to meet both high accuracy and interpretability. We propose a deep learning model called the auditory decoding transformer (ADT) network for speech envelope reconstruction from EEG signals to address these issues. The ADT network uses spatio-temporal convolution for feature extraction, followed by a transformer decoder to decode the speech envelopes. Through anticausal masking, the ADT considers only the current and future EEG features to match the natural relationship of speech and EEG. Performance evaluation shows that the ADT network achieves average reconstruction scores of 0.168 and 0.167 on the SparrKULee and DTU datasets, respectively, rivaling those of other nonlinear models. Furthermore, by visualizing the weights of the spatio-temporal convolution layer as time-domain filters and brain topographies, combined with an ablation study of the temporal convolution kernels, we analyze the behavioral patterns of the ADT network in decoding speech envelopes. The results indicate that low- (0.5-8 Hz) and high-frequency (14-32 Hz) EEG signals are more critical for envelope reconstruction and that the active brain regions are primarily distributed bilaterally in the auditory cortex, consistent with previous research. Visualization of attention scores further validated previous research. In summary, the ADT network balances high performance and interpretability, making it a promising tool for studying neural speech envelope tracking.
Keywords: deep learning; electroencephalogram; interpretability; neural envelope tracking; transformer.