Authors: Nikolai Lund Kühne, Jan Østergaard, Jesper Jensen, Zheng-Hua Tan
Abstract: While attention-based architectures, such as Conformers, excel in speech
enhancement, they face challenges such as scalability with respect to input
sequence length. In contrast, the recently proposed Extended Long Short-Term
Memory (xLSTM) architecture offers linear scalability. However, xLSTM-based
models remain unexplored for speech enhancement. This paper introduces
xLSTM-SENet, the first xLSTM-based single-channel speech enhancement system. A
comparative analysis reveals that xLSTM-and notably, even LSTM-can match or
outperform state-of-the-art Mamba- and Conformer-based systems across various
model sizes in speech enhancement on the VoiceBank+Demand dataset. Through
ablation studies, we identify key architectural design choices such as
exponential gating and bidirectionality contributing to its effectiveness. Our
best xLSTM-based model, xLSTM-SENet2, outperforms state-of-the-art Mamba- and
Conformer-based systems on the Voicebank+DEMAND dataset.
Source: http://arxiv.org/abs/2501.06146v1