Cross-Modal Knowledge Distillation with Multi-Stage Adaptive Feature Fusion for Speech Separation


Cunhang Fan, Wang Xiang,Jianhua Tao,Jiangyan Yi, Zhao Lv

School of Computer Science and Technology, Anhui University, Hefei, China


Abstract

Although audio-visual speech separation has achieved significant advancements, it is relatively difficult to obtain audio and visual modalities simultaneously in real scenarios, often leading to the issue of missing visual modality. Actually, during the training phase of the speech separation network, the developed audio and video datasets can be fully utilized to obtain an effectual audio-visual speech separation. However, enhancing the performance of unimodal models during testing using multimodal approaches poses a challenge. To address the problem, this paper proposes a cross-modal knowledge distillation method for speech separation, which leverages a multimodal model to enhance the unimodal model via knowledge distillation. Specifically, during the training phase, a pre-trained audio-visual network is used as the Student(CMKD) and the audio-only network is used as the Clean. Then the Student(CMKD) with the additional visual input transfers knowledge to the Clean to enhance the performance of the Clean model. During the test phase, the audio-only network only conducts speech separation. In addition, to further improve the performance of the Student(CMKD) model, a multi-stage adaptive feature fusion method is proposed. The global and local perspectives are used to effectively capture deep audio-visual correlations. We have conducted extensive experiments with audio-visual datasets LRS2, LRS3, and VoxCeleb2. Experimental results demonstrate that our proposed method effectively improves the Clean model by 15\%~25\% relative improvements over the baseline in terms of SI-SNRi and SDRi.

Speech Samples

The model is evaluated with LRS2-2Mix :


Mixture input Student Student(CMKD) Teacher Teacher(MAFF) Ground-Truth