Structured Temporal Regularization and Curriculum Optimization for Visual Speech Recognition 面向視覺語音識別的結構化時序正則與課程優化

論文 · 2026

發表IEEE Transactions on Audio, Speech, and Language Processing(JCR Q1, IF 5.5) · 審稿中

作者Jiaqing Chen, Yifei Luo, Yahui Liu, Zhuoyang Qiu, Jing Chen, Hao Jiang

署名說明第一作者

關鍵詞Visual speech recognition, lip reading, low-resource learning, curriculum learning, Swin Transformer, temporal regularization

  1. 低資源視覺唇讀在標註稀缺場景下易過擬合說話人面部等空間捷徑,訓練精度高而驗證掉點明顯。引入 SimMIM 自監督預訓練並升級至帶殘差後歸一化與縮放餘弦注意力之 Swin V2 風格視覺編碼器,削弱身份記憶,為後續時序建模提供更魯棒之視覺表徵基礎。
  2. 時序分支沿用批歸一化時,小批量與說話人分佈變化會引入時序共變數洩漏,削弱跨批次泛化。於時序路徑以組歸一化取代批歸一化並與升級骨干聯合調參,疊加分層時序正則、可學習混合時序池化與分階段課程學習,使訓練—驗證曲線更平穩、過擬合拐點後移。
  3. AICLD-500 低資源基準缺少能同時提升精度並縮窄泛化差距之強基線,需透過系統消融驗證各模組貢獻。組織覆蓋預訓練、骨干、正則與課程等組合之對比實驗,較 SwinLip 基線提升 1.91 個百分點,訓練—驗證泛化差距顯著縮窄。

摘要

Visual Speech Recognition (VSR) in low-resource settings is highly vulnerable to overfitting and shortcut learning, resulting in severe train-validation divergence. To address this issue, we propose a structured optimization framework that jointly improves representation stability, temporal modeling reliability, and training dynamics. Specifically, we integrate SimMIM-based self-supervised pre-training to reduce identity-dependent spatial memorization, migrate the visual backbone from Swin Transformer V1 to a Swin V2-style design with Residual-Post-Normalization and scaled cosine attention to stabilize deep feature propagation, and replace Batch Normalization with Group Normalization in temporal branches to avoid batch-induced temporal leakage. We further introduce hierarchical temporal regularization, learnable mixed temporal pooling, and a stage-wise curriculum strategy with dynamic augmentation and plateau-aware adaptation to progressively shift learning from early discriminative fitting to robust generalization. Extensive experiments on the low-resource AICLD-500 benchmark demonstrate that the proposed method achieves a state-of-the-art Top-1 accuracy of 25.67%, outperforming a strong SwinLip baseline by 1.91 absolute percentage points, while significantly narrowing the generalization gap. These results indicate that structured temporal regularization coupled with curriculum optimization provides an effective and scalable solution for robust VSR under data-scarce conditions.