Structured Temporal Regularization and Curriculum Optimization for Visual Speech Recognition 面向视觉语音识别的结构化时序正则与课程优化

论文 · 2026

发表IEEE Transactions on Audio, Speech, and Language Processing(JCR Q1, IF 5.5) · 审稿中

作者Jiaqing Chen, Yifei Luo, Yahui Liu, Zhuoyang Qiu, Jing Chen, Hao Jiang

署名说明第一作者

关键词Visual speech recognition, lip reading, low-resource learning, curriculum learning, Swin Transformer, temporal regularization

  1. 低资源视觉唇读在标注稀缺场景下易过拟合说话人面部等空间捷径,训练精度高而验证掉点明显。引入 SimMIM 自监督预训练并升级至带残差后归一化与缩放余弦注意力的 Swin V2 风格视觉编码器,削弱身份记忆,为后续时序建模提供更鲁棒的视觉表征基础。
  2. 时序分支沿用批归一化时,小批量与说话人分布变化会引入时序协方差泄漏,削弱跨批次泛化。在时序路径以组归一化替代批归一化并与升级骨干联合调参,叠加分层时序正则、可学习混合时序池化与分阶段课程学习,使训练—验证曲线更平稳、过拟合拐点后移。
  3. AICLD-500 低资源基准缺少能同时提升精度并收窄泛化差距的强基线,需通过系统消融验证各模块贡献。组织覆盖预训练、骨干、正则与课程等组合的对比实验,较SwinLip 基线提升 1.91 个百分点,训练—验证泛化差距显著收窄。

摘要

Visual Speech Recognition (VSR) in low-resource settings is highly vulnerable to overfitting and shortcut learning, resulting in severe train-validation divergence. To address this issue, we propose a structured optimization framework that jointly improves representation stability, temporal modeling reliability, and training dynamics. Specifically, we integrate SimMIM-based self-supervised pre-training to reduce identity-dependent spatial memorization, migrate the visual backbone from Swin Transformer V1 to a Swin V2-style design with Residual-Post-Normalization and scaled cosine attention to stabilize deep feature propagation, and replace Batch Normalization with Group Normalization in temporal branches to avoid batch-induced temporal leakage. We further introduce hierarchical temporal regularization, learnable mixed temporal pooling, and a stage-wise curriculum strategy with dynamic augmentation and plateau-aware adaptation to progressively shift learning from early discriminative fitting to robust generalization. Extensive experiments on the low-resource AICLD-500 benchmark demonstrate that the proposed method achieves a state-of-the-art Top-1 accuracy of 25.67%, outperforming a strong SwinLip baseline by 1.91 absolute percentage points, while significantly narrowing the generalization gap. These results indicate that structured temporal regularization coupled with curriculum optimization provides an effective and scalable solution for robust VSR under data-scarce conditions.