Structured Temporal Regularization and Curriculum Optimization for Visual Speech Recognition（面向視覺語音識別的結構化時序正則與課程優化）

摘要

Visual Speech Recognition (VSR) in low-resource settings is highly vulnerable to overfitting and shortcut learning, resulting in severe train-validation divergence. To address this issue, we propose a structured optimization framework that jointly improves representation stability, temporal modeling reliability, and training dynamics. Specifically, we integrate SimMIM-based self-supervised pre-training to reduce identity-dependent spatial memorization, migrate the visual backbone from Swin Transformer V1 to a Swin V2-style design with Residual-Post-Normalization and scaled cosine attention to stabilize deep feature propagation, and replace Batch Normalization with Group Normalization in temporal branches to avoid batch-induced temporal leakage. We further introduce hierarchical temporal regularization, learnable mixed temporal pooling, and a stage-wise curriculum strategy with dynamic augmentation and plateau-aware adaptation to progressively shift learning from early discriminative fitting to robust generalization. Extensive experiments on the low-resource AICLD-500 benchmark demonstrate that the proposed method achieves a state-of-the-art Top-1 accuracy of 25.67%, outperforming a strong SwinLip baseline by 1.91 absolute percentage points, while significantly narrowing the generalization gap. These results indicate that structured temporal regularization coupled with curriculum optimization provides an effective and scalable solution for robust VSR under data-scarce conditions.

具體工作內容

摘要

基於機器視覺的唇語辨識演算法設計與系統開發