The Evolution of Visual Speech Recognition: From Deep Spatio-Temporal Modeling to LLM-Guided Reasoning 视觉语音识别演化:从深度时空建模到大模型引导推理
- 视觉语音识别相关文献按模型结构零散发表,缺少从预处理到解码的贯通视角,研究人员难以把握技术演进的阶段特征。参与按五个技术时代组织代表性方法与典型架构,归纳从统计模型、RNN/CNN 到 Transformer 与大模型引导推理的演进逻辑,形成清晰的时代—方法对照表支撑全文叙事。
- 数据集与基准评测口径不一,受控/野外、单语/多语场景混杂,不利于横向比较。建立按粒度、采集环境、语言与模态划分的分类学,并梳理基准从实验室到大规模在野、多语言评测的发展趋势,便于读者按场景定位适用数据与评测设置。
- 仅罗列方法难以指导工程落地,鲁棒性、低资源、标注质量与长时连续语音等仍是共性瓶颈。撰写开放问题与未来方向章节,归纳环境适应性、部署效率等议题并衔接多模态与大模型趋势,形成可引用的系统性参考框架。
摘要
Visual Speech Recognition (VSR) has rapidly evolved from handcrafted feature pipelines to deep spatio-temporal architectures and, more recently, LLM-guided reasoning systems. This survey provides a systematic review of that evolution, covering core components of the VSR pipeline, including preprocessing, visual feature extraction, spatio-temporal enhancement, sequence modeling, and decoding. We organize representative methods into five technological eras and analyze their structural shifts from statistical models and recurrent networks to temporal convolutions, Transformer-based global attention, and LLM-empowered generative refinement. We further present a comprehensive dataset taxonomy across granularity, collection environment, language, and modality, and summarize benchmark trends from controlled settings to large-scale in-the-wild and multilingual scenarios. Comparative analysis highlights that while modern visual encoders and attention mechanisms significantly improve discriminative capability, intrinsic viseme ambiguity remains a central bottleneck for visual-only recognition, motivating stronger linguistic priors and multimodal integration. Finally, we discuss key open challenges, including robustness under real-world perturbations, low-resource language coverage, annotation quality, long-form continuous speech modeling, and deployment efficiency, and outline future directions toward reliability-aware decoding, language-agnostic transfer, and scalable multimodal VSR systems.