The Evolution of Visual Speech Recognition: From Deep Spatio-Temporal Modeling to LLM-Guided Reasoning
- Visual speech recognition literature is scattered by model family, lacking an end-to-end view from preprocessing to decoding, so researchers struggle to grasp how the field evolved in distinct stages. Five technological eras and typical architectures were mapped from statistical models and RNN/CNN stacks to Transformers and LLM-guided reasoning, anchoring the manuscript narrative.
- Datasets and benchmarks use inconsistent protocols, mixing controlled vs. in-the-wild and monolingual vs. multilingual settings. A taxonomy over granularity, collection environment, language, and modality was established, plus trends from lab benchmarks to large-scale in-the-wild multilingual evaluation, helping readers locate suitable data and evaluation settings by scenario.
- Method lists alone do not guide deployment; robustness, low-resource languages, label noise, long-form speech, and efficiency remain shared bottlenecks. Open-problems and future-direction chapters were drafted, summarizing environmental adaptability, deployment efficiency, and links to multimodal and large-model trends, yielding a citable systematic reference framework.
Abstract
Visual Speech Recognition (VSR) has rapidly evolved from handcrafted feature pipelines to deep spatio-temporal architectures and, more recently, LLM-guided reasoning systems. This survey provides a systematic review of that evolution, covering core components of the VSR pipeline, including preprocessing, visual feature extraction, spatio-temporal enhancement, sequence modeling, and decoding. We organize representative methods into five technological eras and analyze their structural shifts from statistical models and recurrent networks to temporal convolutions, Transformer-based global attention, and LLM-empowered generative refinement. We further present a comprehensive dataset taxonomy across granularity, collection environment, language, and modality, and summarize benchmark trends from controlled settings to large-scale in-the-wild and multilingual scenarios. Comparative analysis highlights that while modern visual encoders and attention mechanisms significantly improve discriminative capability, intrinsic viseme ambiguity remains a central bottleneck for visual-only recognition, motivating stronger linguistic priors and multimodal integration. Finally, we discuss key open challenges, including robustness under real-world perturbations, low-resource language coverage, annotation quality, long-form continuous speech modeling, and deployment efficiency, and outline future directions toward reliability-aware decoding, language-agnostic transfer, and scalable multimodal VSR systems.