The Evolution of Visual Speech Recognition: From Deep Spatio-Temporal Modeling to LLM-Guided Reasoning

Abstract

Visual Speech Recognition (VSR) has rapidly evolved from handcrafted feature pipelines to deep spatio-temporal architectures and, more recently, LLM-guided reasoning systems. This survey provides a systematic review of that evolution, covering core components of the VSR pipeline, including preprocessing, visual feature extraction, spatio-temporal enhancement, sequence modeling, and decoding. We organize representative methods into five technological eras and analyze their structural shifts from statistical models and recurrent networks to temporal convolutions, Transformer-based global attention, and LLM-empowered generative refinement. We further present a comprehensive dataset taxonomy across granularity, collection environment, language, and modality, and summarize benchmark trends from controlled settings to large-scale in-the-wild and multilingual scenarios. Comparative analysis highlights that while modern visual encoders and attention mechanisms significantly improve discriminative capability, intrinsic viseme ambiguity remains a central bottleneck for visual-only recognition, motivating stronger linguistic priors and multimodal integration. Finally, we discuss key open challenges, including robustness under real-world perturbations, low-resource language coverage, annotation quality, long-form continuous speech modeling, and deployment efficiency, and outline future directions toward reliability-aware decoding, language-agnostic transfer, and scalable multimodal VSR systems.

Detailed contributions

Abstract

Machine-vision lip reading: algorithm design and system development