Authors
Yufei Zhang, Jinlong Huang
Published in
Journal of visualized experiments : JoVE. Issue 231. May 26, 2026. Epub May 26, 2026.
Abstract
Due to articulation issues and phoneme variability, speech disorders such as dysarthria pose significant challenges in clinical rehabilitation. Traditional automatic speech recognition (ASR) systems, which are typically trained on normative datasets, often fail to accurately decode dysarthric speech. Recent breakthroughs in artificial intelligence and deep learning have allowed the integration of multimodal architectures, such as merging auditory, articulatory, and visual information to record complex speech patterns, making it increasingly possible. In this review, we have explored the convergence of transformers and temporal convolutional networks (TCNs) within multimodal frameworks to address phoneme labeling imprecision in dysarthric speech. Here, we discuss how transformer-based contextual modeling and TCN-driven temporal precision can enhance phoneme boundary detection, classification, and rehabilitation feedback. The review also discusses the potential of such hybrid systems in individualized speech therapy, interpretability issues, and clinical applications. Multimodal architectures are a translational approach to enhancing therapeutic monitoring, communication aids, and speech intelligibility assessment for people with motor speech disorders by bridging clinical medicine and medical informatics.
PMID:
42296219
Bibliographic data and abstract were imported from PubMed on 16 Jun 2026.
Read full publication at:
Please sign in
to see all details.
Advertisement
Stats
- Recommendations n/a n/a positive of 0 vote(s)
- Views 9
- Comments 0