From Noisy Neural Time Series to Structured Language: A Foundation Model for Imagined Speech Decoding from EEG signals
Abstract
Communicative brain–computer interfaces (BCIs) offer a promising pathway for restoring communication in patients affected by conditions such as amyotrophic lateral sclerosis (ALS). Among these paradigms, imagined speech decoding from non-invasive EEG is particularly attractive due to its portability and scalability. However, EEG constitutes a highly noisy neural time series, characterized by low signal-to-noise ratios, substantial inter-subject variability, and susceptibility to artifacts. Moreover, imagined speech arises from purely internal cognitive processes, producing weak and spatially diffused neural activity. Extracting structured semantic information from such signals remains a significant challenge. To address this challenge, we present NeuroSpeak, a JEPA-based framework for sentence-level imagined speech generation from non-invasive EEG. Our approach combines masked neural signal modeling with vector-quantized latent discretization to learn robust EEG representations, which are aligned with language embeddings using a predictive alignment objective and decoded into natural language via a pretrained sequence model. We train and evaluate our model on the large-scale CHISCO corpus comprising over 20,000 imagined speech sentences under a subject-agnostic evaluation setting. The proposed framework achieves a semantic similarity score of 47.70\% relative to ground-truth text, demonstrating generalization beyond subject-specific neural patterns. To the best of our knowledge, this represents the largest and most semantically diverse study of sentence-level imagined speech generation using non-invasive EEG.