Convolutional Tokenization Improves Transformers for Multi-Channel Time Series Classification
Abstract
Transformers have shown promise for time series modeling, yet often underperform simpler CNN baselines on multi-channel physiological signals. We hypothesize this stems from the mismatch between patch-based tokenization and the local structure inherent in such signals. We propose a hybrid architecture that replaces standard patch embedding with convolutional tokenization: a spatial attention module learns channel importance, followed by multi-scale temporal convolutions that extract local features before feeding to a transformer encoder. On 64-channel EEG classification with 109 subjects, our hybrid model achieves 81.1\% F1-score, outperforming pure transformers (76.1\%), CNN baselines (78.4\%), and LSTMs (72.7\%). Our results suggest that incorporating convolutional inductive biases into the tokenization stage is crucial for transformers to excel on multi-channel time series.