Skip to yearly menu bar Skip to main content


Language-Interfaced Tabular Oversampling via Progressive Imputation and Self-Authentication

June Yong Yang · Geondo Park · Joowon Kim · Hyeongwon Jang · Eunho Yang

Halle B #99
[ ]
Thu 9 May 7:30 a.m. PDT — 9:30 a.m. PDT


Tabular data in the wild are frequently afflicted with class-imbalance, biasing machine learning model predictions towards major classes. A data-centric solution to this problem is oversampling - where the classes are balanced by adding synthetic minority samples via generative methods. However, although tabular generative models are capable of generating synthetic samples under a balanced distribution, their integrity suffers when the number of minority samples is low. To this end, pre-trained generative language models with rich prior knowledge are a fitting candidate for the task at hand. Nevertheless, an oversampling strategy tailored for tabular data that utilizes the extensive capabilities of such language models is yet to emerge. In this paper, we propose a novel oversampling framework for tabular data to channel the abilities of generative language models. By leveraging its conditional sampling capabilities, we synthesize minority samples by progressively masking the important features of the majority class samples and imputing them towards the minority distribution. To reduce the inclusion of imperfectly converted samples, we utilize the power of the language model itself to self-authenticate the labels of the samples generated by itself, sifting out ill-converted samples. Extensive experiments on a variety of datasets and imbalance ratios reveal that the proposed method successfully generates reliable minority samples to boost the performance of machine learning classifiers, even under heavy imbalance ratios.

Chat is not available.