Poster
in
Workshop: Workshop on Large Language Models for Agents
Self-Training Language Models in Arithmetic Reasoning
Marek Kadlčík · Michal Štefánik · Ondrej Sotolar · Vlastimil Martinek
Recent work shows impressive efficiency of methods for modeling human preferences but achieving further improvements with these methods requires costly human annotations of the quality of model outputs. In this work, we study the potential of preference optimization methods in self-training for arithmetic reasoning tasks, presenting a unique environment where human judgments of model outputs can be substituted with assessment against the reference correct results.First, we curate and transform existing datasets to create Calc-X, a standardized collection with over 300,000 problems with step-by-step solutions. We use Calc-X to train models we call Calcformers that interact with a calculator during inference. Calcformers achieve twice the accuracy of vanilla language model baselines and outperform previous tool-using models.Finally, we apply self-training to Calcformers, where the model's outputs are assessed by checking the final result. We find that self-training methods can achieve substantial improvements on both in-domain and out-of-domain problems, but in this setting, preference optimization performs comparably to traditional supervised training. Our analyses show that preference optimization in self-training inclines to overfitting, which can be partially addressed by parameter-efficient training or deliberate method and parameter selection.