Poster
in
Workshop: ICLR 2025 Workshop on Tackling Climate Change with Machine Learning: Data-Centric Approaches in ML for Climate Action
Large Language Models for Monitoring Dataset Mentions in Climate Research
Aivin V. Solatorio · Rafael Macalaba · James Liounis
Effective climate change research relies on diverse datasets to inform mitigation and adaptation strategies and policies. However, the ways in which these datasets are cited, used, and distributed remains poorly understood. This paper presents a machine learning framework that automates the detection and classification of dataset mentions in climate research papers. Leveraging large language models (LLMs), we generate a weakly supervised dataset through zero-shot extraction, quality assessment via an LLM-as-a-Judge, and refinement by a reasoning agent. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a smaller manually annotated subset to specialize in extracting data mentions. At inference, a ModernBERT-based classifier filters for dataset mentions, optimizing computational efficiency. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 in dataset extraction accuracy. As a framework for monitoring dataset mentions in climate research, this approach helps enhance transparency, identifies data gaps, and helps researchers, funders, and policymakers improve data discoverability and usage for more informed decision-making.