Poster
in
Workshop: AI4MAT-ICLR-2026: ICLR 2026 Workshop on AI for Accelerated Materials Design

Challenges and Vision For Standardization of Biopolymer Datasets for Machine Learning

Jessica Lalonde ⋅ Defne Circi ⋅ ⋅ Stefan Zauscher ⋅ L. Catherine

Project Page [ OpenReview]

Abstract

Machine learning (ML) is transforming materials research, yet potential for biopolymer discovery remains constrained by fragmented data and non-standardized reporting. Biopolymers differ significantly from synthetic polymers, requiring specialized approaches to represent their biosynthetic origins, hierarchical structures, and application-specific metrics. In this perspective, we identify three core challenges limiting biopolymer representation: information encoding, data quality, and data sharing. Unlike prior reviews on polymer informatics, this perspective explicitly focuses on biopolymer-specific challenges arising from biosynthetic variability, hierarchical structure, and environmental sensitivity, and outlines interoperable, ML-ready solutions tailored to these three key challenges. Recommendations include the design and adoption of biopolymer-specific fingerprinting frameworks, the development of hybrid data extraction strategies, and the expansion of Findable, Accessible, Interoperable, Reusable (FAIR)-compliant repositories. We propose a robust foundation to define interoperable, high-quality datasets that capture the full context of biopolymer materials. Standardized metadata, shared ontologies, and community-driven infrastructure will enable scalable, reproducible workflows and accelerate the ML-driven development of biopolymers.

Chat is not available.