Poster
in
Workshop: AI4MAT-ICLR-2025: AI for Accelerated Materials Design
LeMat-Bulk: aggregating, and de-duplicating quantum chemistry materials databases
Martin Siron · Inel DJAFAR · Etienne du Fayet · Amandine Rossello · Ali Ramlaoui · Alexandre Duval
Keywords: [ dataset ] [ material-fingerprint ] [ materials-discovery ] [ crystals ]
Abstract:
The rapid expansion of material science databases enables the training of predictive machine learning models that deliver fast, accurate estimates of materials properties, as well as generative models that explore the vast combinatorial space of material candidates. Initiatives like the Materials Project, OQMD, and Alexandria have greatly expanded the scope of computational materials science and fueled progress in the materials science community. However, they also introduced challenges related to duplication, data integration, and interoperability which complicates efforts to develop scalable machine learning models. To address these challenges, we introduce LeMat-Bulk, a unified dataset combining Density Functional Theory (DFT) calculations from the Materials Project, OQMD, and Alexandria. This dataset encompasses over 5.3 million materials across three DFT functionals, including the largest repository of PBESol and SCAN functional calculations ($\sim$500k). Our methodology standardizes DFT calculations across databases with varying parameters, resolving inconsistencies and enhancing cross-compatibility. Besides, we propose and benchmark a hashing function (BAWL) built on Ongari et al. (2022) that generates identifiers for crystalline inorganic materials by capturing their structural and compositional properties.
Chat is not available.
Successful Page Load