Getting the Data Right: A Physics-Consistent, Calibrated Dataset for SEM-Based Defect Localization in PEM Fuel Cells
Abstract
High-quality data is a key bottleneck for vision systems in scientific imaging, yet publicly available datasets for defect localization in proton exchange membrane fuel cells remain scarce. We present a curated grayscale scanning electron microscopy dataset for single-class defect localization consisting of 1,107 images with bounding-box annotations, fixed train/validation/test splits, and a single canonical annotation source to ensure reproducibility. A physics-consistent preprocessing pipeline removes acquisition artifacts, enforces spatial standardization, and applies global intensity normalization to mitigate shortcut learning from non-physical cues. Controlled learnability and augmentation ablations show that even physically plausible transformations, including 90° rotations, can degrade detection performance, highlighting the need for dataset-specific validation rather than heuristic augmentation. By providing a rigorously validated and transparent benchmark for SEM-based defect localization, this dataset supports reliable automated characterization workflows and reduces a key data bottleneck in data-driven materials discovery and diagnostic pipelines.