Skip to yearly menu bar Skip to main content


Poster

Aria-MIDI: A Dataset of Piano MIDI Files for Symbolic Music Modeling

Louis Bradshaw · Simon Colton

[ ]
Fri 25 Apr midnight PDT — 2:30 a.m. PDT

Abstract:

We introduce an extensive new dataset of MIDI files, created by transcribing audio recordings of piano performances into their constituent notes. The data pipeline we use is multi-stage, employing a language model to autonomously crawl and score audio recordings from the internet based on their metadata, followed by a stage of pruning and segmentation using an audio classifier. The resulting dataset contains over one million distinct MIDI files, comprising roughly 100,000 hours of transcribed audio. We provide an in-depth analysis of our techniques, offering statistical insights, and investigate the content by extracting metadata tags, which we also provide. Dataset available at https://github.com/loubbrad/aria-midi.

Live content is unavailable. Log in and register to view live content