Skip to yearly menu bar Skip to main content


Oral
in
Workshop: Generative and Experimental Perspectives for Biomolecular Design

A New Ultra-High-Throughput Assay for Measuring Protein Fitness

Vikram Sundar · Boqiang Tu · Lindsey Guan · Kevin Esvelt


Abstract: Machine learning (ML) for protein design frequently requires large datasets of protein fitness measurements generated by high-throughput experiments; however, publicly available protein fitness datasets generated by deep mutational scanning are noisy and only include $10^3$ to $10^5$ data points. In this work, we present DHARMA, a new protein fitness assay using molecular recording via base editors and high-throughput sequencing to measure the fitness of up to $10^6$ variants. To mitigate noise in DHARMA experiments, we design a Bayesian inference method FLIGHTED that denoises the output of a DHARMA experiment for downstream ML applications. Our results show that DHARMA and FLIGHTED can accurately measure protein fitness with calibrated errors. Using this technology, we generate a new fitness dataset of $160000$ TEV protease variants and benchmark a number of standard ML models, including protein language model embeddings, on this dataset. We find that data size is the single most important factor in determining ML model performance and that scaling up protein language models does not currently improve performance. DHARMA and FLIGHTED can help generate more large protein fitness datasets for the ML community.

Chat is not available.