Keywords: [ model-based optimization ] [ protein ]
Recent works have successfully demonstrated the ability of deep neural networks in predicting important properties such as fitness and stability from protein sequences via supervised learning. However, the use of learned deep neural network models for the task of designing de novo proteins that maximize a certain fitness value with backbones from scratch remains under-explored. In this paper, we study the problem of designing proteins where the optimization be carried out in a purely data-driven, ``offline'' manner, by utilizing databases of experimental data collected from wet lab evaluations. Synthesis of proteins proposed by the algorithm in an experimental setup in a wet lab, which incurs a big manual overhead for designers, is not allowed. Such an offline optimization problem require that a practitioner make several several design choices: a designer must decide what data distribution to train on, how their method would be evaluated, and must additionally devise workflows for tuning the optimization method they wish to use. In this paper, we perform a systematic study of various design choices that arise in in protein design, grounded in the problem of optimizing for protein stability, and use these insights to propose workflows, protocols and metrics to assist practitioners in effectively applying such data-driven approaches to protein design problems.