Virtual presentation / poster accept
Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules
Jun Xia · Chengshuai Zhao · Bozhen Hu · Zhangyang Gao · Cheng Tan · Yue Liu · Siyuan Li · Stan Z Li
Keywords: [ graph neural networks ] [ Deep Learning and representational learning ]
Recent years have witnessed the prosperity of pre-training graph neural networks (GNNs) for molecules. Typically, atom types as node attributes are randomly masked, and GNNs are then trained to predict masked types as in AttrMask \citep{hu2020strategies}, following the Masked Language Modeling (MLM) task of BERT~\citep{devlin2019bert}. However, unlike MLM with a large vocabulary, the AttrMask pre-training does not learn informative molecular representations due to small and unbalanced atom vocabulary'. To amend this problem, we propose a variant of VQ-VAE~\citep{van2017neural} as a context-aware tokenizer to encode atom attributes into chemically meaningful discrete codes. This can enlarge the atom vocabulary size and mitigate the quantitative divergence between dominant (e.g., carbons) and rare atoms (e.g., phosphorus). With the enlarged atom
vocabulary', we propose a novel node-level pre-training task, dubbed Masked Atoms Modeling (\textbf{MAM}), to mask some discrete codes randomly and then pre-train GNNs to predict them. MAM also mitigates another issue of AttrMask, namely the negative transfer. It can be easily combined with various pre-training tasks to improve their performance. Furthermore, we propose triplet masked contrastive learning (\textbf{TMCL}) for graph-level pre-training to model the heterogeneous semantic similarity between molecules for effective molecule retrieval. MAM and TMCL constitute a novel pre-training framework, \textbf{Mole-BERT}, which can match or outperform state-of-the-art methods in a fully data-driven manner. We release the code at \textcolor{magenta}{\url{https://github.com/junxia97/Mole-BERT}}.