Poster
in
Workshop: Integrating Generative and Experimental Platforms for Biomolecular Design
SweetBERT: exploring BERT-based models for IUPAC glycan nomenclature modeling
Irene Rubia-RodrÃguez · Henrik Nielsen · Garry Gippert · Kristian Barrett · Bernard Henrissat · Ole Winther
Glycans are the most abundant biomolecules on Earth, and participate in key processes in all living organisms. The chemical variability and topological complexity of their natural branched structures has been a challenge in computational glycobiology. As a tool for improving predictive models associated with glycobiology, we propose SweetBERT, a BERT-based language model for encoding glycan sequences which includes explicit information about the branching structure of the sequence. This is achieved by including a pseudo-graph representation in the input embeddings. Performance on downstream tasks by our model underscore promising results of Transformer architectures in addressing the complexities of glycan representation.