Supervisors Maurits Dijkstra and Peter Bloem with assistants Henriette Capel and Robin Weiler worked on an NI Academy Project (see here) which resulted in a 2022 paper published in Nature Scientific Reports. See: https://www.nature.com/articles/s41598-022-19608-4
Proteins are the building blocks of life, but predicting their properties from their sequences is a challenging task. It’s one that we need to overcome though since that knowledge is used for drug discovery for instance. Machine learning methods can help, but they need labelled data, which is often scarce or expensive to obtain. A possible solution is to pre-train a model on unlabeled data, and then fine-tune it on specific tasks with limited labeled data. This approach has been very successful in natural language processing (NLP), where models like BERT and GPT-2 (now GPT-4) can capture high-level semantics and generalize across domains.
But can we apply the same idea to proteins? Can we pre-train a model on protein sequences and fine-tune it for various tasks that require labelled data? And most importantly according to this paper, how do we effectively evaluate the quality of the learned protein representations? At the time, protein representation models were often evaluated using benchmarks with just one or two tasks.
In their academy project, the NI researchers tested ProteinGLUE, a benchmark suite for the evaluation of protein representations. This is a way to test how effective a given machine-learning approach is in predicting protein properties from their sequences. ProteinGLUE consists of seven per-amino-acid tasks (expanding from the 1-2 which were deemed sufficient at the time) that cover different aspects of protein structure and function, such as (for those familiar with the field) secondary structure, solvent accessibility, disorder, binding sites, and more.
The paper shows that pre-training yields higher performance on the downstream tasks compared to no pre-training (pink and light blue above). On the x-axis are the seven tasks which the NI researchers included in their GLUE benchmark. However, the larger base model (dark blue and light blue) does not always outperform the smaller medium model, suggesting that there is room for improvement in the pre-training methods (e.g. making those tasks harder). The paper also discusses some of the challenges and limitations of applying NLP techniques to proteins, such as vocabulary size, sequence length, and domain specificity.
ProteinGLUE is a valuable resource for the field of protein sequence-based property prediction. It provides a standardized way to compare different models and methods and to assess their generalizability and usefulness. TAPE is also an existing benchmark but it covers less protein-function related tasks.
This research done by the NI’s academy opens up new avenues for research on self-supervised protein modelling, which could lead to a better understanding and discovery of novel proteins.