Recently, deep language models like BERT and GPT-2 have shown a remarkable ability to generalize across domains. Models pre-trained on large amounts of general-domain data yield representations that capture high-level semantics, and can be finetuned for domains where little data is available.
We will adapt deep language models from the natural language domain to the domain of protein sequences. Both domains use sequences of tokens from a finite alphabet, making it straightforward to apply existing language models without much adaptation. If this approach is successful, it will lead to representations of protein sequences which extract high-level semantic concepts from raw data, which may benefit drug-discovery, biomedical analysis, and biomolecular information retrieval.
Supervisors: Maurits Dijkstra & Peter Bloem