Transformer Networks in Gene Prediction

Show full item record

Title: Transformer Networks in Gene Prediction
Author: Viljamaa, Venla
Other contributor: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta
University of Helsinki, Faculty of Science
Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten
Publisher: Helsingin yliopisto
Date: 2022
Language: eng
Thesis level: master's thesis
Degree program: Datatieteen maisteriohjelma
Master's Programme in Data Science
Magisterprogrammet i data science
Specialisation: ei opintosuuntaa
no specialization
ingen studieinriktning
Abstract: In bioinformatics, new genomes are sequenced at an increasing rate. To utilize this data in various bioinformatics problems, it must be annotated first. Genome annotation is a computational problem that has traditionally been approached by using statistical methods such as the Hidden Markov model (HMM). However, implementing these methods is often time-consuming and requires domain knowledge. Neural network-based approaches have also been developed for the task, but they typically require a large amount of pre-labeled data. Genomes and natural language share many properties, not least the fact that they both consist of letters. Genomes also have their own grammar, semantics, and context-based meanings, just like phrases in the natural language. These similarities give motivation to the use of Natural language processing (NLP) techniques in genome annotation. In recent years, pre-trained Transformer neural networks have been widely used in NLP. This thesis shows that due to the linguistic properties of genomic data, Transformer network architecture is also suitable for gene predicting. The model used in the experiments, DNABERT, is pre-trained using the full human genome. Using task-specific labeled data sets, the model is then trained to classify DNA sequences into genes and non-genes. The main fine-tuning dataset is the genome of the Escherichia coli bacterium, but preliminary experiments are also performed on human chromosome data. The fine-tuned models are evaluated for accuracy, F1-score and Matthews correlation coefficient (MCC). A customized estimation method is developed, in which the predictions are compared to ground-truth labels at the nucleotide level. Based on that, the best models achieve a 90.15% accuracy and an MCC value of 0.4683 using the Escherichia coli dataset. The model correctly classifies even the minority label, and the execution times are measured in minutes rather than hours. These suggest that the NLP-based Transformer network is a powerful tool for learning the characteristics of gene and non-gene sequences.
Subject: Transformer
Deep learning
Escherichia coli

Files in this item

Total number of downloads: Loading...

Files Size Format View
Viljamaa_Venla_thesis_2022.pdf 2.238Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record