Optical Maps in Genome Assembly and Long k-mer Extraction

Leinonen, Miika2024-05-072024-05-142024-05-072024-05-24http://hdl.handle.net/10138/575223Sequencing the entire human genome has been an ambitious endeavor, culminating in the recent achievement of a gapless human genome sequence in 2022. While this accomplishment is noteworthy, the benefits of genome sequencing have long been recognized. A lot of work goes into gathering, processing, and analyzing sequencing data. In this thesis, we explore some of the behind-the-scenes technical aspects of bioinformatics. More specifically, we take a look at the challenges associated with sequencing data itself and tools for its transformation into formats usable in downstream applications and analysis. We start this thesis by taking a look at the genome assembly process, and how we can enhance it to obtain a more complete and accurate picture of the genome. Genome assembly is needed, because the sequencing data does not represent genomes completely accurately. The data is fragmented, and contains errors. The goal of genome assembly is to reconstruct an accurate depiction of the underlying genome, using the available imperfect data. The work in this thesis takes a look at how this process can be improved by taking advantage of additional data in the form of optical maps. We propose a genome assembly pipeline, that successfully takes advantage of optical maps to produce higher quality contigs. The second part of this thesis is focused on the k-mer counting problem. Sequencing data is often split into arbitrarily long k-length sequences, k-mers. Doing this enables easier processing of the data, and analysis based on k-mer counts. Due to the errors in the sequencing data, finding longer k-mers accurately can be difficult. For this problem, we present two approaches that aim to find long k-mers accurately, even in the presence of errors. The first solution works well when only substitution errors are present. However, this is not a realistic situation and our second solution intends to fix this. The proposed method works well compared to conventional k-mer counting, but could not compete against it if the reads are corrected beforehand. The last problem discussed in this thesis is also related to k-mers. This time, instead of focusing on the accuracy of the k-mer counting process, we look into how to represent the k-mers as memory-efficiently as possible. When sequencing data is split into k-mers, a lot of repetitive information is shared between them. To take advantage of this fact, we present a dictionary solution for long k-mers, where k-mers are stored implicitly. While our proposed data structure is slower than a plain hash table implementation, it can save a lot of space when processing long k-mers.Genomien rakenteesta saatavan tiedon hyödyntäminen on tiedostettu jo vuosikymmenten ajan. Tähän kuitenkin liittyy yhä haasteita, joista monet ovat peräisin sekvensointidatan luonteesta. Sekvensointilaitteilla näytteistä luettu data ei nimittäin ole täysin virheetöntä. Lisäksi koko genomia ei pystytä lukemaan kerralla. Sen sijaan merkkijonoina saatava data, sekvenssit, ovat genomin pätkiä, joiden sijaintia alkuperäisessä genomissa ei tunneta. Tässä tutkielmassa esittelemmekin menetelmiä, jotka ovat apuna sekvensointidatan otossa hyötykäyttöön. Aloitamme tutkielman tutustumalla genomin kasaukseen, ja kuinka tätä prosessia voidaan kehittää siten, että sekvensseistä kasattu genomi on mahdollisimman todenmukaisesti. Esittelemme metodin, jossa genomin kasausta tuetaan käyttämällä hyödyksi optisia karttoja. Optisten karttojen avulla voimme arvioida sekvenssien keskeistä järjestystä, mikä mahdollistaa virheiden korjauksen genomin kasauksen aikana. Menetelmämme avulla kykenimme tuottamaan parempia tuloksia genomin kasauksessa verrattuna menetelmään, jossa optiset kartat eivät olleet käytössä. Seuraavassa tutkielman osassa keskitymme k-meerien laskentaan. Usein sekvensointidatan sisältö esitetään ja käsitellään k-meereinä, eli k:n mittaisina merkkijonoina. Keskitymme tässä tutkielmassa pitkiin k-meereihin, sillä ne tarjoavat enemmän tietoa k-meerien alkuperäisestä sijainnista genomissa. Pitkät k-meerit ovat kuitenkin erityisen alttiita datassa esiintyville virheille. Siksi kehitimmekin menetelmän k-meerien laskentaan sekvensointidatasta, joka pystyy korjaamaan virheitä jo laskennan aikana. Ensimmäinen menetelmämme toimii hyvin sellaisen datan kanssa, jossa esiintyi vain merkkien korvausvirheitä. Seuraava menetelmämme ottaa huomioon myös merkkien lisäykset ja poistot. Pystyimme näiden menetelmien avulla löytämään enemmän oikeita k-meerejä verrattuna perinteiseen k-meerien laskentaprosessiin silloin kun dataa ei oltu korjattu etukäteen. Viimeinen tutkielman osa käsittelee k-meerien tehokasta esitystapaa muistinkäytön kannalta. Kehitimme menetelmän pitkien k-meerien laskentaan hajautustaulun avulla siten, että k-meerien merkkijonoja ei tarvitse tallentaa muistiin kokonaisuudessaan. Menetelmämme avulla onnistuimme säästämään muistin käyttöä verrattuna hajautustauluihin, joissa jokaisen k-meerin täysi merkkijono tallennettiin muistiin.application/pdfengJulkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.Publikationen är skyddad av upphovsrätten. Den får läsas och skrivas ut för personligt bruk. Användning i kommersiellt syfte är förbjuden.computer ScienceOptical Maps in Genome Assembly and Long k-mer ExtractionOptiset kartat genomin kasauksessa ja pitkien k-meerien laskentaURN:ISBN:978-952-84-0130-8Doctoral dissertation (article-based)