Browsing by Subject "ALIGNMENT"

Sort by: Order: Results:

Now showing items 1-20 of 55
  • Mehine, Miika; Khamaiseh, Sara; Ahvenainen, Terhi; Heikkinen, Tuomas; Äyräväinen, Anna; Pakarinen, Päivi; Härkki, Päivi; Pasanen, Annukka; Bützow, Ralf; Vahteristo, Pia (2020)
    Simple Summary Uterine leiomyomas are benign smooth muscle tumors affecting millions of women globally. On a molecular level, leiomyomas can be classified into three main subtypes, each characterized by mutations affecting either MED12, HMGA2, or FH. Leiomyomas are still widely regarded as a single entity, although early observations suggest that different subtypes behave differently, in terms of both clinical outcomes and therapeutic requirements. The majority of classification studies on leiomyomas have been performed using fresh frozen tissue. Archival formalin-fixed paraffin-embedded (FFPE) tissue represents an invaluable source of biological material that can be studied retrospectively. Methods capable of generating high-quality data from FFPE material are in high demand. Here, we show that 3 ' RNA sequencing can accurately classify leiomyomas that have been stored as FFPE tissue in hospital archives for years. A targeted 3 ' RNA sequencing panel could provide researchers and clinicians with a cost-effective and scalable diagnostic tool for classifying smooth muscle tumors. Uterine leiomyomas are benign smooth muscle tumors occurring in 70% of women of reproductive age. The majority of leiomyomas harbor one of three well-established genetic changes: a hotspot mutation in MED12, overexpression of HMGA2, or biallelic loss of FH. The majority of studies have classified leiomyomas by complex and costly methods, such as whole-genome sequencing, or by combining multiple traditional methods, such as immunohistochemistry and Sanger sequencing. The type of specimens and the amount of resources available often determine the choice. A more universal, cost-effective, and scalable method for classifying leiomyomas is needed. The aim of this study was to evaluate whether RNA sequencing can accurately classify formalin-fixed paraffin-embedded (FFPE) leiomyomas. We performed 3 ' RNA sequencing with 44 leiomyoma and 5 myometrium FFPE samples, revealing that the samples clustered according to the mutation status of MED12, HMGA2, and FH. Furthermore, we confirmed each subtype in a publicly available fresh frozen dataset. These results indicate that a targeted 3 ' RNA sequencing panel could serve as a cost-effective and robust tool for stratifying both fresh frozen and FFPE leiomyomas. This study also highlights 3 ' RNA sequencing as a promising method for studying the abundance of unexploited tissue material that is routinely stored in hospital archives.
  • Acosta, Nidia Obscura; Mäkinen, Veli; Tomescu, Alexandru I. (2018)
    Background: Reconstructing the genome of a species from short fragments is one of the oldest bioinformatics problems. Metagenomic assembly is a variant of the problem asking to reconstruct the circular genomes of all bacterial species present in a sequencing sample. This problem can be naturally formulated as finding a collection of circular walks of a directed graph G that together cover all nodes, or edges, of G. Approach: We address this problem with the "safe and complete" framework of Tomescu and Medvedev (Research in computational Molecular biology-20th annual conference, RECOMB 9649: 152-163, 2016). An algorithm is called safe if it returns only those walks (also called safe) that appear as subwalk in all metagenomic assembly solutions for G. A safe algorithm is called complete if it returns all safe walks of G. Results: We give graph-theoretic characterizations of the safe walks of G, and a safe and complete algorithm finding all safe walks of G. In the node-covering case, our algorithm runs in time O(m(2) + n(3)), and in the edge-covering case it runs in time O(m(2)n); n and m denote the number of nodes and edges, respectively, of G. This algorithm constitutes the first theoretical tight upper bound on what can be safely assembled from metagenomic reads using this problem formulation.
  • Mukherjee, Kingshuk; Alipanahi, Bahar; Kahveci, Tamer; Salmela, Leena; Boucher, Christina (2019)
    Motivation: Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. Results: We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data.
  • Lamnidis, Thiseas C.; Majander, Kerttu; Jeong, Choongwon; Salmela, Elina; Wessman, Anna; Moiseyev, Vyacheslav; Khartanovich, Valery; Balanovsky, Oleg; Ongyerth, Matthias; Weihmann, Antje; Sajantila, Antti; Kelso, Janet; Pääbo, Svante; Onkamo, Päivi; Haak, Wolfgang; Krause, Johannes; Schiffels, Stephan (2018)
    European population history has been shaped by migrations of people, and their subsequent admixture. Recently, ancient DNA has brought new insights into European migration events linked to the advent of agriculture, and possibly to the spread of Indo-European languages. However, little is known about the ancient population history of north-eastern Europe, in particular about populations speaking Uralic languages, such as Finns and Saami. Here we analyse ancient genomic data from 11 individuals from Finland and north-western Russia. We show that the genetic makeup of northern Europe was shaped by migrations from Siberia that began at least 3500 years ago. This Siberian ancestry was subsequently admixed into many modern populations in the region, particularly into populations speaking Uralic languages today. Additionally, we show that ancestors of modern Saami inhabited a larger territory during the Iron Age, which adds to the historical and linguistic information about the population history of Finland.
  • Holden, Lindsay A.; Arumilli, Meharji; Hytonen, Marjo K.; Hundi, Sruthi; Salojärvi, Jarkko; Brown, Kim H.; Lohi, Hannes (2018)
    Dogs are excellent animal models for human disease. They have extensive veterinary histories, pedigrees, and a unique genetic system due to breeding practices. Despite these advantages, one factor limiting their usefulness is the canine genome reference (CGR) which was assembled using a single purebred Boxer. Although a common practice, this results in many high-quality reads remaining unmapped. To address this whole-genome sequence data from three breeds, Border Collie (n = 26), Bearded Collie (n = 7), and Entlebucher Sennenhund (n = 8), were analyzed to identify novel, non-CGR genomic contigs using the previously validated pseudo-de novo assembly pipeline. We identified 256,957 novel contigs and paired-end relationships together with BLAT scores provided 126,555 (49%) high-quality contigs with genomic coordinates containing 4.6 Mb of novel sequence absent from the CGR. These contigs close 12,503 known gaps, including 2.4 Mb containing partially missing sequences for 11.5% of Ensembl, 16.4% of RefSeq and 12.2% of canFam3.1+ CGR annotated genes and 1,748 unmapped contigs containing 2,366 novel gene variants. Examples for six disease-associated genes (SCARF2, RD3, COL9A3, FAM161A, RASGRP1 and DLX6) containing gaps or alternate splice variants missing from the CGR are also presented. These findings from non-reference breeds support the need for improvement of the current Boxer-only CGR to avoid missing important biological information. The inclusion of the missing gene sequences into the CGR will facilitate identification of putative disease mutations across diverse breeds and phenotypes.
  • Cheng, Lu; Walker, Alan W.; Corander, Jukka (2012)
  • Celorio-Mancera, Maria de la Paz; Rastas, Pasi; Steward, Rachel A.; Nylin, Soren; Wheat, Christopher W. (2021)
    The comma butterfly (Polygonia c-album, Nymphalidae, Lepidoptera) is a model insect species, most notably in the study of phenotypic plasticity and plant-insect coevolutionary interactions. In order to facilitate the integration of genomic tools with a diverse body of ecological and evolutionary research, we assembled the genome of a Swedish comma using 10X sequencing, scaffolding with matepair data, genome polishing, and assignment to linkage groups using a high-density linkage map. The resulting genome is 373 Mb in size, with a scaffold N50 of 11.7 Mb and contig N50 of 11,2Mb. The genome contained 90.1% of single-copy Lepidopteran orthologs in a BUSCO analysis of 5,286 genes. A total of 21,004 gene-models were annotated on the genome using RNA-Seq data from larval and adult tissue in combination with proteins from the Arthropoda database, resulting in a high-quality annotation for which functional annotations were generated. We further documented the quality of the chromosomal assembly via synteny assessment with Melitaea cinxia. The resulting annotated, chromosome-level genome will provide an important resource for investigating coevolutionary dynamics and comparative analyses in Lepidoptera.
  • Keller, Saskia; Hetzel, Udo; Sironen, Tarja; Korzyukov, Yegor; Vapalahti, Olli; Kipar, Anja; Hepojoki, Jussi (2017)
    Boid inclusion body disease (BIBD) is an often fatal disease affecting mainly constrictor snakes. BIBD has been associated with infection, and more recently with coinfection, by various reptarenavirus species (family Arenaviridae). Thus far BIBD has only been reported in captive snakes, and neither the incubation period nor the route of transmission are known. Herein we provide strong evidence that co-infecting reptarenavirus species can be vertically transmitted in Boa constrictor. In total we examined five B. constrictor clutches with offspring ranging in age from embryos over perinatal abortions to juveniles. The mother and/or father of each clutch were initially diagnosed with BIBD andor reptarenavirus infection by detection of the pathognomonic inclusion bodies (IB) andor reptarenaviral RNA. By applying next-generation sequencing and de novo sequence assembly we determined the "reptarenavirome " of each clutch, yielding several nearly complete L and S segments of multiple reptarenaviruses. We further confirmed vertical transmission of the co-infecting reptarenaviruses by species-specific RT-PCR from samples of parental animals and offspring. Curiously, not all offspring obtained the full parental "reptarenavirome". We extended our findings by an in vitro approach; cell cultures derived from embryonal samples rapidly developed IB and promoted replication of some or all parental viruses. In the tissues of embryos and perinatal abortions, viral antigen was sometimes detected, but IB were consistently seen only in the juvenile snakes from the age of 2 mo onwards. In addition to demonstrating vertical transmission of multiple species, our results also indicate that reptarenavirus infection induces BIBD over time in the offspring.
  • Rehman, Umar; Sultana, Nighat; Abdullah,; Jamal, Abbas; Muzaffar, Maryam; Poczai, Péter (2021)
    Family Phyllanthaceae belongs to the eudicot order Malpighiales, and its species are herbs, shrubs, and trees that are mostly distributed in tropical regions. Here, we elucidate the molecular evolution of the chloroplast genome in Phyllanthaceae and identify the polymorphic loci for phylogenetic inference. We de novo assembled the chloroplast genomes of three Phyllanthaceae species, i.e., Phyllanthus emblica, Flueggea virosa, and Leptopus cordifolius, and compared them with six other previously reported genomes. All species comprised two inverted repeat regions (size range 23,921–27,128 bp) that separated large single-copy (83,627–89,932 bp) and small single-copy (17,424–19,441 bp) regions. Chloroplast genomes contained 111–112 unique genes, including 77–78 protein-coding, 30 tRNAs, and 4 rRNAs. The deletion/pseudogenization of rps16 genes was found in only two species. High variability was seen in the number of oligonucleotide repeats, while guanine-cytosine contents, codon usage, amino acid frequency, simple sequence repeats, synonymous and non-synonymous substitutions, and transition and transversion substitutions were similar. The transition substitutions were higher in coding sequences than in non-coding sequences. Phylogenetic analysis revealed the polyphyletic nature of the genus Phyllanthus. The polymorphic proteincoding genes, including rpl22, ycf1, matK, ndhF, and rps15, were also determined, which may be helpful for reconstructing the high-resolution phylogenetic tree of the family Phyllanthaceae. Overall, the study provides insight into the chloroplast genome evolution in Phyllanthaceae.
  • Abdullah,; Mehmood, Furrukh; Heidari, Parviz; Ahmed, Ibrar; Poczai, Péter (2021)
    The genus Blumea (Asteroideae, Asteraceae) comprises about 100 species, including herbs, shrubs, and small trees. Previous studies have been unable to resolve taxonomic issues and the phylogeny of the genus Blumea due to the low polymorphism of molecular markers. Therefore, suitable polymorphic regions need to be identified. Here, we de novo assembled plastomes of the three Blumea species B. oxyodonta, B. tenella, and B. balsamifera and compared them with 25 other species of Asteroideae after correction of annotations. These species have quadripartite plastomes with similar gene content and genome organization comprising 113 genes, including 80 protein-coding, 29 transfer RNA, and 4 ribosomal RNA genes. The contraction and expansion of inverted repeats also show high similarities among the species. The comparative analysis of codon usage, amino acid frequency, microsatellite repeats, oligonucleotide repeats, and transition and transversion substitutions has revealed high resemblance among the newly assembled species of Blumea. We identified 10 highly polymorphic regions with nucleotide diversity above 0.02, including rps16-trnQ, ycf1, ndhF-rpl32, rps15, petN-psbM, and rpl32-trnL, and they may be suitable for the development of robust, authentic, and cost-effective markers for bar coding and inference of the phylogeny of the genus Blumea. Among these highly polymorphic regions, five regions also co-occurred with oligonucleotide repeats and support use of repeats as a proxy for the identification of polymorphic loci. The phylogenetic analysis revealed a close relationship between Blumea and Pluchea within the tribe Inuleae. Our study supports a sister relationship between “Astereae and Anthemideae,” while Gnaphalieae roots these two tribes, whereas in a previous study a sister relationship was reported between “Senecioneae and Anthemideae” and “Astereae and Gnaphalieae” using nuclear genome sequences. The conflicting phylogenetic signals observed at the tribal level between chloroplast and nuclear genome data require further investigation.
  • Abdullah,; Henriquez, Claudia L.; Mehmood, Furrukh; Shahzadi, Iram; Ali, Zain; Waheed, Mohammad Tahir; Croat, Thomas B; Poczai, Péter; Ahmed, Ibrar (2020)
    The chloroplast genome provides insight into the evolution of plant species. We de novo assembled and annotated chloroplast genomes of four genera representing three subfamilies of Araceae: Lasia spinosa (Lasioideae), Stylochaeton bogneri, Zamioculcas zamiifolia (Zamioculcadoideae), and Orontium aquaticum (Orontioideae), and performed comparative genomics using these chloroplast genomes. The sizes of the chloroplast genomes ranged from 163,770 bp to 169,982 bp. These genomes comprise 113 unique genes, including 79 protein-coding, 4 rRNA, and 30 tRNA genes. Among these genes, 17–18 genes are duplicated in the inverted repeat (IR) regions, comprising 6–7 protein-coding (including trans-splicing gene rps12), 4 rRNA, and 7 tRNA genes. The total number of genes ranged between 130 and 131. The infA gene was found to be a pseudogene in all four genomes reported here. These genomes exhibited high similarities in codon usage, amino acid frequency, RNA editing sites, and microsatellites. The oligonucleotide repeats and junctions JSB (IRb/SSC) and JSA (SSC/IRa) were highly variable among the genomes. The patterns of IR contraction and expansion were shown to be homoplasious, and therefore unsuitable for phylogenetic analyses. Signatures of positive selection were seen in three genes in S. bogneri, including ycf2, clpP, and rpl36. This study is a valuable addition to the evolutionary history of chloroplast genome structure in Araceae.
  • Beier, Sebastian; Himmelbach, Axel; Colmsee, Christian; Zhang, Xiao-Qi; Barrero, Roberto A.; Zhang, Qisen; Li, Lin; Bayer, Micha; Bolser, Daniel; Taudien, Stefan; Groth, Marco; Felder, Marius; Hastie, Alex; Simkova, Hana; Stankova, Helena; Vrana, Jan; Chan, Saki; Munoz-Amatriain, Maria; Ounit, Rachid; Wanamaker, Steve; Schmutzer, Thomas; Aliyeva-Schnorr, Lala; Grasso, Stefano; Tanskanen, Jaakko; Sampath, Dharanya; Heavens, Darren; Cao, Sujie; Chapman, Brett; Dai, Fei; Han, Yong; Li, Hua; Li, Xuan; Lin, Chongyun; McCooke, John K.; Tan, Cong; Wang, Songbo; Yin, Shuya; Zhou, Gaofeng; Poland, Jesse A.; Bellgard, Matthew I.; Houben, Andreas; Dolezel, Jaroslav; Ayling, Sarah; Lonardi, Stefano; Langridge, Peter; Muehlbauer, Gary J.; Kersey, Paul; Clark, Matthew D.; Caccamo, Mario; Schulman, Alan H.; Platzer, Matthias; Close, Timothy J.; Hansson, Mats; Zhang, Guoping; Braumann, Ilka; Li, Chengdao; Waugh, Robbie; Scholz, Uwe; Stein, Nils; Mascher, Martin (2017)
    Barley (Hordeum vulgare L.) is a cereal grass mainly used as animal fodder and raw material for the malting industry. The map-based reference genome sequence of barley cv. `Morex' was constructed by the International Barley Genome Sequencing Consortium (IBSC) using hierarchical shotgun sequencing. Here, we report the experimental and computational procedures to (i) sequence and assemble more than 80,000 bacterial artificial chromosome (BAC) clones along the minimum tiling path of a genome-wide physical map, (ii) find and validate overlaps between adjacent BACs, (iii) construct 4,265 non-redundant sequence scaffolds representing clusters of overlapping BACs, and (iv) order and orient these BAC clusters along the seven barley chromosomes using positional information provided by dense genetic maps, an optical map and chromosome conformation capture sequencing (Hi-C). Integrative access to these sequence and mapping resources is provided by the barley genome explorer (BARLEX).
  • Jouhten, Hanne; Ronkainen, Aki; Aakko, Juhani; Salminen, Seppo; Mattila, Eero; Arkkila, Perttu; Satokari, Reetta (2020)
    Fecal microbiota transplantation (FMT) is an effective treatment for recurrentClostridioides difficileinfection (rCDI) and it's also considered for treating other indications. Metagenomic studies have indicated that commensal donor bacteria may colonize FMT recipients, but cultivation has not been employed to verify strain-level colonization. We combined molecular profiling ofBifidobacteriumpopulations with cultivation, molecular typing, and whole genome sequencing (WGS) to isolate and identify strains that were transferred from donors to recipients. SeveralBifidobacteriumstrains from two donors were recovered from 13 recipients during the 1-year follow-up period after FMT. The strain identities were confirmed by WGS and comparative genomics. Our results show that specific donor-derived bifidobacteria can colonize rCDI patients for at least 1 year, and thus FMT may have long-term consequences for the recipient's microbiota and health. Conceptually, we demonstrate that FMT trials combined with microbial profiling can be used as a platform for discovering and isolating commensal strains with proven colonization capacity for potential therapeutic use.
  • Herranen, J.; Markkanen, J.; Muinonen, K. (2017)
    We establish a theoretical framework for solving the equations of motion for an arbitrarily shaped, inhomogeneous dust particle in the presence of radiation pressure. The repeated scattering problem involved is solved using a state-of-the-art volume integral equation-based T-matrix method. A Fortran implementation of the framework is used to solve the explicit time evolution of a homogeneous irregular sample geometry. The results are shown to be consistent with rigid body dynamics, between integrators, and comparable with predictions from an alignment efficiency potential map. Also, we demonstrate the explicit effect of single-particle dynamics to observed polarization using the obtained orientational results.
  • BEEHIVE Collaboration; Wymant, Chris; Blanquart, Francois; Golubchik, Tanya; Gall, Astrid; Bakker, Margreet; Bezemer, Daniela; Croucher, Nicholas J.; Hall, Matthew; Hillebregt, Mariska; Ong, Swee Hoe; Ratmann, Oliver; Albert, Jan; Bannert, Norbert; Fellay, Jacques; Fransen, Katrien; Gourlay, Annabelle; Grabowski, M. Kate; Gunsenheimer-Bartmeyer, Barbara; Gunthard, Huldrych F.; Kivelä, Pia; Kouyos, Roger; Laeyendecker, Oliver; Liitsola, Kirsi; Meyer, Laurence; Porter, Kholoud; Ristola, Matti; van Sighem, Ard; Berkhout, Ben; Cornelissen, Marion; Kellam, Paul; Reiss, Peter; Fraser, Christophe (2018)
    Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between-and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.
  • Groussin, Mathieu; Poyet, Mathilde; Sistiaga, Ainara; Kearney, Sean M.; Moniz, Katya; Noel, Mary; Hooker, Jeff; Gibbons, Sean M.; Segurel, Laure; Froment, Alain; Mohamed, Rihlat Said; Fezeu, Alain; Juimo, Vanessa A.; Lafosse, Sophie; Tabe, Francis E.; Girard, Catherine; Iqaluk, Deborah; Nguyen, Le Thanh Tu; Shapiro, B. Jesse; Lehtimaki, Jenni; Ruokolainen, Lasse; Kettunen, Pinja P.; Vatanen, Tommi; Sigwazi, Shani; Mabulla, Audax; Dominguez-Rodrigo, Manuel; Nartey, Yvonne A.; Agyei-Nkansah, Adwoa; Duah, Amoako; Awuku, Yaw A.; Valles, Kenneth A.; Asibey, Shadrack O.; Afihene, Mary Y.; Roberts, Lewis R.; Plymoth, Amelie; Onyekwere, Charles A.; Summons, Roger E.; Xavier, Ramnik J.; Alm, Eric J. (2021)
    Industrialization has impacted the human gut ecosystem, resulting in altered microbiome composition and diversity. Whether bacterial genomes may also adapt to the industrialization of their host populations remains largely unexplored. Here, we investigate the extent to which the rates and targets of horizontal gene transfer (HGT) vary across thousands of bacterial strains from 15 human populations spanning a range of industrialization. We show that HGTs have accumulated in the microbiome over recent host generations and that HGT occurs at high frequency within individuals. Comparison across human populations reveals that industrialized lifestyles are associated with higher HGT rates and that the functions of HGTs are related to the level of host industrialization. Our results suggest that gut bacteria continuously acquire new functionality based on host lifestyle and that high rates of HGT may be a recent development in human history linked to industrialization.
  • Salmela, Leena; Mukherjee, Kingshuk; Puglisi, Simon J.; Muggli, Martin D.; Boucher, Christina (2020)
    Motivation: Optical mapping data is used in many core genomics applications, including structural variation detection, scaffolding assembled contigs and mis-assembly detection. However, the pervasiveness of spurious and deleted cut sites in the raw data, which are called Rmaps, make assembly and alignment of them challenging. Although there exists another method to error correct Rmap data, named cOMet, it is unable to scale to even moderately large sized genomes. The challenge faced in error correction is in determining pairs of Rmaps that originate from the same region of the same genome. Results: We create an efficient method for determining pairs of Rmaps that contain significant overlaps between them. Our method relies on the novel and nontrivial adaption and application of spaced seeds in the context of optical mapping, which allows for spurious and deleted cut sites to be accounted for. We apply our method to detecting and correcting these errors. The resulting error correction method, referred to as Elmeri, improves upon the results of state-of-the-art correction methods but in a fraction of the time. More specifically, cOMet required 9.9 CPU days to error correct Rmap data generated from the human genome, whereas Elmeri required less than 15 CPU hours and improved the quality of the Rmaps by more than four times compared to cOMet.
  • Mukherjee, Kingshuk; Rossi, Massimiliano; Salmela, Leena; Boucher, Christina (2021)
    Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at .
  • Dhaygude, Kishor Uttam; Johansson, Helena; Kulmuni, Jonna Katharina; Sundström, Liselotte (2019)
    We present the genome organization and molecular characterization of the three Formica exsecta viruses, along with ORF predictions, and functional annotation of genes. The Formica exsecta virus-4 (FeV4; GenBank ID: MF287670) is a newly discovered negative-sense single-stranded RNA virus representing the first identified member of order Mononegavirales in ants, whereas the Formica exsecta virus-1 (FeV1; GenBank ID: KF500001), and the Formica exsecta virus-2 (FeV2; GenBank ID: KF500002) are positive single-stranded RNA viruses initially identified (but not characterized) in our earlier study. The new virus FeV4 was found by re-analyzing data from a study published earlier. The Formica exsecta virus-4 genome is 9,866 bp in size, with an overall G + C content of 44.92%, and containing five predicted open reading frames (ORFs). Our bioinformatics analysis indicates that gaps are absent and the ORFs are complete, which based on our comparative genomics analysis suggests that the genomes are complete. Following the characterization, we validate virus infection for FeV1, FeV2 and FeV4 for the first time in field-collected worker ants. Some colonies were infected by multiple viruses, and the viruses were observed to infect all castes, and multiple life stages of workers and queens. Finally, highly similar viruses were expressed in adult workers and queens of six other Formica species: F. fusca, F. pressilabris, F. pratensis, F. aquilonia, F. truncorum and F. cinerea. This research indicates that viruses can be shared between ant species, but further studies on viral transmission are needed to understand viral infection pathways.
  • Yoshida, Satoko; Kim, Seungill; Wafula, Eric K.; Tanskanen, Jaakko; Kim, Yong-Min; Honaas, Loren; Yang, Zhenzhen; Spallek, Thomas; Conn, Caitlin E.; Ichihashi, Yasunori; Cheong, Kyeongchae; Cui, Songkui; Der, Joshua P.; Gundlach, Heidrun; Jiao, Yuannian; Hori, Chiaki; Ishida, Juliane K.; Kasahara, Hiroyuki; Kiba, Takatoshi; Kim, Myung-Shin; Koo, Namjin; Laohavisit, Anuphon; Lee, Yong-Hwan; Lumba, Shelley; McCourt, Peter; Mortimer, Jenny C.; Mutuku, J. Musembi; Nomura, Takahito; Sasaki-Sekimoto, Yuko; Seto, Yoshiya; Wang, Yu; Wakatake, Takanori; Sakakibara, Hitoshi; Demura, Taku; Yamaguchi, Shinjiro; Yoneyama, Koichi; Manabe, Ri-ichiroh; Nelson, David C.; Schulman, Alan H.; Timko, Michael P.; DePamphilis, Claude W.; Choi, Doil; Shirasu, Ken (2019)
    Parasitic plants in the genus Striga, commonly known as witchweeds, cause major crop losses in sub-Saharan Africa and pose a threat to agriculture worldwide. An understanding of Striga parasite biology, which could lead to agricultural solutions, has been hampered by the lack of genome information. Here, we report the draft genome sequence of Striga asiatica with 34,577 predicted protein-coding genes, which reflects gene family contractions and expansions that are consistent with a three-phase model of parasitic plant genome evolution. Striga seeds germinate in response to host-derived strigolactones (SLs) and then develop a specialized penetration structure, the haustorium, to invade the host root. A family of SL receptors has undergone a striking expansion, suggesting a molecular basis for the evolution of broad host range among Striga spp. We found that genes involved in lateral root development in non-parasitic model species are coordinately induced during haustorium development in Striga, suggesting a pathway that was partly co-opted during the evolution of the haustorium. In addition, we found evidence for horizontal transfer of host genes as well as retrotransposons, indicating gene flow to S. asiatica from hosts. Our results provide valuable insights into the evolution of parasitism and a key resource for the future development of Striga control strategies.