Browsing by Subject "BURROWS-WHEELER TRANSFORM"

Sort by: Order: Results:

Now showing items 1-6 of 6
  • Trotta, Luca; Norberg, Anna; Taskinen, Mervi; Beziat, Vivien; Degerman, Sofie; Wartiovaara-Kautto, Ulla; Välimaa, Hannamari; Jahnukainen, Kirsi; Casanova, Jean-Laurent; Seppänen, Mikko; Saarela, Janna; Koskenvuo, Minna; Martelius, Timi (2018)
    Background: The telomere biology disorders (TBDs) include a range of multisystem diseases characterized by mucocutaneous symptoms and bone marrow failure. In dyskeratosis congenita (DKQ, the clinical features of TBDs stem from the depletion of crucial stem cell populations in highly proliferative tissues, resulting from abnormal telomerase function. Due to the wide spectrum of clinical presentations and lack of a conclusive laboratory test it may be challenging to reach a clinical diagnosis, especially if patients lack the pathognomonic clinical features of TBDs. Methods: Clinical sequencing was performed on a cohort of patients presenting with variable immune phenotypes lacking molecular diagnoses. Hypothesis-free whole-exome sequencing (WES) was selected in the absence of compelling diagnostic hints in patients with variable immunological and haematological conditions. Results: In four patients belonging to three families, we have detected five novel variants in known TBD-causing genes (DKC1, TERT and RTEL1). In addition to the molecular findings, they all presented shortened blood cell telomeres. These findings are consistent with the displayed TBD phenotypes, addressing towards the molecular diagnosis and subsequent clinical follow-up of the patients. Conclusions: Our results strongly support the utility of WES-based approaches for routine genetic diagnostics of TBD patients with heterogeneous or atypical clinical presentation who otherwise might remain undiagnosed.
  • Norri, Tuukka; Cazaux, Bastien; Kosolobov, Dmitry; Mäkinen, Veli (2019)
    Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,...,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b]P has length at least L and the number d(a,b)=|{Ri[a,b]:1im}| of distinct substrings at segment [a,b] is minimized over [a,b]P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P} founder sequences representing the original R such that crossovers happen only at segment boundaries. Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2). Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.
  • Belazzougui, Djamal; Cunial, Fabio; Karkkainen, Juha; Makinen, Veli (2020)
    The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T is an element of {1, . . . , sigma}(n) can be built in deterministic O(n) time using just O(n log sigma) bits of space, where sigma We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log sigma) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log sigma) bits of space, took O(n log log sigma) time for the first two structures and O(n log(epsilon) n) time for the third, where. is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log sigma log log(sigma) n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.
  • Jin, Long; Yu, Jian Ping; Yang, Zai Jun; Merilä, Juha; Liao, Wen Bo (2018)
    Hibernation is an effective energy conservation strategy that has been widely adopted by animals to cope with unpredictable environmental conditions. The liver, in particular, plays an important role in adaptive metabolic adjustment during hibernation. Mammalian studies have revealed that many genes involved in metabolism are differentially expressed during the hibernation period. However, the differentiation in global gene expression between active and torpid states in amphibians remains largely unknown. We analyzed gene expression in the liver of active and torpid Asiatic toads (Bufo gargarizans) using RNA-sequencing. In addition, we evaluated the differential expression of genes between females and males. A total of 1399 genes were identified as differentially expressed between active and torpid females. Of these, the expressions of 395 genes were significantly elevated in torpid females and involved genes responding to stresses, as well as contractile proteins. The expression of 1004 genes were significantly down-regulated in torpid females, most which were involved in metabolic depression and shifts in the energy utilization. Of the 715 differentially expressed genes between active and torpid males, 337 were up-regulated and 378 down-regulated. A total of 695 genes were differentially expressed between active females and males, of which 655 genes were significantly down-regulated in males. Similarly, 374 differentially expressed genes were identified between torpid females and males, with the expression of 252 genes (mostly contractile proteins) being significantly down-regulated in males. Our findings suggest that expression of many genes in the liver of B. gargarizans are down-regulated during hibernation. Furthermore, there are marked sex differences in the levels of gene expression, with females showing elevated levels of gene expression as compared to males, as well as more marked down-regulation of gene-expression in torpid males than females.
  • Valenzuela, Daniel; Norri, Tuukka; Välimäki, Niko; Pitkänen, Esa; Mäkinen, Veli (2018)
    Background: Typical human genome differs from the reference genome at 4-5 million sites. This diversity is increasingly catalogued in repositories such as ExAC/gnomAD, consisting of >15,000 whole-genomes and >126,000 exome sequences from different individuals. Despite this enormous diversity, resequencing data workflows are still based on a single human reference genome. Identification and genotyping of genetic variants is typically carried out on short-read data aligned to a single reference, disregarding the underlying variation. Results: We propose a new unified framework for variant calling with short-read data utilizing a representation of human genetic variation - a pan-genomic reference. We provide a modular pipeline that can be seamlessly incorporated into existing sequencing data analysis workflows. Our tool is open source and available online: https://gitlab.com/dvalenzu/PanVC. Conclusions: Our experiments show that by replacing a standard human reference with a pan-genomic one we achieve an improvement in single-nucleotide variant calling accuracy and in short indel calling accuracy over the widely adopted Genome Analysis Toolkit (GATK) in difficult genomic regions.
  • Seo, Seung Bum; Zeng, Xiangpei; King, Jonathan L.; Larue, Bobby L.; Assidi, Mourad; Al-Qahtani, Mohamed H.; Sajantila, Antti; Budowle, Bruce (2015)
    Background: Massively parallel sequencing (MPS) technologies have the capacity to sequence targeted regions or whole genomes of multiple nucleic acid samples with high coverage by sequencing millions of DNA fragments simultaneously. Compared with Sanger sequencing, MPS also can reduce labor and cost on a per nucleotide basis and indeed on a per sample basis. In this study, whole genomes of human mitochondria (mtGenome) were sequenced on the Personal Genome Machine (PGM (TM)) (Life Technologies, San Francisco, CA), the out data were assessed, and the results were compared with data previously generated on the MiSeq (TM) (Illumina, San Diego, CA). The objectives of this paper were to determine the feasibility, accuracy, and reliability of sequence data obtained from the PGM. Results: 24 samples were multiplexed (in groups of six) and sequenced on the at least 10 megabase throughput 314 chip. The depth of coverage pattern was similar among all 24 samples; however the coverage across the genome varied. For strand bias, the average ratio of coverage between the forward and reverse strands at each nucleotide position indicated that two-thirds of the positions of the genome had ratios that were greater than 0.5. A few sites had more extreme strand bias. Another observation was that 156 positions had a false deletion rate greater than 0.15 in one or more individuals. There were 31-98 (SNP) mtGenome variants observed per sample for the 24 samples analyzed. The total 1237 (SNP) variants were concordant between the results from the PGM and MiSeq. The quality scores for haplogroup assignment for all 24 samples ranged between 88.8%-100%. Conclusions: In this study, mtDNA sequence data generated from the PGM were analyzed and the output evaluated. Depth of coverage variation and strand bias were identified but generally were infrequent and did not impact reliability of variant calls. Multiplexing of samples was demonstrated which can improve throughput and reduce cost per sample analyzed. Overall, the results of this study, based on orthogonal concordance testing and phylogenetic scrutiny, supported that whole mtGenome sequence data with high accuracy can be obtained using the PGM platform.