Browsing by Subject "Sequencing"

Sort by: Order: Results:

Now showing items 1-10 of 10
  • Emameh, Reza Zolfaghari; Kuuslahti, Marianne; Nosrati, Hassan; Lohi, Hannes; Parkkila, Seppo (2020)
    BackgroundThe inaccuracy of DNA sequence data is becoming a serious problem, as the amount of molecular data is multiplying rapidly and expectations are high for big data to revolutionize life sciences and health care. In this study, we investigated the accuracy of DNA sequence data from commonly used databases using carbonic anhydrase (CA) gene sequences as generic targets. CAs are ancient metalloenzymes that are present in all unicellular and multicellular living organisms. Among the eight distinct families of CAs, including alpha, beta, gamma, delta, zeta, eta, theta, and iota, only alpha -CAs have been reported in vertebrates.ResultsBy an in silico analysis performed on the NCBI and Ensembl databases, we identified several beta- and gamma -CA sequences in vertebrates, including Homo sapiens, Mus musculus, Felis catus, Lipotes vexillifer, Pantholops hodgsonii, Hippocampus comes, Hucho hucho, Oncorhynchus tshawytscha, Xenopus tropicalis, and Rhinolophus sinicus. Polymerase chain reaction (PCR) analysis of genomic DNA persistently failed to amplify positive beta- or gamma -CA gene sequences when Mus musculus and Felis catus DNA samples were used as templates. Further BLAST homology searches of the database-derived "vertebrate" beta- and gamma -CA sequences revealed that the identified sequences were presumably derived from gut microbiota, environmental microbiomes, or grassland ecosystems.ConclusionsOur results highlight the need for more accurate and fast curation systems for DNA databases. The mined data must be carefully reconciled with our best knowledge of sequences to improve the accuracy of DNA data for publication.
  • Zolfaghari Emameh, Reza; Kuuslahti, Marianne; Nosrati, Hassan; Lohi, Hannes; Parkkila, Seppo (BioMed Central, 2020)
    Abstract Background The inaccuracy of DNA sequence data is becoming a serious problem, as the amount of molecular data is multiplying rapidly and expectations are high for big data to revolutionize life sciences and health care. In this study, we investigated the accuracy of DNA sequence data from commonly used databases using carbonic anhydrase (CA) gene sequences as generic targets. CAs are ancient metalloenzymes that are present in all unicellular and multicellular living organisms. Among the eight distinct families of CAs, including α, β, γ, δ, ζ, η, θ, and ι, only α-CAs have been reported in vertebrates. Results By an in silico analysis performed on the NCBI and Ensembl databases, we identified several β- and γ-CA sequences in vertebrates, including Homo sapiens, Mus musculus, Felis catus, Lipotes vexillifer, Pantholops hodgsonii, Hippocampus comes, Hucho hucho, Oncorhynchus tshawytscha, Xenopus tropicalis, and Rhinolophus sinicus. Polymerase chain reaction (PCR) analysis of genomic DNA persistently failed to amplify positive β- or γ-CA gene sequences when Mus musculus and Felis catus DNA samples were used as templates. Further BLAST homology searches of the database-derived “vertebrate” β- and γ-CA sequences revealed that the identified sequences were presumably derived from gut microbiota, environmental microbiomes, or grassland ecosystems. Conclusions Our results highlight the need for more accurate and fast curation systems for DNA databases. The mined data must be carefully reconciled with our best knowledge of sequences to improve the accuracy of DNA data for publication.
  • Somervuo, Panu; Koskinen, Patrik; Mei, Peng; Holm, Liisa; Auvinen, Petri; Paulin, Lars (2018)
    Background: Current high-throughput sequencing platforms provide capacity to sequence multiple samples in parallel. Different samples are labeled by attaching a short sample specific nucleotide sequence, barcode, to each DNA molecule prior pooling them into a mix containing a number of libraries to be sequenced simultaneously. After sequencing, the samples are binned by identifying the barcode sequence within each sequence read. In order to tolerate sequencing errors, barcodes should be sufficiently apart from each other in sequence space. An additional constraint due to both nucleotide usage and basecalling accuracy is that the proportion of different nucleotides should be in balance in each barcode position. The number of samples to be mixed in each sequencing run may vary and this introduces a problem how to select the best subset of available barcodes at sequencing core facility for each sequencing run. There are plenty of tools available for de novo barcode design, but they are not suitable for subset selection. Results: We have developed a tool which can be used for three different tasks: 1) selecting an optimal barcode set from a larger set of candidates, 2) checking the compatibility of user-defined set of barcodes, e.g. whether two or more libraries with existing barcodes can be combined in a single sequencing pool, and 3) augmenting an existing set of barcodes. In our approach the selection process is formulated as a minimization problem. We define the cost function and a set of constraints and use integer programming to solve the resulting combinatorial problem. Based on the desired number of barcodes to be selected and the set of candidate sequences given by user, the necessary constraints are automatically generated and the optimal solution can be found. The method is implemented in C programming language and web interface is available at http://ekhidna2.biocenter.helsinki.fi/barcosel. Conclusions: Increasing capacity of sequencing platforms raises the challenge of mixing barcodes. Our method allows the user to select a given number of barcodes among the larger existing barcode set so that both sequencing errors are tolerated and the nucleotide balance is optimized. The tool is easy to access via web browser.
  • Somervuo, Panu; Koskinen, Patrik; Mei, Peng; Holm, Liisa; Auvinen, Petri; Paulin, Lars (BioMed Central, 2018)
    Abstract Background Current high-throughput sequencing platforms provide capacity to sequence multiple samples in parallel. Different samples are labeled by attaching a short sample specific nucleotide sequence, barcode, to each DNA molecule prior pooling them into a mix containing a number of libraries to be sequenced simultaneously. After sequencing, the samples are binned by identifying the barcode sequence within each sequence read. In order to tolerate sequencing errors, barcodes should be sufficiently apart from each other in sequence space. An additional constraint due to both nucleotide usage and basecalling accuracy is that the proportion of different nucleotides should be in balance in each barcode position. The number of samples to be mixed in each sequencing run may vary and this introduces a problem how to select the best subset of available barcodes at sequencing core facility for each sequencing run. There are plenty of tools available for de novo barcode design, but they are not suitable for subset selection. Results We have developed a tool which can be used for three different tasks: 1) selecting an optimal barcode set from a larger set of candidates, 2) checking the compatibility of user-defined set of barcodes, e.g. whether two or more libraries with existing barcodes can be combined in a single sequencing pool, and 3) augmenting an existing set of barcodes. In our approach the selection process is formulated as a minimization problem. We define the cost function and a set of constraints and use integer programming to solve the resulting combinatorial problem. Based on the desired number of barcodes to be selected and the set of candidate sequences given by user, the necessary constraints are automatically generated and the optimal solution can be found. The method is implemented in C programming language and web interface is available at http://ekhidna2.biocenter.helsinki.fi/barcosel . Conclusions Increasing capacity of sequencing platforms raises the challenge of mixing barcodes. Our method allows the user to select a given number of barcodes among the larger existing barcode set so that both sequencing errors are tolerated and the nucleotide balance is optimized. The tool is easy to access via web browser.
  • Ojala, Teija; Laine, Pia K. S.; Ahlroos, Terhi; Tanskanen, Jarna; Pitkanen, Saara; Salusjarvi, Tuomas; Kankainen, Matti; Tynkkynen, Soile; Paulin, Lars; Auvinen, Petri (2017)
    Propionibacterium freudenreichii is a commercially important bacterium that is essential for the development of the characteristic eyes and flavor of Swiss-type cheeses. These bacteria grow actively and produce large quantities of flavor compounds during cheese ripening at warm temperatures but also appear to contribute to the aroma development during the subsequent cold storage of cheese. Here, we advance our understanding of the role of P. freudenreichii in cheese ripening by presenting the 2.68-Mbp annotated genome sequence of P. freudenreichii ssp. shermanii JS and determining its global transcriptional profiles during industrial cheese-making using transcriptome sequencing. The annotation of the genome identified a total of 2377 protein-coding genes and revealed the presence of enzymes and pathways for formation of several flavor compounds. Based on transcriptome profiling, the expression of 348 protein-coding genes was altered between the warm and cold room ripening of cheese. Several propionate, acetate, and diacetyl/acetoin production related genes had higher expression levels in the warm room, whereas a general slowing down of the metabolism and an activation of mobile genetic elements was seen in the cold room. A few ripening-related and aminoacid catabolism involved genes were induced or remained active in cold room, indicating that strain JS contributes to the aroma development also during cold room ripening. In addition, we performed a comparative genomic analysis of strain JS and 29 other Propionibacterium strains of 10 different species, including an isolate of both P. freudenreichii subspecies freudenreichii and shermanii. Ortholog grouping of the predicted protein sequences revealed that close to 86% of the ortholog groups of strain JS, including a variety of ripening-related ortholog groups, were conserved across the P. freudenreichii isolates. Taken together, this study contributes to the understanding of the genomic basis of P. freudenreichii and sheds light on its activities during cheese ripening. (C) 2016 Elsevier B.V. All rights reserved.
  • Dikareva, Evgenia (Helsingin yliopisto, 2021)
    The gut microbiota has a major impact on the health and early life development in humans. Viruses infecting prokaryotes, called bacteriophages, are the most abundant group of the gut virome that shapes the prokaryotic community. They have been shown to directly interact with the human host or indirectly by interfering with the gut bacterial community. While in the recent years many studies have explored the human gut virome, the field is currently under active investigation, but no standardised protocols for creating high-throughput virome extractions or bioinformatic pipelines for sequences analyses is available. The first aim of this study was to (1) compare the most promising methods for viral particle concentration (dithiothreitol (DTT) and polyethylene glycol (PEG)), DNA extraction afterwards and scaling the methods for high-throughput procedure. The second aim was to (2) compare four bioinformatics tools: Centrifuge, MetaPhlAn, Gut Virome Database (GVD) and a combination of Centrifuge, MetaPhlAn, VirFinder and Blast (Consensus) by analysing shotgun metagenome sequencing results of infant’s stool samples at three time points: 1, 6 and 12 months. The adjustments for high-throughput DNA extraction, resulted in five protocols. The highest yield of DNA was achieved for 1- and 12-months samples with the PEG method. On the other hand, the DTT method was the best for 6-month samples. The infant’s age was the only significant factor driving the viral composition differences on family level for MetaPhlAn (p = 0.004), Centrifuge (p = 0.001) and Consensus (p = 0.001) methods. However, the number of annotated reads and the virome composition depended exclusive on the software used (p = 0.001). All the methods identified phage families: Siphoviridae, Podoviridae and Myoviridae. GVD was the only method that annotated up to 90% of reads to viruses. In conclusion, our results suggest that the PEG extraction method may be best suited for large-scale virome enrichment, as it allowed to obtain the highest DNA yield, was suitable for high-throughput extractions and allowed to create a virome with a high variability in phage representation. For the novel virus identification, GVD method would be used further as it annotated most of the reads to phages.
  • Nair, Preethy Sasidharan; Kuusi, Tuire; Ahvenainen, Minna; Philips, Anju K.; Järvelä, Irma (2019)
    Musical training and performance require precise integration of multisensory and motor centres of the human brain and can be regarded as an epigenetic modifier of brain functions. Numerous studies have identified structural and functional differences between the brains of musicians and non-musicians and superior cognitive functions in musicians. Recently, music-listening and performance has also been shown to affect the regulation of several genes, many of which were identified in songbird singing. MicroRNAs affect gene regulation and studying their expression may give new insights into the epigenetic effect of music. Here, we studied the effect of 2 hours of classical music-performance on the peripheral blood microRNA expressions in professional musicians with respect to a control activity without music for the same duration. As detecting transcriptomic changes in the functional human brain remains a challenge for geneticists, we used peripheral blood to study music-performance induced microRNA changes and interpreted the results in terms of potential effects on brain function, based on the current knowledge about the microRNA function in blood and brain. We identified significant (FDR
  • Bell, Neil E.; Boore, Jeffrey L.; Mishler, Brent D.; Hyvonen, Jaakko (2014)
  • Icay, Katherine; Chen, Ping; Cervera Taboada, Alejandra; Rantanen, Ville; Lehtonen, Rainer; Hautaniemi, Sampsa (2016)
    Background: Large-scale sequencing experiments are complex and require a wide spectrum of computational tools to extract and interpret relevant biological information. This is especially true in projects where individual processing and integrated analysis of both small RNA and complementary RNA data is needed. Such studies would benefit from a computational workflow that is easy to implement and standardizes the processing and analysis of both sequenced data types. Results: We developed SePIA (Sequence Processing, Integration, and Analysis), a comprehensive small RNA and RNA workflow. It provides ready execution for over 20 commonly known RNA-seq tools on top of an established workflow engine and provides dynamic pipeline architecture to manage, individually analyze, and integrate both small RNA and RNA data. Implementation with Docker makes SePIA portable and easy to run. We demonstrate the workflow's extensive utility with two case studies involving three breast cancer datasets. SePIA is straightforward to configure and organizes results into a perusable HTML report. Furthermore, the underlying pipeline engine supports computational resource management for optimal performance. Conclusion: SePIA is an open-source workflow introducing standardized processing and analysis of RNA and small RNA data. SePIA's modular design enables robust customization to a given experiment while maintaining overall workflow structure.
  • Matsson, Hans; Soderhall, Cilla; Einarsdottir, Elisabet; Lamontagne, Maxime; Gudmundsson, Sanna; Backman, Helena; Lindberg, Anne; Ronmark, Eva; Kere, Juha; Sin, Don; Postma, Dirkje S.; Bosse, Yohan; Lundback, Bo; Klar, Joakim (2016)
    Background: Reduced lung function in patients with chronic obstructive pulmonary disease (COPD) is likely due to both environmental and genetic factors. We report here a targeted high-throughput DNA sequencing approach to identify new and previously known genetic variants in a set of candidate genes for COPD. Methods: Exons in 22 genes implicated in lung development as well as 61 genes and 10 genomic regions previously associated with COPD were sequenced using individual DNA samples from 68 cases with moderate or severe COPD and 66 controls matched for age, gender and smoking. Cases and controls were selected from the Obstructive Lung Disease in Northern Sweden (OLIN) studies. Results: In total, 37 genetic variants showed association with COPD (p <0.05, uncorrected). Several variants previously discovered to be associated with COPD from genetic genome-wide analysis studies were replicated using our sample. Two high-risk variants were followed-up for functional characterization in a large eQTL mapping study of 1,111 human lung specimens. The C allele of a synonymous variant, rs8040868, predicting a p.(S45=) in the gene for cholinergic receptor nicotinic alpha 3 (CHRNA3) was associated with COPD (p = 8.8 x 10(-3)). This association remained (p = 0.003 and OR = 1.4, 95 % CI 1.1-1.7) when analysing all available cases and controls in OLIN (n = 1,534). The rs8040868 variant is in linkage disequilibrium with rs16969968 previously associated with COPD and altered expression of the CHRNA5 gene. A follow-up analysis for detection of expression quantitative trait loci revealed that rs8040868-C was found to be significantly associated with a decreased expression of the nearby gene cholinergic receptor, nicotinic, alpha 5 (CHRNA5) in lung tissue. Conclusion: Our data replicate previous result suggesting CHRNA5 as a candidate gene for COPD and rs8040868 as a risk variant for the development of COPD in the Swedish population.