Browsing by Subject "bioinformatics"

Sort by: Order: Results:

Now showing items 1-20 of 32
  • Peltola, Sanni (Helsingin yliopisto, 2019)
    In recent decades, ancient DNA recovered from old and degraded samples, such as bones and fossils, has presented novel prospects in the fields of genetics, archaeology and anthropology. In Finland, ancient DNA research is constrained by the poor preservation of bones: they are quickly degraded by acidic soils, limiting the age of DNA that can be recovered from physical remains. However, some soil components can bind DNA and thus protect the molecules from degradation. Ancient DNA from soils and sediments has previously been used to reconstruct paleoenvironments, to study ancient parasites and diet and to demonstrate the presence of a species at a given site, even when there are no visible fossils present. In this pilot study, I explored the potential of archaeological sediments as an alternative source of ancient human DNA. I collected sediment samples from five Finnish Neolithic Stone Age (6,000–4,000 years ago) settlement sites, located in woodland. In addition, I analysed a lakebed sample from a submerged Mesolithic (10,000–7,000 years ago) settlement site, and a soil sample from an Iron Age burial with bones present to compare DNA yields between the two materials. Soil samples were converted into Illumina sequencing libraries and enriched for human mtDNA. I analysed the sequencing data with a customised metagenomics-based bioinformatic analysis workflow. I also tested program performance with simulated data. The results suggested that human DNA preservation in Finnish archaeological sediments may be poor or very localised. I detected small amounts of human mtDNA in three Stone Age woodland settlement sites and a submerged Mesolithic settlement site. One Stone Age sample exhibited terminal damage patterns suggestive of DNA decay, but the time of deposition is difficult to estimate. Interestingly, no human DNA was recovered from the Iron Age burial soil, suggesting that body decomposition may not serve as a significant source of sedimentary ancient DNA. Additional complications may arise from the high inhibitor content of the soil and the abundance of microbial and other non-human DNA present in environmental samples. In the future, a more refined sampling approach, such as targeting microscopic bone fragments, could be a strategy worth trialling.
  • Arumilli, Meharji (Helsingin yliopisto, 2020)
    Since the annotation of the dog genome in 2005, dogs have emerged as excellent models of human disease. Many disease associations of variant alleles in homologous genes have been discovered in dogs, providing new therapeutic candidates to the corresponding human diseases as well as establishing preclinical large animal models. Significant progress in genetic studies has happened after moving from microarrays to next generation sequencing or combining the two approaches. This transition has required the development and application of novel bioinformatic approaches to facilitate genetics and genomics. This thesis established a variety of bioinformatic approaches and tools to facilitate canine genomics and disease gene discovery. In study I, the successful development and application of the bioinformatic pipelines resulted in the identification of causal variants of three new disease genes, SLC37A2, SCARF2 and FAM20C, of relevance to Caffey disease, van den Ende Gupta syndrome (VDEGS) and Raine syndrome in human, respectively. In study II, novel genomic content was discovered through de novo assembly of genomic reads of Border Collies, which didn’t map to the current canine genome reference. This study revealed sequences that filled the existing gaps in the reference genome and identified gene models that were missing from the reference. Overall, this study reveals novel genomic content to facilitate the improvement of upcoming canine genome reference for disease variant allele discovery in candidate genes. In study III, a novel bioinformatic tool, webGQT, was successfully developed and piloted to handle and filter large amounts of next generation sequencing (NGS) data. This tool is purported to non-bioinformatics users to mine genetic information from millions to billions of variants among thousands of genomes. This tool has been successfully utilized in various disease genetics projects. In summary, new bioinformatic approaches have been successfully developed and applied in this thesis to facilitate both the transition of the field to the NGS era and disease gene discovery and genomics in dogs. These findings in this thesis have implications to veterinary research, diagnostics and human medicine with novel candidate genes in three rare disorders.
  • Koski, Jessica (Helsingin yliopisto, 2021)
    Acute lymphoblastic leukemia (ALL) is a hematological malignancy that is characterized by uncontrolled proliferation and blocked maturation of lymphoid progenitor cells. It is divided into B- and T-cell types both of which have multiple subtypes defined by different somatic genetic changes. Also, germline predisposition has been found to play an important role in multiple hematological malignancies and several germline variants that contribute to the ALL risk have already been identified in pediatric and familial settings. There are only few studies including adult ALL patients but thanks to the findings in acute myeloid leukemia, where they found the germline predisposition to consider also adult patients, there is now more interest in studying adult patients. The prognosis of adult ALL patients is much worse compared to pediatric patients and many are still lacking clear genetic markers for diagnosis. Thus, identifying genetic lesions affecting ALL development is important in order to improve treatments and prognosis. Germline studies can provide additional insight on the predisposition and development of ALL when there are no clear somatic biomarkers. Single nucleotide variants are usually of interest when identifying biomarkers from the genome, but also structural variants can be studied. Their coverage on the genome is higher than that of single nucleotide variants which makes them suitable candidates to explore association with prognosis. Copy number changes can be detected from next generation sequencing data although the detection specificity and sensitivity vary a lot between different software. Current approach is to identify the most likely regions with copy number change by using multiple tools and to later validate the findings experimentally. In this thesis the copy number changes in germline samples of 41 adult ALL patients were analyzed using ExomeDepth, CODEX2 and CNVkit.
  • Qian, Kui (Helsingin yliopisto, 2013)
    Human papillomaviruses (HPVs) form a large family among double stranded DNA (dsDNA) viruses, some types of which are the major causes of cervical cancer. HPV 16 is widely distributed and the most common high-risk HPV type and approximately half of the cervical cancers are associated with HPV type 16. Of the three HPV 16 encoded oncogenes, the function of E5 in regulating viral replication and pathogenesis is less well understood than E6 and E7. The microRNAs (miRNAs) are important small noncoding RNA molecules that regulate wide range of cellular functions. Some dsDNA viruses, such as SV40 and human polyomaviruses, have functional viral miRNAs. The functional and molecular similarities among dsDNA viruses suggest that HPV could encode viral miRNAs, which have not been validated thus far. The aim of this thesis was to study the functions of the host miRNAs in HPV 16 oncogene induction and identify novel HPV encoded viral miRNAs. We utilized microarray technology to investigate the effect of E5 on host miRNAs and mRNAs expression in 0 96 hours after E5 induction in a cell line model. Among the differentially expressed cellular miRNAs, we further validated the expression of hsa-mir-146a, hsa-mir-203, and hsa-mir-324-5p and some of their target genes in a time series of 96 hours of E5 induction. Our results indicate that HPV E5 expression has an impact through complex regulatory patterns of gene expression in the host cells, and part of those genes is regulated by the E5 protein. Second, high throughput sequencing was used to identify virus-encoded miRNAs. We prepared small RNA sequencing libraries from ten HPV-associated cervical lesions, including cancer and two HPV-harboring cell lines. For more flexible analysis of the sequencing data we developed miRSeqNovel, an R based workflow for miRNA sequencing data analysis, and applied it to the sequencing data to predict putative viral miRNAs and discovered nine putative papillomavirus encoded miRNAs. Viral miRNA validation was performed for five candidates, four of which were successfully validated by qPCR from cervical tissue samples and cell lines: two were encoded by HPV 16, one by HPV 38, and one by HPV 68. The expression of two HPV 16 miRNAs was further supported by in situ hybridization, and colocalization with p16INK4A staining, a marker of cervical neoplasia. Prediction of cellular target genes of HPV 16 encoded miRNAs suggests that they may play a role in cell cycle, immune functions, cell adhesion and migration, development and cancer, which were also among the functions targeted by the E5 regulated host cell mRNA and miRNAs. Two putative viral target sites for the two validated HPV 16 miRNAs were mapped to the E5 gene, one in the E1 gene, two in the L1 gene, and one in the long control region (LCR).
  • Reinikka, Siiri (Helsingin yliopisto, 2020)
    Endometrial polyps are one of the most common benign uterine lesions, affecting approximately 10% of all adult women. While endometrial polyps have a high prevalence, their molecular pathogenesis and genetic background are largely undefined. Accordingly, the aim of this thesis was to characterize the somatic mutational landscape of endometrial polyps – to identify mutations in cancer-associated genes, and to identify mutational signatures contributing towards the somatic mutational spectrum. The present study was conducted using whole exome sequencing of 23 endometrial polyps and 18 matching normal blood samples. Mutational signature analysis was conducted using MutationalPatterns and SigProfiler. Endometrial polyps were found to carry varying number of somatic mutations in their exomes, most of them present at a low allelic fraction. Moreover, 43% (10/23) of the polyps were identified to carry one to four cancer-associated mutations, including mutations in genes such as PIK3CA 17% (4/23), KRAS 13% (3/23) and ERBB1 9% (2/23), which are well-established cancer driver genes. Cancer-associated mutational signatures do not have a notable contribution towards the somatic mutational spectrum of endometrial polyps. However, a novel signature, ‘signature B’, characterized by T>G mutations, was found to affect a subset of polyp samples. To conclude, the whole exome sequencing of endometrial polyps revealed several mutations in cancer-associated genes and a novel mutational signature, which may contribute to the development of these benign tumours. However, further research is required to confirm and validate the novel signature, and to define the genetic alterations leading to the polyp pathogenesis.
  • Jokinen, Vilja (Helsingin yliopisto, 2021)
    Uterine leiomyomas are benign smooth muscle tumors arising in myometrium. They are very common, and the incidence in women is up to 70% by the age of 50. Usually, leiomyomas are asymptomatic, but some patients suffer from various symptoms, including abnormal uterine bleeding, pelvic pain, urinary frequency, and constipation. Uterine leiomyomas may also cause subfertility. Genetic alterations in the known driver genes MED12, HMGA2, FH, and COL4A5-6 account for about 90 % of all leiomyomas. These initiator mutations result in distinct molecular subtypes of leiomyomas. The majority of whole-genome sequencing (WGS) studies analyzing chromosomal rearrangements have been performed using fresh frozen tissues. One aim of this study was to examine the feasibility of detecting chromosomal rearrangements from WGS data of formalin-fixed paraffin embedded (FFPE) tissue samples. Previous results from 3’RNA-sequencing data revealed a subset of uterine leiomyoma samples that displayed similar gene expression patterns with HMGA2-positive leiomyomas but were previously classified as HMGA2-negative by immunohistochemistry. According to 3’RNA-sequencing, all these tumors overexpressed PLAG1, and some of them overexpressed HMGA2 or HMGA1. Thus, the second aim of this study was to identify driver mutations in these leiomyoma samples using WGS. In this study, WGS was performed for 16 leiomyoma and 4 normal myometrium FFPE samples. The following bioinformatic tools were used to detect somatic alterations at multiple levels: Delly for chromosomal rearrangements, CNVkit for copy-number alterations, and Mutect for point mutations and small insertions and deletions. Sanger sequencing was used to validate findings. The quality of WGS data obtained from FFPE samples was sufficient for detecting chromosomal rearrangements, although the number of calls were quite high. We identified recurrent chromosomal rearrangements affecting HMGA2, HMGA1, and PLAG1, mutually exclusively. One sample did not harbor any of these rearrangements, but a deletion in COL4A5-6 was found. Biallelic loss of DEPDC5 was seen in one sample with an HMGA2 rearrangement and in another sample with an HMGA1 rearrangement. HMGA2 and HMGA1 encode architectural chromatin proteins regulating several transcription factors. It is well-known that HMGA2 upregulates PLAG1 expression. The structure and functionality of HMGA2 and HMGA1 are very similar and conserved, so it might be that HMGA1 may also regulate PLAG1 expression. The results of this study suggest that HMGA2 and HMGA1 drive tumorigenesis by regulating PLAG1, and thus, PLAG1 rearrangements resulting in PLAG1 overexpression can also drive tumorigenesis. A few samples, previously classified as HMGA2-negative by immunohistochemistry, revealed to harbor HMGA2 rearrangements, suggesting that the proportion of HMGA2-positive leiomyomas might be underestimated in previous studies using immunohistochemistry. Only one study has previously reported biallelic inactivation of DEPDC5 in leiomyomas, and the results of this study support the idea that biallelic loss of DEPDC5 is a secondary driver event in uterine leiomyomas.
  • Pljusnin, Ilja (Helsingin yliopisto, 2020)
    The speed of DNA and RNA sequencing has long ago surpassed the capacity of laboratories to assign function to these sequences by direct experiment. Fortunately, function and other information can be effectively transferred to novel data from previously accumulated knowledge by sequence homology. This has resulted in the development of hundreds of novel homology-based methods. However, the tendency of method developers to be overoptimistic about their own results, biases in the evaluation metrics used to rank methods, inconsistency between different rankings and evaluation metrics, misplaced popularity of methods relative to their performance all indicate that, in many cases, clear knowledge of the comparative performance of different methods is lacking. This has two main consequences. First, researchers use suboptimal tools. Second, method development may go astray because the merits used for guiding method optimization are biased or unclear. To avoid these difficulties, further research is needed into methodology of evaluation and comparative studies. One core approach for transferring function by sequence homology is to create a multiple sequence alignment (MSA) that represents a given group of similar sequences. The resulting alignment can be applied to annotate novel sequences using profile hidden Markov models (HMMs), to create phylogenetic trees or to compare structural features. The application of MSAs and profile HMMs for genome annotation was explored in publication (I). Creating MSA has been addressed by a vast field of research, however there is a lack of independent comparative studies and no comparative studies for alignment strategies. In publication (II) a novel modular MSA aligner was implemented to aid in comparative evaluation of different MSA strategies. Different MSA strategies were then compared to each other and to the state-of-the-art MSA software on three benchmark databases. Another core approach has been to combine homology searches with assignment of annotation terms from a controlled vocabulary such as the Gene Ontology (GO). Hundreds of methods that assign GO terms to novel sequences have been introduced. The research community has also invested into the objective evaluation of these methods via third party competitions. However, the evaluation metrics and merits used in these competitions are still under active debate and need further research and development. In publication (III) a novel framework was introduced for the development of unbiased high-quality evaluation metrics. By testing 37 variations of popular metrics, our approach revealed strong differences between metrics, a list of clearly biased metrics, and a list of high-quality metrics that are well suited for the evaluation of GO annotations. In summary, this thesis presents novel frameworks and implementation platforms for comparative evaluation of two important classes of homology-based methods: MSA aligners and GO sequence classifiers. These results will be instrumental for developing more accurate MSA aligners, for eliminating many forms of bias inherent in contemporary evaluation protocols, for producing informative method rankings for non-specialist users and for guiding method development towards merits that truly reflect the utility of the designed tools.
  • Ta, Hung (Helsingin yliopisto, 2012)
    Living systems, which are composed of biological components such as molecules, cells, organisms or entire species, are dynamic and complex. Their behaviors are difficult to study with respect to the properties of individual elements. To study their behaviors, we use quantitative techniques in the "omic" fields such as genomics, bioinformatics and proteomics to measure the behavior of groups of interacting components, and we use mathematical and computational modeling to describe and predict their dynamical behavior. The first step in the understanding of a biological system is to investigate how its individual elements interact with each other. This step consist of drawing a static wiring diagram that connects the individual parts. Experimental techniques that are used - are designed to observe interactions among the biological components in the laboratory while computational approaches are designed to predict interactions among the individual elements based on their properties. In the first part of this thesis, we present techniques for network inference that are particularly targeted at protein-protein interaction networks. These techniques include comparative genomics, structure-based, biological context methods and integrated frameworks. We evaluate and compare the prediction methods that have been most often used for domain-domain interactions and we discuss the limitations of the methods and data resources. We introduce the concept of the Enhanced Phylogenetic Tree, which is a new graphical presentation of the evolutionary history of protein families; then, we propose a novel method for assigning functional linkages to proteins. This method was applied to predicting both human and yeast protein functional linkages. The next step is to obtain insights into the dynamical aspects of the biological systems. One of the outreaching goals of systems biology is to understand the emergent properties of living systems, i.e., to understand how the individual components of a system come together to form distinct, collective and interactive properties and functions. The emergent properties of a system are neither to be found in nor are directly deducible from the lower-level properties of that system. An example of the emergent properties is synchronization, a dynamical state of complex network systems in which the individual components of the systems behave coherently, almost in unison. In the second part of the thesis, we apply computational modeling to mimic and simplify real-life complex systems. We focus on clarifying how the network topology determines the initiation and propagation of synchronization. A simple but efficient method is proposed to reconstruct network structures from functional behaviors for oscillatory systems such as brain. We study the feasibility of network reconstruction systematically for different regimes of coupling and for different network topologies. We utilize the Kuramoto model, an interacting system of oscillators, which is simple but relevant enough to address our questions.
  • Gaye, Amadou; Marcon, Yannick; Isaeva, Julia; LaFlamme, Philippe; Turner, Andrew; Jones, Elinor M.; Minion, Joel; Boyd, Andrew W.; Newby, Christopher J.; Nuotio, Marja-Liisa; Wilson, Rebecca; Butters, Oliver; Murtagh, Barnaby; Demir, Ipek; Doiron, Dany; Giepmans, Lisette; Wallace, Susan E.; Budin-Ljosne, Isabelle; Schmidt, Carsten Oliver; Boffetta, Paolo; Boniol, Mathieu; Bota, Maria; Carter, Kim W.; deKlerk, Nick; Dibben, Chris; Francis, Richard W.; Hiekkalinna, Tero; Hveem, Kristian; Kvaloy, Kirsti; Millar, Sean; Perry, Ivan J.; Peters, Annette; Phillips, Catherine M.; Popham, Frank; Raab, Gillian; Reischl, Eva; Sheehan, Nuala; Waldenberger, Melanie; Perola, Markus; van den Heuvel, Edwin; Macleod, John; Knoppers, Bartha M.; Stolk, Ronald P.; Fortier, Isabel; Harris, Jennifer R.; Woffenbuttel, Bruce H. R.; Murtagh, Madeleine J.; Ferretti, Vincent; Burton, Paul R. (2014)
  • Taleb, Kawther; Lauridsen, Eva; Daugaard-Jensen, Jette; Nieminen, Pekka; Kreiborg, Sven (2018)
    BackgroundDentinogenesis imperfecta (DI) is a rare debilitating hereditary disorder affecting dentin formation and causing loss of the overlying enamel. Clinically, DI sufferers have a discolored and weakened dentition with an increased risk of fracture. The aims of this study were to assess genotype-phenotype findings in three families with DI-II with special reference to mutations in the DSPP gene and clinical, histological, and imaging manifestations. MethodsNine patients participated in the study (two from family A, four from family B, and three from family C). Buccal swab samples were collected from all participants and extracted for genomic DNA. Clinical and radiographic examinations had been performed longitudinally, and the dental status was documented using photographic images. Four extracted and decalcified tooth samples were prepared for histological analysis to assess dysplastic manifestations in the dentin. Optical coherence tomography (OCT) was applied to study the health of enamel tissue from invivo images and the effect of the mutation on the function and structure of the DSPP gene was analyzed using bioinformatics software programs. ResultsThe direct DNA sequence analysis revealed three distinct mutations, one of which was a novel finding. The mutations caused dominant phenotypes presumably by interference with signal peptide processing and protein secretion. The clinical and radiographic disturbances in the permanent dentition indicated interfamilial variability in DI-II manifestations, however, no significant intrafamilial variability was observed. ConclusionThe different mutations in the DSPP gene were accompanied by distinct phenotypes. Enamel defects suggested deficit in preameloblast function during the early stages of amelogenesis.
  • Parnanen, Katariina M. M.; Hultman, Jenni; Markkanen, Melina; Satokari, Reetta; Rautava, Samuli; Lamendella, Regina; Wright, Justin; McLimans, Christopher J.; Kelleher, Shannon L.; Virta, Marko P. (2022)
    Background Infants are at a high risk of acquiring fatal infections, and their treatment relies on functioning antibiotics. Antibiotic resistance genes (ARGs) are present in high numbers in antibiotic-naive infants' gut microbiomes, and infant mortality caused by resistant infections is high. The role of antibiotics in shaping the infant resistome has been studied, but there is limited knowledge on other factors that affect the antibiotic resistance burden of the infant gut. Objectives Our objectives were to determine the impact of early exposure to formula on the ARG load in neonates and infants born either preterm or full term. Our hypotheses were that diet causes a selective pressure that influences the microbial community of the infant gut, and formula exposure would increase the abundance of taxa that carry ARGs. Methods Cross-sectionally sampled gut metagenomes of 46 neonates were used to build a generalized linear model to determine the impact of diet on ARG loads in neonates. The model was cross-validated using neonate metagenomes gathered from public databases using our custom statistical pipeline for cross-validation. Results Formula-fed neonates had higher relative abundances of opportunistic pathogens such as Staphylococcus aureus, Staphylococcus epidermidis, Klebsiella pneumoniae, Klebsiella oxytoca, and Clostridioides difficile. The relative abundance of ARGs carried by gut bacteria was 69% higher in the formula-receiving group (fold change, 1.69; 95% CI: 1.12-2.55; P = 0.013; n = 180) compared to exclusively human milk-fed infants. The formula-fed infants also had significantly less typical infant bacteria, such as Bifidobacteria, that have potential health benefits. Conclusions The novel finding that formula exposure is correlated with a higher neonatal ARG burden lays the foundation that clinicians should consider feeding mode in addition to antibiotic use during the first months of life to minimize the proliferation of antibiotic-resistant gut bacteria in infants.
  • BEEHIVE Collaboration; Wymant, Chris; Blanquart, Francois; Golubchik, Tanya; Gall, Astrid; Bakker, Margreet; Bezemer, Daniela; Croucher, Nicholas J.; Hall, Matthew; Hillebregt, Mariska; Ong, Swee Hoe; Ratmann, Oliver; Albert, Jan; Bannert, Norbert; Fellay, Jacques; Fransen, Katrien; Gourlay, Annabelle; Grabowski, M. Kate; Gunsenheimer-Bartmeyer, Barbara; Gunthard, Huldrych F.; Kivelä, Pia; Kouyos, Roger; Laeyendecker, Oliver; Liitsola, Kirsi; Meyer, Laurence; Porter, Kholoud; Ristola, Matti; van Sighem, Ard; Berkhout, Ben; Cornelissen, Marion; Kellam, Paul; Reiss, Peter; Fraser, Christophe (2018)
    Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between-and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from
  • Koivunen, Sampo (Helsingin yliopisto, 2019)
    The Oxford Nanopore MinION is a third generation sequencer utilizing nanopore sequencing technology. The nanopore sequencing method allows sequencing of either DNA or RNA strands as they pass through the membrane-embedded nanopores. By measuring the corresponding fluctuations in the ion flow passing through the nanopore the passing strands can be sequenced directly without additional second-hand reactions or measurements. The MinION sequencing has very distinctly different characteristics compared to the market leaders of the sequencing field. The small form factor of the device further helps it to separate itself from the other alternatives. However, the technology has only been on the market for a very short time and thus very little golden standards regarding its capabilities or usage have been established. This thesis describes our experiences testing the capabilities of the MinION sequencer both before its commercial release as a part of a special early access program, as well as our continued experiments with the device following its commercial launch. The main results of this study include successfully sequencing and aligning E.coli and human gDNA samples to their respective reference genomes. Using our sequencing and analysis pipeline specifically tuned to the MinION we were able to sequence the entire E.coli genome on a single MinION flow cell with an average depth of around 180. Over the course of the thesis project the MinION sequencing protocol was evaluated and optimized in order to determine whether it has the potential to achieve our ultimate goal of reliably sequencing the previously inaccessible genomic regions of the human genome. The possibility of augmenting the sequencing protocol by adding the pre-sequencing target enrichment was also explored. Ultimately we were able to confirm that the MinION sequencer can be used to sequence long DNA fragments from a multitude of sample types. The majority of the produced reads could successfully be aligned against a reference genome. However, the limited yield and sequencing quality of a single experiment does limit the applicability of the method for more complicated genomic studies. These issues can be addressed with various techniques, chiefly target enrichment, but adapting such methods into the sequencing pipeline has its own challenges.
  • Arsin, Sila (Helsingin yliopisto, 2019)
    Mycosporines and mycosporine-like amino acids (MAAs) are small-molecules that provide UV protection in a broad range of organisms. Cyanobacteria produce a diverse set of MAA chemical variants, many of which are glycosylated. Even though the biosynthetic pathway for the production of a common cyanobacterial MAA, shinorine, is known, the biosynthetic origins of the glycosylated variants remains unclear. In this work, bioinformatics analyses were performed to catalogue the genetic diversity encoded in the MAA gene clusters in cyanobacterial genomes and identify a set of enzymes that might be involved in MAA biosynthesis. A total of 211 cyanobacterial genomes were found to contain the MAA gene cluster, with six containing glycosyltransferase genes within the gene cluster. Afterwards, 38 strains from the University of Helsinki Culture Collection were tested for the production of MAAs using QTOF-LC/MS analyses. This resulted in the identification of several novel glycosylated MAA chemical variants from Nostoc sp. UHCC 0302, which contained a 7.4 kb MAA biosynthetic gene cluster consisting of 7 genes, including two for glycosyltransferases and one for dioxygenase. Heterologous expression of this gene cluster in Escherichia coli TOP10 resulted in the production of a glycosylated porphyra-334 variant of 509 m/z by the transformant cells, showing that colanic acid biosynthesis glycosyltransferases can catalyse the addition of hexose to MAAs. These results suggested a biosynthetic route for the production of glycosylated MAAs in cyanobacteria and allowed to propose a putative role for dioxygenases in MAA biosynthesis. Further characterization of additional glycosyltransferases is necessary to improve our understanding of glycosylated MAA biosynthesis and functionality, which could be applied to large scale processes and be used in industrial applications.
  • Mäklin, Tommi; Kallonen, Teemu; David, Sophia; Boinett, Christine J.; Pascoe, Ben; Méric, Guillaume; Aanensen, David M.; Feil, Edward J.; Baker, Stephen; Parkhill, Julian; Sheppard, Samuel K.; Corander, Jukka; Honkela, Antti (2021)
    Determining the composition of bacterial communities beyond the level of a genus or species is challenging because of the considerable overlap between genomes representing close relatives. Here, we present the mSWEEP pipeline for identifying and estimating the relative sequence abundances of bacterial lineages from plate sweeps of enrichment cultures. mSWEEP leverages biologically grouped sequence assembly databases, applying probabilistic modelling, and provides controls for false positive results. Using sequencing data from major pathogens, we demonstrate significant improvements in lineage quantification and detection accuracy. Our pipeline facilitates investigating cultures comprising mixtures of bacteria, and opens up a new field of plate sweep metagenomics.
  • Wang, Xiao-Jun; Gao, Jing; Wang, Zhuo; Yu, Qin (2021)
    Background Lung adenocarcinoma (LUAD) is a common lung cancer with a high mortality, for which microRNAs (miRNAs) play a vital role in its regulation. Multiple messenger RNAs (mRNAs) may be regulated by miRNAs, involved in LUAD tumorigenesis and progression. However, the miRNA-mRNA regulatory network involved in LUAD has not been fully elucidated. Methods Differentially expressed miRNAs and mRNA were derived from the Cancer Genome Atlas (TCGA) dataset in tissue samples and from our microarray data in plasma (GSE151963). Then, common differentially expressed (Co-DE) miRNAs were obtained through intersected analyses between the above two datasets. An overlap was applied to confirm the Co-DEmRNAs identified both in targeted mRNAs and DEmRNAs in TCGA. A miRNA-mRNA regulatory network was constructed using Cytoscape. The top five miRNA were identified as hub miRNA by degrees in the network. The functions and signaling pathways associated with the hub miRNA-targeted genes were revealed through Gene Ontology (GO) analysis and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway. The key mRNAs in the protein-protein interaction (PPI) network were identified using the STRING database and CytoHubba. Survival analyses were performed using Gene Expression Profiling Interactive Analysis (GEPIA). Results The miRNA-mRNA regulatory network consists of 19 Co-DEmiRNAs and 760 Co-DEmRNAs. The five miRNAs (miR-539-5p, miR-656-3p, miR-2110, let-7b-5p, and miR-92b-3p) in the network were identified as hub miRNAs by degrees (>100). The 677 Co-DEmRNAs were targeted mRNAs from the five hub miRNAs, showing the roles in the functional analyses of the GO analysis and KEGG pathways (inclusion criteria: 836 and 48, respectively). The PPI network and Cytoscape analyses revealed that the top ten key mRNAs were NOTCH1, MMP2, IGF1, KDR, SPP1, FLT1, HGF, TEK, ANGPT1, and PDGFB. SPP1 and HGF emerged as hub genes through survival analysis. A high SPP1 expression indicated a poor survival, whereas HGF positively associated with survival outcomes in LUAD. Conclusion This study investigated a miRNA-mRNA regulatory network associated with LUAD, exploring the hub miRNAs and potential functions of mRNA in the network. These findings contribute to identify new prognostic markers and therapeutic targets for LUAD patients in clinical settings.
  • Facciotto, Chiara (Helsingin yliopisto, 2021)
    Overcoming drug resistance in cancer is one of the most pressing issues in oncology. The last century saw a dramatic increase in the discovery of new cancer therapies, so much so that chemotherapeutic agents and immunotherapies are now, alone or in combination, the backbone of treatment for many cancers. Despite the increased rate of treatment success brought by these regimens, cancer patients can become resistant to these drugs. This leads to disease relapse, hindering patient survival. Drug resistance remains the primary cause of death in most advanced-stage cancer patients. The molecular mechanisms responsible for the development of a resistance phenotype in cancer cells are complex and include both genetic and epigenetic alterations. Since drug resistance is a multifactorial phenomenon, we used a systems biology approach to investigate it on different fronts. Specifically, we developed a high-throughput drug screening method to test new drug combinations, identifying epigenetic inhibitors able to sensitize lymphoma cells to doxorubicin. We also implemented a bioinformatic pipeline which combines multiple omics data to identify genes and pathways driving platinum response across multiple cancers. We then developed a method to compute differential methylation between cancer samples with varying and unknown tumor purity, which we used to investigate DNA methylation changes linked to drug resistance in ovarian cancer and lymphoma. Finally, we created a workflow management system to build complex bioinformatic pipelines and aid researchers in the analysis of high-throughput biomedical data. By combining laboratory biology experiments and computational analyses, we gained a broader understanding of the cellular mechanisms behind immunochemotherapy failure. Moreover, we were able to identify novel biomarkers associated with platinum response in multiple cancers, as well as new drug combinations able to overcome immunochemotherapy resistance in lymphoma cells. The in vitro and in silico methods presented in this thesis can not only assist researchers in the cancer field, but are broadly applicable to other fields of biomedical research. Overall, this work is an important stepping stone in both understanding and overcoming drug resistance in cancer, and has great potential to improve outcomes for cancer patients in the future.
  • Scheinin, Ilari (Helsingfors universitet, 2011)
    Ewing sarcoma is an aggressive and poorly differentiated malignancy of bone and soft tissue. It primarily affects children, adolescents, and young adults, with a slight male predominance. It is characterized by a translocation between chromosomes 11 and 22 resulting in the EWSR1-FLI1fusion transcription factor. The aim of this study is to identify putative Ewing sarcoma target genes through an integrative analysis of three microarray data sets. Array comparative genomic hybridization is used to measure changes in DNA copy number, and analyzed to detect common chromosomal aberrations. mRNA and miRNA microarrays are used to measure expression of protein-coding and miRNA genes, and these results integrated with the copy number data. Chromosomal aberrations typically contain also bystanders in addition to the driving tumor suppressor and oncogenes, and integration with expression helps to identify the true targets. Correlation between expression of miRNAs and their predicted target mRNAs is also evaluated to assess the results of post-transcriptional miRNA regulation on mRNA levels. The highest frequencies of copy number gains were identified in chromosome 8, 1q, and X. Losses were most frequent in 9p21.3, which also showed an enrichment of copy number breakpoints relative to the rest of the genome. Copy number losses in 9p21.3 were found have a statistically significant effect on the expression of MTAP, but not on CDKN2A, which is a known tumor-suppressor in the same locus. MTAP was also down-regulated in the Ewing sarcoma cell lines compared to mesenchymal stem cells. Genes exhibiting elevated expression in association with copy number gains and up-regulation compared to the reference samples included DCAF7, ENO2, MTCP1, andSTK40. Differentially expressed miRNAs were detected by comparing Ewing sarcoma cell lines against mesenchymal stem cells. 21 up-regulated and 32 down-regulated miRNAs were identified, includingmiR-145, which has been previously linked to Ewing sarcoma. The EWSR1-FLI1 fusion gene represses miR-145, which in turn targets FLI1 forming a mutually repressive feedback loop. In addition higher expression linked to copy number gains and compared to mesenchymal stem cells, STK40 was also found to be a target of four different miRNAs that were all down-regulated in Ewing sarcoma cell lines compared to the reference samples. SLCO5A1 was identified as the only up-regulated gene within a frequently gained region in chromosome 8. This region was gained in over 90 % of the cell lines, and also with a higher frequency than the neighboring regions. In addition, SLCO5A1 was found to be a target of three miRNAs that were down-regulated compared to the mesenchymal stem cells.
  • Faraji, Sahar; Heidari, Parviz; Amouei, Hoorieh; Filiz, Ertugrul; Poczai, Peter (2021)
    Various kinds of primary metabolisms in plants are modulated through sulfate metabolism, and sulfotransferases (SOTs), which are engaged in sulfur metabolism, catalyze sulfonation reactions. In this study, a genome-wide approach was utilized for the recognition and characterization of SOT family genes in the significant nutritional crop potato (Solanum tuberosum L.). Twenty-nine putative StSOT genes were identified in the potato genome and were mapped onto the nine S. tuberosum chromosomes. The protein motifs structure revealed two highly conserved 5 '-phosphosulfate-binding (5 ' PSB) regions and a 3 '-phosphate-binding (3 ' PB) motif that are essential for sulfotransferase activities. The protein-protein interaction networks also revealed an interesting interaction between SOTs and other proteins, such as PRTase, APS-kinase, protein phosphatase, and APRs, involved in sulfur compound biosynthesis and the regulation of flavonoid and brassinosteroid metabolic processes. This suggests the importance of sulfotransferases for proper potato growth and development and stress responses. Notably, homology modeling of StSOT proteins and docking analysis of their ligand-binding sites revealed the presence of proline, glycine, serine, and lysine in their active sites. An expression essay of StSOT genes via potato RNA-Seq data suggested engagement of these gene family members in plants' growth and extension and responses to various hormones and biotic or abiotic stimuli. Our predictions may be informative for the functional characterization of the SOT genes in potato and other nutritional crops.
  • Inouye, Michael; Kettunen, Johannes; Soininen, Pasi; Silander, Kaisa; Ripatti, Samuli; Kumpula, Linda S.; Hämäläinen, Eija; Jousilahti, Pekka; Kangas, Antti J.; Männistö, Satu; Savolainen, Markku J.; Jula, Antti; Leiviskä, Jaana; Palotie, Aarno; Salomaa, Veikko; Perola, Markus; Ala-Korpela, Mika; Peltonen, Leena (2010)