Browsing by Subject "ALGORITHM"

Sort by: Order: Results:

Now showing items 1-20 of 49
  • Murillo-Ramos, Leidys; Brehm, Gunnar; Sihvonen, Pasi; Hausmann, Axel; Holm, Sille; Reza Ghanavi, Hamid; Õunap, Erki; Truuverk, Andro; Staude, Hermann; Friedrich, Egbert; Tammaru, Toomas; Wahlberg, Niklas (2019)
    Our study aims to investigate the relationships of the major lineages within the moth family Geometridae, with a focus on the poorly studied Oenochrominae-Desmobathrinae complex, and to translate some of the results into a coherent subfamilial and tribal level classification for the family. We analyzed a molecular dataset of 1,206 Geometroidea terminal taxa from all biogeographical regions comprising up to 11 molecular markers that includes one mitochondria) (COI) and 10 protein-coding nuclear gene regions (wingless, ArgK, MDH, RpS5, GAPDH, IDH, Ca-ATPase, Nex9, EF-1 alpha, CAD). The molecular data set was analyzed using maximum likelihood as implemented in IQ-TREE and RAxML. We found high support for the subfamilies Larentiinae, Geometrinae and Ennominae in their traditional scopes. Sterrhinae becomes monophyletic only if Ergavia Walker, Ametris Hubner and Macrotes Westwood, which are currently placed in Oenochrominae, are formally transferred to Sterrhinae. Desmobathrinae and Oenochrominae are found to be polyphyletic. The concepts of Oenochrominae and Desmobathrinae required major revision and, after appropriate rearrangements, these groups also form monophyletic subfamily-level entities. Oenochrominae s.str. as originally conceived by Guenee is phylogenetically distant from Epidesmia and its close relatives. The latter is hereby described as the subfamily Epidesmiinae Murillo-Ramos, Sihvonen & Brehm, subfam. nov. Epidesmiinae are a lineage of "slender-bodied Oenochrominae" that include the genera Ecphyas Turner, Systatica Turner, Adeixis Warren, Dichromodes Guenee, Phrixocomes Turner, Abraxaphantes Warren, Epidesmia Duncan & Westwood and Phrataria Walker. Archiearinae are monophyletic when Dirce and Acalyphes are formally transferred to Ennominae. We also found that many tribes were para- or polyphyletic and therefore propose tens of taxonomic changes at the tribe and subfamily levels. Archaeobalbini stat. rev. Viidalepp (Geometrinae) is raised from synonymy with Pseudoterpnini Warren to tribal rank. Chlorodontoperini Murillo-Ramos, Sihvonen & Brehm, trib. nov. and Drepanogynini Murillo-Ramos, Sihvonen & Brehm, trib. nov. are described as new tribes in Geometrinae and Ennominae, respectively.
  • Wang, Yanhao; Li, Yuchen; Fan, Ju; Ye, Chang; Chai, Mingke (2021)
    Graphs are commonly used for representing complex structures such as social relationships, biological interactions, and knowledge bases. In many scenarios, graphs not only represent topological relationships but also store the attributes that denote the semantics associated with their vertices and edges, known as attributed graphs. Attributed graphs can meet demands for a wide range of applications, and thus a variety of queries on attributed graphs have been proposed. However, these diverse types of attributed graph queries have not been systematically investigated yet. In this paper, we provide an extensive survey of several typical types of attributed graph queries. We propose a taxonomy of attributed graph queries based on query inputs and outputs. We summarize the definitions of queries that fall into each category and present a fine-grained classification of queries within each category by analyzing the semantics and algorithmic motivations behind these queries. Moreover, we discuss the insights of how existing studies address the technical challenges of query processing and outline several promising future research directions.
  • Mukherjee, Kingshuk; Alipanahi, Bahar; Kahveci, Tamer; Salmela, Leena; Boucher, Christina (2019)
    Motivation: Optical maps are high-resolution restriction maps (Rmaps) that give a unique numeric representation to a genome. Used in concert with sequence reads, they provide a useful tool for genome assembly and for discovering structural variations and rearrangements. Although they have been a regular feature of modern genome assembly projects, optical maps have been mainly used in post-processing step and not in the genome assembly process itself. Several methods have been proposed for pairwise alignment of single molecule optical maps-called Rmaps, or for aligning optical maps to assembled reads. However, the problem of aligning an Rmap to a graph representing the sequence data of the same genome has not been studied before. Such an alignment provides a mapping between two sets of data: optical maps and sequence data which will facilitate the usage of optical maps in the sequence assembly step itself. Results: We define the problem of aligning an Rmap to a de Bruijn graph and present the first algorithm for solving this problem which is based on a seed-and-extend approach. We demonstrate that our method is capable of aligning 73% of Rmaps generated from the Escherichia coli genome to the de Bruijn graph constructed from short reads generated from the same genome. We validate the alignments and show that our method achieves an accuracy of 99.6%. We also show that our method scales to larger genomes. In particular, we show that 76% of Rmaps can be aligned to the de Bruijn graph in the case of human data.
  • Purisha, Zenith; Karhula, Sakari S.; Ketola, Juuso H.; Rimpeläinen, Juho; Nieminen, Miika T.; Saarakkala, Simo; Kröger, Heikki; Siltanen, Samuli (2019)
    X-ray tomography is a reliable tool for determining the inner structure of 3-D object with penetrating X-rays. However, traditional reconstruction methods, such as Feldkamp-Davis-Kress (FDK), require dense angular sampling in the data acquisition phase leading to long measurement times, especially in X-ray micro-tomography to obtain high-resolution scans. Acquiring less data using greater angular steps is an obvious way for speeding up the process and avoiding the need to save huge data sets. However, computing 3-D reconstruction from such a sparsely sampled data set is difficult because the measurement data are usually contaminated by errors, and linear measurement models do not contain sufficient information to solve the problem in practice. An automatic regularization method is proposed for robust reconstruction, based on enforcing sparsity in the 3-D shearlet transform domain. The inputs of the algorithm are the projection data and a priori known expected degree of sparsity, denoted as 0 <C-pr
  • Barylski, Jacub; Enault, François; Dutilh, Bas E.; Schuller, Margo B.P.; Edwards, Robert A.; Gillis, Annika; Klumpp, Jochen; Knezevic, Petar; Krupovic, Mart; Kuhn, Jens H.; Lavigne, Rob; Oksanen, Hanna M; Sullivan, Matthew B.; Jang, Ho Bin; Simmonds, Peter; Aiewsakun, Pakorn; Wittmann, Johannes; Tolstoy, Igor; Brister, J. Rodney; Kropinki, Andrew; Adriaenssens, Evelien M. (2020)
    Tailed bacteriophages are the most abundant and diverse viruses in the world, with genome sizes ranging from 10 kbp to over 500 kbp. Yet, due to historical reasons, all this diversity is confined to a single virus order-Caudovirales, composed of just four families: Myoviridae, Siphoviridae, Podoviridae, and the newly created Ackermannviridae family. In recent years, this morphology-based classification scheme has started to crumble under the constant flood of phage sequences, revealing that tailed phages are even more genetically diverse than once thought. This prompted us, the Bacterial and Archaeal Viruses Subcommittee of the International Committee on Taxonomy of Viruses (ICTV), to consider overall reorganization of phage taxonomy. In this study, we used a wide range of complementary methods-including comparative genomics, core genome analysis, and marker gene phylogenetics-to show that the group of Bacillus phage SPO1-related viruses previously classified into the Spounavirinae subfamily, is clearly distinct from other members of the family Myoviridae and its diversity deserves the rank of an autonomous family. Thus, we removed this group from the Myoviridae family and created the family Herelleviridae-a new taxon of the same rank. In the process of the taxon evaluation, we explored the feasibility of different demarcation criteria and critically evaluated the usefulness of our methods for phage classification. The convergence of results, drawing a consistent and comprehensive picture of a new family with associated subfamilies, regardless of method, demonstrates that the tools applied here are particularly useful in phage taxonomy. We are convinced that creation of this novel family is a crucial milestone toward much-needed reclassification in the Caudovirales order.
  • But, Anna; Wang, Haining; Mannisto, Satu; Pukkala, Eero; Haukka, Jari (2014)
  • Popov, Georgi; Bačić, Goran; Mattinen, Miika; Manner, Toni; Lindström, Hannu; Seppänen, Heli; Suihkonen, Sami; Vehkamäki, Marko; Kemell, Marianna; Jalkanen, Pasi; Mizohata, Kenichiro; Räisänen, Jyrki; Leskelä, Markku; Koivula, Hanna Maarit; Barry, Seán T.; Ritala, Mikko (2020)
    Atomic layer deposition (ALD) is a viable method for depositing functional, passivating, and encapsulating layers on top of halide perovskites. Studies in that area have only focused on metal oxides, despite a great number of materials that can be made with ALD. This work demonstrates that, in addition to oxides, other ALD processes can be compatible with the perovskites. We describe two new ALD processes for lead sulfide. These processes operate at low deposition temperatures (45-155 degrees C) that have been inaccessible to previous ALD PbS processes. Our processes rely on volatile and reactive lead precursors Pb(dbda) (dbda = rac-N-2,N-3-di-tertbutylbutane-2,3-diamide) and Pb(btsa)(2) (btsa = bis(trimethylsilyl)amide) as well as H2S. These precursors produce high quality PbS thin films that are uniform, crystalline, and pure. The films exhibit p- type conductivity and good mobilities of 10-70 cm(2) V-1 s(-1). Low deposition temperatures enable direct ALD of PbS onto a halide perovskite CH3NH3PbI3 (MAPI) without its decomposition. The stability of MAPI in ambient air is greatly improved by capping with ALD PbS. More generally, these new processes offer valuable alternatives for PbS-based devices, and we hope that this study will inspire more studies on ALD of non-oxides on halide perovskites.
  • Rautiainen, Mikko; Mäkinen, Veli; Marschall, Tobias (2019)
    Motivation: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results: We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers' bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with vertical bar V vertical bar nodes and vertical bar E vertical bar edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(vertical bar V vertical bar+(sic)m/w(sic)vertical bar E vertical bar logw) for acyclic graphs and O(vertical bar V vertical bar+m vertical bar E vertical bar logw) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm.
  • Liu, Jinxiu; Heiskanen, Janne; Maeda, Eduardo Eiji; Pellikka, Petri K. E. (2018)
    West African savannas are subject to regular fires, which have impacts on vegetation structure, biodiversity and carbon balance. An efficient and accurate mapping of burned area associated with seasonal fires can greatly benefit decision making in land management. Since coarse resolution burned area products cannot meet the accuracy needed for fire management and climate modelling at local scales, the medium resolution Landsat data is a promising alternative for local scale studies. In this study, we developed an algorithm for continuous monitoring of annual burned areas using Landsat time series. The algorithm is based on burned pixel detection using harmonic model fitting with Landsat time series and breakpoint identification in the time series data. This approach was tested in a savanna area in southern Burkina Faso using 281 images acquired between October 2000 and April 2016. An overall accuracy of 79.2% was obtained with balanced omission and commission errors. This represents a significant improvement in comparison with MODIS burned area product (67.6%), which had more omission errors than commission errors, indicating underestimation of the total burned area. By observing the spatial distribution of burned areas, we found that the Landsat based method misclassified cropland and cloud shadows as burned areas due to the similar spectral response, and MODIS burned area product omitted small and fragmented burned areas. The proposed algorithm is flexible and robust against decreased data availability caused by clouds and Landsat 7 missing lines, therefore having a high potential for being applied in other landscapes in future studies.
  • Irfan, Furqan B.; Consunji, Rafael; El-Menyar, Ayman; George, Pooja; Peralta, Ruben; Al-Thani, Hassan; Thomas, Stephen Hodges; Alinier, Guillaume; Shuaib, Ashfaq; Al-Suwaidi, Jassim; Singh, Rajvir; Castren, Maaret; Cameron, Peter A.; Djarv, Therese (2017)
    Background: Traumatic cardiac arrest studies have reported improved survival rates recently, ranging from 1.7-7.5%. This population-based nationwide study aims to describe the epidemiology, interventions and outcomes, and determine predictors of survival from out-of-hospital traumatic cardiac arrest (OHTCA) in Qatar. Methods: An observational retrospective population-based study was conducted on OHTCA patients in Qatar, from January 2010 to December 2015. Traumatic cardiac arrest was redefined to include out-of-hospital traumatic cardiac arrest (OHTCA) and in-hospital traumatic cardiac arrest (IHTCA). Results: A total of 410 OHTCA patients were included in the 6-year study period. The mean annual crude incidence rate of OHTCA was 4.0 per 100,000 population, in Qatar. OHTCA mostly occurred in males with a median age of 33. There was a preponderance of blunt injuries (94.3%) and head injuries (66.3%). Overall, the survival rate was 2.4%. Shockable rhythm, prehospital external hemorrhage control, in-hospital blood transfusion, and surgery were associated with higher odds of survival. Adrenaline (Epinephrine) lowered the odds of survival. Conclusion: The incidence of OHTCA was less than expected, with a low rate of survival. Thoracotomy was not associated with improved survival while Adrenaline administration lowered survival in OHTCA patients with majority blunt injuries. Interventions to enable early prehospital control of hemorrhage, blood transfusion, thoracostomy and surgery improved survival. (C) 2017 Elsevier B.V. All rights reserved.
  • Kallio, M. Aleksi; Tuimala, Jarno T.; Hupponen, Taavi; Klemela, Petri; Gentile, Massimiliano; Scheinin, Ilari; Koski, Mikko; Kaki, Janne; Korpelainen, Eija I. (2011)
  • Abera, Temesgen; Heiskanen, Janne; Pellikka, Petri; Adhikari, Hari; Maeda, Eduardo (2020)
    Bushlands (Acacia-Commiphora) constitute the largest and one of the most threatened ecosystems in East Africa. Although several studies have investigated the climatic impacts of land changes on local and global climate, the main focus has been on forest loss and the impacts of bushland clearing thus remain poorly understood. Measuring the impacts of bushland loss on local climate is challenging given that changes often occur at fragmented and small patches. Here, we apply high-resolution satellite imagery and land surface flux modeling approaches to unveil the impacts of bushland clearing on surface biophysical properties and its associated effects on surface energy balance and land surface temperature. Our results show that bushland clearing leads to an average reduction in evapotranspiration of 0.4 mm day(-1). The changes in surface biophysical properties affected the surface energy balance components with different magnitude. The reduction in latent heat flux was stronger than other surface energy fluxes and resulted in an average net increase in daytime land surface temperature (LST) of up to 1.75 K. These results demonstrate the important impact of bushland-to-cropland conversion on the local climate, as they reveal increases in LST of a magnitude comparable to those caused by forest loss. This finding highlights the necessity of bushland conservation for regulating the land surface temperature in East Africa and, at the same time, warns of the climatic impacts of clearing bushlands for agriculture. (c) 2020 The Authors. Published by Elsevier B.V.
  • Purisha, Zenith; Rimpeläinen, Juho; Bubba, Tatiana; Siltanen, Samuli (2018)
    Tomographic reconstruction is an ill-posed inverse problem that calls for regularization. One possibility is to require sparsity of the unknown in an orthonormal wavelet basis. This, in turn, can be achieved by variational regularization, where the penalty term is the sum of the absolute values of the wavelet coefficients. The primal-dual fixed point algorithm showed that the minimizer of the variational regularization functional can be computed iteratively using a soft-thresholding operation. Choosing the soft-thresholding parameter mu > 0 is analogous to the notoriously difficult problem of picking the optimal regularization parameter in Tikhonov regularization. Here, a novel automatic method is introduced for choosing mu, based on a control algorithm driving the sparsity of the reconstruction to an a priori known ratio of nonzero versus zero wavelet coefficients in the unknown.
  • Neittaanmäki-Perttu, Noora; Gronroos, Mari; Jeskanen, Leila; Polonen, Ilkka; Ranki, Annamari; Saksela, Olli; Snellman, Erna (2015)
    Lentigo maligna (LM) is an in situ form of melanoma which can progress into invasive lentigo maligna melanoma (LMM). Variations in the pigmentation and thus visibility of the tumour make assessment of lesion borders challenging. We tested hyperspectral imaging system (HIS) in in vivo preoperative delineation of LM and LMM margins. We compared lesion margins delineated by HIS with those estimated clinically, and confirmed histologically. A total of 14 LMs and 5 LIVIMs in 19 patients were included. HIS analysis matched the histopathological analysis in 18/19 (94.7%) cases while in 1/19 (5.3%) cases HIS showed lesion extension not confirmed by histopathology (false positives). Compared to clinical examination, HIS defined lesion borders more accurately in 10/19 (52.6%) of cases (wider, n=7 or smaller, n=3) while in 8/19 (42.1%) cases lesion borders were the same as delineated clinically as confirmed histologically. Thus, HIS is useful for the detection of subclinical LM/LMM borders.
  • Lassas, Matti; Saksala, Teemu (2019)
    Let (N, g) be a Riemannian manifold with the distance function d(x, y) and an open subset M subset of N. For x is an element of M we denote by D-x the distance difference function D-x:F x F -> R, given by D-x(z(1), z(2)) = d(x, z(1)) - d(x, z(2)), z(1), z(2) is an element of F = N \ M. We consider the inverse problem of determining the topological and the differentiable structure of the manifold M and the metric g vertical bar M on it when we are given the distance difference data, that is, the set F, the metric g vertical bar F, and the collection D(M) = {D-x; x is an element of M}. Moreover, we consider the embedded image D(M) of the manifold M, in the vector space C(F x F), as a representation of manifold M. The inverse problem of determining (M, g) from D(M) arises e.g. in the study of the wave equation on R x N when we observe in F the waves produced by spontaneous point sources at unknown points (t, x) is an element of R x M. Then D-x (z(1), z(2)) is the difference of the times when one observes at points z(1) and z(2) the wave produced by a point source at x that goes off at an unknown time. The problem has applications in hybrid inverse problems and in geophysical imaging.
  • Korpela, Jussi; Lassas, Matti; Oksanen, Lauri (2019)
    An inverse boundary value problem for the 1+1 dimensional wave equation (partial derivative(2)(t) - c(x)(2)partial derivative(2)(x))u(x,t) = 0, x is an element of R+ is considered. We give a discrete regularization strategy to recover wave speed c(x) when we are given the boundary value of the wave, u(0,t), that is produced by a single pulse-like source. The regularization strategy gives an approximative wave speed (c) over tilde, satisfying a Holder type estimate parallel to (c) over tilde - c parallel to
  • Silva, Milton; Pratas, Diogo; Pinho, Armando J. (2020)
    Background: The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. Findings: We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of 2.4%, 7.1%, 6.1%, 5.8%, and 6.0%, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in 12.4%, 11.7%, 10.8%, and 10.1% over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. Conclusions: GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at
  • Tuononen, Minttu; O'Connor, Ewan J.; Sinclair, Victoria A. (2019)
    The presence of clouds and their characteristics have a strong impact on the radiative balance of the Earth and on the amount of solar radiation reaching the Earth's surface. Many applications require accurate forecasts of surface radiation on weather timescales, for example solar energy and UV radiation forecasts. Here we investigate how operational forecasts of low and mid-level clouds affect the accuracy of solar radiation forecasts. A total of 4 years of cloud and solar radiation observations from one site in Helsinki, Finland, are analysed. Cloud observations are obtained from a ceilometer and therefore we first develop algorithms to reliably detect cloud base, precipitation, and fog. These new algorithms are widely applicable for both operational use and research, such as in-cloud icing detection for the wind energy industry and for aviation. The cloud and radiation observations are compared to forecasts from the Integrated Forecast System (IFS) run operationally and developed by the European Centre for Medium-Range Weather Forecasts (ECMWF). We develop methods to evaluate the skill of the cloud and radiation forecasts. These methods can potentially be extended to hundreds of sites globally. Over Helsinki, the measured global horizontal irradiance (GHI) is strongly influenced by its northerly location and the annual variation in cloudiness. Solar radiation forecast error is therefore larger in summer than in winter, but the relative error in the solar radiation forecast is more or less constant throughout the year. The mean overall bias in the GHI forecast is positive (8 W m(-2)). The observed and forecast distributions in cloud cover, at the spatial scales we are considering, are strongly skewed towards clear-sky and overcast situations. Cloud cover forecasts show more skill in winter when the cloud cover is predominantly overcast; in summer there are more clear-sky and broken cloud situations. A negative bias was found in forecast GHI for correctly forecast clear-sky cases and a positive bias in correctly forecast overcast cases. Temporal averaging improved the cloud cover forecast and hence decreased the solar radiation forecast error. The positive bias seen in overcast situations occurs when the model cloud has low values of liquid water path (LWP). We attribute this bias to the model having LWP values that are too low or the model optical properties for clouds with low LWP being incorrect.
  • Mukherjee, Kingshuk; Rossi, Massimiliano; Salmela, Leena; Boucher, Christina (2021)
    Genome wide optical maps are high resolution restriction maps that give a unique numeric representation to a genome. They are produced by assembling hundreds of thousands of single molecule optical maps, which are called Rmaps. Unfortunately, there are very few choices for assembling Rmap data. There exists only one publicly-available non-proprietary method for assembly and one proprietary software that is available via an executable. Furthermore, the publicly-available method, by Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006), follows the overlap-layout-consensus (OLC) paradigm, and therefore, is unable to scale for relatively large genomes. The algorithm behind the proprietary method, Bionano Genomics' Solve, is largely unknown. In this paper, we extend the definition of bi-labels in the paired de Bruijn graph to the context of optical mapping data, and present the first de Bruijn graph based method for Rmap assembly. We implement our approach, which we refer to as rmapper, and compare its performance against the assembler of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) and Solve by Bionano Genomics on data from three genomes: E. coli, human, and climbing perch fish (Anabas Testudineus). Our method was able to successfully run on all three genomes. The method of Valouev et al. (Proc Natl Acad Sci USA 103(43):15770-15775, 2006) only successfully ran on E. coli. Moreover, on the human genome rmapper was at least 130 times faster than Bionano Solve, used five times less memory and produced the highest genome fraction with zero mis-assemblies. Our software, rmapper is written in C++ and is publicly available under GNU General Public License at .
  • Lees, John A.; Harris, Simon R.; Tonkin-Hill, Gerry; Gladstone, Rebecca A.; Lo, Stephanie W.; Weiser, Jeffrey N.; Corander, Jukka; Bentley, Stephen D.; Croucher, Nicholas J. (2019)
    The routine use of genomics for disease surveillance provides the opportunity for high-resolution bacterial epidemiology. Current whole-genome clustering and multilocus typing approaches do not fully exploit core and accessory genomic variation, and they cannot both automatically identify, and subsequently expand, clusters of significantly similar isolates in large data sets spanning entire species. Here, we describe PopPUNK (Population Partitioning Using Nucleotide K-mers), a software implementing scalable and expandable annotation-and alignment-free methods for population analysis and clustering. Variable-length k-mer comparisons are used to distinguish isolates' divergence in shared sequence and gene content, which we demonstrate to be accurate over multiple orders of magnitude using data from both simulations and genomic collections representing 10 taxonomically widespread species. Connections between closely related isolates of the same strain are robustly identified, despite interspecies variation in the pairwise distance distributions that reflects species' diverse evolutionary patterns. PopPUNK can process 10(3)-10(4) genomes in a single batch, with minimal memory use and runtimes up to 200-fold faster than existing model-based methods. Clusters of strains remain consistent as new batches of genomes are added, which is achieved without needing to reanalyze all genomes de novo. This facilitates real-time surveillance with consistent cluster naming between studies and allows for outbreak detection using hundreds of genomes in minutes. Interactive visualization and online publication is streamlined through the automatic output of results to multiple platforms. PopPUNK has been designed as a flexible platform that addresses important issues with currently used whole-genome clustering and typing methods, and has potential uses across bacterial genetics and public health research.