LVQ-KNN: Composition-based DNA/RNA binning of short nucleotide sequences utilizing a prototype-based k-nearest neighbor approach

Belka, Ariane; Fischer, Mareike; Pohlmann, Anne; Beer, Martin; Höper, Dirk

doi:10.1016/j.virusres.2018.10.002

referiert

Veröffentlicht

LVQ-KNN: Composition-based DNA/RNA binning of short nucleotide sequences utilizing a prototype-based k-nearest neighbor approach

Belka, Ariane ; Fischer, Mareike; Pohlmann, Anne ; Beer, Martin ; Höper, Dirk

Unbiased sequencing is an upcoming method to gain information of the microbiome in a sample and for the detection of unrecognized pathogens. There are many software tools for a taxonomic classification of such metagenomics datasets available. Numerous of them have a satisfactory sensitivity and specificity for known organisms, but they fail if the sample contains unknown organisms, which cannot be detected by similarity-based classification employing available databases. However, recognition of unknowns is especially important for the detection of newly emerging pathogens, which are often RNA viruses. Here we present the composition-based analysis tool LVQ-KNN for binning unclassified nucleotide sequence reads into their provenance classes DNA or RNA. With a 5-fold cross-validation, LVQ-KNN reached correct classification rates (CCR) of up to 99.9% for the classification into DNA/RNA. Real datasets gained CCRs of up to 94.5%. Comparing the method to another composition-based analysis tool, similar or better classification results were reached. LVQ-KNN is a new tool for DNA/RNA classification of sequence reads from unbiased sequencing approaches that could be applicable for the detection of yet unknown RNA viruses in metagenomic samples. The source-code, training and test data for LVQ-KNN is available at Github (https://github.com/ab1989/LVQ-KNN).

Dateien

beschränkter Zugriff

Einordnung

Erschienen in:: Virus Research
Vol. 258S. 55-63
Band:: 258
Datum der Veröffentlichung:: 04.10.2018
DOI:: 10.1016/j.virusres.2018.10.002
Sprache:: Englisch
Ressourcentyp:: Text
Schlagwörter:: composition-based analysisoligonucleotidesmetagenomicslearning vector quantization algorithmk-nearest neighbor methodcross validation
DDC-Sachgruppe der DNB:: 570 Biowissenschaften, Biologie
Link URL:: https://www.sciencedirect.com/science/article/pii/S0168170218303848
Link URL:: https://github.com/ab1989/LVQ-KNN
Einrichtung:: Friedrich-Loeffler-Institut, Institut für Virusdiagnostik
Physischer Standort:: SD/2018/385