CARE 2.0: reducing false-positive sequencing error corrections using machine learning
- Publikationstyp:
- Zeitschriftenaufsatz
- Metadaten:
-
- Autoren
- Felix Kallenborn
- Julian Cascitti
- Bertil Schmidt
- Autoren-URL
- https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=fis-test-1&SrcAuth=WosAPI&KeyUT=WOS:000810679500002&DestLinkType=FullRecord&DestApp=WOS_CPL
- DOI
- 10.1186/s12859-022-04754-3
- Externe Identifier
- Clarivate Analytics Document Solution ID: 2C2BI
- PubMed Identifier: 35698033
- ISSN
- 1471-2105
- Ausgabe der Veröffentlichung
- 1
- Zeitschrift
- BMC BIOINFORMATICS
- Schlüsselwörter
- Next-generation sequencing
- Error correction
- Machine learning
- Artikelnummer
- ARTN 227
- Datum der Veröffentlichung
- 2022
- Status
- Published
- Titel
- CARE 2.0: reducing false-positive sequencing error corrections using machine learning
- Sub types
- Article
- Ausgabe der Zeitschrift
- 23
Datenquelle: Web of Science (Lite)
- Andere Metadatenquellen:
-
- Abstract
- <jats:title>Abstract</jats:title><jats:sec><jats:title>Background</jats:title><jats:p>Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as<jats:italic>k</jats:italic>-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.</jats:p></jats:sec><jats:sec><jats:title>Results</jats:title><jats:p>We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved<jats:italic>k</jats:italic>-mer analysis show the applicability of CARE 2.0 to real-world data.</jats:p></jats:sec><jats:sec><jats:title>Conclusion</jats:title><jats:p>False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve<jats:italic>k</jats:italic>-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at<jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/fkallen/CARE">https://github.com/fkallen/CARE</jats:ext-link>.</jats:p></jats:sec>
- Autoren
- Felix Kallenborn
- Julian Cascitti
- Bertil Schmidt
- DOI
- 10.1186/s12859-022-04754-3
- eISSN
- 1471-2105
- Ausgabe der Veröffentlichung
- 1
- Zeitschrift
- BMC Bioinformatics
- Sprache
- en
- Artikelnummer
- 227
- Online publication date
- 2022
- Datum der Veröffentlichung
- 2022
- Status
- Published
- Herausgeber
- Springer Science and Business Media LLC
- Herausgeber URL
- http://dx.doi.org/10.1186/s12859-022-04754-3
- Datum der Datenerfassung
- 2023
- Titel
- CARE 2.0: reducing false-positive sequencing error corrections using machine learning
- Ausgabe der Zeitschrift
- 23
Datenquelle: Crossref
- Abstract
- <h4>Background</h4>Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.<h4>Results</h4>We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.<h4>Conclusion</h4>False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
- Addresses
- Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany. kallenborn@uni-mainz.de.
- Autoren
- Felix Kallenborn
- Julian Cascitti
- Bertil Schmidt
- DOI
- 10.1186/s12859-022-04754-3
- eISSN
- 1471-2105
- Externe Identifier
- PubMed Identifier: 35698033
- PubMed Central ID: PMC9195321
- Funding acknowledgements
- DeCoDeML Project by Rhein-Main-University Network:
- Johannes Gutenberg-Universität Mainz:
- Open access
- true
- ISSN
- 1471-2105
- Ausgabe der Veröffentlichung
- 1
- Zeitschrift
- BMC bioinformatics
- Schlüsselwörter
- Humans
- Sequence Alignment
- Sequence Analysis, DNA
- Algorithms
- Software
- High-Throughput Nucleotide Sequencing
- Machine Learning
- Sprache
- eng
- Medium
- Electronic
- Online publication date
- 2022
- Open access status
- Open Access
- Paginierung
- 227
- Datum der Veröffentlichung
- 2022
- Status
- Published
- Publisher licence
- CC BY
- Datum der Datenerfassung
- 2022
- Titel
- CARE 2.0: reducing false-positive sequencing error corrections using machine learning.
- Sub types
- research-article
- Journal Article
- Ausgabe der Zeitschrift
- 23
Files
https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-022-04754-3 https://europepmc.org/articles/PMC9195321?pdf=render
Datenquelle: Europe PubMed Central
- Abstract
- BACKGROUND: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .
- Date of acceptance
- 2022
- Autoren
- Felix Kallenborn
- Julian Cascitti
- Bertil Schmidt
- Autoren-URL
- https://www.ncbi.nlm.nih.gov/pubmed/35698033
- DOI
- 10.1186/s12859-022-04754-3
- eISSN
- 1471-2105
- Externe Identifier
- PubMed Central ID: PMC9195321
- Ausgabe der Veröffentlichung
- 1
- Zeitschrift
- BMC Bioinformatics
- Schlüsselwörter
- Error correction
- Machine learning
- Next-generation sequencing
- Algorithms
- High-Throughput Nucleotide Sequencing
- Humans
- Machine Learning
- Sequence Alignment
- Sequence Analysis, DNA
- Software
- Sprache
- eng
- Country
- England
- Paginierung
- 227
- PII
- 10.1186/s12859-022-04754-3
- Datum der Veröffentlichung
- 2022
- Status
- Published online
- Datum, an dem der Datensatz öffentlich gemacht wurde
- 2022
- Titel
- CARE 2.0: reducing false-positive sequencing error corrections using machine learning.
- Sub types
- Journal Article
- Ausgabe der Zeitschrift
- 23
Datenquelle: PubMed
- Autoren
- Felix Kallenborn
- Julian Cascitti
- Bertil Schmidt
- Zeitschrift
- BMC Bioinform.
- Artikelnummer
- 1
- Paginierung
- 227 - 227
- Datum der Veröffentlichung
- 2022
- Titel
- CARE 2.0: reducing false-positive sequencing error corrections using machine learning.
- Ausgabe der Zeitschrift
- 23
Datenquelle: DBLP
- Author's licence
- CC-BY
- Autoren
- Felix Kallenborn
- Julian Cascitti
- Bertil Schmidt
- Hosting institution
- Universitätsbibliothek Mainz
- Sammlungen
- DFG-491381577-G
- Resource version
- Published version
- DOI
- 10.1186/s12859-022-04754-3
- Funding acknowledgements
- Gefördert durch die Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 491381577
- File(s) embargoed
- false
- Open access
- true
- ISSN
- 1471-2105
- Zeitschrift
- BMC bioinformatics
- Schlüsselwörter
- 004 Informatik
- 004 Data processing
- Sprache
- eng
- Open access status
- Open Access
- Paginierung
- 227
- Datum der Veröffentlichung
- 2022
- Public URL
- https://openscience.ub.uni-mainz.de/handle/20.500.12030/8137
- Herausgeber
- Springer Nature
- Datum der Datenerfassung
- 2022
- Datum, an dem der Datensatz öffentlich gemacht wurde
- 2022
- Zugang
- Public
- Titel
- CARE 2.0 : reducing false-positive sequencing error corrections using machine learning
- Ausgabe der Zeitschrift
- 23
Files
care_20__reducing_falsepositi-20221020145914717.pdf
Datenquelle: OPENSCIENCE.UB
- Beziehungen:
- Eigentum von