CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Publikationstyp:

Zeitschriftenaufsatz

Metadaten:

Autoren

Felix Kallenborn
Julian Cascitti
Bertil Schmidt

Autoren-URL

https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=fis-test-1&SrcAuth=WosAPI&KeyUT=WOS:000810679500002&DestLinkType=FullRecord&DestApp=WOS_CPL

DOI

10.1186/s12859-022-04754-3

Externe Identifier

Clarivate Analytics Document Solution ID: 2C2BI
PubMed Identifier: 35698033

ISSN

1471-2105

Ausgabe der Veröffentlichung

Zeitschrift

BMC BIOINFORMATICS

Schlüsselwörter

Next-generation sequencing
Error correction
Machine learning

Artikelnummer

ARTN 227

Datum der Veröffentlichung

2022

Status

Published

Titel

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Sub types

Article

Ausgabe der Zeitschrift

Datenquelle: Web of Science (Lite)

Andere Metadatenquellen:

Abstract

AbstractBackgroundNext-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such ask-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.ResultsWe present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0’s hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improvedk-mer analysis show the applicability of CARE 2.0 to real-world data.ConclusionFalse-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improvek-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available athttps://github.com/fkallen/CARE.

Autoren

Felix Kallenborn
Julian Cascitti
Bertil Schmidt

DOI

10.1186/s12859-022-04754-3

eISSN

1471-2105

Ausgabe der Veröffentlichung

Zeitschrift

BMC Bioinformatics

Sprache

Artikelnummer

227

Online publication date

2022

Datum der Veröffentlichung

2022

Status

Published

Herausgeber

Springer Science and Business Media LLC

Herausgeber URL

http://dx.doi.org/10.1186/s12859-022-04754-3

Datum der Datenerfassung

2023

Titel

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Ausgabe der Zeitschrift

Datenquelle: Crossref

Abstract

<h4>Background</h4>Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools.<h4>Results</h4>We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data.<h4>Conclusion</h4>False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Addresses

Department of Computer Science, Johannes Gutenberg University Mainz, Mainz, Germany. kallenborn@uni-mainz.de.

Autoren

Felix Kallenborn
Julian Cascitti
Bertil Schmidt

DOI

10.1186/s12859-022-04754-3

eISSN

1471-2105

Externe Identifier

PubMed Identifier: 35698033
PubMed Central ID: PMC9195321

Funding acknowledgements

DeCoDeML Project by Rhein-Main-University Network:
Johannes Gutenberg-Universität Mainz:

Open access

true

ISSN

1471-2105

Ausgabe der Veröffentlichung

Zeitschrift

BMC bioinformatics

Schlüsselwörter

Humans
Sequence Alignment
Sequence Analysis, DNA
Algorithms
Software
High-Throughput Nucleotide Sequencing
Machine Learning

Sprache

eng

Medium

Electronic

Online publication date

2022

Open access status

Open Access

Paginierung

227

Datum der Veröffentlichung

2022

Status

Published

Publisher licence

CC BY

Datum der Datenerfassung

2022

Titel

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.

Sub types

research-article
Journal Article

Ausgabe der Zeitschrift

Files

https://bmcbioinformatics.biomedcentral.com/counter/pdf/10.1186/s12859-022-04754-3 https://europepmc.org/articles/PMC9195321?pdf=render

Datenquelle: Europe PubMed Central

Abstract

BACKGROUND: Next-generation sequencing pipelines often perform error correction as a preprocessing step to obtain cleaned input data. State-of-the-art error correction programs are able to reliably detect and correct the majority of sequencing errors. However, they also introduce new errors by making false-positive corrections. These correction mistakes can have negative impact on downstream analysis, such as k-mer statistics, de-novo assembly, and variant calling. This motivates the need for more precise error correction tools. RESULTS: We present CARE 2.0, a context-aware read error correction tool based on multiple sequence alignment targeting Illumina datasets. In addition to a number of newly introduced optimizations its most significant change is the replacement of CARE 1.0's hand-crafted correction conditions with a novel classifier based on random decision forests trained on Illumina data. This results in up to two orders-of-magnitude fewer false-positive corrections compared to other state-of-the-art error correction software. At the same time, CARE 2.0 is able to achieve high numbers of true-positive corrections comparable to its competitors. On a simulated full human dataset with 914M reads CARE 2.0 generates only 1.2M false positives (FPs) (and 801.4M true positives (TPs)) at a highly competitive runtime while the best corrections achieved by other state-of-the-art tools contain at least 3.9M FPs and at most 814.5M TPs. Better de-novo assembly and improved k-mer analysis show the applicability of CARE 2.0 to real-world data. CONCLUSION: False-positive corrections can negatively influence down-stream analysis. The precision of CARE 2.0 greatly reduces the number of those corrections compared to other state-of-the-art programs including BFC, Karect, Musket, Bcool, SGA, and Lighter. Thus, higher-quality datasets are produced which improve k-mer analysis and de-novo assembly in real-world datasets which demonstrates the applicability of machine learning techniques in the context of sequencing read error correction. CARE 2.0 is written in C++/CUDA for Linux systems and can be run on the CPU as well as on CUDA-enabled GPUs. It is available at https://github.com/fkallen/CARE .

Date of acceptance

2022

Autoren

Felix Kallenborn
Julian Cascitti
Bertil Schmidt

Autoren-URL

https://www.ncbi.nlm.nih.gov/pubmed/35698033

DOI

10.1186/s12859-022-04754-3

eISSN

1471-2105

Externe Identifier

PubMed Central ID: PMC9195321

Ausgabe der Veröffentlichung

Zeitschrift

BMC Bioinformatics

Schlüsselwörter

Error correction
Machine learning
Next-generation sequencing
Algorithms
High-Throughput Nucleotide Sequencing
Humans
Machine Learning
Sequence Alignment
Sequence Analysis, DNA
Software

Sprache

eng

Country

England

Paginierung

227

PII

10.1186/s12859-022-04754-3

Datum der Veröffentlichung

2022

Status

Published online

Datum, an dem der Datensatz öffentlich gemacht wurde

2022

Titel

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.

Sub types

Journal Article

Ausgabe der Zeitschrift

Datenquelle: PubMed

Autoren

Felix Kallenborn
Julian Cascitti
Bertil Schmidt

Zeitschrift

BMC Bioinform.

Artikelnummer

Paginierung

227 - 227

Datum der Veröffentlichung

2022

Titel

CARE 2.0: reducing false-positive sequencing error corrections using machine learning.

Ausgabe der Zeitschrift

Datenquelle: DBLP

Author's licence

CC-BY

Autoren

Felix Kallenborn
Julian Cascitti
Bertil Schmidt

Hosting institution

Universitätsbibliothek Mainz

Sammlungen

DFG-491381577-G

Resource version

Published version

DOI

10.1186/s12859-022-04754-3

Funding acknowledgements

Gefördert durch die Deutsche Forschungsgemeinschaft (DFG) - Projektnummer 491381577

File(s) embargoed

false

Open access

true

ISSN

1471-2105

Zeitschrift

BMC bioinformatics

Schlüsselwörter

004 Informatik
004 Data processing

Sprache

eng

Open access status

Open Access

Paginierung

227

Datum der Veröffentlichung

2022

Public URL

https://openscience.ub.uni-mainz.de/handle/20.500.12030/8137

Herausgeber

Springer Nature

Datum der Datenerfassung

2022

Datum, an dem der Datensatz öffentlich gemacht wurde

2022

Zugang

Public

Titel

CARE 2.0 : reducing false-positive sequencing error corrections using machine learning

Ausgabe der Zeitschrift

Files

care_20__reducing_falsepositi-20221020145914717.pdf

Datenquelle: OPENSCIENCE.UB

Beziehungen:

Eigentum von

High Performance Computing

CARE 2.0: reducing false-positive sequencing error corrections using machine learning

Files

Files

Werkzeuge