Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Nanopore-based DNA sequencing relies on basecalling the electric current signal. Basecalling requires neural networks to achieve competitive accuracies. To improve sequencing accuracy further, new models are continuously proposed with new architectures. However, benchmarking is currently not standardized, and evaluation metrics and datasets used are defined on a per publication basis, impeding progress in the field. This makes it impossible to distinguish data from model driven improvements.

Results

To standardize the process of benchmarking, we unified existing benchmarking datasets and defined a rigorous set of evaluation metrics. We benchmarked the latest seven basecaller models by recreating and analyzing their neural network architectures. Our results show that overall Bonito’s architecture is the best for basecalling. We find, however, that species bias in training can have a large impact on performance. Our comprehensive evaluation of 90 novel architectures demonstrates that different models excel at reducing different types of errors and using recurrent neural networks (long short-term memory) and a conditional random field decoder are the main drivers of high performing models.

Conclusions

We believe that our work can facilitate the benchmarking of new basecaller tools and that the community can further expand on this work.

Supplementary Information

The online version contains supplementary material available at 10.1186/s13059-023-02903-2.

Related collections

Most cited references 8

Record: found
Abstract: found
Article: not found

Minimap2: pairwise alignment for nucleotide sequences

Heng Li (2018)

Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms.

0 comments Cited 5015 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Nanopore sequencing and assembly of a human genome with ultra-long reads

Miten Jain, Sergey Koren, Karen H. Miga … (2018)

We report the sequencing and assembly of a reference genome for the human GM12878 Utah/Ceph cell line using the MinION (Oxford Nanopore Technologies) nanopore sequencer. 91.2 Gb of sequence data, representing ~30× theoretical coverage, were produced. Reference-based alignment enabled detection of large structural variants and epigenetic modifications. De novo assembly of nanopore reads alone yielded a contiguous assembly (NG50 ~3 Mb). We developed a protocol to generate ultra-long reads (N50 > 100 kb, read lengths up to 882 kb). Incorporating an additional 5× coverage of these ultra-long reads more than doubled the assembly contiguity (NG50 ~6.4 Mb). The final assembled genome was 2,867 million bases in size, covering 85.8% of the reference. Assembly accuracy, after incorporating complementary short-read sequencing data, exceeded 99.8%. Ultra-long reads enabled assembly and phasing of the 4-Mb major histocompatibility complex (MHC) locus in its entirety, measurement of telomere repeat length, and closure of gaps in the reference human genome assembly GRCh38.

0 comments Cited 526 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

Nanopore sequencing technology, bioinformatics and applications

Yunhao Wang, Yue Zhao, Audrey Bollas … (2021)

0 comments Cited 445 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Marc Pagès-Gallego:

ORCID: http://orcid.org/0000-0001-8888-5699

m.pagesgallego@umcutrecht.nl

Jeroen de Ridder: j.deridder-4@umcutrecht.nl

Journal

Journal ID (nlm-ta): Genome Biol

Journal ID (iso-abbrev): Genome Biol

Title: Genome Biology

Publisher: BioMed Central (London )

ISSN (Print): 1474-7596

ISSN (Electronic): 1474-760X

Publication date (Electronic): 11 April 2023

Publication date PMC-release: 11 April 2023

Publication date Collection: 2023

Volume: 24

Electronic Location Identifier: 71

Affiliations

[1 ]GRID grid.7692.a, ISNI 0000000090126352, Center for Molecular Medicine, , University Medical Center Utrecht, ; Universiteitsweg 100, 3584 CG Utrecht, The Netherlands

[2 ]GRID grid.499559.d, Oncode Institute, ; Utrecht, The Netherlands

Author information

Marc Pagès-Gallego http://orcid.org/0000-0001-8888-5699

Article

Publisher ID: 2903

DOI: 10.1186/s13059-023-02903-2

PMC ID: 10088207

PubMed ID: 37041647

SO-VID: 2bb5aa2a-8514-4286-96cd-d84a3cfc9639

License:

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

History

Date received : 1 August 2022

Date accepted : 20 March 2023

Funding

Funded by: FundRef http://dx.doi.org/10.13039/100016036, Health~Holland;

Award ID: LSHM19029

Award Recipient : Marc Pagès-Gallego

Custom metadata

ScienceOpen disciplines: Genetics

Keywords: nanopore,basecalling,benchmark,deep learning

Data availability:

ScienceOpen disciplines: Genetics

Keywords: nanopore, basecalling, benchmark, deep learning

Comments

Comment on this article

scite_

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Cited by 13

See all cited by

Most referenced authors 184

See all reference authors

Comprehensive benchmark and architectural analysis of deep learning models for nanopore sequencing basecalling

Read this article at

Abstract

Background

Results

Conclusions

Supplementary Information

Related collections

Computer Vision, Deep Learning, Deep Reinforcement Learning, IoT

Most cited references 8

Minimap2: pairwise alignment for nucleotide sequences

Nanopore sequencing and assembly of a human genome with ultra-long reads

Nanopore sequencing technology, bioinformatics and applications

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Custom metadata

Comments

Comment on this article

Similar content 211

Cited by 13

Most referenced authors 184