SLOW5: a new file format enables massive acceleration of nanopore sequencing data analysis
Nanopore sequencing is an emerging genomic technology with great potential. However, the storage and analysis of nanopore sequencing data have become major bottlenecks preventing more widespread adoption in research and clinical genomics. Here, we elucidate an inherent limitation in the file format used to store raw nanopore data – known as FAST5 – that prevents efficient analysis on high-performance computing (HPC) systems. To overcome this, we have developed SLOW5, an alternative file format that permits efficient parallelisation and, thereby, acceleration of nanopore data analysis. For example, we show that using SLOW5 format, instead of FAST5, reduces the time and cost of genome-wide DNA methylation profiling by an order of magnitude on common HPC systems (up to >30X times), and delivers consistent improvements on a wide range of different architectures. With a simple, accessible file structure and significant reductions in size compared to FAST5 (~25% reduction for a typical human genome), SLOW5 format will deliver substantial benefits to all areas of the nanopore community.
Example results: With the maximum resource allocation available on Australia's National Computing Infrastructure, genome-wide DNA methylation profiling on a single ~30X human genome sequencing dataset runs for >14 days at a cost of >$500 when using FAST5 files with Nanopolish/f5c software. We were able to complete whole-genome methylation profiling on a single 30X human dataset in just ~10.5 hours when using SLOW5 format as the input. On a typical ~30X human genome sequencing dataset, SLOW5 was ~25% smaller in size, equating to a reduction of ~300 Gbytes and a saving ~$50 per month on Amazon cloud storage.
SLOW5 format and all associated software are free and open source.
SLOW5 format specification documents: https://hasindu2008.github.io/slow5specs
Slow5lib: https://hasindu2008.github.io/slow5lib
Slow5tools: https://hasindu2008.github.io/slow5tools
Pre-print: https://www.biorxiv.org/content/10.1101/2021.06.29.450255v1