Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

Abstract As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage...

Full description

Saved in:

Bibliographic Details
Main Authors:	Foad Nazari, Sneh Patel, Melissa LaRocca, Alina Sansevich, Ryan Czarny, Giana Schena, Emma K. Murray
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Reports
Subjects:	Genomic data compression FASTQ FASTA Lossless compression k-mer sequence DNA
Online Access:	https://doi.org/10.1038/s41598-024-79258-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841559643638202368
author	Foad Nazari Sneh Patel Melissa LaRocca Alina Sansevich Ryan Czarny Giana Schena Emma K. Murray
author_facet	Foad Nazari Sneh Patel Melissa LaRocca Alina Sansevich Ryan Czarny Giana Schena Emma K. Murray
author_sort	Foad Nazari
collection	DOAJ
description	Abstract As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze’s benefits include an auto-tuning compression protocol based on each file’s distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze’s current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ‘ + ’ lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING’s traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ‘ + ’ on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly reduce the storage and transmission costs associated with large omics datasets without sacrificing data integrity.
format	Article
id	doaj-art-2139c1d4c80a4aa98bd39a947c81f986
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-2139c1d4c80a4aa98bd39a947c81f9862025-01-05T12:19:36ZengNature PortfolioScientific Reports2045-23222025-01-0115111910.1038/s41598-024-79258-6Lossless and reference-free compression of FASTQ/A files using GeneSqueezeFoad Nazari0Sneh Patel1Melissa LaRocca2Alina Sansevich3Ryan Czarny4Giana Schena5Emma K. Murray6Rajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedAbstract As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze’s benefits include an auto-tuning compression protocol based on each file’s distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze’s current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ‘ + ’ lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING’s traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ‘ + ’ on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly reduce the storage and transmission costs associated with large omics datasets without sacrificing data integrity.https://doi.org/10.1038/s41598-024-79258-6Genomic data compressionFASTQFASTALossless compressionk-mer sequenceDNA
spellingShingle	Foad Nazari Sneh Patel Melissa LaRocca Alina Sansevich Ryan Czarny Giana Schena Emma K. Murray Lossless and reference-free compression of FASTQ/A files using GeneSqueeze Scientific Reports Genomic data compression FASTQ FASTA Lossless compression k-mer sequence DNA
title	Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_full	Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_fullStr	Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_full_unstemmed	Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_short	Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_sort	lossless and reference free compression of fastq a files using genesqueeze
topic	Genomic data compression FASTQ FASTA Lossless compression k-mer sequence DNA
url	https://doi.org/10.1038/s41598-024-79258-6
work_keys_str_mv	AT foadnazari losslessandreferencefreecompressionoffastqafilesusinggenesqueeze AT snehpatel losslessandreferencefreecompressionoffastqafilesusinggenesqueeze AT melissalarocca losslessandreferencefreecompressionoffastqafilesusinggenesqueeze AT alinasansevich losslessandreferencefreecompressionoffastqafilesusinggenesqueeze AT ryanczarny losslessandreferencefreecompressionoffastqafilesusinggenesqueeze AT gianaschena losslessandreferencefreecompressionoffastqafilesusinggenesqueeze AT emmakmurray losslessandreferencefreecompressionoffastqafilesusinggenesqueeze

Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

Similar Items