Lossless and reference-free compression of FASTQ/A files using GeneSqueeze

Abstract As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage...

Full description

Saved in:
Bibliographic Details
Main Authors: Foad Nazari, Sneh Patel, Melissa LaRocca, Alina Sansevich, Ryan Czarny, Giana Schena, Emma K. Murray
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-79258-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841559643638202368
author Foad Nazari
Sneh Patel
Melissa LaRocca
Alina Sansevich
Ryan Czarny
Giana Schena
Emma K. Murray
author_facet Foad Nazari
Sneh Patel
Melissa LaRocca
Alina Sansevich
Ryan Czarny
Giana Schena
Emma K. Murray
author_sort Foad Nazari
collection DOAJ
description Abstract As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze’s benefits include an auto-tuning compression protocol based on each file’s distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze’s current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ‘ + ’ lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING’s traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ‘ + ’ on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly reduce the storage and transmission costs associated with large omics datasets without sacrificing data integrity.
format Article
id doaj-art-2139c1d4c80a4aa98bd39a947c81f986
institution Kabale University
issn 2045-2322
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-2139c1d4c80a4aa98bd39a947c81f9862025-01-05T12:19:36ZengNature PortfolioScientific Reports2045-23222025-01-0115111910.1038/s41598-024-79258-6Lossless and reference-free compression of FASTQ/A files using GeneSqueezeFoad Nazari0Sneh Patel1Melissa LaRocca2Alina Sansevich3Ryan Czarny4Giana Schena5Emma K. Murray6Rajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedRajant Health IncorporatedAbstract As sequencing becomes more accessible, there is an acute need for novel compression methods to efficiently store sequencing files. Omics analytics can leverage sequencing technologies to enhance biomedical research and individualize patient care, but sequencing files demand immense storage capabilities, particularly when sequencing is utilized for longitudinal studies. Addressing the storage challenges posed by these technologies is crucial for omics analytics to achieve their full potential. We present a novel lossless, reference-free compression algorithm, GeneSqueeze, that leverages the patterns inherent in the underlying components of FASTQ files to solve this need. GeneSqueeze’s benefits include an auto-tuning compression protocol based on each file’s distribution, lossless preservation of IUPAC nucleotides and read identifiers, and unrestricted FASTQ/A file attributes (i.e., read length, number of reads, or read identifier format). We compared GeneSqueeze to the general-purpose compressor, gzip, and to a domain-specific compressor, SPRING, to assess performance. Due to GeneSqueeze’s current Python implementation, GeneSqueeze underperformed as compared to gzip and SPRING in the time domain. GeneSqueeze and gzip achieved 100% lossless compression across all elements of the FASTQ files (i.e. the read identifier, sequence, quality score and ‘ + ’ lines). GeneSqueeze and gzip compressed all files losslessly, while both SPRING’s traditional and lossless modes exhibited data loss of non-ACGTN IUPAC nucleotides and of metadata following the ‘ + ’ on the separator line. GeneSqueeze showed up to three times higher compression ratios as compared to gzip, regardless of read length, number of reads, or file size, and had comparable compression ratios to SPRING across a variety of factors. Overall, GeneSqueeze represents a competitive and specialized compression method for FASTQ/A files containing nucleotide sequences. As such, GeneSqueeze has the potential to significantly reduce the storage and transmission costs associated with large omics datasets without sacrificing data integrity.https://doi.org/10.1038/s41598-024-79258-6Genomic data compressionFASTQFASTALossless compressionk-mer sequenceDNA
spellingShingle Foad Nazari
Sneh Patel
Melissa LaRocca
Alina Sansevich
Ryan Czarny
Giana Schena
Emma K. Murray
Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
Scientific Reports
Genomic data compression
FASTQ
FASTA
Lossless compression
k-mer sequence
DNA
title Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_full Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_fullStr Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_full_unstemmed Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_short Lossless and reference-free compression of FASTQ/A files using GeneSqueeze
title_sort lossless and reference free compression of fastq a files using genesqueeze
topic Genomic data compression
FASTQ
FASTA
Lossless compression
k-mer sequence
DNA
url https://doi.org/10.1038/s41598-024-79258-6
work_keys_str_mv AT foadnazari losslessandreferencefreecompressionoffastqafilesusinggenesqueeze
AT snehpatel losslessandreferencefreecompressionoffastqafilesusinggenesqueeze
AT melissalarocca losslessandreferencefreecompressionoffastqafilesusinggenesqueeze
AT alinasansevich losslessandreferencefreecompressionoffastqafilesusinggenesqueeze
AT ryanczarny losslessandreferencefreecompressionoffastqafilesusinggenesqueeze
AT gianaschena losslessandreferencefreecompressionoffastqafilesusinggenesqueeze
AT emmakmurray losslessandreferencefreecompressionoffastqafilesusinggenesqueeze