Compression rates of microbial genomes are associated with genome size and base composition

Abstract Background To what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what ext...

Full description

Saved in:
Bibliographic Details
Main Authors: Jon Bohlin, John H.-O. Pettersson
Format: Article
Language:English
Published: BioMed Central 2024-10-01
Series:Genomics & Informatics
Subjects:
Online Access:https://doi.org/10.1186/s44342-024-00018-z
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849308274476711936
author Jon Bohlin
John H.-O. Pettersson
author_facet Jon Bohlin
John H.-O. Pettersson
author_sort Jon Bohlin
collection DOAJ
description Abstract Background To what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what extent microbial genomes are amenable to compression as they vary considerably both with respect to size and base composition. For instance, microbial genome sizes vary from less than 100,000 base pairs in symbionts to more than 10 million in soil-dwellers. Genomic base composition, often summarized as genomic AT or GC content due to the similar frequencies of adenine and thymine on one hand and cytosine and guanine on the other, also vary substantially; the most extreme microbes can have genomes with AT content below 25% or above 85% AT. Base composition determines the frequency of DNA words, consisting of multiple nucleotides or oligonucleotides, and may therefore also influence compressibility. Using 4,713 RefSeq genomes, we examined the association between compressibility, using both a DNA based- (MBGC) and a general purpose (ZPAQ) compression algorithm, and genome size, AT content as well as genomic oligonucleotide usage variance (OUV) using generalized additive models. Results We find that genome size (p < 0.001) and OUV (p < 0.001) are both strongly associated with genome redundancy for both type of file compressors. The DNA-based MBGC compressor managed to improve compression with approximately 3% on average with respect to ZPAQ. Moreover, MBGC detected a significant (p < 0.001) compression ratio difference between AT poor and AT rich genomes which was not detected with ZPAQ. Conclusion As lack of compressibility is equivalent to randomness, our findings suggest that smaller and AT rich genomes may have accumulated more random mutations on average than larger and AT poor genomes which, in turn, were significantly more redundant. Moreover, we find that OUV is a strong proxy for genome compressibility in microbial genomes. The ZPAQ compressor was found to agree with the MBGC compressor, albeit with a poorer performance, except for the compressibility of AT-rich and AT-poor/GC-rich genomes.
format Article
id doaj-art-14a9d9f437f14c479766a1eb1ae4be7d
institution Kabale University
issn 2234-0742
language English
publishDate 2024-10-01
publisher BioMed Central
record_format Article
series Genomics & Informatics
spelling doaj-art-14a9d9f437f14c479766a1eb1ae4be7d2025-08-20T03:54:29ZengBioMed CentralGenomics & Informatics2234-07422024-10-012211910.1186/s44342-024-00018-zCompression rates of microbial genomes are associated with genome size and base compositionJon Bohlin0John H.-O. Pettersson1Norwegian Institute of Public Health, Domain for Infection Control, Section for Modeling and BioinformaticsZoonosis Science Center, Clinical Microbiology, Department of Medical Sciences, University of UppsalaAbstract Background To what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what extent microbial genomes are amenable to compression as they vary considerably both with respect to size and base composition. For instance, microbial genome sizes vary from less than 100,000 base pairs in symbionts to more than 10 million in soil-dwellers. Genomic base composition, often summarized as genomic AT or GC content due to the similar frequencies of adenine and thymine on one hand and cytosine and guanine on the other, also vary substantially; the most extreme microbes can have genomes with AT content below 25% or above 85% AT. Base composition determines the frequency of DNA words, consisting of multiple nucleotides or oligonucleotides, and may therefore also influence compressibility. Using 4,713 RefSeq genomes, we examined the association between compressibility, using both a DNA based- (MBGC) and a general purpose (ZPAQ) compression algorithm, and genome size, AT content as well as genomic oligonucleotide usage variance (OUV) using generalized additive models. Results We find that genome size (p < 0.001) and OUV (p < 0.001) are both strongly associated with genome redundancy for both type of file compressors. The DNA-based MBGC compressor managed to improve compression with approximately 3% on average with respect to ZPAQ. Moreover, MBGC detected a significant (p < 0.001) compression ratio difference between AT poor and AT rich genomes which was not detected with ZPAQ. Conclusion As lack of compressibility is equivalent to randomness, our findings suggest that smaller and AT rich genomes may have accumulated more random mutations on average than larger and AT poor genomes which, in turn, were significantly more redundant. Moreover, we find that OUV is a strong proxy for genome compressibility in microbial genomes. The ZPAQ compressor was found to agree with the MBGC compressor, albeit with a poorer performance, except for the compressibility of AT-rich and AT-poor/GC-rich genomes.https://doi.org/10.1186/s44342-024-00018-zMBGCZPAQMicrobial GenomicsCompressionInformation potentialBase composition
spellingShingle Jon Bohlin
John H.-O. Pettersson
Compression rates of microbial genomes are associated with genome size and base composition
Genomics & Informatics
MBGC
ZPAQ
Microbial Genomics
Compression
Information potential
Base composition
title Compression rates of microbial genomes are associated with genome size and base composition
title_full Compression rates of microbial genomes are associated with genome size and base composition
title_fullStr Compression rates of microbial genomes are associated with genome size and base composition
title_full_unstemmed Compression rates of microbial genomes are associated with genome size and base composition
title_short Compression rates of microbial genomes are associated with genome size and base composition
title_sort compression rates of microbial genomes are associated with genome size and base composition
topic MBGC
ZPAQ
Microbial Genomics
Compression
Information potential
Base composition
url https://doi.org/10.1186/s44342-024-00018-z
work_keys_str_mv AT jonbohlin compressionratesofmicrobialgenomesareassociatedwithgenomesizeandbasecomposition
AT johnhopettersson compressionratesofmicrobialgenomesareassociatedwithgenomesizeandbasecomposition