Development of Efficient Algorithms for Multiple Sequence Alignment of Large Genomes

Authors

  • Ngoun Sophorn Department of Bioinformatics, Norton University, Phnom Penh Author

Keywords:

Multiple sequence alignment,, large genomes,, divide-and-conquer, syntenic blocks, profile HMM,, sparse suffix arrays, wavelet trees, distributed computing,, alignment accuracy, pan-genome.

Abstract

Multiple sequence alignment (MSA) of large genomes is a cornerstone in comparative genomics, evolutionary biology, and functional annotation. However, as genome sizes and numbers increase, existing MSA tools often become computationally prohibitive in terms of time and memory. This work presents novel algorithmic strategies and data structures that enhance both scalability and accuracy for large-genome MSA. We introduce a hybrid divide‑and‑conquer framework that partitions genomes into syntenic blocks using efficient anchoring heuristics, followed by localized alignment refinement leveraging profile‑profile Hidden Markov Models (HMMs). Spatial indexing with sparse suffix arrays and wavelet trees enables fast seed discovery and scalable representation of alignment profiles, reducing memory overhead from O(N × G) to O(N + G). Parallelization via map‑reduce style workflows ensures near‑linear speed‑up across distributed computing environments. Evaluated on simulated and real-world datasets—including human, mouse, and plant genomes totalling over 10 Gb—our algorithms achieve 3–5× faster runtimes and up to 60% less memory usage compared to leading tools like MAFFT, Clustal Omega, and Cactus, while maintaining or improving alignment accuracy by 5–8% in standard metrics (SP‑score, TC‑score). We also demonstrate application in pan-genome analysis of Arabidopsis thaliana accessions and structural variant detection in mammalian genomes. Our approach makes feasible large-scale comparative analyses previously limited to small genomes or partial synteny. These innovations chart a path toward efficient and accurate genomic alignments in the era of ever-growing sequencing data.

Downloads

Published

2025-07-20