Document Type : Original/Review Paper

Authors

1 Department of Computer Engineering, Islamic Azad University, Yazd Branch, Iran.

2 Department of Computer Engineering, Shiraz University,Iran.

3 Department of Computer Engineering, Islamic Azad University, Yazd Branch, Iran

4 Department of Systems Biology, Agricultural Biotechnology Research Institute of Iran, Agricultural Research, Education, and Extension Organization, Karaj, Tehran, Iran.

Abstract

Sequence alignment and genome mapping pose significant challenges, primarily focusing on speed and storage space requirements for mapped sequences. With the ever-increasing volume of DNA sequence data, it becomes imperative to develop efficient alignment methods that not only reduce storage demands but also offer rapid alignment. This study introduces the Parallel Sequence Alignment with a Hash-Based Model (PSALR) algorithm, specifically designed to enhance alignment speed and optimize storage space while maintaining utmost accuracy. In contrast to other algorithms like BLAST, PSALR efficiently indexes data using a hash table, resulting in reduced computational load and processing time. This algorithm utilizes data compression and packetization with conventional bandwidth sizes, distributing data among different nodes to reduce memory and transfer time. Upon receiving compressed data, nodes can seamlessly perform searching and mapping, eliminating the need for unpacking and decoding at the destination. As an additional innovation, PSALR not only divides sequences among processors but also breaks down large sequences into sub-sequences, forwarding them to nodes. This approach eliminates any restrictions on query length sent to nodes, and evaluation results are returned directly to the user without central node involvement. Another notable feature of PSALR is its utilization of overlapping sub-sequences within both query and reference sequences. This ensures that the search and mapping process includes all possible sub-sequences of the target sequence, rather than being limited to a subset. Performance tests indicate that the PSALR algorithm outperforms its counterparts, positioning it as a promising solution for efficient sequence alignment and genome mapping.

Keywords

Main Subjects

[1]  L. Hasan, Z. Al-Ars, and S. Vassiliadis, "Hardware acceleration of sequence alignment algorithms - an overview," in Proc. of the International Conference on Design & Technology of Integrated Systems in Nanoscale Era (DTIS), 2007, pp. 92-97, IEEE.
 
[2] P. Bawono et al., "Multiple sequence alignment," in Bioinformatics, Springer, 2017, pp. 167-189.
 
[3] S. B. Needleman and C. D. Wunsch, "A general method applicable to the search for similarities in the amino acid sequence of two proteins," Journal of Molecular Biology, vol. 48, no. 3, pp. 443-453, 1970.
 
[4]  J. D. G. De Herve et al., "A perceptual hash function to store and retrieve large scale DNA sequences," arXiv preprint arXiv:1412.5517, 2014.
 
[5] W. J. Wilbur and D. J. Lipman, "Rapid similarity searches of nucleic acid and protein data banks," Proceedings of the National Academy of Sciences, vol. 80, no. 3, pp. 726-730, 1983.
 
[6] J. Choi et al., "HIA: a genome mapper using hybrid index-based sequence alignment," Algorithms for Molecular Biology, vol. 10, no. 1, pp. 1-9, 2015.
 
[7] H. Li and N. Homer, "A survey of sequence alignment algorithms for next-generation sequencing," Briefings in Bioinformatics, vol. 11, no. 5, pp. 473-483, 2010.
 
[8] S. Bandyopadhyay and R. Mitra, "A parallel pairwise local sequence alignment algorithm," IEEE Transactions on NanoBioscience, vol. 8, no. 2, pp. 139-146, 2009.
 
[9] F. Mozafari et al., "Speeding up DNA sequence alignment by optical correlator," Optics & Laser Technology, vol. 108, pp. 124-135, 2018.
 
[10] H. Li and R. Durbin, "Fast and accurate short read alignment with Burrows–Wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754-1760, 2009.
 
[11] R. Li et al., "SOAP: short oligonucleotide alignment program," Bioinformatics, vol. 24, no. 5, pp. 713-714, 2008.
 
[12] 1 B. Langmead, "Ultrafast and memory-efficient alignment of short DNA sequences to the human genome," Genome Biology, vol. 10, article R25, 2009.
 
[13] B. Langmead and S. L. Salzberg, "Fast gapped-read alignment with Bowtie 2," Nature Methods, vol. 9, no. 4, pp. 357-359, 2012.
 
[14]  S. Misra et al., "Anatomy of a hash-based long read sequence mapping algorithm for next generation DNA sequencing," Bioinformatics, vol. 27, no. 2, pp. 189-195, 2010.
 
[15] J. C. Mu et al., "Fast and accurate read alignment for resequencing," Bioinformatics, vol. 28, no. 18, pp. 2366-2373, 2012.
 
[16] S. F. Altschul et al., "Basic local alignment search tool," Journal of Molecular Biology, vol. 215, no. 3, pp. 403-410, 1990.
 
[17] B. Ma, J. Tromp, and M. Li, "PatternHunter: faster and more sensitive homology search," Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
 
[18] Z. Ning, A. J. Cox, and J. C. Mullikin, "SSAHA: a fast search method for large DNA databases," Genome Research, vol. 11, no. 10, pp. 1725-1729, 2001.
 
[19] F. J. Sedlazeck, P. Rescheneder, and A. Von Haeseler, "NextGenMap: fast and accurate read mapping in highly polymorphic genomes," Bioinformatics, vol. 29, no. 21, pp. 2790-2791, 2013.
 
[20] S. Canzar and S. L. Salzberg, "Short read mapping: An algorithmic tour," Proceedings of the IEEE, vol. 105, no. 3, pp. 436-458, 2017.
 
[21] H. Mohamadi et al., "ntHash: recursive nucleotide hashing," Bioinformatics, vol. 32, no. 22, pp. 3492-3494, 2016.
 
[22] T. D. Wu, "Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays," Algorithms for Molecular Biology, vol. 11, pp. 1-16, 2016.
 
[23] D. Geng et al., "The implementation of KMP algorithm based on MPI+ OpenMP," in Proc. of the 9th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2012, IEEE.
 
[24] C. S. Kouzinopoulos, P. D. Michailidis, and K. G. Margaritis, "Performance study of parallel hybrid multiple pattern matching algorithms for biological sequences," in Proc. of the International Conference on Bioinformatics Models, Methods and Algorithms, 2012, SCITEPRESS.
 
[25] H. Li et al., "A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching," in Proc. IEEE 9th Symposium on Application Specific Processors (SASP), 2011, IEEE.
 
[26] Q. Xue, J. Xie, and J. S., "A parallel algorithm," in Proc. 2014 International Conference on Information Science, Electronics and Electrical Engineering, 2014.
 
[27] M. J. Chaisson and G. Tesler, "Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory," BMC Bioinformatics, vol. 13, p. 238, 2012.
 
[28] D. Peters, K. Qiu, and P. Liang, "Faster short DNA sequence alignment with parallel BWA," in AIP Conference Proceedings, 2011, American Institute of Physics.
 
[29] S. M. Rumble et al., "SHRiMP: accurate mapping of short color-space reads," PLoS Comput Biol, vol. 5, no. 5, e1000386, 2009.
 
[30] M. David et al., "SHRiMP2: sensitive yet practical short read mapping," Bioinformatics, vol. 27, no. 7, pp. 1011-1012, 2011.
 
[31] R. AlSaad, Q. Malluhi, and M. Abouelhoda, "Efficient parallel implementation of the SHRiMP sequence alignment tool using MapReduce," in Qatar Foundation Annual Research Forum Volume 2012 Issue 1, 2012, Hamad bin Khalifa University Press (HBKU Press).
 
[32] C.-M. Liu et al., "SOAP3: ultra-fast GPU-based parallel alignment tool for short reads," Bioinformatics, vol. 28, no. 6, pp. 878-879, 2012.
 
[33] P. Klus et al., "BarraCUDA—a fast short read sequence aligner using graphics processing units," BMC Research Notes, vol. 5, no. 1, p. 27, 2012.
 
[34] Y. Liu, B. Schmidt, and D. L. Maskell, "CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows–Wheeler transform," Bioinformatics, vol. 28, no. 14, pp. 1830-1837, 2012.
 
[35] T. Pan et al., "Kmerind: A flexible parallel library for K-mer indexing of biological sequences on distributed memory systems," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, no. 4, pp. 1117-1131, 2019.
 
[36] A. M. Esmat et al., "A parallel hash‐based method for local sequence alignment," Concurrency and Computation: Practice and Experience, vol. 2021, article e6568, 2021.
 
[37] H. Lin et al., "Efficient data access for parallel BLAST," in Proc. 19th IEEE International Parallel and Distributed Processing Symposium, 2005, IEEE.
 
[38] M. Nowicki, D. Bzhalava, and P. BaŁa, "Massively parallel implementation of sequence alignment with basic local alignment search tool using parallel computing in Java library," Journal of Computational Biology, vol. 25, no. 8, pp. 871-881, 2018.
[39] D. Dechev and A. Tae-Hyuk, "Using SST/Macro for effective analysis of MPI-based applications: Evaluating large-scale genomic sequence search," IEEE Access, vol. 1, pp. 428-435, 2013.
 
[40] T. Vijayaraghavan, A. Rajesh, and K. Sankaralingam, "MPU-BWM: Accelerating sequence alignment," IEEE Computer Architecture Letters, vol. 17, no. 2, pp. 179-182, 2018.
 
[41] H. Martinez et al., "Concurrent and accurate short read mapping on multicore processors," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 5, pp. 995-1007, 2015.
 
[42] J. W. Kim, E. Kim, and K. Park, "Fast matching method for DNA sequences," in Proc. International Symposium on Combinatorics, Algorithms, Probabilistic and Experimental Methodologies, 2007, Springer.
 
[43] A. Dobin et al., "STAR: ultrafast universal RNA-seq aligner," Bioinformatics, vol. 29, no. 1, pp. 15-21, 2013.
 
[44] Y. Chen, S. Yu, and M. Leng, "Parallel sequence alignment algorithm for clustering system," in Proc. International Conference on Programming Languages for Manufacturing, 2006, Springer.
 
[45] N. Homer, B. Merriman, and S. F. Nelson, "BFAST: an alignment tool for large scale genome resequencing," PloS One, vol. 4, no. 11, e7767, 2009.
 
[46] X. Yu and X. Liu, "Mapping RNA-seq reads to transcriptomes efficiently based on learning to hash method," Computers in Biology and Medicine, vol. 116, p. 103539, 2020.
 
[47] F. Peng et al., "New hash-based sequence alignment algorithm," in Proc. 2nd International Conference on Bioinformatics and Intelligent Computing, 2022.
 
[48] A. Joudaki et al., "Aligning distant sequences to graphs using long seed sketches," Genome Research, 2023, article gr.277659.123.
 
[49] H. Zhang et al., "ESA: An efficient sequence alignment algorithm for biological database search on Sunway TaihuLight," Parallel Computing, vol. 117, p. 103043, 2023.
 
[50] K. Xu, X. D. Kai, A. Müller, R. Kobus, B. Schmidt, and W. Liu, "FMapper: Scalable read mapper based on succinct hash index on SunWay TaihuLight," Journal of Parallel and Distributed Computing, vol. 161, p. 11, 2022.
 
[51] S. Suchindra, "New sequence alignment algorithm using AI rules and dynamic seeds," Bioscience & Engineering: An International Journal (BIOEJ), vol. 10, no. 1/2, 2023.
 
[52] G. Greenberg, A. N. Ravi, and I. Shomorony, "LexicHash: sequence similarity estimation via lexicographic comparison of hashes," Bioinformatics, 2023, article btad652.
[53] M. Zaharia et al., "Faster and more accurate sequence alignment with SNAP," arXiv preprint arXiv:1111.5572, 2011.
 
[54] S. Canzar and S. L. Salzberg, "Short read mapping: An algorithmic tour," Proceedings of the IEEE, vol. 105, no. 3, pp. 436-458, 2015.
 
[55] M. Shamsollahi, A. Badiee, and M. Ghazanfari, "Using combined descriptive and predictive methods of data mining for coronary artery disease prediction: A case study approach," Journal of AI and Data Mining, vol. 7, no. 1, pp. 47-58, 2019.