Document Type : Technical Paper

Authors

Computer Engineering Department, Yazd University, Yazd, Iran.

10.22044/jadm.2025.15493.2667

Abstract

In the era of massive data, analyzing bioinformatics fields and discovering its functions are very important. The rate of sequence generation using sequence generation techniques is increasing rapidly, and researchers are faced with many unknown functions. One of the essential operations in bioinformatics is the classification of sequences to discover unknown proteins. There are two methods to classify sequences: the traditional method and the modern method. The conventional methods use sequence alignment, which has a high computational cost. In the contemporary method, feature extraction is used to classify proteins. In this regard, methods such as DeepFam have been presented. This research is an improvement of the DeepFam model, and the special focus is on extracting the appropriate features to differentiate the sequences of different categories. As the model improved, the features tended to be more generic. The grad-CAM method has been used to analyze the extracted features and interpret improved network layers. Then, we used the fitting vector from the transformer model to check the performance of Grad-CAM. The COG database, a massive database of protein sequences, was used to check the accuracy of the presented method. We have shown that by extracting more efficient features, the conserved regions in the sequences can be discovered more accurately, which helps to classify the proteins better. One of the critical advantages of the presented method is that by increasing the number of categories, the necessary flexibility is maintained, and the classification accuracy in three tests is higher than that of other methods.

Keywords

Main Subjects

[1]
C. Yu, S.-Y. Cheng, R. L. He and S. S.-T. Yau, "Protein map: an alignment-free sequence comparison method based on various properties of amino acids," Gene, vol. 486, no. 1-2, pp. 110-118, 2011.
[2]
F. Zhang, H. Song, M. Zeng, Y. Li, L. Kurgan and M. Li, "DeepFunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions," Proteomics, vol. 19, no. 12, p. 1900019, 2019.
[3]
P. Larranaga, B. Calvo, R. . Santana, C. Bielza, J. Galdiano, I. Inza, J. Lozano, R. Armananzas, G. . Santafe, A. Perez and V. Robles, "Machine learning in bioinformatics," Briefings in bioinformatics, vol. 7, no. 1, pp. 86-112, 2006.
[4]
J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li and H. Jiang, "Predicting protein--protein interactions based only on sequences information," Proceedings of the National Academy of Sciences, vol. 104, no. 11, pp. 4337-4341, 2007.
[5]
Y. Ge, S. Zhao and X. Zhao, "A step-by-step classification algorithm of protein secondary structures based on double-layer SVM model," Genomics, vol. 112, no. 2, pp. 1941-1946, 2020.
[6]
Z. Lv, S. Jin, H. Ding and Q. Zou, "A random forest sub-Golgi protein classifier optimized via dipeptide and amino acid composition features," Frontiers in bioengineering and biotechnology, vol. 7, p. 215, 2019.
[7]
C. L. P. Gupta, A. Bihari and S. Tripathi, "Protein Classification using Machine Learning and Statistical Techniques: A Comparative Analysis," arXiv preprint arXiv:1901.06152, 2019.
[8]
O. Yakhnenko, A. Silvescu and V. Honavar, "Discriminatively trained markov model for sequence classification," in Fifth IEEE International Conference on Data Mining (ICDM'05), IEEE, 2005, pp. 8--pp.
[9]
W. Zheng, L. Yang, . R. J. Genco, J. Wactawski-Wende, M. Buck and Y. Sun, "SENSE: Siamese neural network for sequence embedding and alignment-free comparison," Bioinformatics, vol. 35, no. 11, pp. 1820-1828, 2019.
[10]
B. Dogan, "An alignment-free method for bulk comparison of protein sequences from different species," Balkan Journal of Electrical and Computer Engineering, vol. 7, no. 4, pp. 405-416, 2019.
[11]
S. Biđin, I. Vujaklija, T. Paradžik, A. Bielen and D. Vujaklija, "Leitmotif: protein motif scanning 2.0," Bioinformatics, vol. 36, no. 11, pp. 3566-3567, 2020.
[12]
S. Seo, M. Oh, Y. Park and S. Kim, "DeepFam: deep learning based alignment-free method for protein family modeling and prediction," Bioinformatics, vol. 34, no. 13, pp. i254-i262, 2018.
[13]
D. Zhang and M. Kabuka, "Protein Family Classification from Scratch: A CNN based Deep Learning Approach," IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020.
[14]
A. Dabba, A. Tari and D. Zouache, "Multiobjective artificial fish swarm algorithm for multiple sequence alignment," INFOR: Information Systems and Operational Research, vol. 58, no. 1, pp. 38-59, 2020.
[15]
M. S. Waterman, T. F. Smith and W. A. Beyer, "Some biological sequence metrics," Advances in Mathematics, vol. 20, no. 3, pp. 367-387, 1976.
[16]
J. D. Thompson, D. G. Higgins and T. J. Gibson, "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice," Nucleic acids research, vol. 22, no. 22, pp. 4673-4680, 1994.
[17]
K. Katoh, K. Misawa, K.-i. Kuma and T. Miyata, "MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform," Nucleic acids research, vol. 30, no. 14, pp. 3059-3066, 2002.
[18]
. R. C. Edgar, "MUSCLE: a multiple sequence alignment method with reduced time and space complexity," BMC bioinformatics, vol. 5, no. 1, p. 113, 2004.
[19]
C. Notredame, D. G. Higgins and J. Heringa, "T-Coffee: A novel method for fast and accurate multiple sequence alignment," Journal of molecular biology, vol. 302, no. 1, pp. 205-217, 2000.
[20]
F. Naznin, R. Sarker and D. Essam, "Vertical decomposition with genetic algorithm for multiple sequence alignment," BMC bioinformatics, vol. 12, no. 1, p. 353, 2011.
[21]
H. Zhu, Z. He and Y. Jia, "A novel approach to multiple sequence alignment using multiobjective evolutionary algorithm based on decomposition," IEEE journal of biomedical and health informatics, vol. 20, no. 2, pp. 717-727, 2015.
[22]
S. R. Eddy, "Profile hidden Markov models," Bioinformatics (Oxford, England), vol. 14, no. 9, pp. 755-763, 1998.
[23]
F. Naznin, R. Sarker and D. Essam, "Progressive alignment method using genetic algorithm for multiple sequence alignment," IEEE Transactions on Evolutionary Computation, vol. 16, no. 5, pp. 615-631, 2012.
[24]
. W. R. Pearson and D. J. Lipman, "Improved tools for biological sequence comparison," Proceedings of the National Academy of Sciences, vol. 85, no. 8, pp. 2444-2448, 1988.
[25]
W. R. Pearson, "Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms," Genomics, vol. 11, no. 3, pp. 635-650, 1991.
[26]
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman, "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs," Nucleic acids research, vol. 25, no. 17, pp. 3389-3402, 1997.
[27]
M. Bhagwat, L. Young and . R. R. Robison, "Using BLAT to find sequence similarity in closely related genomes," Current protocols in bioinformatics, vol. 37, no. 1, pp. 1-41, 2012.
[28]
S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, . R. C. Hardison, D. Haussler and W. Miller, "Human--mouse alignments with BLASTZ," Genome research, vol. 13, no. 1, pp. 103-107, 2003.
[29]
B. Ma, J. Tromp and M. Li, "PatternHunter: faster and more sensitive homology search," Bioinformatics, vol. 18, no. 3, pp. 440-445, 2002.
[30]
A. Chakraborty and S. Bandyopadhyay, "FOGSAA: Fast optimal global sequence alignment algorithm," Scientific reports, vol. 3, p. 1746, 2013.
[31]
A. Wong, T. Reichert, D. Cohen and B. Aygun, "A generalized method for matching informational macromolecular code sequences," Computers in biology and medicine, vol. 4, no. 1, pp. 43-57, 1974.
[32]
S. Batzoglou, L. Pachter, J. P. Mesirov, B. Berger and E. S. Lander, "Human and mouse gene structure: comparative analysis and application to exon prediction," Genome research, vol. 10, no. 7, pp. 950-958, 2000.
[33]
M. Brudno, . C. B. Do, G. M. Cooper, M. F. Kim, E. Davydov, E. D. Green, A. Sidow and S. Batzoglou, "LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA," Genome research, vol. 13, no. 4, pp. 721-731, 2003.
[34]
A. L. Delcher, A. Phillippy, J. Carlton and S. L. Salzberg, "Fast algorithms for large-scale genome alignment and comparison," Nucleic acids research, vol. 30, no. 11, pp. 2478-2483, 2002.
[35]
N. Bray, I. Dubchak and L. Pachter, "AVID: A global alignment program," Genome research, vol. 13, no. 1, pp. 97-102, 2003.
[36]
W. Huang, D. M. Umbach and L. Li, "Accurate anchoring alignment of divergent sequences," Bioinformatics, vol. 22, no. 1, pp. 29-34, 2006.
[37]
S. Min, B. Lee and S. Yoon, "Deep learning in bioinformatics," Briefings in bioinformatics, vol. 18, no. 5, pp. 851-869, 2017.
[38]
N. Liu, J. Han, D. Zhang, S. Wen and T. Liu, "Predicting eye fixations using convolutional neural networks," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 362-370.
[39]
J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho and Y. Bengio, "Attention-based models for speech recognition," in Advances in neural information processing systems, 2015, pp. 577-585.
[40]
R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba and S. Fidler, "Skip-thought vectors," in Advances in neural information processing systems, 2015, pp. 3294-3302.
[41]
E. Asgari and M. R. Mofrad, "Continuous distributed representation of biological sequences for deep proteomics and genomics," PloS one, vol. 10, no. 11, p. e0141287, 2015.
[42]
M. Zeng, F. Zhang, F.-X. Wu, Y. Li, J. Wang and M. Li, "Protein--protein interaction site prediction through combining local and global features with deep neural networks," Bioinformatics, vol. 36, no. 4, pp. 1114-1120, 2020.
[43]
W. Zhong and F. Gu, "Predicting Local Protein 3D Structures Using Clustering Deep Recurrent Neural Network," IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2020.
[44]
B. Panda and B. Majhi, "A novel improved prediction of protein structural class using deep recurrent neural network," Evolutionary Intelligence, pp. 1-8, 2018.
[45]
R. Jafari and . M. M. Javidi, "Solving the protein folding problem in hydrophobic-polar model using deep reinforcement learning," SN Applied Sciences, vol. 2, no. 2, p. 259, 2020.
[46]
H. Hou, T. Gan, Y. Yang, X. Zhu, S. Liu, W. Guo and J. Hao, "Using deep reinforcement learning to speed up collective cell migration," BMC bioinformatics, vol. 20, no. 18, pp. 1-10, 2019.
[47]
B. Liu, C.-C. Li and K. Yan, "DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks," Briefings in Bioinformatics, 2019.
[48]
P. Baldi and G. Pollastri, "The principled design of large-scale recursive neural network architectures--dag-rnns and the protein structure prediction problem," Journal of Machine Learning Research, vol. 4, no. Sep, pp. 575-602, 2003.
[49]
D. Bhowmik, S. Gao, M. T. Young and A. Ramanathan, "Deep clustering of protein folding simulations," BMC bioinformatics, vol. 19, no. 18, pp. 47-58, 2018.
[50]
Y. Cao, T. A. Geddes, J. Y. H. Yang and P. Yang, "Ensemble deep learning in bioinformatics," Nature Machine Intelligence, vol. 2, no. 9, pp. 500-508, 2020.
[51]
Z. Guo, J. Liu, Y. Wang, M. Chen, D. Wang, D. Xu and J. Cheng, "Diffusion models in bioinformatics: A new wave of deep learning revolution in action," arXiv preprint arXiv:2302.10907, 2023.
[52]
S. Zhang, R. Fan, Y. Liu, S. Chen, Q. Liu and W. Zeng, "Applications of transformer-based language models in bioinformatics: a survey," Bioinformatics Advances, vol. 3, no. 1, 2023.
[53]
T. N. Kinyanjui, K. Mugoye and R. Kibuku, "Multi-Head Self-Attention Fusion Network for Enhanced Multi-Class Crop Disease Classification," Journal of AI and Data Mining, vol. 13, no. 2, pp. 227-240, 2025.
[54]
V. Vimbi, N. Shaffi and M. Mahmud, "Interpreting artificial intelligence models: a systematic review on the application of LIME and SHAP in Alzheimer’s disease detection," Brain Informatics, vol. 11, no. 1, p. 10, 2024.
[55]
C. Molnar, "Interpretable machine learning," 2020.
[56]
P. H. "Game theory: A Multi-leveled approach," 2015.
[57]
R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra, "Grad-cam: Visual explanations from deep networks via gradient-based localization," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618-626.
[58]
J. Vig, A. Madani, L. R. Varshney, C. Xiong, R. Socher and N. F. Rajani, "Bertology meets biology: Interpreting attention in protein language models," arXiv preprint arXiv:2006.15222, 2020.
[59]
"Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences," Proceedings of the National Academy of Sciences, vol. 118, no. 15, p. e2016239118, 2021.
[60]
I.-I. Comm, "Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents," Biochemistry, vol. 9, no. 20, pp. 4022-4027, 1970.
[61]
X. Glorot and Y. Bengio, "Understanding the difficulty of training deep feedforward neural networks," in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 249-256.
[62]
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[63]
R. L. Tatusov, M. Y. Galperin, D. A. Natale and E. V. Koonin, "The COG database: a tool for genome-scale analysis of protein functions and evolution," Nucleic acids research, vol. 28, no. 1, pp. 33-36, 2000.
[64]
R. L. Tatusov, E. V. Koonin and D. J. Lipman, "A genomic perspective on protein families," Science, vol. 278, no. 5338, pp. 631-637, 1997.
[65]
M. Y. Galperin, K. S. Makarova, Y. I. Wolf and E. V. Koonin, "Expanded microbial genome coverage and improved protein family annotation in the COG database," Nucleic acids research, vol. 43, no. D1, pp. D261-D269, 2015.
[66]
N. M. Razali, . Y. B. Wah and others, "Power comparisons of shapiro-wilk, kolmogorov-smirnov, lilliefors and anderson-darling tests," Journal of statistical modeling and analytics, vol. 2, no. 1, pp. 21-33, 2011.
[67]
R. C. Blair and J. J. Higgins, "Comparison of the power of the paired samples t test to that of Wilcoxon's signed-ranks test under various population shapes," Psychological Bulletin, vol. 97, no. 1, p. 119, 1985.