H.3. Artificial Intelligence
Naeimeh Mohammad Karimi; Mehdi Rezaeian
Abstract
In the era of massive data, analyzing bioinformatics fields and discovering its functions are very important. The rate of sequence generation using sequence generation techniques is increasing rapidly, and researchers are faced with many unknown functions. One of the essential operations in bioinformatics ...
Read More
In the era of massive data, analyzing bioinformatics fields and discovering its functions are very important. The rate of sequence generation using sequence generation techniques is increasing rapidly, and researchers are faced with many unknown functions. One of the essential operations in bioinformatics is the classification of sequences to discover unknown proteins. There are two methods to classify sequences: the traditional method and the modern method. The conventional methods use sequence alignment, which has a high computational cost. In the contemporary method, feature extraction is used to classify proteins. In this regard, methods such as DeepFam have been presented. This research is an improvement of the DeepFam model, and the special focus is on extracting the appropriate features to differentiate the sequences of different categories. As the model improved, the features tended to be more generic. The grad-CAM method has been used to analyze the extracted features and interpret improved network layers. Then, we used the fitting vector from the transformer model to check the performance of Grad-CAM. The COG database, a massive database of protein sequences, was used to check the accuracy of the presented method. We have shown that by extracting more efficient features, the conserved regions in the sequences can be discovered more accurately, which helps to classify the proteins better. One of the critical advantages of the presented method is that by increasing the number of categories, the necessary flexibility is maintained, and the classification accuracy in three tests is higher than that of other methods.