Document Type : Original/Review Paper

Authors

School of Computer engineering, Iran University of Science and Technology, Tehran, Iran.

10.22044/jadm.2025.15383.2641

Abstract

Document similarity matching is essential for efficient text retrieval, plagiarism detection, and content analysis. Existing studies in this field can be categorized into three approaches: statistical analysis, deep learning, and hybrid approaches. However, to the best of our knowledge, none have incorporated the importance of named entities into their methodologies. In this paper, we propose DOSTE, a method that first extracts name entities and then utilizes them to enhance document similarity matching through statistical and graph-based analysis. Empirical results indicate that DOSTE achieves better results by emphasizing named entities, resulting in an average improvement of 9% in the average recall metric compared to baseline methods. Also, DOSTE unlike LLM-based approaches, does not require extensive GPU resources. Additionally, non-empirical interpretations of the results indicate that DOSTE is particularly effective in identifying similarity in short documents and complex document comparisons.

Keywords

Main Subjects

[1] P. Hambarde, "Information Retrieval: Recent Advances and Beyond," 2023.
[2] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad, M. Chenaghlu, and J. Gao, "Deep learning--based text classification: a comprehensive review," ACM computing surveys (CSUR), vol. 54, no. 3, pp. 1-40, 2021.
[3] K. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, "Text classification algorithms: A survey," Information, vol. 10, no. 4, p. 150, 2019.
[4] M. W. Bilotti, P. Ogilvie, J. Callan, and E. Nyberg, "Structured retrieval for question answering," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007, pp. 351-358.
[5] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, and R. L. Mercer, "Class-based n-gram models of natural language," Computational linguistics, vol. 18, no. 4, pp. 467-480, 1992.
[6] C. Sammut and G. I. Webb, Encyclopedia of machine learning. Springer Science & Business Media, 2011.
[7] S. Fatima and B. Srinivasu, "Text Document categorization using support vector machine," International Research Journal of Engineering and Technology (IRJET), vol. 4, no. 2, pp. 141-147, 2017.
[8] S.-B. Kim, K.-S. Han, H.-C. Rim, and S. H. Myaeng, "Some effective techniques for naive bayes text classification," IEEE transactions on knowledge and data engineering, vol. 18, no. 11, pp. 1457-1466, 2006.
[9] S. Jiang, G. Pang, M. Wu, and L. Kuang, "An improved K-nearest-neighbor algorithm for text categorization," Expert Systems with Applications, vol. 39, no. 1, pp. 1503-1509, 2012.
[10] N. Reimers, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," arXiv preprint arXiv:1908.10084, 2019.
[11] C. Duan, L. Cui, X. Chen, F. Wei, C. Zhu, and T. Zhao, "Attention-Fused Deep Matching Network for Natural Language Inference," in IJCAI, 2018, pp. 4033-4040.
[12] C. Tan, F. Wei, W. Wang, W. Lv, and M. Zhou, "Multiway attention networks for modeling sentence pairs," in IJCAI, 2018, pp. 4411-4417.
[13] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, "Attention is all you need in speech separation," in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021: IEEE, pp. 21-25.
[14] A. Fan, S. Wang, and Y. Wang, "Legal Document Similarity Matching Based on Ensemble Learning," IEEE Access, 2024.
[15] G. Wang, T. Zhang, G. Xu, Y. Zheng, Z. Du, and Q. Long, "A Deep Learning Based Method to Measure the Similarity of Long Text," in 2020 IEEE 3rd International Conference on Information Systems and Computer Aided Education (ICISCAE), 2020: IEEE, pp. 173-178.
[16] F. Safi-Esfahani, S. Rakian, and M. Nadimi-Shahraki, "English-Persian Plagiarism Detection based on a Semantic Approach," Journal of AI and Data Mining, vol. 5, no. 2, pp. 275-284, 2017.
[17] N. Jiang and M.-C. de Marneffe, "Do you know that Florence is packed with visitors? Evaluating state-of-the-art models of speaker commitment," in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 4208-4213.
[18] I. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau, "Building end-to-end dialogue systems using generative hierarchical neural network models," in Proceedings of the AAAI conference on artificial intelligence, 2016, vol. 30, no. 1.
[19] Q. Wang et al., "Learning deep transformer models for machine translation," arXiv preprint arXiv:1906.01787, 2019.
[20] M. Ostendorff, T. Ruas, M. Schubotz, G. Rehm, and B. Gipp, "Pairwise multi-class document classification for semantic relations between wikipedia articles," in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020, 2020, pp. 127-136.
[21] P. Bafna, D. Pramod, and A. Vaidya, "Document clustering: TF-IDF approach," in 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), 2016: IEEE, pp. 61-66.
[22] M. A. El-Rashidy, R. G. Mohamed, N. A. El-Fishawy, and M. A. Shouman, "An effective text plagiarism detection system based on feature selection and SVM techniques," Multimedia Tools and Applications, vol. 83, no. 1, pp. 2609-2646, 2024.
[23] L. Yang, M. Zhang, C. Li, M. Bendersky, and M. Najork, "Beyond 512 tokens: Siamese multi-depth transformer-based hierarchical encoder for long-form document matching," in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 1725-1734.
[24] M. Ding, C. Zhou, H. Yang, and J. Tang, "Cogltx: Applying bert to long texts," Advances in Neural Information Processing Systems, vol. 33, pp. 12792-12804, 2020.
[25] A. Sharma and S. Kumar, "Ontology-based semantic retrieval of documents using Word2vec model," Data & Knowledge Engineering, vol. 144, p. 102110, 2023.
[26] R. Wu, "RecBERT: Semantic recommendation engine with large language model enhanced query segmentation for k-nearest neighbors ranking retrieval," Intelligent and Converged Networks, 2024.
[27] N. B. Korade, M. B. Salunke, A. A. Bhosle, P. B. Kumbharkar, G. G. Asalkar, and R. G. Khedkar, "Strengthening Sentence Similarity Identification Through OpenAI Embeddings and Deep Learning," International Journal of Advanced Computer Science & Applications, vol. 15, no. 4, 2024.
[28] A. Jha, V. Rakesh, J. Chandrashekar, A. Samavedhi, and C. K. Reddy, "Supervised contrastive learning for interpretable long-form document matching," ACM Transactions on Knowledge Discovery from Data, vol. 17, no. 2, pp. 1-17, 2023.
[29] H. Wang, K. Tian, Z. Wu, and L. Wang, "A short text classification method based on convolutional neural network and semantic extension," International Journal of Computational Intelligence Systems, vol. 14, no. 1, pp. 367-375, 2021.
[30] F. Ahmad and M. Faisal, "A novel hybrid methodology for computing semantic similarity between sentences through various word senses," International Journal of Cognitive Computing in Engineering, vol. 3, pp. 58-77, 2022.
[31] W. Yu, C. Xu, J. Xu, L. Pang, and J.-R. Wen, "Distribution distance regularized sequence representation for text matching in asymmetrical domains," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 721-733, 2022.
[32] D. Viji and S. Revathy, "A hybrid approach of Weighted Fine-Tuned BERT extraction with deep Siamese Bi–LSTM model for semantic text similarity identification," Multimedia tools and applications, vol. 81, no. 5, pp. 6131-6157, 2022.
[33] P. Li, G.-J. Ren, A. L. Gentile, C. DeLuca, D. Tan, and S. Gopisetty, "Long-form information retrieval for enterprise matchmaking," in Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 3260-3264.
[34] F. Mashhadirajab, M. Shamsfard, R. Adelkhah, F. Shafiee, and C. Saedi, "A Text Alignment Corpus for Persian Plagiarism Detection," FIRE (Working Notes), vol. 1737, pp. 184-189, 2016.
[35] M. R. Sharifabadi and S. A. Eftekhari, "Mahak Samim: A Corpus of Persian Academic Texts for Evaluating Plagiarism Detection Systems," FIRE (Working Notes), vol. 1737, pp. 190-192, 2016.
[36] S. Abnar, M. Dehghani, H. Zamani, and A. Shakery, "Expanded n-grams for semantic text alignment," Cappellato et al.[35], 2014.
[37] K. Khoshnavataher, V. Zarrabi, S. Mohtaj, and H. Asghari, "Developing Monolingual Persian Corpus for Extrinsic Plagiarism Detection Using Artificial Obfuscation: Notebook for PAN at CLEF 2015," in CLEF (Working Notes), 2015.
[38] A. C. Marco, A. Myers, S. J. Graham, P. D'Agostino, and K. Apple, "The USPTO patent assignment dataset: Descriptions and analysis," 2015.
[39] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, "Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension," arXiv preprint arXiv:1705.03551, 2017.
[40] A. Trischler et al., "Newsqa: A machine comprehension dataset," arXiv preprint arXiv:1611.09830, 2016.
[41] Z. Yang et al., "HotpotQA: A dataset for diverse, explainable multi-hop question answering," arXiv preprint arXiv:1809.09600, 2018.
[42] D. D. Lewis, Y. Yang, T. Russell-Rose, and F. Li, "Rcv1: A new benchmark collection for text categorization research," Journal of machine learning research, vol. 5, no. Apr, pp. 361-397, 2004.
[43] D. D. Lewis, "text categorization test collection," ed: Tech. Rep., http://www. ics. uci. edu/~ kdd/databases/reuters21578 …, 2004.
[44] H. Asghari, S. Mohtaj, O. Fatemi, H. Faili, P. Rosso, and M. Potthast, "Algorithms and corpora for persian plagiarism detection: overview of PAN at FIRE 2016," in Text Processing: FIRE 2016 International Workshop, Kolkata, India, December 7–10, 2016, Revised Selected Papers, 2018: Springer, pp. 61-79.