Document and Text Processing
A. Shojaie; F. Safi-Esfahani
Abstract
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing ...
Read More
With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved results. This research proposes a framework called ParaMaker. It generates accurate paraphrases of any sentence, similar to human behaviors and sends them to a search engine to find the plagiarism patterns. For English language, ParaMaker was examined against six known methods with standard PAN2014 datasets. Results showed an improvement of 34% in terms of Recall parameter while Precision and Speed parameters were maintained. In Persian language, statements of suspicious documents were examined compared to an exact search approach. ParaMaker showed an improvement of at least 42% while Precision and Speed were maintained.
Document and Text Processing
F. Safi-Esfahani; Sh. Rakian; M.H. Nadimi-Shahraki
Abstract
Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), ...
Read More
Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-lingual plagiarism. In cross-lingual translation, writers meld a translation with their own words and ideas. Based on monolingual plagiarism detection methods, this paper ultimately intends to find a way to detect cross-lingual plagiarism. A framework called Multi-Lingual Plagiarism Detection (MLPD) has been presented for cross-lingual plagiarism analysis with ultimate objective of detection of plagiarism cases. English is the reference language and Persian materials are back translated using translation tools. The data for assessment of MLPD were obtained from English-Persian Mizan parallel corpus. Apache’s Solr was also applied to record the creep of the documents and their indexation. The accuracy mean of the proposed method revealed to be 98.82% when employing highly accurate translation tools which indicate the high accuracy of the proposed method. Also, Google translation service showed the accuracy mean to be 56.9%. These tests demonstrate that improved translation tools enhance the accuracy of the proposed method.