Plagiarism checker for Persian ( PCP ) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection is one of the critical problems in the field of text-mining, in which many researchers are interested. This issue has been considered as a serious one in high academic institutions. There exist language-free tools that do not yield any reliable results since the special features of every language are ignored in them. Considering the paucity of works in the field of Persian language due to the lack of reliable plagiarism checkers in Persian, there is a need for a method to improve the accuracy of detecting plagiarized Persian phrases. An attempt is made in this work to present the PCP solution. This solution is a combinational method, in which, in addition to the meaning and stem of words, synonyms and pluralization are dealt with by applying the document tree representation based on manner fingerprinting the text in the 3-grams words. The grams obtained are eliminated from the text, hashed through the BKDR hash function, and stored as the fingerprint of a document in fingerprints of the reference document repositories in order to check the suspicious documents. The proposed PCP method here is evaluated by eight experiments on seven different sets, which include the suspicions documents and the reference document from the Hamshahri newspaper website. The results obtained indicate that the accuracy of this proposed method in detecting similar texts, in comparison with the "Winnowing" localized method, has a 21.15% average improvement. The accuracy of the PCP method in detecting the similarities, in comparison with the language-free tool, reveals a 31.65% average improvement.


Introduction
Nowadays plagiarism has become a cancer cell in the literary world.This important global issue is considered as a serious crisis for high academic institutions even in freelance writing.Accessibility of different digital documents in Worldwide Web makes it easy for the swindlers to copy explicit subjects from students and academicians by allowing them to be promoted to high academic levels or grades in life without any required scientific background [1].Plagiarism may include:  Replacing the original author's name  Copying ideas, phrases, concepts, research proposals, articles, reports, computer program designs, websites, and the internet and other electronic resources without citing the author's name  Lack of citation regarding quotation  False referencing or referencing the nonexisting resources  Translation plagiarism, where the translated text is submitted without reference to the original text  Artistic plagiarism, where different media including images and videos are used for other works without (a) proper reference(s) to the resource(s) [2,4].There are two major methods that can be used to reduce literary pirating: plagiarism detection and plagiarism prevention [3,4].An attempt is made in this work to adapt the detection method.
The path and status of this work are presented in figure 1, with their hierarchical sequence in gray boxes.According to this tree diagram, plagiarism detection methods include manual methods and software tools that are simple to be implemented, and can be applied in plagiarism [3].Software plagiarism detection is categorized based on text homogeneity regarding monolingual plagiarism detection and cross-lingual plagiarism detection [2].Detecting plagiarism in monolingual environments refers to a homogeneous and congruent environment like English to English, and nearly all systems that are developed to detect it and are divided into the inherent and external types [2,4].Detecting cross-lingual plagiarism refers to detecting texts that encompass multi-languages like English and Arabic.In this method, the document recovery process is similar to the suspicious documents in a cross-lingual environment [5].In detecting inherent plagiarism, named the stylometry-based method as well, there is no reference document, and just the suspicious document is controlled [2].The objective of inherent plagiarism detection is to identify the potential pirate(s) with analyzing changes in writing style [6].
For detecting external plagiarism, named the content-based method, a suspicious document is compared with a number of documents, and the text contents are analyzed based on the logical structure and detection of similarities among texts [2].In this method, a text investigation is made through textual features including removing stop words from the text [7].The common techniques that act based on the content-based method rely on the explicit comparison of the document contents.Most detection methods use stop word deletions [3].The objective of this work is to improve the accuracy of detecting the similarities among the pirated phrases in Persian texts through the stem of current words and document tree representation, and applying the fingerprinting technique according to the word-based 3-grams.The innovation aspects of this proposed method consist of preprocessing operation(s) in more accuracy in comparison with the previous works, and replacement of pluralization or broken words.Applying the document tree representing and its fingerprinting introduces a new tree-nodes with a key volume that contains the hash value of its children trio.Therefore, in copy detection, only branches with the same hash values are considered, which prevent excessive search.The rest of this article is organized as follows: A literature review is presented in section 2. The solution and operation used for pre-processing the text and document tree representation, text fingerprinting, and detecting the suspected phrases are presented in section 3. The presented combinational method (PCP) is discussed in section 4, and, finally, the conclusion is presented in section 5.

Literature review
Using language-free plagiarism detection tools are inefficient on texts like Persian and Arabic, and the outputs of these tools are imprecise and unreliable because they do not consider their special features and structural complexities [3].Hence, the language-sensitive tools should be used.Despite the endeavors in this field in the recent years, no updated and efficient tool has been presented for Persian texts.ZiHayat and Basiri have presented a tool that makes the detection of copying scale of phrases possible in the Persian electronic documents through a native-user interface based on the grand algorithm "Winnowing" [8].The average accuracy of this total is 64%, which is relatively low.It is possible to adapt more updated algorithms for document categorization and natural language processing in order to improve the accuracy of this system [9].Kamran et al. have also presented a tool for detecting plagiarism in Persian documents using "Simhash" algorithm which, despite its low accuracy, is fast in detecting pirates in a large collection of texts.There exist 300 reference articles and 25 suspicious articles as the inputs of the system, which are used to detect phrase similarities of the word-based grams and "Simhash" and "Shingling" algorithms.The developers have concluded that in large sets of Persian documents, using the "Simhash" algorithm (despite its low accuracy) is a more proper method [10].
Mahmoodi and MahmmodiVarNamkhasti has proposed another tool for plagiarism detection; a precise tool for detecting plagiarism in short paragraphs [1].It is impossible to detect plagiarism in documents with multiple paragraphs because the inputs of this tool are both a suspicious document and a reference document, where each one of them includes one paragraph by itself.Assuming the high level of accuracy in the plagiarism detection for short paragraphs, it is not possible to detect plagiarism in multiple paragraphs, and if either of these documents contain more than one paragraph considered as an input, the results would be of low accuracy, and unreliable.
Mahdavi et al. have adapted the vector space model to detect external plagiarism in Persian texts.In their article, 41 reference documents and 84 suspicious documents were created by the developers, and using the vector space model and cosine similarity among them, more accurate document processings s were selected as the candidates.Next, the similarity coefficient shows the overlapping features of 3-grams comprising each document, where the probable similarities are discovered.For every feature, the vector of a document requires both more memory and a long time in the processing of finding similar documents.Therefore, the size and number of features of this vector depend on the length and expression of the documents [11].
Rakian at al. have used the new method of a fuzzy algorithm to consider the different levels of a hierarchical text and use the synonyms necessary in determining the degree of similarity between two sentences in Persian texts, and hence, the external plagiarism detection in Persian texts.
Here, 1,000 reference documents and 400 suspicious documents were established, where the structural change in sentences and then being rewritten are recognizable.In order to select the candidate documents related to the keywords of the text offer recovery and divide their constituent sentences, the potential similarities are detected by the fuzzy methods [12].An increase in the sentence divisions can slow down the processing time and accelerate the memory consumption.

Proposed combinational method (PCP)
Implementing this combinational method includes text preprocessing, document tree representation, text fingerprinting, and copy detection (see Figures 2).In this study, the fingerprints were based upon 3grams of the text created by different levels of the document tree representation.This representation can be obtained by traversing the bottom-up tree.
The final fingerprint of a document created by the hash of the paragraph level will be less than the volume of the hash made at the level of the 3grams words.The fingerprint of a document is compressed and improves the fault memoryconsumption presented in [3,7] and similar works with respect to another language.Since in their fingerprint idea, the hashes in the level of words were copied into their father, they created a high volume of hash word levels in the fingerprint of a document.Moreover, the fingerprint idea in the PCP method causes a difference in the similarity detection approach towards the proposed method in [3,7].

Text preprocessing measure
Text preprocessing is run in order to clean and delete useless information from the text, causing a rise in the accuracy and a reduction in the time required for a possible similarity detection.
According to figure 3, this measure includes the following steps: 1.Text segmentation: here, the text is separated into its constituent paragraphs.While (! Synonym_Lexicon.EndOfFile) 5.
} 10.End 2. Sentence tokenization: here, the constituent sentences of a paragraph are separated by the punctuation marks "?, !, .", and the excess spaces in each paragraph are removed and replaced by one empty space; therefore, it is assumed that all sentences are separated with an empty space.3. Word tokenization: here, for every specified sentence in the previous step, word ranges and punctuation marks are determined in a sense that each sentence would be broken into its constituent words.4. Number replacement: here, the number character is replaced by the "#" sign, which makes finding similarity among the number in the text independent.5. Words normalization: here, operations like removing three points from the text, putting halfspace between the prefixes and postfixes including " ‫می،ومی،تر،تریه،ها‬ " , and finally, replacing the excessive spaces with one space are applied to the normalized words.6. Stop word removal: words like relation words including " " ‫به‬ ‫اما،‬ ‫يلی،‬ ‫را،‬ ‫از،‬ ‫ي،‬ are among the frequent words in the Persian language, which are applied to all texts, and must be ignored in order to assess the similarity in texts because they have no special meaning weight.7. Fragmented pluralization or broken word replacements: in the Persian language, there are words that have the same stem but their pluralization is irregular, like the word ‫,"اخبار"‬ which is a pluralization summation of the word ‫."خبر"‬It is worth mentioning that this step is being presented for the first time in the Persian language.
The input function is a word processing of the document (see Figure 4).If this word is pluralized and replaced by its singular term, then the homogenization of this class of words is accomplished.This step requires pluralized lexicon in the Persian language.For this purpose, the Persian Gate 6.0 plug-in, which is applied in natural language processing in [13], is applied.8. Synonym replacement: in the Persian language, there are words that have the same meaning but different stems such as the word ‫"پىد"‬ that has the synonyms " ‫وصیحت‬ ‫عبرت،‬ ‫رهىمًن،‬ ‫مًعظه،‬ ‫اودرز،‬ ، ‫يعظ‬ ".If there are such words in a sentence, all of them are replaced with their stems, The word ‫"پند"‬ is followed by homogenization of this kind of words in the text.The input function is a word-processing of the document (see Figures 5).9. Part-of-speech (POS) tagging: here, the reminded basic words of the text are tagged, and their types are specified on grammatical parts like the noun, verb, adverb, adjective, and punctuation marks [15].This step is impressive in determining the stem of the words.10.Stemming: here, the words are stemmed based on a specified tag given to them in the previous step followed by removal of prefixes, postfixes, and infixes from the word, respectively.In the manner, different derivative and inflectional states of words in similarity detection are not affected.For example, the words " ‫می‬ ‫ريد‬ ", " ‫بًد‬ ‫رفته‬ " are verbs with the stems ‫"رفت"‬ in past and ‫"ري"‬ in the present.This process becomes possible through the trained model in NHazm [16], which is a tool for processing Persian natural language in Visual Studio environment.11.Punctuation removal: in this step, ignore all the writing signs and available punctuation marks in the text.12. Lemmatization: in the final step, words are replaced with stems in their dictionaries.This step proceeds with each word tag and its stem.

Fingerprinting
A document tree representation is applied in order to fingerprint a text.The PCP approach is to determine the fingerprint of the document at words level in the text, which is divided into 3grams, and after applying the hash function on them, a fingerprint of the document is generated in the 3-grams words.In the next step, to produce a fingerprint of the document in sentences, the generated hashes in the 3-grams are broken into the next 3-grams, as well, where the hash function would be applied on them.Finally, to create the final fingerprint of the document (at the paragraph level), the hashes generated in sentences are broken into the 3-grams again, and then the hash function is applied to them.The final fingerprint of a document created based on tree representation and applied the hash function would generate the hashed 3-grams at each level, whose volume is smaller than the approach presented in [3,7].As shown in figure 6, the stem consists of the tree basic document, the second level consists of all refined text paragraphs, and the third level of the tree encompasses the sentences of the paragraph.Then sentences are divided into word-based 3grams, and using a proper hash function, they are converted into a number.In this manner, the processing speed is increased in the copy detection operation.In figure 7 It is important to select a hash function that minimizes the collisions due to mapping different chunks to the same hash [6,10].In this implementation, the BKDR hash function is used.This function is the sum of each character's multiplication in a certain value named "seed" that usually has the value of 31.The seed value must be an odd number because odd numbers are unique, and multiplication of a number in an odd number creates a unique hash value [6,10].
The steps for the above example of fingerprinting are shown in figure 8.The fingerprint of this single sentence paragraph is 25319069.According to figure 8, after breaking all the words contained in sentences into 3-grams, it is time to hash operations at sentence-level.Through this procedure, the hashes obtained from words-based 3-grams are broken into 3-grams in tree sentencelevel, and a hash operation is run on them.
In the final step, the hashed 3-grams will be converted from sentence-level into paragraphlevel 3-grams.Therefore, the document fingerprints obtained contain paragraph-level hashes of the document.

Copy detection
The main objective of the document tree representation is time-saving during similarity investigation and preventing excessive comparisons.In the PCP method, the similar detection approach is based upon the membership fingerprint in each level of suspicious document fingerprint and the corresponding level of reference document fingerprint.For example, if a hash value of fingerprint (at paragraph level) in the suspicious document exists in the hash fingerprint collection (at paragraph level) of the reference document, each one of the 3-grams that here created this hash (each one of the three hash of manufacturer this hash) at the sentence levels is checked separately.Similarly, if existed similarity in sentences, the hashes of the suspicious document at sentences level checked more precisely at the 3-grams words.In other words, the generated hashes of sentences in the 3-grams words level are examined separately.Therefore, if there exists a similarity, the 3-grams words in the tree leaf are displayed to the user for final decisions.According to the pseudo-code in figure 9, a tree is surveyed by a top-down traverse, and the fingerprints of two texts in the document level are evaluated.Due to the lack of injective hash function and generation of equal hashes for different phrases with these, in order to ensure the final result, the star-tagged parts are added to the code that can generally be deleted from the algorithm.1.The fingerprints of the reference document and the suspicious document are considered as the algorithm inputs.2. If there is any similarity/dissimilarity in each one of the steps, the algorithm output or "Similarity" variable is determined by "True" or "False".3. Similarity detection operation begins.4. Following steps will continue for all the current document paragraph-level hashes.5.If the following paragraph-level hashes of the suspicious document are the subsets of paragraphlevel hashes of the reference document, evaluate the comparison process in sentence-level.6.For each hash in sentence-level of the suspicious document, the comparison process continues at the level of current word.7. If sentence-level hashes of the suspicious documents are the subsets of the sentence-level hashes of the reference document, then the comparison process continues in their word-level.8.For each hash at word-level of the suspicious document, the comparison process continues at their 3-grams level.9.If the 3-grams level hashes of the suspicious document are the subsets of the 3-grams level hashes of reference document, 10.Possible similarity is detected.11, 12. Otherwise, the comparison process continues at the sentence-level hashes of the suspicious document.13, 14.According to line 9, if the sentence-level hashes of the suspicious document are not the subsets of sentence-level hashes of the reference document, then the comparison operation continues at the paragraph-level hashes of the suspicious document.15, 16.According to line 7, if the paragraph-level hashes of the suspicious document are not a subset of paragraph-level hashes of the reference document, then the comparison operation stops.17.Operation of similarity detection ends.

Method evaluation
The implementation is run using the C# programming language, where the features, functions, and classes are used.The evaluation process proceeds once with similarity parameters and their comparison with the native algorithm in "Winnowing" [9], and once, by using the Duplicate Content Checker tool, which implements text similarity detection, and is placed in the language-free categories [17].

Datasets
Evaluation of the performance of the proposed PCP method requires a standard textual dataset.Therefore, seven sets of texts consisting of one suspicious and one reference text in each are collected from the standard dataset of Persian language and Hamshahri newspaper sources [18].The specification of these texts is tabulated in table 1.

Parameters
Evaluation through Recall, Precision, and Fmeasure scales are the three important measures in the efficiency of the plagiarism detection algorithms in addition to Jaccard Similarity Coefficient (1) to ( 4), and all of these algorithms are calculated as follow [10]: F-measure = Jaccard Similarity Coefficient = (4) where, TP is the number of cases that are detected True as a copy, FN is the number of cases that are detected False as the original, and FP is the number of cases that is detected False as a copy [10].

Evaluation results
With respect to table 2, this proposed combinational method is examined through seven random datasets created by documents, tabulated in table 1 with eight tests.
To illustrate the improved accuracy in similarity detection in Persian phrases, the similarity rate of each pair in the tested document is assessed by the "Winnowing" algorithm and PCP method, and hence, the desired parameters are provided.Then to compare the proposed combinational method with the language-free tools, the similarity rate of any suspicious and reference document acquired using the Duplicate Content Checker tool are calculated.The results obtained for these tests are tabulated in table 3.
The results shown in figure 10 show that by using this combinational method, where the meaning of each word and replacement of proper pluralization and synonyms are of concern, the average values for Recall, Precision, and F-measure are improved in the order of 19.26%, 23.61% and 20.58%, respectively, and according to the accuracy in the plagiarism detection evaluated by these parameters, the improved accuracy average is 21.15%.The similarity coefficient improvement of the two texts by 21.13% has gained more safety factor.Since the PCP method is used as a combination of word stems and the tree representation of documents, the effectiveness of all the hashes generated in the fingerprint of any document, which can increase accuracy in the similar detection process, is improved.This similarity scale in comparison with the similarity that is obtained from language-free tools is reliable by 31.65% (see Figures 11).Since language-free tools do not consider the appearance of words and the specific characteristics of the Persian language in the text, they are not accurate enough in detecting similarity or dissimilarity in the Persian texts.To the contrary, this proposed method makes it possible to obtain more accurate results in relation to the language-free method.
Since there exists a direct relation between the document-length and the time-consumed, and since accurate preprocessing and tree representations are applied in this method, naturally, the time-consumption is increased, and this might be considered as a drawback, something that no new method can be without.With respect to the nature of fingerprinting, this needs a repository for reference.The bigger the repository, the bigger is the memory storage.This issue, on its own, can be considered as a drawback.In addition, due to the nature of the fingerprinting technique, the restructuring of the text and the changes thereof, word ordering is not possible.

Conclusion and future work
A combinational method based on the semantic of current words in text and tree representation of the document, accompanied with the fingerprinting technique according to words-based 3-grams and improvements made in similarity detection accuracy of plagiarized phrases in Persian texts is proposed in the PCP method.The results obtained indicate that this combinational method improved the similarity coefficient of two texts by 21.13% because the word meanings and replacing proper pluralization and synonyms are of concern.The calculated similarity scale has an improved rate of 31.65%, and is more reliable, in comparison with the similarity obtained from the language-free tool.This indicates the lack of accuracy in the language-free tools in relation to the language-sensitive methods, especially the proposed combinational method.The data-mining algorithms in categorizing documents automatically are among the proposals to improve this method, which prevents excesscomparison between texts with different themes.

Figure 2 .
Figure 2. Steps of detecting similarities in PCP method.

1 .
Input: The word of Document 2. Output: The Singular_word of Clean Document 31.Input: The word of Document 2. Output: The Root_word of Clean Document 3. Begin 4.
If this word is in a series of synonymous words, replaced by their root words, this class of words is homogenized.This step is required to be lexicon synonyms in the Persian language.Here, a comprehensive synonymous and antonyms lexicon in the Persian language has used name as Raghoumi version [14].

Figure 7 .
Figure 7.An example of a document tree representation.

Figure 10 .
Figure 10.Comparison of PCP method and localizedWinnowing algorithm.

Figure 11 .
Figure 11.Comparison of PCP method and DuplicateContent Checker tool.