Selection of the Best K-Gram Value on Modified Rabin-Karp Algorithm

The Rabin-Karp algorithm is used to detect similarity using hashing techniques, from related studies modifications have been made in the hashing process but in previous studies have not been conducted research for the best k value in the K-Gram process. At the stage of stemming the Nazief & Adriani algorithm is used to transform the words into basic words. The researcher uses several variations of K-Gram values to determine the best K-Gram values. The analysis was performed using Ukara Enhanced public data obtained from the Kaggle with a total of 12215 data. The student essay answers data totaled to 258 data in the group A and 305 in the group B, every student essay answers data in each group will be compared with the answers of other fellow group member. Research results are the value of k = 3 has the best performance which has the highest some interpretations of 1-14% (Little degree of similarity) and 15-50% (Medium level of similarity) compared to values of k = 5, 7, and 9 which have the highest number of interpretation results 0%-0.99% (Document is different). However, if the students essay answers compared have 100% (Exactly the same) interpretations, the k value on K-Gram does not affect the results.


INTRODUCTION
Research on determining the value of k has been widely studied by scientists, but for the selection of the best k value in the application of the Rabin-Karp algorithm has never been done. An experiment is needed to find out the best number of k to find out the effect of the k value on the similarity results that will be obtained in detecting document similarity. An essay is a test in the form of questions that expect answers to be written down, clear, and in writing. Each essay question given generally has an answer that must be explained and improvised by each student because the answer usually does not only cover the understanding of the theory but can be like each student's personal opinion and explanation that has the same meaning or purpose but with a different writing style. Text preprocessing can process the answer data from Ukara Enhanced because the dataset uses Indonesian language, so the Nazief & Adriani stemming algorithm is used at the stemming stage in text preprocessing to make the words in the students' text answers to the basic words. Similarity in the essay answers of each student can be identified similarity value with the Rabin-Karp algorithm by processing the basic word to the stages of parsing using the K-Gram method which will then be converted into a hash using a rolling hash and will match the hash results with other student hashes.
Preprocessing stage in text mining on the document is a case of folding, tokenization, filtering, and stemming [1,2]. Nazief & Adriani algorithm is used in the stage stemming in preprocessing text. This algorithm applies basic Indonesian morphological rules, checked collected allowed affixes and unallowed affixes, and uses a basic Indonesian word dictionary to compare basic words [3]. A. Jelita [4] there are several stemming algorithms in Indonesian, namely Nazief and Adriani's Algorithm, Arifin and Setiono's Algorithm, Vega's Algorithm, Ahmad, Yusoff, Sembok's Algorithm, and Idris. In testing Nazief & Adriani algorithm produces the best results, correctly stemming 93% of word occurrences in C_TR_MAJORITY, 92% of C_TR_UNIQUE, and 95.0% of C_TR_SUBJECTIVE.
Research by A. Rahmatulloh et al. [5] discuss the comparative performance of Porter and Nazief & Adriani stemming on the Winnowing algorithm for plagiarism detection. The results of the study concluded that testing the Winnowing algorithm without stemming had the results of 70.7% plagiarism similarity with processing speed of 0.711 s, Winnowing algorithm testing with stemming Porter had the results of plagiarism similarity of 65.7% with processing speed of 0.221 s, testing of the Winnowing algorithm with Nazief & Adriani stemming had results plagiarism similarity 70.5% with processing speed 0.476 s. Stemming Porters reduce the level of plagiarism similarity results very significantly but speed up processing while Nazief & Adriani stemming results are close to the same as without using stemming and also speed up processing.
K-Gram is used to make the order of grams by changing the results of the preprocessing text into a group of strings that are grouped into new strings where the new string collection is obtained from concatenation of the preprocessing text results with a length determined by the number of k values of K-Gram [6,7]. The stage of making a hash in the Rabin-Karp algorithm using rolling hashes [8]. rolling hash is a non-cryptographic hash function which allows the rapid computation of hash of each of the consecutive chunks. The fast computation of the rolling hash is due to the fact that the hash computation of a chunk utilizes the hash of the previous chunk [9].
Musthofa and Yaqin [10] applied the Rabin-Karp algorithm to correct automatic answers by matching essay answers with key answers. Because the manual system requires a lot of time, to speed up the correction of answers made an automatic answer correction system. In testing using Confusion Matrix in the application of Rabin-Karp algorithm using the value k = 3 and a dataset of 50 in the study resulted in 90% accuracy and 10% error rate. The automatic 13 essay grading system [11] conducts research with a dataset in the form of Japanese answer documents that will be romanized because the input is in the form of hiragana, katakana, or kanji into romaji. In this study using the Winnowing algorithm that uses hashing techniques and in fingerprint search using window techniques. By testing the dataset used with the parameters n = 2, w = 2, and p = 2. In experimental variations n unlike p and w, there are variations with an accuracy below 80% and therefore the parameter n is better with small numbers. the research resulted in an average accuracy of 86.86%. The research of B. Leonardo and S. Hansun [12] discussing to detect the similarity of documents to other documents obtained from searches on Google Search using Rabin-Karp and Jaro-Winkler distance algorithms. Result of research are the similarity of text testing using the Rabin-Karp algorithm produces an average percentage of 51% and requires an average time of 0.594 minutes. Whereas Jaro-Winkler Distance produces an average of 35% and requires an average time of 0.992 minutes. The Rabin-Karp algorithm is effective than the JaroWinkler Distance algorithm. According to M. Bicer and X. Zhang [13] researching on the efficiency of the Boyer-Moore-Horspool algorithm, the Rabin-Karp algorithm, the Raita algorithm, and the Double-Hash algorithm on string similarity. Research reseults are the Double-Hash algorithm is more efficient in 5 different tests such as many patterns, timestamp patterns, very long patterns, very short patterns, and no patterns. The Double-Hash algorithm has a test duration of 5.63s, 5.74s, 5.67s, 6.43s, 6.20s. Subsequent research from D. D. Sinaga and S. Hansun [14] is Detecting the similarity of Indonesian documents using a combination of Confix-Stripping Algorithms in the stemming process so it can detect the prefix and suffix words. The result is the Rabin-Karp Algorithm has an average processing of 0.0123s and has an average accuracy rate of 89.1967% and the testing of the Rabin-Karp Algorithm without stemming processes has an average processing of 0.0103s.
Hashing process in Rabin-Karp algorithm using the modulo process, as defined the value of modulo can produce the same results so that it affects the results of accuracy because modulo can produce hashing that is not unique or in different cases can have the same value. Previous research on the hashing process eliminates modulo values, the results increase the syntax accuracy of word matching [15]. Rabin-Karp algorithm is used to match data from unique hashes formed from the hashing process of each data and the Rabin-Karp algorithm is used to identify the duplicate contents in the dataset [16,17]. After finding a unique hash value in the two data compared then the similarity value between the two is calculated using the Dice Similarity Coefficient. Dice Similarity Coefficient which is used to determine similarity between two documents, two queries, or a document and a query [18,19]. This research aims to determining the k value on K-Gram to decide selection of the best k value in the application of the modified Rabin-Karp algorithm in the removal of modulo in the hashing process to calculate the similarity between documents.

METHODS
This study uses the Ukara Enhanced student answer dataset from Kaggle. This data processed using text preprocessing, at the stemming stage using the Nazief & Adriani algorithm. Base words are cut and grouped into new strings according to the number of k on the K-Gram. Word cuts are changed to hashes using a rolling hash without modulo, and then compared with answers of other students with Rabin-Karp algorithm. Calculation of the similarity value using Dice's Similarity Coefficient and the similarity results are interpreted. These analysis process shown in Figure 1. The following explanation of the flow diagram in order to make the research aims will be divided into 3 processes: text preprocessing, Rabin-Karp algorithm, Dice's Similarity Coefficient. These three processes become the main methodology of this research. In

Text Preprocessing
Primary data is taken from the answer Ukara Enhanced from Kaggle, after that primary data collection then the next process is preprocessing text, which has a case of folding, tokenizing, filtering or stopword removal, and stemming [20]. In this research using Indonesian language data, therefore at the stemming stage using the Nazief & Adriani algorithm.
This research data comes from the raw text Ukara Enhanced answer dataset from students obtained from the Kaggle site, we use a total of 573 essay answers obtained in groups A and B. Student answers (true/false) have been labeled to the dataset. The language processed by word processor is only standard Indonesian according to Kamus Besar Bahasa Indonesia (KBBI). This study does not look at spelling or writing errors in documents, and is independent of synonyms or synonyms.
Case folding is used as a text converter to standard shapes or in lowercase letters and removes characters other than letters. Tokenizing or parsing is used for word separator text based on white space characters, tabulation, and spaces are considered as separators between words. The filtering stage is the stage of selecting important words and removing less important The Nazief & Adriani algorithm was developed with basic word dictionary table search techniques and Indonesian morphological rules such as prefixes, insertions (infix), suffixes (suffixes) and combined prefixes (confixes). This algorithm uses a dictionary of basic words and supports re-coding by rearranging words that experience excessive stemming and have rules [21].

Rabin-Karp Algorithm
The Rabin-Karp Algorithm is the simplest string searching algorithm. This algorithm uses the hash function to discover the potential pattern in the input text. for the length of text n and pattern p of mutual length m, its average and best-case running time is O (n+m) in space O

15
(p), and also the worst-case time is O (nm) in space O (m) [22]. Determine the value of k with prime numbers, 3,5,7 and 9 so that the base word obtained is cut by the number of k in K-Gram which is then processed to be converted into a hash by rolling hash [23].
Change text that has been grouped with K-Gram into a hash using rolling hash. In previous studies have examined the comparison of hashing using modulo and without using modulo, the results if not using modulo syntactic accuracy of word matching increases. Some research on similarity algorithms such as Rabin-Karp and Winnowing using hashing technique is Rolling Hash [24].
(1) Information: : hash value : the ASCII value of the character in the string : string length b : hash basis value The hash value obtained will be sought using the Rabin-Karp algorithm by matching the same hash and supported by the answers of other students. After finding the unique hash value in both documents, then search for the hash value found in both of the matching processes (fingerprint). From the same number of hash findings and the total number of hashes in each document, the similarity values can be calculated [25].

Dice's Similarity Coefficient
The results of the hash comparison obtained will be calculated the similarity value using Formula Dice's Similarity Coefficient.
(2) Information: X : The X represents the amount of fingerprints in document X Y : Y itself represents the amount of fingerprints in document Y After the process of finding the percentage value of Dice's Similarity Coefficient, then be interpreted according to the value of Dice's Similarity Coefficient. Grouping interpretations are shown in Table 1.  The next step is to remove punctuation marks (case folding), perform the tokenizing, filtering and stemming stages. At stemming stage it uses Nazief & Adriani algorithm for the process of determining standard word of a word with some predetermined rules. If all steps have been completed but are unsuccessful, then the first word is assumed to be the base word. The results of the stemming process are shown in Table 2. Table 2 Results Stemming Group Answer ID Stemming 1 1 ["mahasiswa", "daftar", "batu", "batu", "buruk", "prosedur"] 1 2 ["potong", "potong", "informasi", "mereplikasi", "coba", "tama", "butuh", "jenis", "sampel", "prosedur", "cuka", "wadah", "persis", "ukur", "massa", "sampel", "jenis", "wadah", "plastik", "pengaruh", "hasil", "coba", After text preprocessing step is the parsing step, which is the term that has gone through the preprocessing process cut into pieces per character. Cuts per character using the K-Grams method. After the intersection of characters is known, hashing is done at each intersection using rolling hash. Take the gram from "mah", with an ASCII value of 109, an ASCII value of 97, has an ASCII value of 104.
After knowing the hash of each K-Gram intersection in each document answer, then compare the hash results in each answer with the other answer hashes. The following hash results in k 3, 5,7 and 9 hashes are shown in Table 3. Then calculate the similarity, found the same 1 hash for document ID 11643 which has hash as many 27 and document ID 11644 which has hash as many 25 calculating the similarity in K-Gram 3 as follows : The following results of Dice's Similarity Coefficient are shown in Table 4. Then change the value into interpretation which will be displayed on each variant of the K-Gram value so that different interpretations of each k value on the K-Gram are known. The results of interpretation are shown in Table 5.  Table 6. The test results of the similarity possibilities in group A that have been interpreted. Diagram similarity possibilities data group A is shown in the Figure 2.

Figure 2 Diagram Similarity Possibilities Data Group A
Comparison of each answer data in group a has results of similarity at value of k = 3 show that the interpretation of 1-14% (Little degree of similarity) has the most members, as many as 13245. While at a value of k = 5, 7, and 9 shows that the interpretation 0%-0.99% (Document is different) has the most members, values are 25332, 34288, and 41183. The result interpretation of Similarity Possibilities Answer Student Essay Group B are shown in Table 7.  11777  25332  34288  41183  1-14%  15222  13059  8220  2564  15-50%  16884  7073  3324  2239  51-99%  2376  795  427  273  100%  101  101  101  101 According to these results, diagram similarity possibilities data group B is shown in the Figure 3. The results of similarity possibilities in group B at value of k = 3 show that the interpretation of 15-50% (Medium level of similarity) has the most members, as many as 16884. While at a value of k = 5, 7, and 9 shows that the interpretation 0%-0.99% (Document is different) the most values are 25332, 34288, and 41183.
In group A and B datasets testing, comparing each of student essay answers to their group resulting chances of similar answers that tested to every other student essay answer. Testing is done by varying the value of k at the K-Gram stage for the group A and B, which detects the possibility of the same answer between students who are different in each k value that is applied, but in both tests concluded that the value of k = 3 has good results because in that test produced the possibility of similar essay scores between students that spread evenly on each interpretation. However, different values of k = 5, 7, and 9 produce the possibility of similar values to essay answers among students that dominate the interpretation of 0%-0.99% (Document is different). The value of k = 3 results dominate in several interpretations which is able to detect the similarity of essay answers among students into interpretations of 1-14% and 15-50%. While using the values k = 5, 7, and 9 in every interpretation that is decreasing in number in each interpretation. But if the document does have 100% in common then in every test the various k values have the same results.

CONCLUSIONS
The results compare students answer tests in groups A and B with k values = 3, 5, 7 and 9 on K-Gram, thus can be concluded that the number of values on K-Gram affects Dice Similarity Coefficient results. Previous studies that applied the Rabin-Karp algorithm has the similar result, N-Gram value also affects the number of similarity values. The value of k = 3 has the best performance in detecting the similarity between students essay answers, which has the highest number of interpretations of 1-14% (Little degree of similarity) and 15-50% (Medium level of similarity) compared to values of k = 5, 7, and 9 which have the highest number of interpretation results 0%-0.99% (Document is different). But if the students essay answers compared have 100% (Exactly the same) interpretations, the k value on K-Gram does not affect results in each test.