Text Summarization in Multi Document Using Genetic Algorithm

Automatic text summarization is a representation of a document that contains the essence or main focus of the document. Text summarization is automatically performed using the extraction method. The extraction method summarizes by copying the text that is considered the most important or most informative from the source text into a summary [1]. Documents can be divided into two types, namely single documents and multi documents. Multi document is input that comes from many documents from one or more sources that have more than one main idea. This study aims to summarize the text using a Genetic Algorithm by paying attention to the extraction of text features on each chromosome. The feature extraction used is sentence position, positive keywords, negative keywords, similarity between sentences, sentences containing entity words, sentences containing numbers, sentence length, connections between sentences, the number of connections between sentences. The number of chromosomes used is half of the number of public complaints. The data used is data on public complaints against the DIY government from February 2018 to July 2020. The data is obtained from the e-lapor DIY website. From the test results, the average value of Precision 1, Recall is 0.71, and f-measure value is 0.79. Keywords— Automatic Text Sumarization, Feature Extraction , DIY government, Genetic Algorithm. ◼ ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 15, No. 3, July 2021 : 327 – 338 328


INTRODUCTION
The government is essentially formed to serve the community and create conditions that allow each member [2]. Currently the government does not only provide services and receive complaints from the public face-to-face, but many websites have been developed by the government. The more the website develops, the more reports accumulate and it is necessary to summarize the complaints reported by the public.
Summary is a representation of a document that contains the essence or main focus of the document. Documents can be divided into two types, namely single documents and multi documents. Multi document is input that comes from many documents from one or more sources that have one main idea, which can also have different main ideas. To make it easier for readers Summarization of a text or several texts can be done automatically. Automatic text summarization is a technique of making a summary of a text automatically which maintains the important points of the original document by utilizing an application running on a computer [1].
Automatic text summarization has two approaches, namely the extraction method and the abstraction method. The extraction method summarizes by copying the text that is considered the most important or most informative from the source text into a summary [1]. The copied word or sentence can be in the form of a main clause, main sentence, or main paragraph. Extractive methods used to maintain the original words that exist in the actual report in order to minimize the difference in meaning is derived.
One method that has been developed to perform automatic text summarization is genetic algorithms. [3], [4] have performed automatic text summarization in a single document in Indonesian using the genetic algorithm method. Based on the explanation above, this study proposes a Genetic Algorithm which is expected to automatically summarize text in multiple documents using the extraction method.

System Architecture
At this stage an analysis of the information obtained will be carried out. The system architecture to be built has five parts, including data collection, preprocessing, feature extraction, automatic text summarization and finally evaluated. The architecture to be designed in this study is shown in Figure 1.

Data Description
The data in this study used data obtained from public complaint scraping on the DIY ereport website from February 2018 to July 2020. The number of public complaint data used was 1000 community complaints.

Preprocessing
The preprocessing stage is carried out to manage data on the results of public complaints in order to get data that is cleaner than noise to facilitate further processing. The preprocessing stage can be seen in the following [5]: At this stage, the uniformity of words in a complaint becomes a lower case or all letters in a word become all lowercase.

Tokenizing
This stage performs the process of separating words in one sentence into tokens, where each word in one sentence is separated by a space.

Stopword Removal
This stage serves to remove words that have no influence (which, and, or, to, from, etc.).

Feature Extraction
Feature extraction is an important process for converting an unstructured textual format into a structured one. The feature extraction used in this study is the same as the feature extraction that has been done by [6]. a. Sentence Position The sentence position can be calculated using equation (1). Assuming that the sentence in each paragraph is the most important sentence.
Calculating positive keywords can be done using equation (2). Positive keywords are the words that appear the most in a sentence.
c. Negative Keywords (f3) Negative keywords can be calculated using equation (3). Negative keywords are the keywords that appear the least in the sentence.
The similarity between sentences can be calculated using equation (4). The similarity between sentences is to look for the same word that is on the first chromosome with the word that is on the other chromosome.
e. Sentence contains Entity (f5) Sentences containing the word entity can be calculated using equation (5). Sentences containing entities are sentences that contain meanings such as the name of the island, the name of the person, the institution, the place, and so on.
f. Sentence contains number (f6) Sentences containing numbers can be seen in equation (6). Sentences containing numbers are usually considered important.
The sentence length can be calculated using equation (7). The length of the sentence aims to find out how long the sentence.
h. Connection Between Sentences (f8) The connection between sentences can be calculated using equation (8). The connection between sentences is the number of sentences that have the same word as another sentence.
i. Number of Connections Between Sentences (f9) The number of connections between sentences can be calculated using equation (9). The connection between sentences is the number of sentences that have the same word as another sentence.

Genetic Algorithm
Genetic algorithm is a branch of evolutionary algorithms methods adaptively used to solve a search value in an optimization problem on the mechanism of natural selection and the creatures living, in which the algorithm is to follow the principle of natural selection and "whoever is strong, he survived (survive) ", By imitating this theory of evolution, genetic algorithms can be used to find solutions to problems that exist in the real world [7].

Generation Initialization
Generation Initialization (Iteration) is carried out to determine how many iterations will be done to get the best individual. In this study, the repetition was done 10, 100, 200, 500 times to get the best individual.

Population Initialization
Population initialization is the stage where the chromosome string randomization will be carried out with multiples of 10 populations (popsize) that have been determined at the beginning and the number of chromosomes in each individual is n / 2 where n is the number of chromosomes in the public complaint data. Examples of individual representations can be seen in Table 2 below: After randomization string is completed the next stage is taking the results of the calculation extract features (F1-F9) which has been carried out for each chromosome of each individual [6].

Fitness Function
The fitness function is carried out to determine whether or not the existing solution is in an individual, each individual in the population must have a comparative value (fitness). Through this comparison value, the best solution will be obtained. In this study, each individual will be calculated the fitness value using equation (10) [8].

Selection
The selection process is a stage for selecting the best individual to be used as a parent for marriage or crossover. The selection method used is the roulette wheel. On the roulette wheel each individual will be represented as a collection of elements in the draw wheel. Each individual is calculated the probability value. This probability value is the value of how likely an individual is to be selected. The higher the probability value of an individual, the more likely that individual is selected as the best individual.
The way this method works is as follows: 1. Calculated the fitness value of each individual ( , where i is the 1st to the nth individual). 2. Calculated the total fitness of all individuals. 3. Calculated the probability of each individual. 4. From this probability, the ration of each individual is calculated at numbers 1 to 100. 5. Generated random numbers between 1 and 100. 6. From the random numbers generated, it is determined which individuals will be selected in the selection process.

Crossover
Crossover or cross breeding is a stage that will cross two individuals in a population, to get two new individuals. In this implementation, the cross-breeding method used is single point crossover. For single crossover points, the chromosomes to be crossover are selected randomly.

Mutations
Mutations are changes in genes in an individual who are born in a new generation. At this stage been some new people as much as the value of mutation probability her. then one of the genes randomly selected from the mutated individual will be changed. The selected gene has a sentence weight will be added with a random value between -1 to 1.

Elitism
Elitism is carried out using the continuous update method, namely by adding a new generation to the old population to become individuals. Then do the ranking according to the fitness value of each individual. After ranking, the individuals who survive are only the top individuals as many as the population that was initiated at the beginning. The results of elitism will be inputted in the next iteration process.

Selection of the Best Individual
Selection of the best individual is done by sorting each individual based on the highest fitness value. The individual with the highest fitness value is the best individual among other individuals in the same population. The best individual will be used as the result of the summary.

Result of Population Generation
The results of population generation are done by taking random sentences (random) on documents per week. Selected sentences will be stored in an array. The results of the generation array are shown in Figure 3.

Fitness Results
Each individual who has been formed, his fitness value will be calculated. The fitness value is obtained by multiplying the sentence weight by the sentence feature. The fitness results can be seen in Figure 4.

Best Individual Results
Individuals who have the highest fitness value are considered to be the best individuals. The best individual results can be seen in Figure 5.

Figure 5 Best Individual Results
The graph in Figure 7 is the result of the fitness of each generation. The fitness value starts to stabilize in the second iteration because each generation of selected individuals is an individual who has a fitness value of 525126.085450387. Data original can be seen in Figure 6 and best individual results who have been returned to the initial sentences can be seen in Figure 7.

Results of Summarization for Every Category
The summary results for each category are obtained from the best individual summary results. After the best individual summary is obtained, the next step is to categorize it manually. The summary results for each category can be seen in Figure 8.

Testing
This test method will discuss the testing phase of the system being built. Testing is done by measuring system performance which includes precision, recall, and f-measure testing against the summary results of public complaints made by genetic algorithms. From the results of this test, it will be known how many iterations and the number of populations that produce the best performance that produces precision, recall, and f-measure between the test results.
The public complaint data used for this test was taken from the DIY e-report website from February 2018 to July 2020 using 297 community complaints. Not all parameter combinations can be done due to limited time and resources, so the parameters tested in this study used 20 individuals and 20 iterations. The test results can be seen in Table 3. Experiments carried out at this stage used different data sampling and were taken randomly from 2018 to 2020. The experimental results in Figure 9 have the same precision value. This is because the categorization is still done manually and the precision value is obtained from the slice between the system summary results and the data that has been labeled, so that the resulting precision reaches a perfect value. The experimental results in Figure 10 have different recall values . The lowest recall value is in experiment four with a recall value of 0.2. This is because the labeling process is still done manually so that when summarizing according to the category, there are many public complaints that the label cannot be detected. The highest recall values were in experiments seven, eight and nine because in the experiment there was only one category so that the labeling was definitely detected.  Figure 11 shows the f-measure value of the experiment that was done previously. The f-measure value is determined by the precision value and also the recall value. If the precision value is high and the recall value is high, the resulting f-measure value will be high too, and vice versa, if the precision value and recall value are low, the resulting f-measure value will be low. Because in this study the resulting precision value is the same and the lowest recall value is in the fourth experiment, the lowest f-measure value is in experiment four with a f-measure value of 0.4 and the highest f-measure value is in experiments seven, eight and nine with f-measure value 1.

CONCLUSIONS
Based on the test results obtained, the conclusions that can be drawn are as follows: 1. The value of fitness are produced at each generation is not always stable. It mentioned in caused due to the process of selection using the method of the roulette wheel, so that the individual who has the value of fitness highs could just not be selected even though the probability that possess high. 2. Complaints are often lodged by the public in the year 2018 and 2019 most of which relate to the street and the lights either light path or traffic light, because the facility that are used every day.
Year 2020 complaint that most many complaints by the public is a category covid. It is because at the beginning of the year 2020 covid began to enter into Indonesia, especially in Yogyakarta area. Covid is a big problem for the community and also the government. 3. From the results of the test algorithm is genetic, a summary of which is produced by the system is already significant and in accordance with the contents of the document. The value of f-measure is the best that is generated by the system by using 20 individual and 20 iterations exist in the experiment seven, eight and nine with the value of fmeasure amounted to 1. The cause of primary value f-measure is perfect is the process of labeling categories were conducted by researchers still at doing it manually.

SUGGESTIONS
Based on the research conducted, the suggestions that can be implemented for further research are as follows: