Optimalizing Big Data in Reducing Miss-Targeting Family Hope Program (PKH) in Sidoarjo Disctrict with Approach Machine Learning

Machine learning approaches have been used to solve various problems. PKH experienced miss-targeting. This study aims to compare the result of big data by SIKS-NG and machine learning based on the same data and measurement indicators. Obtained algorithms Averaged Neural Network with optimal output compared to others. As for data testing obtained on SIKS-NG and machine learning that uses elevated matrix evaluations with the following 3 indicators: 1) Accuracy obtained by SIKS-NG 72.40% increased to 81.18% for Machine Learning; 2) Precision at the center is getting a high percentage of 91,01%, but it is capable of increasing once the data is given Machine Learning to 95,37%; 3) Recall with the cycle was obtained at 75.49%, while Machine Learning obtained a higher yield of 82.19%. Thus, machine learning has been proven to reduce miss-targeting and can be used as an alternative recommendation in automatic decision making and innovative management practices in government circles. Keywords— Family Hope Program, Miss-Targeting, Big Data, Machine Learning  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 15, No. 1, January 2021 : 99 – 110 100


INTRODUCTION
The Family of Hope Program (PKH) in the Regulation of the Minister of Social Affairs of the Republic of Indonesia Number 1 of 2018 is a well-planned, targeted and having sustainable goals Social Protection Program. Poverty level data is one of the considerations for determining PKH areas. An important aspect to support a poverty reduction strategy is accurate poverty data [1]. Therefore, the government needs to encourage regular data sharing and data transparency as a requirement for prospective PKH beneficiaries. Azizah, Mahmudah and Kriswibowo (2020) argue that the government's political will is very necessary to minimize the increase in poverty in the village [2]. However, the practice of PKH often experiences inaccurate data collection so that PKH does not reach the poor who really need it. In addition, even though the community has been registered in the Integrated Social Welfare Data (DTKS), there are still poor people who have not received PKH assistance. On the other hand, there are still rich people got PKH assistance. The inaccurate data has resulted in social jealousy among the community and the data has not been integrated systematically.
The miss-targeting problem is the main challenge of the PKH program. The slow process of handling complaints about invalid data at the regional level is confirmed by statements by the SMERU Research Institute in the katadata.co.id media [3], Anwar Sadad, Deputy Chairperson of the Regional People's Representative Council of East Java Province 2019-2024 at kominfo.jatimprov.go.id [4] and M Dhamroni Chudlori, Deputy Chairman of the Sidoarjo DPRD Covid-19 Handling Committee at republikjatim.com media [5]. In addition, the Ombudsman of the Republic of Indonesia (ORI) on the Kompas.com [6] has received 817 reports of complaints from the public regarding data manipulation in the distribution of social assistance in order to tackle the Covid-19 pandemic. There are 2 types of errors in analyzing miss-targeting, namely under coverage and leakage [7].
The situation above indicates that poverty reduction strategies must be effective, efficient and transparent by the application of Big Data technology with a machine learning approach. Big Data Analytics assists in finding valuable decisions by understanding data patterns with the help of machine learning algorithms [8]. There are several opportunities to use Big Data in the public sector, including getting feedback and public response from government service information systems and from social media, as a basis for policy making and improving public services [9]. The term machine learning is used to show a systematic and unsystematic graph of the growth and availability of large amounts of data. Machine learning is a sub-field of artificial intelligence that is widely researched and used to solve various problems [10].
Research conducted by Fitriani aims to determine the eligibility of PKH beneficiaries by comparing the C4.5 and Naïve Bayes algorithms through the Rapidminer tools [11]. The total data were 1,109 residents. The results show that the C4.5 algorithm has an accuracy value of 91.25% and an AUC of 0.930 is the highest among the other methods, while the Naïve Bayes method has an accuracy of 87.11% and AUC of 0.923. In other studies, according to Sugianto and Maulana in their research using the classification method of the Naïve Bayeshasil algorithm, the accuracy is 58.29%, precision 92.90%, recall 21.84%, AUC 0.765 and F-measure 34.42%. and the Decision Tree Algorithm got an accuracy of 73.97%, precision 85.04%, recall of 61.92%, AUC 0.746, and F-Measure 71.17% [12].
Based on the exposure of previous research and ensuring the validity of the data used in machine learning, this study focuses to determine the performance of machine learning algorithms with deep learning characters and provide an overview of the comparison of the miss-targeting level of PKH recipients with data processed by the Indonesian Ministry of Social Affairs using the System Next-Generation Social Welfare Information (SIKS-NG).
The data and measurement indicators used are the same, which distinguishes the data processing tools. The tools used in this research are RStudio which is an integrated development environment (IDE) specifically for the R programming language and statistical analysis which is supported by many packages and functions as a translator. Machine learning mechanisms for large-scale multidimensional data from multiple sources are indispensable. It aims to facilitate a 101 more accurate determination of the poor [13]. Machine learning as a field of artificial intelligence in Indonesia is still filled by many actors in the business sector. Almost in line with big data, this happens because there are still many players in the business sector wellestablished data processing infrastructure [14]. There are strong reasons to believe that intelligent data analysis with machine learning will become more widespread as a necessary ingredient for technological advances, especially in the formulation of public policies. Therefore, this study aims to find a model that can reduce the error rate of PKH aid distribution in Sidoarjo Regency, so that it can contribute to national development.

Problem
The problem that will be examined in this research is using big data as a source, namely PKH recipients in Sidoarjo Regency. However, it is often miss-targeting which has caused conflict empirically. Big Data analytics helps in finding valuable decisions by understanding data patterns with the help of machine learning algoritms [8]. The machine learning approach has become one of the mainstays of information technology, supported by the large amount of data available.

Data Collection
The data that will be the main processed material are only in Tanggulangin District with a number of poor people of 5,688 people, and Candi District with a number of poor people of 7,214 people. It is because of the poverty data of other sub-districts are not ready to be released considering the nature of the data is very sensitive, very confidential and already never anonymized (de-identifed), therefore the confidentiality of personal data needs to be guaranteed.
There are 14 variables to determine poor households. The 14 variables used in this study are depicted in Table 1. Types of residential floors made of cheap soil / bamboo / wood 3.
Types of residential walls made of bamboo / thatch / low quality wood / walls without plaster 4.
Do not have defecation facilities / together with other households 5.
Household lighting sources do not use electricity 6.
The source of drinking water comes from wells / unprotected springs / rivers / rainwater 7.
Consuming Meat / Milk / Chicken once a week 9.
Purchasing a set of clothes only once year 10.
Having meals once or twice a day 11.
Unable to Pay Medical Costs at the community Health centers / Polyclinic 12.
Sources of income for the head of a household are: Farmers with a land area of 500 M2, farm workers, fishermen, construction workers, plantation workers and / or other occupations with an income below Rp. 600,000, -per month 13.
Highest Education Head

Identification of Required Data
Based on data obtained from the Social Service Office of Sidoarjo Regency, it includes 14 these variables, however, when verifying and validating data on a predetermined form, these variables are branched into 70 sub variables. As for the 70 sub variables that have been presented, both the poverty data of Tanggulangin District and Candi District, only 55 sub variables will be used in machine learning. It is because of the data is sensitive and does not affect. Furthermore, the classification model is obtained as follows: Class decile 1,2,3,4, and 4+. Decile 1 includes the PKH recipient class which is the focus of this research study, decile 2 includes the Non-Cash Food Assistance recipient class (BPNT), decile 3 and decile 4 including the Healthy Indonesian Card (KIS) class, the last class is decile 4+ for backup data if you have not met the quota for social assistance. Deciles 2, 3, 4, and 4+ are only used as a comparison considering that the data used are poverty data which is the determining measure for all social protection programs.

Data Pre-processing
After going through a series of data preparation then it comes to the data pre-processing stage. Before the data is ready to be trained and tested, pre-processing is needed so that the classifier works better [16]. At this stage, a statistical analysis is needed, namely the PCA (Principal Component Analysis) method technique.
This stage is used to visualize multivariate data. The PCA working pattern detects target errors by seeing how many points overlap in the data set and the goal is to avoid the errors in the target data obtained. In addition, PCA functions to compress variable branches without removing the original characters from the parent variable.

Algorithm Selection
At this stage the machine learning algorithm will be selected. There are several types of algorithms in machine learning such as Nearest Neighbor, Naive Bayes, KNN Classification, Support Vector Machine, Ada Boost, Random Forest, Decision Tree, Neural Network, Bayesian Networks, K-Means Clustering and others [17]; [18]. This is held to obtain the best comparison results from the existing algorithms. It takes a caret library call available on RStudio which functions to determine the automatic turning of algorithm parameters with the highest accuracy value.

Training
Serves as a guide in the algorithm selection stage so that it can study the patterns given to the selected algorithm model. In other words, it provides directions using an algorithm so that the trained machine can look for its own correlation or learn patterns from the given data.

Evaluation with test set
The evaluation carried out aims to determine the classification model. The evaluation in this study uses confusion matrix. The confusion matrix used are accuracy, precision, and recall. According to [19] confusion matrix is information about actual classification results that can be predicted by a classification system. Confusion matrix accuracy and precision for measuring the performance of the model generated from the ANN algorithm, while confusion matrix recall measures the level of miss-targeting on PKH.

Results Evaluation
The  As explained by the author in chapter III, the focus of this research is on decile 1 category, namely PKH aid recipients. Based on 5,688 poor people in Tanggulangin Subdistrict and 7,214 poor people in Candi Subdistrict, which the total population of the two sub-districts is 12,902 people. If you look at the picture in 4.8. The total population is 12,269 inhabitants. This is because when the data were identified, in the percentile sub-instrument, there were several residents who were detected as 'NULL' or the population did not have a percentile value (not zero but empty) which indicated that the population was not included in the category of beneficiaries. As for the population who do not have a percentile value of 633 people, so that the data processing given the action is as many as 12,269 people.
In the SIKS-NG data processing in the decile 1 category, the poor recommended to receive PKH assistance were 1,946 people, while those who were not recommended for PKH assistance were 1,195 people. In contrast to the results of Machine Learning data processing, the recommended recipients of PKH assistance were 1,255 people, while the recommended recipients of PKH assistance were 1886 people. The comparison of the results is shown in Figure 2 and  In addition, when there are high-dimensional data with a large number of variables, machine learning is able to show its ability to recognize informative patterns in the data even though the data has quite complex variables. This is evidenced by the higher accuracy and lower level of miss-targeting than the data processed using the SIKS-NG application.
The results of the evaluation use 3 main indicators, namely accuracy, precision, and recall. The results of the comparison of the three can be seen in Figure 4 and    Figure 7 Graph accuracy, precision, and recall values at SIKS-NG and Machine Learning

CONCLUSIONS
Based on confusion matrix evaluation with the Averaged Neural Network algorithm model assigned, the machine learning approach is superior in every indicator compared to SIKS-NG. The first indicator, Accuracy produced by Machine Learning is 81.18%, while SIKS-NG is 72.40%, better 8.78%. Second, the Precision produced by Machine Learning is 95.37%, while SIKS-NG is 91.01%, better 4.36%. The third indicator, Recall with 82.19% results for Machine Learning, 75% for SIKS-NG, better 7.19%. In addition, PKH recipients also produced less from 1,946 people to 1,255. Thus, in public governance, Machine Learning is present to provide an innovative alternative and government management efficiency, and is able to select PKH aid recipients more accurately.
Regardless of the confidentiality of SIKS-NG using the same or different approaches, Machine Learning with the Averaged Neural Network algorithm model which has a high level of accuracy can be an alternative recommendation for automatic decision making and innovative management practices.