Outlier Detection Credit Card Transactions Using Local Outlier Factor Algorithm (LOF)

transaksi


INTRODUCTION
Credit card to perform various transactions easily, do not close the possibility of threats that can harm or fraud. Based on data from Identity Theft Resource Center (ITRC) and CyberScout, credit card identity theft in the United States at the end of June 2017 increased by 29%. The ITRC estimated the increase from 1091 cases in 2016 to 1500 cases in 2017, with total losses generated around $2530 to $3073. The law governing this criminal activity has been established, if the boycott made a loss of $500 then the amount of the penalty amounted to $1000 and if it exceeds $500 then a penalty charged at $25000 with a sentence period of 15 years imprisonment. The severity of the penalty for perpetrators did not make this criminal act dimmed, so the case raises the worries for the bank and the customer.
The bank as a credit card service provider has made every effort to handle the case, although the effort can still be preceded by the perpetrators. The delay in handling is the problem, the bank knows that the customer's credit card has been compromised after his client reported that the charge is not suitable for the use. All transactions that use credit cards to buy goods, pay installments, and other transactions, which are all stored in the database of a bank that will eventually become large data. The big data is used for a special purpose, then it takes one approach to dig knowledge with artificial intelligence, namely data mining.
Some experts state that data mining is a step analysis of the knowledge discovery process in the database or "knowledge discovery in Database " abbreviated to KDD. According to [1] data mining is a process that uses statistical techniques, mathematics, artificial intelligence, and machine learning to extract and identify useful information and knowledge related to a wide range of large databases.
Credit card breaches that occurred, having an average transaction pattern with the same resemblance, an anticipation case of credit card breach if possible, they will suppress the losses incurred. The lack of information technology involved in detecting the transactions made by the credit card boycott, all the data Transaski has been stored. The solution to the detection is not easy because special methods are needed. [2] use the K-Mean algorithm to group data into groups to detect outliers. [3] use a DB-SCAN algorithm with a density-based approach, to form data groups so as to find outliers.
During this time the outlier identification uses an algorithm used to retype data and is not specifically intended for detecting outliers. Based on the case encountered, algorithms are needed specifically to identify outliers, with high accuracy, recall and precision, and outlier detection of multivariate data. Local outlier factor (LOF) algorithms have been used by previous researchers to handle outliers detection on multivariate data and deliver high accuracy, recall, and precision. [4] in testing ten different datasets, which resulted in an average accuracy of 75.2%, with the highest accuracy of 98% and the LOF algorithm can be used to detect outliers in multivariate data. [5] using the LOF algorithm to detect outliers in the computer network or computer network, generates an accuracy of about 84%.
[6] in testing to detect outliers on Hongkong Street traffic datasets, the results of the accuracy test were generated around 96%. Based on the criteria set, the study used Algortime local outlier factor or LOF as an outlier detector. Focus from the research to identify transaction patterns by utilizing stored credit card transaction data. Identify the pattern using the outlier analysis method. Hope by using the LOF algorithm, it can identify outliers in credit card transaction data until it can be used as an early warning for the bank.

1 Data Collection/Retrieval
Collection or retrieval of data is done by visiting data. World website and doing a search from credit card transaction data, after finding the data searched then download the data. The document that is downloaded is an Excel document procurement-card-transactions-1q-2q-2017_1. The example dataset that is collected is shown in Table 1.  Table 1 Example dataset

2 System Analysis
The built-in system can meet the requirements of a predefined system, as follows: 1. The system can display outlier data visualizations for each customer. 2. The system displays a set of transaction data from each client deemed to be an outlier by the system. 3. The user can change or add a minPts value to the LOF algorithm which is used as the parameter value. 4. User can do additions data or import transaction data with Excel format.

3 Use Case Diagram
In the use case diagram shown in Figure 1, explaining That an employee can upload or import a collection of transaction data Done by some customers in Excel format. Data that has been in An upload will be inserted into the database, and then in the process Using the LOF algorithm. Outlier detection results in the form of data sets Outlier will be visualized. An employee can change and input Value of MinPts for the LOF algorithm aimed at limiting the range of Or the number of nearby neighbors used to define local The neighborhood of a data object.

4 System Outline
The built-in line of systems can display transactions that are considered outliers by the LOF algorithm, thus displaying them to the bank's employees. An explanation of the built system is shown in Table 2 and the flowchart of the system is shown in Figure 2 and   2 Data outlier The result of outliers detection with the LOF algorithm performed will be used by the bank to be an early warning of transactions that are likely to be fraud or not.

MinPts
The value of minPts can be changed by the user as a parameter for the LOF algorithm in detecting outliers. MinPts is the range or number of nearby neighbors used to define the local neighborhood of an object. 4 Import data Users can add transaction data by using an Excel document. Explanation, displaying the transaction data set that is deemed outlier for each customer. The order of the process starts with importing data and then done preprocessing which includes data sanitization, data transformation, and data standardization. Preprocessing results are inserted into a database and subsequently in the process by the LOF algorithm. The result of the LOF algorithm process generates an outlier transaction data set and is display into a page that can be reviewed for each customer.

5 Preprocessing
Preprocessing is a process of preparing data for the data mining process. The process of preprocessing is done, as follows:

5. 1 Data Cleaning
Data cleaning is the process of eliminating or changing data that is worth null or missing value. The process of clearing the data on this research is done manually which fills in the empty value corresponding to the data attribute or directly deletes its data row. If the categorical attribute then that is taken is the most displayed value, while the numeric attribute will be assigned a value of one or a minimum value.

5. 2 Data Transformation
Data transformation is a process to convert or merge data into the appropriate format, then processed in data mining. The data transformation process is done dynamically by the system at the time of outlier detection.

5. 3 Data Standardization
Standardization or normalization of data is a process to normalize data when one of the attributes used has greater value or data than other attributes with the Z-score method. In this research, only the amount attribute in the standardization process, because the data amount has a value greater than the other attributes.

6 Algoritme Local Outlier Factor
Based on [7] local outlier factor algorithm or LOF can used to detect outliers in a set of data and flowcharts of the LOF algorithm shown in Figure 3. The first phase of this algorithm requires input dataset, i.e. transaction data. Before the outlier detection process, first performed preprocessing that serves to ensure the data processed is good data so that the results of the outlier detection process result in high accuracy. The first stage preprocessing process is done cleaning data or data cleaning that serves to eliminate dirty data. The second stage in the preprocessing process is the transformation of data or transformation data that serves to change the data initially in the form of sentences or words into numerical form so that the data can be done calculation process.

Figure 3 Flowchart local outlier factor algorithm
The third stage in the preprocessing process is the standardization process or normalization of data that serves to normalize the data when one of the attributes used has greater value or data than the other attributes, with the method Z-Score. The equation of the Zscore method is shown in Equation

Equation description 4:
If the first row in the State attribute is equal to the second row data in the state attribute, then it is worth 0 (zero) if it is not worth 1 (one).
In the amount attribute use the numeric spacing calculation using euclidean distance formula. The result of the mix Euclidean distance calculation is sorted ascending. Taking the below zero value is based on the value of minPts, then the value is looking for the greatest value and that's the value that is the K-distance value. Once the K-distance value of each object is determined, it determines the number of closest neighbors of the K-distance value.
After determining the closest neighbor of each object then calculate the reachability distance and reachability density. Before entering in the calculation of local reachability density, first, calculate the reachability distance from each object. The purpose of the reachability distance calculation is to ensure that all objects are in a homogenous environment and the LOF value will be more stable. When each object is in a uniform environment, even though the value of minPts changes. The equation of the reachability distance is shown in Equation 5. : Number of neighbors of object p or value of minPts.
Calculating the Reachability distance can be determined by retrieving the maximum value of the mix Euclidean distance result on the selected object and the K-distance value of the object. The determination of Reachability density values can be calculated from the first object by summing the specified Reachability distance value and divided by the value of minPts, and then divided by one. The final stage of the LOF algorithm process is calculating the LOF value of each object to determine the average density ratio of the Reachability local with its neighbor in one range shown in Equation 7. A LOF is a degree that determines whether an object is an outlier or not. The LOF is quoted into LOF MinPts (p) and is the average density ratio of the reachability of local p and neighbor p in a single range. The count is done with the first object, and then summing the reachability density of each neighbor and divided by the first object's density Reachability value and further divided by minPts. The LOF value of each object, if more than the limit value, then the object is an outlier and if the LOF value is less than the boundary value then the object is not an outlier. The limit value is the average of the LOF values.

7 References
Fraud pursuant to the Association of Certified Fraud Examiners or ACFE is defined as an act to enrich oneself through intentional abuse or use of organizational or asset resources. Another language of fraud is an act of illegal use of facilities that is deliberately done in various ways forms of cheating, fraud, or evasion by a particular person or organization. Losses caused by fraudulent action will make some Parties harmed. [2] using K-Mean algorithm to detect outlier on credit card transaction, the results of the research showed that the accuracy produced is relatively low. [3] using a density approach with DB-SCAN algorithm, the results of the research conducted are no different than previous researchers with other clustering approaches.
[8] detects fraud on credit card transactions with priority over time and number of transactions made by the customer of the credit card user. The results of the research, indicating the method used can well detect outlier on credit card transactions. [9] use the Manhattan Distance based algorithm to detect outliers in credit card transactions. The results of the research show that the Manhattan Distance-based algorithm approach can detect outliers well.  [10] use different methods of previous research with a clustering approach. In research conducted using classification approach with SVM algorithm as an outlier detector on credit card transaction. The results of the study showed that the SVM algorithm could classify the transaction data identified as outliers or not, but the built model is still weak because of the amount of data that only slightly use one hundred data.

1 Testing of LOF, INFLO, and AVF algorithms
Test result of LOF, INFLO, and AVF algorithmS using sample method and Confuision matrix in transaction data from five customers such as, Cassie Hunter with total data 721 transaction, Andrew Andrade with total data 351 transaction, Carina Orozco with Total 264 data transactions, Araceli Delgado Ortiz with a total of 245 transaction data, and Cindy Escobar with a total of 232 data transactions. The average test results of LOF, INFLO, and AVF algorithms with the transaction data of five customers, shown in Table 3. Based on the test results in Table 3, explaining the average test results of five customers that have been done, the LOF algorithm is higher than other algorithms, which are 96% accuracy, 98% recall, and 93% precision. The lowest test results in the AVF algorithm are 77% accuracy, 79% recall, and 62% precision.

2 Testing compute timing of LOF, INFLO, and AVF algorithms
The results of computational time testing of LOF, INFLO, and AVF algorithms using transaction data from five customers such as Cassie Hunter with total 721 data on transaction, Andrew Andrade with total data of 351 transactions, Carina Orozco with total data of 264 transaction, Araceli Delgado Ortiz with a total of 245 data on transactions, and Cindy Escobar with total data of 232 transactions. The average computational time test results from the LOF, INFLO, and AVF algorithms with transaction data from five customers, shown in Table 4. Table 4 The average of five customer compute time testing results

.Testing attribute Combinations
In the test results by performing a combination of attributes on the credit card transaction data, aiming to get the best combination. Displays the highest accuracy, recall and precision results with a local outlier factor algorithm or LOF. Use of data for testing using transaction data from five customers such as Cassie Hunter with total 721 data transaction, Andrew Andrade with total data of 351 transactions, Carina Orozco with total data of 264 transactions, Araceli Delgado Ortiz with total 245 data transactions, and Cindy Escobar with total data of 232 transactions. The average result combination is the best attribute with the transaction data of five customers, shown in Table 5. The conclusion of a combination of test attributes shown in Table 5, explains the average of the test results of category, amount, and state, which are 98% accuracy, 98% recall, and 98% precision.

4 Threshold Value
The threshold value on the LOF algorithm is a value that can declare a data outlier or not, if the value of the LOF is more than the limit value then the data is outlier and if less than the boundary or equal value, then the data is not outlier or data Normal. In the LOF algorithm, the increase in the continuous limit value will reduce the amount of outlier data in the outlier detection results. The following results:  Figure 4, Explain the increment of the limit value of Cassie Hunter transaction data from one To eight. At the limit value, one amount of data is outlier around 629 and at the time of the addition of the limit values continuously make the data detected only 1 data with a boundary value of eight. Experiments conducted by using 721 transaction data by input parameter minPts = 236.  Figure 5, explains the addition of the limit value of Andrew Andrade's transaction data from one to four. At the limit value, one amount of data is outlier around 183 and when the increment value of the limit continuously creates the data detected only 1 data with a limit value of four. The experiments were conducted using the 351 transaction data with the input parameter minPts = 340.
Based on the results of the experiment shown in Figure 6, explain the addition of the limit value of the transaction data Carina Orozco from one to four. At the limit value of one amount of data that is outlier around 125 and at the time of the increase of the limit rate continuously makes the data detected only 1 data with a limit value of four. Experiments conducted with using 264 transaction data by input parameter minPts = 180.  Figure 7, the value of the transaction data Araceli Delgado Ortiz from the 1 to 1,8. At the limit value of one amount of data that is outlier around 69 And at the time of the increase of the limit rate continuously makes the data detected only 2 records with a limit value of 1.8. Experiments conducted with using 245 transaction data by input parameter minPts = 215. Based on the results of the experiment shown in Figure 8, explained the addition of the limit value of Cindy Escobar transaction data from 1 up to 2. At the limit value one amount of data outlier about 110 and when adding limit values continuously makes the data detected only 5 data with 2 limit values. Experiments conducted using 232 transaction data by input parameter minPts = 160. Input the value of minPts or parameters affects the results of accuracy, recall, precision, and computational time for each algorithm. Based on the tests by comparing the LOF, INFLO, and AFV algorithms resulted in the highest level of LOF algorithms compared to other algorithms: 96% accuracy, 98% recall, and 93% precision. Testing to calculate compute time from each algorithm results in the most immediate AFV algorithm with an average time of 6,29 seconds. Second is the INFLO algorithm with an average time of 7,50 seconds that has a time difference of 0,13 seconds with a LOF algorithm of 7,63 seconds. Adding parameter values results in slowing down compute time for all algorithms.
The combination of category, amount, and state attributes is the best attribute combination compared to other attribute combinations. Generates an average test of five clients, which are 98% accuracy, 98% recall, and 98% precision.Attempts to add a boundary or threshold value continuously to a local outlier factor algorithm or LOF will reduce the amount of outlier data detected.
Subsequent research is recommended to try with other algorithms that have a different approach to the research that has been done and add the optimized algorithm. The optimized algorithm serves to optimize the LOF algorithm parameters so that users no longer need to manually change or replace the parameters. In testing, algorithms are advised to use transaction data already labeled fraud or not and further research is expected to use the credit card transactions in Indonesia because it is likely to have a pattern of transactions that are Different from the credit card transaction data used.