OVERSAMPLING METHOD TO HANDLING IMBALANCED DATASETS PROBLEM IN BINARY LOGISTIC REGRESSION ALGORITHM

The class imbalance is a condition when one class has a higher percentage than the other then it can affect the accuracy. One method in data mining that can be used to classification is logistic regression method. The method used in this research is RWO-sampling method using random replicate approach for synthetic data generation on descrete attribute. The result of the research can handle the problem of class imbalance, RWO-sampling method with random replicate approach shows better accuracy than RWO-sampling method with roulette and ROS approach. The accuracy value for RWO-Sampling method with roulette and RWO-Sampling approach with random replicate approach has increased to an average of 15.55% of each dataset. As for comparithem with the ROS method has increased an average of 3.7% of each dataset. Furthermore, for testing the underfitting problem in logistic regression, the oversampling method is better than non-oversampling with an increase in accuracy value reaching an average of 2.3% of each dataset.


INTRODUCTION
The imbalance of the data known as the imbalance class is a condition that describes an unbalanced portion of training data between a class with another class [1]. A class distribution that has a smaller percentage than any other class is called a minority class whereas a class that has a large percentage of all data is the majority class. In the case of a class imbalance, the minority class is harder to predict that the majority class but thenmetimes the minority class has more important information. To overcome this required algorithms that can predict the right class label in order to obtain a high accuracy value [2]. One method in data mining that can handle the problem of class imbalance that is a classification method. There are several methods used in the classification process, one of the methods in the classification is Logistic Regression [3]. Logistic Regression is already proven to be a linear classification that produces a powerful classification and is very easy to apply (Lin et al., 2008). The disadvantages of Logistic Regression are vulnerable to underfitting when used on data with an unbalanced class then that it can affect accuracy.
There are three approaches for dealing with data problems with unbalanced classes, ie approaches at the data level, algorithm level and method [4]. The data-level approach consists of various resampling and data manipulation techniques to improve the inclination of class data distribution training. While at the algorithm level, the method is used to adjust the operation of existing algorithms then as to make the classifier more conductive to minority class classification. The algorithm level approach and the ensemble method have the same goal, which is to improve the classifier algorithm without altering the data.
In this research will use an oversampling method, that is RWO-Sampling with a random replicate method to generate synthetic data in discrete attribute and use Logistic Regression as the classification method. This approach is expected to overcome the problem of synthetic data generation in less optimum discrete attribute and overcome the problem of class imbalance and improve the ability of Logistic Regression as the classifier.

METHODS
This research was made to deal with the problems contained in the RWO-Sampling method and binary logistic regression classification algorithm. RWO-Sampling method has a problem that is less optimum in synthetic data generation on discrete attribute by using roulette approach. In contrast to previous research, this study used a random replicate approach for synthetic data generation in discrete attribute.

2.1
System description Random Walk Oversampling (RWO-Sampling) is one of the oversampling methods used in research about imbalanced datasets. In RWO-Sampling application, this method does not function optimally to generate synthetic data on discrete attributes. From the analysis carried out, the thenlution to the problem contained in the RWO-Sampling method is to use the random replicate approach to generate synthetic data that matches the discrete attribute. Unlike the RWO-Sampling method that uses the roulette approach in this study, the random replicate approach will be used.

Analysis of System Structure
In this study the original dataset will be grouped according to the existing class then that it can produce majority and minority data. Data grouping is very necessary because in this study will focus on minority data only. Then the data in the minority class is defined in the discrete attribute or attribute continous. After defining the generation of synthetic data in accorandce with the stages in each attribute and repetition of a number of majority data and minority data. Then after the majority of data and minority data are balanced then, classification process using 3 binary logistic regression classification algorithm using accuracy measurement technique, AUC, f-measure and g-mean.

RWO-Sampling Method
The RWO-Sampling method includes an oversampling method that works by forming or generating new data from minority classes. To form new data, the RWO-Sampling method is based on the average and standard deviation of the minority class data. How RWO-Sampling works, as follows first the dataset used is grouped into minority and majority classes, calculate the difference between the majority class and minority class, defines data based on continous attribute and discrete attribute.
Then generation of synthetic data on attribute continents is done by calculating μ_i (mean) and (standard deviation) for attributes to calculate the mean and standard deviation calculation can be seen in Function (1) and Function (3).
information: : variance to-i : mean to-i ( ) : attribute value to-i in sample to-j n : sum of sample then to estimate the results using σ_i using equation 2.4.
information: ( ) : attribute value to-i in sample to-j : normal distribution value N(0,1) from sample j : standard deviasi sampel to-i n : total sample equation 3.4 is called a random walk model and used for the next step is the formation of new data in the continous attribute.
In the discrete attribute the initial step is to calculate the probability of occurrence for each value then generate synthetic data using the roulette approach.

Binary Logistic Regression
According to [5], logistic regression method is a statistical analysis method that describes the relationship between categorized response variables that have two categories (binary) or more with one or more predictor variables. The binary response variable is the response variable which is only 1 for the presence of a characteristic and 0 for the absence of these characteristics. Logistic regression models are used to see the probability of an event and compare the risk of the occurrence of an event by calculating the factors that influence it.
Bernoulli distribution probability function as follows.
where is the probability of occurrence to-i and random variables to-i. If else ( ) ( ) and if else ( ) . The logit model function lies between the range 0 and 1 obtained by using the logit function as follows: ( ) is a non-linear function then it needs to be transformed into logit form to obtain a linear function then that it can be seen the relationship between independent variables and nonindependent variables.

References
Conducting research on data with unbalanced classes [6]. The method they proposed was the oversampling method by creating class boundaries first and then generating new data from minority classes by calculating the mean and standard deviation between data. The classification algorithm used is C4.5, Naive Bayes and Neural Network. The dataset used is the Diabetes dataset. The accuracy, F-measure and G-mean methods were used to measure the performance of the proposed method and the validation method using 10-fold cross-validation.
[7] Research on data with classes is not balanced by introducing methods that create boundaries between classes and then performing random undersampling (RUS) on majority data as well as SMOTE on minority data. The dataset used comes from UCI datasets. The classification algorithm used is KNN, BP, and Naive Bayes. In this research use precision, recall, f-measure and G-mean method to measure the performance of the introduced method.
[8] conducting research on data with unbalanced classes and applying one example oversampling method ie Adaboost model. Then to delete the data in the majority class using the random undersampling (RUS) method and to generate new data in minority classes using the SMOTE method. The dataset used is the HDDT collection and KEEL collection. Classification algorithm used in this research is Neural Network and Support Vector Machine (SVM). To measure the performance of the proposed method using AUC, f-measure and G-mean and 5-fold cross validation method is used for the validation method. [9] raised the topic of research on data with unbalanced classes. The researcher proposed a fuzzy ensemble method. Then created boundaries between classes and to generate new data from minority classes using random oversampling (ROS) method. The dataset used in this study are real-world datasets. The classification algorithm used is Support Vector Mechine (SVM). The validation method used is the 5-fold cross validation method and to measure the performance of the proposed method using G-Mean and F-Measure methods.
[10] conducting research on unbalanced classes by proposing a method of creating boundaries between classes and data that are far from the established boundaries will be eliminated. The dataset used is real-world datasets. The classification algorithm used is Neural Network. The method for measuring the performance of the proposed method using G-mean and AUC and the validation method used is 5-fold cross-validation. The research proposed for this is to handle data with unbalanced classes by creating boundary boundaries and using oversampling methods to generate new data from minority classes by calculating means and standard deviations between classes. The dataset used in this study is the NASA MDP and UCI Repository dataset. For the validation method using accuracy method, AUC and f-measure are used to measure the performance of the method that the researcher proposes.

Illustration of synthetic data generation in continuous attribute
Using the dataset in Table 1 where X1 is defined as the continous attribute.

Illustration of generating synthetic data in discrete attributes
Using dateset in which X2 is defined as a discrete attribute.  Taking one data randomly from discrete attribute data X1 = {2,2,1,1,3}  Taken at random one piece of data from attribute X1, eg a_random = 2.
Data from the first point is then duplicated and used as synthetic data from discrete attributes.

Illustration of calculation with Logistic Regression
The dataset of the = (* +)  Obtain a new weight by adding weight, the results displayed on Table 4. Then do the calculation using as follows: The result of probability calculation is 0.99. Value probability> 0.5, then belong to class 1.

CONCLUSIONS
Based on the results obtained, then got the following conclusion: Based on the comparison of RWO-Sampling method with roulette approach, RWO-Sampling with random replicate and Random Oversampling (ROS) approach proved that RWO-Sampling performance with random replicate approach resulted in more classification The Values accuracy for the RWO-Sampling method with the roulette and RWO-Sampling approach with the random replication approach experienced an increase of 15.55% on average of each dataset. As for the RWO-sampling method with a random replicateand approach the ROS value accuracy method has increased to an average of 3.7% of each dataset. The comparison of non-oversampling and oversampling methods shows that the oversampling method is proven to produce better classification with an increase in accuracy value reaching an average of 2.3% of each dataset. This proves that the logistic regression algorithm is susceptible to overfitting if used in unbalanced datasets.