GSA to Obtain SVM Kernel Parameter for Thyroid Nodule Classification

Support Vector Machine (SVM) is one of the most popular methods of classification problems due to its global optima solution. However, the selection of appropriate parameters and kernel values remains an obstacle in the process. The problem can be solved by adding the best value of parameter during optimization process in SVM. Gravitational Search Algorithm (GSA) will be used to optimize parameters of SVM. GSA is an optimization algorithm that is inspired by mass interaction and Newton's law of gravity. This research hybridizes the GSA and SVM  to increase system accuracy. The proposed approach had been implemented to improve the classification performance of Thyroid Nodule. The data used in this research are ultrasonography image of Thyroid Nodule obtained from RSUP Dr. Sardjito, Yogyakarta. This research had been evaluated by comparing the default SVM parameters with the proposed method in term of accuracy. The experiment results showed that the use of GSA on SVM is capable to increase system accuracy. In the polynomial kernel the accuracy rose up from 58.5366 % to 89.4309 %, and 41.4634 % to 98.374 % in Polynomial kernel


INTRODUCTION
In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given piece of input data into one of a given number of categories based on quantitative information on one or more characteristics inherent in the items [1]. Classification and prediction play important roles in data mining and their problems arise in many data mining applications such as computer vision, speech recognition, natural language processing and so on [2]. A bundle of approaches was developed to build classification models including SVM which were developed by Vapnik and his colleagues [1]. They have recently attracted a lot of researchers from the machine learning (ML) and pattern classification community for their fascinating properties such as high generalization performance and globally optimal solutions [2]. Support vector machines (SVM) has recently been used in a range of problems including pattern recognition, bioinformatics, and text categorization. SVM classifies data with different class labels by determining a set of support vectors that are members of the set of training inputs that outline a hyperplane in the feature space. SVM provides a generic mechanism that fits the hyperplane surface to the training data using kernel function (e.g. linear, polynomial, or RBF) for the SVM during the training process that selects support vectors along the surface of this function [2].
One problem that faces the user of SVM is how to choose a kernel and the specific parameters for that kernel, it is the crucial step in handling a learning task with SVM since it has a heavy impact on the classification accuracy. Parameters that should be optimized include the penalty parameter C and the kernel function parameters such as the gamma (γ) for the radial basis functions (RBF) kernel. In other words, the largest problems encountered in setting up the SVM model are how to select the kernel function and its parameter values [3].
The proposed algorithm will be used to classify thyroid nodules into malignant or benign classes. In medical circumstances, marking off where the malignant and benign thyroid nodules is a crucial point, although the prevalence of malignant nodules that occur is relatively small only about 7-15% [4], however knowing a lesion is benign or malignant will affect the following treatment, for example if a malignant lesions is detected, biopsy needle is necessery, and vice versa. If there is a malignant lesion that is considered as a benign condition, that is the problem because it will not get appropriate therapy. Ultrasonography (USG) is an imaging modality that has sensitivity in detecting the existence of thyroid nodule. Eventhough the sensitivity of an USG in detecting malignant nodule is about 63-94%, specivity 61,95%, and accuracy amount to 78-94%, USG has a major disadvantage due to its operator dependent. This causes subjectivity of the operator in the diagnosis. Imaging interpretation result in the examination depends on operator"s expertise and skills, beside that operator fatigue can also cause diagnostic errors. Moreover, noise and speckle found on the image increasingly influences the accuracy of the diagnosis [5]. Therefore, a computer-based system is necessary to assist radiologist to classify the malignancy of a thyroid nodule.

Literature Review 2. 1.1 Image Processing
In general, digital image processing refers to the processing of two-dimensional images using a computer. Although an image has a lot of information, but often the quality is decreased (degradation), for example it contains noise, the color is too contrasting, less sharp, blurring, etc. This makes more difficult to interpret information of the image. Therefore, digital image processing is necessary [6]. Digital image processing is a technique to manipulate, modify, or increase the quaity of an image in various ways [6]. Fig. 1 shows the complete stages of digital image processing. Figure 1. Digital image processing complete stages [6] Fig. 1 shows that there are 10 processes in digital image processing. In fact, not all stages are always done, it depends on the images or the objective of the system. In this research, the steps of pre-processing stages are: 1. Image filtering and enhancement, in this stage noise filtering had been applied two times, adaptive median for smoother, and SRBF to remove speckle. 2. Segmentation, after enhancement process, the main object of the image has been separated from background. The method used in this stage is Fast Global Minimization for Active Contour (FGMAC). 3. Representation & description, the main object has been represented according to the characteristic of the edge. Then, it has been extracted to get the value of the features. Tis research provides geometrical and statistical aspects to extract the features of object, 8 features used to describe the characteristic of the nodule, they are: convexity, solidity, aspect ratio, compactness, circularity, dispercy, tortuosity, rectangularity.

Support Vector Machine
SVM is a set of related supervised learning methods used for classification and regression. SVM can be defined as systems which use the hypothesis space of a linear function in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. SVM has properties that are not possessed by other learning machines, specifically in the process of finding the best hyperplane that maximize the margin between non-linear input spaces and feature spaces using the kernel rules [7].
The best hyperplane between the two classes can be obtained by measuring the margin of the hyperplane and looking for the maximum point. Illustration of margin can be seen in Fig.  2 which is shown by the red line. Margin shows the distance between the hyperplane and the closest data from each class. The closest data from each class called support vector [8]. In the real problem, generally data cannot be classified 100% correctly (linearly separable), it needs kernel function. Kernel function used in this research are shown in the Table  1. Table 1. SVM kernel Name of Kernel Equation As it was mentioned, SVM is a classifier, given a set of training examples, each marked as belonging to one of two categories, and the SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.
Given a training set of D, which is the number of samples, pair ( ) represent the i th training input sample and its label respectively, and = ( ) is a p-dimensional vector in the feature space. The generalized linear SVM finds an optimal separating hyperplane which can be written as the set of points O satisfying: ( ), by solving the following optimization problem [7]: where C is the penalty parameter, which controls the trade-off between the complexity of the decision function and the number of training examples that have been misclassified, and is the non-negative slack variable. A good separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. This optimization model can be solved by introducing the Lagrange multipliers for its dual optimization model. After the optimal solution is obtained, the optimal hyperplane parameters and can be determined, and the indicator function (classifier) can be written as: In the nonlinearly separable cases, the SVM maps the training points, nonlinearly, to a highdimensional feature space using kernel function K(Oi,Oj) where linear separation may be possible. The crucial idea is to use kernels to reduce a complex classification task to one that can be solved with separating hyperplanes. After selecting the kernel function, the nonlinear SVM classifier becomes: ) for Polynomial kernel (8) The performance of an SVM can be controlled through the term C and in RBF kernel, and C and d in Polynomial kernel, which are called hyperparameters. These parameters influence the number of support vectors and the maximization margin of the SVM.

Gravitational Search Algorithm
The GSA was first introduced by Rashedi et al. as a new stochastic population-based heuristic optimization tool [9]. This approach provides an iterative method that simulates mass interactions, and moves through a multi-dimensional search space in the influence of gravitation. This heuristic algorithm has been inspired by the Newtonian laws of gravity and motion [9]. The effectiveness of GSA in solving a set of nonlinear benchmark functions has been proven. Moreover, the results confirm that GSA is a suitable tool for the optimization of engineering problems.
In GSA, agents are considered as objects and their performance is measured by their masses. All objects attract each other by the gravity force, and this force causes a global movement of all objects with heavier masses. The heavy masseswhich correspond to good solutionsmove more slowly than the lighter ones, this guarantees the exploitation step of the In GSA, each mass (agent) has four specifications: position, inertial mass, active gravitational mass, and passive gravitational mass. The position of the mass corresponds to a solution of the problem, and its gravitational and inertial masses are determined using a fitness function. In other words, each mass presents a solution, and the algorithm is navigated by properly adjusting the gravitational and inertia masses. By lapse of time, it expected that masses be attracted by the heaviest mass. That mass will present the optimum solution in the search space. Now, consider a system with N agents (masses). The position of i th agent is defined by: presents the position of i th agent in the d th dimension. The gravitational constant, G, is initialized in the beginning and will be reduced with time to control the search accuracy. In other words, G is a function of the initial value ( ) and time(t): ( ) . /, (10) Gravitational and inertia masses are simply calculated by the fitness evaluation. A heavier mass means a more efficient agent. This means that better agents have higher attractions and walk more slowly. Assuming the equality of the gravitational and inertia mass, the values of masses are calculated using the map of fitness. Gravitational and inertia masses are updated using following equations: Where ( ) represent the fitness value of agent i at time t, and ( ) and ( ) are defined as follows (maximization problem): At a specific time "t", the force acting on mass "i" from mass "j" defined as follow: ( Where is the active gravitational mass related to agent j, is the passive gravitational mass related to agent i, ( ) is gravitational constant at time t, is a small constant, and ( ) is the Euclidean distance between agent "i" and "j": ( ) ‖ ( ) ( )‖ (17) So, the total force that acts on agent i in a dimension d be randomly weighted sum of d th components of the forces exerted from other agents: is a random number in the interval [0, 1]. Hence, by the law of motion, the acceleration of the agent i at time t, and in direction d th is given as follow: Where is the inertial mass of i th agent. Furthermore, the next volocity of an agent is considered as a fraction of its current velocity added to its acceleration. Therefore, its position and velocity could be calculated as follows:

The Detail Method of GSA-SVM System
There are two key factors when using the GSA as an optimization algorithm: one is what to choose as the masses of the GSA, another is how to define the fitness function which evaluates the goodness of a particle. This experiment tried to find which kernel that perform better in this problem. The research comparing the results of each kernel, linear, polynomial, and RBF.

Mass Representation
In this problem, the agent represented by the combination of parameter of each kernel, for example: in Polynomial kernel, the agent represented by the combination of parameter C and parameter d, while parameter C and parameter γ used in RBF kernel. The mass of each agent calculated by its accuracy. The accuracy is obtained during the training process using Eq. (7) and Eq. (8). Then, the value is normalized using Eq. (12) and Eq. (13) for further processing.

Fitness Function Definition
Classification accuracy is the criteria used to design a fitness function. Thus, the fitness function should be designed such that the supermass with high classification accuracy.

The Proposed GSA-SVM
The detailed steps of the algorithm can be seen in the Fig. 4. The experiment designed by splitting the data set into two parts, 60 % as a training set, and the rest 40 % as the testing set. The training set is used to compare among others kernel performance, while testing set is untouchable.
In the Fig. 4, the step begin with search space identification and initialization. Secondly, training accuracy of each agent is evaluated as a fitness value. Then, after updating agent"s position, if the criteria is reached, the system will be stopped and the optimized value will be used to test the testing set.

Data Description
The data sets used in the experimentation are all obtained from Radiology Installation of Rumah Sakit Umum Pusat (RSUP) Dr. Sardjito, Yogyakarta with range of periode from 2011 to 2014. 160 data as total is obtained from previous research [10]. The entire data belong to *.bmp RGB format.

Pre-processing
The data obtained from RSUP Dr. Sardjito is processed before being classified. The presence of noises or other obstacles can interfere the classification stages, so that it should be removed first. Complete digital image processing stages in this research can be seen in Fig. 5. Figure 5. Flowchart of pre-process image data According to the Fig. 5, the first step of this pre-processing is to get ROI of the image. The cropping process has been done manually by the experts, Dr. Endang Sri Wulandari as a radiologist and 2 assistants that accompanied by the researcher as the director of the use of the system. The result of cropping process can be seen in the Fig. 6.
(a) (b) Figure 6. (a) Real image from USG, (b) Cropping result as ROI of the image Noise reduction process had done in two steps. First, median adaptif filter has been applied on the ROI, then bilateral filter at a later stage. Fig. 7 (b) shows the result of median adaptif filtering, while bilateral filtering result is provided by Fig. 7 (c). It can be seen that the last image presented smoothest image among others.  After segmenting the image, the next step is representation and description stage. The process is done by extracting the image features. Geometric and statistical characteristics were chosen considering that the radiologist recognized the edge of the ultrasound image also based on the shape of the edge and firmness of the edges. Data should be described as a value to simplify the calculation in the *.xlsx format, before it is splitted out into training and testing set.

Experiment Results
The experiment scenario is comparing the proposed system GSA-SVM with SVM using default parameter. Then, to get the result fairly, the data used to compare the system is testing set, while the training set used to train the GSA-SVM version. Firstly, we test the default SVM parameter in all three diferent kernel using testing set, the result showed by Table 2. As can be seen from the Table 2, that the performance of the system is very low. As explained in the introduction section that the problem using SVM is how to choose the appropriate kernel and its parameters based on the data and the problem. In SVM, different data, different problem required different parameters as well. This means SVM needs to change the parameters in every case to get the optimal solution. In this study, the search process for the best parameters for each kernel is done using GSA.
The scenario started with the polynomial kernel which has parameter C and d. In the training process, to get reliable result k-fold cross validation is applied where the k is 5. The training process is done by running it in different range of iteration and number of agents. Table  3 presents the result of GSA-SVM in polynomial kernel. It can be seen from the Table 3 that the highlighted value is the highest parameter obtained from the training set. The value of the parameter C is 557.0479, and the value of parameter d is 1 get the accuracy of the training set up to 89.79 % while the validation set reach 84 %. In RBF kernel, the optimum parameters obtained in the fold-1 of the first experiment where the number of agent is 5 and Iteration is 25. The maximum accuracy of the training process is 62.67 % while the validation data reach 65 %. Eventhough it only slightly increased from the default one, it is a big improvement. For the final result, it presented in Table 5. As can be seen from the Table 5, the best parameters obtained from the training processes and resulted high accuracy fro the system compare with the default parameter result in

CONCLUSIONS
Based on the experiment result, it can be concluded that Gravitational Search Algorithm (GSA) can be used to find optimal parameters in the SVM classifier. This is evidenced by the increased accuracy in the polynomial kernel from 58.5366% to 89.4309%, and 41.4634% to 98.374% in the RBF kernel.