The K-Means Clustering Algorithm With Semantic Similarity To Estimate The Cost of Hospitalization

K-means


INTRODUCTION
Clustering is a process of grouping data into groups or clusters, where each cluster has data that has high similarities and between clusters has a low similarity [1].Measure of distance to measure data similarity has a very important role in the performance of the K-means algorithm [2].The measurement of similarity, in the K-means algorithm based on distance, still has several weaknesses, such as less attention to the semantic meaning between data.To overcome this problem, actually, semantic similarity can be applied to measure the similarity between objects in clustering, so that semantic proximity will be taken into account.Measurement of similarity with semantic similarity can be conducted using ontology, i.e. by measuring the distance between concepts on ontology.
Some researchers have conducted research to address the problem of distance-based proximity measurement with semantics [3.4].The data used in this study are only types of textbased data and have not been able to accommodate types of categorical data represented by hierarchical model.For data that has types of categorical data with hierarchical model, it can be measured using semantic similarity equation proposed by Girardi et al. [5], Leacock & Chodorow [6], and Rada et al. [7].An example of data with a category type with a hierarchical model is the ICD-10, i.e. an international standard for classifying diseases and other health problems.In computer science, the ICD can be considered as ontology in a simple form, where the importance is the hierarchy of concept.Ontology in this form is often referred to as terminology.
ICD-10 is used in hospitals as a guideline to determine the code of the patient's disease type.The similarity of the disease from the patient can be seen from the proximity of the patient's disease code on ICD-10.The type of patient's disease is one of the factors that determines the cost of hospitalization from patient.As is known, each patient who will conduct an examination to the hospital, can visit the Emergency Installation Unit (IGD) for patients who are in an emergency, or Polyclinic unit for patients who are not in an emergency.Medical personnel will conduct clinical, laboratory, and supporting examinations to establish the diagnosis, initial planning of patient management, and conclusion whether the patient will be hospitalized or not.If the patient is declared to be hospitalized, the medical staff will provide a financial estimate to the patient's family, so that the patient's family will know the estimated cost needed by the patient.Diagnostic Related Group (DRG) can be simplified by means of payments with unit costs per diagnosis, but not unit costs per type of medical or non-medical services provided to patient [8].Estimates of patient costs can be conducted by clustering patient data that includes data of disease diagnosis, age, sex, and inpatient class rates.
In this study, patient clustering was carried out, so that patients can be grouped according to similarities in features.The method used in measuring data with centroid is the semantic similarity of Girardi et al., Leacock & Chodorow, and Rada et al. to measure diagnostic features that have been coded with the ICD-10 and Euclidean distance for features of gender, age, and class rates.Clustering patients can help management or medical personnel from the hospital as a consideration in the grouping of DRG (Diagnosis-related Group) to determine the financing of health services.

Data Collection
The

Data Normalization
There are two methods for data normalization, namely: range and var methods.Range method is a method that normalizes existing data, so that it has a value between 0 and 1.In this study, the method used for data normalization is the range method with the following formula: ( In each initial data column, the data that are searched for have minimum and maximum values.The minimum data were saved to the min data variable, while the maximum data were saved to the data max variable.Data normalization was conducted to normalize the data, so that it has a value between 0 and 1.The parameters that would be normalized are the parameters of the inpatient class and the age of the patient using equation (1).

Model Design
In order to develop the architectural model and conceptual design that will be developed, it needs the stages of needs analysis, both in the form of data analysis needs and function requirements analysis.The stages of system analysis will provide an understanding of the system that will be developed, and to find the shortcomings of the system to be developed, so as to produce a better system and in accordance with user needs.The first step in the clustering process is that patient data will be carried out in the stages of data normalization first, so that it matches the code format on the ontology for the patient diagnosis feature.For example, there is a patient who has a diagnosis code A01.0 (Typoid Fever), then it will be normalized to A01_0.In addition to the diagnosis data of age data and also patient class rates were also carried out stages of normalization using the range method.The measurement of similarity between data on the K-means algorithm used semantic similarity and euclidean distance.Semantic similarity was used to measure the similarity of diagnosis features, while euclidean distance was used to measure the similarity of features of gender, age, and insurance class rates.
The results of the clustering process will form patient clusters that have high features similarity, and have a low similarity between data on different clusters.To find out the quality of the clusters produced, testing will be carried out using the silhouette coefficient method.Silhouette coefficient is able to measure the quality and strength of clusters, so it can be seen how well the data are placed in a cluster.
The testing process is used to determine the patient's cost estimate and also to evaluate the results of the patient's cost estimate.The process of determining patient cost estimate using the normalization method and the similarity calculation that is the same as the clustering process.The normalized patient data will only be measured to each centroid of each cluster for cluster selection.The selected cluster is a cluster that has a centroid with a high similarity value.Hence, to determine the estimated cost of a new patient, only a single cluster is matched with the new patient data.The results of cost estimate displayed by the system are in the form of a range of minimum and maximum costs obtained from patient data that have similarity values above or equal to the set threshold value.

Semantic Similarity
In measuring the similarity between data in clustering algorithms, this study used a measure of semantic similarity.Semantic similarity was chosen to measure the similarity between data due to limitations of distance calculation algorithms such as euclidiean distance, which cannot measure semantic proximity between data.Semantic similarity is obtained by calculating the distance between concepts on ontology.

Semantic similarity between concepts
The measurement of semantic similarity of two concepts was measured using the equation proposed by Girardi et al. [5], Leacock & Chodorow [6], and Rada et al. [7].

Semantic similarity of Girardi et al.
The two nodes (concept) x and y that have been represented in the form of hierarchical trees can be calculated similarity to the equation: ( Where: = the value of the distance between nodes x and y. = the minimum number of edges between nodes x and y. = the depth level of the node x. = the depth level of the node y.

Semantic similarity of Leacock & Chodorow
Leacock & Chodorow used the path length between the two nodes to measure semantic similarity.The equation of the semantic similarity method of Leacock Chodorow can be seen as follows: ( Where: length (u, r) = shortest distance from node u with node r.D = maximum depth from the node to the root between node u with node r.The distance between the two concepts C1, C2 is calculated as the shortest path that connects the concept.The similarity between the two concepts C1 and C2 can be calculated as follows [7]:

Max
= maximum depth from node to root between node C1 and C2.length(C1,C2) = shortest distance from node C1 with node C2.

Semantic similarity between sets of concepts
To calculate the similarity of the collection of concepts using equation ( 5) [5]. ( Where: is a similarity value of X with Y or Y is a collection of concepts

Jaccard Similarity
Jaccard similarity is used to calculate the similarity between two objects of patient diagnosis.The value of Jaccard similarity is obtained from intersection divided by union from two sets of compilations.Jaccard distance is a measurement that is not similar between data sets.This can be determined by the inverse of the Jaccard coefficient obtained by removing the Jaccard similarity from the value of Jaccard similarity [9].The equation for calculating Jaccard similarity is as follows: (6)

K-means Similarity
K-means clustering algorithm aims to classify patients.In K-means, each data must be included in a particular cluster, but it is possible for each data to be included in a particular cluster at a stage of the process, in the next step, move to another cluster [10].The first step in the K-means algorithm is to determine the number of clusters of patients to be formed.The determination of the number of clusters affects the determination of the number of centroids.If the number of clusters to be formed is three, it will choose as many as three data centroids.The initial centroid value in the first iteration was given randomly.When the initial centroid has been selected, then it is to calculate the similarity for each patient data to each centroid.
Data on patients who have high similarity values with centroids in a particular cluster, the patient data were categorized and allocated to the cluster.The process for calculating the similarity value of patient data with centroid using the semantic approach and euclidean distance.Sematic similarity between concepts is calculated using equation ( 2), ( 3), (4) and the semantic similarity between sets of concepts is calculated using equation ( 5).Jaccard similarity is calculated using equation ( 6).After all patient data were allocated to a cluster, the next step is to check the convergence of the patient cluster results by comparing the cluster results in the previous iteration with the cluster results in the iteration that are running or using the specified objective function value.If the results are the same or if the change in the objective function value is below the specified threshold value, then the clustering data results will be converged, but if different or if the objective function value changes are above the specified threshold value, then it has not been converged.It is necessary to do the next iteration and re-determine the new centroid based on the data from each cluster.The new centroid determination is conducted by looking for the average similarity value of all members in each cluster.
The step will repeat again until there is no change in membership of each cluster or changes in the value of the objective function used below the threshold value, so that the data can be converged, the threshold value used is 0.1.The objective function is used to check data convergence in a cluster, namely Sum Square Error (SSE).SSE is the sum of all distances of each data with the cluster center point [1].So that, the final result of this method is grouping patients who have high similarity between data in a cluster.

The Process Of Estimating Patient Costs
Estimated patient costs generated from the system are the minimum and maximum cost ranges obtained from patients who have a similarity value above or equal to the threshold value.

Figure 3 Flowchart determining patient cost estimates
Figure 3 is a sequence of processes to determine patient cost estimate.Each patient's data will be estimated at cost, the first step is to measure the similarity of patient data that will be predicted with each centroid of the cluster.The calculation of the similarity of data with centroid aims to narrow the search space in estimating patient costs, so that the process of calculating similarities to obtain estimates of patient costs will only be carried out in one particular cluster, namely clusters with centroids which have high similarity values with predictable patient data.Estimated patient costs are obtained from cluster members who have similarity values above or equal to the threshold value.Estimated patient costs displayed by the system in the form of a range of costs, namely minimum and maximum costs.

Testing Scheme
Tests are conducted to compare methods of measuring data using semantic and Jaccard similarity on clustering and cost estimates.The semantic data measurement method used the semantic similarity of Girardi  1.Looking for the optimal number of clusters.2. Measuring the accuracy of the proposed method.3. Measuring the computational time of the clustering process

Determining the Number of Clusters
Table 1 is the result of testing the determination of the number of clusters using the silhouette coefficient method.Based on the results of the tests conducted, the highest average value of silhouette coefficient by measuring the similarity of data using the semantic similarity of Girardi et al., which is 0.72 with the number of clusters k=10.The average value of silhouette coefficient by measuring the similarity of data using the semantic similarity of Leacock & Chodorow is 0.73 with the number k=10.The average value of the silhouette coefficient used the measurements of the semantic similarity of Rada et al. is 0.77 with the number k=10.While, the average value of the silhouette coefficient using the Jaccard Similarity measurement is 0.69 with the number k=10.From the three semantic similarity measurement methods used, the best number of clusters is 10 clusters.2. Measurement of system accuracy was conducted by comparing the estimated range of costs incurred by the system with the actual costs incurred by patient.Estimates of the costs displayed by the system are in the form of a range of minimum and maximum costs obtained from patient data that have similarity values above or equal to the set threshold value.If the actual costs incurred by the patient fall into the estimated range of costs estimated by the system, then it is true.The best accuracy is 91.78% for the three semantic similarity methods, whereas without semantic similarity the best accuracy is 84.93%.

Evaluation of computing time
In Table 3, it can be seen the comparison of execution times of each data measurement method in clustering represented in seconds.Based on the results of the comparison of execution time presented in Table 3, the method of measuring Jaccard similarity data has an average execution time less than the method of measuring semantic similarity data.The non-semantic method has a less execution time because the method has a simpler formula, which is only looking for equations without taking into account semantic proximity.

CONCLUSIONS
Based on testing with silhouette coefficient, the method of measuring semantic similarity data on K-Means algorithm is able to produce better quality clustering results compared to Jaccard similarity.The quality of clustering results generated from the method of measuring semantic similarity data belongs to the strong structure.From the three semantic similarity methods used, the semantic similarity method of Rada et al.

FUTURE WORKS
From the results, there are several things that need to be added and developed for further research, namely the need to use methods other than K-Means to conduct clustering, so that the best clustering method can be obtained with optimal results and estimates of patient costs can be closer to actual costs.

Figure 1
Figure 1 System architecture

3 .
ISSN (print): 1978-1520, ISSN (online): 2460-7258  The K-Means Clustering Algorithm With Semantic Similarity... (Ida Bagus Gede Sarasvananda) 317 Semantic similarity of Rada et al.The semantic similarity proposed by Rada et al. use the shortest path distance and depth level to measure the similarity between the concepts on ontology.

Figure 2 Figure 2
Figure 2 Flowchart of K-means using semantic similarity
et al., Leacock & Chodorow, and Rada et al., while the nonsemantic data measurement method used Jaccard.In this study, the amount of data used was  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 13, No. 4, October 2019 : 313 -322 320 244 patient data.Patient data were divided into training data and test data.171 patient data were used as training data and 73 patient data as test data.Tests carried out in this study include: produce clustering  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 13, No. 4, October 2019 : 313 -322 322 quality that is better than semantic similarity of Girardi et al. and semantic similarity of Leacock & Chodorow.The best accuracy is 91.78% for the three semantic similarity methods, whereas without semantic similarity the best accuracy is 84.93%.
amount of data used is 244 patient data.Patient data were divided into training data and test data.171 patient data were used as training data and 73 patient data as testing data.Data used include patient diagnosis that has been coded with international standard for classification of diseases and other health problems, namely ICD-10, class rates, age, and gender.

Table 1
Testing results determine the number of clusters with the silhouette coefficient Comparison of similarity measurement methods using semantic similarity of Girardi et al., semantic similarity of Leacock & Chodorow, semantic similarity of Rada et al., and Jaccard Similarity are presented in Table

Table 2
Comparison of accuracy in each measurement method Similarity The K-Means Clustering Algorithm With Semantic Similarity... (Ida Bagus Gede Sarasvananda) 321

Table 3
Comparison of execution time of clustering process