Classification of Tangerine (Citrus Reticulata Blanco) Quality Using Combination of GLCM, HSV, and K-NN

The quality of fruit production is very important because it is related to the value of sales. Data from the Directorate General of Horticulture at the Ministry of Agriculture in 2017 showed that 94,3% of the total yield of citrus fruits is a type of tangerine. In the classification of the quality, the visual observation process is strongly influenced by subjectivity so that in certain conditions such as tired eyes and the number of oranges that want to classify too many the process can be inconsistent and also take a long time. Therefore, a technology is needed to accelerate the classification process and make it more objective. This study combines the Gray level Co-occurrence Matrix (GLCM) method for texture, Hue, Saturation, Value (HSV) features for color features and the k-Nearest Neighbor (k-NN) classification method. The data used were 60 images of rotten tangerines and 60 images of not rotten tangerines divided using a 4-fold cross-validation method to find the best combination of data training and data testing. 3 main processes will be carried out, namely preprocessing, feature extraction and classification. This study produced the highest accuracy of 80% from the combined of GLCM and HSV features extraction with value k = 5 for k-NN


INTRODUCTION
The quality of fruit production is very important related to the value of sales.Because one of the important roles in national economic development is horticultural commodities.Data from the Directorate General of Horticulture at the Ministry of Agriculture in 2017 showed that Tangerines had a production of 2.165.184tons of total citrus fruit production of 2.295.310tons.This shows that 94,3% of the total citrus fruits are tangerines [1].Fruit quality can be seen from the size, color, and defects on the skin.Today, most quality classifications are still done manually (visual observation) by farmers and sellers.For citrus farmers, it is easy, but it is still undeniable because the manual method is very dependent on subjectivity such as dry eyes and the number of oranges that want to be sorted out is too many, the process will be inconsistent and also requires a long time.
Digital image processing is carried out on the image to get certain results according to needs.With digital image processing, the image of oranges can be processed to determine the quality of oranges including rotten or not rotten oranges.This digital image processing can be one way to solve problems that have been stated before.Feature extraction is one of the important things in digital image processing.With feature extraction, we can know the characteristics that represent each image.
Many previous studies have used digital image processing for fruit classification processes.Starting from the quality and maturity level.Some of them are the classification of the quality of tangerines using the Gray Level Co-occurrence Matrix (GLCM) method for texture features.The quality of tangerines is classified into 3 classes namely Grade A, Grade B, and Grade Super.Then use the Support Vector Machine (SVM) method to identify whether or not an image windowing area is defective.This study resulted in the best accuracy rate of 82.5% with the number of training data as many as 20, the distance value is 2 in the direction of GLCM 45° [2].There is also a classification of the maturity level of lime-based on color features and k-Nearest Neighbor (k-NN) classification.Color features use mean RGB values, and the k values used are 1, 3, 5, 7, and 9. Search for the closest distance between training data and test data using Euclidean Distance and Cityblock Distance.This study produced the highest accuracy of 92% with Euclidean Distance and the value of k = 3 was the best k value [3].And also, research using a combination of feature extraction is a classification of Tomatoes using k-NN based on GLCM and HSV color space.The result using 100 data sets, consisting of 75 training data and 25 testing data yields the highest accuracy rate of 100% with p-value on GLCM is 9 and the membership value (k) in k-NN is 3.According to the experimental results, we can conclude that the proposed method can achieve the highest accuracy [4].
From several previous studies that have described the combination of GLCM and HSV features get the highest accuracy, therefore in this study, GLCM will be used as the extraction of feature texture and HSV color space as the extraction of color features.Classification using the k-NN method for classification of the quality of tangerines rotten and not rotten.This research is expected to accelerate the classification process and make it more objective.

Digital image processing
Digital image processing is a technical step in estimating the characteristics of objects in the image, measuring the characteristics related to the geometry of the object and interpreting the geometry [5].Formation of digital images through several stages, namely image acquisition, sampling, and quantization.Image acquisition is the process of making continuous images into digital forms.Images are obtained as the initial process of image processing.The tool used for image acquisition is a device equipped with sensors that can capture images and convert light energy into digital signals (camera, scanner, etc.).The sampling process is also called the digitizing process at the coordinate x, y. digitization is the transformation of continuous images into digital images by dividing analog (continuous) images into M columns and N lines so that they become discrete images.The greater the value of M and N, the finer the digital image is produced and this also shows the higher the resolution of the image.The quantization process is produced by digital equipment (scanners, digital cameras, etc.).Analog intensity transformation, namely the value of continuous amplitude to the area of discrete intensity is called quantization.The measured amplitude value is the value in the discrete coordinates of the sampling process [6].In this research, as shown in Figure 1 will use the image of rotten tangerines and the image of tangerines are not rotten.

2 Features Extraction
Feature extraction is one of the important things in digital image processing.With feature extraction, we can know the characteristics that represent each image.In this research will use extraction feature for texture using Grey Level Co-occurrence Matrix (GLCM) method and for feature color using Hue, Saturation, Value (HSV) method.

2.1 Gray Level Co-occurrence Matrix (GLCM)
GLCM is one of the advanced techniques used for texture feature extraction was proposed by Haralick back in 1973, It shows how often a pixel value is known as the reference pixel with the intensity value i occurs in a specific relationship to a pixel value known as the neighbor pixel with the intensity value j.So, each element (i,j) of the matrix is the number of occurrences of the pair of a pixel with the value i and a pixel with value j which are at a distance d relative to each other.The spatial relationship between two neighboring pixels can be specified in many ways with different offsets and angles, four possible spatial relationships (0; 45; 90 and 135).Figure 2 illustrates the details of the process to generate four symmetrical cooccurrence matrices with one neighboring pixel (d=1) along four possible directions as {[0 1] for 0; [-1 1] for 45; [-1 0] for 90 and [-1 -1] for 135} [7].

Figure 2 Co-occurrence matrix directions for extracting texture features
The co-occurrence matrix captures the texture properties but cannot be directly used as an analytical tool, for example comparing two textures, this data must be extracted again to get numbers that can be used to classify textures.There are 14 textural features used to classify images according to Haralick.But only 5 features were used in this study.Namely ASM, contrast, IDM, entropy, and correlation.Figure 3 illustrates the steps in extracting texture features using the GLCM method.
Angular second moment (ASM) is used to measure uniformity or often called angular second moment.ASM will be high value when pixel values are similar to each other or in other words, the pixel distribution is in a constant condition.The ASM value is calculated from the Contrast contains information that shows the size of the spread (moment of inertia) elements of the image matrix.If it is located far from the main diagonal, the contrast value is large.Visually value contrast is a measure of variation between the gray degrees of an image area.The value can be calculated as shown in equation (2).
Inverse Difference Moment (IDM) shows homogeneity of images of similar degree.Equation ( 3) is an equation for calculating the value of IDM.
Entropy shows the size of the non-rule of an image.The value of an entropy is large if the image has an even gray degree and the entropy value will be of small value if the image structure is irregular (varied).Equation ( 4) is an equation to calculate the entropy value.
Correlation is a feature that shows a measure of linear dependence on the gray level of an image so that it can provide an indication of the existence of a linear structure (the joint probability).To calculate the value of the correlation shown by equations ( 5) to (9).

2.2 Hue, Saturation, Value (HSV)
The HSV color space is more intuitive to how people experience color than the RGB color space.As hue (H) varies from 0 to 1.0, the corresponding colors vary from red, through yellow, green, cyan, blue, and magenta, back to red.As saturation(S) varies from 0 to 1.0, the corresponding colors (hues) vary from unsaturated (shades of gray) to fully saturated (no white component).As value (V), or brightness, varies from 0 to 1.0, the corresponding colors become increasingly brighter.The hue component in HSV is in the range 0° to 360° angle all lying around a hexagon as shown in figure 4 HSV color space [8].

Figure 4 HSV color space
To get the value H, S, V based on R, G, B is shown in the equation ( 10) to (14) [9].Conversion to HSV values is performed on each pixel of an RGB image.After that, the average is searched for H (mean H), S (mean S) and V (mean V).This average value will characterize the color feature.Figure 5 illustrates the steps to convert RGB to HSV and calculate mean H, mean S, mean V. , , r g b = is normalization of the values of R (red), G (green), B (blue).

, , R G B
= is the values of R (red), G (green), B (blue).

H S V
= is the value of the conversion of R, G, B to H, S, V.
Figure 5 The steps of convert RGB to HSV and calculate mean H, mean S, mean V

3 4-fold Cross -Validation
From a of 120 images will be divided into training data and test data using the 4fold cross-validation method.In this method, all data is divided into 4 groups equally.Where 1 group will be the test data while the other 3 will be the training data [10].From these 4 combinations of training and test data, verification results are sought for use in other testing scenarios.Figure 6 shows the distribution of training data and test data by group.So that each group has 30 data as test data and 90 data as training data.

k-Nearest Neighbor (k-NN)
The k-Nearest Neighbor (k-NN) classification algorithm is one of the most popular approaches used by researchers and practitioners in the areas of Pattern Recognition and Machine Learning.Generally speaking, k-NN only needs one parameter to be adjusted, k, which represents how many closest neighbors are to be considered to classify an unseen object.Once this parameter is set, two main approaches are followed to classify an object, the vote of the majority of the k neighbors, and a weighted vote of all k neighbors considering the distance from where each of them is located concerning the object to classify [11] .
For image classification, a method is needed to calculate the distance between image training data and image test data.one of the most popular is the euclidean distance which will be discussed in 2.4.1.Figure 7 illustrates the steps of k-NN.
Figure 7 The steps of K-NN

4.1 Euclidean distance
Euclidean distance is used to determine the level of similarity or inequality of two input vectors.The level of similarity is in the form of a score and based on the score two feature vectors will be said to be similar or not.Euclidean distance counts the root of the square of the difference of two vectors.Equation euclidean distance is shown in the equation (15).

5 Confusion Matrix
A confusion matrix is a table that describes the performance of a classifier/classification model.It contains information about the actual and prediction classifications done by the classifier and this information is used to evaluate the performance of the classifier.To ascertain  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 13, No. 4, 2019 : 357 -368 364 the performance of a classifier using the confusion matrix and the data it contains, there are some standard metrics that we can calculate using the data(variables) in the confusion matrix.There are Accuracy, Precision, Sensitivity, and Specificity [12].Table 1 shows the confusion matrix for the two-class classification problem.
Precision is calculating the proportion of a predictable class according to the correct conditions for all predicted data correctly.To calculate the Precision is shown in the equation (17).
Specificity (True negative rate) is the proportion of a class that is correctly predicted not from that class.To calculate the Specificity is shown in the equation (19).

RESULTS AND DISCUSSION
In this study, using several test scenarios to see which scenarios can provide the highest accuracy value.First of all, determine the division of training data and test data using 4-fold cross-validation.Division of data folds can be seen in Figure 6.To test each data fold used a combination of GLCM and HSV feature extraction and classification using k-NN with a value of k 3.5, and 7.The performance of each data fold can be seen in Table 2.  From the data shown in table 2, the best accuracy performance is shown in data fold 4.Then, for other test scenarios, it will use data fold 4. The next test is to compare the system performance for the extraction features of GLCM, HSV, and a combination of GLCM and HSV.The classification uses the k-NN method with k values 3.5 and 7.Besides that, it is also tested for test data that does not have data labels (unlabeled).Table 3 shows performance test results for the classification of tangerines.It can be seen that the highest performance value is obtained from the HSV feature extraction with unlabeled test data at values k = 5 and 7 that is 93%.Even though the value of the feature extraction performance by the combined method is lower than HSV, we still need to test the combined extraction method.Why is it necessary to combine the GLCM and HSV methods for features extraction is because these two methods extract different features.GLCM for texture features while HSV for color features.the object of this study was tangerine which would be classified into a class of rotten tangerines and tangerines were not rotten.in visual observation, a clear difference is a change in color, shape, and texture.

IJCCS
This system has a high sensitivity performance percentage of 87% for testing HSV feature extraction with unlabeled test data at values k = 5 and 7.In specificity performance, this performance is calculated to measure how well the training image in the class so that it does not affect other classes.Specificity is the opposite of sensitivity.The highest specificity performance was found in HSV feature extraction testing with unlabeled test data at k = 5 and 7 that is 100%.Precision Performance is used to calculate the proportion or ratio of true positive predictions compared to overall positive predicted results.The highest precision performance is found in the HSV feature extraction test with the unlabeled test data at k = 5 and 7 that is 100%.This shows that the class of unrotten tangerine does not affect the results of the classification of a rotten tangerine class.
Accuracy Performance is the correct percentage of data predicted in a class against all test data.The highest accuracy in this system is shown in the HSV feature extraction test with the unlabeled test data at k = 5 and 7 that is 93%.This shows that the amount of incorrect classification data is small.

CONCLUSIONS
The system can classify the quality of tangerines which are rotten and not rotten.The use of 4-fold cross-validation can help find the right combination of training data and test data to get good system performance values.Combined GLCM and HSV feature extraction methods provide the best accuracy performance value of 80% in tests with a value of k = 5 using labeled test data .While the best accuracy performance value of 93% is obtained from the HSV feature extraction tests with values of k = 5 and 7 on k-NN using unlabeled test data.

SUGGESTION
Rotten tangerine has the highest sensitivity performance of 100% in this test.there needs to be further study by combining this method so that the other types can also be classified properly.In this study, only one orange variety was used (tangerine).for future work, this combination method can be used to classify the quality of more fruit varieties.

Figure 3
Figure 3 The steps of extracting features GLCM

Figure 6
Figure 6 The distribution of training data and test data by 4-fold crossvalidation Euclidean distance between vectors i and vector j.

=
is the value of the k feature in the image i. jk f = is the value of the k feature in the image j.
/ True positive rate) is used to measure the proportion of a class that has been predicted to match according to actual conditions.To calculate the Sensitivity is shown in the equation (18).

Table 1
number of correct negative predictions, b is the number of incorrect positive predictions, c is the number of incorrect negative predictions, and d is the number of correct positive predictions.Accuracy is a percentage of correct predictions based on existing conditions.To calculate the Accuracy is shown in the equation (16).

Table 2
Average 4-fold cross-validation test performance results

Table 3
Performance test results for the classification of tangerines