Dataset Splitting Techniques Comparison For Face Classification on CCTV Images

The performance of classification models in machine learning algorithms is influenced by many factors, one of which is dataset splitting method. To avoid overfitting, it is important to apply a suitable dataset splitting strategy. This study presents comparison of four dataset splitting techniques, namely Random Sub-sampling Validation (RSV), k-Fold Cross Validation (k-FCV), Bootstrap Validation (BV) and Moralis Lima Martin Validation (MLMV). This comparison is done in face classification on CCTV images using Convolutional Neural Network (CNN) algorithm and Support Vector Machine (SVM) algorithm. This study is also applied in two image datasets. The results of the comparison are reviewed by using model accuracy in training set, validation set and test set, also bias and variance of the model. The experiment shows that k-FCV technique has more stable performance and provide high accuracy on training set as well as good generalizations on validation set and test set. Meanwhile, data splitting using MLMV technique has lower performance than the other three techniques since it yields lower accuracy. This technique also shows higher bias and variance values and it builds overfitting models, especially when it is applied on validation set.


INTRODUCTION
The construction of classification models with a machine-learning algorithm involves many parameters that need to be determined in such a way to produce a good model. The higher the complexity in the model the more discriminating power the model possesses, although the risk of overfitting also increases. Overfitting is a condition where the model produces good performance when it's applied to training samples. However, it results poorly when it is applied to unknown samples in the testing process. In other words, the model does not generalize well for the whole case. Overfitting is a fundamental problem in the field of supervised machine learning that may occur because the model tends to learn the entire training data including noise, rather than learn patterns or important hidden information in the data [1].
The framework of each machine-learning algorithm is generally started by dividing the data into three parts: training set, validation set and test set. Furthermore, the model is built based on the training set. This process is a stage for the algorithms to "learn" or analyze training sets in order to build mathematical models that represent the given data. Next, the model is tested by using validation set in order to evaluate its performance. At this stage, if the produced performance level is unacceptable, the learning step is then reviewed by changing the parameters to revise the model [2]. The prediction given by the model on the validation set is used as a determinant of the model's accuracy. However, measuring the performance of the model by using validation set sometimes produces biased estimation. Therefore, it requires additional test set that is not used in the training and validation process [3].
There are many factors influencing model performance, including the algorithm, the characteristics of the data, the number of the samples available, and the method used to divide data. To build a model with good generalization, an appropriate dataset separation strategy must be applied. Strong algorithm performance can even be affected by unfavorable data splitting. Data splitting techniques are often based on simple random sampling using specified ratios, for example by considering the amount and data variation. In general, the training set is given more portions so that the model gets sufficient learning process.
The simplest data splitting method for the training set and validation set is Hold-out Validation (HoV) technique. In this technique, the data is divided directly into three parts for the training process, validation and testing of the model with adjusted proportions. Random Subsampling Validation (RSV) method is a modification of the HoV method, in which data samples are taken randomly and repeatedly several times [4]. Validation techniques that are often used currently in many studies are Cross Validation (CV) techniques. Research [5] uses k-Fold Cross Validation (k-FCV) technique to divide brain image sample data for tumor diagnosis by using Support Vector Machine (SVM) algorithm. Research [6] uses k-FCV to validate models to predict errors in software, whereas research [7] uses k-FCV to validate various algorithms in the classification of vegetation physiognomy in satellite images. The CV technique is widely used because it is considered effective for relatively small samples. This reason also becomes the basic use of the Nested-CV technique to predict autism in research [8] where it obtains large numbers of samples uneasily. Research [9] applies the j-k-FCV technique, which is an optimization of the k-FCV method. In this technique, k-FCV is applied repeatedly which aims to reduce the variance of the model in its application in the field of Natural Language Processing (NLP). Although k-FCV is widely used, research [10] suggests the Bootstrap Validation (BV) technique because it is more balanced in producing bias and variance of the model.
Selecting samples in dataset splitting can be done in two ways : randomly and systematically. The idea of systematic sample selection arises with various considerations. One of which is the distance or variation between the training set and the test set must be minimized [2]. This is because unequal distribution of feature variations will make the training set unrepresentative. Another idea about dataset splitting is that a good training set should have high data variance so that the model can learn from data with sufficient diversity. This idea is used in the Kennard-Stone (KS) technique. According to research [3] systematic sampling such as KS generally gives a low model estimation. It might be because the sample taken is too representative for the training set that makes it not representative for the validation set. Research [11] introduces the Moralis-Lima-Martin (MLM) validation technique using a slight modification to KS algorithm by giving the factor "degree of randomness". This is based on the concept that there will always come random factors because of the variations in the environment.
Comparison of various data splitting techniques has been carried out in various fields by applying various algorithms. The CV technique has evolved and has many modifications. Research [12] applies several types of CV techniques and compares their performances to map geographical areas into several groups. Research [13] applies seven validation methods to nonstationary time series data. This research concludes that the Forward Validation scheme is a technique that provides a better estimation. Research [3] also compares five validation techniques with various parameter variations. This study concludes that systematic sampling using the Kennard-Stone (KS) method and Sample Set Partitioning Based on Joining X-Y Distance (SPXY) gives low model estimation.
Research [11] has employed MLMV technique to six spectral infrared dataset in biomedical field. This technique is compared to KS and random selection technique. The result shows MLMV gives better prediction rather than the other two techniques. Since the proposed technique is relatively new, the employment of MLMV technique for other types of dataset is not yet found. In addition, there is no comparison of MLMV technique with data splitting techniques beside KS and random selection yet.
This paper contributes to apply MLMV technique for other kinds of datasets; they are image datasets from CCTV, and compare MLMV technique to other data splitting techniques. Data splitting techniques used are RSV, k-FCV, and BV as the most widely used techniques, and MLMV method resulting from the modified KS algorithm. To avoid biased conclusion, two learning machine algorithms and two CCTV facial image datasets are used. The algorithms used are Support Vector Machine (SVM) and Convolutional Neural Network (CNN), while the datasets used are the ones gained from research [14] and [15].

Research Design
This research focused on evaluating dataset-splitting techniques by dividing the data into three parts: training set, validation set, and test set. The data gained were in the form of facial image that had been through segmentation process, so the classification algorithm could be directly applied. The general design of the research is showed in Figure 1.
Research steps were started by taking 10% of the data for the test set as shown in Figure 1. The test set was separated so that it really became unrecognizable data by the model and not interfered by data splitting techniques. The data used for the training set and validation set were then partitioned based on the data validation techniques. Then, the training data were processed by using CNN and SVM algorithms. This research did not do tuning parameters for the classifiers. Validation sets were used to measure the performance of the trained models to determine which validation techniques produce the best estimation. Next, the models were tested by using testing set. The result analysis would be carried out by reviewing the accuracy, bias and variance values of the models of all experiments conducted.
The dataset used is dataset 1 (D1) taken from research [14] and dataset 2 (D2) taken from research [15]. The number of the data used on each dataset is 3000 face images. D1 consists of 22 classes. The video is taken from the front side, inside the room by CCTV with natural conditions and lighting. CCTV video resolution is 960 x 720 and 640 x 480 with 10 fps. D2 is called the Choke-Point dataset, which consists of 25 classes originating from portal 1 of 3. The different cameras positioned above the portal to capture subjects that walk naturally.

Data Splitting Techniques 2.2.1 Random Sub-sampling Validation (RSV)
The RSV method is a common used data splitting technique because it is simple and sometimes it provides good performance. In RSV technique, we divided dataset into training set, validation set and test set by randomizing data according to the determined size. This process is repeated several times to get model accuracy average. In this study, each dataset for training and validation is divided by 80% and 20%.

K-Fold Cross Validation (k-FCV)
The k-FCV method is the most popular data splitting technique because it is provides good performance. It divides the data into k equal parts (called k-fold). One part is held out as the validation set and the rest (k-1 part) is used for the training set. This process is repeated k times, so that each part has been used as a validation set once. Predicted performance is then averaged from these experiments. In this study, value of k=5 so the proportion of validation-training data was 80% and 20%.

Bootstrap Validation (BV)
Bootstrapping is a technique of estimating statistical parameters on unknown data distribution. This technique is considered as one of the good methods in model selection. Bootstrap randomly chooses n samples with replacement; meaning that the same sample can be selected several times. This selected sample is used as a training set and the unselected sample is used as a validation set. This process is repeated to obtain the average of model accuracy. According to [10], 36.8% average of the sample did not appear in the bootstrap sample, because the sample was taken with a replacement.

Moralis-Lima-Martin Validation (MLMV)
The Kennard Stone (KS) algorithm, also known as the Computer-Aided Design of Experiment (CADEX) is designed to select the most representative sample of the given dataset. In the first step, the Euclidean distance between each pair of samples is calculated. The largest distance is chosen and ranked as the most representative. Then, in each of the next steps, the remaining sample with the largest distance from the selected sample is added as the sample of the next rank. This procedure is repeated until the number of samples on the list is equal with the number of the data needed in the training set. The selected sample is used as a training set because a representative dataset is important for good training process and the remaining sample is used as a validation set. Unlike CV and bootstrapping, the splitting of datasets arranged in the K-S algorithm is only done at the beginning.
Research [11] proposed MLMV technique, which is the development of the KS technique by adding aspects of random sampling to KS algorithm. The purpose of adding random factors is to make the data more natural due to variations in the environment. MLMV method proposes exchanging samples of 10% random factor. This measurement is inspired by the probability of mutations in genetic algorithms, where 10% is the threshold used to maintain a balance between degree of randomness and model convergence. In this study, the datasets of each training and validation were divided by 80% and 20%. The distance between samples in this experiment was taken from the distance between the image histograms. In general, a description of the data splitting methods used can be shown in Figure 2.

Classification Algorithms
The algorithms that will be used to classify face images are Convolutional Neural Network (CNN) and Support Vector Machine (SVM). Both of them are very popular classifiers because they can provide high classification accuracy. CNN is a Deep Learning algorithm that uses convolution concept to extract image patterns directly. This research uses a convolution SVM is a classification algorithm to separate hyper plane-based data. The kernel used in this research is Radial Basis Function (RBF). SVM is known as an algorithm that can find balance between the model accuracy and its generalization to other samples [12]. This study did not compare classification algorithms. The performance of the model could change if the parameters of the two algorithms were also changed.

RESULTS AND DISCUSSION
We conducted the experiment with two data sets D1 and D2 and two classification algorithms, namely CNN and SVM. Figure 3 shows the results of the training, validating and testing on D1 dataset. In general, accuracy charts form convex curves where the accuracy value rises to a certain number of datasets. Then, it can be stagnant, decreasing, or oscillating after reaching the peak. In the training charts, MLMV shows the best training accuracy values when applied to CNN algorithm. Meanwhile, BV technique shows the best accuracy training values when applied to SVM algorithm. There is a slight difference in the training accuracy value among validation techniques. The best accuracy value is shown on dataset size=2000 using SVM algorithm and BV technique by 94.6%. The lowest training accuracy is shown on dataset size=500 using CNN algorithm and BV technique by 65.7%.
At the validation chart, BV shows the highest validation accuracy value when applied to the CNN algorithm while k-FCV provides the highest validation accuracy value when applied to the SVM algorithm. The difference of accuracy between RSV, k-FCV and BV is quite small, but there is a large difference with MLMV technique. In the validation process, the MLMV provides the lowest validation in almost all datasets sizes which is applied on both CNN algorithm and SVM algorithm. The best validation accuracy value is shown on dataset size=1500 using SVM algorithm and RSV technique by 89.2%. The lowest validation accuracy value is shown on dataset size=500 using CNN algorithm and the MLMV technique by 46.9%.
At the testing chart, k-FCV also shows the highest testing accuracy which is applied on both CNN algorithm and SVM algorithm. The difference of testing accuracy value of the four techniques is quite small on the SVM algorithm but it's quite large in the MLMV on the CNN algorithm. In the testing process, MLMV provides the lowest testing accuracy on CNN algorithm, but BV gives the lowest value on the SVM algorithm. The best testing accuracy value is shown on dataset size=1500 using SVM algorithm and k-FCV technique by 91.7%. Meanwhile, the lowest testing accuracy value is shown on dataset size=500, using CNN algorithm and MLMV technique by 63.3%. Figure 4 shows the results of the training, validating and testing processes in D2 dataset. In general, the accuracy chart in D2 dataset gives a higher value than D1 dataset. This is possible because the resolution in D2 is higher than that in D1. In the training chart, although the accuracy difference is very small, in general, BV techniques show the best training accuracy values. The best training accuracy value is shown on dataset size=500, using SVM algorithm and MLMV technique by 100%. The lowest training accuracy value is shown on dataset size=2500 using CNN algorithm and MLMV technique by 88.2%.
At the validation chart, k-FCV and BV show the highest validation accuracy values when applied to CNN algorithm, while k-FCV technique provides the highest validation accuracy values when applied to the SVM algorithm. As in D1, the difference of accuracy between RSV, k-FCV and BV is quite small, but the difference with the MLMV technique is quite large. In validation process, MLMV technique provides the lowest validation in almost all dataset sizes and when it is applied to CNN and SVM algorithms. The best validation accuracy value is shown on dataset size=1500, by applying SVM algorithm using RSV and k-FCV techniques by 98.6%. The lowest validation accuracy value is shown on dataset size=2500 using CNN algorithm and BV technique by 77.9%.
At the testing chart, k-FCV and BV techniques show the highest testing accuracy values using CNN algorithm while k-FCV technique shows the highest testing accuracy values applied using SVM algorithm. There is a slight difference in the testing accuracy value of the four techniques using SVM algorithm. Meanwhile, there is quite large difference in MLMV technique using CNN algorithm. In the testing process, MLMV technique provides the lowest testing accuracy on CNN algorithm. The best value of testing accuracy is shown on dataset size=500 using SVM algorithm and k-FCV technique by 99.3, %. The lowest testing accuracy value is shown on dataset size=500, using CNN algorithm and MLMV techniques by 86%. The overall average accuracy of the training, validating and testing on D1 and D2 datasets is shown in Table 1. RSV data splitting technique does not show the superiority of the accuracy value. There are two minimum values in the training process when using SVM method on D1 and using CNN on D2. k-FCV data splitting technique provides minimum value in almost all training processes with a slight difference. However, it provides maximum accuracy value in almost all validation and training results. It means that although this technique provides a smaller value of training accuracy than other techniques, it can make good generalizations of the overall data, or in other words, there is no significant overfitting or underfitting. Data splitting technique using BV shows its dominance if we apply to D2 dataset using CNN. Meanwhile, MLMV data splitting technique gives the lowest accuracy in validation and testing process. In this study concludes that MLMV has lower performance than the other three techniques.   Figure 5 Training, validating and testing the models using MLMV technique The overfitting phenomenon shows that the process of MLMV algorithm with random KS mutation base (which aims to make models able to study wide data variations) is unsuccessful.

Testing
Training Validating This is likely due to two things. First, data may not be separated properly, so the purpose of selecting data for training with varied feature samples cannot be achieved. The distance between the image histogram may not represent the actual feature distance. Data separation with MLMV technique can be achieved by using distance based on other features like CNN pre-training weight to make it more representative. Besides, it is necessary to add label distance aspect among samples such as by using SPXY technique. This is because in this study, data samples with big feature distance come from the same label. The second cause is the feature may be distinguished well, but the model learns too much from the noise because the interval of data variation is too large. Therefore, it is quite good to apply the model to the training set but not to the validation set. However, this technique's actual measurement can be seen in testing set accuracy. It is where the data is free from structured data splitting manipulation, which shows model prediction with more realistic data. Besides, this technique also gives good performance and has a slight accuracy difference than other techniques when it is applied on supporting dataset and classification algorithm, like in D2 and SVM algorithm where the testing accuracy is 96, 3%.
Model performance can also be reviewed through bias and variance aspects. Lower bias and variance indicate better model performance. Table 2 shows the comparison of the model's bias and variance values of various dataset splitting techniques. K-FCV technique produces the smallest model bias value, except when it is applied with CNN on D2 dataset. It also produces the smallest variance value when applied with SVM algorithm. When the model is applied with CNN algorithm, the smallest variance is obtained with BV technique. The model using MLM technique consistently shows the biggest bias and variance values. This comparison shows that MLMV technique has lower performace than the others. Table 2 The comparison of bias and variance value

CONCLUSIONS
The comparison results show that the k-FCV technique has more stable performance. k-FCV technique can provide high accuracy in training set as well as good generalization in validation set and test set. In general, it also produces lower bias and variance values compared to other techniques. Data splitting technique using BV is superior when it's applied to D2 dataset using CNN algorithm. RSV technique does not show superior accuracy, but it has a slight difference from the best average accuracy value. Meanwhile, MLMV technique gives the lowest average value. Therefore, it can be concluded that in this study, MLMV has the lowest performance than the other three techniques. It makes overfitting especially when it is applied on validation set. It may occur due to systematic data splitting failure in training set or the model's failure in learning information where there is too much noise in the data. The model's low performance using MLMV is also indicated by the high bias and variance value compared to other techniques.