Comparison Non-Parametric Machine Learning Algorithms for Prediction of Employee Talent

Classification of ordinal data is part of categorical data. Ordinal data consists of features with values based on order or ranking. The use of machine learning methods in Human Resources Management is intended to support decision-making based on objective data analysis, and not on subjective aspects. The purpose of this study is to analyze the relationship between features, and whether the features used as objective factors can classify, and predict certain talented employees or not. This study uses a public dataset provided by IBM analytics. Analysis of the dataset using statistical tests, and confirmatory factor analysis validity tests, intended to determine the relationship or correlation between features in formulating hypothesis testing before building a model by using a comparison of four algorithms, namely Support Vector Machine, K-Nearest Neighbor, Decision Tree, and Artificial Neural Networks. The test results are expressed in the Confusion Matrix, and report classification of each model. The best evaluation is produced by the SVM algorithm with the same Accuracy, Precision, and Recall values, which are 94.00%, Sensitivity 93.28%, False Positive rate 4.62%, False Negative rate 6.72%, and AUC-ROC curve value 0.97 with an excellent category in performing classification of the employee talent prediction model. Keywords— non-parametric, machine learning, ordinal data, employee talent. ◼ ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 15, No. 4, October 2021 : 403 – 414 404


INTRODUCTION
Data mining methods have been applied, and have good prospects in the field of human resource management. The utilization of data mining tools has a positive impact in supporting management, and policy development in organizations. Machine learning is one technique that can provide important support for Human Resources Management (HRM) applications which are usually limited by interpretations and subjective decisions based on employee behavior [1]. By adopting technology, organizations will get many benefits through the process of collecting, managing, and analyzing data, both in terms of efficiency, and competitive advantage, and better business competitiveness as well as leading to improvements in helping the decisionmaking process to achieve the organizational goals that have been set before [1].
This study discusses the application of machine learning techniques in the HR department, which is carried out by analyzing datasets provided by IBM analytics. The selection of this dataset is based on the variables, and attributes that reflect the employee database, and have supporting variables, and attributes owned by the organization, consisting of 35 variables, and 1470 samples. Four nonparametric algorithms will be used, namely Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), and Artificial Neural Networks (ANN). The selection of these four algorithms is based on: (1). The characteristics and types of data to be processed, (2). The number of variables, and samples used, (3). An algorithm for classification, and prediction, and (4). Each has advantages, and disadvantages in generating models during training, and data testing [2].
The objectives of this study are: (1). to analyze, and compare the performance of machine learning nonparametric algorithms in conducting the classification, and prediction process of employee talent based on ordinal category datasets, (2). To produce predictive models with the concept of talent management using tested variables, (3). Determine whether the results of the comparison of nonparametric algorithms in classifying, and predicting talented or non-talented employees can be used in objective decision-making. In addition, this research is useful in: (1). Providing an alternative to developing concepts, and application models in the talent management module, and (2). As a material for evaluating, and testing relationships, and relationships between variables based on hypothesis testing by previous researchers using machine learning methods, and the Python programming language to study employee's talent prediction case.

METHODS
The very large data and employee information (big data) owned by the organization can be analyzed using machine learning technology. Research on the application of machine learning methods, and algorithms in HRM, and other applied sciences have been carried out by previous researchers. Prediction of student activity level by comparing the SVM, and DT algorithms using a dataset of 1530 samples [3], comparing the performance of the DT, SVM, KNN, and Naïve Bayes (NB) algorithms to the prediction of student alcohol consumption using a dataset of 1024 samples [4]. "Maintain, and Evaluate student's performance" using the DT algorithm, Linear Regression, Multiple Regression, and Logistic Regression [5], research on "Talent Identification in Soccer using a one-class SVM" in identifying prospective athletes in soccer [6], and research in predicting the right candidate for the right job by having the required qualities based on the applicant's resume using approximately 500 samples through the DT algorithm, Naïve Bayes, and CART [7], are some examples of research that uses machine learning algorithms in the process.
The results of previous studies, machine learning algorithms in classifying, and predicting produce a good level of accuracy, and can be applied in the field of research to help make better decisions [7], [8], each algorithm has advantages, and disadvantages, which is lack of classification, and prediction [3]- [5], [8]. Classification and prediction results are influenced by several factors such as the number of training data samples used, data types, and characteristics, selection of appropriate algorithms, and statistical methods [1], [8], [9], and there is no one algorithmic method that is superior to other methods for all problem cases or what is known as the "no free lunch" theory for the supervised machine learning method. One of the statistical data processing is using nonparametric methods. The Wilcoxon Sum Rank test is a nonparametric statistical hypothesis that is used to compare two related samples, matched samples, or repeated measurements of one sample to assess whether the population means ratings differ [10]. The Mann-Whitney test is a nonparametric test used to determine the difference between the mean of two populations that are equally distributed from two independent samples with an ordinal data form. The Kruskal Wallis test is a nonparametric test that assesses the difference between three or more groups of independent samples that are not normally distributed (ordinal or ranked data) [11]. The Confirmatory Factor Analysis (CFA) test is carried out to strengthen the results of statistical tests in terms of proving the previous hypothesis test, whether there is a relationship or correlation between the dependent, and independent variables measured and can be used to determine the construct validity of the sample in the survey [12], [13].
Receiver Operating Characteristic (ROC) curve in the Area Under Curve (AUC) in classifying the accuracy of the test results is used to provide comparison results between predictions, and actual target values in the classification process [6], [14]. ROC describes model performance or model comparison with a complete estimate of the classification threshold, where the value in the ROC area varies between the 0 to 1 interval is shown in Table 1. In the work environment, employee job involvement relates to how a person manages his behavior at work and becomes part of the life cycle of an organization in achieving its goals. Employees who are engaged in work will feel that work will be more meaningful if they can show better performance at work [15], [16]. Job satisfaction is very important to make an employee bring out his abilities to the fullest in his work [17].
Although talent management has a strategic role in a modern organization, not much research has been done on the impact of talent management on employee performance with the mediating role of job satisfaction [18]. Other research shows that there is a close relationship between work-life balance, employee performance, and job satisfaction as well as work-life balance that can improve employee performance through employee job satisfaction [19]. Another hypothesis related to job involvement is closely related to improving employee performance and states that the higher a person's job involvement, the higher his employee performance [20]. This is certainly related to the conceptual model of Talent Management, where there is a relationship between employee recognition and employee performance, and there is a relationship between the concept of talent management, and employee performance [21].
Based on the results of previous studies, the formulation of hypotheses using the IBM analytics dataset resulted as the following: a. H1: Is there a positive relationship between education, and performance rating? b. H2: Is there a positive relationship between environment satisfaction, and performance rating? c. H3: Is there a positive relationship between job involvement, and performance rating? In this study, researchers used the performance rating variable as a target in the classification process, and other ordinal data such as education, environment satisfaction, job involvement, job level, job satisfaction, relationship satisfaction, and work-life balance variables were used as predictors.

Nonparametric Statistical Test
The ordinal data used for the experiment will go through statistical tests, and CFA tests to strengthen hypothesis testing. Statistical tests were carried out on ordinal data using the Correlation Coefficient to determine the correlation or rank value relationship between 2 (two) variables. After carrying out statistical tests, and generating conclusions from hypothesis testing, the analysis phase using the CFA validity test is carried out to test measurable, and unmeasured variables. The CFA test carried out is only limited to testing variables by looking at the Keiser-Meyer-Olkin (KMO) test value, and comparing the size of the sampling adequacy of each variable in a proportional measure. The main variable efficiently (KMO >= 0.5), and Bartlett's test is a test of Sphericity that is used to determine whether there is a significant correlation between variables (α < 0.05) [12], [23].

Data Testing
The pre-processing stages include data cleaning which is carried out to ensure that no data is lost, null, or duplicated. Normalize the dataset (standardization) by assigning a value of 0 or 1. The next process is data selection by selecting the relevant data to use (ordinal data), and dividing the dataset into training and test data with a ratio of 90%: 10%, or 1323 samples, and 147 samples. Training and testing data are carried out using the selected algorithm model.

Figure 1 Research Methodology Proposal
The testing process is carried out using training data from the model that has been formed, and further testing is carried out for evaluation. The research methodology proposal carried out at the training, and model testing stages is shown in Figure 1. The evaluation of the classification model carried out on data testing, produces the value of the best model performance in predicting true or false objects displayed in the Confusion Matrix (CM) [24], report classification, and the ROC-AUC curve. CM consists of sections, namely True Positive (TP), False Negative (FN), False Positive (FP), and True Negative (TN) with the calculation parameters using the formula: The accuracy result as shown in equation (1) explains that the model produces a correct prediction ratio for the classification of talent, and non-talent, from the entire sample. Accuracy is used to answer the question "What percentage of the sample correctly predicts talent and nontalent?" The precision results as shown in equation (2) explain that the model produces a ratio of correct predictions for talent classification compared to the overall sample results predicted by talent. Precision is used to answer the question "What percentage of the correct sample of talent out of the total sample predicted talent?" The results of Recall or Sensitivity as shown in equation (3) explain that the model produces a correct prediction ratio for talent classification compared to the entire sample of true (actual) talent. Recall or Sensitivity is used to answer the question "What percentage of the predicted sample is talent compared to the total sample that is talent?" Specificity results as shown in equation (4) explain that the model produces a level of truth in predicting non-talents, compared to the whole sample of non-talents. Specificity is used to answer the question "What percentage of the correct sample is non-talented compared to the total sample that is non-talented?" CM is used to represent the predictions, and actual conditions of the data generated by the algorithm used. The performance results of the four algorithm models are displayed in CM, True Positive is the actual talent, True Negative is the actual non-talent, Positive Predictions is the talent prediction, and Negative Predictions is the non-talent prediction as shown in Table 2. Accuracy is used for the evaluation process, and to determine the ratio of correct predictions (true positive, and true negative) from the overall data. Meanwhile, AUC is used to show numbers that are directly related to the data. The AUC value describes the overall measurement results of the suitability of the model used with the indicator that the greater the AUC value, the better the variables studied are predicting events [25].

RESULTS and DISCUSSION
The research uses the Python programming language, where the input data comes from the IBM Analytics dataset, the dependent, and independent variables are ordinal type, using the

Statistical Test Result
With a significant value (α) is 0.05, the results of statistical tests using the Mann Whitney U test, Wilcoxon Rank Sum, and Kruskal Wallis H test on the dataset are based on the results of statistical tests for all the independent variable has a p-value < 0.05. The conclusion of the hypothesis test on the results of the correlation test between the dependent, and independent variables is that there is a close correlation or relationship between the independent variables (education, environment satisfaction, job involvement, job level, job satisfaction, relationship satisfaction, work-life balance), and the dependent variable (performance ratings). Thus, the results of the hypothesis test stating that there is a positive relationship between the independent variable, and the dependent variable can be accepted.

Hypothesis Testing
Hypothesis testing of the dependent variable performance rating as a target, and the independent variables are education, environment satisfaction, job involvement, job level, job satisfaction, job satisfaction, relationship satisfaction, and work-life balance as predictors by using statistical tests that have been carried out to produce hypotheses: a. H1: There is a positive relationship between education and performance rating. b. H2: There is a positive relationship between environment satisfaction and performance rating. c. H3: There is a positive relationship between job involvement and performance rating. d. H4: There is a positive relationship between job level and performance rating e. H5: There is a positive relationship between job satisfaction and performance rating. f. H6: There is a positive relationship between relationship satisfaction and performance rating. g. H7: There is a positive relationship between work-life balance and performance rating. h. H8: There is a positive, and the convergent relationship between the job level, and education variables, and other independent variables.
The KMO table and Bartlett's test shows that the KMO value is 0.501, which means that there is a significant correlation between variables (the value is >= 0.500). Likewise with Bartlett's Sphericity test which has a value of 41.257 with a p-value of 0.011 < 0.05 (significant) is shown in Table 3, which means that the variable forming factors are quite good, and can be analyzed further.  Table 4 shown, the accuracy results from training data and testing data from each model. Accuracy results show an increase after training using a model that was formed and tested using hyperparameter tuning.  Table 5 shown, the number of testing data as many as 249 samples, the ANN model resulted in 117 samples of true positive, and 112 true negative samples, this indicates that the prediction data that is following the talent classification is 117 samples, and the non-talent classification prediction is 112 samples. While the true negative value of 13 or the prediction results of the non-talent classification that do not match the actual are 13 samples, and the true positive is 7 or this result states that there are 7 samples of predictive data with talent classification that do not match. The final performance of the ANN model produces a precision value of 0.92, and an accuracy level of 0.92 on the test results, and the ROC curve with an AUC value of 0.97 (excellent classification) as shown in Figure 2.

DT Algorithm Model Performance
From the total testing data of 249 samples, the DT model yielded 118 samples of true positive, and 89 true negative samples, this indicates that the prediction data according to the talent classification is 118 samples, and the non-talent classification prediction is 89 samples as shown in Table 6. While the true negative value of 12 or the prediction results of the non-talent classification that do not match the actual are 12 samples, and the true positive is 30. This result states that the predicted data with the talent classification that does not match the actual is 30 samples. The final performance of the DT model produces a precision value of 0.84, with an accuracy level of 0.83 on the test results, and the ROC curve with an AUC value of 0.85 (good classification) as shown in Figure 3.

SVM Algorithm Model Performance
The number of testing data as many as 249 samples, the SVM model resulted in 124 true positive samples, and 111 true negative samples, this indicates that the prediction data according to the talent classification is 124 samples, and the non-talent classification prediction is 111 samples as shown in Table 8

Model Comparison
Evaluation of the model performance resulted in the SVM algorithm which has the highest accuracy of 94.00%, compared to other algorithm models. This confirms that SVM has a more accurate level of accuracy in making predictions for the classification of talent, and nontalent as shown in Table 9. SVM has a precision value of 94.00%, and recall 94.00%, which is higher than the other models. In other words, SVM is better at predicting a positive sample of talent but is nontalented, rather than predicting that a sample that is predicted to be non-talented but is a talent. Furthermore, SVM also has a specificity value of 95.38% higher than other algorithm models. This means that from the test results, the SVM model produces a low false-positive rate or is at the level of 4.62%. So that the resulting prediction model has an error in predicting a sample that is non-talented but is stated to be quite a low talent compared to the results from other models, as shown in Table 10.  SVM also has an AUC value of 0.97 (excellent classification), although this value is the same as the ANN algorithm. However, SVM is superior in terms of specificity value, and a smaller false positive rate, as shown in Figure 6, and Figure 7. The ordinal data has different characteristics in handling. Machine Learning Algorithm is one of the tools that can extract ordinal data into information that can be used for decisionmaking. By using a comparison of four nonparametric machine learning models, namely SVM, KNN, DT, and ANN on the dataset used in this study, the ordinal data went through the stages of nonparametric statistical tests, and CFA validity tests in formulating hypothesis testing. The results of hypothesis testing on the dataset state that there is a correlation or relationship between the dependent variable, and the independent variable, and the existence of a variable that mediates the relationship between the dependent variable, and the independent variable. It can be concluded that the ordinal data used in the dataset can be analyzed using an algorithm model to classify and predict. From the results of training, and testing the prediction model for talent or non-talent classification with the best level of accuracy based on CM, and the ROC-AUC curve is the SVM algorithm, where the model produces an accuracy of 94.00%, AUC of 0.97, and also have FPR, and FNR values of 4.62%, and 6.72% with a very small difference with a low error rate.
Recommendations for further research, prediction models, and analysis of talent or nontalent classification can be used as a guide and initial process in developing methods for classifying talented or non-talented employees using ordinal data. Prediction models and analysis of talent or non-talent classification can also be used as tools in the preparation of deep learning-based application systems for the concept of talent management. The use of more datasets or data that is updated regularly is highly recommended by using feature engineering techniques, data characteristics can be identified easily, and the addition of new features from the sample dataset will be able to improve prediction results and better accuracy.