Word Analysis of Indonesian Keirsey Temperament

Personality uniquely relates to our feeling and pattern to the aspect of actions. This behavior will change through the experience, formal education, and the surrounding environment. This works based on the Keirsey Temperament Sorter, a personality questionnaire developed by David Keirsey. This model divides the personality into four categories as Idealists, Rationals, Guardians, and Artisans. This concept is commonly recognized for the interpretation of specialist trends, potentially contributes to the process of recruitment or selection, and potential fields for analysis of social media data. Words selected by using Chi-Square with an error of 5%. Accuracy of the lexicon approach is 34%, while the best machine learning approach is Random Forest algorithm with 69.59% Keywords— Keirsey, Temperament, Personality, Chi-Square


Background
Personality may differentiate a person from others. Personality explains the combination of characteristics and qualities which create an individual's character. Personality will uniquely relate our feeling and pattern to the aspect of actions. This behavior will change through learning, experience, formal education, and the environment. There is some application personality useful for our daily life. Type of personality can be found in the application for

Temperament Model
Temperament is a configuration of observable personality traits, such as communications, action, attitudes, values, and talents. Temperament denotes a set of innate and particular characteristics of an individual, closely connected with biological or physiological determinants. Carl Gustav Jung introduced one of the essential concepts in 1920. Jung explains how the mind works of every person consist of an interaction between attitudes and functions. Attitudes can be the factor of psychic energy and maybe Extraversion (E) and Introversion (I). The functions are defined by how people view the world, so we will have two ways to receive knowledge (Sensing (S) and iNtuition (N)) and two ways to make decisions (Thinking (T) and Feeling (F)). Afterward, Isabel Briggs Myers and Katharine Cook Briggs introduced a new combination of functions to Jung's proposed typology: Judgment (J) and Perception (P) [6]. This pair determine if an entity's approach in reaction against the outside world derives from logical (Judging) or illogical (Perceiving) function.
The temperament model suggested by David Keirsey [5] divides the personality into four categories namely, Idealists, Rationals, Guardians, and Artisans. This concept is commonly recognized for the interpretation of specialist trends, potentially contribute to the process of recruitment and selection and potential fields for analysis of social media data. D. Keirsey [5] focus on his research on the connection between the taxonomy of Myers-Briggs and the evaluation of personality in practice at the time of choosing, behavior patterns, reasoning, and consistency. He believed that the character-associated temperament determines the individual's personality that inherent and arises from the experience of the temperament with the environment. Hence, the categories are directed by aspirations and interests that motivate us to survive, behave, move, and play a part in society [5]. He stated that expectations are more linked to perception (S-N), completely instinctive, than decision-making (T-F), which is entirely logical. Sensing (S) can be combined with judgment (J) or perception (P), whereas intuition (N) can be combined with feeling (F) or thinking (T). This identification did result in four categories of personality: the Guardian (SJ), the Artisan (SP), Idealist (NF), and Rational (NT).

Previous Works
There has been some automatic prediction of personality, initially taken by Lukito [1] in trying to develop Indonesian MBTI personality classification using three approach, namely machine learning based, lexicon based, and grammatical rule with 97 users data. Train data and test data is 84.5%:15.5%. Naive Bayes model performs better than the others with Introvert-Extrovert (IE) accuracy is 72.5%. Next, Adi [7] developed the classifier model with 286 data for classifying the Indonesian Big-5 personality traits. There are 12 extraction features, namely the number of tweets, retweets, replies, followers, retweeted, hashtags, following, quotes, URLs, favorites, mentions and tweet content. Each label of features is labeled as 1 for high and 0 for low. The selection of features that used in this works is the Decision Tree with four scenarios, combination of hyper parameter tuning, selection of features, and sampling with 80:20 train test ratio. Meanwhile, temperament prediction framework was done by Lima [8]. There are scenarios done, combination of models, Linguistic Inquiry and Word Count (LIWC), Medical Research Council (MRC), Psycholinguistic Database, oNLP. This works not only focus on temperament but also MBTI prediction.
Relatively similar work has been done by Fikry [9] for the classification of extroverted and introverted characters that use feature extraction from posts on Twitter. Extraction of the feature is the number of tweets, URL, hashtag, retweet, liked, mention, follow, active ratio, mention without retweet, reply, word on profile, average word per tweet, tweet character, emoticon/emoji, and media. The training process that uses three proportions of training data and test data is 70:30, 80:20, and 90:10. It seems good accuracy, but this works a tiny scope which is only 60 users. Ong [10] also developed an Indonesian-language of Big Five personality classification system. There are 12 feature selections, namely, the number of tweets, followers, following, favorites, retweets, tweet retweets, quote tweets, mentions, replies, hashtags, extracted tweet URL form, and the time difference between each tweet. This works compared 12 scenarios with the parameters of word weighting, topic modeling, stop word, and n-gram. The proportion of data used for training data only 329 and 30 for testing data.
In the classification of the Big Five Personality, which was done by Jeremy [11], there is an addition of 4 feature extraction approaches. This research-based on metadata approaches such as the number of followers, following, tweets, favorites, retweets, mentions, quotes, replies, and hashtags. Compared to the approach, the approach is not getting significant results without adding extraction of the feature. In computing the Big Five personality, the Naive Bayes and K-NN models get quite good results, and the Sequential Minimal Optimization (SMO) model is the best in the classification process. This work did not use a reduction dimension or selection of features. Utami [2] used an open-vocabulary approach to classify the personality Dominant, Influence, Steadiness, and Compliant (DISC). An exciting part of analytics is the synonyms of every word. The word weighting for first synonym is 0.85, while for the second synonym is 0.35.  Based on the limitation in Table 1, this work conduct using scenario to classify personality Keirsey framework using some model machine learning like logistic regression, Naïve Bayes, KNN, SVM, etc. and also this work use balancing method SMOTE and Chisquare feature selection. The research focuses on words on each dimension of the temperament. There are several discussions, namely (1) explore the words of each dimension of the temperament, (2) the relationship between each dimension based on words, and (3) classification based on these words.
In summary, contributions of this work, the processed text data are used to explore and classify user personality based on the Keirsey Temperament framework two-approach, namely based on the lexicon and machine learning approach. We applied different pre-processing techniques for the extraction feature to combine Categorical Proportional Difference (CPD). This works is organized as follows, section 1 discuss the background, Keirsey Temperament concepts, and recent research about automated personality prediction. Section 2 includes a description of the methodology exploration and classification. Section 3 presents and analyzes performance. Section 4 concludes this work and future research.

METHODS
In this part will be introducted the data to be used, the process of preprocessing data into a lexicon, and rules so that the words can be categorized into one of classes namely Idealists, Rationals, Guardians, and Artisans. More detail of this works are as follows:

Data
Data used in this works is Twitter social media personality data by Iskandar [12]. The data consists of 2 columns, namely text and their label MBTI. The detail type MBTI from this data shown in Figure 1 below: Figure 1 Type MBTI Data Type of MBTI personality preference will be broken down into 4 classes based on Role Temperaments namely Idealists, Rationals, Guardians, and Artisans. The rules of MBTI classes into role Temperaments shown in Table 2 below: Summary of user based on role temperament shown in Table 3. Table 3 Comparison Tempraments  No Temperaments Number of user  1  Idealists  172  2  Rationals  55  3  Guardians  47  4 Artisans 38 Table 3 shows that the Temperaments class data are not balanced. User data are more dominated by users with Temperaments Idealists type as many as 172 while the class with other types is almost 1/3 of the Temperaments Idealists type class. So, it is necessary to do a data balancing of the Idealists class.

Preprocessing
After collecting the data, the information on the behavioral category was extracted from each user account, while the grammatical information was obtained from each user label. Its behavioral and grammatical information represents each user. Some steps must do on natural language processing research which is preprocessing.
Step of preprocessing namely case folding, remove stop word, non-numeric, stemming, normalize word, translate to Indonesia language.

TF-IDF and CPD
Feature extraction on this work consists of TF and also TF-IDF. Term Frequency (TF) explains the number of times the word appears within the document. Similarly, Inverse Document Frequency (IDF) a measure of the final importance of the term the number of documents that contain the term t within the entire document [14]. While categorical proportional difference or called CPD is an easy selection method for multiclass classification problems. CPD estimates how much a word adds to separating a specific classification from different classes in a text corpus. CPD may be defined in equation (1): (1) CPD process positive document and negative document of 1 term exclusively, and next, it computes the relative distinction of 1 term in both positive and negative classes [15].

Analysis
Words were selected by using Chi-Square with an error of 5%. The lower the error will select words that have no correlate with the label class. This works analyze the number of words in each user, the number of unique words before and after it so it will be categorized word to the label class. Similar to other research about automated personality, Eealuation of the classification models is Accuracy, Precision, Recall, and F1-Score. More details about the evaluation model are shown in Table 4.  Accuracy Accuracy is used to evaluate the number of predictive labels that correspond to the actual label.

2.
Precision Precision is the level of accuracy between the information requested by the user and the answer given by the system.

Recall
Recall is the success rate of the system in rediscovering information.

4.
F1-Score F1 Score is the weighted average of Precision and Recall Source: Willy [16] Where TP is true positive, TN is true negative, FP is false positive, and FN is false negative.

Word Exploration
The preprocessing words aim to eliminate words to reduce noise from the data. The number of words after preprocessing is 310 words. Furthermore, words will be analyzed the number of words, the number of words after and before, and the number of users who used these words. This analysis aims to know the context of the word and group the words into the Idealists, Rationals, Artisans, or Guardians classes. Results of generating the three features above to analyze the correlation of each word to the class. Pearson correlation results from these words shown in Figure 2 the following: the researchers' judgment of these words. From the calculation of the weights to each class, the number of the word for Idealists is 116 words, Rationals are 47 words, Guardians are 59 words, and Artisans are 88 words. More explore, this works also mapped the word into the scatter plot with two variables generated using PCA shown in Figure 3 below: Figure 3 Word Temperament Scatter of words in Figure 3 shows four colors, namely blue show words on class Idealists, red show words on class Rationals, green show words on Artisan class, and purple show words on class Guardians. Furthermore, the words will be grouped to become keywords that can get described these words in general. Based on these words can generate keyword each class is shown in Table 5 below:

Classification
The words that have been categorized then will be tested against those words using classification. Classification is done using two approaches, namely lexicon based and machine learning based.

Lexicon
This approach will do the classification based on the words that have been filtered using chi-square. Then, each of these words will be counted the number of words that appear then presented to the total words so that the percentage of Idealists, Rationals, Guardians, and Artisans will be obtained. Based on the highest percentage of those words, the sentence will be classified into the class.

Machine Learning
The words on which have been cleaned from noise. In this part, we classify with three scenario-based on feature extraction. Machine learning model used is Multinomial Naïve Bayes, Random Forest, Logistic Regression, and SVM.
Classification is done on 198 users. This data divided 160 users to training data and 38 to testing data. The result's lexicon approach is 34%. For details is shown in Table 6 below:  Table 6 shows average precision is 34.25%, average recall is 44.75% and f1-score is 37%. While best accuracy using machine learning model is 69.59% with random forest model. Detail of performance precision, recall, and f1-score is shown in Table 7 below: Based on Table 7 above, the best machine learning approach obtained by Random Forest with precision 75.72%, recall 69.88%, and f1-score 69.98%. This best performance is obtained by using feature TF-IDF with CPD and balancing method SMOTE.

Ethic and Privacy
This study only focuses on analyzing words in social media based on the Keirsey temperament model. So this research only takes general topics, not focus on the user's private information. Kosinski et all [17] explained social media research to use publicly available private user information without agreement with the provisions assuming that the data was intentionally made public, user data anonymized after collection and no attempt was made to define it and no interaction or communication with individuals in the sample.
During data collection, exploration until classification, research remains focused on maintaining the privacy of Twitter users who have taken their tweets and ethics in researching social media data. Even we know, Twitter is one or part accessible data, the researcher also keeps Twitter users who have taken data by doing a rename with sample code to disappear judging from researchers. This work was done so that the focus on the words they use is not the focus of the Twitter user [2].

CONCLUSIONS
This research was done to understanding the behavior of users on social media using word that what they said. Here, we did an exploratory study aimed at understanding the potential of machine learning techniques for Keirsey Temperament prediction. We used data from 16 types of Myers-Brigss typology and mapped them into the Keirsey temperament model. This is based on the lexical hypothesis, which shows that the majority of individual differences is encoded in the language. Accuracy of Lexicon approach is 34%, while best perfomance approach to classify using machine learning with Random Forest algorithm is 69.59%.
The understanding of Keirsey temperament framework can be used in various fields, such as professional guidance, leadership training, pedagogical approaches, group dynamics, sales training and customer service, profile audiences, self-understanding, educational aptitude and professional achievement, conflict resolution and stress management, understand decision making, among others. We would like to expand this research to new databases both from Twitter and other social media, do some hypothesis toward each user temperament, and utilize feature extraction and deep learning models to get better results.