Social-Child-Case Document Clustering based on Topic Modeling using Latent Dirichlet Allocation

Children are the future of the nation. All treatment and learning they get would affect their future. Nowadays, there are various kinds of social problems related to children. To ensure the right solution to their problem, social workers usually refer to the social-child-case (SCC) documents to find similar cases in the past and adapting the solution of the cases. Nevertheless, to read a bunch of documents to find similar cases is a tedious task and needs much time. Hence, this work aims to categorize those documents into several groups according to the case type. We use topic modeling with Latent Dirichlet Allocation (LDA) approach to extract topics from the documents and classify them based on their similarities. The Coherence Score and Perplexity graph are used in determining the best model. The result obtains a model with 5 topics that match the targeted case types. The result supports the process of reusing knowledge about SCC handling that ease the finding of documents with similar cases. Keywords— clustering, text document, topic modeling, Latent Dirichlet Allocation, social-child cases  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 14, No. 2, April 2020 : 179 – 188 180


INTRODUCTION
One of the most common social problems in Indonesia correlates with children. Various social problems of children are grouped by Kementerian Sosial Republik Indonesia (Kemensos RI), into several categories. These include abandoned babies, abandoned children, children in conflict with the law, children with disabilities, street children, children abused, and children with the needs of special protection [1]. Because children are the future of the nation, every problem related to children must be handled carefully. Based on Undang-undang Nomor 14 Tahun 2019 about Pekerja Sosial and Peraturan Pemerintah Nomor 44 Tahun 2017 about Pelaksanaan Pengasuhan Anak, the handling of children's social cases are carried out by the government and private institutions with the role of social workers.
In handling cases, social workers usually have to make the documentation [2] in a particular format called the Social-Child-Case (SCC) documents. Within the documents, social workers write at least the description of the case and what actions have been or will be taken in handling it. With various information on handling SCC in the past, these documents can be used as valuable references for social workers to determine the SCC solution they are currently responsible for. The assumption is that similar cases have the same problem-solving approach. Therefore, by reading the available SCC documents, social workers try to find similar cases to determine the solution. The problem is that there are too many documents and are not classified. So, it is quite difficult for social workers to find documents with a similar problem.
Thus, this research develops a computational approach to automatically categorizing the SCC documents based on the similarity of the case descriptions. Hence, we can consider our works as text/document clustering problems, since we have no certain label/category on our dataset. Recent technique on text clustering was dominated by two major approach, artificial neural networks based approach and ontology based approach. Wan et.al [3] using ANN to constructing word vector from large amount of documents and employing k-means algorithm to perform clustering task over the constructed word vectors. Another experiments based on ANN is conducted by Saini et.al [4], since they formulating Self Organizing Map (SOM) for generating various genetic operation to achieve the best clusters during the iteration of the algorithms. Another approach of document clustering is by employing pre-constructed ontology network which have condicted by Rupashinga & Park [5], Kang et.al [6], and Sandhiya & Sudarambal [7]. By using ANN and ontology, document clustering reach a promising accuracy. Nevertheles, there is a need of massive efforts for collecting large data in order to perform ANN effectifelly and also tedious work for constructing an ontology network. Hence, since we only have small amount of data and no pre-constructed ontology, we are trying to develop simpler keyword-based approach. It is assumed that a similar set of keywords reflecting a similar case description of the SCC documents. Based on this idea, we try finding documents with similar case descriptions by extracting keywords from each document into the SCC corpus. One of the most popular computational approaches for extracting keywords from documents is topic modeling. Topic modeling works statistically by exploring documents and representing them as collections of frequently co-occurrence terms [8].
The use of topic modeling approach in providing a useful view from a text collection has been done widely in various domains and cases. For example are, in journalism [9], information science [10] [11], and academic field [12]. Meanwhile, some researchers [13]- [15] have tried to use topic modeling for text clustering. According to [8] and [16], Latent Dirichlet Allocation (LDA) is the most popular topic modeling technique, which is also used in [9] - [15]. Therefore, we use the LDA topic modeling to create a topic model from a corpus of SCC documents. The model consists of topics with each keyword, in which each topic assumed to be a cluster. With the clustered documents, social workers are easier to find the available documents with a similar case to the current one.

Dataset
The dataset that we used comes from a collection of SCC documents belong to the Dinas Sosial DIY, which are written in Indonesian and available in digital form (doc). They were chosen manually with the requirement that they must contain the same feature considered as the attributes of the case. There are two features found within all documents: problem description and recommendation. Both are available in descriptive text. As a result, 167 documents selected, which is each text's length, are from 50 to 900 words.

Experimental Design
The experiment design divided into two significant parts. The first part is data preprocessing, and the second part is the topic modeling of preprocessed data.

Data Preprocessing
According to [17] and [18], preprocessing is considered an essential step in text processing, as it provides a standard and consistent form of data that affecting the whole experiment result. As shown in Figure 1, we use a series of preprocessing steps consist of 1) number and punctuation removal; 2) case-folding (turn all the text into lowercase); 3) term identification; 4) stemming; 5) stopword removal and 6) frequent n-gram identify. Among the preprocessing, there are default steps (data cleaning and case-folding), the additional step (term identification), and the optional steps (stemming, stopword removal, and frequent n-gram identify).
The term identification usage is performed to handle too many words found in the text, which have the same meaning. So, we consider them as SCC terms with inconsistent writing, as shown in Table 1. Those inconsistencies are the results of differences in abbreviations, the use of words, letters, and space. Inconsistent writing can lead to errors in recognizing a term or word, which affects the whole experiment. Therefore, we add this step to make sure that on the next step, those SCC terms will be recognized consistently.  The next steps that are quite significant contain of stemming and stopword removal. Actually, in [19] and [20], Schofield found that in some instances, the use of stemming and stopword removal does not affect the topic model. Since it can even reduce its stability [19], the use of those steps requires some consideration. In this study, we try experimenting using both steps with the consideration that the text addresses a fairly specific domain. For example, without stemming, some words with the same context (e.g. -pengasuhan‖, -diasuh‖, -mengasuh‖ and -asuhan‖) are recognized as different tokens. And without some additional stopword, some words which in this context are general terms (e.g. -klien‖, -kondisi‖ and -mengalami") frequently appear in many documents, even though they do not mean anything.
The frequent n-grams identification is made based on the assumption that a series of words frequently appear in the documents, have specific meanings that can become the text features, as done in [21]. As shown in Table 2, the results written in italics are named-entity. We get them by experimentally re-run this step while changing the threshold and min_count values. The higher the threshold values, the lesser the n-gram produced. Min_count is a minimum number of n-word's occurrences in sequence. In this experiment, we use bigram and trigram, where both use threshold = 75 and min_count = 3. Table 2. N-gram

LDA Topic Modeling
The final process output from the preprocessing becomes a corpus consisting of n-gram tokens. From this corpus, the topic model is built using the LDA algorithm. As a generative probabilistic model of the corpus, LDA assumes that each document represented as a probabilistic distribution over latent topics, and each topic is characterized by a distribution over words [22]. The words that have the highest probability on each topic are usually used to determine what the topic is. Figure  represents the number of documents, while N represents the number of words in the document. The first level is the corpus level parameter (α and ß), which considered as samples in the corpus production process. The second level is the document level parameter (θ), which is a one-time sample of each document. Finally, the word level parameters (z and w) are sampled once for each word in each document.

Figure 2. Graphical model representation of LDA
We first modeled the topic from the corpus and replicated this process several times, as was done in [23]. In generating a model, the LDA algorithm requires an input parameter (n) to determine the number of generated topics. Because there was no absolute knowledge about the topic number of SCC documents, we determined the value of n based on the expert's (social worker) assumption on the range of the topic's number. Based on the experts' assumption, we experimented using n = 2 to n = 10.
Determination of the best model (n topic) was carried out with two measurements, which are the Perplexity value [24] and the Coherence score [25]. The value of perplexity showed the confusion metrics or ways to capture the level of 'uncertainty' of a model's prediction result. In contrast, the coherence score indicated the level of semantic similarity between words on a topic. The formulation to calculate perplexity and coherence score are shown in equation (I) [24] and (II) [26].
where : V = a set of words describing the topic = a smoothing factor which guarantees that score returns real numbers In the coherence score calculation, since this experiment uses no external corpus, the score is calculated by the UMass metric with equation (III) [26]. Coherence score and perplexity are used to evaluate the proposed model. The resulting coherence score was reaching the top at the number of topics (n) = 6, while the perplexity was reaching the lowest value at the number of topics (n) = 9. Therefore, to determine the best topic amount, in this experiment, we used a trade-off (intersection) between both. The LDA topic modeling experiment results with n = 2 to n = 10 was shown in Figure 3. The figure showed that the perplexity and coherence score graphs experience an intersection on the number of topics approaching 5. Thus, the number of topics that will be used in the next step is n = 5. From n-topic = 5, five document clusters were formed. Visualization of the 5 clusters appeared in Figure 4, which showed the distance between clusters in two-dimensional space. In the Gensim library, the distance between one cluster and another cluster was visualized by a multidimensional scaling technique. Figure 4 showed that 3 of the 5 clusters were entirely separated without overlapping, while the 2 clusters were slightly intersecting. The size of each cluster illustrateed the number of documents incorporated in it. The larger the cluster size, the more documents it contains. The words representing a cluster are the keywords of certain documents. This group of keywords is then called the topic. Based on observations of the 5 topics' selected keywords, and by referring to the documents contained in each cluster, we generally describe the dominant topic for each cluster. So the cluster labels (cluster 1 -5) interpretation can be made briefly, as illustrated in Table 3.  In Table 3, the Dominant Keywords of each cluster contains 15 words with the highest coherence score. However, in the Dominant Keywords, some words are also keywords in other clusters. While Unique Keywords contains words that only appear in one cluster. Those keywords are used as lexical identifiers. Although they do not directly provide semantic meaning, these lexical features provide clues about the dominant topics of the documents incorporated in a cluster.
The first cluster discusses the condition of families experiencing broken homes (BrH), causing separation and abandonment. It is indicated by the words -tinggal‖, -rumah‖ and -pergi‖ which can be interpreted by the child or one of the family members who invited the child to leave the house or family, with or without permission (-ijin‖). The words -jalan‖ and -kerja‖ describe the consequences of leaving home, i.e., losing their homes and having to earn a living to survive. Also, the findings of the words -cari‖, -informasi‖, -temu‖, -komunikasi‖, reinforce that in some cases, there are efforts to be able to re-gather with family.
The second cluster contains a series of documents having similarities in the case of unwanted births (KtD). It is proven by the findings of -bayi‖, -lahir‖, -hamil‖, -sah‖, -biologis‖ and -hubung‖. The word -biologis‖ indicates that the baby's biological father and mother are not bound in a legal marriage. The mother's pregnancy and the baby's presence are serious problems for her whole family. Some of them occur in economically weak families, so families find it more difficult to accept the presence of babies. Therefore, in some cases, the babies are entrusted (-titip‖) or handed over (-serah‖) to children's social welfare institutionsespecially the ones with services for abandoned babies and toddlers.
In the third cluster, the highlight problem is the children who is caught doing a specific activity on the streets (AJ). Unlike the first cluster, this cluster's discussion revolves around the children who drop out (-putus‖) from school (-sekolah‖), hanging around and do activities to earn money (-uang‖) on the streets (-jalan‖) such as beggar (-amen‖). Most of them are then caught by the officer (-satpol_pp‖) and end up in a shelter called -camp‖. Other findings in this cluster are the words -pergi‖, -tinggal‖, -rumah‖ which is also a keyword in the first cluster. It happens because some documents describe that broken home was the cause of children ‗s activities on the streets. The relationship between AJ and BrH problems also appears in Figure 4, where both clusters slightly intersect. For the fourth cluster, the most obvious keywords are -bprsr‖, -bapas‖, -curi‖ and -polisi‖ which generally considered as characteristics of children in conflict with the law problem. Starting from a child who violated the law (for example: -curi‖), then acted on (-tindak‖) by law enforcement (-polisi‖). According to the judicial decisions, the children were getting rehabilitation in -bprsr‖ and or -bapas‖. In several documents, the discussions are even reached out to the children's condition after completing rehabilitation and returning to the community. For example, the process of preparing the environment (-lingkung_sosial‖), so that the ‗post-rehabilitation' children do not get a rejection, and well-accepted (-terima‖) by the society (-warga‖).
Next, the clearest keywords from the fifth cluster are -bks‖ and -psbk‖ which are institutions with the rehabilitation services for homeless, beggars, and psychotics (people with mental disorders). This type of case is related to the condition of parents with psychiatric disorders and or tendencies to lead homelessness or begging habits. These conditions conduce their children got improper care, suffering from growth disorders, or even deviant behavior. Some documents contain deviant behavior, starting from individual behavior related to daily activities, e.g., urinate ("bak") and defecate ("bab") habits, up to social behavior related to interactions with others, e.g., communication skills and conflicting tendencies. The word -rujuk‖ represents a child or family who has received rehabilitation from an institution but then referred to another institution due to certain conditions. There are also documents mentioning destructive behaviors (-rusak‖) so, security actions (-tahan‖) are required.
Meanwhile, words such as -keluarga‖, -orangtua‖, -ayah‖, -asuh‖ appear in several clusters, indicating that, in some documents, there are pieces of information about the background of the children's family. According to [27], family engagement is an influential factor for the success of children's social welfare practices. So, information relating to family conditions are really helpful. Besides providing clues about the causes of children's problems, those pieces of information also contributes to provide the best intervention to solve the children's problems.
For comparison, manual classification of the same 167 SCC documents was carried out in the 5 labeled clustering results by the expert. As a result, the proportion of the number of documents included in the 5 labels was illustrated in Figure 5. It seems quite clear that there is a difference of more than 10% proportion on the KtD label. While on other labels, there was a difference between 1-6%. Further observation on the documents included in the KtD cluster found some of the documents which were not suppose to be in the cluster (nonKtD). From the keywords contained, there was one document falls into the KtD cluster because it mentioned the origin of the child, so it has keywords such as "biologis" and "hamil". But because the child's origin was not the focus of the problem, experts did not include that document to the KtD group. As for other non-KTD documents, it might be detected as KtD because they has keywords such as "asuh", "orangtua", "keluarga", "bayi" or "ayah" appearing together. When in fact there are also keywords such as "curi", "hukum", "aktivitas", "jalan", "pergi", "tinggal", "rumah", which clearly characterizes other clusters. This was a bit confusing considering that other clusters having under 6% difference in proportion, with relatively similar findings. The comparison result showed that this model still has weaknesses in predicting the type of SCC. The possible factors affecting the prediction results are the un-clean preprocessing result or the possibility of other features that can be used as an SCC identifier besides the frequent co-occurring keywords simultaneously. For example, the determination of certain keywords for certain types of SCC by experts. Therefore, to get better SCC similarity, the preprocessing and further exploration of other candidate features from the resulting keywords can be improved in the future.

CONCLUSIONS
From the experiments, LDA topic modeling gives promising results in clustering SCC documents according to the topic's similarity. The clusters are obtained using the graph of coherence score and perplexity. The best resulting clusters can be found when coherence score and perplexity plots intersect. The intersection occurs as the number of topics approach 5. The five clusters can be interpreted and labeled according to the targeted case types. It supports the process of reusing knowledge about SCC handling, making it easier to find documents with the same case description. However, when compared with the manual classification result, there was still a big difference in one cluster. This difference could be influenced by the results of unclean preprocessing, or the possibility of other features that can be used as SCC identifiers besides the frequent of co-occurring keywords simultaneously. Therefore, to obtain better SCC similarity, finding other candidate features from the resulting keywords of the preprocessing and further exploration can be improved in the future.