Recommendation System for Thesis Topics Using Content-based Filtering

When pursuing their bachelor degree, every students are required to pursue a thesis in order to graduate from the major that they take. However, during the process, students got several difficulty regarding chosing their thesis topics. Therefore, a recommendation system is needed to classify thesis topics based on the students’ interest and abilities. This study developed a recommendation system for thesis topics using content-based filtering where the students will be asked to choose the course that they interested in along with their grades. After getting all the required data, the recommendation system will process the data and then it’ll show the title and the abstract of publication that fits the criteria. In this research, there are 2 datasets that is used, there are lecturer publication within 3 years and syllabus data of Computer Science UGM course. After running this research, it was found that the recommendation system has an average 7.46 seconds running time. It was also found that the recommendation system got an average 83% of the recommendation system objectives. The recommendation system objectives consist of relevance, novelty, serendipity, and increasing recommendation diversity.


INTRODUCTION
Thesis is a scientific essay that must be done by the students as a part of the final requirements of their academic study [1]. Therefore, thesis is the student's duty especially in this research is undergraduate students. Thesis should be done in order for the students to graduate from the major that they take.
Before getting hand onto the thesis, every student needs to find a topic that will be brought into their thesis topics. However there are still a lot of UGM students, specifically final year Computer Science UGM students, that's confused in finding their thesis topics based on their interests. One of the reason that makes the students confused is that there are so many thesis topics available, but they're not grouped based on the subject that they interested in and they find it hard to determine which subject that is needed to take the topics.
There are some research that have been conducted regarding the recommendation system for thesis topics. The approaches itself are vary between one researcher and another researcher. For example K-Means Clustering and Simple Additive Weighting by Daniati [2] is used in order to get the thesis topics for each student, but the topics recommendation that will be given will be based on the lab that has been predetermined. Another example of method that is used is Naïve Bayesian Classifier [3]. The research that has been conducted giving the thesis topics only based on the lab that has been predetermined. It won't give any fleixibilites for the students when they want to take more than 2 labs or more.
With the problems that have been mentioned before and by moving on from the previous research, we need a system so that can help students group the thesis topics based on the subjects of their interest. Therefore this research built a recommendation system to help students determine their thesis topics based on their interests and grades. However the main focus of this research will be only for computer science major in UGM.
The recommendation system that has been built is based on the courses that is interested in by the student and their grades upon the courses. After getting all of the grades that is needed, the system will compare between the publication's abstract and the courses' syllabus . Then the system will be displaying the list of recommendation publication title based on the smallest distance between those document. By having the course and it's grade, we hope that the system will group up the topics more broadly and will help the students to get the thesis topics based on their not only interest but also abilities.

Text preprocessing
Text preprocessing is a series of process to clean the dataset (text) before the dataset being processed or being used for further preprocessing. Text preprocessing is the most important part in Natural Language Processing [4]. By having text preprocessing flow, it's hoped that it can get rid unnecessary words and maximize the data processing that will be carried out. The purpose of text preprocessing beside improving classification performance, it can also speed up time in data processing. There will be 2 flow for the system that has been built. The first flow shown in figure 1.

Case folding & punctuation removal
A text document must have contain various characters including letters, punctuation marks, numbers and symbols. Therefore, casefolding is needed to make uniform letters only (to lower case). For Punctuation removal, every characters other than letters will be removed and will be treated as delimiters.

Stopword removal
Stopword removal is needed during the text preprocessing. The definition of the stop word itself is defined as a set of words that are not related to the main subject, even though these words are often used in the text [5]. Stopword removal will decrease 20-30% of the total words inside a document [4].

Lemmatization
Lemmatization is a process in text preprocessing that determines the shape of a word and change it into a root word or finding the root of each word based on the context of the sentence [6]. The purpose of the lemmatization is to optimize the text mining process. Lemmatization has the same goal as stemming processto get the root of each word. However, The stemming process looks for the root of a word by cutting the prefix and suffix of the word without paying attention to the context of the word that is used in a sentence. Meanwhile, lemmatization pay attention to the context and morphology of each word [7].

2 Recommendation system
Recommendation system in general is that there are one / several users who provide recommendation as an input, then the input by the system will be collected and directed towards the best ouput according to the system [8]. However, in other cases, the value of a system lies in it's ability to provide the best recommendations to users who use the system. The flow of the system will be shown as in figure 2.

1 Content-based Filtering
Content-based filtering is one of the method that is usually implemented to build a recommendation system. The output of this method really depends on the input from the user. Content-based filtering is one of the best method to build a recommendation system with text as their datasets [9].

2 Term Frequency -Inverse Document Frequency
Term frequency is the total frequency of a term appear in a document [10]. For getting the Term Frequency data, we also need to count the total number of term appear in a document as shown in equation (1). With Count(word,docs) is appearance of term word in document docs and sigma of Count(word,docs) means the total number of all term appear inside a document. After getting the term frequency, we need to find the inverse document frequency of each term. Inverse document frequency is a value to check how often a term appear inside a corpus. If the term often appear inside a corpus, it'll be considered as a less important term. The formula to get the inverse document frequency [10] of a term will be shown in equation (2)

3 Euclidean Distance
For comparing the distance between 2 document, this research will use the euclidean distance. The data that will be calculated should be in vector format so that it can suits the formula on equation After we got the distance between each document, we need to find the weighted distance (by using the course grade) by using the formula on equation 5 The last step after getting the DistanceWeight between each document, we need to find the average distance between each abstract and each syllabus. The formula to get average distance between each abstract and each syllabus can be seen on equation 6. Where, AverageDistance(Aj,S) is the average distance between abstract j and syllabus, DistanceWeight(Aj,Si) is the weighted distance between abstract j and syllabus i, and Count(Si) represent the number of Syllabus that is taken by the students.

3 Evaluation
There will be 2 evaluation that will be measured in this research, there are : 1.
Running time : This evaluation will check the average time that is needed to run the system.

2.
System objective : This evaluation will check relevance, novelty, serendipity and increasing recommendation diversity.

Dataset
There will be 2 datasets that will be used there are computer science UGM course's syllabus and lecturer of computer science UGM publication within 3 years. The dataset for syllabus consists of 48 courses (have been filtered out for courses above 3 rd semester only) and 160 publication including their abstract. For the syllabus datasets consist of No, Mata kuliah and Silabus. But after being cleaned in text preprocessing flow, there will be additional column pp_silabus. For the syllabus dataset the Mata Kuliah and pp_silabus column will be used to get the syllabus that have been chosen by the students. For the publication datasets consist of No, Judul and Abstract. But also after being cleaned in text preprocessing flow, there will be additional column pp_abstract. The pp_abstract column will be used to calculate the distance between publication and syllabus that have been chosen by students, and then after getting the smaller distance, the Judul and Abstract column will be used to show the output of the system.

Experimental Environment
The experiment is conducted in local computer and virtual environment (Google Colab). For the local computer specification that is used are Intel Quad Core @2.50 GHz, 8GB RAM and Nvidia GeForce 940 MX. For the virtual environment that is used are Google Colab Notebook with Virtual RAM 12.7 GB.

System Display
The recommendation system are implemented in web-based. The language that is used for the front-end is javascript (native), and for the back-end python with flask framework. The first step that will be seen by the students are checkbox of list courses as shown in figure 3. After getting all the data that is needed the system then will show the recommended title publication and it's abstract based on the user input. The display will be shown in modal shown in figure 5.

System objective result
Evaluating objective of the system is done by interviewing 30 random computer science student. Before interviewing the student, they're asked to test the system by choosing the course that they like and it's grade. There will be 4 questions regarding the system objective result. The question that will be given will be about the relevance, novelty, serendipity, and increasing recommendation diversity of the system. The result of the interview will be shown in the table below.

CONCLUSIONS
As can be seen in the result and discussion, the system objective have been achieved with a good score with relevance 77.33%, novelty 90.67%, serendipity 80% and increasing recommendation diversity 84%. The system has an average 83% of the system objective. For the running time that is needed by the system is 7.26 seconds average so it's pretty fast to get the recommendation from the system.