Parallelization of Hybrid Content Based and Collaborative Filtering Method in Recommendation System With Apache Spark

metode hybrid collaborative filtering


INTRODUCTION
In everyday life we are often faced with a large selection of items that we do not have knowledge of the item.In this case the recommendation system is present to provide recommendations on what items should be selected.The recommendations given are expected to help users determine what items will be chosen, such as what items to buy, what books will be read, what music will be heard or what films to watch and many more [1].Collaborative Filtering is one of the popular algorithms used to build recommendation systems.Collaborative filtering generates recommendations based on the assessment or behavior of other users in the system.As a method that is widely adopted in the recommendation system, collaborative filtering is divided into 4 methods: user-based, item-based, model-based and fusion-based approach.In practice, collaborative filtering is also divided into three types, namely memorybased collaborative filtering, model-based and collaborative filtering.The working principle of a memory-based collaborative filtering algorithm is to use user ratings to get the same preference between users and between items [2].
Improvisation of collaborative filtering methods is done with the aim of increasing accuracy of the recommendations.One of the methods taken is to hybridize it with content based methods.Collaborative filtering generates recommendations based on active user ratings.Whereas content based methods improve recommendations based on items that have similarities to preferred items.This hybrid technique has proven to be superior to traditional recommendation techniques.However, despite having advantages in quality recommendations, the recommendation technique with the hybrid method has deficiencies in terms of scalability.Scalability is generally used in the technical domain to describe how system size and size of the problem will affect machine performance and algorithms.The number of data and algorithms are more complex, resulting in less optimal performance of the algorithm [3].One indication of this scalability problem is the increased time needed to provide recommendations to users when the recommendation system data volume increases.
Scalability of a collaborative filtering recommendation system is a theme that is widely discussed in various studies with various proposed methods and approaches, one of which is the scale-out approach.In the scale-out approach, an additional computer node is used to run a recommendation system to obtain good scalability.The scale-out method implemented in previous research was using MapReduce Hadoop as practiced by [4], [5], [6], and [7] to get good scalability from traditional collaborative filtering recommendation systems.Another study was conducted by [8] who used Apache Spark to implement a scale-out approach to overcome the scalability of the recommendation system with traditional collaborative filtering methods.The use of Apache Spark by [8] was motivated by the assessment of the MapReduce Hadoop which used a lot of read and write processes to the hard disk which was considered less suitable for the implementation of collaborative filtering algorithms that have many iterative steps.It is expected that implementing it in the Apache Spark cluster will get more optimal results because Apache Spark is able to do processing using cache memory on each node in parallel.

METHODS
The architecture of the hybrid content based and collaborative filtering method models is shown in Figure 1.The first stage is reading a dataset consisting of three types of data, namely movies, ratings, and tags.Then the next step is to calculate the value of the bsed content method.After that, calculate the value of collaborative filtering methods.After the value of the calculation of the two methods is obtained, a hybrid calculation is performed using the results of the calculation of content based and collaborative filtering methods.
Furthermore, testing is carried out in parallel using the Apache Spark cluster.The model that has been created is run on each cluster scheme that has a number of different worker nodes.Then the acceleration obtained in each cluster scheme is calculated.

Content Based Filtering
Content based filtering method is usually used to look for similarities between documents using the terms contained in the item.But in this study, content based filtering will be used to calculate the similarity of a movie using genres and tags as terms.Then the preference is divided by combining genres and tags from the movie that the user likes.Then it will be compared with each movie that has no rating.The similarity of items to preferences greatly affects the value obtained by the item.

1.
The first step of the content based method is to classify terms.Performed by calculating the number of terms that appear on each movieId.

2.
Then the second step is to calculate the TF (Term Frequency) phrase in each movieId using equation ( 1). ( ) is occurrence of the word in document and ( ) is the number of words or terms contained in the document . 3.
Furthermore, calculating TF-IDF is done by calculating the TF value of the phrase in each movie multiplied by the IDF phrase value.Shown in equation ( 3).
( ) is the term frequency of a word or term in a document and ( ) is the inverse document frequency of the term term.
5. Calculate similarity using the cosine similarity approach as shown in equation ( 4).The cosine similarity approach is often used to determine the proximity between text documents [9].Cosine similarity is a calculation that measures cosine values from the angle between two vectors (or two documents in a vector space).The results of the dot product addition to the TF-IDF phrase value for each movieId with the TF-IDF term values on the preference.Then divided by the value of the square roots of the sum of the results of the squared TF-IDF term in the movieId, multiplied by the sum of the results of the squared TF-IDF term in the preferences.

Collaborative FIltering
Collaborative filtering method uses rating as the basis for rating prediction.In this study using collaborative filtering with an item based approach.Item-based collaborative filtering algorithm was developed to cover the weaknesses found in user-based collaborative filtering [10].The basic idea is to make items that have been rated by users as a basis for calculating similarity, then a group of items that have similarity are selected with items that have been rated by the user.The similarity value is used as a weight when predicting the rating value on the target item.Users will get a movie recommendation that has a tendency similar to other users.1.
The first step is to calculate the average rating of each movie as shown in equation ( 5).The amount of rating ( ) in the movie is divided by the COUNT rating value in the movie ( ).
Then the difference between the average and the mean values is calculated.Shown in equation ( 6).̅ (6) 3.
To calculate the Pearson-correlation value between two items, all rating values that do not have a partner with the same user are removed from the account.For example the set of users who give a rating on two items and is U, then the pearson-correlation equation to calculate the similarity of items and or ( ) is shown in equation ( 7).
( ) is the rating value given by the user to item , while ( ) is the rating value given by the user to the item .̅ ( ) is the average rating given in item and ̅ ( ) is the average rating given in item . is a set of users who have given a rating on items and items .
4. At the prediction stage, [10] proposes a weighted sum algorithm to predict as shown in the equation (below).As the name implies, the calculation of predictions for rating on item by 153 user , written ( ), is done by adding up all rating values that are item-neighborhood members.Each sum added is weighted with ( ), which is the similarity value of item with item .As shown in equation ( 8). ( )is the similarity value between item and item .( ) is the rating given by the user to item . is an item-neighborhood set.
5. The last stage is recommendation, which is sorting based on predictive values and then selecting a number of items that have the highest predictive value.These recommended items have never been rated by active users, so after getting these recommendations the user is asked to provide feedback in the form of rating values.

Hybrid
This hybrid technique combines the results of calculating several linear recommendation techniques.This merger calculates the rating prediction separately first, then the results of each method are combined into one.[11] uses the weighted average formula to combine these results.In this study will apply a linear combination of methods.However, the combination that will be used is by summarizing the product of each method and its weight.Shown in equation (9). is prediction value.is weight of the method that used. is value of method calculation.

Testing
In this study, clusters will be created using Google Cloud Dataproc services.Then the data will be stored on the Google Cloud Storage service.The architecture of the cluster created is shown in Figure 2.

Figure 2 Apache Spark cluster architecture
The series of stages were executed in series starting from the first stage to the last stage.One stage can stand alone or process results from the previous stage.The dataset in the cluster is divided into several partitions, computing on stage is done on each partition in the form of a task.The task is run in parallel by the executor on each node.Drivers communicate with a coordinator called a master, who manages workers to run executors.Worker or slave is an instance that contains the executor to run the task.After SparkContext is connected to the cluster manager, the executor is allocated to each node to run the process and store data.Then the program code is sent to the executor, and finally SparkContext sends the task to the executor to run.As shown in Figure 3.

Figure 3
Job paths in the Apache Spark cluster [12] 3. RESULTS AND DISCUSSION

Dataset
The data used for testing is the opensorce dataset obtained from MovieLens.The dataset used to generate recommendations consists of 671 users, 9066 movies, 1,222 tags and 100,004 record ratings.Dataset movie consists of three columns, namely movieId, title, and genres.The movieId column contains the movie id, the title column contains the movie title, and the genres contain movie streams separated by "|".Dataset ratings consists of four columns, namely userId, movieId, and rating.UsertId contains a user id that gives a rating.Then the movieId contains a movie id that is rated.While the rating contains the value of the movie rating given by a user.Dataset Tags consists of three columns, namely userId, movieId, and tags.UserId contains user IDs that tag a movie.Then the movieId contains the movie id tagged by the user.While the tag contains the tag phrase that the user gives to a movie.

Experimental Environment
The machine specifications used are n1-standard-1 which has one virtual CPU (2.3 GHz Intel Xeon E5), 100 GB storage, and 3.75 GB RAM.For testing purposes, several cluster schemes are used, each of which has one master node, but the number of different worker nodes is 2, 4, and 7 node workers.Other configurations that have not been mentioned use the default configuration.After creating the create process, it takes some time for the cluster creation process.Only need to wait until the cluster status is ready.
For comparison, a computer with a single node is used with the specifications of one virtual CPU (2.3 GHz Intel Xeon E5), 500 GB storage, and 3.75 GB RAM.

Calculation results of Hybrid Method
Of all the calculations, ten movieId were taken which had the biggest final score.The final value is obtained from the value of the collaborative filtering calculation multiplied by its weight then added to the content based value multiplied by its weight.As explained in equation (9).In addition to the results of collaborative filtering calculations, the division with a value of 5 is obtained to get a value range scale similar to the content-based method, namely 0-1.The final result of the calculation will have a value range of 0-2.The greater the final result, the more recommended.The calculation results for each experiment that is done do not always find the same value.Often there are different rounding values.However, the difference does not affect the recommendation results due to movieId with the top ten order not changing.

Experimental results on the cluster
To find out the scalability of hybrid content based and collaborative filtering methods that are run in clusters, speedup is calculated using equation (10). is speedup of cluster. is running time average of cluster with smallest number of worker.
is running time average of cluster that speedup will be calculated.(10) There several cluster schemes that have the number of workers for testing purposes, namely 1, 2, 4, and 7 workers.These data are processed in each cluster scheme using hybrid collaborative filtering and content based methods.Testing is carried out ten times in each cluster scheme.So that the total of all experiments conducted was 40 times as shown in Table 2.The "w" column represents the number of workers and the column "TRY (second)" represents each experiment performed.On each data and cluster size the average value is calculated.Then calculate the speedup value on each data size by using the division operation between its average value with the average value of the cluster that has the smallest number of workers, in this case the cluster with the number of workers one as shown in equation (10).3 the increase in speedup is obtained as the number of node workers increases.Speedup is obtained between Apache Spark's standalone runtime on the Apache Spark cluster with 2 workers that are relatively the same, namely 1.003, cluster speedup with 4 workers of 2.913, and speedup back in the cluster with 7 workers of 5.85.11) and (12). is a speedup obtained by a cluster with 7 workers.is the average cluster execution time with 1 worker.Then is the average cluster execution time with 7 workers.Figure 4 shows the speedup graph of the combination method of collaborative filtering and content based on the apache spark cluster.The parallelization using Apache Spark on hybrid collaborative filtering and content based methods gets good runtime results.Shown in increasing speedup obtained in each cluster scheme as the number of workers increases.Obtained the speedup on the cluster scheme with 2 workers that is equal to 1.003 with an average runtime of 10384.3seconds, speedup on the cluster scheme with 4 workers of 2.913 with a runtime of 3574.8 seconds, and an increase in speedup again found in the cluster scheme with 7 workers of 5 , 85 with a runtime of 1779.9 seconds.

Table 1
Calculation Results hybrid collaborative filtering and content based methods

Table 2
Runtime testing hybrid collaborative filtering and content based methods

Table 3
The results of the calculation of the speedup method of hybrid collaborative filtering and content based