The MapReduce Model on Cascading Platform for Frequent Itemset Mining

The implementation of parallel algorithms is very interesting research recently. Parallelism is very suitable to handle large-scale data processing. There are parallel and distributed programming models, such as MapReduce, MPI, and CUDA. The implementation of parallel programming faces difficulties when the data size and complexity increase. The Cascading gives easy scheme of Hadoop system which implements MapReduce model to refactor, test, execute a complex application and converting an application into Hadoop system. Frequent itemsets are objects which most often appear in a dataset. The Frequent Itemset Mining (FIM) requires complex computation. Therefore, FIM is a complicated problem when implemented on large-scale data. This paper discusses the implementation of MapReduce model on Cascading for FIM. The experiment uses the Amazon dataset product co-purchasing network metadata. The experiment shows the fact that the simple mechanism of Cascading which like assembling a pipe system can be used to solve FIM problem. It gives time complexity O(n), more efficient than the nonparallel which has complexity O(n/m). Keywords— Frequent Itemset Mining, MapReduce, Cascading  ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 12, No. 2, July 2018 : 149 – 160 150


Previous works
The fast-growing of computer technology causes a tremendous data increasing.Frequent itemsets are objects that often appear on a dataset.Objects are said to be frequent if their appearance greater than a specified support value.By finding the frequent itemsets in a system, the patterns of the system can be recognized.Frequent itemsets can mine the relevant evidence of computer crime, mine crime trends, and mine connections among different crimes.It can help polices detect case and prevent crime with clues and criterions [1].Frequent itemset mining also plays an important part in college library data analysis.RFP-Growth algorithm was used to find the frequent itemset college library database.There are a lot of redundant data in a library database.The mining process may generate intra-property frequent itemsets [2].
Frequent Itemsets Mining (FIM) is a process of finding the frequent itemsets by using data mining.FIM is a very interesting problem.Some research focus on the algorithm such as MRApriori algorithm [3], parallel balanced mining algorithm for Closed Frequent Itemsets based on the MapReduce [4], Hadoop-MapReduce model for handling massive datasets in mining infrequent itemsets [5], Sequence-Growth algorithm on MapReduce framework [6], data partitioning strategy on Hadoop [7], and the mining algorithm of frequent itemsets based on MapReduce and FP-tree (MAFIM algorithm) [8].Some other research focus on the algorithm implementation for specific objects.
A substantial frequent itemset mining algorithms and their MapReduce implementations are introduced and investigated [9].The use of Hadoop MapReduce framework makes the execution time linear to the number of transactions per batch.It was found that the increasing stock size did not give much impact on execution time.Execution time is also inversely proportional to the number of nodes [10].The MapReduce framework can be used for mining frequent itemsets to infer greater scalability and speed in order to find out the meaningful information from large datasets [11].
A deep review of different FIM techniques shows that the current distributed FIM algorithms often suffer from generating huge intermediate data or scanning the whole transaction database for identifying the frequent itemsets [12].The MapReduce framework is used to build a collaborative filtering.It makes automatic predictions (filtering) about the interests of a user by collecting the preferences or taste information from many users (collaborating) [13].Three MapReduce tasks are implemented to complete the mining of big datasets by using the parallelism among computing nodes of clusters to improve the performance of frequent pattern mining on Hadoop clusters [14].
MapReduce is a programming model for distributed and parallel computing which is very suitable for large-scale data processing.MapReduce was originally developed by Google for parallel and distributed processing [15].MapReduce was developed to work on thousands of machines and massive datasets [16].
The implementation of three Aeste-based a priori algorithm based on Hadoop MapReduce namely MRApriori, one-phase, and k-phases have been compared [3].The MRApriori algorithm took only two phases of MapReduce jobs to search for all Frequent k-Itemsets.Experimental results show that the MRApriori algorithm outperforms comparing the other two algorithms.
MapReduce-based balanced mining algorithm for closed frequent itemset has been presented [4].The algorithm adopts the Greedy strategy to balance the parallel computing.The algorithm consists of three steps: parallel computation, global construction of the frequent list and group maps as well as parallel mining for closed frequent itemset.The experiment showed the effectiveness and scalability the close FIM on a large scale data.
The MapReduce Apriori algorithm on FIM was used to speed up the response time [9].It Parallel Improved Single Pass Ordered (PISPO) based on cloud-computing framework and MapReduce has been proposed [4].The algorithm improved SPOTree, FP-Growth and MapReduce algorithms.PISPO was used to find the frequent itemset in electronic evidence.
There are many other application which use FIM on Hadoop MapReduce.Among of this generates the association rules in the transactional data stream [10] and handles FIM in Social Network Data [11].
MapReduce is a complex and difficult framework to be implemented even for software engineers.The Cascading platform may be used to simplify the process of writing program code.The Cascading libraries abstract the complex data flow on MapReduce programming model [17].
This paper explores the use of Cascading platform on simplifying the MapReduce programming code for FIM problem.Then, the program is used to find the frequent itemset of Amazon transaction data.The time needed to solve the problem is observed.The time needed by the parallel program which implemented on Cascading platform and the non-parallel program are compared.Also, the effect of data size and support number to the execution time are observed.

Related Works 1.2.1 MapReduce
MapReduce is a programming model for processing large scale data.MapReduce model has two main processes namely Map process and Reduce process.Figure 1 shows the relation between Map and Reduce processes.The MapReduce process is begun by breaking up the input data into multiple data items.The Map function outputs one or more key-value pairs.The keyvalue pairs then sorted and grouped based on the key value.For each distinct key, Reduce function processes and outputs one or more key values to a file as the final result [18].

Hadoop
Hadoop is the most popular implementation of MapReduce model.Hadoop is a software framework for reliable, scalable, parallel and distributed computing [16].The Hadoop framework consists of libraries and utilities required for other Hadoop modules, Hadoop Distributed File System (HDFS), and Hadoop Yet Another Resource Negotiator (YARN).HDFS is a distributed system that provides high access via data applications.YARN is a framework for job scheduling and cluster resource management.YARN provides APIs for resource management.YARN also serves another application framework such as Spark and Tez.Hadoop MapReduce is a YARN-based system for large-scale parallel data processing.Figure 2 shows the Hadoop MapReduce model as a YARN-based system.

Cascading
Cascading is an application development platform for building big data applications on Hadoop.Cascading has Java Application Programming Interface (API) which is used to simplify the complexity of MapReduce-based programming that run on the Hadoop.Cascading creates and executes complex data workflow processing on Hadoop.Cascading consists of API for data processing, integration, process design and process scheduling.Cascading can be used directly as Hadoop has been installed [17].Figure 2 shows the Hadoop MapReduce model.Figure 2. The Hadoop MapReduce model as a YARN based system [19] Cascading does not change the layer of mapper-reducer and sub-system layers structure in Hadoop.Cascading provides an abstraction for the MapReduce programming model.The workflow used in Cascading is called "Source-Pipe-Sink".Figure 3 shows the workflow of the Cascading.
Figure 3.The work flow of the Cascading [20] In the Cascading model, data is saved in the input part called "source".Then, data is sent to the output part called "sink", through the path called "pipe".Additional processes may be executed while the data flows from the "source" to the "sink".
A Cascading application may have many "flow".Every "flow" represent physical plan which analog to the scheduling topology on Hadoop.Every "pipe" has head and tail.A "flow" works independently and parallel to the other "flow".Cascading uses tuple-centric data model.All data is represented as tuples.Tuples are a list of values.Tuples flow in the "pipe".
Cascading has pipe types which defined as operations in the stream.Among of this operation are Each, Merge, GroupBy, Every, CoGroup and HashJoin pipes.The Each operation is an operation for the individual tuple.It contains filter, replace value, and remove tuple operations.The Merge operation merges two or more streams.The GroupBy operation groups the tuple based on the field and its value.
The grouping operation prepares the stream to be processed by using aggregator operation and buffer in the group such as counting, totaling, or averaging.The Every operation works on the grouped stream tuple, the output of GroupBy or CoGroup operation.The CoGroup and HashJoin are grouping operation which group two or more streams to get the specific field of output stream.

Frequent Itemset
Frequent itemsets are objects that often appear on a dataset.Objects are said to be frequent if their appearance greater than the specified support value [3].The appearance of each item in the transaction is counted.Support count is the frequency number of each item in the transaction.Suppose n is an integer number, L n is the number of item in the itemset.Table 2 shows the support count of the itemset.If the minimum support count is 4 then the Frequent Itemset is shown in Table 3.

METHODS This research focuses on the application development of parallel FIM based on
MapReduce by using Cascading.The application is used to find the FIM in Amazon product copurchasing network metadata [21].The time needed to execute is observed.The effect of data size and support number are observed.The observations are used to determine the complexity.

Data preprocessing
The experiment uses Amazon product co-purchasing network metadata.It is 35,4 MB data which contains the product metadata and review information about 548,552 different products such as Books, music CDs, DVDs and VHS video tapes.For each product, the following information is available: title, sales rank, list of similar products, detailed product categorization, and product reviews (time, customer, rating, number of votes, number of people The first step of the experiment is transforming the experiment data into transaction data.The transaction data consists of two columns, the customer column, and the ASIN (Amazon Standard Identification Number) columns.This is carried out by using MapReduce Model.The Amazon dataset is inserted into the Hadoop Distributed File System (HDFS) for subsequent processing by Hadoop which gives output key-value pair of <Customer ID, Item purchased>.Figure 4 shows the data preprocessing.

Algorithm design and program implementation
In this experiment, L 1 and L 2 itemsets are mined from the transactional data.The transactional data are put into the Cascading input tab.The L 1 itemsets are mined during the transactional data flow from the input tab to the output tab.The Cascading output tab outputs the L 1 itemsets.This process is depicted in Figure 5.
The output of the process in Figure 5 is used as the input of finding the L 2 itemsets.In this process, HadoopDistributedCache is used to take the L 1 , followed by the process in the pipe which same with the process in Figure 5.The output of this process is L 2 where the key is 2-Frequent itemsets and the value is the support count.based on the key.The results of this operation are the candidate of itemset (C k ) in the form of <Item, {1..n}>.Then, the reducer uses the Every operation to add the value of the itemset candidate.The reducer outputs the key and its support count.The final result of L 1 itemsets is the union of all reducer output.Figure 12 shows the detailed process of L 2 itemsets by using MapReduce.The mappers give output in the form of <item1, item2, 1>.The GroupBy operations give output in the form of <item1, item2, {1..n}>.The reducers give output in the form of <frequent-2 itemsets, count>.The union of these outputs gives the L 2 final result.

Experiments
Two experiments have been done in this research [22].The MapReduce and the non-MapReduce processes for L 1 and L 2 FIM have been observed.Four values of support count are used: 50, 75, 100, and 125.These support counts are used for three different size data.For each experiment, the time needed to accomplish the FIM processes are observed.

Results
The first step of the experiment is transforming the transactional data into key-value pair data of CustomerID and itemID.This process gives 5.524.141bytes which consist of 156.852 transactional data.The L 2 FIM is mined from three different size transactional data: 156.852, 78.426, and 39.213.Table 4 shows the experiment result.Figure 13 shows the comparison of L 2 FIM execution time on a non-MapReduce system.Figure 14 shows the comparison of L 2 FIM execution time on MapReduce system.Figure 15 shows the comparison of the whole L 2 FIM execution time.The line at the bottom of Figure 15, actually represents all the execution time on MapReduce system.

Discussions
Two processes of L 2 frequent itemset mining have been observed in the experiment, a non-MapReduce process dan MapReduce processes.Both processes worked on three different sizes data and four minimum support counts namely 50, 75, 100, and 125.
Both MapReduce and non-MapReduce processes give the same result, but as shown in Table 4, the time needed to accomplish the FIM are very different.The MapReduce system runs faster than the non-MapReduce system.
The change of minimum support count affect significantly the time needed to accomplish the L 2 FIM on the non-MapReduce system as shown in Figure 13, but not significant for the MapReduce system as shown in Figure 14.By comparing the whole experiment result in Figure The execution time of MapReduce system increases in O(n) as the number of datasets increasing, but the minimum support count does not affect the execution time, as shown in Figure 14 and Figure 15

Figure 4 .
Figure 4. Data preprocessing Figure 7 shows the flowchart of mining L k itemsets in the pipe.The implementation of flowcharts in Figure 5, 6, and 7 are started by defining the input IJCCS ISSN (print): 1978-1520, ISSN (online): 2460-7258  The MapReduce Model on Cascading Platform for Frequent Itemset... (Nur Rokhman) 155

Figure 8 .
Figure 8.The definition of the input tap and the output tap.


ISSN (print): 1978-1520, ISSN (online): 2460-7258 IJCCS Vol. 12, No. 2, July 2018 : 149 -160 158 15, the execution time of a non MapReduce system increases in O(n 2 ) as the number of datasets increasing.On the other hand, it decreases in O(1/m) as the minimum support count increasing.The time complexity of L 2 FIM for a non-MapReduce system is O(n 2 /m) with n dataset and m minimum support count. .
4. CONCLUSIONSBased on the experiment and the discussion, it can be concluded that: 1. Cascading platform can be combined with Hadoop to implement MapReduce to mine the L 2 Frequent Itemset.2. The execution time of the L 2 frequent itemset mining with Cascading platform is while the regular process is O(n 2 /m), with n dataset and m minimum support count.

Table 1 .
Examples of transaction data

Table 2 .
Support Count of Each Item L n

Table 3 .
The Frequent Itemset with minimum support count 4

Table 4 .
L 2 FIM execution time