PFPMine: A parallel approach for discovering interacting data entities in data-intensive cloud workflows
摘要:With the evolution of cloud computing, communities and companies deployed their workflows on cloud to support end-to-end business processes that are usually syndicated with other external services. To improve the efficiency of the system as well as reducing energy consumption, data placement and backup strategies should be carefully designed. One of the most challenging problems is the discovery of interacting data entities in date-intensive workflows. To tackle this challenge, this paper presents a frequent pattern-based approach named FPMine for interacting data entity discovery in cloud workflows. A direct discriminative mining algorithm is first proposed to determine the minimum support threshold, based on which FP-tree is constructed to formulate the frequent item pairs. Next, FP-matrix is applied to avoid traversing the FP-trees during data entity discovery, and a pruning approach is introduced to reduce the redundancy of frequent item pairs. Furthermore, we propose a parallel data entity mining algorithm using MapReduce framework, namely PFPMine, and then design a primitive data placement and backup strategy. Finally, we evaluate the efficiency of our approach by experiments using real-life data, based on which we show that our approach can facilitate the discovery of interacting data entities with efficiency for cloud workflows. Comparing with traditional FP-growth approach, we pay only a multiplicative factor for making our approach able to extract fine-grained frequent item pairs rather than frequent patterns, which can bring significant advantages to data placement. After parallelization, the PFPMine algorithm performs better with high efficiency for both sparse datasets and dense datasets than FP-growth. The results show that PFPMine can reduce the running time by at least 25%, and preforms with significantly higher efficiency than FP-growth approach.
© 2020 Elsevier B.V.
ISSN号:0167-739X
卷、期、页:v 113,p474-487
发表日期:2020-12-01
影响因子:6.125100
期刊分区(SCI为中科院分区):二区
收录情况:EI(工程索引)
发表期刊名称:Future Generation Computer Systems
参与作者:黄昱泽,刘聪,张呈宁
第一作者:黄霁崴
论文类型:期刊论文
论文概要:黄昱泽,黄霁崴,刘聪,张呈宁,PFPMine: A parallel approach for discovering interacting data entities in data-intensive cloud workflows,Future Generation Computer Systems,2020,v 113,p474-487
论文题目:PFPMine: A parallel approach for discovering interacting data entities in data-intensive cloud workflows