Workflow Provenance for Big Data: From Modelling to Reporting

Rayhan Ferdous, Banani Roy, Chanchal K. Roy, Kevin A. Schneider


Abstract
Scientific workflow management system (SWFMS) is one of the inherent parts of Big Data analytics systems. Analyses in such data intensive research using workflows are very costly. SWFMSs or workflows keep track of every bit of executions through logs, which later could be used on demand. For example, in the case of errors, security breaches, or even any conditions, we may need to trace back to the previous steps or look at the intermediate data elements. Such fashion of logging is known as workflow provenance. However, prominent workflows being domain specific and developed following different programming paradigms, their architectures, logging mechanisms, information in the logs, provenance queries, and so on differ significantly. So, provenance technology of one workflow from a certain domain is not easily applicable in another domain. Facing the lack of a general workflow provenance standard, we propose a programming model for automated workflow logging. The programming model is easy to implement and easily configurable by domain experts independent of workflow users. We implement our workflow programming model on Bioinformatics research—for evaluation and collect workflow logs from various scientific pipelines’ executions. Then we focus on some fundamental provenance questions inspired by recent literature that can derive many other complex provenance questions. Finally, the end users are provided with discovered insights from the workflow provenance through online data visualization as a separate web service.
Cite:
Rayhan Ferdous, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider. 2019. Workflow Provenance for Big Data: From Modelling to Reporting. Studies in Big Data:185–200.
Copy Citation: