What is Provenance in Big Data?
The term provenance refers to collecting information on the origins of the data and the method of data processing.
Data provenance helps us to identify the authenticity and the quality of the data. To maintain data provenance for a huge volume of the data can be a complicated task as it goes through multiple stages of processing.
For the ease of data provenance, available data is classified into three key stages; which are based on whether the data is being transmitted, processed, or in the storage. These key stages are called data-in-motion, data-in-use, and data-at-rest respectively.
There are few big data provenance challenges which data scientists face while implementing and operating big data provenance. We will be discussing those challenges in this blog.
Big Data Provenance Architecture
Data provenance is useful while debugging data and transformations, evaluation of data quality and trust, implementing access control for the extracted data, data auditing, and creating a model for data authentication.
To understand the importance of data provenance in big data, let's take a look at the Big Data Provenance Architecture.
Big Data Provenance Challenges:
Implementing data provenance is a little problematic due to the workflow-driven analysis in relation to data provenance.
These big data provenance challenges generally occur due to the large volume of data, legacy tools that are application-driven, and distributed dataset patterns (DDP).
We will enlist some of the key challenges based on big data provenance usage.
One of the major challenges in big data provenance is the higher volume of collection overhead.
There are a huge number of online streaming datasets that are used in a multi-step model for big data analytics. These datasets increase the collection overheads.
We also have to factor in the computational cost attached to the analysis. If there are any discrepancies in this database, it could lead to misleading results.
And if these overheads are in the distributed nature, the problem of overheads will only get worse.
Most of the data scientists work on MapReduce programming models to process the data.
Sometimes the number of user-defined functions can go in millions, and recorded data could go higher than the original data.
Handling data this huge is considered to be very complex, and data scientists need to save this data in an efficient manner. They need to find a way to reduce the size without compromising on its capabilities.
The process of reproducing an execution in big data applications is really a complicated process.
Most of the existing data provenance systems do record only intermediate data generated while executing the data and the dependencies.
These systems often neglect the most important aspect of reproducibility, which is known as execution environment information.
This execution environment information consists of parameter configurations of big data engines and the hardware information.
It is critical to the execution performance process, and it could also affect the final results.
As most of these big data systems don’t have this functionality, it becomes difficult to reproduce an accurate execution.
Data scientists find the storage and integration of data provenance complicated.
They normally save the provenance of User Defined Fields (UDFs) running on big data systems on the non-permanent nodes.
They need to stitch together the collected information or need to update it as the analysis is taking place.
The first choice is normally more efficient and effective, but you need to take an additional step, which is to upload the information before freeing the computation nodes.
There is a lot of communication overhead involved in the second choice, but it can be a useful application progress monitoring. But there are additional steps required for the stitching of the data in both the scenarios.
This makes it harder and more complicated to store and integrate the database and this proves to be one of the major big data challenges.
Even though there are a lot of challenges involved in the implementation and operations of big data provenance currently; we cannot look past the fact that big data provenance is a vital cog in the big data analytics process.
The significance of provenance is huge in big data, and if you can overcome these challenges you will have a successful big data platform at your disposal.