Automating Data Ingestion into DataLakes from Various Data Sources



image
Issue

Building the data pipelines and data engineering takes more than a third of the effort involved in all large reporting and data transformation projects as it requires either manual ingestion of large amount of data into a data lake or building individual ETL programs/engines making it time consuming. In addition, adding every new data source requires additional effort for manual ingestion.

Our Approach

DataBeat designed and built a data ingestion framework that is industry agnostic, scalable and can automatically ingest any csv/excel file or data from relational databases or unstructured data by reading the metadata of the file. The framework maintains data quality by performing a series of data quality checks and also captures audit information such as run statistics and error logs.

Impact
  • Reduces ingestion time from days to minutes by automating the process of ingesting data from multiple data sources into a common data lake
  • Provides generic audit balancing and control framework support along with lineage tracking
  • Stores different types of source data (csv, excel, tables) in a common format making it easier for downstream consumption and takes incremental effort (i.e. adding metadata) to add new data source