Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
☆56May 6, 2023Updated 2 years ago
Alternatives and similar repositories for datapipelines-essentials-python
Users that are interested in datapipelines-essentials-python are comparing it to the libraries listed below
Sorting:
- A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousin…☆15Apr 29, 2021Updated 4 years ago
- Developed an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as …☆17Oct 1, 2019Updated 6 years ago
- Contain Interview Questions Solutions☆12May 18, 2018Updated 7 years ago
- A collection of python utility functions☆11Feb 11, 2026Updated 2 weeks ago
- Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark☆11May 22, 2018Updated 7 years ago
- ☆26Jul 9, 2023Updated 2 years ago
- An ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables☆15May 5, 2020Updated 5 years ago
- Jupyter Notebook showing how to process Telecom datasets using PySpark (SparkSQL and DataFrames) and plotting the results using Matplotli…☆16Dec 3, 2018Updated 7 years ago
- ETL using Python in Jupyter Notebook, loading CSV, cleaning data, and saving to SQL Database.☆14Nov 17, 2020Updated 5 years ago
- Apache Spark 3 - Structured Streaming Course Material☆126Aug 19, 2023Updated 2 years ago
- Simple ETL pipeline using Python☆29May 22, 2023Updated 2 years ago
- This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which…☆104Sep 26, 2025Updated 5 months ago
- 🔍Your Data Quality Detector / Gain insight into your data and get it ready for use before you start working with it 💡📊🛠💎☆16Aug 26, 2022Updated 3 years ago
- sparkql: Apache Spark SQL DataFrame schema management for sensible humans☆12Sep 18, 2023Updated 2 years ago
- A micro JSON-based Data Store inspired by MongoDB.☆13Feb 9, 2026Updated 2 weeks ago
- This is a compilation of Data Governance resources, examples, models and communities☆19Apr 16, 2019Updated 6 years ago
- A repo to track data engineering projects☆13Nov 11, 2022Updated 3 years ago
- Python and AirFlow - Data Pipeline Orchestration☆16Aug 3, 2023Updated 2 years ago
- Extensible streaming ingestion pipeline on top of Apache Spark☆46Jul 17, 2025Updated 7 months ago
- ☆20May 23, 2024Updated last year
- Data Lineage Tracing Library☆23Nov 30, 2021Updated 4 years ago
- Data structures & algorithms implemented in Java and solutions to leetcode problems.☆16Mar 18, 2024Updated last year
- Implementing best practices for PySpark ETL jobs and applications.☆2,074Jan 1, 2023Updated 3 years ago
- Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow☆163Jun 16, 2020Updated 5 years ago
- 😈Complete End to End ETL Pipeline with Spark, Airflow, & AWS☆51Aug 23, 2019Updated 6 years ago
- Repository used for Spark Trainings☆54Apr 21, 2023Updated 2 years ago
- Demonstration of using Apache Spark to build robust ETL pipelines while taking advantage of open source, general purpose cluster computin…☆24Aug 11, 2023Updated 2 years ago
- A Scalable Data Cleaning Library for PySpark.☆29Apr 4, 2019Updated 6 years ago
- Udacity Data Engineering Nanodegree Program☆53Mar 4, 2021Updated 4 years ago
- A project with examples of using few commonly used data manipulation/processing/transformation APIs in Apache Spark 2.0.0☆25Aug 5, 2021Updated 4 years ago
- Stream/batch system with Hadoop, Spark on NYC taxi data | #DE☆26Sep 27, 2025Updated 5 months ago
- This solution helps you deploy ETL jobs on data lake using CDK Pipelines.☆69Aug 9, 2022Updated 3 years ago
- A production-grade data pipeline has been designed to automate the parsing of user search patterns to analyze user engagement. Extract d…☆24Nov 22, 2021Updated 4 years ago
- Demo converting streamlit uber nyc rides to use duckdb☆30Apr 9, 2023Updated 2 years ago
- All Data Engineering notebooks from Datacamp course☆116Dec 11, 2019Updated 6 years ago
- Smart Automation Tool for building modern Data Lakes and Data Pipelines☆122Updated this week
- ☆10Jun 29, 2021Updated 4 years ago
- ☆12May 28, 2024Updated last year
- A dynamic data completeness and accuracy library at enterprise scale for Apache Spark☆29Nov 4, 2024Updated last year