yahwang / Awesome-Data-Engineering
π(GitBook) A curated list of awesome Data Engineering resources
β33Updated 2 months ago
Related projects: β
- Playground for Lakehouse (Iceberg, Hudi, Spark, Flink, Trino, DBT, Airflow, Kafka, Debezium CDC)β40Updated 11 months ago
- Data engineering interviews Q&A for data community by data communityβ60Updated 4 years ago
- Design/Implement stream/batch architecture on NYC taxi data | #DEβ26Updated 3 years ago
- A real-time event pipeline around Kafka Ecosystem for Chicago Transit Authority.β29Updated last year
- β35Updated 2 months ago
- Weekly Data Engineering Newsletterβ93Updated 2 months ago
- β48Updated 2 years ago
- (project & tutorial) dag pipeline tests + ci/cd setupβ84Updated 3 years ago
- Projects done in the Data Engineer Nanodegree Program by Udacity.comβ83Updated last year
- Mastering Big Data Analytics with PySpark, Published by Packtβ153Updated last month
- Code snippets for Data Engineering Design Patterns bookβ27Updated this week
- Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation,β¦β89Updated 2 years ago
- β44Updated 2 years ago
- I am using confluent Kafka cluster to produce and consume scraped data. In this project, I've created a real-time data pipeline that utiβ¦β28Updated last year
- Create a streaming data, transfer it to Kafka, modify it with PySpark, take it to ElasticSearch and MinIOβ56Updated last year
- Spark data pipeline that processes movie ratings data.β26Updated last month
- Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatio β¦β53Updated last year
- PySpark Cheatsheetβ35Updated last year
- Full stack data engineering tools and infrastructure set-upβ38Updated 3 years ago
- A Snowflake GPT Demo using SqlAlchemyβ23Updated last year
- RedditR for Content Engagement and Recommendationβ20Updated 6 years ago
- A data pipeline moving data from a Relational database system (RDBMS) to a Hadoop file system (HDFS).β15Updated 3 years ago
- Delta-Lake, ETL, Spark, Airflowβ42Updated last year
- Dockerizing an Apache Spark Standalone Clusterβ43Updated 2 years ago
- Simple stream processing pipelineβ89Updated 3 months ago
- β7Updated 2 years ago
- PySpark data-pipeline testing andΒ CICDβ28Updated 3 years ago
- A repository of sample code to show data quality checking best practices using Airflow.β71Updated last year
- This repo contains commands that data engineers use in day to day work.β58Updated last year
- Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflowβ127Updated 4 years ago
- A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from locβ¦β20Updated 2 years ago