anthonywong611/Batch-ETL-with-AWS-EMR-and-MWAA

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/anthonywong611/Batch-ETL-with-AWS-EMR-and-MWAA)

anthonywong611 / Batch-ETL-with-AWS-EMR-and-MWAA

Create a data pipeline on AWS to execute batch processing in a Spark cluster provisioned by Amazon EMR. ETL using managed airflow: extracts data from S3, transform data using spark, load transformed data back to S3.

☆10

Alternatives and similar repositories for Batch-ETL-with-AWS-EMR-and-MWAA

Users that are interested in Batch-ETL-with-AWS-EMR-and-MWAA are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

EmilDanielsson / Player-Rating-Project
View on GitHub
This project aims to rate football players using data and statistics recorded from the last match they participated in. Much of the code …
☆12Nov 22, 2021Updated 4 years ago
guidok91 / spark-movies-etl
View on GitHub
Spark data pipeline that processes movie ratings data.
☆31Updated this week
pran4ajith / spark-twitter-streaming
View on GitHub
A real-time streaming ETL pipeline for streaming and performing sentiment analysis on Twitter data using Apache Kafka, Apache Spark and D…
☆29Aug 8, 2020Updated 5 years ago
andreyshelopugin / GlickoSoccer
View on GitHub
☆22Jul 29, 2024Updated 2 years ago
chongjason914 / forage-anz
View on GitHub
Solution to Data at ANZ virtual internship on Forage
☆10May 30, 2021Updated 5 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
vsouza / spark-kinesis-redshift
View on GitHub
Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark
☆11May 22, 2018Updated 8 years ago
im-nsk / Building-an-Automated-Weather-Data-Pipeline-with-Airflow-From-Ingestion-to-Data-Warehouse
View on GitHub
This project focuses on building a robust data pipeline using Apache Airflow to automate the ingestion of weather data from the OpenWeath…
☆22Feb 3, 2026Updated 5 months ago
adityashrm21 / image-segmentation-pytorch
View on GitHub
Image Segmentation using Fully Convolutional Networks in PyTorch
☆11May 16, 2019Updated 7 years ago
soccermatics / twelve-gpt-educational
View on GitHub
☆36Feb 19, 2026Updated 5 months ago
nsoria1 / udacity-data-engineering
View on GitHub
Repository created to host udacity data engineer exercises
☆11Mar 1, 2026Updated 4 months ago
morganmazouchi / Delta-Live-Tables-Hands-on-Workshop
View on GitHub
Delta Live Tables Workshop Resources
☆17Feb 24, 2023Updated 3 years ago
shahidzikria / ADD-Net
View on GitHub
Alzheimer’s Disease (AD) is a neurological brain disorder marked by dementia and neurological dysfunction that affects memory, behavioral…
☆16Aug 28, 2022Updated 3 years ago
jddunn / dementia-progression-analysis
View on GitHub
Alzheimer's / dementia progression classifier for MRIs using CNNs and transfer learning
☆19Jan 22, 2018Updated 8 years ago
james-yap / color-palette
View on GitHub
Online Color Palette (https://youtu.be/ig91zc-ERSE)
☆16Jun 24, 2022Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
AnandDedha / aws-airflow-dataengineering-pipeline
View on GitHub
☆21Jan 13, 2024Updated 2 years ago
jerzygangi / forklift
View on GitHub
🚚 ETL for Spark and Airflow
☆25Mar 19, 2018Updated 8 years ago
shravan-kuchkula / dataEngineering
View on GitHub
A repo to track data engineering projects
☆14Nov 11, 2022Updated 3 years ago
gregbeaumont / PowerPopHealth
View on GitHub
Power Pop Health is a collection of content intended to simplify the process of ingesting and prepping Healthcare Open Data using Azure d…
☆18May 23, 2022Updated 4 years ago
chuqiaoshen / Git-Influencer
View on GitHub
Insight Data Engineering project: A platform built in HDFS, Spark and Airflow to help you to find social influencers from GitHub Net…
☆16May 21, 2024Updated 2 years ago
AuFeld / Data_Engineering_Projects
View on GitHub
A collection of data engineering projects: data modeling, ETL pipelines, data lakes, infrastructure configuration on AWS, data warehousin…
☆15Apr 29, 2021Updated 5 years ago
asatrya / airflow-etl-learn
View on GitHub
This is a simple ETL using Airflow. First, we fetch data from API (extract). Then, we drop unused columns, convert to CSV, and validate (…
☆24Oct 12, 2019Updated 6 years ago
szymonzaczek / databricks-ci-cd
View on GitHub
Databricks CI/CD using Azure DevOps
☆21Nov 1, 2022Updated 3 years ago
ozzieliu / python-tutorials
View on GitHub
Tutorials of data science concepts and packages in Python
☆21Feb 11, 2016Updated 10 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
kunal333 / E2ESynapseDemo
View on GitHub
☆27Mar 7, 2022Updated 4 years ago
tensorinfinitysip / a-PyTorch-Project-to-Transfer-Learning
View on GitHub
Image Classification with transfer learning | a PyTorch Tutorial to Transfer Learning
☆21Jul 25, 2024Updated 2 years ago
AmadeusITGroup / spark-perf-hikes
View on GitHub
Performance Hikes for Apache Spark
☆31May 20, 2026Updated 2 months ago
shravan-kuchkula / udacity-data-eng-proj4
View on GitHub
Developed an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as …
☆17Oct 1, 2019Updated 6 years ago
hyunjoonbok / PySpark
View on GitHub
PySpark functions and utilities with examples. Assists ETL process of data modeling
☆103Dec 3, 2020Updated 5 years ago
federicorabanos / futbol-data-visualizacion
View on GitHub
Repositorio con info basica de archivos y documentos para la visualización de datos en fútbol y Python
☆77May 26, 2025Updated last year
jonathanhayes / Tweepy-Twitter-Stream-Example
View on GitHub
Tweepy Stream Example
☆19Apr 23, 2019Updated 7 years ago
waq-r / CodeSignal-Databases
View on GitHub
CodeSignal CodeFights SQL Database queries
☆23Dec 26, 2019Updated 6 years ago
hadjdeh / football-data-analysis
View on GitHub
Football Data Processing & Visualization
☆92Jun 1, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
alanchn31 / Loan-Default-Prediction
View on GitHub
Loan Default Prediction using PySpark, with jobs scheduled by Apache Airflow and Integration with Spark using Apache Livy
☆22Dec 26, 2020Updated 5 years ago
Hongclass / ITMD526
View on GitHub
ITMD - 526 Data Warehousing
☆30May 9, 2016Updated 10 years ago
israel-dryer / Amazon-Scraper
View on GitHub
A webscraper that captures search results data from www.amazon.com
☆25Dec 21, 2020Updated 5 years ago
AdeboyeML / UK_Accident_Traffic_ETL_Pipeline
View on GitHub
This is a capstone project that entails building an end-to-end ETL (Extract-Transform-Load) Data pipeline which extracts UK accident and …
☆18Jun 6, 2020Updated 6 years ago
cmu-db / dbgym
View on GitHub
Infrastructure for researching self-driving databases
☆32Jul 2, 2025Updated last year
ansin218 / pydata-london-2019
View on GitHub
Tutorials and talks held from PyData London 2019
☆12Nov 22, 2022Updated 3 years ago
mattmurray / premier_league
View on GitHub
Data and regressions on Premier League teams from 2000-01 through to 2016-17
☆11Jul 31, 2017Updated 8 years ago