jamesbyars/apache-spark-etl-pipeline-example

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jamesbyars/apache-spark-etl-pipeline-example)

jamesbyars / apache-spark-etl-pipeline-example

Demonstration of using Apache Spark to build robust ETL pipelines while taking advantage of open source, general purpose cluster computing.

☆24

Alternatives and similar repositories for apache-spark-etl-pipeline-example

Users that are interested in apache-spark-etl-pipeline-example are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

guidok91 / spark-movies-etl
View on GitHub
Spark data pipeline that processes movie ratings data.
☆31Updated this week
yennanliu / spark-etl-pipeline
View on GitHub
Various data stream/batch process demo with Apache Scala Spark 🚀
☆12Feb 28, 2020Updated 6 years ago
shravan-kuchkula / udacity-data-eng-proj4
View on GitHub
Developed an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as …
☆17Oct 1, 2019Updated 6 years ago
vectra-ai-research / pyspark-style-guide
View on GitHub
Our style guide for writing readable and maintainable PySpark code.
☆17Dec 21, 2021Updated 4 years ago
vsouza / spark-kinesis-redshift
View on GitHub
Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark
☆11May 22, 2018Updated 8 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
cloudera / cml-training
View on GitHub
Example Python and R code for Cloudera Machine Learning (CML) training
☆14Dec 1, 2020Updated 5 years ago
avensolutions / spark-sql-etl-framework
View on GitHub
Multi-stage, config driven, SQL based ETL framework using PySpark
☆26Sep 16, 2019Updated 6 years ago
gbraccialli / telco-cdr-monitoring
View on GitHub
☆21Oct 6, 2016Updated 9 years ago
MicrosoftDocs / mslearn-cv-classify-bird-species
View on GitHub
Data and source for Azure Computer Vision classify birds with Python SDK
☆11Jan 20, 2021Updated 5 years ago
supratim94336 / DataEngineeringCapstoneProject
View on GitHub
😈Complete End to End ETL Pipeline with Spark, Airflow, & AWS
☆51Aug 23, 2019Updated 6 years ago
im-nsk / Building-an-Automated-Weather-Data-Pipeline-with-Airflow-From-Ingestion-to-Data-Warehouse
View on GitHub
This project focuses on building a robust data pipeline using Apache Airflow to automate the ingestion of weather data from the OpenWeath…
☆22Feb 3, 2026Updated 5 months ago
big-data-lab-team / accident-prediction-montreal
View on GitHub
☆12Dec 8, 2022Updated 3 years ago
hyunjoonbok / PySpark
View on GitHub
PySpark functions and utilities with examples. Assists ETL process of data modeling
☆103Dec 3, 2020Updated 5 years ago
BFergerson / grakn-mythos
View on GitHub
Sharable Grakn knowledge graphs
☆13Dec 28, 2022Updated 3 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
priye-1 / Real_time_End_to_End_Pipeline_using_Kafka
View on GitHub
☆19May 27, 2023Updated 3 years ago
SourabhSinghRana / real-time_crypto_data_pipeline_using_kafka
View on GitHub
I am using confluent Kafka cluster to produce and consume scraped data. In this project, I've created a real-time data pipeline that uti…
☆29May 2, 2023Updated 3 years ago
yennanliu / NYC_Taxi_Pipeline
View on GitHub
Stream/batch system with Hadoop, Spark on NYC taxi data | #DE
☆26Apr 10, 2026Updated 3 months ago
tyo-nu / MINE-Database
View on GitHub
Metabolic In silico Network Expansion (MINE) Database Construction and DB Logic
☆21Apr 21, 2026Updated 3 months ago
WiraDKP / pytorch_gru_speaker_diarization
View on GitHub
Speaker Diarization using GRU in PyTorch
☆11Aug 29, 2020Updated 5 years ago
behindthelogics / EDA-Automobile-Dataset
View on GitHub
How to get start with a Machine Learning or a Data Science Project - Exploratory Data Analysis - step by step
☆12Oct 7, 2020Updated 5 years ago
vinniepsychosis / ETL-Apple-Health
View on GitHub
This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health
☆29Apr 29, 2023Updated 3 years ago
panoramichq / dremio-bigquery-connector
View on GitHub
BigQuery Data Connector for Dremio
☆12Sep 29, 2023Updated 2 years ago
crosslibs / transcribe-live-audio
View on GitHub
Transcribe live audio using Google Cloud Speech to Text API
☆16Aug 14, 2018Updated 7 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
AdeboyeML / UK_Accident_Traffic_ETL_Pipeline
View on GitHub
This is a capstone project that entails building an end-to-end ETL (Extract-Transform-Load) Data pipeline which extracts UK accident and …
☆18Jun 6, 2020Updated 6 years ago
ArpiteshSrivastava / spotify-data-engineering-project
View on GitHub
In this project, we will build and ETL(Extract,Transform,Load) pipeline using the Spotify API on AWS. The pipeline will retrieve data fro…
☆25May 6, 2023Updated 3 years ago
berngp / mesos-spark-docker
View on GitHub
Document and showcase how you can create Spark Applications which run inside Docker Containers using Apache Mesos.
☆28Feb 25, 2016Updated 10 years ago
EnesGokceDS / Amazon_Reviews_NLP_Capstone_Project
View on GitHub
In this repository, you will find all process of NLP from the scratch
☆16Sep 16, 2020Updated 5 years ago
renatogroffe / ASPNETCore2_Swagger
View on GitHub
Exemplo de uso do Swagger para documentação de uma API REST criada com o ASP.NET Core 2.0.
☆11Oct 5, 2017Updated 8 years ago
purushothamgowthu / lazyprogrammer-machine_learning_examples
View on GitHub
☆16Aug 1, 2018Updated 7 years ago
combust / mleap-demo
View on GitHub
Demonstration code for MLeap, both Jupyter notebooks and projects
☆24Aug 26, 2019Updated 6 years ago
pierrenodet / spark-ensemble
View on GitHub
Ensemble Learning for Apache Spark 🌲
☆24Sep 3, 2024Updated last year
fastforwardlabs / cml_churn_demo_mlops
View on GitHub
☆16May 1, 2023Updated 3 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
minerva-ml / steppy-toolkit
View on GitHub
Curated set of transformers that make your work with steppy faster and more effective
☆23Nov 22, 2018Updated 7 years ago
PacktPublishing / Mastering-Elasticsearch-7.0
View on GitHub
Mastering Elasticsearch 7.0, published by Packt
☆24Apr 30, 2023Updated 3 years ago
shravan-kuchkula / udacity-data-eng-proj2
View on GitHub
A production-grade data pipeline has been designed to automate the parsing of user search patterns to analyze user engagement. Extract d…
☆24Nov 22, 2021Updated 4 years ago
ryfeus / s3-browser
View on GitHub
Script generates index.html files for s3 bucket which enables browser experience.
☆13Feb 6, 2025Updated last year
shravan-kuchkula / udacity-data-eng-proj3
View on GitHub
Built a stream processing data pipeline to get data from disparate systems into a dashboard using Kafka as an intermediary.
☆29Aug 14, 2023Updated 2 years ago
kalyanhadooptraining / kalyan-bigdata-realtime-projects
View on GitHub
Big Data Real Time Projects
☆23Dec 4, 2017Updated 8 years ago
qooba / mlflow-feast
View on GitHub
End to end mlflow with feast example
☆18May 18, 2021Updated 5 years ago