Demonstration of using Apache Spark to build robust ETL pipelines while taking advantage of open source, general purpose cluster computing.
☆24Aug 11, 2023Updated 2 years ago
Alternatives and similar repositories for apache-spark-etl-pipeline-example
Users that are interested in apache-spark-etl-pipeline-example are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Spark data pipeline that processes movie ratings data.☆31May 1, 2026Updated 3 weeks ago
- Various data stream/batch process demo with Apache Scala Spark 🚀☆12Feb 28, 2020Updated 6 years ago
- Developed an ETL pipeline for a Data Lake that extracts data from S3, processes the data using Spark, and loads the data back into S3 as …☆17Oct 1, 2019Updated 6 years ago
- ☆16Sep 17, 2017Updated 8 years ago
- Our style guide for writing readable and maintainable PySpark code.☆17Dec 21, 2021Updated 4 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark☆11May 22, 2018Updated 8 years ago
- Data and source for Azure Computer Vision classify birds with Python SDK☆11Jan 20, 2021Updated 5 years ago
- Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatio…☆56May 6, 2023Updated 3 years ago
- Insight Data Engineering project: A platform built in HDFS, Spark and Airflow to help you to find social influencers from GitHub Net…☆16May 21, 2024Updated 2 years ago
- ☆16May 29, 2023Updated 3 years ago
- RedditR for Content Engagement and Recommendation☆18Dec 21, 2017Updated 8 years ago
- This project focuses on building a robust data pipeline using Apache Airflow to automate the ingestion of weather data from the OpenWeath…☆22Feb 3, 2026Updated 3 months ago
- PySpark functions and utilities with examples. Assists ETL process of data modeling☆104Dec 3, 2020Updated 5 years ago
- A batch processing data pipeline, using AWS resources (S3, EMR, Redshift, EC2, IAM), provisioned via Terraform, and orchestrated from loc…☆24May 14, 2022Updated 4 years ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- I am using confluent Kafka cluster to produce and consume scraped data. In this project, I've created a real-time data pipeline that uti…☆29May 2, 2023Updated 3 years ago
- Welcome to my data engineering projects repository! Here you will find a collection of data engineering projects that I have worked on.☆24Apr 27, 2023Updated 3 years ago
- Sharable Grakn knowledge graphs☆14Dec 28, 2022Updated 3 years ago
- A data engineering project with Airflow, dbt, Terrafrom, GCP and much more!☆26Nov 8, 2022Updated 3 years ago
- Loan Default Prediction using PySpark, with jobs scheduled by Apache Airflow and Integration with Spark using Apache Livy☆22Dec 26, 2020Updated 5 years ago
- Speaker Diarization using GRU in PyTorch☆11Aug 29, 2020Updated 5 years ago
- This project involves an ETL (Extract, Transform, Load) process to analyze sleep data exported from Apple Health☆29Apr 29, 2023Updated 3 years ago
- BigQuery Data Connector for Dremio☆12Sep 29, 2023Updated 2 years ago
- Detailed notes and code to learn machine learning with Apache Spark.☆12Sep 24, 2018Updated 7 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Transcribe live audio using Google Cloud Speech to Text API☆16Aug 14, 2018Updated 7 years ago
- In this project, we will build and ETL(Extract,Transform,Load) pipeline using the Spotify API on AWS. The pipeline will retrieve data fro…☆25May 6, 2023Updated 3 years ago
- This is a capstone project that entails building an end-to-end ETL (Extract-Transform-Load) Data pipeline which extracts UK accident and …☆18Jun 6, 2020Updated 5 years ago
- Speaker Diarization is the first step in many early audio processing and aims to solve the problem ”who spoke when”. It therefore relies …☆12Dec 7, 2018Updated 7 years ago
- Exemplo de uso do Swagger para documentação de uma API REST criada com o ASP.NET Core 2.0.☆11Oct 5, 2017Updated 8 years ago
- A production-grade data pipeline has been designed to automate the parsing of user search patterns to analyze user engagement. Extract d…☆24Nov 22, 2021Updated 4 years ago
- Document and showcase how you can create Spark Applications which run inside Docker Containers using Apache Mesos.☆28Feb 25, 2016Updated 10 years ago
- Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation,…☆89Nov 22, 2021Updated 4 years ago
- This is the repository for my version of Kaldi for Dummies example.☆17Nov 18, 2018Updated 7 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- A place for ACES employees to post notes on conferences they attended☆17Nov 14, 2016Updated 9 years ago
- Script generates index.html files for s3 bucket which enables browser experience.☆13Feb 6, 2025Updated last year
- Analytics projects using Big Data eco-systems (Hadoop, Spark, Storm)☆17Dec 27, 2021Updated 4 years ago
- ☆16Aug 1, 2018Updated 7 years ago
- Spark on Kubernetes infrastructure Docker images repo☆37Oct 20, 2022Updated 3 years ago
- In this repository, you will find all process of NLP from the scratch☆16Sep 16, 2020Updated 5 years ago
- A plugin to Apache Airflow to allow you to run Zip and UnZip commands as an Operator☆12Jul 26, 2023Updated 2 years ago