mrugankray / Big-Data-Cluster
The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center and pgAdmin. This cluster is solely intended for usage in a development environment. Do not use it to run any production workloads.
☆59Updated last year
Alternatives and similar repositories for Big-Data-Cluster:
Users that are interested in Big-Data-Cluster are comparing it to the libraries listed below
- Docker with Airflow and Spark standalone cluster☆247Updated last year
- End to end data engineering project with kafka, airflow, spark, postgres and docker.☆76Updated 5 months ago
- This repository contains the code for a realtime election voting system. The system is built using Python, Kafka, Spark Streaming, Postgr…☆34Updated last year
- ☆145Updated 2 years ago
- PySpark functions and utilities with examples. Assists ETL process of data modeling☆99Updated 4 years ago
- This project shows how to capture changes from postgres database and stream them into kafka☆31Updated 8 months ago
- Apache Spark 3 - Structured Streaming Course Material☆121Updated last year
- Projects done in the Data Engineer Nanodegree Program by Udacity.com☆105Updated 2 years ago
- Sample project to demonstrate data engineering best practices☆175Updated 11 months ago
- Ultimate guide for mastering Spark Performance Tuning and Optimization concepts and for preparing for Data Engineering interviews☆101Updated 8 months ago
- End to end data engineering project☆53Updated 2 years ago
- velib-v2: An ETL pipeline that employs batch and streaming jobs using Spark, Kafka, Airflow, and other tools, all orchestrated with Docke…☆18Updated 4 months ago
- Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow☆135Updated 4 years ago
- Hadoop-Hive-Spark cluster + Jupyter on Docker☆65Updated 3 weeks ago
- A template repository to create a data project with IAC, CI/CD, Data migrations, & testing☆254Updated 6 months ago
- Create a streaming data, transfer it to Kafka, modify it with PySpark, take it to ElasticSearch and MinIO☆59Updated last year
- Local Environment to Practice Data Engineering☆136Updated last month
- Stream processing pipeline from Finnhub websocket using Spark, Kafka, Kubernetes and more☆317Updated last year
- The resources of the preparation course for Databricks Data Engineer Associate certification exam☆326Updated last month
- Spark all the ETL Pipelines☆32Updated last year
- ☆28Updated last year
- Get data from API, run a scheduled script with Airflow, send data to Kafka and consume with Spark, then write to Cassandra☆131Updated last year
- Generate synthetic Spotify music stream dataset to create dashboards. Spotify API generates fake event data emitted to Kafka. Spark consu…☆67Updated last year
- ☆44Updated last year
- Big Data Engineering practice project, including ETL with Airflow and Spark using AWS S3 and EMR☆80Updated 5 years ago
- This repository will contain all of the resources for the Mage component of the Data Engineering Zoomcamp: https://github.com/DataTalksCl…☆98Updated 5 months ago
- Simple ETL pipeline using Python☆25Updated last year
- An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Ka…☆224Updated last year
- ☆45Updated last year
- ☆87Updated 2 years ago