The goal of this project is to build a docker cluster that gives access to Hadoop, HDFS, Hive, PySpark, Sqoop, Airflow, Kafka, Flume, Postgres, Cassandra, Hue, Zeppelin, Kadmin, Kafka Control Center and pgAdmin. This cluster is solely intended for usage in a development environment. Do not use it to run any production workloads.
☆77Feb 27, 2023Updated 3 years ago
Alternatives and similar repositories for Big-Data-Cluster
Users that are interested in Big-Data-Cluster are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Hadoop-Hive-Spark cluster + Jupyter on Docker☆86Jan 2, 2025Updated last year
- Repository for building docker image, with open-source applications☆26Apr 23, 2024Updated last year
- Demonstrating practical SQL skills through a curated portfolio of solved problems from top coding platforms.☆51Mar 18, 2026Updated last week
- Run Hadoop Cluster within Docker Containers.☆16Mar 6, 2025Updated last year
- This project demonstrates real-time data streaming and processing architecture using Kafka, Spark Streaming, and Debezium for capturing C…☆13Oct 24, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- ☆13May 1, 2024Updated last year
- ☆13Mar 24, 2023Updated 3 years ago
- ☆145Dec 27, 2024Updated last year
- Docker Big Data Tools: This docker-compose file is configured to run multiple nodes. This is a Hadoop Cluster that contains the necessary…☆31Jul 6, 2021Updated 4 years ago
- Base hadoop/spark/bigdata image with advanced config loading scripts.☆11Nov 3, 2020Updated 5 years ago
- ☆16Jul 9, 2017Updated 8 years ago
- Delta-Lake, ETL, Spark, Airflow☆48Oct 9, 2022Updated 3 years ago
- This repo is for generating data from existing dataset to a file or producing dataset rows as message to kafka in a streaming manner.☆21Jun 13, 2024Updated last year
- Public Docker Images for popular services☆50Sep 7, 2025Updated 6 months ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Here I will be exploring various tools and methods that are used in data engineering process with Python.☆21Jan 4, 2021Updated 5 years ago
- Big Data Ecosystem Docker☆427Apr 29, 2023Updated 2 years ago
- Bu repo 3-5 gün süreyle düzenlenen Python ile Makine Öğrenmesi Eğitimleri için oluşturulmuştur.☆20Oct 9, 2020Updated 5 years ago
- An end-to-end, containerized data pipeline for near-real-time user event analytics using Kafka, ClickHouse, Airflow, and PySpark. Made to…☆56Sep 12, 2025Updated 6 months ago
- Spark application to consume kafka events generated by a python producer.☆12Aug 7, 2021Updated 4 years ago
- English Amazigh dictionary using React.js, Data Extraction from a PDF Dictionary using python☆16Sep 25, 2024Updated last year
- Marshmallow serializer integration with pyspark☆12Dec 29, 2023Updated 2 years ago
- Airflow Examples: code samples for Medium articles☆14Jan 10, 2021Updated 5 years ago
- Data engineering mentorship program☆184Feb 21, 2026Updated last month
- End-to-end encrypted email - Proton Mail • AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- ☆12May 27, 2024Updated last year
- Extract, transform, and load data for analytic processing using AWS Glue☆17May 2, 2021Updated 4 years ago
- A Python PySpark Projet with Poetry☆27Feb 17, 2026Updated last month
- Project - Data Processing and Analysis in Python Course☆39Oct 10, 2018Updated 7 years ago
- Dockerizing an Apache Spark Standalone Cluster☆42Jun 29, 2022Updated 3 years ago
- Data pipeline for extracting, transforming, and visualising Covid-19 data☆14Apr 23, 2023Updated 2 years ago
- On-premises ELT Pipeline☆31Jul 10, 2025Updated 8 months ago
- A ready to go Big Data cluster (Hadoop + Hadoop Streaming + Spark + PySpark) with Docker and Docker Swarm!☆23May 20, 2025Updated 10 months ago
- ☆16Apr 1, 2024Updated last year
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Use `outlines` generators with Haystack.☆15Mar 16, 2026Updated last week
- ☆12Jul 27, 2021Updated 4 years ago
- Spark implementation of Slowly Changing Dimension type 2☆11Jan 8, 2019Updated 7 years ago
- A minimal docker compose setup for experimenting with cloud agnostic Lakehouse Architectures Apache Spark with Hive Metastore + Delta Lak…☆34Apr 17, 2024Updated last year
- The repository for my talk titled the same☆15Nov 20, 2019Updated 6 years ago
- Small data engineering tutorial☆10Oct 24, 2018Updated 7 years ago
- Now updated prior to the version on CRAN.☆14Jan 9, 2024Updated 2 years ago