adaltas / spark-streaming-pyspark
Build and run Spark Structured Streaming pipelines in Hadoop - project using PySpark.
☆12Updated 5 years ago
Related projects ⓘ
Alternatives and complementary repositories for spark-streaming-pyspark
- Dockerizing an Apache Spark Standalone Cluster☆43Updated 2 years ago
- Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatio…☆53Updated last year
- A curated list of awesome Databricks resources, including Spark☆14Updated 4 months ago
- An example CI/CD pipeline using GitHub Actions for doing continuous deployment of AWS Glue jobs built on PySpark and Jupyter Notebooks.☆12Updated 4 years ago
- A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in …☆21Updated 2 years ago
- Full stack data engineering tools and infrastructure set-up☆44Updated 3 years ago
- Hadoop/Hive/Spark container to perform CI tests☆11Updated 3 years ago
- dbt (data build tool) projects targeting AWS analytics services (redshift, glue, emr, athena) and open table formats☆25Updated last year
- Example project for consuming AWS Kinesis streamming and save data on Amazon Redshift using Apache Spark☆11Updated 6 years ago
- Big Data Demystified meetup and blog examples☆31Updated 3 months ago
- ☆26Updated 4 years ago
- Pyspark boilerplate for running prod ready data pipeline☆28Updated 3 years ago
- 📆 Run, schedule, and manage your dbt jobs using Kubernetes.☆24Updated 6 years ago
- ☆14Updated 5 years ago
- Complete data engineering pipeline running on Minikube Kubernetes, Argo CD, Spark, Trino, S3, Delta lake, Postgres+ Debezium CDC, MySQL,…☆24Updated 7 months ago
- Streaming Synthetic Sales Data Generator: Streaming sales data generator for Apache Kafka, written in Python☆44Updated last year
- ☆49Updated 8 months ago
- ☆43Updated 3 months ago
- Demonstration of using Apache Spark to build robust ETL pipelines while taking advantage of open source, general purpose cluster computin…☆24Updated last year
- ☆16Updated last year
- dbt package for monitoring airflow DAGs and tasks☆29Updated this week
- Airflow training for the crunch conf☆105Updated 6 years ago
- Data validation library for PySpark 3.0.0☆34Updated 2 years ago
- Nested Data (JSON/AVRO/XML) Parsing and Flattening in Spark☆15Updated 10 months ago
- Spark app to merge different schemas☆23Updated 3 years ago
- Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes☆63Updated 2 years ago
- ETL (Extract, Transform and Load) with the Spark Python API (PySpark) and Hadoop Distributed File System (HDFS)☆14Updated 5 years ago
- The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on…☆26Updated 2 years ago
- ☆23Updated 3 years ago