adaltas / spark-streaming-pysparkLinks

Build and run Spark Structured Streaming pipelines in Hadoop - project using PySpark.

☆13

Alternatives and similar repositories for spark-streaming-pyspark

Users that are interested in spark-streaming-pyspark are comparing it to the libraries listed below

Sorting:

Wittline / apache-spark-docker
Dockerizing an Apache Spark Standalone Cluster
☆43Updated 3 years ago
astronomer / airflow-data-quality-demo
A repository of sample code to show data quality checking best practices using Airflow.
☆77Updated 2 years ago
sibytes / yetl
Yet Another (Spark) ETL Framework
☆21Updated last year
vincentteyssier / apache-beam-tutorial
☆20Updated 5 years ago
idealo / terraform-emr-pyspark
Quickstart PySpark with Anaconda on AWS/EMR using Terraform
☆47Updated 5 months ago
microsoft / MonitoFi
MonitoFi: Health & Performance Monitor for your Apache NiFi
☆64Updated last year
ssp-data / data-engineering-devops
Full stack data engineering tools and infrastructure set-up
☆53Updated 4 years ago
Aiven-Labs / python-fake-data-producer-for-apache-kafka
The Python fake data producer for Apache Kafka® is a complete demo app allowing you to quickly produce JSON fake streaming datasets and …
☆85Updated last year
godatadriven / airflow-helm
☆11Updated 5 years ago
TomLous / medium-spark-k8s
Spark on Kubernetes using Helm
☆34Updated 5 years ago
mehd-io / pyspark-boilerplate-mehdio
Pyspark boilerplate for running prod ready data pipeline
☆28Updated 4 years ago
guidok91 / spark-movies-etl
Spark data pipeline that processes movie ratings data.
☆28Updated last week
vim89 / datapipelines-essentials-python
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatio…
☆55Updated 2 years ago
RealKinetic / aws-glue-pipeline-example
An example CI/CD pipeline using GitHub Actions for doing continuous deployment of AWS Glue jobs built on PySpark and Jupyter Notebooks.
☆12Updated 4 years ago
monte-carlo-data / data-observability-in-practice
Source code for the MC technical blog post "Data Observability in Practice Using SQL"
☆38Updated 11 months ago
NeerajBhadani / bigdata-ml
☆23Updated 2 years ago
avensolutions / spark-sql-etl-framework
Multi-stage, config driven, SQL based ETL framework using PySpark
☆25Updated 5 years ago
sahilbhange / hive-sql-slowly-changing-dimension
Slowly Changing Dimension type 2 using Hive query language using exclusive join technique with ORC Hive tables, partitioned and clustered…
☆16Updated 6 years ago
vincentclaes / glue-devcontainer
Glue VSCode devcontainer setup
☆14Updated 2 years ago
ismaildawoodjee / GreatEx
A project for exploring how Great Expectations can be used to ensure data quality and validate batches within a data pipeline defined in …
☆21Updated 2 years ago
velascoluis / dbt-ci-cd-gke
CICD pipeline that deploys a dbt image on a GKE cluster
☆11Updated 3 years ago
itversity / data-engineering-spark
☆87Updated 2 years ago
rvilla87 / ETL-PySpark
ETL (Extract, Transform and Load) with the Spark Python API (PySpark) and Hadoop Distributed File System (HDFS)
☆17Updated 6 years ago
paiml / awsbigdata
AWS Big Data Certification
☆25Updated 5 months ago
ongxuanhong / de02-pyspark-optimization
☆14Updated 2 years ago
mikulskibartosz / check-engine
Data validation library for PySpark 3.0.0
☆33Updated 2 years ago
rogeriomm / labtools-k8s
Complete data engineering pipeline running on Minikube Kubernetes, Argo CD, Spark, Trino, S3, Delta lake, Postgres+ Debezium CDC, MySQL,…
☆29Updated last month
kaantas / spark-twitter-sentiment-analysis
Sentiment Analysis of a Twitter Topic with Spark Structured Streaming
☆55Updated 6 years ago
josephmachado / e2e_datapipeline_test
Example repo to create end to end tests for data pipeline.
☆25Updated last year
Wittline / pyspark-on-aws-emr
The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on…
☆27Updated 3 years ago