DIYBigData / pyspark-benchmarkLinks
A lightweight benchmark utility for PySpark
☆20Updated 6 years ago
Alternatives and similar repositories for pyspark-benchmark
Users that are interested in pyspark-benchmark are comparing it to the libraries listed below
Sorting:
- Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation,…☆89Updated 4 years ago
- A collection of data analysis projects done using PySpark via Jupyter notebooks.☆10Updated 3 years ago
- Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatio…☆56Updated 2 years ago
- In-Memory Analytics with Apache Arrow, published by Packt☆104Updated last week
- A tutorial on how to get started with Presto.☆55Updated 4 years ago
- Resource for the book Trino: The Definitive Guide (and formerly Presto: The Definitive Guide)☆231Updated 3 years ago
- Dockerizing an Apache Spark Standalone Cluster☆42Updated 3 years ago
- One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)☆120Updated 4 years ago
- O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsian☆228Updated 2 years ago
- Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with S…☆457Updated last month
- Simple stream processing pipeline☆110Updated last year
- The source code for the book Modern Data Engineering with Apache Spark☆39Updated 3 years ago
- ☆110Updated last year
- ☆90Updated 3 years ago
- Magic to help Spark pipelines upgrade☆34Updated last year
- How to manage Slowly Changing Dimensions with Apache Hive☆55Updated 6 years ago
- A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.☆167Updated 2 weeks ago
- A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for …☆141Updated 5 years ago
- My Study guide used to pass the CRT020 Spark Certification exam☆34Updated 6 years ago
- The Internals of PySpark☆27Updated last year
- Tools for building, packaging, and OAP public cloud integrations such as AWS EMR, Google Dataproc and K8S.☆18Updated last year
- Benchmark data warehouses under Fivetran-like conditions☆171Updated 3 years ago
- Apache Spark Course Material☆96Updated 2 years ago
- ☆65Updated last year
- PySpark Algorithms Book: https://www.amazon.com/dp/B07X4B2218/ref=sr_1_2☆88Updated 6 years ago
- Classwork projects and home works done through Udacity data engineering nano degree☆75Updated 2 years ago
- Weekly Data Engineering Newsletter☆96Updated last year
- Data Engineering with Spark and Delta Lake☆106Updated 3 years ago
- [ARCHIVED] Moved to github.com/NVIDIA/spark-xgboost-examples☆72Updated 5 years ago
- Delta Lake examples☆238Updated last year