DIYBigData / pyspark-benchmark
A lightweight benchmark utility for PySpark
β16Updated 4 years ago
Related projects β
Alternatives and complementary repositories for pyspark-benchmark
- Various data stream/batch process demo with Apache Scala Spark πβ11Updated 4 years ago
- A collection of data analysis projects done using PySpark via Jupyter notebooks.β10Updated 2 years ago
- This repository contains code for Spark Streamingβ21Updated 3 years ago
- Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatioβ¦β53Updated last year
- Interactive Notebooks that support the bookβ38Updated 4 years ago
- Repository used for Spark Trainingsβ53Updated last year
- XGBoost GPU accelerated on Spark example applicationsβ52Updated 2 years ago
- Tools for building, packaging, and OAP public cloud integrations such as AWS EMR, Google Dataproc and K8S.β16Updated 7 months ago
- PySpark Cheatsheetβ35Updated last year
- A repository for a PySpark Cookbook by Tomasz Drabas and Denny Leeβ60Updated 6 years ago
- Spark and Delta Lake Workshopβ22Updated 2 years ago
- Spark-Radiant is Apache Spark Performance and Cost Optimizerβ25Updated 2 years ago
- Flowchart for debugging Spark applicationsβ101Updated last month
- O'Reilly Book: [Data Algorithms with Spark] by Mahmoud Parsianβ209Updated last year
- [ARCHIVED] Moved to github.com/NVIDIA/spark-xgboost-examplesβ70Updated 4 years ago
- Includes notes on using Apache Spark in general, notes on using Spark for Physics, how to run TPCDS on PySpark, how to create histograms β¦β424Updated 2 months ago
- A repo for all spark examples using Rapids Accelerator including ETL, ML/DL, etc.β127Updated this week
- Source code for the MC technical blog post "Data Observability in Practice Using SQL"β36Updated 4 months ago
- β49Updated 8 months ago
- Presentation about Pyspark and how Arrow makes it fasterβ22Updated 4 years ago
- A tutorial on how to get started with Presto.β56Updated 2 years ago
- PySpark-ETLβ23Updated 4 years ago
- A real-time streaming ETL pipeline for streaming and performing sentiment analysis on Twitter data using Apache Kafka, Apache Spark and Dβ¦β29Updated 4 years ago
- Because its never late to start taking notes and 'public' it...β60Updated this week
- β22Updated 2 years ago
- Use the TPC-DS benchmark to test Spark SQL performanceβ175Updated 4 years ago
- The Internals of PySparkβ25Updated 2 months ago