Includes notes on using Apache Spark, with drill down on Spark for Physics, how to run TPCDS on PySpark, how to create histograms with Spark. Also tools for stress testing, measuring CPUs' performance, and I/O latency heat maps. Jupyter notebooks examples for using various DB systems.
☆459Dec 15, 2025Updated 2 months ago
Alternatives and similar repositories for Miscellaneous
Users that are interested in Miscellaneous are comparing it to the libraries listed below
Sorting:
- This is the development repository for sparkMeasure, a tool and library designed for efficient analysis and troubleshooting of Apache Spa…☆816Updated this week
- Spark-Dashboard is a solution for monitoring Apache Spark jobs. This repository provides the tooling and configuration for deploying an A…☆134Jan 5, 2026Updated last month
- Code and examples of how to write and deploy Apache Spark Plugins. Spark plugins allow runnig custom code on the executors as they are in…☆94May 9, 2025Updated 9 months ago
- Qubole Sparklens tool for performance tuning Apache Spark☆590Jun 26, 2024Updated last year
- The Internals of Delta Lake☆188Nov 30, 2025Updated 3 months ago
- Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.☆1,521Updated this week
- Bulletproof Apache Spark jobs with fast root cause analysis of failures.☆73Mar 14, 2021Updated 4 years ago
- Helpers & syntactic sugar for PySpark.☆62Dec 4, 2025Updated 3 months ago
- A library that provides useful extensions to Apache Spark and PySpark.☆232Jan 20, 2026Updated last month
- Spark metrics related custom classes and sinks (e.g. Prometheus)☆188Aug 2, 2022Updated 3 years ago
- A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations o…☆52Jun 17, 2025Updated 8 months ago
- Enabling Spark Optimization through Cross-stack Monitoring and Visualization☆47Aug 23, 2017Updated 8 years ago
- Native SQL Engine plugin for Spark SQL with vectorized SIMD optimizations.☆257Feb 21, 2023Updated 3 years ago
- Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive☆187Oct 15, 2025Updated 4 months ago
- A dynamic data completeness and accuracy library at enterprise scale for Apache Spark☆29Nov 4, 2024Updated last year
- The Internals of Spark SQL☆486Jan 25, 2026Updated last month
- Sample processing code using Spark 2.1+ and Scala☆51Jun 28, 2020Updated 5 years ago
- The Internals of Apache Spark☆1,540Jul 5, 2025Updated 7 months ago
- Monitor Apache Spark from Jupyter Notebook☆172May 16, 2022Updated 3 years ago
- Hadoop Profiler, or hprofiler, is a tool which is able to analyze on- and off-CPU workloads on distributed computing environments.☆24Jul 7, 2016Updated 9 years ago
- Jupyter magics and kernels for working with remote Spark clusters☆1,362Sep 9, 2025Updated 5 months ago
- Scala API for Apache Spark SQL high-order functions☆14Aug 4, 2023Updated 2 years ago
- Splash, a flexible Spark shuffle manager that supports user-defined storage backends for shuffle data storage and exchange☆130Dec 19, 2024Updated last year
- An open source indexing subsystem that brings index-based query acceleration to Apache Spark™ and big data workloads.☆431Jan 14, 2022Updated 4 years ago
- JVM Profiler Sending Metrics to Kafka, Console Output or Custom Reporter☆1,806Jul 12, 2025Updated 7 months ago
- Remote shuffle service for Apache Spark to store shuffle data on remote servers.☆334Sep 29, 2023Updated 2 years ago
- Apache DataFusion Comet Spark Accelerator☆1,148Updated this week
- Dr. Elephant is a job and flow-level performance monitoring and tuning tool for Apache Hadoop and Apache Spark☆1,371Aug 22, 2023Updated 2 years ago
- Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.☆1,039Updated this week
- This is a mirror of https://github.com/LucaCanali/sparkMeasure - sparkMeasure is a tool for performance troubleshooting of Apache Spark w…☆16Oct 3, 2025Updated 5 months ago
- ## Auto-archived due to inactivity. ## Simple JVM Profiler Using StatsD and Other Metrics Backends☆15Oct 3, 2023Updated 2 years ago
- A framework for writing performant user-defined functions (UDFs) that are portable across a variety of engines including Apache Spark, Ap…☆304Oct 30, 2025Updated 4 months ago
- PySpark test helper methods with beautiful error messages☆753Feb 25, 2026Updated last week
- Spark RAPIDS plugin - accelerate Apache Spark with GPUs☆965Updated this week
- A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.☆347May 31, 2024Updated last year
- Read and write Parquet in Scala. Use Scala classes as schema. No need to start a cluster.☆300Jul 13, 2025Updated 7 months ago
- A tool to get better debug info on spark's memory usage☆42Aug 21, 2019Updated 6 years ago
- REST job server for Apache Spark☆2,843Jul 8, 2025Updated 7 months ago
- Fybrik platform - Arrow/Flight module☆15Aug 10, 2024Updated last year