KeithSSmith/spark-compaction

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/KeithSSmith/spark-compaction)

KeithSSmith / spark-compaction

File compaction tool that runs on top of the Spark framework.

☆59

Alternatives and similar repositories for spark-compaction

Users that are interested in spark-compaction are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

asdaraujo / filecrush
View on GitHub
Remedy small files by combining them into larger ones.
☆23Oct 31, 2018Updated 7 years ago
ExpediaGroup / datasqueeze
View on GitHub
Hadoop utility to compact small files
☆18Feb 16, 2026Updated 5 months ago
hbutani / icebergSQL
View on GitHub
Integration of Iceberg table management into Spark SQL
☆11Jan 21, 2020Updated 6 years ago
mislam77-git / examples
View on GitHub
Examples for Apache Oozie book
☆18May 30, 2016Updated 10 years ago
piotr-kalanski / data-quality-monitoring
View on GitHub
Data Quality Monitoring Tool
☆15Dec 5, 2017Updated 8 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
atomix / atomix-jepsen
View on GitHub
Atomix Jepsen tests
☆14Feb 7, 2017Updated 9 years ago
cloudera-labs / envelope
View on GitHub
Build configuration-driven ETL pipelines on Apache Spark
☆162Oct 4, 2022Updated 3 years ago
cloudera / kafka-examples
View on GitHub
Kafka Examples repository.
☆44Feb 5, 2019Updated 7 years ago
bomeng / Heracles
View on GitHub
High performance HBase / Spark SQL engine
☆28Jul 7, 2022Updated 4 years ago
wushujames / kafka-utilities
View on GitHub
☆26Dec 18, 2019Updated 6 years ago
eselyavka / liquibase-impala
View on GitHub
Liquibase extension to add Impala Database support
☆24Mar 8, 2022Updated 4 years ago
cloudacademy / beam
View on GitHub
Mirror of Apache Beam
☆10Jan 27, 2021Updated 5 years ago
implydata / druid-hadoop-inputformat
View on GitHub
Hadoop InputFormat for http://druid.io/
☆10Oct 26, 2016Updated 9 years ago
AbsaOSS / atum
View on GitHub
A dynamic data completeness and accuracy library at enterprise scale for Apache Spark
☆30May 13, 2026Updated 2 months ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
allegro / camus-compressor
View on GitHub
Camus Compressor merges files created by Camus and saves them in a compressed format.
☆13Mar 20, 2023Updated 3 years ago
ebonnal / delta-lake-ui
View on GitHub
[student project] UI to run SQL on Delta Lake tables and visualize the variations of the result among tables versions
☆12Apr 21, 2020Updated 6 years ago
saltstack-formulas / keepalived-formula
View on GitHub
☆12Apr 7, 2025Updated last year
jeoffreylim / maelstrom
View on GitHub
Maelstrom is an open source Kafka integration with Spark that is designed to be developer friendly, high performance (millisecond stream …
☆21Feb 6, 2017Updated 9 years ago
spotify / ratatool
View on GitHub
A tool for data sampling, data generation, and data diffing
☆349Mar 31, 2026Updated 3 months ago
lucidworks / solrj-nested-docs
View on GitHub
Simple example of Solr Block Joins between Parents and Children, implemented in SolrJ
☆22Jul 2, 2014Updated 12 years ago
Varal7 / opendata-ratp
View on GitHub
Demo for making use of RATP's real-time API
☆13May 3, 2017Updated 9 years ago
avensolutions / spark-sql-etl-framework
View on GitHub
Multi-stage, config driven, SQL based ETL framework using PySpark
☆26Sep 16, 2019Updated 6 years ago
mchon89 / Google_App_Engine_Demo
View on GitHub
Deploying a simple, customized Flask API in python via Google App Engine
☆13Aug 20, 2017Updated 8 years ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
ndolgov / experiments
View on GitHub
Code examples for my blog posts
☆22Nov 7, 2018Updated 7 years ago
aravinthsci / Spark_Delta_Lake
View on GitHub
Delta Lake Examples
☆11Apr 24, 2020Updated 6 years ago
OneCricketeer / gryllidae
View on GitHub
Opinionated CNCF-based, Docker Compose setup for everything needed to develop a 12factor app
☆18Feb 23, 2022Updated 4 years ago
tmalaska / Spark.TableStatsExample
View on GitHub
Simple Spark example of generating table stats for use of data quality checks
☆27Apr 28, 2017Updated 9 years ago
awslabs / apn-competency-helper
View on GitHub
APN Designations template folder structure and presentation, including APN Competency Program and APN Service Delivery Program
☆22Feb 11, 2025Updated last year
ExpediaGroup / hello-streams
View on GitHub
hello-streams :: Introducing the stream-first mindset
☆16Mar 5, 2024Updated 2 years ago
benwatson528 / intellij-avro-parquet-plugin
View on GitHub
A Tool Window plugin for IntelliJ that displays Avro and Parquet files and their schemas in JSON.
☆54Jun 15, 2025Updated last year
kppotato / kafka_monitor
View on GitHub
☆17May 5, 2018Updated 8 years ago
edwardcapriolo / filecrush
View on GitHub
Remedy small files by combining them into larger ones.
☆196Jul 1, 2022Updated 4 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ogrebgr / scram-sasl
View on GitHub
Java implementation of the SCRAM SASL for both server and client plus examples
☆17Apr 18, 2021Updated 5 years ago
typesafehub / constructr-zookeeper
View on GitHub
This library enables to use ZooKeeper as cluster coordinator in a ConstructR based cluster
☆12Dec 2, 2017Updated 8 years ago
BenFradet / struct-type-encoder
View on GitHub
Deriving Spark DataFrame schemas from case classes
☆44Jun 24, 2024Updated 2 years ago
bigdatagenomics / utils
View on GitHub
General utility code used across BDG products. Apache 2 licensed.
☆18Mar 17, 2026Updated 4 months ago
leoetlino / ratp-api
View on GitHub
A modern API to get information from the RATP
☆13Jul 12, 2023Updated 3 years ago
dcos-labs / dcos-jupyterlab-service
View on GitHub
JupyterLab Notebook for Mesosphere DC/OS
☆11Aug 6, 2019Updated 6 years ago
koeninger / kafka-exactly-once
View on GitHub
☆242Jun 14, 2018Updated 8 years ago