awslabs/deequ

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/awslabs/deequ)

awslabs / deequ

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

☆3,636

Alternatives and similar repositories for deequ

Users that are interested in deequ are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

awslabs / python-deequ
View on GitHub
Python API for Deequ
☆823Updated this week
fivetran / great_expectations
View on GitHub
Always know what to expect from your data.
☆11,664Updated this week
amundsen-io / amundsen
View on GitHub
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting…
☆4,782Jul 1, 2026Updated 3 weeks ago
delta-io / delta
View on GitHub
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Tr…
☆8,924Updated this week
sodadata / soda-core
View on GitHub
Data Contracts engine for the modern data stack. https://www.soda.io
☆2,396Updated this week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
mrpowers-io / spark-daria
View on GitHub
Essential Spark extensions and helper methods ✨😲
☆767Jun 22, 2026Updated last month
apache / griffin
View on GitHub
Mirror of Apache griffin
☆1,172Aug 3, 2025Updated 11 months ago
YotpoLtd / metorikku
View on GitHub
A simplified, lightweight ETL Framework based on Apache Spark
☆588Jan 24, 2024Updated 2 years ago
OpenLineage / OpenLineage
View on GitHub
An Open Standard for lineage metadata collection
☆2,557Updated this week
datahub-project / datahub
View on GitHub
The Context Platform for your Data and AI Stack
☆12,320Updated this week
databricks / koalas
View on GitHub
Koalas: pandas API on Apache Spark
☆3,371Mar 20, 2024Updated 2 years ago
MarquezProject / marquez
View on GitHub
Collect, aggregate, and visualize a data ecosystem's metadata
☆2,245Updated this week
aws / aws-sdk-pandas
View on GitHub
pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoD…
☆4,117Updated this week
holdenk / spark-testing-base
View on GitHub
Base classes to use when writing tests with Spark
☆1,553Apr 20, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
AbsaOSS / spline
View on GitHub
Data Lineage Tracking And Visualization Solution
☆663Updated this week
apache / iceberg
View on GitHub
Apache Iceberg
☆9,070Updated this week
qubole / sparklens
View on GitHub
Qubole Sparklens tool for performance tuning Apache Spark
☆592Jun 26, 2024Updated 2 years ago
apache / hudi
View on GitHub
Upserts, Deletes And Incremental Processing on Big Data.
☆6,192Updated this week
mrpowers-io / spark-fast-tests
View on GitHub
Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)
☆458Apr 2, 2026Updated 3 months ago
feast-dev / feast
View on GitHub
The Open Source Feature Store for AI/ML
☆7,152Updated this week
mrpowers-io / quinn
View on GitHub
pyspark methods to enhance developer productivity 📣 👯 🎉
☆687Jun 9, 2026Updated last month
aws-samples / amazon-deequ-glue
View on GitHub
Automated data quality suggestions and analysis with Deequ on AWS Glue
☆93Dec 29, 2022Updated 3 years ago
dbt-labs / dbt-core
View on GitHub
dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build application…
☆13,495Updated this week
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
apache / kyuubi
View on GitHub
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
☆2,353Updated this week
trinodb / trino
View on GitHub
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
☆13,061Updated this week
projectnessie / nessie
View on GitHub
Nessie: Transactional Catalog for Data Lakes with Git-like semantics
☆1,481Updated this week
Netflix / metacat
View on GitHub
☆1,687Updated this week
typelevel / frameless
View on GitHub
Expressive types for Spark.
☆898Updated this week
dagster-io / dagster
View on GitHub
An orchestration platform for the development, production, and observation of data assets.
☆15,881Updated this week
combust / mleap
View on GitHub
MLeap: Deploy ML Pipelines to Production
☆1,539Updated this week
kubeflow / spark-operator
View on GitHub
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
☆3,142Updated this week
jupyter-incubator / sparkmagic
View on GitHub
Jupyter magics and kernels for working with remote Spark clusters
☆1,364Sep 9, 2025Updated 10 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
re-data / re-data
View on GitHub
re_data - fix data issues before your users & CEO would discover them 😊
☆1,566Apr 30, 2024Updated 2 years ago
LucaCanali / sparkMeasure
View on GitHub
This repository contains the development code for sparkMeasure, an Apache Spark performance analysis and troubleshooting library. It simp…
☆827May 19, 2026Updated 2 months ago
polynote / polynote
View on GitHub
A better notebook for Scala (and more)
☆4,596Jan 27, 2026Updated 5 months ago
fugue-project / fugue
View on GitHub
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rew…
☆2,170May 19, 2026Updated 2 months ago
delta-io / delta-sharing
View on GitHub
An open protocol for secure data sharing
☆952Updated this week
Netflix / metaflow
View on GitHub
Build, Manage and Deploy AI/ML Systems
☆10,192Updated this week
spotify / scio
View on GitHub
A Scala API for Apache Beam and Google Cloud Dataflow.
☆2,625Updated this week