aviatesk / deequ
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
☆9Updated 3 years ago
Alternatives and similar repositories for deequ:
Users that are interested in deequ are comparing it to the libraries listed below
- Paper: A Zero-rename committer for object stores☆20Updated 3 years ago
- ☆22Updated 5 years ago
- A Giter8 template for scio☆31Updated last month
- Bullet is a streaming query engine that can be plugged into any singular data stream using a Stream Processing framework like Apache Stor…☆41Updated 2 years ago
- Dione - a Spark and HDFS indexing library☆52Updated last year
- Mutation testing framework and code coverage for Hive SQL☆24Updated 3 years ago
- Snowplow Enrichment jobs and library☆22Updated last month
- Sketching data structures for scala, including t-digest☆15Updated 3 years ago
- Demonstration of a Hive Input Format for Iceberg☆26Updated 4 years ago
- Collector for cloud-native web, mobile and event analytics, running on AWS and GCP☆31Updated 3 weeks ago
- ☆14Updated last month
- Apache Beam Site☆29Updated last month
- Data Sketches for Apache Spark☆22Updated 2 years ago
- Set of tools for creating backups, compaction and restoration of Apache Kafka® Clusters☆21Updated last week
- A small project to allow publishing data to Apache Kafka, Apache Pulsar or any other target system☆14Updated 4 years ago
- Bulletproof Apache Spark jobs with fast root cause analysis of failures.☆72Updated 4 years ago
- A testing framework for Trino☆26Updated this week
- Apache Amaterasu☆56Updated 5 years ago
- Example of a tested Apache Flink application.☆42Updated 5 years ago
- minio as local storage and DynamoDB as catalog☆13Updated 10 months ago
- Snowflake Snowpark Java & Scala API☆20Updated this week
- A temporary home for LinkedIn's changes to Apache Iceberg (incubating)☆61Updated 3 months ago
- Provide functionality to build statistical models to repair dirty tabular data in Spark☆12Updated last year
- Embedded PostgreSQL server for use in tests☆9Updated 3 years ago
- Example Spark applications that run on Kubernetes and access GCP products, e.g., GCS, BigQuery, and Cloud PubSub☆37Updated 7 years ago
- Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.☆111Updated 5 years ago
- A table schema-less OLAP Analytics Engine for Big Data.☆24Updated 11 months ago
- Kafka Streams + Memcached (e.g. AWS ElasticCache) for low-latency in-memory lookups☆13Updated 5 years ago
- Scalable CDC Pattern Implemented using PySpark☆18Updated 5 years ago
- An example of building kubernetes operator (Flink) using Abstract operator's framework☆26Updated 5 years ago