aviatesk / deequLinks
Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
☆9Updated 3 years ago
Alternatives and similar repositories for deequ
Users that are interested in deequ are comparing it to the libraries listed below
Sorting:
- ☆22Updated 6 years ago
- Bullet is a streaming query engine that can be plugged into any singular data stream using a Stream Processing framework like Apache Stor…☆41Updated 2 years ago
- Paper: A Zero-rename committer for object stores☆20Updated 4 years ago
- Dione - a Spark and HDFS indexing library☆52Updated last year
- A Giter8 template for scio☆31Updated 4 months ago
- Mutation testing framework and code coverage for Hive SQL☆24Updated 4 years ago
- A testing framework for Trino☆26Updated 3 months ago
- Bulletproof Apache Spark jobs with fast root cause analysis of failures.☆72Updated 4 years ago
- Docker images for Presto integration testing☆35Updated last year
- A small project to allow publishing data to Apache Kafka, Apache Pulsar or any other target system☆14Updated 4 years ago
- minio as local storage and DynamoDB as catalog☆15Updated last year
- A command line client for consuming Postgres logical decoding events in the pgoutput format☆11Updated 11 months ago
- A jdbc driver emulates redshift specific commands.☆62Updated 2 years ago
- Oxia Java client SDK☆17Updated last week
- The Internals of Apache Beam☆12Updated 5 years ago
- Example of a tested Apache Flink application.☆42Updated 5 years ago
- Cloud Storage Connector integrates Apache Pulsar with cloud storage.☆28Updated last month
- A library for strong, schema based conversion between 'natural' JSON documents and Avro☆18Updated last year
- Scalable CDC Pattern Implemented using PySpark☆18Updated 6 years ago
- Extensible streaming ingestion pipeline on top of Apache Spark☆45Updated last week
- Sketching data structures for scala, including t-digest☆15Updated 3 years ago
- CLI and Go Clients to manage Kafka components (Kafka Connect & SchemaRegistry)☆29Updated 8 years ago
- Nested array transformation helper extensions for Apache Spark☆37Updated last year
- A temporary home for LinkedIn's changes to Apache Iceberg (incubating)☆61Updated 6 months ago
- ## Auto-archived due to inactivity. ## Simple JVM Profiler Using StatsD and Other Metrics Backends☆15Updated last year
- Lenses.io CLI (command-line interface)☆37Updated 7 months ago
- Splittable Gzip codec for Hadoop☆70Updated this week
- Shunting Yard is a real-time data replication tool that copies data between Hive Metastores.☆20Updated 3 years ago
- Fluorite: Apache Calcite trace analyzer☆12Updated 6 years ago
- Data Catalog is a service for indexing parameterized, strongly-typed data artifacts across revisions. It also powers Flytes memoization s…☆54Updated last year