Nordstrom / bigdata-profilerLinks

Profiles the data, validates the schema and runs data quality checks and produces a report

☆20

Alternatives and similar repositories for bigdata-profiler

Users that are interested in bigdata-profiler are comparing it to the libraries listed below

Sorting:

mikulskibartosz / check-engine
Data validation library for PySpark 3.0.0
☆33Updated 2 years ago
bernhard-42 / pyspark-atlas
PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection
☆18Updated 8 years ago
amundsen-io / amundsengremlin
Amundsen Gremlin
☆21Updated 2 years ago
datamindedbe / lighthouse
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines an…
☆61Updated 9 months ago
holdenk / high-performance-spark-examples
Examples for High Performance Spark
☆16Updated 7 months ago
justhackit / spark-utils
☆10Updated 3 years ago
godatadriven / airflow-training-skeleton
Skeleton project for Apache Airflow training participants to work on.
☆16Updated 4 years ago
cordon-thiago / spark-schema-merge
Spark app to merge different schemas
☆23Updated 4 years ago
jamesweakley / snowflake-rbgm
Rules based grant management for Snowflake
☆40Updated 6 years ago
CoxAutomotiveDataSolutions / waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
☆75Updated last year
funkyminds / cleanframes
type-class based data cleansing library for Apache Spark SQL
☆78Updated 6 years ago
yodasco / pyspark-emr
A toolset to streamline running spark python on EMR
☆20Updated 8 years ago
sibytes / yetl
Yet Another (Spark) ETL Framework
☆21Updated last year
joerg-schneider / airtunnel
The sane way of building a data layer in Airflow
☆24Updated 5 years ago
bartosz25 / spark-scala-playground
Sample processing code using Spark 2.1+ and Scala
☆51Updated 4 years ago
ZuInnoTe / spark-hadoopoffice-ds
A Spark datasource for the HadoopOffice library
☆38Updated 2 years ago
cartershanklin / hive-scd-examples
How to manage Slowly Changing Dimensions with Apache Hive
☆55Updated 5 years ago
target / data-validator
A tool to validate data, built around Apache Spark.
☆101Updated last month
holdenk / spark-upgrade
Magic to help Spark pipelines upgrade
☆35Updated 8 months ago
Data-Engineering-Weekly / dataengineeringweekly
Weekly Data Engineering Newsletter
☆96Updated 11 months ago
bartosz25 / spark-playground
Code snippets used in demos recorded for the blog.
☆37Updated last week
yaooqinn / itachi
A library that brings useful functions from various modern database management systems to Apache Spark
☆59Updated last year
davidgasquez / kubedbt
📆 Run, schedule, and manage your dbt jobs using Kubernetes.
☆24Updated 6 years ago
saikrishnapujari / Spark-Nested-Data-Parser
Nested Data (JSON/AVRO/XML) Parsing and Flattening in Spark
☆16Updated last year
venkatra / dbt_hacks
A bunch of hacks developed around dbt
☆48Updated 5 years ago
avensolutions / cdc-at-scale-using-spark
Scalable CDC Pattern Implemented using PySpark
☆18Updated 5 years ago
phatak-dev / spark-3.0-examples
Examples of Spark 3.0
☆47Updated 4 years ago
tokern / dbcat
Data Catalog for Databases and Data Warehouses
☆35Updated last year
dimajix / flowman
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pi…
☆95Updated this week
delta-incubator / deltaray
Delta reader for the Ray open-source toolkit for building ML applications
☆46Updated last year