bernhard-42 / pyspark-atlasLinks

PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection

☆18

Alternatives and similar repositories for pyspark-atlas

Users that are interested in pyspark-atlas are comparing it to the libraries listed below

Sorting:

ExpediaGroup / circus-train
Circus Train is a dataset replication tool that copies Hive tables between clusters and clouds.
☆91Updated last year
swoop-inc / spark-records
Bulletproof Apache Spark jobs with fast root cause analysis of failures.
☆73Updated 4 years ago
yaooqinn / itachi
A library that brings useful functions from various modern database management systems to Apache Spark
☆60Updated 2 years ago
rssanders3 / airflow-spark-operator-plugin
A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator
☆73Updated 6 years ago
AbsaOSS / atum
A dynamic data completeness and accuracy library at enterprise scale for Apache Spark
☆29Updated last year
zalando-incubator / spark-json-schema
JSON schema parser for Apache Spark
☆82Updated 3 years ago
airbnb / sputnik
☆63Updated 6 years ago
HeartSaVioR / spark-state-tools
Spark Structured Streaming State Tools
☆34Updated 5 years ago
funkyminds / cleanframes
type-class based data cleansing library for Apache Spark SQL
☆78Updated 6 years ago
cerndb / SparkPlugins
Code and examples of how to write and deploy Apache Spark Plugins. Spark plugins allow runnig custom code on the executors as they are in…
☆94Updated 6 months ago
ing-bank / rokku
Rokku project. This project acts as a proxy on top of any S3 storage solution providing services like authentication, authorization, shor…
☆70Updated 2 months ago
FINRAOS / MegaSparkDiff
A Spark-based data comparison tool at scale which facilitates software development engineers to compare a plethora of pair combinations o…
☆52Updated 5 months ago
CoxAutomotiveDataSolutions / waimak
Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.
☆76Updated last year
intuit / superglue
Superglue is a lineage-tracking tool built to help visualize the propagation of data through complex pipelines composed of tables, jobs …
☆159Updated 2 years ago
KeithSSmith / spark-compaction
File compaction tool that runs on top of the Spark framework.
☆59Updated 6 years ago
hortonworks-spark / spark-schema-registry
Schema Registry integration for Apache Spark
☆40Updated 3 years ago
swoop-inc / spark-alchemy
Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive
☆185Updated last month
MarquezProject / marquez-airflow
Airflow support for Marquez
☆31Updated 4 years ago
rambler-digital-solutions / airflow-declarative
Airflow declarative DAGs via YAML
☆133Updated 2 years ago
FRosner / drunken-data-quality
Spark package for checking data quality
☆222Updated 5 years ago
ing-bank / apache-ranger-s3-plugin
Apache Ranger Plugin for S3
☆20Updated 2 years ago
miho120 / ambari-airflow-mpack
Ambari stack service for installing and managing Apache Airflow on HDP cluster
☆59Updated 7 years ago
AbsaOSS / hyperdrive
Extensible streaming ingestion pipeline on top of Apache Spark
☆46Updated 4 months ago
atlassian / themis
Autoscaling EMR clusters and Kinesis streams on Amazon Web Services (AWS)
☆47Updated last year
trivago / hive-lambda-sting
A small library of hive UDFS using Macros to process and manipulate complex types
☆15Updated last month
mayur2810 / sope
Apache Spark ETL Utilities
☆39Updated last year
palantir / spark-influx-sink
A Spark metrics sink that pushes to InfluxDb
☆51Updated 4 years ago
werneckpaiva / spark-to-tableau
Spark to Tableau Extractor library
☆19Updated 8 years ago
dimajix / flowman
Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pi…
☆97Updated last month
spektom / spark-flamegraph
Easy CPU Profiling for Apache Spark applications
☆48Updated 5 years ago