yodasco / pyspark-emrLinks
A toolset to streamline running spark python on EMR
☆20Updated 8 years ago
Alternatives and similar repositories for pyspark-emr
Users that are interested in pyspark-emr are comparing it to the libraries listed below
Sorting:
- PySpark for ETL jobs including lineage to Apache Atlas in one script via code inspection☆18Updated 8 years ago
- Examples for High Performance Spark☆16Updated 7 months ago
- An example PySpark project with pytest☆16Updated 7 years ago
- Data validation library for PySpark 3.0.0☆33Updated 2 years ago
- Quickstart PySpark with Anaconda on AWS/EMR using Terraform☆47Updated 5 months ago
- AWS bootstrap scripts for Mozilla's flavoured Spark setup.☆47Updated 5 years ago
- Airflow workflow management platform chef cookbook.☆71Updated 5 years ago
- CLI tool to launch Spark jobs on AWS EMR☆67Updated last year
- This code demonstrates the architecture featured on the AWS Big Data blog (https://aws.amazon.com/blogs/big-data/ ) which creates a concu…☆75Updated 6 years ago
- Some class materials for a data processing course using PySpark☆52Updated 2 years ago
- A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator☆73Updated 5 years ago
- Quickstart PySpark with Anaconda on AWS/EMR☆53Updated 8 years ago
- Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines an…☆61Updated 9 months ago
- Profiles the data, validates the schema and runs data quality checks and produces a report☆20Updated 6 years ago
- Real time and offline time series analysis with Spark, Spark Streaming and Storm☆21Updated 4 years ago
- Hadoop Data Pipeline using Falcon☆15Updated 9 years ago
- AWS Big Data Certification☆25Updated 5 months ago
- HDF masterclass materials☆28Updated 9 years ago
- Make your libraries magically appear in Databricks.☆47Updated last year
- Example for an airflow plugin☆49Updated 8 years ago
- Workshop for Hadoop Operations Best Practices☆10Updated 10 years ago
- A rough prototype of a tool for discovering Apache Hive schemas from JSON documents.☆42Updated last year
- A simple introduction to using spark ml pipelines☆26Updated 7 years ago
- Bulletproof Apache Spark jobs with fast root cause analysis of failures.☆72Updated 4 years ago
- Shunting Yard is a real-time data replication tool that copies data between Hive Metastores.☆20Updated 3 years ago
- Code to be contributed to the Apache Airflow (incubating) project for ETL workflow management for integrating with the Snowflake Data War…☆25Updated 7 years ago
- The open source version of the Amazon Redshift Cluster Management Guide.☆48Updated 2 years ago
- The sane way of building a data layer in Airflow☆24Updated 5 years ago
- Monitor Twitter stream for S&P 500 companies to identify & act on unexpected increases in tweet volume☆39Updated 9 years ago
- A collection of airflow sample workflows for data processing on aws☆12Updated 7 years ago