redapt / pyspark-s3-parquet-example
This repo demonstrates how to load a sample Parquet formatted file from an AWS S3 Bucket. A python job will then be submitted to a Apache Spark instance running on AWS EMR, which will run a SQLContext to create a temporary table using a DataFrame. SQL queries will then be possible against the temporary table.
☆19Updated 8 years ago
Related projects ⓘ
Alternatives and complementary repositories for pyspark-s3-parquet-example
- ☆16Updated last year
- Simple demonstration of how to build a complex real time machine learning visualization tool.☆16Updated 8 years ago
- Mastering Spark for Data Science, published by Packt☆46Updated last year
- A simple introduction to using spark ml pipelines☆26Updated 6 years ago
- Ingest tweets with Kafka. Use Spark to track popular hashtags and trendsetters for each hashtag☆29Updated 8 years ago
- PySpark phonetic and string matching algorithms☆35Updated 9 months ago
- ☆17Updated 6 years ago
- This service is meant to simplify running Google Cloud operations, especially BigQuery tasks. This means you do not have to worry about …☆46Updated 5 years ago
- AWS Big Data Certification☆25Updated last year
- How to do data science with Optimus, Spark and Python.☆18Updated 5 years ago
- A single docker image that combines Neo4j Mazerunner and Apache Spark GraphX into a powerful all-in-one graph processing engine☆46Updated 5 years ago
- Using Luigi to create a Machine Learning Pipeline using the Rossman Sales data from Kaggle☆33Updated 8 years ago
- Business Data Analysis by HiPIC of CalStateLA☆20Updated 6 years ago
- A Scalable Data Cleaning Library for PySpark.☆26Updated 5 years ago
- An example PySpark project with pytest☆17Updated 7 years ago
- Airflow workflow management platform chef cookbook.☆68Updated 5 years ago
- Basic tutorial of using Apache Airflow☆35Updated 6 years ago
- Code examples for the Introduction to Kubeflow course☆13Updated 3 years ago
- Real time and offline time series analysis with Spark, Spark Streaming and Storm☆21Updated 4 years ago
- Public source code for the Batch Processing with Apache Beam (Python) online course☆19Updated 4 years ago
- Code and setup information for Introduction to Machine Learning with Spark☆12Updated 9 years ago
- PyConDE & PyData Berlin 2019 Airflow Workshop: Airflow for machine learning pipelines.☆46Updated last year
- Spark NLP for Streamlit☆15Updated 3 years ago
- Udacity Data Pipeline Exercises☆15Updated 4 years ago
- Openscoring application for the Docker distributed applications platform☆10Updated 4 years ago
- Build a sentiment classifier using PL/Python on PostgreSQL, Greenplum Database, or Apache HAWQ☆8Updated 7 years ago
- Data validation library for PySpark 3.0.0☆34Updated 2 years ago