chuqiaoshen / Git-Influencer
Insight Data Engineering project: A platform built in HDFS, Spark and Airflow to help you to find social influencers from GitHub Network.
β16Updated 5 months ago
Related projects β
Alternatives and complementary repositories for Git-Influencer
- π¨ Simple, self-contained fraud detection system built with Apache Kafka and Pythonβ83Updated 5 years ago
- Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validatioβ¦β53Updated last year
- The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus onβ¦β25Updated 2 years ago
- Data engineering interviews Q&A for data community by data communityβ61Updated 4 years ago
- Udacity Data Pipeline Exercisesβ15Updated 4 years ago
- Data Engineering pipeline hosted entirely in the AWS ecosystem utilizing DocumentDB as the databaseβ13Updated 3 years ago
- A real-time event pipeline around Kafka Ecosystem for Chicago Transit Authority.β29Updated last year
- Basic tutorial of using Apache Airflowβ35Updated 6 years ago
- Project files for the post: Running PySpark Applications on Amazon EMR using Apache Airflow: Using the new Amazon Managed Workflows for Aβ¦β41Updated 2 years ago
- Use Kafka and Apache Spark streaming to perform click stream analyticsβ76Updated 4 years ago
- Developed a data pipeline to automate data warehouse ETL by building custom airflow operators that handle the extraction, transformation,β¦β89Updated 2 years ago
- Public source code for the Batch Processing with Apache Beam (Python) online courseβ19Updated 4 years ago
- Design/Implement stream/batch architecture on NYC taxi data | #DEβ26Updated 3 years ago
- Jupyter notebooks for pyspark tutorials given at Universityβ104Updated 2 months ago
- AWS Big Data Certificationβ25Updated last year
- Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startupsβ16Updated 6 years ago
- Code examples for the Introduction to Kubeflow courseβ13Updated 3 years ago
- A code-based tutorial for production level data streaming with PySpark plus Optimus for data cleaning, Confluent Kafka, & Apache Drill uβ¦β26Updated 5 years ago
- Sample Airflow DAGs to load data from the CovidTracking API to Snowflake via an AWS S3 intermediary.β16Updated 3 years ago
- Sentiment Analysis of a Twitter Topic with Spark Structured Streamingβ55Updated 5 years ago
- Apache Spark Interview Question and Answersβ21Updated 4 years ago
- β23Updated 5 years ago
- Because its never late to start taking notes and 'public' it...β60Updated 3 weeks ago
- How to build an awesome data engineering teamβ99Updated 5 years ago
- Full stack data engineering tools and infrastructure set-upβ43Updated 3 years ago
- PySpark phonetic and string matching algorithmsβ35Updated 8 months ago
- A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for β¦β132Updated 4 years ago
- Amazon Redshift Cookbook, Published by Packtβ15Updated last year
- Blog post on ETL pipelines with Airflowβ23Updated 4 years ago
- Challenge for those applying to the Software Engineer, Big Data positionβ34Updated 13 years ago