homeaway/datapull

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/homeaway/datapull)

homeaway / datapull

Cloud based Data Platform based on Apache Spark

☆28

Alternatives and similar repositories for datapull

Users that are interested in datapull are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

AbsaOSS / hyperdrive
View on GitHub
Extensible streaming ingestion pipeline on top of Apache Spark
☆47Jul 17, 2025Updated last year
amesar / docker-spark-hive-metastore
View on GitHub
Spark and Hive docker containers sharing a common MySQL metastore
☆26Apr 17, 2020Updated 6 years ago
ExpediaGroup / shunting-yard
View on GitHub
Shunting Yard is a real-time data replication tool that copies data between Hive Metastores.
☆20Oct 11, 2021Updated 4 years ago
netease-bigdata / ne-spark-courseware
View on GitHub
NetEase Spark Courses
☆15Sep 4, 2018Updated 7 years ago
fraibacas / lakehouse-poc
View on GitHub
Run an open-source data LakeHouse locally using Docker Compose
☆12May 31, 2024Updated 2 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
indix / sparkplug
View on GitHub
Spark package to "plug" holes in data using SQL based rules ⚡️ 🔌
☆28May 15, 2020Updated 6 years ago
PacktPublishing / Django-with-Data-Science
View on GitHub
Django with Data Science [Video], published by Packt
☆12Dec 15, 2025Updated 7 months ago
avensolutions / spark-sql-etl-framework
View on GitHub
Multi-stage, config driven, SQL based ETL framework using PySpark
☆26Sep 16, 2019Updated 6 years ago
japerry911 / crypto-data-pipeline
View on GitHub
Data Pipeline that utilizes GCP, Python 3.10, Prefect, and more.
☆10Jan 23, 2023Updated 3 years ago
richardanaya / spark_delta_lake
View on GitHub
☆16Jun 27, 2020Updated 6 years ago
speedment / avro-mocker
View on GitHub
Generate mock data based on an Apache Avro schema and specific cardinality settings
☆10Apr 16, 2018Updated 8 years ago
awslabs / amazon-s3-tagging-spark-util
View on GitHub
☆12Oct 16, 2023Updated 2 years ago
hequn8128 / TableApiDemo
View on GitHub
☆33Apr 23, 2019Updated 7 years ago
newfront / spark-intro-to-ml
View on GitHub
A Gentle introduction to Machine Learning with Apache Spark
☆11Mar 2, 2026Updated 4 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
xavient / Data-Ingestion-Platform
View on GitHub
☆51Jun 30, 2026Updated 3 weeks ago
openaire / vipe
View on GitHub
Tool for visualizing Apache Oozie pipelines
☆13Feb 15, 2016Updated 10 years ago
aws-samples / amazon-emr-optimize-data-processing
View on GitHub
Optimizing downstream data processing with Amazon Kinesis Data Firehose and Amazon EMR running Apache Spark
☆14Apr 14, 2023Updated 3 years ago
Pathairush / airflow_hive_spark_sqoop
View on GitHub
A docker using the airflow with Hadoop ecosystem (hive, spark, and sqoop)
☆12May 2, 2021Updated 5 years ago
SaurabhChawla100 / spark-radiant
View on GitHub
Spark-Radiant is Apache Spark Performance and Cost Optimizer
☆25Dec 31, 2024Updated last year
fpgmaas / stream-iot
View on GitHub
An end-to-end workflow for processing streaming data on Azure.
☆17Sep 20, 2024Updated last year
jamartinh / Orange3-Spark
View on GitHub
A set of widgets for Python's Orange Machine Learning to work with Apache Spark ML
☆15Dec 24, 2016Updated 9 years ago
timgent / data-flare
View on GitHub
Data quality control tool built on spark and deequ
☆25May 9, 2026Updated 2 months ago
nil1729 / trino-jmx-monitoring
View on GitHub
trino monitoring with JMX metrics through Prometheus and Grafana
☆17Aug 14, 2024Updated last year
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
ggear / cloudera-framework
View on GitHub
☆11Feb 14, 2020Updated 6 years ago
ververica / lab-fraud-detection
View on GitHub
Demo code for implementing and showcasing a Fraud Detection Engine with Apache Flink.
☆33Oct 20, 2022Updated 3 years ago
margitaii / pydeequ
View on GitHub
Python API for Deequ
☆41Nov 10, 2020Updated 5 years ago
datafusion-contrib / datafusion-objectstore-hdfs
View on GitHub
HDFS based on Java implementation as a remote ObjectStore for DataFusion
☆10Feb 13, 2024Updated 2 years ago
mark-hoffmann / fastteradata
View on GitHub
Tools for faster and optimized interaction with Teradata and large datasets.
☆17Jul 11, 2018Updated 8 years ago
knaufk / enrichments-with-flink
View on GitHub
Code Samples for my Ververica Webinar "99 Ways to Enrich Streaming Data with Apache Flink"
☆41Jan 4, 2022Updated 4 years ago
itsbigspark / data-engineering-blueprints
View on GitHub
Patterns and concepts for building resilient data pipelines in Python and Scala
☆16Aug 27, 2024Updated last year
im-nsk / Building-an-Automated-Weather-Data-Pipeline-with-Airflow-From-Ingestion-to-Data-Warehouse
View on GitHub
This project focuses on building a robust data pipeline using Apache Airflow to automate the ingestion of weather data from the OpenWeath…
☆22Feb 3, 2026Updated 5 months ago
javieraviles / spring-boot-redis-rest
View on GitHub
API REST boilerplate using Spring Boot and Redis as database
☆13Dec 26, 2018Updated 7 years ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
A9HORA / Reflected-XSS-Mindmap
View on GitHub
This repo contains mindmap and content regarding reflected xss.
☆11Aug 11, 2020Updated 5 years ago
swoop-inc / spark-records
View on GitHub
Bulletproof Apache Spark jobs with fast root cause analysis of failures.
☆73Mar 14, 2021Updated 5 years ago
divyam-rai / simple-kafka-sasl-docker-python
View on GitHub
Due to lack of resources on how to deploy kafka with simple SASL authentication (just username and password) and how to write producer an…
☆12Dec 29, 2021Updated 4 years ago
apache / kyuubi-client
View on GitHub
Client libraries of end users of Apache Kyuubi
☆11May 15, 2026Updated 2 months ago
aljoscha / blog
View on GitHub
Thoughts on things I find interesting.
☆17Dec 19, 2024Updated last year
shwethags / atlas-lineage
View on GitHub
Example to create lineage in Atlas with sqoop and spark
☆14Apr 5, 2017Updated 9 years ago
target / data-validator
View on GitHub
A tool to validate data, built around Apache Spark.
☆102Jun 15, 2026Updated last month