morsapaes / pyflink-nlp
Self-contained demo using PyFlink with Gensim+spaCy to find topics in the Flink User Mailing List. All you need is Docker! ๐ณ
โ20Updated 2 years ago
Related projects: โ
- Streaming Synthetic Sales Data Generator: Streaming sales data generator for Apache Kafka, written in Pythonโ43Updated last year
- spark on kubernetesโ105Updated last year
- โ50Updated 9 months ago
- The Python fake data producer for Apache Kafkaยฎ is a complete demo app allowing you to quickly produce JSON fake streaming datasets and โฆโ81Updated 4 months ago
- Repository of helm charts for deploying DataHub on a Kubernetes clusterโ160Updated this week
- โ232Updated this week
- Trino dbt demo project to mix and load BigQuery data with and in a local PostgreSQL databaseโ64Updated 3 years ago
- (project & tutorial) dag pipeline tests + ci/cd setupโ84Updated 3 years ago
- One click deploy docker-compose with Kafka, Spark Streaming, Zeppelin UI and Monitoring (Grafana + Kafka Manager)โ120Updated 3 years ago
- Repo for all my code on the articles I post on mediumโ105Updated last year
- Airflow training for the crunch confโ105Updated 5 years ago
- Learn how to add data validation and documentation to a data pipeline built with dbt and Airflow.โ167Updated 10 months ago
- Example for article Running Spark 3 with standalone Hive Metastore 3.0โ96Updated last year
- โ197Updated last month
- โ22Updated 3 years ago
- Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframesโ63Updated 2 years ago
- PySpark data-pipeline testing andย CICDโ28Updated 3 years ago
- For a series of posts on Amazon MSK, Amazon EKS, and Amazon EMRโ65Updated 2 years ago
- โ38Updated this week
- A workspace to experiment with Apache Spark, Livy, and Airflow in a Docker environment.โ39Updated 3 years ago
- An Airflow docker image preconfigured to work well with Spark and Hadoop/EMRโ171Updated 10 months ago
- Sample Airflow DAGsโ60Updated last year
- Apache Hive Metastore as a Standalone server in Dockerโ64Updated 3 weeks ago
- Support for generating modern platforms dynamically with services such as Kafka, Spark, Streamsets, HDFS, ....โ68Updated last week
- Multiple node presto cluster on docker containerโ120Updated 2 years ago
- A Helm chart to install Apache Airflow on Kubernetesโ274Updated this week
- The Trino (https://trino.io/) adapter plugin for dbt (https://getdbt.com)โ207Updated this week
- Project files for the post: Running PySpark Applications on Amazon EMR: Methods for Interacting with PySpark on Amazon Elastic MapReduce.โ38Updated 2 years ago
- Atlas custom type definitionsโ16Updated 3 years ago
- Automated data quality suggestions and analysis with Deequ on AWS Glueโ83Updated last year