databricks / koalas
Koalas: pandas API on Apache Spark
☆3,329Updated 5 months ago
Related projects: ⓘ
- Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.☆3,252Updated last week
- Jupyter magics and kernels for working with remote Spark clusters☆1,315Updated last month
- Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark☆1,472Updated 2 weeks ago
- MLeap: Deploy ML Pipelines to Production☆1,499Updated 2 months ago
- Deep Learning Pipelines for Apache Spark☆1,989Updated last year
- 📚 Parameterize, execute, and analyze notebooks☆5,789Updated 3 weeks ago
- Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting…☆4,391Updated this week
- The Open Source Feature Store for Machine Learning☆5,476Updated this week
- Always know what to expect from your data.☆9,817Updated this week
- Build and manage real-life ML, AI, and data science projects with ease!☆8,046Updated this week
- the portable Python dataframe library☆5,064Updated this week
- Hopsworks - Data-Intensive AI platform with a Feature Store☆1,135Updated last week
- A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rew…☆1,968Updated last month
- Data quality testing for the modern data stack (SQL, Spark, and Pandas) https://www.soda.io☆1,866Updated last week
- Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per s…☆8,257Updated this week
- Python interface to Hive and Presto. 🐝☆1,670Updated last month
- The Internals of Apache Spark☆1,461Updated this week
- Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet f…☆1,778Updated 9 months ago
- A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner☆2,515Updated 5 months ago
- Docker Apache Airflow☆3,765Updated last year
- A Python package for manipulating 2-dimensional tabular data structures☆1,807Updated 9 months ago
- Curated list of resources about Apache Airflow☆3,653Updated 3 weeks ago
- State of the Art Natural Language Processing☆3,808Updated this week
- ETL best practices with airflow, with examples☆1,282Updated 2 years ago
- An Open Standard for lineage metadata collection☆1,708Updated this week
- Dynamically generate Apache Airflow DAGs from YAML configuration files☆1,158Updated last week
- Spark: The Definitive Guide's Code Repository☆2,825Updated 4 years ago
- Dask tutorial☆1,826Updated 10 months ago
- Parallel computing with task scheduling☆12,405Updated this week
- (Deprecated) Scikit-learn integration package for Apache Spark☆1,079Updated 4 years ago