ing-bank / spark-matcher
Record matching and entity resolution at scale in Spark
☆34Updated last year
Alternatives and similar repositories for spark-matcher:
Users that are interested in spark-matcher are comparing it to the libraries listed below
- Kedro Plugin to support running workflows on Kubeflow Pipelines☆53Updated 6 months ago
- An abstraction layer for parameter tuning☆35Updated 6 months ago
- Instant search for and access to many datasets in Pyspark.☆34Updated 2 years ago
- Powerful rapid automatic EDA and feature engineering library with a very easy to use API 🌟☆53Updated 3 years ago
- hooqu is a library built on top of Pandas-like Dataframes for defining "unit tests for data". This is a spiritual port of Apache Deequ to…☆29Updated 3 months ago
- An End-to-End Evaluation Framework for Entity Resolution Systems☆27Updated last year
- Python package for deduplication/entity resolution using active learning☆77Updated 7 months ago
- Similarity encoding of dirty categorical variables (strings)☆20Updated 6 years ago
- 📈🔍 Lets Python do AB testing analysis☆76Updated 11 months ago
- PySpark phonetic and string matching algorithms☆39Updated last year
- Automatically transform all categorical, date-time, NLP variables to numeric in a single line of code for any data set any size.☆64Updated 2 months ago
- real-time data + ML pipeline☆54Updated this week
- Lambda Learner is a library for iterative incremental training of a class of supervised machine learning models.☆42Updated last year
- Spark implementation of computing Shapley Values using monte-carlo approximation☆74Updated 2 years ago
- Repository for my master thesis on automated string handling☆16Updated 3 years ago
- Demo of a supervised machine learning approach for Entity Resolution in graph using Neo4j GDS Link Prediction Pipelines☆22Updated 2 years ago
- Python library to explain Tree Ensemble models (TE) like XGBoost, using a rule list.☆52Updated 11 months ago
- This project focuses on DeepER, a deep learning framework for entity resolution (record deduplication). It examines how DeepER performs o…☆46Updated 6 years ago
- Build your feature store with macros right within your dbt repository☆38Updated 2 years ago
- Distributed Bayesian Entity Resolution in Apache Spark☆57Updated 3 years ago
- 🐍 Material for PyData Global 2021 Presentation: Effective Testing for Machine Learning Projects☆81Updated 3 years ago
- ☆16Updated 4 years ago
- Pipeline components that support partial_fit.☆45Updated 8 months ago
- MinHash implementation in Python☆11Updated 7 months ago
- this repo might get accepted☆28Updated 4 years ago
- Lossless in-memory compression of pandas DataFrames and Series powered by the visions type system. Up to 10x less RAM needed for the same…☆28Updated 2 years ago
- In-Session Personalization Workshop for eCommerce, April 2021, and the MICES Workshop in June 2021.☆22Updated 3 years ago
- Tutorials for Fugue - A unified interface for distributed computing. Fugue executes SQL, Python, and Pandas code on Spark and Dask withou…☆113Updated last year
- A Scalable Data Cleaning Library for PySpark.☆27Updated 5 years ago
- Binding the GDELT universe in a Spark environment☆23Updated last year