david-siqi-liu / sparklyclean
Optimal distributed data deduplication and supervised learning pipeline using Apache Spark
☆10Updated 4 years ago
Alternatives and similar repositories for sparklyclean:
Users that are interested in sparklyclean are comparing it to the libraries listed below
- SparkER: an Entity Resolution framework for Apache Spark☆63Updated 10 months ago
- ☆15Updated 2 years ago
- deep entity resolution lite version☆11Updated 5 years ago
- Distributed Bayesian Entity Resolution in Apache Spark☆57Updated 3 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 2 years ago
- T2K Match is a matching algorithm optimised to match millions of web tables to a central knowledge base.☆21Updated 6 years ago
- FlexMatcher is a schema matching package in Python which handles the problem of matching multiple schemas to a single mediated schema.☆29Updated 2 months ago
- An example of Spark and GraphX with Twitter as sample☆19Updated 8 years ago
- PySpark phonetic and string matching algorithms☆39Updated last year
- ☆11Updated 7 years ago
- S2RDF (SPARQL on Spark for RDF) is a SPARQL query processor for Hadoop based on Spark SQL. It uses the relational interface of Spark for …☆13Updated 6 years ago
- A Scalable Data Cleaning Library for PySpark.☆26Updated 5 years ago
- ☆75Updated last year
- An open source, high scalability toolkit in Java for Entity Resolution.☆216Updated 10 months ago
- Dremio Flight connector. Access Dremio using Arrow flight☆40Updated 4 years ago
- Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple…☆26Updated 3 years ago
- Collection of some algorithms for entity resolution☆28Updated 9 years ago
- Real-time query spark and visualise it as graph.☆24Updated 7 years ago
- Apache NiFi NLP Processor☆18Updated last year
- A spark package to approximate the diameter of large graphs☆15Updated 7 years ago
- Fork of the Freely Extensible Biomedical Record Linkage program☆24Updated 8 years ago
- A Generalized Data Cleaning System☆49Updated 8 years ago
- Applications using Parallel Graph AnalytiX (PGX) from Oracle Labs☆48Updated 3 weeks ago
- A tool facilitating matching for any dataset discovery method. Also, an extensible experiment suite for state-of-the-art schema matching …☆86Updated 2 weeks ago
- JedAI-WebApp is a GUI that facilitates the execution of JedAI. JedAI is an open source, high scalability toolkit that offers out-of-the-b…☆23Updated last year
- Search relevance evaluation toolkit☆31Updated 2 years ago
- Condor allows for the specification of synopsis-based streaming jobs on top of general dataflow systems. Condor provides a collection of …☆13Updated 7 months ago
- UI for JedAI Toolkit☆16Updated 2 years ago
- Building blocks and patterns for building data prep transformations and feature engineering in Spark.☆16Updated 8 years ago
- ☆188Updated 8 months ago