david-siqi-liu / sparklyclean
Optimal distributed data deduplication and supervised learning pipeline using Apache Spark
☆10Updated 4 years ago
Related projects ⓘ
Alternatives and complementary repositories for sparklyclean
- ☆15Updated 2 years ago
- SparkER: an Entity Resolution framework for Apache Spark☆63Updated 7 months ago
- A Generalized Data Cleaning System☆49Updated 8 years ago
- End-to-End Deep Entity Resolution☆31Updated 3 years ago
- UI for JedAI Toolkit☆16Updated 2 years ago
- deep entity resolution lite version☆11Updated 5 years ago
- An example of Spark and GraphX with Twitter as sample☆19Updated 7 years ago
- Sketch and LSH Index library for Java, including OPH methods as well as the Lazo method☆13Updated 11 months ago
- A library to store metadata of relational databases including the schema, statistics, and integrity constraints.☆25Updated 6 years ago
- Stanford Entity-Resolution Framework☆23Updated 6 years ago
- A Java framework to build semantics-aware autoencoder neural network from a knowledge-graph.☆13Updated 7 years ago
- T2K Match is a matching algorithm optimised to match millions of web tables to a central knowledge base.☆21Updated 6 years ago
- Twitter sentiment analysis using Spark and Stanford CoreNLP and visualization using elasticsearch and kibana☆20Updated 6 years ago
- WInte.r is a Java framework for end-to-end data integration. The WInte.r framework implements well-known methods for data pre-processing,…☆110Updated 2 years ago
- Collection of some algorithms for entity resolution☆28Updated 9 years ago
- Text similarity based on Word2Vec vectors.☆10Updated 7 years ago
- 💻 CLI for reporting events to Faros platform☆14Updated last month
- A Scalable Data Cleaning Library for PySpark.☆26Updated 5 years ago
- ☆16Updated 8 years ago
- LSHDB is a parallel and distributed data engine, which relies on Locality-Sensitive Hashing and noSQL systems, for performing record link…☆29Updated 2 years ago
- Real-time query spark and visualise it as graph.☆24Updated 7 years ago
- Library of graph algorithms for Apache Giraph.☆8Updated 8 years ago
- a toy duckdb based timeseries database☆14Updated 4 years ago
- Code examples for Google Natural Language API.☆13Updated 5 years ago
- Apache NiFi NLP Processor☆18Updated last year
- ☆13Updated last year
- Condor allows for the specification of synopsis-based streaming jobs on top of general dataflow systems. Condor provides a collection of …☆13Updated 5 months ago
- Rheem - a cross-platform data processing system☆5Updated 2 years ago
- FlexMatcher is a schema matching package in Python which handles the problem of matching multiple schemas to a single mediated schema.☆31Updated this week
- Explaining Inference Queries with Bayesian Optimization☆10Updated 3 years ago