Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- ☆14Updated last year
- Indri search implementation on top of Lucene search engine☆35Updated last year
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆72Updated this week
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆63Updated 8 months ago
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated 2 years ago
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆22Updated 5 years ago
- An example of graph embeddings for wikipedia page recommendations☆11Updated 4 years ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆25Updated 4 years ago
- Efficiently search the most similar strings against the query in Python.☆18Updated 4 months ago
- Open source library for few shot NLP☆79Updated 2 years ago
- A sentence segmentation library with wide language support optimized for speed and utility.☆68Updated 3 months ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆45Updated last year
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆86Updated 4 years ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated 2 years ago
- ☆20Updated 4 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆34Updated last year
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆63Updated this week
- This repository provides various Python methods for finding and aggregating synonyms for an individual word or a list of words.☆36Updated 2 years ago
- Fast edit distance Python extension written in Cython/C++. Supports Levenshtein distance and Damerau Optimal String Alignment (OSA) dista…☆24Updated 4 months ago
- Fast fuzzy text search☆11Updated 2 years ago
- Reproducing "Writing with Transformer" demo, using aitextgen/FastAPI in backend, Quill/React in frontend☆27Updated 4 years ago
- ☆91Updated 3 years ago
- Use ML-Annotate to label data for machine learning purposes☆109Updated 5 years ago
- Framework for information extraction from tables☆41Updated 6 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆18Updated 3 years ago