Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated 2 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆63Updated 8 months ago
- ☆14Updated last year
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated last month
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆73Updated this week
- Rust python bindings for symspell☆21Updated last year
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 5 years ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- Faster, modernized fork of the language identification tool langid.py☆57Updated 9 months ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆45Updated last year
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆62Updated this week
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- A Domain Specific Language (DSL) for building language patterns. These can be later compiled into spaCy patterns, pure regex, or any othe…☆68Updated 2 years ago
- Python based Wikidata framework for easy dataframe extraction☆45Updated last year
- Sentence Embedding as a Service☆15Updated 2 months ago
- Indri search implementation on top of Lucene search engine☆35Updated last year
- A simple semantic search engine for scientific papers.☆28Updated 2 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- RaKUn 2.0 - A fast keyword detection algorithm☆68Updated last month
- ☆90Updated 3 years ago
- Fast fuzzy text search☆11Updated 2 years ago
- Finds linguistic patterns effortlessly☆38Updated 2 years ago
- 🐍 Python bidding for the Hora Approximate Nearest Neighbor Search Algorithm library☆72Updated 3 years ago
- An index data structure for approximate string search.☆23Updated 6 years ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆35Updated 5 years ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆25Updated 4 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆33Updated last year