Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- ☆13Updated last year
- Indri search implementation on top of Lucene search engine☆34Updated last year
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆63Updated 6 months ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- Tower Parse: Low-Resource Dependency Parsing via Hierarchical Source Selection☆15Updated 3 years ago
- Rust python bindings for symspell☆19Updated last year
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆70Updated this week
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆33Updated last year
- 🌸 Train floret vectors☆18Updated 2 years ago
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 4 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆44Updated last year
- Faster, modernized fork of the language identification tool langid.py☆56Updated 8 months ago
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 3 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Updated 5 years ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆12Updated 4 years ago
- Finds linguistic patterns effortlessly☆37Updated last year
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- A simple semantic search engine for scientific papers.☆28Updated last year
- Fast edit distance Python extension written in Cython/C++. Supports Levenshtein distance and Damerau Optimal String Alignment (OSA) dista…☆24Updated 2 months ago
- GPT-jax based on the official huggingface library☆13Updated 4 years ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆86Updated 4 years ago
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated last year
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated last year
- Multi-class text categorization using state-of-the-art pre-trained contextualized language models, e.g. BERT☆23Updated 2 years ago
- Translation demonstrator☆34Updated 5 years ago
- RaKUn 2.0 - A fast keyword detection algorithm☆68Updated this week
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆22Updated 2 years ago
- ☆70Updated 4 years ago