Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- Indri search implementation on top of Lucene search engine☆34Updated last year
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated last year
- Loadable spellfix1 extension for sqlite as python package☆26Updated last year
- Neural Solr = Solr 9 + Mighty Inference + Node☆17Updated 3 years ago
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- SuperMinHash: A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, Simhash and SimhashIndex☆19Updated 2 years ago
- Python module (C extension and plain python) implementing DAWG☆20Updated 3 years ago
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Updated 7 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆26Updated 10 months ago
- Custom Python functions for working with SQLite FTS4☆22Updated 2 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL☆29Updated 2 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- Fast Neural Machine Translation in C++ - development repository☆19Updated last year
- High performance multiplexed user fuse mounting☆20Updated 2 years ago
- ☆13Updated last year
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Sear…☆85Updated 4 years ago
- An efficient data structure for fast string similarity searches☆22Updated 4 years ago
- Scripts to parse arxiv documents for NLP tasks☆18Updated 2 years ago
- Evaluation framework for document processing models and services.☆21Updated this week
- Encode and decode pairs of surrogate characters in Python 3☆10Updated 3 years ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 2 years ago
- Large-scale query-focused multi-document Summarization dataset☆10Updated 3 years ago
- Sentence Embedding as a Service☆15Updated last year
- Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.☆11Updated 2 years ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 3 years ago
- A collection of prompts for use with the LLM CLI tool☆16Updated 2 years ago