Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication:
Users that are interested in deduplication are comparing it to the libraries listed below
- Loadable spellfix1 extension for sqlite as python package☆26Updated 9 months ago
- Webrecorders DevTools Protocol Automation Library☆17Updated 2 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Indri search implementation on top of Lucene search engine☆34Updated 11 months ago
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆16Updated 2 years ago
- Find duplicate text files.☆13Updated last month
- Interpretable feature construction from taxonomies for text classification☆18Updated 2 years ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆34Updated 4 years ago
- Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.☆47Updated last month
- An efficient data structure for fast string similarity searches☆22Updated 4 years ago
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- Generate random date(time) in Python.☆10Updated 11 months ago
- Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives☆14Updated 3 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆14Updated last year
- A tool to find all duplicates in large sets of text documents.☆16Updated 3 years ago
- ☆20Updated 3 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆62Updated last month
- Datasette plugin for searching all searchable tables at once☆22Updated 5 months ago
- Efficiently computing & storing token n-grams from large corpora☆18Updated 4 months ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆25Updated 6 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL☆29Updated 2 years ago
- A curated list of promising Web Data Extractors resources☆28Updated 5 years ago
- A file utility for accessing both local and remote files through a unified interface.☆37Updated last month
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Updated 7 years ago
- Tokenization across languages. Useful as preprocessing for subword tokenization.☆22Updated 2 years ago
- Multi-Langauge Identification☆29Updated 6 months ago