Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆19Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- ☆15Updated last year
- Indri search implementation on top of Lucene search engine☆35Updated last year
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆65Updated 10 months ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 4 years ago
- Analyze and extract Wikipedia article text and attributes and store them into an ElasticSearch index or to json files (multilingual suppo…☆47Updated 2 years ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆24Updated 3 weeks ago
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated 2 years ago
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 5 years ago
- A robust web archive analytics toolkit☆124Updated last month
- Faster, modernized fork of the language identification tool langid.py☆61Updated last year
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- Scripts to parse arxiv documents for NLP tasks☆19Updated 2 years ago
- A simple semantic search engine for scientific papers.☆28Updated 2 years ago
- 🐍 Python bidding for the Hora Approximate Nearest Neighbor Search Algorithm library☆73Updated 4 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆45Updated last year
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 8 years ago
- A natural language date parser. (Python version of chrono.js)☆25Updated 6 months ago
- A workflow system for Natural Language Processing.☆21Updated 6 years ago
- A file utility for accessing both local and remote files through a unified interface.☆44Updated 2 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆18Updated 3 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 5 years ago
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development☆20Updated 2 years ago
- Neural Elastic Inference and Search☆19Updated 6 years ago
- Python wrapper for Ferret☆45Updated 3 years ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Extract dates from text☆66Updated 4 years ago
- Tower Parse: Low-Resource Dependency Parsing via Hierarchical Source Selection☆15Updated 4 years ago
- Prebuilt .whl files for MacOS + Linux of the Facebook FAISS library☆56Updated 3 years ago
- Sentence Embedding as a Service☆15Updated 5 months ago