Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆19Updated 2 years ago
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- ☆15Updated last year
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- Complex data extraction and orchestration framework designed for processing unstructured documents. It integrates AI-powered document pip…☆80Updated this week
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆17Updated 2 years ago
- This library builds a graph-representation of the content of PDFs. The graph is then clustered, resulting page segments are classified an…☆23Updated 5 years ago
- ☆92Updated 3 years ago
- Indri search implementation on top of Lucene search engine☆35Updated last year
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 4 years ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago
- Fast edit distance Python extension written in Cython/C++. Supports Levenshtein distance and Damerau Optimal String Alignment (OSA) dista…☆25Updated 7 months ago
- Documentation effort for the BookCorpus dataset☆34Updated 4 years ago
- A cost estimator for OpenAI API calls in tqdm loops.☆20Updated last year
- A file utility for accessing both local and remote files through a unified interface.☆45Updated 3 weeks ago
- DocAI helps developers quickly build document, image and text processing pipelines using open source and cloud-based machine learning mod…☆20Updated 3 years ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated 2 years ago
- code and data used to build a training dataset for dragnet models☆10Updated 5 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆65Updated 11 months ago
- FAMIE: A Fast Active Learning Framework for Multilingual Information Extraction☆24Updated 3 years ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆26Updated last month
- Sentence Embedding as a Service☆15Updated 6 months ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- ☆20Updated 4 years ago
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆45Updated last year
- Neural search engine for discovering semantically similar Python repositories on GitHub☆27Updated last year
- LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development☆20Updated 2 years ago
- A robust web archive analytics toolkit☆126Updated 2 months ago
- A simple semantic search engine for scientific papers.☆28Updated 2 years ago
- ☆20Updated 4 years ago
- 🖍️ Highlight text in documents☆110Updated 8 months ago