Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated 2 years ago
- Loadable spellfix1 extension for sqlite as python package☆26Updated last year
- Generate a SQLite database from Wikipedia & Wikidata dumps.☆35Updated last year
- Faster, modernized fork of the language identification tool langid.py☆56Updated 7 months ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated last year
- Indri search implementation on top of Lucene search engine☆34Updated last year
- ☆18Updated 5 years ago
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- Sentence Embedding as a Service☆15Updated 2 weeks ago
- Python package of commonly used metrics for evaluating information retrieval models.☆25Updated 4 years ago
- Document level Attitude and Relation Extraction toolkit (AREkit) for sampling and processing large text collections with ML and for ML☆63Updated 5 months ago
- Simple and clean Python implementation of TextRank as per seminal paper by Rada Mihalcea and Paul Tarau. This implementation performs bot…☆11Updated 4 years ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆27Updated 11 months ago
- ☆17Updated 3 years ago
- Open Collaborative AI Driven Parser builder for Web Scraping, Data Extraction and Crawling,Knowledge Graph☆1Updated 5 months ago
- https://instances.social/instances.json☆23Updated 6 months ago
- 📑 Python Package to reconstruct the original continuous text from PDFs with language models☆32Updated last year
- A workflow system for Natural Language Processing.☆21Updated 5 years ago
- Generate nice CLI from a function signature.☆18Updated 2 years ago
- Yet another tool to search through your (exported) ChatGPT conversations☆12Updated 9 months ago
- Encode and decode pairs of surrogate characters in Python 3☆10Updated 3 years ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago
- Datasette plugin for authenticating access using API tokens☆12Updated 10 months ago
- CLI based diff viewer☆23Updated 3 years ago
- ☆41Updated 6 months ago
- Fast Neural Machine Translation in C++ - development repository☆19Updated last year
- Indexing GDELT database into Elasticsearch, entire database including the -each 15 minutes- real time events☆13Updated 5 years ago
- A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL☆30Updated 2 years ago