Marcnuth / deduplicationLinks
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication
Users that are interested in deduplication are comparing it to the libraries listed below
Sorting:
- Loadable spellfix1 extension for sqlite as python package☆26Updated last year
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆34Updated 2 years ago
- CLI based diff viewer☆23Updated 3 years ago
- Indri search implementation on top of Lucene search engine☆34Updated last year
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated last year
- An index data structure for approximate string search.☆23Updated 6 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 4 years ago
- Transform Oracle PL/SQL Code to Python☆11Updated 11 years ago
- A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL☆29Updated 2 years ago
- A fork of http://pydispatcher.sourceforge.net/ with PyPy support☆16Updated 7 years ago
- Python wrapper for Ferret☆41Updated 3 years ago
- Yet another tool to search through your (exported) ChatGPT conversations☆12Updated 8 months ago
- A Ruia plugin for loading javascript - pyppeteer☆18Updated 3 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated 2 years ago
- Sentence Embedding as a Service☆15Updated last year
- ☆17Updated 3 years ago
- Pipeline for converting PDFs to raw text with PaddleOCR☆23Updated last year
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 6 years ago
- Deployment of pywb as a CommonCrawl Index Server☆21Updated 7 years ago
- Find duplicate text files.☆14Updated 4 months ago
- Hybrid Search (BM25 & Vector) with SQLite☆18Updated 9 months ago
- ReconNER, Debug annotated Named Entity Recognition (NER) data for inconsistencies and get insights on improving the quality of your data.☆35Updated 4 years ago
- Python SDK Client for ZincSearch☆11Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- framework for making streamcorpus data☆11Updated 8 years ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆17Updated 2 years ago
- Travel back in time to debug your Python ⏰ 🐍☆10Updated 3 years ago
- Build a trie-structured regular expression from a list of words☆21Updated 5 years ago