Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆16Updated last year
Related projects ⓘ
Alternatives and complementary repositories for deduplication
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 5 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆16Updated 2 years ago
- A curated list of awesome open source tools and commercial products to catalog, version, and manage data 🚀☆27Updated 2 years ago
- Fast and accurate natural language detection. Detector written in Python. Nito-ELD, ELD.☆13Updated last year
- A file utility for accessing both local and remote files through a unified interface.☆36Updated 3 months ago
- Loadable spellfix1 extension for sqlite as python package☆25Updated 7 months ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Application configuration and scripts for search on https://docs.vespa.ai/☆13Updated this week
- A simple semantic search engine for scientific papers.☆27Updated last year
- Sentence Embedding as a Service☆14Updated last year
- gRPC server for hnswlib☆14Updated last year
- 🐍A curated list of awesome python environment.☆10Updated 4 years ago
- Datasette plugin for authenticating access using API tokens☆12Updated 2 months ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆32Updated last year
- Python context manager to communicate with a subprocess using iterables: for when data is too big to fit in memory and has to be streamed☆9Updated last month
- Efficiently computing & storing token n-grams from large corpora☆15Updated last month
- alvd = A Lightweight Vald. A lightweight distributed vector search engine works without K8s.☆49Updated 3 years ago
- A utility for labeling clusters of text data.☆28Updated 3 years ago
- Datasette plugin for searching all searchable tables at once☆19Updated 2 months ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 5 years ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆9Updated 3 years ago
- framework for making streamcorpus data☆11Updated 7 years ago
- Scripts supporting the development and serving the Roots Search Tool - https://hf.co/spaces/bigscience-data/roots-search☆10Updated last year
- Documentation effort for the BookCorpus dataset☆33Updated 3 years ago
- Python command line tool to manage multiple sites/apps/files with rsync.☆84Updated last month
- My dot files in one place - extensively edited over time. Your mileage may vary☆2Updated 8 years ago
- Tools to construct and process webgraphs from Common Crawl data☆80Updated this week
- Tools for encoding Magic: The Gathering cards into a form suitable for AI text generation☆18Updated 3 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 4 years ago
- Interpretable feature construction from taxonomies for text classification☆18Updated 2 years ago