Marcnuth / deduplication
Remove duplicate documents/videos/images via popular algorithms such as SimHash, SpotSig, Shingling, etc.
☆18Updated last year
Alternatives and similar repositories for deduplication:
Users that are interested in deduplication are comparing it to the libraries listed below
- https://mimesniff.spec.whatwg.org/ implementation for Python☆13Updated last year
- A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL☆29Updated 2 years ago
- Indri search implementation on top of Lucene search engine☆34Updated last year
- Loadable spellfix1 extension for sqlite as python package☆26Updated last year
- Sentence Embedding as a Service☆15Updated last year
- code and data used to build a training dataset for dragnet models☆10Updated 4 years ago
- A file utility for accessing both local and remote files through a unified interface.☆40Updated last week
- Neural Elastic Inference and Search☆19Updated 5 years ago
- A summarization dataset consisting of over 17k open access business journal articles.☆10Updated 4 years ago
- Application configuration and scripts for search on https://docs.vespa.ai/☆12Updated last month
- extract difference between two html pages☆32Updated 6 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- Functional composable pipelines allowing clean separation of the business logic and its implementation☆11Updated 10 months ago
- Neural Solr = Solr 9 + Mighty Inference + Node☆17Updated 2 years ago
- An open-source NLP library: fast text cleaning and preprocessing☆23Updated 3 years ago
- Faster, modernized fork of the language identification tool langid.py☆55Updated 5 months ago
- Integration between Reaction ECommerce and Accelerated Text to provide product descriptions for an e-shop.☆12Updated 4 years ago
- ☆16Updated 3 years ago
- Detecting gibberish as a type of sentiment analysis with GPT2☆24Updated 4 years ago
- Smaug-72B topped the Hugging Face LLM leaderboard and it’s the first model with an average score of 80, making it the world’s best open-s…☆17Updated this week
- ☆13Updated 11 months ago
- Fast and accurate natural language detection. Detector written in Python. Nito-ELD, ELD.☆17Updated last year
- Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.☆44Updated 11 months ago
- Find duplicate text files.☆14Updated 3 months ago
- A server code for serving BERT-based models for text classification. It is designed by SerpApi for heavy-load prototyping and production …☆14Updated last year
- Hybrid Search (BM25 & Vector) with SQLite☆15Updated 8 months ago
- SuperMinHash: A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, Simhash and SimhashIndex☆19Updated 2 years ago
- Indexing GDELT database into Elasticsearch, entire database including the -each 15 minutes- real time events☆13Updated 5 years ago
- Pluggable DSL that uses pipes to perform a series of linear transformations to extract data☆16Updated 9 months ago
- numpy ufuncs for vector similarity☆14Updated last year