MinishLab/semhash

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/MinishLab/semhash)

MinishLab / semhash

Fast Multimodal Semantic Deduplication & Filtering

☆946

Alternatives and similar repositories for semhash

Users that are interested in semhash are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

MinishLab / vicinity
View on GitHub
Lightweight Nearest Neighbors with Flexible Backends
☆347May 24, 2026Updated last month
MinishLab / tokenlearn
View on GitHub
Pre-train Static Word Embeddings
☆108Jun 9, 2026Updated last month
MinishLab / model2vec
View on GitHub
Fast State-of-the-Art Static Embeddings
☆2,159Jun 6, 2026Updated last month
lightonai / pylate
View on GitHub
Late Interaction Models Training & Retrieval
☆875Jul 13, 2026Updated last week
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,334Jul 13, 2026Updated last week
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
huggingface / setfit
View on GitHub
Efficient few-shot learning with Sentence Transformers
☆2,772May 26, 2026Updated last month
stephantul / pynife
View on GitHub
Nearly Inference Free Embeddings: make your RAG queries 500x faster
☆80Apr 27, 2026Updated 2 months ago
Knowledgator / GLiClass
View on GitHub
Generalist and Lightweight Model for Text Classification
☆233Updated this week
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,214Updated this week
urchade / GLiNER
View on GitHub
Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts)
☆3,409Updated this week
Pringled / pyversity
View on GitHub
Fast Diversification for Search & Retrieval
☆492May 24, 2026Updated last month
xhluca / bm25s
View on GitHub
Fast BM25 search in Python, powered by Numpy and Numba
☆1,740Jul 7, 2026Updated 2 weeks ago
huggingface / text-clustering
View on GitHub
Easily embed, cluster and semantically label text datasets
☆610Mar 28, 2024Updated 2 years ago
argilla-io / synthetic-data-generator
View on GitHub
Build datasets using natural language
☆586Sep 19, 2025Updated 10 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆14,573Updated this week
argilla-io / argilla
View on GitHub
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
☆5,039Jul 13, 2026Updated last week
AnswerDotAI / RAGatouille
View on GitHub
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…
☆3,939May 17, 2025Updated last year
qdrant / fastembed
View on GitHub
Fast, Accurate, Lightweight Python library to make State of the Art Embedding
☆3,094Updated this week
Pleias / Pleias-RAG-Library
View on GitHub
Python library to use Pleias-RAG models
☆72Jul 1, 2026Updated 2 weeks ago
AnswerDotAI / ModernBERT
View on GitHub
Bringing BERT into modernity via both architecture changes and scaling
☆1,701Mar 1, 2026Updated 4 months ago
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,486Jun 29, 2026Updated 3 weeks ago
IBM / fastfit
View on GitHub
FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes
☆220Sep 18, 2025Updated 10 months ago
mlabonne / llm-datasets
View on GitHub
Curated list of datasets and tools for post-training.
☆4,699Apr 29, 2026Updated 2 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
MantisAI / sieves
View on GitHub
Plug-and-play document AI with zero-shot models.
☆126May 11, 2026Updated 2 months ago
tomaarsen / SpanMarkerNER
View on GitHub
SpanMarker for Named Entity Recognition
☆477Apr 10, 2026Updated 3 months ago
rasyosef / splade-index
View on GitHub
Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba
☆38Oct 16, 2025Updated 9 months ago
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
lightonai / fast-plaid
View on GitHub
High-Performance Engine for Multi-Vector Search
☆268May 28, 2026Updated last month
AnswerDotAI / rerankers
View on GitHub
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,624Dec 20, 2025Updated 7 months ago
qdrant / block-embeddings
View on GitHub
Trainable embedding transformation for confidence estimation, feature extraction, explainability and conversion from dense to sparse.
☆28Jun 23, 2026Updated 3 weeks ago
mixedbread-ai / baguetter
View on GitHub
Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, imp…
☆210Aug 31, 2024Updated last year
MaartenGr / BERTopic
View on GitHub
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
☆7,748May 13, 2026Updated 2 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
Knowledgator / FlashDeBERTa
View on GitHub
Trully flash implementation of DeBERTa disentangled attention mechanism.
☆90Feb 10, 2026Updated 5 months ago
davidberenstein1957 / fast-sentence-transformers
View on GitHub
Simply, faster, sentence-transformers
☆144Aug 27, 2024Updated last year
AnswerDotAI / fastdata
View on GitHub
☆160Dec 2, 2024Updated last year
webis-de / small-text
View on GitHub
Active Learning for Text Classification in Python
☆646May 24, 2026Updated last month
MinishLab / model2vec-rs
View on GitHub
Official Rust Implementation of Model2Vec
☆197May 24, 2026Updated last month
mixedbread-ai / batched
View on GitHub
The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching o…
☆161Jul 14, 2025Updated last year
x-tabdeveloping / turftopic
View on GitHub
Robust and fast topic models with sentence-transformers.
☆118Updated this week