huggingface/text-clustering

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/huggingface/text-clustering)

huggingface / text-clustering

Easily embed, cluster and semantically label text datasets

☆610

Alternatives and similar repositories for text-clustering

Users that are interested in text-clustering are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,219Updated this week
huggingface / llm-swarm
View on GitHub
Manage scalable open LLM inference endpoints in Slurm clusters
☆289Jul 11, 2024Updated 2 years ago
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,341Updated this week
huggingface / lighteval
View on GitHub
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
☆2,493Jun 29, 2026Updated 3 weeks ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
huggingface / setfit
View on GitHub
Efficient few-shot learning with Sentence Transformers
☆2,775May 26, 2026Updated last month
huggingface / nanotron
View on GitHub
Minimalistic large language model 3D-parallelism training
☆2,761May 26, 2026Updated last month
arcee-ai / mergekit
View on GitHub
Tools for merging pretrained large language models.
☆7,254Jun 17, 2026Updated last month
AnswerDotAI / rerankers
View on GitHub
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,625Dec 20, 2025Updated 7 months ago
BatsResearch / bonito
View on GitHub
A lightweight library for generating synthetic instruction tuning datasets for your data without GPT.
☆831Jul 15, 2025Updated last year
huggingface / alignment-handbook
View on GitHub
Robust recipes to align language models with human and AI preferences
☆5,641May 26, 2026Updated last month
MinishLab / semhash
View on GitHub
Fast Multimodal Semantic Deduplication & Filtering
☆948May 24, 2026Updated last month
IlyasMoutawwakil / py-txi
View on GitHub
A Python wrapper around HuggingFace's TGI (text-generation-inference) and TEI (text-embedding-inference) servers.
☆32Sep 19, 2025Updated 10 months ago
databricks / lilac
View on GitHub
Curate better data for LLMs
☆1,072Mar 19, 2024Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
google-research / deduplicate-text-datasets
View on GitHub
☆1,270Jul 30, 2024Updated last year
mlabonne / llm-datasets
View on GitHub
Curated list of datasets and tools for post-training.
☆4,703Apr 29, 2026Updated 2 months ago
AnswerDotAI / RAGatouille
View on GitHub
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…
☆3,941May 17, 2025Updated last year
davanstrien / awesome-synthetic-datasets
View on GitHub
awesome synthetic (text) datasets
☆335Jan 8, 2026Updated 6 months ago
argilla-io / argilla
View on GitHub
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
☆5,044Updated this week
datadreamer-dev / DataDreamer
View on GitHub
DataDreamer: Prompt. Generate Synthetic Data. Train & Align Models. 🤖💤
☆1,115Feb 2, 2025Updated last year
charlesdedampierre / BunkaTopics
View on GitHub
🗺️ Data Cleaning and Textual Data Visualization 🗺️
☆201May 23, 2025Updated last year
lightonai / pylate
View on GitHub
Late Interaction Models Training & Retrieval
☆876Jul 13, 2026Updated last week
TutteInstitute / datamapplot
View on GitHub
Creating beautiful plots of data maps
☆1,020Updated this week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
mlfoundations / dclm
View on GitHub
DataComp for Language Models
☆1,455Sep 9, 2025Updated 10 months ago
axolotl-ai-cloud / axolotl
View on GitHub
Go ahead and axolotl questions
☆12,232Updated this week
stanford-futuredata / ColBERT
View on GitHub
ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
☆3,902Oct 14, 2025Updated 9 months ago
MaartenGr / BERTopic
View on GitHub
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
☆7,750May 13, 2026Updated 2 months ago
xhluca / bm25s
View on GitHub
Fast BM25 search in Python, powered by Numpy and Numba
☆1,741Updated this week
nomic-ai / contrastors
View on GitHub
Train Models Contrastively in Pytorch
☆798Mar 26, 2025Updated last year
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆15,101Updated this week
microsoft / rho
View on GitHub
Repo for Rho-1: Token-level Data Selection & Selective Pretraining of LLMs.
☆470Apr 18, 2024Updated 2 years ago
taylorai / galactic
View on GitHub
data cleaning and curation for unstructured text
☆329Aug 6, 2024Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
huggingface / text-generation-inference
View on GitHub
Large Language Model Text Generation Inference
☆10,880Mar 21, 2026Updated 4 months ago
magpie-align / magpie
View on GitHub
[ICLR 2025] Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing. Your efficient and high-quality synthetic data …
☆874Mar 17, 2025Updated last year
koaning / bulk
View on GitHub
A Simple Bulk Labelling Tool
☆599Jul 29, 2025Updated 11 months ago
yifanzhang-pro / AutoMathText
View on GitHub
[ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (https://huggingface.co/papers…
☆92Nov 23, 2025Updated 7 months ago
urchade / GLiNER
View on GitHub
Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts)
☆3,415Updated this week
tomaarsen / SpanMarkerNER
View on GitHub
SpanMarker for Named Entity Recognition
☆477Apr 10, 2026Updated 3 months ago
explosion / curated-transformers
View on GitHub
🤖 A PyTorch library of curated Transformer models and their composable components
☆892Apr 17, 2024Updated 2 years ago