segment-any-text/wtpsplit

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/segment-any-text/wtpsplit)

segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.

☆1,320

Alternatives and similar repositories for wtpsplit

Users that are interested in wtpsplit are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

notAI-tech / deepsegment
View on GitHub
A sentence segmenter that actually works!
☆304Aug 18, 2020Updated 5 years ago
superlinear-ai / wtpsplit-lite
View on GitHub
✂️ Sentence segmentation with wtpsplit's state-of-the-art Segment any Text (SaT) models
☆39May 2, 2026Updated 2 months ago
nipunsadvilkar / pySBD
View on GitHub
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
☆927Aug 20, 2024Updated last year
huggingface / setfit
View on GitHub
Efficient few-shot learning with Sentence Transformers
☆2,777May 26, 2026Updated last month
dbmdz / deep-eos
View on GitHub
General-Purpose Neural Networks for Sentence Boundary Detection
☆74Mar 27, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
fnl / syntok
View on GitHub
Text tokenization and sentence segmentation (segtok v2)
☆211Mar 12, 2022Updated 4 years ago
xhluca / bm25s
View on GitHub
Fast BM25 search in Python, powered by Numpy and Numba
☆1,746Updated this week
mixedbread-ai / batched
View on GitHub
The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching o…
☆161Jul 14, 2025Updated last year
huggingface / sentence-transformers
View on GitHub
State-of-the-Art Embeddings, Retrieval, and Reranking
☆18,944Updated this week
NorskRegnesentral / skweak
View on GitHub
skweak: A software toolkit for weak supervision applied to NLP tasks
☆925Sep 2, 2024Updated last year
MaartenGr / BERTopic
View on GitHub
Leveraging BERT and c-TF-IDF to create easily interpretable topics.
☆7,756May 13, 2026Updated 2 months ago
facebookresearch / SONAR
View on GitHub
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.
☆899Oct 10, 2025Updated 9 months ago
nlp-uoregon / trankit
View on GitHub
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
☆795Jul 22, 2025Updated last year
neuml / txtai
View on GitHub
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
☆12,751Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
AnswerDotAI / RAGatouille
View on GitHub
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…
☆3,943May 17, 2025Updated last year
argilla-io / argilla
View on GitHub
Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
☆5,051Updated this week
AnswerDotAI / rerankers
View on GitHub
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,626Dec 20, 2025Updated 7 months ago
OpenNMT / CTranslate2
View on GitHub
Fast inference engine for Transformer models
☆4,585Jul 3, 2026Updated 3 weeks ago
huggingface / text-embeddings-inference
View on GitHub
A blazing fast inference solution for text embeddings models
☆4,959Updated this week
tsproisl / SoMaJo
View on GitHub
A tokenizer and sentence splitter for German and English web and social media texts.
☆153Dec 9, 2024Updated last year
notAI-tech / fastPunct
View on GitHub
Punctuation restoration and spell correction experiments.
☆253Feb 25, 2021Updated 5 years ago
urchade / GLiNER
View on GitHub
Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts)
☆3,428Updated this week
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆15,331Updated this week
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
oliverguhr / fullstop-deep-punctuation-prediction
View on GitHub
A model that predicts the punctuation of English, Italian, French and German texts.
☆90Apr 21, 2026Updated 3 months ago
ELS-RD / transformer-deploy
View on GitHub
Efficient, scalable and enterprise-grade CPU/GPU inference server for 🤗 Hugging Face transformer models 🚀
☆1,690Oct 23, 2024Updated last year
facebookresearch / LASER
View on GitHub
Language-Agnostic SEntence Representations
☆3,661May 2, 2024Updated 2 years ago
castorini / pyserini
View on GitHub
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
☆2,102Jul 16, 2026Updated last week
qdrant / fastembed
View on GitHub
Fast, Accurate, Lightweight Python library to make State of the Art Embedding
☆3,104Updated this week
MilaNLProc / contextualized-topic-models
View on GitHub
A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coher…
☆1,272Jul 24, 2025Updated last year
malteos / clp-transfer
View on GitHub
Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning
☆30Jan 25, 2023Updated 3 years ago
MinishLab / semhash
View on GitHub
Fast Multimodal Semantic Deduplication & Filtering
☆953May 24, 2026Updated 2 months ago
makcedward / nlpaug
View on GitHub
Data augmentation for NLP
☆4,663Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
MaartenGr / KeyBERT
View on GitHub
Minimal keyword extraction with BERT
☆4,207May 13, 2026Updated 2 months ago
pmbaumgartner / spacy-setfit-textcat
View on GitHub
☆29Jun 23, 2022Updated 4 years ago
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,344Updated this week
webis-de / small-text
View on GitHub
Active Learning for Text Classification in Python
☆646May 24, 2026Updated 2 months ago
HLasse / TextDescriptives
View on GitHub
A Python library for calculating a large variety of metrics from text
☆366May 5, 2026Updated 2 months ago
cisnlp / simalign
View on GitHub
[EMNLP 2020] Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)
☆398Nov 7, 2023Updated 2 years ago
laurieburchell / open-lid-dataset
View on GitHub
Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)
☆77Apr 1, 2025Updated last year