Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
β1,245Feb 26, 2026Updated this week
Alternatives and similar repositories for wtpsplit
Users that are interested in wtpsplit are comparing it to the libraries listed below
Sorting:
- A sentence segmenter that actually works!β304Aug 18, 2020Updated 5 years ago
- ππ―pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.β904Aug 20, 2024Updated last year
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipyβ1,500Feb 17, 2026Updated last week
- Efficient few-shot learning with Sentence Transformersβ2,688Dec 11, 2025Updated 2 months ago
- Text tokenization and sentence segmentation (segtok v2)β209Mar 12, 2022Updated 3 years ago
- State-of-the-Art Text Embeddingsβ18,323Updated this week
- General-Purpose Neural Networks for Sentence Boundary Detectionβ73Mar 27, 2023Updated 2 years ago
- βοΈ Sentence segmentation with wtpsplit's state-of-the-art Segment any Text (SaT) modelsβ36Oct 1, 2025Updated 5 months ago
- skweak: A software toolkit for weak supervision applied to NLP tasksβ926Sep 2, 2024Updated last year
- Leveraging BERT and c-TF-IDF to create easily interpretable topics.β7,412Feb 20, 2026Updated last week
- Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processingβ789Jul 22, 2025Updated 7 months ago
- Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024β2,837Feb 24, 2026Updated last week
- Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-β¦β3,859May 17, 2025Updated 9 months ago
- Fast, Accurate, Lightweight Python library to make State of the Art Embeddingβ2,744Jan 9, 2026Updated last month
- π‘ All-in-one AI framework for semantic search, LLM orchestration and language model workflowsβ12,210Feb 22, 2026Updated last week
- Onset-and-Offset-Aware Sound Event Detectionβ21Feb 10, 2025Updated last year
- Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.β2,023Feb 21, 2026Updated last week
- A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherβ¦β1,265Jul 24, 2025Updated 7 months ago
- Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)β389Nov 7, 2023Updated 2 years ago
- Fast inference engine for Transformer modelsβ4,326Feb 4, 2026Updated 3 weeks ago
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.β1,599Dec 20, 2025Updated 2 months ago
- Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasetsβ4,875Feb 23, 2026Updated last week
- Structured Outputsβ13,456Feb 13, 2026Updated 2 weeks ago
- A blazing fast inference solution for text embeddings modelsβ4,525Updated this week
- Minimal keyword extraction with BERTβ4,116Feb 3, 2026Updated last month
- SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.β872Oct 10, 2025Updated 4 months ago
- The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching oβ¦β156Jul 14, 2025Updated 7 months ago
- Unofficial implementation of ConvNeXt-TTS powered by lightningβ18Oct 20, 2024Updated last year
- Punctuation restoration and spell correction experiments.β252Feb 25, 2021Updated 5 years ago
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,687Oct 23, 2024Updated last year
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifiβ¦β3,108Feb 23, 2026Updated last week
- Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speakerβ¦β9,245Feb 20, 2026Updated last week
- Data augmentation for NLPβ4,645Jun 24, 2024Updated last year
- Speech-To-Text forced-alignment Speech processing Universal PERformance Benchmarkβ35May 7, 2025Updated 9 months ago
- A model that predicts the punctuation of English, Italian, French and German texts.β83Feb 22, 2023Updated 3 years ago
- A Python library for calculating a large variety of metrics from textβ360Jan 30, 2026Updated last month
- Language-Agnostic SEntence Representationsβ3,659May 2, 2024Updated last year
- Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'β1,652Dec 4, 2025Updated 2 months ago
- Punctuation Restoration using Transformer Models for High-and Low-Resource Languagesβ227Jul 29, 2024Updated last year