Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
β1,279Apr 11, 2026Updated 3 weeks ago
Alternatives and similar repositories for wtpsplit
Users that are interested in wtpsplit are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- A sentence segmenter that actually works!β304Aug 18, 2020Updated 5 years ago
- ππ―pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.β914Aug 20, 2024Updated last year
- βοΈ Sentence segmentation with wtpsplit's state-of-the-art Segment any Text (SaT) modelsβ38Updated this week
- Efficient few-shot learning with Sentence Transformersβ2,724Apr 17, 2026Updated 2 weeks ago
- Fast BM25 search in Python, powered by Numpy and Numbaβ1,648Updated this week
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- General-Purpose Neural Networks for Sentence Boundary Detectionβ73Mar 27, 2023Updated 3 years ago
- Text tokenization and sentence segmentation (segtok v2)β209Mar 12, 2022Updated 4 years ago
- The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching oβ¦β160Jul 14, 2025Updated 9 months ago
- State-of-the-Art Text Embeddingsβ18,615Updated this week
- skweak: A software toolkit for weak supervision applied to NLP tasksβ926Sep 2, 2024Updated last year
- Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts)β3,132Updated this week
- Leveraging BERT and c-TF-IDF to create easily interpretable topics.β7,578Feb 20, 2026Updated 2 months ago
- Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processingβ795Jul 22, 2025Updated 9 months ago
- SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.β885Oct 10, 2025Updated 6 months ago
- Managed Database hosting by DigitalOcean β’ AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- π‘ All-in-one AI framework for semantic search, LLM orchestration and language model workflowsβ12,453Updated this week
- Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-β¦β3,914May 17, 2025Updated 11 months ago
- Fast inference engine for Transformer modelsβ4,457Feb 4, 2026Updated 3 months ago
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.β1,612Dec 20, 2025Updated 4 months ago
- A blazing fast inference solution for text embeddings modelsβ4,755Apr 17, 2026Updated 2 weeks ago
- Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasetsβ4,954Apr 27, 2026Updated last week
- Fast, Accurate, Lightweight Python library to make State of the Art Embeddingβ2,917Apr 21, 2026Updated last week
- Structured Outputsβ13,776Apr 16, 2026Updated 2 weeks ago
- A tokenizer and sentence splitter for German and English web and social media texts.β153Dec 9, 2024Updated last year
- Virtual machines for every use case on DigitalOcean β’ AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Onset-and-Offset-Aware Sound Event Detectionβ22Feb 10, 2025Updated last year
- Speech-To-Text forced-alignment Speech processing Universal PERformance Benchmarkβ36May 7, 2025Updated 11 months ago
- Punctuation restoration and spell correction experiments.β253Feb 25, 2021Updated 5 years ago
- Efficient, scalable and enterprise-grade CPU/GPU inference server for π€ Hugging Face transformer models πβ1,687Oct 23, 2024Updated last year
- A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherβ¦β1,267Jul 24, 2025Updated 9 months ago
- Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.β2,051Updated this week
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learningβ30Jan 25, 2023Updated 3 years ago
- A model that predicts the punctuation of English, Italian, French and German texts.β87Apr 21, 2026Updated last week
- Minimal keyword extraction with BERTβ4,163Feb 3, 2026Updated 3 months ago
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Data augmentation for NLPβ4,656Jun 24, 2024Updated last year
- Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifiβ¦β3,199Apr 27, 2026Updated last week
- A Python library for calculating a large variety of metrics from textβ363Mar 20, 2026Updated last month
- Zero and Few shot named entity & relationships recognitionβ402Sep 17, 2025Updated 7 months ago
- β29Jun 23, 2022Updated 3 years ago
- [EMNLP 2020] Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)β394Nov 7, 2023Updated 2 years ago
- A simple command line tool to calculate WER for ASR.β14Oct 14, 2024Updated last year