superlinear-ai/wtpsplit-lite

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/superlinear-ai/wtpsplit-lite)

superlinear-ai / wtpsplit-lite

✂️ Sentence segmentation with wtpsplit's state-of-the-art Segment any Text (SaT) models

☆39

Alternatives and similar repositories for wtpsplit-lite

Users that are interested in wtpsplit-lite are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

fyvo / WMT-Biomed-Test
View on GitHub
☆13Aug 23, 2024Updated last year
hucsmn / suffix_array
View on GitHub
suffix array construction and searching algorithms for in-memory binary data.
☆13Sep 10, 2022Updated 3 years ago
sanderland / script_tok
View on GitHub
Code for the paper "BPE stays on SCRIPT", "Which Pieces Does Unigram Tokenization Really Need?" and MinGram
☆18Jun 26, 2026Updated 3 weeks ago
superlinear-ai / conformal-tights
View on GitHub
👖 Conformal Tights adds conformal prediction of coherent quantiles and intervals to any scikit-learn regressor or Darts forecaster
☆118May 1, 2026Updated 2 months ago
superlinear-ai / python-gpu
View on GitHub
🐳 Python GPU adds a minimal install of CUDA and cuDNN on top of the official python:3.x-slim base image
☆20Dec 20, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
kensho-technologies / pathpiece
View on GitHub
PathPiece tokenizer
☆14Nov 10, 2024Updated last year
segment-any-text / wtpsplit
View on GitHub
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
☆1,320Jul 6, 2026Updated 2 weeks ago
gautierdag / tokenizer-bench
View on GitHub
Code for the paper "Getting the most out of your tokenizer for pre-training and domain adaptation"
☆22Feb 14, 2024Updated 2 years ago
cisnlp / mPLM-Sim
View on GitHub
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
☆11Jan 19, 2024Updated 2 years ago
alea-institute / kl3m-data
View on GitHub
KL3M training data collection and preprocessing
☆22Apr 14, 2025Updated last year
stefan-it / ukrainian-electra
View on GitHub
Ukrainian ELECTRA model
☆12Mar 11, 2023Updated 3 years ago
VITA-Group / TAPE
View on GitHub
[ICML'25] "Rethinking Addressing in Language Models via Contextualized Equivariant Positional Encoding" by Jiajun Zhu, Peihao Wang, Ruisi…
☆15Jun 6, 2025Updated last year
rewicks / ersatz
View on GitHub
☆51Jul 25, 2024Updated 2 years ago
hplt-project / OpusTrainer
View on GitHub
Curriculum training
☆22Jun 25, 2025Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
alea-institute / nupunkt
View on GitHub
Next-generation Punkt sentence boundary detection with zero dependencies
☆32Nov 18, 2025Updated 8 months ago
sunnweiwei / MAIR
View on GitHub
MAIR: A Massive Benchmark for Evaluating Instructed Retrieval. Evaluate your retrieval models on 126 diverse tasks. [EMNLP 2024]
☆28Nov 3, 2024Updated last year
google-research / metricx
View on GitHub
☆146Jul 2, 2026Updated 3 weeks ago
LibreTranslate / nllu
View on GitHub
No Language Left Unlocked: scalable backtranslation of NLLB models
☆14Aug 4, 2025Updated 11 months ago
impresso / named-entity-tutorial-dh2019
View on GitHub
Tutorial on NE processing for Digital Humanities - DH Utrech 2019
☆24Jul 18, 2019Updated 7 years ago
iPieter / llmq
View on GitHub
A Scheduler for Batched LLM Inference
☆19Oct 5, 2025Updated 9 months ago
Pleias / OCRoscope
View on GitHub
Small python package to measure OCR quality and other related metrics.
☆26Feb 19, 2024Updated 2 years ago
AnesBenmerzoug / langsfer
View on GitHub
A library for language transfer methods and algorithms.
☆16Feb 6, 2026Updated 5 months ago
yannikbenz / zeroe
View on GitHub
From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks
☆15Feb 23, 2023Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
LSX-UniWue / SuperGLEBer
View on GitHub
German Language Understanding Evaluation Benchmark @NAACL24
☆22Dec 11, 2025Updated 7 months ago
fitnr / unwiki
View on GitHub
Python module to remove wiki markup text.
☆10Jan 15, 2016Updated 10 years ago
jncsnlp / FSL-Multimodal-Rumor-Detection
View on GitHub
☆11Feb 23, 2023Updated 3 years ago
hotco87 / gradio_ko_chat
View on GitHub
☆12Apr 28, 2023Updated 3 years ago
scari / high_performance_python
View on GitHub
Code for the book "High Performance Python" by Micha Gorelick and Ian Ozsvald with OReilly
☆11Jul 19, 2016Updated 10 years ago
DEFI-COLaF / LADaS
View on GitHub
Layout Analysis Dataset with Segmonto (LADaS)
☆25May 29, 2026Updated last month
laurieburchell / open-lid-dataset
View on GitHub
Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)
☆77Apr 1, 2025Updated last year
pnuailab / parser
View on GitHub
한국어 문장 분석 시스템 BCD-KL-Parser
☆10Jun 23, 2020Updated 6 years ago
mansicer / Q-Adapter
View on GitHub
Implementation of ICLR 2025 paper "Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation"
☆18Oct 5, 2024Updated last year
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
cisnlp / GlotLID
View on GitHub
[EMNLP 2023] 💬 Language Identification with Support for More Than 2000 Labels
☆212Apr 15, 2026Updated 3 months ago
mt-upc / SHAS
View on GitHub
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
☆44Feb 9, 2023Updated 3 years ago
Shentao-YANG / Preference_Grounded_Guidance
View on GitHub
Source codes for "Preference-grounded Token-level Guidance for Language Model Fine-tuning" (NeurIPS 2023).
☆17Jan 8, 2025Updated last year
fdschmidt93 / trident-nllb-llm2vec
View on GitHub
Repository for "Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages"
☆15Oct 4, 2024Updated last year
sekstini / basedxl
View on GitHub
☆18Mar 18, 2024Updated 2 years ago
GreycLab / gmic-py
View on GitHub
Python binding for the G'MIC Image Processing Framework
☆11Nov 14, 2025Updated 8 months ago
CONE-MT / Lego-MT
View on GitHub
☆10Mar 22, 2024Updated 2 years ago