segment-any-text / wtpsplitLinks
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
β1,222Updated last month
Alternatives and similar repositories for wtpsplit
Users that are interested in wtpsplit are comparing it to the libraries listed below
Sorting:
- ππ―pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.β894Updated last year
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipyβ1,447Updated 3 weeks ago
- π¦ Integrating LLMs into structured NLP pipelinesβ1,361Updated last year
- Open neural machine translation models and web servicesβ760Updated last month
- NeuSpell: A Neural Spelling Correction Toolkitβ702Updated 2 years ago
- SpanMarker for Named Entity Recognitionβ462Updated last year
- Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024β2,680Updated this week
- State-of-the-art LLM-based translation models.β577Updated 9 months ago
- Efficient few-shot learning with Sentence Transformersβ2,664Updated last month
- A Collection of BM25 Algorithms in Pythonβ1,291Updated last year
- Bringing BERT into modernity via both architecture changes and scalingβ1,607Updated 6 months ago
- SGPT: GPT Sentence Embeddings for Semantic Searchβ872Updated last year
- 80x faster and 95% accurate language identification with Fasttextβ163Updated last year
- Fast Semantic Text Deduplication & Filteringβ863Updated last week
- SPLADE: sparse neural search (SIGIR21, SIGIR22)β965Updated last year
- Late Interaction Models Training & Retrievalβ679Updated this week
- Training open neural machine translation modelsβ391Updated 9 months ago
- βοΈContextual word checker for better suggestions (not actively maintained)β417Updated 11 months ago
- Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processingβ784Updated 5 months ago
- Easily embed, cluster and semantically label text datasetsβ589Updated last year
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β185Updated last month
- FastFit β‘ When LLMs are Unfit Use FastFit β‘ Fast and Effective Text Classification with Many Classesβ211Updated 3 months ago
- Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.β256Updated 3 years ago
- Train and Infer Powerful Sentence Embeddings with AnglE | π₯ SOTA on STS and MTEB Leaderboardβ569Updated 2 months ago
- β‘ boost inference speed of T5 models by 5x & reduce the model size by 3x.β589Updated 2 years ago
- β1,253Updated last year
- String-to-String Algorithms for Natural Language Processingβ563Updated last year
- A very simple news crawler with a funny nameβ427Updated 3 weeks ago
- A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.β1,587Updated 3 weeks ago
- Single-document unsupervised keyword extractionβ1,809Updated last month