segment-any-text / wtpsplit
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
β858Updated 3 weeks ago
Alternatives and similar repositories for wtpsplit:
Users that are interested in wtpsplit are comparing it to the libraries listed below
- ππ―pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.β831Updated 6 months ago
- β‘ boost inference speed of T5 models by 5x & reduce the model size by 3x.β572Updated last year
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipyβ1,019Updated last month
- π¦ Integrating LLMs into structured NLP pipelinesβ1,193Updated last month
- Open neural machine translation models and web servicesβ655Updated 2 months ago
- Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'β1,420Updated 3 weeks ago
- Bringing BERT into modernity via both architecture changes and scalingβ1,208Updated this week
- SpanMarker for Named Entity Recognitionβ417Updated last month
- Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)β357Updated last year
- A neural word aligner based on multilingual BERTβ338Updated 2 years ago
- State-of-the-art LLM-based translation models.β486Updated 3 weeks ago
- SGPT: GPT Sentence Embeddings for Semantic Searchβ861Updated last year
- A sentence segmenter that actually works!β304Updated 4 years ago
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.β1,709Updated 2 weeks ago
- β349Updated last year
- A Neural Framework for MT Evaluationβ542Updated last month
- NeuSpell: A Neural Spelling Correction Toolkitβ684Updated last year
- Fast Semantic Text Deduplicationβ532Updated last week
- Train and Infer Powerful Sentence Embeddings with AnglE | π₯ SOTA on STS and MTEB Leaderboardβ514Updated last week
- A Collection of BM25 Algorithms in Pythonβ1,107Updated 4 months ago
- The most accurate natural language detection library for Python, suitable for short text and mixed-language textβ1,241Updated last week
- 80x faster and 95% accurate language identification with Fasttextβ146Updated last year
- The pipeline for the OSCAR corpusβ166Updated last year
- A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB teβ¦β267Updated last month
- Simply, faster, sentence-transformersβ141Updated 5 months ago
- Tools to download and cleanup Common Crawl dataβ983Updated last year
- Evaluate your speech-to-text system with similarity measures such as word error rate (WER)β688Updated last week
- Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β116Updated 2 months ago
- BigTranslate: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languagesβ221Updated last year
- Multilingual sentence alignment using sentence embeddingsβ108Updated 3 months ago