segment-any-text / wtpsplit
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
☆897Updated 3 weeks ago
Alternatives and similar repositories for wtpsplit:
Users that are interested in wtpsplit are comparing it to the libraries listed below
- 🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.☆842Updated 7 months ago
- ⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.☆578Updated last year
- State-of-the-art LLM-based translation models.☆508Updated 2 months ago
- 🦙 Integrating LLMs into structured NLP pipelines☆1,218Updated 2 months ago
- SpanMarker for Named Entity Recognition☆424Updated 2 months ago
- Evaluate your speech-to-text system with similarity measures such as word error rate (WER)☆705Updated last month
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,078Updated last week
- ✔️Contextual word checker for better suggestions (not actively maintained)☆412Updated 2 months ago
- Tools to download and cleanup Common Crawl data☆993Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆124Updated 4 months ago
- Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.☆243Updated 2 years ago
- NeuSpell: A Neural Spelling Correction Toolkit☆691Updated last year
- SGPT: GPT Sentence Embeddings for Semantic Search☆864Updated last year
- String-to-String Algorithms for Natural Language Processing☆542Updated 8 months ago
- A model that predicts the punctuation of English, Italian, French and German texts.☆80Updated 2 years ago
- A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB te…☆270Updated 2 months ago
- Punctuation Restoration using Transformer Models for High-and Low-Resource Languages☆211Updated 8 months ago
- Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing☆748Updated 5 months ago
- Zero and Few shot named entity & relationships recognition☆361Updated 4 months ago
- Segment documents into coherent parts using word embeddings.☆149Updated 3 years ago
- A neural word aligner based on multilingual BERT☆344Updated 3 years ago
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.☆1,751Updated last month
- Language model fine-tuning on NER with an easy interface and cross-domain evaluation. "T-NER: An All-Round Python Library for Transformer…☆386Updated last year
- Open neural machine translation models and web services☆671Updated 3 months ago
- Fast State-of-the-Art Static Embeddings☆1,129Updated this week
- ☆503Updated 8 months ago
- SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.☆721Updated 2 weeks ago
- ☆359Updated last year
- ☆352Updated last year
- The most accurate natural language detection library for Python, suitable for short text and mixed-language text☆1,301Updated last week