segment-any-text / wtpsplit
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
☆992Updated 3 weeks ago
Alternatives and similar repositories for wtpsplit:
Users that are interested in wtpsplit are comparing it to the libraries listed below
- 🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.☆846Updated 8 months ago
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,120Updated last week
- ⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.☆578Updated 2 years ago
- ☆510Updated 9 months ago
- NeuSpell: A Neural Spelling Correction Toolkit☆692Updated last year
- Evaluate your speech-to-text system with similarity measures such as word error rate (WER)☆716Updated 2 months ago
- SpanMarker for Named Entity Recognition☆425Updated 3 months ago
- Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts) @ NAACL 2024☆1,955Updated 3 weeks ago
- Open neural machine translation models and web services☆681Updated 4 months ago
- Bringing BERT into modernity via both architecture changes and scaling☆1,329Updated last month
- Fast Semantic Text Deduplication☆638Updated this week
- ☆354Updated last year
- Things you can do with the token embeddings of an LLM☆1,437Updated 3 weeks ago
- Efficient few-shot learning with Sentence Transformers☆2,452Updated 2 weeks ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆127Updated 4 months ago
- Fast inference engine for Transformer models☆3,759Updated 2 weeks ago
- Minimal extension of OpenAI's Whisper adding speaker diarization with special tokens☆493Updated last year
- SGPT: GPT Sentence Embeddings for Semantic Search☆865Updated last year
- Easily embed, cluster and semantically label text datasets☆526Updated last year
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.☆1,780Updated 2 months ago
- ☆360Updated last year
- Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.☆1,809Updated 2 weeks ago
- Tools to download and cleanup Common Crawl data☆1,001Updated 2 years ago
- Neural Search☆354Updated last month
- Fast State-of-the-Art Static Embeddings☆1,359Updated this week
- A Collection of BM25 Algorithms in Python☆1,146Updated 6 months ago
- SPLADE: sparse neural search (SIGIR21, SIGIR22)☆837Updated 11 months ago
- 🦙 Integrating LLMs into structured NLP pipelines☆1,228Updated 3 months ago
- ☆1,206Updated 8 months ago
- Punctuation Restoration using Transformer Models for High-and Low-Resource Languages☆212Updated 8 months ago