segment-any-text / wtpsplitLinks
Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
☆1,071Updated this week
Alternatives and similar repositories for wtpsplit
Users that are interested in wtpsplit are comparing it to the libraries listed below
Sorting:
- 🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.☆856Updated 10 months ago
- Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy☆1,213Updated 3 weeks ago
- ☆522Updated 11 months ago
- Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard☆547Updated 3 months ago
- SPLADE: sparse neural search (SIGIR21, SIGIR22)☆860Updated last year
- Open neural machine translation models and web services☆701Updated last week
- State-of-the-art LLM-based translation models.☆534Updated 2 months ago
- SpanMarker for Named Entity Recognition☆434Updated 5 months ago
- A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.☆1,850Updated 3 weeks ago
- ☆359Updated last year
- Tools to download and cleanup Common Crawl data☆1,016Updated 2 years ago
- Bringing BERT into modernity via both architecture changes and scaling☆1,419Updated last week
- 🦙 Integrating LLMs into structured NLP pipelines☆1,267Updated 5 months ago
- Code for 'LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders'☆1,545Updated 5 months ago
- A Collection of BM25 Algorithms in Python☆1,196Updated 8 months ago
- ☆164Updated last year
- 80x faster and 95% accurate language identification with Fasttext☆157Updated last year
- ⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x.☆579Updated 2 years ago
- Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.☆1,878Updated this week
- Guideline following Large Language Model for Information Extraction☆380Updated 8 months ago
- Easily embed, cluster and semantically label text datasets☆552Updated last year
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆207Updated last month
- All-in-one text de-duplication☆690Updated last month
- Training open neural machine translation models☆367Updated 3 months ago
- Obtain Word Alignments using Pretrained Language Models (e.g., mBERT)☆363Updated last year
- Neural Search☆358Updated 3 months ago
- Efficient few-shot learning with Sentence Transformers☆2,509Updated 2 months ago
- String-to-String Algorithms for Natural Language Processing☆549Updated 11 months ago
- SGPT: GPT Sentence Embeddings for Semantic Search☆868Updated last year
- Late Interaction Models Training & Retrieval☆452Updated 2 weeks ago