wikimedia / sentencexLinks
A sentence segmentation library with wide language support optimized for speed and utility.
☆71Updated last week
Alternatives and similar repositories for sentencex
Users that are interested in sentencex are comparing it to the libraries listed below
Sorting:
- Faster, modernized fork of the language identification tool langid.py☆61Updated 11 months ago
- Seed Machine Translation Data☆33Updated last year
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆179Updated 5 months ago
- Library for fast text representation and classification.☆31Updated last year
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆52Updated last month
- Extracts plain text, language identification and more metadata from WARC records☆23Updated last month
- Next-generation Punkt sentence boundary detection with zero dependencies☆24Updated 3 months ago
- The Open Parallel Corpus☆77Updated 2 months ago
- 80x faster and 95% accurate language identification with Fasttext☆163Updated last year
- Targetted language identifier, based on FastText and Hunspell.☆37Updated 2 months ago
- Multilingual sentence alignment using sentence embeddings☆130Updated last year
- Official implementation of the paper "CoEdIT: Text Editing by Task-Specific Instruction Tuning" (EMNLP 2023)☆132Updated last year
- A list of awesome Machine Translation frameworks, libraries, software and papers☆192Updated last year
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆75Updated 7 months ago
- A modern, interlingual wordnet interface for Python☆272Updated this week
- Logical structure analysis for visually structured documents☆92Updated 3 years ago
- Efficient teacher-student models and scripts to make them☆52Updated last year
- Searching in-memory corpus with Corpus Query Language (CQL)☆19Updated 11 months ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Translate HTML using Argos Translate☆53Updated 2 years ago
- Transform TMX to text☆28Updated 2 years ago
- 🕸 GlotWeb: Web Indexing for Low-Resource Languages -- under construction.☆15Updated 3 months ago
- ☆78Updated 2 months ago
- Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.☆255Updated 3 years ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆59Updated last year
- Tool to fix bitexts and tag near-duplicates for removal☆33Updated 2 months ago
- Python3 bindings for the Compact Language Detector v3 (CLD3)☆154Updated 2 years ago
- The pipeline for the OSCAR corpus☆173Updated last week
- OpusFilter - Parallel corpus processing toolkit☆112Updated last week