wikimedia / sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
☆58Updated 6 months ago
Alternatives and similar repositories for sentencex:
Users that are interested in sentencex are comparing it to the libraries listed below
- Seed Machine Translation Data☆30Updated 3 months ago
- Transform TMX to text☆28Updated 2 years ago
- Multilingual sentence alignment using sentence embeddings☆109Updated 4 months ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆70Updated 10 months ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆49Updated last month
- Library for fast text representation and classification.☆28Updated last year
- Faster, modernized fork of the language identification tool langid.py☆53Updated 3 months ago
- Tool to fix bitexts and tag near-duplicates for removal☆29Updated 3 weeks ago
- Sentence aligner☆110Updated 3 years ago
- ☆25Updated last year
- The Open Parallel Corpus☆64Updated this week
- ☆9Updated last year
- Cython wrapper on Hunspell Dictionary☆66Updated 8 months ago
- ☆71Updated this week
- Code for SaGe subword tokenizer (EACL 2023)☆24Updated 3 months ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Updated 2 years ago
- Efficient Low-Memory Aligner☆142Updated last month
- These are lists for a variety of languages containing words that are distinctive to each language.☆35Updated 2 years ago
- NTREX -- News Test References for MT Evaluation☆81Updated 8 months ago
- ☆45Updated 7 months ago
- Small python package to measure OCR quality and other related metrics.☆21Updated last year
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆22Updated 3 years ago
- MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki☆22Updated 2 weeks ago
- List of corpora annotated for coreference for different languages☆17Updated 6 months ago
- Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆118Updated 3 months ago
- Multilingual syllable annotation pipeline component for spacy☆39Updated last year
- The FLORES+ Machine Translation Benchmark☆100Updated 3 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.☆31Updated last year
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated 11 months ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆154Updated 8 months ago