wikimedia / sentencex
A sentence segmentation library with wide language support optimized for speed and utility.
☆53Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for sentencex
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)☆67Updated 6 months ago
- Faster, modernized fork of the language identification tool langid.py☆48Updated 5 months ago
- A python module for word inflections designed for use with spaCy.☆92Updated 4 years ago
- Transform TMX to text☆29Updated last year
- Multilingual sentence alignment using sentence embeddings☆101Updated 2 weeks ago
- ☆67Updated 3 months ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆48Updated 2 months ago
- Code for SaGe subword tokenizer (EACL 2023)☆22Updated this week
- Seed Machine Translation Data☆30Updated last week
- Tool to fix bitexts and tag near-duplicates for removal☆29Updated 3 months ago
- Bilingual sentence similarity classifier using Tensorflow☆19Updated 5 years ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆144Updated this week
- MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki☆22Updated this week
- A Python package for learning, evaluating, annotating, and extracting vector representations of construction grammars☆34Updated last month
- ParaNames: A multilingual resource for parallel names☆30Updated 6 months ago
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆62Updated 8 months ago
- Python framework for processing Universal Dependencies data☆57Updated this week
- Efficient Low-Memory Aligner☆139Updated 2 months ago
- Python Finite-State Toolkit☆45Updated last week
- Library for fast text representation and classification.☆28Updated 10 months ago
- ☆22Updated last year
- Source code for the Apple reproduction☆31Updated 3 years ago
- ☆32Updated 2 years ago
- A tiny BERT for low-resource monolingual models☆29Updated last month
- OpusFilter - Parallel corpus processing toolkit☆102Updated 3 months ago
- A list of resources for conservation, development, and documentation of endangered, minority, and low or under-resourced human languages.☆34Updated last year
- Analyze Argumentation and Rhetorical Aspects in Scientific Writing.☆19Updated 2 years ago
- A modern, interlingual wordnet interface for Python☆221Updated last week
- Python module that identifies Chinese text as being Simplified or Traditional☆86Updated this week
- The Open Parallel Corpus☆57Updated last week
- Extracts plain text, language identification and more metadata from WARC records☆20Updated 3 months ago