uhermjakob / utokenLinks
universal tokenizer
☆16Updated 4 years ago
Alternatives and similar repositories for utoken
Users that are interested in utoken are comparing it to the libraries listed below
Sorting:
- Bilingual sentence similarity classifier using Tensorflow☆24Updated 6 years ago
- Efficient teacher-student models and scripts to make them☆53Updated 2 years ago
- A python module for word inflections designed for use with spaCy.☆93Updated 5 years ago
- Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. Fo…☆108Updated last month
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.☆56Updated 3 months ago
- Text tokenization and sentence segmentation (segtok v2)☆208Updated 3 years ago
- ☆81Updated last month
- OpusFilter - Parallel corpus processing toolkit☆115Updated 3 weeks ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For inst…☆25Updated 4 years ago
- Tool for parsing and converting various span encoding schemes.☆23Updated last year
- Efficient Low-Memory Aligner☆146Updated 11 months ago
- This packages up data for the Open Multilingual Wordnet☆59Updated 7 months ago
- MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology (+morpheme segmentation)☆52Updated 2 years ago
- Master repo for the UniMorph project, includes the UniMorph schema and annotated data files☆33Updated 6 years ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆182Updated 7 months ago
- A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.☆37Updated this week
- A modern, interlingual wordnet interface for Python☆277Updated this week
- Curated corpus of parallel data derived from versions of the Bible provided by eBible.org.☆79Updated 7 months ago
- A sentence segmentation library with wide language support optimized for speed and utility.☆82Updated last month
- Translation demonstrator☆36Updated 5 years ago
- UFSAC is a resource containing all WordNet Sense Annotated Corpora, and a Java library for manipulating them☆38Updated 3 years ago
- Runnable morphological analysis tools from the UniMorph project☆16Updated 7 years ago
- Wiktra - Python tool of Wiktionary Transliteration modules for 514 languages and its 102 different scripts (orthographies)☆32Updated 6 months ago
- Transform TMX to text☆28Updated 3 years ago
- Python Finite-State Toolkit☆60Updated 2 weeks ago
- A python module for English lemmatization and inflection.☆274Updated 2 years ago
- Text to sentence splitter using heuristic algorithm by Philipp Koehn and Josh Schroeder.☆256Updated 3 years ago
- An advanced, extensible web front-end for the Manatee-open corpus search engine☆78Updated this week
- Put together a multilingual corpus from a variety of sources. Used for wordfreq and word embeddings.☆57Updated 4 years ago
- 🧪 Cutting-edge experimental spaCy components and features☆105Updated last year