uhermjakob / utoken
universal tokenizer
β15Updated 3 years ago
Alternatives and similar repositories for utoken:
Users that are interested in utoken are comparing it to the libraries listed below
- Bilingual sentence similarity classifier using Tensorflowβ19Updated 5 years ago
- Code for SaGe subword tokenizer (EACL 2023)β22Updated last month
- An English lexical database from the Big π, let's go Mets baby love da Metsβ14Updated 2 months ago
- GOPHI: an AMR-to-English Verbalizerβ11Updated 4 years ago
- Efficient teacher-student models and scripts to make themβ49Updated last year
- Transform TMX to textβ29Updated 2 years ago
- An easy-to-use library to linguistically compare one sentence and its words to another, in the same language or a different one. For instβ¦β22Updated 3 years ago
- bilingual dictionary extractor from parallel corporaβ22Updated 10 years ago
- English Resource Grammarβ20Updated 5 months ago
- Python framework for processing Universal Dependencies dataβ56Updated 3 weeks ago
- Python Finite-State Toolkitβ47Updated last week
- Library and command line utility to do approximate string matching of a source against a bitext index and get matched source and target.β46Updated 3 weeks ago
- A Python toolkit to generate a tokenized dump of Wikipedia for NLPβ11Updated 8 months ago
- A python module for word inflections designed for use with spaCy.β92Updated 4 years ago
- Automatic extraction of edited sentences from text edition histories.β82Updated 2 years ago
- Resource and Tool for Writing System Identification -- LREC 2024β13Updated 7 months ago
- A tiny BERT for low-resource monolingual modelsβ31Updated 3 months ago
- Source code for the Apple reproductionβ31Updated 3 years ago
- Translation demonstratorβ29Updated 4 years ago
- π« A spaCy package for Yohei Tamura's Rust tokenizations libraryβ27Updated last year
- ππ Python 3 library for managing, annotating, and converting natural language corpuses using popular formats (CoNLL, ELAN, Praat, CSV, β¦β17Updated 6 months ago
- Measure the similarity of text corpora for 74 languagesβ13Updated 11 months ago
- Multilingual Open Textβ25Updated 2 months ago
- Extracts plain text, language identification and more metadata from WARC recordsβ20Updated 5 months ago
- The Mueller Report Corpus V 0.1β11Updated 4 years ago
- XL-AMR is a sequence-to-graph cross-lingual AMR parser that exploits transfer learning (EMNLP2020).β16Updated 5 months ago
- A flexible sentence segmentation library using CRF model and regex rulesβ28Updated 10 months ago
- Hugging Face and Pyserini interoperabilityβ20Updated last year
- MAGPIE: A sense-annotated corpus of potentially idiomatic expressionsβ26Updated 4 years ago
- A python library / model for creating co-references between AMR graph nodes.β9Updated 2 years ago