mingruimingrui / ICU-tokenizerView external linksLinks
ICU based universal language tokenizer
☆33Jan 19, 2022Updated 4 years ago
Alternatives and similar repositories for ICU-tokenizer
Users that are interested in ICU-tokenizer are comparing it to the libraries listed below
Sorting:
- The implementation of CL-ReLKT (NAACL-2022)☆14Aug 31, 2022Updated 3 years ago
- downloads and parses subtitle dataset from opensubtitles.org☆16Apr 19, 2024Updated last year
- c++ mosestokenizer☆18Mar 13, 2024Updated last year
- Extensible DL-based automatic Arabic diacritization tool allowing the restoration of different types of diacritics.☆21Jul 25, 2023Updated 2 years ago
- Trigram files for 500+ languages☆25Mar 21, 2025Updated 10 months ago
- ☆22Jan 3, 2023Updated 3 years ago
- Multilingual Open Text☆25May 8, 2025Updated 9 months ago
- Reader Translator Generator - NMT toolkit based on pytorch☆32Sep 12, 2023Updated 2 years ago
- Wiktra - Python tool of Wiktionary Transliteration modules for 514 languages and its 102 different scripts (orthographies)☆34Jun 29, 2025Updated 7 months ago
- GENOT: Generative Neural Optimal Transport☆14Dec 18, 2024Updated last year
- Data source of the Energy Transition Model☆18Feb 5, 2026Updated last week
- Minangkabau NLP corpus. PACLIC 2020☆10Jun 7, 2021Updated 4 years ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Oct 14, 2022Updated 3 years ago
- Code for our ACL2021 paper Neural Machine Translation with Monolingual Translation Memory☆82Jun 12, 2023Updated 2 years ago
- ☆11Feb 1, 2024Updated 2 years ago
- MS Marco Entity Annotations Disambiguation☆13May 19, 2023Updated 2 years ago
- ☆12Nov 8, 2024Updated last year
- Scrape Youtube for videos and extract screenshots from the videos☆12Feb 12, 2021Updated 5 years ago
- ☆13Nov 5, 2024Updated last year
- ☆11Oct 13, 2023Updated 2 years ago
- The pipeline for the OSCAR corpus☆176Nov 9, 2025Updated 3 months ago
- Project AI Services will help deploy e2e AI use cases that solve business problems for Power Users.☆44Updated this week
- Open Science AI Tools for Systematic, Protocol-Based Literature Reviews☆16Jan 22, 2026Updated 3 weeks ago
- Fine-tuning Llama2-7b and other llms for categorising emails for Deutsche Bahn (German National Railways)☆13Oct 9, 2023Updated 2 years ago
- Official code for AAAI'20 paper "Merging Weak and Active Supervision for Semantic Parsing"☆11Dec 8, 2022Updated 3 years ago
- ☆10Sep 27, 2021Updated 4 years ago
- No-nonsense simple transliteration between writing systems, mostly of Semitic origin☆13Jun 29, 2025Updated 7 months ago
- A monolithic index that supports worst-case optimal joins (WCOJ) by providing all collation orders in a single redundancy eliminating dat…☆16Sep 18, 2025Updated 4 months ago
- Library for fast text representation and classification.☆10Apr 17, 2022Updated 3 years ago
- Tensorflow Operation Wrapper of cppjieba (Chinese Word Segamentation)☆10Oct 21, 2019Updated 6 years ago
- A Font with extensive coverage of Unicode13 as of March 2020 (part of Unicode Fonts for Ancient Scripts)☆15Mar 26, 2020Updated 5 years ago
- Use LLM to generate Obsidian timeline style Cornell notes☆11May 10, 2023Updated 2 years ago
- a script from ERNIE1.0 or ERNIE2.0 to transfomers' BERT format☆10Mar 28, 2020Updated 5 years ago
- Python parser for UCUM (Unified Code for Units of Measure) incl. converter to pint units☆14Feb 1, 2026Updated last week
- FeedbackQA: Improving Question Answering Post-Deployment with Interactive Feedback☆12Jul 13, 2022Updated 3 years ago
- Some realistic tabular datasets for testing (CSV)☆21Mar 7, 2018Updated 7 years ago
- ☆17Nov 28, 2025Updated 2 months ago
- python project template for personal projects! 🙋♀️☆11Nov 28, 2020Updated 5 years ago
- ☆10Oct 20, 2022Updated 3 years ago