These are lists for a variety of languages containing words that are distinctive to each language.
β41Apr 5, 2022Updated 3 years ago
Alternatives and similar repositories for TF-IDF-IIF-top100-wordlists
Users that are interested in TF-IDF-IIF-top100-wordlists are comparing it to the libraries listed below
Sorting:
- Creating super-parallel corpora of more than 1500+ unique languages for NLP researchβ34Dec 8, 2022Updated 3 years ago
- πΈ GlotWeb: Web Indexing for Minority Languages (WWW 2026)β17Feb 27, 2026Updated last week
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- This repository contains the source code and links to some datasets used in the CoNLL 2019 paper "Learning to Represent Bilingual Dictionβ¦β12Oct 1, 2020Updated 5 years ago
- β16Nov 20, 2023Updated 2 years ago
- Explicit Alignment Objectives for Multilingual Bidirectional Encodersβ14Apr 14, 2021Updated 4 years ago
- Toki Pona corpus for NLTKβ15Dec 29, 2018Updated 7 years ago
- generate rules from lists of wordsβ16Jul 9, 2021Updated 4 years ago
- downloads and parses subtitle dataset from opensubtitles.orgβ15Apr 19, 2024Updated last year
- Latex Beamer Themeβ16Apr 25, 2025Updated 10 months ago
- Statistics on multilingual datasetsβ17Jul 12, 2022Updated 3 years ago
- π Resource and Tool for Writing System Identification (Unicode 17.0) -- LREC 2024β21Feb 17, 2026Updated 2 weeks ago
- Data Collection System For NLP/Speech Recognitionβ25Apr 20, 2021Updated 4 years ago
- Extensible DL-based automatic Arabic diacritization tool allowing the restoration of different types of diacritics.β21Jul 25, 2023Updated 2 years ago
- Using pretrained language models for biomedical knowledge graph completion.β47Oct 7, 2021Updated 4 years ago
- Source stories from the African Storybook Project in Markdown formatβ22Jan 25, 2026Updated last month
- Trigram files for 500+ languagesβ25Mar 21, 2025Updated 11 months ago
- Python package for Natural Language Processing (NLP), focused on low-resource languages spoken in Mexico.β23Sep 4, 2025Updated 6 months ago
- Multilingual Open Textβ25May 8, 2025Updated 10 months ago
- A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.β20Jun 8, 2022Updated 3 years ago
- β22Apr 8, 2022Updated 3 years ago
- The pipeline for the OSCAR corpusβ176Nov 9, 2025Updated 4 months ago
- OpusFilter - Parallel corpus processing toolkitβ115Feb 11, 2026Updated 3 weeks ago
- Korean BERT model using character tokenizerβ27Apr 8, 2021Updated 4 years ago
- β25Jul 12, 2022Updated 3 years ago
- Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)β36Nov 29, 2024Updated last year
- Targetted language identifier, based on FastText and Hunspell.β38Sep 4, 2025Updated 6 months ago
- This is a machine learning framework that enables developers to iterate fast over different ML architecture designs.β16Apr 20, 2020Updated 5 years ago
- π Retrieve verses from bible.com/YouVersion.β38Feb 13, 2025Updated last year
- Large scale unannotated Korean corpus for unsupervised tasks. (e.g. Language modeling)β28Aug 11, 2019Updated 6 years ago
- Wiktra - Python tool of Wiktionary Transliteration modules for 514 languages and its 102 different scripts (orthographies)β34Jun 29, 2025Updated 8 months ago
- Translation demonstratorβ37May 12, 2020Updated 5 years ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β36Feb 5, 2026Updated last month
- Training Transformers of Huggingface with KoNLPyβ68Aug 28, 2020Updated 5 years ago
- Minangkabau NLP corpus. PACLIC 2020β10Jun 7, 2021Updated 4 years ago
- Improving Word Translation via Two-Stage Contrastive Learning (ACL 2022). Keywords: Bilingual Lexicon Induction, Word Translation, Cross-β¦β36Jan 23, 2025Updated last year
- Research code for "What to Pre-Train on? Efficient Intermediate Task Selection", EMNLP 2021β37Dec 21, 2021Updated 4 years ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forestsβ41Oct 14, 2022Updated 3 years ago
- A set of pipelines for performing experiments on various NLP tasks with a focus on resource-poor/minority languages.β37Updated this week