google-research-datasets / TF-IDF-IIF-top100-wordlistsView external linksLinks
These are lists for a variety of languages containing words that are distinctive to each language.
β41Apr 5, 2022Updated 3 years ago
Alternatives and similar repositories for TF-IDF-IIF-top100-wordlists
Users that are interested in TF-IDF-IIF-top100-wordlists are comparing it to the libraries listed below
Sorting:
- Creating super-parallel corpora of more than 1500+ unique languages for NLP researchβ34Dec 8, 2022Updated 3 years ago
- πΈ GlotWeb: Web Indexing for Low-Resource Languages -- under construction.β17Aug 13, 2025Updated 6 months ago
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- This repository contains the source code and links to some datasets used in the CoNLL 2019 paper "Learning to Represent Bilingual Dictionβ¦β12Oct 1, 2020Updated 5 years ago
- Toki Pona corpus for NLTKβ15Dec 29, 2018Updated 7 years ago
- β16Nov 20, 2023Updated 2 years ago
- Explicit Alignment Objectives for Multilingual Bidirectional Encodersβ14Apr 14, 2021Updated 4 years ago
- generate rules from lists of wordsβ16Jul 9, 2021Updated 4 years ago
- π Resource and Tool for Writing System Identification -- LREC 2024β21Dec 29, 2025Updated last month
- Latex Beamer Themeβ16Apr 25, 2025Updated 9 months ago
- Statistics on multilingual datasetsβ17Jul 12, 2022Updated 3 years ago
- downloads and parses subtitle dataset from opensubtitles.orgβ16Apr 19, 2024Updated last year
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Apr 1, 2025Updated 10 months ago
- Data Collection System For NLP/Speech Recognitionβ25Apr 20, 2021Updated 4 years ago
- β17Feb 1, 2023Updated 3 years ago
- Extensible DL-based automatic Arabic diacritization tool allowing the restoration of different types of diacritics.β21Jul 25, 2023Updated 2 years ago
- Using pretrained language models for biomedical knowledge graph completion.β47Oct 7, 2021Updated 4 years ago
- Source stories from the African Storybook Project in Markdown formatβ22Jan 25, 2026Updated 3 weeks ago
- Trigram files for 500+ languagesβ25Mar 21, 2025Updated 10 months ago
- A library for fetching and reading Tatoeba's weekly exportsβ24Feb 5, 2026Updated last week
- A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB teβ¦β295Updated this week
- Implementation of "SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages" paper, accepted to Eβ¦β25Nov 4, 2022Updated 3 years ago
- A collection of Korean Text Datasets ready to use using Tensorflow-Datasets.β20Jun 8, 2022Updated 3 years ago
- β22Apr 8, 2022Updated 3 years ago
- The pipeline for the OSCAR corpusβ176Nov 9, 2025Updated 3 months ago
- Public repository for SemEval 2023 - Task 10 - Explainable Detection of Online Sexism (EDOS)β25Apr 13, 2023Updated 2 years ago
- OpusFilter - Parallel corpus processing toolkitβ115Updated this week
- Korean BERT model using character tokenizerβ27Apr 8, 2021Updated 4 years ago
- This is a repository for NaijaSenti. A Lacuna Funded Project for the development of sentiment corpus for four Nigerian languages: Igbo, Hβ¦β36Oct 14, 2025Updated 4 months ago
- β25Jul 12, 2022Updated 3 years ago
- Caucasus languages focused multilingual and monolingual corpuses for Natural Language Processing(NLP)β35Nov 29, 2024Updated last year
- Curated corpus of parallel data derived from versions of the Bible provided by eBible.org.β81May 23, 2025Updated 8 months ago
- Targetted language identifier, based on FastText and Hunspell.β38Sep 4, 2025Updated 5 months ago
- π Retrieve verses from bible.com/YouVersion.β38Feb 13, 2025Updated last year
- Large scale unannotated Korean corpus for unsupervised tasks. (e.g. Language modeling)β28Aug 11, 2019Updated 6 years ago
- Wiktra - Python tool of Wiktionary Transliteration modules for 514 languages and its 102 different scripts (orthographies)β34Jun 29, 2025Updated 7 months ago
- Translation demonstratorβ37May 12, 2020Updated 5 years ago
- ICU based universal language tokenizerβ33Jan 19, 2022Updated 4 years ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β35Feb 5, 2026Updated last week