πΈ GlotWeb: Web Indexing for Minority Languages (WWW 2026)
β17Feb 27, 2026Updated last week
Alternatives and similar repositories for GlotWeb
Users that are interested in GlotWeb are comparing it to the libraries listed below
Sorting:
- mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Modelsβ11Jan 19, 2024Updated 2 years ago
- π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated 11 months ago
- πΈ GlotCC Dataset and Pipline -- NeurIPS 2024β20Apr 6, 2025Updated 11 months ago
- π Resource and Tool for Writing System Identification (Unicode 17.0) -- LREC 2024β21Feb 17, 2026Updated 2 weeks ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β26Feb 16, 2026Updated 2 weeks ago
- Residual Quantization Autoencoder, used for interpreting LLMsβ14Jan 1, 2025Updated last year
- KnowMAN: Weakly Supervised Multinomial Adversarial Networksβ12Nov 9, 2021Updated 4 years ago
- Evaluate language models using multiple choice itemsβ13Updated this week
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 4 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β36Feb 5, 2026Updated last month
- A simple neural truecaser written in pytorch and allennlp.β33Jun 17, 2024Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β106Apr 20, 2024Updated last year
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).β15Jun 4, 2024Updated last year
- A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β188Nov 19, 2025Updated 3 months ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Apr 1, 2025Updated 11 months ago
- Data Collection System For NLP/Speech Recognitionβ25Apr 20, 2021Updated 4 years ago
- β44Feb 11, 2026Updated 3 weeks ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.β57Feb 3, 2026Updated last month
- A fast python implementation of the SimHash algorithm.β27Oct 27, 2021Updated 4 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Jul 29, 2024Updated last year
- π’ Work with static vector modelsβ37Apr 21, 2025Updated 10 months ago
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- Python library for converting between BioNLP formatsβ22Apr 20, 2023Updated 2 years ago
- A Directory of Online Newspaper Sources for 70+ Languagesβ31Apr 15, 2021Updated 4 years ago
- Finite-state script normalization and processing utilitiesβ46Feb 25, 2026Updated last week
- Targetted language identifier, based on FastText and Hunspell.β38Sep 4, 2025Updated 6 months ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Apr 2, 2022Updated 3 years ago
- β14Jan 17, 2024Updated 2 years ago
- Code for pre-training CharacterBERT models (as well as BERT models).β34Sep 6, 2021Updated 4 years ago
- These are lists for a variety of languages containing words that are distinctive to each language.β41Apr 5, 2022Updated 3 years ago
- Code for EMNLP 2021 main conference paper "Dynamic Knowledge Distillation for Pre-trained Language Models"β41Aug 9, 2022Updated 3 years ago
- Creating super-parallel corpora of more than 1500+ unique languages for NLP researchβ34Dec 8, 2022Updated 3 years ago
- β12Jun 5, 2019Updated 6 years ago
- Basis of FragDenStaat.de's βKoalitionstrackerββ15Jul 14, 2025Updated 7 months ago
- A repository for resources relating to NLP in the Balochi languageβ19Jun 3, 2023Updated 2 years ago
- Curated list of awesome datasets for various table understanding tasksβ18Sep 5, 2025Updated 6 months ago
- β10Oct 2, 2024Updated last year
- Translation of query languages to serialized KoralQuery protocolβ13Feb 23, 2026Updated last week