cisnlp / GlotWebView external linksLinks
πΈ GlotWeb: Web Indexing for Low-Resource Languages -- under construction.
β17Aug 13, 2025Updated 6 months ago
Alternatives and similar repositories for GlotWeb
Users that are interested in GlotWeb are comparing it to the libraries listed below
Sorting:
- mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Modelsβ11Jan 19, 2024Updated 2 years ago
- π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated 10 months ago
- πΈ GlotCC Dataset and Pipline -- NeurIPS 2024β20Apr 6, 2025Updated 10 months ago
- π Resource and Tool for Writing System Identification -- LREC 2024β21Dec 29, 2025Updated last month
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β27Updated this week
- Residual Quantization Autoencoder, used for interpreting LLMsβ14Jan 1, 2025Updated last year
- KnowMAN: Weakly Supervised Multinomial Adversarial Networksβ12Nov 9, 2021Updated 4 years ago
- Evaluate language models using multiple choice itemsβ13Jan 15, 2026Updated 3 weeks ago
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 3 months ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β35Feb 5, 2026Updated last week
- A simple neural truecaser written in pytorch and allennlp.β33Jun 17, 2024Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β106Apr 20, 2024Updated last year
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β186Nov 19, 2025Updated 2 months ago
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).β15Jun 4, 2024Updated last year
- A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β74Apr 1, 2025Updated 10 months ago
- Data Collection System For NLP/Speech Recognitionβ25Apr 20, 2021Updated 4 years ago
- β43Jan 13, 2026Updated last month
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.β57Feb 3, 2026Updated last week
- A fast python implementation of the SimHash algorithm.β27Oct 27, 2021Updated 4 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Jul 29, 2024Updated last year
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- Python library for converting between BioNLP formatsβ22Apr 20, 2023Updated 2 years ago
- A Directory of Online Newspaper Sources for 70+ Languagesβ31Apr 15, 2021Updated 4 years ago
- Targetted language identifier, based on FastText and Hunspell.β38Sep 4, 2025Updated 5 months ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Apr 2, 2022Updated 3 years ago
- Finite-state script normalization and processing utilitiesβ46Jan 14, 2026Updated 3 weeks ago
- Code for pre-training CharacterBERT models (as well as BERT models).β34Sep 6, 2021Updated 4 years ago
- These are lists for a variety of languages containing words that are distinctive to each language.β41Apr 5, 2022Updated 3 years ago
- Code for EMNLP 2021 main conference paper "Dynamic Knowledge Distillation for Pre-trained Language Models"β41Aug 9, 2022Updated 3 years ago
- Creating super-parallel corpora of more than 1500+ unique languages for NLP researchβ34Dec 8, 2022Updated 3 years ago
- A repository for resources relating to NLP in the Balochi languageβ19Jun 3, 2023Updated 2 years ago
- β12Jun 5, 2019Updated 6 years ago
- β10Oct 2, 2024Updated last year
- Linear Attention for Efficient Bidirectional Sequence Modelingβ15May 13, 2025Updated 9 months ago
- Translation of query languages to serialized KoralQuery protocolβ13Feb 2, 2026Updated last week
- Utilities to gather software metrics from tools (SONAR, etc) and store them into ElasticSearch for later display using Kibana.β11Dec 31, 2017Updated 8 years ago
- Terminal tool that converts files encoding to UTF-8β10Oct 5, 2019Updated 6 years ago
- Basis of FragDenStaat.de's βKoalitionstrackerββ15Jul 14, 2025Updated 7 months ago