[WWW 2026] πΈ GlotWeb: Web Indexing for Minority Languages
β17Apr 14, 2026Updated 2 months ago
Alternatives and similar repositories for GlotWeb
Users that are interested in GlotWeb are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Modelsβ11Jan 19, 2024Updated 2 years ago
- Residual Quantization Autoencoder, used for interpreting LLMsβ14Jan 1, 2025Updated last year
- [ACL 2025] π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated last year
- [NeurIPS 2024] πΈ GlotCC Dataset and Piplineβ20Apr 6, 2025Updated last year
- [LREC 2024] π Resource and Tool for Writing System Identificationβ22Mar 29, 2026Updated 2 months ago
- Managed Database hosting by DigitalOcean β’ AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β27Feb 16, 2026Updated 4 months ago
- KnowMAN: Weakly Supervised Multinomial Adversarial Networksβ12Nov 9, 2021Updated 4 years ago
- [ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languagesβ107Apr 14, 2026Updated 2 months ago
- [EMNLP 2023] π¬ Language Identification with Support for More Than 2000 Labelsβ207Apr 15, 2026Updated 2 months ago
- Evaluate language models using multiple choice itemsβ13Mar 6, 2026Updated 3 months ago
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific wayβ18Nov 4, 2025Updated 7 months ago
- [NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).β17Jun 4, 2024Updated 2 years ago
- AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.β39Feb 5, 2026Updated 4 months ago
- Bare Metal GPUs on DigitalOcean Gradient AI β’ AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- A simple neural truecaser written in pytorch and allennlp.β35Jun 17, 2024Updated last year
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β76Apr 1, 2025Updated last year
- β13Aug 13, 2024Updated last year
- ParCourE - Parallel Corpus Explorerβ12Dec 27, 2021Updated 4 years ago
- A fast python implementation of the SimHash algorithm.β27Oct 27, 2021Updated 4 years ago
- Data Collection System For NLP/Speech Recognitionβ25Apr 20, 2021Updated 5 years ago
- π’ Work with static vector modelsβ39Apr 21, 2025Updated last year
- A Directory of Online Newspaper Sources for 70+ Languagesβ31Apr 15, 2021Updated 5 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β65Jul 29, 2024Updated last year
- End-to-end encrypted email - Proton Mail β’ AdSpecial offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
- A new Turkish Dependency Treebank in UD styleβ16Aug 17, 2020Updated 5 years ago
- β45Feb 11, 2026Updated 4 months ago
- Translation of query languages to serialized KoralQuery protocolβ15Jun 4, 2026Updated last week
- β13Oct 31, 2025Updated 7 months ago
- OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.β58Feb 3, 2026Updated 4 months ago
- Benchmark scripts for comparing different tokenizers and sentence segmenters of Germanβ12Feb 27, 2023Updated 3 years ago
- Collection of Common Machine Translation Toolsβ11Jul 26, 2022Updated 3 years ago
- The Flutter MotionPhotos Package to detect and extract the video content from the motion photos by https://ente.ioβ19Nov 22, 2024Updated last year
- code and data used to build a training dataset for dragnet modelsβ10Nov 29, 2020Updated 5 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer β’ AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- β15Jan 10, 2022Updated 4 years ago
- Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"β32Jun 20, 2023Updated 2 years ago
- Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"β30Apr 2, 2022Updated 4 years ago
- Basis of FragDenStaat.de's βKoalitionstrackerββ15Jul 14, 2025Updated 11 months ago
- Terminal tool that converts files encoding to UTF-8β10Oct 5, 2019Updated 6 years ago
- Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+β13Oct 18, 2025Updated 7 months ago
- A tiny server to run local inference on MLX model in the style of OpenAIβ13Jan 31, 2024Updated 2 years ago