π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023
β191Mar 27, 2026Updated this week
Alternatives and similar repositories for GlotLID
Users that are interested in GlotLID are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- π Resource and Tool for Writing System Identification (Unicode 17.0) -- LREC 2024β21Feb 17, 2026Updated last month
- πΈ GlotWeb: Web Indexing for Minority Languages (WWW 2026)β17Feb 27, 2026Updated last month
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β75Apr 1, 2025Updated 11 months ago
- A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretrainingβ18Nov 26, 2023Updated 2 years ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β106Apr 20, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways β’ AdManaged hosting with the flexibility to host WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Cloudways by DigitalOcean.
- mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Modelsβ11Jan 19, 2024Updated 2 years ago
- π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated 11 months ago
- β13Aug 23, 2024Updated last year
- OpusFilter - Parallel corpus processing toolkitβ115Feb 11, 2026Updated last month
- PathPiece tokenizerβ14Nov 10, 2024Updated last year
- πΈ GlotCC Dataset and Pipline -- NeurIPS 2024β20Apr 6, 2025Updated 11 months ago
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"β36Jun 7, 2025Updated 9 months ago
- π€ Tokenizers.js: A pure JS/TS implementation of today's most used tokenizersβ40Mar 18, 2026Updated last week
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,965Mar 16, 2026Updated 2 weeks ago
- NordVPN Threat Protection Proβ’ β’ AdTake your cybersecurity to the next level. Block phishing, malware, trackers, and ads. Lightweight app that works with all browsers.
- GC4LM: A Colossal (Biased) language model for Germanβ13May 2, 2021Updated 4 years ago
- β232Oct 27, 2025Updated 5 months ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.β90Sep 12, 2024Updated last year
- ParCourE - Parallel Corpus Explorerβ12Dec 27, 2021Updated 4 years ago
- A framework for evaluating Machine Translation models.β12May 26, 2025Updated 10 months ago
- NTREX -- News Test References for MT Evaluationβ88Jun 5, 2024Updated last year
- β12Mar 17, 2026Updated last week
- Do Multilingual Language Models Think Better in English?β42Aug 3, 2023Updated 2 years ago
- β60Nov 18, 2025Updated 4 months ago
- DigitalOcean Gradient AI Platform β’ AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Jul 29, 2024Updated last year
- β57Dec 27, 2025Updated 3 months ago
- β269Aug 1, 2025Updated 7 months ago
- β16Jan 14, 2022Updated 4 years ago
- POS for African languagesβ19Jun 25, 2025Updated 9 months ago
- Universal Romanizer that can convert any unicode script to roman (latin) scriptβ243Jul 26, 2024Updated last year
- Python source code for EMNLP 2021 Findings paper: "Subword Mapping and Anchoring Across Languages".β13Sep 17, 2021Updated 4 years ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β27Feb 16, 2026Updated last month
- Hengam: An Adversarially Trained Transformer for Persian Temporal Tagging (AACL'22)β11Aug 25, 2023Updated 2 years ago
- NordVPN Special Discount Offer β’ AdSave on top-rated NordVPN 1 or 2-year plans with secure browsing, privacy protection, and support for for all major platforms.
- Package to align tokens from different tokenizations.β16Mar 25, 2024Updated 2 years ago
- ParaNames: A multilingual resource for parallel namesβ40May 20, 2024Updated last year
- The official implementation of HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalizationβ18Mar 7, 2025Updated last year
- β16Jun 14, 2024Updated last year
- Code for Zero-Shot Tokenizer Transferβ143Jan 14, 2025Updated last year
- Minimum Bayes Risk Decoding for Hugging Face Transformersβ60Jun 3, 2024Updated last year
- Pipeline for pulling and processing online language model pretraining data from the webβ178Jul 31, 2023Updated 2 years ago