[EMNLP 2023] π¬ Language Identification with Support for More Than 2000 Labels
β197Apr 15, 2026Updated this week
Alternatives and similar repositories for GlotLID
Users that are interested in GlotLID are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- π Resource and Tool for Writing System Identification (Unicode 17.0) -- LREC 2024β21Mar 29, 2026Updated 3 weeks ago
- πΈ GlotWeb: Web Indexing for Minority Languages (WWW 2026)β17Feb 27, 2026Updated last month
- Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)β75Apr 1, 2025Updated last year
- [ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languagesβ106Updated this week
- mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Modelsβ11Jan 19, 2024Updated 2 years ago
- AI Agents on DigitalOcean Gradient AI Platform β’ AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- [ACL 2025] π Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignmentβ11Apr 6, 2025Updated last year
- β13Aug 23, 2024Updated last year
- OpusFilter - Parallel corpus processing toolkitβ115Apr 8, 2026Updated last week
- PathPiece tokenizerβ14Nov 10, 2024Updated last year
- [NeurIPS 2024] πΈ GlotCC Dataset and Piplineβ20Apr 6, 2025Updated last year
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"β36Jun 7, 2025Updated 10 months ago
- π€ Tokenizers.js: A pure JS/TS implementation of today's most used tokenizersβ47Mar 18, 2026Updated last month
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.β2,983Apr 10, 2026Updated last week
- GC4LM: A Colossal (Biased) language model for Germanβ13May 2, 2021Updated 4 years ago
- GPU virtual machines on DigitalOcean Gradient AI β’ AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- β234Oct 27, 2025Updated 5 months ago
- SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialectsβ23Jan 26, 2025Updated last year
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.β90Sep 12, 2024Updated last year
- A framework for evaluating Machine Translation models.β12May 26, 2025Updated 10 months ago
- NTREX -- News Test References for MT Evaluationβ87Jun 5, 2024Updated last year
- β12Mar 17, 2026Updated last month
- β60Nov 18, 2025Updated 5 months ago
- β15Oct 4, 2024Updated last year
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Jul 29, 2024Updated last year
- 1-Click AI Models by DigitalOcean Gradient β’ AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- β57Dec 27, 2025Updated 3 months ago
- POS for African languagesβ19Jun 25, 2025Updated 9 months ago
- Universal Romanizer that can convert any unicode script to roman (latin) scriptβ244Jul 26, 2024Updated last year
- Python source code for EMNLP 2021 Findings paper: "Subword Mapping and Anchoring Across Languages".β13Sep 17, 2021Updated 4 years ago
- Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resourceβ¦β27Feb 16, 2026Updated 2 months ago
- Hengam: An Adversarially Trained Transformer for Persian Temporal Tagging (AACL'22)β11Aug 25, 2023Updated 2 years ago
- Is In-Context Learning Sufficient for Instruction Following in LLMs? [ICLR 2025]β32Jan 23, 2025Updated last year
- Package to align tokens from different tokenizations.β16Mar 25, 2024Updated 2 years ago
- ParaNames: A multilingual resource for parallel namesβ40May 20, 2024Updated last year
- Virtual machines for every use case on DigitalOcean β’ AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- Web archiving utility libraryβ11Mar 11, 2026Updated last month
- The official implementation of HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalizationβ19Mar 7, 2025Updated last year
- β16Jun 14, 2024Updated last year
- Code for Zero-Shot Tokenizer Transferβ144Jan 14, 2025Updated last year
- Minimum Bayes Risk Decoding for Hugging Face Transformersβ60Jun 3, 2024Updated last year
- β18Aug 30, 2025Updated 7 months ago
- Pipeline for pulling and processing online language model pretraining data from the webβ179Jul 31, 2023Updated 2 years ago