cisnlp/GlotLID

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/cisnlp/GlotLID)

cisnlp / GlotLID

[EMNLP 2023] 💬 Language Identification with Support for More Than 2000 Labels

☆211

Alternatives and similar repositories for GlotLID

Users that are interested in GlotLID are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

cisnlp / GlotScript
View on GitHub
[LREC 2024] 🖋 Resource and Tool for Writing System Identification
☆22Mar 29, 2026Updated 3 months ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
laurieburchell / open-lid-dataset
View on GitHub
Repository accompanying "An Open Dataset and Model for Language Identification" (Burchell et al., 2023)
☆77Apr 1, 2025Updated last year
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 7 months ago
cisnlp / Glot500
View on GitHub
[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
☆107Apr 14, 2026Updated 3 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
cisnlp / ofa
View on GitHub
[NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Nov 26, 2023Updated 2 years ago
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
Helsinki-NLP / OpusFilter
View on GitHub
OpusFilter - Parallel corpus processing toolkit
☆115Jul 1, 2026Updated 2 weeks ago
konstantinjdobler / focus
View on GitHub
[EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"
☆37Jun 7, 2025Updated last year
MaLA-LM / GlotEval
View on GitHub
GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way
☆18Nov 4, 2025Updated 8 months ago
cisnlp / mPLM-Sim
View on GitHub
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
☆11Jan 19, 2024Updated 2 years ago
dadelani / sib-200
View on GitHub
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
☆26May 20, 2026Updated 2 months ago
fyvo / WMT-Biomed-Test
View on GitHub
☆13Aug 23, 2024Updated last year
google-research / metricx
View on GitHub
☆146Jul 2, 2026Updated 2 weeks ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
bitextor / bicleaner-ai
View on GitHub
Bicleaner fork that uses neural networks
☆40Feb 23, 2026Updated 4 months ago
google-research / url-nlp
View on GitHub
☆273Aug 1, 2025Updated 11 months ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,217Updated this week
kensho-technologies / pathpiece
View on GitHub
PathPiece tokenizer
☆14Nov 10, 2024Updated last year
swiss-ai / parity-aware-bpe
View on GitHub
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [ACL 2026]
☆19Apr 18, 2026Updated 3 months ago
bitextor / bifixer
View on GitHub
Tool to fix bitexts and tag near-duplicates for removal
☆35Sep 4, 2025Updated 10 months ago
kargaranamir / Hengam
View on GitHub
Hengam: An Adversarially Trained Transformer for Persian Temporal Tagging (AACL'22)
☆11Aug 25, 2023Updated 2 years ago
openlanguagedata / seed
View on GitHub
Seed Machine Translation Data
☆34Nov 12, 2024Updated last year
MicrosoftTranslator / NTREX
View on GitHub
NTREX -- News Test References for MT Evaluation
☆87Jun 5, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
superlinear-ai / wtpsplit-lite
View on GitHub
✂️ Sentence segmentation with wtpsplit's state-of-the-art Segment any Text (SaT) models
☆39May 2, 2026Updated 2 months ago
neulab / AfricanVoices
View on GitHub
Hosts text-to-speech corpus and speech synthesizers for African languages.
☆19May 31, 2023Updated 3 years ago
openlanguagedata / flores
View on GitHub
The FLORES+ Machine Translation Benchmark
☆112Nov 12, 2024Updated last year
McGill-NLP / feedbackqa
View on GitHub
FeedbackQA: Improving Question Answering Post-Deployment with Interactive Feedback
☆12Jul 13, 2022Updated 4 years ago
facebookresearch / stopes
View on GitHub
A library for preparing data for machine translation research (monolingual preprocessing, bitext mining, etc.) built by the FAIR NLLB te…
☆309Updated this week
thammegowda / mtdata
View on GitHub
A tool that locates, downloads, and extracts machine translation corpora
☆165Apr 13, 2026Updated 3 months ago
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
pemistahl / lingua-py
View on GitHub
The most accurate natural language detection library for Python, suitable for short text and mixed-language text
☆1,759Updated this week
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
allenai / duplodocus
View on GitHub
Tooling for exact and MinHash deduplication of large-scale text datasets
☆90Mar 24, 2026Updated 3 months ago
LibreTranslate / LexiLang
View on GitHub
Simple, fast dictionary-based language detector for short texts.
☆22Feb 5, 2026Updated 5 months ago
alpoktem / bible2speechDB
View on GitHub
Scripts to create speech corpora from open.bible
☆13Jan 3, 2022Updated 4 years ago
LAGoM-NLP / transtokenizer
View on GitHub
☆57Dec 27, 2025Updated 6 months ago
bminixhofer / zett
View on GitHub
Code for Zero-Shot Tokenizer Transfer
☆145Jan 14, 2025Updated last year
Waxal-Multilingual / speech-data
View on GitHub
This repository contains multi-modal speech data for African languages that can be used to train ASR and NLP models
☆17Aug 31, 2022Updated 3 years ago
google-research / xtreme-up
View on GitHub
☆53Jun 6, 2023Updated 3 years ago