cisnlp/GlotWeb

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/cisnlp/GlotWeb)

cisnlp / GlotWeb

[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages

☆17

Alternatives and similar repositories for GlotWeb

Users that are interested in GlotWeb are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

cisnlp / mPLM-Sim
View on GitHub
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
☆11Jan 19, 2024Updated 2 years ago
harish-kamath / rqae
View on GitHub
Residual Quantization Autoencoder, used for interpreting LLMs
☆14Jan 1, 2025Updated last year
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
cisnlp / ofa
View on GitHub
[NAACL 2024] A Framework aims to wisely initialize unseen subword embeddings in PLMs for efficient large-scale continued pretraining
☆18Nov 26, 2023Updated 2 years ago
cisnlp / GlotScript
View on GitHub
[LREC 2024] 🖋 Resource and Tool for Writing System Identification
☆22Mar 29, 2026Updated 3 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
swiss-ai / parity-aware-bpe
View on GitHub
Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [ACL 2026]
☆20Apr 18, 2026Updated 3 months ago
cisnlp / Glot500
View on GitHub
[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
☆107Apr 14, 2026Updated 3 months ago
lm-pub-quiz / lm-pub-quiz
View on GitHub
Evaluate language models using multiple choice items
☆13Mar 6, 2026Updated 4 months ago
cisnlp / GlotLID
View on GitHub
[EMNLP 2023] 💬 Language Identification with Support for More Than 2000 Labels
☆212Apr 15, 2026Updated 3 months ago
internetarchive / sandcrawler
View on GitHub
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
☆28Jul 31, 2024Updated last year
mainlp / germanic-lrl-corpora
View on GitHub
Overview of corpora/datasets for Germanic low-resource languages and dialects. Accompanies "A Survey of Corpora for Germanic Low-Resource…
☆28Feb 16, 2026Updated 5 months ago
neuml / staticvectors
View on GitHub
🔢 Work with static vector models
☆39Apr 21, 2025Updated last year
UBC-NLP / afrolid
View on GitHub
AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.
☆39Feb 5, 2026Updated 5 months ago
mayhewsw / pytorch-truecaser
View on GitHub
A simple neural truecaser written in pytorch and allennlp.
☆35Jun 17, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
amodaresi / MemLLM
View on GitHub
☆13Aug 13, 2024Updated last year
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 8 months ago
cisnlp / parcoure
View on GitHub
ParCourE - Parallel Corpus Explorer
☆12Dec 27, 2021Updated 4 years ago
hybridtheory / floc-simhash
View on GitHub
A fast python implementation of the SimHash algorithm.
☆27Oct 27, 2021Updated 4 years ago
sb-b / BOUN-PARS
View on GitHub
☆15Jan 10, 2022Updated 4 years ago
SAP-archive / portal
View on GitHub
Implementation of the deep learning models with training and evaluation pipelines described in the paper "PORTAL: Scalable Tabular Founda…
☆15May 16, 2025Updated last year
rewire-online / multilingual-hatecheck
View on GitHub
Röttger et al. (WOAH at NAACL 2022): "Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models"
☆17May 23, 2022Updated 4 years ago
papercopilot / iclr-insights
View on GitHub
Insights from the ICLR Peer Review and Rebuttal Process
☆16Nov 24, 2025Updated 8 months ago
AmenRa / indxr
View on GitHub
A Python utility for indexing file lines. Best demo honourable mention at ECIR 2024.
☆23Nov 9, 2025Updated 8 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
malteos / llm-datasets
View on GitHub
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆66Jul 29, 2024Updated last year
marcharaoui / RAG-from-scratch
View on GitHub
Implement different RAG pipelines from scratch for your specific needs
☆16Jun 5, 2025Updated last year
KorAP / Koral
View on GitHub
Translation of query languages to serialized KoralQuery protocol
☆15Updated this week
LoicGrobol / zeldarose
View on GitHub
Train transformer-based models.
☆28Apr 12, 2026Updated 3 months ago
ZenMule / Praat_Scripting_Tutorial
View on GitHub
Praat scripting入门
☆15Apr 8, 2025Updated last year
KorAP / Tokenizer-Evaluation
View on GitHub
Benchmark scripts for comparing different tokenizers and sentence segmenters of German
☆12Feb 27, 2023Updated 3 years ago
hplt-project / OpusCleaner
View on GitHub
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
☆58Feb 3, 2026Updated 5 months ago
ymoslem / MT-Tools
View on GitHub
Collection of Common Machine Translation Tools
☆11Jul 26, 2022Updated 4 years ago
originell / smaz-py3
View on GitHub
Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+
☆13Oct 18, 2025Updated 9 months ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
commoncrawl / web-languages
View on GitHub
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …
☆71Jul 1, 2026Updated 3 weeks ago
pjox / gutf
View on GitHub
Terminal tool that converts files encoding to UTF-8
☆10Oct 5, 2019Updated 6 years ago
Yinghao-Li / CHMM-ALT
View on GitHub
Code for "BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition"
☆32Jun 20, 2023Updated 3 years ago
cindyxinyiwang / expand-via-lexicon-based-adaptation
View on GitHub
Code for ACL 2022 paper "Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation"
☆29Apr 2, 2022Updated 4 years ago
cimeister / tokenizer-intrinsic-evals
View on GitHub
TokEval: intrinsic quality metrics for tokenizers across natural language, code, and math
☆46Jul 4, 2026Updated 3 weeks ago
terryoo / ATDNet
View on GitHub
☆12Jun 5, 2019Updated 7 years ago
google-research / nisaba
View on GitHub
Finite-state script normalization and processing utilities
☆52Updated this week