ehsanasgari/1000Langs

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ehsanasgari/1000Langs)

ehsanasgari / 1000Langs

Creating super-parallel corpora of more than 1500+ unique languages for NLP research

☆33

Alternatives and similar repositories for 1000Langs

Users that are interested in 1000Langs are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

dadelani / sib-200
View on GitHub
SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects
☆26May 20, 2026Updated 2 months ago
alpoktem / bible2speechDB
View on GitHub
Scripts to create speech corpora from open.bible
☆13Jan 3, 2022Updated 4 years ago
google-research-datasets / TF-IDF-IIF-top100-wordlists
View on GitHub
These are lists for a variety of languages containing words that are distinctive to each language.
☆42Apr 5, 2022Updated 4 years ago
cisnlp / Glot500
View on GitHub
[ACL 2023] Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages
☆107Apr 14, 2026Updated 3 months ago
gauthelo / kallaama-speech-dataset
View on GitHub
A transcribed speech dataset in Wolof, Pulaar and Sereer, to support agriculture. Funded by Lacuna Fund.
☆20Mar 26, 2026Updated 3 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
awasthiabhijeet / Error-Driven-ASR-Personalization
View on GitHub
Code for "Error-driven Fixed-Budget ASR Personalization for Accented Speakers" in ICASSP 2021
☆11Jun 13, 2021Updated 5 years ago
Kartikaggarwal98 / Indian_ParallelCorpus
View on GitHub
Curated list of publicly available parallel corpus for Indian Languages
☆36Jul 15, 2021Updated 5 years ago
cisnlp / GlotScript
View on GitHub
[LREC 2024] 🖋 Resource and Tool for Writing System Identification
☆22Mar 29, 2026Updated 3 months ago
csikasote / BembaSpeech
View on GitHub
This is an ASR corpus for Bemba language. It contains read speech from diverse publicly available Bemba sources; Literature Books, Radio/…
☆41Jul 31, 2025Updated 11 months ago
wenkokke / dep2con
View on GitHub
several algorithms for converting dependency structures into constituency structures.
☆10Feb 7, 2022Updated 4 years ago
neulab / newlang-tech
View on GitHub
A guide to building language technology in new languages.
☆59Feb 1, 2022Updated 4 years ago
antonisa / embeddings
View on GitHub
Data and scripts for the proper evaluation of cross-lingual embeddings in multiple languages
☆15Apr 11, 2020Updated 6 years ago
christos-c / bible-corpus
View on GitHub
A multilingual parallel corpus created from translations of the Bible.
☆197May 19, 2025Updated last year
WolofProcessing / online_wolof_data
View on GitHub
Curate online wolof text resources that can be used to build models
☆28Jun 25, 2026Updated 3 weeks ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
ElotlMX / py-elotl
View on GitHub
Python package for Natural Language Processing (NLP), focused on low-resource languages spoken in Mexico.
☆24Sep 4, 2025Updated 10 months ago
tihu-nlp / tihudict
View on GitHub
Tihu dictionary for Persian language
☆13Sep 8, 2019Updated 6 years ago
TurkuNLP / wikibert
View on GitHub
BERT models for many languages created from Wikipedia texts
☆33May 25, 2020Updated 6 years ago
abuccts / wikt2pron
View on GitHub
A Python toolkit converting pronunciation in enwiktionary xml dump to cmudict format
☆34Jul 5, 2019Updated 7 years ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
kamperh / speech_dtw
View on GitHub
Dynamic time warping (DTW) functions for specifically speech alignment.
☆30May 6, 2024Updated 2 years ago
cisnlp / GlotCC
View on GitHub
[NeurIPS 2024] 🕸 GlotCC Dataset and Pipline
☆21Apr 6, 2025Updated last year
mbanon / fastspell
View on GitHub
Targetted language identifier, based on FastText and Hunspell.
☆38Sep 4, 2025Updated 10 months ago
Niger-Volta-LTI / yoruba-voice
View on GitHub
Repo & Project for the Imminent Research Grant code & tasks
☆12May 20, 2024Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
OpenNMT / Server
View on GitHub
☆17Jan 2, 2017Updated 9 years ago
masakhane-io / lafand-mt
View on GitHub
MAFAND-MT
☆63Jul 9, 2024Updated 2 years ago
mingruimingrui / fast-mosestokenizer
View on GitHub
c++ mosestokenizer
☆18Mar 13, 2024Updated 2 years ago
amazon-science / proteno
View on GitHub
This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to…
☆45May 25, 2021Updated 5 years ago
google-research / url-nlp
View on GitHub
☆273Aug 1, 2025Updated 11 months ago
acoli-repo / acoli-dicts
View on GitHub
3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, G…
☆17Jul 19, 2023Updated 3 years ago
muelletm / cistern
View on GitHub
Open-source tools for morphological tagging, segmentation and stemming.
☆41Jul 11, 2019Updated 7 years ago
motazsaad / arwikiExtracts
View on GitHub
Arabic Wikipedia Extracts
☆14Jun 16, 2022Updated 4 years ago
NickRuiz / power-asr
View on GitHub
Phonetically-Oriented Word Error Rate
☆36May 4, 2019Updated 7 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
gpu-poor / gramvaani_hindi_asr
View on GitHub
This repo contains the baseline model recipes and pre-trained model for GramVanni hindi ASR challenge
☆16Mar 26, 2022Updated 4 years ago
AliOsm / arabic-text-diacritization
View on GitHub
Benchmark Arabic text diacritization dataset
☆78Apr 7, 2026Updated 3 months ago
EtienneAb3d / OpenNeuroSpell
View on GitHub
OpenNeuroSpell contains parts of NeuroSpell (http://neurospell.com/en.php) released as open-source. More code will be published as soon a…
☆20Oct 29, 2024Updated last year
kbatsuren / wiktra
View on GitHub
Wiktra - Python tool of Wiktionary Transliteration modules for 514 languages and its 102 different scripts (orthographies)
☆37Jun 29, 2025Updated last year
h9-tec / MorphBPE
View on GitHub
☆17Jan 27, 2025Updated last year
morrisalp / taatiknet
View on GitHub
Character-level conversion between Hebrew text and Latin transliteration using deep learning - a demonstration of seq2seq training.
☆16Jun 27, 2023Updated 3 years ago
rnd2110 / MorphAGram
View on GitHub
A Language-Independent Unsupervised Morphological Segmentation Framework based on Adaptor Grammars
☆17Jun 14, 2024Updated 2 years ago