SEACrowd/seacrowd-datahub

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/SEACrowd/seacrowd-datahub)

SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

☆104

Alternatives and similar repositories for seacrowd-datahub

Users that are interested in seacrowd-datahub are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

IndoNLP / nusa-writes
View on GitHub
NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented an…
☆30Sep 27, 2024Updated last year
IndoNLP / nusa-crowd
View on GitHub
A collaborative project to collect datasets in Indonesian languages.
☆286Jun 2, 2024Updated 2 years ago
HLTCHKUST / UniVaR
View on GitHub
Official reposity for paper "High-Dimension Human Value Representation in Large Language Models" (NAACL'25 Main)
☆23Jul 9, 2024Updated 2 years ago
UKPLab / maps
View on GitHub
Multicultural Proverbs and Sayings
☆13Jan 11, 2025Updated last year
faridlazuarda / cultural-llm-papers
View on GitHub
A curated list of research papers and resources on Cultural LLM.
☆53Sep 26, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
IndoNLP / nusax
View on GitHub
High-quality parallel resource on sentiment analysis for 10 low-resource Indonesian languages, English, and Indonesian (Outstanding Paper…
☆116May 8, 2023Updated 3 years ago
aisingapore / sealion
View on GitHub
South-East Asia Large Language Models
☆418Jul 13, 2026Updated last week
Yale-LILY / ReasTAP
View on GitHub
Data and Code for EMNLP 2022 paper "ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples"
☆15Jun 4, 2023Updated 3 years ago
datarubrics / datarubrics
View on GitHub
DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in …
☆17Jun 6, 2025Updated last year
PyThaiNLP / thai-synonym
View on GitHub
The synonym for thai (open source & open data)
☆18Dec 6, 2023Updated 2 years ago
zliucr / crosslingual-slu
View on GitHub
EMNLP-2020: Cross-lingual Spoken Language Understanding with Regularized Representation Alignment
☆18Nov 21, 2020Updated 5 years ago
louisowen6 / quora_paraphrasing_id
View on GitHub
Quora Paraphrasing Dataset Bahasa Indonesia Version
☆11Apr 18, 2021Updated 5 years ago
BatsResearch / LexC-Gen
View on GitHub
Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.
☆20Oct 3, 2024Updated last year
petrabarus / if-itb-latex
View on GitHub
LaTeX template for final year project / thesis report at Informatics Engineering ITB
☆102Jun 28, 2026Updated 3 weeks ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
DineshRaghu / multi-level-memory-network
View on GitHub
Multi-Level Memory for Task Oriented Dialogs
☆15Jul 19, 2019Updated 7 years ago
wangjin0818 / word_embedding_refine
View on GitHub
☆23Aug 13, 2018Updated 7 years ago
goldmermaid / mlrs
View on GitHub
☆16Aug 6, 2019Updated 6 years ago
HLTCHKUST / Perplexity-FactChecking
View on GitHub
Towards Few-Shot Fact-Checking via Perplexity
☆13Jun 11, 2021Updated 5 years ago
duytinvo / acl2016
View on GitHub
Don't Count, Predict! An Automatic Approach to Learning Sentiment Lexicons for Short Text
☆13Jul 20, 2016Updated 10 years ago
KenrickLance / BalitaNLP-Dataset
View on GitHub
Filipino multi-modal NLP dataset. Consists of 350k+ Filipino news articles and associated images
☆14Mar 11, 2025Updated last year
UBC-NLP / serengeti
View on GitHub
SERENGETI: Massively Multilingual Language Models for Africa
☆17Oct 26, 2023Updated 2 years ago
indolem / IndoBERTweet
View on GitHub
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)
☆76Sep 13, 2021Updated 4 years ago
ehsanasgari / 1000Langs
View on GitHub
Creating super-parallel corpora of more than 1500+ unique languages for NLP research
☆33Dec 8, 2022Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
princeton-nlp / align-mlm
View on GitHub
☆13Nov 30, 2022Updated 3 years ago
indonesian-nlp / multilingual-asr
View on GitHub
Multilingual Speech Recognition for Indonesian Languages
☆72Oct 5, 2022Updated 3 years ago
fajri91 / RSTExtractor
View on GitHub
☆11Dec 8, 2022Updated 3 years ago
jcblaisecruz02 / Filipino-Text-Benchmarks
View on GitHub
Open-source benchmark datasets and pretrained transformer models in the Filipino language.
☆67Aug 26, 2024Updated last year
anaistack / cefr-asag-corpus
View on GitHub
A corpus of short answers written by learners of English and graded with CEFR levels
☆12Dec 17, 2021Updated 4 years ago
LazarusNLP / indonesian-sentence-embeddings
View on GitHub
Embedding Representation for Indonesian Sentences!
☆26Aug 14, 2024Updated last year
mkaufma90 / word2vec-deps
View on GitHub
Dependency based word embeddings
☆12Jun 19, 2017Updated 9 years ago
UBC-NLP / afrolid
View on GitHub
AfroLID, a powerful neural toolkit for African languages identification which covers 517 African languages.
☆39Feb 5, 2026Updated 5 months ago
IndoNLP / indonlu
View on GitHub
The first-ever vast natural language processing benchmark for Indonesian Language. We provide multiple downstream tasks, pre-trained Indo…
☆653Nov 16, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
LaSTUS-TALN-UPF / TSAR-2022-Shared-Task
View on GitHub
TSAR2022 Shared Task on Lexical Simplification - Datasets and Evaluation scripts
☆10Oct 27, 2022Updated 3 years ago
matthewgo / FilipinoStanfordPOSTagger
View on GitHub
☆12Apr 26, 2020Updated 6 years ago
fajri91 / sum_liputan6
View on GitHub
The first large-scale summarization corpus for the Indonesian language. AACL 2020.
☆38Mar 4, 2021Updated 5 years ago
HLTCHKUST / ke-dialogue
View on GitHub
KE-Dialogue: Injecting knowledge graph into a fully end-to-end dialogue system.
☆46Jan 19, 2022Updated 4 years ago
THUNLP-MT / CODIS
View on GitHub
Repo for paper "CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models".
☆13Oct 14, 2024Updated last year
indolem / indolem
View on GitHub
IndoLEM is a comprehensive Indonesian NLU benchmark, comprising three pillars NLP task: morpho-syntax, semantic, and discourse. Presented…
☆109Dec 14, 2020Updated 5 years ago
cohere-ai / magikarp
View on GitHub
Code for the paper "Fishing for Magikarp"
☆191Jul 10, 2026Updated 2 weeks ago