SEACrowd / seacrowd-datahub
A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
☆82Updated 3 months ago
Alternatives and similar repositories for seacrowd-datahub
Users that are interested in seacrowd-datahub are comparing it to the libraries listed below
Sorting:
- NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented an…☆25Updated 7 months ago
- Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback☆95Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆100Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆131Updated 5 months ago
- Welcome to our repository! This repository hosts the data on "IndoCollex: A Testbed for Morphological Transformation of Indonesian Word …☆21Updated 3 years ago
- A curated list of research papers and resources on Cultural LLM.☆42Updated 7 months ago
- Code for Multilingual Eval of Generative AI paper published at EMNLP 2023☆68Updated last year
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆128Updated last year
- High-quality parallel resource on sentiment analysis for 10 low-resource Indonesian languages, English, and Indonesian (Outstanding Paper…☆99Updated 2 years ago
- This repository contains materials for the SIGIR 2022 tutorial on opinion summarization.☆34Updated 2 years ago
- Resources for cultural NLP research☆95Updated 3 weeks ago
- Multilingual Large Language Models Evaluation Benchmark☆123Updated 8 months ago
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.☆108Updated 2 months ago
- ☆90Updated 5 months ago
- IndoNLI☆19Updated 3 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆71Updated last year
- ☆161Updated 10 months ago
- GEMBA — GPT Estimation Metric Based Assessment☆118Updated 9 months ago
- A Multilingual Replicable Instruction-Following Model☆93Updated last year
- A multilingual version of MS MARCO passage ranking dataset☆145Updated last year
- ☆209Updated 2 months ago
- The pipeline for the OSCAR corpus☆167Updated last year
- MINERS ⛏️: The semantic retrieval benchmark for evaluating multilingual language models. (EMNLP 2024 Findings)☆13Updated 7 months ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆93Updated 2 years ago
- TUFS Asian Language Parallel Corpus☆50Updated 2 years ago
- ☆11Updated last year
- Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)☆113Updated 2 years ago
- Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2…☆67Updated 2 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆59Updated 9 months ago
- Tools for managing datasets for governance and training.☆85Updated 3 months ago