SEACrowd / seacrowd-datahubLinks
A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
☆92Updated 8 months ago
Alternatives and similar repositories for seacrowd-datahub
Users that are interested in seacrowd-datahub are comparing it to the libraries listed below
Sorting:
- Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback☆97Updated 2 years ago
- A curated list of research papers and resources on Cultural LLM.☆51Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆104Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆162Updated 4 months ago
- ☆218Updated 2 months ago
- Multilingual Large Language Models Evaluation Benchmark☆132Updated last year
- NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented an…☆27Updated last year
- Code for Multilingual Eval of Generative AI paper published at EMNLP 2023☆70Updated last year
- ☆112Updated 10 months ago
- ☆169Updated last year
- Generate synthetic labeled data for extremely low-resource languages using bilingual lexicons.☆18Updated last year
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆133Updated last year
- Benchmarking Large Language Models☆99Updated 3 months ago
- A multilingual version of MS MARCO passage ranking dataset☆144Updated 2 years ago
- A Multilingual Replicable Instruction-Following Model☆95Updated 2 years ago
- Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: …☆337Updated 2 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆73Updated last year
- Code and Data for "Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering"☆86Updated last year
- Code for our WOAH@ACL 2021 Paper on Data Integration for Toxic Comment Classification: Making More Than 40 Datasets Easily Accessible in …☆29Updated 3 years ago
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.☆117Updated 7 months ago
- Repo for the Belebele dataset, a massively multilingual reading comprehension dataset.☆335Updated 10 months ago
- South-East Asia Large Language Models☆360Updated this week
- Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)☆114Updated 2 years ago
- Implementation of ConGen: Unsupervised Control and Generalization Distillation For Sentence Representation (Finding of EMNLP 2022).☆22Updated 2 years ago
- ☆171Updated 6 years ago
- ☆86Updated 6 months ago
- TimeLMs: Diachronic Language Models from Twitter☆111Updated last year
- Inquisitive Parrots for Search☆198Updated 4 months ago
- ☆79Updated last year
- Data for evaluating gender bias in coreference resolution systems.☆80Updated 6 years ago