SEACrowd / seacrowd-datahubLinks
A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
β85Updated 6 months ago
Alternatives and similar repositories for seacrowd-datahub
Users that are interested in seacrowd-datahub are comparing it to the libraries listed below
Sorting:
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β103Updated last year
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β148Updated 2 months ago
- β106Updated 8 months ago
- NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented anβ¦β25Updated 10 months ago
- Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedbackβ97Updated 2 years ago
- A curated list of research papers and resources on Cultural LLM.β46Updated 10 months ago
- β168Updated last year
- Resources for cultural NLP researchβ101Updated 3 months ago
- Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)β114Updated 2 years ago
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.β112Updated 5 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answersβ132Updated last year
- This repository contains materials for the SIGIR 2022 tutorial on opinion summarization.β34Updated 3 years ago
- Code for Multilingual Eval of Generative AI paper published at EMNLP 2023β70Updated last year
- A Multilingual Replicable Instruction-Following Modelβ94Updated 2 years ago
- β218Updated 3 weeks ago
- The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarizationβ157Updated 2 years ago
- Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.β78Updated 3 years ago
- We introduce MKQA, an open-domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically β¦β185Updated 3 years ago
- BLOOM+1: Adapting BLOOM model to support a new unseen languageβ73Updated last year
- Multilingual abstractive summarization dataset extracted from WikiHow.β94Updated 5 months ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β59Updated last year
- Fine-tuning Open-Source LLMs for Adaptive Machine Translationβ85Updated last month
- Machine learning models from Singapore's NLP research communityβ36Updated 2 years ago
- NTREX -- News Test References for MT Evaluationβ85Updated last year
- Text Extraction Formulation + Feedback Loop for state-of-the-art WSD (EMNLP 2021)β53Updated 3 years ago
- A dataset focused on summarization of dialogs, which represents the rich domain of Twitter customer care conversationsβ33Updated last year
- Multilingual Large Language Models Evaluation Benchmarkβ129Updated last year
- β37Updated 10 months ago
- Crosslingual Reasoning through Test-Time Scalingβ19Updated 3 months ago
- β100Updated last year