SEACrowd / seacrowd-datahubLinks
A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
β94Updated last year
Alternatives and similar repositories for seacrowd-datahub
Users that are interested in seacrowd-datahub are comparing it to the libraries listed below
Sorting:
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β107Updated last year
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β186Updated 2 months ago
- Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedbackβ96Updated 2 years ago
- A curated list of research papers and resources on Cultural LLM.β53Updated last year
- β127Updated last week
- β182Updated last year
- Resources for cultural NLP researchβ113Updated 4 months ago
- Code for Multilingual Eval of Generative AI paper published at EMNLP 2023β72Updated last year
- NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented anβ¦β27Updated last year
- Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)β118Updated 3 years ago
- TimeLMs: Diachronic Language Models from Twitterβ112Updated last year
- β41Updated last year
- β256Updated 5 months ago
- A Multilingual Replicable Instruction-Following Modelβ95Updated 2 years ago
- Multilingual Large Language Models Evaluation Benchmarkβ133Updated last year
- South-East Asia Large Language Modelsβ383Updated this week
- Repository for research in the field of Responsible NLP at Meta.β204Updated last week
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answersβ136Updated last year
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.β125Updated 3 months ago
- Ghostbuster: Detecting Text Ghostwritten by Large Language Models (NAACL 2024)β176Updated last year
- BLOOM+1: Adapting BLOOM model to support a new unseen languageβ74Updated last year
- Benchmarking Large Language Modelsβ105Updated 7 months ago
- Code for Zero-Shot Tokenizer Transferβ142Updated last year
- Models for automatically transforming toxic text to neutralβ35Updated 2 years ago
- Repo for the Belebele dataset, a massively multilingual reading comprehension dataset.β340Updated last year
- β80Updated last year
- This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 4β¦β277Updated last year
- SeeGULL is a broad-coverage stereotype dataset in English containing stereotypes about identity groups spanning 178 countries across 8 diβ¦β38Updated 2 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.β64Updated last year
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasetsβ226Updated last year