SEACrowd / seacrowd-datahubLinks
A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
β94Updated 11 months ago
Alternatives and similar repositories for seacrowd-datahub
Users that are interested in seacrowd-datahub are comparing it to the libraries listed below
Sorting:
- A curated list of research papers and resources on Cultural LLM.β52Updated last year
- π¬ Language Identification with Support for More Than 2000 Labels -- EMNLP 2023β182Updated last month
- β231Updated 5 months ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023β107Updated last year
- Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedbackβ96Updated 2 years ago
- Code for Multilingual Eval of Generative AI paper published at EMNLP 2023β71Updated last year
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answersβ137Updated last year
- Efficient Attention for Long Sequence Processingβ98Updated 2 years ago
- NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented anβ¦β27Updated last year
- β180Updated last year
- Resources for cultural NLP researchβ113Updated 3 months ago
- Fine-tuning Open-Source LLMs for Adaptive Machine Translationβ90Updated 6 months ago
- Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)β117Updated 3 years ago
- A Multilingual Replicable Instruction-Following Modelβ95Updated 2 years ago
- The official code for PRIMERA: Pyramid-based Masked Sentence Pre-training for Multi-document Summarizationβ156Updated 3 years ago
- Repository for research in the field of Responsible NLP at Meta.β204Updated 7 months ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasetsβ225Updated last year
- β119Updated last year
- β40Updated last year
- Repo for the Belebele dataset, a massively multilingual reading comprehension dataset.β338Updated last year
- BLOOM+1: Adapting BLOOM model to support a new unseen languageβ74Updated last year
- Powerful unsupervised domain adaptation method for dense retrieval. Requires only unlabeled corpus and yields massive improvement: "GPL: β¦β338Updated 2 years ago
- OpenNyAI is a mission aimed at developing open source software and datasets to catalyze the creation of AI-powered solutions to improve aβ¦β42Updated last year
- Benchmarking Large Language Modelsβ104Updated 6 months ago
- Ghostbuster: Detecting Text Ghostwritten by Large Language Models (NAACL 2024)β175Updated last year
- β80Updated last year
- Tools for evaluating the performance of MT metrics on data from recent WMT metrics shared tasks.β123Updated 2 months ago
- TimeLMs: Diachronic Language Models from Twitterβ111Updated last year
- FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklistsβ31Updated 4 months ago
- Datasets collection and preprocessings framework for NLP extreme multitask learningβ189Updated 6 months ago