malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆56Updated 7 months ago
Alternatives and similar repositories for llm-datasets:
Users that are interested in llm-datasets are comparing it to the libraries listed below
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆126Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆100Updated 11 months ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆46Updated last year
- Tools for managing datasets for governance and training.☆82Updated last month
- ☆72Updated last year
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆93Updated 2 years ago
- ☆26Updated 3 weeks ago
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆29Updated 2 years ago
- ☆65Updated last year
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆71Updated last year
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆122Updated 3 months ago
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆57Updated 9 months ago
- ☆28Updated last year
- German Alpaca Dataset (Cleaned + Translated)☆24Updated last year
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆73Updated 5 months ago
- GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings☆38Updated last year
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆105Updated 10 months ago
- ☆97Updated 2 years ago
- Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback☆94Updated last year
- Pre-train Static Word Embeddings☆49Updated 2 weeks ago
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆30Updated 5 months ago
- ☆57Updated 5 months ago
- Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…☆87Updated last year
- T-Projection is a method to perform high-quality Annotation Projection of Sequence Labeling datasets.☆12Updated last year
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆175Updated 2 months ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆31Updated last year
- Using open source LLMs to build synthetic datasets for direct preference optimization☆59Updated last year
- Efficient few-shot learning with cross-encoders.☆49Updated last year
- Generalist and Lightweight Model for Text Classification☆91Updated this week
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆189Updated 5 months ago