malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆56Updated 5 months ago
Alternatives and similar repositories for llm-datasets:
Users that are interested in llm-datasets are comparing it to the libraries listed below
- Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆110Updated last month
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆29Updated last year
- ☆65Updated last year
- Tools for managing datasets for governance and training.☆81Updated 3 weeks ago
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆99Updated 8 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆124Updated 10 months ago
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆70Updated 10 months ago
- ☆83Updated 4 months ago
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆170Updated last week
- Pre-train Static Word Embeddings☆32Updated this week
- ☆56Updated 3 months ago
- German Alpaca Dataset (Cleaned + Translated)☆23Updated last year
- Pretraining Efficiently on S2ORC!☆147Updated 2 months ago
- ☆78Updated last month
- Fine-tune ModernBERT on a large Dataset with Custom Tokenizer Training☆52Updated 2 weeks ago
- FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes☆181Updated 3 months ago
- Using open source LLMs to build synthetic datasets for direct preference optimization☆49Updated 10 months ago
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆56Updated 7 months ago
- ☆29Updated 3 months ago
- Multilingual Large Language Models Evaluation Benchmark☆115Updated 4 months ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆45Updated last year
- [EMNLP'23] Official Code for "FOCUS: Effective Embedding Initialization for Monolingual Specialization of Multilingual Models"☆29Updated 2 months ago
- GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings☆37Updated 10 months ago
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆66Updated 2 months ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆93Updated last year
- Retrieval-Augmented Generation battle!☆48Updated last month
- ☆96Updated 2 years ago
- Generalist and Lightweight Model for Text Classification☆58Updated 2 weeks ago
- The official repository for Efficient Long-Text Understanding Using Short-Text Models (Ivgi et al., 2022) paper☆67Updated last year