malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆59Updated 9 months ago
Alternatives and similar repositories for llm-datasets:
Users that are interested in llm-datasets are comparing it to the libraries listed below
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆30Updated 2 years ago
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆46Updated 3 weeks ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆128Updated last year
- ☆45Updated 3 months ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆48Updated last year
- Using open source LLMs to build synthetic datasets for direct preference optimization☆61Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆100Updated last year
- Tools for managing datasets for governance and training.☆85Updated 3 months ago
- ☆57Updated 7 months ago
- ☆65Updated last year
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆71Updated last year
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆76Updated 6 months ago
- GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings☆42Updated last year
- Pretraining Efficiently on S2ORC!☆163Updated 6 months ago
- One-stop shop for running and fine-tuning transformer-based language models for retrieval☆53Updated 2 weeks ago
- Generalist and Lightweight Model for Text Classification☆123Updated this week
- ☆89Updated 4 months ago
- ☆86Updated last month
- Simple replication of [ColBERT-v1](https://arxiv.org/abs/2004.12832).☆80Updated last year
- No Parameter Left Behind: How Distillation and Model Size Affect Zero-Shot Retrieval☆29Updated 2 years ago
- ☆47Updated last year
- Retrieval-Augmented Generation battle!☆50Updated 4 months ago
- RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker