malteos / llm-datasets
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆53Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for llm-datasets
- Tools for managing datasets for governance and training.☆78Updated 3 weeks ago
- ☆27Updated 3 months ago
- [Data + code] ExpertQA : Expert-Curated Questions and Attributed Answers☆122Updated 8 months ago
- Efficient Language Model Training through Cross-Lingual and Progressive Transfer Learning☆29Updated last year
- GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embeddings☆37Updated 8 months ago
- ☆51Updated last year
- German Alpaca Dataset (Cleaned + Translated)☆23Updated last year
- ☆21Updated last week
- Framework for unified summarisation and evaluation of English documents using state-of-the-art models and measures.☆31Updated 6 months ago
- SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 la…☆44Updated last year
- Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages -- ACL 2023☆96Updated 7 months ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆92Updated last year
- ☆83Updated 2 months ago
- ☆65Updated last year
- ☆95Updated last year
- This repository provides scripts for evaluating NLP models on the LEXTREME benchmark, a set of diverse multilingual tasks in legal NLP☆20Updated 10 months ago
- Dense hybrid representations for text retrieval☆62Updated last year
- Pretraining Efficiently on S2ORC!☆136Updated 3 weeks ago
- Do Multilingual Language Models Think Better in English?☆41Updated last year
- A Multilingual Replicable Instruction-Following Model☆94Updated last year
- Minimum Bayes Risk Decoding for Hugging Face Transformers☆56Updated 5 months ago
- INCOME: An Easy Repository for Training and Evaluation of Index Compression Methods in Dense Retrieval. Includes BPR and JPQ.☆22Updated last year
- GLADIS: A General and Large Acronym Disambiguation Benchmark (EACL 23)☆13Updated 4 months ago
- ☆19Updated 3 years ago
- Semantically Structured Sentence Embeddings☆67Updated last month
- BLOOM+1: Adapting BLOOM model to support a new unseen language☆70Updated 8 months ago
- The official repository for Efficient Long-Text Understanding Using Short-Text Models (Ivgi et al., 2022) paper☆67Updated last year
- SPRINT Toolkit helps you evaluate diverse neural sparse models easily using a single click on any IR dataset.☆42Updated last year
- ☆73Updated last year
- The corresponding code for our paper: "Exploring the Challenges of Open Domain Multi-Document Summarization". Do not hesitate to open an …☆31Updated last year