huggingface/olm-training

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/huggingface/olm-training)

huggingface / olm-training

Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.

☆98

Alternatives and similar repositories for olm-training

Users that are interested in olm-training are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huggingface / olm-datasets
View on GitHub
Pipeline for pulling and processing online language model pretraining data from the web
☆179Jul 31, 2023Updated 2 years ago
ltgoslo / ltg-bert
View on GitHub
LTG-Bert
☆34Jan 8, 2024Updated 2 years ago
lgessler / microbert
View on GitHub
A tiny BERT for low-resource monolingual models
☆32Dec 24, 2025Updated 6 months ago
stefan-it / xlm-v-experiments
View on GitHub
Experiments for XLM-V Transformers Integeration
☆13Feb 8, 2023Updated 3 years ago
frankxu2004 / knnlm-why
View on GitHub
Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"
☆59Jan 12, 2023Updated 3 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
huggingface / hffs
View on GitHub
**ARCHIVED** Filesystem interface to 🤗 Hub
☆60Apr 6, 2023Updated 3 years ago
joeljang / ELM
View on GitHub
[ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning
☆99Apr 26, 2023Updated 3 years ago
EleutherAI / improved-t5
View on GitHub
Experiments for efforts to train a new and improved t5
☆76Apr 15, 2024Updated 2 years ago
csarron / BTR
View on GitHub
☆16Mar 3, 2024Updated 2 years ago
huggingface / large_language_model_training_playbook
View on GitHub
An open collection of implementation tips, tricks and resources for training large language models
☆502Mar 8, 2023Updated 3 years ago
castorini / hf-spacerini
View on GitHub
Plug-and-play Search Interfaces with Pyserini and Hugging Face
☆31Aug 5, 2023Updated 2 years ago
gonglinyuan / metro_t0
View on GitHub
Code repo for "Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers" (ACL 2023)
☆22Nov 1, 2023Updated 2 years ago
huggingface / hf-endpoints-emulator
View on GitHub
Local emulator for Hugging Face Inference Endpoints customer handlers
☆26May 28, 2026Updated last month
argilla-io / distilabel-spin-dibt
View on GitHub
Repository containing the SPIN experiments on the DIBT 10k ranked prompts
☆24Mar 12, 2024Updated 2 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
NathanGodey / headless-lm
View on GitHub
Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…
☆29Apr 17, 2024Updated 2 years ago
huggingface / model_card
View on GitHub
☆30Sep 27, 2021Updated 4 years ago
hieudx149 / X-RetroMAE
View on GitHub
Code Roberta version of RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder
☆10Mar 16, 2023Updated 3 years ago
yuzhaouoe / pretraining-data-packing
View on GitHub
[ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training
☆24Aug 18, 2024Updated last year
studio-ousia / bpr
View on GitHub
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering
☆175Jun 6, 2021Updated 5 years ago
julien-c / trainer-proposal
View on GitHub
☆13Mar 27, 2020Updated 6 years ago
facebookresearch / bart_ls
View on GitHub
Long-context pretrained encoder-decoder models
☆97Oct 28, 2022Updated 3 years ago
martiansideofthemoon / rankgen
View on GitHub
Official code and model checkpoints for our EMNLP 2022 paper "RankGen - Improving Text Generation with Large Ranking Models" (https://arx…
☆140Aug 2, 2023Updated 2 years ago
zbambergerNLP / principled-pre-training
View on GitHub
A repository to get acquainted with basic training tasks in natural language processing and machine learning
☆11Dec 27, 2023Updated 2 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
google-research / t5x_retrieval
View on GitHub
☆102Dec 17, 2022Updated 3 years ago
Ankush7890 / ssfinetuning
View on GitHub
A package for fine tuning of pretrained NLP transformers using Semi Supervised Learning
☆14Oct 27, 2021Updated 4 years ago
allenai / staged-training
View on GitHub
Staged Training for Transformer Language Models
☆33Mar 31, 2022Updated 4 years ago
huggingface / gaia
View on GitHub
Hugging Face and Pyserini interoperability
☆20May 18, 2023Updated 3 years ago
ThomasScialom / T0_continual_learning
View on GitHub
Adding new tasks to T0 without catastrophic forgetting
☆33Oct 20, 2022Updated 3 years ago
PiotrNawrot / nanoT5
View on GitHub
Fast & Simple repository for pre-training and fine-tuning T5-style models
☆1,021Aug 21, 2024Updated last year
facebookresearch / ELECTRA-Fewshot-Learning
View on GitHub
This repository contains the code for paper Prompting ELECTRA Few-Shot Learning with Discriminative Pre-Trained Models.
☆48Jun 7, 2022Updated 4 years ago
RulinShao / retrieval-scaling
View on GitHub
Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".
☆226Dec 16, 2025Updated 7 months ago
langtech-bsc / Wikiextractor-V2
View on GitHub
Enhaced version of Wikiextrator: A wikipedia dumps extractor
☆30Sep 17, 2025Updated 10 months ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
CPJKU / wechsel
View on GitHub
Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.
☆92Sep 12, 2024Updated last year
neulab / knn-transformers
View on GitHub
PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an…
☆287Oct 20, 2022Updated 3 years ago
huggingface / setfit
View on GitHub
Efficient few-shot learning with Sentence Transformers
☆2,775May 26, 2026Updated last month
mosaicml / examples
View on GitHub
Fast and flexible reference benchmarks
☆466Mar 25, 2026Updated 3 months ago
huggingface / datablations
View on GitHub
Scaling Data-Constrained Language Models
☆344Jun 28, 2025Updated last year
euclaise / supertrainer2000
View on GitHub
☆50Mar 14, 2024Updated 2 years ago
UKPLab / on-emergence
View on GitHub
Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning
☆33Jan 9, 2025Updated last year