Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.
☆98Feb 9, 2023Updated 3 years ago
Alternatives and similar repositories for olm-training
Users that are interested in olm-training are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Pipeline for pulling and processing online language model pretraining data from the web☆179Jul 31, 2023Updated 2 years ago
- LTG-Bert☆34Jan 8, 2024Updated 2 years ago
- A tiny BERT for low-resource monolingual models☆32Dec 24, 2025Updated 6 months ago
- Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"☆59Jan 12, 2023Updated 3 years ago
- Experiments for XLM-V Transformers Integeration☆13Feb 8, 2023Updated 3 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- **ARCHIVED** Filesystem interface to 🤗 Hub☆60Apr 6, 2023Updated 3 years ago
- [ICML 2023] Exploring the Benefits of Training Expert Language Models over Instruction Tuning☆99Apr 26, 2023Updated 3 years ago
- decontamination☆36Mar 4, 2026Updated 3 months ago
- Experiments for efforts to train a new and improved t5☆76Apr 15, 2024Updated 2 years ago
- ☆16Mar 3, 2024Updated 2 years ago
- An open collection of implementation tips, tricks and resources for training large language models☆502Mar 8, 2023Updated 3 years ago
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…☆29Apr 17, 2024Updated 2 years ago
- Plug-and-play Search Interfaces with Pyserini and Hugging Face☆31Aug 5, 2023Updated 2 years ago
- LARCH: Large Language Model-based Automatic Readme Creation with Heuristics☆17Jul 1, 2023Updated 3 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Code repo for "Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers" (ACL 2023)☆22Nov 1, 2023Updated 2 years ago
- Local emulator for Hugging Face Inference Endpoints customer handlers☆26May 28, 2026Updated last month
- Repository containing the SPIN experiments on the DIBT 10k ranked prompts☆24Mar 12, 2024Updated 2 years ago
- ☆30Sep 27, 2021Updated 4 years ago
- A Streamlit app to add structured tags to a dataset card☆23Jun 30, 2022Updated 4 years ago
- Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering☆175Jun 6, 2021Updated 5 years ago
- [ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training☆23Aug 18, 2024Updated last year
- A repository to get acquainted with basic training tasks in natural language processing and machine learning☆11Dec 27, 2023Updated 2 years ago
- ☆13Mar 27, 2020Updated 6 years ago
- Managed Kubernetes at scale on DigitalOcean • AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Long-context pretrained encoder-decoder models☆97Oct 28, 2022Updated 3 years ago
- ☆102Dec 17, 2022Updated 3 years ago
- A package for fine tuning of pretrained NLP transformers using Semi Supervised Learning☆14Oct 27, 2021Updated 4 years ago
- Staged Training for Transformer Language Models☆33Mar 31, 2022Updated 4 years ago
- Official code and model checkpoints for our EMNLP 2022 paper "RankGen - Improving Text Generation with Large Ranking Models" (https://arx…☆139Aug 2, 2023Updated 2 years ago
- Hugging Face and Pyserini interoperability☆20May 18, 2023Updated 3 years ago
- Adding new tasks to T0 without catastrophic forgetting☆33Oct 20, 2022Updated 3 years ago
- Fast & Simple repository for pre-training and fine-tuning T5-style models☆1,022Aug 21, 2024Updated last year
- This repository contains the code for paper Prompting ELECTRA Few-Shot Learning with Discriminative Pre-Trained Models.☆48Jun 7, 2022Updated 4 years ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Code Roberta version of RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder☆10Mar 16, 2023Updated 3 years ago
- Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.☆90Sep 12, 2024Updated last year
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆226Dec 16, 2025Updated 6 months ago
- PyTorch + HuggingFace code for RetoMaton: "Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval" (ICML 2022), including an…☆287Oct 20, 2022Updated 3 years ago
- Efficient few-shot learning with Sentence Transformers☆2,755May 26, 2026Updated last month
- Scaling Data-Constrained Language Models☆342Jun 28, 2025Updated last year
- Codes and files for the paper Are Emergent Abilities in Large Language Models just In-Context Learning☆33Jan 9, 2025Updated last year