EleutherAI / openwebtext2
☆86Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for openwebtext2
- ☆76Updated 11 months ago
- Python tools for processing the stackexchange data dumps into a text dataset for Language Models☆76Updated 11 months ago
- ☆97Updated 2 years ago
- Open source library for few shot NLP☆77Updated last year
- Tools for managing datasets for governance and training.☆77Updated last week
- Implementation of Marge, Pre-training via Paraphrasing, in Pytorch☆75Updated 3 years ago
- Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.☆92Updated last year
- Experiments with generating opensource language model assistants☆97Updated last year
- Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)☆60Updated last year
- XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale☆153Updated 10 months ago
- Pipeline for pulling and processing online language model pretraining data from the web☆174Updated last year
- ☆95Updated last year
- Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP☆58Updated 2 years ago
- The pipeline for the OSCAR corpus☆162Updated 10 months ago
- ☆31Updated last year
- Code for the paper-"Mirostat: A Perplexity-Controlled Neural Text Decoding Algorithm" (https://arxiv.org/abs/2007.14966).☆57Updated 2 years ago
- Adversarial Training and SFT for Bot Safety Models☆39Updated last year
- Techniques used to run BLOOM at inference in parallel☆37Updated 2 years ago
- ☆179Updated last year
- Official code and model checkpoints for our EMNLP 2022 paper "RankGen - Improving Text Generation with Large Ranking Models" (https://arx…☆136Updated last year
- ☆67Updated 2 years ago
- Evaluation suite for large-scale language models.☆123Updated 3 years ago
- ☆73Updated last year
- A file utility for accessing both local and remote files through a unified interface.☆35Updated 3 months ago
- Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Lo…☆38Updated 9 months ago
- ☆86Updated 2 years ago
- URL downloader supporting checkpointing and continuous checksumming.☆19Updated 11 months ago
- A library for squeakily cleaning and filtering language datasets.☆45Updated last year
- ☆46Updated last month
- ☆147Updated 3 years ago