Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
☆756Dec 8, 2022Updated 3 years ago
Alternatives and similar repositories for openwebtext
Users that are interested in openwebtext are comparing it to the libraries listed below
Sorting:
- An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.☆391Mar 26, 2024Updated last year
- Dataset of GPT-2 outputs for research in detection, biases, and more☆2,016Dec 13, 2023Updated 2 years ago
- Code for the paper "Language Models are Unsupervised Multitask Learners"☆24,648Aug 14, 2024Updated last year
- Unsupervised text tokenizer for Neural Network-based text generation.☆11,668Feb 22, 2026Updated last week
- PyTorch original implementation of Cross-lingual Language Model Pretraining.☆2,924Feb 14, 2023Updated 3 years ago
- Conditional Transformer Language Model for Controllable Generation☆1,884May 1, 2025Updated 10 months ago
- Tools to download and cleanup Common Crawl data☆1,039Apr 25, 2023Updated 2 years ago
- Code for Defending Against Neural Fake News, https://rowanzellers.com/grover/☆919May 22, 2023Updated 2 years ago
- ☆1,636Apr 27, 2023Updated 2 years ago
- jiant is an nlp toolkit☆1,674Jul 6, 2023Updated 2 years ago
- Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"☆1,610Aug 12, 2020Updated 5 years ago
- Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"☆6,490Jan 14, 2026Updated last month
- ✨Fast Coreference Resolution in spaCy with Neural Networks☆2,892Apr 13, 2023Updated 2 years ago
- Ongoing research training transformer models at scale