jcpeterson/openwebtext

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/jcpeterson/openwebtext)

jcpeterson / openwebtext

Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.

☆766

Alternatives and similar repositories for openwebtext

Users that are interested in openwebtext are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

yet-another-account / openwebtext
View on GitHub
An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.
☆392Mar 26, 2024Updated 2 years ago
openai / gpt-2-output-dataset
View on GitHub
Dataset of GPT-2 outputs for research in detection, biases, and more
☆2,027Dec 13, 2023Updated 2 years ago
google / sentencepiece
View on GitHub
Unsupervised text tokenizer for Neural Network-based text generation.
☆11,996Updated this week
openai / gpt-2
View on GitHub
Code for the paper "Language Models are Unsupervised Multitask Learners"
☆25,026Aug 14, 2024Updated last year
facebookresearch / XLM
View on GitHub
PyTorch original implementation of Cross-lingual Language Model Pretraining.
☆2,923Feb 14, 2023Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
salesforce / ctrl
View on GitHub
Conditional Transformer Language Model for Controllable Generation
☆1,880May 1, 2025Updated last year
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
rowanz / grover
View on GitHub
Code for Defending Against Neural Fake News, https://rowanzellers.com/grover/
☆916May 22, 2023Updated 3 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,047Apr 25, 2023Updated 3 years ago
soskek / bookcorpus
View on GitHub
Crawl BookCorpus
☆865Jul 14, 2023Updated 3 years ago
facebookresearch / LASER
View on GitHub
Language-Agnostic SEntence Representations
☆3,662May 2, 2024Updated 2 years ago
minimaxir / gpt-2-simple
View on GitHub
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
☆3,400Dec 14, 2022Updated 3 years ago
nshepperd / gpt-2
View on GitHub
Code for the paper "Language Models are Unsupervised Multitask Learners"
☆1,144Oct 31, 2022Updated 3 years ago
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆17,265Updated this week
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
google-research / text-to-text-transfer-transformer
View on GitHub
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
☆6,541Jul 8, 2026Updated 3 weeks ago
nyu-mll / jiant
View on GitHub
jiant is an nlp toolkit
☆1,675Jul 6, 2023Updated 3 years ago
openai / sparse_attention
View on GitHub
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"
☆1,614Aug 12, 2020Updated 5 years ago
nyu-dl / bert-gen
View on GitHub
☆323Dec 16, 2022Updated 3 years ago
huggingface / neuralcoref
View on GitHub
✨Fast Coreference Resolution in spaCy with Neural Networks
☆2,893Apr 13, 2023Updated 3 years ago
VKCOM / YouTokenToMe
View on GitHub
Unsupervised text tokenizer focused on computational efficiency
☆980Mar 29, 2024Updated 2 years ago
google-research / deduplicate-text-datasets
View on GitHub
☆1,269Jul 30, 2024Updated 2 years ago
google-research / t5x
View on GitHub
☆2,978Jul 9, 2026Updated 3 weeks ago
huggingface / olm-datasets
View on GitHub
Pipeline for pulling and processing online language model pretraining data from the web
☆179Jul 31, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
huggingface / hmtl
View on GitHub
🌊HMTL: Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP
☆1,195Aug 1, 2023Updated 2 years ago
openai / finetune-transformer-lm
View on GitHub
Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
☆2,310Jan 25, 2019Updated 7 years ago
facebookresearch / fairseq
View on GitHub
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
☆32,248Sep 30, 2025Updated 10 months ago
facebookresearch / ParlAI
View on GitHub
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
☆10,626Nov 3, 2023Updated 2 years ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jul 21, 2026Updated last week
EleutherAI / openwebtext2
View on GitHub
☆94Jul 16, 2022Updated 4 years ago
minimaxir / ctrl-gce
View on GitHub
Set up the CTRL text-generating model on Google Compute Engine with just a few console commands.
☆151Oct 26, 2019Updated 6 years ago
namisan / mt-dnn
View on GitHub
Multi-Task Deep Neural Networks for Natural Language Understanding
☆2,260Mar 7, 2024Updated 2 years ago
facebookresearch / ELI5
View on GitHub
Scripts and links to recreate the ELI5 dataset.
☆324Aug 31, 2021Updated 4 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
facebookresearch / unlikelihood_training
View on GitHub
Neural Text Generation with Unlikelihood Training
☆311Aug 31, 2021Updated 4 years ago
EleutherAI / gpt-neox
View on GitHub
An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries
☆7,448Jun 11, 2026Updated last month
facebookresearch / KILT
View on GitHub
Library for Knowledge Intensive Language Tasks
☆979Mar 31, 2022Updated 4 years ago
facebookresearch / fairscale
View on GitHub
PyTorch extensions for high performance and large scale training.
☆3,411Apr 26, 2025Updated last year
google-research / xtreme
View on GitHub
XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 ty…
☆651Jan 4, 2023Updated 3 years ago
zihangdai / xlnet
View on GitHub
XLNet: Generalized Autoregressive Pretraining for Language Understanding
☆6,185May 28, 2023Updated 3 years ago
google / BIG-bench
View on GitHub
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
☆3,249Jul 19, 2024Updated 2 years ago