yet-another-account/openwebtext

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yet-another-account/openwebtext)

yet-another-account / openwebtext

An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.

☆392

Alternatives and similar repositories for openwebtext

Users that are interested in openwebtext are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

jcpeterson / openwebtext
View on GitHub
Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
☆766Dec 8, 2022Updated 3 years ago
graykode / gpt-2-Pytorch
View on GitHub
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation
☆1,013Jul 8, 2019Updated 7 years ago
rowanz / grover
View on GitHub
Code for Defending Against Neural Fake News, https://rowanzellers.com/grover/
☆917May 22, 2023Updated 3 years ago
nshepperd / gpt-2
View on GitHub
Code for the paper "Language Models are Unsupervised Multitask Learners"
☆1,144Oct 31, 2022Updated 3 years ago
openai / gpt-2-output-dataset
View on GitHub
Dataset of GPT-2 outputs for research in detection, biases, and more
☆2,026Dec 13, 2023Updated 2 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Kyubyong / lm_finetuning
View on GitHub
Language Model Fine-tuning for Moby Dick
☆42Mar 3, 2019Updated 7 years ago
shevisj / gpt-2_bot
View on GitHub
This is a reddit bot based on OpenAi's GPT-2 117M model
☆99Aug 27, 2019Updated 6 years ago
facebookresearch / unlikelihood_training
View on GitHub
Neural Text Generation with Unlikelihood Training
☆311Aug 31, 2021Updated 4 years ago
openai / gpt-2
View on GitHub
Code for the paper "Language Models are Unsupervised Multitask Learners"
☆25,010Aug 14, 2024Updated last year
huggingface / pytorch-openai-transformer-lm
View on GitHub
🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI
☆1,520Aug 9, 2021Updated 4 years ago
salesforce / ctrl
View on GitHub
Conditional Transformer Language Model for Controllable Generation
☆1,881May 1, 2025Updated last year
monologg / korean-hate-speech-koelectra
View on GitHub
Bias, Hate classification with KoELECTRA 👿
☆27Jun 12, 2023Updated 3 years ago
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆17,108Updated this week
minimaxir / gpt-2-simple
View on GitHub
Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts
☆3,399Dec 14, 2022Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
openai / finetune-transformer-lm
View on GitHub
Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
☆2,306Jan 25, 2019Updated 7 years ago
huggingface / transfer-learning-conv-ai
View on GitHub
🦄 State-of-the-Art Conversational AI with Transfer Learning
☆1,755Jun 12, 2023Updated 3 years ago
facebookresearch / XLM
View on GitHub
PyTorch original implementation of Cross-lingual Language Model Pretraining.
☆2,927Feb 14, 2023Updated 3 years ago
WikiExtractor / wikiextractor
View on GitHub
A tool for extracting plain text from Wikipedia dumps
☆3,997Updated this week
nelson-liu / contextual-repr-analysis
View on GitHub
A toolkit for evaluating the linguistic knowledge and transferability of contextual representations. Code for "Linguistic Knowledge and T…
☆212Oct 20, 2021Updated 4 years ago
glample / fastBPE
View on GitHub
Fast BPE
☆677Jun 18, 2024Updated 2 years ago
akanyaani / gpt-2-tensorflow2.0
View on GitHub
OpenAI GPT2 pre-training and sequence prediction implementation in Tensorflow 2.0
☆265Mar 25, 2023Updated 3 years ago
zihangdai / xlnet
View on GitHub
XLNet: Generalized Autoregressive Pretraining for Language Understanding
☆6,180May 28, 2023Updated 3 years ago
nyu-dl / bert-gen
View on GitHub
☆323Dec 16, 2022Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
EleutherAI / openwebtext2
View on GitHub
☆94Jul 16, 2022Updated 4 years ago
asyml / texar
View on GitHub
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://…
☆2,390Aug 26, 2021Updated 4 years ago
openai / sparse_attention
View on GitHub
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"
☆1,615Aug 12, 2020Updated 5 years ago
allenai / tpu_pretrain
View on GitHub
LM Pretraining with PyTorch/TPU
☆137Oct 24, 2019Updated 6 years ago
soskek / bookcorpus
View on GitHub
Crawl BookCorpus
☆863Jul 14, 2023Updated 3 years ago
microsoft / DialoGPT
View on GitHub
Large-scale pretraining for dialogue
☆2,422Oct 17, 2022Updated 3 years ago
lucidrains / CLAP
View on GitHub
Contrastive Language-Audio Pretraining
☆15May 18, 2021Updated 5 years ago
google-deepmind / tvt
View on GitHub
☆48Nov 22, 2019Updated 6 years ago
Skylion007 / OpenWebTextCorpus
View on GitHub
☆22Jun 23, 2026Updated 3 weeks ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
tacchinotacchi / distil-bilstm
View on GitHub
Scripts to train a bidirectional LSTM with knowledge distillation from BERT
☆159Nov 21, 2019Updated 6 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,045Apr 25, 2023Updated 3 years ago
chiphuyen / lazynlp
View on GitHub
Library to scrape and clean web pages to create massive datasets.
☆2,270Nov 11, 2020Updated 5 years ago
google / sentencepiece
View on GitHub
Unsupervised text tokenizer for Neural Network-based text generation.
☆11,969Updated this week
rusiaaman / XLnet-gen
View on GitHub
XLNet for generating language.
☆166Jan 30, 2021Updated 5 years ago
google-research / longt5
View on GitHub
☆183May 26, 2023Updated 3 years ago
facebookresearch / ParlAI
View on GitHub
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
☆10,630Nov 3, 2023Updated 2 years ago