EleutherAI / openwebtext2Links

☆92

Alternatives and similar repositories for openwebtext2

Users that are interested in openwebtext2 are comparing it to the libraries listed below

Sorting:

leogao2 / lm_dataformat
☆78Updated last year
EleutherAI / stackexchange-dataset
Python tools for processing the stackexchange data dumps into a text dataset for Language Models
☆85Updated last year
salesforce / TaiChi
Open source library for few shot NLP
☆78Updated 2 years ago
bigscience-workshop / data_tooling
Tools for managing datasets for governance and training.
☆87Updated last week
huggingface / olm-datasets
Pipeline for pulling and processing online language model pretraining data from the web
☆178Updated 2 years ago
microsoft / xtreme-distil-transformers
XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale
☆157Updated last year
AI21Labs / lm-evaluation
Evaluation suite for large-scale language models.
☆128Updated 4 years ago
oscar-project / ungoliant
The pipeline for the OSCAR corpus
☆175Updated 3 weeks ago
huggingface / olm-training
Repo for training MLMs, CLMs, or T5-type models on the OLM pretraining data, but it should work with any hugging face text dataset.
☆96Updated 2 years ago
bloomberg / minilmv2.bb
Our open source implementation of MiniLMv2 (https://aclanthology.org/2021.findings-acl.188)
☆61Updated 2 years ago
Rallio67 / language-model-agents
Experiments with generating opensource language model assistants
☆97Updated 2 years ago
google-research / t5x_retrieval
☆101Updated 2 years ago
huggingface / tune
☆87Updated 3 years ago
google-research / longt5
☆184Updated 2 years ago
EleutherAI / lm_perplexity
☆160Updated 4 years ago
shayne-longpre / a-pretrainers-guide
☆72Updated 2 years ago
zphang / minimal-opt
☆67Updated 3 years ago
CarperAI / InstructGPT
For experiments involving instruct gpt. Currently used for documenting open research questions.
☆71Updated 3 years ago
stas00 / porting
Helper scripts and notes that were used while porting various nlp models
☆48Updated 3 years ago
allenai / bff
☆38Updated last year
lucidrains / marge-pytorch
Implementation of Marge, Pre-training via Paraphrasing, in Pytorch
☆76Updated 4 years ago
cimeister / typical-sampling
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
☆81Updated 3 years ago
google-research-datasets / seahorse
Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 q…
☆89Updated last year
sunyt32 / torchscale
Transformers at any scale
☆42Updated last year
google-research / dialog-inpainting
☆97Updated 3 years ago
gsarti / t5-flax-gcp
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP
☆58Updated 3 years ago
martiansideofthemoon / rankgen
Official code and model checkpoints for our EMNLP 2022 paper "RankGen - Improving Text Generation with Large Ranking Models" (https://arx…
☆138Updated 2 years ago
inspired-cognition / critique-apps
Apps built using Inspired Cognition's Critique.
☆57Updated 2 years ago
huggingface / transformers_bloom_parallel
Techniques used to run BLOOM at inference in parallel
☆37Updated 3 years ago
google-research-datasets / presto
A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs
☆115Updated 2 years ago