soskek/bookcorpus

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/soskek/bookcorpus)

soskek / bookcorpus

Crawl BookCorpus

☆863

Alternatives and similar repositories for bookcorpus

Users that are interested in bookcorpus are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

sgraaf / Replicate-Toronto-BookCorpus
View on GitHub
This repository contains code to replicate the no-longer publicly available Toronto BookCorpus dataset
☆49Apr 6, 2022Updated 4 years ago
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
WikiExtractor / wikiextractor
View on GitHub
A tool for extracting plain text from Wikipedia dumps
☆3,998Updated this week
nyu-mll / jiant
View on GitHub
jiant is an nlp toolkit
☆1,675Jul 6, 2023Updated 3 years ago
facebookresearch / SentAugment
View on GitHub
SentAugment is a data augmentation technique for NLP that retrieves similar sentences from a large bank of sentences. It can be used in c…
☆359Feb 22, 2022Updated 4 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
rsennrich / subword-nmt
View on GitHub
Unsupervised Word Segmentation for Neural Machine Translation and Text Generation
☆2,271Aug 7, 2024Updated last year
facebookresearch / XLM
View on GitHub
PyTorch original implementation of Cross-lingual Language Model Pretraining.
☆2,926Feb 14, 2023Updated 3 years ago
hplt-project / sacremoses
View on GitHub
Python port of Moses tokenizer, truecaser and normalizer
☆497Feb 6, 2026Updated 5 months ago
allenai / allennlp
View on GitHub
An open-source NLP research library, built on PyTorch.
☆11,889Nov 22, 2022Updated 3 years ago
harvardnlp / urnng
View on GitHub
☆179Jul 31, 2020Updated 5 years ago
facebookresearch / fairseq
View on GitHub
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
☆32,245Sep 30, 2025Updated 9 months ago
facebookresearch / SentEval
View on GitHub
A python tool for evaluating the quality of sentence embeddings.
☆2,110Mar 19, 2024Updated 2 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,046Apr 25, 2023Updated 3 years ago
facebookresearch / unlikelihood_training
View on GitHub
Neural Text Generation with Unlikelihood Training
☆311Aug 31, 2021Updated 4 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
sheng-z / JOCI
View on GitHub
Ordinal Common-sense Inference
☆27May 15, 2018Updated 8 years ago
jcpeterson / openwebtext
View on GitHub
Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
☆766Dec 8, 2022Updated 3 years ago
openai / finetune-transformer-lm
View on GitHub
Code and model for the paper "Improving Language Understanding by Generative Pre-Training"
☆2,306Jan 25, 2019Updated 7 years ago
google-deepmind / pg19
View on GitHub
☆260Feb 25, 2020Updated 6 years ago
zihangdai / xlnet
View on GitHub
XLNet: Generalized Autoregressive Pretraining for Language Understanding
☆6,180May 28, 2023Updated 3 years ago
microsoft / MASS
View on GitHub
MASS: Masked Sequence to Sequence Pre-training for Language Generation
☆1,117Nov 28, 2022Updated 3 years ago
google-research / text-to-text-transfer-transformer
View on GitHub
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
☆6,538Jul 8, 2026Updated last week
facebookresearch / LAMA
View on GitHub
LAnguage Model Analysis
☆1,391Jul 7, 2024Updated 2 years ago
salesforce / awd-lstm-lm
View on GitHub
LSTM and QRNN Language Model Toolkit for PyTorch
☆1,990Feb 12, 2022Updated 4 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
glample / fastBPE
View on GitHub
Fast BPE
☆677Jun 18, 2024Updated 2 years ago
nelson-liu / contextual-repr-analysis
View on GitHub
A toolkit for evaluating the linguistic knowledge and transferability of contextual representations. Code for "Linguistic Knowledge and T…
☆212Oct 20, 2021Updated 4 years ago
tomohideshibata / BERT-related-papers
View on GitHub
BERT-related papers
☆2,034Aug 12, 2023Updated 2 years ago
harvardnlp / pytorch-struct
View on GitHub
Fast, general, and tested differentiable structured prediction in PyTorch
☆1,132Apr 20, 2022Updated 4 years ago
google / sentencepiece
View on GitHub
Unsupervised text tokenizer for Neural Network-based text generation.
☆11,971Updated this week
Maluuba / nlg-eval
View on GitHub
Evaluation code for various unsupervised automated metrics for Natural Language Generation.
☆1,391Aug 20, 2024Updated last year
luheng / lsgn
View on GitHub
Labeled Span Graph Networks
☆118Jun 22, 2018Updated 8 years ago
facebookresearch / KILT
View on GitHub
Library for Knowledge Intensive Language Tasks
☆978Mar 31, 2022Updated 4 years ago
OpenMindClub / awesome-gpt-dev
View on GitHub
☆13May 10, 2023Updated 3 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
facebookresearch / LASER
View on GitHub
Language-Agnostic SEntence Representations
☆3,661May 2, 2024Updated 2 years ago
marcotcr / checklist
View on GitHub
Beyond Accuracy: Behavioral Testing of NLP models with CheckList
☆2,051Jan 9, 2024Updated 2 years ago
sacmehta / delight
View on GitHub
DeLighT: Very Deep and Light-Weight Transformers
☆469Oct 16, 2020Updated 5 years ago
yet-another-account / openwebtext
View on GitHub
An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.
☆392Mar 26, 2024Updated 2 years ago
sebastianruder / NLP-progress
View on GitHub
Repository to track the progress in Natural Language Processing (NLP), including the datasets and the current state-of-the-art for the mo…
☆22,957Jul 28, 2024Updated last year
facebookresearch / adaptive-span
View on GitHub
Transformer training code for sequential tasks
☆610Sep 14, 2021Updated 4 years ago
google-research-datasets / paws
View on GitHub
This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, an…
☆570Jan 4, 2022Updated 4 years ago