facebookresearch/cc_net

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/facebookresearch/cc_net)

facebookresearch / cc_net

Tools to download and cleanup Common Crawl data

☆1,047

Alternatives and similar repositories for cc_net

Users that are interested in cc_net are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

google-research / deduplicate-text-datasets
View on GitHub
☆1,270Jul 30, 2024Updated last year
facebookresearch / XLM
View on GitHub
PyTorch original implementation of Cross-lingual Language Model Pretraining.
☆2,923Feb 14, 2023Updated 3 years ago
facebookresearch / LASER
View on GitHub
Language-Agnostic SEntence Representations
☆3,661May 2, 2024Updated 2 years ago
ChenghaoMou / text-dedup
View on GitHub
All-in-one text de-duplication
☆764Mar 9, 2026Updated 4 months ago
bigscience-workshop / data-preparation
View on GitHub
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆318Mar 20, 2023Updated 3 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
togethercomputer / RedPajama-Data
View on GitHub
The RedPajama-Data repository contains code for preparing large datasets for training large language models.
☆4,970Jun 3, 2026Updated last month
google / sentencepiece
View on GitHub
Unsupervised text tokenizer for Neural Network-based text generation.
☆11,975Updated this week
google-research / text-to-text-transfer-transformer
View on GitHub
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
☆6,536Jul 8, 2026Updated 2 weeks ago
google-research / xtreme
View on GitHub
XTREME is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models that covers 40 ty…
☆651Jan 4, 2023Updated 3 years ago
oscar-project / goclassy
View on GitHub
An asynchronous concurrent pipeline for classifying Common Crawl based on fastText's pipeline.
☆86Apr 21, 2021Updated 5 years ago
facebookresearch / SentAugment
View on GitHub
SentAugment is a data augmentation technique for NLP that retrieves similar sentences from a large bank of sentences. It can be used in c…
☆359Feb 22, 2022Updated 4 years ago
kpu / kenlm
View on GitHub
KenLM: Faster and Smaller Language Model Queries
☆2,792Mar 30, 2025Updated last year
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,221Updated this week
NVIDIA / Megatron-LM
View on GitHub
Ongoing research training transformer models at scale
☆17,181Updated this week
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
WikiExtractor / wikiextractor
View on GitHub
A tool for extracting plain text from Wikipedia dumps
☆3,996Updated this week
google-research / electra
View on GitHub
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
☆2,367Mar 23, 2024Updated 2 years ago
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
google-research / bleurt
View on GitHub
BLEURT is a metric for Natural Language Generation based on transfer learning.
☆794Aug 4, 2023Updated 2 years ago
google-research / FLAN
View on GitHub
☆1,565Jul 2, 2026Updated 3 weeks ago
facebookresearch / fairseq
View on GitHub
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
☆32,251Sep 30, 2025Updated 9 months ago
google-research / multilingual-t5
View on GitHub
☆1,294Dec 15, 2022Updated 3 years ago
microsoft / DeBERTa
View on GitHub
The implementation of DeBERTa
☆2,241Sep 29, 2023Updated 2 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 3 months ago
rom1504 / cc2dataset
View on GitHub
Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
☆321Dec 9, 2023Updated 2 years ago
facebookresearch / MLQA
View on GitHub
New dataset
☆311Aug 31, 2021Updated 4 years ago
glample / fastBPE
View on GitHub
Fast BPE
☆677Jun 18, 2024Updated 2 years ago
facebookresearch / fairscale
View on GitHub
PyTorch extensions for high performance and large scale training.
☆3,411Apr 26, 2025Updated last year
bigscience-workshop / Megatron-DeepSpeed
View on GitHub
Ongoing research training transformer language models at scale, including: BERT & GPT-2
☆1,448Mar 20, 2024Updated 2 years ago
CarperAI / trlx
View on GitHub
A repo for distributed training of language models with Reinforcement Learning via Human Feedback (RLHF)
☆4,753Jan 8, 2024Updated 2 years ago
facebookresearch / KILT
View on GitHub
Library for Knowledge Intensive Language Tasks
☆978Mar 31, 2022Updated 4 years ago
anthropics / hh-rlhf
View on GitHub
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
☆1,852Jun 17, 2025Updated last year
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
allenai / longformer
View on GitHub
Longformer: The Long-Document Transformer
☆2,201Feb 8, 2023Updated 3 years ago
facebookresearch / metaseq
View on GitHub
Repo for external large-scale work
☆6,550Apr 27, 2024Updated 2 years ago
huggingface / trl
View on GitHub
Train transformer language models with reinforcement learning.
☆18,913Updated this week
facebookresearch / unlikelihood_training
View on GitHub
Neural Text Generation with Unlikelihood Training
☆311Aug 31, 2021Updated 4 years ago
facebookresearch / flores
View on GitHub
Facebook Low Resource (FLoRes) MT Benchmark
☆771Nov 20, 2023Updated 2 years ago
alexa / bort
View on GitHub
Repository for the paper "Optimal Subarchitecture Extraction for BERT"
☆470Jun 22, 2022Updated 4 years ago
facebookresearch / DPR
View on GitHub
Dense Passage Retriever - is a set of tools and models for open domain Q&A task.
☆1,869Apr 6, 2023Updated 3 years ago