shjwudp / c4-dataset-scriptLinks

Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.

☆131

Alternatives and similar repositories for c4-dataset-script

Users that are interested in c4-dataset-script are comparing it to the libraries listed below

Sorting:

bigscience-workshop / data-preparation
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆317Updated 2 years ago
AI21Labs / Parallel-Context-Windows
☆105Updated 2 years ago
LAION-AI / Open-Instruction-Generalist
Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks
☆209Updated last year
raunak-agarwal / instruction-datasets
Datasets for Instruction Tuning of Large Language Models
☆259Updated 2 years ago
dropreg / efficient_alpaca
The aim of this repository is to utilize LLaMA to reproduce and enhance the Stanford Alpaca
☆98Updated 2 years ago
p-lambda / dsir
DSIR large-scale data selection framework for language model training
☆266Updated last year
gpt4life / alpagasus
Unofficial implementation of AlpaGasus
☆93Updated 2 years ago
thu-coai / PICL
Code for ACL2023 paper: Pre-Training to Learn in Context
☆106Updated last year
keirp / OpenWebMath
☆166Updated last year
gmftbyGMFTBY / Copyisallyouneed
[ICLR 2023] Codebase for Copy-Generator model, including an implementation of kNN-LM
☆189Updated 10 months ago
THUDM / LongAlign
[EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs
☆256Updated 11 months ago
THUDM / icetk
A unified tokenization tool for Images, Chinese and English.
☆153Updated 2 years ago
HuangLK / transpeeder
train llama on a single A100 80G node using 🤗 transformers and 🚀 Deepspeed Pipeline Parallelism
☆225Updated 2 years ago
icip-cas / ChatAlpaca
A Multi-Turn Dialogue Corpus based on Alpaca Instructions
☆177Updated 2 years ago
OpenLMLab / LEval
[ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark
☆391Updated last year
OpenBMB / UltraFeedback
A large-scale, fine-grained, diverse preference dataset (and models).
☆356Updated last year
bigai-nlco / LooGLE
ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models
☆192Updated last year
Dahoas / reward-modeling
☆98Updated 2 years ago
FreedomIntelligence / MultilingualSIFT
MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning
☆94Updated 2 years ago
Spico197 / Humpback
🐋 An unofficial implementation of Self-Alignment with Instruction Backtranslation.
☆138Updated 7 months ago
liutiedong / goat
a Fine-tuned LLaMA that is Good at Arithmetic Tasks
☆178Updated 2 years ago
zsc / llama_infer
Inference script for Meta's LLaMA models using Hugging Face wrapper
☆110Updated 2 years ago
NormXU / Consistent-DynamicNTKRoPE
An Experiment on Dynamic NTK Scaling RoPE
☆64Updated 2 years ago
yegcjs / mixinglaws
☆108Updated 4 months ago
facebookresearch / SemDeDup
Code for "SemDeDup", a simple method for identifying and removing semantic duplicates from a dataset (data pairs which are semantically s…
☆147Updated 2 years ago
OFA-Sys / InsTag
InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning
☆284Updated 2 years ago
OpenLMLab / scaling-rope
code for Scaling Laws of RoPE-based Extrapolation
☆73Updated 2 years ago
OpenBMB / InfiniteBench
Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718
☆358Updated last year
FranxYao / Long-Context-Data-Engineering
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
☆478Updated last year
sangmichaelxie / doremi
Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets
☆347Updated last year