shjwudp / c4-dataset-scriptLinks
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
☆127Updated 2 years ago
Alternatives and similar repositories for c4-dataset-script
Users that are interested in c4-dataset-script are comparing it to the libraries listed below
Sorting:
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning☆158Updated 9 months ago
- Unofficial implementation of AlpaGasus☆91Updated last year
- ☆105Updated 2 years ago
- All available datasets for Instruction Tuning of Large Language Models☆252Updated last year
- DSIR large-scale data selection framework for language model training☆251Updated last year
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆144Updated 7 months ago
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆313Updated 2 years ago
- ☆101Updated 8 months ago
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆336Updated 8 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆176Updated last year
- [ACL 2024] Long-Context Language Modeling with Parallel Encodings☆154Updated last year
- [EMNLP 2024] LongAlign: A Recipe for Long Context Alignment of LLMs☆250Updated 6 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆184Updated 8 months ago
- Code for ACL2023 paper: Pre-Training to Learn in Context☆108Updated 10 months ago
- [ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues☆96Updated 10 months ago
- An Experiment on Dynamic NTK Scaling RoPE☆64Updated last year
- Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks☆208Updated last year
- [ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark☆379Updated 11 months ago
- ☆155Updated last year
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆203Updated last year
- This project studies the performance and robustness of language models and task-adaptation methods.☆149Updated last year
- [ACL 2024] FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models☆101Updated last week
- Benchmarking Complex Instruction-Following with Multiple Constraints Composition (NeurIPS 2024 Datasets and Benchmarks Track)☆85Updated 4 months ago
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆261Updated last year
- code for Scaling Laws of RoPE-based Extrapolation☆73Updated last year
- A large-scale, fine-grained, diverse preference dataset (and models).☆341Updated last year
- MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning☆90Updated last year
- Official repository of NEFTune: Noisy Embeddings Improves Instruction Finetuning☆396Updated last year
- A Multi-Turn Dialogue Corpus based on Alpaca Instructions☆171Updated 2 years ago
- Official implementation of ACL 2025 Findings paper "Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Text…☆82Updated last month