shjwudp / c4-dataset-script
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese data processing and cleaning methods in MassiveText.
☆118Updated last year
Related projects: ⓘ
- Code used for sourcing and cleaning the BigScience ROOTS corpus☆299Updated last year
- ☆99Updated last year
- Codes for the paper "∞Bench: Extending Long Context Evaluation Beyond 100K Tokens": https://arxiv.org/abs/2402.13718☆244Updated last week
- [ACL'24 Outstanding] Data and code for L-Eval, a comprehensive long context language models evaluation benchmark☆342Updated 2 months ago
- LongAlign: A Recipe for Long Context Alignment Encompassing Data, Training, and Evaluation☆194Updated 4 months ago
- Pytorch implementation of DoReMi, a method for optimizing the data mixture weights in language modeling datasets☆292Updated 8 months ago
- What's In My Big Data (WIMBD) - a toolkit for analyzing large text datasets☆174Updated last week
- DSIR large-scale data selection framework for language model training☆221Updated 5 months ago
- Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models☆208Updated last week
- Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks☆204Updated 8 months ago
- InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning☆196Updated last year
- All available datasets for Instruction Tuning of Large Language Models☆231Updated 9 months ago
- ☆110Updated 4 months ago
- [ACL 2024] Long-Context Language Modeling with Parallel Encodings☆133Updated 3 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆148Updated 6 months ago
- [ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning☆337Updated 2 months ago
- [ACL'24] Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning☆101Updated last week
- [EMNLP 2023] The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning☆201Updated 10 months ago
- Unofficial implementation of AlpaGasus☆83Updated 11 months ago
- [EMNLP 2023] Adapting Language Models to Compress Long Contexts☆268Updated last week
- YuLan-IR: Information Retrieval Boosted LMs☆211Updated 6 months ago
- https://acl2023-retrieval-lm.github.io/☆152Updated 11 months ago
- Implementation of paper Data Engineering for Scaling Language Models to 128K Context☆416Updated 6 months ago
- ☆87Updated 4 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆195Updated 3 months ago
- A Survey of Attributions for Large Language Models☆155Updated 3 weeks ago
- A repository sharing the literatures about long-context large language models, including the methodologies and the evaluation benchmarks☆239Updated last month
- Code and data for "Lost in the Middle: How Language Models Use Long Contexts"☆300Updated 8 months ago
- Reformatted Alignment☆111Updated 4 months ago
- Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]☆467Updated 4 months ago