EleutherAI / pile_dedupeLinks

Pile Deduplication Code

☆19

Alternatives and similar repositories for pile_dedupe

Users that are interested in pile_dedupe are comparing it to the libraries listed below

Sorting:

McGill-NLP / retriever-lm-reasoning
Code for "Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model", EMNLP Findings 20…
☆28Updated last year
nkandpa2 / long_tail_knowledge
Repo for the paper "Large Language Models Struggle to Learn Long-Tail Knowledge"
☆78Updated 2 years ago
HazyResearch / skill-it
Skill-It! A Data-Driven Skills Framework for Understanding and Training Language Models
☆47Updated 2 years ago
yuzhaouoe / pretraining-data-packing
[ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training
☆22Updated last year
princeton-nlp / LM-Kernel-FT
A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643
☆78Updated 2 years ago
allenai / easy-to-hard-generalization
Code for the arXiv preprint "The Unreasonable Effectiveness of Easy Training Data"
☆48Updated last year
jzbjyb / ReAtt
Retrieval as Attention
☆82Updated 2 years ago
bloomberg / MixCE-acl2023
Implementation of MixCE method described in ACL 2023 paper by Zhang et al.
☆20Updated 2 years ago
swj0419 / in-context-pretraining
☆54Updated last year
gmftbyGMFTBY / Rep-Dropout
[NeurIPS 2023] Repetition In Repetition Out: Towards Understanding Neural Text Degeneration from the Data Perspective
☆37Updated 2 years ago
locuslab / scaling_laws_data_filtering
☆65Updated last year
davidbrandfonbrener / color-filter-olmo
☆13Updated this week
RUCAIBox / BAMBOO
☆35Updated last year
thu-coai / PICL
Code for ACL2023 paper: Pre-Training to Learn in Context
☆107Updated last year
casmlab / NPHardEval
Repository for NPHardEval, a quantified-dynamic benchmark of LLMs
☆59Updated last year
wwxu21 / CUT
Source code of "Reasons to Reject? Aligning Language Models with Judgments"
☆58Updated last year
csitfun / LogiCoT
the instructions and demonstrations for building a formal logical reasoning capable GLM
☆54Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year
Leooyii / LCEG
Long Context Extension and Generalization in LLMs
☆62Updated last year
shankarp8 / knowledge_distillation
Repository for "Propagating Knowledge Updates to LMs Through Distillation" (NeurIPS 2023).
☆26Updated last year
Hunter-DDM / stablemoe
Code for the ACL-2022 paper "StableMoE: Stable Routing Strategy for Mixture of Experts"
☆50Updated 3 years ago
LuLuLuyi / LongHeads
[EMNLP'24] LongHeads: Multi-Head Attention is Secretly a Long Context Processor
☆31Updated last year
princeton-pli / MeCo
Code for ICML 25 paper "Metadata Conditioning Accelerates Language Model Pre-training (MeCo)"
☆46Updated 4 months ago
r-three / RAD
Reference implementation for Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model
☆43Updated 3 weeks ago
allenai / data-efficient-finetuning
Code for paper 'Data-Efficient FineTuning'
☆28Updated 2 years ago
GSYfate / knnlm-limits
Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"
☆24Updated 6 months ago
sunyt32 / torchscale
Transformers at any scale
☆41Updated last year
googleinterns / localizing-paragraph-memorization
☆15Updated last year
hkust-nlp / llm-compression-intelligence
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
☆142Updated last year
kernelmachine / silo-lm
SILO Language Models code repository
☆83Updated last year