VITA-Group / Data-Efficient-ScalingLinks

[ICML 2023] "Data Efficient Neural Scaling Law via Model Reusing" by Peihao Wang, Rameswar Panda, Zhangyang Wang

☆14

Alternatives and similar repositories for Data-Efficient-Scaling

Users that are interested in Data-Efficient-Scaling are comparing it to the libraries listed below

Sorting:

Doraemonzzz / xmixers
Xmixers: A collection of SOTA efficient token/channel mixers
☆29Updated last month
dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
VITA-Group / Random-MoE-as-Dropout
[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…
☆55Updated 2 years ago
kamanphoebe / Look-into-MoEs
[NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Models
☆55Updated 8 months ago
HazyResearch / prefix-linear-attention
☆56Updated last year
qiuzh20 / gated_attention
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink…
☆95Updated last month
chijames / KERPLE
☆19Updated 3 years ago
sustcsonglin / mamba-triton
☆48Updated last year
Infini-AI-Lab / GRESO
☆62Updated 3 months ago
IBM / selective-dense-state-space-model
Open-sourcing code associated with the AAAI-25 paper "On the Expressiveness and Length Generalization of Selective State-Space Models on …
☆15Updated last month
DRSY / KV_Compression
[EMNLP 2023]Context Compression for Auto-regressive Transformers with Sentinel Tokens
☆25Updated last year
HKUNLP / efficient-attention
[EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling
☆87Updated 2 years ago
Shwai-He / MEO
The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":
☆38Updated last year
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆36Updated last year
yikangshen / megablocks
☆20Updated last year
Leooyii / LCEG
Long Context Extension and Generalization in LLMs
☆62Updated last year
john-hewitt / implicit-ins
Codebase for Instruction Following without Instruction Tuning
☆36Updated last year
sail-sg / dice
Official implementation of Bootstrapping Language Models via DPO Implicit Rewards
☆44Updated 6 months ago
berlino / gated_linear_attention
☆105Updated last year
princeton-nlp / Edge-Pruning
[NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".
☆61Updated 2 months ago
sjelassi / transformers_ssm_copy
☆33Updated last year
PKU-ML / LongPPL
Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"
☆102Updated 2 weeks ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆61Updated last year
song-wx / SIFT
[ICML2024 Spotlight] Fine-Tuning Pre-trained Large Language Models Sparsely
☆22Updated last year
shawntan / stickbreaking-attention
Stick-breaking attention
☆61Updated 3 months ago
yegcjs / DiffusionLLM
Code for paper "Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning"
☆83Updated last year
gregorbachmann / Next-Token-Failures
☆103Updated last year
juvi21 / CoPE-cuda
Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719
☆22Updated last year
Infini-AI-Lab / gsm_infinite
☆55Updated 4 months ago
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago