CerebrasResearch / reapLinks

REAP: Router-weighted Expert Activation Pruning for SMoE compression

☆95

Alternatives and similar repositories for reap

Users that are interested in reap are comparing it to the libraries listed below

Sorting:

Cornell-RelaxML / qtip
☆153Updated 4 months ago
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆253Updated last year
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆125Updated 7 months ago
pranavjad / tinyllama-bitnet
Train your own small bitnet model
☆75Updated last year
Cornell-RelaxML / yaqa-quantization
☆62Updated 4 months ago
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆308Updated 5 months ago
rafacelente / bllama
1.58-bit LLaMa model
☆83Updated last year
NimbleEdge / sparse_transformers
Sparse Inferencing for transformer based LLMs
☆201Updated 3 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆60Updated last year
BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆100Updated 11 months ago
jukofyork / transplant-vocab
Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.
☆44Updated 2 weeks ago
IST-DASLab / QuEST
Work in progress.
☆75Updated 4 months ago
IST-DASLab / Quartet
☆106Updated 2 weeks ago
QuixiAI / grokadamw
☆136Updated last year
anhvth / opensloth
☆228Updated last month
astramind-ai / BitMat
An efficent implementation of the method proposed in "The Era of 1-bit LLMs"
☆154Updated last year
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated last week
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆249Updated 9 months ago
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆98Updated 5 months ago
matt-c1 / llama-3-quant-comparison
Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.
☆165Updated last year
Zyphra / Zamba2
PyTorch implementation of models from the Zamba2 series.
☆185Updated 9 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆204Updated this week
snu-mllab / KVzip
[NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)
☆146Updated 2 weeks ago
AlpinDale / QuIP-for-Llama
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models
☆40Updated 2 years ago
gabrielolympie / moe-pruner
A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size
☆77Updated 2 months ago
amazon-science / mxfp4-llm
Official implementation for Training LLMs with MXFP4
☆102Updated 6 months ago
ScalingIntelligence / good-kernels
Samples of good AI generated CUDA kernels
☆91Updated 5 months ago
snu-mllab / GuidedQuant
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
☆47Updated 4 months ago
reka-ai / rekaquant
☆62Updated 4 months ago