CerebrasResearch / reapLinks
REAP: Router-weighted Expert Activation Pruning for SMoE compression
☆17Updated last week
Alternatives and similar repositories for reap
Users that are interested in reap are comparing it to the libraries listed below
Sorting:
- ☆152Updated 4 months ago
- Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models☆249Updated last year
- An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs☆532Updated last week
- Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.☆165Updated last year
- Sparse Inferencing for transformer based LLMs☆201Updated 2 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆550Updated 2 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆659Updated 6 months ago
- 1.58-bit LLaMa model☆83Updated last year
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆344Updated 5 months ago
- LLM Inference on consumer devices☆124Updated 7 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆202Updated last year
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆374Updated 6 months ago
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated last year
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆306Updated 5 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆147Updated last week
- Official implementation of Half-Quadratic Quantization (HQQ)☆883Updated last month
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆668Updated this week
- Experimental BitNet Implementation☆73Updated 4 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆248Updated 8 months ago
- ☆561Updated 11 months ago
- Testing LLM reasoning abilities with family relationship quizzes.☆62Updated 8 months ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆125Updated last week
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆127Updated 2 months ago
- ☆102Updated this week
- Train your own small bitnet model☆75Updated last year
- Local Qwen3 LLM inference. One easy-to-understand file of C source with no dependencies.☆139Updated 3 months ago
- This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"☆108Updated last week
- The homepage of OneBit model quantization framework.☆193Updated 8 months ago
- ☆135Updated 5 months ago
- ☆83Updated 2 weeks ago