LeanModels / DFloat11Links
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
☆600Updated 2 months ago
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆674Updated 9 months ago
- REAP: Router-weighted Expert Activation Pruning for SMoE compression☆222Updated last month
- Block Diffusion for Ultra-Fast Speculative Decoding☆459Updated this week
- 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…☆839Updated this week
- ☆163Updated 7 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆326Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆394Updated 7 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆238Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆744Updated 5 months ago
- ☆1,278Updated 2 months ago
- ☆725Updated 2 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆523Updated 11 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆468Updated 8 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆902Updated this week
- dInfer: An Efficient Inference Framework for Diffusion Language Models☆410Updated 3 weeks ago
- Sparse Inferencing for transformer based LLMs☆218Updated 5 months ago
- NVIDIA Linux open GPU with P2P support☆126Updated 2 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆252Updated last year
- Official implementation of Half-Quadratic Quantization (HQQ)☆910Updated last month
- kernels, of the mega variety☆657Updated 4 months ago
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3–4× reduction in memory and 2× decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)☆196Updated 2 weeks ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆203Updated 2 months ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆1,007Updated this week
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆626Updated this week
- [ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.☆921Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆201Updated last year
- LLM Inference on consumer devices☆129Updated 10 months ago
- KV cache compression for high-throughput LLM inference☆151Updated last year
- Accelerating MoE with IO and Tile-aware Optimizations☆563Updated 2 weeks ago
- The homepage of OneBit model quantization framework.☆200Updated last year