LeanModels / DFloat11Links
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
β600Updated 2 months ago
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithmβ674Updated 9 months ago
- π―An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantizaβ¦β839Updated this week
- Block Diffusion for Ultra-Fast Speculative Decodingβ459Updated this week
- β163Updated 7 months ago
- REAP: Router-weighted Expert Activation Pruning for SMoE compressionβ222Updated last month
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Modelsβ327Updated 2 months ago
- Efficient LLM Inference over Long Sequencesβ394Updated 7 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Modelsβ751Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ238Updated this week
- KV cache compression for high-throughput LLM inferenceβ151Updated last year
- NVIDIA Linux open GPU with P2P supportβ126Updated 2 months ago
- Sparse Inferencing for transformer based LLMsβ218Updated 5 months ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU viβ¦β1,007Updated this week
- β725Updated 2 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ523Updated 11 months ago
- LLM Inference on consumer devicesβ129Updated 10 months ago
- [ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.β921Updated last month
- kernels, of the mega varietyβ665Updated last week
- Checkpoint-engine is a simple middleware to update model weights in LLM inference enginesβ902Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)β912Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsβ387Updated 9 months ago
- dInfer: An Efficient Inference Framework for Diffusion Language Modelsβ410Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β252Updated last year
- [NeurIPS'25 Oral] Query-agnostic KV cache eviction: 3β4Γ reduction in memory and 2Γ decrease in latency (Qwen3/2.5, Gemma3, LLaMA3)β196Updated 2 weeks ago
- Code for data-aware compression of DeepSeek modelsβ69Updated last month
- ArcticInference: vLLM plugin for high-throughput, low-latency inferenceβ384Updated this week
- LLM KV cache compression made easyβ866Updated last week
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learningβ287Updated 3 months ago
- β71Updated 7 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Seβ¦β810Updated 11 months ago