LeanModels / DFloat11Links
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
β589Updated last month
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithmβ671Updated 8 months ago
- π―An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantizaβ¦β806Updated this week
- REAP: Router-weighted Expert Activation Pruning for SMoE compressionβ189Updated last month
- Efficient LLM Inference over Long Sequencesβ393Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ233Updated this week
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Modelsβ323Updated last month
- Block Diffusion for Ultra-Fast Speculative Decodingβ188Updated last week
- Model Compression Toolbox for Large Language Models and Diffusion Modelsβ730Updated 5 months ago
- β720Updated last month
- Sparse Inferencing for transformer based LLMsβ217Updated 5 months ago
- β163Updated 6 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference enginesβ888Updated this week
- [ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.β897Updated last week
- kernels, of the mega varietyβ640Updated 3 months ago
- β1,257Updated last month
- ArcticInference: vLLM plugin for high-throughput, low-latency inferenceβ368Updated last week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ519Updated 11 months ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU viβ¦β971Updated this week
- dInfer: An Efficient Inference Framework for Diffusion Language Modelsβ385Updated last week
- Official implementation of Half-Quadratic Quantization (HQQ)β905Updated 3 weeks ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learningβ277Updated 2 months ago
- Parallel Scaling Law for Language Model β Beyond Parameter and Inference Time Scalingβ467Updated 7 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β201Updated last year
- KV cache compression for high-throughput LLM inferenceβ148Updated 11 months ago
- TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)β423Updated 3 months ago
- LLM Inference on consumer devicesβ128Updated 9 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β249Updated 11 months ago
- An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUsβ616Updated last week
- LLM KV cache compression made easyβ749Updated 3 weeks ago
- The homepage of OneBit model quantization framework.β198Updated 11 months ago