LeanModels / DFloat11Links
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
☆572Updated last month
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆670Updated 7 months ago
- Advanced quantization toolkit for LLMs and VLMs. Support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Schemes and seamless integration with Tra…☆775Updated this week
- Efficient LLM Inference over Long Sequences☆393Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆220Updated this week
- ☆159Updated 6 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆716Updated 4 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆319Updated 3 weeks ago
- Sparse Inferencing for transformer based LLMs☆215Updated 4 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆515Updated 10 months ago
- ☆1,233Updated last month
- LLM Inference on consumer devices☆128Updated 9 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆249Updated 10 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆864Updated last week
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆933Updated this week
- kernels, of the mega variety☆631Updated 2 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆463Updated 7 months ago
- KV cache compression for high-throughput LLM inference☆148Updated 10 months ago
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆349Updated last week
- ☆716Updated 3 weeks ago
- A framework for efficient model inference with omni-modality models☆1,335Updated this week
- REAP: Router-weighted Expert Activation Pruning for SMoE compression☆151Updated 2 weeks ago
- Load compute kernels from the Hub☆352Updated last week
- NVIDIA Linux open GPU with P2P support☆98Updated 2 weeks ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆198Updated 2 weeks ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆360Updated last year
- The homepage of OneBit model quantization framework.☆196Updated 10 months ago
- Muon is Scalable for LLM Training☆1,387Updated 4 months ago
- [ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.☆852Updated last week
- Official implementation of Half-Quadratic Quantization (HQQ)☆902Updated this week
- Dream 7B, a large diffusion language model☆1,115Updated last month