LeanModels / DFloat11Links
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆569Updated last week
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆668Updated 7 months ago
- Advanced quantization toolkit for LLMs and VLMs. Native support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Bits and seamless integration with …☆735Updated this week
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆316Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆210Updated 2 weeks ago
- Efficient LLM Inference over Long Sequences☆392Updated 5 months ago
- ☆156Updated 5 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆700Updated 3 months ago
- Sparse Inferencing for transformer based LLMs☆215Updated 3 months ago
- ☆1,215Updated 2 weeks ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆456Updated 6 months ago
- REAP: Router-weighted Expert Activation Pruning for SMoE compression☆129Updated 3 weeks ago
- A framework for efficient model inference with omni-modality models☆466Updated this week
- [ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.☆798Updated 2 weeks ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆202Updated last year
- LLM Inference on consumer devices☆125Updated 8 months ago
- ☆709Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆507Updated 9 months ago
- The homepage of OneBit model quantization framework.☆196Updated 9 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆851Updated last week
- Muon is Scalable for LLM Training☆1,372Updated 4 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆570Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆894Updated last month
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆247Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆378Updated 7 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆249Updated 10 months ago
- KV cache compression for high-throughput LLM inference☆145Updated 9 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆360Updated 11 months ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆900Updated this week
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆455Updated this week
- TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)☆413Updated 2 months ago