LeanModels / DFloat11Links
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆536Updated 3 weeks ago
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆658Updated 4 months ago
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆631Updated this week
- Efficient LLM Inference over Long Sequences☆391Updated 2 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆302Updated 4 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆633Updated last month
- A safetensors extension to efficiently store sparse quantized tensors on disk☆161Updated this week
- Sparse Inferencing for transformer based LLMs☆197Updated last month
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆443Updated 4 months ago
- ☆150Updated 3 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆706Updated this week
- LLM Inference on consumer devices☆124Updated 6 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆188Updated 3 months ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆784Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆491Updated 7 months ago
- Muon is Scalable for LLM Training☆1,311Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆201Updated last year
- ☆639Updated 3 weeks ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆186Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆245Updated 7 months ago
- A collection of tricks and tools to speed up transformer models☆180Updated last week
- KV cache compression for high-throughput LLM inference☆138Updated 7 months ago
- TransMLA: Multi-Head Latent Attention Is All You Need☆356Updated 3 weeks ago
- SpargeAttention: A training-free sparse attention that can accelerate any model inference.☆718Updated last month
- kernels, of the mega variety☆496Updated 3 months ago
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆86Updated 2 weeks ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆343Updated 9 months ago
- LLM KV cache compression made easy☆609Updated last week
- Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model☆242Updated 3 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆756Updated 6 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆878Updated 2 weeks ago