LeanModels / DFloat11Links
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆559Updated 2 months ago
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆701Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆692Updated 2 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆663Updated 6 months ago
- Efficient LLM Inference over Long Sequences☆390Updated 4 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆308Updated 5 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆187Updated last week
- ☆152Updated 4 months ago
- ☆1,073Updated last week
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆860Updated last week
- ☆700Updated 3 weeks ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆450Updated 5 months ago
- Sparse Inferencing for transformer based LLMs☆201Updated 3 months ago
- [ICML2025] SpargeAttention: A training-free sparse attention that accelerates any model inference.☆765Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆499Updated 9 months ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆820Updated this week
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆550Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆249Updated 9 months ago
- The homepage of OneBit model quantization framework.☆194Updated 9 months ago
- LLM Inference on consumer devices☆125Updated 7 months ago
- A Unified and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for 🤗Diffusers.☆542Updated this week
- Muon is Scalable for LLM Training☆1,354Updated 3 months ago
- kernels, of the mega variety☆597Updated last month
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆546Updated 10 months ago
- Dream 7B, a large diffusion language model☆1,054Updated last month
- https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic caching☆385Updated 4 months ago
- End-to-end recipes for optimizing diffusion models with torchao and diffusers (inference and FP8 training).☆380Updated 5 months ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning☆244Updated last week
- Official implementation of Half-Quadratic Quantization (HQQ)☆889Updated 2 weeks ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆376Updated 7 months ago
- LLM KV cache compression made easy☆680Updated this week