LeanModels / DFloat11Links

DFloat11: Lossless LLM Compression for Efficient GPU Inference

☆464

Alternatives and similar repositories for DFloat11

Users that are interested in DFloat11 are comparing it to the libraries listed below

Sorting:

microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆648Updated 3 months ago
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…
☆574Updated this week
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
NimbleEdge / sparse_transformers
Sparse Inferencing for transformer based LLMs
☆196Updated last week
Cornell-RelaxML / qtip
☆145Updated last month
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆289Updated 2 months ago
nunchaku-tech / deepcompressor
Model Compression Toolbox for Large Language Models and Diffusion Models
☆578Updated 4 months ago
MoonshotAI / Moonlight
Muon is Scalable for LLM Training
☆1,240Updated last week
QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆428Updated 2 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆483Updated 5 months ago
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆142Updated this week
ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…
☆713Updated last week
OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
☆169Updated 2 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆244Updated 6 months ago
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆123Updated 4 months ago
DreamLM / Dream
Dream 7B, a large diffusion language model
☆887Updated last month
deepreinforce-ai / CUDA-L1
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
☆131Updated this week
HanGuo97 / flute
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆372Updated 3 months ago
xuyuzhuang11 / OneBit
The homepage of OneBit model quantization framework.
☆185Updated 6 months ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆198Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆134Updated 6 months ago
JT-Ushio / MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
☆183Updated last month
SandAI-org / MagiAttention
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
☆469Updated this week
PrimeIntellect-ai / OpenDiloco
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
☆521Updated 6 months ago
antimatter15 / reverse-engineering-gemma-3n
Reverse Engineering Gemma 3n: Google's New Edge-Optimized Language Model
☆233Updated 2 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆142Updated this week
facebookresearch / memory
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…
☆344Updated 7 months ago
apple / ml-cross-entropy
☆510Updated last week
wdlctc / headinfer
☆54Updated 2 months ago
NVIDIA / kvpress
LLM KV cache compression made easy
☆566Updated last week