LeanModels / DFloat11Links
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆426Updated last month
Alternatives and similar repositories for DFloat11
Users that are interested in DFloat11 are comparing it to the libraries listed below
Sorting:
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆643Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆378Updated 3 weeks ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆466Updated 4 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆402Updated last month
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆273Updated last month
- Model Compression Toolbox for Large Language Models and Diffusion Models☆501Updated 2 months ago
- Muon is Scalable for LLM Training☆1,081Updated 2 months ago
- ☆137Updated this week
- kernels, of the mega variety☆406Updated 3 weeks ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆129Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆525Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆239Updated 4 months ago
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆385Updated this week
- Sparse Inferencing for transformer based LLMs☆183Updated this week
- prime-rl is a codebase for decentralized async RL training at scale☆341Updated this week
- KV cache compression for high-throughput LLM inference☆131Updated 4 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se …☆705Updated 3 months ago
- TransMLA: Multi-Head Latent Attention Is All You Need☆310Updated this week
- ☆159Updated this week
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆176Updated this week
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆339Updated 6 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated 11 months ago
- Perplexity GPU Kernels☆375Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆322Updated last week
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆633Updated this week
- LLM KV cache compression made easy☆520Updated this week
- Dream 7B, a large diffusion language model☆774Updated last week
- LLM Inference on consumer devices☆119Updated 3 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆151Updated this week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆154Updated 8 months ago