NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆307Updated last week
Alternatives and similar repositories for Star-Attention:
Users that are interested in Star-Attention are comparing it to the libraries listed below
- LLM KV cache compression made easy☆267Updated this week
- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads☆400Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆199Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆204Updated this week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆281Updated 5 months ago
- Advanced Quantization Algorithm for LLMs/VLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent …☆271Updated this week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆313Updated 4 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆269Updated last month
- A family of compressed models obtained via pruning and knowledge distillation☆299Updated last month
- This repository contains the experimental PyTorch native float8 training UX☆213Updated 4 months ago
- ☆201Updated 7 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024☆242Updated last week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆232Updated 2 months ago
- Ring attention implementation with flash attention☆606Updated last week
- scalable and robust tree-based speculative decoding algorithm☆322Updated 4 months ago
- KV cache compression for high-throughput LLM inference☆97Updated this week
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆106Updated last week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆100Updated 2 weeks ago
- [ICML 2024] CLLMs: Consistency Large Language Models☆360Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆181Updated 5 months ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch