hao-ai-lab / cse234-w25-PALinks
☆32Updated 2 months ago
Alternatives and similar repositories for cse234-w25-PA
Users that are interested in cse234-w25-PA are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 6 months ago
- ☆93Updated last week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆107Updated 2 weeks ago
- 16-fold memory access reduction with nearly no loss☆94Updated 2 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆163Updated 10 months ago
- ☆129Updated 3 months ago
- ring-attention experiments☆143Updated 7 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆126Updated 6 months ago
- A minimal implementation of vllm.☆41Updated 10 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆158Updated 3 weeks ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆214Updated 5 months ago
- ☆49Updated 2 weeks ago
- ☆70Updated 2 weeks ago
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆70Updated 2 months ago
- 🔥 A minimal training framework for scaling FLA models☆146Updated 3 weeks ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆62Updated 4 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆46Updated 7 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆121Updated last week
- Triton-based implementation of Sparse Mixture of Experts.☆217Updated 6 months ago
- kernels, of the mega variety☆329Updated this week
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆48Updated 7 months ago
- Efficient triton implementation of Native Sparse Attention.☆155Updated 2 weeks ago
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [arXiv '25]☆37Updated 3 weeks ago
- ☆85Updated 2 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 11 months ago
- ☆248Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …☆59Updated 7 months ago
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆119Updated this week
- ☆46Updated 11 months ago