hao-ai-lab / cse234-w25-PALinks
☆44Updated 9 months ago
Alternatives and similar repositories for cse234-w25-PA
Users that are interested in cse234-w25-PA are comparing it to the libraries listed below
Sorting:
- JAX backend for SGL☆205Updated this week
- ring-attention experiments☆160Updated last year
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆109Updated last month
- Accelerating MoE with IO and Tile-aware Optimizations☆469Updated this week
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆55Updated last week
- Autonomous GPU Kernel Generation via Deep Agents☆192Updated last week
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆159Updated 2 months ago
- Ship correct and fast LLM kernels to PyTorch☆127Updated last week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆135Updated last year
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆45Updated last year
- Cataloging released Triton kernels.☆278Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆175Updated last year
- Systems for GenAI☆151Updated 8 months ago
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆245Updated last year
- fmchisel: Efficient Compression and Training Algorithms for Foundation Models☆81Updated 2 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆148Updated last month
- A minimal implementation of vllm.☆64Updated last year
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆66Updated last year
- ☆97Updated 9 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆135Updated 6 months ago
- 16-fold memory access reduction with nearly no loss☆109Updated 9 months ago
- The evaluation framework for training-free sparse attention in LLMs☆106Updated 2 months ago
- ☆133Updated 6 months ago
- ☆268Updated this week
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆151Updated 10 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆257Updated 2 months ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆360Updated 5 months ago
- DeeperGEMM: crazy optimized version☆73Updated 7 months ago
- ☆210Updated last month
- kernels, of the mega variety☆634Updated 2 months ago