hao-ai-lab / cse234-w25-PALinks

☆41

Alternatives and similar repositories for cse234-w25-PA

Users that are interested in cse234-w25-PA are comparing it to the libraries listed below

Sorting:

sgl-project / sglang-jax
JAX backend for SGL
☆77Updated this week
gpu-mode / ring-attention
ring-attention experiments
☆153Updated last year
Infini-AI-Lab / MagicPIG
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
☆238Updated 10 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆143Updated this week
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
RLsys-Foundation / TritonForge
🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…
☆82Updated last week
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆338Updated 3 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆105Updated 6 months ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆261Updated last month
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆169Updated last year
meta-pytorch / BackendBench
How to ensure correctness and ship LLM generated kernels in PyTorch
☆66Updated this week
shadowpa0327 / Palu
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
☆142Updated 8 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆240Updated this week
MDK8888 / vllmini
A minimal implementation of vllm.
☆58Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆237Updated 11 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 2 weeks ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆114Updated 2 weeks ago
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆26Updated 9 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 4 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆60Updated 11 months ago
ovg-project / kvcached
Virtualized Elastic KV Cache for Dynamic GPU Sharing and Beyond
☆103Updated last week
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 2 months ago
IST-DASLab / MoE-Quant
Code for data-aware compression of DeepSeek models
☆56Updated 4 months ago
ScalingIntelligence / hydragen
Hydragen: High-Throughput LLM Inference with Shared Prefixes
☆43Updated last year
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆38Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆156Updated 6 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆72Updated 5 months ago
FasterDecoding / SnapKV
☆283Updated 3 months ago
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆260Updated 3 weeks ago