Relaxed-System-Lab / HKUST-COMP6211J-2025fallLinks
☆19Updated 2 weeks ago
Alternatives and similar repositories for HKUST-COMP6211J-2025fall
Users that are interested in HKUST-COMP6211J-2025fall are comparing it to the libraries listed below
Sorting:
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆63Updated last week
- Code release for AdapMoE accepted by ICCAD 2024☆34Updated 6 months ago
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)☆43Updated 10 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆59Updated 7 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…☆232Updated 3 months ago
- Github repo for ICLR-2025 paper, Fine-tuning Large Language Models with Sparse Matrices☆21Updated 6 months ago
- Summary of some awesome work for optimizing LLM inference☆134Updated last week
- CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark☆32Updated 4 months ago
- ☆58Updated last year
- Curated collection of papers in MoE model inference☆296Updated 3 weeks ago
- ☆39Updated 3 months ago
- A Survey of Efficient Attention Methods: Hardware-efficient, Sparse, Compact, and Linear Attention☆212Updated 2 months ago
- ☆131Updated 2 weeks ago
- A record of reading list on some MLsys popular topic☆16Updated 7 months ago
- Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.☆76Updated last week
- Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)☆66Updated 6 months ago
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆157Updated last year
- DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit☆75Updated this week
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆24Updated last year
- ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)☆16Updated last month
- ☆43Updated last year
- [HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning☆111Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆85Updated 4 months ago
- MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)☆55Updated last year
- 16-fold memory access reduction with nearly no loss☆106Updated 7 months ago
- LLM Inference analyzer for different hardware platforms☆94Updated 4 months ago
- A sparse attention kernel supporting mix sparse patterns☆355Updated 8 months ago
- Tile-based language built for AI computation across all scales☆74Updated this week
- ☆58Updated last year
- ☆199Updated 2 weeks ago