YJHMITWEB / ExFlow
Explore Inter-layer Expert Affinity in MoE Model Inference
☆9Updated last year
Alternatives and similar repositories for ExFlow:
Users that are interested in ExFlow are comparing it to the libraries listed below
- ☆53Updated last year
- ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)☆37Updated 4 months ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆34Updated 2 weeks ago
- 16-fold memory access reduction with nearly no loss☆91Updated last month
- ☆59Updated 10 months ago
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆19Updated 7 months ago
- ☆96Updated 5 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆44Updated last month
- LLM Inference analyzer for different hardware platforms☆65Updated last week
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆85Updated last year
- Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines☆17Updated last year
- An experimentation platform for LLM inference optimisation☆29Updated 7 months ago
- ☆13Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆161Updated 9 months ago
- ☆74Updated 3 weeks ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆155Updated 7 months ago
- Stateful LLM Serving☆65Updated last month
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆106Updated 3 weeks ago
- ☆15Updated last month
- ☆45Updated last year
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆36Updated 3 weeks ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆37Updated this week
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆179Updated 2 weeks ago
- ☆35Updated 9 months ago
- Code release for AdapMoE accepted by ICCAD 2024☆19Updated last week
- Official Repo for "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization"☆29Updated last year
- ☆58Updated 2 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆276Updated 5 months ago
- [ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection☆106Updated 2 months ago
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆54Updated 10 months ago