Ying1123 / llm-caching-multiplexing
☆19Updated last year
Related projects ⓘ
Alternatives and complementary repositories for llm-caching-multiplexing
- A minimal implementation of vllm.☆30Updated 3 months ago
- ☆19Updated last year
- ☆38Updated 4 months ago
- NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference☆61Updated last month
- Elixir: Train a Large Language Model on a Small GPU Cluster☆13Updated last year
- A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup☆34Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- An Attention Superoptimizer☆20Updated 6 months ago
- TensorRT LLM Benchmark Configuration☆11Updated 3 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆26Updated 5 months ago
- [ICDCS 2023] DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining☆12Updated 11 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- ☆24Updated 7 months ago
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆74Updated last year
- Beyond KV Caching: Shared Attention for Efficient LLMs☆13Updated 4 months ago
- Code associated with the paper **Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees**.☆26Updated last year
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆15Updated this week
- Official Repo for "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization"☆27Updated 8 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- ☆32Updated this week
- How much energy do GenAI models consume?☆41Updated last month
- EE-LLM is a framework for large-scale training and inference of early-exit (EE) large language models (LLMs).☆49Updated 5 months ago
- GPU operators for sparse tensor operations☆29Updated 8 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆32Updated 3 months ago
- Repo for the paper: PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees (CVPR 2024)☆13Updated 3 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆53Updated last month
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆11Updated 5 months ago
- ☆23Updated last year
- ☆24Updated last year
- ☆46Updated 5 months ago