volcengine / veScale
A PyTorch Native LLM Training Framework
☆665Updated 2 months ago
Related projects ⓘ
Alternatives and complementary repositories for veScale
- Disaggregated serving system for Large Language Models (LLMs).☆359Updated 3 months ago
- FlashInfer: Kernel Library for LLM Serving☆1,452Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆357Updated this week
- GLake: optimizing GPU memory management and IO transmission.☆379Updated 3 months ago
- Ring attention implementation with flash attention☆585Updated last week
- Zero Bubble Pipeline Parallelism☆281Updated last week
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆311Updated 2 months ago
- A throughput-oriented high-performance serving framework for LLMs☆636Updated 2 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆355Updated last week
- Large Language Model (LLM) Systems Paper List☆645Updated this week
- ☆502Updated 2 months ago
- A large-scale simulation framework for LLM inference☆277Updated last month
- ☆289Updated 7 months ago
- Efficient and easy multi-instance LLM serving☆213Updated this week
- Dynamic Memory Management for Serving LLMs without PagedAttention☆238Updated last week
- Best practice for training LLaMA models in Megatron-LM☆628Updated 10 months ago
- FlagScale is a large model toolkit based on open-sourced projects.☆169Updated this week
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆545Updated last month
- Microsoft Automatic Mixed Precision Library☆525Updated last month
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆457Updated 8 months ago
- A low-latency & high-throughput serving engine for LLMs☆245Updated 2 months ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆1,120Updated 3 months ago
- FlagGems is an operator library for large language models implemented in Triton Language.☆342Updated this week
- A fast communication-overlapping library for tensor parallelism on GPUs.☆224Updated 3 weeks ago
- veRL: Volcano Engine Reinforcement Learning for LLM☆318Updated this week
- A collection of memory efficient attention operators implemented in the Triton language.☆219Updated 5 months ago
- [NeurIPS'23] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.☆391Updated 3 months ago
- Serverless LLM Serving for Everyone.☆352Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆624Updated 2 months ago
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆443Updated last week