hpcaitech / CachedEmbedding
A memory efficient DLRM training solution using ColossalAI
☆103Updated 2 years ago
Alternatives and similar repositories for CachedEmbedding:
Users that are interested in CachedEmbedding are comparing it to the libraries listed below
- A Python library transfers PyTorch tensors between CPU and NVMe☆106Updated 3 months ago
- ☆116Updated 11 months ago
- ☆60Updated this week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆109Updated 2 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆65Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆65Updated 5 months ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 3 years ago
- ☆176Updated 5 months ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆206Updated 6 months ago
- ☆100Updated 6 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆92Updated 11 months ago
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆61Updated last year
- Modular and structured prompt caching for low-latency LLM inference☆88Updated 3 months ago
- ☆117Updated 10 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆237Updated 4 months ago
- ☆127Updated 2 months ago
- ☆69Updated 2 months ago
- ☆76Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttention☆296Updated last week
- An easy-to-use package for implementing SmoothQuant for LLMs☆93Updated 9 months ago
- ☆44Updated 8 months ago
- ☆79Updated 2 years ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆85Updated 4 months ago
- LLM Serving Performance Evaluation Harness☆70Updated this week
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated last year
- A minimal implementation of vllm.☆34Updated 7 months ago
- Elixir: Train a Large Language Model on a Small GPU Cluster☆13Updated last year
- Triton-based implementation of Sparse Mixture of Experts.☆201Updated 3 months ago
- Vocabulary Parallelism☆17Updated 3 months ago
- Benchmark suite for LLMs from Fireworks.ai☆68Updated 2 weeks ago