shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆185Updated last month
Related projects ⓘ
Alternatives and complementary repositories for scattermoe
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆195Updated 3 months ago
- Explorations into some recent techniques surrounding speculative decoding☆211Updated last year
- ☆188Updated 6 months ago
- This repository contains the experimental PyTorch native float8 training UX☆211Updated 3 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- ☆132Updated last year
- ☆88Updated 2 months ago
- ☆96Updated last month
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆193Updated this week
- Cataloging released Triton kernels.☆134Updated 2 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- Applied AI experiments and examples for PyTorch☆166Updated 3 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆278Updated 4 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆78Updated this week
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆176Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆143Updated this week
- [ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆241Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆202Updated 2 weeks ago
- Collection of kernels written in Triton language☆68Updated 3 weeks ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆214Updated 3 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆165Updated this week
- ☆156Updated last year
- ring-attention experiments☆97Updated last month
- Understand and test language model architectures on synthetic tasks.☆162Updated 6 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- Low-bit optimizers for PyTorch☆119Updated last year
- extensible collectives library in triton☆71Updated last month
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆188Updated 3 weeks ago