apple / ml-recurrent-drafter
☆61Updated 2 weeks ago
Related projects: ⓘ
- Boosting 4-bit inference kernels with 2:4 Sparsity☆47Updated 2 weeks ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆36Updated 8 months ago
- ☆83Updated 3 weeks ago
- ☆50Updated 3 months ago
- Applied AI experiments and examples for PyTorch☆123Updated last month
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆55Updated this week
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆156Updated this week
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆183Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆57Updated 5 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆166Updated 3 weeks ago
- ☆66Updated 3 months ago
- ring-attention experiments☆89Updated 5 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- ☆75Updated this week
- Simple and fast low-bit matmul kernels in CUDA☆48Updated this week
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆39Updated this week
- Cataloging released Triton kernels.☆111Updated 3 weeks ago
- NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference☆58Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆150Updated last week
- Odysseus: Playground of LLM Sequence Parallelism☆50Updated 3 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 3 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- ☆124Updated last week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆73Updated last month
- ☆164Updated 4 months ago
- This repository contains the experimental PyTorch native float8 training UX☆210Updated last month
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting☆60Updated 6 months ago
- ☆117Updated 7 months ago
- The official repo for "LLoCo: Learning Long Contexts Offline"☆104Updated 3 months ago