DS3Lab / CocktailSGD
☆24Updated last year
Related projects ⓘ
Alternatives and complementary repositories for CocktailSGD
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 10 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆147Updated 4 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆57Updated this week
- Unit Scaling demo and experimentation code☆16Updated 8 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆87Updated last month
- Simple and efficient pytorch-native transformer training and inference (batched)☆61Updated 7 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆145Updated this week
- extensible collectives library in triton☆72Updated last month
- ☆96Updated last month
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆87Updated 3 months ago
- ☆34Updated 9 months ago
- NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference☆61Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- ☆88Updated 2 months ago
- ☆20Updated last week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆79Updated this week
- ☆24Updated 7 months ago
- ☆122Updated 10 months ago
- This repository contains code for the MicroAdam paper.☆12Updated 4 months ago
- Experiment of using Tangent to autodiff triton☆72Updated 9 months ago
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆31Updated 4 months ago
- ☆35Updated 3 weeks ago
- Efficient LLM Inference Acceleration using Prompting☆43Updated 3 weeks ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- ☆38Updated 4 months ago
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆89Updated last year
- ☆99Updated last month