DS3Lab / CocktailSGDLinks

☆27

Alternatives and similar repositories for CocktailSGD

Users that are interested in CocktailSGD are comparing it to the libraries listed below

Sorting:

IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
DS3Lab / Decentralized_FM_alpha
☆19Updated 2 years ago
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆140Updated 11 months ago
ScalingIntelligence / CATS
☆27Updated 8 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
mayank31398 / ladder-residual-inference
☆14Updated 3 weeks ago
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆61Updated 9 months ago
stanford-futuredata / stk
☆107Updated 11 months ago
insuhan / hyper-attn
☆81Updated last year
tanyuqian / redco
NAACL '24 (Best Demo Paper RunnerUp) / MlSys @ NeurIPS '23 - RedCoast: A Lightweight Tool to Automate Distributed Training and Inference
☆66Updated 7 months ago
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆114Updated 8 months ago
Raincleared-Song / sparse_gpu_operator
GPU operators for sparse tensor operations
☆33Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆70Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆165Updated last year
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆69Updated 5 months ago
FasterDecoding / TEAL
☆137Updated 5 months ago
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
google-deepmind / asyncdiloco
☆45Updated last year
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆71Updated last year
epfml / dynamic-sparse-flash-attention
☆147Updated 2 years ago
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
mobiusml / low-rank-llama2
Low-Rank Llama Custom Training
☆23Updated last year
IST-DASLab / MicroAdam
This repository contains code for the MicroAdam paper.
☆19Updated 7 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆127Updated 8 months ago
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆93Updated last year
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated last week
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆82Updated 3 weeks ago
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆44Updated 10 months ago
DS3Lab / DT-FM
☆94Updated 3 years ago