UofT-EcoSystem / TempoLinks

Memory footprint reduction for transformer models

☆11

Alternatives and similar repositories for Tempo

Users that are interested in Tempo are comparing it to the libraries listed below

Sorting:

LiuXiaoxuanPKU / GACT-ICML
☆43Updated 3 years ago
mit-han-lab / flash-moba
☆143Updated this week
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆106Updated 7 months ago
zhuohan123 / terapipe
☆77Updated 4 years ago
stanford-futuredata / stk
☆113Updated last year
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆217Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
exists-forall / striped_attention
☆43Updated 2 years ago
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 9 months ago
microsoft / SparTA
☆159Updated last year
yuezhouhu / 2by4-pretrain
Efficient 2:4 sparse training algorithms and implementations
☆57Updated 11 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 11 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆127Updated 5 months ago
thunlp / TritonBench
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆95Updated 5 months ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 3 months ago
Gaffey / ExCP
Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".
☆48Updated last year
tile-ai / TileOPs
☆57Updated last week
Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 3 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆154Updated last month
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆123Updated this week
DachengLi1 / AMP
(NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.
☆43Updated 3 years ago
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated 11 months ago
LiuXiaoxuanPKU / OSD
☆60Updated 11 months ago
RulinShao / FastCkpt
Python package for rematerialization-aware gradient checkpointing
☆27Updated 2 years ago
zhengzangw / Sequence-Scheduling
PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".
☆93Updated 2 years ago
fzyzcjy / torch_utils
Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…
☆67Updated 2 months ago
tsinghua-ideal / Twilight
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆67Updated last week
thu-ml / Jetfire-INT8Training
☆60Updated last year