FasterDecoding/TEAL

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/FasterDecoding/TEAL)

FasterDecoding / TEAL

☆167

Alternatives and similar repositories for TEAL

Users that are interested in TEAL are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

ScalingIntelligence / CATS
View on GitHub
☆33Nov 11, 2024Updated last year
hdong920 / GRIFFIN
View on GitHub
☆40Aug 27, 2024Updated last year
SqueezeAILab / SqueezedAttention
View on GitHub
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆58Nov 20, 2024Updated last year
xxyux / SpInfer
View on GitHub
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆68Mar 25, 2025Updated last year
FMInference / DejaVu
View on GitHub
☆359Apr 2, 2024Updated 2 years ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
luuyin / OWL
View on GitHub
Official Pytorch Implementation of "Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity"
☆81Jul 7, 2025Updated last year
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆400Jul 10, 2025Updated last year
NVlabs / MaskLLM
View on GitHub
[NeurIPS 24 Spotlight] MaskLLM: Learnable Semi-structured Sparsity for Large Language Models
☆189Jan 1, 2025Updated last year
HanGuo97 / flute
View on GitHub
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆391Apr 13, 2025Updated last year
IST-DASLab / EvoPress
View on GitHub
☆43Jun 14, 2026Updated last month
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 7 months ago
andy-yang-1 / DoubleSparse
View on GitHub
16-fold memory access reduction with nearly no loss
☆107Mar 26, 2025Updated last year
NonvolatileMemory / flash_tree_attn
View on GitHub
☆20Dec 24, 2024Updated last year
Dao-AILab / fast-hadamard-transform
View on GitHub
Fast Hadamard transform in CUDA, with a PyTorch interface
☆338Mar 10, 2026Updated 4 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
CASIA-LMC-Lab / FLAP
View on GitHub
[AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models
☆76Jan 6, 2024Updated 2 years ago
IST-DASLab / Sparse-Marlin
View on GitHub
Boosting 4-bit inference kernels with 2:4 Sparsity
☆96Sep 4, 2024Updated last year
mit-han-lab / x-attention
View on GitHub
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆280Jul 6, 2025Updated last year
microsoft / SparTA
View on GitHub
☆167Jul 22, 2024Updated last year
mit-han-lab / Block-Sparse-Attention
View on GitHub
A sparse attention kernel supporting mix sparse patterns
☆534Jan 18, 2026Updated 6 months ago
IsaacRe / vllm-kvcompress
View on GitHub
KV cache compression for high-throughput LLM inference
☆158Feb 5, 2025Updated last year
FasterDecoding / BitDelta
View on GitHub
☆206Dec 5, 2024Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆850Mar 6, 2025Updated last year
yuezhouhu / 2by4-pretrain
View on GitHub
Efficient 2:4 sparse training algorithms and implementations
☆62Dec 8, 2024Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
raymin0223 / fast_robust_early_exit
View on GitHub
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)
☆67Sep 28, 2024Updated last year
goodevening13 / aquakv
View on GitHub
☆21Apr 27, 2026Updated 2 months ago
OpenBitSys / BitDistiller
View on GitHub
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆139May 16, 2024Updated 2 years ago
Anonymous1252022 / fp4-all-the-way
View on GitHub
☆52May 20, 2025Updated last year
dropbox / gemlite
View on GitHub
Fast low-bit matmul kernels in Triton
☆477Updated this week
opengear-project / GEAR
View on GitHub
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆183Jul 12, 2024Updated 2 years ago
locuslab / wanda
View on GitHub
A simple and effective LLM pruning approach.
☆868Aug 9, 2024Updated last year
spcl / QuaRot
View on GitHub
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
☆523Nov 26, 2024Updated last year
AnswerDotAI / cold-compress
View on GitHub
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆153Aug 9, 2024Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆418Nov 20, 2025Updated 8 months ago
sunlex0717 / DissectingTensorCores
View on GitHub
☆114Apr 19, 2024Updated 2 years ago
sustcsonglin / disco-pointer
View on GitHub
Official Implementation of ACL2023: Don't Parse, Choose Spans! Continuous and Discontinuous Constituency Parsing via Autoregressive Span …
☆14Aug 25, 2023Updated 2 years ago
NJUNLP / MoE-LPR
View on GitHub
☆22Dec 11, 2024Updated last year
microsoft / MInference
View on GitHub
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,221Apr 8, 2026Updated 3 months ago
ByteDance-Seed / FlexPrefill
View on GitHub
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆170Oct 13, 2025Updated 9 months ago
VITA-Group / llm-kick
View on GitHub
[ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.
☆27Apr 21, 2025Updated last year