facebookresearch / LayerSkipLinks

Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024

☆344

Alternatives and similar repositories for LayerSkip

Users that are interested in LayerSkip are comparing it to the libraries listed below

Sorting:

NVlabs / Minitron
A family of compressed models obtained via pruning and knowledge distillation
☆352Updated 11 months ago
FasterDecoding / BitDelta
☆202Updated 10 months ago
facebookresearch / memory
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…
☆342Updated 10 months ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆202Updated last year
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆390Updated 3 months ago
huggingface / picotron_tutorial
☆222Updated 3 weeks ago
wolfecameron / nanoMoE
An extension of the nanoGPT repository for training small MOE models.
☆202Updated 7 months ago
cornstarch-org / Cornstarch
☆107Updated last month
lucidrains / speculative-decoding
Explorations into some recent techniques surrounding speculative decoding
☆288Updated 10 months ago
CASE-Lab-UMD / LLM-Drop
The official implementation of the paper "What Matters in Transformers? Not All Attention is Needed".
☆177Updated 6 months ago
mengxiayu / LLMSuperWeight
Code for studying the super weight in LLM
☆120Updated 10 months ago
hao-ai-lab / Consistency_LLM
[ICML 2024] CLLMs: Consistency Large Language Models
☆405Updated 11 months ago
NVlabs / hymba
☆201Updated 10 months ago
hao-ai-lab / Dynasor
[NeurIPS 2025] Simple extension on vLLM to help you speed up reasoning model without training.
☆197Updated 4 months ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆173Updated last year
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆162Updated 6 months ago
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆270Updated 3 months ago
changjonathanc / flex-nano-vllm
FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆296Updated 2 months ago
NVIDIA / kvpress
LLM KV cache compression made easy
☆660Updated last week
microsoft / TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
☆446Updated 9 months ago
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆117Updated last year
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆248Updated 8 months ago
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆494Updated 8 months ago
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆249Updated last year
snowflakedb / ArcticTraining
ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)
☆231Updated this week
allenai / OLMo-core
PyTorch building blocks for the OLMo ecosystem
☆307Updated this week
eqimp / hogwild_llm
Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache
☆127Updated 2 months ago
facebookresearch / RAM
A framework to study AI models in Reasoning, Alignment, and use of Memory (RAM).
☆295Updated this week
arcee-ai / EvolKit
EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…
☆240Updated 11 months ago
microsoft / ArchScale
Simple & Scalable Pretraining for Neural Architecture Research
☆297Updated 2 months ago