PyTorch implementation of StableMask (ICML'24)
☆15Jun 27, 2024Updated last year
Alternatives and similar repositories for StableMask
Users that are interested in StableMask are comparing it to the libraries listed below
Sorting:
- ☆24Sep 25, 2024Updated last year
- recipe for training fully-featured self supervised image jepa models☆12Jun 4, 2025Updated 8 months ago
- [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers☆15Feb 12, 2026Updated 2 weeks ago
- [EMNLP'24] LongHeads: Multi-Head Attention is Secretly a Long Context Processor☆31Apr 8, 2024Updated last year
- [EMNLP 2024] Quantize LLM to extremely low-bit, and finetune the quantized LLMs☆15Jul 18, 2024Updated last year
- ☆17Jun 11, 2025Updated 8 months ago
- ☆20May 30, 2024Updated last year
- Code for the EMNLP24 paper "A simple and effective L2 norm based method for KV Cache compression."☆18Dec 13, 2024Updated last year
- PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)☆20Jan 19, 2025Updated last year
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆30Nov 12, 2024Updated last year
- Scalable and Stable Parallelization of Nonlinear RNNS☆29Oct 21, 2025Updated 4 months ago
- Code for the paper "Function-Space Learning Rates"☆25Jun 3, 2025Updated 8 months ago
- ☆23Jun 18, 2024Updated last year
- HGRN2: Gated Linear RNNs with State Expansion☆56Aug 20, 2024Updated last year
- ☆28Oct 28, 2024Updated last year
- ☆23Oct 15, 2024Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Jun 6, 2024Updated last year
- MiSS is a novel PEFT method that features a low-rank structure but introduces a new update mechanism distinct from LoRA, achieving an exc…☆31Jan 28, 2026Updated last month
- [EMNLP 2023]Context Compression for Auto-regressive Transformers with Sentinel Tokens☆25Nov 6, 2023Updated 2 years ago
- ☆27May 3, 2024Updated last year
- Checkpointable dataset utilities for foundation model training☆32Jan 29, 2024Updated 2 years ago
- ☆106Mar 9, 2024Updated last year
- Code for "Optimizing DDPM Sampling with Shortcut Fine-Tuning" (https://arxiv.org/abs/2301.13362), ICML 2023☆30Oct 6, 2023Updated 2 years ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆27Apr 17, 2024Updated last year
- Multi-Layer Sparse Autoencoders (ICLR 2025)☆29Feb 6, 2026Updated 3 weeks ago
- Official implementation of the transformer (TF) architecture suggested in a paper entitled "Looped Transformers as Programmable Computers…☆30Apr 8, 2023Updated 2 years ago
- ☆34Sep 10, 2024Updated last year
- ☆29Feb 27, 2024Updated 2 years ago
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆35Jun 12, 2024Updated last year
- ☆35Apr 12, 2024Updated last year
- Flash Attention Triton kernel with support for second-order derivatives☆142Feb 4, 2026Updated 3 weeks ago
- FeatureAlignment = Alignment + Mechanistic Interpretability☆34Mar 8, 2025Updated 11 months ago
- ☆33Nov 4, 2024Updated last year
- BitLinear implementation☆35Jan 1, 2026Updated last month
- [CVPR 2023] Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention During Vision Transformer Inference☆30Mar 14, 2024Updated last year
- ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer☆41Jan 29, 2026Updated last month
- open-source code for paper: Retrieval Head Mechanistically Explains Long-Context Factuality☆231Aug 2, 2024Updated last year
- ☆35Dec 12, 2023Updated 2 years ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆36Jun 7, 2024Updated last year