facebookresearch / chaiLinks
CHAI is a library for dynamic pruning of attention heads for efficient LLM inference.
☆17Updated 8 months ago
Alternatives and similar repositories for chai
Users that are interested in chai are comparing it to the libraries listed below
Sorting:
- Official code for the paper "Attention as a Hypernetwork"☆40Updated last year
- The implementation for MLSys 2023 paper: "Cuttlefish: Low-rank Model Training without All The Tuning"☆45Updated 2 years ago
- Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…☆27Updated last year
- Official Code Implementation for 'A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models'☆19Updated last year
- [ICML 2024] Official Repository for the paper "Transformers Get Stable: An End-to-End Signal Propagation Theory for Language Models"☆10Updated last year
- Triton implement of bi-directional (non-causal) linear attention☆51Updated 6 months ago
- Code for "RSQ: Learning from Important Tokens Leads to Better Quantized LLMs"☆18Updated 2 months ago
- [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers☆13Updated 4 months ago
- JAX Scalify: end-to-end scaled arithmetics☆16Updated 9 months ago
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆25Updated 9 months ago
- Repository for "TESS-2: A Large-Scale, Generalist Diffusion Language Model"☆47Updated 5 months ago
- Here we will test various linear attention designs.☆62Updated last year
- Implementation of 2-simplicial attention proposed by Clift et al. (2019) and the recent attempt to make practical in Fast and Simplex, Ro…☆43Updated 3 weeks ago
- ☆13Updated last year
- [NeurIPS 2023] Sparse Modular Activation for Efficient Sequence Modeling☆38Updated last year
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆33Updated last year
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆121Updated last month
- ☆53Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated 2 years ago
- OTOv1-v3, NeurIPS, ICLR, TMLR, DNN Training, Compression, Structured Pruning, Erasing Operators, CNN, Diffusion, LLM☆46Updated 10 months ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆119Updated last year
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆28Updated last year
- PyTorch implementation of StableMask (ICML'24)☆13Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated last year
- ☆27Updated last year
- Official code implementation for the work Preference Alignment with Flow Matching (NeurIPS 2024)☆57Updated 9 months ago
- ☆15Updated 2 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆19Updated last year
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun☆55Updated 5 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆127Updated 11 months ago