Qualcomm-AI-research / llm-surgeon
☆23Updated 5 months ago
Related projects ⓘ
Alternatives and complementary repositories for llm-surgeon
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- ☆45Updated 4 months ago
- Code for studying the super weight in LLM☆16Updated last week
- ☆53Updated 3 weeks ago
- ☆35Updated 9 months ago
- ☆45Updated 9 months ago
- ☆35Updated 7 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 10 months ago
- ☆46Updated last month
- ☆51Updated 5 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆53Updated last month
- Fast and memory-efficient exact attention☆27Updated last week
- Language models scale reliably with over-training and on downstream tasks☆94Updated 7 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆37Updated this week
- Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…☆18Updated last year
- Triton Implementation of HyperAttention Algorithm☆46Updated 11 months ago
- Official code for the paper "Attention as a Hypernetwork"☆23Updated 5 months ago
- Simple and efficient pytorch-native transformer training and inference (batched)☆61Updated 7 months ago
- Code for the paper: Why Transformers Need Adam: A Hessian Perspective☆42Updated 6 months ago
- ☆55Updated last month
- ☆77Updated 5 months ago
- ☆69Updated 8 months ago
- ☆74Updated 11 months ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆24Updated 7 months ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆19Updated 2 months ago
- LLM KV cache compression made easy☆64Updated last week
- Using FlexAttention to compute attention with different masking patterns☆40Updated 2 months ago
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆28Updated 7 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆78Updated 2 months ago