Qualcomm-AI-research / llm-surgeon
☆21Updated 3 months ago
Related projects: ⓘ
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆77Updated last year
- Code accompanying the paper "Massive Activations in Large Language Models"☆104Updated 6 months ago
- A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.☆48Updated last month
- ☆68Updated last month
- ☆66Updated 3 months ago
- ☆42Updated 3 months ago
- ☆47Updated 3 months ago
- Language models scale reliably with over-training and on downstream tasks☆91Updated 5 months ago
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆61Updated last week
- PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)☆62Updated 11 months ago
- ☆35Updated 5 months ago
- ☆38Updated 2 weeks ago
- Data for "Datamodels: Predicting Predictions with Training Data"☆87Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆66Updated 3 months ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆62Updated last month
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆51Updated 3 months ago
- The implementation for MLSys 2023 paper: "Cuttlefish: Low-rank Model Training without All The Tuning"☆42Updated last year
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆63Updated 3 months ago
- ☆48Updated 3 months ago
- ☆136Updated 7 months ago
- ☆43Updated 7 months ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆99Updated 6 months ago
- ☆65Updated 9 months ago
- Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).☆54Updated 2 years ago
- ☆117Updated 7 months ago
- Code for NOLA, an implementation of "nola: Compressing LoRA using Linear Combination of Random Basis"☆46Updated 3 weeks ago
- ☆48Updated 4 months ago
- ☆129Updated last year
- Finetune Google's pre-trained ViT models from HuggingFace's model hub.☆18Updated 3 years ago
- ☆68Updated 2 months ago