nanowell / Q-Sparse-LLM
My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
☆30Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for Q-Sparse-LLM
- GoldFinch and other hybrid transformer components☆39Updated 4 months ago
- A repository for research on medium sized language models.☆74Updated 5 months ago
- ☆35Updated 3 weeks ago
- QuIP quantization☆46Updated 8 months ago
- ☆62Updated 3 months ago
- From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…☆43Updated 4 months ago
- Implementation of "LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models"☆42Updated last week
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆92Updated last month
- Here we will test various linear attention designs.☆56Updated 6 months ago
- Collection of autoregressive model implementation☆67Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆104Updated last month
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- ☆63Updated last month
- ☆27Updated 5 months ago
- Triton Implementation of HyperAttention Algorithm☆46Updated 11 months ago
- This repository contains code for the MicroAdam paper.☆12Updated 4 months ago
- Implementation of the paper: "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention" from Google in pyTO…☆52Updated last week
- Script for processing OpenAI's PRM800K process supervision dataset into an Alpaca-style instruction-response format☆27Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 10 months ago
- Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks☆129Updated 2 months ago
- SparseGPT + GPTQ Compression of LLMs like LLaMa, OPT, Pythia☆41Updated last year
- Using FlexAttention to compute attention with different masking patterns☆40Updated last month
- ☆40Updated 2 weeks ago
- This repo is based on https://github.com/jiaweizzhao/GaLore☆19Updated 2 months ago
- ☆45Updated 9 months ago
- Linear Attention Sequence Parallelism (LASP)☆64Updated 5 months ago
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated last month
- ☆35Updated 9 months ago
- ☆46Updated last week
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year