YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear attention mechanism.
☆99Updated 7 months ago
Alternatives and similar repositories for DiJiang:
Users that are interested in DiJiang are comparing it to the libraries listed below
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆120Updated 2 weeks ago
- ☆98Updated 10 months ago
- Low-bit optimizers for PyTorch☆125Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆149Updated last month
- When it comes to optimizers, it's always better to be safe than sorry☆166Updated last week
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆249Updated 9 months ago
- Code for paper "Patch-Level Training for Large Language Models"☆77Updated 2 months ago
- [EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer☆59Updated last year
- ☆140Updated last year
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆60Updated 9 months ago
- Linear Attention Sequence Parallelism (LASP)☆76Updated 7 months ago
- ☆108Updated 4 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆96Updated 4 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind☆115Updated 5 months ago
- ☆186Updated last year
- [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models☆76Updated 10 months ago
- ☆80Updated 4 months ago
- Some preliminary explorations of Mamba's context scaling.☆209Updated 11 months ago
- Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆54Updated last month
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆28Updated 7 months ago
- Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"☆123Updated 9 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆76Updated 2 months ago
- Converting Mixtral-8x7B to Mixtral-[1~7]x7B☆20Updated 10 months ago
- Official implementation of TransNormerLLM: A Faster and Better LLM☆238Updated last year
- Here we will test various linear attention designs.☆58Updated 9 months ago
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆27Updated 10 months ago
- An algorithm for static activation quantization of LLMs☆111Updated 2 weeks ago
- Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]☆130Updated 4 months ago
- ☆55Updated 2 weeks ago
- Code for paper "Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning"☆65Updated last year