YuchuanTian / DiJiangLinks

[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear attention mechanism.

☆102

Alternatives and similar repositories for DiJiang

Users that are interested in DiJiang are comparing it to the libraries listed below

Sorting:

YuchuanTian / RethinkTinyLM
[ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”
☆122Updated 6 months ago
berlino / gated_linear_attention
☆106Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆130Updated last year
shaochenze / PatchTrain
Code for paper "Patch-Level Training for Large Language Models"
☆86Updated 8 months ago
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆119Updated last year
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆98Updated 10 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
☆118Updated last month
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆67Updated last year
nbasyl / DoRA
Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"
☆124Updated last year
ldery / Bonsai
Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"
☆28Updated last year
OpenNLPLab / Transnormer
[EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer
☆61Updated 2 years ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆85Updated 7 months ago
xichen-fy / Fira
Fira: Can We Achieve Full-rank Training of LLMs Under Low-rank Constraint?
☆112Updated 9 months ago
DAMO-NLP-SG / CLEX
[ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models
☆78Updated last year
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆30Updated last year
SalesforceAIResearch / GemFilter
☆83Updated 6 months ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆62Updated last year
itsnamgyu / block-transformer
Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)
☆160Updated 3 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆79Updated 5 months ago
HazyResearch / prefix-linear-attention
☆56Updated last year
IST-DASLab / RoSA
Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)
☆41Updated last year
OpenNLPLab / TransnormerLLM
Official implementation of TransNormerLLM: A Faster and Better LLM
☆247Updated last year
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆32Updated last year
RobertCsordas / moeut
☆83Updated 11 months ago
OpenSparseLLMs / MoM
☆95Updated 3 months ago
jxiw / MambaInLlama
[NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models
☆225Updated 3 months ago
TsinghuaC3I / Fourier-Position-Embedding
[ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
☆82Updated 2 months ago
juvi21 / CoPE-cuda
Contextual Position Encoding but with some custom CUDA Kernels https://arxiv.org/abs/2405.18719
☆22Updated last year