transformer-vq / transformer_vqLinks

☆199

Alternatives and similar repositories for transformer_vq

Users that are interested in transformer_vq are comparing it to the libraries listed below

Sorting:

TsinghuaC3I / Fourier-Position-Embedding
[ICML 2025] Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
☆103Updated 5 months ago
Chaos96 / fourierft
☆148Updated last year
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆333Updated 8 months ago
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆104Updated last year
Outsider565 / LoRA-GA
☆213Updated last year
zhixuan-lin / forgetting-transformer
[ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning
☆133Updated 2 weeks ago
OpenNLPLab / TransnormerLLM
Official implementation of TransNormerLLM: A Faster and Better LLM
☆248Updated last year
berlino / gated_linear_attention
☆105Updated last year
OpenSparseLLMs / MoM
☆106Updated 2 months ago
OpenNLPLab / Transnormer
[EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer
☆63Updated 2 years ago
nbasyl / DoRA
Official implementation of "DoRA: Weight-Decomposed Low-Rank Adaptation"
☆124Updated last year
fkodom / grouped-query-attention-pytorch
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …
☆182Updated last year
WailordHe / DenseSSM
A repository for DenseSSMs
☆89Updated last year
NVlabs / GatedDeltaNet
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆376Updated 2 months ago
VITA-Group / LiGO
[ICLR 2023] "Learning to Grow Pretrained Models for Efficient Transformer Training" by Peihao Wang, Rameswar Panda, Lucas Torroba Hennige…
☆92Updated last year
chuanyang-Zheng / DAPE
The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"
☆39Updated last year
bojone / FSQ
Keras implement of Finite Scalar Quantization
☆85Updated 2 years ago
fla-org / flash-bidirectional-linear-attention
Triton implement of bi-directional (non-causal) linear attention
☆56Updated 9 months ago
HKUNLP / efficient-attention
[EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling
☆87Updated 2 years ago
thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆98Updated 11 months ago
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
shaochenze / PatchTrain
Code for paper "Patch-Level Training for Large Language Models"
☆93Updated last week
OpenNLPLab / Tnn
[ICLR 2023] Official implementation of Transnormer in our ICLR 2023 paper - Toeplitz Neural Network for Sequence Modeling
☆80Updated last year
sustcsonglin / linear-attention-and-beyond-slides
☆95Updated 8 months ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆308Updated last week
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆67Updated last year
lucidrains / soft-moe-pytorch
Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch
☆334Updated 7 months ago
zhijie-group / Discrete-Diffusion-Forcing
Discrete Diffusion Forcing (D2F): dLLMs Can Do Faster-Than-AR Inference
☆197Updated last month
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆150Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆132Updated 2 years ago