itsnamgyu / block-transformerLinks

Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)

☆160

Alternatives and similar repositories for block-transformer

Users that are interested in block-transformer are comparing it to the libraries listed below

Sorting:

jxiw / MambaInLlama
[NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models
☆225Updated 3 months ago
SalesforceAIResearch / GemFilter
☆83Updated 6 months ago
HanGuo97 / lq-lora
☆127Updated last year
lucidrains / coconut-pytorch
Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch
☆178Updated last month
wdlctc / mini-s
☆51Updated 9 months ago
FasterDecoding / BitDelta
☆199Updated 8 months ago
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆98Updated 10 months ago
whyNLP / LCKV
Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…
☆151Updated 3 months ago
wuhy68 / Parameter-Efficient-MoE
Parameter-Efficient Sparsity Crafting From Dense to Mixture-of-Experts for Instruction Tuning on General Tasks (EMNLP'24)
☆146Updated 10 months ago
jeffreysijuntan / lloco
The official repo for "LLoCo: Learning Long Contexts Offline"
☆118Updated last year
minyoungg / LTE
☆68Updated last year
microsoft / LongRoPE
LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.
☆237Updated 11 months ago
astramind-ai / Mixture-of-depths
Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
☆167Updated last year
TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆141Updated this week
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
RobertCsordas / moeut
☆83Updated 11 months ago
hahnyuan / PB-LLM
PB-LLM: Partially Binarized Large Language Models
☆153Updated last year
RWKV / RWKV-LM
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…
☆51Updated 4 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆127Updated 8 months ago
sanyalsunny111 / LLM-Inheritune
This is the official repository for Inheritune.
☆112Updated 5 months ago
lucidrains / PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
☆127Updated 11 months ago
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆198Updated last year
Infini-AI-Lab / gsm_infinite
☆51Updated last month
kyleliang919 / Online-Subspace-Descent
[NeurIPS 2024] Low rank memory efficient optimizer without SVD
☆30Updated last month
ScalingIntelligence / large_language_monkeys
☆101Updated 10 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆244Updated 6 months ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆43Updated 7 months ago
schwartz-lab-NLP / TOVA
Token Omission Via Attention
☆128Updated 9 months ago
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆47Updated 3 months ago