kyegomez / AttentionIsOFFByOneLinks

Implementation of "Attention Is Off By One" by Evan Miller

☆193

Alternatives and similar repositories for AttentionIsOFFByOne

Users that are interested in AttentionIsOFFByOne are comparing it to the libraries listed below

Sorting:

OpenNLPLab / TransnormerLLM
Official implementation of TransNormerLLM: A Faster and Better LLM
☆247Updated last year
OpenNLPLab / lightning-attention
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
☆323Updated 5 months ago
fkodom / grouped-query-attention-pytorch
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …
☆173Updated last year
transformer-vq / transformer_vq
☆196Updated last year
nengwp / Lion-vs-Adam
Lion and Adam optimization comparison
☆62Updated 2 years ago
bzhangGo / rmsnorm
Root Mean Square Layer Normalization
☆247Updated 2 years ago
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆130Updated last year
Arnav0400 / ViT-Slim
Official code for our CVPR'22 paper “Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space”
☆250Updated last year
bojone / rerope
Rectified Rotary Position Embeddings
☆375Updated last year
bojone / tiger
A Tight-fisted Optimizer
☆48Updated 2 years ago
lucidrains / FLASH-pytorch
Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"
☆368Updated last year
fkodom / yet-another-retnet
A simple but robust PyTorch implementation of RetNet from "Retentive Network: A Successor to Transformer for Large Language Models" (http…
☆106Updated last year
lucidrains / soft-moe-pytorch
Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch
☆309Updated 4 months ago
Outsider565 / LoRA-GA
☆204Updated 9 months ago
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆102Updated last year
haochengxi / Train_Transformers_with_INT4
☆153Updated 2 years ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆106Updated 2 years ago
bobby-he / simplified_transformers
☆292Updated 7 months ago
BlinkDL / RWKV-CUDA
The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )
☆221Updated 7 months ago
QingruZhang / AdaLoRA
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (ICLR 2023).
☆338Updated 2 years ago
lucidrains / memory-efficient-attention-pytorch
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
☆379Updated 2 years ago
bojone / FSQ
Keras implement of Finite Scalar Quantization
☆78Updated last year
google-research / vmoe
☆656Updated last month
facebookresearch / LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
☆305Updated 5 months ago
OscarXZQ / weight-selection
☆182Updated 10 months ago
shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆155Updated 6 months ago
YuchuanTian / RethinkTinyLM
[ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”
☆122Updated 6 months ago
WailordHe / DenseSSM
A repository for DenseSSMs
☆88Updated last year
OpenNLPLab / Transnormer
[EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer
☆61Updated 2 years ago
syncdoth / RetNet
Huggingface compatible implementation of RetNet (Retentive Networks, https://arxiv.org/pdf/2307.08621.pdf) including parallel, recurrent,…
☆226Updated last year