sjelassi / transformers_ssm_copyLinks

☆33

Alternatives and similar repositories for transformers_ssm_copy

Users that are interested in transformers_ssm_copy are comparing it to the libraries listed below

Sorting:

dangxingyu / rnn-icrag
Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"
☆27Updated last year
shawntan / stickbreaking-attention
Stick-breaking attention
☆61Updated 3 months ago
berlino / seq_icl
☆53Updated last year
Leooyii / LCEG
Long Context Extension and Generalization in LLMs
☆62Updated last year
qiuzh20 / gated_attention
The official implementation for [NeurIPS2025 Oral] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink…
☆99Updated last month
sustcsonglin / mamba-triton
☆48Updated last year
PKU-ML / LongPPL
Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"
☆103Updated 2 weeks ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆84Updated last year
gregorbachmann / Next-Token-Failures
☆103Updated last year
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
jopetty / word-problem
Experiments on the impact of depth in transformers and SSMs.
☆36Updated last week
HazyResearch / prefix-linear-attention
☆56Updated last year
princeton-nlp / LM-Kernel-FT
A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643
☆78Updated 2 years ago
glassroom / heinsen_attention
Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)
☆24Updated last year
VITA-Group / Random-MoE-as-Dropout
[ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…
☆55Updated 2 years ago
chijames / KERPLE
☆19Updated 3 years ago
GSYfate / knnlm-limits
Official code repo for paper "Great Memory, Shallow Reasoning: Limits of kNN-LMs"
☆24Updated 6 months ago
formll / resolving-scaling-law-discrepancies
☆20Updated last year
sail-sg / SkyLadder
The official repository for SkyLadder: Better and Faster Pretraining via Context Window Scheduling
☆35Updated 2 weeks ago
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated 2 years ago
Shark-NLP / CAB
☆31Updated 2 years ago
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
Infini-AI-Lab / Kinetics
Kinetics: Rethinking Test-Time Scaling Laws
☆81Updated 3 months ago
RobertCsordas / moeut
☆86Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆69Updated last year
OpenNLPLab / HGRN2
HGRN2: Gated Linear RNNs with State Expansion
☆55Updated last year
Doraemonzzz / Awesome-Triton-Resources
Awesome Triton Resources
☆36Updated 6 months ago
princeton-nlp / Edge-Pruning
[NeurIPS 2024 Spotlight] Code and data for the paper "Finding Transformer Circuits with Edge Pruning".
☆61Updated 2 months ago
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆29Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year