JonasGeiping / linear_cross_entropy_lossLinks

A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.

☆69

Alternatives and similar repositories for linear_cross_entropy_loss

Users that are interested in linear_cross_entropy_loss are comparing it to the libraries listed below

Sorting:

mgmalek / efficient_cross_entropy
☆121Updated last year
berlino / seq_icl
☆53Updated last year
mlfoundations / scaling
Language models scale reliably with over-training and on downstream tasks
☆100Updated last year
epfml / dynamic-sparse-flash-attention
☆149Updated 2 years ago
samsja / muon_fsdp_2
Muon fsdp 2
☆44Updated 2 months ago
HazyResearch / based
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆241Updated 4 months ago
epfml / schedules-and-scaling
Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"
☆84Updated 11 months ago
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
JeanKaddour / NoTrainNoGain
Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)
☆80Updated 2 years ago
insuhan / hyper-attn
☆83Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆91Updated 3 months ago
McGill-NLP / length-generalization
Code for the paper "The Impact of Positional Encoding on Length Generalization in Transformers", NeurIPS 2023
☆136Updated last year
cloneofsimo / min-fsdp
☆91Updated last year
kyo-takano / chinchilla
A toolkit for scaling law research ⚖
☆52Updated 8 months ago
sustcsonglin / mamba-triton
☆48Updated last year
shreyansh26 / Attention-Mask-Patterns
Using FlexAttention to compute attention with different masking patterns
☆47Updated last year
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆216Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆195Updated 4 months ago
epfml / llm-baselines
nanoGPT-like codebase for LLM training
☆107Updated 5 months ago
HazyResearch / zoology
Understand and test language model architectures on synthetic tasks.
☆233Updated 3 weeks ago
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆120Updated last year
microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆29Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
microsoft / mutransformers
some common Huggingface transformers in maximal update parametrization (µP)
☆85Updated 3 years ago
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆61Updated last year
EleutherAI / nanoGPT-mup
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆166Updated 3 months ago
HanGuo97 / lq-lora
☆127Updated last year
kyleliang919 / Online-Subspace-Descent
[NeurIPS 2024] Low rank memory efficient optimizer without SVD
☆30Updated 3 months ago
RobertCsordas / moeut
☆86Updated last year