mgmalek/efficient_cross_entropy

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mgmalek/efficient_cross_entropy)

mgmalek / efficient_cross_entropy

☆124

Alternatives and similar repositories for efficient_cross_entropy

Users that are interested in efficient_cross_entropy are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

JonasGeiping / linear_cross_entropy_loss
View on GitHub
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆75Aug 2, 2024Updated last year
srush / triton-autodiff
View on GitHub
Experiment of using Tangent to autodiff triton
☆81Jan 22, 2024Updated 2 years ago
sustcsonglin / mamba-triton
View on GitHub
☆52Jan 28, 2024Updated 2 years ago
GindaChen / FlexFlashAttention3
View on GitHub
FlexAttention w/ FlashAttention3 Support
☆27Oct 5, 2024Updated last year
srush / tangent
View on GitHub
Source-to-Source Debuggable Derivatives in Pure Python
☆15Jan 23, 2024Updated 2 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
HazyResearch / train-tk
View on GitHub
train with kittens!
☆66Oct 25, 2024Updated last year
bdusell / stack-attention
View on GitHub
Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"
☆18Mar 15, 2024Updated 2 years ago
Doraemonzzz / hgru2-pytorch
View on GitHub
☆24Sep 25, 2024Updated last year
nil0x9 / flash-muon
View on GitHub
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆257Jun 15, 2025Updated last year
ethansmith2000 / fsdp_optimizers
View on GitHub
supporting pytorch FSDP for optimizers
☆84Dec 8, 2024Updated last year
cloneofsimo / min-max-gpt
View on GitHub
Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training
☆132Apr 17, 2024Updated 2 years ago
stanford-futuredata / stk
View on GitHub
☆113Aug 26, 2024Updated last year
glassroom / heinsen_attention
View on GitHub
Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)
☆25Jun 6, 2024Updated 2 years ago
proger / nanokitchen
View on GitHub
Parallel Associative Scan for Language Models
☆18Jan 8, 2024Updated 2 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
yikangshen / megablocks
View on GitHub
☆20May 30, 2024Updated 2 years ago
IBM / triton-dejavu
View on GitHub
Framework to reduce autotune overhead to zero for well known deployments.
☆101Sep 19, 2025Updated 10 months ago
naver-ai / tablevqabench
View on GitHub
☆46May 21, 2024Updated 2 years ago
shreyansh26 / Attention-Mask-Patterns
View on GitHub
Using FlexAttention to compute attention with different masking patterns
☆47Sep 22, 2024Updated last year
BobMcDear / attorch
View on GitHub
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆605May 13, 2026Updated 2 months ago
proger / accelerated-scan
View on GitHub
Accelerated First Order Parallel Associative Scan
☆198Jan 7, 2026Updated 6 months ago
EleutherAI / nanoGPT-mup
View on GitHub
The simplest, fastest repository for training/finetuning medium-sized GPTs.
☆199Jan 19, 2026Updated 6 months ago
sjelassi / transformers_ssm_copy
View on GitHub
☆40Feb 26, 2024Updated 2 years ago
habanero-lab / APPy
View on GitHub
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆29Mar 22, 2026Updated 4 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
shawntan / scattermoe
View on GitHub
Triton-based implementation of Sparse Mixture of Experts.
☆281Oct 3, 2025Updated 9 months ago
iwiwi / epochraft
View on GitHub
Checkpointable dataset utilities for foundation model training
☆32Jan 29, 2024Updated 2 years ago
HazyResearch / zoology
View on GitHub
Understand and test language model architectures on synthetic tasks.
☆278Mar 22, 2026Updated 4 months ago
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆115Jun 28, 2025Updated last year
insuhan / hyper-attn
View on GitHub
☆87Dec 1, 2023Updated 2 years ago
emalach / LinearLM
View on GitHub
Code for the paper: https://arxiv.org/pdf/2309.06979.pdf
☆21Jul 29, 2024Updated last year
Doraemonzzz / xmixers
View on GitHub
Xmixers: A collection of SOTA efficient token/channel mixers
☆28Sep 4, 2025Updated 10 months ago
0xWelt / VibeRL
View on GitHub
VibeRL is a Reinforcement Learning framework built essentially through vibe coding with Kimi K2.
☆17Updated this week
shawntan / stickbreaking-attention
View on GitHub
Stick-breaking attention
☆63Jul 1, 2025Updated last year
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
imoneoi / multipack
View on GitHub
Multipack distributed sampler for fast padding-free training of LLMs
☆207Aug 10, 2024Updated last year
bentherien / mu_learned_optimization
View on GitHub
[Poster; ICLR 2026] [Oral; Neurips OPT2024] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
☆16Apr 15, 2026Updated 3 months ago
zhuzilin / ring-flash-attention
View on GitHub
Ring attention implementation with flash attention
☆1,037Sep 10, 2025Updated 10 months ago
cloneofsimo / min-fsdp
View on GitHub
☆93Jul 5, 2024Updated 2 years ago
HazyResearch / based
View on GitHub
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆256Jun 6, 2025Updated last year
BlinkDL / LinearAttentionArena
View on GitHub
Here we will test various linear attention designs.
☆62Apr 25, 2024Updated 2 years ago
proger / hippogriff
View on GitHub
Griffin MQA + Hawk Linear RNN Hybrid
☆89Apr 13, 2026Updated 3 months ago