seungrokj / ai_sprint_parisLinks

☆14

Alternatives and similar repositories for ai_sprint_paris

Users that are interested in ai_sprint_paris are comparing it to the libraries listed below

Sorting:

meta-pytorch / monarch
PyTorch Single Controller
☆423Updated this week
m-a-n-i-f-e-s-t / power-attention
Attention Kernels for Symmetric Power Transformers
☆115Updated 3 weeks ago
MatX-inc / seqax
seqax = sequence modeling + JAX
☆167Updated 2 months ago
PrimeIntellect-ai / pccl
PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP
☆121Updated 2 weeks ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆142Updated last year
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆57Updated this week
meta-pytorch / BackendBench
How to ensure correctness and ship LLM generated kernels in PyTorch
☆58Updated last week
nebius / kvax
A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.
☆141Updated 5 months ago
AllanYangZhou / midGPT
Distributed pretraining of large language models (LLMs) on cloud TPU slices, with Jax and Equinox.
☆24Updated 11 months ago
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆81Updated last year
jax-ml / jax-llm-examples
Minimal yet performant LLM examples in pure JAX
☆160Updated this week
divyamakkar0 / JAXformer
A zero-to-one guide on scaling modern transformers with n-dimensional parallelism.
☆91Updated 3 weeks ago
pyember / ember
☆224Updated 3 months ago
daniel-geon-park / triton_bwd
Automatic differentiation for Triton Kernels
☆11Updated last month
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆59Updated this week
PrimeIntellect-ai / pi-quant
SIMD quantization kernels
☆87Updated 2 weeks ago
NVIDIA / compute-eval
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…
☆66Updated 3 months ago
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆101Updated 8 months ago
Noumena-Network / NSA-Test
NSA Triton Kernels written with GPT5 and Opus 4.1
☆65Updated last month
google / tunix
A JAX-native LLM Post-Training Library
☆150Updated this week
lindermanlab / elk
Scalable and Stable Parallelization of Nonlinear RNNS
☆22Updated 3 weeks ago
modula-systems / modula
🧱 Modula software package
☆239Updated last month
google-deepmind / nanodo
☆281Updated last year
Jaykef / Triton-nanoGPT
Custom triton kernels for training Karpathy's nanoGPT.
☆19Updated 11 months ago
martin-marek / batch-size
📄Small Batch Size Training for Language Models
☆62Updated 3 weeks ago
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆84Updated this week
clement-bonnet / lpn
Latent Program Network (from the "Searching Latent Program Spaces" paper)
☆98Updated 6 months ago
kuterd / opal_ptx
Experimental GPU language with meta-programming
☆23Updated last year
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆129Updated 9 months ago
axonn-ai / axonn
A parallel framework for training deep neural networks
☆63Updated 6 months ago