deepreinforce-ai / CUDA-L1Links

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

☆247

Alternatives and similar repositories for CUDA-L1

Users that are interested in CUDA-L1 are comparing it to the libraries listed below

Sorting:

NVlabs / QeRL
QeRL enables RL for 32B LLMs on a single H100 GPU.
☆459Updated last week
huggingface / kernel-builder
👷 Build compute kernels
☆192Updated this week
huggingface / kernels
Load compute kernels from the Hub
☆348Updated this week
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆392Updated 5 months ago
OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
☆189Updated last month
microsoft / ArchScale
Simple & Scalable Pretraining for Neural Architecture Research
☆304Updated last month
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated last year
NVIDIA-NeMo / Automodel
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support
☆194Updated this week
HanGuo97 / log-linear-attention
☆256Updated 6 months ago
IST-DASLab / QuEST
Work in progress.
☆75Updated 2 weeks ago
changjonathanc / flex-nano-vllm
FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆320Updated last month
NX-AI / flashrnn
FlashRNN - Fast RNN Kernels with I/O Awareness
☆169Updated last month
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆220Updated last week
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆148Updated last month
IST-DASLab / Quartet
☆111Updated 3 weeks ago
facebookresearch / llm-speedrunner
The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languag…
☆112Updated 2 months ago
apple / ml-recurrent-drafter
☆219Updated 10 months ago
apple / ml-l3m
Large multi-modal models (L3M) pre-training.
☆222Updated 2 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆106Updated last month
Zyphra / Zamba2
PyTorch implementation of models from the Zamba2 series.
☆186Updated 10 months ago
bloc97 / DeMo
DeMo: Decoupled Momentum Optimization
☆197Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆223Updated 5 months ago
sayakpaul / nanoDiT
Just another reasonably minimal repo for class-conditional training of pixel-space diffusion transformers.
☆138Updated 6 months ago
NVlabs / hymba
☆203Updated 11 months ago
eqimp / hogwild_llm
Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache
☆135Updated 3 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆249Updated 10 months ago
Cornell-RelaxML / qtip
☆158Updated 5 months ago
BorealisAI / neuzip
Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…
☆60Updated last year
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆214Updated this week
bluorion-com / ZClip
Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".
☆139Updated 2 weeks ago