deepreinforce-ai / CUDA-L1Links
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
☆283Updated 2 months ago
Alternatives and similar repositories for CUDA-L1
Users that are interested in CUDA-L1 are comparing it to the libraries listed below
Sorting:
- QeRL enables RL for 32B LLMs on a single H100 GPU.☆473Updated last month
- Block Diffusion for Ultra-Fast Speculative Decoding☆313Updated 2 weeks ago
- Simple & Scalable Pretraining for Neural Architecture Research☆306Updated last month
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support☆245Updated last week
- A collection of tricks and tools to speed up transformer models☆194Updated last month
- Official JAX implementation of End-to-End Test-Time Training for Long Context☆297Updated 3 weeks ago
- Load compute kernels from the Hub☆376Updated this week
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.☆331Updated 2 months ago
- mHC kernels implemented in CUDA☆217Updated this week
- 👷 Build compute kernels☆213Updated this week
- ☆265Updated 7 months ago
- Efficient LLM Inference over Long Sequences☆393Updated 6 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆147Updated 2 months ago
- The evaluation framework for training-free sparse attention in LLMs☆110Updated 3 months ago
- ☆114Updated last week
- Quantized LLM training in pure CUDA/C++.☆232Updated last week
- Multi-Turn RL Training System with AgentTrainer for Language Model Game Reinforcement Learning☆58Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆131Updated last year
- Official implementation for Training LLMs with MXFP4☆116Updated 8 months ago
- Dion optimizer algorithm☆419Updated this week
- ☆218Updated 11 months ago
- PyTorch-native post-training at scale☆595Updated this week
- An early research stage expert-parallel load balancer for MoE models based on linear programming.☆485Updated 2 months ago
- Work in progress.☆77Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆250Updated 11 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆229Updated 7 months ago
- Samples of good AI generated CUDA kernels☆99Updated 7 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 9 months ago
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cache☆139Updated 5 months ago
- ☆163Updated 6 months ago