deepreinforce-ai / CUDA-L1Links
CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning
β247Updated last month
Alternatives and similar repositories for CUDA-L1
Users that are interested in CUDA-L1 are comparing it to the libraries listed below
Sorting:
- QeRL enables RL for 32B LLMs on a single H100 GPU.β459Updated last week
- π· Build compute kernelsβ192Updated this week
- Load compute kernels from the Hubβ348Updated this week
- Efficient LLM Inference over Long Sequencesβ392Updated 5 months ago
- A collection of tricks and tools to speed up transformer modelsβ189Updated last month
- Simple & Scalable Pretraining for Neural Architecture Researchβ304Updated last month
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ130Updated last year
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face supportβ194Updated this week
- β256Updated 6 months ago
- Work in progress.β75Updated 2 weeks ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β320Updated last month
- FlashRNN - Fast RNN Kernels with I/O Awarenessβ169Updated last month
- Quantized LLM training in pure CUDA/C++.β220Updated last week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.β148Updated last month
- β111Updated 3 weeks ago
- The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in languagβ¦β112Updated 2 months ago
- β219Updated 10 months ago
- Large multi-modal models (L3M) pre-training.β222Updated 2 months ago
- The evaluation framework for training-free sparse attention in LLMsβ106Updated last month
- PyTorch implementation of models from the Zamba2 series.β186Updated 10 months ago
- DeMo: Decoupled Momentum Optimizationβ197Updated last year
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ223Updated 5 months ago
- Just another reasonably minimal repo for class-conditional training of pixel-space diffusion transformers.β138Updated 6 months ago
- β203Updated 11 months ago
- Official PyTorch implementation for Hogwild! Inference: Parallel LLM Generation with a Concurrent Attention Cacheβ135Updated 3 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β249Updated 10 months ago
- β158Updated 5 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This repβ¦β60Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on diskβ214Updated this week
- Official implementation of the paper: "ZClip: Adaptive Spike Mitigation for LLM Pre-Training".β139Updated 2 weeks ago