FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆337Nov 2, 2025Updated 4 months ago
Alternatives and similar repositories for flex-nano-vllm
Users that are interested in flex-nano-vllm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,125Aug 26, 2025Updated 7 months ago
- A PyTorch native platform for training generative AI models☆5,191Updated this week
- ☆21Mar 3, 2025Updated last year
- Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code☆10Aug 29, 2023Updated 2 years ago
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆148Nov 11, 2024Updated last year
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Official repository for our work on micro-budget training of large-scale diffusion models.☆1,555Jan 12, 2025Updated last year
- Implementing DeepSeek R1's GRPO algorithm from scratch☆1,803Apr 18, 2025Updated 11 months ago
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,762Oct 27, 2025Updated 5 months ago
- NanoGPT (124M) in 2 minutes☆5,003Mar 17, 2026Updated 2 weeks ago
- Simple MPI implementation for prototyping or learning☆305Aug 6, 2025Updated 7 months ago
- Custom triton kernels for training Karpathy's nanoGPT.☆19Oct 21, 2024Updated last year
- Our library for RL environments + evals☆3,930Updated this week
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆189Jan 19, 2026Updated 2 months ago
- CIFAR-10 speedrun: Trains to 94% accuracy in 1.98 seconds on a single NVIDIA A100 GPU.☆68Oct 17, 2025Updated 5 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- ☆92Jun 30, 2025Updated 9 months ago
- Single File, Single GPU, From Scratch, Efficient, Full Parameter Tuning library for "RL for LLMs"☆605Oct 7, 2025Updated 5 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆99Jul 24, 2025Updated 8 months ago
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆382Mar 23, 2026Updated last week
- UNet diffusion model in pure CUDA☆656Jun 28, 2024Updated last year
- Learn CUDA with PyTorch☆257Updated this week
- ☆52May 19, 2025Updated 10 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆477May 17, 2025Updated 10 months ago
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation☆33Feb 26, 2026Updated last month
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆927Feb 28, 2026Updated last month
- A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.☆3,812Mar 13, 2026Updated 2 weeks ago
- Cookbook of SGLang - Recipe☆106Updated this week
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆90Nov 29, 2025Updated 4 months ago
- Tile primitives for speedy kernels☆3,285Updated this week
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- Official Code Repository for the paper "Continuous Diffusion Model for Language Modeling" (NeurIPS 2025).☆70Sep 25, 2025Updated 6 months ago
- FlexAttention w/ FlashAttention3 Support☆27Oct 5, 2024Updated last year
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆59Oct 18, 2025Updated 5 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Official implementation of TBA for async LLM post-training.☆29Nov 5, 2025Updated 4 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆598Aug 12, 2025Updated 7 months ago
- Optimized primitives for collective multi-GPU communication☆10May 8, 2024Updated last year
- Helpful tools and examples for working with flex-attention☆1,165Feb 8, 2026Updated last month
- Minimal reproduction of DeepSeek R1-Zero☆12,991Feb 27, 2026Updated last month
- NSA Triton Kernels written with GPT5 and Opus 4.1☆71Aug 12, 2025Updated 7 months ago
- Minimal (400 LOC) implementation Maximum (multi-node, FSDP) GPT training☆132Apr 17, 2024Updated last year