FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆345Nov 2, 2025Updated 6 months ago
Alternatives and similar repositories for flex-nano-vllm
Users that are interested in flex-nano-vllm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Tools for visualizing neural nets☆20Jul 29, 2025Updated 9 months ago
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,182Aug 26, 2025Updated 8 months ago
- Simple MPI implementation for prototyping or learning☆312Aug 6, 2025Updated 9 months ago
- A PyTorch native platform for training generative AI models☆5,339Updated this week
- ☆21Mar 3, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code☆10Aug 29, 2023Updated 2 years ago
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,861Oct 27, 2025Updated 6 months ago
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆147Nov 11, 2024Updated last year
- Official repository for our work on micro-budget training of large-scale diffusion models.☆1,573Jan 12, 2025Updated last year
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆409Updated this week
- Implementing DeepSeek R1's GRPO algorithm from scratch☆1,844Apr 18, 2025Updated last year
- NanoGPT (124M) in 90 seconds☆5,233May 12, 2026Updated last week
- Custom triton kernels for training Karpathy's nanoGPT.☆19Oct 21, 2024Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆191Jan 19, 2026Updated 3 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Our library for RL environments + evals☆4,106Updated this week
- ☆94Jun 30, 2025Updated 10 months ago
- Single File, Single GPU, From Scratch, Efficient, Full Parameter Tuning library for "RL for LLMs"☆611Oct 7, 2025Updated 7 months ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆100Jul 24, 2025Updated 9 months ago
- UNet diffusion model in pure CUDA☆657Jun 28, 2024Updated last year
- Archer2.0 evolves from its predecessor by introducing ASPO, which overcomes fundamental PPO-Clip limitations to prevent premature converg…☆31Oct 10, 2025Updated 7 months ago
- ☆52May 19, 2025Updated last year
- CIFAR-10 speedrun: Trains to 94% accuracy in 1.98 seconds on a single NVIDIA A100 GPU.☆76Oct 17, 2025Updated 7 months ago
- Learn CUDA with PyTorch☆300Updated this week
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆479May 17, 2025Updated last year
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation☆34Feb 26, 2026Updated 2 months ago
- ☆19Oct 3, 2022Updated 3 years ago
- Checkpoint-engine is a simple middleware to update model weights in LLM inference engines☆953Feb 28, 2026Updated 2 months ago
- A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.☆4,186May 10, 2026Updated last week
- Official repository Flash Local Linear Attention☆23Apr 23, 2026Updated 3 weeks ago
- Tile primitives for speedy kernels☆3,360May 11, 2026Updated last week
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆97Apr 20, 2026Updated 3 weeks ago
- Official Code Repository for the paper "Continuous Diffusion Model for Language Modeling" (NeurIPS 2025).☆72Sep 25, 2025Updated 7 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- FlexAttention w/ FlashAttention3 Support☆27Oct 5, 2024Updated last year
- Official implementation of TBA for async LLM post-training.☆31Nov 5, 2025Updated 6 months ago
- Optimizing Causal LMs through GRPO with weighted reward functions and automated hyperparameter tuning using Optuna☆60Oct 18, 2025Updated 7 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆600Aug 12, 2025Updated 9 months ago
- Optimized primitives for collective multi-GPU communication☆10May 8, 2024Updated 2 years ago
- Cookbook of SGLang - Recipe☆134May 5, 2026Updated last week
- Helpful tools and examples for working with flex-attention☆1,187Apr 13, 2026Updated last month