changjonathanc / flex-nano-vllmView external linksLinks
FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆334Nov 2, 2025Updated 3 months ago
Alternatives and similar repositories for flex-nano-vllm
Users that are interested in flex-nano-vllm are comparing it to the libraries listed below
Sorting:
- Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code☆10Aug 29, 2023Updated 2 years ago
- Minimalistic 4D-parallelism distributed training framework for education purpose☆2,076Aug 26, 2025Updated 5 months ago
- ☆21Mar 3, 2025Updated 11 months ago
- Simple MPI implementation for prototyping or learning☆300Aug 6, 2025Updated 6 months ago
- A PyTorch native platform for training generative AI models☆5,069Updated this week
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆187Jan 19, 2026Updated 3 weeks ago
- LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation☆33Oct 11, 2025Updated 4 months ago
- Custom triton kernels for training Karpathy's nanoGPT.☆19Oct 21, 2024Updated last year
- ☆52May 19, 2025Updated 8 months ago
- The simplest, fastest repository for training/finetuning small-sized VLMs.☆4,647Oct 27, 2025Updated 3 months ago
- Fast and memory-efficient exact kmeans☆138Updated this week
- ☆89Jun 30, 2025Updated 7 months ago
- Learn CUDA with PyTorch☆200Feb 7, 2026Updated last week
- A scalable asynchronous reinforcement learning implementation with in-flight weight updates.☆362Updated this week
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆469May 17, 2025Updated 8 months ago
- ☆20Jul 12, 2023Updated 2 years ago
- CIFAR-10 speedrun: Trains to 94% accuracy in 1.98 seconds on a single NVIDIA A100 GPU.☆56Oct 17, 2025Updated 3 months ago
- Implementing DeepSeek R1's GRPO algorithm from scratch☆1,762Apr 18, 2025Updated 9 months ago
- NanoGPT (124M) in 2 minutes☆4,624Updated this week
- Cookbook of SGLang - Recipe☆73Updated this week
- Helpful tools and examples for working with flex-attention☆1,127Feb 8, 2026Updated last week
- utilities for batched llm calls with retries☆44Updated this week
- [Developmental] Quarto Extension to Enable Google Colaboratory Links with Quarto Documents☆15May 18, 2025Updated 8 months ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 4 months ago
- Our library for RL environments + evals☆3,833Updated this week
- Spectral Sphere Optimizer☆96Jan 14, 2026Updated last month
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆87Nov 29, 2025Updated 2 months ago
- UNet diffusion model in pure CUDA☆661Jun 28, 2024Updated last year
- Single File, Single GPU, From Scratch, Efficient, Full Parameter Tuning library for "RL for LLMs"☆593Oct 7, 2025Updated 4 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Jul 17, 2025Updated 6 months ago
- Tile primitives for speedy kernels☆3,139Updated this week
- Official repository for our work on micro-budget training of large-scale diffusion models.☆1,550Jan 12, 2025Updated last year
- JAX implementation of GPTQ quantization algorithm☆10Jul 19, 2023Updated 2 years ago
- Code accompanying our ICML 2020 paper on choice set optimization in group decision-making.☆11Jun 27, 2020Updated 5 years ago
- Optimized primitives for collective multi-GPU communication☆10May 8, 2024Updated last year
- ICLR 2023: Learning to Extrapolate: A Transductive Approach☆11Aug 15, 2023Updated 2 years ago
- Jax implementation of VIT-VQGAN☆10Jan 25, 2024Updated 2 years ago
- An implementation of PSGD Kron second-order optimizer for PyTorch☆98Jul 24, 2025Updated 6 months ago
- logit lens for VGGT☆26Dec 2, 2025Updated 2 months ago